subscribe to arXiv mailings

arXiv:2407.01972 [pdf, other]

doi 10.1145/3626772.3657662

MeMemo: On-device Retrieval Augmentation for Private and Personalized Text Generation

Abstract: Retrieval-augmented text generation (RAG) addresses the common limitations of large language models (LLMs), such as hallucination, by retrieving information from an updatable external knowledge base. However, existing approaches often require dedicated backend servers for data storage and retrieval, thereby limiting their applicability in use cases that require strict data privacy, such as persona… ▽ More Retrieval-augmented text generation (RAG) addresses the common limitations of large language models (LLMs), such as hallucination, by retrieving information from an updatable external knowledge base. However, existing approaches often require dedicated backend servers for data storage and retrieval, thereby limiting their applicability in use cases that require strict data privacy, such as personal finance, education, and medicine. To address the pressing need for client-side dense retrieval, we introduce MeMemo, the first open-source JavaScript toolkit that adapts the state-of-the-art approximate nearest neighbor search technique HNSW to browser environments. Developed with modern and native Web technologies, such as IndexedDB and Web Workers, our toolkit leverages client-side hardware capabilities to enable researchers and developers to efficiently search through millions of high-dimensional vectors in the browser. MeMemo enables exciting new design and research opportunities, such as private and personalized content creation and interactive prototyping, as demonstrated in our example application RAG Playground. Reflecting on our work, we discuss the opportunities and challenges for on-device dense retrieval. MeMemo is available at https://github.com/poloclub/mememo. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted to SIGIR 2024. 6 pages, 2 figures. For a live demo, visit https://poloclub.github.io/mememo/. Code is open-source at https://github.com/poloclub/mememo

arXiv:2405.03546 [pdf, other]

CCDM: Continuous Conditional Diffusion Models for Image Generation

Authors: Xin Ding, Yongwei Wang, Kao Zhang, Z. Jane Wang

Abstract: Continuous Conditional Generative Modeling (CCGM) aims to estimate the distribution of high-dimensional data, typically images, conditioned on scalar continuous variables known as regression labels. While Continuous conditional Generative Adversarial Networks (CcGANs) were initially designed for this task, their adversarial training mechanism remains vulnerable to extremely sparse or imbalanced da… ▽ More Continuous Conditional Generative Modeling (CCGM) aims to estimate the distribution of high-dimensional data, typically images, conditioned on scalar continuous variables known as regression labels. While Continuous conditional Generative Adversarial Networks (CcGANs) were initially designed for this task, their adversarial training mechanism remains vulnerable to extremely sparse or imbalanced data, resulting in suboptimal outcomes. To enhance the quality of generated images, a promising alternative is to replace CcGANs with Conditional Diffusion Models (CDMs), renowned for their stable training process and ability to produce more realistic images. However, existing CDMs encounter challenges when applied to CCGM tasks due to several limitations such as inadequate U-Net architectures and deficient model fitting mechanisms for handling regression labels. In this paper, we introduce Continuous Conditional Diffusion Models (CCDMs), the first CDM designed specifically for the CCGM task. CCDMs address the limitations of existing CDMs by introducing specially designed conditional diffusion processes, a modified denoising U-Net with a custom-made conditioning mechanism, a novel hard vicinal loss for model fitting, and an efficient conditional sampling procedure. With comprehensive experiments on four datasets with varying resolutions ranging from 64x64 to 192x192, we demonstrate the superiority of the proposed CCDM over state-of-the-art CCGM models, establishing new benchmarks in CCGM. Extensive ablation studies validate the model design and implementation configuration of the proposed CCDM. Our code is publicly available at https://github.com/UBCDingXin/CCDM. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2404.16069 [pdf, other]

Interactive Visual Learning for Stable Diffusion

Authors: Seongmin Lee, Benjamin Hoover, Hendrik Strobelt, Zijie J. Wang, ShengYun Peng, Austin Wright, Kevin Li, Haekyu Park, Haoyang Yang, Polo Chau

Abstract: Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex internal structures and operations often pose challenges for non-experts to grasp. We introduce Diffusion Explainer, the first interactive visualization tool designed to elucidate how Stable Diffusion transforms text prompts into images. It tightly integrates a vi… ▽ More Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex internal structures and operations often pose challenges for non-experts to grasp. We introduce Diffusion Explainer, the first interactive visualization tool designed to elucidate how Stable Diffusion transforms text prompts into images. It tightly integrates a visual overview of Stable Diffusion's complex components with detailed explanations of their underlying operations. This integration enables users to fluidly transition between multiple levels of abstraction through animations and interactive elements. Offering real-time hands-on experience, Diffusion Explainer allows users to adjust Stable Diffusion's hyperparameters and prompts without the need for installation or specialized hardware. Accessible via users' web browsers, Diffusion Explainer is making significant strides in democratizing AI education, fostering broader public access. More than 7,200 users spanning 113 countries have used our open-sourced tool at https://poloclub.github.io/diffusion-explainer/. A video demo is available at https://youtu.be/MbkIADZjPnA. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 4 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2305.03509

arXiv:2404.01361 [pdf, other]

LLM Attributor: Interactive Visual Attribution for LLM Generation

Authors: Seongmin Lee, Zijie J. Wang, Aishwarya Chakravarthy, Alec Helbling, ShengYun Peng, Mansi Phute, Duen Horng Chau, Minsuk Kahng

Abstract: While large language models (LLMs) have shown remarkable capability to generate convincing text across diverse domains, concerns around its potential risks have highlighted the importance of understanding the rationale behind text generation. We present LLM Attributor, a Python library that provides interactive visualizations for training data attribution of an LLM's text generation. Our library o… ▽ More While large language models (LLMs) have shown remarkable capability to generate convincing text across diverse domains, concerns around its potential risks have highlighted the importance of understanding the rationale behind text generation. We present LLM Attributor, a Python library that provides interactive visualizations for training data attribution of an LLM's text generation. Our library offers a new way to quickly attribute an LLM's text generation to training data points to inspect model behaviors, enhance its trustworthiness, and compare model-generated text with user-provided text. We describe the visual and interactive design of our tool and highlight usage scenarios for LLaMA2 models fine-tuned with two different datasets: online articles about recent disasters and finance-related question-answer pairs. Thanks to LLM Attributor's broad support for computational notebooks, users can easily integrate it into their workflow to interactively visualize attributions of their models. For easier access and extensibility, we open-source LLM Attributor at https://github.com/poloclub/ LLM-Attribution. The video demo is available at https://youtu.be/mIG2MDQKQxM. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: 8 pages, 3 figures, For a video demo, see https://youtu.be/mIG2MDQKQxM

arXiv:2403.19754 [pdf, other]

GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation

Authors: Mohsen Gholami, Mohammad Akbari, Cindy Hu, Vaden Masrani, Z. Jane Wang, Yong Zhang

Abstract: Knowledge distillation from LLMs is essential for the efficient deployment of language models. Prior works have proposed data generation using LLMs for preparing distilled models. We argue that generating data with LLMs is prone to sampling mainly from the center of original content distribution. This limitation hinders the distilled model from learning the true underlying data distribution and to… ▽ More Knowledge distillation from LLMs is essential for the efficient deployment of language models. Prior works have proposed data generation using LLMs for preparing distilled models. We argue that generating data with LLMs is prone to sampling mainly from the center of original content distribution. This limitation hinders the distilled model from learning the true underlying data distribution and to forget the tails of the distributions (samples with lower probability). To this end, we propose GOLD, a task-agnostic data generation and knowledge distillation framework, which employs an iterative out-of-distribution-guided feedback mechanism for the LLM. As a result, the generated data improves the generalizability of distilled models. An energy-based OOD evaluation approach is also introduced to deal with noisy generated data. Our extensive experiments on 10 different classification and sequence-to-sequence tasks in NLP show that GOLD respectively outperforms prior arts and the LLM with an average improvement of 5% and 14%. We will also show that the proposed method is applicable to less explored and novel tasks. The code is available. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2402.15350 [pdf, other]

doi 10.1145/3613904.3642335

Farsight: Fostering Responsible AI Awareness During AI Application Prototyping

Authors: Zijie J. Wang, Chinmay Kulkarni, Lauren Wilcox, Michael Terry, Michael Madaio

Abstract: Prompt-based interfaces for Large Language Models (LLMs) have made prototyping and building AI-powered applications easier than ever before. However, identifying potential harms that may arise from AI applications remains a challenge, particularly during prompt-based prototyping. To address this, we present Farsight, a novel in situ interactive tool that helps people identify potential harms from… ▽ More Prompt-based interfaces for Large Language Models (LLMs) have made prototyping and building AI-powered applications easier than ever before. However, identifying potential harms that may arise from AI applications remains a challenge, particularly during prompt-based prototyping. To address this, we present Farsight, a novel in situ interactive tool that helps people identify potential harms from the AI applications they are prototyping. Based on a user's prompt, Farsight highlights news articles about relevant AI incidents and allows users to explore and edit LLM-generated use cases, stakeholders, and harms. We report design insights from a co-design study with 10 AI prototypers and findings from a user study with 42 AI prototypers. After using Farsight, AI prototypers in our user study are better able to independently identify potential harms associated with a prompt and find our tool more useful and usable than existing resources. Their qualitative feedback also highlights that Farsight encourages them to focus on end-users and think beyond immediate harms. We discuss these findings and reflect on their implications for designing AI prototyping experiences that meaningfully engage with AI harms. Farsight is publicly accessible at: https://PAIR-code.github.io/farsight. △ Less

Submitted 2 July, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: Accepted to CHI 2024 (Best Paper, Honorable Mention). 40 pages, 19 figures, 5 tables. For a demo video, see https://youtu.be/BlSFbGkOlHk. For a live demo, visit https://PAIR-code.github.io/farsight. The source code is available at https://github.com/PAIR-code/farsight

arXiv:2401.14447 [pdf, other]

Wordflow: Social Prompt Engineering for Large Language Models

Authors: Zijie J. Wang, Aishwarya Chakravarthy, David Munechika, Duen Horng Chau

Abstract: Large language models (LLMs) require well-crafted prompts for effective use. Prompt engineering, the process of designing prompts, is challenging, particularly for non-experts who are less familiar with AI technologies. While researchers have proposed techniques and tools to assist LLM users in prompt design, these works primarily target AI application developers rather than non-experts. To addres… ▽ More Large language models (LLMs) require well-crafted prompts for effective use. Prompt engineering, the process of designing prompts, is challenging, particularly for non-experts who are less familiar with AI technologies. While researchers have proposed techniques and tools to assist LLM users in prompt design, these works primarily target AI application developers rather than non-experts. To address this research gap, we propose social prompt engineering, a novel paradigm that leverages social computing techniques to facilitate collaborative prompt design. To investigate social prompt engineering, we introduce Wordflow, an open-source and social text editor that enables everyday users to easily create, run, share, and discover LLM prompts. Additionally, by leveraging modern web technologies, Wordflow allows users to run LLMs locally and privately in their browsers. Two usage scenarios highlight how social prompt engineering and our tool can enhance laypeople's interaction with LLMs. Wordflow is publicly accessible at https://poloclub.github.io/wordflow. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: 8 pages, 7 figures. Wordflow is available at: https://poloclub.github.io/wordflow. The code is available at: https://github.com/poloclub/wordflow/. For a demo video, see: https://youtu.be/3dOcVuofGVo

arXiv:2401.10029 [pdf]

Cardiac Digital Twin Pipeline for Virtual Therapy Evaluation

Authors: Julia Camps, Zhinuo Jenny Wang, Ruben Doste, Maxx Holmes, Brodie Lawson, Jakub Tomek, Kevin Burrage, Alfonso Bueno-Orovio, Blanca Rodriguez

Abstract: Cardiac digital twins are computational tools capturing key functional and anatomical characteristics of patient hearts for investigating disease phenotypes and predicting responses to therapy. When paired with large-scale computational resources and large clinical datasets, digital twin technology can enable virtual clinical trials on virtual cohorts to fast-track therapy development. Here, we pr… ▽ More Cardiac digital twins are computational tools capturing key functional and anatomical characteristics of patient hearts for investigating disease phenotypes and predicting responses to therapy. When paired with large-scale computational resources and large clinical datasets, digital twin technology can enable virtual clinical trials on virtual cohorts to fast-track therapy development. Here, we present an automated pipeline for personalising ventricular anatomy and electrophysiological function based on routinely acquired cardiac magnetic resonance (CMR) imaging data and the standard 12-lead electrocardiogram (ECG). Using CMR-based anatomical models, a sequential Monte-Carlo approximate Bayesian computational inference method is extended to infer electrical activation and repolarisation characteristics from the ECG. Fast simulations are conducted with a reaction-Eikonal model, including the Purkinje network and biophysically-detailed subcellular ionic current dynamics for repolarisation. For each patient, parameter uncertainty is represented by inferring a population of ventricular models rather than a single one, which means that parameter uncertainty can be propagated to therapy evaluation. Furthermore, we have developed techniques for translating from reaction-Eikonal to monodomain simulations, which allows more realistic simulations of cardiac electrophysiology. The pipeline is demonstrated in a healthy female subject, where our inferred reaction-Eikonal models reproduced the patient's ECG with a Pearson's correlation coefficient of 0.93, and the translated monodomain simulations have a correlation coefficient of 0.89. We then apply the effect of Dofetilide to the monodomain population of models for this subject and show dose-dependent QT and T-peak to T-end prolongations that are in keeping with large population drug response data. △ Less

Submitted 18 January, 2024; originally announced January 2024.

arXiv:2312.14915 [pdf, other]

PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF

Authors: Mohsen Gholami, Rabab Ward, Z. Jane Wang

Abstract: This paper proposes an end-to-end framework for generating 3D human pose datasets using Neural Radiance Fields (NeRF). Public datasets generally have limited diversity in terms of human poses and camera viewpoints, largely due to the resource-intensive nature of collecting 3D human pose data. As a result, pose estimators trained on public datasets significantly underperform when applied to unseen… ▽ More This paper proposes an end-to-end framework for generating 3D human pose datasets using Neural Radiance Fields (NeRF). Public datasets generally have limited diversity in terms of human poses and camera viewpoints, largely due to the resource-intensive nature of collecting 3D human pose data. As a result, pose estimators trained on public datasets significantly underperform when applied to unseen out-of-distribution samples. Previous works proposed augmenting public datasets by generating 2D-3D pose pairs or rendering a large amount of random data. Such approaches either overlook image rendering or result in suboptimal datasets for pre-trained models. Here we propose PoseGen, which learns to generate a dataset (human 3D poses and images) with a feedback loss from a given pre-trained pose estimator. In contrast to prior art, our generated data is optimized to improve the robustness of the pre-trained model. The objective of PoseGen is to learn a distribution of data that maximizes the prediction error of a given pre-trained model. As the learned data distribution contains OOD samples of the pre-trained model, sampling data from such a distribution for further fine-tuning a pre-trained model improves the generalizability of the model. This is the first work that proposes NeRFs for 3D human data generation. NeRFs are data-driven and do not require 3D scans of humans. Therefore, using NeRF for data generation is a new direction for convenient user-specific data generation. Our extensive experiments show that the proposed PoseGen improves two baseline models (SPIN and HybrIK) on four datasets with an average 6% relative improvement. △ Less

Submitted 22 December, 2023; originally announced December 2023.

arXiv:2311.13196 [pdf, other]

Optimal Time of Arrival Estimation for MIMO Backscatter Channels

Authors: Chen He, Luyang Han, Z. Jane Wang

Abstract: In this paper, we propose a novel time of arrival (TOA) estimator for multiple-input-multiple-output (MIMO) backscatter channels in closed form. The proposed estimator refines the estimation precision from the topological structure of the MIMO backscatter channels, and can considerably enhance the estimation accuracy. Particularly, we show that for the general $M \times N$ bistatic topology, the m… ▽ More In this paper, we propose a novel time of arrival (TOA) estimator for multiple-input-multiple-output (MIMO) backscatter channels in closed form. The proposed estimator refines the estimation precision from the topological structure of the MIMO backscatter channels, and can considerably enhance the estimation accuracy. Particularly, we show that for the general $M \times N$ bistatic topology, the mean square error (MSE) is $\frac{M+N-1}{MN}σ^2_0$, and for the general $M \times M$ monostatic topology, it is $\frac{2M-1}{M^2}σ^2_0$ for the diagonal subchannels, and $\frac{M-1}{M^2}σ^2_0$ for the off-diagonal subchannels, where $σ^2_0$ is the MSE of the conventional least square estimator. In addition, we derive the Cramer-Rao lower bound (CRLB) for MIMO backscatter TOA estimation which indicates that the proposed estimator is optimal. Simulation results verify that the proposed TOA estimator can considerably improve both estimation and positioning accuracy, especially when the MIMO scale is large. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2310.12347 [pdf, other]

VisGrader: Automatic Grading of D3 Visualizations

Authors: Matthew Hull, Vivian Pednekar, Hannah Murray, Nimisha Roy, Emmanuel Tung, Susanta Routray, Connor Guerin, Justin Chen, Zijie J. Wang, Seongmin Lee, Mahdi Roozbahani, Duen Horng Chau

Abstract: Manually grading D3 data visualizations is a challenging endeavor, and is especially difficult for large classes with hundreds of students. Grading an interactive visualization requires a combination of interactive, quantitative, and qualitative evaluation that are conventionally done manually and are difficult to scale up as the visualization complexity, data size, and number of students increase… ▽ More Manually grading D3 data visualizations is a challenging endeavor, and is especially difficult for large classes with hundreds of students. Grading an interactive visualization requires a combination of interactive, quantitative, and qualitative evaluation that are conventionally done manually and are difficult to scale up as the visualization complexity, data size, and number of students increase. We present VisGrader, a first-of-its kind automatic grading method for D3 visualizations that scalably and precisely evaluates the data bindings, visual encodings, interactions, and design specifications used in a visualization. Our method enhances students learning experience, enabling them to submit their code frequently and receive rapid feedback to better inform iteration and improvement to their code and visualization design. We have successfully deployed our method and auto-graded D3 submissions from more than 4000 students in a visualization course at Georgia Tech, and received positive feedback for expanding its adoption. △ Less

Submitted 19 October, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.12243 [pdf, other]

REVAMP: Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes

Authors: Matthew Hull, Zijie J. Wang, Duen Horng Chau

Abstract: Deep Learning models, such as those used in an autonomous vehicle are vulnerable to adversarial attacks where an attacker could place an adversarial object in the environment, leading to mis-classification. Generating these adversarial objects in the digital space has been extensively studied, however successfully transferring these attacks from the digital realm to the physical realm has proven c… ▽ More Deep Learning models, such as those used in an autonomous vehicle are vulnerable to adversarial attacks where an attacker could place an adversarial object in the environment, leading to mis-classification. Generating these adversarial objects in the digital space has been extensively studied, however successfully transferring these attacks from the digital realm to the physical realm has proven challenging when controlling for real-world environmental factors. In response to these limitations, we introduce REVAMP, an easy-to-use Python library that is the first-of-its-kind tool for creating attack scenarios with arbitrary objects and simulating realistic environmental factors, lighting, reflection, and refraction. REVAMP enables researchers and practitioners to swiftly explore various scenarios within the digital realm by offering a wide range of configurable options for designing experiments and using differentiable rendering to reproduce physically plausible adversarial objects. We will demonstrate and invite the audience to try REVAMP to produce an adversarial texture on a chosen object while having control over various scene parameters. The audience will choose a scene, an object to attack, the desired attack class, and the number of camera positions to use. Then, in real time, we show how this altered texture causes the chosen object to be mis-classified, showcasing the potential of REVAMP in real-world scenarios. REVAMP is open-source and available at https://github.com/poloclub/revamp. △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.05123 [pdf, other]

Distribution-Based Trajectory Clustering

Authors: Zi Jing Wang, Ye Zhu, Kai Ming Ting

Abstract: Trajectory clustering enables the discovery of common patterns in trajectory data. Current methods of trajectory clustering rely on a distance measure between two points in order to measure the dissimilarity between two trajectories. The distance measures employed have two challenges: high computational cost and low fidelity. Independent of the distance measure employed, existing clustering algori… ▽ More Trajectory clustering enables the discovery of common patterns in trajectory data. Current methods of trajectory clustering rely on a distance measure between two points in order to measure the dissimilarity between two trajectories. The distance measures employed have two challenges: high computational cost and low fidelity. Independent of the distance measure employed, existing clustering algorithms have another challenge: either effectiveness issues or high time complexity. In this paper, we propose to use a recent Isolation Distributional Kernel (IDK) as the main tool to meet all three challenges. The new IDK-based clustering algorithm, called TIDKC, makes full use of the distributional kernel for trajectory similarity measuring and clustering. TIDKC identifies non-linearly separable clusters with irregular shapes and varied densities in linear time. It does not rely on random initialisation and is robust to outliers. An extensive evaluation on 7 large real-world trajectory datasets confirms that IDK is more effective in capturing complex structures in trajectories than traditional and deep learning-based distance measures. Furthermore, the proposed TIDKC has superior clustering performance and efficiency to existing trajectory clustering algorithms. △ Less

Submitted 30 October, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

arXiv:2306.09328 [pdf, other]

WizMap: Scalable Interactive Visualization for Exploring Large Machine Learning Embeddings

Authors: Zijie J. Wang, Fred Hohman, Duen Horng Chau

Abstract: Machine learning models often learn latent embedding representations that capture the domain semantics of their training data. These embedding representations are valuable for interpreting trained models, building new models, and analyzing new datasets. However, interpreting and using embeddings can be challenging due to their opaqueness, high dimensionality, and the large size of modern datasets.… ▽ More Machine learning models often learn latent embedding representations that capture the domain semantics of their training data. These embedding representations are valuable for interpreting trained models, building new models, and analyzing new datasets. However, interpreting and using embeddings can be challenging due to their opaqueness, high dimensionality, and the large size of modern datasets. To tackle these challenges, we present WizMap, an interactive visualization tool to help researchers and practitioners easily explore large embeddings. With a novel multi-resolution embedding summarization method and a familiar map-like interaction design, WizMap enables users to navigate and interpret embedding spaces with ease. Leveraging modern web technologies such as WebGL and Web Workers, WizMap scales to millions of embedding points directly in users' web browsers and computational notebooks without the need for dedicated backend servers. WizMap is open-source and available at the following public demo link: https://poloclub.github.io/wizmap. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: 8 pages, 8 figures, Accepted to ACL 2023. For a demo video, see https://youtu.be/8fJG87QVceQ. For a live demo, see https://poloclub.github.io/wizmap. Code is available at https://github.com/poloclub/wizmap

arXiv:2305.03509 [pdf, other]

Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion

Authors: Seongmin Lee, Benjamin Hoover, Hendrik Strobelt, Zijie J. Wang, ShengYun Peng, Austin Wright, Kevin Li, Haekyu Park, Haoyang Yang, Duen Horng Chau

Abstract: Diffusion-based generative models' impressive ability to create convincing images has captured global attention. However, their complex internal structures and operations often make them difficult for non-experts to understand. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly… ▽ More Diffusion-based generative models' impressive ability to create convincing images has captured global attention. However, their complex internal structures and operations often make them difficult for non-experts to understand. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex components with detailed explanations of their underlying operations, enabling users to fluidly transition between multiple levels of abstraction through animations and interactive elements. By comparing the evolutions of image representations guided by two related text prompts over refinement timesteps, users can discover the impact of prompts on image generation. Diffusion Explainer runs locally in users' web browsers without the need for installation or specialized hardware, broadening the public's education access to modern AI techniques. Our open-sourced tool is available at: https://poloclub.github.io/diffusion-explainer/. A video demo is available at https://youtu.be/Zg4gxdIWDds. △ Less

Submitted 8 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

Comments: 5 pages, 5 figures

arXiv:2305.03039 [pdf, other]

doi 10.1145/3613905.3650848

SuperNOVA: Design Strategies and Opportunities for Interactive Visualization in Computational Notebooks

Authors: Zijie J. Wang, David Munechika, Seongmin Lee, Duen Horng Chau

Abstract: Computational notebooks, such as Jupyter Notebook, have become data scientists' de facto programming environments. Many visualization researchers and practitioners have developed interactive visualization tools that support notebooks, yet little is known about the appropriate design of these tools. To address this critical research gap, we investigate the design strategies in this space by analyzi… ▽ More Computational notebooks, such as Jupyter Notebook, have become data scientists' de facto programming environments. Many visualization researchers and practitioners have developed interactive visualization tools that support notebooks, yet little is known about the appropriate design of these tools. To address this critical research gap, we investigate the design strategies in this space by analyzing 163 notebook visualization tools. Our analysis encompasses 64 systems from academic papers and 105 systems sourced from a pool of 55k notebooks containing interactive visualizations that we obtain via scraping 8.6 million notebooks on GitHub. Through this study, we identify key design implications and trade-offs, such as leveraging multimodal data in notebooks as well as balancing the degree of visualization-notebook integration. Furthermore, we provide empirical evidence that tools compatible with more notebook platforms have a greater impact. Finally, we develop SuperNOVA, an open-source interactive browser to help researchers explore existing notebook visualization tools. SuperNOVA is publicly accessible at: https://poloclub.github.io/supernova/. △ Less

Submitted 28 March, 2024; v1 submitted 4 May, 2023; originally announced May 2023.

Comments: Accepted at CHI 2024 (Late-Breaking Work). 17 pages, 11 figures, 1 table. SuperNOVA is available at: http://poloclub.github.io/supernova/. The code is available at: https://github.com/poloclub/supernova

arXiv:2304.05967 [pdf, other]

doi 10.1145/3544548.3580790

Angler: Helping Machine Translation Practitioners Prioritize Model Improvements

Authors: Samantha Robertson, Zijie J. Wang, Dominik Moritz, Mary Beth Kery, Fred Hohman

Abstract: Machine learning (ML) models can fail in unexpected ways in the real world, but not all model failures are equal. With finite time and resources, ML practitioners are forced to prioritize their model debugging and improvement efforts. Through interviews with 13 ML practitioners at Apple, we found that practitioners construct small targeted test sets to estimate an error's nature, scope, and impact… ▽ More Machine learning (ML) models can fail in unexpected ways in the real world, but not all model failures are equal. With finite time and resources, ML practitioners are forced to prioritize their model debugging and improvement efforts. Through interviews with 13 ML practitioners at Apple, we found that practitioners construct small targeted test sets to estimate an error's nature, scope, and impact on users. We built on this insight in a case study with machine translation models, and developed Angler, an interactive visual analytics tool to help practitioners prioritize model improvements. In a user study with 7 machine translation experts, we used Angler to understand prioritization practices when the input space is infinite, and obtaining reliable signals of model quality is expensive. Our study revealed that participants could form more interesting and user-focused hypotheses for prioritization by analyzing quantitative summary statistics and qualitatively assessing data by reading sentences. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: Accepted to CHI 2023. 20 pages, 6 figures

arXiv:2303.16972 [pdf, other]

doi 10.1145/3593013.3594134

Queer In AI: A Case Study in Community-Led Participatory AI

Authors: Organizers Of QueerInAI, :, Anaelia Ovalle, Arjun Subramonian, Ashwin Singh, Claas Voelcker, Danica J. Sutherland, Davide Locatelli, Eva Breznik, Filip Klubička, Hang Yuan, Hetvi J, Huan Zhang, Jaidev Shriram, Kruno Lehman, Luca Soldaini, Maarten Sap, Marc Peter Deisenroth, Maria Leonor Pacheco, Maria Ryskina, Martin Mundt, Milind Agarwal, Nyx McLean, Pan Xu, A Pranav , et al. (26 additional authors not shown)

Abstract: We present Queer in AI as a case study for community-led participatory design in AI. We examine how participatory design and intersectional tenets started and shaped this community's programs over the years. We discuss different challenges that emerged in the process, look at ways this organization has fallen short of operationalizing participatory and intersectional principles, and then assess th… ▽ More We present Queer in AI as a case study for community-led participatory design in AI. We examine how participatory design and intersectional tenets started and shaped this community's programs over the years. We discuss different challenges that emerged in the process, look at ways this organization has fallen short of operationalizing participatory and intersectional principles, and then assess the organization's impact. Queer in AI provides important lessons and insights for practitioners and theorists of participatory methods broadly through its rejection of hierarchy in favor of decentralization, success at building aid and programs by and for the queer community, and effort to change actors and institutions outside of the queer community. Finally, we theorize how communities like Queer in AI contribute to the participatory design in AI more broadly by fostering cultures of participation in AI, welcoming and empowering marginalized participants, critiquing poor or exploitative participatory practices, and bringing participation to institutions outside of individual research projects. Queer in AI's work serves as a case study of grassroots activism and participatory methods within AI, demonstrating the potential of community-led participatory methods and intersectional praxis, while also providing challenges, case studies, and nuanced insights to researchers developing and using participatory methods. △ Less

Submitted 8 June, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: To appear at FAccT 2023

Journal ref: 2023 ACM Conference on Fairness, Accountability, and Transparency

arXiv:2303.09545 [pdf, other]

doi 10.1145/3543873.3587362

WebSHAP: Towards Explaining Any Machine Learning Models Anywhere

Authors: Zijie J. Wang, Duen Horng Chau

Abstract: As machine learning (ML) is increasingly integrated into our everyday Web experience, there is a call for transparent and explainable web-based ML. However, existing explainability techniques often require dedicated backend servers, which limit their usefulness as the Web community moves toward in-browser ML for lower latency and greater privacy. To address the pressing need for a client-side expl… ▽ More As machine learning (ML) is increasingly integrated into our everyday Web experience, there is a call for transparent and explainable web-based ML. However, existing explainability techniques often require dedicated backend servers, which limit their usefulness as the Web community moves toward in-browser ML for lower latency and greater privacy. To address the pressing need for a client-side explainability solution, we present WebSHAP, the first in-browser tool that adapts the state-of-the-art model-agnostic explainability technique SHAP to the Web environment. Our open-source tool is developed with modern Web technologies such as WebGL that leverage client-side hardware capabilities and make it easy to integrate into existing Web ML applications. We demonstrate WebSHAP in a usage scenario of explaining ML-based loan approval decisions to loan applicants. Reflecting on our work, we discuss the opportunities and challenges for future research on transparent Web ML. WebSHAP is available at https://github.com/poloclub/webshap. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: 5 pages, 4 figures. Accepted at the ACM Web Conference 2023 (WWW 2023). For a live demo, visit https://poloclub.github.io/webshap/. Code is open-source at https://github.com/poloclub/webshap

arXiv:2302.14165 [pdf, other]

doi 10.1145/3544548.3580816

GAM Coach: Towards Interactive and User-centered Algorithmic Recourse

Authors: Zijie J. Wang, Jennifer Wortman Vaughan, Rich Caruana, Duen Horng Chau

Abstract: Machine learning (ML) recourse techniques are increasingly used in high-stakes domains, providing end users with actions to alter ML predictions, but they assume ML developers understand what input variables can be changed. However, a recourse plan's actionability is subjective and unlikely to match developers' expectations completely. We present GAM Coach, a novel open-source system that adapts i… ▽ More Machine learning (ML) recourse techniques are increasingly used in high-stakes domains, providing end users with actions to alter ML predictions, but they assume ML developers understand what input variables can be changed. However, a recourse plan's actionability is subjective and unlikely to match developers' expectations completely. We present GAM Coach, a novel open-source system that adapts integer linear programming to generate customizable counterfactual explanations for Generalized Additive Models (GAMs), and leverages interactive visualizations to enable end users to iteratively generate recourse plans meeting their needs. A quantitative user study with 41 participants shows our tool is usable and useful, and users prefer personalized recourse plans over generic plans. Through a log analysis, we explore how users discover satisfactory recourse plans, and provide empirical evidence that transparency can lead to more opportunities for everyday users to discover counterintuitive patterns in ML models. GAM Coach is available at: https://poloclub.github.io/gam-coach/. △ Less

Submitted 28 February, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

Comments: Accepted to CHI 2023. 20 pages, 12 figures. For a demo video, see https://youtu.be/ubacP34H9XE. For a live demo, visit https://poloclub.github.io/gam-coach/

arXiv:2212.04029 [pdf, other]

Occlusion-Robust FAU Recognition by Mining Latent Space of Masked Autoencoders

Authors: Minyang Jiang, Yongwei Wang, Martin J. McKeown, Z. Jane Wang

Abstract: Facial action units (FAUs) are critical for fine-grained facial expression analysis. Although FAU detection has been actively studied using ideally high quality images, it was not thoroughly studied under heavily occluded conditions. In this paper, we propose the first occlusion-robust FAU recognition method to maintain FAU detection performance under heavy occlusions. Our novel approach takes adv… ▽ More Facial action units (FAUs) are critical for fine-grained facial expression analysis. Although FAU detection has been actively studied using ideally high quality images, it was not thoroughly studied under heavily occluded conditions. In this paper, we propose the first occlusion-robust FAU recognition method to maintain FAU detection performance under heavy occlusions. Our novel approach takes advantage of rich information from the latent space of masked autoencoder (MAE) and transforms it into FAU features. Bypassing the occlusion reconstruction step, our model efficiently extracts FAU features of occluded faces by mining the latent space of a pretrained masked autoencoder. Both node and edge-level knowledge distillation are also employed to guide our model to find a mapping between latent space vectors and FAU features. Facial occlusion conditions, including random small patches and large blocks, are thoroughly studied. Experimental results on BP4D and DISFA datasets show that our method can achieve state-of-the-art performances under the studied facial occlusion, significantly outperforming existing baseline methods. In particular, even under heavy occlusion, the proposed method can achieve comparable performance as state-of-the-art methods under normal conditions. △ Less

Submitted 7 December, 2022; originally announced December 2022.

arXiv:2211.04020 [pdf, other]

Generating counterfactual explanations of tumor spatial proteomes to discover effective strategies for enhancing immune infiltration

Authors: Zitong Jerry Wang, Alexander M. Xu, Aman Bhargava, Matt W. Thomson

Abstract: The tumor microenvironment (TME) significantly impacts cancer prognosis due to its immune composition. While therapies for altering the immune composition, including immunotherapies, have shown exciting results for treating hematological cancers, they are less effective for immunologically-cold, solid tumors. Spatial omics technologies capture the spatial organization of the TME with unprecedented… ▽ More The tumor microenvironment (TME) significantly impacts cancer prognosis due to its immune composition. While therapies for altering the immune composition, including immunotherapies, have shown exciting results for treating hematological cancers, they are less effective for immunologically-cold, solid tumors. Spatial omics technologies capture the spatial organization of the TME with unprecedented molecular detail, revealing the relationship between immune cell localization and molecular signals. Here, we formulate T-cell infiltration prediction as a self-supervised machine learning problem and develop a counterfactual optimization strategy that leverages large scale spatial omics profiles of patient tumors to design tumor perturbations predicted to boost T-cell infiltration. A convolutional neural network predicts T-cell distribution based on signaling molecules in the TME provided by imaging mass cytometry. Gradient-based counterfactual generation, then, computes perturbations predicted to boost T-cell abundance. We apply our framework to melanoma, colorectal cancer liver metastases, and breast tumor data, discovering combinatorial perturbations predicted to support T-cell infiltration across tens to hundreds of patients. This work presents a paradigm for counterfactual-based prediction and design of cancer therapeutics using spatial omics data. △ Less

Submitted 13 October, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

arXiv:2210.14896 [pdf, other]

DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models

Authors: Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, Duen Horng Chau

Abstract: With recent advancements in diffusion models, users can generate high-quality images by writing text prompts in natural language. However, generating images with desired details requires proper prompts, and it is often unclear how a model reacts to different prompts or what the best prompts are. To help researchers tackle these critical challenges, we introduce DiffusionDB, the first large-scale t… ▽ More With recent advancements in diffusion models, users can generate high-quality images by writing text prompts in natural language. However, generating images with desired details requires proper prompts, and it is often unclear how a model reacts to different prompts or what the best prompts are. To help researchers tackle these critical challenges, we introduce DiffusionDB, the first large-scale text-to-image prompt dataset totaling 6.5TB, containing 14 million images generated by Stable Diffusion, 1.8 million unique prompts, and hyperparameters specified by real users. We analyze the syntactic and semantic characteristics of prompts. We pinpoint specific hyperparameter values and prompt styles that can lead to model errors and present evidence of potentially harmful model usage, such as the generation of misinformation. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models. DiffusionDB is publicly available at: https://poloclub.github.io/diffusiondb. △ Less

Submitted 6 July, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: Accepted to ACL 2023 (nominated for best paper, top 1.6% of submissions, oral presentation). 17 pages, 11 figures. The dataset is available at https://huggingface.co/datasets/poloclub/diffusiondb. The code is at https://github.com/poloclub/diffusiondb. The interactive visualization demo is at https://poloclub.github.io/diffusiondb/explorer/

arXiv:2210.00160 [pdf, other]

Explaining Website Reliability by Visualizing Hyperlink Connectivity

Authors: Seongmin Lee, Sadia Afroz, Haekyu Park, Zijie J. Wang, Omar Shaikh, Vibhor Sehgal, Ankit Peshin, Duen Horng Chau

Abstract: As the information on the Internet continues growing exponentially, understanding and assessing the reliability of a website is becoming increasingly important. Misinformation has far-ranging repercussions, from sowing mistrust in media to undermining democratic elections. While some research investigates how to alert people to misinformation on the web, much less research has been conducted on ex… ▽ More As the information on the Internet continues growing exponentially, understanding and assessing the reliability of a website is becoming increasingly important. Misinformation has far-ranging repercussions, from sowing mistrust in media to undermining democratic elections. While some research investigates how to alert people to misinformation on the web, much less research has been conducted on explaining how websites engage in spreading false information. To fill the research gap, we present MisVis, a web-based interactive visualization tool that helps users assess a website's reliability by understanding how it engages in spreading false information on the World Wide Web. MisVis visualizes the hyperlink connectivity of the website and summarizes key characteristics of the Twitter accounts that mention the site. A large-scale user study with 139 participants demonstrates that MisVis facilitates users to assess and understand false information on the web and node-link diagrams can be used to communicate with non-experts. MisVis is available at the public demo link: https://poloclub.github.io/MisVis. △ Less

Submitted 30 September, 2022; originally announced October 2022.

Comments: Accepted at IEEE VIS 2022, 5 pages, 4 figures, For a live demo, visit https://poloclub.github.io/MisVis

arXiv:2209.09227 [pdf, other]

doi 10.1109/VIS54862.2022.00021

TimberTrek: Exploring and Curating Sparse Decision Trees with Interactive Visualization

Authors: Zijie J. Wang, Chudi Zhong, Rui Xin, Takuya Takagi, Zhi Chen, Duen Horng Chau, Cynthia Rudin, Margo Seltzer

Abstract: Given thousands of equally accurate machine learning (ML) models, how can users choose among them? A recent ML technique enables domain experts and data scientists to generate a complete Rashomon set for sparse decision trees--a huge set of almost-optimal interpretable ML models. To help ML practitioners identify models with desirable properties from this Rashomon set, we develop TimberTrek, the f… ▽ More Given thousands of equally accurate machine learning (ML) models, how can users choose among them? A recent ML technique enables domain experts and data scientists to generate a complete Rashomon set for sparse decision trees--a huge set of almost-optimal interpretable ML models. To help ML practitioners identify models with desirable properties from this Rashomon set, we develop TimberTrek, the first interactive visualization system that summarizes thousands of sparse decision trees at scale. Two usage scenarios highlight how TimberTrek can empower users to easily explore, compare, and curate models that align with their domain knowledge and values. Our open-source tool runs directly in users' computational notebooks and web browsers, lowering the barrier to creating more responsible ML models. TimberTrek is available at the following public demo link: https://poloclub.github.io/timbertrek. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: Accepted at IEEE VIS 2022. 5 pages, 6 figures. For a demo video, see https://youtu.be/3eGqTmsStJM. For a live demo, visit https://poloclub.github.io/timbertrek

arXiv:2209.04966 [pdf, other]

Multi-modal Streaming 3D Object Detection

Authors: Mazen Abdelfattah, Kaiwen Yuan, Z. Jane Wang, Rabab Ward

Abstract: Modern autonomous vehicles rely heavily on mechanical LiDARs for perception. Current perception methods generally require 360° point clouds, collected sequentially as the LiDAR scans the azimuth and acquires consecutive wedge-shaped slices. The acquisition latency of a full scan (~ 100ms) may lead to outdated perception which is detrimental to safe operation. Recent streaming perception works prop… ▽ More Modern autonomous vehicles rely heavily on mechanical LiDARs for perception. Current perception methods generally require 360° point clouds, collected sequentially as the LiDAR scans the azimuth and acquires consecutive wedge-shaped slices. The acquisition latency of a full scan (~ 100ms) may lead to outdated perception which is detrimental to safe operation. Recent streaming perception works proposed directly processing LiDAR slices and compensating for the narrow field of view (FOV) of a slice by reusing features from preceding slices. These works, however, are all based on a single modality and require past information which may be outdated. Meanwhile, images from high-frequency cameras can support streaming models as they provide a larger FoV compared to a LiDAR slice. However, this difference in FoV complicates sensor fusion. To address this research gap, we propose an innovative camera-LiDAR streaming 3D object detection framework that uses camera images instead of past LiDAR slices to provide an up-to-date, dense, and wide context for streaming perception. The proposed method outperforms prior streaming models on the challenging NuScenes benchmark. It also outperforms powerful full-scan detectors while being much faster. Our method is shown to be robust to missing camera images, narrow LiDAR slices, and small camera-LiDAR miscalibration. △ Less

Submitted 11 September, 2022; originally announced September 2022.

arXiv:2206.15465 [pdf, other]

doi 10.1145/3534678.3539074

Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values

Authors: Zijie J. Wang, Alex Kale, Harsha Nori, Peter Stella, Mark E. Nunnally, Duen Horng Chau, Mihaela Vorvoreanu, Jennifer Wortman Vaughan, Rich Caruana

Abstract: Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientists, we develop GAM Changer, the first interactive… ▽ More Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientists, we develop GAM Changer, the first interactive system to help domain experts and data scientists easily and responsibly edit Generalized Additive Models (GAMs) and fix problematic patterns. With novel interaction techniques, our tool puts interpretability into action--empowering users to analyze, validate, and align model behaviors with their knowledge and values. Physicians have started to use our tool to investigate and fix pneumonia and sepsis risk prediction models, and an evaluation with 7 data scientists working in diverse domains highlights that our tool is easy to use, meets their model editing needs, and fits into their current workflows. Built with modern web technologies, our tool runs locally in users' web browsers or computational notebooks, lowering the barrier to use. GAM Changer is available at the following public demo link: https://interpret.ml/gam-changer. △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: Accepted at KDD 2022. 11 pages, 19 figures. For a demo video, see https://youtu.be/D6whtfInqTc. For a live demo, visit https://interpret.ml/gam-changer

arXiv:2206.13801 [pdf, other]

Joint Precoding for Active Intelligent Transmitting Surface Empowered Outdoor-to-Indoor Communication in mmWave Cellular Networks

Authors: Xie Xie, Chen He, Feifei Gao, Zhu Han, Z. Jane Wang

Abstract: Outdoor-to-indoor communications in millimeter-wave (mmWave) cellular networks have been one challenging research problem due to the severe attenuation and the high penetration loss caused by the propagation characteristics of mmWave signals. We propose a viable solution to implement the outdoor-to-indoor mmWave communication system with the aid of an active intelligent transmitting surface (activ… ▽ More Outdoor-to-indoor communications in millimeter-wave (mmWave) cellular networks have been one challenging research problem due to the severe attenuation and the high penetration loss caused by the propagation characteristics of mmWave signals. We propose a viable solution to implement the outdoor-to-indoor mmWave communication system with the aid of an active intelligent transmitting surface (active-ITS), where the active-ITS allows the incoming signal from an outdoor base station (BS) to pass through the surface and be received by the indoor user-equipments (UEs) after shifting its phase and magnifying its amplitude. Then, the problem of joint precoding of the BS and active-ITS is investigated to maximize the weighted sum-rate (WSR) of the communication system. An efficient block coordinate descent (BCD) based algorithm is developed to solve it with the suboptimal solutions in nearly closed-forms. In addition, to reduce the size and hardware cost of an active-ITS, we provide a block-amplifying architecture to partially remove the circuit components for power-amplifying, where multiple transmissive-type elements (TEs) in each block share a same power amplifier. Simulations indicate that active-ITS has the potential of achieving a given performance with much fewer TEs compared to the passive-ITS under the same total system power consumption, which makes it suitable for application to the size-limited and aesthetic-needed scenario, and the inevitable performance degradation caused by the block-amplifying architecture is acceptable. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: 30 pages, 8 figures

arXiv:2206.12540 [pdf, other]

Visual Auditor: Interactive Visualization for Detection and Summarization of Model Biases

Authors: David Munechika, Zijie J. Wang, Jack Reidy, Josh Rubin, Krishna Gade, Krishnaram Kenthapadi, Duen Horng Chau

Abstract: As machine learning (ML) systems become increasingly widespread, it is necessary to audit these systems for biases prior to their deployment. Recent research has developed algorithms for effectively identifying intersectional bias in the form of interpretable, underperforming subsets (or slices) of the data. However, these solutions and their insights are limited without a tool for visually unders… ▽ More As machine learning (ML) systems become increasingly widespread, it is necessary to audit these systems for biases prior to their deployment. Recent research has developed algorithms for effectively identifying intersectional bias in the form of interpretable, underperforming subsets (or slices) of the data. However, these solutions and their insights are limited without a tool for visually understanding and interacting with the results of these algorithms. We propose Visual Auditor, an interactive visualization tool for auditing and summarizing model biases. Visual Auditor assists model validation by providing an interpretable overview of intersectional bias (bias that is present when examining populations defined by multiple features), details about relationships between problematic data slices, and a comparison between underperforming and overperforming data slices in a model. Our open-source tool runs directly in both computational notebooks and web browsers, making model auditing accessible and easily integrated into current ML development workflows. An observational user study in collaboration with domain experts at Fiddler AI highlights that our tool can help ML practitioners identify and understand model biases. △ Less

Submitted 24 June, 2022; originally announced June 2022.

arXiv:2206.05375 [pdf, other]

Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer

Authors: Dan Wang, Xinrui Cui, Septimiu Salcudean, Z. Jane Wang

Abstract: We propose a Transformer-based NeRF (TransNeRF) to learn a generic neural radiance field conditioned on observed-view images for the novel view synthesis task. By contrast, existing MLP-based NeRFs are not able to directly receive observed views with an arbitrary number and require an auxiliary pooling-based operation to fuse source-view information, resulting in the missing of complicated relatio… ▽ More We propose a Transformer-based NeRF (TransNeRF) to learn a generic neural radiance field conditioned on observed-view images for the novel view synthesis task. By contrast, existing MLP-based NeRFs are not able to directly receive observed views with an arbitrary number and require an auxiliary pooling-based operation to fuse source-view information, resulting in the missing of complicated relationships between source views and the target rendering view. Furthermore, current approaches process each 3D point individually and ignore the local consistency of a radiance field scene representation. These limitations potentially can reduce their performance in challenging real-world applications where large differences between source views and a novel rendering view may exist. To address these challenges, our TransNeRF utilizes the attention mechanism to naturally decode deep associations of an arbitrary number of source views into a coordinate-based scene representation. Local consistency of shape and appearance are considered in the ray-cast space and the surrounding-view space within a unified Transformer network. Experiments demonstrate that our TransNeRF, trained on a wide variety of scenes, can achieve better performance in comparison to state-of-the-art image-based neural rendering methods in both scene-agnostic and per-scene finetuning scenarios especially when there is a considerable gap between source views and a rendering view. △ Less

Submitted 10 June, 2022; originally announced June 2022.

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2205.09744 [pdf, other]

Overcoming Language Disparity in Online Content Classification with Multimodal Learning

Authors: Gaurav Verma, Rohit Mujumdar, Zijie J. Wang, Munmun De Choudhury, Srijan Kumar

Abstract: Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, side… ▽ More Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, sidelining a majority of the languages spoken globally. While existing research has developed better multilingual and monolingual language models to bridge this language disparity between English and non-English languages, we explore the promise of incorporating the information contained in images via multimodal machine learning. Our comparative analyses on three detection tasks focusing on crisis information, fake news, and emotion recognition, as well as five high-resource non-English languages, demonstrate that: (a) detection frameworks based on pre-trained large language models like BERT and multilingual-BERT systematically perform better on the English language compared against non-English languages, and (b) including images via multimodal learning bridges this performance gap. We situate our findings with respect to existing work on the pitfalls of large language models, and discuss their theoretical and practical implications. Resources for this paper are available at https://multimodality-language-disparity.github.io/. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: Accepted for publication at ICWSM 2022 as a full paper

arXiv:2205.03963 [pdf, other]

NOVA: A Practical Method for Creating Notebook-Ready Visual Analytics

Authors: Zijie J. Wang, David Munechika, Seongmin Lee, Duen Horng Chau

Abstract: How can we develop visual analytics (VA) tools that can be easily adopted? Visualization researchers have developed a large number of web-based VA tools to help data scientists in a wide range of tasks. However, adopting these standalone systems can be challenging, as they require data scientists to create new workflows to streamline the VA processes. Recent surveys suggest computational notebooks… ▽ More How can we develop visual analytics (VA) tools that can be easily adopted? Visualization researchers have developed a large number of web-based VA tools to help data scientists in a wide range of tasks. However, adopting these standalone systems can be challenging, as they require data scientists to create new workflows to streamline the VA processes. Recent surveys suggest computational notebooks have been dominating data scientists' analytical workflows, as these notebooks seamlessly combine text, code, and visualization, allowing users to rapidly iterate code experiments. To help visualization researchers develop VA tools that can be easily integrated into existing data science workflows, we present NOVA, a simple and flexible method to adapt web-based VA systems for notebooks. We provide detailed examples of using this method with diverse web development technologies and different types of computational notebooks. Deployed application examples highlight that NOVA is easy to adopt, and data scientists appreciate in-notebook VA. NOVA is available at https://github.com/poloclub/nova. △ Less

Submitted 15 May, 2023; v1 submitted 8 May, 2022; originally announced May 2022.

Comments: Accepted to IEEE VIS 2022 (poster). 2 pages, 1 figure. For a live demo, visit https://poloclub.github.io/nova. For method application examples, see https://github.com/poloclub/nova

arXiv:2204.05899 [pdf, other]

VisCUIT: Visual Auditor for Bias in CNN Image Classifier

Authors: Seongmin Lee, Zijie J. Wang, Judy Hoffman, Duen Horng Chau

Abstract: CNN image classifiers are widely used, thanks to their efficiency and accuracy. However, they can suffer from biases that impede their practical applications. Most existing bias investigation techniques are either inapplicable to general image classification tasks or require significant user efforts in perusing all data subgroups to manually specify which data attributes to inspect. We present Vis… ▽ More CNN image classifiers are widely used, thanks to their efficiency and accuracy. However, they can suffer from biases that impede their practical applications. Most existing bias investigation techniques are either inapplicable to general image classification tasks or require significant user efforts in perusing all data subgroups to manually specify which data attributes to inspect. We present VisCUIT, an interactive visualization system that reveals how and why a CNN classifier is biased. VisCUIT visually summarizes the subgroups on which the classifier underperforms and helps users discover and characterize the cause of the underperformances by revealing image concepts responsible for activating neurons that contribute to misclassifications. VisCUIT runs in modern browsers and is open-source, allowing people to easily access and extend the tool to other model architectures and datasets. VisCUIT is available at the following public demo link: https://poloclub.github.io/VisCUIT. A video demo is available at https://youtu.be/eNDbSyM4R_4. △ Less

Submitted 13 April, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

Comments: 9 pages, 4 figures

arXiv:2203.11490 [pdf, other]

SSD-KD: A Self-supervised Diverse Knowledge Distillation Method for Lightweight Skin Lesion Classification Using Dermoscopic Images

Authors: Yongwei Wang, Yuheng Wang, Tim K. Lee, Chunyan Miao, Z. Jane Wang

Abstract: Skin cancer is one of the most common types of malignancy, affecting a large population and causing a heavy economic burden worldwide. Over the last few years, computer-aided diagnosis has been rapidly developed and make great progress in healthcare and medical practices due to the advances in artificial intelligence. However, most studies in skin cancer detection keep pursuing high prediction acc… ▽ More Skin cancer is one of the most common types of malignancy, affecting a large population and causing a heavy economic burden worldwide. Over the last few years, computer-aided diagnosis has been rapidly developed and make great progress in healthcare and medical practices due to the advances in artificial intelligence. However, most studies in skin cancer detection keep pursuing high prediction accuracies without considering the limitation of computing resources on portable devices. In this case, knowledge distillation (KD) has been proven as an efficient tool to help improve the adaptability of lightweight models under limited resources, meanwhile keeping a high-level representation capability. To bridge the gap, this study specifically proposes a novel method, termed SSD-KD, that unifies diverse knowledge into a generic KD framework for skin diseases classification. Our method models an intra-instance relational feature representation and integrates it with existing KD research. A dual relational knowledge distillation architecture is self-supervisedly trained while the weighted softened outputs are also exploited to enable the student model to capture richer knowledge from the teacher model. To demonstrate the effectiveness of our method, we conduct experiments on ISIC 2019, a large-scale open-accessed benchmark of skin diseases dermoscopic images. Experiments show that our distilled lightweight model can achieve an accuracy as high as 85% for the classification tasks of 8 different skin diseases with minimal parameters and computing requirements. Ablation studies confirm the effectiveness of our intra- and inter-instance relational knowledge integration strategy. Compared with state-of-the-art knowledge distillation techniques, the proposed method demonstrates improved performances for multi-diseases classification on the large-scale dermoscopy database. △ Less

Submitted 29 March, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

Comments: 14 pages, 5 figures

arXiv:2203.08176 [pdf, other]

SemiPFL: Personalized Semi-Supervised Federated Learning Framework for Edge Intelligence

Authors: Arvin Tashakori, Wenwen Zhang, Z. Jane Wang, Peyman Servati

Abstract: Recent advances in wearable devices and Internet-of-Things (IoT) have led to massive growth in sensor data generated in edge devices. Labeling such massive data for classification tasks has proven to be challenging. In addition, data generated by different users bear various personal attributes and edge heterogeneity, rendering it impractical to develop a global model that adapts well to all users… ▽ More Recent advances in wearable devices and Internet-of-Things (IoT) have led to massive growth in sensor data generated in edge devices. Labeling such massive data for classification tasks has proven to be challenging. In addition, data generated by different users bear various personal attributes and edge heterogeneity, rendering it impractical to develop a global model that adapts well to all users. Concerns over data privacy and communication costs also prohibit centralized data accumulation and training. We propose SemiPFL that supports edge users having no label or limited labeled datasets and a sizable amount of unlabeled data that is insufficient to train a well-performing model. In this work, edge users collaborate to train a Hyper-network in the server, generating personalized autoencoders for each user. After receiving updates from edge users, the server produces a set of base models for each user, which the users locally aggregate them using their own labeled dataset. We comprehensively evaluate our proposed framework on various public datasets from a wide range of application scenarios, from wearable health to IoT, and demonstrate that SemiPFL outperforms state-of-art federated learning frameworks under the same assumptions regarding user performance, network footprint, and computational consumption. We also show that the solution performs well for users without label or having limited labeled datasets and increasing performance for increased labeled data and number of users, signifying the effectiveness of SemiPFL for handling data heterogeneity and limited annotation. We also demonstrate the stability of SemiPFL for handling user hardware resource heterogeneity in three real-time scenarios. △ Less

Submitted 19 November, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

arXiv:2202.11086 [pdf, other]

doi 10.1145/3491101.3519653

StickyLand: Breaking the Linear Presentation of Computational Notebooks

Authors: Zijie J. Wang, Katie Dai, W. Keith Edwards

Abstract: How can we better organize code in computational notebooks? Notebooks have become a popular tool among data scientists, as they seamlessly weave text and code together, supporting users to rapidly iterate and document code experiments. However, it is often challenging to organize code in notebooks, partially because there is a mismatch between the linear presentation of code and the non-linear pro… ▽ More How can we better organize code in computational notebooks? Notebooks have become a popular tool among data scientists, as they seamlessly weave text and code together, supporting users to rapidly iterate and document code experiments. However, it is often challenging to organize code in notebooks, partially because there is a mismatch between the linear presentation of code and the non-linear process of exploratory data analysis. We present StickyLand, a notebook extension for empowering users to freely organize their code in non-linear ways. With sticky cells that are always shown on the screen, users can quickly access their notes, instantly observe experiment results, and easily build interactive dashboards that support complex visual analytics. Case studies highlight how our tool can enhance notebook users's productivity and identify opportunities for future notebook designs. StickyLand is available at https://github.com/xiaohk/stickyland. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: Accepted at CHI 2022 (Late-Breaking Work). 7 pages, 6 figures. For a demo video, see https://youtu.be/OKaPmEBzEX0. For a live demo, visit https://zijie.wang/#stickyland-demo

arXiv:2201.09685 [pdf, other]

Robust Joint Design for Intelligent Reflecting Surfaces Assisted Cell-Free Networks

Authors: Xie Xie, Chen He, Xiaoya Li, Zhu Han, Z. Jane Wang

Abstract: Intelligent reflecting surfaces (IRSs) have emerged as a promising economical solution to implement cell-free networks. However, the performance gains achieved by IRSs critically depend on smartly tuned passive beamforming based on the assumption that the accurate channel state information (CSI) knowledge is available, which is practically impossible. Thus, in this paper, we investigate the impact… ▽ More Intelligent reflecting surfaces (IRSs) have emerged as a promising economical solution to implement cell-free networks. However, the performance gains achieved by IRSs critically depend on smartly tuned passive beamforming based on the assumption that the accurate channel state information (CSI) knowledge is available, which is practically impossible. Thus, in this paper, we investigate the impact of the CSI uncertainty on IRS-assisted cell-free networks. We adopt a stochastic programming method to cope with the CSI uncertainty by maximizing the expectation of the sum-rate, which guarantees robust performance over the average. Accordingly, an average sum-rate maximization problem is formulated, which is non-convex and arduous to obtain its optimal solution due to the coupled variables and the expectation operation with respect to CSI uncertainties. As a compromising approach, we develop an efficient robust joint design algorithm with low-complexity. Particularly, the original problem is equivalently transformed into a tractable form, and then, the locally optimal solution can be obtained by employing the block coordinate descent method. We further prove that the CSI uncertainty impacts the design of the active transmitting beamforming of APs, but surprisingly does not directly impact the design of the passive reflecting beamforming of IRSs. It is worth noting that the investigated scenario is flexible and general, and thus the proposed algorithm can act as a general framework to solve various sum-rate maximization problems. Simulation results demonstrate that IRSs can achieve considerable data rate improvement for conventional cell-free networks, and confirm the resilience of the proposed algorithm against the CSI uncertainty. △ Less

Submitted 20 February, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: 30 pages

arXiv:2112.11593 [pdf, other]

AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation

Authors: Mohsen Gholami, Bastian Wandt, Helge Rhodin, Rabab Ward, Z. Jane Wang

Abstract: This paper addresses the problem of cross-dataset generalization of 3D human pose estimation models. Testing a pre-trained 3D pose estimator on a new dataset results in a major performance drop. Previous methods have mainly addressed this problem by improving the diversity of the training data. We argue that diversity alone is not sufficient and that the characteristics of the training data need t… ▽ More This paper addresses the problem of cross-dataset generalization of 3D human pose estimation models. Testing a pre-trained 3D pose estimator on a new dataset results in a major performance drop. Previous methods have mainly addressed this problem by improving the diversity of the training data. We argue that diversity alone is not sufficient and that the characteristics of the training data need to be adapted to those of the new dataset such as camera viewpoint, position, human actions, and body size. To this end, we propose AdaptPose, an end-to-end framework that generates synthetic 3D human motions from a source dataset and uses them to fine-tune a 3D pose estimator. AdaptPose follows an adversarial training scheme. From a source 3D pose the generator generates a sequence of 3D poses and a camera orientation that is used to project the generated poses to a novel view. Without any 3D labels or camera information AdaptPose successfully learns to create synthetic 3D poses from the target dataset while only being trained on 2D poses. In experiments on the Human3.6M, MPI-INF-3DHP, 3DPW, and Ski-Pose datasets our method outperforms previous work in cross-dataset evaluations by 14% and previous semi-supervised learning methods that use partial 3D annotations by 16%. △ Less

Submitted 15 March, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

arXiv:2112.06654 [pdf, other]

doi 10.1109/MSP.2021.3134629

Toward Open-World Electroencephalogram Decoding Via Deep Learning: A Comprehensive Survey

Authors: Xun Chen, Chang Li, Aiping Liu, Martin J. McKeown, Ruobing Qian, Z. Jane Wang

Abstract: Electroencephalogram (EEG) decoding aims to identify the perceptual, semantic, and cognitive content of neural processing based on non-invasively measured brain activity. Traditional EEG decoding methods have achieved moderate success when applied to data acquired in static, well-controlled lab environments. However, an open-world environment is a more realistic setting, where situations affecting… ▽ More Electroencephalogram (EEG) decoding aims to identify the perceptual, semantic, and cognitive content of neural processing based on non-invasively measured brain activity. Traditional EEG decoding methods have achieved moderate success when applied to data acquired in static, well-controlled lab environments. However, an open-world environment is a more realistic setting, where situations affecting EEG recordings can emerge unexpectedly, significantly weakening the robustness of existing methods. In recent years, deep learning (DL) has emerged as a potential solution for such problems due to its superior capacity in feature extraction. It overcomes the limitations of defining `handcrafted' features or features extracted using shallow architectures, but typically requires large amounts of costly, expertly-labelled data - something not always obtainable. Combining DL with domain-specific knowledge may allow for development of robust approaches to decode brain activity even with small-sample data. Although various DL methods have been proposed to tackle some of the challenges in EEG decoding, a systematic tutorial overview, particularly for open-world applications, is currently lacking. This article therefore provides a comprehensive survey of DL methods for open-world EEG decoding, and identifies promising research directions to inspire future studies for EEG decoding in real-world applications. △ Less

Submitted 16 December, 2021; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: Accepted by the IEEE Signal Processing Magazine

arXiv:2112.03245 [pdf, other]

GAM Changer: Editing Generalized Additive Models with Interactive Visualization

Authors: Zijie J. Wang, Alex Kale, Harsha Nori, Peter Stella, Mark Nunnally, Duen Horng Chau, Mihaela Vorvoreanu, Jennifer Wortman Vaughan, Rich Caruana

Abstract: Recent strides in interpretable machine learning (ML) research reveal that models exploit undesirable patterns in the data to make predictions, which potentially causes harms in deployment. However, it is unclear how we can fix these models. We present our ongoing work, GAM Changer, an open-source interactive system to help data scientists and domain experts easily and responsibly edit their Gener… ▽ More Recent strides in interpretable machine learning (ML) research reveal that models exploit undesirable patterns in the data to make predictions, which potentially causes harms in deployment. However, it is unclear how we can fix these models. We present our ongoing work, GAM Changer, an open-source interactive system to help data scientists and domain experts easily and responsibly edit their Generalized Additive Models (GAMs). With novel visualization techniques, our tool puts interpretability into action -- empowering human users to analyze, validate, and align model behaviors with their knowledge and values. Built using modern web technologies, our tool runs locally in users' computational notebooks or web browsers without requiring extra compute resources, lowering the barrier to creating more responsible ML models. GAM Changer is available at https://interpret.ml/gam-changer. △ Less

Submitted 6 December, 2021; originally announced December 2021.

Comments: 7 pages, 15 figures, accepted to the Research2Clinics workshop at NeurIPS 2021. For a demo video, see https://youtu.be/2gVSoPoSeJ8. For a live demo, visit https://interpret.ml/gam-changer/

arXiv:2112.02721 [pdf, other]

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Authors: Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo , et al. (101 additional authors not shown)

Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data split… ▽ More Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter). △ Less

Submitted 11 October, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

Comments: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter

arXiv:2109.02688 [pdf, other]

doi 10.1109/TGRS.2022.3191735

Rethinking Crowdsourcing Annotation: Partial Annotation with Salient Labels for Multi-Label Image Classification

Authors: Jianzhe Lin, Tianze Yu, Z. Jane Wang

Abstract: Annotated images are required for both supervised model training and evaluation in image classification. Manually annotating images is arduous and expensive, especially for multi-labeled images. A recent trend for conducting such laboursome annotation tasks is through crowdsourcing, where images are annotated by volunteers or paid workers online (e.g., workers of Amazon Mechanical Turk) from scrat… ▽ More Annotated images are required for both supervised model training and evaluation in image classification. Manually annotating images is arduous and expensive, especially for multi-labeled images. A recent trend for conducting such laboursome annotation tasks is through crowdsourcing, where images are annotated by volunteers or paid workers online (e.g., workers of Amazon Mechanical Turk) from scratch. However, the quality of crowdsourcing image annotations cannot be guaranteed, and incompleteness and incorrectness are two major concerns for crowdsourcing annotations. To address such concerns, we have a rethinking of crowdsourcing annotations: Our simple hypothesis is that if the annotators only partially annotate multi-label images with salient labels they are confident in, there will be fewer annotation errors and annotators will spend less time on uncertain labels. As a pleasant surprise, with the same annotation budget, we show a multi-label image classifier supervised by images with salient annotations can outperform models supervised by fully annotated images. Our method contributions are 2-fold: An active learning way is proposed to acquire salient labels for multi-label images; and a novel Adaptive Temperature Associated Model (ATAM) specifically using partial annotations is proposed for multi-label image classification. We conduct experiments on practical crowdsourcing data, the Open Street Map (OSM) dataset and benchmark dataset COCO 2014. When compared with state-of-the-art classification methods trained on fully annotated images, the proposed ATAM can achieve higher accuracy. The proposed idea is promising for crowdsourcing data annotation. Our code will be publicly available. △ Less

Submitted 6 September, 2021; originally announced September 2021.

arXiv:2108.06810 [pdf, other]

doi 10.1109/TGRS.2022.3170357

SCIDA: Self-Correction Integrated Domain Adaptation from Single- to Multi-label Aerial Images

Authors: Tianze Yu, Jianzhe Lin, Lichao Mou, Yuansheng Hua, Xiaoxiang Zhu, Z. Jane Wang

Abstract: Most publicly available datasets for image classification are with single labels, while images are inherently multi-labeled in our daily life. Such an annotation gap makes many pre-trained single-label classification models fail in practical scenarios. This annotation issue is more concerned for aerial images: Aerial data collected from sensors naturally cover a relatively large land area with mul… ▽ More Most publicly available datasets for image classification are with single labels, while images are inherently multi-labeled in our daily life. Such an annotation gap makes many pre-trained single-label classification models fail in practical scenarios. This annotation issue is more concerned for aerial images: Aerial data collected from sensors naturally cover a relatively large land area with multiple labels, while annotated aerial datasets, which are publicly available (e.g., UCM, AID), are single-labeled. As manually annotating multi-label aerial images would be time/labor-consuming, we propose a novel self-correction integrated domain adaptation (SCIDA) method for automatic multi-label learning. SCIDA is weakly supervised, i.e., automatically learning the multi-label image classification model from using massive, publicly available single-label images. To achieve this goal, we propose a novel Label-Wise self-Correction (LWC) module to better explore underlying label correlations. This module also makes the unsupervised domain adaptation (UDA) from single- to multi-label data possible. For model training, the proposed model only uses single-label information yet requires no prior knowledge of multi-labeled data; and it predicts labels for multi-label aerial images. In our experiments, trained with single-labeled MAI-AID-s and MAI-UCM-s datasets, the proposed model is tested directly on our collected Multi-scene Aerial Image (MAI) dataset. △ Less

Submitted 29 November, 2021; v1 submitted 15 August, 2021; originally announced August 2021.

arXiv:2108.00180 [pdf, other]

Delving into Deep Image Prior for Adversarial Defense: A Novel Reconstruction-based Defense Framework

Authors: Li Ding, Yongwei Wang, Xin Ding, Kaiwen Yuan, Ping Wang, Hua Huang, Z. Jane Wang

Abstract: Deep learning based image classification models are shown vulnerable to adversarial attacks by injecting deliberately crafted noises to clean images. To defend against adversarial attacks in a training-free and attack-agnostic manner, this work proposes a novel and effective reconstruction-based defense framework by delving into deep image prior (DIP). Fundamentally different from existing reconst… ▽ More Deep learning based image classification models are shown vulnerable to adversarial attacks by injecting deliberately crafted noises to clean images. To defend against adversarial attacks in a training-free and attack-agnostic manner, this work proposes a novel and effective reconstruction-based defense framework by delving into deep image prior (DIP). Fundamentally different from existing reconstruction-based defenses, the proposed method analyzes and explicitly incorporates the model decision process into our defense. Given an adversarial image, firstly we map its reconstructed images during DIP optimization to the model decision space, where cross-boundary images can be detected and on-boundary images can be further localized. Then, adversarial noise is purified by perturbing on-boundary images along the reverse direction to the adversarial image. Finally, on-manifold images are stitched to construct an image that can be correctly predicted by the victim classifier. Extensive experiments demonstrate that the proposed method outperforms existing state-of-the-art reconstruction-based methods both in defending white-box attacks and defense-aware attacks. Moreover, the proposed method can maintain a high visual quality during adversarial image reconstruction. △ Less

Submitted 31 July, 2021; originally announced August 2021.

Comments: To be publish in ACM MM 2021

arXiv:2105.14545 [pdf, other]

A Joint Power Splitting, Active and Passive Beamforming Optimization Framework for IRS Assisted MIMO SWIPT System

Authors: Chen He, Xie Xie, Kun Yang, Z. Jane Wang

Abstract: This paper considers an intelligent reflecting surface (IRS) assisted multi-input multi-output (MIMO) power splitting (PS) based simultaneous wireless information and power transfer (SWIPT) system with multiple PS receivers (PSRs). The objective is to maximize the achievable data rate of the system by jointly optimizing the PS ratios at the PSRs, the active transmit beamforming (ATB) at the access… ▽ More This paper considers an intelligent reflecting surface (IRS) assisted multi-input multi-output (MIMO) power splitting (PS) based simultaneous wireless information and power transfer (SWIPT) system with multiple PS receivers (PSRs). The objective is to maximize the achievable data rate of the system by jointly optimizing the PS ratios at the PSRs, the active transmit beamforming (ATB) at the access point (AP), and the passive reflective beamforming (PRB) at the IRS, while the constraints on maximum transmission power at the AP, the reflective phase shift of each element at the IRS, the individual minimum harvested energy requirement of each PSR, and the domain of PS ratio of each PSR are all satisfied. For this unsolved problem, however, since the optimization variables are intricately coupled and the constraints are conflicting, the formulated problem is non-convex, and cannot be addressed by employing exist approaches directly. To this end, we propose a joint optimization framework to solve this problem. Particularly, we reformulate it as an equivalent form by employing the Lagrangian dual transform and the fractional programming transform, and decompose the transformed problem into several sub-problems. Then, we propose an alternate optimization algorithm by capitalizing on the dual sub-gradient method, the successive convex approximation method, and the penalty-based majorization-minimization approach, to solve the sub-problems iteratively, and obtain the optimal solutions in nearly closed-forms. Numerical simulation results verify the effectiveness of the IRS in SWIPT system and indicate that the proposed algorithm offers a substantial performance gain. △ Less

Submitted 30 May, 2021; originally announced May 2021.

Comments: 13 pages, 7 figures

arXiv:2105.06599 [pdf, other]

TriPose: A Weakly-Supervised 3D Human Pose Estimation via Triangulation from Video

Authors: Mohsen Gholami, Ahmad Rezaei, Helge Rhodin, Rabab Ward, Z. Jane Wang

Abstract: Estimating 3D human poses from video is a challenging problem. The lack of 3D human pose annotations is a major obstacle for supervised training and for generalization to unseen datasets. In this work, we address this problem by proposing a weakly-supervised training scheme that does not require 3D annotations or calibrated cameras. The proposed method relies on temporal information and triangulat… ▽ More Estimating 3D human poses from video is a challenging problem. The lack of 3D human pose annotations is a major obstacle for supervised training and for generalization to unseen datasets. In this work, we address this problem by proposing a weakly-supervised training scheme that does not require 3D annotations or calibrated cameras. The proposed method relies on temporal information and triangulation. Using 2D poses from multiple views as the input, we first estimate the relative camera orientations and then generate 3D poses via triangulation. The triangulation is only applied to the views with high 2D human joint confidence. The generated 3D poses are then used to train a recurrent lifting network (RLN) that estimates 3D poses from 2D poses. We further apply a multi-view re-projection loss to the estimated 3D poses and enforce the 3D poses estimated from multi-views to be consistent. Therefore, our method relaxes the constraints in practice, only multi-view videos are required for training, and is thus convenient for in-the-wild settings. At inference, RLN merely requires single-view videos. The proposed method outperforms previous works on two challenging datasets, Human3.6M and MPI-INF-3DHP. Codes and pretrained models will be publicly available. △ Less

Submitted 13 May, 2021; originally announced May 2021.

arXiv:2104.03164 [pdf, other]

doi 10.1016/j.eswa.2022.119060

Distilling and Transferring Knowledge via cGAN-generated Samples for Image Classification and Regression

Authors: Xin Ding, Yongwei Wang, Zuheng Xu, Z. Jane Wang, William J. Welch

Abstract: Knowledge distillation (KD) has been actively studied for image classification tasks in deep learning, aiming to improve the performance of a student based on the knowledge from a teacher. However, applying KD in image regression with a scalar response variable has been rarely studied, and there exists no KD method applicable to both classification and regression tasks yet. Moreover, existing KD m… ▽ More Knowledge distillation (KD) has been actively studied for image classification tasks in deep learning, aiming to improve the performance of a student based on the knowledge from a teacher. However, applying KD in image regression with a scalar response variable has been rarely studied, and there exists no KD method applicable to both classification and regression tasks yet. Moreover, existing KD methods often require a practitioner to carefully select or adjust the teacher and student architectures, making these methods less flexible in practice. To address the above problems in a unified way, we propose a comprehensive KD framework based on cGANs, termed cGAN-KD. Fundamentally different from existing KD methods, cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples. This novel mechanism makes cGAN-KD suitable for both classification and regression tasks, compatible with other KD methods, and insensitive to the teacher and student architectures. An error bound for a student model trained in the cGAN-KD framework is derived in this work, providing a theory for why cGAN-KD is effective as well as guiding the practical implementation of cGAN-KD. Extensive experiments on CIFAR-100 and ImageNet-100 show that we can combine state of the art KD methods with the cGAN-KD framework to yield a new state of the art. Moreover, experiments on Steering Angle and UTKFace demonstrate the effectiveness of cGAN-KD in image regression tasks, where existing KD methods are inapplicable. △ Less

Submitted 26 December, 2022; v1 submitted 7 April, 2021; originally announced April 2021.

arXiv:2103.14625 [pdf, other]

Dodrio: Exploring Transformer Models with Interactive Visualization

Authors: Zijie J. Wang, Robert Turko, Duen Horng Chau

Abstract: Why do large pre-trained transformer-based models perform so well across a wide variety of NLP tasks? Recent research suggests the key may lie in multi-headed attention mechanism's ability to learn and represent linguistic information. Understanding how these models represent both syntactic and semantic knowledge is vital to investigate why they succeed and fail, what they have learned, and how th… ▽ More Why do large pre-trained transformer-based models perform so well across a wide variety of NLP tasks? Recent research suggests the key may lie in multi-headed attention mechanism's ability to learn and represent linguistic information. Understanding how these models represent both syntactic and semantic knowledge is vital to investigate why they succeed and fail, what they have learned, and how they can improve. We present Dodrio, an open-source interactive visualization tool to help NLP researchers and practitioners analyze attention mechanisms in transformer-based models with linguistic knowledge. Dodrio tightly integrates an overview that summarizes the roles of different attention heads, and detailed views that help users compare attention weights with the syntactic structure and semantic information in the input text. To facilitate the visual comparison of attention weights and linguistic knowledge, Dodrio applies different graph visualization techniques to represent attention weights scalable to longer input text. Case studies highlight how Dodrio provides insights into understanding the attention mechanism in transformer-based models. Dodrio is available at https://poloclub.github.io/dodrio/. △ Less

Submitted 5 June, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: 10 pages, 8 figures, Accepted to ACL 2021. For a demo video, see https://youtu.be/qB-T9j7UTgE . For a live demo, see https://poloclub.github.io/dodrio/

arXiv:2103.12957 [pdf, ps, other]

Multi-view 3D Reconstruction with Transformer

Authors: Dan Wang, Xinrui Cui, Xun Chen, Zhengxia Zou, Tianyang Shi, Septimiu Salcudean, Z. Jane Wang, Rabab Ward

Abstract: Deep CNN-based methods have so far achieved the state of the art results in multi-view 3D object reconstruction. Despite the considerable progress, the two core modules of these methods - multi-view feature extraction and fusion, are usually investigated separately, and the object relations in different views are rarely explored. In this paper, inspired by the recent great success in self-attentio… ▽ More Deep CNN-based methods have so far achieved the state of the art results in multi-view 3D object reconstruction. Despite the considerable progress, the two core modules of these methods - multi-view feature extraction and fusion, are usually investigated separately, and the object relations in different views are rarely explored. In this paper, inspired by the recent great success in self-attention-based Transformer models, we reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem and propose a new framework named 3D Volume Transformer (VolT) for such a task. Unlike previous CNN-based methods using a separate design, we unify the feature extraction and view fusion in a single Transformer network. A natural advantage of our design lies in the exploration of view-to-view relationships using self-attention among multiple unordered inputs. On ShapeNet - a large-scale 3D reconstruction benchmark dataset, our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters ($70\%$ less) than other CNN-based methods. Experimental results also suggest the strong scaling capability of our method. Our code will be made publicly available. △ Less

Submitted 23 March, 2021; originally announced March 2021.

Showing 1–50 of 86 results for author: Wang, Z J