-
Studying the Performance of the Jellyfish Search Optimiser for the Application of Projection Pursuit
Authors:
H. Sherry Zhang,
Dianne Cook,
Nicolas Langrené,
Jessica Wai Yin Leung
Abstract:
The projection pursuit (PP) guided tour interactively optimises a criteria function known as the PP index, to explore high-dimensional data by revealing interesting projections. The optimisation in PP can be non-trivial, involving non-smooth functions and optima with a small squint angle, detectable only from close proximity. To address these challenges, this study investigates the performance of…
▽ More
The projection pursuit (PP) guided tour interactively optimises a criteria function known as the PP index, to explore high-dimensional data by revealing interesting projections. The optimisation in PP can be non-trivial, involving non-smooth functions and optima with a small squint angle, detectable only from close proximity. To address these challenges, this study investigates the performance of a recently introduced swarm-based algorithm, Jellyfish Search Optimiser (JSO), for optimising PP indexes. The performance of JSO for visualising data is evaluated across various hyper-parameter settings and compared with existing optimisers. Additionally, this work proposes novel methods to quantify two properties of the PP index, smoothness and squintability that capture the complexities inherent in PP optimisation problems. These two metrics are evaluated along with JSO hyper-parameters to determine their effects on JSO success rate. Our numerical results confirm the positive impact of these metrics on the JSO success rate, with squintability being the most significant. The JSO algorithm has been implemented in the tourr package and functions to calculate smoothness and squintability are available in the ferrn package.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction
Authors:
Yini Fang,
Jingling Yu,
Haozheng Zhang,
Ralf van der Lans,
Bertram Shi
Abstract:
Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than pixel intensities. Thi…
▽ More
Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than pixel intensities. This paper introduces the Object-level Attention Transformer (OAT), which predicts human scanpaths as they search for a target object within a cluttered scene of distractors. OAT uses an encoder-decoder architecture. The encoder captures information about the position and appearance of the objects within an image and about the target. The decoder predicts the gaze scanpath as a sequence of object fixations, by integrating output features from both the encoder and decoder. We also propose a new positional encoding that better reflects spatial relationships between objects. We evaluated OAT on the Amazon book cover dataset and a new dataset for visual search that we collected. OAT's predicted gaze scanpaths align more closely with human gaze patterns, compared to predictions by algorithms based on spatial attention on both established metrics and a novel behavioural-based metric. Our results demonstrate the generalization ability of OAT, as it accurately predicts human scanpaths for unseen layouts and target objects.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
STS MICCAI 2023 Challenge: Grand challenge on 2D and 3D semi-supervised tooth segmentation
Authors:
Yaqi Wang,
Yifan Zhang,
Xiaodiao Chen,
Shuai Wang,
Dahong Qian,
Fan Ye,
Feng Xu,
Hongyuan Zhang,
Qianni Zhang,
Chengyu Wu,
Yunxiang Li,
Weiwei Cui,
Shan Luo,
Chengkai Wang,
Tianhao Li,
Yi Liu,
Xiang Feng,
Huiyu Zhou,
Dongyun Liu,
Qixuan Wang,
Zhouhao Lin,
Wei Song,
Yuanlin Li,
Bing Wang,
Chunshi Wang
, et al. (2 additional authors not shown)
Abstract:
Computer-aided design (CAD) tools are increasingly popular in modern dental practice, particularly for treatment planning or comprehensive prognosis evaluation. In particular, the 2D panoramic X-ray image efficiently detects invisible caries, impacted teeth and supernumerary teeth in children, while the 3D dental cone beam computed tomography (CBCT) is widely used in orthodontics and endodontics d…
▽ More
Computer-aided design (CAD) tools are increasingly popular in modern dental practice, particularly for treatment planning or comprehensive prognosis evaluation. In particular, the 2D panoramic X-ray image efficiently detects invisible caries, impacted teeth and supernumerary teeth in children, while the 3D dental cone beam computed tomography (CBCT) is widely used in orthodontics and endodontics due to its low radiation dose. However, there is no open-access 2D public dataset for children's teeth and no open 3D dental CBCT dataset, which limits the development of automatic algorithms for segmenting teeth and analyzing diseases. The Semi-supervised Teeth Segmentation (STS) Challenge, a pioneering event in tooth segmentation, was held as a part of the MICCAI 2023 ToothFairy Workshop on the Alibaba Tianchi platform. This challenge aims to investigate effective semi-supervised tooth segmentation algorithms to advance the field of dentistry. In this challenge, we provide two modalities including the 2D panoramic X-ray images and the 3D CBCT tooth volumes. In Task 1, the goal was to segment tooth regions in panoramic X-ray images of both adult and pediatric teeth. Task 2 involved segmenting tooth sections using CBCT volumes. Limited labelled images with mostly unlabelled ones were provided in this challenge prompt using semi-supervised algorithms for training. In the preliminary round, the challenge received registration and result submission by 434 teams, with 64 advancing to the final round. This paper summarizes the diverse methods employed by the top-ranking teams in the STS MICCAI 2023 Challenge.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
HPPP: Halpern-type Preconditioned Proximal Point Algorithms and Applications to Image Restoration
Authors:
Shuchang Zhang,
Hui Zhang,
Hongxia Wang
Abstract:
Preconditioned Proximal Point (PPP) algorithms provide a unified framework for splitting methods in image restoration. Recent advancements with RED (Regularization by Denoising) and PnP (Plug-and-Play) priors have achieved state-of-the-art performance in this domain, emphasizing the need for a meaningful particular solution. However, degenerate PPP algorithms typically exhibit weak convergence in…
▽ More
Preconditioned Proximal Point (PPP) algorithms provide a unified framework for splitting methods in image restoration. Recent advancements with RED (Regularization by Denoising) and PnP (Plug-and-Play) priors have achieved state-of-the-art performance in this domain, emphasizing the need for a meaningful particular solution. However, degenerate PPP algorithms typically exhibit weak convergence in infinite-dimensional Hilbert space, leading to uncertain solutions. To address this issue, we propose the Halpern-type Preconditioned Proximal Point (HPPP) algorithm, which leverages the strong convergence properties of Halpern iteration to achieve a particular solution. Based on the implicit regularization defined by gradient RED, we further introduce the Gradient REgularization by Denoising via HPPP called GraRED-HP3 algorithm. The HPPP algorithm is shown to have the regularity converging to a particular solution by a toy example. Additionally, experiments in image deblurring and inpainting validate the effectiveness of GraRED-HP3, showing it surpasses classical methods such as Chambolle-Pock (CP), PPP, RED, and RED-PRO.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Authors:
Hailin Zhang,
Xiaodong Ji,
Yilin Chen,
Fangcheng Fu,
Xupeng Miao,
Xiaonan Nie,
Weipeng Chen,
Bin Cui
Abstract:
As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), a crucial component in LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either…
▽ More
As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), a crucial component in LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques used in the database community, we consider the storage and searching of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, for each newly generated token, we first identify important tokens through Maximum Inner-Product Search (MIPS) using PQ codes and centroids, then fetch the corresponding key-value pairs for self-attention computation. Through meticulous design of overlapping and caching, we minimize any additional computation and communication overhead during both phases. Extensive experiments show that PQCache achieves both effectiveness and efficiency. It maintains model quality even when only 1/5 of the tokens are involved in attention, while attaining acceptable system latency.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
The Future of Learning: Large Language Models through the Lens of Students
Authors:
He Zhang,
Jingyi Xie,
Chuhao Wu,
Jie Cai,
ChanMin Kim,
John M. Carroll
Abstract:
As Large-Scale Language Models (LLMs) continue to evolve, they demonstrate significant enhancements in performance and an expansion of functionalities, impacting various domains, including education. In this study, we conducted interviews with 14 students to explore their everyday interactions with ChatGPT. Our preliminary findings reveal that students grapple with the dilemma of utilizing ChatGPT…
▽ More
As Large-Scale Language Models (LLMs) continue to evolve, they demonstrate significant enhancements in performance and an expansion of functionalities, impacting various domains, including education. In this study, we conducted interviews with 14 students to explore their everyday interactions with ChatGPT. Our preliminary findings reveal that students grapple with the dilemma of utilizing ChatGPT's efficiency for learning and information seeking, while simultaneously experiencing a crisis of trust and ethical concerns regarding the outcomes and broader impacts of ChatGPT. The students perceive ChatGPT as being more "human-like" compared to traditional AI. This dilemma, characterized by mixed emotions, inconsistent behaviors, and an overall positive attitude towards ChatGPT, underscores its potential for beneficial applications in education and learning. However, we argue that despite its human-like qualities, the advanced capabilities of such intelligence might lead to adverse consequences. Therefore, it's imperative to approach its application cautiously and strive to mitigate potential harms in future developments.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics
Authors:
Fengyu Cai,
Xinran Zhao,
Hongming Zhang,
Iryna Gurevych,
Heinz Koeppl
Abstract:
Recent advances in measuring hardness-wise properties of data guide language models in sample selection within low-resource scenarios. However, class-specific properties are overlooked for task setup and learning. How will these properties influence model learning and is it generalizable across datasets? To answer this question, this work formally initiates the concept of…
▽ More
Recent advances in measuring hardness-wise properties of data guide language models in sample selection within low-resource scenarios. However, class-specific properties are overlooked for task setup and learning. How will these properties influence model learning and is it generalizable across datasets? To answer this question, this work formally initiates the concept of $\textit{class-wise hardness}$. Experiments across eight natural language understanding (NLU) datasets demonstrate a consistent hardness distribution across learning paradigms, models, and human judgment. Subsequent experiments unveil a notable challenge in measuring such class-wise hardness with instance-level metrics in previous works. To address this, we propose $\textit{GeoHard}$ for class-wise hardness measurement by modeling class geometry in the semantic embedding space. $\textit{GeoHard}$ surpasses instance-level metrics by over 59 percent on $\textit{Pearson}$'s correlation on measuring class-wise hardness. Our analysis theoretically and empirically underscores the generality of $\textit{GeoHard}$ as a fresh perspective on data diagnosis. Additionally, we showcase how understanding class-wise hardness can practically aid in improving task learning.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations
Authors:
Tomáš Chobola,
Yu Liu,
Hanyi Zhang,
Julia A. Schnabel,
Tingying Peng
Abstract:
Current deep learning-based low-light image enhancement methods often struggle with high-resolution images, and fail to meet the practical demands of visual perception across diverse and unseen scenarios. In this paper, we introduce a novel approach termed CoLIE, which redefines the enhancement process through mapping the 2D coordinates of an underexposed image to its illumination component, condi…
▽ More
Current deep learning-based low-light image enhancement methods often struggle with high-resolution images, and fail to meet the practical demands of visual perception across diverse and unseen scenarios. In this paper, we introduce a novel approach termed CoLIE, which redefines the enhancement process through mapping the 2D coordinates of an underexposed image to its illumination component, conditioned on local context. We propose a reconstruction of enhanced-light images within the HSV space utilizing an implicit neural function combined with an embedded guided filter, thereby significantly reducing computational overhead. Moreover, we introduce a single image-based training loss function to enhance the model's adaptability to various scenes, further enhancing its practical applicability. Through rigorous evaluations, we analyze the properties of our proposed framework, demonstrating its superiority in both image quality and scene adaptability. Furthermore, our evaluation extends to applications in downstream tasks within low-light scenarios, underscoring the practical utility of CoLIE. The source code is available at https://github.com/ctom2/colie.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Voltage-Controlled Magnetoelectric Devices for Neuromorphic Diffusion Process
Authors:
Yang Cheng,
Qingyuan Shu,
Albert Lee,
Haoran He,
Ivy Zhu,
Haris Suhail,
Minzhang Chen,
Renhe Chen,
Zirui Wang,
Hantao Zhang,
Chih-Yao Wang,
Shan-Yi Yang,
Yu-Chen Hsin,
Cheng-Yi Shih,
Hsin-Han Lee,
Ran Cheng,
Sudhakar Pamarti,
Xufeng Kou,
Kang L. Wang
Abstract:
Stochastic diffusion processes are pervasive in nature, from the seemingly erratic Brownian motion to the complex interactions of synaptically-coupled spiking neurons. Recently, drawing inspiration from Langevin dynamics, neuromorphic diffusion models were proposed and have become one of the major breakthroughs in the field of generative artificial intelligence. Unlike discriminative models that h…
▽ More
Stochastic diffusion processes are pervasive in nature, from the seemingly erratic Brownian motion to the complex interactions of synaptically-coupled spiking neurons. Recently, drawing inspiration from Langevin dynamics, neuromorphic diffusion models were proposed and have become one of the major breakthroughs in the field of generative artificial intelligence. Unlike discriminative models that have been well developed to tackle classification or regression tasks, diffusion models as well as other generative models such as ChatGPT aim at creating content based upon contexts learned. However, the more complex algorithms of these models result in high computational costs using today's technologies, creating a bottleneck in their efficiency, and impeding further development. Here, we develop a spintronic voltage-controlled magnetoelectric memory hardware for the neuromorphic diffusion process. The in-memory computing capability of our spintronic devices goes beyond current Von Neumann architecture, where memory and computing units are separated. Together with the non-volatility of magnetic memory, we can achieve high-speed and low-cost computing, which is desirable for the increasing scale of generative models in the current era. We experimentally demonstrate that the hardware-based true random diffusion process can be implemented for image generation and achieve comparable image quality to software-based training as measured by the Frechet inception distance (FID) score, achieving ~10^3 better energy-per-bit-per-area over traditional hardware.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs
Authors:
Pinxue Zhao,
Hailin Zhang,
Fangcheng Fu,
Xiaonan Nie,
Qibin Liu,
Fang Yang,
Yuanbo Peng,
Dian Jiao,
Shuaipeng Li,
Jinbao Xue,
Yangyu Tao,
Bin Cui
Abstract:
Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing f…
▽ More
Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have adopted strategies such as recomputation and various forms of parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM training framework designed for fine-grained activation memory management. Given the quadratic scaling of computation and linear scaling of memory with sequence lengths when using FlashAttention, we offload memory-consuming activations to CPU memory after each layer's forward pass and fetch them during the backward pass. To maximize the swapping of activations without hindering computation, and to avoid exhausting limited CPU memory, we implement a token-wise activation recomputation and swapping mechanism. Furthermore, we tackle the memory fragmentation issue by employing a bi-level Mixed Integer Programming (MIP) approach, optimizing the reuse of memory across transformer layers. Empirical results demonstrate that MEMO achieves an average of 2.42x and 2.26x MFU compared to Megatron-LM and DeepSpeed, respectively. This improvement is attributed to MEMO's ability to minimize memory fragmentation, reduce recomputation and intensive communication, and circumvent the delays associated with the memory reorganization process due to fragmentation. By leveraging fine-grained activation memory management, MEMO facilitates efficient training of 7B LLM with 1 million sequence length on just 8 A800 GPUs, achieving an MFU of 52.30%.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
IPA-NeRF: Illusory Poisoning Attack Against Neural Radiance Fields
Authors:
Wenxiang Jiang,
Hanwei Zhang,
Shuo Zhao,
Zhongwen Guo,
Hao Wang
Abstract:
Neural Radiance Field (NeRF) represents a significant advancement in computer vision, offering implicit neural network-based scene representation and novel view synthesis capabilities. Its applications span diverse fields including robotics, urban mapping, autonomous navigation, virtual reality/augmented reality, etc., some of which are considered high-risk AI applications. However, despite its wi…
▽ More
Neural Radiance Field (NeRF) represents a significant advancement in computer vision, offering implicit neural network-based scene representation and novel view synthesis capabilities. Its applications span diverse fields including robotics, urban mapping, autonomous navigation, virtual reality/augmented reality, etc., some of which are considered high-risk AI applications. However, despite its widespread adoption, the robustness and security of NeRF remain largely unexplored. In this study, we contribute to this area by introducing the Illusory Poisoning Attack against Neural Radiance Fields (IPA-NeRF). This attack involves embedding a hidden backdoor view into NeRF, allowing it to produce predetermined outputs, i.e. illusory, when presented with the specified backdoor view while maintaining normal performance with standard inputs. Our attack is specifically designed to deceive users or downstream models at a particular position while ensuring that any abnormalities in NeRF remain undetectable from other viewpoints. Experimental results demonstrate the effectiveness of our Illusory Poisoning Attack, successfully presenting the desired illusory on the specified viewpoint without impacting other views. Notably, we achieve this attack by introducing small perturbations solely to the training set. The code can be found at https://github.com/jiang-wenxiang/IPA-NeRF.
△ Less
Submitted 18 July, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Authors:
Zehan Wang,
Ziang Zhang,
Hang Zhang,
Luping Liu,
Rongjie Huang,
Xize Cheng,
Hengshuang Zhao,
Zhou Zhao
Abstract:
Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joi…
▽ More
Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters, which support 3D, audio, image, and language inputs. Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together. This approach enables "scaling up" by indirectly increasing the model parameters and the amount of seen data. To effectively integrate various spaces, we dynamically assign weights to different spaces by learning routers with two objectives: cross-modal overall alignment and language representation decoupling. Notably, since binding and routing spaces both only require lightweight networks, OmniBind is extremely training-efficient. Learning the largest 30B model requires merely unpaired unimodal data and approximately 3 days on a single 8-4090 node. Extensive experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection
Authors:
Jingwen Yu,
Hanjing Ye,
Jianhao Jiao,
Ping Tan,
Hong Zhang
Abstract:
Visual loop closure detection is an important module in visual simultaneous localization and mapping (SLAM), which associates current camera observation with previously visited places. Loop closures correct drifts in trajectory estimation to build a globally consistent map. However, a false loop closure can be fatal, so verification is required as an additional step to ensure robustness by rejecti…
▽ More
Visual loop closure detection is an important module in visual simultaneous localization and mapping (SLAM), which associates current camera observation with previously visited places. Loop closures correct drifts in trajectory estimation to build a globally consistent map. However, a false loop closure can be fatal, so verification is required as an additional step to ensure robustness by rejecting the false positive loops. Geometric verification has been a well-acknowledged solution that leverages spatial clues provided by local feature matching to find true positives. Existing feature matching methods focus on homography and pose estimation in long-term visual localization, lacking references for geometric verification. To fill the gap, this paper proposes a unified benchmark targeting geometric verification of loop closure detection under long-term conditional variations. Furthermore, we evaluate six representative local feature matching methods (handcrafted and learning-based) under the benchmark, with in-depth analysis for limitations and future directions.
△ Less
Submitted 16 July, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen
Authors:
Alessandro Palma,
Till Richter,
Hanyi Zhang,
Manuel Lubetzki,
Alexander Tong,
Andrea Dittadi,
Fabian Theis
Abstract:
Generative modeling of single-cell RNA-seq data has shown invaluable potential in community-driven tasks such as trajectory inference, batch effect removal and gene expression generation. However, most recent deep models generating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, ignoring the inherently discrete and over-dispersed nature of sing…
▽ More
Generative modeling of single-cell RNA-seq data has shown invaluable potential in community-driven tasks such as trajectory inference, batch effect removal and gene expression generation. However, most recent deep models generating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, ignoring the inherently discrete and over-dispersed nature of single-cell data, which limits downstream applications and hinders the incorporation of robust noise models. Moreover, crucial aspects of deep-learning-based synthetic single-cell generation remain underexplored, such as controllable multi-modal and multi-label generation and its role in the performance enhancement of downstream tasks. This work presents Cell Flow for Generation (CFGen), a flow-based conditional generative model for multi-modal single-cell counts, which explicitly accounts for the discrete nature of the data. Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks such as conditioning on multiple attributes and boosting rare cell type classification via data augmentation. By showcasing CFGen on a diverse set of biological datasets and settings, we provide evidence of its value to the fields of computational biology and deep generative models.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation
Authors:
Xiaoshuai Hao,
Ruikai Li,
Hui Zhang,
Dingzhe Li,
Rong Yin,
Sangil Jung,
Seung-In Park,
ByungIn Yoo,
Haimei Zhao,
Jing Zhang
Abstract:
Online high-definition (HD) map construction is an important and challenging task in autonomous driving. Recently, there has been a growing interest in cost-effective multi-view camera-based methods without relying on other sensors like LiDAR. However, these methods suffer from a lack of explicit depth information, necessitating the use of large models to achieve satisfactory performance. To addre…
▽ More
Online high-definition (HD) map construction is an important and challenging task in autonomous driving. Recently, there has been a growing interest in cost-effective multi-view camera-based methods without relying on other sensors like LiDAR. However, these methods suffer from a lack of explicit depth information, necessitating the use of large models to achieve satisfactory performance. To address this, we employ the Knowledge Distillation (KD) idea for efficient HD map construction for the first time and introduce a novel KD-based approach called MapDistill to transfer knowledge from a high-performance camera-LiDAR fusion model to a lightweight camera-only model. Specifically, we adopt the teacher-student architecture, i.e., a camera-LiDAR fusion model as the teacher and a lightweight camera model as the student, and devise a dual BEV transform module to facilitate cross-modal knowledge distillation while maintaining cost-effective camera-only deployment. Additionally, we present a comprehensive distillation scheme encompassing cross-modal relation distillation, dual-level feature distillation, and map head distillation. This approach alleviates knowledge transfer challenges between modalities, enabling the student model to learn improved feature representations for HD map construction. Experimental results on the challenging nuScenes dataset demonstrate the effectiveness of MapDistill, surpassing existing competitors by over 7.7 mAP or 4.5X speedup.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Strategic Littlestone Dimension: Improved Bounds on Online Strategic Classification
Authors:
Saba Ahmadi,
Kunhe Yang,
Hanrui Zhang
Abstract:
We study the problem of online binary classification in settings where strategic agents can modify their observable features to receive a positive classification. We model the set of feasible manipulations by a directed graph over the feature space, and assume the learner only observes the manipulated features instead of the original ones. We introduce the Strategic Littlestone Dimension, a new co…
▽ More
We study the problem of online binary classification in settings where strategic agents can modify their observable features to receive a positive classification. We model the set of feasible manipulations by a directed graph over the feature space, and assume the learner only observes the manipulated features instead of the original ones. We introduce the Strategic Littlestone Dimension, a new combinatorial measure that captures the joint complexity of the hypothesis class and the manipulation graph. We demonstrate that it characterizes the instance-optimal mistake bounds for deterministic learning algorithms in the realizable setting. We also achieve improved regret in the agnostic setting by a refined agnostic-to-realizable reduction that accounts for the additional challenge of not observing agents' original features. Finally, we relax the assumption that the learner knows the manipulation graph, instead assuming their knowledge is captured by a family of graphs. We derive regret bounds in both the realizable setting where all agents manipulate according to the same graph within the graph family, and the agnostic setting where the manipulation graphs are chosen adversarially and not consistently modeled by a single graph in the family.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Cross-Phase Mutual Learning Framework for Pulmonary Embolism Identification on Non-Contrast CT Scans
Authors:
Bizhe Bai,
Yan-Jie Zhou,
Yujian Hu,
Tony C. W. Mok,
Yilang Xiang,
Le Lu,
Hongkun Zhang,
Minfeng Xu
Abstract:
Pulmonary embolism (PE) is a life-threatening condition where rapid and accurate diagnosis is imperative yet difficult due to predominantly atypical symptomatology. Computed tomography pulmonary angiography (CTPA) is acknowledged as the gold standard imaging tool in clinics, yet it can be contraindicated for emergency department (ED) patients and represents an onerous procedure, thus necessitating…
▽ More
Pulmonary embolism (PE) is a life-threatening condition where rapid and accurate diagnosis is imperative yet difficult due to predominantly atypical symptomatology. Computed tomography pulmonary angiography (CTPA) is acknowledged as the gold standard imaging tool in clinics, yet it can be contraindicated for emergency department (ED) patients and represents an onerous procedure, thus necessitating PE identification through non-contrast CT (NCT) scans. In this work, we explore the feasibility of applying a deep-learning approach to NCT scans for PE identification. We propose a novel Cross-Phase Mutual learNing framework (CPMN) that fosters knowledge transfer from CTPA to NCT, while concurrently conducting embolism segmentation and abnormality classification in a multi-task manner. The proposed CPMN leverages the Inter-Feature Alignment (IFA) strategy that enhances spatial contiguity and mutual learning between the dual-pathway network, while the Intra-Feature Discrepancy (IFD) strategy can facilitate precise segmentation of PE against complex backgrounds for single-pathway networks. For a comprehensive assessment of the proposed approach, a large-scale dual-phase dataset containing 334 PE patients and 1,105 normal subjects has been established. Experimental results demonstrate that CPMN achieves the leading identification performance, which is 95.4\% and 99.6\% in patient-level sensitivity and specificity on NCT scans, indicating the potential of our approach as an economical, accessible, and precise tool for PE identification in clinical practice.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Incremental high average-utility itemset mining: survey and challenges
Authors:
Jing Chen,
Shengyi Yang,
Weiping Ding,
Peng Li,
Aijun Liu,
Hongjun Zhang,
Tian Li
Abstract:
The High Average Utility Itemset Mining (HAUIM) technique, a variation of High Utility Itemset Mining (HUIM), uses the average utility of the itemsets. Historically, most HAUIM algorithms were designed for static databases. However, practical applications like market basket analysis and business decision-making necessitate regular updates of the database with new transactions. As a result, researc…
▽ More
The High Average Utility Itemset Mining (HAUIM) technique, a variation of High Utility Itemset Mining (HUIM), uses the average utility of the itemsets. Historically, most HAUIM algorithms were designed for static databases. However, practical applications like market basket analysis and business decision-making necessitate regular updates of the database with new transactions. As a result, researchers have developed incremental HAUIM (iHAUIM) algorithms to identify HAUIs in a dynamically updated database. Contrary to conventional methods that begin from scratch, the iHAUIM algorithm facilitates incremental changes and outputs, thereby reducing the cost of discovery. This paper provides a comprehensive review of the state-of-the-art iHAUIM algorithms, analyzing their unique characteristics and advantages. First, we explain the concept of iHAUIM, providing formulas and real-world examples for a more in-depth understanding. Subsequently, we categorize and discuss the key technologies used by varying types of iHAUIM algorithms, encompassing Apriori-based, Tree-based, and Utility-list-based techniques. Moreover, we conduct a critical analysis of each mining method's advantages and disadvantages. In conclusion, we explore potential future directions, research opportunities, and various extensions of the iHAUIM algorithm.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
Authors:
Jinrui Zhang,
Teng Wang,
Haigang Zhang,
Ping Lu,
Feng Zheng
Abstract:
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training.…
▽ More
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training. Without intermediate reasoning steps, models may establish superficial shortcuts between instructions and responses, failing to internalize the inherent reasoning logic. To address this challenge, we propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning. Unlike previous methods that learning from responses only, our approach entails the model predicting rationales justifying why responses are correct or incorrect. This fosters a deeper engagement with the fine-grained reasoning underlying each response, thus enhancing the model's reasoning proficiency. To facilitate this approach, we propose REVERIE, the first large-scale instruction-tuning dataset with ReflEctiVE RatIonalE annotations. REVERIE comprises 115k machine-generated reasoning instructions, each meticulously annotated with a corresponding pair of correct and confusing responses, alongside comprehensive rationales elucidating the justification behind the correctness or erroneousness of each response. Experimental results on multiple LVLM benchmarks reveal that reflective instruction tuning with the REVERIE dataset yields noticeable performance gain over the baseline model, demonstrating the effectiveness of reflecting from the rationales. Project page is at https://zjr2000.github.io/projects/reverie.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts
Authors:
Jianhao Li,
Tianyu Sun,
Zhongdao Wang,
Enze Xie,
Bailan Feng,
Hongbo Zhang,
Ze Yuan,
Ke Xu,
Jiaheng Liu,
Ping Luo
Abstract:
This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quali…
▽ More
This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quality instance masks from the prompts using the Segment Anything Model (SAM) and transform the remaining problem into predicting 3D shapes from given 2D masks. Due to the ill-posed nature of this problem, it presents a significant challenge as multiple 3D shapes can project into an identical mask. To tackle this issue, we then lift 2D masks to 3D forms and employ gradient descent to adjust their poses and shapes until the projections fit the masks and the surfaces conform to surrounding LiDAR points. Notably, since we do not train on a specific dataset, the SLF auto-labeler does not overfit to biased annotation patterns in the training set as other methods do. Thus, the generalization ability across different datasets improves. Experimental results on the KITTI dataset demonstrate that the SLF auto-labeler produces high-quality bounding box annotations, achieving an AP@0.5 IoU of nearly 90\%. Detectors trained with the generated pseudo-labels perform nearly as well as those trained with actual ground-truth annotations. Furthermore, the SLF auto-labeler shows promising results in detailed shape predictions, providing a potential alternative for the occupancy annotation of dynamic objects.
△ Less
Submitted 17 July, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Restore-RWKV: Efficient and Effective Medical Image Restoration with RWKV
Authors:
Zhiwen Yang,
Hui Zhang,
Dan Zhao,
Bingzheng Wei,
Yan Xu
Abstract:
Transformers have revolutionized medical image restoration, but the quadratic complexity still poses limitations for their application to high-resolution medical images. The recent advent of RWKV in the NLP field has attracted much attention as it can process long sequences efficiently. To leverage its advanced design, we propose Restore-RWKV, the first RWKV-based model for medical image restorati…
▽ More
Transformers have revolutionized medical image restoration, but the quadratic complexity still poses limitations for their application to high-resolution medical images. The recent advent of RWKV in the NLP field has attracted much attention as it can process long sequences efficiently. To leverage its advanced design, we propose Restore-RWKV, the first RWKV-based model for medical image restoration. Since the original RWKV model is designed for 1D sequences, we make two necessary modifications for modeling spatial relations in 2D images. First, we present a recurrent WKV (Re-WKV) attention mechanism that captures global dependencies with linear computational complexity. Re-WKV incorporates bidirectional attention as basic for a global receptive field and recurrent attention to effectively model 2D dependencies from various scan directions. Second, we develop an omnidirectional token shift (Omni-Shift) layer that enhances local dependencies by shifting tokens from all directions and across a wide context range. These adaptations make the proposed Restore-RWKV an efficient and effective model for medical image restoration. Extensive experiments demonstrate that Restore-RWKV achieves superior performance across various medical image restoration tasks, including MRI image super-resolution, CT image denoising, PET image synthesis, and all-in-one medical image restoration. Code is available at: \href{https://github.com/Yaziwel/Restore-RWKV.git}{https://github.com/Yaziwel/Restore-RWKV}.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Residual resampling-based physics-informed neural network for neutron diffusion equations
Authors:
Heng Zhang,
Yun-Ling He,
Dong Liu,
Qin Hang,
He-Min Yao,
Di Xiang
Abstract:
The neutron diffusion equation plays a pivotal role in the analysis of nuclear reactors. Nevertheless, employing the Physics-Informed Neural Network (PINN) method for its solution entails certain limitations. Traditional PINN approaches often utilize fully connected network (FCN) architecture, which is susceptible to overfitting, training instability, and gradient vanishing issues as the network d…
▽ More
The neutron diffusion equation plays a pivotal role in the analysis of nuclear reactors. Nevertheless, employing the Physics-Informed Neural Network (PINN) method for its solution entails certain limitations. Traditional PINN approaches often utilize fully connected network (FCN) architecture, which is susceptible to overfitting, training instability, and gradient vanishing issues as the network depth increases. These challenges result in accuracy bottlenecks in the solution. In response to these issues, the Residual-based Resample Physics-Informed Neural Network(R2-PINN) is proposed, which proposes an improved PINN architecture that replaces the FCN with a Convolutional Neural Network with a shortcut(S-CNN), incorporating skip connections to facilitate gradient propagation between network layers. Additionally, the incorporation of the Residual Adaptive Resampling (RAR) mechanism dynamically increases sampling points, enhancing the spatial representation capabilities and overall predictive accuracy of the model. The experimental results illustrate that our approach significantly improves the model's convergence capability, achieving high-precision predictions of physical fields. In comparison to traditional FCN-based PINN methods, R2-PINN effectively overcomes the limitations inherent in current methods, providing more accurate and robust solutions for neutron diffusion equations.
△ Less
Submitted 23 June, 2024;
originally announced July 2024.
-
ARA-O-RAN: End-to-End Programmable O-RAN Living Lab for Agriculture and Rural Communities
Authors:
Tianyi Zhang,
Joshua Ofori Boateng,
Taimoor UI Islam,
Arsalan Ahmad,
Hongwei Zhang,
Daji Qiao
Abstract:
As wireless networks evolve towards open architectures like O-RAN, testing, and integration platforms are crucial to address challenges like interoperability. This paper describes ARA-O-RAN, a novel O-RAN testbed established through the NSF Platforms for Advanced Wireless Research (PAWR) ARA platform. ARA provides an at-scale rural wireless living lab focused on technologies for digital agricultur…
▽ More
As wireless networks evolve towards open architectures like O-RAN, testing, and integration platforms are crucial to address challenges like interoperability. This paper describes ARA-O-RAN, a novel O-RAN testbed established through the NSF Platforms for Advanced Wireless Research (PAWR) ARA platform. ARA provides an at-scale rural wireless living lab focused on technologies for digital agriculture and rural communities. As an O-RAN Alliance certified Open Testing and Integration Centre (OTIC), ARA launched ARA-O-RAN -- the first public O-RAN testbed tailored to rural and agriculture use cases, together with the end-to-end, whole-stack programmability. ARA-O-RAN uniquely combines support for outdoor testing across a university campus, surrounding farmlands, and rural communities with a 50-node indoor sandbox. The testbed facilitates vital R\&D to implement open architectures that can meet rural connectivity needs. The paper outlines ARA-O-RAN's hardware system design, software architecture, and enabled research experiments. It also discusses plans aligned with national spectrum policy and rural spectrum innovation. ARA-O-RAN exemplifies the value of purpose-built wireless testbeds in accelerating impactful wireless research.
△ Less
Submitted 14 June, 2024;
originally announced July 2024.
-
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Authors:
Yaoting Wang,
Peiwen Sun,
Dongzhan Zhou,
Guangyao Li,
Honggang Zhang,
Di Hu
Abstract:
Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions…
▽ More
Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at \href{https://gewu-lab.github.io/Ref-AVS}{https://gewu-lab.github.io/Ref-AVS}.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Authors:
Ruisheng Cao,
Fangyu Lei,
Haoyuan Wu,
Jixuan Chen,
Yeqiao Fu,
Hongcheng Gao,
Xinzhuang Xiong,
Hanchong Zhang,
Yuchen Mao,
Wenjing Hu,
Tianbao Xie,
Hongshen Xu,
Danyang Zhang,
Sida Wang,
Ruoxi Sun,
Pengcheng Yin,
Caiming Xiong,
Ansong Ni,
Qian Liu,
Victor Zhong,
Lu Chen,
Kai Yu,
Tao Yu
Abstract:
Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivit…
▽ More
Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14.0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow. Our code and data are available at https://spider2-v.github.io.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
Authors:
Yaoting Wang,
Peiwen Sun,
Yuanchao Li,
Honggang Zhang,
Di Hu
Abstract:
The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especi…
▽ More
The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
Authors:
Anni Zou,
Wenhao Yu,
Hongming Zhang,
Kaixin Ma,
Deng Cai,
Zhuosheng Zhang,
Hai Zhao,
Dong Yu
Abstract:
Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, m…
▽ More
Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, multi-modal information understanding and long-context reading. However, no current benchmark exists to evaluate their performance in such scenarios, where a raw file and questions are provided as input, and a corresponding response is expected as output. In this paper, we introduce DocBench, a new benchmark designed to evaluate LLM-based document reading systems. Our benchmark involves a meticulously crafted process, including the recruitment of human annotators and the generation of synthetic questions. It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions. We evaluate both proprietary LLM-based systems accessible via web interfaces or APIs, and a parse-then-read pipeline employing open-source LLMs. Our evaluations reveal noticeable gaps between existing LLM-based document reading systems and human performance, underscoring the challenges of developing proficient systems. To summarize, DocBench aims to establish a standardized benchmark for evaluating LLM-based document reading systems under diverse real-world scenarios, thereby guiding future advancements in this research area.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
$\texttt{MixGR}$: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity
Authors:
Fengyu Cai,
Xinran Zhao,
Tong Chen,
Sihao Chen,
Hongming Zhang,
Iryna Gurevych,
Heinz Koeppl
Abstract:
Recent studies show the growing significance of document retrieval in the generation of LLMs, i.e., RAG, within the scientific domain by bridging their knowledge gap. However, dense retrievers often struggle with domain-specific retrieval and complex query-document relationships, particularly when query segments correspond to various parts of a document. To alleviate such prevalent challenges, thi…
▽ More
Recent studies show the growing significance of document retrieval in the generation of LLMs, i.e., RAG, within the scientific domain by bridging their knowledge gap. However, dense retrievers often struggle with domain-specific retrieval and complex query-document relationships, particularly when query segments correspond to various parts of a document. To alleviate such prevalent challenges, this paper introduces $\texttt{MixGR}$, which improves dense retrievers' awareness of query-document matching across various levels of granularity in queries and documents using a zero-shot approach. $\texttt{MixGR}$ fuses various metrics based on these granularities to a united score that reflects a comprehensive query-document similarity. Our experiments demonstrate that $\texttt{MixGR}$ outperforms previous document retrieval by 24.7% and 9.8% on nDCG@5 with unsupervised and supervised retrievers, respectively, averaged on queries containing multiple subqueries from five scientific retrieval datasets. Moreover, the efficacy of two downstream scientific question-answering tasks highlights the advantage of $\texttt{MixGR}$to boost the application of LLMs in the scientific domain.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems
Authors:
Yunxiao Shi,
Xing Zi,
Zijing Shi,
Haimin Zhang,
Qiang Wu,
Min Xu
Abstract:
Retrieval-augmented generation (RAG) techniques leverage the in-context learning capabilities of large language models (LLMs) to produce more accurate and relevant responses. Originating from the simple 'retrieve-then-read' approach, the RAG framework has evolved into a highly flexible and modular paradigm. A critical component, the Query Rewriter module, enhances knowledge retrieval by generating…
▽ More
Retrieval-augmented generation (RAG) techniques leverage the in-context learning capabilities of large language models (LLMs) to produce more accurate and relevant responses. Originating from the simple 'retrieve-then-read' approach, the RAG framework has evolved into a highly flexible and modular paradigm. A critical component, the Query Rewriter module, enhances knowledge retrieval by generating a search-friendly query. This method aligns input questions more closely with the knowledge base. Our research identifies opportunities to enhance the Query Rewriter module to Query Rewriter+ by generating multiple queries to overcome the Information Plateaus associated with a single query and by rewriting questions to eliminate Ambiguity, thereby clarifying the underlying intent. We also find that current RAG systems exhibit issues with Irrelevant Knowledge; to overcome this, we propose the Knowledge Filter. These two modules are both based on the instruction-tuned Gemma-2B model, which together enhance response quality. The final identified issue is Redundant Retrieval; we introduce the Memory Knowledge Reservoir and the Retriever Trigger to solve this. The former supports the dynamic expansion of the RAG system's knowledge base in a parameter-free manner, while the latter optimizes the cost for accessing external knowledge, thereby improving resource utilization and response efficiency. These four RAG modules synergistically improve the response quality and efficiency of the RAG system. The effectiveness of these modules has been validated through experiments and ablation studies across six common QA datasets. The source code can be accessed at https://github.com/Ancientshi/ERM4.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Visual Prompt Selection for In-Context Learning Segmentation
Authors:
Wei Suo,
Lanqing Lai,
Mengyang Sun,
Hanwang Zhang,
Peng Wang,
Yanning Zhang
Abstract:
As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual promp…
▽ More
As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual prompts or simply apply similarity sorting to select contextual examples. In this paper, we focus on rethinking and improving the example selection strategy. By comprehensive comparisons, we first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation. Based on the above insights, we propose a new stepwise context search method. Different from previous works, we construct a small yet rich candidate pool and adaptively search the well-matched contexts. More importantly, this method effectively reduces the annotation cost by compacting the search space. Extensive experiments show that our method is an effective strategy for selecting examples and enhancing segmentation performance.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Adaptive Model Predictive Control with Data-driven Error Model for Quadrupedal Locomotion
Authors:
Xuanqi Zeng,
Hongbo Zhang,
Linzhu Yue,
Zhitao Song,
Linwei Zhang,
Yun-Hui Liu
Abstract:
Model Predictive Control (MPC) relies heavily on the robot model for its control law. However, a gap always exists between the reduced-order control model with uncertainties and the real robot, which degrades its performance. To address this issue, we propose the controller of integrating a data-driven error model into traditional MPC for quadruped robots. Our approach leverages real-world data fr…
▽ More
Model Predictive Control (MPC) relies heavily on the robot model for its control law. However, a gap always exists between the reduced-order control model with uncertainties and the real robot, which degrades its performance. To address this issue, we propose the controller of integrating a data-driven error model into traditional MPC for quadruped robots. Our approach leverages real-world data from sensors to compensate for defects in the control model. Specifically, we employ the Autoregressive Moving Average Vector (ARMAV) model to construct the state error model of the quadruped robot using data. The predicted state errors are then used to adjust the predicted future robot states generated by MPC. By such an approach, our proposed controller can provide more accurate inputs to the system, enabling it to achieve desired states even in the presence of model parameter inaccuracies or disturbances. The proposed controller exhibits the capability to partially eliminate the disparity between the model and the real-world robot, thereby enhancing the locomotion performance of quadruped robots. We validate our proposed method through simulations and real-world experimental trials on a large-size quadruped robot that involves carrying a 20 kg un-modeled payload (84% of body weight).
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
All Roads Lead to Rome: Unveiling the Trajectory of Recommender Systems Across the LLM Era
Authors:
Bo Chen,
Xinyi Dai,
Huifeng Guo,
Wei Guo,
Weiwen Liu,
Yong Liu,
Jiarui Qin,
Ruiming Tang,
Yichao Wang,
Chuhan Wu,
Yaxiong Wu,
Hao Zhang
Abstract:
Recommender systems (RS) are vital for managing information overload and delivering personalized content, responding to users' diverse information needs. The emergence of large language models (LLMs) offers a new horizon for redefining recommender systems with vast general knowledge and reasoning capabilities. Standing across this LLM era, we aim to integrate recommender systems into a broader pic…
▽ More
Recommender systems (RS) are vital for managing information overload and delivering personalized content, responding to users' diverse information needs. The emergence of large language models (LLMs) offers a new horizon for redefining recommender systems with vast general knowledge and reasoning capabilities. Standing across this LLM era, we aim to integrate recommender systems into a broader picture, and pave the way for more comprehensive solutions for future research. Therefore, we first offer a comprehensive overview of the technical progression of recommender systems, particularly focusing on language foundation models and their applications in recommendation. We identify two evolution paths of modern recommender systems -- via list-wise recommendation and conversational recommendation. These two paths finally converge at LLM agents with superior capabilities of long-term memory, reflection, and tool intelligence. Along these two paths, we point out that the information effectiveness of the recommendation is increased, while the user's acquisition cost is decreased. Technical features, research methodologies, and inherent challenges for each milestone along the path are carefully investigated -- from traditional list-wise recommendation to LLM-enhanced recommendation to recommendation with LLM agents. Finally, we highlight several unresolved challenges crucial for the development of future personalization technologies and interfaces and discuss the future prospects.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Stabilizing Dynamic Systems through Neural Network Learning: A Robust Approach
Authors:
Yu Zhang,
Haoyu Zhang,
Yongxiang Zou,
Houcheng Li,
Long Cheng
Abstract:
Point-to-point and periodic motions are ubiquitous in the world of robotics. To master these motions, Autonomous Dynamic System (DS) based algorithms are fundamental in the domain of Learning from Demonstration (LfD). However, these algorithms face the significant challenge of balancing precision in learning with the maintenance of system stability. This paper addresses this challenge by presentin…
▽ More
Point-to-point and periodic motions are ubiquitous in the world of robotics. To master these motions, Autonomous Dynamic System (DS) based algorithms are fundamental in the domain of Learning from Demonstration (LfD). However, these algorithms face the significant challenge of balancing precision in learning with the maintenance of system stability. This paper addresses this challenge by presenting a novel ADS algorithm that leverages neural network technology. The proposed algorithm is designed to distill essential knowledge from demonstration data, ensuring stability during the learning of both point-to-point and periodic motions. For point-to-point motions, a neural Lyapunov function is proposed to align with the provided demonstrations. In the case of periodic motions, the neural Lyapunov function is used with the transversal contraction to ensure that all generated motions converge to a stable limit cycle. The model utilizes a streamlined neural network architecture, adept at achieving dual objectives: optimizing learning accuracy while maintaining global stability. To thoroughly assess the efficacy of the proposed algorithm, rigorous evaluations are conducted using the LASA dataset and a manually designed dataset. These assessments were complemented by empirical validation through robotic experiments, providing robust evidence of the algorithm's performance
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Minimizing PLM-Based Few-Shot Intent Detectors
Authors:
Haode Zhang,
Xiao-Ming Wu,
Albert Y. S. Lam
Abstract:
Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based inte…
▽ More
Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based intent detectors trained with few-shot data. Specifically, we utilize large language models (LLMs) for data augmentation, employ a cutting-edge model compression method for knowledge distillation, and devise a vocabulary pruning mechanism called V-Prune. Through these approaches, we successfully achieve a compression ratio of 21 in model memory usage, including both Transformer and the vocabulary, while maintaining almost identical performance levels on four real-world benchmarks.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Prism XR -- A Curated Exhibition Experience in Virtual Reality with Peer Annotation Features and Virtual Guides for Art and Archaeology Classes
Authors:
Huopu Zhang
Abstract:
The Prism XR project is a curated exhibition experience in virtual reality (VR) for art and archaeology education with features designed for the enhancement of interactivity and collaborative learning. The project integrates peer annotations and a virtual exhibition guide to augment educational experiences. The peer annotation features are intended for facilitating visitor critiques and comments p…
▽ More
The Prism XR project is a curated exhibition experience in virtual reality (VR) for art and archaeology education with features designed for the enhancement of interactivity and collaborative learning. The project integrates peer annotations and a virtual exhibition guide to augment educational experiences. The peer annotation features are intended for facilitating visitor critiques and comments pivotal in fostering a dialog between the curator and the audience and a dialogue between the visitors in art and archaeology education, which are demonstrated to have positive impacts on the learning motivations and learning outcomes. The virtual exhibition guide is intended to address the issue of isolation in the virtual exhibition space and to increase interactivity in the virtual curatorial experiences.
△ Less
Submitted 15 July, 2024; v1 submitted 24 June, 2024;
originally announced July 2024.
-
Region Attention Transformer for Medical Image Restoration
Authors:
Zhiwen Yang,
Haowei Chen,
Ziniu Qian,
Yang Zhou,
Hui Zhang,
Dan Zhao,
Bingzheng Wei,
Yan Xu
Abstract:
Transformer-based methods have demonstrated impressive results in medical image restoration, attributed to the multi-head self-attention (MSA) mechanism in the spatial dimension. However, the majority of existing Transformers conduct attention within fixed and coarsely partitioned regions (\text{e.g.} the entire image or fixed patches), resulting in interference from irrelevant regions and fragmen…
▽ More
Transformer-based methods have demonstrated impressive results in medical image restoration, attributed to the multi-head self-attention (MSA) mechanism in the spatial dimension. However, the majority of existing Transformers conduct attention within fixed and coarsely partitioned regions (\text{e.g.} the entire image or fixed patches), resulting in interference from irrelevant regions and fragmentation of continuous image content. To overcome these challenges, we introduce a novel Region Attention Transformer (RAT) that utilizes a region-based multi-head self-attention mechanism (R-MSA). The R-MSA dynamically partitions the input image into non-overlapping semantic regions using the robust Segment Anything Model (SAM) and then performs self-attention within these regions. This region partitioning is more flexible and interpretable, ensuring that only pixels from similar semantic regions complement each other, thereby eliminating interference from irrelevant regions. Moreover, we introduce a focal region loss to guide our model to adaptively focus on recovering high-difficulty regions. Extensive experiments demonstrate the effectiveness of RAT in various medical image restoration tasks, including PET image synthesis, CT image denoising, and pathological image super-resolution. Code is available at \href{https://github.com/Yaziwel/Region-Attention-Transformer-for-Medical-Image-Restoration.git}{https://github.com/RAT}.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Disassembling Obfuscated Executables with LLM
Authors:
Huanyao Rong,
Yue Duan,
Hang Zhang,
XiaoFeng Wang,
Hongbo Chen,
Shengchen Duan,
Shen Wang
Abstract:
Disassembly is a challenging task, particularly for obfuscated executables containing junk bytes, which is designed to induce disassembly errors. Existing solutions rely on heuristics or leverage machine learning techniques, but only achieve limited successes. Fundamentally, such obfuscation cannot be defeated without in-depth understanding of the binary executable's semantics, which is made possi…
▽ More
Disassembly is a challenging task, particularly for obfuscated executables containing junk bytes, which is designed to induce disassembly errors. Existing solutions rely on heuristics or leverage machine learning techniques, but only achieve limited successes. Fundamentally, such obfuscation cannot be defeated without in-depth understanding of the binary executable's semantics, which is made possible by the emergence of large language models (LLMs). In this paper, we present DisasLLM, a novel LLM-driven dissembler to overcome the challenge in analyzing obfuscated executables. DisasLLM consists of two components: an LLM-based classifier that determines whether an instruction in an assembly code snippet is correctly decoded, and a disassembly strategy that leverages this model to disassemble obfuscated executables end-to-end. We evaluated DisasLLM on a set of heavily obfuscated executables, which is shown to significantly outperform other state-of-the-art disassembly solutions.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Emerging Practices for Large Multimodal Model (LMM) Assistance for People with Visual Impairments: Implications for Design
Authors:
Jingyi Xie,
Rui Yu,
He Zhang,
Sooyeon Lee,
Syed Masum Billah,
John M. Carroll
Abstract:
People with visual impairments perceive their environment non-visually and often use AI-powered assistive tools to obtain textual descriptions of visual information. Recent large vision-language model-based AI-powered tools like Be My AI are more capable of understanding users' inquiries in natural language and describing the scene in audible text; however, the extent to which these tools are usef…
▽ More
People with visual impairments perceive their environment non-visually and often use AI-powered assistive tools to obtain textual descriptions of visual information. Recent large vision-language model-based AI-powered tools like Be My AI are more capable of understanding users' inquiries in natural language and describing the scene in audible text; however, the extent to which these tools are useful to visually impaired users is currently understudied. This paper aims to fill this gap. Our study with 14 visually impaired users reveals that they are adapting these tools organically -- not only can these tools facilitate complex interactions in household, spatial, and social contexts, but they also act as an extension of users' cognition, as if the cognition were distributed in the visual information. We also found that although the tools are currently not goal-oriented, users accommodate this limitation and embrace the tools' capabilities for broader use. These findings enable us to envision design implications for creating more goal-oriented, real-time processing, and reliable AI-powered assistive technology.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
Authors:
Wenshuo Peng,
Kaipeng Zhang,
Yue Yang,
Hao Zhang,
Yu Qiao
Abstract:
Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are u…
▽ More
Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data. The existing adaptation methods do not consider the missing knowledge, which may lead to crucial task-related knowledge for the downstream tasks being ignored. To address this issue, we propose a new adaptation framework called Data Adaptive Traceback (DAT). Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data to enable the downstream tasks. Furthermore, we adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning. We conduct extensive experiments that show our proposed DAT approach meaningfully improves various benchmark datasets performance over traditional adaptation methods by simply.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene
Authors:
Ruiyang Zhang,
Hu Zhang,
Hang Yu,
Zhedong Zheng
Abstract:
The unsupervised 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting distant or small objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D…
▽ More
The unsupervised 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting distant or small objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D images for unsupervised 3D detection and introduce a new method, dubbed LiDAR-2D Self-paced Learning (LiSe). We argue that RGB images serve as a valuable complement to LiDAR data, offering precise 2D localization cues, particularly when scarce LiDAR points are available for certain objects. Considering the unique characteristics of both modalities, our framework devises a self-paced learning pipeline that incorporates adaptive sampling and weak model aggregation strategies. The adaptive sampling strategy dynamically tunes the distribution of pseudo labels during training, countering the tendency of models to overfit easily detected samples, such as nearby and large-sized objects. By doing so, it ensures a balanced learning trajectory across varying object scales and distances. The weak model aggregation component consolidates the strengths of models trained under different pseudo label distributions, culminating in a robust and powerful final model. Experimental evaluations validate the efficacy of our proposed LiSe method, manifesting significant improvements of +7.1% AP$_{BEV}$ and +3.4% AP$_{3D}$ on nuScenes, and +8.3% AP$_{BEV}$ and +7.4% AP$_{3D}$ on Lyft compared to existing techniques.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Enhancing Thermal Infrared Tracking with Natural Language Modeling and Coordinate Sequence Generation
Authors:
Miao Yan,
Ping Zhang,
Haofei Zhang,
Ruqian Hao,
Juanxiu Liu,
Xiaoyang Wang,
Lin Liu
Abstract:
Thermal infrared tracking is an essential topic in computer vision tasks because of its advantage of all-weather imaging. However, most conventional methods utilize only hand-crafted features, while deep learning-based correlation filtering methods are limited by simple correlation operations. Transformer-based methods ignore temporal and coordinate information, which is critical for TIR tracking…
▽ More
Thermal infrared tracking is an essential topic in computer vision tasks because of its advantage of all-weather imaging. However, most conventional methods utilize only hand-crafted features, while deep learning-based correlation filtering methods are limited by simple correlation operations. Transformer-based methods ignore temporal and coordinate information, which is critical for TIR tracking that lacks texture and color information. In this paper, to address these issues, we apply natural language modeling to TIR tracking and propose a novel model called NLMTrack, which enhances the utilization of coordinate and temporal information. NLMTrack applies an encoder that unifies feature extraction and feature fusion, which simplifies the TIR tracking pipeline. To address the challenge of low detail and low contrast in TIR images, on the one hand, we design a multi-level progressive fusion module that enhances the semantic representation and incorporates multi-scale features. On the other hand, the decoder combines the TIR features and the coordinate sequence features using a causal transformer to generate the target sequence step by step. Moreover, we explore an adaptive loss aimed at elevating tracking accuracy and a simple template update strategy to accommodate the target's appearance variations. Experiments show that NLMTrack achieves state-of-the-art performance on multiple benchmarks. The Code is publicly available at \url{https://github.com/ELOESZHANG/NLMTrack}.
△ Less
Submitted 18 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
MemWarp: Discontinuity-Preserving Cardiac Registration with Memorized Anatomical Filters
Authors:
Hang Zhang,
Xiang Chen,
Renjiu Hu,
Dongdong Liu,
Gaolei Li,
Rongguang Wang
Abstract:
Many existing learning-based deformable image registration methods impose constraints on deformation fields to ensure they are globally smooth and continuous. However, this assumption does not hold in cardiac image registration, where different anatomical regions exhibit asymmetric motions during respiration and movements due to sliding organs within the chest. Consequently, such global constraint…
▽ More
Many existing learning-based deformable image registration methods impose constraints on deformation fields to ensure they are globally smooth and continuous. However, this assumption does not hold in cardiac image registration, where different anatomical regions exhibit asymmetric motions during respiration and movements due to sliding organs within the chest. Consequently, such global constraints fail to accommodate local discontinuities across organ boundaries, potentially resulting in erroneous and unrealistic displacement fields. In this paper, we address this issue with MemWarp, a learning framework that leverages a memory network to store prototypical information tailored to different anatomical regions. MemWarp is different from earlier approaches in two main aspects: firstly, by decoupling feature extraction from similarity matching in moving and fixed images, it facilitates more effective utilization of feature maps; secondly, despite its capability to preserve discontinuities, it eliminates the need for segmentation masks during model inference. In experiments on a publicly available cardiac dataset, our method achieves considerable improvements in registration accuracy and producing realistic deformations, outperforming state-of-the-art methods with a remarkable 7.1\% Dice score improvement over the runner-up semi-supervised method. Source code will be available at https://github.com/tinymilky/Mem-Warp.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Source Code Summarization in the Era of Large Language Models
Authors:
Weisong Sun,
Yun Miao,
Yuekang Li,
Hongyu Zhang,
Chunrong Fang,
Yi Liu,
Gelei Deng,
Yang Liu,
Zhenyu Chen
Abstract:
To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systemat…
▽ More
To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types. Finally, we unexpectedly find that CodeLlama-Instruct with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Authors:
Feng Li,
Renrui Zhang,
Hao Zhang,
Yuanhan Zhang,
Bo Li,
Wei Li,
Zejun Ma,
Chunyuan Li
Abstract:
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capa…
▽ More
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation
Authors:
Zikai Huang,
Xuemiao Xu,
Cheng Xu,
Huaidong Zhang,
Chenxi Zheng,
Jing Qin,
Shengfeng He
Abstract:
Dance, as an art form, fundamentally hinges on the precise synchronization with musical beats. However, achieving aesthetically pleasing dance sequences from music is challenging, with existing methods often falling short in controllability and beat alignment. To address these shortcomings, this paper introduces Beat-It, a novel framework for beat-specific, key pose-guided dance generation. Unlike…
▽ More
Dance, as an art form, fundamentally hinges on the precise synchronization with musical beats. However, achieving aesthetically pleasing dance sequences from music is challenging, with existing methods often falling short in controllability and beat alignment. To address these shortcomings, this paper introduces Beat-It, a novel framework for beat-specific, key pose-guided dance generation. Unlike prior approaches, Beat-It uniquely integrates explicit beat awareness and key pose guidance, effectively resolving two main issues: the misalignment of generated dance motions with musical beats, and the inability to map key poses to specific beats, critical for practical choreography. Our approach disentangles beat conditions from music using a nearest beat distance representation and employs a hierarchical multi-condition fusion mechanism. This mechanism seamlessly integrates key poses, beats, and music features, mitigating condition conflicts and offering rich, multi-conditioned guidance for dance generation. Additionally, a specially designed beat alignment loss ensures the generated dance movements remain in sync with the designated beats. Extensive experiments confirm Beat-It's superiority over existing state-of-the-art methods in terms of beat alignment and motion controllability.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification
Authors:
Zhenyu Kuang,
Hongyang Zhang,
Lidong Cheng,
Yinhao Liu,
Yue Huang,
Xinghao Ding
Abstract:
Generalizable vehicle re-identification (ReID) aims to enable the well-trained model in diverse source domains to broadly adapt to unknown target domains without additional fine-tuning or retraining. However, it still faces the challenges of domain shift problem and has difficulty accurately generalizing to unknown target domains. This limitation occurs because the model relies heavily on primary…
▽ More
Generalizable vehicle re-identification (ReID) aims to enable the well-trained model in diverse source domains to broadly adapt to unknown target domains without additional fine-tuning or retraining. However, it still faces the challenges of domain shift problem and has difficulty accurately generalizing to unknown target domains. This limitation occurs because the model relies heavily on primary domain-invariant features in the training data and pays less attention to potentially valuable secondary features. To solve this complex and common problem, this paper proposes the two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method, which incorporates multiple experts with unique perspectives into Contrastive Language-Image Pretraining (CLIP) and fully leverages high-level semantic knowledge for comprehensive feature representation. Specifically, we propose to construct the learnable prompt set of all specific-perspective experts by adversarial learning in the latent space of visual features during the first stage of training. The learned prompt set with high-level semantics is then utilized to guide representation learning of the multi-level features for final knowledge fusion in the next stage. In this process of knowledge fusion, although multiple experts employ different assessment ways to examine the same vehicle, their common goal is to confirm the vehicle's true identity. Their collective decision can ensure the accuracy and consistency of the evaluation results. Furthermore, we design different image inputs for two-stage training, which include image component separation and diversity enhancement in order to extract the ID-related prompt representation and to obtain feature representation highlighted by all experts, respectively. Extensive experimental results demonstrate that our method achieves state-of-the-art recognition performance.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Flow to Rare Events: An Application of Normalizing Flow in Temporal Importance Sampling for Automated Vehicle Validation
Authors:
Yichun Ye,
He Zhang,
Ye Tian,
Jian Sun
Abstract:
Automated Vehicle (AV) validation based on simulated testing requires unbiased evaluation and high efficiency. One effective solution is to increase the exposure to risky rare events while reweighting the probability measure. However, characterizing the distribution of risky events is particularly challenging due to the paucity of samples and the temporality of continuous scenario variables. To so…
▽ More
Automated Vehicle (AV) validation based on simulated testing requires unbiased evaluation and high efficiency. One effective solution is to increase the exposure to risky rare events while reweighting the probability measure. However, characterizing the distribution of risky events is particularly challenging due to the paucity of samples and the temporality of continuous scenario variables. To solve it, we devise a method to represent, generate, and reweight the distribution of risky rare events. We decompose the temporal evolution of continuous variables into distribution components based on conditional probability. By introducing the Risk Indicator Function, the distribution of risky rare events is theoretically precipitated out of naturalistic driving distribution. This targeted distribution is practically generated via Normalizing Flow, which achieves exact and tractable probability evaluation of intricate distribution. The rare event distribution is then demonstrated as the advantageous Importance Sampling distribution. We also promote the technique of temporal Importance Sampling. The combined method, named as TrimFlow, is executed to estimate the collision rate of Car-following scenarios as a tentative practice. The results showed that sampling background vehicle maneuvers from rare event distribution could evolve testing scenarios to hazardous states. TrimFlow reduced 86.1% of tests compared to generating testing scenarios according to their exposure in the naturalistic driving environment. In addition, the TrimFlow method is not limited to one specific type of functional scenario.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules
Authors:
Zhuocheng Gong,
Ang Lv,
Jian Guan,
Junxi Yan,
Wei Wu,
Huishuai Zhang,
Minlie Huang,
Dongyan Zhao,
Rui Yan
Abstract:
Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, ca…
▽ More
Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, can be used to compute a token as long as it possesses the needed processing capabilities. The construction of MoM starts from a finite set of modules defined by multi-head attention and feed-forward networks, each distinguished by its unique parameterization. Two routers then iteratively select attention modules and feed-forward modules from the set to process a token. The selection dynamically expands the computation graph in the forward pass of the token, culminating in an assembly of modules. We show that MoM provides not only a unified framework for Transformers and their numerous variants but also a flexible and learnable approach for reducing redundancy in Transformer parameterization. We pre-train various MoMs using OpenWebText. Empirical results demonstrate that MoMs, of different parameter counts, consistently outperform vanilla transformers on both GLUE and XSUM benchmarks. More interestingly, with a fixed parameter budget, MoM-large enables an over 38% increase in depth for computation graphs compared to GPT-2-large, resulting in absolute gains of 1.4 on GLUE and 1 on XSUM. On the other hand, MoM-large also enables an over 60% reduction in depth while involving more modules per layer, yielding a 16% reduction in TFLOPs and a 43% decrease in memory usage compared to GPT-2-large, while maintaining comparable performance.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Economic span selection of bridge based on deep reinforcement learning
Authors:
Leye Zhang,
Xiangxiang Tian,
Chengli Zhang,
Hongjun Zhang
Abstract:
Deep Q-network algorithm is used to select economic span of bridge. Selection of bridge span has a significant impact on the total cost of bridge, and a reasonable selection of span can reduce engineering cost. Economic span of bridge is theoretically analyzed, and the theoretical solution formula of economic span is deduced. Construction process of bridge simulation environment is described in de…
▽ More
Deep Q-network algorithm is used to select economic span of bridge. Selection of bridge span has a significant impact on the total cost of bridge, and a reasonable selection of span can reduce engineering cost. Economic span of bridge is theoretically analyzed, and the theoretical solution formula of economic span is deduced. Construction process of bridge simulation environment is described in detail, including observation space, action space and reward function of the environment. Agent is constructed, convolutional neural network is used to approximate Q function,ε greedy policy is used for action selection, and experience replay is used for training. The test verifies that the agent can successfully learn optimal policy and realize economic span selection of bridge. This study provides a potential decision-making tool for bridge design.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Not all explicit cues help communicate: Pedestrians' perceptions, fixations, and decisions toward automated vehicles with varied appearance
Authors:
Wei Lyu,
Yaqin Cao,
Yi Ding,
Jingyu Li,
Kai Tian,
Hui Zhang
Abstract:
Given pedestrians' vulnerability in road traffic, it remains unclear how novel AV appearances will impact pedestrians crossing behaviour. To address this gap, this study pioneers an investigation into the influence of AVs' exterior design, correlated with their kinematics, on pedestrians' road-crossing perception and decision-making. A video-based eye-tracking experimental study was conducted with…
▽ More
Given pedestrians' vulnerability in road traffic, it remains unclear how novel AV appearances will impact pedestrians crossing behaviour. To address this gap, this study pioneers an investigation into the influence of AVs' exterior design, correlated with their kinematics, on pedestrians' road-crossing perception and decision-making. A video-based eye-tracking experimental study was conducted with 61 participants who responded to video stimuli depicting a manipulated vehicle approaching a predefined road-crossing location on an unsignalized, two-way road. The vehicle's kinematic pattern was manipulated into yielding and non-yielding, and its external appearances were varied across five types: with a human driver (as a conventional vehicle), with no driver (as an AV), with text-based identity indications, with roof radar sensors, with dynamic eHMIs adjusted to vehicle kinematics. Participants' perceived clarity, crossing initiation distance (CID), crossing decision time (CDT), and gaze behaviour, during interactions were recorded and reported. The results indicated that AVs' kinematic profiles play a dominant role in pedestrians' road-crossing decisions, supported by their subjective evaluations, CID, CDT, and gaze patterns during interactions. Moreover, the use of clear eHMI, such as dynamic pedestrian icons, reduced pedestrians' visual load, enhanced their perceptual clarity, expedited road-crossing decisions, and thereby improved overall crossing efficiency. However, it was found that both textual identity indications and roof radar sensors have no significant effect on pedestrians' decisions but negatively impact pedestrians' visual attention, as evidenced by heightened fixation counts and prolonged fixation durations, particularly under yielding conditions. Excessive visual and cognitive resource occupation suggests that not all explicit cues facilitate human-vehicle communication.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.