Skip to main content

Showing 1–50 of 3,422 results for author: Zhang, L

  1. arXiv:2407.13748  [pdf, other

    cs.CV

    General Geometry-aware Weakly Supervised 3D Object Detection

    Authors: Guowen Zhang, Junsong Fan, Liyi Chen, Zhaoxiang Zhang, Zhen Lei, Lei Zhang

    Abstract: 3D object detection is an indispensable component for scene understanding. However, the annotation of large-scale 3D datasets requires significant human effort. To tackle this problem, many methods adopt weakly supervised 3D object detection that estimates 3D boxes by leveraging 2D boxes and scene/class-specific priors. However, these approaches generally depend on sophisticated manual priors, whi… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV24

  2. arXiv:2407.13362  [pdf, other

    cs.CV

    Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation

    Authors: Pengfei Wang, Yuxi Wang, Shuai Li, Zhaoxiang Zhang, Zhen Lei, Lei Zhang

    Abstract: The scarcity of large-scale 3D-text paired data poses a great challenge on open vocabulary 3D scene understanding, and hence it is popular to leverage internet-scale 2D data and transfer their open vocabulary capabilities to 3D models through knowledge distillation. However, the existing distillation-based 3D scene understanding approaches rely on the representation capacity of 2D models, disregar… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  3. arXiv:2407.13217  [pdf, other

    cs.CV

    LIDIA: Precise Liver Tumor Diagnosis on Multi-Phase Contrast-Enhanced CT via Iterative Fusion and Asymmetric Contrastive Learning

    Authors: Wei Huang, Wei Liu, Xiaoming Zhang, Xiaoli Yin, Xu Han, Chunli Li, Yuan Gao, Yu Shi, Le Lu, Ling Zhang, Lei Zhang, Ke Yan

    Abstract: The early detection and precise diagnosis of liver tumors are tasks of critical clinical value, yet they pose significant challenges due to the high heterogeneity and variability of liver tumors. In this work, a precise LIver tumor DIAgnosis network on multi-phase contrast-enhance CT, named LIDIA, is proposed for real-world scenario. To fully utilize all available phases in contrast-enhanced CT, L… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted to MICCAI 2024

  4. arXiv:2407.13210  [pdf, other

    eess.IV cs.CV

    Improved Esophageal Varices Assessment from Non-Contrast CT Scans

    Authors: Chunli Li, Xiaoming Zhang, Yuan Gao, Xiaoli Yin, Le Lu, Ling Zhang, Ke Yan, Yu Shi

    Abstract: Esophageal varices (EV), a serious health concern resulting from portal hypertension, are traditionally diagnosed through invasive endoscopic procedures. Despite non-contrast computed tomography (NC-CT) imaging being a less expensive and non-invasive imaging modality, it has yet to gain full acceptance as a primary clinical diagnostic tool for EV evaluation. To overcome existing diagnostic challen… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Early accepted to MICCAI 2024

  5. arXiv:2407.13158  [pdf, other

    cs.LG cs.DB

    HHGT: Hierarchical Heterogeneous Graph Transformer for Heterogeneous Graph Representation Learning

    Authors: Qiuyu Zhu, Liang Zhang, Qianxiong Xu, Kaijun Liu, Cheng Long, Xiaoyang Wang

    Abstract: Despite the success of Heterogeneous Graph Neural Networks (HGNNs) in modeling real-world Heterogeneous Information Networks (HINs), challenges such as expressiveness limitations and over-smoothing have prompted researchers to explore Graph Transformers (GTs) for enhanced HIN representation learning. However, research on GT in HINs remains limited, with two key shortcomings in existing work: (1) A… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  6. arXiv:2407.12449  [pdf, other

    cs.CV cs.AI

    Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

    Authors: Kaixin Bai, Lei Zhang, Zhaopeng Chen, Fang Wan, Jianwei Zhang

    Abstract: Despite the substantial progress in deep learning, its adoption in industrial robotics projects remains limited, primarily due to challenges in data acquisition and labeling. Previous sim2real approaches using domain randomization require extensive scene and model optimization. To address these issues, we introduce an innovative physically-based structured light simulation system, generating both… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: 7 pages, 2024 IEEE International Conference on Robotics and Automation

  7. arXiv:2407.11503  [pdf, other

    cs.CV

    Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation

    Authors: Shijie Chang, Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu

    Abstract: Existing few-shot segmentation (FSS) methods mainly focus on prototype feature generation and the query-support matching mechanism. As a crucial prompt for generating prototype features, the pair of image-mask types in the support set has become the default setting. However, various types such as image, text, box, and mask all can provide valuable information regarding the objects in context, clas… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Preprint under review

  8. arXiv:2407.11300  [pdf, other

    cs.CV cs.AI

    Large Vision-Language Models as Emotion Recognizers in Context Awareness

    Authors: Yuxuan Lei, Dingkang Yang, Zhaoyu Chen, Jiawei Chen, Peng Zhai, Lihua Zhang

    Abstract: Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues. Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images. However, their knowledge is confined to specific training datasets and may reflect the subjective emotional biases of the annotators. Furthermore… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  9. arXiv:2407.11044  [pdf, other

    cs.LG cs.AI

    Generalizing soft actor-critic algorithms to discrete action spaces

    Authors: Le Zhang, Yong Gu, Xin Zhao, Yanshuo Zhang, Shu Zhao, Yifei Jin, Xinxin Wu

    Abstract: ATARI is a suite of video games used by reinforcement learning (RL) researchers to test the effectiveness of the learning algorithm. Receiving only the raw pixels and the game score, the agent learns to develop sophisticated strategies, even to the comparable level of a professional human games tester. Ideally, we also want an agent requiring very few interactions with the environment. Previous co… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Chinese Conference on Pattern Recognition and Computer Vision (PRCV) 2024. GitHub Repo https://github.com/lezhang-thu/bigger-better-faster-SAC

  10. arXiv:2407.10833  [pdf, other

    eess.IV cs.CV

    MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

    Authors: Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen

    Abstract: We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develo… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  11. arXiv:2407.10510  [pdf, other

    cs.CL cs.AI cs.CE

    TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription Prediction

    Authors: Xingzhi Zhou, Xin Dong, Chunhao Li, Yuning Bai, Yulong Xu, Ka Chun Cheung, Simon See, Xinpeng Song, Runshun Zhang, Xuezhong Zhou, Nevin L. Zhang

    Abstract: Traditional Chinese medicine (TCM) relies on specific combinations of herbs in prescriptions to treat symptoms and signs, a practice that spans thousands of years. Predicting TCM prescriptions presents a fascinating technical challenge with practical implications. However, this task faces limitations due to the scarcity of high-quality clinical datasets and the intricate relationship between sympt… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  12. arXiv:2407.10157  [pdf, other

    eess.IV cs.CV

    SACNet: A Spatially Adaptive Convolution Network for 2D Multi-organ Medical Segmentation

    Authors: Lin Zhang, Wenbo Gao, Jie Yi, Yunyun Yang

    Abstract: Multi-organ segmentation in medical image analysis is crucial for diagnosis and treatment planning. However, many factors complicate the task, including variability in different target categories and interference from complex backgrounds. In this paper, we utilize the knowledge of Deformable Convolution V3 (DCNv3) and multi-object segmentation to optimize our Spatially Adaptive Convolution Network… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

  13. arXiv:2407.10124  [pdf, other

    cs.RO

    Adaptive Model Predictive Control with Data-driven Error Model for Quadrupedal Locomotion

    Authors: Xuanqi Zeng, Hongbo Zhang, Linzhu Yue, Zhitao Song, Linwei Zhang, Yun-Hui Liu

    Abstract: Model Predictive Control (MPC) relies heavily on the robot model for its control law. However, a gap always exists between the reduced-order control model with uncertainties and the real robot, which degrades its performance. To address this issue, we propose the controller of integrating a data-driven error model into traditional MPC for quadruped robots. Our approach leverages real-world data fr… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: 7 Pages, 7 figures, conference(ICRA 2024)

  14. arXiv:2407.10048  [pdf, other

    cs.SD eess.AS

    Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

    Authors: Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie

    Abstract: Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  15. arXiv:2407.09793  [pdf, other

    cs.SE

    Uncovering Weaknesses in Neural Code Generation

    Authors: Xiaoli Lian, Shuaisong Wang, Jieping Ma, Fang Liu, Xin Tan, Li Zhang, Lin Shi, Cuiyun Gao

    Abstract: Code generation, the task of producing source code from prompts, has seen significant advancements with the advent of pre-trained large language models (PLMs). Despite these achievements, there lacks a comprehensive taxonomy of weaknesses about the benchmark and the generated code, which risks the community's focus on known issues at the cost of under-explored areas. Our systematic study aims to… ▽ More

    Submitted 17 July, 2024; v1 submitted 13 July, 2024; originally announced July 2024.

  16. arXiv:2407.09781  [pdf, other

    cs.CV

    Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

    Authors: Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M. Patel, Lei Zhang

    Abstract: Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-em… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  17. arXiv:2407.09540  [pdf, other

    eess.IV cs.CE cs.CV cs.LG q-bio.TO

    Prompting Whole Slide Image Based Genetic Biomarker Prediction

    Authors: Ling Zhang, Boxiang Yun, Xingran Xie, Qingli Li, Xinxing Li, Yan Wang

    Abstract: Prediction of genetic biomarkers, e.g., microsatellite instability and BRAF in colorectal cancer is crucial for clinical decision making. In this paper, we propose a whole slide image (WSI) based genetic biomarker prediction method via prompting techniques. Our work aims at addressing the following challenges: (1) extracting foreground instances related to genetic biomarkers from gigapixel WSIs, a… ▽ More

    Submitted 26 June, 2024; originally announced July 2024.

    Comments: 11 pages, 3 figures, MICCAI2024

  18. arXiv:2407.08966  [pdf, other

    cs.CV cs.AI cs.LG

    LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

    Authors: Yabin Zhang, Wenjie Zhu, Chenhang He, Lei Zhang

    Abstract: Out-of-distribution (OOD) detection is crucial for model reliability, as it identifies samples from unknown classes and reduces errors due to unexpected inputs. Vision-Language Models (VLMs) such as CLIP are emerging as powerful tools for OOD detection by integrating multi-modal information. However, the practical application of such systems is challenged by manual prompt engineering, which demand… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: ECCV2024; Codes and Supp. are available at: https://github.com/YBZh/LAPT

  19. arXiv:2407.08195  [pdf

    cs.AI cs.CL cs.MA

    A Text-to-Game Engine for UGC-Based Role-Playing Games

    Authors: Lei Zhang, Xuezheng Peng, Shuyi Yang, Feiyang Wang

    Abstract: The shift from professionally generated content (PGC) to user-generated content (UGC) has revolutionized various media formats, from text to video. With the rapid advancements in generative AI, a similar shift is set to transform the game industry, particularly in the realm of role-playing games (RPGs). This paper introduces a new framework for a text-to-game engine that utilizes foundation models… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 13 pages,11 figures

  20. arXiv:2407.08109  [pdf, other

    cs.CV cs.AI cs.LG

    Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

    Authors: Suqi Song, Chenxu Zhang, Peng Zhang, Pengkun Li, Fenglong Song, Lei Zhang

    Abstract: Urban waterlogging poses a major risk to public safety and infrastructure. Conventional methods using water-level sensors need high-maintenance to hardly achieve full coverage. Recent advances employ surveillance camera imagery and deep learning for detection, yet these struggle amidst scarce data and adverse environmental conditions. In this paper, we establish a challenging Urban Waterlogging Be… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  21. arXiv:2407.07614  [pdf, other

    cs.CV

    MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

    Authors: Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, LeiLei Gan, Hao Jiang

    Abstract: Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by in… ▽ More

    Submitted 11 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: 14 pages, 9 figures

  22. arXiv:2407.06807  [pdf, other

    cs.AI

    A Hybrid Training-time and Run-time Defense Against Adversarial Attacks in Modulation Classification

    Authors: Lu Zhang, Sangarapillai Lambotharan, Gan Zheng, Guisheng Liao, Ambra Demontis, Fabio Roli

    Abstract: Motivated by the superior performance of deep learning in many applications including computer vision and natural language processing, several recent studies have focused on applying deep neural network for devising future generations of wireless networks. However, several recent works have pointed out that imperceptible and carefully designed adversarial examples (attacks) can significantly deter… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: Published in IEEE Wireless Communications Letters, vol. 11, no. 6, pp. 1161-1165, June 2022

  23. arXiv:2407.06796  [pdf, other

    cs.AI

    Countermeasures Against Adversarial Examples in Radio Signal Classification

    Authors: Lu Zhang, Sangarapillai Lambotharan, Gan Zheng, Basil AsSadhan, Fabio Roli

    Abstract: Deep learning algorithms have been shown to be powerful in many communication network design problems, including that in automatic modulation classification. However, they are vulnerable to carefully crafted attacks called adversarial examples. Hence, the reliance of wireless networks on deep learning algorithms poses a serious threat to the security and operation of wireless networks. In this let… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: Published in IEEE Wireless Communications Letters, vol. 10, no. 8, pp. 1830-1834, Aug. 2021

  24. arXiv:2407.06778  [pdf, other

    cs.CR cs.AI

    A BERT-based Empirical Study of Privacy Policies' Compliance with GDPR

    Authors: Lu Zhang, Nabil Moukafih, Hamad Alamri, Gregory Epiphaniou, Carsten Maple

    Abstract: Since its implementation in May 2018, the General Data Protection Regulation (GDPR) has prompted businesses to revisit and revise their data handling practices to ensure compliance. The privacy policy, which serves as the primary means of informing users about their privacy rights and the data practices of companies, has been significantly updated by numerous businesses post-GDPR implementation. H… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: Published in IEEE Conference on Communications and Network Security (CNS), 2023

  25. arXiv:2407.06612  [pdf

    eess.IV cs.CV cs.LG

    AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review

    Authors: Rui Jin, Derun Li, Dehui Xiang, Lei Zhang, Hailing Zhou, Fei Shi, Weifang Zhu, Jing Cai, Tao Peng, Xinjian Chen

    Abstract: Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The ad… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  26. arXiv:2407.06507  [pdf

    cs.LG cs.AI

    Economic span selection of bridge based on deep reinforcement learning

    Authors: Leye Zhang, Xiangxiang Tian, Chengli Zhang, Hongjun Zhang

    Abstract: Deep Q-network algorithm is used to select economic span of bridge. Selection of bridge span has a significant impact on the total cost of bridge, and a reasonable selection of span can reduce engineering cost. Economic span of bridge is theoretically analyzed, and the theoretical solution formula of economic span is deduced. Construction process of bridge simulation environment is described in de… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 7 pages, 6 figures

  27. arXiv:2407.06152  [pdf, other

    physics.chem-ph cs.AI

    Uni-ELF: A Multi-Level Representation Learning Framework for Electrolyte Formulation Design

    Authors: Boshen Zeng, Sian Chen, Xinxin Liu, Changhong Chen, Bin Deng, Xiaoxu Wang, Zhifeng Gao, Yuzhi Zhang, Weinan E, Linfeng Zhang

    Abstract: Advancements in lithium battery technology heavily rely on the design and engineering of electrolytes. However, current schemes for molecular design and recipe optimization of electrolytes lack an effective computational-experimental closed loop and often fall short in accurately predicting diverse electrolyte formulation properties. In this work, we introduce Uni-ELF, a novel multi-level represen… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  28. arXiv:2407.06053  [pdf, other

    cond-mat.mtrl-sci cs.LG quant-ph

    Learning local equivariant representations for quantum operators

    Authors: Zhanghao Zhouyin, Zixi Gan, Shishir Kumar Pandey, Linfeng Zhang, Qiangqiang Gu

    Abstract: Predicting quantum operator matrices such as Hamiltonian, overlap, and density matrices in the density functional theory (DFT) framework is crucial for understanding material properties. Current methods often focus on individual operators and struggle with efficiency and scalability for large systems. Here we introduce a novel deep learning model, SLEM (strictly localized equivariant message-passi… ▽ More

    Submitted 16 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: 11 pages, 5 figures and 5 tables

  29. arXiv:2407.05875  [pdf, other

    cs.CV

    Minutes to Seconds: Speeded-up DDPM-based Image Inpainting with Coarse-to-Fine Sampling

    Authors: Lintao Zhang, Xiangcheng Du, LeoWu TomyEnrique, Yiqun Wang, Yingbin Zheng, Cheng Jin

    Abstract: For image inpainting, the existing Denoising Diffusion Probabilistic Model (DDPM) based method i.e. RePaint can produce high-quality images for any inpainting form. It utilizes a pre-trained DDPM as a prior and generates inpainting results by conditioning on the reverse diffusion process, namely denoising process. However, this process is significantly time-consuming. In this paper, we propose an… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: The code is avaliable at: https://github.com/linghuyuhangyuan/M2S

  30. arXiv:2407.05418  [pdf, other

    cs.CV cs.AI

    EMBANet: A Flexible Efffcient Multi-branch Attention Network

    Authors: Keke Zu, Hu Zhang, Jian Lu, Lei Zhang, Chen Xu

    Abstract: This work presents a novel module, namely multi-branch concat (MBC), to process the input tensor and obtain the multi-scale feature map. The proposed MBC module brings new degrees of freedom (DoF) for the design of attention networks by allowing the type of transformation operators and the number of branches to be flexibly adjusted. Two important transformation operators, multiplex and split, are… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  31. arXiv:2407.05131  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.CY

    RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

    Authors: Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, Huaxiu Yao

    Abstract: The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challen… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

  32. arXiv:2407.05106  [pdf, other

    cs.CV

    DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

    Authors: Qi Wang, Zhou Xu, Yuming Lin, Jingtao Ye, Hongsheng Li, Guangming Zhu, Syed Afaq Ali Shah, Mohammed Bennamoun, Liang Zhang

    Abstract: Neuromorphic sensors, specifically event cameras, revolutionize visual data acquisition by capturing pixel intensity changes with exceptional dynamic range, minimal latency, and energy efficiency, setting them apart from conventional frame-based cameras. The distinctive capabilities of event cameras have ignited significant interest in the domain of event-based action recognition, recognizing thei… ▽ More

    Submitted 13 July, 2024; v1 submitted 6 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  33. arXiv:2407.04963  [pdf, other

    cs.CV

    Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training

    Authors: Dingkang Yang, Kun Yang, Haopeng Kuang, Zhaoyu Chen, Yuzheng Wang, Lihua Zhang

    Abstract: Understanding emotions from diverse contexts has received widespread attention in computer vision communities. The core philosophy of Context-Aware Emotion Recognition (CAER) is to provide valuable semantic cues for recognizing the emotions of target persons by leveraging rich contextual information. Current approaches invariably focus on designing sophisticated structures to extract perceptually… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

    Comments: TPAMI 2024

  34. arXiv:2407.04955  [pdf, other

    cs.CV

    Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

    Authors: Dingkang Yang, Mingcheng Li, Linhao Qu, Kun Yang, Peng Zhai, Song Wang, Lihua Zhang

    Abstract: Understanding human intentions (e.g., emotions) from videos has received considerable attention recently. Video streams generally constitute a blend of temporal data stemming from distinct modalities, including natural language, facial expressions, and auditory clues. Despite the impressive advancements of previous works via attention-based paradigms, the inherent temporal asynchrony and modality… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

    Comments: TCSVT 2024

  35. arXiv:2407.04923  [pdf, other

    cs.CV cs.CL

    OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

    Authors: Tiancheng Zhao, Qianqian Zhang, Kyusong Lee, Peng Liu, Lu Zhang, Chunxin Fang, Jiajia Liao, Kelei Jiang, Yibo Ma, Ruochen Xu

    Abstract: We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an ac… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: 14 pages

  36. arXiv:2407.04396  [pdf, other

    cs.CV cs.AI

    Graph-Guided Test-Time Adaptation for Glaucoma Diagnosis using Fundus Photography

    Authors: Qian Zeng, Le Zhang, Yipeng Liu, Ce Zhu, Fan Zhang

    Abstract: Glaucoma is a leading cause of irreversible blindness worldwide. While deep learning approaches using fundus images have largely improved early diagnosis of glaucoma, variations in images from different devices and locations (known as domain shifts) challenge the use of pre-trained models in real-world settings. To address this, we propose a novel Graph-guided Test-Time Adaptation (GTTA) framework… ▽ More

    Submitted 9 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: 11 pages, 3 figures, 3 tables, submitted to MICCAI

  37. arXiv:2407.04162  [pdf, other

    eess.IV cs.CV

    Measurement Embedded Schrödinger Bridge for Inverse Problems

    Authors: Yuang Wang, Pengfei Jin, Siyeop Yoon, Matthew Tivnan, Quanzheng Li, Li Zhang, Dufan Wu

    Abstract: Score-based diffusion models are frequently employed as structural priors in inverse problems. However, their iterative denoising process, initiated from Gaussian noise, often results in slow inference speeds. The Image-to-Image Schrödinger Bridge (I$^2$SB), which begins with the corrupted image, presents a promising alternative as a prior for addressing inverse problems. In this work, we introduc… ▽ More

    Submitted 22 May, 2024; originally announced July 2024.

    Comments: 14 pages, 2 figures, Neurips preprint

  38. arXiv:2407.03891  [pdf, other

    cs.SE cs.PL

    AutoBench: Automatic Testbench Generation and Evaluation Using LLMs for HDL Design

    Authors: Ruidi Qiu, Grace Li Zhang, Rolf Drechsler, Ulf Schlichtmann, Bing Li

    Abstract: In digital circuit design, testbenches constitute the cornerstone of simulation-based hardware verification. Traditional methodologies for testbench generation during simulation-based hardware verification still remain partially manual, resulting in inefficiencies in testing various scenarios and requiring expensive time from designers. Large Language Models (LLMs) have demonstrated their potentia… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  39. arXiv:2407.03738  [pdf, other

    eess.SY cs.LG

    BasisN: Reprogramming-Free RRAM-Based In-Memory-Computing by Basis Combination for Deep Neural Networks

    Authors: Amro Eldebiky, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ing-Chao Lin, Ulf Schlichtmann, Bing Li

    Abstract: Deep neural networks (DNNs) have made breakthroughs in various fields including image recognition and language processing. DNNs execute hundreds of millions of multiply-and-accumulate (MAC) operations. To efficiently accelerate such computations, analog in-memory-computing platforms have emerged leveraging emerging devices such as resistive RAM (RRAM). However, such accelerators face the hurdle of… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: accepted by ICCAD2024

  40. arXiv:2407.03469  [pdf

    cs.SE cs.AI

    Scaling Data-Driven Building Energy Modelling using Large Language Models

    Authors: Sunil Khadka, Liang Zhang

    Abstract: Building Management System (BMS) through a data-driven method always faces data and model scalability issues. We propose a methodology to tackle the scalability challenges associated with the development of data-driven models for BMS by using Large Language Models (LLMs). LLMs' code generation adaptability can enable broader adoption of BMS by "automating the automation," particularly the data han… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  41. arXiv:2407.03331  [pdf, other

    cs.CV cs.AI cs.DC

    Anole: Adapting Diverse Compressed Models For Cross-Scene Prediction On Mobile Devices

    Authors: Yunzhe Li, Hongzi Zhu, Zhuohong Deng, Yunlong Cheng, Liang Zhang, Shan Chang, Minyi Guo

    Abstract: Emerging Artificial Intelligence of Things (AIoT) applications desire online prediction using deep neural network (DNN) models on mobile devices. However, due to the movement of devices, unfamiliar test samples constantly appear, significantly affecting the prediction accuracy of a pre-trained DNN. In addition, unstable network connection calls for local model inference. In this paper, we propose… ▽ More

    Submitted 9 May, 2024; originally announced July 2024.

  42. arXiv:2407.03008  [pdf, other

    cs.CV

    Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

    Authors: Zhaohe Liao, Jiangtong Li, Li Niu, Liqing Zhang

    Abstract: Despite the recent progress made in Video Question-Answering (VideoQA), these methods typically function as black-boxes, making it difficult to understand their reasoning processes and perform consistent compositional reasoning. To address these challenges, we propose a \textit{model-agnostic} Video Alignment and Answer Aggregation (VA$^{3}$) framework, which is capable of enhancing both compositi… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: 10 pages,CVPR

    Journal ref: CVPR (2024) 13395-13404

  43. arXiv:2407.02505  [pdf, other

    cs.CE cs.LG physics.flu-dyn

    A MgNO Method for Multiphase Flow in Porous Media

    Authors: Xinliang Liu, Xia Yang, Chen-Song Zhang, Lian Zhang, Li Zhao

    Abstract: This research investigates the application of Multigrid Neural Operator (MgNO), a neural operator architecture inspired by multigrid methods, in the simulation for multiphase flow within porous media. The architecture is adjusted to manage a variety of crucial factors, such as permeability and porosity heterogeneity. The study extendes MgNO to time-dependent porous media flow problems and validate… ▽ More

    Submitted 16 June, 2024; originally announced July 2024.

  44. arXiv:2407.02392  [pdf, other

    cs.CV

    TokenPacker: Efficient Visual Projector for Multimodal LLM

    Authors: Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, Lei Zhang

    Abstract: The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significa… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 16 pages, Codes:https://github.com/CircleRadon/TokenPacker

  45. arXiv:2407.02049  [pdf, other

    eess.AS cs.CL cs.SD

    Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

    Authors: Ruiqi Li, Zhiqing Hong, Yongqi Wang, Lichao Zhang, Rongjie Huang, Siqi Zheng, Zhou Zhao

    Abstract: Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achie… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Working in progress

  46. arXiv:2407.02040  [pdf, other

    cs.CV cs.AI cs.MM

    ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

    Authors: Zhiyuan Ma, Yuxiang Wei, Yabin Zhang, Xiangyu Zhu, Zhen Lei, Lei Zhang

    Abstract: By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillatio… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024. Code available at https://github.com/theEricMa/ScaleDreamer

  47. arXiv:2407.02031  [pdf, other

    cs.DC cs.AI cs.LG

    SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

    Authors: Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang

    Abstract: This paper documents our characterization study and practices for serving text-to-image requests with stable diffusion models in production. We first comprehensively analyze inference request traces for commercial text-to-image applications. It commences with our observation that add-on modules, i.e., ControlNets and LoRAs, that augment the base stable diffusion models, are ubiquitous in generatin… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  48. arXiv:2407.01928  [pdf, other

    cs.CV

    SymPoint Revolutionized: Boosting Panoptic Symbol Spotting with Layer Feature Enhancement

    Authors: Wenlong Liu, Tianyu Yang, Qizhi Yu, Lei Zhang

    Abstract: SymPoint is an initial attempt that utilizes point set representation to solve the panoptic symbol spotting task on CAD drawing. Despite its considerable success, it overlooks graphical layer information and suffers from prohibitively slow training convergence. To tackle this issue, we introduce SymPoint-V2, a robust and efficient solution featuring novel, streamlined designs that overcome these l… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: code at https://github.com/nicehuster/SymPointV2

  49. arXiv:2407.01749  [pdf, other

    cs.LG cs.AI

    Invariant Correlation of Representation with Label

    Authors: Gaojie Jin, Ronghui Mu, Xinping Yi, Xiaowei Huang, Lijun Zhang

    Abstract: The Invariant Risk Minimization (IRM) approach aims to address the challenge of domain generalization by training a feature representation that remains invariant across multiple environments. However, in noisy environments, IRM-related techniques such as IRMv1 and VREx may be unable to achieve the optimal IRM solution, primarily due to erroneous optimization directions. To address this issue, we i… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  50. arXiv:2407.01731  [pdf, other

    cs.CV

    Uncertainty Quantification in Table Structure Recognition

    Authors: Kehinde Ajayi, Leizhen Zhang, Yi He, Jian Wu

    Abstract: Quantifying uncertainties for machine learning models is a critical step to reduce human verification effort by detecting predictions with low confidence. This paper proposes a method for uncertainty quantification (UQ) of table structure recognition (TSR). The proposed UQ method is built upon a mixture-of-expert approach termed Test-Time Augmentation (TTA). Our key idea is to enrich and diversify… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 7 Figures