-
CookAR: Affordance Augmentations in Wearable AR to Support Kitchen Tool Interactions for People with Low Vision
Authors:
Jaewook Lee,
Andrew D. Tjahjadi,
Jiho Kim,
Junpu Yu,
Minji Park,
Jiawen Zhang,
Jon E. Froehlich,
Yapeng Tian,
Yuhang Zhao
Abstract:
Cooking is a central activity of daily living, supporting independence and both mental and physical health. However, prior work has highlighted key barriers for people with low vision (LV) to cook, particularly around safely interacting with cooking tools, such as sharp knives or hot pans. Drawing on recent advancements in computer vision (CV) and robotics, we present CookAR, a head-mounted AR sys…
▽ More
Cooking is a central activity of daily living, supporting independence and both mental and physical health. However, prior work has highlighted key barriers for people with low vision (LV) to cook, particularly around safely interacting with cooking tools, such as sharp knives or hot pans. Drawing on recent advancements in computer vision (CV) and robotics, we present CookAR, a head-mounted AR system with real-time object affordance augmentations to support safe and efficient interactions with kitchen tools. To design and implement CookAR, we manually collected and annotated the first egocentric dataset of kitchen tool affordances, fine-tuned an affordance segmentation model, and leveraged a stereo camera attached to an AR headset to generate the visual augmentations. To validate CookAR, we conducted a technical performance evaluation and a three-part qualitative lab study with ten LV participants. Our technical evaluation demonstrates that our fine-tuned model outperforms the base model on our class-specific dataset, while our user study indicates a preference for affordance augmentations over the traditional whole object augmentations. Code is available at: https://github.com/makeabilitylab/CookAR
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
DeepClair: Utilizing Market Forecasts for Effective Portfolio Selection
Authors:
Donghee Choi,
Jinkyu Kim,
Mogan Gim,
Jinho Lee,
Jaewoo Kang
Abstract:
Utilizing market forecasts is pivotal in optimizing portfolio selection strategies. We introduce DeepClair, a novel framework for portfolio selection. DeepClair leverages a transformer-based time-series forecasting model to predict market trends, facilitating more informed and adaptable portfolio decisions. To integrate the forecasting model into a deep reinforcement learning-driven portfolio sele…
▽ More
Utilizing market forecasts is pivotal in optimizing portfolio selection strategies. We introduce DeepClair, a novel framework for portfolio selection. DeepClair leverages a transformer-based time-series forecasting model to predict market trends, facilitating more informed and adaptable portfolio decisions. To integrate the forecasting model into a deep reinforcement learning-driven portfolio selection framework, we introduced a two-step strategy: first, pre-training the time-series model on market data, followed by fine-tuning the portfolio selection architecture using this model. Additionally, we investigated the optimization technique, Low-Rank Adaptation (LoRA), to enhance the pre-trained forecasting model for fine-tuning in investment scenarios. This work bridges market forecasting and portfolio selection, facilitating the advancement of investment strategies.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization
Authors:
Audrey Huang,
Wenhao Zhan,
Tengyang Xie,
Jason D. Lee,
Wen Sun,
Akshay Krishnamurthy,
Dylan J. Foster
Abstract:
Language model alignment methods, such as reinforcement learning from human feedback (RLHF), have led to impressive advances in language model capabilities, but existing techniques are limited by a widely observed phenomenon known as overoptimization, where the quality of the language model plateaus or degrades over the course of the alignment process. Overoptimization is often attributed to overf…
▽ More
Language model alignment methods, such as reinforcement learning from human feedback (RLHF), have led to impressive advances in language model capabilities, but existing techniques are limited by a widely observed phenomenon known as overoptimization, where the quality of the language model plateaus or degrades over the course of the alignment process. Overoptimization is often attributed to overfitting to an inaccurate reward model, and while it can be mitigated through online data collection, this is infeasible in many settings. This raises a fundamental question: Do existing offline alignment algorithms make the most of the data they have, or can their sample-efficiency be improved further?
We address this question with a new algorithm for offline alignment, $χ^2$-Preference Optimization ($χ$PO). $χ$PO is a one-line change to Direct Preference Optimization (DPO; Rafailov et al., 2023), which only involves modifying the logarithmic link function in the DPO objective. Despite this minimal change, $χ$PO implicitly implements the principle of pessimism in the face of uncertainty via regularization with the $χ^2$-divergence -- which quantifies uncertainty more effectively than KL-regularization -- and provably alleviates overoptimization, achieving sample-complexity guarantees based on single-policy concentrability -- the gold standard in offline reinforcement learning. $χ$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm that is provably robust to overoptimization.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods
Authors:
WooJae Jeon,
KanJun Lee,
Jeewoo Lee
Abstract:
This paper introduces PG-Rainbow, a novel algorithm that incorporates a distributional reinforcement learning framework with a policy gradient algorithm. Existing policy gradient methods are sample inefficient and rely on the mean of returns when calculating the state-action value function, neglecting the distributional nature of returns in reinforcement learning tasks. To address this issue, we u…
▽ More
This paper introduces PG-Rainbow, a novel algorithm that incorporates a distributional reinforcement learning framework with a policy gradient algorithm. Existing policy gradient methods are sample inefficient and rely on the mean of returns when calculating the state-action value function, neglecting the distributional nature of returns in reinforcement learning tasks. To address this issue, we use an Implicit Quantile Network that provides the quantile information of the distribution of rewards to the critic network of the Proximal Policy Optimization algorithm. We show empirical results that through the integration of reward distribution information into the policy network, the policy agent acquires enhanced capabilities to comprehensively evaluate the consequences of potential actions in a given state, facilitating more sophisticated and informed decision-making processes. We evaluate the performance of the proposed algorithm in the Atari-2600 game suite, simulated via the Arcade Learning Environment (ALE).
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
On the modification and revocation of open source licences
Authors:
Paul Gagnon,
Misha Benjamin,
Justine Gauthier,
Catherine Regis,
Jenny Lee,
Alexei Nordell-Markovits
Abstract:
Historically, open source commitments have been deemed irrevocable once materials are released under open source licenses. In this paper, the authors argue for the creation of a subset of rights that allows open source contributors to force users to (i) update to the most recent version of a model, (ii) accept new use case restrictions, or even (iii) cease using the software entirely. While this w…
▽ More
Historically, open source commitments have been deemed irrevocable once materials are released under open source licenses. In this paper, the authors argue for the creation of a subset of rights that allows open source contributors to force users to (i) update to the most recent version of a model, (ii) accept new use case restrictions, or even (iii) cease using the software entirely. While this would be a departure from the traditional open source approach, the legal, reputational and moral risks related to open-sourcing AI models could justify contributors having more control over downstream uses. Recent legislative changes have also opened the door to liability of open source contributors in certain cases. The authors believe that contributors would welcome the ability to ensure that downstream users are implementing updates that address issues like bias, guardrail workarounds or adversarial attacks on their contributions. Finally, this paper addresses how this license category would interplay with RAIL licenses, and how it should be operationalized and adopted by key stakeholders such as OSS platforms and scanning tools.
△ Less
Submitted 28 May, 2024;
originally announced July 2024.
-
Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models
Authors:
Jung Hyun Lee,
June Yong Yang,
Byeongho Heo,
Dongyoon Han,
Kang Min Yoo
Abstract:
Large Language Models (LLMs) have demonstrated impressive problem-solving capabilities in mathematics through step-by-step reasoning chains. However, they are susceptible to reasoning errors that impact the quality of subsequent reasoning chains and the final answer due to language models' autoregressive token-by-token generating nature. Recent works have proposed adopting external verifiers to gu…
▽ More
Large Language Models (LLMs) have demonstrated impressive problem-solving capabilities in mathematics through step-by-step reasoning chains. However, they are susceptible to reasoning errors that impact the quality of subsequent reasoning chains and the final answer due to language models' autoregressive token-by-token generating nature. Recent works have proposed adopting external verifiers to guide the generation of reasoning paths, but existing works utilize models that have been trained with step-by-step labels to assess the correctness of token-by-token reasoning chains. Consequently, they struggle to recognize discriminative details of tokens within a reasoning path and lack the ability to evaluate whether an intermediate reasoning path is on a promising track toward the correct final answer. To amend the lack of sound and token-grained math-verification signals, we devise a novel training scheme for verifiers that apply token-level supervision with the expected cumulative reward (i.e., value). Furthermore, we propose a practical formulation of the cumulative reward by reducing it to finding the probability of future correctness of the final answer and thereby enabling the empirical estimation of the value. Experimental results on mathematical reasoning benchmarks show that Token-Supervised Value Model (TVM) can outperform step-by-step verifiers on GSM8K and MATH with Mistral and Llama.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients
Authors:
Dohyung Kim,
Junghyup Lee,
Jeimin Jeon,
Jaehyeon Moon,
Bumsub Ham
Abstract:
Network quantization generally converts full-precision weights and/or activations into low-bit fixed-point values in order to accelerate an inference process. Recent approaches to network quantization further discretize the gradients into low-bit fixed-point values, enabling an efficient training. They typically set a quantization interval using a min-max range of the gradients or adjust the inter…
▽ More
Network quantization generally converts full-precision weights and/or activations into low-bit fixed-point values in order to accelerate an inference process. Recent approaches to network quantization further discretize the gradients into low-bit fixed-point values, enabling an efficient training. They typically set a quantization interval using a min-max range of the gradients or adjust the interval such that the quantization error for entire gradients is minimized. In this paper, we analyze the quantization error of gradients for the low-bit fixed-point training, and show that lowering the error for large-magnitude gradients boosts the quantization performance significantly. Based on this, we derive an upper bound of quantization error for the large gradients in terms of the quantization interval, and obtain an optimal condition for the interval minimizing the quantization error for large gradients. We also introduce an interval update algorithm that adjusts the quantization interval adaptively to maintain a small quantization error for large gradients. Experimental results demonstrate the effectiveness of our quantization method for various combinations of network architectures and bit-widths on various tasks, including image classification, object detection, and super-resolution.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Crafting the Path: Robust Query Rewriting for Information Retrieval
Authors:
Ingeol Baek,
Jimin Lee,
Joonho Yang,
Hwanhee Lee
Abstract:
Query rewriting aims to generate a new query that can complement the original query to improve the information retrieval system. Recent studies on query rewriting, such as query2doc (Q2D), query2expand (Q2E) and querey2cot (Q2C), rely on the internal knowledge of Large Language Models (LLMs) to generate a relevant passage to add information to the query. Nevertheless, the efficacy of these methodo…
▽ More
Query rewriting aims to generate a new query that can complement the original query to improve the information retrieval system. Recent studies on query rewriting, such as query2doc (Q2D), query2expand (Q2E) and querey2cot (Q2C), rely on the internal knowledge of Large Language Models (LLMs) to generate a relevant passage to add information to the query. Nevertheless, the efficacy of these methodologies may markedly decline in instances where the requisite knowledge is not encapsulated within the model's intrinsic parameters. In this paper, we propose a novel structured query rewriting method called Crafting the Path tailored for retrieval systems. Crafting the Path involves a three-step process that crafts query-related information necessary for finding the passages to be searched in each step. Specifically, the Crafting the Path begins with Query Concept Comprehension, proceeds to Query Type Identification, and finally conducts Expected Answer Extraction. Experimental results show that our method outperforms previous rewriting methods, especially in less familiar domains for LLMs. We demonstrate that our method is less dependent on the internal parameter knowledge of the model and generates queries with fewer factual inaccuracies. Furthermore, we observe that Crafting the Path has less latency compared to the baselines.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Graph Signal Processing for Cross-Domain Recommendation
Authors:
Jeongeun Lee,
Seongku Kang,
Won-Yong Shin,
Jeongwhan Choi,
Noseong Park,
Dongha Lee
Abstract:
Cross-domain recommendation (CDR) extends conventional recommender systems by leveraging user-item interactions from dense domains to mitigate data sparsity and the cold start problem. While CDR offers substantial potential for enhancing recommendation performance, most existing CDR methods suffer from sensitivity to the ratio of overlapping users and intrinsic discrepancy between source and targe…
▽ More
Cross-domain recommendation (CDR) extends conventional recommender systems by leveraging user-item interactions from dense domains to mitigate data sparsity and the cold start problem. While CDR offers substantial potential for enhancing recommendation performance, most existing CDR methods suffer from sensitivity to the ratio of overlapping users and intrinsic discrepancy between source and target domains. To overcome these limitations, in this work, we explore the application of graph signal processing (GSP) in CDR scenarios. We propose CGSP, a unified CDR framework based on GSP, which employs a cross-domain similarity graph constructed by flexibly combining target-only similarity and source-bridged similarity. By processing personalized graph signals computed for users from either the source or target domain, our framework effectively supports both inter-domain and intra-domain recommendations. Our empirical evaluation demonstrates that CGSP consistently outperforms various encoder-based CDR approaches in both intra-domain and inter-domain recommendation scenarios, especially when the ratio of overlapping users is low, highlighting its significant practical implication in real-world applications.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Improving Unsupervised Video Object Segmentation via Fake Flow Generation
Authors:
Suhwan Cho,
Minhyeok Lee,
Jungho Lee,
Donghyeong Kim,
Seunghoon Lee,
Sungmin Woo,
Sangyoun Lee
Abstract:
Unsupervised video object segmentation (VOS), also known as video salient object detection, aims to detect the most prominent object in a video at the pixel level. Recently, two-stream approaches that leverage both RGB images and optical flow maps have gained significant attention. However, the limited amount of training data remains a substantial challenge. In this study, we propose a novel data…
▽ More
Unsupervised video object segmentation (VOS), also known as video salient object detection, aims to detect the most prominent object in a video at the pixel level. Recently, two-stream approaches that leverage both RGB images and optical flow maps have gained significant attention. However, the limited amount of training data remains a substantial challenge. In this study, we propose a novel data generation method that simulates fake optical flows from single images, thereby creating large-scale training data for stable network learning. Inspired by the observation that optical flow maps are highly dependent on depth maps, we generate fake optical flows by refining and augmenting the estimated depth maps of each image. By incorporating our simulated image-flow pairs, we achieve new state-of-the-art performance on all public benchmark datasets without relying on complex modules. We believe that our data generation method represents a potential breakthrough for future VOS research.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
Authors:
Jung Hyun Lee,
Jeonghoon Kim,
June Yong Yang,
Se Jung Kwon,
Eunho Yang,
Kang Min Yoo,
Dongsoo Lee
Abstract:
With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language underst…
▽ More
With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) $-$ a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. We show the superiority of LRQ over prior LLM PTQ works under (i) $8$-bit weight and per-tensor activation quantization, (ii) $4$-bit weight and $8$-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes. Our code is available at \url{https://github.com/onliwad101/FlexRound_LRQ} to inspire LLM researchers and engineers.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Isometric Representation Learning for Disentangled Latent Space of Diffusion Models
Authors:
Jaehoon Hahm,
Junho Lee,
Sunghyun Kim,
Joonseok Lee
Abstract:
The latent space of diffusion model mostly still remains unexplored, despite its great success and potential in the field of generative modeling. In fact, the latent space of existing diffusion models are entangled, with a distorted mapping from its latent space to image space. To tackle this problem, we present Isometric Diffusion, equipping a diffusion model with a geometric regularizer to guide…
▽ More
The latent space of diffusion model mostly still remains unexplored, despite its great success and potential in the field of generative modeling. In fact, the latent space of existing diffusion models are entangled, with a distorted mapping from its latent space to image space. To tackle this problem, we present Isometric Diffusion, equipping a diffusion model with a geometric regularizer to guide the model to learn a geometrically sound latent space of the training data manifold. This approach allows diffusion models to learn a more disentangled latent space, which enables smoother interpolation, more accurate inversion, and more precise control over attributes directly in the latent space. Our extensive experiments consisting of image interpolations, image inversions, and linear editing show the effectiveness of our method.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Navigating the swarm: Deep neural networks command emergent behaviours
Authors:
Dongjo Kim,
Jeongsu Lee,
Ho-Young Kim
Abstract:
Interacting individuals in complex systems often give rise to coherent motion exhibiting coordinated global structures. Such phenomena are ubiquitously observed in nature, from cell migration, bacterial swarms, animal and insect groups, and even human societies. Primary mechanisms responsible for the emergence of collective behavior have been extensively identified, including local alignments base…
▽ More
Interacting individuals in complex systems often give rise to coherent motion exhibiting coordinated global structures. Such phenomena are ubiquitously observed in nature, from cell migration, bacterial swarms, animal and insect groups, and even human societies. Primary mechanisms responsible for the emergence of collective behavior have been extensively identified, including local alignments based on average or relative velocity, non-local pairwise repulsive-attractive interactions such as distance-based potentials, interplay between local and non-local interactions, and cognitive-based inhomogeneous interactions. However, discovering how to adapt these mechanisms to modulate emergent behaviours remains elusive. Here, we demonstrate that it is possible to generate coordinated structures in collective behavior at desired moments with intended global patterns by fine-tuning an inter-agent interaction rule. Our strategy employs deep neural networks, obeying the laws of dynamics, to find interaction rules that command desired collective structures. The decomposition of interaction rules into distancing and aligning forces, expressed by polynomial series, facilitates the training of neural networks to propose desired interaction models. Presented examples include altering the mean radius and size of clusters in vortical swarms, timing of transitions from random to ordered states, and continuously shifting between typical modes of collective motions. This strategy can even be leveraged to superimpose collective modes, resulting in hitherto unexplored but highly practical hybrid collective patterns, such as protective security formations. Our findings reveal innovative strategies for creating and controlling collective motion, paving the way for new applications in robotic swarm operations, active matter organisation, and for the uncovering of obscure interaction rules in biological systems.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition
Authors:
George Tsoukalas,
Jasper Lee,
John Jennings,
Jimmy Xin,
Michelle Ding,
Michael Jennings,
Amitayush Thakur,
Swarat Chaudhuri
Abstract:
We present PutnamBench, a new multilingual benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems. PutnamBench consists of 1697 hand-constructed formalizations of 640 theorems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America. All the theorems have formalization…
▽ More
We present PutnamBench, a new multilingual benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems. PutnamBench consists of 1697 hand-constructed formalizations of 640 theorems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America. All the theorems have formalizations in Lean 4 and Isabelle; a substantial subset also has Coq formalizations. Proving the theorems requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses. We use PutnamBench to evaluate several established neural and symbolic theorem-provers. These approaches can only solve a handful of the PutnamBench problems, establishing the benchmark as a difficult open challenge for research on neural theorem-proving. PutnamBench is available at https://github.com/trishullab/PutnamBench.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
The Life Cycle of Large Language Models: A Review of Biases in Education
Authors:
Jinsook Lee,
Yann Hicke,
Renzhe Yu,
Christopher Brooks,
René F. Kizilcec
Abstract:
Large Language Models (LLMs) are increasingly adopted in educational contexts to provide personalized support to students and teachers. The unprecedented capacity of LLM-based applications to understand and generate natural language can potentially improve instructional effectiveness and learning outcomes, but the integration of LLMs in education technology has renewed concerns over algorithmic bi…
▽ More
Large Language Models (LLMs) are increasingly adopted in educational contexts to provide personalized support to students and teachers. The unprecedented capacity of LLM-based applications to understand and generate natural language can potentially improve instructional effectiveness and learning outcomes, but the integration of LLMs in education technology has renewed concerns over algorithmic bias which may exacerbate educational inequities. In this review, building on prior work on mapping the traditional machine learning life cycle, we provide a holistic map of the LLM life cycle from the initial development of LLMs to customizing pre-trained models for various applications in educational settings. We explain each step in the LLM life cycle and identify potential sources of bias that may arise in the context of education. We discuss why current measures of bias from traditional machine learning fail to transfer to LLM-generated content in education, such as tutoring conversations because the text is high-dimensional, there can be multiple correct responses, and tailoring responses may be pedagogically desirable rather than unfair. This review aims to clarify the complex nature of bias in LLM applications and provide practical guidance for their evaluation to promote educational equity.
△ Less
Submitted 3 June, 2024;
originally announced July 2024.
-
Algorithms for College Admissions Decision Support: Impacts of Policy Change and Inherent Variability
Authors:
Jinsook Lee,
Emma Harvey,
Joyce Zhou,
Nikhil Garg,
Thorsten Joachims,
Rene F. Kizilcec
Abstract:
Each year, selective American colleges sort through tens of thousands of applications to identify a first-year class that displays both academic merit and diversity. In the 2023-2024 admissions cycle, these colleges faced unprecedented challenges. First, the number of applications has been steadily growing. Second, test-optional policies that have remained in place since the COVID-19 pandemic limi…
▽ More
Each year, selective American colleges sort through tens of thousands of applications to identify a first-year class that displays both academic merit and diversity. In the 2023-2024 admissions cycle, these colleges faced unprecedented challenges. First, the number of applications has been steadily growing. Second, test-optional policies that have remained in place since the COVID-19 pandemic limit access to key information historically predictive of academic success. Most recently, longstanding debates over affirmative action culminated in the Supreme Court banning race-conscious admissions. Colleges have explored machine learning (ML) models to address the issues of scale and missing test scores, often via ranking algorithms intended to focus on 'top' applicants. However, the Court's ruling will force changes to these models, which were able to consider race as a factor in ranking. There is currently a poor understanding of how these mandated changes will shape applicant ranking algorithms, and, by extension, admitted classes. We seek to address this by quantifying the impact of different admission policies on the applications prioritized for review. We show that removing race data from a developed applicant ranking algorithm reduces the diversity of the top-ranked pool without meaningfully increasing the academic merit of that pool. We contextualize this impact by showing that excluding data on applicant race has a greater impact than excluding other potentially informative variables like intended majors. Finally, we measure the impact of policy change on individuals by comparing the arbitrariness in applicant rank attributable to policy change to the arbitrariness attributable to randomness. We find that any given policy has a high degree of arbitrariness and that removing race data from the ranking algorithm increases arbitrariness in outcomes for most applicants.
△ Less
Submitted 24 June, 2024;
originally announced July 2024.
-
VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
Authors:
Bocheng Zou,
Mu Cai,
Jianrui Zhang,
Yong Jae Lee
Abstract:
In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more c…
▽ More
In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons or sketches. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced at https://vgbench.github.io.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
3D Geometric Shape Assembly via Efficient Point Cloud Matching
Authors:
Nahyuk Lee,
Juhong Min,
Junha Lee,
Seungwook Kim,
Kanghee Lee,
Jaesik Park,
Minsu Cho
Abstract:
Learning to assemble geometric shapes into a larger target structure is a pivotal task in various practical applications. In this work, we tackle this problem by establishing local correspondences between point clouds of part shapes in both coarse- and fine-levels. To this end, we introduce Proxy Match Transform (PMT), an approximate high-order feature transform layer that enables reliable matchin…
▽ More
Learning to assemble geometric shapes into a larger target structure is a pivotal task in various practical applications. In this work, we tackle this problem by establishing local correspondences between point clouds of part shapes in both coarse- and fine-levels. To this end, we introduce Proxy Match Transform (PMT), an approximate high-order feature transform layer that enables reliable matching between mating surfaces of parts while incurring low costs in memory and computation. Building upon PMT, we introduce a new framework, dubbed Proxy Match TransformeR (PMTR), for the geometric assembly task. We evaluate the proposed PMTR on the large-scale 3D geometric shape assembly benchmark dataset of Breaking Bad and demonstrate its superior performance and efficiency compared to state-of-the-art methods. Project page: https://nahyuklee.github.io/pmtr.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Deflated Dynamics Value Iteration
Authors:
Jongmin Lee,
Amin Rakhsha,
Ernest K. Ryu,
Amir-massoud Farahmand
Abstract:
The Value Iteration (VI) algorithm is an iterative procedure to compute the value function of a Markov decision process, and is the basis of many reinforcement learning (RL) algorithms as well. As the error convergence rate of VI as a function of iteration $k$ is $O(γ^k)$, it is slow when the discount factor $γ$ is close to $1$. To accelerate the computation of the value function, we propose Defla…
▽ More
The Value Iteration (VI) algorithm is an iterative procedure to compute the value function of a Markov decision process, and is the basis of many reinforcement learning (RL) algorithms as well. As the error convergence rate of VI as a function of iteration $k$ is $O(γ^k)$, it is slow when the discount factor $γ$ is close to $1$. To accelerate the computation of the value function, we propose Deflated Dynamics Value Iteration (DDVI). DDVI uses matrix splitting and matrix deflation techniques to effectively remove (deflate) the top $s$ dominant eigen-structure of the transition matrix $\mathcal{P}^π$. We prove that this leads to a $\tilde{O}(γ^k |λ_{s+1}|^k)$ convergence rate, where $λ_{s+1}$is $(s+1)$-th largest eigenvalue of the dynamics matrix. We then extend DDVI to the RL setting and present Deflated Dynamics Temporal Difference (DDTD) algorithm. We empirically show the effectiveness of the proposed algorithms.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors
Authors:
Jae Joong Lee,
Bosheng Li,
Sara Beery,
Jonathan Huang,
Songlin Fei,
Raymond A. Yeh,
Bedrich Benes
Abstract:
We introduce Tree D-fusion, featuring the first collection of 600,000 environmentally aware, 3D simulation-ready tree models generated through Diffusion priors. Each reconstructed 3D tree model corresponds to an image from Google's Auto Arborist Dataset, comprising street view images and associated genus labels of trees across North America. Our method distills the scores of two tree-adapted diffu…
▽ More
We introduce Tree D-fusion, featuring the first collection of 600,000 environmentally aware, 3D simulation-ready tree models generated through Diffusion priors. Each reconstructed 3D tree model corresponds to an image from Google's Auto Arborist Dataset, comprising street view images and associated genus labels of trees across North America. Our method distills the scores of two tree-adapted diffusion models by utilizing text prompts to specify a tree genus, thus facilitating shape reconstruction. This process involves reconstructing a 3D tree envelope filled with point markers, which are subsequently utilized to estimate the tree's branching structure using the space colonization algorithm conditioned on a specified genus.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Disrupting Diffusion-based Inpainters with Semantic Digression
Authors:
Geonho Son,
Juhun Lee,
Simon S. Woo
Abstract:
The fabrication of visual misinformation on the web and social media has increased exponentially with the advent of foundational text-to-image diffusion models. Namely, Stable Diffusion inpainters allow the synthesis of maliciously inpainted images of personal and private figures, and copyrighted contents, also known as deepfakes. To combat such generations, a disruption framework, namely Photogua…
▽ More
The fabrication of visual misinformation on the web and social media has increased exponentially with the advent of foundational text-to-image diffusion models. Namely, Stable Diffusion inpainters allow the synthesis of maliciously inpainted images of personal and private figures, and copyrighted contents, also known as deepfakes. To combat such generations, a disruption framework, namely Photoguard, has been proposed, where it adds adversarial noise to the context image to disrupt their inpainting synthesis. While their framework suggested a diffusion-friendly approach, the disruption is not sufficiently strong and it requires a significant amount of GPU and time to immunize the context image. In our work, we re-examine both the minimal and favorable conditions for a successful inpainting disruption, proposing DDD, a "Digression guided Diffusion Disruption" framework. First, we identify the most adversarially vulnerable diffusion timestep range with respect to the hidden space. Within this scope of noised manifold, we pose the problem as a semantic digression optimization. We maximize the distance between the inpainting instance's hidden states and a semantic-aware hidden state centroid, calibrated both by Monte Carlo sampling of hidden states and a discretely projected optimization in the token space. Effectively, our approach achieves stronger disruption and a higher success rate than Photoguard while lowering the GPU memory requirement, and speeding the optimization up to three times faster.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Dominant Design Prediction with Phylogenetic Networks
Authors:
Youwei He,
Jeong-Dong Lee,
Dawoon Jeong,
Sungjun Choi,
Jiyong Kim
Abstract:
This study proposes an effective method to predict technology development from an evolutionary perspective. Product evolution is the result of technological evolution and market selection. A phylogenetic network is the main method to study product evolution. The formation of the dominant design determines the trajectory of technology development. How to predict future dominant design has become a…
▽ More
This study proposes an effective method to predict technology development from an evolutionary perspective. Product evolution is the result of technological evolution and market selection. A phylogenetic network is the main method to study product evolution. The formation of the dominant design determines the trajectory of technology development. How to predict future dominant design has become a key issue in technology forecasting and new product development. We define the dominant product and use machine learning methods, combined with product evolutionary theory, to construct a Fully Connected Phylogenetic Network dataset to effectively predict the future dominant design.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
MATE: Meet At The Embedding -- Connecting Images with Long Texts
Authors:
Young Kyun Jang,
Junmo Kang,
Yong Jae Lee,
Donghyun Kim
Abstract:
While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this pape…
▽ More
While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this paper, we introduce Meet At The Embedding (MATE), a novel approach that combines the capabilities of VLMs with Large Language Models (LLMs) to overcome this challenge without the need for additional image-long text pairs. Specifically, we replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts. To bridge the gap between VLM and LLM, MATE incorporates a projection module that is trained in a multi-stage manner. It starts by aligning the embeddings from the VLM text encoder with those from the LLM using extensive text pairs. This module is then employed to seamlessly align image embeddings closely with LLM embeddings. We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts (lengthy captions / documents). Extensive experimental results demonstrate that MATE effectively connects images with long texts, uncovering diverse semantic relationships.
△ Less
Submitted 26 June, 2024;
originally announced July 2024.
-
GPTutor: Great Personalized Tutor with Large Language Models for Personalized Learning Content Generation
Authors:
Eason Chen,
Jia-En Lee,
Jionghao Lin,
Kenneth Koedinger
Abstract:
We developed GPTutor, a pioneering web application designed to revolutionize personalized learning by leveraging the capabilities of Generative AI at scale. GPTutor adapts educational content and practice exercises to align with individual students' interests and career goals, enhancing their engagement and understanding of critical academic concepts. The system uses a serverless architecture to d…
▽ More
We developed GPTutor, a pioneering web application designed to revolutionize personalized learning by leveraging the capabilities of Generative AI at scale. GPTutor adapts educational content and practice exercises to align with individual students' interests and career goals, enhancing their engagement and understanding of critical academic concepts. The system uses a serverless architecture to deliver personalized and scalable learning experiences. By integrating advanced Chain-of-Thoughts prompting methods, GPTutor provides a personalized educational journey that not only addresses the unique interests of each student but also prepares them for future professional success. This demo paper presents the design, functionality, and potential of GPTutor to foster a more engaging and effective educational environment.
△ Less
Submitted 16 May, 2024;
originally announced July 2024.
-
TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models
Authors:
Jeongho Kim,
Min-Jung Kim,
Junsoo Lee,
Jaegul Choo
Abstract:
Pose-driven human-image animation diffusion models have shown remarkable capabilities in realistic human video synthesis. Despite the promising results achieved by previous approaches, challenges persist in achieving temporally consistent animation and ensuring robustness with off-the-shelf pose detectors. In this paper, we present TCAN, a pose-driven human image animation method that is robust to…
▽ More
Pose-driven human-image animation diffusion models have shown remarkable capabilities in realistic human video synthesis. Despite the promising results achieved by previous approaches, challenges persist in achieving temporally consistent animation and ensuring robustness with off-the-shelf pose detectors. In this paper, we present TCAN, a pose-driven human image animation method that is robust to erroneous poses and consistent over time. In contrast to previous methods, we utilize the pre-trained ControlNet without fine-tuning to leverage its extensive pre-acquired knowledge from numerous pose-image-caption pairs. To keep the ControlNet frozen, we adapt LoRA to the UNet layers, enabling the network to align the latent space between the pose and appearance features. Additionally, by introducing an additional temporal layer to the ControlNet, we enhance robustness against outliers of the pose detector. Through the analysis of attention maps over the temporal axis, we also designed a novel temperature map leveraging pose information, allowing for a more static background. Extensive experiments demonstrate that the proposed method can achieve promising results in video synthesis tasks encompassing various poses, like chibi. Project Page: https://eccv2024tcan.github.io/
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
DIOR-ViT: Differential Ordinal Learning Vision Transformer for Cancer Classification in Pathology Images
Authors:
Ju Cheon Lee,
Keunho Byeon,
Boram Song,
Kyungeun Kim,
Jin Tae Kwak
Abstract:
In computational pathology, cancer grading has been mainly studied as a categorical classification problem, which does not utilize the ordering nature of cancer grades such as the higher the grade is, the worse the cancer is. To incorporate the ordering relationship among cancer grades, we introduce a differential ordinal learning problem in which we define and learn the degree of difference in th…
▽ More
In computational pathology, cancer grading has been mainly studied as a categorical classification problem, which does not utilize the ordering nature of cancer grades such as the higher the grade is, the worse the cancer is. To incorporate the ordering relationship among cancer grades, we introduce a differential ordinal learning problem in which we define and learn the degree of difference in the categorical class labels between pairs of samples by using their differences in the feature space. To this end, we propose a transformer-based neural network that simultaneously conducts both categorical classification and differential ordinal classification for cancer grading. We also propose a tailored loss function for differential ordinal learning. Evaluating the proposed method on three different types of cancer datasets, we demonstrate that the adoption of differential ordinal learning can improve the accuracy and reliability of cancer grading, outperforming conventional cancer grading approaches. The proposed approach should be applicable to other diseases and problems as they involve ordinal relationship among class labels.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Probability of Differentiation Reveals Brittleness of Homogeneity Bias in Large Language Models
Authors:
Messi H. J. Lee,
Calvin K. Lai
Abstract:
Homogeneity bias in Large Language Models (LLMs) refers to their tendency to homogenize the representations of some groups compared to others. Previous studies documenting this bias have predominantly used encoder models, which may have inadvertently introduced biases. To address this limitation, we prompted GPT-4 to generate single word/expression completions associated with 18 situation cues - s…
▽ More
Homogeneity bias in Large Language Models (LLMs) refers to their tendency to homogenize the representations of some groups compared to others. Previous studies documenting this bias have predominantly used encoder models, which may have inadvertently introduced biases. To address this limitation, we prompted GPT-4 to generate single word/expression completions associated with 18 situation cues - specific, measurable elements of environments that influence how individuals perceive situations and compared the variability of these completions using probability of differentiation. This approach directly assessed homogeneity bias from the model's outputs, bypassing encoder models. Across five studies, we find that homogeneity bias is highly volatile across situation cues and writing prompts, suggesting that the bias observed in past work may reflect those within encoder models rather than LLMs. Furthermore, these results suggest that homogeneity bias in LLMs is brittle, as even minor and arbitrary changes in prompts can significantly alter the expression of biases. Future work should further explore how variations in syntactic features and topic choices in longer text generations influence homogeneity bias in LLMs.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization
Authors:
Jeongseok Hyun,
Su Ho Han,
Hyolim Kang,
Joon-Young Lee,
Seon Joo Kim
Abstract:
The vocabulary size in temporal action localization (TAL) is constrained by the scarcity of large-scale annotated datasets. To address this, recent works incorporate powerful pre-trained vision-language models (VLMs), such as CLIP, to perform open-vocabulary TAL (OV-TAL). However, unlike VLMs trained on extensive image/video-text pairs, existing OV-TAL methods still rely on small, fully labeled TA…
▽ More
The vocabulary size in temporal action localization (TAL) is constrained by the scarcity of large-scale annotated datasets. To address this, recent works incorporate powerful pre-trained vision-language models (VLMs), such as CLIP, to perform open-vocabulary TAL (OV-TAL). However, unlike VLMs trained on extensive image/video-text pairs, existing OV-TAL methods still rely on small, fully labeled TAL datasets for training an action localizer. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our self-training approach consists of two stages. First, a class-agnostic action localizer is trained on a human-labeled TAL dataset and used to generate pseudo-labels for unlabeled videos. Second, the large-scale pseudo-labeled dataset is combined with the human-labeled dataset to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we highlighted issues with existing OV-TAL evaluation schemes and proposed a new evaluation protocol. Code is released at https://github.com/HYUNJS/STOV-TAL
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Forcing quasirandomness with 4-point permutations
Authors:
Daniel Kráľ,
Jae-baek Lee,
Jonathan A. Noel
Abstract:
A combinatorial object is said to be quasirandom if it exhibits certain properties that are typically seen in a truly random object of the same kind. It is known that a permutation is quasirandom if and only if the pattern density of each of the twenty-four 4-point permutations is close to 1/24, which is its expected value in a random permutation. In other words, the set of all twenty-four 4-point…
▽ More
A combinatorial object is said to be quasirandom if it exhibits certain properties that are typically seen in a truly random object of the same kind. It is known that a permutation is quasirandom if and only if the pattern density of each of the twenty-four 4-point permutations is close to 1/24, which is its expected value in a random permutation. In other words, the set of all twenty-four 4-point permutations is quasirandom-forcing. Moreover, it is known that there exist sets of eight 4-point permutations that are also quasirandom-forcing. Breaking the barrier of linear dependency of perturbation gradients, we show that every quasirandom-forcing set of 4-point permutations must have cardinality at least five.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Sparse-DeRF: Deblurred Neural Radiance Fields from Sparse View
Authors:
Dogyoon Lee,
Donghyeong Kim,
Jungho Lee,
Minhyeok Lee,
Seunghoon Lee,
Sangyoun Lee
Abstract:
Recent studies construct deblurred neural radiance fields (DeRF) using dozens of blurry images, which are not practical scenarios if only a limited number of blurry images are available. This paper focuses on constructing DeRF from sparse-view for more pragmatic real-world scenarios. As observed in our experiments, establishing DeRF from sparse views proves to be a more challenging problem due to…
▽ More
Recent studies construct deblurred neural radiance fields (DeRF) using dozens of blurry images, which are not practical scenarios if only a limited number of blurry images are available. This paper focuses on constructing DeRF from sparse-view for more pragmatic real-world scenarios. As observed in our experiments, establishing DeRF from sparse views proves to be a more challenging problem due to the inherent complexity arising from the simultaneous optimization of blur kernels and NeRF from sparse view. Sparse-DeRF successfully regularizes the complicated joint optimization, presenting alleviated overfitting artifacts and enhanced quality on radiance fields. The regularization consists of three key components: Surface smoothness, helps the model accurately predict the scene structure utilizing unseen and additional hidden rays derived from the blur kernel based on statistical tendencies of real-world; Modulated gradient scaling, helps the model adjust the amount of the backpropagated gradient according to the arrangements of scene objects; Perceptual distillation improves the perceptual quality by overcoming the ill-posed multi-view inconsistency of image deblurring and distilling the pre-filtered information, compensating for the lack of clean information in blurry images. We demonstrate the effectiveness of the Sparse-DeRF with extensive quantitative and qualitative experimental results by training DeRF from 2-view, 4-view, and 6-view blurry images.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
MUSE: Machine Unlearning Six-Way Evaluation for Language Models
Authors:
Weijia Shi,
Jaechan Lee,
Yangsibo Huang,
Sadhika Malladi,
Jieyu Zhao,
Ari Holtzman,
Daogao Liu,
Luke Zettlemoyer,
Noah A. Smith,
Chiyuan Zhang
Abstract:
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approxim…
▽ More
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation on data not intended for removal, (5) scalability with respect to the size of removal requests, and (6) sustainability over sequential unlearning requests. Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7B-parameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage. Furthermore, existing algorithms fail to meet deployer's expectations because they often degrade general model utility and also cannot sustainably accommodate successive unlearning requests or large-scale content removal. Our findings identify key issues with the practicality of existing unlearning algorithms on language models, and we release our benchmark to facilitate further evaluations: muse-bench.github.io
△ Less
Submitted 14 July, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models
Authors:
Messi H. J. Lee,
Jacob M. Montgomery,
Calvin K. Lai
Abstract:
Vision Language Models (VLMs), exemplified by GPT-4V, adeptly integrate text and vision modalities. This integration enhances Large Language Models' ability to mimic human perception, allowing them to process image inputs. Despite VLMs' advanced capabilities, however, there is a concern that VLMs inherit biases of both modalities in ways that make biases more pervasive and difficult to mitigate. O…
▽ More
Vision Language Models (VLMs), exemplified by GPT-4V, adeptly integrate text and vision modalities. This integration enhances Large Language Models' ability to mimic human perception, allowing them to process image inputs. Despite VLMs' advanced capabilities, however, there is a concern that VLMs inherit biases of both modalities in ways that make biases more pervasive and difficult to mitigate. Our study explores how VLMs perpetuate homogeneity bias and trait associations with regards to race and gender. When prompted to write stories based on images of human faces, GPT-4V describes subordinate racial and gender groups with greater homogeneity than dominant groups and relies on distinct, yet generally positive, stereotypes. Importantly, VLM stereotyping is driven by visual cues rather than group membership alone such that faces that are rated as more prototypically Black and feminine are subject to greater stereotyping. These findings suggest that VLMs may associate subtle visual cues related to racial and gender groups with stereotypes in ways that could be challenging to mitigate. We explore the underlying reasons behind this behavior and discuss its implications and emphasize the importance of addressing these biases as VLMs come to mirror human perception.
△ Less
Submitted 21 May, 2024;
originally announced July 2024.
-
Scaling Exponents Across Parameterizations and Optimizers
Authors:
Katie Everett,
Lechao Xiao,
Mitchell Wortsman,
Alexander A. Alemi,
Roman Novak,
Peter J. Liu,
Izzeddin Gur,
Jascha Sohl-Dickstein,
Leslie Pack Kaelbling,
Jaehoon Lee,
Jeffrey Pennington
Abstract:
Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results unde…
▽ More
Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.
△ Less
Submitted 16 July, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
Read, Watch and Scream! Sound Generation from Text and Video
Authors:
Yujin Jeong,
Yunji Kim,
Sanghyuk Chun,
Jiyoung Lee
Abstract:
Multimodal generative models have shown impressive advances with the help of powerful diffusion models. Despite the progress, generating sound solely from text poses challenges in ensuring comprehensive scene depiction and temporal alignment. Meanwhile, video-to-sound generation limits the flexibility to prioritize sound synthesis for specific objects within the scene. To tackle these challenges,…
▽ More
Multimodal generative models have shown impressive advances with the help of powerful diffusion models. Despite the progress, generating sound solely from text poses challenges in ensuring comprehensive scene depiction and temporal alignment. Meanwhile, video-to-sound generation limits the flexibility to prioritize sound synthesis for specific objects within the scene. To tackle these challenges, we propose a novel video-and-text-to-sound generation method, called ReWaS, where video serves as a conditional control for a text-to-audio generation model. Our method estimates the structural information of audio (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-sound model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Our demo is available at https://naver-ai.github.io/rewas
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation
Authors:
Jin Woo Lee,
Jaehyun Park,
Min Jun Choi,
Kyogu Lee
Abstract:
While significant advancements have been made in music generation and differentiable sound synthesis within machine learning and computer audition, the simulation of instrument vibration guided by physical laws has been underexplored. To address this gap, we introduce a novel model for simulating the spatio-temporal motion of nonlinear strings, integrating modal synthesis and spectral modeling wit…
▽ More
While significant advancements have been made in music generation and differentiable sound synthesis within machine learning and computer audition, the simulation of instrument vibration guided by physical laws has been underexplored. To address this gap, we introduce a novel model for simulating the spatio-temporal motion of nonlinear strings, integrating modal synthesis and spectral modeling within a neural network framework. Our model leverages physical properties and fundamental frequencies as inputs, outputting string states across time and space that solve the partial differential equation characterizing the nonlinear string. Empirical evaluations demonstrate that the proposed architecture achieves superior accuracy in string motion simulation compared to existing baseline architectures. The code and demo are available online.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Neural varifolds: an aggregate representation for quantifying the geometry of point clouds
Authors:
Juheon Lee,
Xiaohao Cai,
Carola-Bibian Schönlieb,
Simon Masnou
Abstract:
Point clouds are popular 3D representations for real-life objects (such as in LiDAR and Kinect) due to their detailed and compact representation of surface-based geometry. Recent approaches characterise the geometry of point clouds by bringing deep learning based techniques together with geometric fidelity metrics such as optimal transportation costs (e.g., Chamfer and Wasserstein metrics). In thi…
▽ More
Point clouds are popular 3D representations for real-life objects (such as in LiDAR and Kinect) due to their detailed and compact representation of surface-based geometry. Recent approaches characterise the geometry of point clouds by bringing deep learning based techniques together with geometric fidelity metrics such as optimal transportation costs (e.g., Chamfer and Wasserstein metrics). In this paper, we propose a new surface geometry characterisation within this realm, namely a neural varifold representation of point clouds. Here the surface is represented as a measure/distribution over both point positions and tangent spaces of point clouds. The varifold representation quantifies not only the surface geometry of point clouds through the manifold-based discrimination, but also subtle geometric consistencies on the surface due to the combined product space. This study proposes neural varifold algorithms to compute the varifold norm between two point clouds using neural networks on point clouds and their neural tangent kernel representations. The proposed neural varifold is evaluated on three different sought-after tasks -- shape matching, few-shot shape classification and shape reconstruction. Detailed evaluation and comparison to the state-of-the-art methods demonstrate that the proposed versatile neural varifold is superior in shape matching and few-shot shape classification, and is competitive for shape reconstruction.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images
Authors:
Jisu Shin,
Junmyeong Lee,
Seongmin Lee,
Min-Gyu Park,
Ju-Mi Kang,
Ju Hong Yoon,
Hae-Gon Jeon
Abstract:
We present a novel framework for reconstructing animatable human avatars from multiple images, termed CanonicalFusion. Our central concept involves integrating individual reconstruction results into the canonical space. To be specific, we first predict Linear Blend Skinning (LBS) weight maps and depth maps using a shared-encoder-dual-decoder network, enabling direct canonicalization of the 3D mesh…
▽ More
We present a novel framework for reconstructing animatable human avatars from multiple images, termed CanonicalFusion. Our central concept involves integrating individual reconstruction results into the canonical space. To be specific, we first predict Linear Blend Skinning (LBS) weight maps and depth maps using a shared-encoder-dual-decoder network, enabling direct canonicalization of the 3D mesh from the predicted depth maps. Here, instead of predicting high-dimensional skinning weights, we infer compressed skinning weights, i.e., 3-dimensional vector, with the aid of pre-trained MLP networks. We also introduce a forward skinning-based differentiable rendering scheme to merge the reconstructed results from multiple images. This scheme refines the initial mesh by reposing the canonical mesh via the forward skinning and by minimizing photometric and geometric errors between the rendered and the predicted results. Our optimization scheme considers the position and color of vertices as well as the joint angles for each image, thereby mitigating the negative effects of pose errors. We conduct extensive experiments to demonstrate the effectiveness of our method and compare our CanonicalFusion with state-of-the-art methods. Our source codes are available at https://github.com/jsshin98/CanonicalFusion.
△ Less
Submitted 15 July, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
Variational Partial Group Convolutions for Input-Aware Partial Equivariance of Rotations and Color-Shifts
Authors:
Hyunsu Kim,
Yegon Kim,
Hongseok Yang,
Juho Lee
Abstract:
Group Equivariant CNNs (G-CNNs) have shown promising efficacy in various tasks, owing to their ability to capture hierarchical features in an equivariant manner. However, their equivariance is fixed to the symmetry of the whole group, limiting adaptability to diverse partial symmetries in real-world datasets, such as limited rotation symmetry of handwritten digit images and limited color-shift sym…
▽ More
Group Equivariant CNNs (G-CNNs) have shown promising efficacy in various tasks, owing to their ability to capture hierarchical features in an equivariant manner. However, their equivariance is fixed to the symmetry of the whole group, limiting adaptability to diverse partial symmetries in real-world datasets, such as limited rotation symmetry of handwritten digit images and limited color-shift symmetry of flower images. Recent efforts address this limitation, one example being Partial G-CNN which restricts the output group space of convolution layers to break full equivariance. However, such an approach still fails to adjust equivariance levels across data. In this paper, we propose a novel approach, Variational Partial G-CNN (VP G-CNN), to capture varying levels of partial equivariance specific to each data instance. VP G-CNN redesigns the distribution of the output group elements to be conditioned on input data, leveraging variational inference to avoid overfitting. This enables the model to adjust its equivariance levels according to the needs of individual data points. Additionally, we address training instability inherent in discrete group equivariance models by redesigning the reparametrizable distribution. We demonstrate the effectiveness of VP G-CNN on both toy and real-world datasets, including MNIST67-180, CIFAR10, ColorMNIST, and Flowers102. Our results show robust performance, even in uncertainty metrics.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Detect Closer Surfaces that can be Seen: New Modeling and Evaluation in Cross-domain 3D Object Detection
Authors:
Ruixiao Zhang,
Yihong Wu,
Juheon Lee,
Adam Prugel-Bennett,
Xiaohao Cai
Abstract:
The performance of domain adaptation technologies has not yet reached an ideal level in the current 3D object detection field for autonomous driving, which is mainly due to significant differences in the size of vehicles, as well as the environments they operate in when applied across domains. These factors together hinder the effective transfer and application of knowledge learned from specific d…
▽ More
The performance of domain adaptation technologies has not yet reached an ideal level in the current 3D object detection field for autonomous driving, which is mainly due to significant differences in the size of vehicles, as well as the environments they operate in when applied across domains. These factors together hinder the effective transfer and application of knowledge learned from specific datasets. Since the existing evaluation metrics are initially designed for evaluation on a single domain by calculating the 2D or 3D overlap between the prediction and ground-truth bounding boxes, they often suffer from the overfitting problem caused by the size differences among datasets. This raises a fundamental question related to the evaluation of the 3D object detection models' cross-domain performance: Do we really need models to maintain excellent performance in their original 3D bounding boxes after being applied across domains? From a practical application perspective, one of our main focuses is actually on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the size of vehicles is much more difficult. In other words, as long as a model can accurately identify the closest surfaces to the ego vehicle, it is sufficient to effectively avoid obstacles. In this paper, we propose two metrics to measure 3D object detection models' ability of detecting the closer surfaces to the sensor on the ego vehicle, which can be used to evaluate their cross-domain performance more comprehensively and reasonably. Furthermore, we propose a refinement head, named EdgeHead, to guide models to focus more on the learnable closer surfaces, which can greatly improve the cross-domain performance of existing models not only under our new metrics, but even also under the original BEV/3D metrics.
△ Less
Submitted 12 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
CRiM-GS: Continuous Rigid Motion-Aware Gaussian Splatting from Motion Blur Images
Authors:
Junghe Lee,
Donghyeong Kim,
Dogyoon Lee,
Suhwan Cho,
Sangyoun Lee
Abstract:
Neural radiance fields (NeRFs) have received significant attention due to their high-quality novel view rendering ability, prompting research to address various real-world cases. One critical challenge is the camera motion blur caused by camera movement during exposure time, which prevents accurate 3D scene reconstruction. In this study, we propose continuous rigid motion-aware gaussian splatting…
▽ More
Neural radiance fields (NeRFs) have received significant attention due to their high-quality novel view rendering ability, prompting research to address various real-world cases. One critical challenge is the camera motion blur caused by camera movement during exposure time, which prevents accurate 3D scene reconstruction. In this study, we propose continuous rigid motion-aware gaussian splatting (CRiM-GS) to reconstruct accurate 3D scene from blurry images with real-time rendering speed. Considering the actual camera motion blurring process, which consists of complex motion patterns, we predict the continuous movement of the camera based on neural ordinary differential equations (ODEs). Specifically, we leverage rigid body transformations to model the camera motion with proper regularization, preserving the shape and size of the object. Furthermore, we introduce a continuous deformable 3D transformation in the \textit{SE(3)} field to adapt the rigid body transformation to real-world problems by ensuring a higher degree of freedom. By revisiting fundamental camera theory and employing advanced neural network training techniques, we achieve accurate modeling of continuous camera trajectories. We conduct extensive experiments, demonstrating state-of-the-art performance both quantitatively and qualitatively on benchmark datasets.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Green Multigrid Network
Authors:
Ye Lin,
Young Ju Lee,
Jiwei Jia
Abstract:
GreenLearning networks (GL) directly learn Green's function in physical space, making them an interpretable model for capturing unknown solution operators of partial differential equations (PDEs). For many PDEs, the corresponding Green's function exhibits asymptotic smoothness. In this paper, we propose a framework named Green Multigrid networks (GreenMGNet), an operator learning algorithm designe…
▽ More
GreenLearning networks (GL) directly learn Green's function in physical space, making them an interpretable model for capturing unknown solution operators of partial differential equations (PDEs). For many PDEs, the corresponding Green's function exhibits asymptotic smoothness. In this paper, we propose a framework named Green Multigrid networks (GreenMGNet), an operator learning algorithm designed for a class of asymptotically smooth Green's functions.
Compared with the pioneering GL, the new framework presents itself with better accuracy and efficiency, thereby achieving a significant improvement. GreenMGNet is composed of two technical novelties. First, Green's function is modeled as a piecewise function to take into account its singular behavior in some parts of the hyperplane. Such piecewise function is then approximated by a neural network with augmented output(AugNN) so that it can capture singularity accurately. Second, the asymptotic smoothness property of Green's function is used to leverage the Multi-Level Multi-Integration (MLMI) algorithm for both the training and inference stages. Several test cases of operator learning are presented to demonstrate the accuracy and effectiveness of the proposed method. On average, GreenMGNet achieves $3.8\%$ to $39.15\%$ accuracy improvement. To match the accuracy level of GL, GreenMGNet requires only about $10\%$ of the full grid data, resulting in a $55.9\%$ and $92.5\%$ reduction in training time and GPU memory cost for one-dimensional test problems, and a $37.7\%$ and $62.5\%$ reduction for two-dimensional test problems.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment
Authors:
Janghwan Lee,
Seongmin Park,
Sukjin Hong,
Minsoo Kim,
Du-Seong Chang,
Jungwook Choi
Abstract:
The rapid advancement of large language models (LLMs) has facilitated their transformation into conversational chatbots that can grasp contextual nuances and generate pertinent sentences, closely mirroring human values through advanced techniques such as instruction tuning and reinforcement learning from human feedback (RLHF). However, the computational efficiency required for LLMs, achieved throu…
▽ More
The rapid advancement of large language models (LLMs) has facilitated their transformation into conversational chatbots that can grasp contextual nuances and generate pertinent sentences, closely mirroring human values through advanced techniques such as instruction tuning and reinforcement learning from human feedback (RLHF). However, the computational efficiency required for LLMs, achieved through techniques like post-training quantization (PTQ), presents challenges such as token-flipping that can impair chatbot performance. In response, we propose a novel preference alignment approach, quantization-aware direct preference optimization (QDPO), that aligns quantized LLMs with their full-precision counterparts, improving conversational abilities. Evaluated on two instruction-tuned LLMs in various languages, QDPO demonstrated superior performance in improving conversational abilities compared to established PTQ and knowledge-distillation fine-tuning techniques, marking a significant step forward in the development of efficient and effective conversational LLMs.
△ Less
Submitted 18 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA
Authors:
Xuqi Zhu,
Huaizhi Zhang,
JunKyu Lee,
Jiacheng Zhu,
Chandrajit Pal,
Sangeet Saha,
Klaus D. McDonald-Maier,
Xiaojun Zhai
Abstract:
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MAD…
▽ More
Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed "Approximate Multiplication Unit (AMU)". The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
△ Less
Submitted 7 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
Safe CoR: A Dual-Expert Approach to Integrating Imitation Learning and Safe Reinforcement Learning Using Constraint Rewards
Authors:
Hyeokjin Kwon,
Gunmin Lee,
Junseo Lee,
Songhwai Oh
Abstract:
In the realm of autonomous agents, ensuring safety and reliability in complex and dynamic environments remains a paramount challenge. Safe reinforcement learning addresses these concerns by introducing safety constraints, but still faces challenges in navigating intricate environments such as complex driving situations. To overcome these challenges, we present the safe constraint reward (Safe CoR)…
▽ More
In the realm of autonomous agents, ensuring safety and reliability in complex and dynamic environments remains a paramount challenge. Safe reinforcement learning addresses these concerns by introducing safety constraints, but still faces challenges in navigating intricate environments such as complex driving situations. To overcome these challenges, we present the safe constraint reward (Safe CoR) framework, a novel method that utilizes two types of expert demonstrations$\unicode{x2013}$reward expert demonstrations focusing on performance optimization and safe expert demonstrations prioritizing safety. By exploiting a constraint reward (CoR), our framework guides the agent to balance performance goals of reward sum with safety constraints. We test the proposed framework in diverse environments, including the safety gym, metadrive, and the real$\unicode{x2013}$world Jackal platform. Our proposed framework enhances the performance of algorithms by $39\%$ and reduces constraint violations by $88\%$ on the real-world Jackal platform, demonstrating the framework's efficacy. Through this innovative approach, we expect significant advancements in real-world performance, leading to transformative effects in the realm of safe and reliable autonomous agents.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Efficient Extrinsic Self-Calibration of Multiple IMUs using Measurement Subset Selection
Authors:
Jongwon Lee,
David Hanley,
Timothy Bretl
Abstract:
This paper addresses the problem of choosing a sparse subset of measurements for quick calibration parameter estimation. A standard solution to this is selecting a measurement only if its utility -- the difference between posterior (with the measurement) and prior information (without the measurement) -- exceeds some threshold. Theoretically, utility, a function of the parameter estimate, should b…
▽ More
This paper addresses the problem of choosing a sparse subset of measurements for quick calibration parameter estimation. A standard solution to this is selecting a measurement only if its utility -- the difference between posterior (with the measurement) and prior information (without the measurement) -- exceeds some threshold. Theoretically, utility, a function of the parameter estimate, should be evaluated at the estimate obtained with all measurements selected so far, hence necessitating a recalibration with each new measurement. However, we hypothesize that utility is insensitive to changes in the parameter estimate for many systems of interest, suggesting that evaluating utility at some initial parameter guess would yield equivalent results in practice. We provide evidence supporting this hypothesis for extrinsic calibration of multiple inertial measurement units (IMUs), showing the reduction in calibration time by two orders of magnitude by forgoing recalibration for each measurement.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
SPARKLE: Enhancing SPARQL Generation with Direct KG Integration in Decoding
Authors:
Jaebok Lee,
Hyeonjeong Shin
Abstract:
Existing KBQA methods have traditionally relied on multi-stage methodologies, involving tasks such as entity linking, subgraph retrieval and query structure generation. However, multi-stage approaches are dependent on the accuracy of preceding steps, leading to cascading errors and increased inference time. Although a few studies have explored the use of end-to-end models, they often suffer from l…
▽ More
Existing KBQA methods have traditionally relied on multi-stage methodologies, involving tasks such as entity linking, subgraph retrieval and query structure generation. However, multi-stage approaches are dependent on the accuracy of preceding steps, leading to cascading errors and increased inference time. Although a few studies have explored the use of end-to-end models, they often suffer from lower accuracy and generate inoperative query that is not supported by the underlying data. Furthermore, most prior approaches are limited to the static training data, potentially overlooking the evolving nature of knowledge bases over time. To address these challenges, we present a novel end-to-end natural language to SPARQL framework, SPARKLE. Notably SPARKLE leverages the structure of knowledge base directly during the decoding, effectively integrating knowledge into the query generation. Our study reveals that simply referencing knowledge base during inference significantly reduces the occurrence of inexecutable query generations. SPARKLE achieves new state-of-the-art results on SimpleQuestions-Wiki and highest F1 score on LCQuAD 1.0 (among models not using gold entities), while getting slightly lower result on the WebQSP dataset. Finally, we demonstrate SPARKLE's fast inference speed and its ability to adapt when the knowledge base differs between the training and inference stages.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Guided Trajectory Generation with Diffusion Models for Offline Model-based Optimization
Authors:
Taeyoung Yun,
Sujin Yun,
Jaewoo Lee,
Jinkyoo Park
Abstract:
Optimizing complex and high-dimensional black-box functions is ubiquitous in science and engineering fields. Unfortunately, the online evaluation of these functions is restricted due to time and safety constraints in most cases. In offline model-based optimization (MBO), we aim to find a design that maximizes the target function using only a pre-existing offline dataset. While prior methods consid…
▽ More
Optimizing complex and high-dimensional black-box functions is ubiquitous in science and engineering fields. Unfortunately, the online evaluation of these functions is restricted due to time and safety constraints in most cases. In offline model-based optimization (MBO), we aim to find a design that maximizes the target function using only a pre-existing offline dataset. While prior methods consider forward or inverse approaches to address the problem, these approaches are limited by conservatism and the difficulty of learning highly multi-modal mappings. Recently, there has been an emerging paradigm of learning to improve solutions with synthetic trajectories constructed from the offline dataset. In this paper, we introduce a novel conditional generative modeling approach to produce trajectories toward high-scoring regions. First, we construct synthetic trajectories toward high-scoring regions using the dataset while injecting locality bias for consistent improvement directions. Then, we train a conditional diffusion model to generate trajectories conditioned on their scores. Lastly, we sample multiple trajectories from the trained model with guidance to explore high-scoring regions beyond the dataset and select high-fidelity designs among generated trajectories with the proxy function. Extensive experiment results demonstrate that our method outperforms competitive baselines on Design-Bench and its practical variants. The code is publicly available in \texttt{https://github.com/dbsxodud-11/GTG}.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Deep Learning Models for Flapping Fin Unmanned Underwater Vehicle Control System Gait Optimization
Authors:
Brian Zhou,
Kamal Viswanath,
Jason Geder,
Alisha Sharma,
Julian Lee
Abstract:
The last few decades have led to the rise of research focused on propulsion and control systems for bio-inspired unmanned underwater vehicles (UUVs), which provide more maneuverable alternatives to traditional UUVs in underwater missions. Recent work has explored the use of time-series neural network surrogate models to predict thrust and power from vehicle design and fin kinematics. We develop a…
▽ More
The last few decades have led to the rise of research focused on propulsion and control systems for bio-inspired unmanned underwater vehicles (UUVs), which provide more maneuverable alternatives to traditional UUVs in underwater missions. Recent work has explored the use of time-series neural network surrogate models to predict thrust and power from vehicle design and fin kinematics. We develop a search-based inverse model that leverages kinematics-to-thrust and kinematics-to-power neural network models for control system design. Our inverse model finds a set of fin kinematics with the multi-objective goal of reaching a target thrust under power constraints while creating a smooth kinematics transition between flapping cycles. We demonstrate how a control system integrating this inverse model can make online, cycle-to-cycle adjustments to prioritize different system objectives, with improvements in increasing thrust generation or reducing power consumption of any given movement upwards of 0.5 N and 3.0 W in a range of 2.2 N and 9.0 W. As propulsive efficiency is of utmost importance for flapping-fin UUVs in order to extend their range and endurance for essential operations but lacks prior research, we develop a non-dimensional figure of merit (FOM), derived from measures of propulsive efficiency, that is able to evaluate different fin designs and kinematics, and allow for comparison with other bio-inspired platforms. We use the developed FOM to analyze optimal gaits and compare the performance between different fin materials, providing a better understanding of how fin materials affect thrust generation and propulsive efficiency and allowing us to inform control systems and weight for efficiency on the developed inverse gait-selector model.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Locate&Edit: Energy-based Text Editing for Efficient, Flexible, and Faithful Controlled Text Generation
Authors:
Hye Ryung Son,
Jay-Yoon Lee
Abstract:
Recent approaches to controlled text generation (CTG) often involve manipulating the weights or logits of base language models (LMs) at decoding time. However, these methods are inapplicable to latest black-box LMs and ineffective at preserving the core semantics of the base LM's original generations. In this work, we propose Locate&Edit(L&E), an efficient and flexible energy-based approach to CTG…
▽ More
Recent approaches to controlled text generation (CTG) often involve manipulating the weights or logits of base language models (LMs) at decoding time. However, these methods are inapplicable to latest black-box LMs and ineffective at preserving the core semantics of the base LM's original generations. In this work, we propose Locate&Edit(L&E), an efficient and flexible energy-based approach to CTG, which edits text outputs from a base LM using off-the-shelf energy models. Given text outputs from the base LM, L&E first locates spans that are most relevant to constraints (e.g., toxicity) utilizing energy models, and then edits these spans by replacing them with more suitable alternatives. Importantly, our method is compatible with black-box LMs, as it requires only the text outputs. Also, since L&E doesn't mandate specific architecture for its component models, it can work with a diverse combination of available off-the-shelf models. Moreover, L&E preserves the base LM's original generations, by selectively modifying constraint-related aspects of the texts and leaving others unchanged. These targeted edits also ensure that L&E operates efficiently. Our experiments confirm that L&E achieves superior semantic preservation of the base LM generations and speed, while simultaneously obtaining competitive or improved constraint satisfaction. Furthermore, we analyze how the granularity of energy distribution impacts CTG performance and find that fine-grained, regression-based energy models improve constraint satisfaction, compared to conventional binary classifier energy models.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models
Authors:
Jaeyoung Lee,
Ximing Lu,
Jack Hessel,
Faeze Brahman,
Youngjae Yu,
Yonatan Bisk,
Yejin Choi,
Saadia Gabriel
Abstract:
Given the growing influx of misinformation across news and social media, there is a critical need for systems that can provide effective real-time verification of news claims. Large language or multimodal model based verification has been proposed to scale up online policing mechanisms for mitigating spread of false and harmful content. While these can potentially reduce burden on human fact-check…
▽ More
Given the growing influx of misinformation across news and social media, there is a critical need for systems that can provide effective real-time verification of news claims. Large language or multimodal model based verification has been proposed to scale up online policing mechanisms for mitigating spread of false and harmful content. While these can potentially reduce burden on human fact-checkers, such efforts may be hampered by foundation model training data becoming outdated. In this work, we test the limits of improving foundation model performance without continual updating through an initial study of knowledge transfer using either existing intra- and inter- domain benchmarks or explanations generated from large language models (LLMs). We evaluate on 12 public benchmarks for fact-checking and misinformation detection as well as two other tasks relevant to content moderation -- toxicity and stance detection. Our results on two recent multi-modal fact-checking benchmarks, Mocheg and Fakeddit, indicate that knowledge transfer strategies can improve Fakeddit performance over the state-of-the-art by up to 1.7% and Mocheg performance by up to 2.9%.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.