Skip to main content

Showing 1–50 of 56 results for author: Lin, B Y

  1. arXiv:2407.10457  [pdf, other

    cs.CL cs.AI

    The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

    Authors: Yifan Song, Guoyin Wang, Sujian Li, Bill Yuchen Lin

    Abstract: Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the performance differences between greedy decoding and sampling, identifying benchmarks' consistency regarding… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  2. arXiv:2406.18495  [pdf, other

    cs.CL

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

    Authors: Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri

    Abstract: We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced a… ▽ More

    Submitted 9 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: First two authors contributed equally. Third and fourth authors contributed equally

  3. arXiv:2406.12935  [pdf, other

    cs.CR cs.AI cs.LG

    ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

    Authors: Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

    Abstract: Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understo… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  4. arXiv:2406.11069  [pdf, other

    cs.CV cs.AI cs.CL

    WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

    Authors: Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, Bill Yuchen Lin

    Abstract: Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: link: https://hf.co/spaces/WildVision/vision-arena

  5. arXiv:2406.08464  [pdf, other

    cs.CL cs.AI

    Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

    Authors: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin

    Abstract: High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Link: https://magpie-align.github.io/

  6. arXiv:2406.05761  [pdf, other

    cs.CL

    The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

    Authors: Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang , et al. (7 additional authors not shown)

    Abstract: As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on spec… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Work in Progress

  7. arXiv:2406.04770  [pdf, other

    cs.CL cs.AI

    WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

    Authors: Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi

    Abstract: We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs su… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Link: https://hf.co/spaces/allenai/WildBench

  8. arXiv:2405.01535  [pdf, other

    cs.CL

    Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

    Authors: Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo

    Abstract: Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those ass… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Work in Progress

  9. arXiv:2404.10199  [pdf, other

    cs.CL cs.AI

    CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

    Authors: Huihan Li, Liwei Jiang, Jena D. Huang, Hyunwoo Kim, Sebastin Santy, Taylor Sorensen, Bill Yuchen Lin, Nouha Dziri, Xiang Ren, Yejin Choi

    Abstract: As the utilization of large language models (LLMs) has proliferated worldwide, it is crucial for them to have adequate knowledge and fair representation for diverse global cultures. In this work, we uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations, and extract symbols from these generations that are as… ▽ More

    Submitted 26 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

  10. arXiv:2404.05955  [pdf, other

    cs.CL cs.AI

    VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

    Authors: Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

    Abstract: Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained a… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  11. arXiv:2403.13787  [pdf, other

    cs.LG

    RewardBench: Evaluating Reward Models for Language Modeling

    Authors: Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

    Abstract: Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training a… ▽ More

    Submitted 8 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: 44 pages, 19 figures, 12 tables

  12. arXiv:2403.02502  [pdf, other

    cs.CL cs.AI cs.LG

    Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

    Authors: Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin

    Abstract: Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Contrary to previous studies that exclusively train on successful expert trajectories, our method allows agents to learn… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

    Comments: Accepted to ACL 2024 Main Conference; Camera Ready

  13. arXiv:2402.15610  [pdf, other

    cs.CL

    Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning

    Authors: Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, Khyathi Raghavi Chandu

    Abstract: Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to… ▽ More

    Submitted 12 June, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: Accepted to ACL Findings 2024

  14. arXiv:2402.14658  [pdf, other

    cs.SE cs.AI cs.CL

    OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

    Authors: Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue

    Abstract: The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Co… ▽ More

    Submitted 27 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

  15. arXiv:2402.08983  [pdf, other

    cs.CR cs.AI cs.CL

    SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

    Authors: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran

    Abstract: As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, aiming to provoke unintended and unsafe behaviors from LLMs, remain a significant/leading LLM safety threat. In this paper, we aim to defend LLMs aga… ▽ More

    Submitted 7 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

    Comments: To appear in ACL 2024

  16. arXiv:2312.05979  [pdf, other

    cs.CL

    NovaCOMET: Open Commonsense Foundation Models with Symbolic Knowledge Distillation

    Authors: Peter West, Ronan Le Bras, Taylor Sorensen, Bill Yuchen Lin, Liwei Jiang, Ximing Lu, Khyathi Chandu, Jack Hessel, Ashutosh Baheti, Chandra Bhagavatula, Yejin Choi

    Abstract: We present NovaCOMET, an open commonsense knowledge model, that combines the best aspects of knowledge and general task models. Compared to previous knowledge models, NovaCOMET allows open-format relations enabling direct application to reasoning tasks; compared to general task models like Flan-T5, it explicitly centers knowledge, enabling superior performance for commonsense reasoning. NovaCOME… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

  17. arXiv:2312.01552  [pdf, other

    cs.CL cs.AI

    The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

    Authors: Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi

    Abstract: The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tunin… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: 26 pages, 8 figures. Project website: https://allenai.github.io/re-align/

  18. arXiv:2311.05657  [pdf, other

    cs.AI cs.CL cs.LG

    Agent Lumos: Unified and Modular Training for Open-Source Language Agents

    Authors: Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, Bill Yuchen Lin

    Abstract: Closed-source agents suffer from several issues such as a lack of affordability, transparency, and reproducibility, particularly on complex interactive tasks. This motivates the development of open-source alternatives. We introduce LUMOS, one of the first frameworks for training open-source LLM-based agents. LUMOS features a learnable, unified, and modular architecture with a planning module that… ▽ More

    Submitted 10 July, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: Accepted to ACL 2024 Main Conference; Camera Ready. Project website: https://allenai.github.io/lumos/

  19. arXiv:2310.11564  [pdf, other

    cs.CL

    Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

    Authors: Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu

    Abstract: While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Preprint

  20. arXiv:2310.00752  [pdf, other

    cs.CL cs.AI

    TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

    Authors: Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen

    Abstract: We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mi… ▽ More

    Submitted 9 May, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

  21. arXiv:2309.17277  [pdf, other

    cs.AI

    Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4

    Authors: Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, Yutaka Matsuo

    Abstract: Unlike perfect information games, where all elements are known to every player, imperfect information games emulate the real-world complexities of decision-making under uncertain or incomplete information. GPT-4, the recent breakthrough in large language models (LLMs) trained on massive passive data, is notable for its knowledge retrieval and reasoning abilities. This paper delves into the applica… ▽ More

    Submitted 6 October, 2023; v1 submitted 29 September, 2023; originally announced September 2023.

  22. arXiv:2307.13269  [pdf, other

    cs.CL cs.AI

    LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition

    Authors: Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, Min Lin

    Abstract: Low-rank adaptations (LoRA) are often employed to fine-tune large language models (LLMs) for new tasks. This paper investigates LoRA composability for cross-task generalization and introduces LoraHub, a simple framework devised for the purposive assembly of LoRA modules trained on diverse given tasks, with the objective of achieving adaptable performance on unseen tasks. With just a few examples f… ▽ More

    Submitted 18 January, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: Add more related work and experimental results

  23. arXiv:2306.02561  [pdf, other

    cs.CL cs.AI cs.LG

    LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

    Authors: Dongfu Jiang, Xiang Ren, Bill Yuchen Lin

    Abstract: We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). Our framework consists of two modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison… ▽ More

    Submitted 30 June, 2023; v1 submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted to ACL 2023 (main conference); Project website: https://yuchenlin.xyz/LLM-Blender/ V3 update: fix a few typos and update a few citations; V2 update: The experiments on summarization, translation, and constrained generation tasks in the prior version have been moved to the appendix

  24. arXiv:2305.18654  [pdf, other

    cs.CL cs.AI cs.LG

    Faith and Fate: Limits of Transformers on Compositionality

    Authors: Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, Yejin Choi

    Abstract: Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. This begs the question: Are these errors incidental, or do they signal more substantial limitations? In an attempt to demystify transformer LLMs, we investigate the li… ▽ More

    Submitted 31 October, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: 10 pages + appendix (40 pages)

  25. arXiv:2305.17390  [pdf, other

    cs.CL cs.AI cs.LG cs.MA cs.RO

    SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks

    Authors: Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, Xiang Ren

    Abstract: We introduce SwiftSage, a novel agent framework inspired by the dual-process theory of human cognition, designed to excel in action planning for complex interactive reasoning tasks. SwiftSage integrates the strengths of behavior cloning and prompting large language models (LLMs) to enhance task completion performance. The framework comprises two primary modules: the Swift module, representing fast… ▽ More

    Submitted 6 December, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: Accepted to NeurIPS 2023 (spotlight). Project website: https://swiftsage.github.io

  26. arXiv:2305.15065  [pdf, other

    cs.CL

    Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

    Authors: Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, Nouha Dziri, Jillian Fisher, Bill Yuchen Lin, Skyler Hallinan, Xiang Ren, Sean Welleck, Yejin Choi

    Abstract: While extreme-scale language models have demonstrated exceptional performance on a variety of language tasks, the degree of control over these language models through pure prompting can often be limited. Directly fine-tuning such language models can be effective for tailoring them, but it can be either extremely costly (e.g., GPT-3) or not even feasible for the broader community (e.g., GPT-4). W… ▽ More

    Submitted 6 December, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  27. arXiv:2212.10555  [pdf, other

    cs.CL cs.AI cs.LG

    PairReranker: Pairwise Reranking for Natural Language Generation

    Authors: Dongfu Jiang, Bill Yuchen Lin, Xiang Ren

    Abstract: Pre-trained language models have been successful in natural language generation (NLG) tasks. While various decoding methods have been employed, they often produce suboptimal results. We first present an empirical analysis of three NLG tasks: summarization, machine translation, and constrained text generation. We found that selecting the best output from the results of multiple decoding methods can… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: We will release our code and data at https://inklab.usc.edu/PairReranker

  28. arXiv:2211.09267  [pdf, other

    cs.CL cs.AI cs.LG

    Reflect, Not Reflex: Inference-Based Common Ground Improves Dialogue Response Quality

    Authors: Pei Zhou, Hyundong Cho, Pegah Jandaghi, Dong-Ho Lee, Bill Yuchen Lin, Jay Pujara, Xiang Ren

    Abstract: Human communication relies on common ground (CG), the mutual knowledge and beliefs shared by participants, to produce coherent and interesting conversations. In this paper, we demonstrate that current response generation (RG) models produce generic and dull responses in dialogues because they act reflexively, failing to explicitly model CG, both due to the lack of CG in training data and the stand… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

    Comments: Accepted at EMNLP-2022. 19 pages, 17 figures, 4 tables

  29. arXiv:2209.00465  [pdf, other

    cs.AI cs.CL cs.LG cs.RO

    On Grounded Planning for Embodied Tasks with Language Models

    Authors: Bill Yuchen Lin, Chengsong Huang, Qian Liu, Wenda Gu, Sam Sommerer, Xiang Ren

    Abstract: Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear **whether LMs have the capacity to generate grounded, executable plans for embodied tasks.** This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback f… ▽ More

    Submitted 15 July, 2023; v1 submitted 29 August, 2022; originally announced September 2022.

    Comments: Accepted to AAAI 2023 Project website: https://yuchenlin.xyz/g-planet/

  30. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  31. arXiv:2205.02014  [pdf, other

    cs.CL cs.AI cs.LG

    On Continual Model Refinement in Out-of-Distribution Data Streams

    Authors: Bill Yuchen Lin, Sida Wang, Xi Victoria Lin, Robin Jia, Lin Xiao, Xiang Ren, Wen-tau Yih

    Abstract: Real-world natural language processing (NLP) models need to be continually updated to fix the prediction errors in out-of-distribution (OOD) data streams while overcoming catastrophic forgetting. However, existing continual learning (CL) problem setups cannot cover such a realistic and complex scenario. In response to this, we propose a new CL problem formulation dubbed continual model refinement… ▽ More

    Submitted 4 May, 2022; originally announced May 2022.

    Comments: Accepted to ACL 2022; Project website: https://cmr-nlp.github.io/

  32. arXiv:2204.07937  [pdf, other

    cs.CL cs.AI cs.LG

    Unsupervised Cross-Task Generalization via Retrieval Augmentation

    Authors: Bill Yuchen Lin, Kangmin Tan, Chris Miller, Beiwen Tian, Xiang Ren

    Abstract: Humans can perform unseen tasks by recalling relevant skills acquired previously and then generalizing them to the target tasks, even if there is no supervision at all. In this paper, we aim to improve this kind of cross-task generalization ability of massive multi-task language models, such as T0 and FLAN, in an unsupervised setting. We propose a retrieval-augmentation method named ReCross that t… ▽ More

    Submitted 17 October, 2022; v1 submitted 17 April, 2022; originally announced April 2022.

    Comments: Accepted to NeurIPS 2022. Website: https://inklab.usc.edu/ReCross/

  33. arXiv:2110.08555  [pdf, other

    cs.CL

    On the Robustness of Reading Comprehension Models to Entity Renaming

    Authors: Jun Yan, Yang Xiao, Sagnik Mukherjee, Bill Yuchen Lin, Robin Jia, Xiang Ren

    Abstract: We study the robustness of machine reading comprehension (MRC) models to entity renaming -- do models make more wrong predictions when the same questions are asked about an entity whose name has been changed? Such failures imply that models overly rely on entity information to answer questions, and thus may generalize poorly when facts about the world change or questions are asked about novel enti… ▽ More

    Submitted 4 May, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: Accepted to NAACL 2022

  34. arXiv:2109.05620  [pdf, other

    cs.CL

    RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models

    Authors: Bill Yuchen Lin, Wenyang Gao, Jun Yan, Ryan Moreno, Xiang Ren

    Abstract: To audit the robustness of named entity recognition (NER) models, we propose RockNER, a simple yet effective method to create natural adversarial examples. Specifically, at the entity level, we replace target entities with other entities of the same semantic class in Wikidata; at the context level, we use pre-trained language models (e.g., BERT) to generate word substitutions. Together, the two le… ▽ More

    Submitted 12 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP 2021 as a short paper. Project website: https://inklab.usc.edu/rockner/

  35. arXiv:2109.04726  [pdf, other

    cs.CL cs.IR

    AutoTriggER: Label-Efficient and Robust Named Entity Recognition with Auxiliary Trigger Extraction

    Authors: Dong-Ho Lee, Ravi Kiran Selvam, Sheikh Muhammad Sarwar, Bill Yuchen Lin, Fred Morstatter, Jay Pujara, Elizabeth Boschee, James Allan, Xiang Ren

    Abstract: Deep neural models for named entity recognition (NER) have shown impressive results in overcoming label scarcity and generalizing to unseen entities by leveraging distant supervision and auxiliary information such as explanations. However, the costs of acquiring such additional information are generally prohibitive. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER… ▽ More

    Submitted 18 May, 2023; v1 submitted 10 September, 2021; originally announced September 2021.

    Comments: 15 pages, 13 figures, EACL 2023

  36. arXiv:2106.06937  [pdf, other

    cs.CL cs.AI

    Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning

    Authors: Bill Yuchen Lin, Seyeon Lee, Xiaoyang Qiao, Xiang Ren

    Abstract: Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey Corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-agnostic probing tas… ▽ More

    Submitted 13 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL-IJCNLP 2021 (long paper at main conference). Project website: https://inklab.usc.edu/XCSR/

  37. arXiv:2104.09574  [pdf, other

    cs.CL cs.AI cs.LG

    Probing Commonsense Explanation in Dialogue Response Generation

    Authors: Pei Zhou, Pegah Jandaghi, Bill Yuchen Lin, Justin Cho, Jay Pujara, Xiang Ren

    Abstract: Humans use commonsense reasoning (CSR) implicitly to produce natural and coherent responses in conversations. Aiming to close the gap between current response generation (RG) models and human communication abilities, we want to understand why RG models respond as they do by probing RG model's understanding of commonsense reasoning that elicits proper responses. We formalize the problem by framing… ▽ More

    Submitted 9 September, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

    Comments: Accepted in EMNLP 2021-Findings. 15 pages, 12 figures, 3 tables

  38. arXiv:2104.08835  [pdf, other

    cs.CL cs.LG

    CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP

    Authors: Qinyuan Ye, Bill Yuchen Lin, Xiang Ren

    Abstract: Humans can learn a new language task efficiently with only few examples, by leveraging their knowledge obtained when learning prior tasks. In this paper, we explore whether and how such cross-task generalization ability can be acquired, and further applied to build better few-shot learners across diverse NLP tasks. We introduce CrossFit, a problem setup for studying cross-task generalization abili… ▽ More

    Submitted 30 September, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: Accepted to EMNLP 2021. Camera-ready version. Code: https://github.com/INK-USC/CrossFit

  39. arXiv:2104.08815  [pdf, other

    cs.CL cs.AI cs.LG

    FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks

    Authors: Bill Yuchen Lin, Chaoyang He, Zihang Zeng, Hulin Wang, Yufen Huang, Christophe Dupuy, Rahul Gupta, Mahdi Soltanolkotabi, Xiang Ren, Salman Avestimehr

    Abstract: Increasing concerns and regulations about data privacy and sparsity necessitate the study of privacy-preserving, decentralized learning methods for natural language processing (NLP) tasks. Federated learning (FL) provides promising approaches for a large number of clients (e.g., personal devices or organizations) to collaboratively learn a shared global model to benefit all clients while allowing… ▽ More

    Submitted 6 May, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: Accepted to NAACL 2022 Findings. Github: https://github.com/FedML-AI/FedNLP

  40. arXiv:2104.08808  [pdf, other

    cs.CL

    Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning

    Authors: Xisen Jin, Bill Yuchen Lin, Mohammad Rostami, Xiang Ren

    Abstract: The ability to continuously expand knowledge over time and utilize it to rapidly generalize to new tasks is a key feature of human linguistic intelligence. Existing models that pursue rapid generalization to new tasks (e.g., few-shot learning methods), however, are mostly trained in a single shot on fixed datasets, unable to dynamically expand their knowledge; while continual learning algorithms a… ▽ More

    Submitted 20 August, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: Accepted at Findings of EMNLP 2021; Fixed an error in Table 3 (see footnote 4); Updated Q3 in Sec. 4.2

  41. arXiv:2101.00376  [pdf, other

    cs.CL cs.AI

    RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge

    Authors: Bill Yuchen Lin, Ziyi Wu, Yichi Yang, Dong-Ho Lee, Xiang Ren

    Abstract: Question: I have five fingers but I am not alive. What am I? Answer: a glove. Answering such a riddle-style question is a challenging cognitive process, in that it requires complex commonsense reasoning abilities, an understanding of figurative language, and counterfactual reasoning skills, which are all important abilities for advanced natural language understanding (NLU). However, there are curr… ▽ More

    Submitted 4 July, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

    Comments: Accepted to ACL 2021 (Findings). Project page: https://inklab.usc.edu/RiddleSense

  42. arXiv:2011.07956  [pdf, other

    cs.CL cs.AI cs.LG

    Pre-training Text-to-Text Transformers for Concept-centric Common Sense

    Authors: Wangchunshu Zhou, Dong-Ho Lee, Ravi Kiran Selvam, Seyeon Lee, Bill Yuchen Lin, Xiang Ren

    Abstract: Pre-trained language models (PTLM) have achieved impressive results in a range of natural language understanding (NLU) and generation (NLG) tasks. However, current pre-training objectives such as masked token prediction (for BERT-style PTLMs) and masked span infilling (for T5-style PTLMs) do not explicitly model the relational commonsense knowledge about everyday concepts, which is crucial to many… ▽ More

    Submitted 24 November, 2020; v1 submitted 24 October, 2020; originally announced November 2020.

    Comments: 15 pages, 4 figures. Code and Data: https://github.com/INK-USC/CALM/

  43. arXiv:2010.14439  [pdf, other

    cs.CL cs.AI cs.LG

    Differentiable Open-Ended Commonsense Reasoning

    Authors: Bill Yuchen Lin, Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Xiang Ren, William W. Cohen

    Abstract: Current commonsense reasoning research focuses on developing models that use commonsense knowledge to answer multiple-choice questions. However, systems designed to answer multiple-choice questions may not be useful in applications that do not provide a small list of candidate answers to choose from. As a step towards making commonsense reasoning research more realistic, we propose to study open-e… ▽ More

    Submitted 6 June, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

    Comments: Accepted to NAACL 2021. Project website: https://open-csr.github.io

  44. FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

    Authors: Bill Yuchen Lin, Ying Sheng, Nguyen Vo, Sandeep Tata

    Abstract: Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: in Proc. of KDD 2020 (Research Track). Figure 5 updated

  45. arXiv:2005.02178  [pdf, other

    cs.CL cs.LG

    IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization

    Authors: Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren

    Abstract: Fine-tuning pre-trained language models (PTLMs), such as BERT and its better variant RoBERTa, has been a common practice for advancing performance in natural language understanding (NLU) tasks. Recent advance in representation learning shows that isotropic (i.e., unit-variance and uncorrelated) embeddings can significantly improve performance on downstream tasks with faster convergence and better… ▽ More

    Submitted 3 February, 2021; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: AAAI 2021

  46. arXiv:2005.00782  [pdf, other

    cs.CL cs.AI cs.LO

    RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms

    Authors: Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara, Xiang Ren

    Abstract: Pre-trained language models (PTLMs) have achieved impressive performance on commonsense inference benchmarks, but their ability to employ commonsense to make robust inferences, which is crucial for effective communications with humans, is debated. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA: Robust Inference capability based on Commonsense Axioms, tha… ▽ More

    Submitted 9 September, 2021; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted in EMNLP 2021 main conference. 20 pages, 8 figures

  47. arXiv:2005.00683  [pdf, other

    cs.CL cs.AI

    Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models

    Authors: Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, Xiang Ren

    Abstract: Recent works show that pre-trained language models (PTLMs), such as BERT, possess certain commonsense and factual knowledge. They suggest that it is promising to use PTLMs as "neural knowledge bases" via predicting masked words. Surprisingly, we find that this may not work for numerical commonsense knowledge (e.g., a bird usually has two legs). In this paper, we investigate whether and to what ext… ▽ More

    Submitted 17 September, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: To appear in Proceedings of EMNLP 2020. Project page: http://inklab.usc.edu/NumerSense/

  48. arXiv:2005.00646  [pdf, other

    cs.CL cs.LG

    Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering

    Authors: Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, Xiang Ren

    Abstract: Existing work on augmenting question answering (QA) models with external knowledge (e.g., knowledge graphs) either struggle to model multi-hop relations efficiently, or lack transparency into the model's prediction rationale. In this paper, we propose a novel knowledge-aware approach that equips pre-trained language models (PTLMs) with a multi-hop relational reasoning module, named multi-hop graph… ▽ More

    Submitted 18 September, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: Accepted to EMNLP 2020. Project page: https://github.com/INK-USC/MHGRN

  49. arXiv:2004.07499  [pdf, other

    cs.CL cs.AI cs.LG

    LEAN-LIFE: A Label-Efficient Annotation Framework Towards Learning from Explanation

    Authors: Dong-Ho Lee, Rahul Khanna, Bill Yuchen Lin, Jamin Chen, Seyeon Lee, Qinyuan Ye, Elizabeth Boschee, Leonardo Neves, Xiang Ren

    Abstract: Successfully training a deep neural network demands a huge corpus of labeled data. However, each label only provides limited information to learn from and collecting the requisite number of labels involves massive human effort. In this work, we introduce LEAN-LIFE, a web-based, Label-Efficient AnnotatioN framework for sequence labeling and classification tasks, with an easy-to-use UI that not only… ▽ More

    Submitted 16 April, 2020; originally announced April 2020.

    Comments: Accepted to the ACL 2020 (demo). The first two authors contributed equally. Project page: http://inklab.usc.edu/leanlife/

  50. arXiv:2004.07493  [pdf, other

    cs.CL cs.IR cs.LG

    TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition

    Authors: Bill Yuchen Lin, Dong-Ho Lee, Ming Shen, Ryan Moreno, Xiao Huang, Prashant Shiralkar, Xiang Ren

    Abstract: Training neural models for named entity recognition (NER) in a new domain often requires additional human annotations (e.g., tens of thousands of labeled instances) that are usually expensive and time-consuming to collect. Thus, a crucial research question is how to obtain supervision in a cost-effective way. In this paper, we introduce "entity triggers," an effective proxy of human explanations f… ▽ More

    Submitted 6 July, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

    Comments: Accepted to the ACL 2020. Project page: https://inklab.usc.edu/TriggerNER/ (Fixed a few typos and added a new figure.)

    Journal ref: Proc. of ACL 2020, page 8503--8511