Skip to main content

Showing 1–50 of 154 results for author: Hajishirzi, H

  1. arXiv:2407.12043  [pdf, other

    cs.CL cs.AI cs.HC

    The Art of Saying No: Contextual Noncompliance in Language Models

    Authors: Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

    Abstract: Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  2. arXiv:2407.07087  [pdf, other

    cs.CL cs.LG

    CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation

    Authors: Tong Chen, Akari Asai, Niloofar Mireshghallah, Sewon Min, James Grimmelmann, Yejin Choi, Hannaneh Hajishirzi, Luke Zettlemoyer, Pang Wei Koh

    Abstract: Evaluating the degree of reproduction of copyright-protected content by language models (LMs) is of significant interest to the AI and legal communities. Although both literal and non-literal similarities are considered by courts when assessing the degree of reproduction, prior research has focused only on literal similarities. To bridge this gap, we introduce CopyBench, a benchmark designed to me… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  3. arXiv:2406.18853  [pdf, other

    cs.LG

    Decoding-Time Language Model Alignment with Multiple Objectives

    Authors: Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A. Smith, Simon Du

    Abstract: Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a lin… ▽ More

    Submitted 28 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

  4. arXiv:2406.09279  [pdf, other

    cs.CL

    Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

    Authors: Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

    Abstract: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core a… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Preprint

  5. arXiv:2406.08446  [pdf, other

    cs.CL cs.AI

    OLMES: A Standard for Language Model Evaluations

    Authors: Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, Hannaneh Hajishirzi

    Abstract: Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  6. arXiv:2406.07835  [pdf, other

    cs.CL cs.AI

    SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

    Authors: David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, Arman Cohan

    Abstract: We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed t… ▽ More

    Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Submitted to NeurIPS Datasets and Benchmarks 2024

  7. arXiv:2406.06469  [pdf, other

    cs.AI cs.CL cs.LG

    Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

    Authors: Joongwon Kim, Bhargavi Paranjape, Tushar Khot, Hannaneh Hajishirzi

    Abstract: Language agents perform complex tasks by using tools to execute each step precisely. However, most existing agents are based on proprietary models or designed to target specific tasks, such as mathematics or multi-hop question answering. We introduce Husky, a holistic, open-source language agent that learns to reason over a unified action space to address a diverse set of complex tasks involving n… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 50 pages, 42 figures. Project webpage available [here](https://agent-husky.github.io/)

  8. arXiv:2404.01197  [pdf, other

    cs.CV

    Getting it Right: Improving Spatial Consistency in Text-to-Image Models

    Authors: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

    Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: project webpage : https://spright-t2i.github.io/

  9. arXiv:2403.13787  [pdf, other

    cs.LG

    RewardBench: Evaluating Reward Models for Language Modeling

    Authors: Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

    Abstract: Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training a… ▽ More

    Submitted 8 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: 44 pages, 19 figures, 12 tables

  10. arXiv:2403.03187  [pdf, other

    cs.CL cs.AI cs.LG

    Reliable, Adaptable, and Attributable Language Models with Retrieval

    Authors: Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi, Wen-tau Yih

    Abstract: Parametric language models (LMs), which are trained on vast amounts of web data, exhibit remarkable flexibility and capability. However, they still face practical challenges such as hallucinations, difficulty in adapting to new data distributions, and a lack of verifiability. In this position paper, we advocate for retrieval-augmented LMs to replace parametric LMs as the next generation of LMs. By… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  11. arXiv:2402.16797  [pdf, other

    cs.CL

    Set the Clock: Temporal Alignment of Pretrained Language Models

    Authors: Bowen Zhao, Zander Brumbaugh, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith

    Abstract: Language models (LMs) are trained on web text originating from many points in time and, in general, without any explicit temporal grounding. This work investigates the temporal chaos of pretrained LMs and explores various methods to align their internal knowledge to a target time, which we call "temporal alignment." To do this, we first automatically construct a dataset containing 20K time-sensiti… ▽ More

    Submitted 9 June, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: Accepted as Findings of ACL 2024. Our code and data is available at https://github.com/yizhongw/llm-temporal-alignment

  12. arXiv:2402.10171  [pdf, other

    cs.CL cs.AI

    Data Engineering for Scaling Language Models to 128K Context

    Authors: Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng

    Abstract: We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contex… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: Code at https://github.com/FranxYao/Long-Context-Data-Engineering

  13. arXiv:2402.07841  [pdf, other

    cs.CL

    Do Membership Inference Attacks Work on Large Language Models?

    Authors: Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, Hannaneh Hajishirzi

    Abstract: Membership inference attacks (MIAs) attempt to predict whether a particular datapoint is a member of a target model's training data. Despite extensive research on traditional machine learning models, there has been limited work studying MIA on the pre-training data of large language models (LLMs). We perform a large-scale evaluation of MIAs over a suite of language models (LMs) trained on the Pile… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  14. arXiv:2402.00838  [pdf, other

    cs.CL

    OLMo: Accelerating the Science of Language Models

    Authors: Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam , et al. (18 additional authors not shown)

    Abstract: Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models… ▽ More

    Submitted 7 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  15. arXiv:2402.00159  [pdf, other

    cs.CL

    Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

    Authors: Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen , et al. (11 additional authors not shown)

    Abstract: Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training dat… ▽ More

    Submitted 6 June, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

    Comments: Accepted at ACL 2024; Dataset: https://hf.co/datasets/allenai/dolma; Code: https://github.com/allenai/dolma

  16. arXiv:2401.17377  [pdf, other

    cs.CL cs.AI cs.IR

    Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

    Authors: Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi

    Abstract: Are $n$-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second, existing $n$-gra… ▽ More

    Submitted 4 April, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

  17. arXiv:2401.12200  [pdf, other

    cs.CL cs.LG

    APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

    Authors: Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao

    Abstract: Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To… ▽ More

    Submitted 4 June, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted to ICML 2024 Oral; code available at https://github.com/ROIM1998/APT

  18. arXiv:2401.06855  [pdf, other

    cs.CL

    Fine-grained Hallucination Detection and Editing for Language Models

    Authors: Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Yulia Tsvetkov, Hannaneh Hajishirzi

    Abstract: Large language models (LMs) are prone to generate factual errors, which are often called hallucinations. In this paper, we introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms, each requiring varying degrees of careful assessments to verify factuality. We propose a novel task of automatic fine-grained hallucination detection and construct a n… ▽ More

    Submitted 21 February, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: Our code, data, and demo are available at https://fine-grained-hallucination.github.io. Expanded human annotations adding a new LM, as well as included more baselines for comparison

  19. arXiv:2312.10523  [pdf, other

    cs.CL cs.AI cs.LG

    Paloma: A Benchmark for Evaluating Language Model Fit

    Authors: Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge

    Abstract: Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribution extrapolates to others, Perplexity Analysis for Language Model Assessment (Paloma), measures LM fit to 585 text domains, ranging from nytimes.com… ▽ More

    Submitted 16 December, 2023; originally announced December 2023.

    Comments: Project Page: https://paloma.allen.ai/

  20. arXiv:2311.10702  [pdf, other

    cs.CL

    Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2

    Authors: Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi

    Abstract: Since the release of TÜLU [Wang et al., 2023b], open resources for instruction tuning have developed quickly, from better base models to new finetuning techniques. We test and incorporate a number of these advances into TÜLU, resulting in TÜLU 2, a suite of improved TÜLU models for advancing the understanding and best practices of adapting pretrained language models to downstream tasks and user pr… ▽ More

    Submitted 19 November, 2023; v1 submitted 17 November, 2023; originally announced November 2023.

    Comments: technical report; fixed zephyr numbers

  21. arXiv:2310.20707  [pdf, other

    cs.CL cs.LG

    What's In My Big Data?

    Authors: Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse Dodge

    Abstract: Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corp… ▽ More

    Submitted 5 March, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

    Comments: Published at ICLR 2024 spotlight

  22. arXiv:2310.12126  [pdf, other

    cs.LG cs.AI cs.CL

    SHARCS: Efficient Transformers through Routing with Dynamic Width Sub-networks

    Authors: Mohammadreza Salehi, Sachin Mehta, Aditya Kusupati, Ali Farhadi, Hannaneh Hajishirzi

    Abstract: We introduce SHARCS for adaptive inference that takes into account the hardness of input samples. SHARCS can train a router on any transformer network, enabling the model to direct different samples to sub-networks with varying widths. Our experiments demonstrate that: (1) SHARCS outperforms or complements existing per-sample adaptive inference methods across various classification tasks in terms… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

  23. arXiv:2310.11564  [pdf, other

    cs.CL

    Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

    Authors: Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu

    Abstract: While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Preprint

  24. arXiv:2310.11513  [pdf, other

    cs.CV cs.LG

    GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

    Authors: Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt

    Abstract: Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic me… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

  25. arXiv:2310.11511  [pdf, other

    cs.CL cs.AI cs.LG

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    Authors: Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi

    Abstract: Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed numb… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: 30 pages, 2 figures, 12 tables

  26. arXiv:2310.07707  [pdf, other

    cs.LG cs.CL cs.CV

    MatFormer: Nested Transformer for Elastic Inference

    Authors: Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain

    Abstract: Transformer models are deployed in a wide range of settings, from multi-accelerator clusters to standalone mobile phones. The diverse inference constraints in these scenarios necessitate practitioners to train foundation models such as PaLM 2, Llama, & ViTs as a series of models of varying sizes. Due to significant training costs, only a select few model sizes are trained and supported, limiting m… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: 31 pages, 12 figures, first three authors contributed equally

  27. arXiv:2310.04921  [pdf, other

    cs.AI cs.CL cs.LG

    Crystal: Introspective Reasoners Reinforced with Self-Feedback

    Authors: Jiacheng Liu, Ramakanth Pasunuru, Hannaneh Hajishirzi, Yejin Choi, Asli Celikyilmaz

    Abstract: Extensive work has shown that the performance and interpretability of commonsense reasoning can be improved via knowledge-augmented reasoning methods, where the knowledge that underpins the reasoning process is explicitly verbalized and utilized. However, existing implementations, including "chain-of-thought" and its variants, fall short in capturing the introspective nature of knowledge required… ▽ More

    Submitted 18 October, 2023; v1 submitted 7 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 main conference

  28. arXiv:2310.02255  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Authors: Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao

    Abstract: Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived… ▽ More

    Submitted 20 January, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: 116 pages, 120 figures. Accepted to ICLR 2024

  29. arXiv:2310.01329  [pdf, other

    cs.CL cs.AI

    BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models

    Authors: Qingqing Cao, Sewon Min, Yizhong Wang, Hannaneh Hajishirzi

    Abstract: Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly… ▽ More

    Submitted 3 May, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 camera-ready version

  30. arXiv:2309.15028  [pdf, other

    cs.CL cs.AI cs.LG

    Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding

    Authors: Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz

    Abstract: Inference-time search algorithms such as Monte-Carlo Tree Search (MCTS) may seem unnecessary when generating natural language text based on state-of-the-art reinforcement learning such as Proximal Policy Optimization (PPO). In this paper, we demonstrate that it is possible to get extra mileage out of PPO by integrating MCTS on top. The key idea is not to throw out the value network, a byproduct of… ▽ More

    Submitted 2 April, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

  31. arXiv:2308.04430  [pdf, other

    cs.CL cs.AI cs.LG

    SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

    Authors: Sewon Min, Suchin Gururangan, Eric Wallace, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer

    Abstract: The legality of training language models (LMs) on copyrighted or otherwise restricted data is under intense debate. However, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. We present SILO, a new language model that manages this risk-performance tradeoff during… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: 27 pages; 6 figures. Code, models, and data available at https://github.com/kernelmachine/silo-lm

  32. arXiv:2307.09701  [pdf, other

    cs.CL

    Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

    Authors: Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi

    Abstract: Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across diffe… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

  33. arXiv:2306.04751  [pdf, other

    cs.CL

    How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

    Authors: Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi

    Abstract: In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a la… ▽ More

    Submitted 30 October, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: 18 pages, 6 figure, 10 tables. NeurIPS 2023 Datasets and Benchmarks Track Camera Ready

  34. arXiv:2306.01693  [pdf, other

    cs.CL

    Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

    Authors: Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi

    Abstract: Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are transformed into a learning signal - has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text… ▽ More

    Submitted 30 October, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 camera-ready

  35. arXiv:2305.17530  [pdf, other

    cs.CV cs.AI cs.CL

    PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

    Authors: Qingqing Cao, Bhargavi Paranjape, Hannaneh Hajishirzi

    Abstract: Large-scale vision language (VL) models use Transformers to perform cross-modal interactions between the input text and image. These cross-modal interactions are computationally expensive and memory-intensive due to the quadratic complexity of processing the input image and text. We present PuMer: a token reduction framework that uses text-informed Pruning and modality-aware Merging strategies to… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 Main Conference

  36. arXiv:2305.14857  [pdf, other

    cs.CL

    BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer

    Authors: Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, Hannaneh Hajishirzi

    Abstract: Despite remarkable advancements in few-shot generalization in natural language processing, most models are developed and evaluated primarily in English. To facilitate research on few-shot cross-lingual transfer, we introduce a new benchmark, called BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples and instructi… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: The data and code is available at https://buffetfs.github.io/

  37. arXiv:2305.14815  [pdf, other

    cs.CL cs.IR

    Machine Reading Comprehension using Case-based Reasoning

    Authors: Dung Thai, Dhruv Agarwal, Mudit Chaudhary, Wenlong Zhao, Rajarshi Das, Manzil Zaheer, Jay-Yoon Lee, Hannaneh Hajishirzi, Andrew McCallum

    Abstract: We present an accurate and interpretable method for answer extraction in machine reading comprehension that is reminiscent of case-based reasoning (CBR) from classical AI. Our method (CBR-MRC) builds upon the hypothesis that contextualized answers to similar questions share semantic similarities with each other. Given a test question, CBR-MRC first retrieves a set of similar cases from a nonparame… ▽ More

    Submitted 5 December, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: 9 pages, 2 figures

  38. arXiv:2305.14251  [pdf, other

    cs.CL cs.AI cs.LG

    FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

    Authors: Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi

    Abstract: Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of… ▽ More

    Submitted 11 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 25 pages; 7 figures. Published as a main conference paper at EMNLP 2023. Code available at https://github.com/shmsw25/FActScore

  39. arXiv:2305.13256  [pdf, other

    cs.CL cs.AI

    TaskWeb: Selecting Better Source Tasks for Multi-task NLP

    Authors: Joongwon Kim, Akari Asai, Gabriel Ilharco, Hannaneh Hajishirzi

    Abstract: Recent work in NLP has shown promising results in training models on large amounts of tasks to achieve better generalization. However, it is not well-understood how tasks are related, and how helpful training tasks can be chosen for a new task. In this work, we investigate whether knowing task relationships via pairwise task transfer improves choosing one or more source tasks that help to learn a… ▽ More

    Submitted 3 December, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: 21 pages, 14 figures

  40. arXiv:2305.11744  [pdf, other

    cs.IR cs.CL

    ReFIT: Relevance Feedback from a Reranker during Inference

    Authors: Revanth Gangi Reddy, Pradeep Dasigi, Md Arafat Sultan, Arman Cohan, Avirup Sil, Heng Ji, Hannaneh Hajishirzi

    Abstract: Retrieve-and-rerank is a prevalent framework in neural information retrieval, wherein a bi-encoder network initially retrieves a pre-defined number of candidates (e.g., K=100), which are then reranked by a more powerful cross-encoder model. While the reranker often yields improved candidate scores compared to the retriever, its scope is confined to only the top K retrieved candidates. As a result,… ▽ More

    Submitted 28 May, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: Preprint

  41. arXiv:2305.03695  [pdf, other

    cs.CL cs.AI

    Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements

    Authors: Jiacheng Liu, Wenya Wang, Dianzhuo Wang, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi

    Abstract: Despite the much discussed capabilities of today's language models, they are still prone to silly and unexpected commonsense failures. We consider a retrospective verification approach that reflects on the correctness of LM outputs, and introduce Vera, a general-purpose model that estimates the plausibility of declarative statements based on commonsense knowledge. Trained on ~7M commonsense statem… ▽ More

    Submitted 18 October, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 main conference

  42. arXiv:2304.14108  [pdf, other

    cs.CV cs.CL cs.LG

    DataComp: In search of the next generation of multimodal datasets

    Authors: Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song , et al. (9 additional authors not shown)

    Abstract: Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Commo… ▽ More

    Submitted 20 October, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023 Datasets and Benchmarks Track

  43. arXiv:2303.09014  [pdf, other

    cs.CL

    ART: Automatic multi-step reasoning and tool-use for large language models

    Authors: Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, Marco Tulio Ribeiro

    Abstract: Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code). Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

  44. arXiv:2301.12050  [pdf, other

    cs.LG cs.CL

    Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling

    Authors: Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, Roy Fox

    Abstract: Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world. However, if initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that will be verified through world ex… ▽ More

    Submitted 27 April, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

    Comments: in proceedings of ICML 23

  45. arXiv:2212.10560  [pdf, other

    cs.CL cs.AI

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Authors: Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi

    Abstract: Large "instruction-tuned" language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. We introduce Self-Instruct, a framework for improvi… ▽ More

    Submitted 25 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023 camera ready, 23 pages, 9 figures, 11 tables

  46. arXiv:2212.10511  [pdf, other

    cs.CL cs.AI cs.LG

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    Authors: Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh Hajishirzi

    Abstract: Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 m… ▽ More

    Submitted 2 July, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023; Code and data available at https://github.com/AlexTMallen/adaptive-retrieval

  47. arXiv:2212.10315  [pdf, other

    cs.CL

    HINT: Hypernetwork Instruction Tuning for Efficient Zero- & Few-Shot Generalisation

    Authors: Hamish Ivison, Akshita Bhagia, Yizhong Wang, Hannaneh Hajishirzi, Matthew Peters

    Abstract: Recent NLP models have shown the remarkable ability to effectively generalise `zero-shot' to new tasks using only natural language instructions as guidance. However, many of these approaches suffer from high computational costs due to their reliance on concatenating lengthy instructions with every input example, resulting in costly reprocessing of the instruction. To avoid this, we introduce Hyper… ▽ More

    Submitted 24 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  48. arXiv:2212.09865  [pdf, other

    cs.CL cs.AI

    Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations

    Authors: Xinxi Lyu, Sewon Min, Iz Beltagy, Luke Zettlemoyer, Hannaneh Hajishirzi

    Abstract: Although large language models can be prompted for both zero- and few-shot learning, performance drops significantly when no demonstrations are available. In this paper, we introduce Z-ICL, a new zero-shot method that closes the gap by constructing pseudo-demonstrations for a given test input using a raw text corpus. Concretely, pseudo-demonstrations are constructed by (1) finding the nearest neig… ▽ More

    Submitted 3 June, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: 11 pages; 9 figures

  49. arXiv:2212.04089  [pdf, other

    cs.LG cs.CL cs.CV

    Editing Models with Task Arithmetic

    Authors: Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, Ali Farhadi

    Abstract: Changing how pre-trained models behave -- e.g., improving their performance on a downstream task or mitigating biases learned during pre-training -- is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around \textit{task vectors}. A task vector specifies a direction in the weight space of a pr… ▽ More

    Submitted 31 March, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

    Comments: In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023)

  50. arXiv:2212.01349  [pdf, other

    cs.CL cs.AI cs.LG

    Nonparametric Masked Language Modeling

    Authors: Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, Luke Zettlemoyer

    Abstract: Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. NPM fills in the [MASK] solely from retrieving a token from a text corpus. We show t… ▽ More

    Submitted 25 May, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

    Comments: 20 pages; 9 figures. Published at ACL 2023 Findings. Code available at https://github.com/facebookresearch/NPM