Skip to main content

Showing 1–15 of 15 results for author: Shleifer, S

  1. arXiv:2304.11277  [pdf, other

    cs.DC cs.AI cs.LG cs.PF

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Authors: Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, Shen Li

    Abstract: It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit tech… ▽ More

    Submitted 12 September, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

  2. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  3. arXiv:2205.01068  [pdf, other

    cs.CL cs.LG

    OPT: Open Pre-trained Transformer Language Models

    Authors: Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer

    Abstract: Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open… ▽ More

    Submitted 21 June, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

  4. arXiv:2203.06850  [pdf, other

    cs.CL cs.AI

    Efficient Language Modeling with Sparse all-MLP

    Authors: Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li

    Abstract: All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input… ▽ More

    Submitted 31 May, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

  5. arXiv:2112.10684  [pdf, other

    cs.CL cs.AI cs.LG

    Efficient Large Scale Language Modeling with Mixtures of Experts

    Authors: Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov

    Abstract: Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning, we… ▽ More

    Submitted 26 October, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: EMNLP 2022

  6. arXiv:2112.10668  [pdf, other

    cs.CL cs.AI

    Few-shot Learning with Multilingual Language Models

    Authors: Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li

    Abstract: Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study t… ▽ More

    Submitted 10 November, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: Accepted to EMNLP 2022; 34 pages

  7. arXiv:2110.09456  [pdf, other

    cs.CL cs.AI

    NormFormer: Improved Transformer Pretraining with Extra Normalization

    Authors: Sam Shleifer, Jason Weston, Myle Ott

    Abstract: During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first… ▽ More

    Submitted 1 November, 2021; v1 submitted 18 October, 2021; originally announced October 2021.

  8. arXiv:2110.02861  [pdf, other

    cs.LG

    8-bit Optimizers via Block-wise Quantization

    Authors: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer

    Abstract: Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In t… ▽ More

    Submitted 20 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: ICLR2022 spotlight version

  9. arXiv:2010.13002  [pdf, other

    cs.CL cs.AI

    Pre-trained Summarization Distillation

    Authors: Sam Shleifer, Alexander M. Rush

    Abstract: Recent state-of-the-art approaches to summarization utilize large pre-trained Transformer models. Distilling these models to smaller student models has become critically important for practical use; however there are many different distillation methods proposed by the NLP literature. Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowle… ▽ More

    Submitted 28 October, 2020; v1 submitted 24 October, 2020; originally announced October 2020.

  10. arXiv:1912.07390  [pdf, other

    eess.SP cs.LG

    Incrementally Improving Graph WaveNet Performance on Traffic Prediction

    Authors: Sam Shleifer, Clara McCreery, Vamsi Chitters

    Abstract: We present a series of modifications which improve upon Graph WaveNet's previously state-of-the-art performance on the METR-LA traffic prediction task. The goal of this task is to predict the future speed of traffic at each sensor in a network using the past hour of sensor readings. Graph WaveNet (GWN) is a spatio-temporal graph neural network which interleaves graph convolution to aggregate infor… ▽ More

    Submitted 11 December, 2019; originally announced December 2019.

  11. arXiv:1911.08554  [pdf, other

    cs.CL cs.AI cs.LG

    Classification as Decoder: Trading Flexibility for Control in Medical Dialogue

    Authors: Sam Shleifer, Manish Chablani, Anitha Kannan, Namit Katariya, Xavier Amatriain

    Abstract: Generative seq2seq dialogue systems are trained to predict the next word in dialogues that have already occurred. They can learn from large unlabeled conversation datasets, build a deeper understanding of conversational context, and generate a wide variety of responses. This flexibility comes at the cost of control, a concerning tradeoff in doctor/patient interactions. Inaccuracies, typos, or unde… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.

    Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract. arXiv admin note: substantial text overlap with arXiv:1910.03476

  12. arXiv:1910.03771  [pdf, other

    cs.CL

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Authors: Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander M. Rush

    Abstract: Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the… ▽ More

    Submitted 13 July, 2020; v1 submitted 8 October, 2019; originally announced October 2019.

    Comments: 8 pages, 4 figures, more details at https://github.com/huggingface/transformers

  13. arXiv:1910.03476  [pdf, other

    cs.CL cs.LG

    Classification As Decoder: Trading Flexibility For Control In Neural Dialogue

    Authors: Sam Shleifer, Manish Chablani, Namit Katariya, Anitha Kannan, Xavier Amatriain

    Abstract: Generative seq2seq dialogue systems are trained to predict the next word in dialogues that have already occurred. They can learn from large unlabeled conversation datasets, build a deep understanding of conversational context, and generate a wide variety of responses. This flexibility comes at the cost of control. Undesirable responses in the training data will be reproduced by the model at infere… ▽ More

    Submitted 17 October, 2019; v1 submitted 4 October, 2019; originally announced October 2019.

  14. arXiv:1906.04887  [pdf, other

    cs.LG cs.CV stat.ML

    Using Small Proxy Datasets to Accelerate Hyperparameter Search

    Authors: Sam Shleifer, Eric Prokop

    Abstract: One of the biggest bottlenecks in a machine learning workflow is waiting for models to train. Depending on the available computing resources, it can take days to weeks to train a neural network on a large dataset with many classes such as ImageNet. For researchers experimenting with new algorithmic approaches, this is impractically time consuming and costly. We aim to generate smaller "proxy datas… ▽ More

    Submitted 11 June, 2019; originally announced June 2019.

  15. arXiv:1903.09244  [pdf, other

    cs.CL cs.LG

    Low Resource Text Classification with ULMFit and Backtranslation

    Authors: Sam Shleifer

    Abstract: In computer vision, virtually every state-of-the-art deep learning system is trained with data augmentation. In text classification, however, data augmentation is less widely practiced because it must be performed before training and risks introducing label noise. We augment the IMDB movie reviews dataset with examples generated by two families of techniques: random token perturbations introduced… ▽ More

    Submitted 25 March, 2019; v1 submitted 21 March, 2019; originally announced March 2019.