Skip to main content

Showing 1–27 of 27 results for author: Keskar, N S

  1. arXiv:2303.08774  [pdf, other

    cs.CL cs.AI

    GPT-4 Technical Report

    Authors: OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko , et al. (256 additional authors not shown)

    Abstract: We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo… ▽ More

    Submitted 4 March, 2024; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: 100 pages; updated authors list; fixed author names and added citation

  2. arXiv:2208.03645  [pdf, other

    cs.IR cs.AI

    Generating Negative Samples for Sequential Recommendation

    Authors: Yongjun Chen, Jia Li, Zhiwei Liu, Nitish Shirish Keskar, Huan Wang, Julian McAuley, Caiming Xiong

    Abstract: To make Sequential Recommendation (SR) successful, recent works focus on designing effective sequential encoders, fusing side information, and mining extra positive self-supervision signals. The strategy of sampling negative items at each time step is less explored. Due to the dynamics of users' interests and model updates during training, considering randomly sampled items from a user's non-inter… ▽ More

    Submitted 7 August, 2022; originally announced August 2022.

  3. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  4. arXiv:2205.09226  [pdf, other

    cs.CL

    Modeling Multi-hop Question Answering as Single Sequence Prediction

    Authors: Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, Nitish Shirish Keskar, Caiming Xiong

    Abstract: Fusion-in-decoder (Fid) (Izacard and Grave, 2020) is a generative question answering (QA) model that leverages passage retrieval with a pre-trained transformer and pushed the state of the art on single-hop QA. However, the complexity of multi-hop QA hinders the effectiveness of the generative QA approach. In this work, we propose a simple generative approach (PathFid) that extends the task beyond… ▽ More

    Submitted 18 May, 2022; originally announced May 2022.

    Comments: ACL 2022

  5. arXiv:2111.10497  [pdf, ps, other

    cs.CL

    Combining Data-driven Supervision with Human-in-the-loop Feedback for Entity Resolution

    Authors: Wenpeng Yin, Shelby Heinecke, Jia Li, Nitish Shirish Keskar, Michael Jones, Shouzhong Shi, Stanislav Georgiev, Kurt Milich, Joseph Esposito, Caiming Xiong

    Abstract: The distribution gap between training datasets and data encountered in production is well acknowledged. Training datasets are often constructed over a fixed period of time and by carefully curating the data to be labeled. Thus, training datasets may not contain all possible variations of data that could be encountered in real-world production environments. Tasked with building an entity resolution… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Camera-ready for Data-Centric AI Workshop at NeurIPS 2021

  6. arXiv:2010.12885  [pdf, other

    cs.CL

    Unsupervised Paraphrasing with Pretrained Language Models

    Authors: Tong Niu, Semih Yavuz, Yingbo Zhou, Nitish Shirish Keskar, Huan Wang, Caiming Xiong

    Abstract: Paraphrase generation has benefited extensively from recent progress in the designing of training objectives and model architectures. However, previous explorations have largely focused on supervised methods, which require a large amount of labeled data that is costly to collect. To address this drawback, we adopt a transfer learning approach and propose a training pipeline that enables pre-traine… ▽ More

    Submitted 10 September, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

    Comments: Accepted at EMNLP 2021 main conference

  7. arXiv:2009.06367  [pdf, other

    cs.CL cs.LG

    GeDi: Generative Discriminator Guided Sequence Generation

    Authors: Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, Nazneen Fatema Rajani

    Abstract: While large-scale language models (LMs) are able to imitate the distribution of natural language well enough to generate realistic text, it is difficult to control which regions of the distribution they generate. This is especially problematic because datasets used for training large LMs usually contain significant toxicity, hate, bias, and negativity. We propose GeDi as an efficient method for us… ▽ More

    Submitted 22 October, 2020; v1 submitted 14 September, 2020; originally announced September 2020.

  8. arXiv:2007.14966  [pdf, other

    cs.CL cs.IT

    Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity

    Authors: Sourya Basu, Govardana Sachitanandam Ramachandran, Nitish Shirish Keskar, Lav R. Varshney

    Abstract: Neural text decoding is important for generating high-quality texts using language models. To generate high-quality text, popular decoding algorithms like top-k, top-p (nucleus), and temperature-based sampling truncate or distort the unreliable low probability tail of the language model. Though these methods generate high-quality text after parameter tuning, they are ad hoc. Not much is known abou… ▽ More

    Submitted 14 January, 2021; v1 submitted 29 July, 2020; originally announced July 2020.

    Comments: 25 pages, 12 figures

  9. arXiv:2004.03497  [pdf, other

    q-bio.BM cs.LG stat.ML

    ProGen: Language Modeling for Protein Generation

    Authors: Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, Richard Socher

    Abstract: Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ~280M protein sequences condit… ▽ More

    Submitted 7 March, 2020; originally announced April 2020.

  10. arXiv:2002.03438  [pdf, ps, other

    cs.CL cs.CY cs.LG

    Limits of Detecting Text Generated by Large-Scale Language Models

    Authors: Lav R. Varshney, Nitish Shirish Keskar, Richard Socher

    Abstract: Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated. We show that error exponents for particular language models are bounded in terms of their perplexity, a s… ▽ More

    Submitted 9 February, 2020; originally announced February 2020.

    Comments: ITA 2020

  11. arXiv:1910.10245  [pdf, other

    stat.ML cs.LG

    Global Capacity Measures for Deep ReLU Networks via Path Sampling

    Authors: Ryan Theisen, Jason M. Klusowski, Huan Wang, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

    Abstract: Classical results on the statistical complexity of linear models have commonly identified the norm of the weights $\|w\|$ as a fundamental capacity measure. Generalizations of this measure to the setting of deep networks have been varied, though a frequently identified quantity is the product of weight norms of each layer. In this work, we show that for a large class of networks possessing a posit… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

  12. arXiv:1909.05858  [pdf, other

    cs.CL

    CTRL: A Conditional Transformer Language Model for Controllable Generation

    Authors: Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher

    Abstract: Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw t… ▽ More

    Submitted 20 September, 2019; v1 submitted 11 September, 2019; originally announced September 2019.

  13. arXiv:1909.03290  [pdf, other

    cs.CY cs.AI cs.LG

    Pretrained AI Models: Performativity, Mobility, and Change

    Authors: Lav R. Varshney, Nitish Shirish Keskar, Richard Socher

    Abstract: The paradigm of pretrained deep learning models has recently emerged in artificial intelligence practice, allowing deployment in numerous societal settings with limited computational resources, but also embedding biases and enabling unintended negative uses. In this paper, we treat pretrained models as objects of study and discuss the ethical impacts of their sociological position. We discuss how… ▽ More

    Submitted 7 September, 2019; originally announced September 2019.

  14. arXiv:1908.08960  [pdf, other

    cs.CL

    Neural Text Summarization: A Critical Evaluation

    Authors: Wojciech Kryściński, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher

    Abstract: Text summarization aims at compressing long documents into a shorter form that conveys the most important parts of the original document. Despite increased interest in the community and notable research effort, progress on benchmark datasets has stagnated. We critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlight three primary sho… ▽ More

    Submitted 23 August, 2019; originally announced August 2019.

    Comments: To appear in EMNLP 2019, 13 pages, 2 figures, 6 tables

  15. arXiv:1905.11471  [pdf, other

    cs.CL cs.AI cs.LG

    XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering

    Authors: Jasdeep Singh, Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

    Abstract: While natural language processing systems often focus on a single language, multilingual transfer learning has the potential to improve performance, especially for low-resource languages. We introduce XLDA, cross-lingual data augmentation, a method that replaces a segment of the input text with its translation in another language. XLDA enhances performance of all 14 tested languages of the cross-l… ▽ More

    Submitted 27 May, 2019; originally announced May 2019.

  16. arXiv:1904.09286  [pdf, other

    cs.CL

    Unifying Question Answering, Text Classification, and Regression via Span Extraction

    Authors: Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher

    Abstract: Even as pre-trained language encoders such as BERT are shared across many tasks, the output layers of question answering, text classification, and regression models are significantly different. Span decoders are frequently used for question answering, fixed-class, classification layers for text classification, and similarity-scoring layers for regression tasks, We show that this distinction is not… ▽ More

    Submitted 20 September, 2019; v1 submitted 19 April, 2019; originally announced April 2019.

    Comments: updating paper to also include regression tasks

  17. arXiv:1901.00603  [pdf, other

    cs.CL cs.AI

    Coarse-grain Fine-grain Coattention Network for Multi-evidence Question Answering

    Authors: Victor Zhong, Caiming Xiong, Nitish Shirish Keskar, Richard Socher

    Abstract: End-to-end neural models have made significant progress in question answering, however recent studies show that these models implicitly assume that the answer and evidence appear close together in a single document. In this work, we propose the Coarse-grain Fine-grain Coattention Network (CFC), a new question answering model that combines information from evidence across multiple documents. The CF… ▽ More

    Submitted 13 May, 2019; v1 submitted 2 January, 2019; originally announced January 2019.

    Comments: ICLR 2019; 9 pages, 7 figures

  18. arXiv:1810.13243  [pdf, other

    cs.LG stat.ML

    A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

    Authors: Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

    Abstract: The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dyna… ▽ More

    Submitted 29 October, 2018; originally announced October 2018.

    Comments: We use empirical tools of mode connectivity and SVCCA to investigate neural network training heuristics of learning rate restarts, warmup and knowledge distillation. arXiv admin note: text overlap with arXiv:1806.06977

  19. arXiv:1809.07402  [pdf, other

    cs.LG stat.ML

    Identifying Generalization Properties in Neural Networks

    Authors: Huan Wang, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

    Abstract: While it has not yet been proven, empirical evidence suggests that model generalization is related to local properties of the optima which can be described via the Hessian. We connect model generalization with the local property of a solution under the PAC-Bayes paradigm. In particular, we prove that model generalization ability is related to the Hessian, the higher-order "smoothness" terms charac… ▽ More

    Submitted 19 September, 2018; originally announced September 2018.

    Comments: 23 pages

  20. arXiv:1806.08730  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    The Natural Language Decathlon: Multitask Learning as Question Answering

    Authors: Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

    Abstract: Deep learning has improved performance on many natural language processing (NLP) tasks individually. However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: question answering, machine translation, summarization, natural language in… ▽ More

    Submitted 20 June, 2018; originally announced June 2018.

  21. arXiv:1806.06977  [pdf, ps, other

    cs.LG stat.ML

    Using Mode Connectivity for Loss Landscape Analysis

    Authors: Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

    Abstract: Mode connectivity is a recently introduced frame- work that empirically establishes the connected- ness of minima by finding a high accuracy curve between two independently trained models. To investigate the limits of this setup, we examine the efficacy of this technique in extreme cases where the input models are trained or initialized differently. We find that the procedure is resilient to such… ▽ More

    Submitted 18 June, 2018; originally announced June 2018.

    Comments: Accepted as a workshop paper at ICML's Workshop on Modern Trends in Nonconvex Optimization for Machine Learning, 2018

  22. arXiv:1803.08240  [pdf, other

    cs.CL cs.AI cs.NE

    An Analysis of Neural Language Modeling at Multiple Scales

    Authors: Stephen Merity, Nitish Shirish Keskar, Richard Socher

    Abstract: Many of the leading approaches in language modeling introduce novel, complex and specialized architectures. We take existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. When properly tuned, LSTMs and QRNNs achieve state-of-the-art results on character-level (Penn Treebank, enwik8) and word-… ▽ More

    Submitted 22 March, 2018; originally announced March 2018.

  23. arXiv:1712.07628  [pdf, other

    cs.LG math.OC

    Improving Generalization Performance by Switching from Adam to SGD

    Authors: Nitish Shirish Keskar, Richard Socher

    Abstract: Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches… ▽ More

    Submitted 20 December, 2017; originally announced December 2017.

  24. arXiv:1711.02132  [pdf, other

    cs.AI cs.CL

    Weighted Transformer Network for Machine Translation

    Authors: Karim Ahmed, Nitish Shirish Keskar, Richard Socher

    Abstract: State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine tran… ▽ More

    Submitted 6 November, 2017; originally announced November 2017.

  25. arXiv:1708.02182  [pdf, ps, other

    cs.CL cs.LG cs.NE

    Regularizing and Optimizing LSTM Language Models

    Authors: Stephen Merity, Nitish Shirish Keskar, Richard Socher

    Abstract: Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose th… ▽ More

    Submitted 7 August, 2017; originally announced August 2017.

  26. arXiv:1609.04836  [pdf, other

    cs.LG math.OC

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Authors: Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang

    Abstract: The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the mod… ▽ More

    Submitted 9 February, 2017; v1 submitted 15 September, 2016; originally announced September 2016.

    Comments: Accepted as a conference paper at ICLR 2017

  27. arXiv:1511.01169  [pdf, ps, other

    cs.LG math.OC stat.ML

    adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

    Authors: Nitish Shirish Keskar, Albert S. Berahas

    Abstract: Recurrent Neural Networks (RNNs) are powerful models that achieve exceptional performance on several pattern recognition problems. However, the training of RNNs is a computationally difficult task owing to the well-known "vanishing/exploding" gradient problem. Algorithms proposed for training RNNs either exploit no (or limited) curvature information and have cheap per-iteration complexity, or atte… ▽ More

    Submitted 23 February, 2016; v1 submitted 3 November, 2015; originally announced November 2015.