Skip to main content

Showing 1–49 of 49 results for author: Hendrycks, D

  1. arXiv:2406.04313  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.CY

    Improving Alignment and Robustness with Circuit Breakers

    Authors: Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks

    Abstract: AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to… ▽ More

    Submitted 12 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Code and models are available at https://github.com/GraySwanAI/circuit-breakers

  2. arXiv:2403.15447  [pdf, other

    cs.CL cs.AI

    Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

    Authors: Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, Bo Li

    Abstract: Compressing high-capability Large Language Models (LLMs) has emerged as a favored strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods boast impressive advancements in preserving benign task performance, the potential risks of compression in terms of safety and trustworthiness have been largely neglected. This study conducts the first, thorough evaluation o… ▽ More

    Submitted 4 June, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

    Comments: Accepted to ICML'24

  3. arXiv:2403.03218  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer , et al. (32 additional authors not shown)

    Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe… ▽ More

    Submitted 15 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: See the project page at https://wmdp.ai

  4. arXiv:2402.11777  [pdf, other

    cs.CL cs.AI cs.LG

    Uncovering Latent Human Wellbeing in Language Model Embeddings

    Authors: Pedro Freire, ChengCheng Tan, Adam Gleave, Dan Hendrycks, Scott Emmons

    Abstract: Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

    Comments: 10 pages, 5 figures, 1 table

    ACM Class: I.2.7

  5. arXiv:2402.04249  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Authors: Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks

    Abstract: Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties prev… ▽ More

    Submitted 26 February, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: Website: https://www.harmbench.org

  6. arXiv:2311.04235  [pdf, other

    cs.AI cs.CL cs.LG

    Can LLMs Follow Simple Rules?

    Authors: Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, David Wagner

    Abstract: As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Existing evaluations of adversarial attack… ▽ More

    Submitted 8 March, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: Project website: https://eecs.berkeley.edu/~normanmu/llm_rules; revised content

  7. arXiv:2310.01405  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.CY

    Representation Engineering: A Top-Down Approach to AI Transparency

    Authors: Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

    Abstract: In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive p… ▽ More

    Submitted 10 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Code is available at https://github.com/andyzoujm/representation-engineering

  8. Identifying and Mitigating the Security Risks of Generative AI

    Authors: Clark Barrett, Brad Boyd, Elie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, Diyi Yang

    Abstract: Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well… ▽ More

    Submitted 28 December, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

    Journal ref: Foundations and Trends in Privacy and Security 6 (2023) 1-52

  9. arXiv:2308.14752  [pdf, other

    cs.CY cs.AI cs.HC

    AI Deception: A Survey of Examples, Risks, and Potential Solutions

    Authors: Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks

    Abstract: This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth. We first survey empirical examples of AI deception, discussing both special-use AI systems (including Meta's CICERO) built for specific competitive situations, and general-purpose AI systems (… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: 18 pages (not including executive summary, references, and appendix), six figures

  10. arXiv:2306.12001  [pdf, other

    cs.CY cs.AI cs.LG

    An Overview of Catastrophic AI Risks

    Authors: Dan Hendrycks, Mantas Mazeika, Thomas Woodside

    Abstract: Rapid advancements in artificial intelligence (AI) have sparked growing concerns among experts, policymakers, and world leaders regarding the potential for increasingly advanced AI systems to pose catastrophic risks. Although numerous risks have been detailed separately, there is a pressing need for a systematic discussion and illustration of the potential dangers to better inform efforts to mitig… ▽ More

    Submitted 9 October, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

  11. arXiv:2306.11698  [pdf, other

    cs.CL cs.AI cs.CR

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

    Authors: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

    Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To thi… ▽ More

    Submitted 26 February, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 Outstanding Paper (Datasets and Benchmarks Track)

  12. arXiv:2304.03279  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

    Authors: Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

    Abstract: Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI,… ▽ More

    Submitted 12 June, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

    Comments: ICML 2023 Oral (camera-ready); 31 pages, 5 figures

  13. arXiv:2303.16200  [pdf, other

    cs.CY cs.AI cs.LG cs.NE

    Natural Selection Favors AIs over Humans

    Authors: Dan Hendrycks

    Abstract: For billions of years, evolution has been the driving force behind the development of life, including humans. Evolution endowed humans with high intelligence, which allowed us to become one of the most successful species on the planet. Today, humans aim to create artificial intelligence systems that surpass even our own intelligence. As artificial intelligences (AIs) evolve and eventually surpass… ▽ More

    Submitted 18 July, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

    Comments: An explainer video corresponding to the paper is available at https://www.youtube.com/watch?v=48h-ySTggE8

  14. arXiv:2301.00876  [pdf, other

    cs.CL

    MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding

    Authors: Steven H. Wang, Antoine Scardigli, Leonard Tang, Wei Chen, Dimitry Levkin, Anya Chen, Spencer Ball, Thomas Woodside, Oliver Zhang, Dan Hendrycks

    Abstract: Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 3… ▽ More

    Submitted 24 November, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

    Comments: EMNLP 2023. 5 pages + appendix. Code and dataset are available at https://github.com/TheAtticusProject/maud

  15. arXiv:2210.10039  [pdf, other

    cs.CV cs.CY cs.LG

    How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

    Authors: Mantas Mazeika, Eric Tang, Andy Zou, Steven Basart, Jun Shern Chan, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks

    Abstract: In recent years, deep neural networks have demonstrated increasingly strong abilities to recognize objects and activities in videos. However, as video understanding becomes widely used in real-world applications, a key consideration is developing human-centric systems that understand not only the content of the video but also how it would affect the wellbeing and emotional state of viewers. To fac… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022; datasets available at https://github.com/hendrycks/emodiversity/

  16. arXiv:2210.07242  [pdf, other

    cs.CV cs.AI cs.LG

    OpenOOD: Benchmarking Generalized Out-of-Distribution Detection

    Authors: Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Dan Hendrycks, Yixuan Li, Ziwei Liu

    Abstract: Out-of-distribution (OOD) detection is vital to safety-critical machine learning applications and has thus been extensively studied, with a plethora of methods developed in the literature. However, the field currently lacks a unified, strictly formulated, and comprehensive benchmark, which often results in unfair comparisons and inconclusive results. From the problem setting perspective, OOD detec… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted by NeurIPS 2022 Datasets and Benchmarks Track. Codebase: https://github.com/Jingkang50/OpenOOD

  17. arXiv:2206.15474  [pdf, other

    cs.LG cs.CL

    Forecasting Future World Events with Neural Networks

    Authors: Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

    Abstract: Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing tho… ▽ More

    Submitted 9 October, 2022; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022; our dataset is available at https://github.com/andyzoujm/autocast

  18. arXiv:2206.08966  [pdf

    cs.CY cs.AI cs.LG

    Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks

    Authors: Anthony M. Barrett, Dan Hendrycks, Jessica Newman, Brandie Nonnecke

    Abstract: Artificial intelligence (AI) systems can provide many beneficial capabilities but also risks of adverse events. Some AI systems could present risks of events with very high or catastrophic consequences at societal scale. The US National Institute of Standards and Technology (NIST) has been developing the NIST Artificial Intelligence Risk Management Framework (AI RMF) as voluntary guidance on AI ri… ▽ More

    Submitted 23 February, 2023; v1 submitted 17 June, 2022; originally announced June 2022.

    Comments: 56 pages; updated throughout for general consistency with NIST AI RMF 1.0

  19. arXiv:2206.05862  [pdf, other

    cs.CY cs.AI cs.LG

    X-Risk Analysis for AI Research

    Authors: Dan Hendrycks, Mantas Mazeika

    Abstract: Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping in mind the potential benefits of AI, there is some concern that building ever more inte… ▽ More

    Submitted 20 September, 2022; v1 submitted 12 June, 2022; originally announced June 2022.

  20. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  21. arXiv:2112.05135  [pdf, other

    cs.LG cs.CV

    PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

    Authors: Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, Jacob Steinhardt

    Abstract: In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy. These other goals include out-of-distribution (OOD) robustness, prediction consistency, resilience to adversaries, calibrated uncertainty estimates, and the ability to detect anomalous inputs. However, improving performance towards these goals is often… ▽ More

    Submitted 29 March, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

    Comments: CVPR 2022. Code and models are available at https://github.com/andyzoujm/pixmix

  22. arXiv:2112.00659  [pdf, other

    cs.LG cs.AI cs.CR

    Certified Adversarial Defenses Meet Out-of-Distribution Corruptions: Benchmarking Robustness and Simple Baselines

    Authors: Jiachen Sun, Akshay Mehra, Bhavya Kailkhura, Pin-Yu Chen, Dan Hendrycks, Jihun Hamm, Z. Morley Mao

    Abstract: Certified robustness guarantee gauges a model's robustness to test-time attacks and can assess the model's readiness for deployment in the real world. In this work, we critically examine how the adversarial robustness guarantees from randomized smoothing-based certification methods change when state-of-the-art certifiably robust models encounter out-of-distribution (OOD) data. Our analysis demonst… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

    Comments: 21 pages, 15 figures, and 9 tables

  23. arXiv:2110.14051  [pdf, other

    cs.CV cs.LG

    A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges

    Authors: Mohammadreza Salehi, Hossein Mirzaei, Dan Hendrycks, Yixuan Li, Mohammad Hossein Rohban, Mohammad Sabokrou

    Abstract: Machine learning models often encounter samples that are diverged from the training distribution. Failure to recognize an out-of-distribution (OOD) sample, and consequently assign that sample to an in-class label significantly compromises the reliability of a model. The problem has gained significant attention due to its importance for safety deploying models in open-world settings. Detecting OOD… ▽ More

    Submitted 3 December, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: Published in Transaction on Machine Learning (TMLR)

  24. arXiv:2110.13136  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    What Would Jiminy Cricket Do? Towards Agents That Behave Morally

    Authors: Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt

    Abstract: When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environme… ▽ More

    Submitted 7 February, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2021. Environments available here https://github.com/hendrycks/jiminy-cricket

  25. arXiv:2109.13916  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Unsolved Problems in ML Safety

    Authors: Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

    Abstract: Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the tec… ▽ More

    Submitted 16 June, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: Position Paper

  26. arXiv:2107.11011  [pdf, other

    cs.LG

    VisDA-2021 Competition Universal Domain Adaptation to Improve Performance on Out-of-Distribution Data

    Authors: Dina Bashkirova, Dan Hendrycks, Donghyun Kim, Samarth Mishra, Kate Saenko, Kuniaki Saito, Piotr Teterwak, Ben Usman

    Abstract: Progress in machine learning is typically measured by training and testing a model on the same distribution of data, i.e., the same domain. This over-estimates future accuracy on out-of-distribution data. The Visual Domain Adaptation (VisDA) 2021 competition tests models' ability to adapt to novel test distributions and handle distributional shift. We set up unsupervised domain adaptation challeng… ▽ More

    Submitted 22 July, 2021; originally announced July 2021.

    Comments: Neurips 2021 Competition Track

  27. arXiv:2105.09938  [pdf, other

    cs.SE cs.CL cs.LG

    Measuring Coding Challenge Competence With APPS

    Authors: Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt

    Abstract: While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for c… ▽ More

    Submitted 8 November, 2021; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: NeurIPS 2021. Code and the APPS dataset is available at https://github.com/hendrycks/apps

  28. arXiv:2103.06268  [pdf, other

    cs.CL cs.LG

    CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

    Authors: Dan Hendrycks, Collin Burns, Anya Chen, Spencer Ball

    Abstract: Many specialized domains remain untouched by deep learning, as large labeled datasets require expensive expert annotators. We address this bottleneck within the legal domain by introducing the Contract Understanding Atticus Dataset (CUAD), a new dataset for legal contract review. CUAD was created with dozens of legal experts from The Atticus Project and consists of over 13,000 annotations. The tas… ▽ More

    Submitted 8 November, 2021; v1 submitted 10 March, 2021; originally announced March 2021.

    Comments: NeurIPS 2021. Code and the CUAD dataset are available at https://github.com/TheAtticusProject/cuad/

  29. arXiv:2103.03874  [pdf, other

    cs.LG cs.AI cs.CL

    Measuring Mathematical Problem Solving With the MATH Dataset

    Authors: Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

    Abstract: Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanati… ▽ More

    Submitted 8 November, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

    Comments: NeurIPS 2021. Code and the MATH dataset is available at https://github.com/hendrycks/math/

  30. arXiv:2009.03300  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Measuring Massive Multitask Language Understanding

    Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

    Abstract: We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over… ▽ More

    Submitted 12 January, 2021; v1 submitted 7 September, 2020; originally announced September 2020.

    Comments: ICLR 2021; the test and code is available at https://github.com/hendrycks/test

  31. arXiv:2008.02275  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Aligning AI With Shared Human Values

    Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt

    Abstract: We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable… ▽ More

    Submitted 17 February, 2023; v1 submitted 5 August, 2020; originally announced August 2020.

    Comments: ICLR 2021; the ETHICS dataset is available at https://github.com/hendrycks/ethics/

  32. arXiv:2006.16241  [pdf, other

    cs.CV cs.LG stat.ML

    The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

    Authors: Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer

    Abstract: We introduce four new real-world distribution shift datasets consisting of changes in image style, image blurriness, geographic location, camera operation, and more. With our new datasets, we take stock of previously proposed methods for improving out-of-distribution robustness and put them to the test. We find that using larger models and artificial data augmentations can improve robustness on re… ▽ More

    Submitted 24 July, 2021; v1 submitted 29 June, 2020; originally announced June 2020.

    Comments: ICCV 2021; Datasets, code, and models available at https://github.com/hendrycks/imagenet-r

  33. arXiv:2004.06100  [pdf, other

    cs.CL cs.LG

    Pretrained Transformers Improve Out-of-Distribution Robustness

    Authors: Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song

    Abstract: Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and… ▽ More

    Submitted 16 April, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

    Comments: ACL 2020

  34. arXiv:1912.02781  [pdf, other

    stat.ML cs.CV cs.LG

    AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

    Authors: Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan

    Abstract: Modern deep neural networks can achieve high accuracy when the training distribution and test distribution are identically distributed, but this assumption is frequently violated in practice. When the train and test distributions are mismatched, accuracy can plummet. Currently there are few techniques that improve robustness to unforeseen data shifts encountered during deployment. In this work, we… ▽ More

    Submitted 17 February, 2020; v1 submitted 5 December, 2019; originally announced December 2019.

    Comments: Code available at https://github.com/google-research/augmix

  35. arXiv:1911.11132  [pdf, other

    cs.CV cs.LG

    Scaling Out-of-Distribution Detection for Real-World Settings

    Authors: Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joe Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song

    Abstract: Detecting out-of-distribution examples is important for safety-critical machine learning applications such as detecting novel biological phenomena and self-driving cars. However, existing research mainly focuses on simple small-scale settings. To set the stage for more realistic out-of-distribution detection, we depart from small-scale settings and explore large-scale multiclass and multi-label se… ▽ More

    Submitted 15 May, 2022; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: ICML 2022; The Species dataset and code are available at https://github.com/hendrycks/anomaly-seg

  36. arXiv:1908.08016  [pdf, other

    cs.LG cs.CR cs.CV stat.ML

    Testing Robustness Against Unforeseen Adversaries

    Authors: Max Kaufmann, Daniel Kang, Yi Sun, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks

    Abstract: Adversarial robustness research primarily focuses on L_p perturbations, and most defenses are developed with identical training-time and test-time adversaries. However, in real-world applications developers are unlikely to have access to the full range of attacks or corruptions their system will face. Furthermore, worst-case inputs are likely to be diverse and need not be constrained to the L_p ba… ▽ More

    Submitted 30 October, 2023; v1 submitted 21 August, 2019; originally announced August 2019.

    Comments: Datasets available at https://github.com/centerforaisafety/adversarial-corruptions

  37. arXiv:1907.07174  [pdf, other

    cs.LG cs.CV stat.ML

    Natural Adversarial Examples

    Authors: Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, Dawn Song

    Abstract: We introduce two challenging datasets that reliably cause machine learning model performance to substantially degrade. The datasets are collected with a simple adversarial filtration technique to create datasets with limited spurious cues. Our datasets' real-world, unmodified examples transfer to various unseen models reliably, demonstrating that computer vision models have shared weaknesses. The… ▽ More

    Submitted 4 March, 2021; v1 submitted 16 July, 2019; originally announced July 2019.

    Comments: CVPR 2021; dataset and code available at https://github.com/hendrycks/natural-adv-examples

  38. arXiv:1906.12340  [pdf, other

    cs.LG cs.CV stat.ML

    Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

    Authors: Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, Dawn Song

    Abstract: Self-supervision provides effective representations for downstream tasks without requiring labels. However, existing approaches lag behind fully supervised training and are often not thought beneficial beyond obviating or reducing the need for annotations. We find that self-supervision can benefit robustness in a variety of ways, including robustness to adversarial examples, label corruption, and… ▽ More

    Submitted 29 October, 2019; v1 submitted 28 June, 2019; originally announced June 2019.

    Comments: NeurIPS 2019; code and data available at https://github.com/hendrycks/ss-ood

  39. arXiv:1905.01034  [pdf, other

    cs.LG cs.AI cs.CR stat.ML

    Transfer of Adversarial Robustness Between Perturbation Types

    Authors: Daniel Kang, Yi Sun, Tom Brown, Dan Hendrycks, Jacob Steinhardt

    Abstract: We study the transfer of adversarial robustness of deep neural networks between different perturbation types. While most work on adversarial examples has focused on $L_\infty$ and $L_2$-bounded perturbations, these do not capture all types of perturbations available to an adversary. The present work evaluates 32 attacks of 5 different types against models adversarially trained on a 100-class subse… ▽ More

    Submitted 3 May, 2019; originally announced May 2019.

    Comments: 11 pages, 6 figures

  40. arXiv:1903.12261  [pdf, other

    cs.LG cs.CV stat.ML

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Authors: Dan Hendrycks, Thomas Dietterich

    Abstract: In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, ImageNet-C, standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications. Then we propose a new dataset called ImageNet-P which enables researchers to benchmark a classifier's robustness to common perturbations. Unlike rece… ▽ More

    Submitted 28 March, 2019; originally announced March 2019.

    Comments: ICLR 2019 camera-ready; datasets available at https://github.com/hendrycks/robustness ; this article supersedes arXiv:1807.01697

  41. arXiv:1901.09960  [pdf, other

    cs.LG cs.CV stat.ML

    Using Pre-Training Can Improve Model Robustness and Uncertainty

    Authors: Dan Hendrycks, Kimin Lee, Mantas Mazeika

    Abstract: He et al. (2018) have called into question the utility of pre-training by showing that training from scratch can often yield similar performance to pre-training. We show that although pre-training may not improve performance on traditional classification metrics, it improves model robustness and uncertainty estimates. Through extensive experiments on adversarial examples, label corruption, class i… ▽ More

    Submitted 20 October, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

    Comments: ICML 2019. PyTorch code here: https://github.com/hendrycks/pre-training Figure 3 updated

  42. arXiv:1812.04606  [pdf, other

    cs.LG cs.CL cs.CV stat.ML

    Deep Anomaly Detection with Outlier Exposure

    Authors: Dan Hendrycks, Mantas Mazeika, Thomas Dietterich

    Abstract: It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training ano… ▽ More

    Submitted 28 January, 2019; v1 submitted 11 December, 2018; originally announced December 2018.

    Comments: ICLR 2019; PyTorch code available at https://github.com/hendrycks/outlier-exposure

  43. arXiv:1808.00529  [pdf, other

    cs.LG stat.ML

    Open Category Detection with PAC Guarantees

    Authors: Si Liu, Risheek Garrepalli, Thomas G. Dietterich, Alan Fern, Dan Hendrycks

    Abstract: Open category detection is the problem of detecting "alien" test instances that belong to categories or classes that were not present in the training data. In many applications, reliably detecting such aliens is central to ensuring the safety and accuracy of test set predictions. Unfortunately, there are no algorithms that provide theoretical guarantees on their ability to detect aliens under gene… ▽ More

    Submitted 1 August, 2018; originally announced August 2018.

  44. arXiv:1807.01697  [pdf, other

    cs.LG cs.AI cs.CV cs.NE stat.ML

    Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations

    Authors: Dan Hendrycks, Thomas G. Dietterich

    Abstract: In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, ImageNet-C, standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications. Unlike recent robustness research, this benchmark evaluates performance on commonplace corruptions not worst-case adversarial corruptions. We find th… ▽ More

    Submitted 27 April, 2019; v1 submitted 4 July, 2018; originally announced July 2018.

    Comments: Superseded by _Benchmarking Neural Network Robustness to Common Corruptions and Perturbations_ arXiv:1903.12261

  45. arXiv:1802.05300  [pdf, other

    cs.LG cs.CL cs.CV cs.NE

    Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise

    Authors: Dan Hendrycks, Mantas Mazeika, Duncan Wilson, Kevin Gimpel

    Abstract: The growing importance of massive datasets used for deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling, non-expert labeling, and label corruption by data poisoning adversaries. Numerous previous works assume that no source of labels can be trusted. We relax this assumption and assume that a small subset of th… ▽ More

    Submitted 28 January, 2019; v1 submitted 14 February, 2018; originally announced February 2018.

    Comments: NeurIPS 2018. PyTorch code available at https://github.com/mmazeika/glc

  46. arXiv:1610.02136  [pdf, other

    cs.NE cs.CV cs.LG

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Authors: Dan Hendrycks, Kevin Gimpel

    Abstract: We consider the two related problems of detecting if an example is misclassified or out-of-distribution. We present a simple baseline that utilizes probabilities from softmax distributions. Correctly classified examples tend to have greater maximum softmax probabilities than erroneously classified and out-of-distribution examples, allowing for their detection. We assess performance by defining sev… ▽ More

    Submitted 3 October, 2018; v1 submitted 7 October, 2016; originally announced October 2016.

    Comments: Published as a conference paper at ICLR 2017. 1 Figure in 1 Appendix. Minor changes from the previous version

    Journal ref: International Conference on Learning Representations 2017

  47. arXiv:1608.00530  [pdf, other

    cs.LG cs.CR cs.CV cs.NE

    Early Methods for Detecting Adversarial Images

    Authors: Dan Hendrycks, Kevin Gimpel

    Abstract: Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will… ▽ More

    Submitted 23 March, 2017; v1 submitted 1 August, 2016; originally announced August 2016.

    Comments: ICLR 2017 Workshop Contribution

  48. arXiv:1607.02488  [pdf, other

    cs.LG cs.NE

    Adjusting for Dropout Variance in Batch Normalization and Weight Initialization

    Authors: Dan Hendrycks, Kevin Gimpel

    Abstract: We show how to adjust for the variance introduced by dropout with corrections to weight initialization and Batch Normalization, yielding higher accuracy. Though dropout can preserve the expected input to a neuron between train and test, the variance of the input differs. We thus propose a new weight initialization by correcting for the influence of dropout rates and an arbitrary nonlinearity's inf… ▽ More

    Submitted 23 March, 2017; v1 submitted 8 July, 2016; originally announced July 2016.

  49. arXiv:1606.08415  [pdf, other

    cs.LG

    Gaussian Error Linear Units (GELUs)

    Authors: Dan Hendrycks, Kevin Gimpel

    Abstract: We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $xΦ(x)$, where $Φ(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs ($x\mathbf{1}_{x>0}$). We perform an empirical evaluation of the GELU nonlinearity… ▽ More

    Submitted 5 June, 2023; v1 submitted 27 June, 2016; originally announced June 2016.

    Comments: Trimmed version of 2016 draft