subscribe to arXiv mailings

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2012.06793 [pdf, other]

Fine-grained Classification via Categorical Memory Networks

Authors: Weijian Deng, Joshua Marsh, Stephen Gould, Liang Zheng

Abstract: Motivated by the desire to exploit patterns shared across classes, we present a simple yet effective class-specific memory module for fine-grained feature learning. The memory module stores the prototypical feature representation for each category as a moving average. We hypothesize that the combination of similarities with respect to each category is itself a useful discriminative cue. To detect… ▽ More Motivated by the desire to exploit patterns shared across classes, we present a simple yet effective class-specific memory module for fine-grained feature learning. The memory module stores the prototypical feature representation for each category as a moving average. We hypothesize that the combination of similarities with respect to each category is itself a useful discriminative cue. To detect these similarities, we use attention as a querying mechanism. The attention scores with respect to each class prototype are used as weights to combine prototypes via weighted sum, producing a uniquely tailored response feature representation for a given input. The original and response features are combined to produce an augmented feature for classification. We integrate our class-specific memory module into a standard convolutional neural network, yielding a Categorical Memory Network. Our memory module significantly improves accuracy over baseline CNNs, achieving competitive accuracy with state-of-the-art methods on four benchmarks, including CUB-200-2011, Stanford Cars, FGVC Aircraft, and NABirds. △ Less

Submitted 12 December, 2020; originally announced December 2020.

Comments: 10 pages, 9 figures, 7 tables; this version is not fully edited and will be updated soon

arXiv:cond-mat/0701184 [pdf, ps, other]

doi 10.1063/1.2731718

Transport optimization on complex networks

Authors: Bogdan Danila, Yong Yu, John A. Marsh, Kevin E. Bassler

Abstract: We present a comparative study of the application of a recently introduced heuristic algorithm to the optimization of transport on three major types of complex networks. The algorithm balances network traffic iteratively by minimizing the maximum node betweenness with as little path lengthening as possible. We show that by using this optimal routing, a network can sustain significantly higher tr… ▽ More We present a comparative study of the application of a recently introduced heuristic algorithm to the optimization of transport on three major types of complex networks. The algorithm balances network traffic iteratively by minimizing the maximum node betweenness with as little path lengthening as possible. We show that by using this optimal routing, a network can sustain significantly higher traffic without jamming than in the case of shortest path routing. A formula is proved that allows quick computation of the average number of hops along the path and of the average travel times once the betweennesses of the nodes are computed. Using this formula, we show that routing optimization preserves the small-world character exhibited by networks under shortest path routing, and that it significantly reduces the average travel time on congested networks with only a negligible increase in the average travel time at low loads. Finally, we study the correlation between the weights of the links in the case of optimal routing and the betweennesses of the nodes connected by them. △ Less

Submitted 9 January, 2007; originally announced January 2007.

Comments: 19 pages, 7 figures

Journal ref: Chaos 17 (2), 026102 (2007)

arXiv:cond-mat/0607017 [pdf, ps, other]

doi 10.1103/PhysRevE.74.046106

Optimal routing on complex networks

Authors: Bogdan Danila, Yong Yu, John A. Marsh, Kevin E. Bassler

Abstract: We present a novel heuristic algorithm for routing optimization on complex networks. Previously proposed routing optimization algorithms aim at avoiding or reducing link overload. Our algorithm balances traffic on a network by minimizing the maximum node betweenness with as little path lengthening as possible, thus being useful in cases when networks are jamming due to queuing overload. By using… ▽ More We present a novel heuristic algorithm for routing optimization on complex networks. Previously proposed routing optimization algorithms aim at avoiding or reducing link overload. Our algorithm balances traffic on a network by minimizing the maximum node betweenness with as little path lengthening as possible, thus being useful in cases when networks are jamming due to queuing overload. By using the resulting routing table, a network can sustain significantly higher traffic without jamming than in the case of traditional shortest path routing. △ Less

Submitted 8 July, 2006; v1 submitted 1 July, 2006; originally announced July 2006.

Comments: 4 pages, 5 figures

Journal ref: Phys Rev E 74, 046106 (2006)

arXiv:cond-mat/0605570 [pdf]

Generalized Box-Muller method for generating q-Gaussian random deviates

Authors: William Thistleton, Kenric Nelson, John A. Marsh, Constantino Tsallis

Abstract: Addendum: The generalized Box-Müller algorithm provides a methodology for generating q-Gaussian random variates. The parameter $-\infty<q\leq3$ is related to the shape of the tail decay; $q<1$ for compact-support including parabola $(q=0)$; $1<q\leq3$ for heavy-tail including Cauchy $(q=2)$. This addendum clarifies the transformation $q'=((3q-1)/(q+1))$ within the algorithm is due to a difference… ▽ More Addendum: The generalized Box-Müller algorithm provides a methodology for generating q-Gaussian random variates. The parameter $-\infty<q\leq3$ is related to the shape of the tail decay; $q<1$ for compact-support including parabola $(q=0)$; $1<q\leq3$ for heavy-tail including Cauchy $(q=2)$. This addendum clarifies the transformation $q'=((3q-1)/(q+1))$ within the algorithm is due to a difference in the dimensions d of the generalized logarithm and the generalized distribution. The transformation is clarified by the decomposition of $q=1+2κ/(1+dκ)$, where the shape parameter $-1<κ\leq\infty$ quantifies the magnitude of the deformation from exponential. A simpler specification for the generalized Box- Müller algorithm is provided using the shape of the tail decay. Original: The q-Gaussian distribution is known to be an attractor of certain correlated systems, and is the distribution which, under appropriate constraints, maximizes the entropy Sq, basis of nonextensive statistical mechanics. This theory is postulated as a natural extension of the standard (Boltzmann-Gibbs) statistical mechanics, and may explain the ubiquitous appearance of heavy-tailed distributions in both natural and man-made systems. The q-Gaussian distribution is also used as a numerical tool, for example as a visiting distribution in Generalized Simulated Annealing. We develop and present a simple, easy to implement numerical method for generating random deviates from a q-Gaussian distribution based upon a generalization of the well known Box-Muller method. Our method is suitable for a larger range of q values, q<3, than has previously appeared in the literature, and can generate deviates from q-Gaussian distributions of arbitrary width and center. MATLAB code showing a straightforward implementation is also included. △ Less

Submitted 10 February, 2021; v1 submitted 23 May, 2006; originally announced May 2006.

Comments: Addendum: 7 pages including 2 figures; Original: 14 pages including 8 figures and a code Nelson and Thistleton are the authors of the addendum; Thistleton, Marsh, Nelson and Tsallis are the authors of the original

arXiv:cond-mat/0603861 [pdf, ps, other]

doi 10.1103/PhysRevE.74.046114

Congestion-gradient driven transport on complex networks

Authors: Bogdan Danila, Yong Yu, Samuel Earl, John A. Marsh, Zoltan Toroczkai, Kevin E. Bassler

Abstract: We present a study of transport on complex networks with routing based on local information. Particles hop from one node of the network to another according to a set of routing rules with different degrees of congestion awareness, ranging from random diffusion to rigid congestion-gradient driven flow. Each node can be either source or destination for particles and all nodes have the same routing… ▽ More We present a study of transport on complex networks with routing based on local information. Particles hop from one node of the network to another according to a set of routing rules with different degrees of congestion awareness, ranging from random diffusion to rigid congestion-gradient driven flow. Each node can be either source or destination for particles and all nodes have the same routing capacity, which are features of ad-hoc wireless networks. It is shown that the transport capacity increases when a small amount of congestion awareness is present in the routing rules, and that it then decreases as the routing rules become too rigid when the flow becomes strictly congestion-gradient driven. Therefore, an optimum value of the congestion awareness exists in the routing rules. It is also shown that, in the limit of a large number of nodes, networks using routing based on local information jam at any nonzero load. Finally, we study the correlation between congestion at node level and a betweenness centrality measure. △ Less

Submitted 31 March, 2006; originally announced March 2006.

Comments: 11 pages, 8 figures

Journal ref: Phys Rev E 74, 046114 (2006)

Showing 1–6 of 6 results for author: Marsh, J