-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Fine-grained Classification via Categorical Memory Networks
Authors:
Weijian Deng,
Joshua Marsh,
Stephen Gould,
Liang Zheng
Abstract:
Motivated by the desire to exploit patterns shared across classes, we present a simple yet effective class-specific memory module for fine-grained feature learning. The memory module stores the prototypical feature representation for each category as a moving average. We hypothesize that the combination of similarities with respect to each category is itself a useful discriminative cue. To detect…
▽ More
Motivated by the desire to exploit patterns shared across classes, we present a simple yet effective class-specific memory module for fine-grained feature learning. The memory module stores the prototypical feature representation for each category as a moving average. We hypothesize that the combination of similarities with respect to each category is itself a useful discriminative cue. To detect these similarities, we use attention as a querying mechanism. The attention scores with respect to each class prototype are used as weights to combine prototypes via weighted sum, producing a uniquely tailored response feature representation for a given input. The original and response features are combined to produce an augmented feature for classification. We integrate our class-specific memory module into a standard convolutional neural network, yielding a Categorical Memory Network. Our memory module significantly improves accuracy over baseline CNNs, achieving competitive accuracy with state-of-the-art methods on four benchmarks, including CUB-200-2011, Stanford Cars, FGVC Aircraft, and NABirds.
△ Less
Submitted 12 December, 2020;
originally announced December 2020.
-
Transport optimization on complex networks
Authors:
Bogdan Danila,
Yong Yu,
John A. Marsh,
Kevin E. Bassler
Abstract:
We present a comparative study of the application of a recently introduced heuristic algorithm to the optimization of transport on three major types of complex networks. The algorithm balances network traffic iteratively by minimizing the maximum node betweenness with as little path lengthening as possible. We show that by using this optimal routing, a network can sustain significantly higher tr…
▽ More
We present a comparative study of the application of a recently introduced heuristic algorithm to the optimization of transport on three major types of complex networks. The algorithm balances network traffic iteratively by minimizing the maximum node betweenness with as little path lengthening as possible. We show that by using this optimal routing, a network can sustain significantly higher traffic without jamming than in the case of shortest path routing. A formula is proved that allows quick computation of the average number of hops along the path and of the average travel times once the betweennesses of the nodes are computed. Using this formula, we show that routing optimization preserves the small-world character exhibited by networks under shortest path routing, and that it significantly reduces the average travel time on congested networks with only a negligible increase in the average travel time at low loads. Finally, we study the correlation between the weights of the links in the case of optimal routing and the betweennesses of the nodes connected by them.
△ Less
Submitted 9 January, 2007;
originally announced January 2007.
-
Optimal routing on complex networks
Authors:
Bogdan Danila,
Yong Yu,
John A. Marsh,
Kevin E. Bassler
Abstract:
We present a novel heuristic algorithm for routing optimization on complex networks. Previously proposed routing optimization algorithms aim at avoiding or reducing link overload. Our algorithm balances traffic on a network by minimizing the maximum node betweenness with as little path lengthening as possible, thus being useful in cases when networks are jamming due to queuing overload. By using…
▽ More
We present a novel heuristic algorithm for routing optimization on complex networks. Previously proposed routing optimization algorithms aim at avoiding or reducing link overload. Our algorithm balances traffic on a network by minimizing the maximum node betweenness with as little path lengthening as possible, thus being useful in cases when networks are jamming due to queuing overload. By using the resulting routing table, a network can sustain significantly higher traffic without jamming than in the case of traditional shortest path routing.
△ Less
Submitted 8 July, 2006; v1 submitted 1 July, 2006;
originally announced July 2006.
-
Generalized Box-Muller method for generating q-Gaussian random deviates
Authors:
William Thistleton,
Kenric Nelson,
John A. Marsh,
Constantino Tsallis
Abstract:
Addendum: The generalized Box-Müller algorithm provides a methodology for generating q-Gaussian random variates. The parameter $-\infty<q\leq3$ is related to the shape of the tail decay; $q<1$ for compact-support including parabola $(q=0)$; $1<q\leq3$ for heavy-tail including Cauchy $(q=2)$. This addendum clarifies the transformation $q'=((3q-1)/(q+1))$ within the algorithm is due to a difference…
▽ More
Addendum: The generalized Box-Müller algorithm provides a methodology for generating q-Gaussian random variates. The parameter $-\infty<q\leq3$ is related to the shape of the tail decay; $q<1$ for compact-support including parabola $(q=0)$; $1<q\leq3$ for heavy-tail including Cauchy $(q=2)$. This addendum clarifies the transformation $q'=((3q-1)/(q+1))$ within the algorithm is due to a difference in the dimensions d of the generalized logarithm and the generalized distribution. The transformation is clarified by the decomposition of $q=1+2κ/(1+dκ)$, where the shape parameter $-1<κ\leq\infty$ quantifies the magnitude of the deformation from exponential. A simpler specification for the generalized Box- Müller algorithm is provided using the shape of the tail decay.
Original: The q-Gaussian distribution is known to be an attractor of certain correlated systems, and is the distribution which, under appropriate constraints, maximizes the entropy Sq, basis of nonextensive statistical mechanics. This theory is postulated as a natural extension of the standard (Boltzmann-Gibbs) statistical mechanics, and may explain the ubiquitous appearance of heavy-tailed distributions in both natural and man-made systems. The q-Gaussian distribution is also used as a numerical tool, for example as a visiting distribution in Generalized Simulated Annealing. We develop and present a simple, easy to implement numerical method for generating random deviates from a q-Gaussian distribution based upon a generalization of the well known Box-Muller method. Our method is suitable for a larger range of q values, q<3, than has previously appeared in the literature, and can generate deviates from q-Gaussian distributions of arbitrary width and center. MATLAB code showing a straightforward implementation is also included.
△ Less
Submitted 10 February, 2021; v1 submitted 23 May, 2006;
originally announced May 2006.
-
Congestion-gradient driven transport on complex networks
Authors:
Bogdan Danila,
Yong Yu,
Samuel Earl,
John A. Marsh,
Zoltan Toroczkai,
Kevin E. Bassler
Abstract:
We present a study of transport on complex networks with routing based on local information. Particles hop from one node of the network to another according to a set of routing rules with different degrees of congestion awareness, ranging from random diffusion to rigid congestion-gradient driven flow. Each node can be either source or destination for particles and all nodes have the same routing…
▽ More
We present a study of transport on complex networks with routing based on local information. Particles hop from one node of the network to another according to a set of routing rules with different degrees of congestion awareness, ranging from random diffusion to rigid congestion-gradient driven flow. Each node can be either source or destination for particles and all nodes have the same routing capacity, which are features of ad-hoc wireless networks. It is shown that the transport capacity increases when a small amount of congestion awareness is present in the routing rules, and that it then decreases as the routing rules become too rigid when the flow becomes strictly congestion-gradient driven. Therefore, an optimum value of the congestion awareness exists in the routing rules. It is also shown that, in the limit of a large number of nodes, networks using routing based on local information jam at any nonzero load. Finally, we study the correlation between congestion at node level and a betweenness centrality measure.
△ Less
Submitted 31 March, 2006;
originally announced March 2006.