-
Empirical Comparison between Cross-Validation and Mutation-Validation in Model Selection
Authors:
Jinyang Yu,
Sami Hamdan,
Leonard Sasse,
Abigail Morrison,
Kaustubh R. Patil
Abstract:
Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and $k$-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding th…
▽ More
Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and $k$-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding three posterior probabilities: practical equivalence, CV superiority, and MV superiority. We also evaluated the differences in the capacity of the selected models and computational efficiency. We found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets. MV exhibited advantages in terms of selecting simpler models and lower computational costs. However, in some cases MV selected overly simplistic models leading to underfitting and showed instability in hyperparameter selection. These limitations of MV became more evident in the evaluation of a real-world neuroscientific task of predicting sex at birth using brain functional connectivity.
△ Less
Submitted 15 February, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
-
On Leakage in Machine Learning Pipelines
Authors:
Leonard Sasse,
Eliana Nicolaisen-Sobesky,
Juergen Dukart,
Simon B. Eickhoff,
Michael Götz,
Sami Hamdan,
Vera Komeyer,
Abhijit Kulkarni,
Juha Lahnakoski,
Bradley C. Love,
Federico Raimondo,
Kaustubh R. Patil
Abstract:
Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to ne…
▽ More
Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.
△ Less
Submitted 5 March, 2024; v1 submitted 7 November, 2023;
originally announced November 2023.
-
Julearn: an easy-to-use library for leakage-free evaluation and inspection of ML models
Authors:
Sami Hamdan,
Shammi More,
Leonard Sasse,
Vera Komeyer,
Kaustubh R. Patil,
Federico Raimondo
Abstract:
The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary…
▽ More
The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary objective of ML is to build models that can make accurate predictions on unseen data. Researchers aim to prove the existence of such generalizable models by evaluating performance using techniques such as cross-validation (CV), which uses systematic subsampling to estimate the generalization performance. Choosing a CV scheme and evaluating an ML pipeline can be challenging and, if used improperly, can lead to overestimated results and incorrect interpretations.
We created julearn, an open-source Python library, that allow researchers to design and evaluate complex ML pipelines without encountering in common pitfalls. In this manuscript, we present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects that can be easily implemented using this novel library. Julearn aims to simplify the entry into the ML world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls. With its design, unique features and simple interface, it poses as a useful Python-based library for research projects.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Confound-leakage: Confound Removal in Machine Learning Leads to Leakage
Authors:
Sami Hamdan,
Bradley C. Love,
Georg G. von Polier,
Susanne Weis,
Holger Schwender,
Simon B. Eickhoff,
Kaustubh R. Patil
Abstract:
Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifical…
▽ More
Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against naïve use of standard confound removal approaches.
△ Less
Submitted 27 October, 2022; v1 submitted 17 October, 2022;
originally announced October 2022.
-
Self-Supervised Learning with an Information Maximization Criterion
Authors:
Serdar Ozsoy,
Shadi Hamdan,
Sercan Ö. Arik,
Deniz Yuret,
Alper T. Erdogan
Abstract:
Self-supervised learning allows AI systems to learn effective representations from large amounts of data using tasks that do not require costly labeling. Mode collapse, i.e., the model producing identical representations for all inputs, is a central problem to many self-supervised learning approaches, making self-supervised tasks, such as matching distorted variants of the inputs, ineffective. In…
▽ More
Self-supervised learning allows AI systems to learn effective representations from large amounts of data using tasks that do not require costly labeling. Mode collapse, i.e., the model producing identical representations for all inputs, is a central problem to many self-supervised learning approaches, making self-supervised tasks, such as matching distorted variants of the inputs, ineffective. In this article, we argue that a straightforward application of information maximization among alternative latent representations of the same input naturally solves the collapse problem and achieves competitive empirical results. We propose a self-supervised learning method, CorInfoMax, that uses a second-order statistics-based mutual information measure that reflects the level of correlation among its arguments. Maximizing this correlative information measure between alternative representations of the same input serves two purposes: (1) it avoids the collapse problem by generating feature vectors with non-degenerate covariances; (2) it establishes relevance among alternative representations by increasing the linear dependence among them. An approximation of the proposed information maximization objective simplifies to a Euclidean distance-based objective function regularized by the log-determinant of the feature covariance matrix. The regularization term acts as a natural barrier against feature space degeneracy. Consequently, beyond avoiding complete output collapse to a single point, the proposed approach also prevents dimensional collapse by encouraging the spread of information across the whole feature space. Numerical experiments demonstrate that CorInfoMax achieves better or competitive performance results relative to the state-of-the-art SSL approaches.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
A containerized proof-of-concept implementation of LightChain system
Authors:
Yahya Hassanzadeh-Nazarabadi,
Nazir Nayal,
Shadi Sameh Hamdan,
Öznur Özkasap,
Alptekin Küpçü
Abstract:
LightChain is the first Distributed Hash Table (DHT)-based blockchain with a logarithmic asymptotic message and memory complexity. In this demo paper, we present the software architecture of our open-source implementation of LightChain, as well as a novel deployment scenario of the entire LightChain system on a single machine aiming at results reproducibility.
LightChain is the first Distributed Hash Table (DHT)-based blockchain with a logarithmic asymptotic message and memory complexity. In this demo paper, we present the software architecture of our open-source implementation of LightChain, as well as a novel deployment scenario of the entire LightChain system on a single machine aiming at results reproducibility.
△ Less
Submitted 26 July, 2020;
originally announced July 2020.
-
Detecting Sybil Attacks in Vehicular Ad Hoc Networks
Authors:
Salam Hamdan,
Amjad Hudaib,
Arafat Awajan
Abstract:
Ad hoc networks is vulnerable to numerous number of attacks due to its infrastructure-less nature, one of these attacks is the Sybil attack. Sybil attack is a severe attack on vehicular ad hoc networks (VANET) in which the intruder maliciously claims or steals multiple identities and use these identities to disturb the functionality of the VANET network by disseminating false identities. Many solu…
▽ More
Ad hoc networks is vulnerable to numerous number of attacks due to its infrastructure-less nature, one of these attacks is the Sybil attack. Sybil attack is a severe attack on vehicular ad hoc networks (VANET) in which the intruder maliciously claims or steals multiple identities and use these identities to disturb the functionality of the VANET network by disseminating false identities. Many solutions have been proposed in order to defense the VANET network against the Sybil attack. In this research a hybrid algorithm is proposed, by combining footprint and privacy-preserving detection of abuses of pseudonyms (P2DAP) methods. The hybrid detection algorithm is implemented using the ns2 simulator. The proposed algorithm is working as follows, P2DAP acting better than footprint when the number of vehicles increases. On the other hand, the footprint algorithm acting better when the speed of vehicles increases. The hybrid algorithm depends on encryption, authentication and on the trajectory of the vehicle. The scenarios will be generated using SUMO and MOVE tools.
△ Less
Submitted 9 May, 2019;
originally announced May 2019.