subscribe to arXiv mailings

Empirical Comparison between Cross-Validation and Mutation-Validation in Model Selection

Authors: Jinyang Yu, Sami Hamdan, Leonard Sasse, Abigail Morrison, Kaustubh R. Patil

Abstract: Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and $k$-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding th… ▽ More Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and $k$-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding three posterior probabilities: practical equivalence, CV superiority, and MV superiority. We also evaluated the differences in the capacity of the selected models and computational efficiency. We found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets. MV exhibited advantages in terms of selecting simpler models and lower computational costs. However, in some cases MV selected overly simplistic models leading to underfitting and showed instability in hyperparameter selection. These limitations of MV became more evident in the evaluation of a real-world neuroscientific task of predicting sex at birth using brain functional connectivity. △ Less

Submitted 15 February, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

arXiv:2311.04179 [pdf]

On Leakage in Machine Learning Pipelines

Authors: Leonard Sasse, Eliana Nicolaisen-Sobesky, Juergen Dukart, Simon B. Eickhoff, Michael Götz, Sami Hamdan, Vera Komeyer, Abhijit Kulkarni, Juha Lahnakoski, Bradley C. Love, Federico Raimondo, Kaustubh R. Patil

Abstract: Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to ne… ▽ More Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines. △ Less

Submitted 5 March, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

Comments: second draft

arXiv:2310.12568 [pdf, other]

Julearn: an easy-to-use library for leakage-free evaluation and inspection of ML models

Authors: Sami Hamdan, Shammi More, Leonard Sasse, Vera Komeyer, Kaustubh R. Patil, Federico Raimondo

Abstract: The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary… ▽ More The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary objective of ML is to build models that can make accurate predictions on unseen data. Researchers aim to prove the existence of such generalizable models by evaluating performance using techniques such as cross-validation (CV), which uses systematic subsampling to estimate the generalization performance. Choosing a CV scheme and evaluating an ML pipeline can be challenging and, if used improperly, can lead to overestimated results and incorrect interpretations. We created julearn, an open-source Python library, that allow researchers to design and evaluate complex ML pipelines without encountering in common pitfalls. In this manuscript, we present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects that can be easily implemented using this novel library. Julearn aims to simplify the entry into the ML world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls. With its design, unique features and simple interface, it poses as a useful Python-based library for research projects. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 13 pages, 5 figures

arXiv:2210.09232 [pdf, other]

Confound-leakage: Confound Removal in Machine Learning Leads to Leakage

Authors: Sami Hamdan, Bradley C. Love, Georg G. von Polier, Susanne Weis, Holger Schwender, Simon B. Eickhoff, Kaustubh R. Patil

Abstract: Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifical… ▽ More Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against naïve use of standard confound removal approaches. △ Less

Submitted 27 October, 2022; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: Revised Introduction, added CoI, results unchanged

arXiv:2209.07999 [pdf, other]

Self-Supervised Learning with an Information Maximization Criterion

Authors: Serdar Ozsoy, Shadi Hamdan, Sercan Ö. Arik, Deniz Yuret, Alper T. Erdogan

Abstract: Self-supervised learning allows AI systems to learn effective representations from large amounts of data using tasks that do not require costly labeling. Mode collapse, i.e., the model producing identical representations for all inputs, is a central problem to many self-supervised learning approaches, making self-supervised tasks, such as matching distorted variants of the inputs, ineffective. In… ▽ More Self-supervised learning allows AI systems to learn effective representations from large amounts of data using tasks that do not require costly labeling. Mode collapse, i.e., the model producing identical representations for all inputs, is a central problem to many self-supervised learning approaches, making self-supervised tasks, such as matching distorted variants of the inputs, ineffective. In this article, we argue that a straightforward application of information maximization among alternative latent representations of the same input naturally solves the collapse problem and achieves competitive empirical results. We propose a self-supervised learning method, CorInfoMax, that uses a second-order statistics-based mutual information measure that reflects the level of correlation among its arguments. Maximizing this correlative information measure between alternative representations of the same input serves two purposes: (1) it avoids the collapse problem by generating feature vectors with non-degenerate covariances; (2) it establishes relevance among alternative representations by increasing the linear dependence among them. An approximation of the proposed information maximization objective simplifies to a Euclidean distance-based objective function regularized by the log-determinant of the feature covariance matrix. The regularization term acts as a natural barrier against feature space degeneracy. Consequently, beyond avoiding complete output collapse to a single point, the proposed approach also prevents dimensional collapse by encouraging the spread of information across the whole feature space. Numerical experiments demonstrate that CorInfoMax achieves better or competitive performance results relative to the state-of-the-art SSL approaches. △ Less

Submitted 16 September, 2022; originally announced September 2022.

ACM Class: I.2; I.4; I.5

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2007.13203 [pdf, other]

A containerized proof-of-concept implementation of LightChain system

Authors: Yahya Hassanzadeh-Nazarabadi, Nazir Nayal, Shadi Sameh Hamdan, Öznur Özkasap, Alptekin Küpçü

Abstract: LightChain is the first Distributed Hash Table (DHT)-based blockchain with a logarithmic asymptotic message and memory complexity. In this demo paper, we present the software architecture of our open-source implementation of LightChain, as well as a novel deployment scenario of the entire LightChain system on a single machine aiming at results reproducibility. LightChain is the first Distributed Hash Table (DHT)-based blockchain with a logarithmic asymptotic message and memory complexity. In this demo paper, we present the software architecture of our open-source implementation of LightChain, as well as a novel deployment scenario of the entire LightChain system on a single machine aiming at results reproducibility. △ Less

Submitted 26 July, 2020; originally announced July 2020.

arXiv:1905.03507 [pdf]

doi 10.1080/17445760.2019.1617865

Detecting Sybil Attacks in Vehicular Ad Hoc Networks

Authors: Salam Hamdan, Amjad Hudaib, Arafat Awajan

Abstract: Ad hoc networks is vulnerable to numerous number of attacks due to its infrastructure-less nature, one of these attacks is the Sybil attack. Sybil attack is a severe attack on vehicular ad hoc networks (VANET) in which the intruder maliciously claims or steals multiple identities and use these identities to disturb the functionality of the VANET network by disseminating false identities. Many solu… ▽ More Ad hoc networks is vulnerable to numerous number of attacks due to its infrastructure-less nature, one of these attacks is the Sybil attack. Sybil attack is a severe attack on vehicular ad hoc networks (VANET) in which the intruder maliciously claims or steals multiple identities and use these identities to disturb the functionality of the VANET network by disseminating false identities. Many solutions have been proposed in order to defense the VANET network against the Sybil attack. In this research a hybrid algorithm is proposed, by combining footprint and privacy-preserving detection of abuses of pseudonyms (P2DAP) methods. The hybrid detection algorithm is implemented using the ns2 simulator. The proposed algorithm is working as follows, P2DAP acting better than footprint when the number of vehicles increases. On the other hand, the footprint algorithm acting better when the speed of vehicles increases. The hybrid algorithm depends on encryption, authentication and on the trajectory of the vehicle. The scenarios will be generated using SUMO and MOVE tools. △ Less

Submitted 9 May, 2019; originally announced May 2019.

Showing 1–8 of 8 results for author: Hamdan, S