subscribe to arXiv mailings

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Authors: Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi

Abstract: Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a leng… ▽ More Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named \textit{OmegaPRM} for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM). Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from the 51\% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 18 pages, 5 figures, 1 table

arXiv:2405.18368 [pdf, other]

The 2024 Brain Tumor Segmentation (BraTS) Challenge: Glioma Segmentation on Post-treatment MRI

Authors: Maria Correia de Verdier, Rachit Saluja, Louis Gagnon, Dominic LaBella, Ujjwall Baid, Nourel Hoda Tahon, Martha Foltyn-Dumitru, Jikai Zhang, Maram Alafif, Saif Baig, Ken Chang, Gennaro D'Anna, Lisa Deptula, Diviya Gupta, Muhammad Ammar Haider, Ali Hussain, Michael Iv, Marinos Kontzialis, Paul Manning, Farzan Moodi, Teresa Nunes, Aaron Simon, Nico Sollmann, David Vu, Maruf Adewole , et al. (60 additional authors not shown)

Abstract: Gliomas are the most common malignant primary brain tumors in adults and one of the deadliest types of cancer. There are many challenges in treatment and monitoring due to the genetic diversity and high intrinsic heterogeneity in appearance, shape, histology, and treatment response. Treatments include surgery, radiation, and systemic therapies, with magnetic resonance imaging (MRI) playing a key r… ▽ More Gliomas are the most common malignant primary brain tumors in adults and one of the deadliest types of cancer. There are many challenges in treatment and monitoring due to the genetic diversity and high intrinsic heterogeneity in appearance, shape, histology, and treatment response. Treatments include surgery, radiation, and systemic therapies, with magnetic resonance imaging (MRI) playing a key role in treatment planning and post-treatment longitudinal assessment. The 2024 Brain Tumor Segmentation (BraTS) challenge on post-treatment glioma MRI will provide a community standard and benchmark for state-of-the-art automated segmentation models based on the largest expert-annotated post-treatment glioma MRI dataset. Challenge competitors will develop automated segmentation models to predict four distinct tumor sub-regions consisting of enhancing tissue (ET), surrounding non-enhancing T2/fluid-attenuated inversion recovery (FLAIR) hyperintensity (SNFH), non-enhancing tumor core (NETC), and resection cavity (RC). Models will be evaluated on separate validation and test datasets using standardized performance metrics utilized across the BraTS 2024 cluster of challenges, including lesion-wise Dice Similarity Coefficient and Hausdorff Distance. Models developed during this challenge will advance the field of automated MRI segmentation and contribute to their integration into clinical practice, ultimately enhancing patient care. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 10 pages, 4 figures, 1 table

arXiv:2405.16981 [pdf, other]

Characterising Developer Sentiment in Software Components: An Exploratory Study of Gentoo

Authors: Tien Rahayu Tulili, Ayushi Rastogi, Andrea Capiluppi

Abstract: Collaborative software development happens in teams, that cooperate on shared artefacts, and discuss development on online platforms. Due to the complexity of development and the variety of teams, software components often act as effective containers for parallel work and teams. Past research has shown how communication between team members, especially in an open-source environment, can become e… ▽ More Collaborative software development happens in teams, that cooperate on shared artefacts, and discuss development on online platforms. Due to the complexity of development and the variety of teams, software components often act as effective containers for parallel work and teams. Past research has shown how communication between team members, especially in an open-source environment, can become extremely toxic, and lead to members leaving the development team. This has a direct effect on the evolution and maintenance of the project in which the former members were active in. The purpose of our study is two-fold: first, we propose an approach to evaluate, at a finer granularity, the positive and negative emotions in the communication between developers; and second, we aim to characterise a project's development paths, or components, as more or less impacted by the emotions. Our analysis evaluates single sentences rather than whole messages as the finest granularity of communication. The previous study found that the high positivity or negativity at the sentence level may indirectly impact the writer him/herself, or the reader. In this way, we could highlight specific paths of Gentoo as the most affected by negative emotions, and show how negative emotions have evolved and changed along the same paths. By joining the analysis of the mailing lists, from which we derive the sentiment of the developers, with the information derived from the development logs, we obtained a longitudinal picture of how development paths have been historically affected by positive or negative emotions. Our study shows that, in recent years, negative emotions have generally decreased in the communication between Gentoo developers. We also show how file paths, as collaborative software development artefacts, were more or less impacted by the emotions of the developers. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2404.01096 [pdf, other]

Enabling Memory Safety of C Programs using LLMs

Authors: Nausheen Mohammed, Akash Lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma

Abstract: Memory safety violations in low-level code, written in languages like C, continues to remain one of the major sources of software vulnerabilities. One method of removing such violations by construction is to port C code to a safe C dialect. Such dialects rely on programmer-supplied annotations to guarantee safety with minimal runtime overhead. This porting, however, is a manual process that impose… ▽ More Memory safety violations in low-level code, written in languages like C, continues to remain one of the major sources of software vulnerabilities. One method of removing such violations by construction is to port C code to a safe C dialect. Such dialects rely on programmer-supplied annotations to guarantee safety with minimal runtime overhead. This porting, however, is a manual process that imposes significant burden on the programmer and, hence, there has been limited adoption of this technique. The task of porting not only requires inferring annotations, but may also need refactoring/rewriting of the code to make it amenable to such annotations. In this paper, we use Large Language Models (LLMs) towards addressing both these concerns. We show how to harness LLM capabilities to do complex code reasoning as well as rewriting of large codebases. We also present a novel framework for whole-program transformations that leverages lightweight static analysis to break the transformation into smaller steps that can be carried out effectively by an LLM. We implement our ideas in a tool called MSA that targets the CheckedC dialect. We evaluate MSA on several micro-benchmarks, as well as real-world code ranging up to 20K lines of code. We showcase superior performance compared to a vanilla LLM baseline, as well as demonstrate improvement over a state-of-the-art symbolic (non-LLM) technique. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.20120 [pdf, ps, other]

Privacy-Preserving Data Aggregation Techniques for Enhanced Efficiency and Security in Wireless Sensor Networks: A Comprehensive Analysis and Evaluation

Authors: Ayush Rastogi, Harsh Rastogi, Yash Rastogi, Divyansh Dubey

Abstract: In this paper, we present a multidimensional, highly effective method for aggregating data for wireless sensor networks while maintaining privacy. The suggested system is resistant to data loss and secure against both active and passive privacy compromising attacks, such as the coalition attack from a rogue base station and kidnapped sensor nodes. With regard to cluster size, it achieves consisten… ▽ More In this paper, we present a multidimensional, highly effective method for aggregating data for wireless sensor networks while maintaining privacy. The suggested system is resistant to data loss and secure against both active and passive privacy compromising attacks, such as the coalition attack from a rogue base station and kidnapped sensor nodes. With regard to cluster size, it achieves consistent communication overhead, which is helpful in large-scale WSNs. Due to its constant size communication overhead, the suggested strategy outperforms the previous privacy-preserving data aggregation scheme not only in terms of privacy preservation but also in terms of communication complexity and energy costs. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 4 pages

arXiv:2403.10704 [pdf, other]

PERL: Parameter Efficient Reinforcement Learning from Human Feedback

Authors: Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu, Simral Chaudhary, Bowen Li, Saravanan Ganesh, Bill Byrne, Jessica Hoffmann, Hassan Mansoor, Wei Li, Abhinav Rastogi, Lucas Dixon

Abstract: Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Pretrained Large Language Models (LLMs) with human preferences. But training models with RLHF is computationally expensive, and an overall complex process. In this work, we study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu… ▽ More Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Pretrained Large Language Models (LLMs) with human preferences. But training models with RLHF is computationally expensive, and an overall complex process. In this work, we study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu et al. [2021]. We investigate the setup of "Parameter Efficient Reinforcement Learning" (PERL), in which we perform reward model training and reinforcement learning using LoRA. We compare PERL to conventional fine-tuning (full-tuning) across various configurations for 7 benchmarks, including 2 novel datasets, of reward modeling and reinforcement learning. We find that PERL performs on par with the conventional RLHF setting, while training faster, and with less memory. This enables the high performance of RLHF, while reducing the computational burden that limits its adoption as an alignment technique for Large Language Models. We also release 2 novel thumbs up/down preference datasets: "Taskmaster Coffee", and "Taskmaster Ticketing" to promote research around RLHF. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2402.19038 [pdf, other]

Understanding Fairness in Software Engineering: Insights from Stack Exchange

Authors: Emeralda Sesari, Federica Sarro, Ayushi Rastogi

Abstract: Software practitioners discuss problems at work with peers, in-person and online. These discussions can be technical (e.g., how to fix a bug?) and social (e.g., how to assign work fairly?). While there is a growing body of knowledge exploring fairness problems and solutions in the human and social factors of software engineering, most focus has been on specific problems. This study provides fairne… ▽ More Software practitioners discuss problems at work with peers, in-person and online. These discussions can be technical (e.g., how to fix a bug?) and social (e.g., how to assign work fairly?). While there is a growing body of knowledge exploring fairness problems and solutions in the human and social factors of software engineering, most focus has been on specific problems. This study provides fairness discussions by software practitioners on Stack Exchange sites. We present an exploratory study presenting the fairness experience of software practitioners and fairness expectations in software teams. We also want to identify the fairness aspects software practitioners talk about the most. For example, do they care more about fairness in income or how they are treated in the workplace? Our investigation of fairness discussions on eight Stack Exchange sites resulted in a list of 136 posts (28 questions and 108 answers) manually curated from 4,178 candidate posts. The study reveals that the majority of fairness discussions (24 posts) revolve around the topic of income suggesting that many software practitioners are highly interested in matters related to their pay and how it is fairly distributed. Further, we noted that while not discussed as often, discussions on fairness in recruitment tend to receive the highest number of views and scores. Interestingly, the study shows that unfairness experiences extend beyond the protected attributes. In this study, only 25 out of 136 posts mention protected attributes, with gender mainly being discussed. △ Less

Submitted 21 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: To be published in 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024

arXiv:2312.13463 [pdf, other]

doi 10.1145/3639477.3639737

The Devil Is in the Command Line: Associating the Compiler Flags With the Binary and Build Metadata

Authors: Gunnar Kudrjavets, Aditya Kumar, Jeff Thomas, Ayushi Rastogi

Abstract: Engineers build large software systems for multiple architectures, operating systems, and configurations. A set of inconsistent or missing compiler flags generates code that catastrophically impacts the system's behavior. In the authors' industry experience, defects caused by an undesired combination of compiler flags are common in nontrivial software projects. We are unaware of any build and CI/C… ▽ More Engineers build large software systems for multiple architectures, operating systems, and configurations. A set of inconsistent or missing compiler flags generates code that catastrophically impacts the system's behavior. In the authors' industry experience, defects caused by an undesired combination of compiler flags are common in nontrivial software projects. We are unaware of any build and CI/CD systems that track how the compiler produces a specific binary in a structured manner. We postulate that a queryable database of how the compiler compiled and linked the software system will help to detect defects earlier and reduce the debugging time. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 3 pages. To be published in the 46th International Conference on Software Engineering (ICSE 2024), April 14 - April 20 2024, Lisbon, Portugal

arXiv:2312.13462 [pdf, other]

doi 10.1145/3639477.3639735

What Do You Mean by Memory? When Engineers Are Lost in the Maze of Complexity

Authors: Gunnar Kudrjavets, Aditya Kumar, Jeff Thomas, Ayushi Rastogi

Abstract: An accepted practice to decrease applications' memory usage is to reduce the amount and frequency of memory allocations. Factors such as (a) the prevalence of out-of-memory (OOM) killers, (b) memory allocations in modern programming languages done implicitly, (c) overcommitting being a default strategy in the Linux kernel, and (d) the rise in complexity and terminology related to memory management… ▽ More An accepted practice to decrease applications' memory usage is to reduce the amount and frequency of memory allocations. Factors such as (a) the prevalence of out-of-memory (OOM) killers, (b) memory allocations in modern programming languages done implicitly, (c) overcommitting being a default strategy in the Linux kernel, and (d) the rise in complexity and terminology related to memory management makes the existing guidance inefficient. The industry needs detailed guidelines for optimizing memory usage targeting specific operating systems (OS) and programming language types. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 3 pages. To be published in the 46th International Conference on Software Engineering (ICSE 2024), April 14 - April 20 2024, Lisbon, Portugal

arXiv:2311.07948 [pdf, other]

Finding Inductive Loop Invariants using Large Language Models

Authors: Adharsh Kamath, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K. Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma

Abstract: Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop's behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program's runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting t… ▽ More Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop's behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program's runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting the entire program, thus are indispensable artifacts in a formal proof of correctness. Finding inductive loop invariants is an undecidable problem, and despite a long history of research towards practical solutions, it remains far from a solved problem. This paper investigates the capabilities of the Large Language Models (LLMs) in offering a new solution towards this old, yet important problem. To that end, we first curate a dataset of verification problems on programs with loops. Next, we design a prompt for exploiting LLMs, obtaining inductive loop invariants, that are checked for correctness using sound symbolic tools. Finally, we explore the effectiveness of using an efficient combination of a symbolic tool and an LLM on our dataset and compare it against a purely symbolic baseline. Our results demonstrate that LLMs can help improve the state-of-the-art in automated program verification. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2311.02489 [pdf, other]

doi 10.1007/s10664-023-10401-z

Does Code Review Speed Matter for Practitioners?

Authors: Gunnar Kudrjavets, Ayushi Rastogi

Abstract: Increasing code velocity is a common goal for a variety of software projects. The efficiency of the code review process significantly impacts how fast the code gets merged into the final product and reaches the customers. We conducted a survey to study the code velocity-related beliefs and practices in place. We analyzed 75 completed surveys from 39 participants from the industry and 36 from the o… ▽ More Increasing code velocity is a common goal for a variety of software projects. The efficiency of the code review process significantly impacts how fast the code gets merged into the final product and reaches the customers. We conducted a survey to study the code velocity-related beliefs and practices in place. We analyzed 75 completed surveys from 39 participants from the industry and 36 from the open-source community. Our critical findings are (a) the industry and open-source community hold a similar set of beliefs, (b) quick reaction time is of utmost importance and applies to the tooling infrastructure and the behavior of other engineers, (c) time-to-merge is the essential code review metric to improve, (d) engineers have differing opinions about the benefits of increased code velocity for their career growth, and (e) the controlled application of the commit-then-review model can increase code velocity. Our study supports the continued need to invest in and improve code velocity regardless of the underlying organizational ecosystem. △ Less

Submitted 4 November, 2023; originally announced November 2023.

Comments: 29 pages, 7 figures. To be published in Empirical Software Engineering An International Journal

arXiv:2310.09342 [pdf, other]

Ranking LLM-Generated Loop Invariants for Program Verification

Authors: Saikat Chakraborty, Shuvendu K. Lahiri, Sarah Fakhoury, Madanlal Musuvathi, Akash Lal, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, Nikhil Swamy

Abstract: Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an… ▽ More Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. The source code and the experimental data for this paper are available in \url{https://github.com/microsoft/NeuralInvariantRanker}. △ Less

Submitted 12 February, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-findings 2023)

arXiv:2309.00267 [pdf, other]

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Authors: Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash

Abstract: Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences. However, gathering high-quality human preference labels can be a time-consuming and expensive endeavor. RL from AI Feedback (RLAIF), introduced by Bai et al., offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lie… ▽ More Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences. However, gathering high-quality human preference labels can be a time-consuming and expensive endeavor. RL from AI Feedback (RLAIF), introduced by Bai et al., offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, RLAIF achieves comparable or superior performance to RLHF, as rated by human evaluators. Furthermore, RLAIF demonstrates the ability to outperform a supervised fine-tuned baseline even when the LLM preference labeler is the same size as the policy. In another experiment, directly prompting the LLM for reward scores achieves superior performance to the canonical RLAIF setup, where LLM preference labels are first distilled into a reward model. Finally, we conduct extensive studies on techniques for generating aligned AI preferences. Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF. △ Less

Submitted 30 November, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

Comments: Added two more tasks and many more experiments and analyses (e.g. same-size RLAIF, direct RLAIF, cost analysis)

arXiv:2308.05177 [pdf, other]

Fixing Rust Compilation Errors using LLMs

Authors: Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Aseem Rastogi

Abstract: The Rust programming language, with its safety guarantees, has established itself as a viable choice for low-level systems programming language over the traditional, unsafe alternatives like C/C++. These guarantees come from a strong ownership-based type system, as well as primitive support for features like closures, pattern matching, etc., that make the code more concise and amenable to reasonin… ▽ More The Rust programming language, with its safety guarantees, has established itself as a viable choice for low-level systems programming language over the traditional, unsafe alternatives like C/C++. These guarantees come from a strong ownership-based type system, as well as primitive support for features like closures, pattern matching, etc., that make the code more concise and amenable to reasoning. These unique Rust features also pose a steep learning curve for programmers. This paper presents a tool called RustAssistant that leverages the emergent capabilities of Large Language Models (LLMs) to automatically suggest fixes for Rust compilation errors. RustAssistant uses a careful combination of prompting techniques as well as iteration with an LLM to deliver high accuracy of fixes. RustAssistant is able to achieve an impressive peak accuracy of roughly 74% on real-world compilation errors in popular open-source Rust repositories. We plan to release our dataset of Rust compilation errors to enable further research. △ Less

Submitted 9 August, 2023; originally announced August 2023.

arXiv:2305.13725 [pdf, other]

Conversational Recommendation as Retrieval: A Simple, Strong Baseline

Authors: Raghav Gupta, Renat Aksitov, Samrat Phatale, Simral Chaudhary, Harrison Lee, Abhinav Rastogi

Abstract: Conversational recommendation systems (CRS) aim to recommend suitable items to users through natural language conversation. However, most CRS approaches do not effectively utilize the signal provided by these conversations. They rely heavily on explicit external knowledge e.g., knowledge graphs to augment the models' understanding of the items and attributes, which is quite hard to scale. To allev… ▽ More Conversational recommendation systems (CRS) aim to recommend suitable items to users through natural language conversation. However, most CRS approaches do not effectively utilize the signal provided by these conversations. They rely heavily on explicit external knowledge e.g., knowledge graphs to augment the models' understanding of the items and attributes, which is quite hard to scale. To alleviate this, we propose an alternative information retrieval (IR)-styled approach to the CRS item recommendation task, where we represent conversations as queries and items as documents to be retrieved. We expand the document representation used for retrieval with conversations from the training set. With a simple BM25-based retriever, we show that our task formulation compares favorably with much more complex baselines using complex external knowledge on a popular CRS benchmark. We demonstrate further improvements using user-centric modeling and data augmentation to counter the cold start problem for CRSs. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: To appear at the 5th NLP4ConvAI workshop

arXiv:2303.04293 [pdf, other]

doi 10.1109/MSR59073.2023.00046

Are We Speeding Up or Slowing Down? On Temporal Aspects of Code Velocity

Authors: Gunnar Kudrjavets, Nachiappan Nagappan, Ayushi Rastogi

Abstract: This paper investigates how the duration of various code review periods changes over a projects' lifetime. We study four open-source software (OSS) projects: Blender, FreeBSD, LLVM, and Mozilla. We mine and analyze the characteristics of 283,235 code reviews that cover, on average, seven years' worth of development. Our main conclusion is that neither the passage of time or the project's size impa… ▽ More This paper investigates how the duration of various code review periods changes over a projects' lifetime. We study four open-source software (OSS) projects: Blender, FreeBSD, LLVM, and Mozilla. We mine and analyze the characteristics of 283,235 code reviews that cover, on average, seven years' worth of development. Our main conclusion is that neither the passage of time or the project's size impact code velocity. We find that (a) the duration of various code review periods (time-to-first-response, time-to-accept, and time-to-merge) for FreeBSD, LLVM, and Mozilla either becomes shorter or stays the same; no directional trend is present for Blender, (b) an increase in the size of the code bases (annually 3-17%) does not accompany a decrease in code velocity, and (c) for FreeBSD, LLVM, and Mozilla, the 30-day moving median stays in a fixed range for time-to-merge. These findings do not change with variabilities in code churn metrics, such as the number of commits or distinct authors of code changes. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: 5 pages. To be published in Proceedings of MSR '23: Proceedings of the 20th International Conference on Mining Software Repositories (MSR 2023). May 15-16, 2023, Melbourne, Australia

arXiv:2303.01954 [pdf, other]

Synthetic Data Generator for Adaptive Interventions in Global Health

Authors: Aditya Rastogi, Juan Francisco Garamendi, Ana Fernández del Río, Anna Guitart, Moiz Hassan Khan, Dexian Tang, África Periáñez

Abstract: Artificial Intelligence and digital health have the potential to transform global health. However, having access to representative data to test and validate algorithms in realistic production environments is essential. We introduce HealthSyn, an open-source synthetic data generator of user behavior for testing reinforcement learning algorithms in the context of mobile health interventions. The gen… ▽ More Artificial Intelligence and digital health have the potential to transform global health. However, having access to representative data to test and validate algorithms in realistic production environments is essential. We introduce HealthSyn, an open-source synthetic data generator of user behavior for testing reinforcement learning algorithms in the context of mobile health interventions. The generator utilizes Markov processes to generate diverse user actions, with individual user behavioral patterns that can change in reaction to personalized interventions (i.e., reminders, recommendations, and incentives). These actions are translated into actual logs using an ML-purposed data schema specific to the mobile health application functionality included with HealthKit, and open-source SDK. The logs can be fed to pipelines to obtain user metrics. The generated data, which is based on real-world behaviors and simulation techniques, can be used to develop, test, and evaluate, both ML algorithms in research and end-to-end operational RL-based intervention delivery frameworks. △ Less

Submitted 27 April, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

arXiv:2212.11866 [pdf, other]

doi 10.1109/ICSE-SEIP58684.2023.00040

Who Ate My Memory? Towards Attribution in Memory Management

Authors: Gunnar Kudrjavets, Ayushi Rastogi, Jeff Thomas, Nachiappan Nagappan

Abstract: To understand applications' memory usage details, engineers use instrumented builds and profiling tools. Both approaches are impractical for use in production environments or deployed mobile applications. As a result, developers can gather only high-level memory-related statistics for deployed software. In our experience, the lack of granular field data makes fixing performance and reliability-rel… ▽ More To understand applications' memory usage details, engineers use instrumented builds and profiling tools. Both approaches are impractical for use in production environments or deployed mobile applications. As a result, developers can gather only high-level memory-related statistics for deployed software. In our experience, the lack of granular field data makes fixing performance and reliability-related defects complex and time-consuming. The software industry needs lightweight solutions to collect detailed data about applications' memory usage to increase developer productivity. Current research into memory attribution-related data structures, techniques, and tools is in the early stages and enables several new research avenues. △ Less

Submitted 22 December, 2022; originally announced December 2022.

Comments: 3 pages. To be published in the 45th International Conference on Software Engineering (ICSE 2023), May 14 - May 20 2023, Melbourne, Australia

arXiv:2212.09939 [pdf, other]

AnyTOD: A Programmable Task-Oriented Dialog System

Authors: Jeffrey Zhao, Yuan Cao, Raghav Gupta, Harrison Lee, Abhinav Rastogi, Mingqiu Wang, Hagen Soltau, Izhak Shafran, Yonghui Wu

Abstract: We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A ne… ▽ More We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A neural LM keeps track of events occurring during a conversation and a symbolic program implementing the dialog policy is executed to recommend next actions AnyTOD should take. This approach drastically reduces data annotation and model training requirements, addressing the enduring challenge of rapidly adapting a TOD system to unseen tasks and domains. We demonstrate state-of-the-art results on STAR, ABCD and SGD benchmarks. We also demonstrate strong zero-shot transfer ability in low-resource settings, such as zero-shot on MultiWOZ. In addition, we release STARv2, an updated version of the STAR dataset with richer annotations, for benchmarking zero-shot end-to-end TOD models. △ Less

Submitted 13 February, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: v2, update with Multiwoz, SGD results

arXiv:2212.08704 [pdf, other]

Speech Aware Dialog System Technology Challenge (DSTC11)

Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Abhinav Rastogi, Jeffrey Zhao, Ye Jia, Wei Han, Yuan Cao, Aramys Miranda

Abstract: Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is st… ▽ More Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, our goal in hosting the speech-aware dialog state tracking challenge was to create a public corpus or task which can be used to investigate the performance gap between the written and spoken forms of input, develop models that could alleviate this gap, and establish whether Text-to-Speech-based (TTS) systems is a reasonable surrogate to the more-labor intensive human data collection. We created three spoken versions of the popular written-domain MultiWoz task -- (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs. Additionally, we provided different forms of ASR output to encourage wider participation from teams that may not have access to state-of-the-art ASR systems. These included ASR transcripts, word time stamps, and latent representations of the audio (audio encoder outputs). In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain. △ Less

Submitted 16 December, 2022; originally announced December 2022.

arXiv:2208.13289 [pdf, other]

doi 10.1016/j.jco.2024.101824

Statistical Inverse Problems in Hilbert Scales

Authors: Abhishake Rastogi

Abstract: In this paper, we study the Tikhonov regularization scheme in Hilbert scales for the nonlinear statistical inverse problem with a general noise. The regularizing norm in this scheme is stronger than the norm in Hilbert space. We focus on developing a theoretical analysis for this scheme based on the conditional stability estimates. We utilize the concept of the distance function to establish the h… ▽ More In this paper, we study the Tikhonov regularization scheme in Hilbert scales for the nonlinear statistical inverse problem with a general noise. The regularizing norm in this scheme is stronger than the norm in Hilbert space. We focus on developing a theoretical analysis for this scheme based on the conditional stability estimates. We utilize the concept of the distance function to establish the high probability estimates of the direct and reconstruction error in Reproducing kernel Hilbert space setting. Further, the explicit rates of convergence in terms of sample size are established for the oversmoothing case and the regular case over the regularity class defined through appropriate source condition. Our results improve and generalize previous results obtained in related settings. △ Less

Submitted 28 August, 2022; originally announced August 2022.

Journal ref: Journal of Complexity 82 (2024) 101824

arXiv:2208.09628 [pdf, other]

Are You Comfortable Now: Deep Learning the Temporal Variation in Thermal Comfort in Winters

Authors: Betty Lala, Srikant Manas Kala, Anmol Rastogi, Kunal Dahiya, Aya Hagishima

Abstract: Indoor thermal comfort in smart buildings has a significant impact on the health and performance of occupants. Consequently, machine learning (ML) is increasingly used to solve challenges related to indoor thermal comfort. Temporal variability of thermal comfort perception is an important problem that regulates occupant well-being and energy consumption. However, in most ML-based thermal comfort s… ▽ More Indoor thermal comfort in smart buildings has a significant impact on the health and performance of occupants. Consequently, machine learning (ML) is increasingly used to solve challenges related to indoor thermal comfort. Temporal variability of thermal comfort perception is an important problem that regulates occupant well-being and energy consumption. However, in most ML-based thermal comfort studies, temporal aspects such as the time of day, circadian rhythm, and outdoor temperature are not considered. This work addresses these problems. It investigates the impact of circadian rhythm and outdoor temperature on the prediction accuracy and classification performance of ML models. The data is gathered through month-long field experiments carried out in 14 classrooms of 5 schools, involving 512 primary school students. Four thermal comfort metrics are considered as the outputs of Deep Neural Networks and Support Vector Machine models for the dataset. The effect of temporal variability on school children's comfort is shown through a "time of day" analysis. Temporal variability in prediction accuracy is demonstrated (up to 80%). Furthermore, we show that outdoor temperature (varying over time) positively impacts the prediction performance of thermal comfort models by up to 30%. The importance of spatio-temporal context is demonstrated by contrasting micro-level (location specific) and macro-level (6 locations across a city) performance. The most important finding of this work is that a definitive improvement in prediction accuracy is shown with an increase in the time of day and sky illuminance, for multiple thermal comfort metrics. △ Less

Submitted 20 August, 2022; originally announced August 2022.

Comments: Accepted for publication in IEEE SMC 2022

arXiv:2208.08484 [pdf, other]

doi 10.1109/ISSREW55968.2022.00035

When malloc() Never Returns NULL -- Reliability as an Illusion

Authors: Gunnar Kudrjavets, Jeff Thomas, Aditya Kumar, Nachiappan Nagappan, Ayushi Rastogi

Abstract: For decades, the guidance given to software engineers has been to check the memory allocation results. This validation step is necessary to avoid crashes. However, in user mode, in modern operating systems (OS), such as Android, FreeBSD, iOS, and macOS, the caller does not have an opportunity to handle the memory allocation failures. This behavioral trait results from the actions of a system compo… ▽ More For decades, the guidance given to software engineers has been to check the memory allocation results. This validation step is necessary to avoid crashes. However, in user mode, in modern operating systems (OS), such as Android, FreeBSD, iOS, and macOS, the caller does not have an opportunity to handle the memory allocation failures. This behavioral trait results from the actions of a system component called an out-of-memory (OOM) killer. We identify that the only mainstream OS that, by default, lets applications detect memory allocation failures is Microsoft Windows. The false expectation that an application can handle OOM errors can negatively impact its design. The presence of error-handling code creates an illusion of reliability and is wasteful in terms of lines of code and code size. We describe the current behavior of a sample of popular OSs during low-memory conditions and provide recommendations for engineering practices going forward. △ Less

Submitted 17 August, 2022; originally announced August 2022.

Comments: 6 pages. To be published in the 33rd IEEE International Symposium on Software Reliability Engineering (ISSRE 2022), Oct 31 - Nov 3 2022, Charlotte, North Carolina, USA

arXiv:2206.14202 [pdf, other]

Building Matters: Spatial Variability in Machine Learning Based Thermal Comfort Prediction in Winters

Authors: Betty Lala, Srikant Manas Kala, Anmol Rastogi, Kunal Dahiya, Hirozumi Yamaguchi, Aya Hagishima

Abstract: Thermal comfort in indoor environments has an enormous impact on the health, well-being, and performance of occupants. Given the focus on energy efficiency and Internet-of-Things enabled smart buildings, machine learning (ML) is being increasingly used for data-driven thermal comfort (TC) prediction. Generally, ML-based solutions are proposed for air-conditioned or HVAC ventilated buildings and th… ▽ More Thermal comfort in indoor environments has an enormous impact on the health, well-being, and performance of occupants. Given the focus on energy efficiency and Internet-of-Things enabled smart buildings, machine learning (ML) is being increasingly used for data-driven thermal comfort (TC) prediction. Generally, ML-based solutions are proposed for air-conditioned or HVAC ventilated buildings and the models are primarily designed for adults. On the other hand, naturally ventilated (NV) buildings are the norm in most countries. They are also ideal for energy conservation and long-term sustainability goals. However, the indoor environment of NV buildings lacks thermal regulation and varies significantly across spatial contexts. These factors make TC prediction extremely challenging. Thus, determining the impact of the building environment on the performance of TC models is important. Further, the generalization capability of TC prediction models across different NV indoor spaces needs to be studied. This work addresses these problems. Data is gathered through month-long field experiments conducted in 5 naturally ventilated school buildings, involving 512 primary school students. The impact of spatial variability on student comfort is demonstrated through variation in prediction accuracy (by as much as 71%). The influence of building environment on TC prediction is also demonstrated through variation in feature importance. Further, a comparative analysis of spatial variability in model performance is done for children (our dataset) and adults (ASHRAE-II database). Finally, the generalization capability of thermal comfort models in NV classrooms is assessed and major challenges are highlighted. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: Accepted in SmartSys SMARTCOMP 2022

arXiv:2206.11728 [pdf, ps, other]

doi 10.1109/ICSME55016.2022.00079

There Ain't No Such Thing as a Free Custom Memory Allocator

Authors: Gunnar Kudrjavets, Jeff Thomas, Aditya Kumar, Nachiappan Nagappan, Ayushi Rastogi

Abstract: Using custom memory allocators is an efficient performance optimization technique. However, dependency on a custom allocator can introduce several maintenance-related issues. We present lessons learned from the industry and provide critical guidance for using custom memory allocators and enumerate various challenges associated with integrating them. These recommendations are based on years of expe… ▽ More Using custom memory allocators is an efficient performance optimization technique. However, dependency on a custom allocator can introduce several maintenance-related issues. We present lessons learned from the industry and provide critical guidance for using custom memory allocators and enumerate various challenges associated with integrating them. These recommendations are based on years of experience incorporating custom allocators into different industrial software projects. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: 4 pages. To be published in 38th IEEE International Conference on Software Maintenance and Evolution (ICSME 2022), Oct 3-7, 2022, Limassol, Cyprus

arXiv:2206.05616 [pdf, other]

doi 10.1109/ICSME55016.2022.00027

Is Kernel Code Different From Non-Kernel Code? A Case Study of BSD Family Operating Systems

Authors: Gunnar Kudrjavets, Jeff Thomas, Nachiappan Nagappan, Ayushi Rastogi

Abstract: Code churn and code velocity describe the evolution of a code base. Current research quantifies and studies code churn and velocity at a high level of abstraction, often at the overall project level or even at the level of an entire company. We argue that such an approach ignores noticeable differences among the subsystems of large projects. We conducted an exploratory study on four BSD family ope… ▽ More Code churn and code velocity describe the evolution of a code base. Current research quantifies and studies code churn and velocity at a high level of abstraction, often at the overall project level or even at the level of an entire company. We argue that such an approach ignores noticeable differences among the subsystems of large projects. We conducted an exploratory study on four BSD family operating systems: DragonFlyBSD, FreeBSD, NetBSD, and OpenBSD. We mine 797,879 commits to characterize code churn in terms of the annual growth rate, commit types, change type ratio, and size taxonomy of commits for different subsystems (kernel, non-kernel, and mixed). We also investigate differences among various code review periods, i.e., time-to-first-response, time-to-accept, and time-to-merge, as indicators of code velocity. Our study provides empirical evidence that quantifiable evolutionary code characteristics at a global system scope fail to take into account significant individual differences that exist at a subsystem level. We found that while there exist similarities in the code base growth rate and distribution of commit types (neutral, additive, and subtractive) across BSD subsystems, (a) most commits contain kernel or non-kernel code exclusively, (b) kernel commits are larger than non-kernel commits, and (c) code reviews for kernel code take longer than non-kernel code. △ Less

Submitted 11 June, 2022; originally announced June 2022.

Comments: 13 pages. To be published in 38th IEEE International Conference on Software Maintenance and Evolution (ICSME 2022), Oct 3-7, 2022, Limassol, Cyprus

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2204.04327 [pdf, other]

doi 10.18653/v1/2022.naacl-main.336

Show, Don't Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue

Authors: Raghav Gupta, Harrison Lee, Jeffrey Zhao, Abhinav Rastogi, Yuan Cao, Yonghui Wu

Abstract: Building universal dialogue systems that operate across multiple domains/APIs and generalize to new ones with minimal overhead is a critical challenge. Recent works have leveraged natural language descriptions of schema elements to enable such systems; however, descriptions only indirectly convey schema semantics. In this work, we propose Show, Don't Tell, which prompts seq2seq models with a label… ▽ More Building universal dialogue systems that operate across multiple domains/APIs and generalize to new ones with minimal overhead is a critical challenge. Recent works have leveraged natural language descriptions of schema elements to enable such systems; however, descriptions only indirectly convey schema semantics. In this work, we propose Show, Don't Tell, which prompts seq2seq models with a labeled example dialogue to show the semantics of schema elements rather than tell the model through descriptions. While requiring similar effort from service developers as generating descriptions, we show that using short examples as schema representations with large language models results in state-of-the-art performance on two popular dialogue state tracking benchmarks designed to measure zero-shot generalization - the Schema-Guided Dialogue dataset and the MultiWOZ leave-one-out benchmark. △ Less

Submitted 17 October, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: NAACL 2022

Journal ref: In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4541-4549, Seattle, United States. Association for Computational Linguistics

arXiv:2203.07473 [pdf, other]

doi 10.1145/3524842.3528005

The Unexplored Treasure Trove of Phabricator Code Review

Authors: Gunnar Kudrjavets, Nachiappan Nagappan, Ayushi Rastogi

Abstract: Phabricator is a modern code collaboration tool used by popular projects like FreeBSD and Mozilla. However, unlike the other well-known code review environments, such as Gerrit or GitHub, there is no readily accessible public code review dataset for Phabricator. This paper describes our experience mining code reviews from five different projects that use Phabricator (Blender, FreeBSD, KDE, LLVM, a… ▽ More Phabricator is a modern code collaboration tool used by popular projects like FreeBSD and Mozilla. However, unlike the other well-known code review environments, such as Gerrit or GitHub, there is no readily accessible public code review dataset for Phabricator. This paper describes our experience mining code reviews from five different projects that use Phabricator (Blender, FreeBSD, KDE, LLVM, and Mozilla). We discuss the challenges associated with the data retrieval process and our solutions, resulting in a dataset with details regarding 317,476 Phabricator code reviews. Our dataset is available in both JSON and MySQL database dump formats. The dataset enables analyses of the history of code reviews at a more granular level than other platforms. In addition, given that the projects we mined are publicly accessible via the Conduit API, our dataset can be used as a foundation to fetch additional details and insights. △ Less

Submitted 14 March, 2022; originally announced March 2022.

Comments: 5 pages. To be published in Proceedings of MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories (MSR 2022). ACM, New York, NY, USA

arXiv:2203.05048 [pdf, other]

doi 10.1145/3524842.3528432

Mining Code Review Data to Understand Waiting Times Between Acceptance and Merging: An Empirical Analysis

Authors: Gunnar Kudrjavets, Aditya Kumar, Nachiappan Nagappan, Ayushi Rastogi

Abstract: Increasing code velocity (or the speed with which code changes are reviewed and merged) is integral to speeding up development and contributes to the work satisfaction of engineers. While factors affecting code change acceptance have been investigated in the past, solutions to decrease the code review lifetime are less understood. This study investigates the code review process to quantify delays… ▽ More Increasing code velocity (or the speed with which code changes are reviewed and merged) is integral to speeding up development and contributes to the work satisfaction of engineers. While factors affecting code change acceptance have been investigated in the past, solutions to decrease the code review lifetime are less understood. This study investigates the code review process to quantify delays and investigate opportunities to potentially increase code velocity. We study the temporal characteristics of half a million code reviews hosted on Gerrit and Phabricator, starting from the first response, to a decision to accept or reject the changes, and until the changes are merged into a target branch. We identified two types of time delays: (a) the wait time from the proposal of code changes until first response, and (b) the wait time between acceptance and merging. Our study indicates that reducing the time between acceptance and merging has the potential to speed up Phabricator code reviews by 29-63%. Small code changes and changes made by authors with a large number of previously accepted code reviews have a higher chance of being immediately accepted, without code review iterations. Our analysis suggests that switching from manual to automatic merges can help increase code velocity. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 12 pages. To be published in Proceedings of MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories (MSR 2022). ACM, New York, NY, USA

arXiv:2203.05045 [pdf, other]

doi 10.1145/3524842.3528448

Do Small Code Changes Merge Faster? A Multi-Language Empirical Investigation

Authors: Gunnar Kudrjavets, Nachiappan Nagappan, Ayushi Rastogi

Abstract: Code velocity, or the speed with which code changes are integrated into a production environment, plays a crucial role in Continuous Integration and Continuous Deployment. Many studies report factors influencing code velocity. However, solutions to increase code velocity are unclear. Meanwhile, the industry continues to issue guidelines on "ideal" code change size, believing it increases code velo… ▽ More Code velocity, or the speed with which code changes are integrated into a production environment, plays a crucial role in Continuous Integration and Continuous Deployment. Many studies report factors influencing code velocity. However, solutions to increase code velocity are unclear. Meanwhile, the industry continues to issue guidelines on "ideal" code change size, believing it increases code velocity despite lacking evidence validating the practice. Surprisingly, this fundamental question has not been studied to date. This study investigates the practicality of improving code velocity by optimizing pull request size and composition (ratio of insertions, deletions, and modifications). We start with a hypothesis that a moderate correlation exists between pull request size and time-to-merge. We selected 100 most popular, actively developed projects from 10 programming languages on GitHub. We analyzed our dataset of 845,316 pull requests by size, composition, and context to explore its relationship to time-to-merge - a proxy to measure code velocity. Our study shows that pull request size and composition do not relate to time-to-merge. Regardless of the contextual factors that can influence pull request size or composition (e.g., programming language), the observation holds. Pull request data from two other platforms: Gerrit and Phabricator (401,790 code reviews) confirms the lack of relationship. This negative result as in "... eliminate useless hypotheses ..." challenges a widespread belief by showing that small code changes do not merge faster to increase code velocity. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 12 pages. To be published in Proceedings of MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories (MSR 2022). ACM, New York, NY, USA

arXiv:2203.04394 [pdf, other]

doi 10.1145/3524613.3527803

Quantifying Daily Evolution of Mobile Software Based on Memory Allocator Churn

Authors: Gunnar Kudrjavets, Jeff Thomas, Aditya Kumar, Nachiappan Nagappan, Ayushi Rastogi

Abstract: The pace and volume of code churn necessary to evolve modern software systems present challenges for analyzing the performance impact of any set of code changes. Traditional methods used in performance analysis rely on extensive data collection and profiling, which often takes days. For large organizations utilizing Continuous Integration (CI) and Continuous Deployment (CD), these traditional tech… ▽ More The pace and volume of code churn necessary to evolve modern software systems present challenges for analyzing the performance impact of any set of code changes. Traditional methods used in performance analysis rely on extensive data collection and profiling, which often takes days. For large organizations utilizing Continuous Integration (CI) and Continuous Deployment (CD), these traditional techniques often fail to provide timely and actionable data. A different impact analysis method that allows for more efficient detection of performance regressions is needed. We propose the utilization of user mode memory allocator churn as a novel approach to performance engineering. User mode allocator churn acts as a proxy metric to evaluate the relative change in the cost of specific tasks. We prototyped the memory allocation churn methodology while engaged in performance engineering for a major iOS application. We find that calculating and analyzing memory allocator churn (a) results in deterministic measurements, (b) is efficient for determining the presence of both individual performance regressions and general performance-related trends, and (c) is a suitable alternative to measuring the task completion time. △ Less

Submitted 6 May, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: 5 pages. To be published in Proceedings of The 9th International Conference on Mobile Software Engineering and Systems (MobileSoft '22). ACM, New York, NY, USA

arXiv:2201.12409 [pdf, other]

A Unified Approach to Entity-Centric Context Tracking in Social Conversations

Authors: Ulrich Rückert, Srinivas Sunkara, Abhinav Rastogi, Sushant Prakash, Pranav Khaitan

Abstract: In human-human conversations, Context Tracking deals with identifying important entities and keeping track of their properties and relationships. This is a challenging problem that encompasses several subtasks such as slot tagging, coreference resolution, resolving plural mentions and entity linking. We approach this problem as an end-to-end modeling task where the conversational context is repres… ▽ More In human-human conversations, Context Tracking deals with identifying important entities and keeping track of their properties and relationships. This is a challenging problem that encompasses several subtasks such as slot tagging, coreference resolution, resolving plural mentions and entity linking. We approach this problem as an end-to-end modeling task where the conversational context is represented by an entity repository containing the entity references mentioned so far, their properties and the relationships between them. The repository is updated turn-by-turn, thus making training and inference computationally efficient even for long conversations. This paper lays the groundwork for an investigation of this framework in two ways. First, we release Contrack, a large scale human-human conversation corpus for context tracking with people and location annotations. It contains over 7000 conversations with an average of 11.8 turns, 5.8 entities and 15.2 references per conversation. Second, we open-source a neural network architecture for context tracking. Finally we compare this network to state-of-the-art approaches for the subtasks it subsumes and report results on the involved tradeoffs. △ Less

Submitted 26 April, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

Comments: Published at LREC 2022

arXiv:2201.10599 [pdf, other]

doi 10.1145/3510457.3513057

The Unexplored Terrain of Compiler Warnings

Authors: Gunnar Kudrjavets, Aditya Kumar, Nachiappan Nagappan, Ayushi Rastogi

Abstract: The authors' industry experiences suggest that compiler warnings, a lightweight version of program analysis, are valuable early bug detection tools. Significant costs are associated with patches and security bulletins for issues that could have been avoided if compiler warnings were addressed. Yet, the industry's attitude towards compiler warnings is mixed. Practices range from silencing all compi… ▽ More The authors' industry experiences suggest that compiler warnings, a lightweight version of program analysis, are valuable early bug detection tools. Significant costs are associated with patches and security bulletins for issues that could have been avoided if compiler warnings were addressed. Yet, the industry's attitude towards compiler warnings is mixed. Practices range from silencing all compiler warnings to having a zero-tolerance policy as to any warnings. Current published data indicates that addressing compiler warnings early is beneficial. However, support for this value theory stems from grey literature or is anecdotal. Additional focused research is needed to truly assess the cost-benefit of addressing warnings. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: 2 pages. To be published in 44nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP '22), May 21-29, 2022, Pittsburgh, PA, USA

arXiv:2201.08904 [pdf, other]

Description-Driven Task-Oriented Dialog Modeling

Authors: Jeffrey Zhao, Raghav Gupta, Yuan Cao, Dian Yu, Mingqiu Wang, Harrison Lee, Abhinav Rastogi, Izhak Shafran, Yonghui Wu

Abstract: Task-oriented dialogue (TOD) systems are required to identify key information from conversations for the completion of given tasks. Such information is conventionally specified in terms of intents and slots contained in task-specific ontology or schemata. Since these schemata are designed by system developers, the naming convention for slots and intents is not uniform across tasks, and may not con… ▽ More Task-oriented dialogue (TOD) systems are required to identify key information from conversations for the completion of given tasks. Such information is conventionally specified in terms of intents and slots contained in task-specific ontology or schemata. Since these schemata are designed by system developers, the naming convention for slots and intents is not uniform across tasks, and may not convey their semantics effectively. This can lead to models memorizing arbitrary patterns in data, resulting in suboptimal performance and generalization. In this paper, we propose that schemata should be modified by replacing names or notations entirely with natural language descriptions. We show that a language description-driven system exhibits better understanding of task specifications, higher performance on state tracking, improved data efficiency, and effective zero-shot transfer to unseen tasks. Following this paradigm, we present a simple yet effective Description-Driven Dialog State Tracking (D3ST) model, which relies purely on schema descriptions and an "index-picking" mechanism. We demonstrate the superiority in quality, data efficiency and robustness of our approach as measured on the MultiWOZ (Budzianowski et al.,2018), SGD (Rastogi et al., 2020), and the recent SGD-X (Lee et al., 2021) benchmarks. △ Less

Submitted 21 January, 2022; originally announced January 2022.

arXiv:2111.15149 [pdf, other]

doi 10.1145/3409003

SteelCore: An Extensible Concurrent Separation Logic for Effectful Dependently Typed Programs

Authors: Nikhil Swamy, Aseem Rastogi, Aymeric Fromherz, Denis Merigoux, Danel Ahman, Guido Martínez

Abstract: Much recent research has been devoted to modeling effects within type theory. Building on this work, we observe that effectful type theories can provide a foundation on which to build semantics for more complex programming constructs and program logics, extending the reasoning principles that apply within the host effectful type theory itself. Concretely, our main contribution is a semantics for c… ▽ More Much recent research has been devoted to modeling effects within type theory. Building on this work, we observe that effectful type theories can provide a foundation on which to build semantics for more complex programming constructs and program logics, extending the reasoning principles that apply within the host effectful type theory itself. Concretely, our main contribution is a semantics for concurrent separation logic (CSL) within the F* proof assistant in a manner that enables dependently typed, effectful F* programs to make use of concurrency and to be specified and verified using a full-featured, extensible CSL. In contrast to prior approaches, we directly derive the partial-correctness Hoare rules for CSL from the denotation of computations in the effectful semantics of non-deterministically interleaved atomic actions. Demonstrating the flexibility of our semantics, we build generic, verified libraries that support various concurrency constructs, ranging from dynamically allocated, storable spin locks, to protocol-indexed channels. We conclude that our effectful semantics provides a simple yet expressive basis on which to layer domain-specific languages and logics for verified, concurrent programming. △ Less

Submitted 30 November, 2021; originally announced November 2021.

Comments: ICFP 2020 camera-ready version

arXiv:2110.06800 [pdf, other]

doi 10.1609/aaai.v36i10.21341

SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

Authors: Harrison Lee, Raghav Gupta, Abhinav Rastogi, Yuan Cao, Bin Zhang, Yonghui Wu

Abstract: Zero/few-shot transfer to unseen services is a critical challenge in task-oriented dialogue research. The Schema-Guided Dialogue (SGD) dataset introduced a paradigm for enabling models to support any service in zero-shot through schemas, which describe service APIs to models in natural language. We explore the robustness of dialogue systems to linguistic variations in schemas by designing SGD-X -… ▽ More Zero/few-shot transfer to unseen services is a critical challenge in task-oriented dialogue research. The Schema-Guided Dialogue (SGD) dataset introduced a paradigm for enabling models to support any service in zero-shot through schemas, which describe service APIs to models in natural language. We explore the robustness of dialogue systems to linguistic variations in schemas by designing SGD-X - a benchmark extending SGD with semantically similar yet stylistically diverse variants for every schema. We observe that two top state tracking models fail to generalize well across schema variants, measured by joint goal accuracy and a novel metric for measuring schema sensitivity. Additionally, we present a simple model-agnostic data augmentation method to improve schema robustness. △ Less

Submitted 23 August, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: AAAI 2022

Journal ref: Lee, H., Gupta, R., Rastogi, A., Cao, Y., Zhang, B., & Wu, Y. (2022). SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 10938-10946

arXiv:2108.09946 [pdf, other]

Pull Request Latency Explained: An Empirical Overview

Authors: Xunhui Zhang, Yue Yu, Tao Wang, Ayushi Rastogi, Huaimin Wang

Abstract: Pull request latency evaluation is an essential application of effort evaluation in the pull-based development scenario. It can help the reviewers sort the pull request queue, remind developers about the review processing time, speed up the review process and accelerate software development. There is a lack of work that systematically organizes the factors that affect pull request latency. Also, t… ▽ More Pull request latency evaluation is an essential application of effort evaluation in the pull-based development scenario. It can help the reviewers sort the pull request queue, remind developers about the review processing time, speed up the review process and accelerate software development. There is a lack of work that systematically organizes the factors that affect pull request latency. Also, there is no related work discussing the differences and variations in characteristics in different scenarios and contexts. In this paper, we collected relevant factors through a literature review approach. Then we assessed their relative importance in five scenarios and six different contexts using the mixed-effects linear regression model. We find that the relative importance of factors differs in different scenarios, e.g., the first response time of the reviewer is most important when there exist comments. Meanwhile, the number of commits in a pull request has a more significant impact on pull request latency when closing than submitting due to changes in contributions brought about by the review process. △ Less

Submitted 23 August, 2021; originally announced August 2021.

arXiv:2107.13731 [pdf, other]

UIBert: Learning Generic Multimodal Representations for UI Understanding

Authors: Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, Blaise Aguera y Arcas

Abstract: To improve the accessibility of smart devices and to simplify their usage, building models which understand user interfaces (UIs) and assist users to complete their tasks is critical. However, unique challenges are proposed by UI-specific characteristics, such as how to effectively leverage multimodal UI features that involve image, text, and structural metadata and how to achieve good performance… ▽ More To improve the accessibility of smart devices and to simplify their usage, building models which understand user interfaces (UIs) and assist users to complete their tasks is critical. However, unique challenges are proposed by UI-specific characteristics, such as how to effectively leverage multimodal UI features that involve image, text, and structural metadata and how to achieve good performance when high-quality labeled data is unavailable. To address such challenges we introduce UIBert, a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data to learn generic feature representations for a UI and its components. Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other. We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI. We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy. △ Less

Submitted 10 August, 2021; v1 submitted 28 July, 2021; originally announced July 2021.

Comments: 8 pages, IJCAI 2021

arXiv:2107.05829 [pdf, other]

Promises and Perils of Inferring Personality on GitHub

Authors: Frenk van Mil, Ayushi Rastogi, Andy Zaidman

Abstract: Personality plays a pivotal role in our understanding of human actions and behavior. Today, the applications of personality are widespread, built on the solutions from psychology to infer personality. In software engineering, for instance, one widely used solution to infer personality uses textual communication data. As studies on personality in software engineering continue to grow, it is imperat… ▽ More Personality plays a pivotal role in our understanding of human actions and behavior. Today, the applications of personality are widespread, built on the solutions from psychology to infer personality. In software engineering, for instance, one widely used solution to infer personality uses textual communication data. As studies on personality in software engineering continue to grow, it is imperative to understand the performance of these solutions. This paper compares the inferential ability of three widely studied text-based personality tests against each other and the ground truth on GitHub. We explore the challenges and potential solutions to improve the inferential ability of personality tests. Our study shows that solutions for inferring personality are far from being perfect. Software engineering communications data can infer individual developer personality with an average error rate of 41%. In the best case, the error rate can be reduced up to 36% by following our recommendations. △ Less

Submitted 15 July, 2021; v1 submitted 12 July, 2021; originally announced July 2021.

arXiv:2106.01885 [pdf, other]

How does Software Change?

Authors: Ayushi Rastogi, Georgios Gousios

Abstract: Software evolves with changes to its codebase over time. Internally, software changes in response to decisions to include some code change into the codebase and discard others. Explaining the mechanism of software evolution, this paper presents a theory of software change. Our theory is grounded in multiple evidence sources (e.g., GitHub documentation and relevant scientific literature) relating t… ▽ More Software evolves with changes to its codebase over time. Internally, software changes in response to decisions to include some code change into the codebase and discard others. Explaining the mechanism of software evolution, this paper presents a theory of software change. Our theory is grounded in multiple evidence sources (e.g., GitHub documentation and relevant scientific literature) relating to the pull-based development model in GitHub. The resulting theory explains the influence of project-related core concepts (e.g., people and governance) as well as its ecosystem on the decision of software change. △ Less

Submitted 3 June, 2021; originally announced June 2021.

arXiv:2105.13970 [pdf, other]

Pull Request Decision Explained: An Empirical Overview

Authors: Xunhui Zhang, Yue Yu, Georgios Gousios, Ayushi Rastogi

Abstract: Context: Pull-based development model is widely used in open source, leading the trends in distributed software development. One aspect which has garnered significant attention is studies on pull request decision - identifying factors for explanation. Objective: This study builds on a decade long research on pull request decision to explain it. We empirically investigate how factors influence pull… ▽ More Context: Pull-based development model is widely used in open source, leading the trends in distributed software development. One aspect which has garnered significant attention is studies on pull request decision - identifying factors for explanation. Objective: This study builds on a decade long research on pull request decision to explain it. We empirically investigate how factors influence pull request decision and scenarios that change the influence of factors. Method: We identify factors influencing pull request decision on GitHub through a systematic literature review and infer it by mining archival data. We collect a total of 3,347,937 pull requests with 95 features from 11,230 diverse projects on GitHub. Using this data, we explore the relations of the factors to each other and build mixed-effect logistic regression models to empirically explain pull request decision. Results: Our study shows that a small number of factors explain pull request decision with the integrator same or different from the submitter as the most important factor. We also noted that some factors are important only in special cases e.g., the percentage of failed builds is important for pull request decision when continuous integration is used. △ Less

Submitted 28 May, 2021; originally announced May 2021.

arXiv:2105.04236 [pdf, other]

SIRNN: A Math Library for Secure RNN Inference

Authors: Deevashwer Rathee, Mayank Rathee, Rahul Kranti Kiran Goli, Divya Gupta, Rahul Sharma, Nishanth Chandran, Aseem Rastogi

Abstract: Complex machine learning (ML) inference algorithms like recurrent neural networks (RNNs) use standard functions from math libraries like exponentiation, sigmoid, tanh, and reciprocal of square root. Although prior work on secure 2-party inference provides specialized protocols for convolutional neural networks (CNNs), existing secure implementations of these math operators rely on generic 2-party… ▽ More Complex machine learning (ML) inference algorithms like recurrent neural networks (RNNs) use standard functions from math libraries like exponentiation, sigmoid, tanh, and reciprocal of square root. Although prior work on secure 2-party inference provides specialized protocols for convolutional neural networks (CNNs), existing secure implementations of these math operators rely on generic 2-party computation (2PC) protocols that suffer from high communication. We provide new specialized 2PC protocols for math functions that crucially rely on lookup-tables and mixed-bitwidths to address this performance overhead; our protocols for math functions communicate up to 423x less data than prior work. Some of the mixed bitwidth operations used by our math implementations are (zero and signed) extensions, different forms of truncations, multiplication of operands of mixed-bitwidths, and digit decomposition (a generalization of bit decomposition to larger digits). For each of these primitive operations, we construct specialized 2PC protocols that are more communication efficient than generic 2PC, and can be of independent interest. Furthermore, our math implementations are numerically precise, which ensures that the secure implementations preserve model accuracy of cleartext. We build on top of our novel protocols to build SIRNN, a library for end-to-end secure 2-party DNN inference, that provides the first secure implementations of an RNN operating on time series sensor data, an RNN operating on speech data, and a state-of-the-art ML architecture that combines CNNs and RNNs for identifying all heads present in images. Our evaluation shows that SIRNN achieves up to three orders of magnitude of performance improvement when compared to inference of these models using an existing state-of-the-art 2PC framework. △ Less

Submitted 10 May, 2021; originally announced May 2021.

Comments: IEEE Security and Privacy 2021

arXiv:2104.10824 [pdf, other]

doi 10.1371/journal.pone.0255809

Colonoscopy Polyp Detection and Classification: Dataset Creation and Comparative Evaluations

Authors: Kaidong Li, Mohammad I. Fathan, Krushi Patel, Tianxiao Zhang, Cuncong Zhong, Ajay Bansal, Amit Rastogi, Jean S. Wang, Guanghui Wang

Abstract: Colorectal cancer (CRC) is one of the most common types of cancer with a high mortality rate. Colonoscopy is the preferred procedure for CRC screening and has proven to be effective in reducing CRC mortality. Thus, a reliable computer-aided polyp detection and classification system can significantly increase the effectiveness of colonoscopy. In this paper, we create an endoscopic dataset collected… ▽ More Colorectal cancer (CRC) is one of the most common types of cancer with a high mortality rate. Colonoscopy is the preferred procedure for CRC screening and has proven to be effective in reducing CRC mortality. Thus, a reliable computer-aided polyp detection and classification system can significantly increase the effectiveness of colonoscopy. In this paper, we create an endoscopic dataset collected from various sources and annotate the ground truth of polyp location and classification results with the help of experienced gastroenterologists. The dataset can serve as a benchmark platform to train and evaluate the machine learning models for polyp classification. We have also compared the performance of eight state-of-the-art deep learning-based object detection models. The results demonstrate that deep CNN models are promising in CRC screening. This work can serve as a baseline for future research in polyp detection and classification. △ Less

Submitted 5 August, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

arXiv:2012.05064 [pdf, other]

Secure Medical Image Analysis with CrypTFlow

Authors: Javier Alvarez-Valle, Pratik Bhatu, Nishanth Chandran, Divya Gupta, Aditya Nori, Aseem Rastogi, Mayank Rathee, Rahul Sharma, Shubham Ugare

Abstract: We present CRYPTFLOW, a system that converts TensorFlow inference code into Secure Multi-party Computation (MPC) protocols at the push of a button. To do this, we build two components. Our first component is an end-to-end compiler from TensorFlow to a variety of MPC protocols. The second component is an improved semi-honest 3-party protocol that provides significant speedups for inference. We empi… ▽ More We present CRYPTFLOW, a system that converts TensorFlow inference code into Secure Multi-party Computation (MPC) protocols at the push of a button. To do this, we build two components. Our first component is an end-to-end compiler from TensorFlow to a variety of MPC protocols. The second component is an improved semi-honest 3-party protocol that provides significant speedups for inference. We empirically demonstrate the power of our system by showing the secure inference of real-world neural networks such as DENSENET121 for detection of lung diseases from chest X-ray images and 3D-UNet for segmentation in radiotherapy planning using CT images. In particular, this paper provides the first evaluation of secure segmentation of 3D images, a task that requires much more powerful models than classification and is the largest secure inference task run till date. △ Less

Submitted 9 December, 2020; originally announced December 2020.

Comments: 6 pages. PPML NeurIPS 2020 Workshop, Vancouver, Canada. arXiv admin note: substantial text overlap with arXiv:1909.07814

arXiv:2011.06486 [pdf, ps, other]

Overview of the Ninth Dialog System Technology Challenge: DSTC9

Authors: Chulaka Gunasekara, Seokhwan Kim, Luis Fernando D'Haro, Abhinav Rastogi, Yun-Nung Chen, Mihail Eric, Behnam Hedayatnia, Karthik Gopalakrishnan, Yang Liu, Chao-Wei Huang, Dilek Hakkani-Tür, Jinchao Li, Qi Zhu, Lingxiao Luo, Lars Liden, Kaili Huang, Shahin Shayandeh, Runze Liang, Baolin Peng, Zheng Zhang, Swadheen Shukla, Minlie Huang, Jianfeng Gao, Shikib Mehri, Yulan Feng , et al. (14 additional authors not shown)

Abstract: This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with unstructured knowledge access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog, and 4. Situated interactive multi-modal dialog. This… ▽ More This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with unstructured knowledge access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog, and 4. Situated interactive multi-modal dialog. This paper describes the task definition, provided datasets, baselines and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks. △ Less

Submitted 12 November, 2020; originally announced November 2020.

arXiv:2010.06457 [pdf, other]

doi 10.1145/3372297.3417274

CrypTFlow2: Practical 2-Party Secure Inference

Authors: Deevashwer Rathee, Mayank Rathee, Nishant Kumar, Nishanth Chandran, Divya Gupta, Aseem Rastogi, Rahul Sharma

Abstract: We present CrypTFlow2, a cryptographic framework for secure inference over realistic Deep Neural Networks (DNNs) using secure 2-party computation. CrypTFlow2 protocols are both correct -- i.e., their outputs are bitwise equivalent to the cleartext execution -- and efficient -- they outperform the state-of-the-art protocols in both latency and scale. At the core of CrypTFlow2, we have new 2PC proto… ▽ More We present CrypTFlow2, a cryptographic framework for secure inference over realistic Deep Neural Networks (DNNs) using secure 2-party computation. CrypTFlow2 protocols are both correct -- i.e., their outputs are bitwise equivalent to the cleartext execution -- and efficient -- they outperform the state-of-the-art protocols in both latency and scale. At the core of CrypTFlow2, we have new 2PC protocols for secure comparison and division, designed carefully to balance round and communication complexity for secure inference tasks. Using CrypTFlow2, we present the first secure inference over ImageNet-scale DNNs like ResNet50 and DenseNet121. These DNNs are at least an order of magnitude larger than those considered in the prior work of 2-party DNN inference. Even on the benchmarks considered by prior work, CrypTFlow2 requires an order of magnitude less communication and 20x-30x less time than the state-of-the-art. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: To appear at ACM CCS 2020. Code available at: https://github.com/mpc-msri/EzPC

arXiv:2010.03165 [pdf, other]

doi 10.1145/3368089.3409717

Questions for Data Scientists in Software Engineering: A Replication

Authors: Hennie Huijgens, Ayushi Rastogi, Ernst Mulders, Georgios Gousios, Arie van Deursen

Abstract: In 2014, a Microsoft study investigated the sort of questions that data science applied to software engineering should answer. This resulted in 145 questions that developers considered relevant for data scientists to answer, thus providing a research agenda to the community. Fast forward to five years, no further studies investigated whether the questions from the software engineers at Microsoft h… ▽ More In 2014, a Microsoft study investigated the sort of questions that data science applied to software engineering should answer. This resulted in 145 questions that developers considered relevant for data scientists to answer, thus providing a research agenda to the community. Fast forward to five years, no further studies investigated whether the questions from the software engineers at Microsoft hold for other software companies, including software-intensive companies with different primary focus (to which we refer as software-defined enterprises). Furthermore, it is not evident that the problems identified five years ago are still applicable, given the technological advances in software engineering. △ Less

Submitted 4 January, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

arXiv:2010.00822 [pdf, other]

doi 10.1109/TSE.2021.3092813

Including Everyone, Everywhere: Understanding Opportunities and Challenges of Geographic Gender-Inclusion in OSS

Authors: Gede Artha Azriadi Prana, Denae Ford, Ayushi Rastogi, David Lo, Rahul Purandare, Nachiappan Nagappan

Abstract: The gender gap is a significant concern facing the software industry as the development becomes more geographically distributed. Widely shared reports indicate that gender differences may be specific to each region. However, how complete can these reports be with little to no research reflective of the Open Source Software (OSS) process and communities software is now commonly developed in? Our st… ▽ More The gender gap is a significant concern facing the software industry as the development becomes more geographically distributed. Widely shared reports indicate that gender differences may be specific to each region. However, how complete can these reports be with little to no research reflective of the Open Source Software (OSS) process and communities software is now commonly developed in? Our study presents a multi-region geographical analysis of gender inclusion on GitHub. This mixed-methods approach includes quantitatively investigating differences in gender inclusion in projects across geographic regions and investigate these trends over time using data from contributions to 21,456 project repositories. We also qualitatively understand the unique experiences of developers contributing to these projects through a survey that is strategically targeted to developers in various regions worldwide. Our findings indicate that gender diversity is low across all parts of the world, with no substantial difference across regions. However, there has been statistically significant improvement in diversity worldwide since 2014, with certain regions such as Africa improving at faster pace. We also find that most motivations and barriers to contributions (e.g., lack of resources to contribute and poor working environment) were shared across regions, however, some insightful differences, such as how to make projects more inclusive, did arise. From these findings, we derive and present implications for tools that can foster inclusion in open source software communities and empower contributions from everyone, everywhere. △ Less

Submitted 15 September, 2021; v1 submitted 2 October, 2020; originally announced October 2020.

Comments: 19 pages, 16 tables, 3 figures, Includes appendices

Journal ref: IEEE Transactions on Software Engineering 2021

arXiv:2007.12720 [pdf, other]

MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines

Authors: Xiaoxue Zang, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, Jindong Chen

Abstract: MultiWOZ is a well-known task-oriented dialogue dataset containing over 10,000 annotated dialogues spanning 8 domains. It is extensively used as a benchmark for dialogue state tracking. However, recent works have reported presence of substantial noise in the dialogue state annotations. MultiWOZ 2.1 identified and fixed many of these erroneous annotations and user utterances, resulting in an improv… ▽ More MultiWOZ is a well-known task-oriented dialogue dataset containing over 10,000 annotated dialogues spanning 8 domains. It is extensively used as a benchmark for dialogue state tracking. However, recent works have reported presence of substantial noise in the dialogue state annotations. MultiWOZ 2.1 identified and fixed many of these erroneous annotations and user utterances, resulting in an improved version of this dataset. This work introduces MultiWOZ 2.2, which is a yet another improved version of this dataset. Firstly, we identify and fix dialogue state annotation errors across 17.3% of the utterances on top of MultiWOZ 2.1. Secondly, we redefine the ontology by disallowing vocabularies of slots with a large number of possible values (e.g., restaurant name, time of booking). In addition, we introduce slot span annotations for these slots to standardize them across recent models, which previously used custom string matching heuristics to generate them. We also benchmark a few state of the art dialogue state tracking models on the corrected dataset to facilitate comparison for future work. In the end, we discuss best practices for dialogue data collection that can help avoid annotation errors. △ Less

Submitted 10 July, 2020; originally announced July 2020.

Journal ref: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI (2020) 109-117

Showing 1–50 of 80 results for author: Rastogi, A