subscribe to arXiv mailings

arXiv:2406.20015 [pdf, other]

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

Authors: Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, Hayato Yamana

Abstract: Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community still needs to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM's hallucinations through two perspectives: depth and bre… ▽ More Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community still needs to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM's hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the ToolBH benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve a total score of 45.3 and 37.0, respectively, on a scale of 100. In this benchmark, larger model parameters do not guarantee better performance; the training data and response strategies also play a crucial role in tool-enhanced LLM scenarios. Our diagnostic analysis indicates that the primary reason for model errors lies in assessing task solvability. Additionally, open-weight models suffer from performance drops with verbose replies, whereas proprietary models excel with longer reasoning. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2405.12174 [pdf, other]

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Authors: Haoxiang Shi, Jiaan Wang, Jiarong Xu, Cen Wang, Tetsuya Sakai

Abstract: Text-to-Table aims to generate structured tables to convey the key information from unstructured documents. Existing text-to-table datasets are typically oriented English, limiting the research in non-English languages. Meanwhile, the emergence of large language models (LLMs) has shown great success as general task solvers in multi-lingual settings (e.g., ChatGPT), theoretically enabling text-to-t… ▽ More Text-to-Table aims to generate structured tables to convey the key information from unstructured documents. Existing text-to-table datasets are typically oriented English, limiting the research in non-English languages. Meanwhile, the emergence of large language models (LLMs) has shown great success as general task solvers in multi-lingual settings (e.g., ChatGPT), theoretically enabling text-to-table in other languages. In this paper, we propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task. Our preliminary analysis of English text-to-table datasets highlights two key factors for dataset construction: data diversity and data hallucination. Inspired by this, the CT-Eval dataset selects a popular Chinese multidisciplinary online encyclopedia as the source and covers 28 domains to ensure data diversity. To minimize data hallucination, we first train an LLM to judge and filter out the task samples with hallucination, then employ human annotators to clean the hallucinations in the validation and testing sets. After this process, CT-Eval contains 88.6K task samples. Using CT-Eval, we evaluate the performance of open-source and closed-source LLMs. Our results reveal that zero-shot LLMs (including GPT-4) still have a significant performance gap compared with human judgment. Furthermore, after fine-tuning, open-source LLMs can significantly improve their text-to-table ability, outperforming GPT-4 by a large margin. In short, CT-Eval not only helps researchers evaluate and quickly understand the Chinese text-to-table ability of existing LLMs but also serves as a valuable resource to significantly improve the text-to-table performance of LLMs. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: 10 pages

arXiv:2405.03110 [pdf, other]

Vector Quantization for Recommender Systems: A Review and Outlook

Authors: Qijiong Liu, Xiaoyu Dong, Jiaren Xiao, Nuo Chen, Hengchang Hu, Jieming Zhu, Chenxu Zhu, Tetsuya Sakai, Xiao-Ming Wu

Abstract: Vector quantization, renowned for its unparalleled feature compression capabilities, has been a prominent topic in signal processing and machine learning research for several decades and remains widely utilized today. With the emergence of large models and generative AI, vector quantization has gained popularity in recommender systems, establishing itself as a preferred solution. This paper starts… ▽ More Vector quantization, renowned for its unparalleled feature compression capabilities, has been a prominent topic in signal processing and machine learning research for several decades and remains widely utilized today. With the emergence of large models and generative AI, vector quantization has gained popularity in recommender systems, establishing itself as a preferred solution. This paper starts with a comprehensive review of vector quantization techniques. It then explores systematic taxonomies of vector quantization methods for recommender systems (VQ4Rec), examining their applications from multiple perspectives. Further, it provides a thorough introduction to research efforts in diverse recommendation scenarios, including efficiency-oriented approaches and quality-oriented approaches. Finally, the survey analyzes the remaining challenges and anticipates future trends in VQ4Rec, including the challenges associated with the training of vector quantization, the opportunities presented by large language models, and emerging trends in multimodal recommender systems. We hope this survey can pave the way for future researchers in the recommendation community and accelerate their exploration in this promising field. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2404.13556 [pdf, other]

ChatRetriever: Adapting Large Language Models for Generalized and Robust Conversational Dense Retrieval

Authors: Kelong Mao, Chenlong Deng, Haonan Chen, Fengran Mo, Zheng Liu, Tetsuya Sakai, Zhicheng Dou

Abstract: Conversational search requires accurate interpretation of user intent from complex multi-turn contexts. This paper presents ChatRetriever, which inherits the strong generalization capability of large language models to robustly represent complex conversational sessions for dense retrieval. To achieve this, we propose a simple and effective dual-learning approach that adapts LLM for retrieval via c… ▽ More Conversational search requires accurate interpretation of user intent from complex multi-turn contexts. This paper presents ChatRetriever, which inherits the strong generalization capability of large language models to robustly represent complex conversational sessions for dense retrieval. To achieve this, we propose a simple and effective dual-learning approach that adapts LLM for retrieval via contrastive learning while enhancing the complex session understanding through masked instruction tuning on high-quality conversational instruction tuning data. Extensive experiments on five conversational search benchmarks demonstrate that ChatRetriever substantially outperforms existing conversational dense retrievers, achieving state-of-the-art performance on par with LLM-based rewriting approaches. Furthermore, ChatRetriever exhibits superior robustness in handling diverse conversational contexts. Our work highlights the potential of adapting LLMs for retrieval with complex inputs like conversational search sessions and proposes an effective approach to advance this research direction. △ Less

Submitted 21 April, 2024; originally announced April 2024.

arXiv:2403.18462 [pdf, other]

Decoy Effect In Search Interaction: Understanding User Behavior and Measuring System Vulnerability

Authors: Nuo Chen, Jiqun Liu, Hanpei Fang, Yuankai Luo, Tetsuya Sakai, Xiao-Ming Wu

Abstract: This study examines the decoy effect's underexplored influence on user search interactions and methods for measuring information retrieval (IR) systems' vulnerability to this effect. It explores how decoy results alter users' interactions on search engine result pages, focusing on metrics like click-through likelihood, browsing time, and perceived document usefulness. By analyzing user interaction… ▽ More This study examines the decoy effect's underexplored influence on user search interactions and methods for measuring information retrieval (IR) systems' vulnerability to this effect. It explores how decoy results alter users' interactions on search engine result pages, focusing on metrics like click-through likelihood, browsing time, and perceived document usefulness. By analyzing user interaction logs from multiple datasets, the study demonstrates that decoy results significantly affect users' behavior and perceptions. Furthermore, it investigates how different levels of task difficulty and user knowledge modify the decoy effect's impact, finding that easier tasks and lower knowledge levels lead to higher engagement with target documents. In terms of IR system evaluation, the study introduces the DEJA-VU metric to assess systems' susceptibility to the decoy effect, testing it on specific retrieval tasks. The results show differences in systems' effectiveness and vulnerability, contributing to our understanding of cognitive biases in search behavior and suggesting pathways for creating more balanced and bias-aware IR evaluations. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2311.02362 [pdf, other]

doi 10.20736/0002001351

Decoy Effect in Search Interaction: A Pilot Study

Authors: Nuo Chen, Jiqun Liu, Tetsuya Sakai, Xiao-Ming Wu

Abstract: In recent years, the influence of cognitive effects and biases on users' thinking, behaving, and decision-making has garnered increasing attention in the field of interactive information retrieval. The decoy effect, one of the main empirically confirmed cognitive biases, refers to the shift in preference between two choices when a third option (the decoy) which is inferior to one of the initial ch… ▽ More In recent years, the influence of cognitive effects and biases on users' thinking, behaving, and decision-making has garnered increasing attention in the field of interactive information retrieval. The decoy effect, one of the main empirically confirmed cognitive biases, refers to the shift in preference between two choices when a third option (the decoy) which is inferior to one of the initial choices is introduced. However, it is not clear how the decoy effect influences user interactions with and evaluations on Search Engine Result Pages (SERPs). To bridge this gap, our study seeks to understand how the decoy effect at the document level influences users' interaction behaviors on SERPs, such as clicks, dwell time, and usefulness perceptions. We conducted experiments on two publicly available user behavior datasets and the findings reveal that, compared to cases where no decoy is present, the probability of a document being clicked could be improved and its usefulness score could be higher, should there be a decoy associated with the document. △ Less

Submitted 4 November, 2023; originally announced November 2023.

arXiv:2310.00970 [pdf, other]

EALM: Introducing Multidimensional Ethical Alignment in Conversational Information Retrieval

Authors: Yiyao Yu, Junjie Wang, Yuxiang Zhang, Lin Zhang, Yujiu Yang, Tetsuya Sakai

Abstract: Artificial intelligence (AI) technologies should adhere to human norms to better serve our society and avoid disseminating harmful or misleading information, particularly in Conversational Information Retrieval (CIR). Previous work, including approaches and datasets, has not always been successful or sufficiently robust in taking human norms into consideration. To this end, we introduce a workflow… ▽ More Artificial intelligence (AI) technologies should adhere to human norms to better serve our society and avoid disseminating harmful or misleading information, particularly in Conversational Information Retrieval (CIR). Previous work, including approaches and datasets, has not always been successful or sufficiently robust in taking human norms into consideration. To this end, we introduce a workflow that integrates ethical alignment, with an initial ethical judgment stage for efficient data screening. To address the need for ethical judgment in CIR, we present the QA-ETHICS dataset, adapted from the ETHICS benchmark, which serves as an evaluation tool by unifying scenarios and label meanings. However, each scenario only considers one ethical concept. Therefore, we introduce the MP-ETHICS dataset to evaluate a scenario under multiple ethical concepts, such as justice and Deontology. In addition, we suggest a new approach that achieves top performance in both binary and multi-label ethical judgment tasks. Our research provides a practical method for introducing ethical alignment into the CIR workflow. The data and code are available at https://github.com/wanng-ide/ealm . △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2310.00410 [pdf, other]

doi 10.1145/3624918.3625338

Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores

Authors: Rikiya Takehi, Akihisa Watanabe, Tetsuya Sakai

Abstract: Existing dialogue quality evaluation systems can return a score for a given system turn from a particular viewpoint, e.g., engagingness. However, to improve dialogue systems by locating exactly where in a system turn potential problems lie, a more fine-grained evaluation may be necessary. We therefore propose an evaluation approach where a turn is decomposed into nuggets (i.e., expressions associa… ▽ More Existing dialogue quality evaluation systems can return a score for a given system turn from a particular viewpoint, e.g., engagingness. However, to improve dialogue systems by locating exactly where in a system turn potential problems lie, a more fine-grained evaluation may be necessary. We therefore propose an evaluation approach where a turn is decomposed into nuggets (i.e., expressions associated with a dialogue act), and nugget-level evaluation is enabled by leveraging an existing turn-level evaluation system. We demonstrate the potential effectiveness of our evaluation method through a case study. △ Less

Submitted 30 September, 2023; originally announced October 2023.

Journal ref: In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP `23), November 26-28, 2023, Beijing, China. ACM, New York, NY, USA, 6 pages

arXiv:2308.02926 [pdf, other]

Towards Consistency Filtering-Free Unsupervised Learning for Dense Retrieval

Authors: Haoxiang Shi, Sumio Fujita, Tetsuya Sakai

Abstract: Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces… ▽ More Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces the model efficiency. In addition, consistency filtering often struggles to identify retrieval intentions and recognize query and corpus distributions in a target domain. In this study, we evaluate a more efficient solution: replacing the consistency filter with either direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods for achieving consistent filtering-free unsupervised dense retrieval. Our extensive experimental evaluations demonstrate that, on average, TextRank-based pseudo relevance feedback outperforms other methods. Furthermore, we analyzed the training and inference efficiency of the proposed paradigm. The results indicate that filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance. In some cases, it can even improve performance based on particular datasets. △ Less

Submitted 5 August, 2023; originally announced August 2023.

arXiv:2307.02936 [pdf, other]

A Meta-Evaluation of C/W/L/A Metrics: System Ranking Similarity, System Ranking Consistency and Discriminative Power

Authors: Nuo Chen, Tetsuya Sakai

Abstract: Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for offline evaluation metrics. This framework allows information retrieval (IR) researchers to design evaluation metrics through the flexible combination of user browsing models and user gain aggregations. However, the statistical stability of C/W/L/A metrics with different aggregations is not yet investigated. In this study,… ▽ More Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for offline evaluation metrics. This framework allows information retrieval (IR) researchers to design evaluation metrics through the flexible combination of user browsing models and user gain aggregations. However, the statistical stability of C/W/L/A metrics with different aggregations is not yet investigated. In this study, we investigate the statistical stability of C/W/L/A metrics from the perspective of: (1) the system ranking similarity among aggregations, (2) the system ranking consistency of aggregations and (3) the discriminative power of aggregations. More specifically, we combined various aggregation functions with the browsing model of Precision, Discounted Cumulative Gain (DCG), Rank-Biased Precision (RBP), INST, Average Precision (AP) and Expected Reciprocal Rank (ERR), examing their performances in terms of system ranking similarity, system ranking consistency and discriminative power on two offline test collections. Our experimental result suggests that, in terms of system ranking consistency and discriminative power, the aggregation function of expected rate of gain (ERG) has an outstanding performance while the aggregation function of maximum relevance usually has an insufficient performance. The result also suggests that Precision, DCG, RBP, INST and AP with their canonical aggregation all have favourable performances in system ranking consistency and discriminative power; but for ERR, replacing its canonical aggregation with ERG can further strengthen the discriminative power while obtaining a system ranking list similar to the canonical version at the same time. △ Less

Submitted 5 August, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

arXiv:2305.08290 [pdf, ps, other]

SWAN: A Generic Framework for Auditing Textual Conversational Systems

Authors: Tetsuya Sakai

Abstract: We present a simple and generic framework for auditing a given textual conversational system, given some samples of its conversation sessions as its input. The framework computes a SWAN (Schematised Weighted Average Nugget) score based on nugget sequences extracted from the conversation sessions. Following the approaches of S-measure and U-measure, SWAN utilises nugget positions within the convers… ▽ More We present a simple and generic framework for auditing a given textual conversational system, given some samples of its conversation sessions as its input. The framework computes a SWAN (Schematised Weighted Average Nugget) score based on nugget sequences extracted from the conversation sessions. Following the approaches of S-measure and U-measure, SWAN utilises nugget positions within the conversations to weight the nuggets based on a user model. We also present a schema of twenty (+1) criteria that may be worth incorporating in the SWAN framework. In our future work, we plan to devise conversation sampling methods that are suitable for the various criteria, construct seed user turns for comparing multiple systems, and validate specific instances of SWAN for the purpose of preventing negative impacts of conversational systems on users and society. This paper was written while preparing for the ICTIR 2023 keynote (to be given on July 23, 2023). △ Less

Submitted 14 May, 2023; originally announced May 2023.

Comments: 13 pages

arXiv:2305.06566 [pdf, other]

ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models

Authors: Qijiong Liu, Nuo Chen, Tetsuya Sakai, Xiao-Ming Wu

Abstract: Personalized content-based recommender systems have become indispensable tools for users to navigate through the vast amount of content available on platforms like daily news websites and book recommendation services. However, existing recommenders face significant challenges in understanding the content of items. Large language models (LLMs), which possess deep semantic comprehension and extensiv… ▽ More Personalized content-based recommender systems have become indispensable tools for users to navigate through the vast amount of content available on platforms like daily news websites and book recommendation services. However, existing recommenders face significant challenges in understanding the content of items. Large language models (LLMs), which possess deep semantic comprehension and extensive knowledge from pretraining, have proven to be effective in various natural language processing tasks. In this study, we explore the potential of leveraging both open- and closed-source LLMs to enhance content-based recommendation. With open-source LLMs, we utilize their deep layers as content encoders, enriching the representation of content at the embedding level. For closed-source LLMs, we employ prompting techniques to enrich the training data at the token level. Through comprehensive experiments, we demonstrate the high effectiveness of both types of LLMs and show the synergistic relationship between them. Notably, we observed a significant relative improvement of up to 19.32% compared to existing state-of-the-art recommendation models. These findings highlight the immense potential of both open- and closed-source of LLMs in enhancing content-based recommendation systems. We will make our code and LLM-generated data available for other researchers to reproduce our results. △ Less

Submitted 31 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

arXiv:2305.03970 [pdf, other]

NER-to-MRC: Named-Entity Recognition Completely Solving as Machine Reading Comprehension

Authors: Yuxiang Zhang, Junjie Wang, Xinyu Zhu, Tetsuya Sakai, Hayato Yamana

Abstract: Named-entity recognition (NER) detects texts with predefined semantic labels and is an essential building block for natural language processing (NLP). Notably, recent NER research focuses on utilizing massive extra data, including pre-training corpora and incorporating search engines. However, these methods suffer from high costs associated with data collection and pre-training, and additional tra… ▽ More Named-entity recognition (NER) detects texts with predefined semantic labels and is an essential building block for natural language processing (NLP). Notably, recent NER research focuses on utilizing massive extra data, including pre-training corpora and incorporating search engines. However, these methods suffer from high costs associated with data collection and pre-training, and additional training process of the retrieved data from search engines. To address the above challenges, we completely frame NER as a machine reading comprehension (MRC) problem, called NER-to-MRC, by leveraging MRC with its ability to exploit existing data efficiently. Several prior works have been dedicated to employing MRC-based solutions for tackling the NER problem, several challenges persist: i) the reliance on manually designed prompts; ii) the limited MRC approaches to data reconstruction, which fails to achieve performance on par with methods utilizing extensive additional data. Thus, our NER-to-MRC conversion consists of two components: i) transform the NER task into a form suitable for the model to solve with MRC in a efficient manner; ii) apply the MRC reasoning strategy to the model. We experiment on 6 benchmark datasets from three domains and achieve state-of-the-art performance without external data, up to 11.24% improvement on the WNUT-16 dataset. △ Less

Submitted 6 May, 2023; originally announced May 2023.

arXiv:2301.03793 [pdf, other]

Estimation of User's World Model Using Graph2vec

Authors: Tatsuya Sakai, Takayuki Nagai

Abstract: To obtain advanced interaction between autonomous robots and users, robots should be able to distinguish their state space representations (i.e., world models). Herein, a novel method was proposed for estimating the user's world model based on queries. In this method, the agent learns the distributed representation of world models using graph2vec and generates concept activation vectors that repre… ▽ More To obtain advanced interaction between autonomous robots and users, robots should be able to distinguish their state space representations (i.e., world models). Herein, a novel method was proposed for estimating the user's world model based on queries. In this method, the agent learns the distributed representation of world models using graph2vec and generates concept activation vectors that represent the meaning of queries in the latent space. Experimental results revealed that the proposed method can estimate the user's world model more efficiently than the simple method of using the ``AND'' search of queries. △ Less

Submitted 10 January, 2023; originally announced January 2023.

arXiv:2211.00981 [pdf, other]

Relevance Assessments for Web Search Evaluation: Should We Randomise or Prioritise the Pooled Documents? (CORRECTED VERSION)

Authors: Tetsuya Sakai, Sijie Tao, Zhaohao Zeng

Abstract: In the context of depth-$k$ pooling for constructing web search test collections, we compare two approaches to ordering pooled documents for relevance assessors: the prioritisation strategy (PRI) used widely at NTCIR, and the simple randomisation strategy (RND). In order to address research questions regarding PRI and RND, we have constructed and released the WWW3E8 data set, which contains eight… ▽ More In the context of depth-$k$ pooling for constructing web search test collections, we compare two approaches to ordering pooled documents for relevance assessors: the prioritisation strategy (PRI) used widely at NTCIR, and the simple randomisation strategy (RND). In order to address research questions regarding PRI and RND, we have constructed and released the WWW3E8 data set, which contains eight independent relevance labels for 32,375 topic-document pairs, i.e., a total of 259,000 labels. Four of the eight relevance labels were obtained from PRI-based pools; the other four were obtained from RND-based pools. Using WWW3E8, we compare PRI and RND in terms of inter-assessor agreement, system ranking agreement, and robustness to new systems that did not contribute to the pools. We also utilise an assessor activity log we obtained as a byproduct of WWW3E8 to compare the two strategies in terms of assessment efficiency. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: 30 pages. This is a corrected version of an open-access TOIS paper ( https://dl.acm.org/doi/pdf/10.1145/3494833 )

arXiv:2210.10266 [pdf, ps, other]

Corrected Evaluation Results of the NTCIR WWW-2, WWW-3, and WWW-4 English Subtasks

Authors: Tetsuya Sakai, Sijie Tao, Maria Maistro, Zhumin Chu, Yujing Li, Nuo Chen, Nicola Ferro, Junjie Wang, Ian Soboroff, Yiqun Liu

Abstract: Unfortunately, the official English (sub)task results reported in the NTCIR-14 WWW-2, NTCIR-15 WWW-3, and NTCIR-16 WWW-4 overview papers are incorrect due to noise in the official qrels files; this paper reports results based on the corrected qrels files. The noise is due to a fatal bug in the backend of our relevance assessment interface. More specifically, at WWW-2, WWW-3, and WWW-4, two version… ▽ More Unfortunately, the official English (sub)task results reported in the NTCIR-14 WWW-2, NTCIR-15 WWW-3, and NTCIR-16 WWW-4 overview papers are incorrect due to noise in the official qrels files; this paper reports results based on the corrected qrels files. The noise is due to a fatal bug in the backend of our relevance assessment interface. More specifically, at WWW-2, WWW-3, and WWW-4, two versions of pool files were created for each English topic: a PRI ("prioritised") file, which uses the NTCIRPOOL script to prioritise likely relevant documents, and a RND ("randomised") file, which randomises the pooled documents. This was done for the purpose of studying the effect of document ordering for relevance assessors. However, the programmer who wrote the interface backend assumed that a combination of a topic ID and a document rank in the pool file uniquely determines a document ID; this is obviously incorrect as we have two versions of pool files. The outcome is that all the PRI-based relevance labels for the WWW-2 test collection are incorrect (while all the RND-based relevance labels are correct), and all the RND-based relevance labels for the WWW-3 and WWW-4 test collections are incorrect (while all the PRI-based relevance labels are correct). This bug was finally discovered at the NTCIR-16 WWW-4 task when the first seven authors of this paper served as Gold assessors (i.e., topic creators who define what is relevant) and closely examined the disagreements with Bronze assessors (i.e., non-topic-creators; non-experts). We would like to apologise to the WWW participants and the NTCIR chairs for the inconvenience and confusion caused due to this bug. △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: 24 pages

arXiv:2210.08590 [pdf, other]

Zero-Shot Learners for Natural Language Understanding via a Unified Multiple Choice Perspective

Authors: Ping Yang, Junjie Wang, Ruyi Gan, Xinyu Zhu, Lin Zhang, Ziwei Wu, Xinyu Gao, Jiaxing Zhang, Tetsuya Sakai

Abstract: We propose a new paradigm for zero-shot learners that is format agnostic, i.e., it is compatible with any format and applicable to a list of language tasks, such as text classification, commonsense reasoning, coreference resolution, and sentiment analysis. Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training. Our appro… ▽ More We propose a new paradigm for zero-shot learners that is format agnostic, i.e., it is compatible with any format and applicable to a list of language tasks, such as text classification, commonsense reasoning, coreference resolution, and sentiment analysis. Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training. Our approach converts zero-shot learning into multiple-choice tasks, avoiding problems in commonly used large-scale generative models such as FLAN. It not only adds generalization ability to models but also significantly reduces the number of parameters. Our method shares the merits of efficient training and deployment. Our approach shows state-of-the-art performance on several benchmarks and produces satisfactory results on tasks such as natural language inference and text classification. Our model achieves this success with only 235M parameters, which is substantially smaller than state-of-the-art models with billions of parameters. The code and pre-trained models are available at https://github.com/IDEA-CCNL/Fengshenbang-LM . △ Less

Submitted 18 October, 2022; v1 submitted 16 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2210.05335 [pdf, other]

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Authors: Yatai Ji, Junjie Wang, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, Yujiu Yang

Abstract: Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream data… ▽ More Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results. △ Less

Submitted 20 July, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: CVPR 2023 Main Track Long Paper

arXiv:2204.07304 [pdf, ps, other]

On Variants of Root Normalised Order-aware Divergence and a Divergence based on Kendall's Tau

Authors: Tetsuya Sakai

Abstract: This paper reports on a follow-up study of the work reported in Sakai, which explored suitable evaluation measures for ordinal quantification tasks. More specifically, the present study defines and evaluates, in addition to the quantification measures considered earlier, a few variants of an ordinal quantification measure called Root Normalised Order-aware Divergence (RNOD), as well as a measure w… ▽ More This paper reports on a follow-up study of the work reported in Sakai, which explored suitable evaluation measures for ordinal quantification tasks. More specifically, the present study defines and evaluates, in addition to the quantification measures considered earlier, a few variants of an ordinal quantification measure called Root Normalised Order-aware Divergence (RNOD), as well as a measure which we call Divergence based on Kendall's $τ$ (DNKT). The RNOD variants represent alternative design choices based on the idea of Sakai's Distance-Weighted sum of squares (DW), while DNKT is designed to ensure that the system's estimated distribution over classes is faithful to the target priorities over classes. As this Priority Preserving Property (PPP) of DNKT may be useful in some applications, we also consider combining some of the existing quantification measures with DNKT. Our experiments with eight ordinal quantification data sets suggest that the variants of RNOD do not offer any benefit over the original RNOD at least in terms of system ranking consistency, i.e., robustness of the system ranking to the choice of test data. Of all ordinal quantification measures considered in this study (including Normalised Match Distance, a.k.a. Earth Mover's Distance), RNOD is the most robust measure overall. Hence the design choice of RNOD is a good one from this viewpoint. Also, DNKT is the worst performer in terms of system ranking consistency. Hence, if DNKT seems appropriate for a task, sample size design should take its statistical instability into account. △ Less

Submitted 14 April, 2022; originally announced April 2022.

arXiv:2204.00280 [pdf, other]

A Versatile Framework for Evaluating Ranked Lists in terms of Group Fairness and Relevance

Authors: Tetsuya Sakai, Jin Young Kim, Inho Kang

Abstract: We present a simple and versatile framework for evaluating ranked lists in terms of group fairness and relevance, where the groups (i.e., possible attribute values) can be either nominal or ordinal in nature. First, we demonstrate that, if the attribute set is binary, our framework can easily quantify the overall polarity of each ranked list. Second, by utilising an existing diversified search tes… ▽ More We present a simple and versatile framework for evaluating ranked lists in terms of group fairness and relevance, where the groups (i.e., possible attribute values) can be either nominal or ordinal in nature. First, we demonstrate that, if the attribute set is binary, our framework can easily quantify the overall polarity of each ranked list. Second, by utilising an existing diversified search test collection and treating each intent as an attribute value, we demonstrate that our framework can handle soft group membership, and that our group fairness measures are highly correlated with both adhoc IR and diversified IR measures under this setting. Third, we demonstrate how our framework can quantify intersectional group fairness based on multiple attribute sets. We also show that the similarity function for comparing the achieved and target distributions over the attribute values should be chosen carefully. △ Less

Submitted 1 April, 2022; originally announced April 2022.

arXiv:2203.16062 [pdf, other]

AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval

Authors: Riku Togashi, Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, Tetsuya Sakai

Abstract: Evaluation measures have a crucial impact on the direction of research. Therefore, it is of utmost importance to develop appropriate and reliable evaluation measures for new applications where conventional measures are not well suited. Video Moment Retrieval (VMR) is one such application, and the current practice is to use R@$K,θ$ for evaluating VMR systems. However, this measure has two disadvant… ▽ More Evaluation measures have a crucial impact on the direction of research. Therefore, it is of utmost importance to develop appropriate and reliable evaluation measures for new applications where conventional measures are not well suited. Video Moment Retrieval (VMR) is one such application, and the current practice is to use R@$K,θ$ for evaluating VMR systems. However, this measure has two disadvantages. First, it is rank-insensitive: It ignores the rank positions of successfully localised moments in the top-$K$ ranked list by treating the list as a set. Second, it binarizes the Intersection over Union (IoU) of each retrieved video moment using the threshold $θ$ and thereby ignoring fine-grained localisation quality of ranked moments. We propose an alternative measure for evaluating VMR, called Average Max IoU (AxIoU), which is free from the above two problems. We show that AxIoU satisfies two important axioms for VMR evaluation, namely, \textbf{Invariance against Redundant Moments} and \textbf{Monotonicity with respect to the Best Moment}, and also that R@$K,θ$ satisfies the first axiom only. We also empirically examine how AxIoU agrees with R@$K,θ$, as well as its stability with respect to change in the test data and human-annotated temporal boundaries. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: Accepted by CVPR2022

arXiv:2108.05995 [pdf]

Screenline-based Two-step Calibration and its application to an agent-based urban freight simulator

Authors: Yusuke Hara, Takanori Sakai, André Romano Alho, Moshe Ben-Akiva

Abstract: Calibration is an essential process to make an agent-based simulator operational. Especially, the calibration for freight demand is challenging due to the model complexity and the shortage of available freight demand data compared with passenger data. This paper proposes a novel calibration method that relies solely on screenline counts, named Screenline-based Two-step Calibration (SLTC). SLTC con… ▽ More Calibration is an essential process to make an agent-based simulator operational. Especially, the calibration for freight demand is challenging due to the model complexity and the shortage of available freight demand data compared with passenger data. This paper proposes a novel calibration method that relies solely on screenline counts, named Screenline-based Two-step Calibration (SLTC). SLTC consists of two parts: (1) tour-based demand adjustment and (2) model parameter updates. The former generates screenline-based tours by cloning/removing instances of the simulated goods vehicle tours, aiming to minimize the gaps between the observed and the simulated screenline counts. The latter updates the parameters of the commodity flow model which generates inputs to simulate goods vehicle tours. To demonstrate the practicality of the proposed method, we apply it to an agent-based urban freight simulator, SimMobility Freight. The result shows that SLTC allows the simulator to replicate the observed screenline counts with reasonable computational cost for calibration. △ Less

Submitted 12 August, 2021; originally announced August 2021.

arXiv:2106.10923 [pdf, other]

Unsupervised Deep Learning by Injecting Low-Rank and Sparse Priors

Authors: Tomoya Sakai

Abstract: What if deep neural networks can learn from sparsity-inducing priors? When the networks are designed by combining layer modules (CNN, RNN, etc), engineers less exploit the inductive bias, i.e., existing well-known rules or prior knowledge, other than annotated training data sets. We focus on employing sparsity-inducing priors in deep learning to encourage the network to concisely capture the natur… ▽ More What if deep neural networks can learn from sparsity-inducing priors? When the networks are designed by combining layer modules (CNN, RNN, etc), engineers less exploit the inductive bias, i.e., existing well-known rules or prior knowledge, other than annotated training data sets. We focus on employing sparsity-inducing priors in deep learning to encourage the network to concisely capture the nature of high-dimensional data in an unsupervised way. In order to use non-differentiable sparsity-inducing norms as loss functions, we plug their proximal mappings into the automatic differentiation framework. We demonstrate unsupervised learning of U-Net for background subtraction using low-rank and sparse priors. The U-Net can learn moving objects in a training sequence without any annotation, and successfully detect the foreground objects in test sequences. △ Less

Submitted 21 June, 2021; originally announced June 2021.

arXiv:2105.04769 [pdf, other]

doi 10.1145/3404835.3462933

Scalable Personalised Item Ranking through Parametric Density Estimation

Authors: Riku Togashi, Masahiro Kato, Mayu Otani, Tetsuya Sakai, Shin'ichi Satoh

Abstract: Learning from implicit feedback is challenging because of the difficult nature of the one-class problem: we can observe only positive examples. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. However, such methods have two main drawbacks particularly in large-scale applications; (1) the pairwise approach is severely inefficient du… ▽ More Learning from implicit feedback is challenging because of the difficult nature of the one-class problem: we can observe only positive examples. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. However, such methods have two main drawbacks particularly in large-scale applications; (1) the pairwise approach is severely inefficient due to the quadratic computational cost; and (2) even recent model-based samplers (e.g. IRGAN) cannot achieve practical efficiency due to the training of an extra model. In this paper, we propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart while performing similarly to the pairwise counterpart in terms of ranking effectiveness. Our approach estimates the probability densities of positive items for each user within a rich class of distributions, viz. \emph{exponential family}. In our formulation, we derive a loss function and the appropriate negative sampling distribution based on maximum likelihood estimation. We also develop a practical technique for risk approximation and a regularisation scheme. We then discuss that our single-model approach is equivalent to an IRGAN variant under a certain condition. Through experiments on real-world datasets, our approach outperforms the pointwise and pairwise counterparts in terms of effectiveness and efficiency. △ Less

Submitted 10 May, 2021; originally announced May 2021.

Comments: Accepted by SIGIR'21

arXiv:2105.02670 [pdf, ps, other]

A Framework of Explanation Generation toward Reliable Autonomous Robots

Authors: Tatsuya Sakai, Kazuki Miyazawa, Takato Horii, Takayuki Nagai

Abstract: To realize autonomous collaborative robots, it is important to increase the trust that users have in them. Toward this goal, this paper proposes an algorithm which endows an autonomous agent with the ability to explain the transition from the current state to the target state in a Markov decision process (MDP). According to cognitive science, to generate an explanation that is acceptable to humans… ▽ More To realize autonomous collaborative robots, it is important to increase the trust that users have in them. Toward this goal, this paper proposes an algorithm which endows an autonomous agent with the ability to explain the transition from the current state to the target state in a Markov decision process (MDP). According to cognitive science, to generate an explanation that is acceptable to humans, it is important to present the minimum information necessary to sufficiently understand an event. To meet this requirement, this study proposes a framework for identifying important elements in the decision-making process using a prediction model for the world and generating explanations based on these elements. To verify the ability of the proposed method to generate explanations, we conducted an experiment using a grid environment. It was inferred from the result of a simulation experiment that the explanation generated using the proposed method was composed of the minimum elements important for understanding the transition from the current state to the target state. Furthermore, subject experiments showed that the generated explanation was a good summary of the process of state transition, and that a high evaluation was obtained for the explanation of the reason for an action. △ Less

Submitted 6 May, 2021; originally announced May 2021.

arXiv:2105.02658 [pdf, ps, other]

Explainable Autonomous Robots: A Survey and Perspective

Authors: Tatsuya Sakai, Takayuki Nagai

Abstract: Advanced communication protocols are critical to enable the coexistence of autonomous robots with humans. Thus, the development of explanatory capabilities is an urgent first step toward autonomous robots. This survey provides an overview of the various types of "explainability" discussed in machine learning research. Then, we discuss the definition of "explainability" in the context of autonomous… ▽ More Advanced communication protocols are critical to enable the coexistence of autonomous robots with humans. Thus, the development of explanatory capabilities is an urgent first step toward autonomous robots. This survey provides an overview of the various types of "explainability" discussed in machine learning research. Then, we discuss the definition of "explainability" in the context of autonomous robots (i.e., explainable autonomous robots) by exploring the question "what is an explanation?" We further conduct a research survey based on this definition and present some relevant topics for future research. △ Less

Submitted 6 May, 2021; originally announced May 2021.

arXiv:2104.08755 [pdf, other]

DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels

Authors: Zhaohao Zeng, Tetsuya Sakai

Abstract: We introduce a data set called DCH-2, which contains 4,390 real customer-helpdesk dialogues in Chinese and their English translations. DCH-2 also contains dialogue-level annotations and turn-level annotations obtained independently from either 19 or 20 annotators. The data set was built through our effort as organisers of the NTCIR-14 Short Text Conversation and NTCIR-15 Dialogue Evaluation tasks,… ▽ More We introduce a data set called DCH-2, which contains 4,390 real customer-helpdesk dialogues in Chinese and their English translations. DCH-2 also contains dialogue-level annotations and turn-level annotations obtained independently from either 19 or 20 annotators. The data set was built through our effort as organisers of the NTCIR-14 Short Text Conversation and NTCIR-15 Dialogue Evaluation tasks, to help researchers understand what constitutes an effective customer-helpdesk dialogue, and thereby build efficient and helpful helpdesk systems that are available to customers at all times. In addition, DCH-2 may be utilised for other purposes, for example, as a repository for retrieval-based dialogue systems, or as a parallel corpus for machine translation in the helpdesk domain. △ Less

Submitted 30 May, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: 6 pages, 3 figures

arXiv:2101.06233 [pdf, other]

Predictive Optimization with Zero-Shot Domain Adaptation

Authors: Tomoya Sakai, Naoto Ohsaka

Abstract: Prediction in a new domain without any training sample, called zero-shot domain adaptation (ZSDA), is an important task in domain adaptation. While prediction in a new domain has gained much attention in recent years, in this paper, we investigate another potential of ZSDA. Specifically, instead of predicting responses in a new domain, we find a description of a new domain given a prediction. The… ▽ More Prediction in a new domain without any training sample, called zero-shot domain adaptation (ZSDA), is an important task in domain adaptation. While prediction in a new domain has gained much attention in recent years, in this paper, we investigate another potential of ZSDA. Specifically, instead of predicting responses in a new domain, we find a description of a new domain given a prediction. The task is regarded as predictive optimization, but existing predictive optimization methods have not been extended to handling multiple domains. We propose a simple framework for predictive optimization with ZSDA and analyze the condition in which the optimization problem becomes convex optimization. We also discuss how to handle the interaction of characteristics of a domain in predictive optimization. Through numerical experiments, we demonstrate the potential usefulness of our proposed framework. △ Less

Submitted 15 January, 2021; originally announced January 2021.

Comments: SDM2021. Full version including appendix

arXiv:2010.13447 [pdf, other]

doi 10.1145/3397271.3401036

How to Measure the Reproducibility of System-oriented IR Experiments

Authors: Timo Breuer, Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, Philipp Schaer, Ian Soboroff

Abstract: Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented data… ▽ More Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented dataset, which would allow us to develop such methods. To address these issues, we compare several measures to objectively quantify to what extent we have replicated or reproduced a system-oriented IR experiment. These measures operate at different levels of granularity, from the fine-grained comparison of ranked lists, to the more general comparison of the obtained effects and significant differences. Moreover, we also develop a reproducibility-oriented dataset, which allows us to validate our measures and which can also be used to develop future measures. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: SIGIR2020 Full Conference Paper

arXiv:2010.11585 [pdf]

doi 10.3390/futuretransp1030034

A simulation-based evaluation of a Cargo-Hitching service for E-commerce using mobility-on-demand vehicles

Authors: Andre Alho, Takanori Sakai, Simon Oh, Cheng Cheng, Ravi Seshadri, Wen Han Chong, Yusuke Hara, Julia Caravias, Lynette Cheah, Moshe Ben-Akiva

Abstract: Time-sensitive parcel deliveries, shipments requested for delivery in a day or less, are an increasingly important research subject. It is challenging to deal with these deliveries from a carrier perspective since it entails additional planning constraints, preventing an efficient consolidation of deliveries which is possible when demand is well known in advance. Furthermore, such time-sensitive d… ▽ More Time-sensitive parcel deliveries, shipments requested for delivery in a day or less, are an increasingly important research subject. It is challenging to deal with these deliveries from a carrier perspective since it entails additional planning constraints, preventing an efficient consolidation of deliveries which is possible when demand is well known in advance. Furthermore, such time-sensitive deliveries are requested to a wider spatial scope than retail centers, including homes and offices. Therefore, an increase in such deliveries is considered to exacerbate negative externalities such as congestion and emissions. One of the solutions is to leverage spare capacity in passenger transport modes. This concept is often denominated as cargo-hitching. While there are various possible system designs, it is crucial that such solution does not deteriorate the quality of service of passenger trips. This research aims to evaluate the use of Mobility-On-Demand services to perform same-day parcel deliveries. For this purpose, we use SimMobility, a high-resolution agent-based simulation platform of passenger and freight flows, applied in Singapore. E-commerce demand carrier data are used to characterize simulated parcel delivery demand. Operational scenarios that aim to minimize the adverse effect of fulfilling deliveries with Mobility-On-Demand vehicles on Mobility-On-Demand passenger flows (fulfillment, wait and travel times) are explored. Results indicate that the Mobility-On-Demand services have potential to fulfill a considerable amount of parcel deliveries and decrease freight vehicle traffic and total vehicle-kilometers-travelled without compromising the quality of Mobility On-Demand for passenger travel. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: 19 pages, 4 tables, 7 figures. Submitted to Transportation (Springer)

Journal ref: Future Transp. 2021, 1, 639-656

arXiv:2006.05616 [pdf, other]

Regret Minimization for Causal Inference on Large Treatment Space

Authors: Akira Tanimoto, Tomoya Sakai, Takashi Takenouchi, Hisashi Kashima

Abstract: Predicting which action (treatment) will lead to a better outcome is a central task in decision support systems. To build a prediction model in real situations, learning from biased observational data is a critical issue due to the lack of randomized controlled trial (RCT) data. To handle such biased observational data, recent efforts in causal inference and counterfactual machine learning have fo… ▽ More Predicting which action (treatment) will lead to a better outcome is a central task in decision support systems. To build a prediction model in real situations, learning from biased observational data is a critical issue due to the lack of randomized controlled trial (RCT) data. To handle such biased observational data, recent efforts in causal inference and counterfactual machine learning have focused on debiased estimation of the potential outcomes on a binary action space and the difference between them, namely, the individual treatment effect. When it comes to a large action space (e.g., selecting an appropriate combination of medicines for a patient), however, the regression accuracy of the potential outcomes is no longer sufficient in practical terms to achieve a good decision-making performance. This is because the mean accuracy on the large action space does not guarantee the nonexistence of a single potential outcome misestimation that might mislead the whole decision. Our proposed loss minimizes a classification error of whether or not the action is relatively good for the individual target among all feasible actions, which further improves the decision-making performance, as we prove. We also propose a network architecture and a regularizer that extracts a debiased representation not only from the individual feature but also from the biased action for better generalization in large action spaces. Extensive experiments on synthetic and semi-synthetic datasets demonstrate the superiority of our method for large combinatorial action spaces. △ Less

Submitted 9 June, 2020; originally announced June 2020.

arXiv:2003.04345 [pdf, other]

A Parallelizable Energy-Preserving Integrator MB4 and Its Application to Quantum-Mechanical Wavepacket Dynamics

Authors: Tsubasa Sakai, Shuhei Kudo, Hiroto Imachi, Yuto Miyatake, Takeo Hoshi, Yusaku Yamamoto

Abstract: In simulating physical systems, conservation of the total energy is often essential, especially when energy conversion between different forms of energy occurs frequently. Recently, a new fourth order energy-preserving integrator named MB4 was proposed based on the so-called continuous stage Runge--Kutta methods (Y.~Miyatake and J.~C.~Butcher, SIAM J.~Numer.~Anal., 54(3), 1993-2013). A salient fea… ▽ More In simulating physical systems, conservation of the total energy is often essential, especially when energy conversion between different forms of energy occurs frequently. Recently, a new fourth order energy-preserving integrator named MB4 was proposed based on the so-called continuous stage Runge--Kutta methods (Y.~Miyatake and J.~C.~Butcher, SIAM J.~Numer.~Anal., 54(3), 1993-2013). A salient feature of this method is that it is parallelizable, which makes its computational time for one time step comparable to that of second order methods. In this paper, we illustrate how to apply the MB4 method to a concrete ordinary differential equation using the nonlinear Schrödinger-type equation on a two-dimensional grid as an example. This system is a prototypical model of two-dimensional disordered organic material and is difficult to solve with standard methods like the classical Runge--Kutta methods due to the nonlinearity and the $δ$-function like potential coming from defects. Numerical tests show that the method can solve the equation stably and preserves the total energy to 16-digit accuracy throughout the simulation. It is also shown that parallelization of the method yields up to 2.8 times speedup using 3 computational nodes. △ Less

Submitted 9 March, 2020; originally announced March 2020.

arXiv:2002.08709 [pdf, other]

Do We Need Zero Training Loss After Achieving Zero Training Error?

Authors: Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, Masashi Sugiyama

Abstract: Overparameterized deep networks have the capacity to memorize training data with zero \emph{training error}. Even after memorization, the \emph{training loss} continues to approach zero, making the model overconfident and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, it is hard to tune their hyperparameters in order to maintain a fixed/… ▽ More Overparameterized deep networks have the capacity to memorize training data with zero \emph{training error}. Even after memorization, the \emph{training loss} continues to approach zero, making the model overconfident and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, it is hard to tune their hyperparameters in order to maintain a fixed/preset level of training loss. We propose a direct solution called \emph{flooding} that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the \emph{flood level}. Our approach makes the loss float around the flood level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the flood level. This can be implemented with one line of code and is compatible with any stochastic optimizer and other regularizers. With flooding, the model will continue to "random walk" with the same non-zero training loss, and we expect it to drift into an area with a flat loss landscape that leads to better generalization. We experimentally show that flooding improves performance and, as a byproduct, induces a double descent curve of the test loss. △ Less

Submitted 31 March, 2021; v1 submitted 20 February, 2020; originally announced February 2020.

Comments: ICML 2020 camera ready version

arXiv:2002.03582 [pdf, other]

Different Types of Voice User Interface Failures May Cause Different Degrees of Frustration

Authors: Shiyoh Goetsu, Tetsuya Sakai

Abstract: We report on an investigation into how different types of failures in a voice user interface (VUI) affects user frustration. To this end, we conducted a pilot user study ($n=10$) and a main user study ($n=30$), both with a simple voice-operated calendar application that we built using the Alexa Skills Kit. In our pilot study, we identified three major failure types as perceived by the users, namel… ▽ More We report on an investigation into how different types of failures in a voice user interface (VUI) affects user frustration. To this end, we conducted a pilot user study ($n=10$) and a main user study ($n=30$), both with a simple voice-operated calendar application that we built using the Alexa Skills Kit. In our pilot study, we identified three major failure types as perceived by the users, namely, Reason Unknown, Speech Misrecognition, and Utterance Pattern Match Failure, along with more fine-grained failure types from the developer's viewpoint such as Intent Pattern Match Failure and Intent Misclassification. Then, in our main study, we set up three user tasks that were designed to each induce a specific failure type, and collected user frustration ratings for each task. Our main findings are: (a)Users may be relatively tolerant to user-perceived Speech Misrecognition, and not so to user-perceived Reason Unknown and Utterance Mattern Match Failures; (b)Regarding the relationship between developer-perceived and user-perceived failure types, 68.8\% of developer-perceived Intent Misclassification instances caused user-perceived Reason Unkown failures. From (a) and (b), a practical design implication would be to try to prevent Intent Misclassification from happening by carefully crafting the utterance patterns for each intent. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: 5 pages;1 figure

arXiv:1910.08280 [pdf, other]

Robust modal regression with direct log-density derivative estimation

Authors: Hiroaki Sasaki, Tomoya Sakai, Takafumi Kanamori

Abstract: Modal regression is aimed at estimating the global mode (i.e., global maximum) of the conditional density function of the output variable given input variables, and has led to regression methods robust against heavy-tailed or skewed noises. The conditional mode is often estimated through maximization of the modal regression risk (MRR). In order to apply a gradient method for the maximization, the… ▽ More Modal regression is aimed at estimating the global mode (i.e., global maximum) of the conditional density function of the output variable given input variables, and has led to regression methods robust against heavy-tailed or skewed noises. The conditional mode is often estimated through maximization of the modal regression risk (MRR). In order to apply a gradient method for the maximization, the fundamental challenge is accurate approximation of the gradient of MRR, not MRR itself. To overcome this challenge, in this paper, we take a novel approach of directly approximating the gradient of MRR. To approximate the gradient, we develop kernelized and neural-network-based versions of the least-squares log-density derivative estimator, which directly approximates the derivative of the log-density without density estimation. With direct approximation of the MRR gradient, we first propose a modal regression method with kernels, and derive a new parameter update rule based on a fixed-point method. Then, the derived update rule is theoretically proved to have a monotonic hill-climbing property towards the conditional mode. Furthermore, we indicate that our approach of directly approximating the gradient is compatible with recent sophisticated stochastic gradient methods (e.g., Adam), and then propose another modal regression method based on neural networks. Finally, the superior performance of the proposed methods is demonstrated on various artificial and benchmark datasets. △ Less

Submitted 18 October, 2019; originally announced October 2019.

arXiv:1905.01799 [pdf, other]

RSL19BD at DBDC4: Ensemble of Decision Tree-based and LSTM-based Models

Authors: Chih-Hao Wang, Sosuke Kato, Tetsuya Sakai

Abstract: RSL19BD (Waseda University Sakai Laboratory) participated in the Fourth Dialogue Breakdown Detection Challenge (DBDC4) and submitted five runs to both English and Japanese subtasks. In these runs, we utilise the Decision Tree-based model and the Long Short-Term Memory-based (LSTM-based) model following the approaches of RSL17BD and KTH in the Third Dialogue Breakdown Detection Challenge (DBDC3) re… ▽ More RSL19BD (Waseda University Sakai Laboratory) participated in the Fourth Dialogue Breakdown Detection Challenge (DBDC4) and submitted five runs to both English and Japanese subtasks. In these runs, we utilise the Decision Tree-based model and the Long Short-Term Memory-based (LSTM-based) model following the approaches of RSL17BD and KTH in the Third Dialogue Breakdown Detection Challenge (DBDC3) respectively. The Decision Tree-based model follows the approach of RSL17BD but utilises RandomForestRegressor instead of ExtraTreesRegressor. In addition, instead of predicting the mean and the variance of the probability distribution of the three breakdown labels, it predicts the probability of each label directly. The LSTM-based model follows the approach of KTH with some changes in the architecture and utilises Convolutional Neural Network (CNN) to perform text feature extraction. In addition, instead of targeting the single breakdown label and minimising the categorical cross entropy loss, it targets the probability distribution of the three breakdown labels and minimises the mean squared error. Run 1 utilises a Decision Tree-based model; Run 2 utilises an LSTM-based model; Run 3 performs an ensemble of 5 LSTM-based models; Run 4 performs an ensemble of Run 1 and Run 2; Run 5 performs an ensemble of Run 1 and Run 3. Run 5 statistically significantly outperformed all other runs in terms of MSE (NB, PB, B) for the English data and all other runs except Run 4 in terms of MSE (NB, PB, B) for the Japanese data (alpha level = 0.05). △ Less

Submitted 18 November, 2019; v1 submitted 5 May, 2019; originally announced May 2019.

Comments: 21 pages, 7 figures, Proceedings of Chatbots and Conversational Agents and Dialogue Breakdown Detection Challenge (WOCHAT+DBDC), IWSDS 2019; proceedings updated

arXiv:1903.11272 [pdf, ps, other]

Graded Relevance Assessments and Graded Relevance Measures of NTCIR: A Survey of the First Twenty Years

Authors: Tetsuya Sakai

Abstract: NTCIR was the first large-scale IR evaluation conference to construct test collections with graded relevance assessments: the NTCIR-1 test collections from 1998 already featured relevant and partially relevant documents. In this paper, I first describe a few graded-relevance measures that originated from NTCIR (and a few variants) which are used across different NTCIR tasks. I then provide a surve… ▽ More NTCIR was the first large-scale IR evaluation conference to construct test collections with graded relevance assessments: the NTCIR-1 test collections from 1998 already featured relevant and partially relevant documents. In this paper, I first describe a few graded-relevance measures that originated from NTCIR (and a few variants) which are used across different NTCIR tasks. I then provide a survey on the use of graded relevance assessments and of graded relevance measures in the past NTCIR tasks which primarily tackled ranked retrieval. My survey shows that the majority of the past tasks fully utilised graded relevance by means of graded evaluation measures, but not all of them; interestingly, even a few relatively recent tasks chose to adhere to binary relevance measures. I conclude this paper by a summary of my survey in table form, and a brief discussion on what may lie beyond graded relevance. △ Less

Submitted 27 March, 2019; originally announced March 2019.

Comments: 31 pages; full length version of a book chapter (Evaluating Information Retrieval and Access Tasks: NTCIR's Legacy of Research Impact)

arXiv:1803.04663 [pdf, ps, other]

Binary Matrix Completion Using Unobserved Entries

Authors: Masayoshi Hayashi, Tomoya Sakai, Masashi Sugiyama

Abstract: A matrix completion problem, which aims to recover a complete matrix from its partial observations, is one of the important problems in the machine learning field and has been studied actively. However, there is a discrepancy between the mainstream problem setting, which assumes continuous-valued observations, and some practical applications such as recommendation systems and SNS link predictions… ▽ More A matrix completion problem, which aims to recover a complete matrix from its partial observations, is one of the important problems in the machine learning field and has been studied actively. However, there is a discrepancy between the mainstream problem setting, which assumes continuous-valued observations, and some practical applications such as recommendation systems and SNS link predictions where observations take discrete or even binary values. To cope with this problem, Davenport et al. (2014) proposed a binary matrix completion (BMC) problem, where observations are quantized into binary values. Hsieh et al. (2015) proposed a PU (Positive and Unlabeled) matrix completion problem, which is an extension of the BMC problem. This problem targets the setting where we cannot observe negative values, such as SNS link predictions. In the construction of their method for this setting, they introduced a methodology of the classification problem, regarding each matrix entry as a sample. Their risk, which defines losses over unobserved entries as well, indicates the possibility of the use of unobserved entries. In this paper, motivated by a semi-supervised classification method recently proposed by Sakai et al. (2017), we develop a method for the BMC problem which can use all of positive, negative, and unobserved entries, by combining the risks of Davenport et al. (2014) and Hsieh et al. (2015). To the best of our knowledge, this is the first BMC method which exploits all kinds of matrix entries. We experimentally show that an appropriate mixture of risks improves the performance. △ Less

Submitted 13 March, 2018; originally announced March 2018.

arXiv:1710.05359 [pdf, other]

doi 10.1162/neco_a_01337

Information-Theoretic Representation Learning for Positive-Unlabeled Classification

Authors: Tomoya Sakai, Gang Niu, Masashi Sugiyama

Abstract: Recent advances in weakly supervised classification allow us to train a classifier only from positive and unlabeled (PU) data. However, existing PU classification methods typically require an accurate estimate of the class-prior probability, which is a critical bottleneck particularly for high-dimensional data. This problem has been commonly addressed by applying principal component analysis in ad… ▽ More Recent advances in weakly supervised classification allow us to train a classifier only from positive and unlabeled (PU) data. However, existing PU classification methods typically require an accurate estimate of the class-prior probability, which is a critical bottleneck particularly for high-dimensional data. This problem has been commonly addressed by applying principal component analysis in advance, but such unsupervised dimension reduction can collapse underlying class structure. In this paper, we propose a novel representation learning method from PU data based on the information-maximization principle. Our method does not require class-prior estimation and thus can be used as a preprocessing method for PU classification. Through experiments, we demonstrate that our method combined with deep neural networks highly improves the accuracy of PU class-prior estimation, leading to state-of-the-art PU classification performance. △ Less

Submitted 18 June, 2022; v1 submitted 15 October, 2017; originally announced October 2017.

Journal ref: Neural Computation (2021) 33 (1) 244-268

arXiv:1708.01321 [pdf, other]

doi 10.1016/j.comgeo.2014.09.004

On balanced 4-holes in bichromatic point sets

Authors: S. Bereg, J. M. Díaz-Báñez, R. Fabila-Monroy, P. Pérez-Lantero, A. Ramírez-Vigueras, T. Sakai, J. Urrutia, I. Ventura

Abstract: Let $S=R\cup B$ be a point set in the plane in general position such that each of its elements is colored either red or blue, where $R$ and $B$ denote the points colored red and the points colored blue, respectively. A quadrilateral with vertices in $S$ is called a $4$-hole if its interior is empty of elements of $S$. We say that a $4$-hole of $S$ is balanced if it has $2$ red and $2$ blue points… ▽ More Let $S=R\cup B$ be a point set in the plane in general position such that each of its elements is colored either red or blue, where $R$ and $B$ denote the points colored red and the points colored blue, respectively. A quadrilateral with vertices in $S$ is called a $4$-hole if its interior is empty of elements of $S$. We say that a $4$-hole of $S$ is balanced if it has $2$ red and $2$ blue points of $S$ as vertices. In this paper, we prove that if $R$ and $B$ contain $n$ points each then $S$ has at least $\frac{n^2-4n}{12}$ balanced $4$-holes, and this bound is tight up to a constant factor. Since there are two-colored point sets with no balanced {\em convex} $4$-holes, we further provide a characterization of the two-colored point sets having this type of $4$-holes. △ Less

Submitted 3 August, 2017; originally announced August 2017.

Comments: this is an arxiv version of our paper

Journal ref: Computational Geometry: Theory and Applications, 48 (3): 169-179 (2015)

arXiv:1705.01708 [pdf, other]

doi 10.1007/s10994-017-5678-9

Semi-Supervised AUC Optimization based on Positive-Unlabeled Learning

Authors: Tomoya Sakai, Gang Niu, Masashi Sugiyama

Abstract: Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced classification. So far, various supervised AUC optimization methods have been developed and they are also extended to semi-supervised scenarios to cope with small sample problems. However, existing semi-supervised AUC optimization methods rely on strong distributional assumptions, which… ▽ More Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced classification. So far, various supervised AUC optimization methods have been developed and they are also extended to semi-supervised scenarios to cope with small sample problems. However, existing semi-supervised AUC optimization methods rely on strong distributional assumptions, which are rarely satisfied in real-world problems. In this paper, we propose a novel semi-supervised AUC optimization method that does not require such restrictive assumptions. We first develop an AUC optimization method based only on positive and unlabeled data (PU-AUC) and then extend it to semi-supervised learning by combining it with a supervised AUC optimization method. We theoretically prove that, without the restrictive distributional assumptions, unlabeled data contribute to improving the generalization performance in PU and semi-supervised AUC optimization methods. Finally, we demonstrate the practical usefulness of the proposed methods through experiments. △ Less

Submitted 11 April, 2022; v1 submitted 4 May, 2017; originally announced May 2017.

Comments: Fixed typos in Appendix

arXiv:1704.06767 [pdf, other]

Convex Formulation of Multiple Instance Learning from Positive and Unlabeled Bags

Authors: Han Bao, Tomoya Sakai, Issei Sato, Masashi Sugiyama

Abstract: Multiple instance learning (MIL) is a variation of traditional supervised learning problems where data (referred to as bags) are composed of sub-elements (referred to as instances) and only bag labels are available. MIL has a variety of applications such as content-based image retrieval, text categorization and medical diagnosis. Most of the previous work for MIL assume that the training bags are… ▽ More Multiple instance learning (MIL) is a variation of traditional supervised learning problems where data (referred to as bags) are composed of sub-elements (referred to as instances) and only bag labels are available. MIL has a variety of applications such as content-based image retrieval, text categorization and medical diagnosis. Most of the previous work for MIL assume that the training bags are fully labeled. However, it is often difficult to obtain an enough number of labeled bags in practical situations, while many unlabeled bags are available. A learning framework called PU learning (positive and unlabeled learning) can address this problem. In this paper, we propose a convex PU learning method to solve an MIL problem. We experimentally show that the proposed method achieves better performance with significantly lower computational costs than an existing method for PU-MIL. △ Less

Submitted 1 May, 2018; v1 submitted 22 April, 2017; originally announced April 2017.

arXiv:1605.06955 [pdf, other]

Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data

Authors: Tomoya Sakai, Marthinus Christoffel du Plessis, Gang Niu, Masashi Sugiyama

Abstract: Most of the semi-supervised classification methods developed so far use unlabeled data for regularization purposes under particular distributional assumptions such as the cluster assumption. In contrast, recently developed methods of classification from positive and unlabeled data (PU classification) use unlabeled data for risk evaluation, i.e., label information is directly extracted from unlabel… ▽ More Most of the semi-supervised classification methods developed so far use unlabeled data for regularization purposes under particular distributional assumptions such as the cluster assumption. In contrast, recently developed methods of classification from positive and unlabeled data (PU classification) use unlabeled data for risk evaluation, i.e., label information is directly extracted from unlabeled data. In this paper, we extend PU classification to also incorporate negative data and propose a novel semi-supervised classification approach. We establish generalization error bounds for our novel methods and show that the bounds decrease with respect to the number of unlabeled data without the distributional assumptions that are required in existing semi-supervised classification methods. Through experiments, we demonstrate the usefulness of the proposed methods. △ Less

Submitted 16 June, 2017; v1 submitted 23 May, 2016; originally announced May 2016.

Comments: Accepted to the 34th International Conference on Machine Learning (ICML 2017)

arXiv:1603.03130 [pdf, other]

Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning

Authors: Gang Niu, Marthinus Christoffel du Plessis, Tomoya Sakai, Yao Ma, Masashi Sugiyama

Abstract: In PU learning, a binary classifier is trained from positive (P) and unlabeled (U) data without negative (N) data. Although N data is missing, it sometimes outperforms PN learning (i.e., ordinary supervised learning). Hitherto, neither theoretical nor experimental analysis has been given to explain this phenomenon. In this paper, we theoretically compare PU (and NU) learning against PN learning ba… ▽ More In PU learning, a binary classifier is trained from positive (P) and unlabeled (U) data without negative (N) data. Although N data is missing, it sometimes outperforms PN learning (i.e., ordinary supervised learning). Hitherto, neither theoretical nor experimental analysis has been given to explain this phenomenon. In this paper, we theoretically compare PU (and NU) learning against PN learning based on the upper bounds on estimation errors. We find simple conditions when PU and NU learning are likely to outperform PN learning, and we prove that, in terms of the upper bounds, either PU or NU learning (depending on the class-prior probability and the sizes of P and N data) given infinite U data will improve on PN learning. Our theoretical findings well agree with the experimental results on artificial and benchmark data even when the experimental setup does not match the theoretical assumptions exactly. △ Less

Submitted 28 October, 2016; v1 submitted 9 March, 2016; originally announced March 2016.

Comments: NIPS 2016 camera-ready version

arXiv:0907.5321 [pdf, ps, other]

doi 10.1109/ICCVW.2009.5457702

Multiple pattern classification by sparse subspace decomposition

Authors: Tomoya Sakai

Abstract: A robust classification method is developed on the basis of sparse subspace decomposition. This method tries to decompose a mixture of subspaces of unlabeled data (queries) into class subspaces as few as possible. Each query is classified into the class whose subspace significantly contributes to the decomposed subspace. Multiple queries from different classes can be simultaneously classified in… ▽ More A robust classification method is developed on the basis of sparse subspace decomposition. This method tries to decompose a mixture of subspaces of unlabeled data (queries) into class subspaces as few as possible. Each query is classified into the class whose subspace significantly contributes to the decomposed subspace. Multiple queries from different classes can be simultaneously classified into their respective classes. A practical greedy algorithm of the sparse subspace decomposition is designed for the classification. The present method achieves high recognition rate and robust performance exploiting joint sparsity. △ Less

Submitted 4 August, 2009; v1 submitted 30 July, 2009; originally announced July 2009.

Comments: 8 pages, 3 figures, 2nd IEEE International Workshop on Subspace Methods, Workshop Proceedings of ICCV 2009

Showing 1–45 of 45 results for author: Sakai, T