Skip to main content

Showing 1–50 of 58 results for author: Potthast, M

  1. Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

    Authors: Nandan Thakur, Luiz Bonifacio, Maik Fröbe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin

    Abstract: The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touché 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: SIGIR 2024 (Resource & Reproducibility Track)

  2. arXiv:2405.07920  [pdf, other

    cs.IR

    A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking

    Authors: Ferdinand Schlatt, Maik Fröbe, Harrisen Scells, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Benno Stein, Martin Potthast, Matthias Hagen

    Abstract: Cross-encoders distilled from large language models (LLMs) are often more effective re-rankers than cross-encoders fine-tuned on manually labeled data. However, the distilled models usually do not reach their teacher LLM's effectiveness. To investigate whether best practices for fine-tuning cross-encoders on manually labeled data (e.g., hard-negative sampling, deep sampling, and listwise loss func… ▽ More

    Submitted 16 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  3. arXiv:2404.09615  [pdf, other

    cs.CL cs.CY

    If there's a Trigger Warning, then where's the Trigger? Investigating Trigger Warnings at the Passage Level

    Authors: Matti Wiegmann, Jennifer Rakete, Magdalena Wolska, Benno Stein, Martin Potthast

    Abstract: Trigger warnings are labels that preface documents with sensitive content if this content could be perceived as harmful by certain groups of readers. Since warnings about a document intuitively need to be shown before reading it, authors usually assign trigger warnings at the document level. What parts of their writing prompted them to assign a warning, however, remains unclear. We investigate for… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  4. arXiv:2404.06912  [pdf, other

    cs.IR

    Set-Encoder: Permutation-Invariant Inter-Passage Attention for Listwise Passage Re-Ranking with Cross-Encoders

    Authors: Ferdinand Schlatt, Maik Fröbe, Harrisen Scells, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Benno Stein, Martin Potthast, Matthias Hagen

    Abstract: Existing cross-encoder re-rankers can be categorized as pointwise, pairwise, or listwise models. Pair- and listwise models allow passage interactions, which usually makes them more effective than pointwise models but also less efficient and less robust to input order permutations. To enable efficient permutation-invariant passage interactions during re-ranking, we propose a new cross-encoder archi… ▽ More

    Submitted 16 June, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

  5. arXiv:2403.17564  [pdf, other

    cs.CL

    Task-Oriented Paraphrase Analytics

    Authors: Marcel Gohsen, Matthias Hagen, Martin Potthast, Benno Stein

    Abstract: Since paraphrasing is an ill-defined task, the term "paraphrasing" covers text transformation tasks with different characteristics. Consequently, existing paraphrasing studies have applied quite different (explicit and implicit) criteria as to when a pair of texts is to be considered a paraphrase, all of which amount to postulating a certain level of semantic or lexical similarity. In this paper,… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-COLING 2024

  6. arXiv:2403.07654  [pdf, other

    cs.IR

    Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

    Authors: Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen

    Abstract: Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding targe… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 13 pages, 3 figures, Accepted at ECIR 2024 as a Full Paper

  7. arXiv:2402.06913  [pdf, other

    cs.CL

    TL;DR Progress: Multi-faceted Literature Exploration in Text Summarization

    Authors: Shahbaz Syed, Khalid Al-Khatib, Martin Potthast

    Abstract: This paper presents TL;DR Progress, a new tool for exploring the literature on neural text summarization. It organizes 514~papers based on a comprehensive annotation scheme for text summarization approaches and enables fine-grained, faceted search. Each paper was manually annotated to capture aspects such as evaluation metrics, quality dimensions, learning paradigms, challenges addressed, datasets… ▽ More

    Submitted 10 February, 2024; originally announced February 2024.

    Comments: EACL 2024 System Demonstration

  8. Detecting Generated Native Ads in Conversational Search

    Authors: Sebastian Schmidt, Ines Zelch, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: Conversational search engines such as YouChat and Microsoft Copilot use large language models (LLMs) to generate responses to queries. It is only a small step to also let the same technology insert ads within the generated responses - instead of separately placing ads next to a response. Inserted ads would be reminiscent of native advertising and product placement, both of which are very effective… ▽ More

    Submitted 30 April, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: WWW'24 Short Papers Track; 4 pages

  9. arXiv:2401.06320  [pdf, other

    cs.IR cs.CL

    Zero-shot Generative Large Language Models for Systematic Review Screening Automation

    Authors: Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan Koopman, Guido Zuccon

    Abstract: Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review. This study investigates the effectiveness of using zero-shot large language models~(LLMs… ▽ More

    Submitted 31 January, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: Accepted to ECIR2024 full paper (findings)

  10. Evaluating Generative Ad Hoc Information Retrieval

    Authors: Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guido Zuccon, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: Recent advances in large language models have enabled the development of viable generative retrieval systems. Instead of a traditional document ranking, generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval. Yet, the establishe… ▽ More

    Submitted 22 May, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: 14 pages, 6 figures, 1 table. Published at SIGIR'24 perspective paper track

  11. arXiv:2311.02408  [pdf, other

    cs.CL

    Citance-Contextualized Summarization of Scientific Papers

    Authors: Shahbaz Syed, Ahmad Dawar Hakimi, Khalid Al-Khatib, Martin Potthast

    Abstract: Current approaches to automatic summarization of scientific papers generate informative summaries in the form of abstracts. However, abstracts are not intended to show the relationship between a paper and the references cited in it. We propose a new contextualized summarization approach that can generate an informative summary conditioned on a given sentence containing the citation of a reference… ▽ More

    Submitted 13 November, 2023; v1 submitted 4 November, 2023; originally announced November 2023.

    Comments: Accepted at EMNLP 2023 Findings

  12. arXiv:2311.01882  [pdf, other

    cs.CL

    Indicative Summarization of Long Discussions

    Authors: Shahbaz Syed, Dominik Schwabe, Khalid Al-Khatib, Martin Potthast

    Abstract: Online forums encourage the exchange and discussion of different stances on many topics. Not only do they provide an opportunity to present one's own arguments, but may also gather a broad cross-section of others' arguments. However, the resulting long discussions are difficult to overview. This paper presents a novel unsupervised approach using large language models (LLMs) to generating indicativ… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: Accepted at EMNLP 2023 Main Conference

  13. arXiv:2310.04892  [pdf, other

    cs.IR

    Commercialized Generative AI: A Critical Study of the Feasibility and Ethics of Generating Native Advertising Using Large Language Models in Conversational Web Search

    Authors: Ines Zelch, Matthias Hagen, Martin Potthast

    Abstract: How will generative AI pay for itself? Unless charging users for access, selling advertising is the only alternative. Especially in the multi-billion dollar web search market with ads as the main source of revenue, the introduction of a subscription model seems unlikely. The recent disruption of search by generative large language models could thus ultimately be accompanied by generated ads. Our c… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: Presented at OSSYM 2023

  14. arXiv:2309.05238  [pdf, other

    cs.IR cs.AI

    Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation

    Authors: Shuai Wang, Harrisen Scells, Martin Potthast, Bevan Koopman, Guido Zuccon

    Abstract: Screening prioritisation in medical systematic reviews aims to rank the set of documents retrieved by complex Boolean queries. Prioritising the most important documents ensures that subsequent review steps can be carried out more efficiently and effectively. The current state of the art uses the final title of the review as a query to rank the documents using BERT-based neural rankers. However, th… ▽ More

    Submitted 23 November, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: Preprints for Accepted paper in SIGIR-AP-2023, note that this is updated from ACM published paper. The working title was wrong in the ACM-published version due to a bug in data preprocessing; however, this does not have any influence on the final conclusion/observation made from the paper

  15. arXiv:2308.12059  [pdf, other

    cs.CV cs.LG

    Manipulating Embeddings of Stable Diffusion Prompts

    Authors: Niklas Deckers, Julia Peters, Martin Potthast

    Abstract: Prompt engineering is still the primary way for users of generative text-to-image models to manipulate generated images in a targeted way. Based on treating the model as a continuous function and by passing gradients between the image space and the prompt embedding space, we propose and analyze a new method to directly manipulate the embedding of a prompt instead of the prompt text. We then derive… ▽ More

    Submitted 22 June, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

    Comments: IJCAI 2024 camera ready version

  16. arXiv:2308.04226  [pdf, other

    cs.HC cs.CL cs.IR cs.LG

    OpinionConv: Conversational Product Search with Grounded Opinions

    Authors: Vahid Sadiri Javadi, Martin Potthast, Lucie Flek

    Abstract: When searching for products, the opinions of others play an important role in making informed decisions. Subjective experiences about a product can be a valuable source of information. This is also true in sales conversations, where a customer and a sales assistant exchange facts and opinions about products. However, training an AI for such conversations is complicated by the fact that language mo… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  17. arXiv:2306.01481  [pdf, other

    cs.CL

    GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

    Authors: Aleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey Schoelkopf, Stella Biderman, Martin Potthast, Jimmy Lin

    Abstract: Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR researc… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

  18. The Information Retrieval Experiment Platform

    Authors: Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier's interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 11 pages. To be published in the proceedings of SIGIR 2023

  19. arXiv:2305.14935  [pdf, other

    cs.CL

    Modeling Appropriate Language in Argumentation

    Authors: Timon Ziegenbein, Shahbaz Syed, Felix Lange, Martin Potthast, Henning Wachsmuth

    Abstract: Online discussion moderators must make ad-hoc decisions about whether the contributions of discussion participants are appropriate or should be removed to maintain civility. Existing research on offensive language and the resulting tools cover only one aspect among many involved in such decisions. The question of what is considered appropriate in a controversial discussion has not yet been systema… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  20. arXiv:2305.02350  [pdf, other

    cs.CL cs.LG

    Using Language Models on Low-end Hardware

    Authors: Fabian Ziegner, Janos Borst, Andreas Niekler, Martin Potthast

    Abstract: This paper evaluates the viability of using fixed language models for training text classification networks on low-end hardware. We combine language models with a CNN architecture and put together a comprehensive benchmark with 8 datasets covering single-label and multi-label classification of topic, sentiment, and genre. Our observations are distilled into a list of trade-offs, concluding that th… ▽ More

    Submitted 8 May, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

    Comments: 5+4 pages, 6 tables; fixed affiliation

  21. Perspectives on Large Language Models for Relevance Judgment

    Authors: Guglielmo Faggioli, Laura Dietz, Charles Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, Henning Wachsmuth

    Abstract: When asked, large language models (LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for LLMs to support relevance judgments along with concerns and issues that arise. We devise a human--machine collaboration spectrum th… ▽ More

    Submitted 18 November, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    ACM Class: H.3.3

  22. The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

    Authors: Jan Heinrich Reimer, Sebastian Schmidt, Maik Fröbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish the… ▽ More

    Submitted 31 July, 2023; v1 submitted 1 April, 2023; originally announced April 2023.

    Comments: SIGIR 2023 resource paper, 13 pages

  23. arXiv:2302.14534  [pdf, other

    cs.IR cs.CL

    Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

    Authors: Christopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin, Martin Potthast

    Abstract: We present Spacerini, a tool that integrates the Pyserini toolkit for reproducible information retrieval research with Hugging Face to enable the seamless construction and deployment of interactive search engines. Spacerini makes state-of-the-art sparse and dense retrieval models more accessible to non-IR practitioners while minimizing deployment effort. This is useful for NLP researchers who want… ▽ More

    Submitted 24 March, 2024; v1 submitted 28 February, 2023; originally announced February 2023.

  24. arXiv:2301.11030  [pdf, other

    cs.CL

    Paraphrase Acquisition from Image Captions

    Authors: Marcel Gohsen, Matthias Hagen, Martin Potthast, Benno Stein

    Abstract: We propose to use image captions from the Web as a previously underutilized resource for paraphrases (i.e., texts with the same "message") and to create and analyze a corresponding dataset. When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of… ▽ More

    Submitted 15 February, 2023; v1 submitted 26 January, 2023; originally announced January 2023.

  25. arXiv:2301.09759  [pdf, other

    cs.CL

    Topic Ontologies for Arguments

    Authors: Yamen Ajjour, Johannes Kiesel, Benno Stein, Martin Potthast

    Abstract: Many computational argumentation tasks, like stance classification, are topic-dependent: the effectiveness of approaches to these tasks significantly depends on whether the approaches were trained on arguments from the same topics as those they are tested on. So, which are these topics that researchers train approaches on? This paper contributes the first comprehensive survey of topic coverage, as… ▽ More

    Submitted 23 January, 2023; originally announced January 2023.

  26. arXiv:2212.07476  [pdf, other

    cs.IR cs.CL cs.CV

    The Infinite Index: Information Retrieval on Generative Text-To-Image Models

    Authors: Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, Martin Potthast

    Abstract: Conditional generative models such as DALL-E and Stable Diffusion generate images based on a user-defined text, the prompt. Finding and refining prompts that produce a desired image has become the art of prompt engineering. Generative models do not provide a built-in retrieval model for a user's information need expressed through prompts. In light of an extensive literature review, we reframe prom… ▽ More

    Submitted 21 January, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

    Comments: Final version for CHIIR 2023

  27. arXiv:2211.02477  [pdf, other

    cs.CL cs.DL

    SMAuC -- The Scientific Multi-Authorship Corpus

    Authors: Janek Bevendorff, Philipp Sauer, Lukas Gienapp, Wolfgang Kircheis, Erik Körner, Benno Stein, Martin Potthast

    Abstract: The rapidly growing volume of scientific publications offers an interesting challenge for research on methods for analyzing the authorship of documents with one or more authors. However, most existing datasets lack scientific documents or the necessary metadata for constructing new experiments and test cases. We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorsh… ▽ More

    Submitted 10 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

  28. arXiv:2210.09587  [pdf, other

    cs.CL

    Summary Workbench: Unifying Application and Evaluation of Text Summarization Models

    Authors: Shahbaz Syed, Dominik Schwabe, Martin Potthast

    Abstract: This paper presents Summary Workbench, a new tool for developing and evaluating text summarization models. New models and evaluation measures can be easily integrated as Docker-based plugins, allowing to examine the quality of their summaries against any input and to evaluate them using various evaluation measures. Visual analyses combining multiple measures provide insights into the models' stren… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted as system demonstration at EMNLP 2022

  29. arXiv:2210.06970  [pdf, other

    cs.CL cs.IR

    Differential Bias: On the Perceptibility of Stance Imbalance in Argumentation

    Authors: Alonso Palomino, Martin Potthast, Khalid Al-Khatib, Benno Stein

    Abstract: Most research on natural language processing treats bias as an absolute concept: Based on a (probably complex) algorithmic analysis, a sentence, an article, or a text is classified as biased or not. Given the fact that for humans the question of whether a text is biased can be difficult to answer or is answered contradictory, we ask whether an "absolute bias classification" is a promising goal at… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted at AACL-IJCNLP 2022, Findings Volume

  30. arXiv:2209.12299  [pdf, other

    cs.DL cs.IR

    WARC-DL: Scalable Web Archive Processing for Deep Learning

    Authors: Niklas Deckers, Martin Potthast

    Abstract: Web archives have grown to petabytes. In addition to providing invaluable background knowledge on many social and cultural developments over the last 30 years, they also provide vast amounts of training data for machine learning. To benefit from recent developments in Deep Learning, the use of web archives requires a scalable solution for their processing that supports inference with and training… ▽ More

    Submitted 25 September, 2022; originally announced September 2022.

    Comments: Submitted to OSSYM 2022 - 4th International Open Search Symposium

  31. arXiv:2209.04409  [pdf, other

    cs.CL

    Trigger Warnings: Bootstrapping a Violence Detector for FanFiction

    Authors: Magdalena Wolska, Christopher Schröder, Ole Borchardt, Benno Stein, Martin Potthast

    Abstract: We present the first dataset and evaluation results on a newly defined computational task of trigger warning assignment. Labeled corpus data has been compiled from narrative works hosted on Archive of Our Own (AO3), a well-known fanfiction site. In this paper, we focus on the most frequently assigned trigger type--violence--and define a document-level binary classification task of whether or not t… ▽ More

    Submitted 9 September, 2022; originally announced September 2022.

    Comments: 5 pages

  32. Sparse Pairwise Re-ranking with Pre-trained Transformers

    Authors: Lukas Gienapp, Maik Fröbe, Matthias Hagen, Martin Potthast

    Abstract: Pairwise re-ranking models predict which of two documents is more relevant to a query and then aggregate a final ranking from such preferences. This is often more effective than pointwise re-ranking models that directly predict a relevance value for each document. However, the high inference overhead of pairwise models limits their practical application: usually, for a set of $k$ documents to be r… ▽ More

    Submitted 10 July, 2022; originally announced July 2022.

    Comments: Accepted at ICTIR 2022

  33. arXiv:2206.14759  [pdf, other

    cs.IR

    How Train-Test Leakage Affects Zero-shot Retrieval

    Authors: Maik Fröbe, Christopher Akiki, Martin Potthast, Matthias Hagen

    Abstract: Neural retrieval models are often trained on (subsets of) the millions of queries of the MS MARCO / ORCAS datasets and then tested on the 250 Robust04 queries or other TREC benchmarks with often only 50 queries. In such setups, many of the few test queries can be very similar to queries from the huge training data -- in fact, 69% of the Robust04 queries have near-duplicates in MS MARCO / ORCAS. We… ▽ More

    Submitted 30 August, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: To appear at the 29th International Symposium on String Processing and Information Retrieval (SPIRE 2022)

  34. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  35. arXiv:2203.10282  [pdf, other

    cs.CL

    Clickbait Spoiling via Question Answering and Passage Retrieval

    Authors: Matthias Hagen, Maik Fröbe, Artur Jurk, Martin Potthast

    Abstract: We introduce and study the task of clickbait spoiling: generating a short text that satisfies the curiosity induced by a clickbait post. Clickbait links to a web page and advertises its contents by arousing curiosity instead of providing an informative summary. Our contributions are approaches to classify the type of spoiler needed (i.e., a phrase or a passage), and to generate appropriate spoiler… ▽ More

    Submitted 19 March, 2022; originally announced March 2022.

    Comments: Accepted at ACL 2022

  36. arXiv:2202.02081  [pdf, other

    cs.CL cs.CY

    Tracking Discourse Influence in Darknet Forums

    Authors: Christopher Akiki, Lukas Gienapp, Martin Potthast

    Abstract: This technical report documents our efforts in addressing the tasks set forth by the 2021 AMoC (Advanced Modelling of Cyber Criminal Careers) Hackathon. Our main contribution is a joint visualisation of semantic and temporal features, generating insight into the supplied data on darknet cybercrime through the aspects of novelty, transience, and resonance, which describe the potential impact a mess… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

    Comments: Submitted as an entry by Leipzig University's TEMIR group to the Bristol Cyber Security Group's AMoC (Advanced Modelling of Cyber Criminal Careers) project hackathon

  37. arXiv:2112.11800  [pdf, other

    cs.DL cs.CL cs.IR

    STEREO: Scientific Text Reuse in Open Access Publications

    Authors: Lukas Gienapp, Wolfgang Kircheis, Bjarne Sievers, Benno Stein, Martin Potthast

    Abstract: We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains more than 91 million cases of reused text passages found in 4.2 million unique open-access publications. Featuring a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case, our dataset addresses the most… ▽ More

    Submitted 13 December, 2022; v1 submitted 22 December, 2021; originally announced December 2021.

    Comments: 14 pages, 3 figures, 4 tables

  38. arXiv:2112.03103  [pdf, other

    cs.IR

    FastWARC: Optimizing Large-Scale Web Archive Analytics

    Authors: Janek Bevendorff, Martin Potthast, Benno Stein

    Abstract: Web search and other large-scale web data analytics rely on processing archives of web pages stored in a standardized and efficient format. Since its introduction in 2008, the IIPC's Web ARCive (WARC) format has become the standard format for this purpose. As a list of individually compressed records of HTTP requests and responses, it allows for constant-time random access to all kinds of web data… ▽ More

    Submitted 22 November, 2021; originally announced December 2021.

    Journal ref: OSSYM 2021 - 3rd International Open Search Symposium

  39. arXiv:2111.10864  [pdf, other

    cs.IR

    The Impact of Main Content Extraction on Near-Duplicate Detection

    Authors: Maik Fröbe, Matthias Hagen, Janek Bevendorff, Michael Völske, Benno Stein, Christopher Schröder, Robby Wagner, Lukas Gienapp, Martin Potthast

    Abstract: Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and potential of near-duplicates with an information retrieval focus, motivating that current efforts toward an open and independent European web search infrastructure shou… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

  40. arXiv:2110.15181  [pdf, ps, other

    cs.CL

    BERTian Poetics: Constrained Composition with Masked LMs

    Authors: Christopher Akiki, Martin Potthast

    Abstract: Masked language models have recently been interpreted as energy-based sequence models that can be generated from using a Metropolis--Hastings sampler. This short paper demonstrates how this can be instrumentalized for constrained composition and explores the poetics implied by such a usage. Our focus on constraints makes it especially apt to understand the generated text through the poetics of the… ▽ More

    Submitted 28 October, 2021; originally announced October 2021.

    Comments: Accepted as a poster at the 2021 NeurIPS Workshop on Machine Learning for Creativity and Design

  41. arXiv:2110.08011  [pdf, other

    cs.CL

    Modeling Proficiency with Implicit User Representations

    Authors: Kim Breitwieser, Allison Lahnala, Charles Welch, Lucie Flek, Martin Potthast

    Abstract: We introduce the problem of proficiency modeling: Given a user's posts on a social media platform, the task is to identify the subset of posts or topics for which the user has some level of proficiency. This enables the filtering and ranking of social media posts on a given topic as per user proficiency. Unlike experts on a given topic, proficient users may not have received formal training and po… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

  42. arXiv:2109.15086  [pdf, other

    cs.CL

    Key Point Analysis via Contrastive Learning and Extractive Argument Summarization

    Authors: Milad Alshomary, Timon Gurcke, Shahbaz Syed, Philipp Heinrich, Maximilian Spliethöver, Philipp Cimiano, Martin Potthast, Henning Wachsmuth

    Abstract: Key point analysis is the task of extracting a set of concise and high-level statements from a given collection of arguments, representing the gist of these arguments. This paper presents our proposed approach to the Key Point Analysis shared task, collocated with the 8th Workshop on Argument Mining. The approach integrates two complementary components. One component employs contrastive learning v… ▽ More

    Submitted 22 October, 2021; v1 submitted 30 September, 2021; originally announced September 2021.

  43. arXiv:2108.01879  [pdf, other

    cs.CL

    Summary Explorer: Visualizing the State of the Art in Text Summarization

    Authors: Shahbaz Syed, Tariq Yousef, Khalid Al-Khatib, Stefan Jänicke, Martin Potthast

    Abstract: This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems by compiling the outputs of 55~state-of-the-art single document summarization approaches on three benchmark datasets, and visually exploring them during a qualitative assessment. The underlying design of the tool considers three well-known summary quality criteria (coverage, faithfulne… ▽ More

    Submitted 24 September, 2021; v1 submitted 4 August, 2021; originally announced August 2021.

    Comments: Accepted as system demonstration at EMNLP 2021

  44. Small-Text: Active Learning for Text Classification in Python

    Authors: Christopher Schröder, Lydia Müller, Andreas Niekler, Martin Potthast

    Abstract: We introduce small-text, an easy-to-use active learning library, which offers pool-based active learning for single- and multi-label text classification in Python. It features numerous pre-implemented state-of-the-art query strategies, including some that leverage the GPU. Standardized interfaces allow the combination of a variety of classifiers, query strategies, and stopping criteria, facilitati… ▽ More

    Submitted 7 October, 2023; v1 submitted 21 July, 2021; originally announced July 2021.

    Comments: This revision fixes the number of query strategies for modAL, which had remained unchanged from an earlier iteration of the table that did not yet include multi-label strategies

  45. arXiv:2107.05687  [pdf, other

    cs.CL cs.LG

    Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers

    Authors: Christopher Schröder, Andreas Niekler, Martin Potthast

    Abstract: Active learning is the iterative construction of a classification model through targeted labeling, enabling significant labeling cost savings. As most research on active learning has been carried out before transformer-based language models ("transformers") became popular, despite its practical importance, comparably few papers have investigated how transformers can be combined with active learnin… ▽ More

    Submitted 20 March, 2022; v1 submitted 12 July, 2021; originally announced July 2021.

    Comments: ACL 2022 Findings

  46. arXiv:2107.00893  [pdf, other

    cs.DL cs.NI cs.SI

    Web Archive Analytics

    Authors: Michael Völske, Janek Bevendorff, Johannes Kiesel, Benno Stein, Maik Fröbe, Matthias Hagen, Martin Potthast

    Abstract: Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes -- to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data s… ▽ More

    Submitted 2 July, 2021; originally announced July 2021.

    Comments: 12 pages, 5 figures. Published in the proceedings of INFORMATIK 2020

    Journal ref: INFORMATIK 2020. Gesellschaft für Informatik, Bonn. (pp. 61-72)

  47. Generating Informative Conclusions for Argumentative Texts

    Authors: Shahbaz Syed, Khalid Al-Khatib, Milad Alshomary, Henning Wachsmuth, Martin Potthast

    Abstract: The purpose of an argumentative text is to support a certain conclusion. Yet, they are often omitted, expecting readers to infer them rather. While appropriate when reading an individual text, this rhetorical device limits accessibility when browsing many texts (e.g., on a search engine or on social media). In these scenarios, an explicit conclusion makes for a good candidate summary of an argumen… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

  48. arXiv:2105.11752  [pdf, other

    cs.CL

    Argument Undermining: Counter-Argument Generation by Attacking Weak Premises

    Authors: Milad Alshomary, Shahbaz Syed, Arkajit Dhar, Martin Potthast, Henning Wachsmuth

    Abstract: Text generation has received a lot of attention in computational argumentation research as of recent. A particularly challenging task is the generation of counter-arguments. So far, approaches primarily focus on rebutting a given conclusion, yet other ways to counter an argument exist. In this work, we go beyond previous research by exploring argument undermining, that is, countering an argument b… ▽ More

    Submitted 31 May, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

    Comments: 9 pages, 3 figures

  49. arXiv:2005.14714  [pdf, other

    cs.CL

    The Importance of Suppressing Domain Style in Authorship Analysis

    Authors: Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, Martin Potthast

    Abstract: The prerequisite of many approaches to authorship analysis is a representation of writing style. But despite decades of research, it still remains unclear to what extent commonly used and widely accepted representations like character trigram frequencies actually represent an author's writing style, in contrast to more domain-specific style components or even topic. We address this shortcoming for… ▽ More

    Submitted 29 May, 2020; originally announced May 2020.

  50. Abstractive Snippet Generation

    Authors: Wei-Fan Chen, Shahbaz Syed, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: An abstractive snippet is an originally created piece of text to summarize a web page on a search engine results page. Compared to the conventional extractive snippets, which are generated by extracting phrases and sentences verbatim from a web page, abstractive snippets circumvent copyright issues; even more interesting is the fact that they open the door for personalization. Abstractive snippets… ▽ More

    Submitted 15 March, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

    Comments: Accepted by WWW 2020