Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection to mitigate risks like the spread of fake news and plagiarism. Existing research has been constrained by evaluating detection methods on specific domains or particular language models. In practical scenarios, however, the detector faces texts from various domains or LLMs without knowing their sources. To this end, we build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs. Empirical results show challenges in distinguishing machine-generated texts from human-authored ones across various scenarios, especially out-of-distribution. These challenges are due to the decreasing linguistic distinctions between the two sources. Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios. We release our resources at

  • Large language models (LLMs) have achieved human-level text generation, highlighting the need for effective AI-generated text detection to mitigate risks like the spread of fake news and plagiarism.
  • Existing research has been constrained by evaluating detection methods on specific domains or particular language models.
  • This paper builds a comprehensive testbed to evaluate text detection methods across diverse human writings and texts generated by different LLMs.

Plain English Explanation

Large language models are artificial intelligence (AI) systems that can generate human-like text. While this is an impressive technological achievement, it also poses risks, such as the spread of fake news and plagiarism.

Previous research on detecting AI-generated text has been limited in scope, focusing on specific types of text or particular language models. In the real world, however, a text detection system might encounter content from a wide range of sources, including both human-written and AI-generated text.

To address this challenge, the researchers in this paper created a comprehensive test dataset, gathering text from diverse human writings as well as texts generated by different language models. By evaluating text detection methods across this diverse set of sources, the researchers were able to better understand the challenges in distinguishing machine-generated text from human-authored content, especially when the text comes from sources that were not part of the original training data.

Technical Explanation

The researchers built a comprehensive testbed by gathering texts from diverse human writings and texts generated by different large language models. This allowed them to evaluate the performance of text detection methods across a wide range of scenarios, including when the detector faces texts from sources it was not trained on.

The empirical results revealed significant challenges in distinguishing machine-generated texts from human-authored ones, especially in out-of-distribution scenarios. This is due to the decreasing linguistic distinctions between the two sources as language models become more advanced.

Despite these challenges, the researchers found that the top-performing text detection method could still identify 86.54% of out-of-domain texts generated by a new language model. This indicates the feasibility of using text detection methods in practical application scenarios, though further research is needed to address the challenges identified in the study.

Critical Analysis

The researchers acknowledge the limitations of their study, noting that the testbed they created, while comprehensive, may not capture all possible real-world scenarios. Additionally, the performance of text detection methods may continue to evolve as language models become more advanced.

One potential area for further research is exploring generalized detection strategies that can adapt to a wider range of text sources, rather than relying on detection methods trained on specific datasets or language models.

It is also important to consider the societal implications of AI-generated text detection, particularly regarding privacy, transparency, and the potential for abuse. Adapting fake news detection methods to the era of large language models may require additional ethical considerations and safeguards.


This research highlights the growing challenge of distinguishing machine-generated text from human-authored content, particularly as language models become more advanced. The researchers' comprehensive testbed and evaluation of text detection methods provide valuable insights into the current state of the field and the need for continued innovation to address this challenge.

As large language models become more prevalent, the ability to reliably detect AI-generated text will be crucial in mitigating the risks of fake news, plagiarism, and other potential misuses of this technology. The findings of this study contribute to the ongoing efforts to develop effective AI-generated text detection methods that can adapt to the ever-evolving landscape of language generation.

