Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap

Read original: arXiv:2211.13087 - Published 8/20/2024 by Mengmi Zhang, Giorgia Dellaferrera, Ankur Sikarwar, Caishun Chen, Marcelo Armendariz, Noga Mudrik, Prachi Agrawal, Spandan Madan, Mranmay Shetty, Andrei Barbu and 11 others

👀

Overview

As AI becomes more advanced, it's crucial to determine whether the agents we interact with are human or not.
The researchers used the Turing test to systematically benchmark current AI systems across various language and vision tasks.
The experiments involved hundreds of human and AI agents, resulting in thousands of Turing-like tests.

Plain English Explanation

The researchers wanted to understand how close current AI systems are to being able to imitate humans in different tasks. They used the Turing test as a way to do this, which involves evaluating whether a human judge can distinguish an AI's responses from a human's.

The researchers had both human and AI agents perform a variety of language and vision tasks, like captioning images, associating words, and detecting objects. They then had human and AI judges try to determine which responses were from humans and which were from AIs.

The results showed that current AIs are getting quite good at imitating humans in these complex tasks. While human judges were often fooled, simpler AI judges were actually better at distinguishing human and AI responses. The researchers also found that the AIs' performance on the imitation tests didn't always correlate with their performance on standard AI benchmarks.

Technical Explanation

The researchers conducted a series of experiments to systematically benchmark the ability of current AI systems to imitate humans across a range of language and vision tasks. They used the Turing test as a framework, involving both human and AI agents as "test-takers" as well as human and AI judges.

The language tasks included image captioning, word association, and open-ended conversation, while the vision tasks involved object detection, color estimation, and attention prediction. The experiments were large-scale, involving 549 human agents, 26 AI agents, 1,126 human judges, and 10 AI judges, across 25,650 Turing-like tests.

The results revealed that current AI systems are surprisingly close to being able to impersonate humans in these complex language and vision challenges. While human judges were often deceived, simple AI judges were able to outperform them in distinguishing human and AI responses.

Interestingly, the researchers found that the AIs' performance on the imitation tests was only minimally correlated with their standard AI benchmark scores. This suggests that evaluating whether a machine can pass as human constitutes an important independent test for assessing AI capabilities.

Critical Analysis

The researchers acknowledge several caveats and limitations to their work. For example, they note that the AI systems tested were limited to specific tasks and may not generalize to broader conversational or visual capabilities. Additionally, the imitation tests focused on relatively short-term interactions, and it's unclear how well the AIs would perform in longer, more open-ended exchanges.

One potential concern is the reliance on human judges, who may have biases or preconceptions that could influence their assessments. The researchers attempt to address this by including AI judges, but the sample size of 10 is relatively small.

Furthermore, the researchers do not delve deeply into the potential societal implications of AIs becoming increasingly adept at impersonating humans. As these systems become more advanced, there may be ethical and practical concerns around transparency, trust, and the potential for deception.

Conclusion

This study provides important insights into the current state of AI's ability to imitate human behavior across a range of language and vision tasks. The large-scale, systematic approach and the introduction of new benchmark datasets and evaluation metrics represent valuable contributions to the field.

The findings suggest that while current AIs are not yet indistinguishable from humans, they are rapidly approaching that capability. This raises significant questions about the ethical and societal implications of such advanced AI systems and the need for rigorous, ongoing evaluation of their abilities and limitations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap

Mengmi Zhang, Giorgia Dellaferrera, Ankur Sikarwar, Caishun Chen, Marcelo Armendariz, Noga Mudrik, Prachi Agrawal, Spandan Madan, Mranmay Shetty, Andrei Barbu, Haochen Yang, Tanishq Kumar, Shui'Er Han, Aman Raj Singh, Meghna Sadwani, Stella Dellaferrera, Michele Pizzochero, Brandon Tang, Yew Soon Ong, Hanspeter Pfister, Gabriel Kreiman

As AI algorithms increasingly participate in daily activities, it becomes critical to ascertain whether the agents we interact with are human or not. To address this question, we turn to the Turing test and systematically benchmark current AIs in their abilities to imitate humans in three language tasks (Image captioning, Word association, and Conversation) and three vision tasks (Object detection, Color estimation, and Attention prediction). The experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges, in 25,650 Turing-like tests. The results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges. While human judges were often deceived, simple AI judges outperformed human judges in distinguishing human answers from AI answers. The results of imitation tests are only minimally correlated with standard performance metrics in AI. Thus, evaluating whether a machine can pass as a human constitutes an important independent test to evaluate AI algorithms. The curated, large-scale, Turing datasets introduced here and their evaluation metrics provide new benchmarks and insights to assess whether an agent is human or not and emphasize the relevance of rigorous, systematic, and quantitative imitation tests in these and other AI domains.

8/20/2024

Passed the Turing Test: Living in Turing Futures

Bernardo Gonc{c}alves

The world has seen the emergence of machines based on pretrained models, transformers, also known as generative artificial intelligences for their ability to produce various types of content, including text, images, audio, and synthetic data. Without resorting to preprogramming or special tricks, their intelligence grows as they learn from experience, and to ordinary people, they can appear human-like in conversation. This means that they can pass the Turing test, and that we are now living in one of many possible Turing futures where machines can pass for what they are not. However, the learning machines that Turing imagined would pass his imitation tests were machines inspired by the natural development of the low-energy human cortex. They would be raised like human children and naturally learn the ability to deceive an observer. These ``child machines,'' Turing hoped, would be powerful enough to have an impact on society and nature.

9/14/2024

🏋️

People cannot distinguish GPT-4 from a human in a Turing test

Cameron R. Jones, Benjamin K. Bergen

We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.

5/15/2024

👀

Attributions toward Artificial Agents in a modified Moral Turing Test

Eyal Aharoni, Sharlene Fernandes, Daniel J. Brady, Caelan Alexander, Michael Criner, Kara Queen, Javier Rando, Eddy Nahmias, Victor Crespo

Advances in artificial intelligence (AI) raise important questions about whether people view moral evaluations by AI systems similarly to human-generated moral evaluations. We conducted a modified Moral Turing Test (m-MTT), inspired by Allen and colleagues' (2000) proposal, by asking people to distinguish real human moral evaluations from those made by a popular advanced AI language model: GPT-4. A representative sample of 299 U.S. adults first rated the quality of moral evaluations when blinded to their source. Remarkably, they rated the AI's moral reasoning as superior in quality to humans' along almost all dimensions, including virtuousness, intelligence, and trustworthiness, consistent with passing what Allen and colleagues call the comparative MTT. Next, when tasked with identifying the source of each evaluation (human or computer), people performed significantly above chance levels. Although the AI did not pass this test, this was not because of its inferior moral reasoning but, potentially, its perceived superiority, among other possible explanations. The emergence of language models capable of producing moral responses perceived as superior in quality to humans' raises concerns that people may uncritically accept potentially harmful moral guidance from AI. This possibility highlights the need for safeguards around generative language models in matters of morality.

6/19/2024