Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

2406.13763

Published 6/21/2024 by Zhawnen Chen, Tianchun Wang, Yizhou Wang, Michal Kosinski, Xiang Zhang, Yun Fu, Sheng Li

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Abstract

Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.

Create account to get full access

Overview

This paper explores the use of multimodal video large language models (LLMs) for understanding and predicting human behavior based on the "theory of mind" - the ability to attribute mental states like beliefs, desires, and intentions to others.
The researchers developed a multimodal video question answering (MMTOM-QA) dataset and used it to evaluate whether LLMs can exhibit human-like theory of mind reasoning.
The results suggest that while LLMs can make some inferences about the mental states of individuals in videos, they still struggle with more complex aspects of theory of mind reasoning compared to humans.

Plain English Explanation

The paper is looking at whether large language models (LLMs) - powerful AI systems that can understand and generate human-like text - are able to "read minds" in a way that humans can. Humans have a remarkable ability called the "theory of mind" - we can imagine what other people are thinking, feeling, and intending, and use that to predict and understand their behavior.

The researchers wanted to see if LLMs could exhibit similar theory of mind capabilities when analyzing videos of people. They created a special dataset of video clips and questions that test different aspects of theory of mind reasoning. They then evaluated how well different LLMs could answer those questions and demonstrate an understanding of the characters' mental states.

The results suggest that while LLMs can make some basic inferences about what people in the videos are thinking or feeling, they still fall short of human-level theory of mind. There are many nuanced aspects of understanding others' minds that current LLMs have difficulty with. The paper provides important insights into the limits of current AI systems when it comes to the very human-like ability to "read minds."

Technical Explanation

The researchers developed a new multimodal video question answering (MMTOM-QA) dataset [link] to evaluate whether large language models (LLMs) can exhibit human-like "theory of mind" reasoning. Theory of mind refers to the ability to attribute mental states like beliefs, desires, and intentions to others in order to understand and predict their behavior.

They tested several state-of-the-art multimodal video LLMs on the MMTOM-QA dataset, including models from the Do LLMs Exhibit Human-like Reasoning? and Emotional Theory of Mind papers. The results showed that while the LLMs could make some inferences about the mental states of individuals in the videos, they struggled with more complex aspects of theory of mind reasoning compared to humans.

The paper also discusses how LLMs' representations of beliefs, desires, and intentions, as explored in the Language Models Represent Beliefs of Self and Others paper, may be limited compared to human-level theory of mind.

Critical Analysis

The paper provides valuable insights into the current limitations of LLMs when it comes to theory of mind reasoning. While the models can make some basic inferences, they fall short of human-level understanding of others' mental states. The authors acknowledge that there is still much work to be done to achieve AI systems with robust theory of mind capabilities.

One potential limitation of the study is the relatively small size of the MMTOM-QA dataset. Expanding the dataset with a greater diversity of video clips and theory of mind scenarios could provide more comprehensive insights. Additionally, the paper does not explore how different model architectures or training approaches might impact theory of mind performance.

Further research is needed to better understand the underlying mechanisms and representations that enable human-level theory of mind, and how to imbue LLMs with similar capabilities. As discussed in the LLM Theory of Mind paper, achieving strong theory of mind in AI systems could have important implications for safety and alignment with human values.

Conclusion

This paper takes an important step towards understanding the limits of current LLMs when it comes to the very human-like ability to "read minds" and reason about the mental states of others. While the models can make some inferences, they still struggle with the nuanced and complex aspects of theory of mind that come so naturally to humans.

Advancing AI systems' theory of mind capabilities could have significant implications for fields like psychology, cognitive science, and human-AI interaction. The insights from this paper highlight the need for continued research to develop LLMs that can better understand and predict human behavior by "seeing through the theory of mind's eye."

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MMToM-QA: Multimodal Theory of Mind Question Answering

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

6/18/2024

cs.AI cs.CL cs.CV cs.LG

Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses

Maryam Amirizaniani, Elias Martin, Maryna Sivachenko, Afra Mashhadi, Chirag Shah

Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts, which is vital for guiding one's own thought processes. Although large language models (LLMs) excel in tasks such as summarization, question answering, and translation, they still face challenges with ToM reasoning, especially in open-ended questions. Despite advancements, the extent to which LLMs truly understand ToM reasoning and how closely it aligns with human ToM reasoning remains inadequately explored in open-ended scenarios. Motivated by this gap, we assess the abilities of LLMs to perceive and integrate human intentions and emotions into their ToM reasoning processes within open-ended questions. Our study utilizes posts from Reddit's ChangeMyView platform, which demands nuanced social reasoning to craft persuasive responses. Our analysis, comparing semantic similarity and lexical overlap metrics between responses generated by humans and LLMs, reveals clear disparities in ToM reasoning capabilities in open-ended questions, with even the most advanced models showing notable limitations. To enhance LLM capabilities, we implement a prompt tuning method that incorporates human intentions and emotions, resulting in improvements in ToM reasoning performance. However, despite these improvements, the enhancement still falls short of fully achieving human-like reasoning. This research highlights the deficiencies in LLMs' social reasoning and demonstrates how integrating human intentions and emotions can boost their effectiveness.

6/11/2024

cs.CL cs.AI

Language Models Represent Beliefs of Self and Others

Wentao Zhu, Zhining Zhang, Yizhou Wang

Understanding and attributing mental states, known as Theory of Mind (ToM), emerges as a fundamental capability for human social reasoning. While Large Language Models (LLMs) appear to possess certain ToM abilities, the mechanisms underlying these capabilities remain elusive. In this study, we discover that it is possible to linearly decode the belief status from the perspectives of various agents through neural activations of language models, indicating the existence of internal representations of self and others' beliefs. By manipulating these representations, we observe dramatic changes in the models' ToM performance, underscoring their pivotal role in the social reasoning process. Additionally, our findings extend to diverse social reasoning tasks that involve different causal inference patterns, suggesting the potential generalizability of these representations.

5/31/2024

cs.AI cs.CL

⚙️

Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning

Yasaman Etesam, Ozge Nilay Yalc{c}{i}n, Chuxuan Zhang, Angelica Lim

The emotional theory of mind problem requires facial expressions, body pose, contextual information and implicit commonsense knowledge to reason about the person's emotion and its causes, making it currently one of the most difficult problems in affective computing. In this work, we propose multiple methods to incorporate the emotional reasoning capabilities by constructing narrative captions relevant to emotion perception, that includes contextual and physical signal descriptors that focuses on Who, What, Where and How questions related to the image and emotions of the individual. We propose two distinct ways to construct these captions using zero-shot classifiers (CLIP) and fine-tuning visual-language models (LLaVA) over human generated descriptors. We further utilize these captions to guide the reasoning of language (GPT-4) and vision-language models (LLaVa, GPT-Vision). We evaluate the use of the resulting models in an image-to-language-to-emotion task. Our experiments showed that combining the Fast narrative descriptors and Slow reasoning of language models is a promising way to achieve emotional theory of mind.

6/18/2024

cs.CV