Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

Read original: arXiv:2405.02650 - Published 5/7/2024 by Maxim Ifergan, Renana Keydar, Omri Abend, Amit Pinchevski

Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

Overview

This paper explores the use of topic modeling to identify narrative patterns and outliers in a large corpus of Holocaust testimonies.
The researchers analyzed over 50,000 testimonies from the USC Shoah Foundation's Visual History Archive to uncover common themes and unique stories within the data.
The findings provide insights into the experiences and perspectives of Holocaust survivors, which could aid in historical and psychological research on the topic.

Plain English Explanation

The researchers in this study analyzed a large collection of personal accounts from Holocaust survivors to see if they could identify any common patterns or unique stories. They used a technique called "topic modeling" to automatically detect themes and topics that appeared throughout the testimonies.

By applying this data-driven approach, the researchers were able to find recurring narrative structures and outliers - stories that stood out as being quite different from the majority. This could help historians and psychologists better understand the diverse experiences of Holocaust survivors and how they chose to recount their stories.

For example, the analysis might reveal that many survivors talked about the specific challenges of hiding from the Nazis, while others focused more on the trauma of being separated from their families. The researchers could then dive deeper into these themes to uncover important insights.

Overall, this work demonstrates how advanced text analysis techniques can shed new light on large-scale historical collections, moving beyond individual accounts to identify broader patterns and noteworthy exceptions.

Technical Explanation

The researchers used a well-established topic modeling algorithm called Latent Dirichlet Allocation (LDA) to analyze the corpus of Holocaust testimonies. LDA is designed to automatically discover the underlying "topics" that are present across a large set of documents, where each topic is represented by a cluster of semantically related words.

By applying LDA to the testimony transcripts, the researchers were able to identify the most prominent topical themes that emerged, as well as quantify how prevalent each theme was across the full corpus. This allowed them to map out the narrative landscape and identify testimonies that diverged significantly from the main patterns.

The researchers also conducted additional statistical analyses to further characterize the corpus, such as examining the distribution of testimony lengths, the most common named entities referenced, and changes in topical focus over time. These corpus-level insights provided helpful context for interpreting the topic modeling results.

Overall, the study demonstrates the value of leveraging natural language processing and machine learning techniques to gain a more holistic, data-driven understanding of large-scale historical archives like the USC Shoah Foundation's Visual History collection.

Critical Analysis

The researchers acknowledge several limitations in their work, including the fact that the topic modeling approach relies on certain assumptions and parameter choices that can impact the resulting themes. There is also the potential for bias in the testimony transcripts themselves, as the collection may not be fully representative of all Holocaust survivor experiences.

Additionally, while the topic modeling reveals high-level narrative patterns, it does not provide the rich contextual details and nuances that would be needed for a deeper qualitative analysis. The researchers recommend combining this computational approach with close readings of individual testimonies to obtain a more holistic understanding.

Further research could explore ways to incorporate additional metadata (e.g., survivor demographics, interview settings) into the analyses, as well as investigate how the topical structures and outlier narratives evolve over time as new testimonies are added to the archive. Benchmarking against other text analysis methods could also help validate the insights gained from the topic modeling.

Overall, this study represents an important step towards leveraging large-scale historical data in more systematic and scalable ways. By combining computational techniques with domain expertise, researchers can uncover patterns and outliers that might otherwise be difficult to detect through traditional qualitative approaches alone.

Conclusion

This paper demonstrates the potential of topic modeling to provide a data-driven lens for analyzing a vast corpus of Holocaust testimonies. By surfacing common narrative themes as well as unique survivor stories, the researchers have laid the groundwork for further historical and psychological investigation into the experiences and perspectives of those who lived through this tragic period.

While the computational approach has limitations, it serves as a valuable complement to more traditional qualitative analyses. By integrating these different methodologies, scholars can gain a richer, more holistic understanding of large-scale historical archives and the human stories they contain.

Ultimately, this work highlights the power of applying advanced text analysis techniques to unlock insights from massive collections of personal narratives. As historians, psychologists, and other researchers continue to digitize and preserve such invaluable sources, tools like topic modeling will become increasingly important for making sense of these vast troves of data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling

Maxim Ifergan, Renana Keydar, Omri Abend, Amit Pinchevski

The vast collection of Holocaust survivor testimonies presents invaluable historical insights but poses challenges for manual analysis. This paper leverages advanced Natural Language Processing (NLP) techniques to explore the USC Shoah Foundation Holocaust testimony corpus. By treating testimonies as structured question-and-answer sections, we apply topic modeling to identify key themes. We experiment with BERTopic, which leverages recent advances in language modeling technology. We align testimony sections into fixed parts, revealing the evolution of topics across the corpus of testimonies. This highlights both a common narrative schema and divergences between subgroups based on age and gender. We introduce a novel method to identify testimonies within groups that exhibit atypical topic distributions resembling those of other groups. This study offers unique insights into the complex narratives of Holocaust survivors, demonstrating the power of NLP to illuminate historical discourse and identify potential deviations in survivor experiences.

5/7/2024

Mapping News Narratives Using LLMs and Narrative-Structured Text Embeddings

Jan Elfes

Given the profound impact of narratives across various societal levels, from personal identities to international politics, it is crucial to understand their distribution and development over time. This is particularly important in online spaces. On the Web, narratives can spread rapidly and intensify societal divides and conflicts. While many qualitative approaches exist, quantifying narratives remains a significant challenge. Computational narrative analysis lacks frameworks that are both comprehensive and generalizable. To address this gap, we introduce a numerical narrative representation grounded in structuralist linguistic theory. Chiefly, Greimas' Actantial Model represents a narrative through a constellation of six functional character roles. These so-called actants are genre-agnostic, making the model highly generalizable. We extract the actants using an open-source LLM and integrate them into a Narrative-Structured Text Embedding that captures both the semantics and narrative structure of a text. We demonstrate the analytical insights of the method on the example of 5000 full-text news articles from Al Jazeera and The Washington Post on the Israel-Palestine conflict. Our method successfully distinguishes articles that cover the same topics but differ in narrative structure.

9/11/2024

⚙️

Analyzing Narrative Processing in Large Language Models (LLMs): Using GPT4 to test BERT

Patrick Krauss, Jannik Hosch, Claus Metzner, Andreas Maier, Peter Uhrig, Achim Schilling

The ability to transmit and receive complex information via language is unique to humans and is the basis of traditions, culture and versatile social interactions. Through the disruptive introduction of transformer based large language models (LLMs) humans are not the only entity to understand and produce language any more. In the present study, we have performed the first steps to use LLMs as a model to understand fundamental mechanisms of language processing in neural networks, in order to make predictions and generate hypotheses on how the human brain does language processing. Thus, we have used ChatGPT to generate seven different stylistic variations of ten different narratives (Aesop's fables). We used these stories as input for the open source LLM BERT and have analyzed the activation patterns of the hidden units of BERT using multi-dimensional scaling and cluster analysis. We found that the activation vectors of the hidden units cluster according to stylistic variations in earlier layers of BERT (1) than narrative content (4-5). Despite the fact that BERT consists of 12 identical building blocks that are stacked and trained on large text corpora, the different layers perform different tasks. This is a very useful model of the human brain, where self-similar structures, i.e. different areas of the cerebral cortex, can have different functions and are therefore well suited to processing language in a very efficient way. The proposed approach has the potential to open the black box of LLMs on the one hand, and might be a further step to unravel the neural processes underlying human language processing and cognition in general.

5/6/2024

Unveiling the Potential of BERTopic for Multilingual Fake News Analysis -- Use Case: Covid-19

Karla Schafer, Jeong-Eun Choi, Inna Vogel, Martin Steinebach

Topic modeling is frequently being used for analysing large text corpora such as news articles or social media data. BERTopic, consisting of sentence embedding, dimension reduction, clustering, and topic extraction, is the newest and currently the SOTA topic modeling method. However, current topic modeling methods have room for improvement because, as unsupervised methods, they require careful tuning and selection of hyperparameters, e.g., for dimension reduction and clustering. This paper aims to analyse the technical application of BERTopic in practice. For this purpose, it compares and selects different methods and hyperparameters for each stage of BERTopic through density based clustering validation and six different topic coherence measures. Moreover, it also aims to analyse the results of topic modeling on real world data as a use case. For this purpose, the German fake news dataset (GermanFakeNCovid) on Covid-19 was created by us and in order to experiment with topic modeling in a multilingual (English and German) setting combined with the FakeCovid dataset. With the final results, we were able to determine thematic similarities between the United States and Germany. Whereas, distinguishing the topics of fake news from India proved to be more challenging.

7/12/2024