Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?

Read original: arXiv:2311.07564 - Published 6/17/2024 by Cristina Aggazzotti, Nicholas Andrews, Elizabeth Allyn Smith

🗣️

Overview

This paper explores the challenge of authorship verification for transcribed speech, which differs from written text analysis.
Many stylistic features used in written text attribution are not informative for transcribed speech, but it may exhibit other patterns like filler words and backchannels that could be characteristic of different speakers.
The paper proposes a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts, controlling for topic to avoid spurious associations.
The authors establish the state of the art on this benchmark by comparing neural and non-neural baselines, finding that written text attribution models perform worse as conversational topic is increasingly controlled.

Plain English Explanation

Authorship verification is the task of determining if two different writing samples were written by the same person. This paper explores a new challenge in this area - verifying the authorship of transcribed speech rather than written text.

When analyzing written text, things like punctuation and capitalization can provide clues about the author. But these stylistic features aren't as informative for transcribed speech. However, transcribed speech does have other patterns, like use of filler words and backchannels (e.g. "um," "uh-huh"), that may be characteristic of different speakers.

To study this, the researchers created a new dataset of human-transcribed conversational speech. They carefully designed the dataset to control for the topic of the conversations, so that any speaker differences detected weren't just due to talking about different subjects.

The researchers then tested a variety of neural and non-neural models for attributing the transcribed speech to different speakers. They found that models trained on written text performed surprisingly well in some cases, but their performance dropped significantly as the conversational topic was more controlled.

The paper also looks at how the style of the transcription itself impacts the model performance, and whether fine-tuning on speech transcripts can improve the results.

Technical Explanation

The paper proposes a new benchmark for speaker attribution focused on human-transcribed conversational speech. To construct this benchmark, the authors used conversation prompts and ensured that speakers participated in the same conversations, in order to control for topic and limit spurious associations between speakers and content.

The authors compare a suite of neural and non-neural baselines on this new benchmark, including models trained on written text as well as models fine-tuned on the speech transcripts. They find that while written text attribution models achieve surprisingly good performance in certain settings, their performance drops markedly as conversational topic is increasingly controlled.

The paper also presents analyses of the impact of transcription style on model performance, as well as the ability of fine-tuning on speech transcripts to improve performance. Overall, the results highlight the novel challenges posed by the attribution of transcribed speech compared to written text.

Critical Analysis

The paper acknowledges some limitations of the proposed benchmark, including the reliance on human-transcribed speech rather than automatically transcribed speech, which may differ in stylistic patterns. Additionally, the authors note that the benchmark only covers a limited set of conversation prompts and speakers, and that further work is needed to test the generalization of the models.

One potential issue not addressed in the paper is the potential bias introduced by the human transcribers, whose own stylistic preferences and idiosyncrasies could be reflected in the transcripts. This could complicate the task of distinguishing between different speakers' speech patterns.

Additionally, the paper does not explore the potential impact of speaker demographics, such as age, gender, or regional accent, on the performance of the attribution models. These factors could also play a role in shaping the linguistic patterns observed in the transcripts.

Overall, the paper presents a valuable contribution to the field of authorship verification, highlighting the unique challenges and considerations involved in analyzing transcribed speech. The new benchmark and the insights gained from the comparative analysis of baselines provide a solid foundation for future research in this area.

Conclusion

This paper tackles the novel challenge of authorship verification for transcribed speech, which differs from the more typical task of written text attribution. By creating a carefully controlled benchmark dataset and evaluating a range of models, the authors demonstrate that many stylistic features used in written text attribution are not informative for transcribed speech, but that other patterns, such as filler words and backchannels, may be characteristic of different speakers.

The findings from this research could have important implications for applications like speaker diarization, where accurately attributing speech to individual speakers is crucial, as well as generating citations and question-answering systems that rely on accurate speaker attribution. The insights from this work can help guide future research and development in these areas.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?

Cristina Aggazzotti, Nicholas Andrews, Elizabeth Allyn Smith

Authorship verification is the task of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not informative in this setting. On the other hand, transcribed speech exhibits other patterns, such as filler words and backchannels (e.g., 'um', 'uh-huh'), which may be characteristic of different speakers. We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts. To limit spurious associations of speakers with topic, we employ both conversation prompts and speakers participating in the same conversation to construct verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they perform markedly worse as conversational topic is increasingly controlled. We present analyses of the impact of transcription style on performance as well as the ability of fine-tuning on speech transcripts to improve performance.

6/17/2024

Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

Minh Nguyen, Franck Dernoncourt, Seunghyun Yoon, Hanieh Deilamsalehy, Hao Tan, Ryan Rossi, Quan Hung Tran, Trung Bui, Thien Huu Nguyen

We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives. Despite the advancements in speech recognition, the task of text-based speaker identification (SpeakerID) has received limited attention, lacking large-scale, diverse datasets for effective model training. Addressing these gaps, we present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources. We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names. Through extensive experiments, our best model achieves a great precision of 80.3%, setting a new benchmark for SpeakerID. The data and code are publicly available here: url{https://github.com/adobe-research/speaker-identification}

7/18/2024

Speaker Verification in Agent-Generated Conversations

Yizhe Yang, Palakorn Achananuparp, Heyan Huang, Jing Jiang, Ee-Peng Lim

The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.

6/7/2024

Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization?

Roshan Sharma, Suwon Shon, Mark Lindsey, Hira Dhamyal, Rita Singh, Bhiksha Raj

Reference summaries for abstractive speech summarization require human annotation, which can be performed by listening to an audio recording or by reading textual transcripts of the recording. In this paper, we examine whether summaries based on annotators listening to the recordings differ from those based on annotators reading transcripts. Using existing intrinsic evaluation based on human evaluation, automatic metrics, LLM-based evaluation, and a retrieval-based reference-free method. We find that summaries are indeed different based on the source modality, and that speech-based summaries are more factually consistent and information-selective than transcript-based summaries. Meanwhile, transcript-based summaries are impacted by recognition errors in the source, and expert-written summaries are more informative and reliable. We make all the collected data and analysis code public(https://github.com/cmu-mlsp/interview_humanssum) to facilitate the reproduction of our work and advance research in this area.

8/15/2024