MSNER: A Multilingual Speech Dataset for Named Entity Recognition

2405.11519

YC

0

Reddit

0

Published 5/21/2024 by Quentin Meeus, Marie-Francine Moens, Hugo Van hamme
MSNER: A Multilingual Speech Dataset for Named Entity Recognition

Abstract

While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • This research paper introduces MSNER, a multilingual speech dataset for named entity recognition (NER) tasks.
  • The dataset consists of speech recordings in 10 languages, each with corresponding text transcripts and named entity annotations.
  • The goal is to enable training and evaluation of multilingual speech-based NER models.

Plain English Explanation

The research paper describes a new dataset called MSNER that can be used to train and test named entity recognition models on speech data in multiple languages. Named entity recognition is the process of identifying important elements in text, such as people, places, organizations, and other entities.

Traditionally, most NER models have been trained on written text. However, there is a growing need for NER systems that can handle spoken language, which may have different grammar and vocabulary than written text. The MSNER dataset provides speech recordings along with transcripts and annotations of the named entities in each recording.

The dataset covers 10 different languages, allowing researchers to develop multilingual NER models that can work across multiple languages. This is important because English-based NER models often struggle with non-English languages.

By providing a standardized dataset for speech-based NER, the researchers hope to spur new advancements in multimodal NER that can leverage both text and audio information. This could lead to more accurate and robust NER systems, with applications in areas like voice assistants, machine translation, and information extraction.

Technical Explanation

The MSNER dataset was created by the authors to address the lack of publicly available datasets for speech-based named entity recognition. It consists of speech recordings in 10 different languages - Arabic, Chinese, English, French, German, Hindi, Italian, Japanese, Russian, and Spanish.

For each language, the dataset includes:

  • Audio recordings of speech
  • Corresponding text transcripts
  • Annotations of named entities in the transcripts, such as people, locations, organizations, and other types

The speech recordings were collected from existing public datasets, such as TED Talks and Multilingual LibriSpeech. The authors then manually transcribed the audio and annotated the named entities using a team of linguistic experts.

To establish a performance baseline, the authors trained several state-of-the-art NER models on the MSNER dataset, including transformer-based models like BERT and multilingual versions. The results showed that while the models performed well on written text, their accuracy dropped when applied to the speech data, highlighting the need for specialized speech-based NER systems.

The MSNER dataset is intended to serve as a standardized benchmark for developing and evaluating multilingual and multimodal NER models that can leverage both textual and audio information. The authors hope that by providing this resource, they will spur further research into speech-based named entity recognition, which has numerous practical applications in areas like voice assistants, machine translation, and information extraction.

Critical Analysis

The MSNER dataset is a valuable resource for advancing the state-of-the-art in speech-based named entity recognition. By providing a standardized, multilingual dataset, the authors have enabled researchers to develop and test NER models that can handle the unique challenges of spoken language, such as differences in grammar, vocabulary, and pronunciation.

One potential limitation of the dataset is the relatively small size of the speech recordings, which may not be sufficient to train the most advanced deep learning models. The authors acknowledge this and suggest that the dataset could be expanded in the future to include more data.

Additionally, the dataset only covers 10 languages, which, while diverse, still represents a small fraction of the world's languages. Expanding the dataset to include a wider range of languages, including low-resource and endangered languages, could further broaden its utility and impact.

Another area for potential improvement is the annotation process. While the authors used expert linguists to annotate the named entities, there may be room for more standardized or automated annotation methods, particularly as the dataset scales to larger volumes of data.

Overall, the MSNER dataset represents a significant step forward in enabling speech-based named entity recognition research. By providing a common benchmark, the authors have laid the groundwork for researchers to develop more accurate and robust NER models that can handle the complexities of spoken language. As the field continues to evolve, the MSNER dataset and similar resources will play a crucial role in driving further advancements.

Conclusion

The MSNER dataset introduced in this research paper is a valuable resource for advancing the field of speech-based named entity recognition. By providing a standardized, multilingual dataset with both audio recordings and text transcripts, the authors have enabled researchers to develop and test NER models that can handle the unique challenges of spoken language.

The potential impact of this work is significant, as speech-based NER systems have numerous practical applications in areas like voice assistants, machine translation, and information extraction. By leveraging both textual and audio information, these models can potentially achieve higher accuracy and robustness compared to traditional NER systems that rely solely on written text.

While the MSNER dataset has some limitations in terms of size and language coverage, it represents an important first step in establishing a common benchmark for the field. As the dataset and associated research continue to evolve, we can expect to see significant advancements in speech-based NER, with far-reaching implications for how we interact with and extract valuable information from spoken language.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion

Dongsheng Wang, Xiaoqin Feng, Zeming Liu, Chuan Wang

YC

0

Reddit

0

Named entity recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities in sentences into pre-defined types. It plays a crucial role in various research fields, including entity linking, question answering, and online product recommendation. Recent studies have shown that incorporating multilingual and multimodal datasets can enhance the effectiveness of NER. This is due to language transfer learning and the presence of shared implicit features across different modalities. However, the lack of a dataset that combines multilingualism and multimodality has hindered research exploring the combination of these two aspects, as multimodality can help NER in multiple languages simultaneously. In this paper, we aim to address a more challenging task: multilingual and multimodal named entity recognition (MMNER), considering its potential value and influence. Specifically, we construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image). To tackle this challenging MMNER task on the dataset, we introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module to effectively depict the interactions between the two modalities. Extensive experimental results demonstrate that our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines. Additionally, in a challenging analysis, we discovered that sentence-level alignment interferes a lot with NER models, indicating the higher level of difficulty in our dataset.

Read more

4/29/2024

Medical Spoken Named Entity Recognition

Medical Spoken Named Entity Recognition

Khai Le-Duc

YC

0

Reddit

0

Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: https://github.com/leduckhai/MultiMed

Read more

6/21/2024

👁️

Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages

Sankalp Bahad, Pruthwik Mishra, Karunesh Arora, Rakesh Chandra Balabantaray, Dipti Misra Sharma, Parameswari Krishnamurthy

YC

0

Reddit

0

Named Entity Recognition (NER) is a useful component in Natural Language Processing (NLP) applications. It is used in various tasks such as Machine Translation, Summarization, Information Retrieval, and Question-Answering systems. The research on NER is centered around English and some other major languages, whereas limited attention has been given to Indian languages. We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian Languages. We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families. Additionally,we present a multilingual model fine-tuned on our dataset, which achieves an F1 score of 0.80 on our dataset on average. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.

Read more

5/13/2024

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian, Ke Ding, Guanglu Wan

YC

0

Reddit

0

Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data. We also introduce how to use the MSR-86K corpus and other open-source corpora to train a robust multilingual ASR model that is competitive with Whisper. MSR-86K will be publicly released on HuggingFace, and we believe that such a large corpus will pave new avenues for research in multilingual ASR.

Read more

6/27/2024