SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset

2405.07354

Published 5/14/2024 by Sushant Gautam, Mehdi Houshmand Sarkhoosh, Jan Held, Cise Midoglu, Anthony Cioppa, Silvio Giancola, Vajira Thambawita, Michael A. Riegler, P{aa}l Halvorsen, Mubarak Shah

cs.SD cs.IR cs.LG cs.MM eess.AS

🛠️

Abstract

The application of Automatic Speech Recognition (ASR) technology in soccer offers numerous opportunities for sports analytics. Specifically, extracting audio commentaries with ASR provides valuable insights into the events of the game, and opens the door to several downstream applications such as automatic highlight generation. This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts, enhancing video content with rich layers of textual information derived from the game audio using ASR. These textual commentaries, generated using the Whisper model and translated with Google Translate, extend the usefulness of the SoccerNet dataset in diverse applications such as enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games. We detail the methods involved in the curation of this dataset and the integration of ASR. We also highlight the implications of a multimodal approach in sports analytics, and how the enriched dataset can support diverse applications, thus broadening the scope of research and development in the field of sports analytics.

Create account to get full access

Overview

This paper explores the application of Automatic Speech Recognition (ASR) technology in soccer analytics.
It presents SoccerNet-Echoes, a dataset that augments the existing SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts.
The textual commentaries, generated using the Whisper model and translated with Google Translate, provide valuable insights into the events of the game and enable diverse applications such as enhanced action spotting, automatic caption generation, and game summarization.

Plain English Explanation

Soccer is a sport that is loved by millions of people around the world. When you watch a soccer game, there are often commentators who provide play-by-play and analysis of the game. These commentaries can provide a wealth of information about what's happening on the field.

The researchers in this paper wanted to see if they could use Automatic Speech Recognition (ASR) technology to automatically transcribe these commentaries and add them to a dataset of soccer videos called SoccerNet. By doing this, they could create a more comprehensive dataset that includes not only the video of the game, but also the textual information from the commentaries.

This textual information could then be used to develop advanced algorithms that can better understand and analyze the dynamics of a soccer game. For example, the algorithms could be used to automatically generate highlights or summaries of the game, or to understand the different tactics and strategies used by the teams.

The researchers also wanted to make the dataset more accessible to a global audience, so they used Google Translate to translate the commentaries into different languages. This means that researchers from around the world can use the dataset to develop their own soccer analytics tools and applications.

Technical Explanation

The researchers used the Whisper model, a state-of-the-art ASR system, to automatically transcribe the audio commentaries from the soccer game broadcasts. They then used Google Translate to translate the transcriptions into multiple languages, creating a multilingual dataset.

The resulting SoccerNet-Echoes dataset includes the original video footage from the SoccerNet dataset, as well as the automatically generated textual commentaries in multiple languages. This allows researchers to leverage both the visual and textual information to develop more sophisticated sports analytics algorithms.

The researchers highlight several potential applications of the SoccerNet-Echoes dataset, including enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, the dataset aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games.

Critical Analysis

The researchers acknowledge that the quality of the automatically generated transcriptions and translations may not be perfect, and that there is room for improvement in the ASR and machine translation technologies used. They also note that the dataset is limited to a specific set of soccer games and may not be representative of all soccer matches.

Additionally, the researchers do not address the potential privacy and ethical concerns associated with using broadcast commentaries without the explicit consent of the commentators or the broadcasting companies. This is an important consideration that should be addressed in future work.

Despite these limitations, the SoccerNet-Echoes dataset represents a significant step forward in the field of sports analytics, and the researchers' multimodal approach to understanding the dynamics of soccer games is a promising avenue for further research.

Conclusion

This paper presents a novel approach to leveraging Automatic Speech Recognition technology to enhance sports analytics. By augmenting the SoccerNet dataset with automatically generated textual commentaries, the researchers have created a valuable resource for researchers and developers working on a wide range of applications, from enhanced action spotting to automatic game summarization.

The multimodal nature of the SoccerNet-Echoes dataset opens up new possibilities for understanding the complex dynamics of soccer games, and the researchers' attention to making the dataset accessible to a global audience is commendable. While there are still some challenges to address, this work represents an important step forward in the field of sports analytics and the application of advanced speech recognition technologies to enhance our understanding of the world's most popular sport.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MatchTime: Towards Automatic Soccer Game Commentary Generation

Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, Weidi Xie

Soccer is a globally popular sport with a vast audience, in this paper, we consider constructing an automatic soccer game commentary model to improve the audiences' viewing experience. In general, we make the following contributions: First, observing the prevalent video-text misalignment in existing datasets, we manually annotate timestamps for 49 matches, establishing a more robust benchmark for soccer game commentary generation, termed as SN-Caption-test-align; Second, we propose a multi-modal temporal alignment pipeline to automatically correct and filter the existing dataset at scale, creating a higher-quality soccer game commentary dataset for training, denoted as MatchTime; Third, based on our curated dataset, we train an automatic commentary generation model, named MatchVoice. Extensive experiments and ablation studies have demonstrated the effectiveness of our alignment pipeline, and training model on the curated datasets achieves state-of-the-art performance for commentary generation, showcasing that better alignment can lead to significant performance improvements in downstream tasks.

6/27/2024

cs.CV

🛸

Commentary Generation from Data Records of Multiplayer Strategy Esports Game

Zihan Wang, Naoki Yoshinaga

Esports, a sports competition on video games, has become one of the most important sporting events. Although esports play logs have been accumulated, only a small portion of them accompany text commentaries for the audience to retrieve and understand the plays. In this study, we therefore introduce the task of generating game commentaries from esports' data records. We first build large-scale esports data-to-text datasets that pair structured data and commentaries from a popular esports game, League of Legends. We then evaluate Transformer-based models to generate game commentaries from structured data records, while examining the impact of the pre-trained language models. Evaluation results on our dataset revealed the challenges of this novel task. We will release our dataset to boost potential research in the data-to-text generation community.

5/9/2024

cs.CL

🛸

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

Ara Yeroyan (Data Science Department, American University of Armenia), Nikolay Karpov (Nvidia, NeMo Conversational AI team)

In recent years, automatic speech recognition (ASR) systems have significantly improved, especially in languages with a vast amount of transcribed speech data. However, ASR systems tend to perform poorly for low-resource languages with fewer resources, such as minority and regional languages. This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments, whereas optimal ASR training requires segments ranging from 4 to 15 seconds. To address this, we propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training. Our approach simplifies data preparation for ASR systems in low-resource languages and demonstrates its application through a case study involving the Armenian language. Our method, which is portable to many low-resource languages, not only mitigates the issue of data scarcity but also enhances the performance of ASR models for underrepresented languages.

6/4/2024

cs.CL cs.LG eess.AS eess.SP

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann

We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics. In addition, we conduct a listening test with 20 participants for the speech enhancement task, where a generative method is preferred. We introduce a blind test set that allows for automatic online evaluation of uploaded data. Dataset download links and automatic evaluation server can be found online.

6/13/2024

eess.AS cs.LG cs.SD