Zero-Shot End-To-End Spoken Question Answering In Medical Domain






Published 6/11/2024 by Yanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier
Zero-Shot End-To-End Spoken Question Answering In Medical Domain


In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question audio transcription and answer selection, resulting in significant resource utilization and error accumulation. To tackle these challenges, we explore the effectiveness of end-to-end (E2E) methodologies for SQA in the medical domain. Our study introduces a novel zero-shot SQA approach, compared to traditional cascade systems. Through a comprehensive evaluation conducted on a new open benchmark of 8 medical tasks and 48 hours of synthetic audio, we demonstrate that our approach requires up to 14.7 times fewer resources than a combined 1.3B parameters LLM with a 1.55B parameters ASR model while improving average accuracy by 0.5%. These findings underscore the potential of E2E methodologies for SQA in resource-constrained contexts.

  • This paper presents a zero-shot end-to-end spoken question answering system for the medical domain.
  • The system is designed to enable users to ask questions in natural spoken language and receive relevant answers directly, without the need for manual transcription or multi-stage processing.
  • The authors leverage integrating pre-trained speech and language models to build this end-to-end system.
  • The system is evaluated on the LibriQA dataset, a novel dataset for spoken question answering.

Plain English Explanation

The paper describes a new system that can answer medical questions asked out loud, without any need for the questions to be manually typed or transcribed first. The system uses a combination of speech recognition and language models to understand the spoken questions and provide relevant answers directly.

This is a valuable capability, as it allows users to interact with the system more naturally, without the overhead of transcribing their questions. The system leverages recent advancements in integrating speech and language models to achieve this end-to-end functionality.

The authors test the system on a new dataset called LibriQA, which is designed specifically for evaluating spoken question answering in the medical domain. This allows them to rigorously assess the performance of their system in a real-world-like setting.

Technical Explanation

The key technical innovation of this paper is the development of a zero-shot end-to-end spoken question answering system for the medical domain. The system takes a spoken question as input and directly outputs the most relevant answer, without requiring any intermediate manual transcription or multi-stage processing.

To achieve this, the authors leverage recent advancements in integrating pre-trained speech and language models. Specifically, they use a speech recognition model to transcribe the spoken question, and then feed the transcribed text into a language model that has been fine-tuned on medical question answering tasks.

The authors evaluate their system on the LibriQA dataset, a novel dataset designed for spoken question answering in the medical domain. This dataset contains a diverse set of spoken questions and corresponding answers, which allows the authors to assess the system's performance in a realistic setting.

The results demonstrate the effectiveness of the proposed zero-shot end-to-end approach, with the system achieving strong performance on the LibriQA benchmark. This suggests that the integration of speech and language models can be a powerful technique for building natural, conversational interfaces for medical information access.

Critical Analysis

The paper presents a compelling approach to spoken question answering in the medical domain, but there are a few potential limitations and areas for further research worth considering.

First, the authors acknowledge that the LibriQA dataset, while a valuable resource, is still relatively small in scale. Evaluating the system on larger and more diverse datasets would help further validate its performance and generalizability.

Additionally, the authors do not provide a detailed analysis of the system's error patterns or failure cases. Understanding the specific challenges and limitations of the approach could inform future improvements and adaptations to other domains.

It would also be interesting to see how the zero-shot end-to-end system compares to more traditional multi-stage approaches, both in terms of performance and user experience. Comparisons to other state-of-the-art systems could help contextualize the contributions of this work.

Finally, the authors do not discuss the potential ethical and societal implications of deploying such a system in real-world medical settings. Considerations around data privacy, bias, and accessibility should be carefully addressed before any practical deployments.


This paper presents a novel zero-shot end-to-end spoken question answering system for the medical domain, which enables users to interact with the system more naturally through spoken language. By integrating pre-trained speech and language models, the authors have developed a system that can directly answer spoken questions without the need for manual transcription or multi-stage processing.

The evaluation on the LibriQA dataset demonstrates the system's strong performance, suggesting that this approach could be a valuable tool for improving access to medical information and expertise.

As language models continue to advance and integrate with speech capabilities, we may see a growing number of zero-shot and few-shot conversational AI systems that can seamlessly bridge the gap between spoken language and information retrieval. This could have significant implications for the accessibility and usability of medical and other knowledge-intensive domains.

