70B-parameter large language models in Japanese medical question-answering

Read original: arXiv:2406.14882 - Published 6/24/2024 by Issey Sukeda, Risa Kishikawa, Satoshi Kodera

70B-parameter large language models in Japanese medical question-answering

Overview

This paper explores the use of large language models (LLMs) with 70 billion parameters for medical question-answering in Japanese.
The researchers investigate the potential benefits of "medical instruction tuning" - fine-tuning LLMs on domain-specific medical data to improve their performance on medical tasks.
They evaluate the model's capabilities on a Japanese medical question-answering dataset and compare its performance to smaller, general-purpose language models.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. In this research, the authors focused on using LLMs with 70 billion parameters for answering medical questions in Japanese.

The key idea is to "fine-tune" these large models on domain-specific medical data, a process called "medical instruction tuning." This helps the models better understand and apply medical knowledge when answering questions. The researchers wanted to see if this approach could improve the models' performance on a Japanese medical question-answering task, compared to smaller, more general-purpose language models.

Technical Explanation

The researchers used a 70 billion parameter LLM as the starting point for their experiments. They fine-tuned this model on a large corpus of Japanese medical texts, a process known as "medical instruction tuning." This allowed the model to develop a deeper understanding of medical terminology, concepts, and reasoning.

The team then evaluated the fine-tuned model's performance on the MedExpQA dataset, a Japanese medical question-answering benchmark. They compared its results to those of smaller, general-purpose language models that did not undergo the medical instruction tuning process.

The findings showed that the 70 billion parameter LLM outperformed the smaller models, demonstrating the benefits of domain-specific fine-tuning for improving medical question-answering capabilities in Japanese.

Critical Analysis

The paper provides a valuable contribution to the field of large language models and their application to medical tasks in non-English languages. The researchers' approach of "medical instruction tuning" is a promising technique that could potentially be applied to other specialized domains as well.

However, the paper does not address certain limitations and potential issues. For example, it does not discuss the computational and resource requirements of training such a large 70 billion parameter model, which may limit its practical deployment, especially in resource-constrained settings. Additionally, the paper does not explore potential biases or ethical considerations that may arise from using large language models for medical decision-making.

Further research is needed to better understand the robustness, generalizability, and potential risks of these large medical language models, particularly when deployed in real-world clinical settings. Ongoing work on developing multilingual medical language models and analyzing social biases in Japanese LLMs may also provide valuable insights.

Conclusion

This paper demonstrates the potential of using large 70 billion parameter language models for medical question-answering in Japanese, particularly when they are fine-tuned on domain-specific medical data. The findings suggest that this "medical instruction tuning" approach can improve the performance of these models compared to smaller, general-purpose language models.

While the results are promising, further research is needed to address the limitations and potential risks associated with deploying such large medical language models in real-world settings. Ongoing work in multilingual medical language models and analyzing biases in Japanese LLMs may also offer valuable insights to advance this field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

70B-parameter large language models in Japanese medical question-answering

Issey Sukeda, Risa Kishikawa, Satoshi Kodera

Since the rise of large language models (LLMs), the domain adaptation has been one of the hot topics in various domains. Many medical LLMs trained with English medical dataset have made public recently. However, Japanese LLMs in medical domain still lack its research. Here we utilize multiple 70B-parameter LLMs for the first time and show that instruction tuning using Japanese medical question-answering dataset significantly improves the ability of Japanese LLMs to solve Japanese medical license exams, surpassing 50% in accuracy. In particular, the Japanese-centric models exhibit a more significant leap in improvement through instruction tuning compared to their English-centric counterparts. This underscores the importance of continual pretraining and the adjustment of the tokenizer in our local language. We also examine two slightly different prompt formats, resulting in non-negligible performance improvement.

6/24/2024

💬

Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources

Issey Sukeda

The recent success of large language models (LLMs) and the scaling law has led to a widespread adoption of larger models. Particularly in the healthcare industry, there is an increasing demand for locally operated LLMs due to security concerns. However, the majority of high quality open-source LLMs have a size of 70B parameters, imposing significant financial burdens on users for GPU preparation and operation. To overcome these issues, we present a medical adaptation based on the recent 7B models, which enables the operation in low computational resources. We compare the performance on medical question-answering benchmarks in two languages (Japanese and English), demonstrating that its scores reach parity with or surpass those of currently existing medical LLMs that are ten times larger. We find that fine-tuning an English-centric base model on Japanese medical dataset improves the score in both language, supporting the effect of cross-lingual knowledge transfer. We hope that this study will alleviate financial challenges, serving as a stepping stone for clinical institutions to practically utilize LLMs locally. Our evaluation code is available at https://github.com/stardust-coder/japanese-lm-med-harness.

9/23/2024

💬

Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.

4/17/2024

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

Junfeng Jiang, Jiahao Huang, Akiko Aizawa

Recent developments in Japanese large language models (LLMs) primarily focus on general domains, with fewer advancements in Japanese biomedical LLMs. One obstacle is the absence of a comprehensive, large-scale benchmark for comparison. Furthermore, the resources for evaluating Japanese biomedical LLMs are insufficient. To advance this field, we propose a new benchmark including eight LLMs across four categories and 20 Japanese biomedical datasets across five tasks. Experimental results indicate that: (1) LLMs with a better understanding of Japanese and richer biomedical knowledge achieve better performance in Japanese biomedical tasks, (2) LLMs that are not mainly designed for Japanese biomedical domains can still perform unexpectedly well, and (3) there is still much room for improving the existing LLMs in certain Japanese biomedical tasks. Moreover, we offer insights that could further enhance development in this field. Our evaluation tools tailored to our benchmark as well as the datasets are publicly available in https://huggingface.co/datasets/Coldog2333/JMedBench to facilitate future research.

9/23/2024