Economy Watchers Survey provides Datasets and Tasks for Japanese Financial Domain

Read original: arXiv:2407.14727 - Published 7/23/2024 by Masahiro Suzuki, Hiroki Sakaji
Total Score

0

🔍

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • The paper discusses the lack of Japanese and financial domain-specific natural language processing (NLP) tasks compared to English and general domains.
  • The researchers constructed two large datasets using materials from a Japanese government agency, providing three Japanese financial NLP tasks.
  • These tasks include sentence categorization (3-class and 12-class) and sentiment analysis (5-class).
  • The datasets are designed to be comprehensive and up-to-date, with an automatic update framework to ensure the latest task datasets are publicly available.

Plain English Explanation

Most NLP tasks and datasets are focused on the English language or general domains. However, there are fewer resources available for languages other than English, particularly in specialized domains like finance. To address this gap, the researchers created two large datasets using materials from a Japanese government agency, which provide three NLP tasks in the Japanese financial domain.

The first task is sentence categorization, where the goal is to classify sentences into one of three or one of twelve categories. The second task is sentiment analysis, where the goal is to classify a sentence into one of five sentiment classes.

These datasets are designed to be comprehensive and up-to-date, using an automatic update framework to ensure the latest task data is publicly available at all times. This makes them a valuable resource for researchers and developers working on Japanese and financial NLP applications.

Technical Explanation

The paper addresses the lack of NLP tasks and datasets available for languages other than English and for specialized domains like finance. To address this gap, the researchers constructed two large datasets using materials published by a Japanese central government agency.

The datasets provide three Japanese financial NLP tasks:

  1. 3-class sentence categorization: Classifying sentences into one of three categories.
  2. 12-class sentence categorization: Classifying sentences into one of twelve categories.
  3. 5-class sentiment analysis: Classifying the sentiment of a sentence into one of five classes.

The researchers designed the datasets to be comprehensive and up-to-date, using an automatic update framework to ensure the latest task datasets are publicly available. This framework allows the datasets to be continuously expanded and improved over time, providing a valuable resource for researchers and developers working on Japanese and financial NLP applications.

Critical Analysis

The researchers have addressed an important gap in the availability of NLP tasks and datasets for languages other than English and specialized domains like finance. By leveraging materials from a Japanese government agency, they have created a comprehensive set of tasks that can be used to evaluate and develop Japanese and financial NLP models.

However, the paper does not provide a detailed analysis of the datasets, such as the distribution of samples across classes or the complexity of the tasks. Additionally, while the automatic update framework is a promising feature, the paper does not discuss how the quality and consistency of the datasets will be maintained over time.

Further research could explore the performance of existing NLP models on these tasks, as well as the challenges and opportunities presented by the financial domain and the Japanese language. Evaluating the datasets' usefulness for real-world applications would also be valuable.

Conclusion

This paper addresses a significant gap in the availability of NLP tasks and datasets for languages other than English and specialized domains like finance. By creating two large datasets with three Japanese financial NLP tasks, the researchers have provided a valuable resource for researchers and developers working in these areas.

The comprehensive and up-to-date nature of the datasets, enabled by the automatic update framework, ensures that the latest task data is publicly available at all times. This makes the datasets a promising tool for advancing NLP research and development in the Japanese language and financial domain.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

Total Score

0

Economy Watchers Survey provides Datasets and Tasks for Japanese Financial Domain

Masahiro Suzuki, Hiroki Sakaji

Many natural language processing (NLP) tasks in English or general domains are widely available and are often used to evaluate pre-trained language models. In contrast, there are fewer tasks available for languages other than English and for the financial domain. In particular, tasks in Japanese and the financial domain are limited. We construct two large datasets using materials published by a Japanese central government agency. The datasets provide three Japanese financial NLP tasks, which include a 3-class and 12-class classification for categorizing sentences, as well as a 5-class classification task for sentiment analysis. Our datasets are designed to be comprehensive and up-to-date, leveraging an automatic update framework that ensures the latest task datasets are publicly available anytime.

Read more

7/23/2024

↗️

Total Score

0

JaFIn: Japanese Financial Instruction Dataset

Kota Tanabe, Masahiro Suzuki, Hiroki Sakaji, Itsuki Noda

We construct an instruction dataset for the large language model (LLM) in the Japanese finance domain. Domain adaptation of language models, including LLMs, is receiving more attention as language models become more popular. This study demonstrates the effectiveness of domain adaptation through instruction tuning. To achieve this, we propose an instruction tuning data in Japanese called JaFIn, the Japanese Financial Instruction Dataset. JaFIn is manually constructed based on multiple data sources, including Japanese government websites, which provide extensive financial knowledge. We then utilize JaFIn to apply instruction tuning for several LLMs, demonstrating that our models specialized in finance have better domain adaptability than the original models. The financial-specialized LLMs created were evaluated using a quantitative Japanese financial benchmark and qualitative response comparisons, showing improved performance over the originals.

Read more

7/23/2024

💬

Total Score

0

Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training

Masanori Hirano, Kentaro Imajo

Large language models (LLMs) are now widely used in various fields, including finance. However, Japanese financial-specific LLMs have not been proposed yet. Hence, this study aims to construct a Japanese financial-specific LLM through continual pre-training. Before tuning, we constructed Japanese financial-focused datasets for continual pre-training. As a base model, we employed a Japanese LLM that achieved state-of-the-art performance on Japanese financial benchmarks among the 10-billion-class parameter models. After continual pre-training using the datasets and the base model, the tuned model performed better than the original model on the Japanese financial benchmarks. Moreover, the outputs comparison results reveal that the tuned model's outputs tend to be better than the original model's outputs in terms of the quality and length of the answers. These findings indicate that domain-specific continual pre-training is also effective for LLMs. The tuned model is publicly available on Hugging Face.

Read more

4/17/2024

💬

Total Score

0

Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.

Read more

4/17/2024