CHisIEC: An Information Extraction Corpus for Ancient Chinese History

Read original: arXiv:2403.15088 - Published 4/23/2024 by Xuemei Tang, Zekun Deng, Qi Su, Hao Yang, Jun Wang
Total Score

0

CHisIEC: An Information Extraction Corpus for Ancient Chinese History

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Introduces a new corpus called CHisIEC (Chinese History Information Extraction Corpus) for information extraction on ancient Chinese history
  • Describes the process of creating this corpus by collecting and annotating relevant texts
  • Evaluates the performance of several information extraction models on the corpus

Plain English Explanation

The researchers have created a new dataset called CHisIEC that is focused on ancient Chinese history. This dataset contains a collection of historical texts that have been carefully annotated with important information, such as the names of people, places, and events. The goal is to provide a resource that can be used to train and test information extraction models, which are algorithms that can automatically extract key facts and entities from unstructured text.

By having a specialized dataset like CHisIEC, researchers and developers can develop more accurate and reliable information extraction systems for ancient Chinese history. This could be useful for a variety of applications, such as creating structured knowledge bases from historical documents or generating summaries of important events and figures. The availability of this high-quality dataset is a valuable contribution to the field of natural language processing and its application to historical research.

Technical Explanation

The CHisIEC corpus was created by collecting a variety of ancient Chinese historical texts, including chronicles, biographies, and memoirs. These texts were then manually annotated by domain experts to identify named entities (such as people, places, and organizations) as well as events and their associated attributes (such as the time, location, and participants).

The researchers evaluated the performance of several state-of-the-art information extraction models on the CHisIEC corpus, including IEPILE, INSTRUCTIE, REXEL, and GRAPHER. The results showed that the models were able to achieve reasonably good performance on the task of extracting named entities and event information from the historical texts, but there is still room for improvement, particularly in accurately identifying the relationships between different entities and events.

Critical Analysis

The authors acknowledge that the CHisIEC corpus is a relatively small dataset compared to some of the larger information extraction benchmarks, and that the texts are predominantly in classical Chinese, which can present additional challenges for natural language processing models. They also note that the manual annotation process, while rigorous, may have introduced some inconsistencies or biases.

Additionally, the paper does not provide a deep analysis of the specific errors or shortcomings of the evaluated models, which would be helpful for guiding future research and development. It would also be interesting to see how the performance of these models compares to human experts in accurately extracting information from ancient Chinese historical texts.

Conclusion

The CHisIEC corpus represents an important step forward in the development of information extraction tools for ancient Chinese history. By providing a high-quality, domain-specific dataset, the researchers have created a valuable resource for training and evaluating natural language processing models in this important area. The evaluation results suggest that current state-of-the-art models can achieve reasonable performance, but there is still significant room for improvement, particularly in understanding the complex relationships between entities and events.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CHisIEC: An Information Extraction Corpus for Ancient Chinese History
Total Score

0

CHisIEC: An Information Extraction Corpus for Ancient Chinese History

Xuemei Tang, Zekun Deng, Qi Su, Hao Yang, Jun Wang

Natural Language Processing (NLP) plays a pivotal role in the realm of Digital Humanities (DH) and serves as the cornerstone for advancing the structural analysis of historical and cultural heritage texts. This is particularly true for the domains of named entity recognition (NER) and relation extraction (RE). In our commitment to expediting ancient history and culture, we present the ``Chinese Historical Information Extraction Corpus''(CHisIEC). CHisIEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks, offering a resource to facilitate research in the field. Spanning a remarkable historical timeline encompassing data from 13 dynasties spanning over 1830 years, CHisIEC epitomizes the extensive temporal range and text heterogeneity inherent in Chinese historical documents. The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset comprising 14,194 entities and 8,609 relations. To establish the robustness and versatility of our dataset, we have undertaken comprehensive experimentation involving models of various sizes and paradigms. Additionally, we have evaluated the capabilities of Large Language Models (LLMs) in the context of tasks related to ancient Chinese history. The dataset and code are available at url{https://github.com/tangxuemei1995/CHisIEC}.

Read more

4/23/2024

🚀

Total Score

0

Assessing the Performance of Chinese Open Source Large Language Models in Information Extraction Tasks

Yida Cai, Hao Sun, Hsiu-Yuan Huang, Yunfang Wu

Information Extraction (IE) plays a crucial role in Natural Language Processing (NLP) by extracting structured information from unstructured text, thereby facilitating seamless integration with various real-world applications that rely on structured data. Despite its significance, recent experiments focusing on English IE tasks have shed light on the challenges faced by Large Language Models (LLMs) in achieving optimal performance, particularly in sub-tasks like Named Entity Recognition (NER). In this paper, we delve into a comprehensive investigation of the performance of mainstream Chinese open-source LLMs in tackling IE tasks, specifically under zero-shot conditions where the models are not fine-tuned for specific tasks. Additionally, we present the outcomes of several few-shot experiments to further gauge the capability of these models. Moreover, our study includes a comparative analysis between these open-source LLMs and ChatGPT, a widely recognized language model, on IE performance. Through meticulous experimentation and analysis, we aim to provide insights into the strengths, limitations, and potential enhancements of existing Chinese open-source LLMs in the domain of Information Extraction within the context of NLP.

Read more

6/5/2024

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus
Total Score

0

IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Honghao Gui, Lin Yuan, Hongbin Ye, Ningyu Zhang, Mengshu Sun, Lei Liang, Huajun Chen

Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

Read more

5/28/2024

⛏️

Total Score

0

InstructIE: A Bilingual Instruction-based Information Extraction Dataset

Honghao Gui, Shuofei Qiao, Jintian Zhang, Hongbin Ye, Mengshu Sun, Lei Liang, Jeff Z. Pan, Huajun Chen, Ningyu Zhang

Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE). Recent works indicate that the main reason lies in the lack of extensive data on IE instructions. Note that the existing datasets on IE instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains. We propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Additionally, we manually annotate the test set. Experimental results demonstrate that large language models trained with InstructIE can not only obtain better IE capabilities but also enhance zero-shot performance compared with baselines.

Read more

7/30/2024