Historical German Text Normalization Using Type- and Token-Based Language Modeling

Read original: arXiv:2409.02841 - Published 9/5/2024 by Anton Ehrmanntraut

💬

Overview

Historic variations in spelling pose challenges for full-text search and natural language processing on digitized historical texts.
Automatic orthographic normalization of historical source material is often pursued to minimize the gap between historic and contemporary spelling.
This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus.

Plain English Explanation

The paper addresses the problem of historic spelling variations in digitized historical texts. These variations can make it difficult to search or analyze these texts using modern language processing techniques. To address this, the researchers developed a system to normalize the spelling of historical German texts from the 1700s to 1900s.

The key idea is to train a machine learning model on a dataset that pairs historical spellings with their modern equivalents. This allows the model to learn how to convert historical spellings into their modern forms. The researchers used a Transformer-based approach, combining an encoder-decoder model to normalize individual word types and a pre-trained language model to adjust the normalizations in context.

The researchers found that their proposed system achieves state-of-the-art accuracy, performing comparably to a larger, end-to-end normalization system. However, they note that normalizing historical text remains a challenge due to difficulties in generalization and the lack of high-quality parallel training data.

Technical Explanation

The paper presents a machine learning-based approach for normalizing the spelling of historical German texts. The researchers trained a model on a parallel corpus of historical and modern spellings, using a Transformer-based architecture.

The key components of the model are:

Encoder-Decoder for Word Normalization: An encoder-decoder model is used to normalize individual word types, converting historical spellings to their modern equivalents.
Pre-trained Language Model for Context-Aware Adjustments: A pre-trained causal language model is used to adjust the normalizations of words based on their surrounding context.

The researchers extensively evaluated their proposed system, finding that it achieves state-of-the-art accuracy, comparable to a much larger, end-to-end normalization system that fine-tunes a pre-trained Transformer language model.

Critical Analysis

The researchers acknowledge that the normalization of historical text remains a challenging task, despite the strong performance of their proposed system. Two key limitations are:

Generalization Difficulties: The models struggle to generalize well, likely due to the complexities and variations in historical spelling.
Lack of High-Quality Parallel Data: The researchers note the lack of extensive, high-quality parallel data (historical and modern spellings) as a significant constraint.

These limitations highlight the ongoing challenges in this domain and the need for further research and data collection efforts to improve the performance and robustness of historical text normalization systems.

Conclusion

This paper presents a novel Transformer-based approach for normalizing the spelling of historical German texts from the 1700s to 1900s. The proposed system achieves state-of-the-art accuracy, demonstrating the potential of machine learning techniques for addressing the challenges posed by historic spelling variations.

While the results are promising, the researchers acknowledge the ongoing difficulties in this domain, such as the need for better generalization and the lack of high-quality parallel data. Continued research and data collection efforts in this area could further enhance the capabilities of historical text normalization systems, ultimately improving the accessibility and analysis of digitized historical documents.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Historical German Text Normalization Using Type- and Token-Based Language Modeling

Anton Ehrmanntraut

Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.

9/5/2024

⚙️

Is text normalization relevant for classifying medieval charters?

Florian Atzenhofer-Baumgartner, Tam'as Kov'acs

This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.

8/30/2024

💬

Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers

Frederick Riemenschneider, Kevin Krahn

Historical languages present unique challenges to the NLP community, with one prominent hurdle being the limited resources available in their closed corpora. This work describes our submission to the constrained subtask of the SIGTYP 2024 shared task, focusing on PoS tagging, morphological tagging, and lemmatization for 13 historical languages. For PoS and morphological tagging we adapt a hierarchical tokenization method from Sun et al. (2023) and combine it with the advantages of the DeBERTa-V3 architecture, enabling our models to efficiently learn from every character in the training data. We also demonstrate the effectiveness of character-level T5 models on the lemmatization task. Pre-trained from scratch with limited data, our models achieved first place in the constrained subtask, nearly reaching the performance levels of the unconstrained task's winner. Our code is available at https://github.com/bowphs/SIGTYP-2024-hierarchical-transformers

5/31/2024

🖼️

Medical Concept Normalization in a Low-Resource Setting

Tim Patzelt

In the field of biomedical natural language processing, medical concept normalization is a crucial task for accurately mapping mentions of concepts to a large knowledge base. However, this task becomes even more challenging in low-resource settings, where limited data and resources are available. In this thesis, I explore the challenges of medical concept normalization in a low-resource setting. Specifically, I investigate the shortcomings of current medical concept normalization methods applied to German lay texts. Since there is no suitable dataset available, a dataset consisting of posts from a German medical online forum is annotated with concepts from the Unified Medical Language System. The experiments demonstrate that multilingual Transformer-based models are able to outperform string similarity methods. The use of contextual information to improve the normalization of lay mentions is also examined, but led to inferior results. Based on the results of the best performing model, I present a systematic error analysis and lay out potential improvements to mitigate frequent errors.

9/24/2024