Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Read original: arXiv:2407.12838 - Published 7/19/2024 by Laura Manrique-G'omez, Tony Montes, Rub'en Manrique
Total Score

0

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper presents a historical Latin American Spanish newspaper corpus with long-form language model (LLM) optical character recognition (OCR) correction.
  • The corpus covers 19th century newspapers from various Latin American countries, providing a valuable resource for researchers studying semantic shift, language change, and historical linguistics in the region.
  • The authors leveraged state-of-the-art LLM-based OCR models to correct the digitized newspaper text, improving its accuracy and reliability for downstream analysis.

Plain English Explanation

This research paper describes the creation of a new historical dataset of 19th century Latin American Spanish newspapers. This dataset is important because it provides researchers with a valuable resource for studying how the Spanish language evolved and changed over time in various parts of Latin America.

The researchers used advanced language models and optical character recognition (OCR) techniques to digitize and correct the text from these old newspaper pages. This helps ensure the accuracy and quality of the data, making it more useful for researchers conducting linguistic analysis.

Some key insights from this dataset could include:

By making this historical newspaper corpus available, the researchers are enabling new discoveries and a better understanding of how the Spanish language developed across Latin America in the 1800s.

Technical Explanation

The key aspects of this research paper are:

  1. Corpus Creation: The authors compiled a corpus of 19th century Latin American Spanish newspapers from various countries in the region. This provides a rich dataset for studying linguistic changes and patterns during this historical period.

  2. OCR Correction: To improve the accuracy and reliability of the digitized newspaper text, the authors leveraged state-of-the-art LLM-based OCR models. This helps correct errors and inconsistencies that can arise from the challenging handwritten and typeset nature of historical print media.

  3. Potential Applications: The cleaned and corrected corpus opens up new research opportunities, as demonstrated by the internal links to related work on semantic shift detection, analysis of medieval criminal sentences, and other linguistic studies leveraging this dataset.

The authors thoroughly describe their data collection and curation process, as well as the OCR correction methodology. They also provide insights into the characteristics and composition of the final corpus, making it a valuable resource for the research community.

Critical Analysis

The authors acknowledge several limitations and areas for further research:

  • The corpus is limited to 19th century newspapers, so expanding the temporal coverage could provide additional insights into language evolution.
  • The OCR correction process, while state-of-the-art, may still contain some residual errors, which could impact certain types of linguistic analysis.
  • Incorporating additional metadata, such as publication details and article-level annotations, could further enhance the utility of the corpus.

Additionally, researchers may want to consider the potential biases and representational issues inherent in historical newspaper archives, which may not fully capture the linguistic diversity of the region during this period.

Conclusion

This research paper presents a significant contribution to the field of historical linguistics, providing a high-quality corpus of 19th century Latin American Spanish newspapers with LLM-based OCR correction. This dataset opens up new avenues for studying semantic shift, language change, and other linguistic phenomena in the region during a critical period of political and social transformation.

The authors' efforts to curate and clean this corpus demonstrate a commitment to supporting the research community and enabling new discoveries. By making this resource publicly available, they are facilitating interdisciplinary collaboration and advancing our understanding of the rich linguistic heritage of Latin America.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Total Score

0

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-G'omez, Tony Montes, Rub'en Manrique

This paper presents two significant contributions: first, a novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region. Second, it introduces a framework for OCR error correction and linguistic surface form detection in digitized corpora, utilizing a Large Language Model. This framework is adaptable to various contexts and, in this paper, is specifically applied to the newly created dataset.

Read more

7/19/2024

Historical Ink: Semantic Shift Detection for 19th Century Spanish
Total Score

0

Historical Ink: Semantic Shift Detection for 19th Century Spanish

Tony Montes, Laura Manrique-G'omez, Rub'en Manrique

This paper explores the evolution of word meanings in 19th-century Spanish texts, with an emphasis on Latin American Spanish, using computational linguistics techniques. It addresses the Semantic Shift Detection (SSD) task, which is crucial for understanding linguistic evolution, particularly in historical contexts. The study focuses on analyzing a set of Spanish target words. To achieve this, a 19th-century Spanish corpus is constructed, and a customizable pipeline for SSD tasks is developed. This pipeline helps find the senses of a word and measure their semantic change between two corpora using fine-tuned BERT-like models with old Spanish texts for both Latin American and general Spanish cases. The results provide valuable insights into the cultural and societal shifts reflected in language changes over time.

Read more

7/22/2024

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models
Total Score

0

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

Jonathan Bourne

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

Read more

9/2/2024

LiMe: a Latin Corpus of Late Medieval Criminal Sentences
Total Score

0

LiMe: a Latin Corpus of Late Medieval Criminal Sentences

Alessandra Bassani, Beatrice Del Bo, Alfio Ferrara, Marta Mangini, Sergio Picascia, Ambra Stefanello

The Latin language has received attention from the computational linguistics research community, which has built, over the years, several valuable resources, ranging from detailed annotated corpora to sophisticated tools for linguistic analysis. With the recent advent of large language models, researchers have also started developing models capable of generating vector representations of Latin texts. The performances of such models remain behind the ones for modern languages, given the disparity in available data. In this paper, we present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani, and thoroughly annotated by experts, in order to be employed for masked language model, as well as supervised natural language processing tasks.

Read more

4/22/2024