LiMe: a Latin Corpus of Late Medieval Criminal Sentences

Read original: arXiv:2404.12829 - Published 4/22/2024 by Alessandra Bassani, Beatrice Del Bo, Alfio Ferrara, Marta Mangini, Sergio Picascia, Ambra Stefanello
Total Score

0

LiMe: a Latin Corpus of Late Medieval Criminal Sentences

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper introduces LiMe, a new Latin corpus of late medieval criminal sentences.
  • The corpus was created by collecting and digitizing a large number of Latin documents related to criminal trials and sentences from the late middle ages.
  • The authors describe the process of constructing the corpus, its key features, and potential applications in historical and linguistic research.

Plain English Explanation

The paper discusses the creation of a new collection of historical documents called the LiMe corpus. This corpus contains a large number of Latin texts related to criminal trials and punishments from the late medieval period, roughly the 13th to 15th centuries.

The researchers gathered these documents, which were originally written in Latin, and converted them into a digital format that can be easily searched and analyzed by scholars. This allows historians and linguists to study the language, legal procedures, and social contexts of criminal justice during that time period in a more systematic way.

The LiMe corpus provides researchers with a valuable resource for understanding how the criminal justice system operated in late medieval Europe, as well as how the Latin language was used in an official, legal setting. By analyzing the patterns and details in these historical records, scholars can gain new insights into the social and cultural history of the era.

Technical Explanation

The paper introduces the LiMe corpus, a new collection of Latin texts related to late medieval criminal trials and sentences. The corpus was constructed by the authors by locating and digitizing a large number of relevant historical documents from various archives and libraries across Europe.

The key features of the LiMe corpus include:

  • Texts spanning the 13th to 15th centuries, representing the late medieval period
  • Documents focused on criminal cases, sentences, and legal procedures
  • The texts are written entirely in Latin, the dominant language of administration and scholarship at the time
  • The corpus covers a variety of geographical regions within Europe

The authors describe the process they used to build the corpus, which involved searching archives, transcribing handwritten documents, and converting the texts into a standardized digital format. They also discuss the potential applications of the LiMe corpus in fields such as history, linguistics, and computational analysis of language models.

Critical Analysis

The LiMe corpus represents a valuable addition to the resources available for scholars studying late medieval Europe. By providing access to a large and diverse collection of criminal justice documents in their original Latin form, the corpus enables new avenues of research that were previously difficult or impossible.

However, the authors acknowledge some potential limitations of the corpus. For example, the availability and preservation of historical documents can be uneven across different regions and time periods, which may lead to gaps or biases in the corpus' coverage. Additionally, the process of transcribing and digitizing handwritten Latin texts introduces the possibility of errors or inconsistencies that could affect the accuracy of the corpus.

Despite these caveats, the LiMe corpus appears to be a well-designed and carefully curated resource that will likely prove invaluable for researchers studying the linguistic, legal, and social aspects of late medieval criminal justice. The authors' commitment to making the corpus openly accessible and providing detailed documentation on its construction and content is also commendable.

Conclusion

The LiMe corpus represents a significant contribution to the field of late medieval studies by providing scholars with a comprehensive collection of criminal justice documents in their original Latin form. This resource will enable new lines of research into the language, legal procedures, and social contexts of crime and punishment during this crucial period of European history.

The LiMe corpus has the potential to shed light on many aspects of late medieval society, from the evolution of legal institutions to the day-to-day realities of criminal activity and its consequences. As a freely available and well-documented resource, it will likely become an essential tool for historians, linguists, and other scholars seeking to deepen our understanding of this pivotal era.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

LiMe: a Latin Corpus of Late Medieval Criminal Sentences
Total Score

0

LiMe: a Latin Corpus of Late Medieval Criminal Sentences

Alessandra Bassani, Beatrice Del Bo, Alfio Ferrara, Marta Mangini, Sergio Picascia, Ambra Stefanello

The Latin language has received attention from the computational linguistics research community, which has built, over the years, several valuable resources, ranging from detailed annotated corpora to sophisticated tools for linguistic analysis. With the recent advent of large language models, researchers have also started developing models capable of generating vector representations of Latin texts. The performances of such models remain behind the ones for modern languages, given the disparity in available data. In this paper, we present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani, and thoroughly annotated by experts, in order to be employed for masked language model, as well as supervised natural language processing tasks.

Read more

4/22/2024

🏷️

Total Score

0

MultiLegalPile: A 689GB Multilingual Legal Corpus

Joel Niklaus, Veton Matoshi, Matthias Sturmer, Ilias Chalkidis, Daniel E. Ho

Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, there are few datasets available for specialized critical domains such as law and the available ones are often only for the English language. We curate and release MultiLegalPile, a 689GB corpus in 24 languages from 17 jurisdictions. The MultiLegalPile corpus, which includes diverse legal data sources with varying licenses, allows for pretraining NLP models under fair use, with more permissive licenses for the Eurlex Resources and Legal mC4 subsets. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, the trained models, and all of the code under the most open possible licenses.

Read more

5/21/2024

🌐

Total Score

0

HLDC: Hindi Legal Documents Corpus

Arnav Kapoor, Mudit Dhawan, Anmol Goel, T. H. Arjun, Akshala Bhatnagar, Vibhu Agrawal, Amul Agrawal, Arnab Bhattacharya, Ponnurangam Kumaraguru, Ashutosh Modi

Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area. We release the corpus and model implementation code with this paper: https://github.com/Exploration-Lab/HLDC

Read more

5/27/2024

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Total Score

0

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-G'omez, Tony Montes, Rub'en Manrique

This paper presents two significant contributions: first, a novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region. Second, it introduces a framework for OCR error correction and linguistic surface form detection in digitized corpora, utilizing a Large Language Model. This framework is adaptable to various contexts and, in this paper, is specifically applied to the newly created dataset.

Read more

7/19/2024