Pleias

Models by this creator

🗣️

OCRonos-Vintage

The OCRonos-Vintage model is a small specialized model for OCR (Optical Character Recognition) correction of cultural heritage archives. It was pre-trained by the maintainer PleIAs using the llm.c framework. This model is only 124 million parameters, allowing it to run efficiently on CPU or provide high-speed correction on GPUs (over 10,000 tokens per second) while maintaining quality comparable to larger models like GPT-4 or the llama version of OCRonos for English-language cultural archives. Model inputs and outputs The OCRonos-Vintage model takes OCRized text as input and generates corrected text as output. It was specifically trained on a dataset of cultural heritage archives from sources like the Library of Congress, Internet Archive, and Hathi Trust. Inputs OCRized text**: The model takes as input text that has been processed by an optical character recognition (OCR) system, which may contain errors or irregularities. Outputs Corrected text**: The model outputs text that has been corrected and refined compared to the input OCRized version. Capabilities The OCRonos-Vintage model excels at correcting errors and improving the quality of OCRized text from cultural heritage archives. It was trained on a large corpus of historical documents, allowing it to handle a variety of challenging text styles and structures common in these types of archives. What can I use it for? The OCRonos-Vintage model is well-suited for projects that involve processing and enhancing digitized cultural heritage materials, such as books, manuscripts, and historical documents. It can be used to improve the accuracy and readability of OCR output, which is crucial for tasks like text mining, indexing, and making these valuable resources more accessible to researchers and the public. Things to try Experiment with the OCRonos-Vintage model on different types of cultural heritage documents, such as newspapers, journals, or archival records. Observe how the model handles variations in font, layout, and language. You could also try fine-tuning the model on domain-specific datasets to further improve its performance on particular types of materials.

Updated 9/6/2024

Image-to-Text

🧠

OCRonos

PleIAs

OCRonos is a series of specialized language models trained by PleIAs for the correction of badly digitized texts, as part of the Bad Data Toolbox. The models are versatile tools that support the correction of OCR errors, wrong word cut/merge, and overall broken text structures. They were trained on a highly diverse set of OCRized texts in multiple languages, drawn from cultural heritage sources and financial/administrative documents. The current release features a model based on the llama-3-8b architecture that has been the most tested to date. Future releases will focus on smaller internal models that provide a better ratio of generation cost to quality. OCRonos is generally faithful to the original material, providing sensible restitution of deteriorated text and rarely rewriting correct words. On highly deteriorated content, it can act as a synthetic rewriting tool rather than a strict correction tool. Model inputs and outputs Inputs Corrupted/Broken Text**: OCRonos takes in text that has been poorly digitized, with errors, missing words, and other structural issues. Outputs Corrected Text**: The model outputs a corrected version of the input text, with OCR errors fixed, words merged/split correctly, and the overall structure improved. Capabilities OCRonos is capable of reliably correcting a wide range of digitization artifacts, including common OCR mistakes, word segmentation issues, and other text degradation problems. It performs particularly well on cultural heritage archives and financial/administrative documents, where the training data was focused. The model is able to retain the original meaning and intent while restoring the text to a more readable and usable form. What can I use it for? OCRonos can be a valuable tool for making challenging digitized resources more accessible and usable for language model applications and search retrieval. It is especially suited for situations where the original PDF sources are too damaged for correct OCRization or difficult to retrieve. The model can be used to pre-process text before feeding it into other NLP pipelines, improving the overall quality and reliability of the results. Things to try One interesting aspect of OCRonos is its ability to act as a synthetic rewriting tool on highly deteriorated content, rather than just a strict correction tool. This can be useful for generating more readable versions of severely damaged texts where the original meaning needs to be preserved. Experimenting with the model's behavior on different types of corrupted text, from historical archives to modern administrative documents, can yield interesting insights into its capabilities and limitations.

Updated 9/16/2024

Text-to-Text