OCRonos

Maintainer: PleIAs

Last updated 9/18/2024

🧠

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

OCRonos is a series of specialized language models trained by PleIAs for the correction of badly digitized texts, as part of the Bad Data Toolbox. The models are versatile tools that support the correction of OCR errors, wrong word cut/merge, and overall broken text structures. They were trained on a highly diverse set of OCRized texts in multiple languages, drawn from cultural heritage sources and financial/administrative documents.

The current release features a model based on the llama-3-8b architecture that has been the most tested to date. Future releases will focus on smaller internal models that provide a better ratio of generation cost to quality. OCRonos is generally faithful to the original material, providing sensible restitution of deteriorated text and rarely rewriting correct words. On highly deteriorated content, it can act as a synthetic rewriting tool rather than a strict correction tool.

Model inputs and outputs

Inputs

Corrupted/Broken Text: OCRonos takes in text that has been poorly digitized, with errors, missing words, and other structural issues.

Outputs

Corrected Text: The model outputs a corrected version of the input text, with OCR errors fixed, words merged/split correctly, and the overall structure improved.

Capabilities

OCRonos is capable of reliably correcting a wide range of digitization artifacts, including common OCR mistakes, word segmentation issues, and other text degradation problems. It performs particularly well on cultural heritage archives and financial/administrative documents, where the training data was focused. The model is able to retain the original meaning and intent while restoring the text to a more readable and usable form.

What can I use it for?

OCRonos can be a valuable tool for making challenging digitized resources more accessible and usable for language model applications and search retrieval. It is especially suited for situations where the original PDF sources are too damaged for correct OCRization or difficult to retrieve. The model can be used to pre-process text before feeding it into other NLP pipelines, improving the overall quality and reliability of the results.

Things to try

One interesting aspect of OCRonos is its ability to act as a synthetic rewriting tool on highly deteriorated content, rather than just a strict correction tool. This can be useful for generating more readable versions of severely damaged texts where the original meaning needs to be preserved. Experimenting with the model's behavior on different types of corrupted text, from historical archives to modern administrative documents, can yield interesting insights into its capabilities and limitations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🗣️

OCRonos-Vintage

PleIAs

The OCRonos-Vintage model is a small specialized model for OCR (Optical Character Recognition) correction of cultural heritage archives. It was pre-trained by the maintainer PleIAs using the llm.c framework. This model is only 124 million parameters, allowing it to run efficiently on CPU or provide high-speed correction on GPUs (over 10,000 tokens per second) while maintaining quality comparable to larger models like GPT-4 or the llama version of OCRonos for English-language cultural archives. Model inputs and outputs The OCRonos-Vintage model takes OCRized text as input and generates corrected text as output. It was specifically trained on a dataset of cultural heritage archives from sources like the Library of Congress, Internet Archive, and Hathi Trust. Inputs OCRized text**: The model takes as input text that has been processed by an optical character recognition (OCR) system, which may contain errors or irregularities. Outputs Corrected text**: The model outputs text that has been corrected and refined compared to the input OCRized version. Capabilities The OCRonos-Vintage model excels at correcting errors and improving the quality of OCRized text from cultural heritage archives. It was trained on a large corpus of historical documents, allowing it to handle a variety of challenging text styles and structures common in these types of archives. What can I use it for? The OCRonos-Vintage model is well-suited for projects that involve processing and enhancing digitized cultural heritage materials, such as books, manuscripts, and historical documents. It can be used to improve the accuracy and readability of OCR output, which is crucial for tasks like text mining, indexing, and making these valuable resources more accessible to researchers and the public. Things to try Experiment with the OCRonos-Vintage model on different types of cultural heritage documents, such as newspapers, journals, or archival records. Observe how the model handles variations in font, layout, and language. You could also try fine-tuning the model on domain-specific datasets to further improve its performance on particular types of materials.

Updated Invalid Date

Image-to-Text

👨‍🏫

orca_mini_13b

pankajmathur

orca_mini_13b is an OpenLLaMa-13B model fine-tuned on explain-tuned datasets. The dataset was created using instructions and input from WizardLM, Alpaca, and Dolly-V2 datasets, applying approaches from the Orca Research Paper. This helps the model learn the thought process from the teacher model, which is the GPT-3.5-turbo-0301 version of ChatGPT. Model inputs and outputs The orca_mini_13b model takes a combination of system prompts and user instructions as input, and generates relevant text responses as output. Inputs System prompt**: A prompt that sets the context for the model, describing the role and goals of the AI assistant. User instruction**: The task or query that the user wants the model to address. Input (optional)**: Additional context or information that the user provides to help the model complete the task. Outputs Response**: The model's generated text response to the user's instruction, which aims to provide a detailed, thoughtful, and step-by-step explanation. Capabilities The orca_mini_13b model is capable of generating high-quality, explain-tuned responses to a variety of tasks and queries. It demonstrates strong performance on reasoning-based benchmarks like BigBench-Hard and AGIEval, indicating its ability to engage in complex, logical thinking. What can I use it for? The orca_mini_13b model can be used for a range of applications that require detailed, step-by-step explanations, such as: Educational or tutoring applications Technical support and customer service Research and analysis tasks General question-answering and information retrieval By leveraging the model's explain-tuned capabilities, users can gain a deeper understanding of the topics and concepts being discussed. Things to try One interesting thing to try with the orca_mini_13b model is to provide it with prompts or instructions that require it to take on different expert roles, such as a logician, mathematician, or physicist. This can help uncover the model's breadth of knowledge and its ability to tailor its responses to the specific needs of the task at hand. Another interesting approach is to explore the model's performance on open-ended, creative tasks, such as generating poetry or short stories. The model's strong grounding in language and reasoning may translate into an ability to produce engaging and insightful creative output.

Updated Invalid Date

Text-to-Text

➖

Nous-Hermes-Llama2-13b-GGML

NousResearch

The Nous-Hermes-Llama2-13b is a state-of-the-art language model fine-tuned by Nous Research on over 300,000 instructions. This model was developed through a collaborative effort with Teknium, Karan4D, Emozilla, Huemin Art, and Redmond AI. It builds upon the original Nous-Hermes-Llama2-7b and Nous-Hermes-13b models, inheriting their strengths while further improving on capabilities. Model inputs and outputs Inputs Instruction**: A natural language description of a task for the model to complete. Additional context**: Optional additional information provided to the model to aid in understanding the task. Outputs Response**: The model's generated output answering or completing the provided instruction. Capabilities The Nous-Hermes-Llama2-13b model stands out for its ability to provide long, coherent responses with a low rate of hallucination. It has also been trained without the censorship mechanisms present in some other language models, allowing for more open-ended and creative outputs. Benchmark results show this model performing exceptionally well on a variety of tasks, including scoring #1 on ARC-c, ARC-e, Hellaswag, and OpenBookQA, and 2nd place on Winogrande. What can I use it for? The Nous-Hermes-Llama2-13b model is suitable for a wide range of language tasks, from generating creative text to understanding and following complex instructions. Example use cases include building chatbots, virtual assistants, and content generation tools. The LM Studio and alpaca-discord projects provide examples of how this model can be integrated into practical applications. Things to try One key aspect of the Nous-Hermes-Llama2-13b model is its ability to provide long, thoughtful responses. This can be leveraged for tasks that require extended reasoning or exploration of a topic. Additionally, the model's lack of censorship mechanisms opens up possibilities for more open-ended and creative applications, such as roleplaying chatbots or speculative fiction generation.

Updated Invalid Date

Text-to-Text

🧪

30B-Lazarus

CalderaAI

119

The 30B-Lazarus model is the result of an experimental approach to combining several large language models and specialized LoRAs (Layers of Residual Adaption) to create an ensemble model with enhanced capabilities. The composition includes models such as SuperCOT, gpt4xalpaca, and StoryV2, along with the manticore-30b-chat-pyg-alpha and Vicuna Unlocked LoRA models. The maintainer, CalderaAI, indicates that this experimental approach aims to additively apply desired features without paradoxically watering down the model's effective behavior. Model inputs and outputs The 30B-Lazarus model is a text-to-text AI model, meaning it takes text as input and generates text as output. The model is primarily instructed-based, with the Alpaca instruct format being the primary input format. However, the maintainer suggests that the Vicuna instruct format may also work. Inputs Instruction**: Text prompts or instructions for the model to follow, often in the Alpaca or Vicuna instruct format. Context**: Additional context or information provided to the model to inform its response. Outputs Generated text**: The model's response to the provided input, which can range from short answers to longer, more detailed text. Capabilities The 30B-Lazarus model is designed to have enhanced capabilities in areas like reasoning, storytelling, and task-completion compared to the base LLaMA model. By combining several specialized models and LoRAs, the maintainer aims to create a more comprehensive and capable language model. However, the maintainer notes that further experimental testing and evaluation is required to fully understand the model's capabilities and limitations. What can I use it for? The 30B-Lazarus model could potentially be used for a variety of natural language processing tasks, such as question answering, text generation, and problem-solving. The maintainer suggests that the model may be particularly well-suited for text-based adventure games or interactive storytelling applications, where its enhanced storytelling and task-completion capabilities could be leveraged. Things to try When using the 30B-Lazarus model, the maintainer recommends experimenting with different presets and instructions to see how the model responds. They suggest trying out the "Godlike" and "Storywriter" presets in tools like KoboldAI or Text-Generation-WebUI, and adjusting parameters like output length and temperature to find the best settings for your use case. Additionally, exploring the model's ability to follow chain-of-thought reasoning or provide detailed, creative responses to open-ended prompts could be an interesting area to investigate further.

Updated Invalid Date

Text-to-Text