TFT-ID-1.0

Maintainer: yifeihu

Total Score

76

Last updated 9/17/2024

๐Ÿงช

PropertyValue
Run this modelRun on HuggingFace
API specView on HuggingFace
Github linkNo Github link provided
Paper linkNo paper link provided

Create account to get full access

or

If you already have an account, we'll log you in

Model Overview

The TFT-ID (Table/Figure/Text IDentifier) model is an object detection model fine-tuned to extract tables, figures, and text sections from academic papers. Developed by Yifei Hu, the model is based on the microsoft/Florence-2 checkpoints and was trained on over 36,000 manually annotated bounding boxes from the Hugging Face Daily Papers dataset.

The model takes an image of a single paper page as input and returns bounding boxes for all tables, figures, and text sections present, along with the corresponding labels. This makes it a useful tool for academic document processing workflows, as the extracted text sections can be easily fed into downstream OCR systems.

Model Inputs and Outputs

Inputs

  • Paper page image: The model takes an image of a single page from an academic paper as input.

Outputs

  • Object detection results: The model outputs a dictionary containing the bounding boxes and labels for all detected tables, figures, and text sections in the input image. The format is:
    {'<OD>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['label1', 'label2', ...]}}
    

Capabilities

The TFT-ID model is highly accurate, achieving a 96.78% success rate on a test set of 373 paper pages. It is particularly effective at identifying tables and figures, reaching a 98.84% success rate on a subset of 258 images.

The model's ability to extract clean text content from the identified text sections makes it a valuable tool for academic document processing pipelines. However, it is important to note that the TFT-ID model is not an OCR system itself, and the extracted text may still require further processing.

What Can I Use It For?

The TFT-ID model is well-suited for automating the extraction of tables, figures, and text sections from academic papers. This can be particularly useful for researchers, publishers, and academic institutions looking to streamline their document processing workflows.

Some potential use cases include:

  • Academic document processing: Integrating the TFT-ID model into a document processing pipeline to automatically identify and extract relevant content from academic papers.
  • Literature review automation: Using the model to rapidly locate and extract tables, figures, and key text sections from a large corpus of academic literature, facilitating more efficient literature reviews.
  • Dataset curation: Employing the TFT-ID model to generate structured datasets of tables, figures, and text from academic papers, which can then be used to train other AI models.

Things to Try

One interesting aspect of the TFT-ID model is its ability to handle a variety of table and figure formats, including both bordered and borderless elements. This robustness can be further explored by testing the model on academic papers with diverse layouts and visual styles.

Additionally, the model's integration with downstream OCR workflows presents opportunities for experimentation. Users could, for example, evaluate the quality and accuracy of the extracted text sections, and explore ways to optimize the overall document processing pipeline.

Finally, the TFT-ID model's exceptional performance on table and figure detection tasks suggests that it could be a valuable component in more complex academic document understanding systems, such as those focused on automated summarization or knowledge extraction.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

๐ŸŽฒ

TB-OCR-preview-0.1

yifeihu

Total Score

115

TB-OCR-preview-0.1 is an end-to-end optical character recognition (OCR) model developed by Yifei Hu that can handle text, math LaTeX, and Markdown formats simultaneously. It takes a block of text as input and returns clean Markdown output, with headers marked by ## and math expressions wrapped in brackets \( inline math \) \[ display math \] for easy parsing. This model does not require separate line detection or math formula detection. Model inputs and outputs Inputs A block of text containing a mix of regular text, math LaTeX, and Markdown formatting. Outputs Clean Markdown output with headers, math expressions, and other formatting properly identified. Capabilities TB-OCR-preview-0.1 can accurately extract and format text, math, and Markdown elements from a given block of text. This is particularly useful for tasks like digitizing scientific papers, notes, or other documents that contain a mix of these elements. What can I use it for? TB-OCR-preview-0.1 is well-suited for use cases where you need to convert scanned or photographed text, math, and Markdown content into a more structured, machine-readable format. This could include tasks like automating the digitization of research papers, lecture notes, or other technical documents. Things to try Consider combining TB-OCR-preview-0.1 with the TFT-ID-1.0 model, which specializes in text, table, and figure detection for full-page OCR. This can be more efficient than using TB-OCR-preview-0.1 on entire pages, as it allows you to split the text into smaller blocks and process them in parallel.

Read more

Updated Invalid Date

๐Ÿ”ฎ

table-detection-and-extraction

foduucom

Total Score

52

The table-detection-and-extraction model is an object detection model based on the YOLO (You Only Look Once) framework. It is designed to detect tables, whether they are bordered or borderless, in images. The model has been fine-tuned on a vast dataset and achieved high accuracy in detecting tables and distinguishing between bordered and borderless ones. The model serves as a versatile solution for precisely identifying tables within images. Its capabilities extend beyond mere detection - it plays a crucial role in addressing the complexities of unstructured documents by enabling the isolation of tables of interest. This seamless integration with Optical Character Recognition (OCR) technology empowers the model to not only locate tables but also extract pertinent data contained within. Model inputs and outputs Inputs Images**: The model takes image data as input and is capable of detecting and extracting tables from them. Outputs Bounding boxes**: The model outputs bounding box information that delineates the location of tables within the input image. Table data**: By coupling the bounding box information with OCR, the model can extract the textual data contained within the detected tables. Capabilities The table-detection-and-extraction model excels at identifying tables, whether they have borders or not, within images. Its advanced techniques allow users to isolate tables of interest and extract the relevant data, streamlining the process of information retrieval from unstructured documents. What can I use it for? The table-detection-and-extraction model can be utilized in a variety of applications that involve processing unstructured documents. It can be particularly useful for tasks such as automated data extraction from financial reports, invoices, or other tabular documents. By integrating the model's capabilities, users can streamline their document analysis workflows and quickly retrieve important information. Things to try One key aspect to explore with the table-detection-and-extraction model is its integration with OCR technology. By leveraging the bounding box information provided by the model, users can efficiently crop and extract the textual data within the detected tables. This combined approach can significantly enhance the accuracy and efficiency of document processing tasks. Additionally, you may want to experiment with customizing the model's parameters or fine-tuning it on your specific dataset to optimize its performance for your unique use case. The model's versatility allows for adaptations to address a wide range of unstructured document analysis challenges.

Read more

Updated Invalid Date

๐ŸŽฒ

TB-OCR-preview-0.1

yifeihu

Total Score

115

TB-OCR-preview-0.1 is an end-to-end optical character recognition (OCR) model developed by Yifei Hu that can handle text, math LaTeX, and Markdown formats simultaneously. It takes a block of text as input and returns clean Markdown output, with headers marked by ## and math expressions wrapped in brackets \( inline math \) \[ display math \] for easy parsing. This model does not require separate line detection or math formula detection. Model inputs and outputs Inputs A block of text containing a mix of regular text, math LaTeX, and Markdown formatting. Outputs Clean Markdown output with headers, math expressions, and other formatting properly identified. Capabilities TB-OCR-preview-0.1 can accurately extract and format text, math, and Markdown elements from a given block of text. This is particularly useful for tasks like digitizing scientific papers, notes, or other documents that contain a mix of these elements. What can I use it for? TB-OCR-preview-0.1 is well-suited for use cases where you need to convert scanned or photographed text, math, and Markdown content into a more structured, machine-readable format. This could include tasks like automating the digitization of research papers, lecture notes, or other technical documents. Things to try Consider combining TB-OCR-preview-0.1 with the TFT-ID-1.0 model, which specializes in text, table, and figure detection for full-page OCR. This can be more efficient than using TB-OCR-preview-0.1 on entire pages, as it allows you to split the text into smaller blocks and process them in parallel.

Read more

Updated Invalid Date

๐Ÿ“ˆ

Taiyi-Stable-Diffusion-XL-3.5B

IDEA-CCNL

Total Score

53

The Taiyi-Stable-Diffusion-XL-3.5B is a powerful text-to-image model developed by IDEA-CCNL that builds upon the foundations of models like Google's Imagen and OpenAI's DALL-E 3. Unlike previous Chinese text-to-image models, which had moderate effectiveness, Taiyi-XL focuses on enhancing Chinese text-to-image generation while retaining English proficiency. This addresses the unique challenges of bilingual language processing. The training of the Taiyi-Diffusion-XL model involved several key stages. First, a high-quality dataset of image-text pairs was created, with advanced vision-language models generating accurate captions to enrich the dataset. Then, the model expanded the vocabulary and position encoding of a pre-trained English CLIP model to better support Chinese and longer texts. Finally, based on Stable-Diffusion-XL, the text encoder was replaced, and multi-resolution, aspect-ratio-variant training was conducted on the prepared dataset. Similar models include the Taiyi-Stable-Diffusion-1B-Chinese-v0.1, which was the first open-source Chinese Stable Diffusion model, and AltDiffusion, a bilingual text-to-image diffusion model developed by BAAI. Model inputs and outputs Inputs Prompt**: A text description of the desired image, which can be in English or Chinese. Outputs Image**: A visually compelling image generated based on the input prompt. Capabilities The Taiyi-Stable-Diffusion-XL-3.5B model excels at generating high-quality, detailed images from both English and Chinese text prompts. It can create a wide range of content, from realistic scenes to fantastical illustrations. The model's bilingual capabilities make it a valuable tool for artists and creators working with both languages. What can I use it for? The Taiyi-Stable-Diffusion-XL-3.5B model can be used for a variety of creative and professional applications. Artists and designers can leverage the model to generate concept art, illustrations, and other digital assets. Educators and researchers can use it to explore the capabilities of text-to-image generation and its applications in areas like art, design, and language learning. Developers can integrate the model into creative tools and applications to empower users with powerful image generation capabilities. Things to try One interesting aspect of the Taiyi-Stable-Diffusion-XL-3.5B model is its ability to generate high-resolution, long-form images. Try experimenting with prompts that describe complex scenes or panoramic views to see the model's capabilities in this area. You can also explore the model's performance on specific types of images, such as portraits, landscapes, or fantasy scenes, to understand its strengths and limitations.

Read more

Updated Invalid Date