NuExtract

Maintainer: numind

140

Last updated 7/31/2024

📊

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

NuExtract is a version of the phi-3-mini model, fine-tuned by numind on a private high-quality synthetic dataset for information extraction tasks. Compared to the base model, NuExtract is tailored for extracting specific information from input text. Other similar models from numind include the larger NuExtract-large and smaller NuExtract-tiny versions.

Model inputs and outputs

The NuExtract model takes two main inputs: a text passage (up to 2000 tokens) and a JSON template describing the information to extract. The model is purely extractive, meaning its output will consist of text directly present in the original input. Users can also provide an example output format to help the model understand the task more precisely.

Inputs

Text passage: A text document up to 2000 tokens in length
JSON template: A JSON object describing the information to extract from the text

Outputs

Extracted information: The relevant text from the input passage, formatted according to the provided JSON template or example

Capabilities

The NuExtract model excels at extracting specific pieces of information from input text. It can handle a variety of extraction tasks, such as pulling key facts, entities, or other structured data from documents. By fine-tuning the base phi-3-mini model, NuExtract has gained specialized capabilities for this type of information extraction while maintaining the strong reasoning and language understanding abilities of the original model.

What can I use it for?

The NuExtract model could be useful for any application that requires extracting structured data from text, such as:

Automating information retrieval from business documents or reports
Populating databases or knowledge graphs from unstructured data sources
Powering intelligent search or question-answering systems
Summarizing key details from lengthy technical or scientific papers

Since NuExtract is a fine-tuned version of a larger language model, it can also serve as a starting point for further customization and fine-tuning to meet the needs of specific domains or use cases.

Things to try

One interesting aspect of NuExtract is its ability to handle both the text input and the JSON template in a unified way. This allows for greater flexibility in how the extraction task is specified, as users can experiment with different template formats or even provide examples to guide the model's output. Developers could also explore combining NuExtract with other numind models, such as the SOTA Multilingual Entity Recognition Foundation Model, to tackle more complex information extraction challenges.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🧪

NuExtract-large

numind

102

NuExtract-large is a version of the Phi-3-small model, fine-tuned by NuMind on a private high-quality synthetic dataset for information extraction. It is a text-to-text model designed for extracting structured information from input text. Compared to similar models like NuNER-v0.1 and NuNER-multilingual-v0.1, which focus on entity recognition, NuExtract-large is specialized for more general information extraction tasks. It can extract relevant information from input text based on a provided JSON template. Model inputs and outputs NuExtract-large is a text-to-text model, taking in input text and a JSON template as input, and generating the extracted information as output. Inputs Input text**: The input text can be up to 2000 tokens long. It contains the information that the model will extract from. JSON template**: A JSON template that describes the information the user wants to extract from the input text. Example output**: An optional example of the desired output formatting to help the model understand the task. Outputs Extracted information**: The model's attempt at extracting the requested information from the input text, formatted according to the provided JSON template. Capabilities NuExtract-large is capable of extracting structured information from input text based on a provided template. It can handle a variety of information extraction tasks, from extracting key entities and facts to summarizing longer passages of text. The model's fine-tuning on a high-quality synthetic dataset gives it strong performance on information extraction, as evidenced by its benchmarked results. It outperforms the base Phi-3-small model on these tasks. What can I use it for? NuExtract-large could be useful for a variety of applications that require extracting structured information from text, such as: Automating data entry from documents or web pages Summarizing long passages of text into key facts and entities Powering intelligent search and question-answering systems Streamlining business processes by extracting relevant information Companies could potentially monetize NuExtract-large by building applications and services that leverage its information extraction capabilities, such as NuExtract from the model's maintainer NuMind. Things to try One interesting thing to try with NuExtract-large is using it to extract information from longer, more complex input texts. The model's fine-tuning on a high-quality dataset suggests it may be able to handle these types of inputs well, going beyond simple entity extraction to summarize key facts and relationships. Another idea is to experiment with providing different levels of detail in the JSON template and example output to see how it affects the model's performance. This could help refine the template and instructions to get the most accurate extractions for your specific use case.

Updated Invalid Date

Text-to-Text

📉

NuNER-v0.1

numind

The NuNER-v0.1 model is an English language entity recognition model fine-tuned from the RoBERTa-base model by the team at NuMind. This model provides strong token embeddings for entity recognition tasks in English. It was the prototype for the NuNER v1.0 model, which is the version reported in the paper introducing the model. The NuNER-v0.1 model outperforms the base RoBERTa-base model on entity recognition, achieving an F1 macro score of 0.7500 compared to 0.7129 for RoBERTa-base. Combining the last and second-to-last hidden states further improves performance to 0.7686 F1 macro. Other notable entity recognition models include bert-base-NER, a BERT-base model fine-tuned on the CoNLL-2003 dataset, and roberta-large-ner-english, a RoBERTa-large model fine-tuned for English NER. Model inputs and outputs Inputs Text**: The model takes in raw text as input, which it then tokenizes and encodes for processing. Outputs Entity predictions**: The model outputs a sequence of entity predictions for the input text, classifying each token as belonging to one of the four entity types: location (LOC), organization (ORG), person (PER), or miscellaneous (MISC). Token embeddings**: The model can also be used to extract token-level embeddings, which can be useful for downstream tasks. The author suggests using the concatenation of the last and second-to-last hidden states for better quality embeddings. Capabilities The NuNER-v0.1 model is highly capable at recognizing entities in English text, surpassing the base RoBERTa model on the CoNLL-2003 NER dataset. It can accurately identify locations, organizations, people, and miscellaneous entities within input text. This makes it a powerful tool for applications that require understanding the entities mentioned in documents, such as information extraction, knowledge graph construction, or content analysis. What can I use it for? The NuNER-v0.1 model can be used for a variety of applications that involve identifying and extracting entities from English text. Some potential use cases include: Information Extraction**: The model can be used to automatically extract key entities (people, organizations, locations, etc.) from documents, articles, or other text-based data sources. Knowledge Graph Construction**: The entity predictions from the model can be used to populate a knowledge graph with structured information about the entities mentioned in a corpus. Content Analysis**: By understanding the entities present in text, the model can enable more sophisticated content analysis tasks, such as topic modeling, sentiment analysis, or text summarization. Chatbots and Virtual Assistants**: The entity recognition capabilities of the model can be leveraged to improve the natural language understanding of chatbots and virtual assistants, allowing them to better comprehend user queries and respond appropriately. Things to try One interesting aspect of the NuNER-v0.1 model is its ability to produce high-quality token embeddings by concatenating the last and second-to-last hidden states. These embeddings could be used as input features for a wide range of downstream NLP tasks, such as text classification, named entity recognition, or relation extraction. Experimenting with different ways of utilizing these embeddings, such as fine-tuning on domain-specific datasets or combining them with other model architectures, could lead to exciting new applications and performance improvements. Another avenue to explore would be comparing the NuNER-v0.1 model's performance on different types of text data, beyond the news-based CoNLL-2003 dataset used for evaluation. Trying the model on more informal, conversational text (e.g., social media, emails, chat logs) could uncover interesting insights about its generalization capabilities and potential areas for improvement.

Updated Invalid Date

Text-to-Text

🏅

NuNER_Zero

numind

NuNER Zero is a zero-shot Named Entity Recognition (NER) model developed by numind. It uses the GLiNER architecture, which takes a concatenation of entity types and text as input. Unlike GLiNER, NuNER Zero is a token classifier, allowing it to detect arbitrary long entities. The model was trained on the NuNER v2.0 dataset, which combines subsets of Pile and C4 annotated using Large Language Models (LLMs). At the time of its release, NuNER Zero was the best compact zero-shot NER model, outperforming GLiNER-large-v2.1 by 3.1% token-level F1-Score on GLiNERS's benchmark. Model inputs and outputs Inputs Text**: The input text for named entity recognition. Entity types**: The set of entity types to detect in the input text. Outputs Entities**: A list of detected entities, where each entity contains the following information: text: The text of the detected entity. label: The entity type of the detected entity. start: The start index of the entity in the input text. end: The end index of the entity in the input text. Capabilities NuNER Zero can detect a wide range of entity types in text, including organizations, initiatives, projects, and more. It achieves this through its zero-shot capabilities, which allow it to identify entities without being trained on a specific set of predefined types. The model's token-level classification approach also enables it to detect long entities that span multiple tokens, which is a limitation of traditional NER models. What can I use it for? NuNER Zero can be a valuable tool for a variety of natural language processing tasks, such as: Content analysis**: Extracting relevant entities from text, such as news articles, research papers, or social media posts, to gain insights and understand the key topics and concepts. Knowledge graph construction**: Building knowledge graphs by identifying and linking entities in large text corpora, which can be used for tasks like question answering and recommendation systems. Business intelligence**: Automating the extraction of relevant entities from customer support tickets, financial reports, or product descriptions to support decision-making and process optimization. Things to try One interesting aspect of NuNER Zero is its ability to detect entities without being trained on a predefined set of types. This makes it a versatile tool that can be applied to a wide range of domains and use cases. To get the most out of the model, you could experiment with different entity types and see how it performs on your specific data and requirements. Additionally, you could explore ways to combine NuNER Zero with other natural language processing models, such as relation extraction or sentiment analysis, to build more comprehensive text understanding pipelines.

Updated Invalid Date

Text-to-Text

👨‍🏫

NuNER-multilingual-v0.1

numind

The NuNER-multilingual-v0.1 model is a powerful multilingual entity recognition foundation model developed by NuMind. It is built on top of the Multilingual BERT (mBERT) model and has been fine-tuned on an artificially annotated subset of the OSCAR dataset. This model provides domain and language-independent embeddings for the entity recognition task, supporting over 9 languages. Compared to the base mBERT model, the NuNER-multilingual-v0.1 model demonstrates superior performance, with an F1 macro score of 0.5892 versus 0.5206 for mBERT. Additionally, by using a "two emb trick" technique, the model's performance can be further improved to an F1 macro score of 0.6231. Model inputs and outputs Inputs Textual data in one of the supported languages Outputs Embeddings that can be used for downstream entity recognition tasks Capabilities The NuNER-multilingual-v0.1 model excels at providing high-quality embeddings for the entity recognition task, with the ability to generalize across different languages and domains. This makes it a valuable tool for a wide range of natural language processing applications, including named entity recognition, knowledge extraction, and information retrieval. What can I use it for? The NuNER-multilingual-v0.1 model can be leveraged in various use cases, such as: Developing multilingual information extraction systems Building knowledge graphs and knowledge bases from unstructured text Enhancing search and recommendation engines with entity-based features Improving chatbots and virtual assistants with better understanding of named entities Things to try One interesting aspect of the NuNER-multilingual-v0.1 model is the "two emb trick" technique, which can be used to improve the quality of the embeddings. By concatenating the hidden states from the last and second-to-last layers of the model, you can obtain embeddings with even better performance for your entity recognition tasks.

Updated Invalid Date

Text-to-Text