Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset

2310.10118

Published 4/9/2024 by Arthur Amalvy (LIA), Vincent Labatut (LIA), Richard Dufour (LS2N - 'equipe TALN)

👁️

Abstract

While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instructiontuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.

Create account to get full access

Overview

Recent transformer-based models excel at named entity recognition (NER), but struggle with long documents like novels.
To address this, the research proposes using an instruction-tuned large language model (LLM) to generate a synthetic dataset for training a context retriever.
This context retriever, based on a BERT model, can then find relevant context to improve NER performance on long-form text.

Plain English Explanation

Neural networks like BERT have become very good at recognizing named entities (people, places, organizations, etc.) in short text. However, they often struggle when applied to longer documents like novels.

The key insight of this research is that the model needs more "context" - background information about the document - to accurately identify entities in lengthy text. But obtaining this context supervision is difficult, as it requires manual labeling.

Instead, the researchers used a powerful language model trained on providing helpful instructions to generate a synthetic dataset of document-level context. They then trained a separate BERT-based model to use this context to improve named entity recognition on long-form text.

By leveraging the capabilities of large language models, the researchers were able to sidestep the challenge of costly manual supervision and create an effective context retrieval system.

Technical Explanation

The paper first acknowledges the strong performance of modern transformer-based models on named entity recognition (NER) tasks. However, it notes that these models struggle when applied to longer documents, as they have difficulty capturing the relevant context needed for accurate entity recognition.

To address this, the researchers propose an approach that first generates a synthetic dataset for training a context retrieval model. They use the Alpaca instruction-tuned LLM to produce prompts and corresponding relevant passages from the input documents.

This generated dataset is then used to train a BERT-based neural context retriever. This model learns to identify the most salient passages to include as context for the NER task on long-form text.

The researchers evaluate their approach on an English literary dataset composed of the first chapters from 40 books. They show that their context retrieval model outperforms several baseline retrieval methods in terms of improving NER performance on this challenging long-form text.

Critical Analysis

The paper presents a clever and well-executed solution to the problem of applying NER models to long-form documents. By leveraging the capabilities of instruction-tuned language models, the researchers are able to sidestep the challenge of obtaining manual supervision for context retrieval.

However, the research is limited to a relatively small dataset of book excerpts. It would be valuable to see how the approach generalizes to larger and more diverse long-form corpora, such as full novels or academic papers.

Additionally, the paper does not delve into the potential biases or limitations of the Alpaca model used for generating the synthetic dataset. The quality and representativeness of this dataset could have a significant impact on the performance of the downstream context retriever.

Further research could also explore more sophisticated context retrieval strategies, such as reinforcement learning or few-shot learning techniques, to improve the model's ability to identify the most relevant passages for NER.

Conclusion

This research demonstrates an innovative approach to enhancing named entity recognition for long-form text by leveraging the power of large language models. The use of a synthetic context retrieval dataset, generated by an instruction-tuned LLM, allows the researchers to sidestep the challenge of manual supervision and create an effective context-aware NER system.

While the evaluation is limited to a specific literary dataset, the core ideas presented in this paper could have broader implications for improving the performance of natural language processing models on lengthy, complex documents. As language models continue to advance, we may see more creative solutions that harness their capabilities to tackle challenging real-world problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

LTNER: Large Language Model Tagging for Named Entity Recognition with Contextualized Entity Marking

Faren Yan, Peng Yu, Xin Chen

The use of LLMs for natural language processing has become a popular trend in the past two years, driven by their formidable capacity for context comprehension and learning, which has inspired a wave of research from academics and industry professionals. However, for certain NLP tasks, such as NER, the performance of LLMs still falls short when compared to supervised learning methods. In our research, we developed a NER processing framework called LTNER that incorporates a revolutionary Contextualized Entity Marking Gen Method. By leveraging the cost-effective GPT-3.5 coupled with context learning that does not require additional training, we significantly improved the accuracy of LLMs in handling NER tasks. The F1 score on the CoNLL03 dataset increased from the initial 85.9% to 91.9%, approaching the performance of supervised fine-tuning. This outcome has led to a deeper understanding of the potential of LLMs.

4/9/2024

cs.CL cs.AI

📊

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos

Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33%$ to $6.19%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.

6/28/2024

cs.LG cs.AI cs.CL

🛸

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel

We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.

6/4/2024

cs.CL cs.AI

👁️

A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition

Haojie Zhang, Yimeng Zhuang

Few-shot Named Entity Recognition (NER) aims to extract named entities using only a limited number of labeled examples. Existing contrastive learning methods often suffer from insufficient distinguishability in context vector representation because they either solely rely on label semantics or completely disregard them. To tackle this issue, we propose a unified label-aware token-level contrastive learning framework. Our approach enriches the context by utilizing label semantics as suffix prompts. Additionally, it simultaneously optimizes context-context and context-label contrastive learning objectives to enhance generalized discriminative contextual representations.Extensive experiments on various traditional test domains (OntoNotes, CoNLL'03, WNUT'17, GUM, I2B2) and the large-scale few-shot NER dataset (FEWNERD) demonstrate the effectiveness of our approach. It outperforms prior state-of-the-art models by a significant margin, achieving an average absolute gain of 7% in micro F1 scores across most scenarios. Further analysis reveals that our model benefits from its powerful transfer capability and improved contextual representations.

5/9/2024

cs.CL