Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages

2405.04829

Published 5/13/2024 by Sankalp Bahad, Pruthwik Mishra, Karunesh Arora, Rakesh Chandra Balabantaray, Dipti Misra Sharma, Parameswari Krishnamurthy

cs.CL

👁️

Abstract

Named Entity Recognition (NER) is a useful component in Natural Language Processing (NLP) applications. It is used in various tasks such as Machine Translation, Summarization, Information Retrieval, and Question-Answering systems. The research on NER is centered around English and some other major languages, whereas limited attention has been given to Indian languages. We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian Languages. We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families. Additionally,we present a multilingual model fine-tuned on our dataset, which achieves an F1 score of 0.80 on our dataset on average. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.

Create account to get full access

Overview

Named Entity Recognition (NER) is a crucial component of Natural Language Processing (NLP) applications such as Machine Translation, Summarization, Information Retrieval, and Question-Answering systems.
While NER research has primarily focused on English and other major languages, limited attention has been given to Indian languages.
This paper analyzes the challenges and proposes techniques tailored for Multilingual Named Entity Recognition for Indian Languages.
The researchers present a human-annotated named entity corpus of 40K sentences across 4 Indian languages from two major Indian language families.
Additionally, they present a multilingual model fine-tuned on their dataset, which achieves an F1 score of 0.80 on average, and performs comparably on completely unseen benchmark datasets for Indian languages.

Plain English Explanation

Named Entity Recognition (NER) is a technique used in Natural Language Processing (NLP) to identify and classify important words or phrases in text, such as names of people, organizations, locations, and more. This is a useful tool for a variety of applications, like machine translation, text summarization, and question-answering systems.

Most of the research on NER has focused on English and some other major languages, but has not given much attention to Indian languages. This paper looks at the challenges involved in doing NER for Indian languages and proposes some techniques to address them.

The researchers first created a dataset of 40,000 sentences in 4 different Indian languages, with humans annotating the named entities in the text. This gives them a high-quality dataset to train and test NER models on.

They then developed a multilingual NER model that can be used across these Indian languages. This model achieved an average F1 score of 0.80 on their dataset, which is a strong performance. Importantly, the model also did well on completely new datasets for Indian languages, showing that it can be applied more broadly.

Overall, this research helps advance the state-of-the-art for NER in Indian languages, which is an important step in making NLP technologies more accessible and useful for a wider range of languages and communities.

Technical Explanation

This paper focuses on the challenge of Multilingual Named Entity Recognition (NER) for Indian languages. The researchers present a human-annotated named entity corpus of 40,000 sentences across 4 Indian languages from two major language families (Indo-Aryan and Dravidian). This provides a high-quality dataset to train and evaluate NER models.

The researchers then develop a multilingual NER model that is fine-tuned on their dataset. This model uses a contrastive learning approach to learn common representations across the different languages. The model achieves an average F1 score of 0.80 on their dataset, which is a strong performance.

Importantly, the researchers also evaluate their model on completely unseen benchmark datasets for Indian languages. They find that the model achieves comparable performance on these external datasets, demonstrating its broader applicability.

The paper also includes an analysis of the challenges involved in NER for Indian languages, such as the rich morphology, complex scripts, and lack of large annotated datasets. The proposed techniques, including the multilingual model and dataset, aim to address these challenges.

Critical Analysis

The paper makes a valuable contribution by addressing the underexplored area of NER for Indian languages. The creation of a high-quality annotated dataset is a significant achievement, as the lack of such resources has been a major bottleneck in this field.

The multilingual NER model presented in the paper also shows promising results, with its ability to generalize to unseen datasets being a notable strength. However, the paper could have provided more details on the model architecture and training process, which would allow for a deeper technical understanding and potentially easier replication.

While the researchers mention the challenges of Indian language NER, such as rich morphology and complex scripts, the paper could have delved deeper into these issues and how the proposed techniques specifically address them. Additionally, further analysis of the model's performance across different entity types and language families could have provided additional insights.

Overall, this research represents an important step forward in expanding NER capabilities to a broader range of languages. The dataset and model can serve as valuable resources for the research community, inspiring further work in this area.

Conclusion

This paper tackles the important problem of Multilingual Named Entity Recognition (NER) for Indian languages, an area that has received limited attention in the past. By creating a high-quality annotated dataset and developing a strong multilingual NER model, the researchers have made significant progress in this underexplored field.

The key contributions of this work include:

Presentation of a human-annotated named entity corpus of 40,000 sentences across 4 Indian languages, providing a valuable resource for the research community.
Development of a multilingual NER model that achieves strong performance, with an average F1 score of 0.80, and demonstrates the ability to generalize to unseen datasets.
Analysis of the unique challenges involved in NER for Indian languages, such as rich morphology and complex scripts, and the proposed techniques to address them.

These advancements have the potential to significantly improve the performance of various NLP applications, such as machine translation, text summarization, and question-answering systems, for a wider range of Indian language users. The dataset and model presented in this paper can serve as valuable resources for further research and development in this important field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Deep Learning Based Named Entity Recognition Models for Recipes

Mansi Goel, Ayush Agarwal, Shubham Agrawal, Janak Kapuriya, Akhil Vamshi Konam, Rishabh Gupta, Shrey Rastogi, Niharika, Ganesh Bagler

Food touches our lives through various endeavors, including flavor, nourishment, health, and sustainability. Recipes are cultural capsules transmitted across generations via unstructured text. Automated protocols for recognizing named entities, the building blocks of recipe text, are of immense value for various applications ranging from information extraction to novel recipe generation. Named entity recognition is a technique for extracting information from unstructured or semi-structured data with known labels. Starting with manually-annotated data of 6,611 ingredient phrases, we created an augmented dataset of 26,445 phrases cumulatively. Simultaneously, we systematically cleaned and analyzed ingredient phrases from RecipeDB, the gold-standard recipe data repository, and annotated them using the Stanford NER. Based on the analysis, we sampled a subset of 88,526 phrases using a clustering-based approach while preserving the diversity to create the machine-annotated dataset. A thorough investigation of NER approaches on these three datasets involving statistical, fine-tuning of deep learning-based language models and few-shot prompting on large language models (LLMs) provides deep insights. We conclude that few-shot prompting on LLMs has abysmal performance, whereas the fine-tuned spaCy-transformer emerges as the best model with macro-F1 scores of 95.9%, 96.04%, and 95.71% for the manually-annotated, augmented, and machine-annotated datasets, respectively.

6/7/2024

cs.CL cs.AI cs.IR

💬

2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion

Dongsheng Wang, Xiaoqin Feng, Zeming Liu, Chuan Wang

Named entity recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities in sentences into pre-defined types. It plays a crucial role in various research fields, including entity linking, question answering, and online product recommendation. Recent studies have shown that incorporating multilingual and multimodal datasets can enhance the effectiveness of NER. This is due to language transfer learning and the presence of shared implicit features across different modalities. However, the lack of a dataset that combines multilingualism and multimodality has hindered research exploring the combination of these two aspects, as multimodality can help NER in multiple languages simultaneously. In this paper, we aim to address a more challenging task: multilingual and multimodal named entity recognition (MMNER), considering its potential value and influence. Specifically, we construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image). To tackle this challenging MMNER task on the dataset, we introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module to effectively depict the interactions between the two modalities. Extensive experimental results demonstrate that our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines. Additionally, in a challenging analysis, we discovered that sentence-level alignment interferes a lot with NER models, indicating the higher level of difficulty in our dataset.

4/29/2024

cs.CL cs.AI

llmNER: (Zero|Few)-Shot Named Entity Recognition, Exploiting the Power of Large Language Models

Fabi'an Villena, Luis Miranda, Claudio Aracena

Large language models (LLMs) allow us to generate high-quality human-like text. One interesting task in natural language processing (NLP) is named entity recognition (NER), which seeks to detect mentions of relevant information in documents. This paper presents llmNER, a Python library for implementing zero-shot and few-shot NER with LLMs; by providing an easy-to-use interface, llmNER can compose prompts, query the model, and parse the completion returned by the LLM. Also, the library enables the user to perform prompt engineering efficiently by providing a simple interface to test multiple variables. We validated our software on two NER tasks to show the library's flexibility. llmNER aims to push the boundaries of in-context learning research by removing the barrier of the prompting and parsing steps.

6/10/2024

cs.CL

MSNER: A Multilingual Speech Dataset for Named Entity Recognition

Quentin Meeus, Marie-Francine Moens, Hugo Van hamme

While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.

5/21/2024

cs.CL cs.LG