ReALM: Reference Resolution As Language Modeling

Read original: arXiv:2403.20329 - Published 8/20/2024 by Joel Ruben Antony Moniz, Soundarya Krishnan, Melis Ozyildirim, Prathamesh Saraf, Halim Cagri Ates, Yuan Zhang, Hong Yu

ReALM: Reference Resolution As Language Modeling

Introduction

The paper discusses the importance of understanding context, including ambiguous references, for conversational assistants to communicate effectively with users. While recent large language models (LLMs) can handle some contextual understanding, the authors argue for the continued value of traditional NLP pipelines in certain scenarios.

Specifically, they highlight four cases where a pipeline approach may be preferable: 1) when running on low-power devices with limited computing resources, making large end-to-end models infeasible; 2) when integrating with existing APIs or components, overhauling to a single LLM may be cumbersome; 3) a modular approach allows swapping improved reference resolution modules transparently; and 4) reference resolution needs to handle not just conversational context but also on-screen context.

The authors propose using smaller, fine-tuned language models specifically for reference resolution. To handle on-screen context, they suggest reconstructing the screen as a textual representation with entities tagged, allowing the language model to "see" and resolve references to on-screen elements. This is presented as a novel approach for enabling LLMs to understand on-screen context.

Related Work and Motivation

The paper discusses the need for conversational agents to understand on-screen references, which differ from visual and deictic references. On-screen references tend to be more structured and textual, enabling a text-only approach without visual components. They are also often action-oriented rather than question-answering based, and use synthetic screens rather than natural images.

The paper notes that jointly handling conversational and on-screen references has been relatively unexplored. While vision transformers and pre-trained models have gained prominence for visual understanding tasks, they are trained on natural images rather than screenshots, have different distributions, and can be computationally expensive.

The paper identifies limitations in existing approaches, such as relying on manually onboarding new entity types, treating each type distinctly without leveraging similarities, using hand-crafted rules and heuristics that lack robustness and semantic understanding, and classifying entity relevance independently without considering the whole screen context.

Task

The task involves identifying relevant entities from 3 types: on-screen entities currently displayed, conversational entities mentioned previously, and background entities from processes not directly visible. The goal is to extract the entities pertinent to the user's current query. The task is framed as a multiple choice problem, where the model outputs the relevant entities from the options shown on the user's screen, or "None of these" if applicable. During evaluation, any permutation of the correct entities is accepted as a valid answer.

Datasets

The datasets used contain user queries and a list of entities, along with the ground-truth relevant entity(ies) for each query. Each entity has information like type, name, and other details. If on-screen context exists, the entity's bounding box and surrounding objects/properties are included.

For conversational data, annotators were shown synthetic entity lists and asked to provide queries that unambiguously reference a chosen entity. For example, referring to a specific business by saying "Take me to the one second from the bottom."

Synthetic data was generated from templates - one with mentions, entities, slot values, and another with query variations for the entity references defined in the first template. Queries were created by substituting the references.

For on-screen data, annotators first classified displayed information into entity types like phone numbers and emails, then provided queries for that information. In a second phase, other annotators identified which entity in the list was referenced by each query.

Models

The paper compares the proposed model ReALM with two baseline approaches: a re-implementation of the reference resolver from a previous paper (MARRS), and ChatGPT using GPT-3.5 and GPT-4.

For the MARRS baseline, they trained a re-implementation of the system proposed in a previous paper, which is not based on large language models. This baseline was specifically designed for reference resolution.

For the ChatGPT baseline, they used the GPT-3.5 and GPT-4 versions, providing just the text prompt for GPT-3.5 and the prompt along with a screenshot for GPT-4's image capabilities for on-screen reference resolution.

Their approach, ReALM, involves fine-tuning a FLAN-T5 language model. They convert the data into a sentence format to feed to the model, shuffling the entities to prevent overfitting to positions. For conversational references, they assume two types: type-based (relying on entity types) and descriptive (relying on properties). Their encoding captures both types and properties.

For on-screen references, they assume upstream detectors can parse screen text and extract entities with types, bounding boxes, and surrounding text. They use a novel algorithm to encode the screen layout as text based on sorting bounding box centers top-to-bottom and left-to-right with line breaks.

Results

The paper presents their experimental results, indicating that their proposed approach outperforms the MARRS model across all types of datasets. It also surpasses the performance of GPT-3.5, which has a significantly larger number of parameters. Their approach performs comparably to the latest GPT-4, despite being a much lighter and faster model. The gains are particularly notable on onscreen datasets, where their model with textual encoding performs almost as well as GPT-4, even though the latter is provided with screenshots.

The authors also experiment with models of different sizes, observing that performance improves across all datasets as model size increases, but the difference is most pronounced for the complex onscreen datasets.

In an analysis section, the paper explores the zero-shot performance of their model on an unseen domain, Alarms. Their approach and GPT-4 perform similarly well on this unseen test set, outperforming a finetuned model.

The paper also highlights that their model, ReaLM, demonstrates superior understanding of domain-specific queries compared to GPT-4 due to finetuning on user requests. An example illustrates GPT-4 incorrectly assuming a reference is only about a setting, while the ground truth includes a home automation device, which ReaLM can recognize due to its domain-specific training.

Conclusion and Future Work

The paper demonstrates how large language models can be used for reference resolution by encoding entities as natural text. A novel approach represents on-screen entities and their relative positions in a textual format, which is then passed to the language model. This method, called ReaLM, outperforms previous approaches and performs comparably to GPT-4, despite having fewer parameters. ReaLM even surpasses GPT-4 for domain-specific user utterances, making it an ideal choice for practical reference resolution systems that can run on-device without compromising performance.

However, while ReaLM effectively encodes the position of entities, it may lose nuanced positional information required for complex user queries. The authors suggest exploring more complex approaches, such as dividing the screen into a grid and encoding relative spatial positions into text, as a promising avenue for future research.

Ethics Statement

The system allows for constraining the language model's output or applying post-processing to prevent unexpected generations. However, the authors state that in practice, they encounter very little hallucination or fabricated content from the language model. As a result, they do not constrain the model's decoding or generation process.

Acknowledgements

The authors express gratitude to Stephen Pulman, Leon Liyang Zhang, Jiarui Lu, Jeff Nichols, Shruti Bhargava, Dhivya Piraviperumal, and Junhan Chen for their assistance and feedback throughout the research process.

Appendix A Encoding onscreen entities

The paper presents visual examples of how screen grabs might appear when parsed and processed by the system. These sample representations are displayed in Figure 2 of the paper.

(a) Onscreen Capture 1

The paper explores different strategies for encoding on-screen elements.

Clustering: Objects on the screen are spatially clustered into semantic groups. Users can refer to nearby bounding boxes by a descriptive title. However, as the number of entities in a cluster increases, the prompt length explodes since each object lists all other objects as surrounding entities.
Onscreen Grab: The screen is parsed, but turn objects are provided as a separate list instead of being annotated within the parse.
Onscreen Grab with Injected Turn Objects: This is the final approach used. The screen is parsed, and turn objects are annotated within the parse itself.

The paper presents an algorithm for the final approach and provides sample encodings for each strategy. It also includes an ablation study comparing the performance of the different encoding approaches.

Figure 3: Performance improvements with each experiment – (a) Baseline Finetuned LLM, (b) Obtaining screen elements through OCR, (c) Obtaining screen elements through UI elements and Clustering (d) Adding an extra newline between the instruction and user request, (e) Onscreen Grab, (f) Onscreen Grab with injected turn objects, (g) Onscreen Grab with injected turn object + needing lines to be separated by at least Margin, (h) Separating elements in the same line by a tab

The algorithm describes the process of encoding visual elements displayed on the screen. It involves identifying and representing various objects, characters, or components that appear in the user interface or video content.

Appendix B Entity Representations

In Table 8, the paper presents examples of different domains and their corresponding representations utilized as input for the large language model (LLM). These examples illustrate how diverse subject areas, such as chemistry, computer science, and mathematics, are encoded into a format suitable for processing by the LLM.

Appendix C Sample Inputs

The provided section indicates that visual representations will be shown to illustrate how input data is encoded or represented within the model.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ReALM: Reference Resolution As Language Modeling

Joel Ruben Antony Moniz, Soundarya Krishnan, Melis Ozyildirim, Prathamesh Saraf, Halim Cagri Ates, Yuan Zhang, Hong Yu

Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.

8/20/2024

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

Huahang Li, Longyu Feng, Shuangyin Li, Fei Hao, Chen Jason Zhang, Yuanfeng Song

Entity resolution, the task of identifying and merging records that refer to the same real-world entity, is crucial in sectors like e-commerce, healthcare, and law enforcement. Large Language Models (LLMs) introduce an innovative approach to this task, capitalizing on their advanced linguistic capabilities and a ``pay-as-you-go'' model that provides significant advantages to those without extensive data science expertise. However, current LLMs are costly due to per-API request billing. Existing methods often either lack quality or become prohibitively expensive at scale. To address these problems, we propose an uncertainty reduction framework using LLMs to improve entity resolution results. We first initialize possible partitions of the entity cluster, refer to the same entity, and define the uncertainty of the result. Then, we reduce the uncertainty by selecting a few valuable matching questions for LLM verification. Upon receiving the answers, we update the probability distribution of the possible partitions. To further reduce costs, we design an efficient algorithm to judiciously select the most valuable matching pairs to query. Additionally, we create error-tolerant techniques to handle LLM mistakes and a dynamic adjustment method to reach truly correct partitions. Experimental results show that our method is efficient and effective, offering promising applications in real-world tasks.

9/14/2024

💬

Entity Matching using Large Language Models

Ralph Peeters, Christian Bizer

Entity Matching is the task of deciding whether two entity descriptions refer to the same real-world entity and is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent and more robust alternative to PLM-based matchers. Our study covers hosted and open-source LLMs, which can be run locally. We evaluate these models in a zero-shot scenario and a scenario where task-specific training data is available. We compare different prompt designs and the prompt sensitivity of the models and show that there is no single best prompt but needs to be tuned for each model/dataset combination. We further investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning a hosted LLM using the same pool of training data. Our experiments show that the best LLMs require no or only a few training examples to perform similarly to PLMs that were fine-tuned using thousands of examples. LLM-based matchers further exhibit higher robustness to unseen entities. We show that GPT4 can generate structured explanations for matching decisions. The model can automatically identify potential causes of matching errors by analyzing explanations of wrong decisions. We demonstrate that the model can generate meaningful textual descriptions of the identified error classes, which can help data engineers improve entity matching pipelines.

6/6/2024

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Jiading Fang, Xiangshan Tan, Shengjie Lin, Igor Vasiljevic, Vitor Guizilini, Hongyuan Mei, Rares Ambrus, Gregory Shakhnarovich, Matthew R Walter

If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging -- it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs). Transcrib3D uses text as the unifying medium, which allows us to sidestep the need to learn shared representations connecting multi-modal inputs, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks, with a great leap in performance from previous multi-modality baselines. To improve upon zero-shot performance and facilitate local deployment on edge computers and robots, we propose self-correction for fine-tuning that trains smaller models, resulting in performance close to that of large models. We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions. Project site is at https://ripl.github.io/Transcrib3D.

5/1/2024