Large Language Models for Relevance Judgment in Product Search

2406.00247

Published 6/4/2024 by Navid Mehrdad, Hrushikesh Mohapatra, Mossaab Bagdouri, Prijith Chandran, Alessandro Magnani, Xunfan Cai, Ajit Puthenputhussery, Sachin Yadav, Tony Lee, ChengXiang Zhai and 1 other

cs.IR cs.AI

Large Language Models for Relevance Judgment in Product Search

Abstract

High relevance of retrieved and re-ranked items to the search query is the cornerstone of successful product search, yet measuring relevance of items to queries is one of the most challenging tasks in product information retrieval, and quality of product search is highly influenced by the precision and scale of available relevance-labelled data. In this paper, we present an array of techniques for leveraging Large Language Models (LLMs) for automating the relevance judgment of query-item pairs (QIPs) at scale. Using a unique dataset of multi-million QIPs, annotated by human evaluators, we test and optimize hyper parameters for finetuning billion-parameter LLMs with and without Low Rank Adaption (LoRA), as well as various modes of item attribute concatenation and prompting in LLM finetuning, and consider trade offs in item attribute inclusion for quality of relevance predictions. We demonstrate considerable improvement over baselines of prior generations of LLMs, as well as off-the-shelf models, towards relevance annotations on par with the human relevance evaluators. Our findings have immediate implications for the growing field of relevance judgment automation in product search.

Create account to get full access

Overview

This paper explores the use of large language models (LLMs) for relevance judgment in product search.
The researchers investigate how LLMs can be leveraged to automate the process of relevance labeling, which is crucial for training effective product search models.
The paper presents a low-rank adaptation approach to fine-tune LLMs for relevance judgment and demonstrates its effectiveness on a real-world product search dataset.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. In the context of product search, these models can be used to assess how relevant a given product is to a user's search query. This is an important task, as it helps train the algorithms that power product search engines to deliver the most relevant results.

The researchers in this paper explore a way to use LLMs to automate the process of relevance labeling. Traditionally, this task has been done manually by human raters, which can be time-consuming and costly. By fine-tuning LLMs to perform relevance judgment, the researchers aim to make this process more efficient and scalable.

The key innovation in this paper is a "low-rank adaptation" approach, which allows the LLMs to be adapted to the specific task of relevance judgment without having to retrain the entire model from scratch. This makes the process more practical and accessible for real-world product search applications.

Technical Explanation

The researchers conducted experiments using a real-world product search dataset to evaluate the performance of their low-rank adaptation approach. They fine-tuned several popular LLMs, including BERT and GPT-3, on the relevance judgment task and compared their performance to traditional machine learning models.

The results showed that the fine-tuned LLMs were able to achieve strong performance on the relevance judgment task, outperforming the traditional models. This suggests that LLMs can be effectively leveraged to automate the process of relevance labeling, which is a crucial component of building effective product search engines.

The low-rank adaptation approach proved to be particularly useful, as it allowed the researchers to quickly and efficiently adapt the LLMs to the specific task at hand without having to retrain the entire model from scratch. This makes it easier to deploy these models in real-world product search applications.

Critical Analysis

The paper provides a compelling demonstration of how LLMs can be used to automate the relevance judgment process in product search. However, the researchers acknowledge that their approach has some limitations. For example, the performance of the fine-tuned LLMs may be sensitive to the specific dataset used for training, and further research may be needed to understand how well the models generalize to different product domains.

Additionally, while the low-rank adaptation approach is efficient, it may not capture all the nuances of the relevance judgment task. There may be room for further refinement and optimization of the fine-tuning process to improve the overall performance of the models.

Overall, this paper represents an important step forward in the application of LLMs to product search and e-commerce. The researchers have demonstrated the potential of these powerful AI models to streamline and automate key tasks, such as relevance judgment, which are essential for building effective and user-friendly product search experiences.

Conclusion

This paper presents a novel approach to leveraging large language models (LLMs) for relevance judgment in product search. The researchers show that by fine-tuning LLMs using a low-rank adaptation technique, they can achieve strong performance on the relevance judgment task, outperforming traditional machine learning models.

The ability to automate the relevance labeling process has significant implications for the development of more effective and scalable product search engines. By reducing the reliance on manual labeling, the researchers have opened the door for faster and more efficient training of product search models, ultimately leading to better user experiences.

While the paper highlights the promise of this approach, it also acknowledges some limitations and areas for further research. Nonetheless, this work represents an important step forward in the application of large language models to real-world e-commerce challenges, and it will likely inspire further innovation in this rapidly evolving field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Exploring Large Language Models for Relevance Judgments in Tetun

Gabriel de Jesus, S'ergio Nunes

The Cranfield paradigm has served as a foundational approach for developing test collections, with relevance judgments typically conducted by human assessors. However, the emergence of large language models (LLMs) has introduced new possibilities for automating these tasks. This paper explores the feasibility of using LLMs to automate relevance assessments, particularly within the context of low-resource languages. In our study, LLMs are employed to automate relevance judgment tasks, by providing a series of query-document pairs in Tetun as the input text. The models are tasked with assigning relevance scores to each pair, where these scores are then compared to those from human annotators to evaluate the inter-annotator agreement levels. Our investigation reveals results that align closely with those reported in studies of high-resource languages.

6/12/2024

cs.IR

Can We Use Large Language Models to Fill Relevance Judgment Holes?

Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, Mohammad Aliannejadi

Incomplete relevance judgments limit the re-usability of test collections. When new systems are compared against previous systems used to build the pool of judged documents, they often do so at a disadvantage due to the ``holes'' in test collection (i.e., pockets of un-assessed documents returned by the new system). In this paper, we take initial steps towards extending existing test collections by employing Large Language Models (LLM) to fill the holes by leveraging and grounding the method using existing human judgments. We explore this problem in the context of Conversational Search using TREC iKAT, where information needs are highly dynamic and the responses (and, the results retrieved) are much more varied (leaving bigger holes). While previous work has shown that automatic judgments from LLMs result in highly correlated rankings, we find substantially lower correlates when human plus automatic judgments are used (regardless of LLM, one/two/few shot, or fine-tuned). We further find that, depending on the LLM employed, new runs will be highly favored (or penalized), and this effect is magnified proportionally to the size of the holes. Instead, one should generate the LLM annotations on the whole document pool to achieve more consistent rankings with human-generated labels. Future work is required to prompt engineering and fine-tuning LLMs to reflect and represent the human annotations, in order to ground and align the models, such that they are more fit for purpose.

5/10/2024

cs.IR cs.CL

Large language models can accurately predict searcher preferences

Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra

Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality gold labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

5/20/2024

cs.IR cs.AI cs.CL cs.LG

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, Maarten de Rijke

Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.

6/18/2024

cs.IR cs.AI cs.CL cs.LG