TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation

Read original: arXiv:2406.10450 - Published 8/20/2024 by Haohao Qu, Wenqi Fan, Zihuai Zhao, Qing Li

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation

Overview

The paper presents a novel method called TokenRec for learning to tokenize user IDs in order to enable large language model (LLM)-based generative recommendations.
The proposed approach leverages vector quantization to learn a compact representation of user IDs, which can then be used to generate personalized recommendations using an LLM.
The method aims to address challenges in aligning user IDs with LLM-based recommendation systems, which is crucial for personalized and contextual recommendations.

Plain English Explanation

The paper introduces a new technique called TokenRec that helps large language models (LLMs) provide better personalized recommendations. LLMs are powerful AI models that can generate human-like text, but they often struggle to incorporate individual user preferences and information, like user IDs, into their recommendations.

TokenRec solves this by learning how to convert user IDs into a special "token" that the LLM can understand and use to generate recommendations tailored to each individual. It does this using a technique called vector quantization, which compresses the user ID information into a compact, efficient form.

By bridging the gap between user IDs and LLM-based recommendation systems, TokenRec allows for more personalized and contextual recommendations that better meet the needs of individual users.

Technical Explanation

The TokenRec method consists of two key components:

A learnable tokenizer that converts user IDs into a compact representation (token) using vector quantization. This allows the LLM to efficiently incorporate user-specific information.
An LLM-based recommendation model that generates personalized recommendations using the tokenized user IDs, along with other contextual information.

The authors evaluate TokenRec on several benchmark datasets and show that it outperforms traditional collaborative filtering approaches as well as other methods for integrating user IDs with LLMs. The results demonstrate the effectiveness of the proposed approach in aligning user IDs with LLM-based recommendation systems and improving the quality of recommendations for users.

Critical Analysis

The paper provides a promising approach for enhancing LLM-based recommendation systems by effectively incorporating user-specific information. However, the authors acknowledge that TokenRec may have limited performance on datasets with sparse user-item interactions, as the tokenizer relies on the availability of user IDs.

Additionally, the paper does not extensively explore the potential trade-offs between the compactness of the tokenized representation and the preservation of user-specific information. Further research could investigate the optimal balance between these factors and the impact on recommendation quality.

The authors also note that TokenRec has been evaluated on relatively small-scale datasets, and its performance on larger, more diverse datasets remains to be explored.

Conclusion

The TokenRec method presents a innovative approach to aligning user IDs with LLM-based recommendation systems, enabling more personalized and contextual recommendations. By learning a compact representation of user IDs through vector quantization, the method bridges the gap between user-specific information and the power of LLMs, improving the quality of recommendations for individual users.

As recommender systems enter the era of large language models, techniques like TokenRec will play a crucial role in unleashing the full potential of these advanced AI models for personalized and contextual recommendations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation

Haohao Qu, Wenqi Fan, Zihuai Zhao, Qing Li

There is a growing interest in utilizing large-scale language models (LLMs) to advance next-generation Recommender Systems (RecSys), driven by their outstanding language understanding and in-context learning capabilities. In this scenario, tokenizing (i.e., indexing) users and items becomes essential for ensuring a seamless alignment of LLMs with recommendations. While several studies have made progress in representing users and items through textual contents or latent representations, challenges remain in efficiently capturing high-order collaborative knowledge into discrete tokens that are compatible with LLMs. Additionally, the majority of existing tokenization approaches often face difficulties in generalizing effectively to new/unseen users or items that were not in the training corpus. To address these challenges, we propose a novel framework called TokenRec, which introduces not only an effective ID tokenization strategy but also an efficient retrieval paradigm for LLM-based recommendations. Specifically, our tokenization strategy, Masked Vector-Quantized (MQ) Tokenizer, involves quantizing the masked user/item representations learned from collaborative filtering into discrete tokens, thus achieving a smooth incorporation of high-order collaborative knowledge and a generalizable tokenization of users and items for LLM-based RecSys. Meanwhile, our generative retrieval paradigm is designed to efficiently recommend top-$K$ items for users to eliminate the need for the time-consuming auto-regressive decoding and beam search processes used by LLMs, thus significantly reducing inference time. Comprehensive experiments validate the effectiveness of the proposed methods, demonstrating that TokenRec outperforms competitive benchmarks, including both traditional recommender systems and emerging LLM-based recommender systems.

8/20/2024

🌀

Learnable Tokenizer for LLM-based Generative Recommendation

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, Tat-Seng Chua

Utilizing powerful Large Language Models (LLMs) for generative recommendation has attracted much attention. Nevertheless, a crucial challenge is transforming recommendation data into the language space of LLMs through effective item tokenization. Current approaches, such as ID, textual, and codebook-based identifiers, exhibit shortcomings in encoding semantic information, incorporating collaborative signals, or handling code assignment bias. To address these limitations, we propose LETTER (a LEarnable Tokenizer for generaTivE Recommendation), which integrates hierarchical semantics, collaborative signals, and code assignment diversity to satisfy the essential requirements of identifiers. LETTER incorporates Residual Quantized VAE for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias. We instantiate LETTER on two models and propose a ranking-guided generation loss to augment their ranking ability theoretically. Experiments on three datasets validate the superiority of LETTER, advancing the state-of-the-art in the field of LLM-based generative recommendation.

8/20/2024

STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM

Qijiong Liu, Jieming Zhu, Lu Fan, Zhou Zhao, Xiao-Ming Wu

Traditional recommendation models often rely on unique item identifiers (IDs) to distinguish between items, which can hinder their ability to effectively leverage item content information and generalize to long-tail or cold-start items. Recently, semantic tokenization has been proposed as a promising solution that aims to tokenize each item's semantic representation into a sequence of discrete tokens. In this way, it preserves the item's semantics within these tokens and ensures that semantically similar items are represented by similar tokens. These semantic tokens have become fundamental in training generative recommendation models. However, existing generative recommendation methods typically involve multiple sub-models for embedding, quantization, and recommendation, leading to an overly complex system. In this paper, we propose to streamline the semantic tokenization and generative recommendation process with a unified framework, dubbed STORE, which leverages a single large language model (LLM) for both tasks. Specifically, we formulate semantic tokenization as a text-to-token task and generative recommendation as a token-to-token task, supplemented by a token-to-text reconstruction task and a text-to-token auxiliary task. All these tasks are framed in a generative manner and trained using a single LLM backbone. Extensive experiments have been conducted to validate the effectiveness of our STORE framework across various recommendation tasks and datasets. We will release the source code and configurations for reproducible research.

9/16/2024

IDGenRec: LLM-RecSys Alignment with Textual ID Learning

Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li, Yongfeng Zhang

Generative recommendation based on Large Language Models (LLMs) have transformed the traditional ranking-based recommendation style into a text-to-text generation paradigm. However, in contrast to standard NLP tasks that inherently operate on human vocabulary, current research in generative recommendations struggles to effectively encode recommendation items within the text-to-text framework using concise yet meaningful ID representations. To better align LLMs with recommendation needs, we propose IDGen, representing each item as a unique, concise, semantically rich, platform-agnostic textual ID using human language tokens. This is achieved by training a textual ID generator alongside the LLM-based recommender, enabling seamless integration of personalized recommendations into natural language generation. Notably, as user history is expressed in natural language and decoupled from the original dataset, our approach suggests the potential for a foundational generative recommendation model. Experiments show that our framework consistently surpasses existing models in sequential recommendation under standard experimental setting. Then, we explore the possibility of training a foundation recommendation model with the proposed method on data collected from 19 different datasets and tested its recommendation performance on 6 unseen datasets across different platforms under a completely zero-shot setting. The results show that the zero-shot performance of the pre-trained foundation model is comparable to or even better than some traditional recommendation models based on supervised training, showing the potential of the IDGen paradigm serving as the foundation model for generative recommendation. Code and data are open-sourced at https://github.com/agiresearch/IDGenRec.

5/20/2024