Learning Mutually Informed Representations for Characters and Subwords

2311.07853

Published 4/9/2024 by Yilin Wang, Xinyi Hu, Matthew R. Gormley

🌀

Abstract

Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling (intraword code-switching). Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. We make our code publically available.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Most language models use subword tokenization, which breaks text into smaller units like characters or words.
Different text granularities (characters, subwords, words) can contain different types of information.
Prior studies show that using multiple input granularities improves model performance, but few output representations for each granularity.
This paper introduces the "entanglement model" to combine character and subword language models, generating mutually informed representations for both.

Plain English Explanation

The researchers behind this paper noticed that most language models - AI systems that can understand and generate human language - work by breaking text down into smaller pieces called "subwords." This subword tokenization can help the models handle words they haven't seen before.

However, the researchers also realized that different levels of text - like individual characters, subwords, or full words - can actually contain distinct types of information that could be useful for the models. Previous studies had shown that incorporating multiple input granularities (character, subword, word) can improve a model's overall performance, but these models typically didn't produce useful representations for each granularity as output.

To address this, the researchers developed a new model called the "entanglement model." Inspired by vision-language models that can process both images and text, the entanglement model treats characters and subwords as separate "modalities" and learns to generate mutually informed representations for both.

The researchers evaluated this entanglement model on a variety of language tasks like text classification, named entity recognition, and sequence labeling. Notably, they found that the entanglement model outperformed its underlying language models, especially when dealing with noisy or low-resource languages. It even beat larger pre-trained models on some English language tasks.

Technical Explanation

The key innovation in this paper is the "entanglement model," which aims to combine character-level and subword-level language models to generate more informative representations for both granularities.

Inspired by vision-language models, the entanglement model treats characters and subwords as separate "modalities" and learns to generate mutually informed representations for them. This is in contrast to typical language models that use subword tokenization but don't explicitly model the interactions between different text granularities.

The entanglement model has two main components: a character-level encoder and a subword-level encoder. The character encoder takes a sequence of characters as input and produces a character-level representation. The subword encoder takes a sequence of subwords and generates a subword-level representation. Crucially, the two encoders are connected, allowing the representations to inform each other.

The researchers evaluated the entanglement model on a range of language tasks, including text classification, named entity recognition, POS-tagging, and character-level sequence labeling. They found that the entanglement model outperformed its underlying character-level and subword-level language models, especially in the presence of noisy or low-resource data. Remarkably, the entanglement model also surpassed larger pre-trained language models on all English sequence labeling and classification tasks.

Critical Analysis

One potential limitation of the entanglement model is that it requires training separate character-level and subword-level encoders, which could be more computationally expensive than a single, unified language model. The paper does not provide a detailed analysis of the model's training time or inference latency compared to other approaches.

Additionally, the researchers only evaluated the entanglement model on a limited set of tasks, primarily focusing on sequence labeling and classification. It would be interesting to see how the model performs on other language understanding and generation tasks, such as translation, summarization, or open-ended dialogue.

The paper also does not provide a thorough analysis of the learned character-level and subword-level representations, nor does it explore the specific types of information captured by each granularity. A more in-depth examination of the internal workings of the entanglement model could yield additional insights.

Overall, the entanglement model presents a promising approach to leveraging multiple text granularities, and the strong performance on the evaluated tasks suggests it could be a valuable tool for a variety of language-related applications. However, further research is needed to fully understand its capabilities and limitations.

Conclusion

This paper introduces the "entanglement model," a novel language model that combines character-level and subword-level representations to improve performance on a range of language tasks. By treating characters and subwords as separate modalities and learning mutually informed representations for both, the entanglement model outperforms its underlying language models, particularly in the presence of noisy or low-resource data.

The strong results on sequence labeling and classification tasks demonstrate the potential of the entanglement model to enhance language understanding and processing. While more research is needed to fully explore its capabilities, this work highlights the benefits of incorporating multiple text granularities and cross-modal interactions in language models, potentially leading to more robust and versatile AI systems for a variety of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

New!Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning

Huiming Wang, Zhaodonghui Li, Liying Cheng, Soh De Wen, Lidong Bing

Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training base sentence embedding models into three stages (i.e., sentence generation, sentence pair construction, in-batch training) and refines the generated content at these three distinct stages, ensuring only high-quality sentence pairs are utilized to train a base contrastive learning model. Our extensive experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results. Comprehensive analyses further underscore the potential of our framework in various application scenarios and achieving better sentence representation learning with LLMs.

5/20/2024

cs.CL

🔄

A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Francois Meyer, Jan Buys

Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.

4/1/2024

cs.CL

⛏️

A Decoupling and Aggregating Framework for Joint Extraction of Entities and Relations

Yao Wang, Xin Liu, Weikun Kong, Hai-Tao Yu, Teeradaj Racharak, Kyoung-Sook Kim, Minh Le Nguyen

Named Entity Recognition and Relation Extraction are two crucial and challenging subtasks in the field of Information Extraction. Despite the successes achieved by the traditional approaches, fundamental research questions remain open. First, most recent studies use parameter sharing for a single subtask or shared features for both two subtasks, ignoring their semantic differences. Second, information interaction mainly focuses on the two subtasks, leaving the fine-grained informtion interaction among the subtask-specific features of encoding subjects, relations, and objects unexplored. Motivated by the aforementioned limitations, we propose a novel model to jointly extract entities and relations. The main novelties are as follows: (1) We propose to decouple the feature encoding process into three parts, namely encoding subjects, encoding objects, and encoding relations. Thanks to this, we are able to use fine-grained subtask-specific features. (2) We propose novel inter-aggregation and intra-aggregation strategies to enhance the information interaction and construct individual fine-grained subtask-specific features, respectively. The experimental results demonstrate that our model outperforms several previous state-of-the-art models. Extensive additional experiments further confirm the effectiveness of our model.

5/15/2024

cs.CL cs.AI

Transforming LLMs into Cross-modal and Cross-lingual RetrievalSystems

Frank Palma Gomez, Ramon Sanabria, Yun-hsuan Sung, Daniel Cer, Siddharth Dalmia, Gustavo Hernandez Abrego

Large language models (LLMs) are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn't require speech data during LLM pre-training and can exploit LLM's multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.

4/5/2024

cs.CL cs.IR cs.SD eess.AS