How to Compute the Probability of a Word

Read original: arXiv:2406.14561 - Published 6/21/2024 by Tiago Pimentel, Clara Meister

How to Compute the Probability of a Word

Overview

Explains how to compute the probability of a word using language models
Discusses the concept of a "word" and how language models represent and work with them
Examines the role of word distributions in language modeling and how these can be used to evaluate the probability of a given word

Plain English Explanation

The paper explores the fundamental question of how to compute the probability of a word using language models. Language models are machine learning models that are trained on large text corpora to understand and generate human language. A key aspect of these models is how they represent and work with "words" - the basic units of language.

The paper first delves into the definition of a "word", noting that this can be a complex concept depending on the language and context. It then examines how language models typically represent words, often using techniques like word embeddings or subword vocabularies to capture the nuances of language.

With this foundation, the paper explores the role of word distributions in language modeling. These distributions describe the likelihood of different words appearing in a given context. By understanding these distributions, language models can evaluate the probability of a particular word appearing in a sentence or piece of text.

The paper touches on related concepts, such as how language models can assess their own confidence in a given prediction and the potential for unused information in the token probability distribution. Overall, the paper provides a deep dive into the fundamental mechanisms underlying language models and how they can be used to reason about the probability of words.

Technical Explanation

The paper begins by discussing the concept of a "word" and how it can be defined in the context of language modeling. It notes that while words may seem like a basic unit of language, their definition can be complex, especially when dealing with morphologically rich languages or handling things like compound words.

The paper then explores how language models typically represent words, often using techniques like word embeddings or subword vocabularies. These approaches aim to capture the nuanced relationships between words and their components, allowing the models to better understand and generate language.

With this foundation, the paper delves into the role of word distributions in language modeling. These distributions describe the likelihood of different words appearing in a given context, based on the model's understanding of language. By examining these distributions, the paper shows how language models can evaluate the probability of a particular word appearing in a sentence or piece of text.

The paper also touches on related topics, such as how language models can assess their own confidence in a given prediction and the potential for unused information in the token probability distribution. These aspects highlight the depth and complexity of language modeling, as well as the potential for further research and development in this field.

Critical Analysis

The paper provides a thorough and well-grounded exploration of the fundamental mechanisms underlying language models and how they can be used to reason about the probability of words. However, the paper does not delve into some of the potential limitations or caveats of this approach.

For example, the paper does not address the challenges that language models can face when dealing with rare or out-of-vocabulary words, which can significantly impact the accuracy of their probability estimates. Additionally, the paper does not discuss the potential biases or inconsistencies that can arise in the word distributions learned by language models, which could lead to skewed or inaccurate probability estimates.

Furthermore, the paper does not explore the implications of these probability estimates for downstream applications, such as text generation or language understanding. Understanding how these probability estimates are used and their potential impact on real-world applications would be a valuable addition to the analysis.

Conclusion

The paper provides a comprehensive overview of how to compute the probability of a word using language models. It delves into the nuances of word representation, the role of word distributions, and related concepts like model confidence and unused information. While the technical explanation is thorough, the paper could benefit from a more in-depth discussion of the limitations and potential implications of this approach.

Overall, the paper offers a valuable contribution to the understanding of language modeling and the fundamental mechanisms underlying these powerful AI systems. By exploring these core concepts, the research paves the way for further advancements in natural language processing and generation, with applications across a wide range of domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

How to Compute the Probability of a Word

Tiago Pimentel, Clara Meister

Language models (LMs) estimate the probability distribution over sequences of natural language; these distributions are crucial for computing perplexity and surprisal in linguistics research. While we are usually concerned with measuring these values for words, most LMs operate over subwords. Despite seemingly straightforward, accurately computing probabilities over one unit given probabilities over the other requires care. Indeed, we show here that many recent linguistic studies have been incorrectly computing these values. This paper derives the correct methods for computing word probabilities, highlighting issues when relying on language models that use beginning-of-word (bow)-marking tokenisers, e.g., the GPT family. Empirically, we show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.

6/21/2024

💬

What Are the Odds? Language Models Are Capable of Probabilistic Reasoning

Akshay Paruchuri, Jake Garrison, Shun Liao, John Hernandez, Jacob Sunshine, Tim Althoff, Xin Liu, Daniel McDuff

Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper, we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We evaluate three ways to provide context to LMs 1) anchoring examples from within a distribution or family of distributions, 2) real-world context, 3) summary statistics on which to base a Normal approximation. Models can make inferences about distributions, and can be further aided by the incorporation of real-world context, example shots and simplified assumptions, even if these assumptions are incorrect or misspecified. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we will release publicly.

6/19/2024

Leading Whitespaces of Language Models' Subword Vocabulary Poses a Confound for Calculating Word Probabilities

Byung-Doh Oh, William Schuler

Word-by-word conditional probabilities from Transformer-based language models are increasingly being used to evaluate their predictions over minimal pairs or to model the incremental processing difficulty of human readers. In this paper, we argue that there is a confound posed by the subword tokenization scheme of such language models, which has gone unaddressed thus far. This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this can result in word probabilities that sum to more than one, thereby violating the axiom that $mathsf{P}(Omega) = 1$. This property results in a misallocation of word-by-word surprisal, where the unacceptability of the current 'end of word' is incorrectly carried over to the next word. Additionally, language models' such implicit prediction of word boundaries is incongruous with psycholinguistic experiments where human subjects directly observe upcoming word boundaries. We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word, which resolves this confound. As a case study, we show that this results in significantly different estimates of garden-path effects in transitive/intransitive sentences, where a comma is strongly expected before the critical word.

6/18/2024

💬

Probabilistic Medical Predictions of Large Language Models

Bowen Gu, Rishi J. Desai, Kueiyu Joshua Lin, Jie Yang

Large Language Models (LLMs) have demonstrated significant potential in clinical applications through prompt engineering, which enables the generation of flexible and diverse clinical predictions. However, they pose challenges in producing prediction probabilities, which are essential for transparency and allowing clinicians to apply flexible probability thresholds in decision-making. While explicit prompt instructions can lead LLMs to provide prediction probability numbers through text generation, LLMs' limitations in numerical reasoning raise concerns about the reliability of these text-generated probabilities. To assess this reliability, we compared explicit probabilities derived from text generation to implicit probabilities calculated based on the likelihood of predicting the correct label token. Experimenting with six advanced open-source LLMs across five medical datasets, we found that the performance of explicit probabilities was consistently lower than implicit probabilities with respect to discrimination, precision, and recall. Moreover, these differences were enlarged on small LLMs and imbalanced datasets, emphasizing the need for cautious interpretation and applications, as well as further research into robust probability estimation methods for LLMs in clinical contexts.

8/22/2024