This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models






Published 4/3/2024 by Bryan Li, Samar Haider, Chris Callison-Burch



Do the Spratly Islands belong to China, the Philippines, or Vietnam? A pretrained large language model (LLM) may answer differently if asked in the languages of each claimant country: Chinese, Tagalog, or Vietnamese. This contrasts with a multilingual human, who would likely answer consistently. In this paper, we show that LLMs recall certain geographical knowledge inconsistently when queried in different languages -- a phenomenon we term geopolitical bias. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We introduce BorderLines, a dataset of territorial disputes which covers 251 territories, each associated with a set of multiple-choice questions in the languages of each claimant country (49 languages in total). We also propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages. We then evaluate various multilingual LLMs on our dataset and metrics to probe their internal knowledge and use the proposed metrics to discover numerous inconsistencies in how these models respond in different languages. Finally, we explore several prompt modification strategies, aiming to either amplify or mitigate geopolitical bias, which highlights how brittle LLMs are and how they tailor their responses depending on cues from the interaction context. Our code and data are available at

Create account to get full access


If you already have an account, we'll log you in


  • This paper examines how large language models (LLMs) recall geographical knowledge differently depending on the language they are queried in, a phenomenon the authors call "geopolitical bias."
  • The researchers focus on territorial disputes as a case study, introducing a new dataset called "BorderLines" that covers 251 territories and associated multiple-choice questions in 49 languages.
  • The paper proposes metrics to quantify bias and inconsistencies in how LLMs respond across different languages, and explores strategies for amplifying or mitigating this bias.

Plain English Explanation

Imagine you ask a computer program a question about a place like the Spratly Islands - a disputed territory in the South China Sea. Depending on the language you use to ask the question (Chinese, Tagalog, or Vietnamese, for example), the program might give you a different answer. This is because the program, or large language model, has developed certain biases about geopolitical issues that can influence how it recalls information.

The researchers in this paper wanted to study this phenomenon in more detail. They created a dataset called "BorderLines" that covers 251 territories around the world, each with multiple-choice questions about who claims ownership in the languages of the countries involved. By evaluating how different language models perform on this dataset, the researchers were able to identify numerous inconsistencies in how the models respond across languages.

For instance, a language model might say the Spratly Islands belong to China when asked in Chinese, but to the Philippines when asked in Tagalog. This is problematic, as a human with knowledge of the region would likely give a consistent answer regardless of the language used.

The researchers also explored ways to either amplify or reduce this geopolitical bias in language models, highlighting how sensitive these models can be to subtle changes in how questions are phrased or presented. This suggests language models may not have a robust, unbiased understanding of sensitive geopolitical topics.

Technical Explanation

The paper begins by introducing the concept of "geopolitical bias" in large language models (LLMs). The authors hypothesize that LLMs may recall certain geographical knowledge inconsistently when queried in different languages, in contrast with a multilingual human who would likely provide a consistent answer.

To investigate this, the researchers created BorderLines, a dataset covering 251 territories with associated multiple-choice questions in the languages of the claimant countries (49 languages total). This allows them to systematically evaluate how various multilingual LLMs perform on this task and quantify any inconsistencies in their responses across languages.

The paper proposes several evaluation metrics to measure bias and consistency, including language-specific accuracy, cross-lingual accuracy (accuracy when the model is queried in a different language), and a new metric called "flip rate" that captures how often a model's prediction changes when the language changes.

The researchers then evaluate several prominent multilingual LLMs on the BorderLines dataset using these metrics. They find numerous instances where the models exhibit geopolitical bias, providing different answers depending on the language used. The paper also explores prompt engineering strategies that can either amplify or mitigate this bias, further demonstrating the sensitivity of these models to contextual cues.

Critical Analysis

The paper provides a valuable contribution by rigorously documenting the phenomenon of geopolitical bias in LLMs and introducing a dataset and evaluation framework to study it. The authors acknowledge that territorial disputes are a complex, sensitive topic, and that their dataset and findings may not generalize to all geopolitical issues.

One potential limitation is the reliance on multiple-choice questions, which may not fully capture the nuances of territorial claims. Additionally, the dataset is focused on a relatively small number of territories, and expanding the coverage could yield additional insights.

The paper would also benefit from a more in-depth discussion of the potential societal implications of geopolitical bias in language models. As these models become more pervasive, it is crucial to understand how they might amplify or obfuscate sensitive political issues, potentially shaping public discourse and opinion.

Overall, this research highlights the importance of critically evaluating the robustness and consistency of language models, especially when it comes to complex, multilingual, and politically charged topics. The authors have laid the groundwork for further exploration in this area, which could lead to more accountable and reliable AI systems.


This paper sheds light on an important and underexplored issue in the field of large language models: their tendency to exhibit geopolitical bias when responding to queries in different languages. By introducing the BorderLines dataset and a suite of evaluation metrics, the researchers have provided a framework for rigorously studying this phenomenon.

The findings suggest that current multilingual language models may not have a robust, unbiased understanding of sensitive geopolitical topics, and that their responses can be heavily influenced by subtle contextual cues. This raises significant concerns about the potential societal impacts of these models, as they become more widely deployed in applications that involve political and geographical knowledge.

Moving forward, this research highlights the need for greater scrutiny and accountability in the development of language models, particularly when it comes to their handling of complex, multilingual, and politically charged information. By addressing geopolitical bias, the AI community can work towards creating more reliable and trustworthy systems that serve the public good.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


Evaluation of Geographical Distortions in Language Models: A Crucial Step Towards Equitable Representations

R'emy Decoupes, Roberto Interdonato, Mathieu Roche, Maguelonne Teisseire, Sarah Valentin





Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.

Read more



Assessing Political Bias in Large Language Models

Luca Rettenberger, Markus Reischl, Mark Schutera





The assessment of bias within Large Language Models (LLMs) has emerged as a critical concern in the contemporary discourse surrounding Artificial Intelligence (AI) in the context of their potential impact on societal dynamics. Recognizing and considering political bias within LLM applications is especially important when closing in on the tipping point toward performative prediction. Then, being educated about potential effects and the societal behavior LLMs can drive at scale due to their interplay with human operators. In this way, the upcoming elections of the European Parliament will not remain unaffected by LLMs. We evaluate the political bias of the currently most popular open-source LLMs (instruct or assistant models) concerning political issues within the European Union (EU) from a German voter's perspective. To do so, we use the Wahl-O-Mat, a voting advice application used in Germany. From the voting advice of the Wahl-O-Mat we quantize the degree of alignment of LLMs with German political parties. We show that larger models, such as Llama3-70B, tend to align more closely with left-leaning political parties, while smaller models often remain neutral, particularly when prompted in English. The central finding is that LLMs are similarly biased, with low variances in the alignment concerning a specific party. Our findings underline the importance of rigorously assessing and making bias transparent in LLMs to safeguard the integrity and trustworthiness of applications that employ the capabilities of performative prediction and the invisible hand of machine learning prediction and language generation.

Read more



Distortions in Judged Spatial Relations in Large Language Models

Nir Fulman, Abdulkadir Memduhou{g}lu, Alexander Zipf





We present a benchmark for assessing the capability of Large Language Models (LLMs) to discern intercardinal directions between geographic locations and apply it to three prominent LLMs: GPT-3.5, GPT-4, and Llama-2. This benchmark specifically evaluates whether LLMs exhibit a hierarchical spatial bias similar to humans, where judgments about individual locations' spatial relationships are influenced by the perceived relationships of the larger groups that contain them. To investigate this, we formulated 14 questions focusing on well-known American cities. Seven questions were designed to challenge the LLMs with scenarios potentially influenced by the orientation of larger geographical units, such as states or countries, while the remaining seven targeted locations were less susceptible to such hierarchical categorization. Among the tested models, GPT-4 exhibited superior performance with 55 percent accuracy, followed by GPT-3.5 at 47 percent, and Llama-2 at 45 percent. The models showed significantly reduced accuracy on tasks with suspected hierarchical bias. For example, GPT-4's accuracy dropped to 33 percent on these tasks, compared to 86 percent on others. However, the models identified the nearest cardinal direction in most cases, reflecting their associative learning mechanism, thereby embodying human-like misconceptions. We discuss avenues for improving the spatial reasoning capabilities of LLMs.

Read more


A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Yuqi Ye, Hanwen Gu





Based on the foundation of Large Language Models (LLMs), Multilingual Large Language Models (MLLMs) have been developed to address the challenges of multilingual natural language processing tasks, hoping to achieve knowledge transfer from high-resource to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolution, key techniques, and multilingual capacities. Secondly, we explore widely utilized multilingual corpora for MLLMs' training and multilingual datasets oriented for downstream tasks that are crucial for enhancing the cross-lingual capability of MLLMs. Thirdly, we survey the existing studies on multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs including its category and evaluation metrics, and summarize the existing debiasing techniques. Finally, we discuss existing challenges and point out promising research directions. By demonstrating these aspects, this paper aims to facilitate a deeper understanding of MLLMs and their potentiality in various domains.

Read more
