LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild

2405.20363

Published 6/3/2024 by Zhiqiang Wang, Dejia Xu, Rana Muhammad Shahroz Khan, Yanbin Lin, Zhiwen Fan, Xingquan Zhu

LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild

Abstract

Image geolocation is a critical task in various image-understanding applications. However, existing methods often fail when analyzing challenging, in-the-wild images. Inspired by the exceptional background knowledge of multimodal language models, we systematically evaluate their geolocation capabilities using a novel image dataset and a comprehensive evaluation framework. We first collect images from various countries via Google Street View. Then, we conduct training-free and training-based evaluations on closed-source and open-source multi-modal language models. we conduct both training-free and training-based evaluations on closed-source and open-source multimodal language models. Our findings indicate that closed-source models demonstrate superior geolocation abilities, while open-source models can achieve comparable performance through fine-tuning.

Create account to get full access

Overview

This paper, titled "LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild", examines the performance of large language models (LLMs) on the task of predicting the geographic location of images.
The researchers created a new dataset called LLMGeo, which contains over 1 million diverse in-the-wild images with associated geolocation information.
They evaluated several state-of-the-art LLMs, including GPT-3, DALL-E, and Flamingo, on this dataset to understand the models' capabilities in geographic image localization.

Plain English Explanation

In this paper, the researchers wanted to see how well large language models (LLMs) - which are AI systems trained on vast amounts of text data - can figure out where images were taken. They created a new dataset called LLMGeo that has over 1 million real-world photos with information about their geographic locations.

The researchers then tested several top LLMs, like GPT-3, DALL-E, and Flamingo, to see how accurately they could predict the location of the images in the LLMGeo dataset. This allowed them to understand the strengths and limitations of these powerful AI models when it comes to guessing the geographic origin of pictures.

Understanding how well LLMs can perform geographic image localization is important for applications like travel planning, urban mapping, and disaster response, where being able to quickly and accurately determine where a photo was taken can be very useful.

Technical Explanation

The researchers created the LLMGeo dataset - a large, diverse collection of over 1 million in-the-wild images with associated geolocation information. This dataset was designed to be challenging and representative of real-world conditions, going beyond prior geolocation datasets that were more constrained.

They then evaluated the performance of several state-of-the-art LLMs, including GPT-3, DALL-E, and Flamingo, on the LLMGeo dataset. The models were tested on their ability to predict the geographic location of the images, using both GPS coordinates and coarser regional labels.

The results showed that while the LLMs exhibited some capability in geographic image localization, their performance was limited compared to specialized computer vision models. The models tended to struggle with regional biases and had difficulty generalizing to diverse, unconstrained image data.

Critical Analysis

The researchers acknowledge several limitations and caveats in their work. First, the LLMs evaluated were not specifically trained for geographic image localization, so their performance may not reflect their full potential if fine-tuned on this task. Additionally, the researchers note that the LLMGeo dataset, while more challenging than prior benchmarks, may still not capture the full diversity and complexity of real-world images.

Furthermore, the paper does not delve into potential biases or fairness issues that may arise from these LLMs' performance on geographic localization. As these models become more widely deployed, it will be important to carefully examine their geographic biases and ensure equitable performance across different regions and populations.

Overall, this work provides a valuable benchmark for understanding the current limitations of LLMs in geographic image localization. The results highlight the need for further research and development to improve these models' capabilities in this domain, which has significant practical applications.

Conclusion

This paper presents the LLMGeo dataset and uses it to evaluate the performance of state-of-the-art large language models on the task of geographic image localization. The results show that while LLMs exhibit some capability in this domain, their performance is limited compared to specialized computer vision models, particularly when it comes to handling diverse, unconstrained image data.

The findings of this research have important implications for the development and deployment of LLMs in real-world applications that rely on accurate geographic information, such as travel planning, urban mapping, and disaster response. By identifying the current limitations of these models, the paper paves the way for further research and improvements that could unlock the full potential of LLMs in geographic image understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

Ling Li, Yu Ye, Bingchuan Jiang, Wei Zeng

This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

6/28/2024

cs.CV cs.LG

Quantifying Geospatial in the Common Crawl Corpus

Ilya Ilyankou, Meihui Wang, James Haworth, Stefano Cavazzi

Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that between 1 in 5 and 1 in 6 documents contain geospatial information such as coordinates and street addresses. Our findings provide quantitative insights into the nature and extent of geospatial data within Common Crawl, and web crawl data in general. Furthermore, we formulate questions to guide future investigations into the geospatial content of available web crawl datasets and its influence on LLMs.

6/10/2024

cs.CL cs.AI

PIGEON: Predicting Image Geolocations

Lukas Haas, Michal Skreta, Silas Alberti, Chelsea Finn

Planet-scale image geolocalization remains a challenging problem due to the diversity of images originating from anywhere in the world. Although approaches based on vision transformers have made significant progress in geolocalization accuracy, success in prior literature is constrained to narrow distributions of images of landmarks, and performance has not generalized to unseen places. We present a new geolocalization system that combines semantic geocell creation, multi-task contrastive pretraining, and a novel loss function. Additionally, our work is the first to perform retrieval over location clusters for guess refinements. We train two models for evaluations on street-level data and general-purpose image geolocalization; the first model, PIGEON, is trained on data from the game of Geoguessr and is capable of placing over 40% of its guesses within 25 kilometers of the target location globally. We also develop a bot and deploy PIGEON in a blind experiment against humans, ranking in the top 0.01% of players. We further challenge one of the world's foremost professional Geoguessr players to a series of six matches with millions of viewers, winning all six games. Our second model, PIGEOTTO, differs in that it is trained on a dataset of images from Flickr and Wikipedia, achieving state-of-the-art results on a wide range of image geolocalization benchmarks, outperforming the previous SOTA by up to 7.7 percentage points on the city accuracy level and up to 38.8 percentage points on the country level. Our findings suggest that PIGEOTTO is the first image geolocalization model that effectively generalizes to unseen places and that our approach can pave the way for highly accurate, planet-scale image geolocalization systems. Our code is available on GitHub.

4/9/2024

cs.CV cs.LG

🎯

Evaluating Tool-Augmented Agents in Remote Sensing Platforms

Simranjit Singh, Michael Fore, Dimitrios Stamoulis

Tool-augmented Large Language Models (LLMs) have shown impressive capabilities in remote sensing (RS) applications. However, existing benchmarks assume question-answering input templates over predefined image-text data pairs. These standalone instructions neglect the intricacies of realistic user-grounded tasks. Consider a geospatial analyst: they zoom in a map area, they draw a region over which to collect satellite imagery, and they succinctly ask Detect all objects here. Where is `here`, if it is not explicitly hardcoded in the image-text template, but instead is implied by the system state, e.g., the live map positioning? To bridge this gap, we present GeoLLM-QA, a benchmark designed to capture long sequences of verbal, visual, and click-based actions on a real UI platform. Through in-depth evaluation of state-of-the-art LLMs over a diverse set of 1,000 tasks, we offer insights towards stronger agents for RS applications.

5/3/2024

cs.CL cs.AI cs.LG