Zero-shot Building Age Classification from Facade Image Using GPT-4

2404.09921

Published 4/16/2024 by Zichao Zeng, June Moh Goo, Xinglei Wang, Bin Chi, Meihui Wang, Jan Boehm

Zero-shot Building Age Classification from Facade Image Using GPT-4

Abstract

A building's age of construction is crucial for supporting many geospatial applications. Much current research focuses on estimating building age from facade images using deep learning. However, building an accurate deep learning model requires a considerable amount of labelled training data, and the trained models often have geographical constraints. Recently, large pre-trained vision language models (VLMs) such as GPT-4 Vision, which demonstrate significant generalisation capabilities, have emerged as potential training-free tools for dealing with specific vision tasks, but their applicability and reliability for building information remain unexplored. In this study, a zero-shot building age classifier for facade images is developed using prompts that include logical instructions. Taking London as a test case, we introduce a new dataset, FI-London, comprising facade images and building age epochs. Although the training-free classifier achieved a modest accuracy of 39.69%, the mean absolute error of 0.85 decades indicates that the model can predict building age epochs successfully albeit with a small bias. The ensuing discussion reveals that the classifier struggles to predict the age of very old buildings and is challenged by fine-grained predictions within 2 decades. Overall, the classifier utilising GPT-4 Vision is capable of predicting the rough age epoch of a building from a single facade image without any training.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper presents a novel approach to classifying the age of buildings from facade images using the GPT-4 language model.
The researchers develop a zero-shot learning technique that can categorize building ages without needing to retrain the model on labeled data for each new location.
This could have important applications in urban planning, historic preservation, and understanding the evolution of cityscapes over time.

Plain English Explanation

In this paper, the researchers tackle the challenge of automatically determining the age or time period of a building based solely on an image of its facade. This is a valuable capability for applications like urban planning, where understanding the historical development of a city can inform decisions about infrastructure, preservation, and future growth.

The key innovation of this work is the use of a powerful language model called GPT-4 to enable "zero-shot" learning. This means the model can classify building ages without requiring any labeled training data specific to the location or architectural styles being analyzed. Instead, the researchers leverage GPT-4's broad knowledge and language understanding to map facade image features to building age categories in a generalizable way.

This is a significant advance over previous approaches that relied on extensive labeled datasets for each new city or region being studied. By avoiding the need for costly and time-consuming data collection and model retraining, the zero-shot technique enables scalable and cost-effective building age classification that can be applied broadly.

Technical Explanation

The core of the researchers' approach is to fine-tune the GPT-4 language model on a dataset of building facade images paired with their corresponding age labels. This allows the model to learn associations between visual features of the facades and the appropriate age categories, such as "Victorian," "Art Deco," or "Modern."

Once this initial training is complete, the model can then be deployed in a zero-shot setting to classify the ages of buildings in new locations, without any further retraining. The researchers demonstrate the effectiveness of this technique on a diverse set of cities, showing that the model generalizes well to different architectural styles and urban environments.

A key technical insight is the use of a contrastive learning objective during the fine-tuning process. This encourages the model to learn discriminative features that can reliably distinguish between the various building age classes, rather than just memorizing the training data.

The paper also explores ways to visualize and interpret the model's decision-making process, providing insights into which visual cues the model is using to make its age predictions. This can be valuable for understanding the model's strengths and limitations, as well as for furthering research into explainable AI in the context of built environment analysis.

Critical Analysis

One potential limitation of the proposed approach is its reliance on the availability and quality of the initial dataset used to fine-tune the GPT-4 model. If this dataset does not sufficiently capture the diversity of building styles and ages across different regions, the model's zero-shot performance may be compromised.

Additionally, while the researchers demonstrate the model's effectiveness on a range of cities, there may be edge cases or unique architectural styles that the model struggles to classify accurately. Further evaluation and testing would be needed to fully understand the limits of the zero-shot approach.

Another area for further research could be the incorporation of additional data modalities, such as geospatial information or historical records, to further enhance the model's building age classification capabilities.

Conclusion

This paper presents a novel zero-shot learning approach for classifying the age of buildings from facade images using the powerful GPT-4 language model. By leveraging the model's broad knowledge and language understanding, the researchers have developed a scalable and cost-effective technique that can be applied to a wide range of urban environments without the need for extensive labeled training data.

The implications of this work are significant, as it could enable more efficient and data-driven decision-making in fields such as urban planning, historic preservation, and architectural research. By providing a means to automatically catalog and analyze the evolution of a city's built environment, this technology could lead to better-informed policies and more thoughtful stewardship of our historical and cultural assets.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Automated National Urban Map Extraction

Hasan Nasrallah, Abed Ellatif Samhat, Cristiano Nattero, Ali J. Ghandour

Developing countries usually lack the proper governance means to generate and regularly update a national rooftop map. Using traditional photogrammetry and surveying methods to produce a building map at the federal level is costly and time consuming. Using earth observation and deep learning methods, we can bridge this gap and propose an automated pipeline to fetch such national urban maps. This paper aims to exploit the power of fully convolutional neural networks for multi-class buildings' instance segmentation to leverage high object-wise accuracy results. Buildings' instance segmentation from sub-meter high-resolution satellite images can be achieved with relatively high pixel-wise metric scores. We detail all engineering steps to replicate this work and ensure highly accurate results in dense and slum areas witnessed in regions that lack proper urban planning in the Global South. We applied a case study of the proposed pipeline to Lebanon and successfully produced the first comprehensive national building footprint map with approximately 1 Million units with an 84% accuracy. The proposed architecture relies on advanced augmentation techniques to overcome dataset scarcity, which is often the case in developing countries.

5/6/2024

cs.CV

🤿

Content Bias in Deep Learning Image Age Approximation: A new Approach Towards better Explainability

Robert Jochl, Andreas Uhl

In the context of temporal image forensics, it is not evident that a neural network, trained on images from different time-slots (classes), exploits solely image age related features. Usually, images taken in close temporal proximity (e.g., belonging to the same age class) share some common content properties. Such content bias can be exploited by a neural network. In this work, a novel approach is proposed that evaluates the influence of image content. This approach is verified using synthetic images (where content bias can be ruled out) with an age signal embedded. Based on the proposed approach, it is shown that a deep learning approach proposed in the context of age classification is most likely highly dependent on the image content. As a possible countermeasure, two different models from the field of image steganalysis, along with three different preprocessing techniques to increase the signal-to-noise ratio (age signal to image content), are evaluated using the proposed method.

5/3/2024

cs.CV cs.AI

❗

GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

Jiangning Zhang, Haoyang He, Xuhai Chen, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: textbf{textit{1)}} Granular Region Division, textbf{textit{2)}} Prompt Designing, textbf{textit{3)}} Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, eg, WinCLIP and CLIP-AD, and further researches are needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at url{https://github.com/zhangzjn/GPT-4V-AD}.

4/17/2024

cs.CV

Zero-shot detection of buildings in mobile LiDAR using Language Vision Model

June Moh Goo, Zichao Zeng, Jan Boehm

Recent advances have demonstrated that Language Vision Models (LVMs) surpass the existing State-of-the-Art (SOTA) in two-dimensional (2D) computer vision tasks, motivating attempts to apply LVMs to three-dimensional (3D) data. While LVMs are efficient and effective in addressing various downstream 2D vision tasks without training, they face significant challenges when it comes to point clouds, a representative format for representing 3D data. It is more difficult to extract features from 3D data and there are challenges due to large data sizes and the cost of the collection and labelling, resulting in a notably limited availability of datasets. Moreover, constructing LVMs for point clouds is even more challenging due to the requirements for large amounts of data and training time. To address these issues, our research aims to 1) apply the Grounded SAM through Spherical Projection to transfer 3D to 2D, and 2) experiment with synthetic data to evaluate its effectiveness in bridging the gap between synthetic and real-world data domains. Our approach exhibited high performance with an accuracy of 0.96, an IoU of 0.85, precision of 0.92, recall of 0.91, and an F1 score of 0.92, confirming its potential. However, challenges such as occlusion problems and pixel-level overlaps of multi-label points during spherical image generation remain to be addressed in future studies.

4/16/2024

cs.CV cs.AI