ELSA: Evaluating Localization of Social Activities in Urban Streets

2406.01551

Published 6/4/2024 by Maryam Hosseini, Marco Cipriano, Sedigheh Eslami, Daniel Hodczak, Liu Liu, Andres Sevtsuk, Gerard de Melo

cs.CV

ELSA: Evaluating Localization of Social Activities in Urban Streets

Abstract

Why do some streets attract more social activities than others? Is it due to street design, or do land use patterns in neighborhoods create opportunities for businesses where people gather? These questions have intrigued urban sociologists, designers, and planners for decades. Yet, most research in this area has remained limited in scale, lacking a comprehensive perspective on the various factors influencing social interactions in urban settings. Exploring these issues requires fine-level data on the frequency and variety of social interactions on urban street. Recent advances in computer vision and the emergence of the open-vocabulary detection models offer a unique opportunity to address this long-standing issue on a scale that was previously impossible using traditional observational methods. In this paper, we propose a new benchmark dataset for Evaluating Localization of Social Activities (ELSA) in urban street images. ELSA draws on theoretical frameworks in urban sociology and design. While majority of action recognition datasets are collected in controlled settings, we use in-the-wild street-level imagery, where the size of social groups and the types of activities can vary significantly. ELSA includes 937 manually annotated images with more than 4,300 multi-labeled bounding boxes for individual and group activities, categorized into three primary groups: Condition, State, and Action. Each category contains various sub-categories, e.g., alone or group under Condition category, standing or walking, which fall under the State category, and talking or dining with regards to the Action category. ELSA is publicly available for the research community.

Create account to get full access

Overview

This research paper, titled "ELSA: Evaluating Localization of Social Activities in Urban Streets," explores a method for detecting and localizing social activities in urban street scenes using computer vision techniques.
The authors develop a dataset and evaluation framework to assess the performance of various models in this task, which has applications in urban planning, public space design, and understanding human behavior in cities.

Plain English Explanation

The paper focuses on a computer vision problem of identifying and locating social activities happening in city streets. This could be useful for urban planners, architects, and researchers who want to understand how people use public spaces in cities.

The researchers created a new dataset of images showing different types of social interactions on city streets, such as people having conversations, playing games, or participating in events. They then developed machine learning models that could analyze these images and automatically detect where the social activities were taking place.

By evaluating the performance of these models, the researchers were able to identify their strengths and weaknesses. This information can help improve the technology and make it more useful for applications like Eyes on the Streets: Leveraging Street-Level Imaging to Analyze the Urban Environment, Using Unsupervised Learning to Explore Robot-Pedestrian Interaction Patterns, and A Citizen Science Toolkit to Collect Human Perceptions of Public Spaces.

Technical Explanation

The ELSA paper presents a new dataset and evaluation framework for detecting and localizing social activities in urban street scenes. The dataset, called the ELSA dataset, contains over 20,000 annotated images of various social interactions happening in city streets, such as people chatting, playing, or participating in events.

The researchers then developed several deep learning models, including a Mask R-CNN-based approach and a Transformer-based method, to tackle the task of localizing the social activities within the images. These models were trained and evaluated on the ELSA dataset, with the results showing the strengths and limitations of the different approaches.

The evaluation framework introduced in the paper measures not only the detection accuracy of the models, but also their ability to precisely localize the social activities within the images. This provides a more comprehensive assessment of the models' performance, which is important for real-world applications like Using Unsupervised Learning to Explore Robot-Pedestrian Interaction Patterns and Long-Term Human Participation Assessment in Collaborative Learning Environments.

Critical Analysis

The ELSA paper presents a well-designed dataset and evaluation framework for the task of social activity localization in urban streets. The authors have made a significant contribution by providing a standardized benchmark for researchers to compare and improve upon existing methods.

One potential limitation of the study is the reliance on a single dataset, which may not capture the full diversity of social interactions that occur in different urban environments around the world. The authors acknowledge this and suggest expanding the dataset in the future to include more diverse locations and cultural contexts.

Additionally, while the paper evaluates the performance of several deep learning models, it does not provide a detailed analysis of the specific features and architectural choices that lead to their strengths and weaknesses. Further research in this area could help identify the most effective approaches for this task and guide the development of more robust and generalizable models.

Overall, the ELSA paper represents an important step towards better understanding and analyzing human behavior in public spaces, with potential applications in areas like OpenStreetView: 5M+ Many Roads to Global Visual and A Citizen Science Toolkit to Collect Human Perceptions of Public Spaces.

Conclusion

The ELSA paper presents a novel dataset and evaluation framework for detecting and localizing social activities in urban street scenes. By developing advanced computer vision models and assessing their performance, the researchers have made significant progress in understanding how people use and interact with public spaces in cities.

The insights gained from this work can inform urban planning, public space design, and the development of technologies that aim to better understand and support human behavior in the built environment. As the field of computer vision continues to advance, the ELSA framework and dataset can serve as a valuable resource for researchers and practitioners working to create more livable and vibrant cities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Eyes on the Streets: Leveraging Street-Level Imaging to Model Urban Crime Dynamics

Zhixuan Qi, Huaiying Luo, Chen Chi

This study addresses the challenge of urban safety in New York City by examining the relationship between the built environment and crime rates using machine learning and a comprehensive dataset of street view im- ages. We aim to identify how urban landscapes correlate with crime statistics, focusing on the characteristics of street views and their association with crime rates. The findings offer insights for urban planning and crime pre- vention, highlighting the potential of environmental de- sign in enhancing public safety.

4/17/2024

cs.CV

Self-supervised Multi-actor Social Activity Understanding in Streaming Videos

Shubham Trehan, Sathyanarayanan N. Aakur

This work addresses the problem of Social Activity Recognition (SAR), a critical component in real-world tasks like surveillance and assistive robotics. Unlike traditional event understanding approaches, SAR necessitates modeling individual actors' appearance and motions and contextualizing them within their social interactions. Traditional action localization methods fall short due to their single-actor, single-action assumption. Previous SAR research has relied heavily on densely annotated data, but privacy concerns limit their applicability in real-world settings. In this work, we propose a self-supervised approach based on multi-actor predictive learning for SAR in streaming videos. Using a visual-semantic graph structure, we model social interactions, enabling relational reasoning for robust performance with minimal labeled data. The proposed framework achieves competitive performance on standard group activity recognition benchmarks. Evaluation on three publicly available action localization benchmarks demonstrates its generalizability to arbitrary action localization.

6/21/2024

cs.CV

Game of LLMs: Discovering Structural Constructs in Activities using Large Language Models

Shruthi K. Hiremath, Thomas Ploetz

Human Activity Recognition is a time-series analysis problem. A popular analysis procedure used by the community assumes an optimal window length to design recognition pipelines. However, in the scenario of smart homes, where activities are of varying duration and frequency, the assumption of a constant sized window does not hold. Additionally, previous works have shown these activities to be made up of building blocks. We focus on identifying these underlying building blocks--structural constructs, with the use of large language models. Identifying these constructs can be beneficial especially in recognizing short-duration and infrequent activities. We also propose the development of an activity recognition procedure that uses these building blocks to model activities, thus helping the downstream task of activity monitoring in smart homes.

6/21/2024

cs.LG cs.CL

LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors

Sheikh Asif Imran, Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam

Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpreting and responding to activity and motion analysis queries. Our evaluation demonstrates LLaSA's effectiveness in activity classification and question answering, highlighting its potential in healthcare, sports science, and human-computer interaction. These contributions advance sensor-aware language models and open new research avenues. Our code repository and datasets can be found on https://github.com/BASHLab/LLaSA.

6/21/2024

cs.CL