A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data

Read original: arXiv:2404.02187 - Published 4/4/2024 by Junlan Chen, Ziyuan Pu, Nan Zheng, Xiao Wen, Hongliang Ding, Xiucheng Guo

🤿

Overview

Crash data is often highly imbalanced, with most crashes being non-fatal and only a small number being fatal.
This data imbalance poses challenges for modeling crash severity, as there are limited samples of fatal crashes to analyze.
Existing data resampling methods like SMOTE and GANs struggle to handle the unique characteristics of crash data, particularly sparse discrete risk factors.
There is a need for more comprehensive research comparing the performance of different resampling methods for crash severity modeling.

Plain English Explanation

Crash data collected by authorities contains information about different types of crashes, from minor fender-benders to devastating fatal collisions. However, this data tends to be heavily skewed - the vast majority of crashes are non-fatal, while fatal crashes are quite rare. This imbalance in the data makes it difficult for researchers to accurately model and understand the factors that contribute to the most serious crashes.

Imagine you're trying to bake a cake, but you only have a tiny amount of one key ingredient. It would be very hard to get the recipe right and produce a good cake. Similarly, with crash data, researchers struggle to properly analyze and interpret the limited information they have about fatal crashes, since these make up such a small portion of the overall data.

To address this challenge, researchers have tried using different data resampling techniques. These methods either add more examples of the rare, fatal crashes (oversampling) or remove some of the more common, non-fatal crashes (undersampling). While these approaches can help balance the data, the existing techniques have limitations when it comes to the unique characteristics of crash data, which often involves a mix of continuous variables (like speed) and discrete, categorical factors (like road conditions).

The current study proposes a new method, based on a type of machine learning model called a Conditional Tabular GAN, to generate synthetic crash data that better reflects the real-world distribution of fatal and non-fatal crashes. The researchers then use this balanced, synthetic data to train and evaluate crash severity models, comparing the performance to models trained on the original, imbalanced data or data resampled using other techniques.

Technical Explanation

This study addresses the challenge of highly imbalanced crash data, where fatal crashes are much rarer than non-fatal crashes. Such imbalance makes it difficult for crash severity models to accurately capture and interpret the factors contributing to the most serious crashes.

The researchers propose using a Conditional Tabular GAN (CTGAN) to generate synthetic crash data that is more balanced between fatal and non-fatal cases. CTGAN is a type of generative adversarial network (GAN) that can handle both continuous and discrete variables, which is important given the mixed nature of crash data.

The study compares the performance of crash severity models trained on the original imbalanced data, data resampled using traditional techniques like SMOTE, and data generated by the proposed CTGAN-based method. Classification accuracy and consistency of the generated data distributions are assessed.

Additionally, the researchers employ Monte Carlo simulation to evaluate the models' ability to estimate crash probabilities and parameters, considering both two-class (fatal vs. non-fatal) and three-class (fatal, serious injury, minor injury) imbalance scenarios.

The results indicate that using the synthetic data generated by the CTGAN-based method leads to improved performance in crash severity modeling compared to the other approaches. This suggests the proposed data generation technique can effectively address the challenges posed by imbalanced crash data.

Critical Analysis

The study provides a comprehensive evaluation of the proposed CTGAN-based data generation method and its application to crash severity modeling. The use of Monte Carlo simulation to assess parameter and probability estimation is a valuable addition, as it allows for a more thorough understanding of the models' capabilities.

However, the paper does not delve into potential limitations or caveats of the research. For example, the performance of the CTGAN-based method may be dependent on the specific characteristics of the dataset used, and its effectiveness could vary with different crash data sources or jurisdictions.

Additionally, while the CTGAN approach addresses the challenge of mixed continuous and discrete variables, it does not explicitly consider the potential spatial and temporal dependencies in crash data. These factors could also play a role in accurately modeling crash severity and may warrant further investigation.

Lastly, the paper does not discuss the computational complexity or training time requirements of the CTGAN-based method compared to other resampling techniques. This information could be useful for practitioners considering the practical implementation of these approaches.

Overall, the study presents a promising solution to the problem of imbalanced crash data, but additional research exploring the method's robustness, generalizability, and feasibility would strengthen the findings and their implications for the field.

Conclusion

This research tackles the critical challenge of highly imbalanced crash data, where fatal crashes are vastly outnumbered by non-fatal crashes. The proposed CTGAN-based data generation method offers a solution to this problem, enabling the creation of synthetic, balanced crash data that can be used to train more accurate and informative crash severity models.

By comparing the performance of models trained on the original data, resampled data, and the CTGAN-generated data, the study demonstrates the benefits of the new approach. The ability to better capture the factors contributing to the most serious crashes has important implications for road safety research and initiatives aimed at reducing fatalities and severe injuries.

While the study provides a robust evaluation, further research exploring the method's limitations and potential enhancements would strengthen the findings and help guide its practical application. Nonetheless, this work represents a valuable contribution to the field, highlighting the potential of advanced data generation techniques to overcome the challenges posed by imbalanced crash data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data

Junlan Chen, Ziyuan Pu, Nan Zheng, Xiao Wen, Hongliang Ding, Xiucheng Guo

Crash data is often greatly imbalanced, with the majority of crashes being non-fatal crashes, and only a small number being fatal crashes due to their rarity. Such data imbalance issue poses a challenge for crash severity modeling since it struggles to fit and interpret fatal crash outcomes with very limited samples. Usually, such data imbalance issues are addressed by data resampling methods, such as under-sampling and over-sampling techniques. However, most traditional and deep learning-based data resampling methods, such as synthetic minority oversampling technique (SMOTE) and generative Adversarial Networks (GAN) are designed dedicated to processing continuous variables. Though some resampling methods have improved to handle both continuous and discrete variables, they may have difficulties in dealing with the collapse issue associated with sparse discrete risk factors. Moreover, there is a lack of comprehensive studies that compare the performance of various resampling methods in crash severity modeling. To address the aforementioned issues, the current study proposes a crash data generation method based on the Conditional Tabular GAN. After data balancing, a crash severity model is employed to estimate the performance of classification and interpretation. A comparative study is conducted to assess classification accuracy and distribution consistency of the proposed generation method using a 4-year imbalanced crash dataset collected in Washington State, U.S. Additionally, Monte Carlo simulation is employed to estimate the performance of parameter and probability estimation in both two- and three-class imbalance scenarios. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms using original data or synthetic data generated by other resampling methods.

4/4/2024

Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study

Emmanouil Panagiotou, Arjun Roy, Eirini Ntoutsi

Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. Although class and group imbalances commonly coincide in real-world tabular datasets, limited methods address this scenario. While most methods use oversampling techniques, like interpolation, to mitigate imbalances, recent advancements in synthetic tabular data generation offer promise but have not been adequately explored for this purpose. To this end, this paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models for synthetic tabular data generation and various sampling strategies. Experimental results on four datasets, demonstrate the effectiveness of generative models for bias mitigation, creating opportunities for further exploration in this direction.

9/10/2024

Systematic Evaluation of Synthetic Data Augmentation for Multi-class NetFlow Traffic

Maximilian Wolf, Dieter Landes, Andreas Hotho, Daniel Schlor

The detection of cyber-attacks in computer networks is a crucial and ongoing research challenge. Machine learning-based attack classification offers a promising solution, as these models can be continuously updated with new data, enhancing the effectiveness of network intrusion detection systems (NIDS). Unlike binary classification models that simply indicate the presence of an attack, multi-class models can identify specific types of attacks, allowing for more targeted and effective incident responses. However, a significant drawback of these classification models is their sensitivity to imbalanced training data. Recent advances suggest that generative models can assist in data augmentation, claiming to offer superior solutions for imbalanced datasets. Classical balancing methods, although less novel, also provide potential remedies for this issue. Despite these claims, a comprehensive comparison of these methods within the NIDS domain is lacking. Most existing studies focus narrowly on individual methods, making it difficult to compare results due to varying experimental setups. To close this gap, we designed a systematic framework to compare classical and generative resampling methods for class balancing across multiple popular classification models in the NIDS domain, evaluated on several NIDS benchmark datasets. Our experiments indicate that resampling methods for balancing training data do not reliably improve classification performance. Although some instances show performance improvements, the majority of results indicate decreased performance, with no consistent trend in favor of a specific resampling technique enhancing a particular classifier.

8/30/2024

GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic Microplastics Data

Daniel Platnick, Sourena Khanzadeh, Alireza Sadeghian, Richard Anthony Valenzano

Microplastic particle ingestion or inhalation by humans is a problem of growing concern. Unfortunately, current research methods that use machine learning to understand their potential harms are obstructed by a lack of available data. Deep learning techniques in particular are challenged by such domains where only small or imbalanced data sets are available. Overcoming this challenge often involves oversampling underrepresented classes or augmenting the existing data to improve model performance. This paper proposes GANsemble: a two-module framework connecting data augmentation with conditional generative adversarial networks (cGANs) to generate class-conditioned synthetic data. First, the data chooser module automates augmentation strategy selection by searching for the best data augmentation strategy. Next, the cGAN module uses this strategy to train a cGAN for generating enhanced synthetic data. We experiment with the GANsemble framework on a small and imbalanced microplastics data set. A Microplastic-cGAN (MPcGAN) algorithm is introduced, and baselines for synthetic microplastics (SYMP) data are established in terms of Frechet Inception Distance (FID) and Inception Scores (IS). We also provide a synthetic microplastics filter (SYMP-Filter) algorithm to increase the quality of generated SYMP. Additionally, we show the best amount of oversampling with augmentation to fix class imbalance in small microplastics data sets. To our knowledge, this study is the first application of generative AI to synthetically create microplastics data.

5/2/2024