Predictive Modelling of Air Quality Index (AQI) Across Diverse Cities and States of India using Machine Learning: Investigating the Influence of Punjab's Stubble Burning on AQI Variability

2404.08702

Published 4/16/2024 by Kamaljeet Kaur Sidhu, Habeeb Balogun, Kazeem Oluwakemi Oseni

🛸

Abstract

Air pollution is a common and serious problem nowadays and it cannot be ignored as it has harmful impacts on human health. To address this issue proactively, people should be aware of their surroundings, which means the environment where they survive. With this motive, this research has predicted the AQI based on different air pollutant concentrations in the atmosphere. The dataset used for this research has been taken from the official website of CPCB. The dataset has the air pollutant concentration from 22 different monitoring stations in different cities of Delhi, Haryana, and Punjab. This data is checked for null values and outliers. But, the most important thing to note is the correct understanding and imputation of such values rather than ignoring or doing wrong imputation. The time series data has been used in this research which is tested for stationarity using The Dickey-Fuller test. Further different ML models like CatBoost, XGBoost, Random Forest, SVM regressor, time series model SARIMAX, and deep learning model LSTM have been used to predict AQI. For the performance evaluation of different models, I used MSE, RMSE, MAE, and R2. It is observed that Random Forest performed better as compared to other models.

Create account to get full access

Overview

This research paper aims to predict the Air Quality Index (AQI) based on different air pollutant concentrations in the atmosphere.
The dataset used for this research is from the official website of the Central Pollution Control Board (CPCB) and covers air pollutant concentrations from 22 different monitoring stations in Delhi, Haryana, and Punjab.
The researchers have employed various machine learning (ML) models, including CatBoost, XGBoost, Random Forest, Support Vector Machine (SVM) regressor, the time series model SARIMAX, and the deep learning model LSTM, to predict AQI.
The performance of these models is evaluated using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²).
The researchers found that the Random Forest model performed better compared to the other models.

Plain English Explanation

Air pollution is a serious problem that can harm human health, so it's important for people to understand the quality of the air around them. This research aimed to predict the Air Quality Index (AQI), which measures how polluted the air is, based on the concentrations of different air pollutants.

The researchers used data on air pollutant levels from monitoring stations in several cities in northern India. They checked the data for any missing or unusual values and made sure to handle them correctly, as that's an important step in working with real-world data.

Next, the researchers tried out several different machine learning models to predict the AQI. These included decision tree-based models like CatBoost and XGBoost, the popular Random Forest model, the classic Support Vector Machine (SVM) regressor, a time series model called SARIMAX, and a deep learning model called LSTM.

To evaluate how well each model performed, the researchers looked at several common metrics, such as the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). They also calculated the R-squared (R²) value, which shows how much of the variation in the data the model can explain.

Ultimately, the researchers found that the Random Forest model performed the best out of all the models they tried. This suggests that the Random Forest algorithm was able to capture the relationships between the air pollutant concentrations and the AQI more accurately than the other approaches.

Technical Explanation

The researchers used a dataset of air pollutant concentrations from 22 different monitoring stations in Delhi, Haryana, and Punjab, which they obtained from the official website of the Central Pollution Control Board (CPCB). They checked the dataset for any null values or outliers and handled them appropriately, as this is a crucial step in working with real-world data.

The researchers then used time series analysis techniques to examine the data, testing for stationarity using the Dickey-Fuller test. This is an important consideration when working with time-dependent data, as non-stationary data can lead to inaccurate model predictions.

Next, the researchers applied several different machine learning models to the data, including CatBoost, XGBoost, Random Forest, SVM regressor, the time series model SARIMAX, and the deep learning model LSTM. These models were chosen to represent a diverse range of approaches, from tree-based algorithms to more complex neural network architectures.

To evaluate the performance of these models, the researchers used several common metrics: MSE, RMSE, MAE, and R². These metrics provide a comprehensive assessment of the models' predictive accuracy and the proportion of the variance in the data that they can explain.

The results showed that the Random Forest model outperformed the other models, suggesting that this algorithm was able to capture the underlying relationships between the air pollutant concentrations and the AQI more effectively than the other approaches.

Critical Analysis

The researchers have provided a thorough and well-designed study to predict AQI based on air pollutant concentrations. However, there are a few potential limitations and areas for further research that could be considered:

Geographical Limitations: The dataset used in this study was limited to Delhi, Haryana, and Punjab. It would be interesting to see how the models perform on data from other regions or a more diverse geographical range, as air pollution patterns can vary significantly across different locations.
Pollutant Selection: The researchers focused on using the concentrations of various air pollutants as input features. While this is a reasonable approach, there may be other relevant factors, such as meteorological conditions or land use patterns, that could improve the model's predictive power.
Model Interpretability: The study used several black-box models, such as CatBoost and LSTM, which can be difficult to interpret. Incorporating more explainable AI techniques could help provide insights into the relationships between the input features and the predicted AQI.
Temporal Dynamics: The researchers used time series analysis techniques, but there may be opportunities to further explore the temporal dynamics of air pollution, such as incorporating historical trends or seasonal patterns.

Overall, this research provides a valuable contribution to the field of air quality prediction and demonstrates the potential of machine learning techniques to address this important environmental and public health issue. By considering the limitations and areas for further research, future studies can build upon this work and continue to advance our understanding of air pollution patterns and mitigation strategies.

Conclusion

This research paper presents a comprehensive study on predicting the Air Quality Index (AQI) based on air pollutant concentrations in the atmosphere. The researchers used a dataset from the Central Pollution Control Board (CPCB) covering monitoring stations in Delhi, Haryana, and Punjab, and applied various machine learning models, including CatBoost, XGBoost, Random Forest, SVM regressor, SARIMAX, and LSTM, to forecast the AQI.

The results showed that the Random Forest model outperformed the other models in terms of predictive accuracy, as measured by metrics like MSE, RMSE, MAE, and R². This suggests that the Random Forest algorithm was able to effectively capture the relationships between the air pollutant concentrations and the AQI.

While the study provides valuable insights into air quality prediction, there are opportunities for further research to address limitations, such as expanding the geographical scope, incorporating additional relevant factors, and exploring more interpretable modeling approaches. By building on this work, future studies can contribute to a deeper understanding of air pollution patterns and support the development of effective strategies to improve air quality and protect public health.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Predicting Lung Disease Severity via Image-Based AQI Analysis using Deep Learning Techniques

Anvita Mahajan, Sayali Mate, Chinmayee Kulkarni, Suraj Sawant

Air pollution is a significant health concern worldwide, contributing to various respiratory diseases. Advances in air quality mapping, driven by the emergence of smart cities and the proliferation of Internet-of-Things sensor devices, have led to an increase in available data, fueling momentum in air pollution forecasting. The objective of this study is to devise an integrated approach for predicting air quality using image data and subsequently assessing lung disease severity based on Air Quality Index (AQI).The aim is to implement an integrated approach by refining existing techniques to improve accuracy in predicting AQI and lung disease severity. The study aims to forecast additional atmospheric pollutants like AQI, PM10, O3, CO, SO2, NO2 in addition to PM2.5 levels. Additionally, the study aims to compare the proposed approach with existing methods to show its effectiveness. The approach used in this paper uses VGG16 model for feature extraction in images and neural network for predicting AQI.In predicting lung disease severity, Support Vector Classifier (SVC) and K-Nearest Neighbors (KNN) algorithms are utilized. The neural network model for predicting AQI achieved training accuracy of 88.54 % and testing accuracy of 87.44%,which was measured using loss function, while the KNN model used for predicting lung disease severity achieved training accuracy of 98.4% and testing accuracy of 97.5% In conclusion, the integrated approach presented in this study forecasts air quality and evaluates lung disease severity, achieving high testing accuracies of 87.44% for AQI and 97.5% for lung disease severity using neural network, KNN, and SVC models. The future scope involves implementing transfer learning and advanced deep learning modules to enhance prediction capabilities. While the current study focuses on India, the objective is to expand its scope to encompass global coverage.

5/8/2024

cs.CV cs.LG

Urban Air Pollution Forecasting: a Machine Learning Approach leveraging Satellite Observations and Meteorological Forecasts

Giacomo Blanco, Luca Barco, Lorenzo Innocenti, Claudio Rossi

Air pollution poses a significant threat to public health and well-being, particularly in urban areas. This study introduces a series of machine-learning models that integrate data from the Sentinel-5P satellite, meteorological conditions, and topological characteristics to forecast future levels of five major pollutants. The investigation delineates the process of data collection, detailing the combination of diverse data sources utilized in the study. Through experiments conducted in the Milan metropolitan area, the models demonstrate their efficacy in predicting pollutant levels for the forthcoming day, achieving a percentage error of around 30%. The proposed models are advantageous as they are independent of monitoring stations, facilitating their use in areas without existing infrastructure. Additionally, we have released the collected dataset to the public, aiming to stimulate further research in this field. This research contributes to advancing our understanding of urban air quality dynamics and emphasizes the importance of amalgamating satellite, meteorological, and topographical data to develop robust pollution forecasting models.

5/31/2024

cs.LG

Physics-based deep learning reveals rising heating demand heightens air pollution in Norwegian cities

Cong Cao, Ramit Debnath, R. Michael Alvarez

Policymakers frequently analyze air quality and climate change in isolation, disregarding their interactions. This study explores the influence of specific climate factors on air quality by contrasting a regression model with K-Means Clustering, Hierarchical Clustering, and Random Forest techniques. We employ Physics-based Deep Learning (PBDL) and Long Short-Term Memory (LSTM) to examine the air pollution predictions. Our analysis utilizes ten years (2009-2018) of daily traffic, weather, and air pollution data from three major cities in Norway. Findings from feature selection reveal a correlation between rising heating degree days and heightened air pollution levels, suggesting increased heating activities in Norway are a contributing factor to worsening air quality. PBDL demonstrates superior accuracy in air pollution predictions compared to LSTM. This paper contributes to the growing literature on PBDL methods for more accurate air pollution predictions using environmental variables, aiding policymakers in formulating effective data-driven climate policies.

5/9/2024

cs.CY cs.AI cs.LG cs.NE

📊

Indoor PM2.5 forecasting and the association with outdoor air pollution: a modelling study based on sensor data in Australia

Wenhua Yu, Bahareh Nakisa, Seng W. Loke, Svetlana Stevanovic, Yuming Guo, Mohammad Naim Rastgoo

Exposure to poor indoor air quality poses significant health risks, necessitating thorough assessment to mitigate associated dangers. This study aims to predict hourly indoor fine particulate matter (PM2.5) concentrations and investigate their correlation with outdoor PM2.5 levels across 24 distinct buildings in Australia. Indoor air quality data were gathered from 91 monitoring sensors in eight Australian cities spanning 2019 to 2022. Employing an innovative three-stage deep ensemble machine learning framework (DEML), comprising three base models (Support Vector Machine, Random Forest, and eXtreme Gradient Boosting) and two meta-models (Random Forest and Generalized Linear Model), hourly indoor PM2.5 concentrations were predicted. The model's accuracy was evaluated using a rolling windows approach, comparing its performance against three benchmark algorithms (SVM, RF, and XGBoost). Additionally, a correlation analysis assessed the relationship between indoor and outdoor PM2.5 concentrations. Results indicate that the DEML model consistently outperformed benchmark models, achieving an R2 ranging from 0.63 to 0.99 and RMSE from 0.01 to 0.663 mg/m3 for most sensors. Notably, outdoor PM2.5 concentrations significantly impacted indoor air quality, particularly evident during events like bushfires. This study underscores the importance of accurate indoor air quality prediction, crucial for developing location-specific early warning systems and informing effective interventions. By promoting protective behaviors, these efforts contribute to enhanced public health outcomes.

5/14/2024

cs.LG cs.AI