Uncertainty estimation of machine learning spatial precipitation predictions from satellite data

Read original: arXiv:2311.07511 - Published 8/23/2024 by Georgia Papacharalampous, Hristos Tyralis, Nikolaos Doulamis, Anastasios Doulamis

📊

Overview

Combining satellite and gauge data with machine learning can create high-resolution precipitation datasets, but they often lack uncertainty estimates.
This research benchmarked six algorithms to optimally provide predictive uncertainty estimates for spatial precipitation data.
They evaluated the algorithms on 15 years of monthly data across the contiguous United States.
The algorithms included quantile regression, quantile regression forests, generalized random forests, gradient boosting machines, light gradient boosting machine, and quantile regression neural networks.

Plain English Explanation

Precipitation, or rainfall, is an important environmental factor that affects many aspects of our lives. However, accurately measuring and predicting precipitation can be challenging, especially across large geographic areas.

To address this, researchers have started combining data from satellite observations and ground-based rain gauges, using machine learning techniques to create high-resolution precipitation datasets. These datasets can provide more detailed and accurate information than relying on just one data source.

However, an important limitation of these machine learning-based precipitation datasets is that they often do not include estimates of the uncertainty in the predictions. Knowing the uncertainty is crucial for making informed decisions based on the data, such as in weather forecasting or water resource management.

This research aimed to find the best ways to provide reliable uncertainty estimates for spatial precipitation predictions. The researchers tested six different machine learning algorithms, including some novel approaches, on 15 years of monthly precipitation data across the contiguous United States. They evaluated the algorithms' ability to accurately quantify the full probability distribution of precipitation at each location, rather than just providing a single predicted value.

The results showed that the light gradient boosting machine algorithm outperformed the other methods, including the current standard of random forest models, in terms of providing the most reliable uncertainty estimates. This suggests that light gradient boosting machines could be a useful tool for creating precipitation datasets that not only provide accurate predictions, but also clearly communicate the level of confidence in those predictions.

Technical Explanation

The researchers benchmarked six machine learning algorithms for their ability to estimate predictive uncertainty in spatial precipitation data:

Quantile Regression (QR): A statistical method for modeling the conditional quantiles of a response variable.
Quantile Regression Forests (QRF): An extension of random forests that can model the full conditional distribution.
Generalized Random Forests (GRF): A flexible framework for training random forest models to estimate various statistical quantities, including quantiles.
Gradient Boosting Machines (GBM): An ensemble method that combines many weak prediction models (like decision trees) into a strong predictive model.
Light Gradient Boosting Machine (LightGBM): A variant of gradient boosting that is computationally efficient and can handle large-scale data.
Quantile Regression Neural Networks (QRNN): A neural network architecture designed to model conditional quantiles.

The researchers trained these models on 15 years of monthly precipitation data from over 12,000 locations across the contiguous United States. The predictor variables were nearby satellite-derived precipitation estimates and elevation, while the target variable was the monthly mean gauge precipitation.

The models' performance was evaluated using quantile scoring functions and the quantile scoring rule, which measure how well the predicted quantiles match the observed precipitation. The results showed that LightGBM outperformed the other algorithms, including the random forest variants (QRF and GRF), which are considered the current state-of-the-art for spatial prediction with machine learning.

Critical Analysis

The researchers acknowledge several limitations and areas for further research:

The study was limited to monthly precipitation data, and the performance of the algorithms may differ for other temporal resolutions (e.g., daily or hourly).
The predictors used were relatively simple (satellite data and elevation), and incorporating additional variables, such as weather model outputs, may improve the uncertainty estimates.
The evaluation focused on quantile scoring, but other metrics, such as the coverage of prediction intervals, could provide additional insights.
The benchmarking was conducted on data from the contiguous United States, and the algorithms' performance may vary in other geographic regions with different precipitation patterns.

Additionally, while the researchers demonstrate the superior performance of LightGBM compared to the other algorithms, they do not provide a detailed analysis of why this method outperformed the others. Further research could explore the specific model characteristics and inductive biases that make LightGBM well-suited for this task.

Conclusion

This research presents a comprehensive evaluation of several machine learning algorithms for quantifying predictive uncertainty in spatial precipitation data. The key finding is that the light gradient boosting machine algorithm outperformed the other methods, including the current standard of random forest models.

This work provides a valuable framework for incorporating uncertainty estimates into high-resolution precipitation datasets, which can improve decision-making in a wide range of applications, such as water resource management, agriculture, and disaster response. The proposed suite of machine learning algorithms, along with the formal evaluation methodology, can serve as a starting point for further research and practical implementation in operational settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Uncertainty estimation of machine learning spatial precipitation predictions from satellite data

Georgia Papacharalampous, Hristos Tyralis, Nikolaos Doulamis, Anastasios Doulamis

Merging satellite and gauge data with machine learning produces high-resolution precipitation datasets, but uncertainty estimates are often missing. We addressed the gap of how to optimally provide such estimates by benchmarking six algorithms, mostly novel even for the more general task of quantifying predictive uncertainty in spatial prediction settings. On 15 years of monthly data from over the contiguous United States (CONUS), we compared quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machine (LightGBM), and quantile regression neural networks (QRNN). Their ability to issue predictive precipitation quantiles at nine quantile levels (0.025, 0.050, 0.100, 0.250, 0.500, 0.750, 0.900, 0.950, 0.975), approximating the full probability distribution, was evaluated using quantile scoring functions and the quantile scoring rule. Predictors at a site were nearby values from two satellite precipitation retrievals, namely PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals), and the site's elevation. The dependent variable was the monthly mean gauge precipitation. With respect to QR, LightGBM showed improved performance in terms of the quantile scoring rule by 11.10%, also surpassing QRF (7.96%), GRF (7.44%), GBM (4.64%) and QRNN (1.73%). Notably, LightGBM outperformed all random forest variants, the current standard in spatial prediction with machine learning. To conclude, we propose a suite of machine learning algorithms for estimating uncertainty in spatial data prediction, supported with a formal evaluation framework based on scoring functions and scoring rules.

8/23/2024

📊

Using Long Short-term Memory (LSTM) to merge precipitation data over mountainous area in Sierra Nevada

Yihan Wang, Lujun Zhang

Obtaining reliable precipitation estimation with high resolutions in time and space is of great importance to hydrological studies. However, accurately estimating precipitation is a challenging task over high mountainous complex terrain. The three widely used precipitation measurement approaches, namely rainfall gauge, precipitation radars, and satellite-based precipitation sensors, have their own pros and cons in producing reliable precipitation products over complex areas. One way to decrease the detection error probability and improve data reliability is precipitation data merging. With the rapid advancements in computational capabilities and the escalating volume and diversity of earth observational data, Deep Learning (DL) models have gained considerable attention in geoscience. In this study, a deep learning technique, namely Long Short-term Memory (LSTM), was employed to merge a radar-based and a satellite-based Global Precipitation Measurement (GPM) precipitation product Integrated Multi-Satellite Retrievals for GPM (IMERG) precipitation product at hourly scale. The merged results are compared with the widely used reanalysis precipitation product, Multi-Radar Multi-Sensor (MRMS), and assessed against gauge observational data from the California Data Exchange Center (CDEC). The findings indicated that the LSTM-based merged precipitation notably underestimated gauge observations and, at times, failed to provide meaningful estimates, showing predominantly near-zero values. Relying solely on individual Quantitative Precipitation Estimates (QPEs) without additional meteorological input proved insufficient for generating reliable merged QPE. However, the merged results effectively captured the temporal trends of the observations, outperforming MRMS in this aspect. This suggested that incorporating bias correction techniques could potentially enhance the accuracy of the merged product.

4/23/2024

🔎

Interpolation of mountain weather forecasts by machine learning

Kazuma Iwase, Tomoyuki Takenawa

Recent advances in numerical simulation methods based on physical models and their combination with machine learning have improved the accuracy of weather forecasts. However, the accuracy decreases in complex terrains such as mountainous regions because these methods usually use grids of several kilometers square and simple machine learning models. While deep learning has also made significant progress in recent years, its direct application is difficult to utilize the physical knowledge used in the simulation. This paper proposes a method that uses machine learning to interpolate future weather in mountainous regions using forecast data from surrounding plains and past observed data to improve weather forecasts in mountainous regions. We focus on mountainous regions in Japan and predict temperature and precipitation mainly using LightGBM as a machine learning model. Despite the use of a small dataset, through feature engineering and model tuning, our method partially achieves improvements in the RMSE with significantly less training time.

8/15/2024

Machine learning models for daily rainfall forecasting in Northern Tropical Africa using tropical wave predictors

Athul Rasheeda Satheesh, Peter Knippertz, Andreas H. Fink

Numerical weather prediction (NWP) models often underperform compared to simpler climatology-based precipitation forecasts in northern tropical Africa, even after statistical postprocessing. AI-based forecasting models show promise but have avoided precipitation due to its complexity. Synoptic-scale forcings like African easterly waves and other tropical waves (TWs) are important for predictability in tropical Africa, yet their value for predicting daily rainfall remains unexplored. This study uses two machine-learning models--gamma regression and a convolutional neural network (CNN)--trained on TW predictors from satellite-based GPM IMERG data to predict daily rainfall during the July-September monsoon season. Predictor variables are derived from the local amplitude and phase information of seven TW from the target and up-and-downstream neighboring grids at 1-degree spatial resolution. The ML models are combined with Easy Uncertainty Quantification (EasyUQ) to generate calibrated probabilistic forecasts and are compared with three benchmarks: Extended Probabilistic Climatology (EPC15), ECMWF operational ensemble forecast (ENS), and a probabilistic forecast from the ENS control member using EasyUQ (CTRL EasyUQ). The study finds that downstream predictor variables offer the highest predictability, with downstream tropical depression (TD)-type wave-based predictors being most important. Other waves like mixed-Rossby gravity (MRG), Kelvin, and inertio-gravity waves also contribute significantly but show regional preferences. ENS forecasts exhibit poor skill due to miscalibration. CTRL EasyUQ shows improvement over ENS and marginal enhancement over EPC15. Both gamma regression and CNN forecasts significantly outperform benchmarks in tropical Africa. This study highlights the potential of ML models trained on TW-based predictors to improve daily precipitation forecasts in tropical Africa.

8/30/2024