Explainable machine learning for predicting shellfish toxicity in the Adriatic Sea using long-term monitoring data of HABs

Read original: arXiv:2405.04372 - Published 5/10/2024 by Martin Marzidovv{s}ek, Janja Franc'e, Vid Podpev{c}an, Stanka Vadnjal, Jov{z}ica Dolenc, Patricija Mozetiv{c}

📊

Overview

This study applied explainable machine learning techniques to predict the toxicity of mussels in the Gulf of Trieste, Adriatic Sea, caused by harmful algal blooms.
The researchers trained and evaluated machine learning models to accurately predict diarrhetic shellfish poisoning (DSP) events based on a 28-year dataset of toxic phytoplankton records and mussel toxin concentrations.
The random forest model provided the best prediction of positive toxicity results, and explainability methods identified key algal species and environmental factors as the best predictors of DSP outbreaks.

Plain English Explanation

Mussels are a popular seafood, but they can sometimes become toxic and cause illness in people who eat them. This happens when certain types of algae, known as harmful algal blooms, grow in the water where the mussels live. The researchers in this study wanted to find a way to better predict when these toxic algal blooms might occur, so they could warn mussel farmers and consumers ahead of time.

To do this, the researchers analyzed a large dataset that included information about the different types of algae present in mussel farming areas, as well as the levels of toxins found in the mussels themselves over a 28-year period. They then used machine learning models to try and identify patterns in the data that could help predict when toxic algal blooms might happen.

The best-performing model was the random forest model, which was able to accurately predict when mussels would be unsafe to eat based on the types of algae present and other environmental factors like water salinity, river discharge, and precipitation.

The researchers also used techniques to explain how the models were making their predictions, which identified two specific types of algae, Dinophysis fortii and D. caudata, as the most important factors in predicting when mussels would be toxic.

Overall, this research could help improve early warning systems for mussel farms and support more sustainable aquaculture practices, ensuring that seafood is safe for consumers.

Technical Explanation

This study applied explainable machine learning techniques to predict the toxicity of mussels (Mytilus galloprovincialis) in the Gulf of Trieste, Adriatic Sea, caused by harmful algal blooms. The researchers created a 28-year dataset containing records of toxic phytoplankton in mussel farming areas and toxin concentrations in mussels, and used this data to train and evaluate the performance of various machine learning models in predicting diarrhetic shellfish poisoning (DSP) events.

The random forest model provided the best prediction of positive toxicity results, achieving the highest F1 score. Explainability methods such as permutation importance and SHAP were then used to identify the key features driving the model's predictions. These analyses revealed that the presence of specific algal species, namely Dinophysis fortii and D. caudata, as well as environmental factors like salinity, river discharge, and precipitation, were the most important predictors of DSP outbreaks.

Critical Analysis

The researchers acknowledge several limitations in their study. Firstly, the dataset was limited to a single geographic region, the Gulf of Trieste, and the findings may not be directly transferable to other mussel farming areas. Additionally, the dataset only covered a 28-year period, which may not be sufficient to capture the full range of variability in harmful algal bloom dynamics and their impacts on mussel toxicity.

While the explainable machine learning techniques provided valuable insights into the key predictors of DSP events, the study did not explore the underlying mechanisms linking these factors to mussel toxicity. Further research may be needed to better understand the complex ecological and physiological processes involved.

Furthermore, the study focused on predicting the occurrence of DSP events, but did not address the potential for false positive or false negative predictions, which could have important implications for the practical implementation of an early warning system.

Despite these limitations, this research represents an important step towards developing more accurate and interpretable models for predicting the safety of mussels and supporting sustainable aquaculture practices.

Conclusion

This study demonstrated the value of applying explainable machine learning techniques to predict the toxicity of mussels caused by harmful algal blooms. By analyzing a comprehensive dataset of toxic phytoplankton and mussel toxin concentrations, the researchers were able to train a random forest model that could accurately predict diarrhetic shellfish poisoning events.

The use of explainability methods, such as permutation importance and SHAP, provided valuable insights into the key drivers of mussel toxicity, identifying specific algal species and environmental factors as the most important predictors. These findings could inform the development of early warning systems to protect public health and support the sustainability of mussel farming operations.

Overall, this research highlights the potential of machine learning and explainable AI to tackle complex environmental challenges and improve our understanding of the complex interactions between marine ecosystems and human activities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Explainable machine learning for predicting shellfish toxicity in the Adriatic Sea using long-term monitoring data of HABs

Martin Marzidovv{s}ek, Janja Franc'e, Vid Podpev{c}an, Stanka Vadnjal, Jov{z}ica Dolenc, Patricija Mozetiv{c}

In this study, explainable machine learning techniques are applied to predict the toxicity of mussels in the Gulf of Trieste (Adriatic Sea) caused by harmful algal blooms. By analysing a newly created 28-year dataset containing records of toxic phytoplankton in mussel farming areas and toxin concentrations in mussels (Mytilus galloprovincialis), we train and evaluate the performance of ML models to accurately predict diarrhetic shellfish poisoning (DSP) events. The random forest model provided the best prediction of positive toxicity results based on the F1 score. Explainability methods such as permutation importance and SHAP identified key species (Dinophysis fortii and D. caudata) and environmental factors (salinity, river discharge and precipitation) as the best predictors of DSP outbreaks. These findings are important for improving early warning systems and supporting sustainable aquaculture practices.

5/10/2024

Utilising Explainable Techniques for Quality Prediction in a Complex Textiles Manufacturing Use Case

Briony Forsberg, Dr Henry Williams, Prof Bruce MacDonald, Tracy Chen, Dr Reza Hamzeh, Dr Kirstine Hulse

This paper develops an approach to classify instances of product failure in a complex textiles manufacturing dataset using explainable techniques. The dataset used in this study was obtained from a New Zealand manufacturer of woollen carpets and rugs. In investigating the trade-off between accuracy and explainability, three different tree-based classification algorithms were evaluated: a Decision Tree and two ensemble methods, Random Forest and XGBoost. Additionally, three feature selection methods were also evaluated: the SelectKBest method, using chi-squared as the scoring function, the Pearson Correlation Coefficient, and the Boruta algorithm. Not surprisingly, the ensemble methods typically produced better results than the Decision Tree model. The Random Forest model yielded the best results overall when combined with the Boruta feature selection technique. Finally, a tree ensemble explaining technique was used to extract rule lists to capture necessary and sufficient conditions for classification by a trained model that could be easily interpreted by a human. Notably, several features that were in the extracted rule lists were statistical features and calculated features that were added to the original dataset. This demonstrates the influence that bringing in additional information during the data preprocessing stages can have on the ultimate model performance.

7/29/2024

A Critical Assessment of Interpretable and Explainable Machine Learning for Intrusion Detection

Omer Subasi, Johnathan Cree, Joseph Manzano, Elena Peterson

There has been a large number of studies in interpretable and explainable ML for cybersecurity, in particular, for intrusion detection. Many of these studies have significant amount of overlapping and repeated evaluations and analysis. At the same time, these studies overlook crucial model, data, learning process, and utility related issues and many times completely disregard them. These issues include the use of overly complex and opaque ML models, unaccounted data imbalances and correlated features, inconsistent influential features across different explanation methods, the inconsistencies stemming from the constituents of a learning process, and the implausible utility of explanations. In this work, we empirically demonstrate these issues, analyze them and propose practical solutions in the context of feature-based model explanations. Specifically, we advise avoiding complex opaque models such as Deep Neural Networks and instead using interpretable ML models such as Decision Trees as the available intrusion datasets are not difficult for such interpretable models to classify successfully. Then, we bring attention to the binary classification metrics such as Matthews Correlation Coefficient (which are well-suited for imbalanced datasets. Moreover, we find that feature-based model explanations are most often inconsistent across different settings. In this respect, to further gauge the extent of inconsistencies, we introduce the notion of cross explanations which corroborates that the features that are determined to be impactful by one explanation method most often differ from those by another method. Furthermore, we show that strongly correlated data features and the constituents of a learning process, such as hyper-parameters and the optimization routine, become yet another source of inconsistent explanations. Finally, we discuss the utility of feature-based explanations.

7/8/2024

✨

Evaluating Explanatory Capabilities of Machine Learning Models in Medical Diagnostics: A Human-in-the-Loop Approach

Jos'e Bobes-Bascar'an (University of Coru~na), Eduardo Mosqueira-Rey (University of Coru~na), 'Angel Fern'andez-Leal (University of Coru~na), Elena Hern'andez-Pereira (University of Coru~na), David Alonso-R'ios (University of Coru~na), Vicente Moret-Bonillo (University of Coru~na), Israel Figueirido-Arnoso (University of Coru~na), Yolanda Vidal-'Insua (Complejo Hospitalario)

This paper presents a comprehensive study on the evaluation of explanatory capabilities of machine learning models, with a focus on Decision Trees, Random Forest and XGBoost models using a pancreatic cancer dataset. We use Human-in-the-Loop related techniques and medical guidelines as a source of domain knowledge to establish the importance of the different features that are relevant to establish a pancreatic cancer treatment. These features are not only used as a dimensionality reduction approach for the machine learning models, but also as way to evaluate the explainability capabilities of the different models using agnostic and non-agnostic explainability techniques. To facilitate interpretation of explanatory results, we propose the use of similarity measures such as the Weighted Jaccard Similarity coefficient. The goal is to not only select the best performing model but also the one that can best explain its conclusions and aligns with human domain knowledge.

4/1/2024