Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Read original: arXiv:2407.18811 - Published 7/29/2024 by Burak Yelmen, Maris Alver, Estonian Biobank Research Team, Flora Jay, Lili Milani

Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Overview

Researchers used artificial neural networks (ANNs) to detect genetic variants associated with complex traits
The study aimed to interpret the inner workings of the ANNs to better understand the detected associations
Findings provide insights into the genomic architecture of complex traits and the potential of interpretable AI for genetic studies

Plain English Explanation

The researchers in this study used a type of artificial intelligence called artificial neural networks to identify genetic variants (differences in DNA) that are associated with complex traits, such as height or disease risk. Complex traits are influenced by many different genes working together in complex ways.

The key innovation here was that the researchers didn't just use the neural network as a "black box" to make predictions. Instead, they also tried to interpret how the neural network was making its decisions - what specific genetic variants it was focusing on to detect associations with the traits. This interpretation step is important because it can provide insights into the underlying genomic architecture - the complex web of genetic factors - that shape these complex traits.

By opening up the "black box" of the neural network, the researchers were able to gain a better understanding of the genetic basis of the complex traits they studied. This type of interpretable AI approach holds great promise for accelerating genetic research and uncovering the fundamental biological mechanisms behind important human characteristics and health conditions.

Technical Explanation

The researchers developed a method to interpret the inner workings of artificial neural networks (ANNs) trained to detect genome-wide association signals for complex traits. They used an ANN architecture with multiple hidden layers to model the complex, nonlinear relationships between genetic variants and trait values.

To interpret the ANN models, the researchers leveraged techniques from the field of explainable AI, such as saliency mapping and layer visualization. These methods allowed them to identify the specific genetic variants that the ANN was focusing on to make its predictions about the complex traits.

The researchers applied their interpretable ANN approach to data from large-scale genome-wide association studies (GWAS) on several complex human traits, including height, body mass index, and educational attainment. By analyzing the genetic variants highlighted by the ANN interpretations, they were able to gain insights into the genomic architecture underlying these complex phenotypes.

Critical Analysis

The researchers acknowledge several limitations of their approach. First, the interpretation techniques they used, while powerful, do not provide a complete picture of the ANN's decision-making process. There may be complex, higher-order interactions between genetic variants that are difficult to fully capture.

Additionally, the researchers note that their method primarily identifies

common

genetic variants associated with complex traits. Rare variants, which may also play an important role, are not as easily detected by the ANN models.

Further research is needed to extend this interpretable AI approach to other complex trait domains, such as mental health or neurological disorders, where the underlying genetics may be even more intricate. Continued advancements in explainable AI methods will also be crucial for gaining a deeper understanding of the genetic basis of complex human traits and diseases.

Conclusion

This study demonstrates the potential of using interpretable artificial neural networks to uncover the genetic underpinnings of complex human traits. By going beyond the "black box" of the ANN models and analyzing the specific genetic variants they focus on, the researchers were able to gain insights into the genomic architecture shaping these important characteristics.

This type of interpretable AI approach holds great promise for accelerating genetic research and enhancing our understanding of the fundamental biological mechanisms that contribute to complex human traits and health conditions. As the field of explainable AI continues to evolve, we can expect to see more innovative applications of these techniques in the life sciences and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Burak Yelmen, Maris Alver, Estonian Biobank Research Team, Flora Jay, Lili Milani

Investigating the genetic architecture of complex diseases is challenging due to the highly polygenic and interactive landscape of genetic and environmental factors. Although genome-wide association studies (GWAS) have identified thousands of variants for multiple complex phenotypes, conventional statistical approaches can be limited by simplified assumptions such as linearity and lack of epistasis models. In this work, we trained artificial neural networks for predicting complex traits using both simulated and real genotype/phenotype datasets. We extracted feature importance scores via different post hoc interpretability methods to identify potentially associated loci (PAL) for the target phenotype. Simulations we performed with various parameters demonstrated that associated loci can be detected with good precision using strict selection criteria, but downstream analyses are required for fine-mapping the exact variants due to linkage disequilibrium, similarly to conventional GWAS. By applying our approach to the schizophrenia cohort in the Estonian Biobank, we were able to detect multiple PAL related to this highly polygenic and heritable disorder. We also performed enrichment analyses with PAL in genic regions, which predominantly identified terms associated with brain morphology. With further improvements in model optimization and confidence measures, artificial neural networks can enhance the identification of genomic loci associated with complex diseases, providing a more comprehensive approach for GWAS and serving as initial screening tools for subsequent functional studies. Keywords: Deep learning, interpretability, genome-wide association studies, complex diseases

7/29/2024

Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Thomas Le Menestrel, Erin Craig, Robert Tibshirani, Trevor Hastie, Manuel Rivas

Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores (p-value < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pre-training can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy

5/8/2024

AI-driven multi-omics integration for multi-scale predictive modeling of causal genotype-environment-phenotype relationships

You Wu (Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, USA), Lei Xie (Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York, USA, Ph.D. Program in Biology and Biochemistry, The Graduate Center, The City University of New York, New York, New York, USA, Department of Computer Science, Hunter College, The City University of New York, New York, New York, USA, Helen and Robert Appel Alzheimers Disease Research Institute, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, New York, USA)

Despite the wealth of single-cell multi-omics data, it remains challenging to predict the consequences of novel genetic and chemical perturbations in the human body. It requires knowledge of molecular interactions at all biological levels, encompassing disease models and humans. Current machine learning methods primarily establish statistical correlations between genotypes and phenotypes but struggle to identify physiologically significant causal factors, limiting their predictive power. Key challenges in predictive modeling include scarcity of labeled data, generalization across different domains, and disentangling causation from correlation. In light of recent advances in multi-omics data integration, we propose a new artificial intelligence (AI)-powered biology-inspired multi-scale modeling framework to tackle these issues. This framework will integrate multi-omics data across biological levels, organism hierarchies, and species to predict causal genotype-environment-phenotype relationships under various conditions. AI models inspired by biology may identify novel molecular targets, biomarkers, pharmaceutical agents, and personalized medicines for presently unmet medical needs.

7/10/2024

Semantically Rich Local Dataset Generation for Explainable AI in Genomics

Pedro Barbosa, Rosina Savisaar, Alcides Fonseca

Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms. Therefore, interpreting these models may provide novel insights into the underlying biology, supporting downstream biomedical applications. Due to their complexity, interpretable surrogate models can only be built for local explanations (e.g., a single instance). However, accomplishing this requires generating a dataset in the neighborhood of the input, which must maintain syntactic similarity to the original data while introducing semantic variability in the model's predictions. This task is challenging due to the complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity. Our custom, domain-guided individual representation effectively constrains syntactic similarity, and we provide two alternative fitness functions that promote diversity with no computational effort. Applied to the RNA splicing domain, our approach quickly achieves good diversity and significantly outperforms a random baseline in exploring the search space, as shown by our proof-of-concept, short RNA sequence. Furthermore, we assess its generalizability and demonstrate scalability to larger sequences, resulting in a ~30% improvement over the baseline.

7/18/2024