Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Read original: arXiv:2404.17626 - Published 5/8/2024 by Thomas Le Menestrel, Erin Craig, Robert Tibshirani, Trevor Hastie, Manuel Rivas

Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Overview

This paper presents a novel approach for predicting ancestry-specific disease risk using multiomics data from the UK Biobank.
The researchers leverage pre-training and interaction modeling techniques to build accurate predictive models that can account for genetic and environmental factors.
The proposed method demonstrates strong performance in predicting a variety of complex diseases across different ancestral groups.

Plain English Explanation

The researchers in this study tackled the challenge of predicting disease risk for people from different ancestral backgrounds. They used a large dataset called the UK Biobank, which contains genetic, health, and lifestyle information for thousands of participants.

Rather than building a single model to predict disease for everyone, the researchers developed a more nuanced approach. They first pre-trained their models on the overall dataset to capture general patterns. Then, they added an "interaction" component to the models, allowing them to account for how genetic and environmental factors may influence disease risk differently for people of different ancestries.

This approach proved to be highly effective. The models were able to accurately predict the risk of various complex diseases, such as heart disease and diabetes, for participants from diverse ancestral backgrounds. By considering ancestry-specific factors, the models were better able to capture the unique health profiles of different populations.

This research is significant because it highlights the importance of developing personalized, ancestry-aware healthcare solutions. Traditional one-size-fits-all models may overlook important population-specific differences, leading to suboptimal predictions and potentially exacerbating health disparities. The techniques demonstrated in this paper could pave the way for more equitable and effective disease prevention and management strategies.

Technical Explanation

The researchers employed a two-stage modeling approach to leverage the UK Biobank multiomics data for ancestry-specific disease prediction.

First, they pre-trained a set of base models using the entire dataset, allowing the models to learn general patterns and feature representations. These pre-trained models served as a foundation for the subsequent stage.

In the second stage, the researchers introduced an "interaction" component to the models, which enabled them to capture how genetic and environmental factors may differentially influence disease risk across ancestral groups. This was achieved by incorporating ancestry-specific interaction terms into the model architecture.

The researchers evaluated their approach on a range of complex diseases, including heart disease, diabetes, and cancer. They found that the models leveraging pre-training and interaction modeling consistently outperformed standard machine learning baselines, demonstrating strong performance in predicting disease risk for individuals from diverse ancestral backgrounds.

Furthermore, the researchers conducted detailed analyses to understand the key genetic and environmental factors driving ancestry-specific disease risk. This provided valuable insights into the underlying biological mechanisms and highlighted the importance of considering population-specific effects in disease prediction and prevention.

Critical Analysis

The researchers acknowledge several limitations in their study. First, the UK Biobank dataset, while extensive, may not fully capture the diversity of global populations, potentially limiting the generalizability of the findings. Additionally, the interactions between genetic and environmental factors are known to be highly complex, and the models in this study may not fully capture the nuances of these relationships.

Another potential concern is the reliance on self-reported ancestry information, which can be subject to inaccuracies and biases. Incorporating more robust methods for determining genetic ancestry, such as those used in Applying BioBERT to Extract Germline Gene-Disease Associations or Integrating Heterogeneous Gene Expression Data Through Knowledge Graphs, could further strengthen the ancestry-specific insights.

Additionally, the paper does not address potential issues of fairness and bias in the models. As with any machine learning system, there is a risk of perpetuating or even amplifying existing health disparities if the models are not carefully designed and evaluated for their impact on different populations. Techniques like those demonstrated in GestaltMML: Enhancing Rare Genetic Disease Diagnosis Through Multi-Modal Machine Learning could be leveraged to ensure more equitable outcomes.

Overall, this study represents an important step towards more personalized and ancestry-aware disease prediction models. However, further research is needed to address the limitations and potential pitfalls, as highlighted above, to ensure the responsible development and deployment of these tools in real-world healthcare applications.

Conclusion

This paper presents a novel approach for predicting ancestry-specific disease risk using multiomics data from the UK Biobank. By leveraging pre-training and interaction modeling techniques, the researchers developed models that can accurately predict the risk of complex diseases for individuals from diverse ancestral backgrounds.

The key insights from this study underscore the importance of considering population-specific factors in disease prediction and prevention. Traditional one-size-fits-all models may overlook crucial genetic and environmental interactions that drive health disparities across different ancestral groups. The techniques demonstrated in this paper could pave the way for more equitable and effective healthcare solutions, tailored to the unique needs of diverse populations.

As the field of precision medicine continues to evolve, research like this will be crucial in addressing the challenges of health equity and ensuring that advancements in predictive modeling benefit people of all backgrounds equally. By further refining and validating these approaches, the medical community can work towards a future where disease risk prediction is truly personalized and inclusive.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Thomas Le Menestrel, Erin Craig, Robert Tibshirani, Trevor Hastie, Manuel Rivas

Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores (p-value < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pre-training can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy

5/8/2024

Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Burak Yelmen, Maris Alver, Estonian Biobank Research Team, Flora Jay, Lili Milani

Investigating the genetic architecture of complex diseases is challenging due to the highly polygenic and interactive landscape of genetic and environmental factors. Although genome-wide association studies (GWAS) have identified thousands of variants for multiple complex phenotypes, conventional statistical approaches can be limited by simplified assumptions such as linearity and lack of epistasis models. In this work, we trained artificial neural networks for predicting complex traits using both simulated and real genotype/phenotype datasets. We extracted feature importance scores via different post hoc interpretability methods to identify potentially associated loci (PAL) for the target phenotype. Simulations we performed with various parameters demonstrated that associated loci can be detected with good precision using strict selection criteria, but downstream analyses are required for fine-mapping the exact variants due to linkage disequilibrium, similarly to conventional GWAS. By applying our approach to the schizophrenia cohort in the Estonian Biobank, we were able to detect multiple PAL related to this highly polygenic and heritable disorder. We also performed enrichment analyses with PAL in genic regions, which predominantly identified terms associated with brain morphology. With further improvements in model optimization and confidence measures, artificial neural networks can enhance the identification of genomic loci associated with complex diseases, providing a more comprehensive approach for GWAS and serving as initial screening tools for subsequent functional studies. Keywords: Deep learning, interpretability, genome-wide association studies, complex diseases

7/29/2024

U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks

Zhe Fei, Yi Li

Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the H'ajek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.

7/23/2024

📉

FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313

Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida

Genotype imputation enhances genetic data by predicting missing SNPs using reference haplotype information. Traditional methods leverage linkage disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity of LD structures between genotyped target sets and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged, offering a promising alternative by predicting missing genotypes without external databases, thereby enhancing privacy and accessibility. However, these methods often produce models with tens of millions of parameters, leading to challenges such as the need for substantial computational resources to train and inefficiency for client-sided deployment. Our study addresses these limitations by introducing a baseline for a novel genotype imputation pipeline that supports client-sided imputation models generalizable across any genotyping chip and genomic region. This approach enhances patient privacy by performing imputation directly on edge devices. As a case study, we focus on PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk prediction. Utilizing consumer genetic panels such as 23andMe, our model democratizes access to personalized genetic insights by allowing 23andMe users to obtain their PRS313 score. We demonstrate that simple linear regression can significantly improve the accuracy of PRS313 scores when calculated using SNPs imputed from consumer gene panels, such as 23andMe. Our linear regression model achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with simple imputation (substituting missing SNPs with the minor allele frequency). These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression models for genotype imputation, providing a viable and light-weight alternative to reference based imputation.

7/15/2024