FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313

Read original: arXiv:2407.09355 - Published 7/15/2024 by Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida

📉

Overview

This paper presents a novel genotype imputation pipeline that supports client-sided imputation models, enhancing patient privacy and accessibility.
The approach focuses on improving the accuracy of polygenic risk scores (PRS313) for breast cancer risk prediction using consumer genetic panels like 23andMe.
The study demonstrates that simple linear regression can significantly enhance the accuracy of PRS313 scores calculated from imputed SNPs, providing a lightweight alternative to reference-based imputation methods.

Plain English Explanation

Genotype imputation is a technique used to predict missing genetic information, known as single nucleotide polymorphisms (SNPs), in genetic data. Traditional methods rely on the similarities in the linkage between SNPs, called linkage disequilibrium (LD), between the target data and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged as a promising alternative, allowing for imputation without the need for external databases, which enhances privacy and accessibility.

However, these deep learning models often require substantial computational resources to train and deploy, posing challenges for client-sided applications. This study addresses these limitations by introducing a baseline for a novel genotype imputation pipeline that supports client-sided models, which can be used across different genotyping chips and genomic regions.

As a case study, the researchers focus on improving the accuracy of a polygenic risk score (PRS313) for breast cancer risk prediction, which is based on 313 SNPs. By utilizing consumer genetic panels like 23andMe, the researchers demonstrate that a simple linear regression model can significantly enhance the accuracy of PRS313 scores calculated from imputed SNPs, compared to using the raw 23andMe data or simple imputation methods.

These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression-based genotype imputation, providing a lightweight and accessible alternative to resource-intensive reference-based imputation methods.

Technical Explanation

The paper introduces a novel genotype imputation pipeline that supports client-sided imputation models, addressing the limitations of resource-intensive deep learning-based methods. The proposed approach aims to enhance the accuracy of polygenic risk scores (PRS313) for breast cancer risk prediction using consumer genetic panels, such as 23andMe.

The researchers demonstrate that a simple linear regression model can significantly improve the accuracy of PRS313 scores calculated from imputed SNPs, compared to using the raw 23andMe data or simple imputation methods like substituting missing SNPs with the minor allele frequency.

Specifically, the linear regression model achieved an R^2 of 0.86, indicating a strong correlation between the imputed PRS313 scores and the reference values. This is a substantial improvement over the R^2 of 0.33 without imputation and 0.28 with simple imputation.

These results suggest that popular SNP analysis libraries could benefit from integrating linear regression-based genotype imputation, offering a lightweight and accessible alternative to resource-intensive reference-based imputation methods. By performing imputation directly on edge devices, the proposed approach also enhances patient privacy by avoiding the need for external databases.

Critical Analysis

The paper presents a promising approach to improving the accuracy of polygenic risk scores using client-sided genotype imputation models. The use of a simple linear regression model as a baseline is a compelling and practical solution, addressing the limitations of resource-intensive deep learning-based methods.

However, the study is limited to a specific use case (PRS313 for breast cancer risk prediction) and a specific consumer genetic panel (23andMe). It would be valuable to evaluate the performance of the linear regression model across a wider range of polygenic risk scores and genotyping platforms to assess its broader applicability.

Additionally, while the paper mentions the potential for enhanced privacy by performing imputation on edge devices, it does not provide a detailed discussion of the privacy implications or compare the proposed approach to other privacy-preserving imputation methods. Further research is needed to thoroughly evaluate the privacy guarantees and potential trade-offs of the client-sided imputation model.

The paper also lacks a comprehensive evaluation of the computational efficiency and model size of the linear regression-based imputation compared to other imputation techniques or benchmark frameworks. This information would be crucial for assessing the practical viability and scalability of the proposed approach.

Overall, the study presents a promising baseline for client-sided genotype imputation, but further research is needed to explore its broader applicability, privacy implications, and computational efficiency compared to existing methods.

Conclusion

This study introduces a novel genotype imputation pipeline that supports client-sided models, addressing the limitations of resource-intensive deep learning-based methods. By focusing on improving the accuracy of a polygenic risk score (PRS313) for breast cancer risk prediction using consumer genetic panels, the researchers demonstrate that a simple linear regression model can significantly enhance the accuracy of the PRS313 scores calculated from imputed SNPs.

These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression-based genotype imputation, providing a lightweight and accessible alternative to reference-based imputation methods. Additionally, the client-sided approach enhances patient privacy by avoiding the need for external databases.

While the study is limited to a specific use case and genetic panel, the baseline framework presented in this work lays the foundation for further research into more generalized and privacy-preserving genotype imputation techniques. Exploring the performance of the linear regression model across a broader range of applications and comparing it to other imputation methods could lead to valuable insights for improving the accessibility and privacy of personalized genetic insights.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313

Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida

Genotype imputation enhances genetic data by predicting missing SNPs using reference haplotype information. Traditional methods leverage linkage disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity of LD structures between genotyped target sets and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged, offering a promising alternative by predicting missing genotypes without external databases, thereby enhancing privacy and accessibility. However, these methods often produce models with tens of millions of parameters, leading to challenges such as the need for substantial computational resources to train and inefficiency for client-sided deployment. Our study addresses these limitations by introducing a baseline for a novel genotype imputation pipeline that supports client-sided imputation models generalizable across any genotyping chip and genomic region. This approach enhances patient privacy by performing imputation directly on edge devices. As a case study, we focus on PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk prediction. Utilizing consumer genetic panels such as 23andMe, our model democratizes access to personalized genetic insights by allowing 23andMe users to obtain their PRS313 score. We demonstrate that simple linear regression can significantly improve the accuracy of PRS313 scores when calculated using SNPs imputed from consumer gene panels, such as 23andMe. Our linear regression model achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with simple imputation (substituting missing SNPs with the minor allele frequency). These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression models for genotype imputation, providing a viable and light-weight alternative to reference based imputation.

7/15/2024

Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank

Thomas Le Menestrel, Erin Craig, Robert Tibshirani, Trevor Hastie, Manuel Rivas

Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data. We evaluate the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on data from White British and other ancestries and validated across a cohort of over 96,000 individuals for 8 diseases. Out of 96 models trained, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores (p-value < 0.05), found for diabetes, arthritis, gall stones, cystitis, asthma and osteoarthritis. For the interaction and pretrained models that outperformed the baseline, the PRS score was the primary driver behind prediction. Our findings indicate that both interaction terms and pre-training can enhance prediction accuracy but for a limited set of diseases and moderate improvements in accuracy

5/8/2024

U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks

Zhe Fei, Yi Li

Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the H'ajek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.

7/23/2024

🛠️

Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records

Wenrui Li, Xiaoyu Wang, Yuetian Sun, Snezana Milanovic, Mark Kon, Julio Enrique Castrillon-Candas

It has long been a recognized problem that many datasets contain significant levels of missing numerical data. A potentially critical predicate for application of machine learning methods to datasets involves addressing this problem. However, this is a challenging task. In this paper, we apply a recently developed multi-level stochastic optimization approach to the problem of imputation in massive medical records. The approach is based on computational applied mathematics techniques and is highly accurate. In particular, for the Best Linear Unbiased Predictor (BLUP) this multi-level formulation is exact, and is significantly faster and more numerically stable. This permits practical application of Kriging methods to data imputation problems for massive datasets. We test this approach on data from the National Inpatient Sample (NIS) data records, Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. Numerical results show that the multi-level method significantly outperforms current approaches and is numerically robust. It has superior accuracy as compared with methods recommended in the recent report from HCUP. Benchmark tests show up to 75% reductions in error. Furthermore, the results are also superior to recent state of the art methods such as discriminative deep learning.

4/4/2024