Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry

Read original: arXiv:2402.11363 - Published 6/27/2024 by Shiva Ebrahimi, Xuan Guo

Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry

Overview

• This paper presents a Transformer-based de novo peptide sequencing method for data-independent acquisition (DIA) mass spectrometry. • The research was supported by funding from the National Institutes of Health. • The key ideas are using a Transformer model to directly predict peptide sequences from DIA mass spectrometry data, without the need for database matching.

Plain English Explanation

Mass spectrometry is a powerful analytical technique used to identify and quantify molecules in complex samples. One common application is proteomic analysis, where the goal is to determine the proteins present in a biological sample.

Traditional proteomic analysis relies on database searching - matching the observed mass spectrometry data to known protein sequences in a database. However, this approach has limitations, particularly for samples containing unknown or novel proteins.

The authors of this paper propose using a Transformer-based machine learning model to directly predict the amino acid sequences of peptides (protein building blocks) from the mass spectrometry data, without the need for database matching. This "de novo" sequencing approach can potentially identify novel peptides and proteins.

The key innovation is applying Transformer models, which have shown great success in natural language processing, to the problem of peptide sequencing from mass spectrometry data. By learning the complex patterns in the data, the Transformer model can accurately predict the amino acid sequence of the peptides.

Technical Explanation

The authors developed a Transformer-based model for de novo peptide sequencing from DIA mass spectrometry data. DIA is a data acquisition technique that collects mass spectra across a wide range of mass-to-charge (m/z) values, in contrast to traditional data-dependent acquisition (DDA) which selects specific precursor ions.

The Transformer model takes as input the DIA mass spectra and outputs the most likely amino acid sequence for each peptide. This is done in an end-to-end fashion, without the need for database matching or other intermediate steps.

The model architecture consists of an encoder that processes the mass spectra and a decoder that generates the amino acid sequence token-by-token. The authors trained the model on a large dataset of DIA mass spectra with known peptide sequences.

Experimental results on benchmark datasets demonstrate that the Transformer-based de novo sequencing approach outperforms traditional database search methods, particularly for novel or modified peptides. The authors highlight the potential of this technology to enable more comprehensive proteomic analysis.

Critical Analysis

The paper presents a promising approach to address the limitations of database-dependent proteomic analysis. By leveraging the power of Transformer models, the authors have shown that direct de novo peptide sequencing from DIA data is feasible and can outperform traditional methods.

However, the authors acknowledge several caveats and areas for further research. The model performance is still not perfect, and there may be challenges in scaling the approach to handle the complexity of real-world biological samples. Additionally, the interpretability of the Transformer model's predictions is an area that requires further investigation.

While the results are encouraging, it will be important to continue evaluating the method on diverse datasets and real-world applications to fully assess its capabilities and limitations. Extending the approach to handle post-translational modifications and other complex peptide features would also be a valuable direction for future research.

Conclusion

This paper presents a novel Transformer-based approach for de novo peptide sequencing from DIA mass spectrometry data. By directly predicting amino acid sequences without the need for database matching, the method has the potential to enable more comprehensive proteomic analysis, including the identification of novel and modified peptides.

The results demonstrate the power of Transformer models in extracting meaningful patterns from complex mass spectrometry data, opening up new possibilities for advancing mass spectrometry-based proteomics. While further research is needed to address the remaining challenges, this work represents an important step forward in the field of computational proteomics.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry

Shiva Ebrahimi, Xuan Guo

Tandem mass spectrometry (MS/MS) stands as the predominant high-throughput technique for comprehensively analyzing protein content within biological samples. This methodology is a cornerstone driving the advancement of proteomics. In recent years, substantial strides have been made in Data-Independent Acquisition (DIA) strategies, facilitating impartial and non-targeted fragmentation of precursor ions. The DIA-generated MS/MS spectra present a formidable obstacle due to their inherent high multiplexing nature. Each spectrum encapsulates fragmented product ions originating from multiple precursor peptides. This intricacy poses a particularly acute challenge in de novo peptide/protein sequencing, where current methods are ill-equipped to address the multiplexing conundrum. In this paper, we introduce DiaTrans, a deep-learning model based on transformer architecture. It deciphers peptide sequences from DIA mass spectrometry data. Our results show significant improvements over existing STOA methods, including DeepNovo-DIA and PepNet. Casanovo-DIA enhances precision by 15.14% to 34.8%, recall by 11.62% to 31.94% at the amino acid level, and boosts precision by 59% to 81.36% at the peptide level. Integrating DIA data and our DiaTrans model holds considerable promise to uncover novel peptides and more comprehensive profiling of biological samples. Casanovo-DIA is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/DiaTrans.

6/27/2024

NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics

Jingbo Zhou, Shaorong Chen, Jun Xia, Sizhe Liu, Tianze Ling, Wenjie Du, Yue Liu, Jianwei Yin, Stan Z. Li

Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the high-throughput analysis of protein composition in biological tissues. Many deep learning methods have been developed for emph{de novo} peptide sequencing task, i.e., predicting the peptide sequence for the observed mass spectrum. However, two key challenges seriously hinder the further advancement of this important task. Firstly, since there is no consensus for the evaluation datasets, the empirical results in different research papers are often not comparable, leading to unfair comparison. Secondly, the current methods are usually limited to amino acid-level or peptide-level precision and recall metrics. In this work, we present the first unified benchmark NovoBench for emph{de novo} peptide sequencing, which comprises diverse mass spectrum data, integrated models, and comprehensive evaluation metrics. Recent impressive methods, including DeepNovo, PointNovo, Casanovo, InstaNovo, AdaNovo and $pi$-HelixNovo are integrated into our framework. In addition to amino acid-level and peptide-level precision and recall, we evaluate the models' performance in terms of identifying post-tranlational modifications (PTMs), efficiency and robustness to peptide length, noise peaks and missing fragment ratio, which are important influencing factors while seldom be considered. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark will be open-sourced to facilitate future research and application.

6/19/2024

🤿

Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

Yonghan Yu, Ming Li

Peptide identification in mass spectrometry-based proteomics is crucial for understanding protein function and dynamics. Traditional database search methods, though widely used, rely on heuristic scoring functions and statistical estimations have to be introduced for a higher identification rate. Here, we introduce DeepSearch, the first deep learning-based end-to-end database search method for tandem mass spectrometry. DeepSearch leverages a modified transformer-based encoder-decoder architecture under the contrastive learning framework. Unlike conventional methods that rely on ion-to-ion matching, DeepSearch adopts a data-driven approach to score peptide spectrum matches. DeepSearch is also the first deep learning-based method that can profile variable post-translational modifications in a zero-shot manner. We showed that DeepSearch's scoring scheme expressed less bias and did not require any statistical estimation. We validated DeepSearch's accuracy and robustness across various datasets, including those from species with diverse protein compositions and a modification-enriched dataset. DeepSearch sheds new light on database search methods in tandem mass spectrometry.

5/13/2024

Peptide Sequencing Via Protein Language Models

Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber

We introduce a protein language model for determining the complete sequence of a peptide based on measurement of a limited set of amino acids. To date, protein sequencing relies on mass spectrometry, with some novel edman degregation based platforms able to sequence non-native peptides. Current protein sequencing techniques face limitations in accurately identifying all amino acids, hindering comprehensive proteome analysis. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify in protein sequences from the UniRef database. This targeted masking mimics real-world sequencing limitations. We then modify and finetune a ProtBert derived transformer-based model, for a new downstream task predicting these masked residues, providing an approximation of the complete sequence. Evaluating on three bacterial Escherichia species, we achieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM]) are known. Structural assessment using AlphaFold and TM-score validates the biological relevance of our predictions. The model also demonstrates potential for evolutionary analysis through cross-species performance. This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis, potentially accelerating advancements in proteomics and structural biology by providing a probabilistic reconstruction of the complete protein sequence from limited experimental data.

8/6/2024