A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Read original: arXiv:2405.18749 - Published 6/4/2024 by Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Overview

This paper presents a dataset of SARS-CoV-2 protein interaction data and a corpus of camelid nanobody (VHH) sequences, which can be used to train antibody language models.
The dataset contains information on how SARS-CoV-2 proteins interact with human proteins, as well as a collection of VHH sequences that bind to the SARS-CoV-2 spike protein.
The authors aim to provide valuable resources for researchers working on developing antibody-based therapies and tools for analyzing antibody sequences.

Plain English Explanation

The paper describes two key resources that can be used to help develop new treatments for COVID-19. The first is a dataset that shows how different proteins in the SARS-CoV-2 virus (the virus that causes COVID-19) interact with human proteins. This information can be useful for understanding how the virus infects human cells and identifying potential drug targets.

The second resource is a collection of nanobody (VHH) sequences. Nanobodies are a type of antibody fragment that can bind to the SARS-CoV-2 spike protein, which the virus uses to attach to and infect human cells. This dataset could be used to train machine learning models to design new nanobodies that could be developed into treatments.

By making these resources publicly available, the authors aim to support the development of new antibody-based therapies and tools for analyzing antibody sequences, which could ultimately help in the fight against COVID-19.

Technical Explanation

The paper presents two key resources for supporting research on antibody-based treatments for SARS-CoV-2:

A dataset of SARS-CoV-2 protein interactions with human proteins. This dataset contains information on how different proteins in the SARS-CoV-2 virus interact with various human proteins. This can help researchers understand the mechanisms of SARS-CoV-2 infection and identify potential drug targets. Similar datasets have been used to develop machine learning models that can predict how mutations in influenza viruses affect their interactions with human proteins.
A corpus of camelid nanobody (VHH) sequences that bind to the SARS-CoV-2 spike protein. Nanobodies are a type of antibody fragment that can recognize and bind to specific protein targets. This corpus of VHH sequences that bind to the SARS-CoV-2 spike protein can be used to train language models to generate new nanobodies or predict their binding properties, which could aid in the development of new antibody-based COVID-19 therapies. Previous research has explored using machine learning to design new antibodies from scratch.

By making these resources publicly available, the authors aim to support ongoing research efforts to develop effective antibody-based treatments for SARS-CoV-2 and COVID-19.

Critical Analysis

The paper provides valuable datasets and resources for researchers working on antibody-based therapies for COVID-19. The interaction dataset can help identify potential drug targets, while the VHH sequence corpus can be used to train language models for nanobody design and analysis.

However, the paper does not provide much detail on the specific contents and quality of the datasets, which could be a limitation for researchers trying to evaluate their usefulness. Additionally, the paper does not discuss any potential biases or limitations in the data that could impact the reliability of models trained on it.

Furthermore, while the authors mention the potential for these resources to support the development of new COVID-19 treatments, the paper does not explore the challenges and obstacles that may need to be overcome to translate these resources into real-world therapies. Additional research would be needed to assess the feasibility and effectiveness of using these resources for antibody design and development.

Overall, the paper provides a valuable contribution by making these important datasets publicly available, but more work may be needed to fully realize their potential impact on COVID-19 treatment development.

Conclusion

This paper presents a SARS-CoV-2 protein interaction dataset and a corpus of camelid nanobody (VHH) sequences, which can be used to train language models for antibody research and development. These resources have the potential to support the development of new antibody-based therapies for COVID-19, by enabling a better understanding of SARS-CoV-2 infection mechanisms and facilitating the design of novel nanobodies that can bind to the virus's spike protein.

By making these datasets publicly available, the authors aim to accelerate research efforts in this critical area and contribute to the ongoing fight against the COVID-19 pandemic. While further work may be needed to fully realize the potential of these resources, this paper represents an important step forward in leveraging machine learning and antibody engineering to combat the SARS-CoV-2 virus.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura, Akihiro Imura

Antibodies are crucial proteins produced by the immune system to eliminate harmful foreign substances and have become pivotal therapeutic agents for treating human diseases. To accelerate the discovery of antibody therapeutics, there is growing interest in constructing language models using antibody sequences. However, the applicability of pre-trained language models for antibody discovery has not been thoroughly evaluated due to the scarcity of labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset for antibody language models, containing over two million VHH sequences. We report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT pre-trained on VHHCorpus-2M and existing general protein and antibody-specific pre-trained language models. These results confirm that AVIDa-SARS-CoV-2 provides valuable benchmarks for evaluating the representation capabilities of antibody language models for binding prediction, thereby facilitating the development of AI-driven antibody discovery. The datasets are available at https://datasets.cognanous.com.

6/4/2024

Improving Antibody Humanness Prediction using Patent Data

Talip Ucar, Aubin Ramon, Dino Oglic, Rebecca Croasdale-Wood, Tom Diethe, Pietro Sormanni

We investigate the potential of patent data for improving the antibody humanness prediction using a multi-stage, multi-loss training process. Humanness serves as a proxy for the immunogenic response to antibody therapeutics, one of the major causes of attrition in drug discovery and a challenging obstacle for their use in clinical settings. We pose the initial learning stage as a weakly-supervised contrastive-learning problem, where each antibody sequence is associated with possibly multiple identifiers of function and the objective is to learn an encoder that groups them according to their patented properties. We then freeze a part of the contrastive encoder and continue training it on the patent data using the cross-entropy loss to predict the humanness score of a given antibody sequence. We illustrate the utility of the patent data and our approach by performing inference on three different immunogenicity datasets, unseen during training. Our empirical results demonstrate that the learned model consistently outperforms the alternative baselines and establishes new state-of-the-art on five out of six inference tasks, irrespective of the used metric.

6/11/2024

Decoupled Sequence and Structure Generation for Realistic Antibody Design

Nayoung Kim, Minsu Kim, Sungsoo Ahn, Jinkyoo Park

Antibody design plays a pivotal role in advancing therapeutics. Although deep learning has made rapid progress in this field, existing methods jointly generate antibody sequences and structures, limiting task-specific optimization. In response, we propose an antibody sequence-structure decoupling (ASSD) framework, which separates sequence generation and structure prediction. Although our approach is simple, such a decoupling strategy has been overlooked in previous works. We also find that the widely used non-autoregressive generators promote sequences with overly repeating tokens. Such sequences are both out-of-distribution and prone to undesirable developability properties that can trigger harmful immune responses in patients. To resolve this, we introduce a composition-based objective that allows an efficient trade-off between high performance and low token repetition. Our results demonstrate that ASSD consistently outperforms existing antibody design models, while the composition-based objective successfully mitigates token repetition of non-autoregressive models. Our code is available at url{https://github.com/lkny123/ASSD_public}.

5/28/2024

Constructing the CORD-19 Vaccine Dataset

Manisha Singh, Divy Sharma, Alonso Ma, Bridget Tyree, Margaret Mitchell

We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one used in the CORD-19 Kaggle challenge [Goldbloom et al., 2022]. For further evaluation, sequential sentence classification was performed on each paper's abstract using the model from Dernoncourt et al. [2016]. We partially hand annotated the training dataset and used a pre-trained BERT-PubMed layer. 'CORD- 19-Vaccination' contains 30k research papers and can be immensely valuable for NLP research such as text mining, information extraction, and question answering, specific to the domain of COVID-19 vaccine research.

7/29/2024