'Bingo' for Predicting Essential Genes from Protein Data

‘Bingo’—a large language model- and graph neural network-based workflow for the prediction of essential genes from protein data

'Bingo' for Predicting Essential Genes from Protein Data

Have you ever wondered how scientists can predict which genes are essential for the survival of an organism? Essential genes are those that are required for the basic functions of life, and they can be potential targets for drug development or genetic engineering. However, identifying essential genes is not an easy task, as it requires costly and time-consuming experiments.

But what if there was a way to predict essential genes using only protein data? That is the goal of a new study that I recently went through by Ma et al. (2024), published in the journal Briefings in Bioinformatics. The researchers developed a novel workflow, called Bingo, that combines two powerful techniques: large language models (LLMs) and graph neural networks (GNNs).

LLMs are deep learning models that can learn from large amounts of text data, such as scientific literature or protein sequences. GNNs are another type of deep learning model that can learn from graph data, such as protein structures or interactions. By combining these two techniques, Bingo can capture complex and intrinsic patterns in protein data and use them to predict gene essentiality.

The researchers tested Bingo on four different species: C. elegans, D. melanogaster, M. musculus, and H. sapiens (a HepG2 cell line). They found that Bingo achieved a high predictive performance, outperforming existing methods and baselines. Moreover, Bingo was able to transfer its knowledge across species, meaning that it could predict essential genes for a new species without any prior training data. This is especially useful for non-model organisms that lack high-quality genomic and proteomic datasets.

Bingo also has the potential to provide biological insights into the decision-making process of the model. By using the attention mechanism and a tool called GNNExplainer, the researchers were able to identify key functional sites and structural domains that are linked to gene essentiality. These include binding sites, catalytic sites, post-translational modifications, and DNA binding regions. These findings suggest that Bingo can not only predict essential genes, but also explain why they are essential.

Bingo is a promising tool for the prediction of essential genes from protein data, and it could have many applications in biomedical research and biotechnology. By using Bingo, scientists could discover novel intervention candidates for diseases, parasites, or pests, or engineer new traits or functions in organisms. Bingo could also help to fill the gap of knowledge for non-model organisms that are poorly characterized by experimental data.

If you are interested in learning more about Bingo, you can read the full paper here: https://doi.org/10.1093/bib/bbad472. You can also check out the code and data here: https://github.com/Bingo-LLM-GNN/Bingo.

Did you find this article valuable?

Support Darsh Patel by becoming a sponsor. Any amount is appreciated!