Proteins: The Energy of the Cell and Its Various Applications
Proteins play a vital role in various applications, including material and treatments. Made up of amino acid chains that fold into specific shapes, they are the energy source for cells. Recently, there has been a significant discovery of new protein sequences due to advancements in low-cost sequencing technology.
However, there is currently a gap in understanding the functions of these novel protein sequences, as functional annotation is expensive and time-consuming. To bridge this gap, accurate and effective protein function annotation methods need to be developed. Many data-driven approaches rely on learning representations of protein structures, as protein functions are influenced by their folding patterns. These representations can be used for protein design, structure classification, model quality assessment, and function prediction.
One challenge is the limited number of published protein structures compared to other machine learning application fields. Experimental identification of protein structures is difficult and time-consuming. For example, the Protein Data Bank has only 182K experimentally confirmed structures, while Pfam has 47M protein sequences and ImageNet has 10M annotated pictures. To address this, researchers have used unlabeled protein sequence data to develop representations of existing proteins.
Recent advancements in deep learning-based protein structure prediction techniques have made it possible to accurately predict the structures of many protein sequences. However, these techniques do not fully utilize the information about protein structure that determines their functions. To address this, structure-based protein encoders have been proposed to better utilize structural information.
The researchers have developed a protein encoder called the GeomEtry-Aware Relational Graph Neural Network, which conducts relational message passing on protein residue graphs. They have also introduced a sparse edge message passing technique that improves the protein structure encoder. Inspired by the triangle attention design in Evoformer, they have implemented edge-level message passing on GNNs for protein structure encoding.
To enhance the protein structure encoder, they have also created a geometric pretraining approach using the contrastive learning framework. This approach includes innovative augmentation functions to find physiologically linked protein substructures that co-occur in proteins. They have established a strong foundation for pretraining protein structure representations by comparing their methods against several downstream property prediction tasks.
Their model, GearNet, consistently outperforms existing protein encoders on supervised tasks such as Enzyme Commission number prediction, Gene Ontology term prediction, fold classification, and reaction classification. Even when trained on fewer samples, GearNet performs as well as or better than the most advanced sequence-based encoders pretrained on larger datasets.
The codebase for GearNet is publicly available on Github. It is implemented in PyTorch and Torch Drug. For more information, check out the research paper and the Github link. Credit for this research goes to the researchers on the project.
Don’t forget to join our ML subreddit, Discord channel, and email newsletter, where we share the latest AI research news and interesting projects.
About the Author:
Aneesh Tickoo is a consulting intern at MarktechPost, currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He is passionate about building solutions in image processing and enjoys collaborating on interesting projects.