ProtST: Enhancing Protein Sequence Pre-training and Understanding with Biomedical Texts

Large Language Models have made significant advancements in the field of Artificial Intelligence (AI). These models, known as Protein Language Models (PLMs), have the potential to enhance protein structure and function prediction. Proteins play a crucial role in biological growth, cell regeneration, and drug discovery. However, existing PLMs lack explicit information about protein functionalities.

To address this issue, a team of researchers developed ProtST, a framework that improves the pre-training and comprehension of protein sequences using biomedical texts. They also created a dataset called ProtDescribe, which combines protein sequences with text descriptions of their functions and properties. The ProtST framework aims to preserve the representation power of conventional PLMs in capturing co-evolutionary information during pre-training.

ProtST consists of three tasks. The first task, called Unimodal Mask Prediction, involves masking certain regions in protein sequences to retain the PLM’s ability to represent co-evolutionary information. The second task, Multimodal Representation Alignment, aligns protein sequences with their related text representations to capture the semantic relationship between the sequences and their descriptions. The third task, Multimodal Mask Prediction, defines fine-grained dependencies between residues in protein sequences and words in the descriptions of protein properties.

The team evaluated ProtST and found that it performs better on representation learning benchmarks compared to previous models. It also demonstrated good performance in zero-shot protein categorization and allows the retrieval of functional proteins from a database without the need for function annotation.

Overall, ProtST shows promise in enhancing protein sequence pre-training and understanding using biomedical texts. This framework could be a valuable addition to the field of AI. For more information, you can check out the Paper and Github link.

Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter to stay updated with the latest AI research news and projects. If you have any questions or suggestions, feel free to email us at

Check Out 800+ AI Tools in AI Tools Club for more AI resources.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...