New AI Tool Classifies Effects of 71 Million ‘Missense’ Mutations
Uncovering the root causes of disease is a big challenge in human genetics. There are millions of possible mutations, but limited data on which ones actually cause disease. This is important for faster diagnosis and developing life-saving treatments. To help researchers learn more about the effects of these mutations, we have created a catalogue of ‘missense’ mutations. These mutations can affect the function of human proteins and potentially lead to diseases like cystic fibrosis, sickle-cell anemia, or cancer.
Our new AI tool, AlphaMissense, was used to develop this catalogue. In a recent study published in Science, we found that AlphaMissense classified 89% of the 71 million possible missense variants as either likely pathogenic or likely benign. This is a significant improvement compared to the 0.1% confirmed by human experts.
AI tools like AlphaMissense have the power to accelerate research in molecular biology, clinical genetics, and statistics. Traditional experiments to find disease-causing mutations can be expensive and time-consuming. But with AI predictions, researchers can get a preview of results for thousands of proteins at once, saving time and resources.
We have made all our predictions freely available to the research community and open sourced the model code for AlphaMissense. It predicted the pathogenicity of all possible missense variants with 89% accuracy, classifying 57% as likely benign and 32% as likely pathogenic.
So, what exactly is a missense variant? It is a single letter substitution in DNA that changes an amino acid within a protein. Think of DNA as a language, and changing one letter changes the word and alters the meaning of the sentence. Similarly, a missense variant changes which amino acid is translated, potentially affecting protein function. On average, a person carries over 9,000 missense variants. Most are harmless, but some can disrupt protein function and cause disease.
Classifying missense variants is crucial for understanding which protein changes can lead to disease. So far, only 2% of the more than 4 million known missense variants have been classified as pathogenic or benign. The rest are considered ‘variants of unknown significance’ due to a lack of data. AlphaMissense changes that by classifying 89% of variants using a threshold that ensures 90% precision.
AlphaMissense is based on our previous model, AlphaFold, which predicted protein structures from amino acid sequences. We trained AlphaMissense by fine-tuning AlphaFold on variants seen in human and primate populations. It uses databases of related protein sequences and structural context to predict the likelihood of a variant being pathogenic.
Comparisons with other computational methods have shown that AlphaMissense outperforms in predicting variant effects. We are excited to see how AlphaMissense can contribute to the understanding of proteins and genetics. We have made all our predictions freely available to the scientific community and have partnered with EMBL-EBI to enhance usability for researchers.
Our hope is that AlphaMissense, along with other tools, will help researchers better understand diseases and develop life-saving treatments.