In a crucial step for the future of medicine, biotechnology and artificial intelligence, Google released a database -available a few years ago, but now making an immense quantitative leap- where record the structures of almost all the proteins that exist on the face of the Earth: 200 million, corresponding to 1 million species. All based on the predictions of an algorithm designed by the company. Here, why this is so relevant to science and what it could be used for.
novelty comes of Alphabetparent company of Google, but, more specifically, of the section of the company dedicated to artificial intelligence: DeepMind, the same one that once designed AlphaZero, the algorithm that has been beating the world champions of the Chinese game for years Go, and now it’s on the news with an obsessive protein modeling program: AlphaFold.
Any enumeration is unfair: there are proteins in that glass of fresh milk, in a malignant tumor, in the endorphins that explode with excitement or pleasure, in a butterfly chorizo steak, in Sergio Massa, in the bacteria that cause diarrhoea, in a new and contagious Omicron and, also, in the vaccine designed to combat it.
Proteins are everything. They are behind the genes (the cellular fingerprint), as the “building” that gives them materiality. There is a reason why scientists often say that genes they translate either are encoded in protein (how can we forget the enigmatic “claaaro… it’s the part of the genome that codes for Spike“?)
The scientific importance of proteins is unquestionable. The problem is that it is not possible to advance on them without knowing the function that was assigned to them. Y understanding function depends on understanding protein structure.
The scope of an algorithm capable of predicting the shape of each protein, a kind of origami made up of chains of amino acids, is enormous.
Protein even in the soup
Whether it is to elucidate the nutritional action of the proteins in a barbecue or to understand the therapeutic effect of a drug against an altered brain protein that produces Alzheimer’s in a patient; or to understand the changes of the star of SARS-CoV-2, the famous Spike protein, it is essential for scientists to reconstruct the three dimensional structure that gives existence to these molecules.
To develop the AlphaFold algorithm, Google relied on 21 genomes from different speciesinformation provided by an institution with which it had to form an alliance, the European Bioinformatics Institute (EMBL-EBI)
An illustration of the amino acid chain that makes up a protein. Photo: Shutterstock
All of the above comes from the patient explanations of two experts on these issues. On the one hand, Javier Santosdoctor in Biological Sciences, main researcher at Conicet and adjunct professor of the Department of Biological Chemistry of Exactas of the UBA, who works on the challenge of “trying to establish relationships between conformation-dynamics-stability-biological function” of the proteins involved in a neurological disease called Friedreich’s Ataxia.
Also, from Switzerland, Luciano Abriata, biotechnologist and doctor in Chemistry from the University of Rosario, who works in a biomolecular modeling laboratory and in the study of protein structures and production in a section of the Swiss Federal Institute of Technology, in Lausanne. “I work in virtual reality, doing modeling to understand how life works, but at the atomic level,” he summarized.
amino acids behind proteins
To understand the origami-type folding that defines proteins, Santos clarified that “they are made up of chains that combine, in different ways, 20 amino acids. Some are small, with 50 to 100 amino acids, and others are huge, with 1,000 to 1,500 or even more.”
“These chains fold (NdR: in English, the term is “fold”. Hence, AlphaFold) and adopt specific shapes that depend on how the amino acid sequence is presented in each case … The sequences are consequence of evolution through thousands and thousands of years,” he said.
The achievement of AlphaFold is that the algorithm they developed “allows to predict the specific structure from those amino acid sequences extremely precisely.”
So precise, explained Abriata, that “AlphaFold not only offers itself as a great predictor of structures, but also adds a metric that indicates how good the prediction is. And when there is not so much certainty, they mark it”.
According to a magazine article Natureabout 35% of the more than 214 million predictions offered by the database “are considered highly accurate, which means that they are as good as experimentally determined structures. Another 45% are considered accurate enough for many applications.”
The protein prediction game
A valid question is how and why Google got into all this. We will answer only how.
For more than 25 years, a competition called CASP (for Critical Assessment of Techniques for Protein Structure Prediction), in which the competitors develop protein prediction models, and a group of evaluators (who have the information -not released publicly yet- in their hands) judge who guessed right.
The image, from 2017, shows one of the Chinese Go champions trying to beat Google’s algorithm. AFP Photo
“AlphaFold won in 2018”, said Abriata, who, in fact, was one of the evaluators of that contest.
If competition like this has existed a quarter of a century ago, it is because predicting the 3D shape of proteins represents a long-standing problem in biological studies.
Abriata summed it up: “With computational models to predict structures of this type saves a lot of money and time in experiments. For some proteins, it took ten to twenty years to figure out the structure!”
“The shape of proteins defines their function., a central issue if you are developing a drug to treat a clinical problem, or to address many biotechnological developments. Even energetic,” she added.
The enthusiasm is remarkable. Not for nothing, Abriata released several times “it’s a great goal”.
How Google messed with protein
Let’s see why when presenting this news there was talk of “quantitative jump”.
These days the 15th CASP competition is taking place, but it all started in the 13th, when in 2018 “DeepMind, with a whole new technology, entered with AlphaFold I and won”, Abriata recalled.
However, the milestone was at the end of 2020, in the 14th edition: “They brought a new AlphaFold (the second) made from scratch, with a new computational model and they broke it. They cracked the problem”.
So, the algorithm had managed to model some 350,000 proteins. In 2021 a couple of papers came out in Nature with the explanations of the method used by AlphaFold and the novelty that they had achieved model the structure of a couple of million proteins.
But now, the algorithm predicted (and the database has been opened publicly) nothing less than 200 million.
Querying those structures takes seconds. But, of course, it has its weight. Nothing less than 23 terabytes.
Where the algorithm makes water
Two more facts. The first is that a number of proteins exist in an “unpredictable” wayas if its structure were “fluid” or “fluctuating”.
According to Santos, “the system makes it possible to predict extremely precisely folded proteins, that is, ‘structured’ proteins, but there are others that are intrinsically disordered, with much more mobility and a highly fluctuating number of conformations that are exchanged. That’s where dynamics plays a key role.
The second, that AlphaFold is from public access.
Abriata acknowledged that “many researchers have been concerned about these advances in machine learning and artificial intelligence, especially if they are in private hands”.
Go player Lee Se-Dol and Deepmind boss Demis Hassabis after their last duel against AlphaGo.
However, since every algorithm needs to learn from somewhere, AlphaFold was built on top of a previous database, which limits the company’s possibilities.
In this case, these data were produced “by pedal”, that is, painstakingly generated in laboratories around the world and altruistically gathered in the Protein Data Bank (PDB), “which already has 50 years of investment in research”.
Given that this institution based the Google algorithm on it and given that the data dumped there was produced with taxesthere is a legal obligation to publicly open everything generated based on that prior information.
Is this the end of experimental biology? Abriata did not hesitate: “Not at all! But the dream is… perhaps it will never be achieved. It is to stop experimenting and modeling everything.”