In recent years, models that can predict the structure or function of proteins have been widely used for a variety of biological applications, such as the identification of drug targets and the design of new therapeutic antibodies.
These models, which are based on models of large languages (LLM), can make very precise predictions of the adequacy of a protein for a given application. However, there is no way to determine how these models make their predictions or what protein characteristics play the most important role in these decisions.
In a new study, MIT researchers used a new technique to open this “black box” and allow them to determine the characteristics that a protein language model takes into account during predictions. Understanding what is happening inside this black box could help researchers choose better models for a particular task, helping to rationalize the process of identifying new drugs or vaccine targets.
Our work has great implications for increased explanability of downstream tasks based on these representations. In addition, the identification of the functionalities that protein language models follow have the potential to reveal new biological information from these representations. “”
Bonnie Berger, principal of the study and professor of Simons math, Massachusetts Institute of Technology
Berger is also the leader of the calculation and biology group in the IT computers and artificial intelligence laboratory.
Onkar Gujral, a student graduated from the MIT, is the main author of the study, who appears this week in the Proceedings of the National Academy of Sciences. MIRI BAFNA, MIT graduate student, and Eric Alm, professor of biological engineering of MIT, are also authors of the article.
Open the black box
In 2018, Berger and the former student graduated from MIT Tristan Bepler Phd ’20 presented the first protein language model. Their model, such as subsequent protein models that accelerated the development of Alphafold, like ESM2 and Omegafold, was based on LLM. These models, which include chatgpt, can analyze huge amounts of text and determine which words are most likely to appear together.
Protein language models use a similar approach, but instead of analyzing the words, they analyze the amino acid sequences. Researchers have used these models to predict the structure and function of proteins, and for applications such as the identification of proteins that could be linked to specific drugs.
In a 2021 study, Berger and his colleagues used a protein language model to predict which sections of viral surface protein are less likely to mutate in a way that allows viral escape. This allowed them to identify the possible targets for influenza vaccines, HIV and SARS-COV-2.
However, in all these studies, it was impossible to know how the models made their predictions.
“We would make predictions at the end, but we had absolutely no idea of what was going on in the individual components of this black box,” said Berger.
In the new study, the researchers wanted to deepen the way in which protein language models make their predictions. Like LLM, protein language models code information as representations which consist of an activation model of different “nodes” in a neural network. These nodes are similar to neural networks that store memories and other information in the brain.
The internal functioning of LLMs is not easy to interpret, but in the past two years, researchers have started to use a type of algorithm known as a sparse self -den to help shed light on how these models make their predictions. Berger’s Lab’s new study is the first to use this algorithm on protein language models.
Large self -employeders work by adjusting how a protein is represented in a neural network. As a general rule, a given protein will be represented by an activation scheme for a constrained number of neurons, for example, 480. A sparse self -entertainment will extend this representation in a much larger number of nodes, for example 20,000.
When information on a protein is coded by only 480 neurons, each node lights up for several characteristics, which makes it very difficult to know what features each node code. However, when the neural network is extended to 20,000 knots, this additional space as well as a rarity constraint gives the information room to “spread”. Now, a characteristic of the protein which was previously coded by several nodes can occupy a single node.
“In a sparse representation, the enlightening neurons do so more significantly,” explains Gujral. “Before the creation of sparse representations, networks have information so closely together that it is difficult to interpret neurons. »»
Interpretable models
Once the researchers have obtained sparse representations of many proteins, they used an AI assistant called Claude (linked to the popular anthropogenic chatbot of the same name), to analyze representations. In this case, they asked Claude to compare sparse representations with the known characteristics of each protein, such as molecular function, protein family or location in a cell.
By analyzing thousands of representations, Claude can determine which nodes correspond to specific protein characteristics, then describe them in simple English. For example, the algorithm could say: “This neuron seems to detect the proteins involved in the transmembrane transport of ions or amino acids, in particular those located in the plasma membrane. »»
This process makes nodes much more “interpretable”, which means that researchers can say what each node encoded. They found that the characteristics most likely to be coded by these nodes were the family of proteins and certain functions, including several different metabolic and biosynthetic processes.
“When you train a sparse autoencoder, you do not form it in interpretation, but it turns out that by encouraging the representation to be really rare, it ends up leading to interpretability,” explains Gujral.
Understanding the characteristics of a particular protein model encoding could help researchers choose the right model for a particular task, or modify the type of input they give to the model, to generate the best results. In addition, the analysis of the characteristics that a encode model could one day help biologists to find out more about the proteins they study.
“At one point, when the models become much more powerful, you could learn more biology than you already know, by opening the models,” explains Gujral.