HomeNewsA brand new computing technique could make it easier to engineer useful...

A brand new computing technique could make it easier to engineer useful proteins

To engineer proteins with useful functions, researchers typically start with a natural protein that has a desired function, reminiscent of emitting fluorescent light, and subject it to many rounds of random mutation, eventually producing an optimized version of the protein.

This process has produced optimized versions of many vital proteins, including green fluorescent protein (GFP). However, creating an optimized version has proven difficult for other proteins. MIT researchers have now developed a computational approach that makes it easier to predict mutations that lead to higher proteins, based on a comparatively small amount of information.

Using this model, researchers created proteins with mutations predicted to guide to improved versions of GFP and a protein from the adeno-associated virus (AAV) that’s used to offer DNA for gene therapy. They hope it is also used to develop additional tools for neuroscientific research and medical applications.

“Protein design is a difficult problem since the mapping from DNA sequence to protein structure and performance could be very complex. There could also be a big protein ten changes away within the sequence, but any change in between could correspond to a totally nonfunctional protein. It's like trying to search out your option to the river basin in a mountain range when craggy peaks along the best way block your view. “The current work is attempting to make it easier to search out the riverbed,” says Ila Fiete, professor of brain and cognitive sciences at MIT, member of MIT’s McGovern Institute for Brain Research, director of the K. Lisa Yang Integrative Computational Neuroscience Center and one in every of the senior leaders Authors of the study.

Regina Barzilay, professor of AI and health within the School of Engineering at MIT, and Tommi Jaakkola, Thomas Siebel Professor of Electrical Engineering and Computer Science at MIT, are also senior authors of an open access publication Paper about work, which will probably be presented on the International Conference on Learning Representations in May. MIT graduate students Andrew Kirjner and Jason Yim are the lead authors of the study. Other authors include Shahar Bracha, a postdoc at MIT, and Raman Samusevich, a doctoral student on the Czech Technical University.

Optimization of proteins

Many naturally occurring proteins have functions that would make them useful for research or medical applications, but they require a bit of extra engineering to optimize them. In this study, the researchers were originally serious about developing proteins that could possibly be used as voltage indicators in living cells. Produced by some bacteria and algae, these proteins emit fluorescent light when an electrical potential is detected. If such proteins were developed to be used in mammalian cells, they might allow researchers to measure neuron activity without the usage of electrodes.

Although many years of research have been invested in developing these proteins to supply a stronger fluorescent signal in a shorter time, they’ve not change into potent enough for widespread use. Bracha, who works in Edward Boyden's lab on the McGovern Institute, reached out to Fiete's lab to see if they might work together on a computational approach that would help speed up the means of optimizing the proteins.

“This work illustrates the human likelihood that characterizes so many scientific discoveries,” says Fiete. “It grew out of the Yang Tan Collective Retreat, a scientific meeting of researchers from multiple centers at MIT with different missions, united by the collective support of K. Lisa Yang. We learned that a few of our interests and tools in modeling how brains learn and optimize could possibly be applied to the entirely different field of protein design, as practiced within the Boyden lab.”

For any given protein that researchers might wish to optimize, there are almost infinitely many possible sequences that could possibly be created by swapping different amino acids at any point inside the sequence. Because there are such a lot of possible variants, it’s inconceivable to check all of them experimentally. That's why researchers have turned to computer modeling to predict which variants will work best.

In this study, researchers sought to beat these challenges by utilizing data from GFP to develop and test a computational model that would predict higher versions of the protein.

They began by training a kind of model called a convolutional neural network (CNN) on experimental data consisting of GFP sequences and their brightness – the feature they desired to optimize.

The model was in a position to create a “fitness landscape” – a three-dimensional map showing the fitness of a given protein and the way much it deviates from the unique sequence – based on a comparatively small amount of experimental data (of about 1,000 variants). GFP).

These landscapes contain peaks that represent fitter proteins and valleys that represent less fit proteins. It may be difficult to predict the trail a protein must take to succeed in the height of its fitness, because a protein often must undergo a mutation that makes it less fit before reaching a close-by peak of upper fitness. To solve this problem, the researchers used an existing computational technique to “smooth” the fitness landscape.

After these small bumps within the landscape were smoothed out, the researchers retrained the CNN model and located that it could reach larger fitness peaks more easily. The model was in a position to predict optimized GFP sequences that had as much as seven different amino acids from the protein sequence they began with, and one of the best of those proteins were estimated to be about 2.5 times higher than the unique.

“Once now we have this landscape that represents what the model thinks is nearby, we smooth it after which retrain the model on the smoother version of the landscape,” says Kirjner. “Now there’s a smooth path from place to begin to summit, which the model can now achieve through iterative small improvements. The same is commonly impossible with unsmoothed landscapes.”

Conceptual proof

The researchers also showed that this approach worked well in identifying recent sequences for the viral capsid of adeno-associated virus (AAV), a viral vector commonly used for DNA delivery. In this case, they optimized the capsid for its ability to package a DNA payload.

“We used GFP and AAV as a proof of concept to point out that it is a technique that works with datasets which are thoroughly characterised and due to this fact could also be applicable to other protein engineering problems should,” says Bracha.

The researchers now plan to use this computational technique to data that Bracha generated on voltage indicator proteins.

“Dozens of labs have been working on this for twenty years and there’s still nothing higher,” she says. “The hope is that by generating a smaller data set, we are able to now train a model in silico and make predictions that could possibly be higher than the manual testing of the last twenty years.”

The research was funded partly by the US National Science Foundation, the Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, the Abdul Latif Jameel Clinic for Machine Learning in Health, the DTRA Discovery of Medical Countermeasures Against New and Emerging Threats Program and the DARPA Accelerated Molecular Discovery Program, the Sanofi Computational Antibody Design Grant, the US Office of Naval Research, the Howard Hughes Medical Institute, the National Institutes of Health, the K. Lisa Yang ICoN Center, and the K. Lisa Yang and Hock E. Tan Center for Molecular Therapeutics at MIT.


Please enter your comment!
Please enter your name here

Must Read