Table of Contents
“OK. Here we go.” David Juergens, a computational chemist at the University of Washington (UW) in Seattle, is about to design a protein that, in 3-billion-plus years of tinkering, evolution has never produced.
On a video call, Juergens opens a cloud-based version of an artificial intelligence (AI) tool he helped to develop, called RFdiffusion. This neural network, and others like it, are helping to bring the creation of custom proteins — until recently a highly technical and often unsuccessful pursuit — to mainstream science.
These proteins could form the basis for vaccines, therapeutics and biomaterials. “It’s been a completely transformative moment,” says Gevorg Grigoryan, the co-founder and chief technical officer of Generate Biomedicines in Somerville, Massachusetts, a biotechnology company applying protein design to drug development.
The tools are inspired by AI software that synthesizes realistic images, such as the Midjourney software that, this year, was famously used to produce a viral image of Pope Francis wearing a designer white puffer jacket. A similar conceptual approach, researchers have found, can churn out realistic protein shapes to criteria that designers specify — meaning, for instance, that it’s possible to speedily draw up new proteins that should bind tightly to another biomolecule. And early experiments show that when researchers manufacture these proteins, a useful fraction do perform as the software suggests.
The tools have revolutionized the process of designing proteins in the past year, researchers say. “It is an explosion in capabilities,” says Mohammed AlQuraishi, a computational biologist at Columbia University in New York City, whose team has developed one such tool for protein design. “You can now create designs that have sought-after qualities.”
“You’re building a protein structure customized for a problem,” says David Baker, a computational biophysicist at UW whose group, which includes Juergens, developed RFdiffusion. The team released the software in March 2023, and a paper describing the neural network appears this week in Nature1. (A preprint version was released in late 2022, at around the same time that several other teams, including AlQuraishi’s2 and Grigoryan’s3, reported similar neural networks).
For the first time, protein designers now have the kinds of reproducible and robust tools around which a new industry can be created, Grigoryan adds. “The next challenge becomes, what do you do with it?”
Juergens inputs a few specifications for the protein he wants into a web form resembling an online tax calculator. It must be 100 amino acids long and form a symmetrical two-protein complex called a homodimer. Many cell receptors adopt this configuration, and a new homodimer could be a synthetic cell-signalling molecule, chimes in Joe Watson, a UW computational biochemist who co-developed RFdiffusion, and is also on the video call. But this morning’s design isn’t meant to do anything except resemble a realistic protein.
Researchers have struggled for decades to build new proteins. At first, they tried to cobble together useful parts of existing proteins, such as a pocket of an enzyme in which a chemical reaction is catalysed. This approach relied on understanding how proteins fold up and work, as well as intuition and a lot of trial and error. Scientists sometimes screened thousands of designs to identify one that worked as hoped.
A light-bulb moment came with AlphaFold (developed by the London-based AI firm DeepMind, now Google DeepMind) and other AI-based models that could accurately predict protein structures from amino-acid sequences, says Baker. Designers realized that these neural networks, trained on real protein sequences and structures, could also help to create proteins from scratch.
Scientists are using AI to dream up revolutionary new proteins
In the past few years, Baker’s team and others in the field have released a slew of AI-based protein-design tools. One approach these tools use, called hallucination, involves creating a random string of amino acids that is then optimized by AlphaFold, or a similar tool called RoseTTAFold, until it resembles something that the neural network suggests is likely to fold into a specific structure. Another, called inpainting, takes a specified snippet of a protein sequence or structure and builds the rest of the molecule around it using RoseTTAFold.
But these tools are far from perfect. Experiments tended to show that structures designed by hallucination methods didn’t always form well-folded proteins when they were made in the laboratory, and ended up as gunk at the bottom of a test tube, for instance. Hallucination methods also struggled to make anything but small proteins (although other researchers showed, in a February preprint, how the technique could be used to design longer molecules4). Inpainting also did a poor job of forming proteins when given shorter snippets. Even when the approach did produce a theoretical protein structure, it wasn’t able to come up with diverse solutions to a problem that would increase the odds of success.
That is where RFdiffusion and similar protein-designing AIs, released in recent months, come in. They are based on the same principles as neural networks that generate realistic images, such as Stable Diffusion, DALL-E and Midjourney. These ‘diffusion’ networks are trained on data, be they images or protein structures, which are then made progressively noisier, eventually bearing no resemblance to the starting image or structure. The network then learns to ‘denoise’ the data, performing the task in reverse.
Networks such as RFdiffusion are trained on tens of thousands of real protein structures stored in a repository called the Protein Data Bank (PDB). When the network makes a new protein, it begins with total noise: a random assortment of amino acids. “You’re asking what is the protein that gave rise to the noise,” explains Watson. After rounds of denoising, it produces something resembling a real — but new — protein.
When Baker’s team tested RFdiffusion without providing any guidance except the length of the protein, the network generated diverse, realistic-looking proteins, different from anything it had been trained on in the PDB.
But the researchers are also able to direct the program to make proteins according to specific design constraints during the denoising process, a process called conditioning.
For instance, Baker’s team conditioned RFdiffusion to make proteins that include a specific fold, or that can nestle against the surface of another molecule (an interaction that underlies binding). Grigoryan’s team even developed a diffusion network called Chroma and then conditioned it to make proteins shaped to resemble the 26 capital letters used in English, as well the Arabic numerals3.
Signal from noise
Juergens’ computer screen initially shows noise, the random assortment of amino acids that the AI system starts with. They are represented as red, smudgy squiggles that resemble a toddler’s fingerpainting. They morph, frame by frame, into ever-more-complex shapes, with protein-like features such as tight spirals known as α-helices and ribbony shapes that double back on themselves, called β-sheets. “It’s a nice mixed alpha–beta topology,” says Juergens, smiling as he admires a creation that took only a few minutes to make. “This is looking good.”
The tool has gained widespread use in Baker’s laboratory. “The design process is almost unrecognizable compared to a year ago,” he says. The neural network has excelled in design challenges that have been inefficient, difficult or impossible using other approaches.
In one analysis reported in their study1, the researchers started with a snippet from another protein, such as a portion of a viral protein recognized by immune cells, and tasked AI-based tools with churning out 100 different new proteins, to see how many would incorporate the desired motif. The team carried out this challenge for 25 different initial shapes. The results didn’t always incorporate the starting snippet, but RFdiffusion produced at least one protein that did for 23 of the motifs, compared with 15 for hallucination and 12 for inpainting.
‘The entire protein universe’: AI predicts shape of nearly every known protein
RFdiffusion has also proved adept at making proteins that self-assemble into complex nanoparticles that might be able to deliver drugs or vaccine components. Previous AI approaches5 can also make these kinds of protein, but Watson says RFdiffusion’s designs are much more sophisticated.
Neural networks such as RFdiffusion seem to really shine when tasked with designing proteins that can stick to another specified protein. Baker’s team has used the network to create proteins that bind strongly to proteins implicated in cancers, autoimmune diseases and other conditions. One as-yet unpublished success, he says, was to design strong binders for a hard-to-target immune-signalling molecule called the tumour necrosis factor receptor — the target for antibody drugs that generate billions of dollars in revenue each year. “It is broadening the space of proteins we can make binders to and make meaningful therapies” for, Watson says.
Baker’s team is cranking out so many designs that testing whether they work as intended has become a serious bottleneck. “One machine-learning person can generate enough designs to keep 100 biologists busy for months,” says Kevin Yang, a biomedical machine-learning researcher at Microsoft Research in Cambridge, Massachusetts whose team has developed its own diffusion-based protein design tool6.
But early signs suggest that RFdiffusion’s creations are the real deal. In another challenge described in their study, Baker’s team tasked the tool with designing proteins containing a key stretch of p53, a signalling molecule that is overactive in many cancers (and a sought-after drug target). When the researchers made 95 of the software’s designs (by engineering bacteria to express the proteins), more than half maintained p53’s ability to bind to its natural target, MDM2. The best designs did so around 1,000 times more strongly than did natural p53. When the researchers attempted this task with hallucination, the designs — although predicted to work — did not pan out in the test tube, says Watson.
Overall, Baker says his team has found that 10–20% of RFdiffusion’s designs bind to their intended target strongly enough to be useful, compared with less than 1% for earlier, pre-AI methods. (Previous machine-learning approaches were not able to reliably design binders, Watson says). Biochemist Matthias Gloegl, a colleague at UW, says that lately he has been hitting success rates approaching 50%, which means it can take just a week or two to come up with working designs, as opposed to months. “It’s really insane,” he says.
The cloud-based version of RFdiffusion had around 100 users each day by late June, according to Sergey Ovchinnikov, an evolutionary biologist at Harvard University in Cambridge, Massachusetts. Joel Mackay, a biochemist at the University of Sydney in Australia, has been dabbling with RFdiffusion to design proteins capable of binding to other proteins that his lab studies, which include molecules called transcription factors that control gene activity in cells. He found the design process simple, and used computer modelling to validate that, in theory, the proteins should bind to the transcription factors.
Mackay is now testing whether the proteins can alter gene expression as intended when they are produced in cells. He has his fingers crossed, because such a finding would amount to a simple way to switch specific transcription factors on and off within cells, instead of using drugs that can take years to identify, if they can be discovered at all. “If this method works reliably for our types of proteins, it would be a total game-changer,” he says.
The latest models such as RFdiffusion are a “step change” says Charlotte Deane, an immune informatician at the University of Oxford, UK. But key challenges remain. “What it will do is inspire people to see how far we can push these diffusion methods,” she says.
One application that she and other scientists and biotechnology companies are particularly interested in is designing more complex binding proteins such as antibodies, or the protein receptors used by T cells (a type of immune cell). These proteins have flexible loops that interlock with their targets, as opposed to the sandwich-like, flat interfaces that RFdiffusion has excelled at so far. Baker says they are making progress with antibodies.
Ovchinnikov and others say it’s challenging, in general, to design biomolecules whose function depends on floppy regions that give them the ability to adopt many different shapes. These are features that have proved difficult to model using AI. “If the problem is, can we bind to something else and inhibit it,” says Ovchinnikov, “I think that problem is going to be solved with these methods. But in order to do something more complex, more like what nature does, you need to introduce some flexibility.”
Tanja Kortemme, a computational biologist at the University of California, San Francisco, is using RFdiffusion to design proteins that can be used as sensors or as switches to control cells. She says that if a protein’s active site depends on the placement of a few amino acids, the AI network does well, but it struggles to design proteins with more-complex active sites, requiring many more key amino acids to be in place — a challenge she and her colleagues are trying to tackle.
What’s next for AlphaFold and the AI protein-folding revolution
Another limitation of the latest diffusion methods is their inability to create proteins that are vastly different from natural proteins, says Yang. That is because the AI systems have been trained only on existing proteins that scientists have characterized, he says, and tend to create proteins that resemble those. Generating more-alien-looking proteins might require a better understanding of the physics that imbues proteins with their function.
That could make it easier to design proteins to carry out tasks no natural protein has ever evolved to do. “There’s still a lot of room to grow,” Yang says.
The latest protein-design tools have proved to be extremely powerful at creating proteins that can do a particular task — so long as that function can be described in terms of a shape, such as the surface of a protein to bind to, says AlQuraishi. But, he adds, tools such as RFdiffusion aren’t yet able to handle other kinds of specifications, such as making a protein that can carry out a particular reaction regardless of its shape — when “you know what you want but you don’t know what the geometry is”.
Future protein-design tools will also need the capacity to churn out proteins to numerous different criteria, says Grigoryan. A potential therapeutic protein must not only bind to its target, but also not bind to others and should possess properties that make it easy to mass-produce.
One direction that researchers are exploring is whether proteins could be designed using plain language text descriptions, similar to the prompts fed to image-generation tools such as Midjourney. “You can really imagine we will be able to write descriptions of a protein and have them synthesized and tested,” says Watson.
Grigoryan and his colleagues have taken a step towards this goal. In their December 2022 preprint3, they trained Chroma to attach descriptions to its designs and spit out designs to text-based specifications, including ‘protein with a CHAD domain’ (a protein shape incorporating multiple helices) or ‘crystal structure of aminotransferases’ (enzymes involved in making and breaking down proteins).
The protein Juergens created in a few minutes this morning is only a model of a protein’s 3D structure. Juergens then uses another AI tool to come up with sequences of amino acids that should fold up into that structure. As a final check, he plugs the sequences into AlphaFold to see whether the software predicts folded structures that match the design. They’re spot on, with the AlphaFold predictions differing from the design by an average of just 1 ångström (the width of a hydrogen atom).
“This is at the accuracy that we would class as a design success,” says Watson. The only thing left to do, he says, is to see how the protein performs in real life.