Machine-learning-guided directed evolution for protein engineering

Proteins are sequences of amino acids. The amino-acid sequence determines how the protein will fold into a 3D structure. That structure then determines what the protein does. A protein’s amino-acid sequence completely determines its function. However, nobody has figured out how to determine structure or function given the amino-acid sequence. In other words, the mapping from sequence to function is unknown, and our current rational methods aren’t good enough to figure it out. Unfortunately, we don’t know how sequence determines structure determines function. Engineering is even worse, because now we want to go backwards, from function to sequence.

Directed evolution works by making small changes (mutating) a parent protein and then keeping the one(s) that improve the desired function. This allows us to engineer proteins while bypassing the need to know how sequence determines function. This works really well! But you need a starting point (parent) and the ability to screen lots (>>10^3 / wk) of variants. Importantly, there’s a lot of information in all the variants that aren’t the best at each round that we’re not using. Historically, people don’t even spend the time and money to figure out what all those sequences are. Machine learning allows us to better use the information gathered from the screen.


Machine-learning methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. Therefore, we can use the information from the screen to train a ML sequence-function model, and then use that model to select the next set of variants. This review analyzes machine-learning methods for protein engineering and provides two case studies as examples of how machine learning can speed up engineering or allow the engineering of more difficult properties than directed evolution alone.