Mathematics-assisted Directed Evolution and Protein Engineering
Climate change is contributing to more frequent and intense heat waves, droughts, and other extreme weather events, in addition to rising sea levels and an overall loss of biodiversity. These natural disasters all significantly impact food security, the environment, the economy, and society as a whole. Directed evolution—a powerful molecular biology technique that can engineer organisms for environmental remediation (see Figure 1)—might be able to mitigate some of the effects of climate change. This approach has revolutionized the field of protein engineering and allows researchers to develop novel therapeutics, control harmful pathogens, and design enzymes for industrial processes by creating proteins with tailored properties and functions that would be difficult or impossible to achieve with traditional methods [1]. However, directed evolution also faces significant challenges due to the astronomically large mutational space — e.g., \(20^{200}\) potential mutations for a relatively small protein with 200 amino acids. And organisms typically have numerous different proteins, which makes most directed evolution tasks intractable.
The research community has recently developed a significant interest in artificial intelligence (AI)-assisted directed evolution (AIDE) [9]. AI is an effective tool for exploring the protein fitness landscape and identifying optimal evolutionary paths. When combined with experimental validation, AIDE offers an alternative to rational protein design and could potentially yield more effective antibodies and enzymes as well as flood-, drought-, and disease-resistant plants.
AIDE has also inspired a deeper understanding of evolutionary principles. Although this technique primarily concentrates on protein sequence data, structure data offers a wealth of biophysical information that cannot be extracted from sequences — like crucial structure-function relationships. The main challenges that accompany the utilization of structure data stem from its structural complexity, high dimensionality, and nonlinearity, along with the multiscale and multiphysical interactions that are inherent in biological systems.
In the last few years, the impact of topological data analysis (TDA) on science and engineering has grown exponentially. Persistent homology (PH)—the main tool of TDA—uses filtration to bridge the gap between complex geometry and abstract topology [5, 14]. PH successfully handles intricately complex, high-dimensional, nonlinear, and multiscale data, including those from computer-aided drug design [12]. However, it is limited by the following shortcomings:
- Inability to handle heterogeneous information (i.e., different types of atoms in proteins)
- Qualitative nature (e.g., ignoring the difference between a five- and six-membered ring)
- Lack of description of non-topological changes (i.e., homotopic shape evolution)
- Incapability to cope with directed networks and digraphs (i.e., polarization and gene regulation)
- Incapacity to characterize structured data (e.g., functional groups and protein domains/motifs).
Scientists have introduced many persistent topological Laplacians to address these challenges, including persistent Laplacians (PLs) [11], persistent path Laplacians, persistent sheaf Laplacians, persistent hypergraph Laplacians [6], persistent hyperdigraph Laplacians, and evolutionary de Rham-Hodge theory [3]. For example, PLs—also known as persistent spectral graphs—can capture the homotopic shape evolution of data that PH cannot describe (see Figure 2). This approach can generate accurate forecasts of the future dominant variants of SARS-CoV-2 [13].
The effectiveness and practicality of new TDA tools is evident in AIDE and AI-assisted protein engineering (AIPE) research [10]. Element-specific PH and PLs simplify molecular geometric complexes; reduce protein dimensionality; and capture topological invariants, shape evolution, and sequence disparities in the protein fitness landscape. Researchers have combined these structure embeddings with sequence embeddings from advanced natural language processing tools, such as autoencoders, long short-term memory, and transformers [9]. The subsequent integration of these methods with cutting-edge machine learning algorithms established a new generation of AIDE and AIPE approaches [10].
Scientists are enthusiastic and optimistic about mathematics-assisted directed evolution and protein engineering [7]. Mathematicians can play a crucial role in this burgeoning field through the development of statistical models and/or mathematical frameworks for AI and machine learning, which intend to optimize the efficiency and accuracy of machine learning algorithms [8]. Moreover, mathematicians can help capture the behavior of AI and machine learning systems and analyze their computational complexity, thereby providing valuable insights into their capabilities and limitations. Continued advances in mathematical techniques and innovative AI algorithms will determine the future of directed evolution and protein engineering, with the potential to address grand challenges like climate change, food security, and global pandemics.
References
[1] Arnold, F.H. (1998). Design by directed evolution. Accounts Chem. Res., 31(3), 125-131.
[2] Brouwer, N., Connuck, H., Dubniczki, H., Gownaris, N., Howard, A., Olmsted, C., … Zallek, T. (2023). Ecology for All! LibreTexts. Retrieved from https://bio.libretexts.org/Courses/Gettysburg_College/01%3A_Ecology_for_All.
[3] Chen, J., Zhao, R., Tong, Y., & Wei, G.-W. (2021). Evolutionary de Rham-Hodge method. Discrete Contin. Dynam. Syst. Series B, 26(7), 3785-3821.
[4] Chu, J. (2017, March 16). Climate change to worsen drought, diminish corn yields in Africa. MIT News. Retrieved from https://news.mit.edu/2017/climate-change-drought-corn-yields-africa-0316.
[5] Edelsbrunner, H., & Harer, J. (2008). Persistent homology — a survey. Contemp. Math., 453, 257-282.
[6] Liu, X., Feng, H., Wu, J., & Xia, K. (2021). Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity prediction. Brief. Bioinform., 22(5), bbab127.
[7] Luo, Y. (2023). Sensing the shape of functional proteins with topology. Nature Comput. Sci., 3, 124-125.
[8] Mucllari, E, Zadorozhnyy, V., Pospisil, C., Nguyen, D., & Ye, Q. (2022). Orthogonal gated recurrent unit with Neumann-Cayley transformation. Preprint, arXiv:2208.06496.
[9] Narayanan, H., Dingfelder, F., Butt´e, A., Lorenzen, N., Sokolov, M., & Arosio, P. (2021). Machine learning for biologics: Opportunities for protein engineering, developability, and formulation. Trends Pharmacol. Sci., 42(3), 151-165.
[10] Qiu, Y., & Wei, G.-W. (2023). Persistent spectral theory-guided protein engineering. Nature Comput. Sci., 3, 149-163.
[11] Wang, R., Nguyen, D.D., & Wei, G.-W. (2020). Persistent spectral graph. Inter. J. Num. Methods Biomed. Eng., 36(9), e3376.
[12] Wei, G.-W. (2017, December 1). Persistent homology analysis of biomolecular data. SIAM News, 50(10), p. 10.
[13] Wei, G.-W. (2022, August 9). Topological artificial intelligence forecasting of future dominant viral variants. SIAM News Online. Retrieved from https://siam.org/publications/siam-news/articles/topological-artificial-intelligence-forecasting-of-future-dominant-viral-variants.
[14] Zomorodian, A., & Carlsson, G. (2004). Computing persistent homology. In Proceedings of the twentieth annual symposium on computational geometry (pp. 347-356). New York, NY: Association for Computing Machinery.
About the Authors
Yuchi Qiu
Research Associate, Michigan State University
Yuchi Qiu is a research associate at Michigan State University. His research focuses on artificial intelligence and mathematics-assisted directed evolution and protein engineering.
Guo-Wei Wei
MSU Foundation Professor, Michigan State University
Guo-Wei Wei is an MSU Foundation Professor at Michigan State University. His research explores the mathematical foundations of bioscience and data science.