Combining mechanistic and statistical models for predicting reaction outcomes in organic synthesis

Computational modeling and machine learning tools have assisted in the fundamental challenge of predicting the "over-the-arrow" optimal reaction conditions to maximize the output (e.g., yield and selectivity). The work presented here explores multiple challenging synthetic reactions for reaction optimization ranging from: (i) precise photocatalytic transformations in chemical biology, (ii) new reactivity using organobismuth(V) reagents, (iii) challenging reversible nucleophilic alcohol addition reactions influence at equilibrium, and (iv) a late-stage key reaction step in a total synthesis project. Overall, this dissertation aims to assist in predicting optimal reaction outcomes by understanding and formulating reaction mechanisms from quantum mechanics and statistical methods while using open-source automated workflows to improve transparency and reproducibility within data-chemistry fields. Chapter 1 provides the necessary background to introduce the methods behind computational and statistical models that assist in addressing the challenges faced within the optimization process and the limitations of each strategy. First, there will be a brief overview of the computational protocols to generate and understand reaction mechanisms using quantum mechanical methods. Then, a summary of the data-driven approach introduces the statistical methods and metrics that build relationships to chemical reactivity using computer-readable mechanistically derived molecular descriptions. Chapter 2 tackles the challenge of studying the chemical reactivity in large biological systems (e.g., peptides and proteins) with quantum mechanical methods. First, the precise photocatalytic functionalization at selenocysteine reaction developed by the Payne lab is simulated using a simplified model substrate followed by a more realistic model that generates the final energy profile. Based on the resulting computational analysis, the utility of this late-stage functionalization reaction is later demonstrated on large polypeptide chains. Chapters 3 and 4 embark on a journey into new bismuth chemistry developed by the Ball group. The bismuth arylation reaction published in Nature transformed the following collaborative work discussed here, ranging from the computational protocols implemented in selectivity problems to the versatile chemical reactivity originating from bismuth(V) reagents. From the previously reported but otherwise unexplored DFT integration grid effects, the computed free energies on organobismuth reactions explored here would have led to significant errors and incorrectly predicting selectivities. With the optimal computational protocols, new reactivity using organobismuth reagents is investigated in Chapter 3 to propose a reaction mechanism for the selective arylation of 2- and 4-pyridiones. Chapter 4 describes the mechanistic investigation of the developed palladium-catalyzed cross-coupling reaction to achieve challenging C-C couplings in mild reaction conditions with the amino-bridged bismacycle reagent. A statistical modeling approach using automated workflows discussed in Chapter 7 is applied here to predict an optimal reaction design and capture the origin of the reactivity for various coupling substrates and modified organobismuth(V)-reagents for the developed Bisma-Stille cross-coupling reaction. Chapter 5 describes a mechanistic investigation to optimize a challenging key reaction in the total synthesis of the natural product of allopupukeanane developed by the Sarpong group. The reaction success in late-stage synthetic plans becomes detrimental as the availability of reactants in a multiple-step natural product synthesis becomes limiting. The elementary step influencing the reactivity is identified in the palladium-mediated cascade reaction. Then, a data-driven approach is implemented to screen various ligands and collect mechanistically derived molecular DFT features to incorporate into a Bayesian optimization tool developed by the Doyle lab. Automated workflows discussed in Chapter 7 were utilized to collect the features. This approach successfully identified more suitable and efficient reaction conditions for racemic mixture, byproduct formation, and catalyst decomposition challenges. The overall synthesis plan to access multiple natural products via the bridged bicycle scaffold highlighted in this chapter is an ongoing project by the Sarpong group. Chapter 6 pivots into data-driven approaches to formulate statistical relationships sampled over small and large datasets. First, the collaborative research in section 6.2 dives into building a multivariate linear regression model with a small dataset to explain the reaction performance in various solvents on the challenging reversible nucleophilic alcohol addition reaction developed by the Bandar group. The statistical conclusions provide the bases for modeling the solvent effects via DFT methods. Next, in section 6.3, a machine learning model is trained on a large diverse molecule dataset to predict NMR chemical shifts with high accuracy to DFT-derived NMR values at only a fraction of the cost of DFT methods. Here are two examples where a successful prediction is evaluated based on the research goal to obtain model accuracy or interpretability. Chapter 7 focuses on facilitating the transparency and reproducibility for collecting and generating meaningful statistical models for the data chemist in low- and high-throughput studies. The open-source, automated workflows, DISCO and REGGAE, allowed for the execution of projects mentioned in Chapters 4 to 6 at different stages of the research process (e.g., chemical data collection, feature selection, and then statistical modeling).
2023 Summer.
Includes bibliographical references.
Rights Access
Embargo expires: 08/28/2024.
Associated Publications