Brian Neltner has been interested in applying machine learning and automation tools to catalyst screening since 2010, after seeing the effectiveness of Symyx robotic screening tools. He has one issued patent in the area of automated catalyst manufacturing and testing equipment related to advancing the state of the art in catalyst design.

Since 2010, great strides have been made in using machine learning and adaptive learning to help improve these techniques, and Brian is very excited by the many new avenues for research this opens up. Below, there are several publications which Brian has found particularly inspiring and would like to incorporate into future work, with a brief description and explanation of how these pieces of research relate to long term project goals.

His current github repository can be found at https://github.com/neltnerb/machine-learning with some very basic work looking at chemical solubility following the general concepts explained in Kearnes, et al. below.

Molecular graph convolutions: moving beyond fingerprints, by Steven Kearnes, Kevin McCloskey, Marc Berndyl, Vijay Pande, and Patrick Riley.

This paper is a fascinating effort to use neural networks to produce a molecular fingerprint for chemicals based on nothing but a graph of the atoms present. This approach is particularly interesting in that it did not require the use of physically measured features such as charge, electronegativity, hybridization, hydrogen bonding, or other high level features that chemists would typically use to predict the properties of a given molecule.

Despite limiting the features analyzed to just the atom type (with no information about what an "atom" means), bond types between atoms, and graph distance, they were able to achieve phenomenal performance in a very generalizable way. This opens up a lot of doors to solving the kinds of problems Brian is interested in by developing a robust "reaction fingerprint" to predict the products given features based on the reactants, a heterogeneous catalyst, and reaction conditions.

Brian's attempt at replicating the work of Kearnes, et al. Although this is a much smaller dataset, and the neural network used to train on the data is far simpler, the results are clearly demonstrating the feasibility of the general approach. — Brian's attempt at replicating the work of Kearnes, *et al.* Although this is a much smaller dataset, and the neural network used to train on the data is far simpler, the results are clearly demonstrating the feasibility of the general approach.

This is the paper that Brian made his first efforts to replicate in an effort to become familiar with TensorFlow, a machine learning library for Python produced by Google, and machine learning for computational chemistry. In his first project, he created a small dataset in the way prescribed by Kearnes, et al. starting from 1000 molecules with their solubilities and chemical formula in SMILES format. By converting that SMILES format into a feature set using networkx for graph length calculation and rdkit to convert SMILES format chemicals into a list of atoms and bond features, he was able to implement a simple deep neural network which in a very short training time produced some reasonable results as shown to the right.

The results from the toy model shown to the right is Brian's partial replication of the results of Kearnes, et al. This simple proof of concept along with Kearnes's more robust results shows the fundamental feasibility of this approach for predicting physical properties using machine learning on molecular graphs.

Neural Networks for the Prediction of Organic Chemistry Reactions, by Jennifer N. Wei, David Duvenaud, and Alán Aspuru-Guzik

This paper develops the ability to predict reaction types based on a concatenated reaction fingerprint containing the fingerprint for two reactants, each with one functional group, and a reagent in order to predict products. In this work, a database of chemicals and chemical transformations was constructed from textbook examples and trained using a fingerprinting system. Subsequently, it was possible to use this system to predict the reaction type and therefore the products of a set of homework problems from the same textbook using a neural network.

There are some limitations to this approach, mostly surrounding the limitation on number of reactants, number of functional groups, and the use of theoretical rather than experimental results in training. These problems are addressable by automation and screening technology which allow for very high throughput dataset generation. However, at present there are not very many good datasets for chemical reactions to train on, and those that exist are generally for processes carried out in batch (i.e. not a continuous production process such as those used in industrial chemistry). Those results using continuous production processes are generally not comparable to one another easily due to the differences in reactor design for different facilities, which makes analysis difficult.

As a proof of concept, this research seems like it can be extended by data mining research literature and patents, but ultimately what is needed is a robust dataset over many different catalysts and reactions which will be best synthesized with automated chemical reactors. The use of a chemical reactor also allows the incorporation of heterogeneous catalysts commonly used in industry and the ability to produce datasets that use the catalyst features as inputs to describe the overall reaction. Ultimately, extremely high throughput tools like Symyx wafer screening (sadly now defunct) or new automated systems can provide this training data.

Accelerated search for materials with targeted properties by adaptive design, by Dezhen Xue, Prasanna V Balachandran, John Hogden, James Theiler, Deqing Xue, and Turab Lookman

This paper fills in the third major piece of the puzzle of catalyst design. It describes "Adaptive Design", accomplished by closely coupling the experimental design process with the experiment being carried out in order to optimize not for the best guess at the best performing result, but rather to balance the need to explore poorly understood areas of the feature space simultaneously with improving the experimental results. This approach accomplishes the need to not just predict features, but also to help the system essentially answer the question, "What experiment should I do next in order to increase my understanding the most?"

The ability to pick high information experiments is the essence of experimental design, and this paper shows a unique way to improve the way that we explore the parameter space when experiments are very expensive. In the case of chemical manufacturing processes, this is certainly the case - an experiment may take hours or days to complete so experimental time is a dominant cost limiting the knowledge of such a system. In the case of this publication, they used a variety of basic materials properties such as the metallic radius and valence electron numbers informed by scientific understanding of the relationship between basic materials properties and identifying a shape memory alloy with low thermal hysteresis.

In the more general case, I believe that combining molecular fingerprints with reaction fingerprinting using a newly developed set of catalyst basic features such as the atomic composition, surface area, and pKa will allow for the development of a hybrid reaction fingerprint that includes the catalyst and molecular features in order to rapidly identify the catalysts and reactants to test which will generate the largest increase in model confidence.

Automatic chemical design using a data-driven continuous representation of molecules, by Rafael Gómez-Bombarelli, David Duvenaud, José Miguel Hernández-Lobato, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik

This paper adds in the concept of an autoencoder to the chemical data analysis. An autoencoder is a tool which essentially trains a set of inputs on itself, with a small number of nodes as a pinch point in the middle. By analogy to image processing, this is effectively a compression and decompression scheme trained simultaneously where the compressed form is represented by the outputs at the pinch point. In this paper, by using continuous variables in the intermediate latent representation they are able to introduce variations and predict the results to allow for the property prediction of "similar" molecules to a training example and then decode the fingerprints for the identified candidates into molecular graphs.

The advantage of this approach with molecular representation is that, unlike the approaches above, you can decode the generated fingerprint to interpret the results clearly. Given a new reaction fingerprint, it would be possible to decode this into a set of physical catalyst features and a list of reactants and conditions necessary to carry out further experimental testing.

This ability fits nicely with an adaptive design approach. For instance, the reaction fingerprint may be varied statistically, as done in this publication, and the products estimated rapidly using the trained neural network. Subsequently, the variant with the highest score on making a desired product is decoded and tested experimentally, both verifying the result and informing the model with more data to refine future results. It can also be used to identify areas of the reaction space where there is very sparse data, allowing the adaptive learning system to explore novel catalysts and reactions which may be of societal importance.

As an example of these pieces fitting together, consider a dataset with features of:

Reaction Conditions (flow rate, temperature, pressure)
Reactants (input as a molecular graph)
Catalyst (atomic composition, surface area, crystalline phase, oxidation state, ...)

In this example, consider an arbitrary number of reactants, each encoded by identical networks A to produce fingerprints for each reactant. An autoencoder is used during training to penalize fingerprints without enough information to decode back into the original molecular graph. The same autoencoder is used for every reactant, which allows it to be extensible to many reactants as long as they are subsequently concatenated and reduced to a fixed length "reactant fingerprint" containing the features of all of the reactants together.

A separate autoencoder B encodes the catalyst features, which is particularly important here in that we would like to be able to decode the algorithm's suggested new catalysts back into features which we may attempt to create in the physical world. The fingerprints of the catalyst and all of the reactants are then concatenated with the reaction conditions and put through a final autoencoder C which allows the representation of an arbitrary number of reactants into a single fixed length representation. Generally speaking, our ability to synthesize a specified catalyst is substantially greater than our ability to predict their performance, so given a set of surface areas, atomic concentrations, and so on it is likely that many of the predicted catalysts can be manufactured with straightforward existing techniques.

The final concatenated and fingerprinted representation of the entire reaction is trained with a neural network with the product distribution as labels, simultaneously generating a model which both allows the decoding of the reactants, catalyst, and reaction conditions while also ensuring that the intermediate representations contain all the information critical to predict the reaction products accurately.

With this tool, it may be feasible to vary the fingerprinted reaction to identify "similar" reactions, and test them with the downstream neural network to predict the performance of these slight variations of the original training examples and use a gradient method to explore the latent feature space. After identifying reaction fingerprints with a great deal of promise in producing our desired products, the decoder can tell us the reactant list, catalyst features, and conditions. More ambitiously, identifying clusters of reaction types using real experimental data may provide unique new guidance about what these reactions actually do - for instance, identifying new reaction mechanisms which are currently unknown.

By combining a system like the above example with an adaptive training system to design and carry out tests, it would be possible to rapidly explore the space of catalysts and generate a predictive model for which catalysts and conditions to use for a novel product molecule. The value of such a model would be incalculable to the greater chemical industry. By providing a way to rapidly pick and improve upon existing industrial catalysts, or enable entire new processes, it can provide a way to increase the energy and resource efficiency of chemical manufacturing while lowering production costs. Such a tool would allow chemical manufacturers to rapidly scale new processes up to commercial scale by speeding up the typically multi-year process of catalyst development.

Further, by incorporating information about the price, toxicity, and availability of different chemicals along with predictions of the cost of different industrial processes it becomes feasible to not just identify the best catalyst and set of reactants for a target molecule, but to identify the full industrial chemical process best suited to produce the target molecule at the lowest cost. This would allow for incredible improvements to the chemical supply chain, a key foundation upon which modern society is based.