Purdue University

Data + Code

Savoie group datasets and software projects

Yet Another Reaction Program (YARP)

YARP is a python library developed by the Savoie group to predict reaction outcomes and elucidate reaction networks (pub1, pub2, pub3). The current github project can be found here. It contains an object-oriented refactoring of many of the YARP routines and objects so that users can more easily incorporate aspects of YARP into their own workflows.

Edge-Featured Graph Attention (EGAT)

EGAT is a pytorch project developed by the Savoie group for implementing graph-featured attention networks relevent to molecular property prediction and reaction prediction (pub).

Reaction Graph Depth (RGD) Dataset(s)

The reaction graph depth dataset(s) contain optimized reactant, product, and transition state geometries, intrinsic reaction coordinate calculations, activation energies, and heats of reaction calculated using the Yet Another Reaction Program (YARP) method developed by our group. The reactions in the RGD datasets are enumerate using graph-based reaction rules and truncation at a fixed number of bonds away from reactive atoms to make model reactants/products. The initial RGD dataset, RGD1, used a truncation depth of one applied to CHON-containing molecules from pubchem. Expansions to other chemical scopes, molecular sizes, and levels of theory are ongoing and will be posted here as they become ready for public use.

  1. RGD1 DFT  (RGD1_CHNO.h5)  (download): Contains reactant, product, and transition state geometries, heat of reaction, and activation energy for 126,857 distinct reactions calculated at the B3LYP-D3/TZVP level. A total of 176,992 transition states are present due to the discovery of multiple transition state conformations for 33,032 of the reactions.

    NOTE: an earlier version of 1. uploaded to figshare reported activation energies for some reactions that corresponded to the reverse of the stated reaction (i.e., reactants and products were swapped). This issue has been corrected on both figshare and here.
  2. RGD1 DFT Raw Output Files  (RGD1_rawoutput.zip)  (download): Contains output files for the DFT-level refinements that the data in 1 are derived from.
  3. RGD1 xTB-level IRCs  (RGD1_xTB-IRCs.zip)  (download): Contains the low-level IRC calculations performed at the GFN2-xTB level of theory on the RGD1 reactions. These IRCs are calculated by YARP as a feature for identifying localization to unintended transition states. The IRCs are calculated with a fixed number of steps, so there is no guarrantee that the endpoints correspond to GFN2-xTB minima. Nevertheless, these non-equilibrium geometries are potentially useful as a starting point for reactive complex optimization or machine learning.
  4. RGD1 CCSD(T)-F12 comparisons  (RGD1_CCSDpT-F12_test.csv)  (download): Contains a subset of reactions calculated at the CCSD(T)-F12/cc-pVDZ-F12//B3LYP-D3/TZVP level for comparison and potential use in transfer learning.
  5. RGD1 GFN2-xTB Reactant and Product  (Delta2ML_RPs.h5)  (download): Contains reactant, product, and transition state geometries, heat of reaction, and activation energy for 126,857 distinct reactions calculated at the GFN2-xTB level. A total of 176,992 transition states are present due to the discovery of multiple transition state conformations for 33,032 of the reactions.
  6. RGD1 GFN2-xTB Transition States  (Delta2ML_TSs.h5)  (download): Contains transition state geometries and activation energy for 126,857 distinct reactions calculated at the GFN2-xTB level. A total of 176,992 transition states are present due to the discovery of multiple transition state conformations for 33,032 of the reactions.

2 Model of Reaction Properties

This project provides a pytorch implementation of a ∆2-learning model that uses GFN2-xTB level optimized geometries and corresponding single point energies to provide DFT (B3LYP-D3/TZVP) and Gaussian-4 (G4) level single point energies (publication). This model is trained on both equilibrium structures (e.g., reactant and product) and transition states thus can be used to predict activation energies for C,H,O,N-containing systems.