Predictive Chemistry

Three panels comparing sampling strategies in a multi-modal fitness landscape: exhaustive enumeration, genetic algorithm clustering around few peaks, and ACSESS covering all peaks with fewer evaluations

ACSESS: Diverse Optimal Molecules in the SMU

Property-optimizing ACSESS combines diversity-biased sampling with iterative fitness thresholding to discover diverse sets of molecules with favorable properties. Tested on GDB-9 (dipole moment optimization) and NKp fitness landscapes, it outperforms standard genetic algorithms in diversity while matching or exceeding their fitness, using only ~30,000 evaluations to navigate a 300,000-molecule space.

Predictive Chemistry

Diagram showing AllChem's combinatorial synthon assembly pipeline: 7,000 building blocks transformed by 100 reactions into 5 million synthons, which combine in A-B-C topology to represent 10^20 structures

AllChem: Generating and Searching 10^20 Structures

AllChem generates ~5 million synthons by recursively applying ~100 reactions to ~7,000 building blocks, combinatorially representing up to 10^20 complete structures with an A-B-C topology. Topomer shape similarity enables efficient searching of this space, and every hit comes with a proposed synthetic route.

Predictive Chemistry

CHX8 enumeration pipeline from 77,524 structures to 31,497 stable molecules, example strained scaffolds with RSE values, and box plots of relative strain energy distribution by heavy atom count

CHX8: Complete Eight-Carbon Hydrocarbon Space

CHX8 exhaustively enumerates all mathematically feasible hydrocarbons with up to eight carbon atoms (77,524 structures), then DFT-optimizes them to identify 31,497 stable molecules. A universal relative strain energy (RSE) metric referenced to cyclohexane serves as a synthesizability proxy. CHX8 covers 16x more C8 hydrocarbons than GDB-13 and reveals that over 90% of novel structures should be synthetically accessible.

Predictive Chemistry

Six molecules with atoms colored by divalent (blue, simple) vs non-divalent (red, complex) nodes, showing increasing MC1 complexity from hexane to pivaloyl methylamine

Molecular Complexity from the GDB Chemical Space

Buehler and Reymond introduce two molecular complexity measures, MC1 (fraction of non-divalent nodes) and MC2 (count of non-divalent nodes excluding carboxyl groups), derived from analyzing synthesizability patterns in GDB-enumerated molecules. They compare these measures against existing complexity scores across GDB-13s, ZINC, ChEMBL, and COCONUT.

Predictive Chemistry

Three-stage canonical generation pipeline (geng, vcolg, multig) alongside a log-scale speed comparison showing Surge outperforming MOLGEN 5.0 by 42-161x across natural product molecular formulas

Surge: Fastest Open-Source Chemical Graph Generator

Surge is a constitutional isomer generator based on the canonical generation path method, using nauty for graph automorphism computation. Its three-stage pipeline (simple graph generation, vertex coloring for atom assignment, edge multiplicity for bond orders) generates 7-22 million molecules per second, outperforming MOLGEN 5.0 by 42-161x on natural product molecular formulas.

Predictive Chemistry

Grid of heteroaromatic ring systems rendered with RDKit, showing known ring systems in blue-tinted panels and predicted tractable rings in amber-tinted panels

VEHICLe: Heteroaromatic Rings of the Future

VEHICLe (Virtual Exploratory Heterocyclic Library) is a complete enumeration of 24,867 mono- and bicyclic heteroaromatic ring systems built from C, N, O, S, and H. Of these, only 1,701 have ever appeared in published compounds. A random forest classifier trained on known vs. unknown ring systems predicts that over 3,000 additional ring systems are synthetically tractable.

Predictive Chemistry

Three data transfer methods for retrosynthesis: pre-training plus fine-tuning, multi-task learning, and self-training

Data Transfer Approaches for Seq-to-Seq Retrosynthesis

A systematic study of data transfer techniques (joint training, self-training, pre-training plus fine-tuning) applied to Transformer-based retrosynthesis. Pre-training on USPTO-Full followed by fine-tuning on USPTO-50K achieves the best results, improving top-1 accuracy from 35.3% to 57.4%.

Predictive Chemistry

Seq2seq encoder-decoder translating reactant SMILES to product SMILES for reaction prediction

Neural Machine Translation for Reaction Prediction

This 2016 paper first proposed treating organic reaction prediction as a neural machine translation problem, using a GRU-based sequence-to-sequence model with attention to translate reactant SMILES strings into product SMILES strings.

Predictive Chemistry

ReactionT5 two-stage pretraining from CompoundT5 to ReactionT5 with product prediction and yield results

ReactionT5: Pre-trained T5 for Reaction Prediction

ReactionT5 introduces a two-stage pretraining pipeline (compound then reaction) on the Open Reaction Database, enabling competitive product and yield prediction with as few as 30 fine-tuning reactions.

Predictive Chemistry

Bar chart showing RMSE improvement from SMILES augmentation across ESOL, FreeSolv, and lipophilicity datasets

Maxsmi: SMILES Augmentation for Property Prediction

A systematic study of SMILES augmentation strategies for molecular property prediction, showing that augmentation consistently improves CNN and RNN performance and that prediction variance across SMILES correlates with model uncertainty.

Predictive Chemistry

Bar chart showing MTL-BERT combining pretraining, multitask learning, and SMILES enumeration for best improvement

MTL-BERT: Multitask BERT for Property Prediction

MTL-BERT pretrains a BERT model on 1.7M unlabeled SMILES, then fine-tunes jointly on 60 ADMET and molecular property tasks using SMILES enumeration as data augmentation in all phases.

Predictive Chemistry

Bar chart comparing LLM-Prop band gap MAE against CGCNN, SchNet, MEGNet, and ALIGNN

LLM-Prop: Predicting Crystal Properties from Text

LLM-Prop uses the encoder half of T5, fine-tuned on Robocrystallographer text descriptions, to predict crystal properties. It outperforms GNN baselines like ALIGNN on band gap and volume prediction while using fewer parameters.