Method Papers: New Algorithms, Architectures, and Mechanisms on Hunter Heidenreich | ML Research Scientist

MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine

Sun, 12 Apr 2026 00:00:00 +0000

A Modular MB-nrg Method for Biomolecular Potentials

This is a Method paper. Zhou and colleagues extend the MB-nrg (many-body energy) formalism to covalently bonded biomolecules and build the first coupled-cluster-accurate potential energy function (PEF) for polyalanine in the gas phase. The contribution has three parts: a generalization of the MB-nrg decomposition from whole-molecule 1-mers to functional-group “natural building blocks,” a DLPNO-CCSD(T)/aug-cc-pVTZ training protocol driven by parallel-bias metadynamics sampling, and a demonstration that the resulting PEF reproduces alanine dipeptide energetics and AceAla$_9$Nme secondary-structure dynamics more faithfully than the Amber ff14SB and ff19SB force fields.

Why Empirical Force Fields Fall Short for Protein Dynamics

Protein dynamics span femtosecond vibrations to millisecond conformational changes, and capturing them at atomic resolution is central to understanding catalysis, allostery, and ligand binding. Classical force fields such as CHARMM, OPLS, and Amber approximate the potential energy surface with pairwise-additive analytical terms. This functional form struggles with the many-body interactions that shape disordered regions of proteins, including exchange-repulsion, charge transfer, charge penetration, and cooperative hydrogen bonding. Polarizable force fields add induced dipoles but remain empirically parameterized and fail to capture short-range many-body effects from electron-density overlap.

Quantum-mechanical methods avoid this, but coupled cluster theory scales as $\mathcal{O}(N^7)$ in the number of electrons and even DFT remains $\mathcal{O}(N^3)$ to $\mathcal{O}(N^4)$, ruling out direct ab initio molecular dynamics for biomolecules. Fragmentation methods like molecular fractionation with conjugate caps (MFCC) mitigate the cost, but they truncate the many-body expansion at two bodies and miss long-range hydrogen bonding. Machine-learned force fields (MLFFs) reach near-QM accuracy at lower cost, yet they typically train on DFT data (inheriting delocalization errors and poor dispersion), struggle with interpretability, and extrapolate unreliably. Existing permutationally invariant polynomial (PIP) approaches scale factorially in the number of atoms, capping direct applicability at roughly ten to fifteen atoms per fragment.

MB-nrg PEFs based on the many-body expansion and PIPs have successfully modeled water, halides in water, carbon dioxide, methane, ammonia, nitrogen pentoxide, and N-methylacetamide. Extending them to covalently bonded biomolecules requires rethinking what counts as a “body.”

Building Polyalanine from Functional-Group n-mers

The MB-nrg formalism starts from the many-body expansion of the total energy,

$$ E_N(1, \dots, N) = \sum_{i=1}^{N} \varepsilon^{1\mathrm{B}}(i) + \sum_{i

where each $n$-body contribution is defined recursively as the $n$-mer energy minus all lower-order terms. The full PEF combines physics-based and data-driven components,

$$ V_{\mathrm{MB\text{-}nrg}} = V_{\mathrm{ML}} + V_{\mathrm{phys}} $$

with $V_{\mathrm{ML}} = V_{\mathrm{ML}}^{1\mathrm{B}} + V_{\mathrm{ML}}^{2\mathrm{B}} + V_{\mathrm{ML}}^{3\mathrm{B}}$ capturing short-range quantum-mechanical interactions, and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}} + V_{\mathrm{rep}}$ supplying electrostatics, dispersion, and repulsion. Dispersion follows a Tang-Toennies damped $C_6/R^6$ form with XDM-derived coefficients; electrostatics uses a Thole-modified self-consistent polarization model inherited from MB-pol; the repulsion term is a Lennard-Jones $R^{-12}$ contribution borrowed from Amber ff14SB, activated only for non-bonded atom pairs not covered by a PIP.

Each data-driven $n$-body term is expressed as

$$ V_{\mathrm{ML}}^{n\mathrm{B}} = \sum_{\mathrm{M}_1 < \dots < \mathrm{M}_n}^{N} s^{n\mathrm{B}}(\mathrm{M}_1, \dots, \mathrm{M}_n), V_{\mathrm{PIP}}^{n\mathrm{B}}(\mathrm{M}_1, \dots, \mathrm{M}_n) $$

where $V_{\mathrm{PIP}}^{n\mathrm{B}}$ is a permutationally invariant polynomial in Morse-like variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$ and $s^{n\mathrm{B}}$ is a switching function.

The key extension in this paper, building on earlier work on linear alkanes, is to treat functional groups (not whole molecules) as 1-mers. An Ace-capped, Nme-capped polyalanine chain decomposes into three distinct 1-mer types (-CH-, CH$_3$-, -CONH-), five distinct 2-mer types, and six distinct 3-mer types, for 14 unique PIPs that cover every $n$-mer appearing in any AceAla$_n$Nme chain. Cleaving covalent bonds between 1-mers would produce radicals, so the authors cap dangling valences with “ghost” hydrogen atoms at fixed C-H (1.14 Å) and N-H (1.09 Å) distances. Each $n$-mer energy is then referenced to its own optimized H-capped structure,

$$ E_n(1, \dots, n) = E_n^{\mathrm{H\text{-}capped}}(1, \dots, n) - E_n^{\mathrm{H\text{-}capped,opt}}(1, \dots, n). $$

In the current implementation, only covalently bonded $n$-mers receive PIPs, the 2-body contribution from a dimer with one intervening 1-mer is folded into the corresponding 3-body term, and non-bonded 1-mers interact through the Lennard-Jones repulsion alone. Crucially, no whole-chain polyalanine data enters any stage of training: every PIP is parameterized on isolated $n$-mer configurations, and the total energy is reconstructed through the many-body expansion.

Training on DLPNO-CCSD(T) with Metadynamics Sampling

Training sets are generated for each of the 14 $n$-mer types using parallel-bias metadynamics (PBMetaD) with partitioned families, biasing heavy-atom bonds, angles, and dihedrals across 300 K, 500 K, and 700 K in LAMMPS interfaced with PLUMED and modified OPLS/CM1A and Amber ff14SB force fields. For each $n$-mer, 200,000 candidate configurations are sampled, then reduced to roughly 10,000-20,000 training configurations (and about 1,000 test configurations) through Mini-batch K-means clustering on chemically equivalent pairwise distances. Reference energies are computed at the DLPNO-CCSD(T)/aug-cc-pVTZ level in ORCA.

Each PIP minimizes a weighted, ridge-regularized sum of squared errors,

$$ \chi^2 = \sum_{k \in \mathcal{S}} w_k \left[ V^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}(k) \right]^2 + \Gamma^2 \sum_l c_l^2 $$

with $\Gamma = 0.0005$ throughout and low-energy bias weights

$$ w_k = \left( \frac{\delta E}{\varepsilon^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}_{\min} + \delta E} \right)^2. $$

MB-Fit handles the fit, combining simplex optimization for non-linear parameters $k_{\tau(ij)}$ with ridge regression for the linear coefficients $c_l$.

Table 1 in the paper reports, for each of the 14 PIPs, the polynomial degree (5 for the smaller -CH- and CH$_3$- 1-mers, 3 for the larger -CONH- 1-mer and for all 2-mers and 3-mers), the number of symmetrized monomials (ranging from 635 for the -CH- and CH$_3$- 1-mers to 2871 for the -CONH-CH-CONH- 3-mer), the training-set size, and RMSDs for the train and test splits. All training RMSDs stay below 0.4 kcal/mol and all test RMSDs below 0.5 kcal/mol, with the smallest errors for the -CH- and CH$_3$- 1-mers (0.05 kcal/mol train, 0.14 kcal/mol test) and the largest test RMSD (0.47 kcal/mol) for the -CONH-CH- 2-mer.

MD validations run in LAMMPS interfaced with MBX and PLUMED. For alanine dipeptide metadynamics, bias potentials on the backbone $\varphi$ and $\psi$ angles are deposited every 500 steps with a 1.0 kJ/mol height and 11.46° width over 10 ns trajectories in the NVT ensemble, using the velocity-Verlet integrator with a 0.5 fs time step. Analogous MetaD runs with Amber ff14SB and ff19SB are performed in Amber23. The longer AceAla$_9$Nme trajectories start from fully extended structures and run in a 100 Å × 100 Å × 100 Å gas-phase box.

CCSD(T) Energy Landscapes, Free-Energy Surfaces, and Helix Dynamics

Alanine dipeptide 2D PES. Alanine dipeptide geometries are optimized on a Ramachandran grid with 10° spacing at the RI-MP2/def2-TZVP level and then evaluated at DLPNO-CCSD(T)/aug-cc-pVTZ. Despite never seeing whole alanine dipeptide in training, MB-nrg closely matches the reference locations and relative energies of four minima ($m_1$ to $m_4$), three maxima ($M_1$ to $M_3$), and one saddle point ($X$). Amber ff14SB and ff19SB capture the minima reasonably but badly overshoot the barriers: at $M_1$, MB-nrg misses the reference by only -2.41 kcal/mol, while ff14SB and ff19SB overshoot by +7.50 and +7.83 kcal/mol. The authors also note that ff19SB incorrectly orders the secondary minima by predicting $m_3$ lower than $m_2$.

Model	RMSD overall (kcal/mol)	RMSD $\leq 10$ kcal/mol	RMSD $> 10$ kcal/mol
MB-nrg	1.27	1.18	1.59
Amber ff14SB	6.33	5.72	8.44
Amber ff19SB	5.23	4.79	6.81

The authors attribute MB-nrg’s residual high-energy error to terminal methyl groups approaching the backbone in conformations where non-bonded 1-mer interactions are modeled by the simple LJ repulsion rather than an explicit PIP.

Harmonic vibrations. Normal modes for the $m_1$ and $m_4$ alanine dipeptide conformers, computed by diagonalizing the Hessian, match RI-MP2/def2-TZVP references with mean deviations of 17.41 cm$^{-1}$ and 21.07 cm$^{-1}$ across all 60 modes. The authors acknowledge that some of this discrepancy reflects differences in theoretical levels (MB-nrg is trained to CCSD(T)/aug-cc-pVTZ, while the reference normal modes are computed at RI-MP2/def2-TZVP).

Free-energy surfaces. Well-tempered metadynamics at 300 K produces 2D free-energy surfaces over $(\varphi, \psi)$. MB-nrg yields a smoother FES whose extrema line up with the DLPNO-CCSD(T) reference PES. Amber ff14SB and ff19SB remain reasonable near the low-energy $m_1$ and $m_2$ minima but systematically overestimate the barriers near $M_1$, $M_2$, and $M_3$, which the authors argue artificially confines the dipeptide and suppresses conformational transitions.

Secondary structure in AceAla$_9$Nme. In 600 ps NVT MD starting from a fully extended structure, the STRIDE algorithm tracks residue-level secondary structures. Amber ff14SB and ff19SB collapse into $\alpha$-helices at roughly 40 ps and 80 ps, respectively, with ff19SB remaining especially rigid. MB-nrg takes about 100 ps before helices begin to form and then exhibits continuous oscillations between $3_{10}$- and $\alpha$-helical conformations. Ramachandran plots over the nine alanine residues show MB-nrg exploring the “bridge” region ($\varphi < 0°$, $-20° \leq \psi \leq 20°$) associated with $3_{10}$-helices and sampling the left-handed $\alpha_L$ region that Amber rarely visits. The authors tie this flexibility to experimental observations of alanine-rich peptides in the gas phase and to similar predictions from GEMS and MACE-OFF.

Transferability Without Whole-Chain Training Data

The paper demonstrates that a modular, bottom-up PEF built from functional-group $n$-mers can reach CCSD(T) accuracy for polyalanine in the gas phase without ever training on whole-chain data. Truncating explicit data-driven terms at the 3-body level appears to balance cost and fidelity, with long-range effects handled by many-body polarization in $V_{\mathrm{elec}}$ and by Amber-derived repulsion between distant 1-mers. The 2D PES, harmonic frequencies, free-energy surface, and secondary-structure dynamics each validate a different facet of the model.

The authors are explicit about limitations. The current PEF applies only to gas-phase polyalanine; solvent effects and other amino acids remain open. The Lennard-Jones repulsion for non-bonded 1-mers is a placeholder for eventual 2-body PIPs that should capture short-range interactions during folding. Long-range hydrogen bonding in compact secondary structures (π-helices, $3_{10}$-helices, $\alpha$-helices) may produce non-negligible higher-order many-body contributions that the current 3-body truncation omits. The 2-body contribution from a dimer with one intervening monomer is currently folded into the 3-body term because of steric conflicts between capping hydrogens, and a systematic fix is flagged for future work. The authors position this paper as the first in a series (the “I.” in the title refers to “Polyalanine in the Gas Phase”) that will extend MB-nrg to broader biomolecular systems under physiological conditions. The follow-up, MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs, adds explicit 1-mer/water 2-body PIPs and benchmarks alanine dipeptide solvation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	Per $n$-mer pools from PBMetaD in LAMMPS/PLUMED	200,000 configurations each, reduced to ~10-20k via Mini-batch K-means	OPLS/CM1A and Amber ff14SB sampled at 300 K, 500 K, 700 K
Training labels	DLPNO-CCSD(T)/aug-cc-pVTZ in ORCA	14 unique $n$-mer types	Domain-based local pair natural orbital approximation to canonical CCSD(T)
Test	Held-out $n$-mer configurations	~1,000 per $n$-mer	Same clustering protocol
Alanine dipeptide benchmark	Ramachandran grid at 10° spacing, RI-MP2/def2-TZVP geometries	1,296 grid points (approximate)	Single-point energies at DLPNO-CCSD(T)/aug-cc-pVTZ, ff14SB, ff19SB, MB-nrg
AceAla$_9$Nme dynamics	600 ps NVT MD from fully extended start	Single trajectory per model	STRIDE for secondary-structure assignment

Per the Data Availability statement, “any data generated and analyzed in this study are available from the authors upon request.” No public release is announced in the text.

Algorithms

Many-body expansion of the energy with 1-, 2-, and 3-body data-driven terms.
Permutationally invariant polynomials in Morse-exponential variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$, symmetrized over chemically equivalent atoms.
“Ghost” H-capping at cleaved covalent bonds, with fixed C-H (1.14 Å) and N-H (1.09 Å) bond lengths and per-$n$-mer optimized-structure referencing.
Non-linear parameters fit by simplex minimization, linear coefficients by ridge regression with $\Gamma = 0.0005$.
Low-energy weighting in the loss through $w_k = (\delta E / (\varepsilon^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}_{\min} + \delta E))^2$.
Tang-Toennies damped dispersion with XDM-derived $C_6$ and damping parameters, Thole-modified many-body polarization, and LJ repulsion borrowed from Amber ff14SB.

Models

14 PIPs total covering three 1-mer types, five 2-mer types, and six 3-mer types. Polynomial degree is 5 for the -CH- and CH$_3$- 1-mers, and 3 for the -CONH- 1-mer together with all 2-mers and 3-mers. Term counts range from 635 (-CH-, CH$_3$-) to 2871 (-CONH-CH-CONH-).
MB-nrg PEF implemented in the MBX code and exercised through LAMMPS and PLUMED.
Training set sizes per $n$-mer range from roughly 12,000 to 47,000 configurations (the -CONH- 1-mer dataset is the largest at 47,438).

Evaluation

Metric	MB-nrg	Amber ff14SB	Amber ff19SB
$n$-mer training RMSD	$\leq 0.35$ kcal/mol	n/a	n/a
$n$-mer test RMSD	$\leq 0.47$ kcal/mol	n/a	n/a
Alanine dipeptide 2D PES RMSD (overall)	1.27 kcal/mol	6.33 kcal/mol	5.23 kcal/mol
Same, $\leq 10$ kcal/mol region	1.18 kcal/mol	5.72 kcal/mol	4.79 kcal/mol
Same, $> 10$ kcal/mol region	1.59 kcal/mol	8.44 kcal/mol	6.81 kcal/mol
Alanine dipeptide $m_1$ normal-mode mean deviation vs RI-MP2/def2-TZVP	17.41 cm$^{-1}$	n/a	n/a
Alanine dipeptide $m_4$ normal-mode mean deviation vs RI-MP2/def2-TZVP	21.07 cm$^{-1}$	n/a	n/a
AceAla$_9$Nme helix-formation onset (from extended start)	~100 ps ($\alpha$/$3_{10}$ mix)	~40 ps ($\alpha$)	~80 ps ($\alpha$)

Hardware

Computational resources came from the Air Force Office of Scientific Research (FA9550-20-1-0351), NSF award 2311260, the DoD High Performance Computing Modernization Program, the San Diego Supercomputer Center via ACCESS allocation CHE240114, and NERSC (contract DE-AC02-05CH11231, award BES-ERCAP0030920). Specific wall-clock and node-hour figures are not reported in the main text.

Paper Information

Citation: Zhou, R., Bull-Vulpe, E. F., Pan, Y., & Paesani, F. (2025). Data-Driven Many-Body Simulations of Biomolecules with CCSD(T) Accuracy: I. Polyalanine in the Gas Phase. ChemRxiv. https://doi.org/10.26434/chemrxiv-2025-b05k5

Publication: ChemRxiv preprint, 25 March 2025.

Additional Resources:

@misc{zhou2025data,
  title={Data-Driven Many-Body Simulations of Biomolecules with CCSD(T) Accuracy: I. Polyalanine in the Gas Phase},
  author={Zhou, Ruihan and Bull-Vulpe, Ethan F. and Pan, Yuanhui and Paesani, Francesco},
  year={2025},
  doi={10.26434/chemrxiv-2025-b05k5},
  howpublished={\url{https://doi.org/10.26434/chemrxiv-2025-b05k5}}
}

MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs

Sun, 12 Apr 2026 00:00:00 +0000

Extending MB-nrg from Gas-Phase Polyalanine to Aqueous Solution

This is a Method paper, the second installment in Zhou and Paesani’s MB-nrg-for-biomolecules series. Paper I (covered in MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine) decomposed gas-phase polyalanine into functional-group $n$-mers and fit permutationally invariant polynomials (PIPs) to DLPNO-CCSD(T)/aug-cc-pVTZ reference data. This sequel adds the missing piece: explicit, machine-learned 2-body interactions between every polyalanine functional-group 1-mer and a water molecule, trained on the same coupled-cluster reference. The resulting PEF couples the gas-phase intramolecular MB-nrg term, the MB-pol water model, and a new MB-nrg ala-water cross term within a single modular many-body decomposition.

Why Empirical Force Fields Struggle with Hydrated Peptides

Biomolecular function in water emerges from a coupling of intramolecular flexibility with solvent-mediated interactions, including hydrogen-bond networks, cooperative polarization, dispersion, and short-range exchange-repulsion. Empirical force fields such as AMBER, CHARMM, and OPLS approximate the multidimensional PES with pairwise-additive analytical terms whose parameters are tuned to experimental observables or low-level quantum data. The authors note that this functional form leads to systematic errors in predicted conformational ensembles for short peptides and intrinsically disordered proteins (IDPs), with reported overpopulation of polyproline II (pPII) basins and antiparallel $\beta$ regions for alanine residues, plus underrepresentation of the transitional $\beta$ basin compared to experiment.

Polarizable force fields recover dielectric and hydration trends through induced dipoles, but still lean on empirical functional forms and miss short-range quantum effects (charge transfer, charge penetration, exchange-repulsion) that arise from electron-density overlap. Machine-learned force fields like MACE-OFF, GEMS, and FeNNix-Bio1 have improved bio-organic accuracy, but they still depend critically on the diversity and quality of training data, struggle to decompose energies into physically interpretable components, and most rely on DFT references that inherit delocalization errors and incomplete long-range correlation. Local descriptors common to MLFFs also limit treatment of long-range electrostatics and many-body correlations, both essential for biomolecular solvation.

The MB-nrg formalism, originally developed for water and small molecules and recently extended to alkanes and gas-phase polyalanine, offers an alternative: a rigorous many-body expansion (MBE) of the energy combined with both data-driven $n$-body PIPs and physics-based long-range terms. Paper II asks whether this modular gas-phase scaffold can be cleanly extended to aqueous environments by adding only short-range peptide-water 2-body PIPs.

A Modular MB-nrg PEF for Polyalanine in Water

The MBE writes the total energy of a system of $N$ 1-mers as

$$ E_N(1, \dots, N) = \sum_{i=1}^{N} \varepsilon^{1\mathrm{B}}(i) + \sum_{i

with each $n$-body term defined recursively as the $n$-mer energy minus all lower-order contributions. The MBE converges quickly for insulating molecular systems with large electronic band gaps (such as water and peptides), so explicit PIP corrections are typically truncated at $n \leq 4$, with higher-order effects absorbed into many-body polarization.

For polyalanine in water, the total potential is partitioned into three modular blocks:

$$ V_{\mathrm{MB\text{-}nrg}}^{\mathrm{tot}} = V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}} + V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}} + V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}} $$

where $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}}$ is the gas-phase intramolecular polyalanine PEF from Paper I, $V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}}$ is the MB-pol water model, and $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}}$ is the new peptide-water cross term. The cross term itself follows the MB-nrg recipe of splitting machine-learned and physics-based contributions:

$$ V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}} = V_{\mathrm{ML}} + V_{\mathrm{phys}} $$

with $V_{\mathrm{ML}} = V_{\mathrm{ML}}^{2\mathrm{B}}$ (only 2-body PIPs in this implementation) and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}}$. The 2-body machine-learned term sums switched PIPs over every (1-mer, water) dimer:

$$ V_{\mathrm{ML}}^{2\mathrm{B}} = \sum_{i=1}^{N} s^{2\mathrm{B}}(\mathrm{M}_i, \mathrm{WAT}), V_{\mathrm{PIP}}^{2\mathrm{B}}(\mathrm{M}_i, \mathrm{WAT}) $$

where $\mathrm{M}_i$ is the $i$-th polyalanine functional-group 1-mer (-CH-, CH$_3$-, or -CONH-), WAT is a water molecule, and $s^{2\mathrm{B}}$ is a cosine switching function

$$ s^{2\mathrm{B}}(x) = \begin{cases} 1 & x < 0 \\ (1 + \cos(x))/2 & 0 \leq x < 1 \\ 0 & 1 \leq x \end{cases}, \quad x = \frac{R - R_{\mathrm{in}}}{R_{\mathrm{out}} - R_{\mathrm{in}}} $$

that smoothly attenuates the short-range PIP beyond a defined distance to preserve energy conservation in MD. The physics-based block uses a Thole-modified self-consistent polarization model (inherited from MB-pol) for $V_{\mathrm{elec}}$ and a Tang-Toennies damped dispersion sum

$$ V_{\mathrm{disp}} = -\sum_{\substack{\alpha \in 1\text{-mers} \\ \beta \in \mathrm{water}}} f(\mathrm{b}_{\alpha\beta} R_{\alpha\beta}), \frac{C_{6, \alpha\beta}}{R_{\alpha\beta}^{6}} $$

with $C_{6, \alpha\beta}$ coefficients and atomic polarizabilities derived from the exchange-hole dipole moment (XDM) method, and atomic charges fit to reproduce the permanent multipole moments of each $n$-mer’s optimized structure.

The authors stress that explicit 3-body and higher peptide-water PIPs are deliberately omitted in this first implementation; their effects are absorbed into the classical polarization term. They flag that strongly hydrogen-bonded or cooperative configurations may benefit from adding higher-body corrections in future work, following the precedent of MB-pol(2023) for water.

Training Set Generation and DLPNO-CCSD(T) Reference Data

Training pools for the three 1-mer-water dimers (CH$_3$-H$_2$O, -CH–H$_2$O, -CONH–H$_2$O) extend the parallel-bias metadynamics with partitioned families (PBMetaD+PFs) protocol from Paper I. Covalent boundaries are capped with “ghost” hydrogens at fixed C-H (1.14 Å) and N-H (1.09 Å) distances to preserve closed-shell character; each 2-body energy is referenced to the corresponding optimized capped 1-mer-water geometry to remove constant offsets.

PBMetaD simulations are run in LAMMPS interfaced with PLUMED, using Amber ff14SB for the alanine 1-mers and TIP4P/2005f for water. Collective variables span all heavy-atom bonds, angles, and dihedrals in each dimer. To target distinct interaction regimes, three separate biased runs apply upper and lower walls on the 1-mer/water center-of-mass distance: 0-4 Å (short-range repulsion), 4-7 Å (mid-range attraction), and 7-10 Å (long-range orientation-dependent interactions). Each dimer yields about 600,000 configurations, reduced to roughly 40,000 training and 2,000 test configurations per type by K-means clustering.

Reference 2-body energies are computed at the DLPNO-CCSD(T)/aug-cc-pVTZ level in ORCA, using the aug-cc-pVTZ/C auxiliary basis, the RIJCOSX approximation, TightSCF, TightPNO, and the PModel pair-selection option. The counterpoise method corrects every 2-body energy for basis set superposition error.

Each PIP minimizes a weighted, ridge-regularized least-squares objective:

$$ \chi^2 = \sum_{k \in \mathcal{S}} w_k \left[ V^{2\mathrm{B}}(k) - \varepsilon^{2\mathrm{B}}(k) \right]^2 + \Gamma^2 \sum_l c_l^2 $$

with $\Gamma = 0.0005$ throughout. Training weights bias the fit toward low-energy configurations,

$$ w_k = \left( \frac{\delta E}{\varepsilon^{2\mathrm{B}}(k) - \varepsilon_{\mathrm{min}}^{2\mathrm{B}} + \delta E} \right)^2 $$

with $\delta E = 40$ kcal/mol for all 1-mer-water pairs. MB-Fit handles the optimization, combining simplex minimization for non-linear parameters (Morse decay constants) with ridge regression for the linear coefficients.

Table 1 reports the PIP specifications. All three PIPs use polynomial degree 3 with a complete, unscreened basis. The -CH- and CH$_3$- dimers each require 710 symmetrized terms; the chemically richer -CONH- dimer requires 1,267 terms to capture its dipolar character and directional hydrogen bonding. Training-set sizes range from 41,781 to 43,174 configurations.

1-mer type	PIP degree	PIP terms	Training configs	Train RMSD (kcal/mol)	Test RMSD (kcal/mol)	Train MAE	Test MAE
-CH-	3	710	43,174	0.07	0.08	0.06	0.06
CH$_3$-	3	710	43,172	0.08	0.08	0.05	0.05
-CONH-	3	1,267	41,781	0.18	0.20	0.13	0.16

All RMSDs sit below 0.20 kcal/mol on both train and test splits, well within sub-chemical accuracy.

Validation: Dimer Scans, Free-Energy Surfaces, and Hydration

The authors stage four validation studies of increasing complexity, each touching a distinct facet of the new PEF.

Alanine dipeptide-water dimer scans. One-dimensional scans probe the interaction energy along four hydrogen-bonding coordinates of an alanine dipeptide-water dimer: O$_1$-H$_w$, H$_1$-O$_w$, O$_2$-H$_w$, and H$_2$-O$_w$, where subscripts 1 and 2 mark the acetyl and N-methyl termini. The dipeptide is constrained to four representative Ramachandran conformations: C5 ($\varphi = -150°$, $\psi = 150°$), pPII ($\varphi = -80°$, $\psi = 150°$), C7$_{\mathrm{eq}}$ ($\varphi = -80°$, $\psi = 70°$), and right-handed $\alpha$-helix $\alpha_R$ ($\varphi = -80°$, $\psi = -30°$). MB-nrg closely tracks the DLPNO-CCSD(T)/aug-cc-pVTZ reference curves across all 16 (4 conformation $\times$ 4 site) scans, despite never seeing the full dipeptide-water surface during training. Amber ff14SB/TIP3P and ff19SB/OPC underestimate hydrogen-bond depths and miss curvature near equilibrium, with the ff14SB/TIP3P combination yielding slightly better overall agreement than ff19SB/OPC even though TIP3P is the less accurate water model.

Two specific failure modes of the empirical force fields stand out. In the pPII conformation, both ff14SB and ff19SB predict significantly deeper interaction wells than the reference, overstabilizing several hydrogen bonds. In the H$_2$-O$_w$ scan of the $\alpha_R$ conformation, both empirical FFs exhibit a spurious 2.5-4.0 Å energy barrier that the authors trace to the simple Lennard-Jones repulsion between the acetyl carbonyl oxygen and water; MB-nrg and DLPNO-CCSD(T) instead show a smoothly decaying profile. The one MB-nrg deviation noted is the C5 H$_1$-O$_w$ scan in the 1.5-2.5 Å range, where MB-nrg predicts a slightly more attractive interaction than the reference. Here the H$_1$-O$_2$ distance is 2.3 Å and water acts simultaneously as acceptor at H$_1$ and donor to O$_2$, a cooperative pattern the authors expect would require explicit 2-mer-water or 3-mer-water terms to fully reproduce.

Free-energy surface in explicit MB-pol water. Four-walker well-tempered metadynamics (WT-MetaD) simulations explore the conformational landscape of alanine dipeptide as a function of $(\varphi, \psi)$, biasing the central alanine residue’s backbone dihedrals every 500 steps with 1.0 kJ/mol Gaussians of 11.46° width. The free-energy section reports 2.5 ns per replica across four parallel walkers (10 ns aggregate, matching the Figure 6 caption); the methods section states 8 ns total, an internal inconsistency in the paper. The MB-nrg FES recovers all major low-energy conformers identified by NMR and prior MP2/DFT studies: a global minimum at $\alpha_R$, additional local minima in C5, $\beta_2$, and $\alpha_L$, and a metastable pPII basin. The C7$_{\mathrm{eq}}$ minimum that dominates the gas-phase Ramachandran surface in Paper I is significantly destabilized in solution, consistent with experiment.

Quantitatively, MB-nrg predicts $\alpha_R$ and $\beta_2$ as isoenergetic global minima, with C5 about 3 kcal/mol higher in free energy. Prior DFT-with-implicit-solvation studies (Mironov et al., Yang and Honig) report C5, $\alpha_R$, and $\beta_2$ as nearly isoenergetic, and the authors note that the discrepancy may reflect the explicit MB-pol water treatment, residual DFT errors in the reference, or both. They flag a planned systematic benchmarking of MB-nrg PEFs for diverse polypeptides against both DFT and DLPNO-CCSD(T) data in future work. The Amber FESs over-stabilize pPII relative to C5/$\alpha_R$, contradicting experimental and DFT benchmarks; ff19SB/OPC also exhibits a spurious C7$_{\mathrm{eq}}$ minimum that is absent from MB-nrg.

Hydration radial distribution functions. Site-site RDFs at 300 K for the same hydrogen-bond contacts (O$_1$-H$_w$, O$_2$-H$_w$, H$_1$-O$_w$, H$_2$-O$_w$) are computed from NVT MD trajectories. All three models reproduce well-defined first-shell peaks near 2.0 Å. For the O-H$_w$ pairs, MB-nrg shows a broader, slightly right-shifted second-shell peak, indicating less rigid water structure beyond the first shell. The amide-hydrogen RDFs are nearly identical between ff14SB/TIP3P and ff19SB/OPC, while MB-nrg reveals subtle first-shell shifts (shorter H$_1$-O$_w$, longer H$_2$-O$_w$) and weaker, less-defined second-shell features near 3.7-3.8 Å that are absent from the empirical force fields and consistent with prior ab initio MD on alanine dipeptide.

A Modular Path to Chemically Accurate Biomolecular Simulations

Across the four benchmarks, the same picture emerges: a modular, bottom-up MB-nrg PEF built from functional-group $n$-mers and trained only on isolated 1-mer-water dimers can reach DLPNO-CCSD(T) accuracy for both energetic and structural observables of alanine dipeptide in explicit water. The decomposition into a gas-phase intramolecular term, an MB-pol water model, and an MB-nrg cross term keeps each piece interpretable and individually replaceable; the gas-phase polyalanine PEF from Paper I drops in unchanged, and the new ala-water PIPs were fit without ever seeing the full alanine dipeptide-water PES.

The authors are explicit about limitations:

The cross term currently includes only 2-body PIPs (one 1-mer with one water). Higher-body peptide-water terms ($n > 2$) are folded into the classical polarization, which the authors expect will be inadequate for strongly cooperative configurations such as the C5 H$_1$-O$_w$ scan where one water bridges H$_1$ and O$_2$.
Quantitative differences between the MB-nrg FES and prior implicit-solvation DFT studies (relative depths of $\alpha_R$, $\beta_2$, and C5) remain to be reconciled through systematic benchmarking against higher-level reference data.
Only polyalanine is considered. The framework is designed to generalize to other amino acids and side-chain-water interactions, but sequence- and side-chain-specific PIPs are still to be fit.
No public release of the parameterized PEF or training data is announced; the data availability statement says “available from the authors upon request.”

The paper positions MB-nrg as a transferable, interpretable strategy for chemically accurate biomolecular simulations in solution, with future work aimed at heteropolypeptides and explicit higher-order many-body cross terms.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training pools	PBMetaD+PFs in LAMMPS/PLUMED	~600,000 configs per dimer, reduced to ~40,000	ff14SB for alanine 1-mers, TIP4P/2005f for water; 300 K, 500 K, 700 K
Distance regimes	Walls on 1-mer/water COM distance	0-4, 4-7, and 7-10 Å	Short-range repulsion, mid-range attraction, long-range orientation
Training labels	DLPNO-CCSD(T)/aug-cc-pVTZ in ORCA	3 unique 1-mer-water dimer types	RIJCOSX, TightSCF, TightPNO, PModel; counterpoise BSSE correction
Test sets	Held-out clustered configs	~2,000 per dimer	Same K-means clustering protocol
Alanine dipeptide-water scans	1D scans along 4 H-bond coordinates in 4 conformations	16 scans total	C5, pPII, C7$_{\mathrm{eq}}$, and $\alpha_R$ conformations
Alanine dipeptide FES	WT-MetaD on $\varphi$, $\psi$ in MB-pol water	4 walkers, 2.5 ns each (10 ns total per the results section and Figure 6 caption; methods section states 8 ns)	1.0 kJ/mol height, 11.46° width, deposition every 500 steps
Hydration RDFs	NVT MD at 300 K	Single trajectory per model	Same H-bond sites as the dimer scans

Per the data availability statement, “any data generated and analyzed in this study, including the MB-nrg PEF, are available from the authors upon request.” The MBX engine is publicly available on GitHub under a UC Regents custom license that grants free use for educational, research, and non-profit purposes but restricts commercial use. No public release of the new ala-water PIPs is announced in the text.

Artifacts table

Artifact	Type	License	Notes
MBX	Code	UC Regents custom (academic/non-profit only; no SPDX-recognized OSS license)	C++ many-body potential engine; runs the MB-nrg PEF via LAMMPS and PLUMED
MB-Fit	Code	Check repo	Training pipeline for PIP fitting; used to fit the new 1-mer-water PIPs
MB-nrg ala-water PIPs (this paper)	Model	Not released	“Available from the authors upon request” per the data availability statement
DLPNO-CCSD(T) training/test sets	Dataset	Not released	Same statement; ~600,000 raw configs per dimer reduced to ~40,000 train + ~2,000 test

Algorithms

Many-body expansion of the energy partitioned into three modular blocks: $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}} + V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}} + V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}}$.
Cross term split into $V_{\mathrm{ML}}^{2\mathrm{B}}$ (PIPs over every 1-mer-water dimer) and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}}$.
Permutationally invariant polynomials in Morse-exponential variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$, symmetrized over chemically equivalent atoms; same construction as the NMA-water PIPs.
Cosine switching function $s^{2\mathrm{B}}$ smoothly attenuates short-range PIPs between user-defined inner and outer cutoffs.
Dispersion: Tang-Toennies damped $C_6/R^6$ with XDM-derived coefficients and damping parameters.
Electrostatics: modified Thole model with self-consistent induced dipoles for many-body polarization; per-atom charges fit to reproduce permanent multipole moments of each $n$-mer’s optimized structure.
Ghost-H capping at cleaved covalent boundaries with fixed C-H (1.14 Å) and N-H (1.09 Å) distances; per-dimer optimized-structure referencing.
Training with simplex minimization for non-linear parameters and ridge regression for linear coefficients via MB-Fit, with low-energy weighting and $\Gamma = 0.0005$, $\delta E = 40$ kcal/mol.
WT-MetaD with four parallel walkers for the alanine dipeptide FES.

Models

Three new 1-mer-water 2-body PIPs covering -CH-/H$_2$O, CH$_3$-/H$_2$O, and -CONH-/H$_2$O dimers.
All three PIPs use polynomial degree 3 with a complete, unscreened basis (no term screening).
Term counts: 710 for -CH-/H$_2$O and CH$_3$-/H$_2$O, 1,267 for -CONH-/H$_2$O.
Combined with the gas-phase polyalanine MB-nrg PEF from Paper I and the MB-pol water model, exercised through MBX, LAMMPS, and PLUMED.

Evaluation

Metric	MB-nrg	Amber ff14SB/TIP3P	Amber ff19SB/OPC
-CH-/H$_2$O 2-body train/test RMSD	0.07 / 0.08 kcal/mol	n/a	n/a
CH$_3$-/H$_2$O 2-body train/test RMSD	0.08 / 0.08 kcal/mol	n/a	n/a
-CONH-/H$_2$O 2-body train/test RMSD	0.18 / 0.20 kcal/mol	n/a	n/a
Alanine dipeptide-water 1D scans (qualitative)	Tracks DLPNO-CCSD(T) curves across 16 scans	Underestimates H-bond depths; spurious $\alpha_R$ H$_2$-O$_w$ barrier	Same shape as ff14SB/TIP3P
Alanine dipeptide FES global minima	Isoenergetic $\alpha_R$ and $\beta_2$; C5 ~3 kcal/mol higher	Over-stabilizes pPII	Over-stabilizes pPII; spurious C7$_{\mathrm{eq}}$ minimum
O-H$_w$ second shell	Broader, right-shifted; finer detail consistent with prior AIMD	Sharper, less detail	Sharper, less detail
H-O$_w$ second shell	Weak features near 3.7-3.8 Å	Absent	Absent

Quantitative RMSD or KL-divergence values for the FES and RDF benchmarks are not reported in the main text.

Hardware

The authors acknowledge support from the Air Force Office of Scientific Research (FA9550-20-1-0351, theoretical development) and NSF (award 2311260, MBX implementation). Computational resources came from the DoD High Performance Computing Modernization Program, the San Diego Supercomputer Center via ACCESS allocation CHE240114, and NERSC (contract DE-AC02-05CH11231, award BES-ERCAP0030920). Specific wall-clock and node-hour figures are not reported in the main text.

Paper Information

Citation: Zhou, R., & Paesani, F. (2025). Toward Chemical Accuracy in Biomolecular Simulations through Data-Driven Many-Body Potentials: II. Polyalanine in Water. ChemRxiv. https://doi.org/10.26434/chemrxiv-2025-j6cwv-v2

Publication: ChemRxiv preprint (version 2), 10 October 2025.

Additional Resources:

MBX software (Paesani group)
MB-Fit (training pipeline)
Companion paper: MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine (Paper I)

@article{zhou2025toward,
  title={Toward Chemical Accuracy in Biomolecular Simulations through Data-Driven Many-Body Potentials: II. Polyalanine in Water},
  author={Zhou, Ruihan and Paesani, Francesco},
  journal={ChemRxiv},
  year={2025},
  doi={10.26434/chemrxiv-2025-j6cwv-v2}
}

Graph Grammar and ILP for Carbon Fixation Pathways

Sun, 12 Apr 2026 00:00:00 +0000

A Graph-Grammar and ILP Framework for Pathway Discovery

Abel et al. present a Method paper that couples generative chemical space expansion with integer linear programming (ILP) pathway queries to systematically propose artificial carbon fixation pathways. The workflow uses the cheminformatics package MØD to iteratively expand a reaction hypergraph from a seed set of metabolites and rule-based enzyme reactions, then queries the resulting network for autocatalytic flows producing a chosen target molecule. Post-hoc annotation with eQuilibrator Gibbs energies and cofactor accounting ranks candidates by thermodynamic feasibility. Applied to the Acetyl-CoA-Succinyl-CoA pathway family plus selected synthetic and theoretical pathways, the framework recovers the natural pathways and proposes two new theoretical autocatalytic cycles (an 11-step Acetyl-CoA cycle and a 12-step Malate cycle) whose efficiency, measured in ATP and redox cofactors per fixed carbon, is comparable to the synthetic CETCH cycle and the natural rTCA.

Why Computational Pathway Design for Carbon Fixation

Fixing atmospheric CO$_2$ or bicarbonate into value-added chemicals is a thermodynamically unfavorable process that nature solves through enzymatic cascades coupled to cofactor-driven reactions. Seven natural carbon fixation pathways are known, along with several artificial proposals, and the Acetyl-CoA-Succinyl-CoA family is particularly appealing as a design template because each member overlaps structurally with at least one other and each exhibits autocatalysis. Prior approaches to artificial pathway design (e.g., Erb Lab CETCH, HOPAC) rely heavily on manual heuristics, database searches, and extensive in-vitro optimization including directed evolution. Earlier computational work (Löwe and Kremling, 2021) uses flux balance analysis and expert curation that requires complete kinetic parameterization, making generative exploration infeasible. Abel et al. target the design stage directly: a computational approach that can quickly enumerate many topologically distinct pathway candidates without requiring a priori kinetic parameters.

Generative Chemical Space Expansion with Graph-Grammar Rules

The core innovation is treating the chemical reaction network (CRN) as a directed multi-hypergraph $H = (V, E)$ where vertices in $V$ are molecules and each hyperedge $e \in E$ is a directed pair $(e_{tail}, e_{head})$ of multisets representing reactants and products. This hyperedge formalization directly captures the many-to-many nature of biochemical reactions.

Reactions are specified as graph transformation rules written in the Graph Modeling Language (GML). A rule defines the bond rewiring at a reaction center plus a tunable molecular context around that center. A rule with no context is fully promiscuous (every oxidoreductase class reaction, say); a rule with rich context mimics a specific enzyme. This rule-based formalism lets one rule represent an entire reaction class, so the CRN can be expanded without enumerating every possible enzyme-substrate pair in advance. Expansion proceeds iteratively: the rules act on the current molecule pool, producing new molecules and new hyperedges, until a user-defined step count is reached. Two biochemical sanity constraints bound the combinatorial explosion: molecules are restricted to at most 6 carbon atoms in the backbone (excluding the CoA moiety), and at most one CoA group per molecule.

Pathway discovery is then an ILP flow query over the CRN. A pathway is a hyperflow: an assignment of integer flow values to hyperedges such that internal molecules balance between production and consumption, leaving only designated source and sink molecules with net flow. The main optimization objective minimizes the number of reactions used and, as a tiebreaker, the magnitude of flow on those reactions:

$$ \min \left(\sum_{e \in E} z_e \cdot w + x_e\right) $$

where $z_e$ is a boolean indicator that hyperedge $e$ carries flow, $x_e$ is the integer flow on $e$, and the weight $w = 1000$ prioritizes minimizing the edge count over the total flow magnitude. Autocatalysis is encoded as a constraint on the autocatalyst molecule $a$: its inflow and outflow are both positive, with outflow strictly exceeding inflow so the cycle nets at least one additional molecule of the autocatalyst.

$$ 0 < x_a^{in} < x_a^{out} $$

Only the autocatalyst itself, cofactors, and CO$_2$/HCO$_3^-$ are permitted as sources and sinks, so any valid flow represents a net reaction that fixes carbon and regenerates the autocatalyst. Unlike classical flux balance analysis, which optimizes continuous flux distributions at steady state, the integer-valued ILP formulation emphasizes pathway structure (which reactions are active) rather than flux magnitude.

Solutions are post-annotated with two feasibility measures. The first is cofactor accounting, split into ATP/ADP as an energy proxy and reduced redox cofactors (NAD(P)H, ubiquinone, Ferredoxin) as an electron proxy. The second is the standard Gibbs free energy of the net reaction computed via the eQuilibrator 3.0 component-contribution method at pH 7 and ionic strength 0.1 M using the eQuilibrator API 0.6.0:

$$ \Delta_r G’^{\circ} = \sum \Delta_f G’^{\circ}_{\text{products}} - \sum \Delta_f G’^{\circ}_{\text{reactants}} $$

Experimental Setup, Queries, and Comparison to Literature

The seed pool for expansion contains 49 intermediates drawn from the Acetyl-CoA-Succinyl-CoA family (rTCA, DC/4-HB, 3-HP/4-HB, 3-HP bicycle), the synthetic CETCH cycle, and theoretical pathways proposed by Bar-Even et al., plus 20 helper molecules (cofactors, water, CO$_2$). Rule contexts were derived from KEGG enzyme entries. The Calvin-Benson-Basham cycle and the non-autocatalytic Wood-Ljungdahl and reductive glycine pathways were excluded.

Expansion statistics (Table 4 in the paper):

Expansion steps	Molecules (vertices)	Reactions (hyperedges)
1	165	220
2	318	942
5	996	29,266

At one expansion step, flow queries recover only the input pathways with no recombinations. Two expansion steps produce sufficient novelty for recombined pathways while keeping ILP runtimes tractable. Five steps makes flow queries computationally prohibitive without adding biological insight. All reported analyses use the two-step CRN.

Three benchmark flow queries target autocatalytic pathways producing Acetyl-CoA, Malate, and Propionyl-CoA. Each query is run to return 1000 topologically distinct optimal solutions (under the ILP objective, solutions with equal length are equally optimal). All flow queries were solved with Gurobi 11.0.3 under an academic license on a consumer laptop (AMD Ryzen 7 5700U, 16 GB RAM, Windows 11). The full 1000-solution search took just under 18 hours.

Two Novel Autocatalytic Cycles Competitive with Synthetic Pathways

The shortest-pathway queries yield two novel theoretical autocatalytic cycles: an 11-step Acetyl-CoA cycle and a 12-step Malate cycle. Comparison to natural, theoretical, and synthetic pathways on the four standard measures (steps, ATP units, cofactors, carbon units fixed per cycle):

Pathway	Status	Steps	ATP	Cofactors	C fixed	ATP/C	Cof/C
Shortest Acetyl-CoA (this work)	Theoretical	11	2	5	2	1	2.5
Shortest Malate (this work)	Theoretical	12	3	8	4	0.75	2
CETCH	Synthetic	11	1	4	2	0.5	2
rGPS-MCG	Synthetic	18	4	6	3	1.33	2
C4-glyoxylate / alanine	Theoretical	9	2	2	2	1	1
rTCA	Natural	12	4	7	4	1	1.75
3HP/4HB	Natural	16	4	6	2	2	3
DC/4HB	Natural	14	4	7	2	2	3.5
3HP-bicycle	Natural	19	3	4	2	1.5	2

The 11-step Acetyl-CoA cycle matches CETCH in length and carbon units fixed while using one more ATP and one more redox cofactor. The Malate cycle is the same length as rTCA (12 steps) but uses one fewer ATP and one fewer cofactor while fixing the same four carbons.

Across the 1000-solution benchmarks (Table 2 of the paper), the Acetyl-CoA cycle is the most cofactor-efficient per step (0.69 cofactors/step; average 7.6 total), while Propionyl-CoA and Malate average 0.89 and 0.88 cofactors/step. Gibbs energies average $\Delta_r G’^{\circ} = -150.66$ kJ/mol for Acetyl-CoA, $-165.82$ for Propionyl-CoA, and $-196.98$ for Malate, making the Malate query the most thermodynamically driven even after accounting for its higher cofactor count. Three specific Acetyl-CoA solutions inspected in detail share a common rTCA-like core with a glyoxylate shunt and differ mainly along the oxaloacetate-to-malyl-CoA branch; their totals range from $\Delta_r G’^{\circ}_{total} = -80$ kJ/mol (the one-ATP solution) to $-168$ kJ/mol.

All solutions rely on Ferredoxin-dependent carboxylating enzymes (pyruvate:ferredoxin oxidoreductase and 2-ketoglutarate:ferredoxin oxidoreductase), which have higher reduction potentials than NAD(P) but are oxygen-sensitive and would restrict wet-lab implementation to anaerobic conditions or engineered anaerobic strains.

Findings, Limitations, and Future Directions

The workflow produces pathway candidates whose efficiency approaches the best synthetic designs while running on a consumer laptop, and it generalizes to any chemical space that can be formalized by graph-transformation rules. Because the ILP returns many equally optimal solutions, a downstream filtering step can select candidates matching user criteria (oxygen sensitivity, specific cofactor preference, enzyme availability).

Acknowledged limitations include: the topology-only search ignores enzyme kinetics, so candidates that look thermodynamically favorable might be bottlenecked in practice; the carbon-count and CoA restrictions are necessary to bound combinatorial blow-up but also constrain the discoverable space; reliance on Ferredoxin complicates implementation; and enzyme availability varies across organisms, which matters for recombination-based designs. The authors point to kinetic modeling, cofactor-recycling system inclusion, and incorporation of metabolic reactions outside the canonical carbon fixation space as future directions.

The paper positions itself as a design-stage tool rather than an end-to-end in-vitro pipeline. The authors frame the contribution as idea generation that complements, not replaces, the subsequent experimental optimization (enzyme engineering, directed evolution) that has carried prior synthetic pathway work from theory to in-vitro success.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Seed molecules	Curated Acetyl-CoA-Succinyl-CoA family + CETCH + Bar-Even theoretical	49 metabolites + 20 cofactors	Tables S1-S2
Reaction rules	KEGG enzyme entries, GML-encoded	Rules listed in Figure S1	Conservative context
CRN (2-step expansion)	Generated by MØD	318 molecules, 942 reactions	Primary analysis space
Thermodynamic data	eQuilibrator 3.0 component-contribution	All molecules in space	pH 7, ionic strength 0.1 M

Algorithms

Graph-grammar rule expansion via MØD 1.0.0 with a 6-carbon backbone cap and at most one CoA moiety per molecule. ILP flow queries formulated with the edge-minimization objective in Equation (1) and the autocatalysis constraint in Equation (2). Natural pathway presence first verified via set operations on the CRN, then reconfirmed by constraining the ILP to pass through core intermediates. The pathway solution enumeration is structural: 1000 topologically distinct solutions per query at the optimal objective value.

Models

No machine-learning models. The pipeline is symbolic: graph transformations, hypergraph flow constraints, and component-contribution free energy estimates.

Evaluation

Metric	Acetyl-CoA	Propionyl-CoA	Malate
Avg steps	11	15	12
Avg cofactors	7.6	13.3	10.6
Cofactors/step	0.69	0.89	0.88
Avg $\Delta_r G’^{\circ}$ (kJ/mol)	$-150.66$	$-165.82$	$-196.98$

Hardware

Gurobi 11.0.3 (academic license) on a consumer laptop: AMD Ryzen 7 5700U, 16 GB RAM, Windows 11. Full 1000-solution runs for the three benchmark queries completed in just under 18 hours total.

Artifacts and licensing

Code and output pathways: github.com/anne-susann/C_fixation_pathway_design (MIT License)
MØD cheminformatics package (version 1.0.0)
eQuilibrator API version 0.6.0
Gurobi 11.0.3

Paper Information

Citation: Abel, A.-S., Lauber, N., Andersen, J. L., Fagerberg, R., Merkle, D. E., & Flamm, C. (2026). Computational approaches in chemical space exploration for carbon fixation pathways. npj Systems Biology and Applications, 12(1), 17. https://doi.org/10.1038/s41540-025-00641-8

Publication: npj Systems Biology and Applications, 2026

@article{abel2026computational,
  title={Computational approaches in chemical space exploration for carbon fixation pathways},
  author={Abel, Anne-Susann and Lauber, Nino and Andersen, Jakob Lykke and Fagerberg, Rolf and Merkle, Daniel Elmar and Flamm, Christoph},
  journal={npj Systems Biology and Applications},
  volume={12},
  number={1},
  pages={17--17},
  year={2026},
  publisher={Nature Portfolio},
  doi={10.1038/s41540-025-00641-8}
}

Surge: Fastest Open-Source Chemical Graph Generator

Sat, 11 Apr 2026 00:00:00 +0000

A Three-Stage Canonical Generation Path

Surge is an open-source constitutional isomer generator that enumerates all possible molecular structures for a given molecular formula. It is built on the nauty package for graph automorphism computation and uses a three-stage canonical generation path method that decomposes the enumeration problem into progressively refined graph operations. Surge outperforms the previous state-of-the-art (MOLGEN 5.0) by orders of magnitude in speed while running in under 5 MB of RAM regardless of molecule size.

Motivation: The Need for Fast, Open Structure Generators

Chemical structure generators are essential for computer-assisted structure elucidation (CASE), virtual library creation, and chemical space enumeration (e.g., GDB-17’s 166.4 billion molecules). MOLGEN had been the gold standard for decades but is closed-source. The previous best open-source alternative, MAYGEN, was roughly 3x slower than MOLGEN. Reymond’s lab used an in-house nauty-based generator for GDB-17 but did not release it publicly. Surge fills this gap as a fast, open-source, and extensible alternative.

The Three-Stage Algorithm

Given a molecular formula (e.g., $\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$), Surge proceeds through three stages:

Stage 1 (geng): Simple graph generation. Computes all connected simple graphs with the appropriate number of non-hydrogen atoms and edges, subject to maximum degree constraints from the molecular formula. These graphs represent molecular topologies without atom types or bond orders. For Lysopine ($\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$), this produces 534,493 graphs in 1.3 seconds.

Stage 2 (vcolg): Vertex coloring (atom assignment). Assigns element types (C, N, O, S, etc.) to vertices in all distinct ways, using the automorphism group of each simple graph to avoid generating equivalent assignments. Given a fixed ordering of elements (e.g., $\text{C} < \text{O} < \text{S}$), element assignments are represented as lists $L$ and compared lexicographically. Exactly one representative from each equivalence class is selected by computing the canonical (lexicographically maximal) list:

$$ \text{canon}(L) = \max\{\gamma(L) \mid \gamma \in \text{Aut}(G)\} $$

A list $L$ is accepted if and only if $\text{canon}(L) = L$, i.e., no automorphism produces a lexicographically larger list. For Lysopine, this expands to 3.0 billion vertex-labeled graphs in 90 seconds.

Stage 3 (multig): Edge multiplicity (bond orders). Assigns bond multiplicities (single, double, triple) to edges, again using automorphism group factorization to avoid duplicates. For Lysopine, this produces 6.0 billion completed molecules in an additional 100 seconds.

Efficient Automorphism Handling via Group Factorization

The key algorithmic innovation is the factorization of the automorphism group:

$$ \text{Aut}(G) = NM = \{\gamma\delta \mid \gamma \in N,; \delta \in M\} $$

where $M$ is the “minor subgroup” generated by transpositions of leaves sharing a common neighbor (“flowers”), and $N$ is a complete set of coset representatives computed by nauty. A flower is a maximal set of degree-1 vertices (leaves) with the same neighbor. The minor subgroup $M$ is normal in $\text{Aut}(G)$, making the factorization well-defined.

Theorem. A list $L$ satisfies $L = \text{canon}(L)$ if and only if $L = \max\{\delta(L) \mid \delta \in M\}$ and $L = \max\{\gamma(L) \mid \gamma \in N\}$.

This factorization enables efficient canonicity testing. Maximality under $M$ reduces to enforcing decreasing element order within each flower (simple inequality constraints during recursive assignment). Maximality under $N$ requires explicit testing against the $N$ generators, but $N$ is trivial (identity only) 58% of the time in Stage 2 and 98% of the time in Stage 3.

Benchmark Results

Benchmarked against MOLGEN 5.0 on 30 natural product molecular formulas from the COCONUT database on a compute-optimized c2-standard-4 Google Cloud VM, Surge achieves 7-22 million molecules per second with a memory footprint of at most 5 MB regardless of molecule size. Representative results:

Formula	Isomers	Surge (s)	MOLGEN (s)	Speedup
$\text{C}_{10}\text{H}_{16}\text{O}_5$	1.1B	69	5,146	75x
$\text{C}_9\text{H}_{18}\text{N}_2\text{O}_4$	6.0B	289	27,250	94x
$\text{C}_{11}\text{H}_{12}\text{O}_4$	31.6B	2,179	181,725	83x
$\text{C}_{10}\text{H}_{13}\text{NO}_5$	552B	54,372	6,325,646	116x
$\text{C}_{10}\text{H}_{10}\text{N}_2\text{O}_3$	1.5T	83,186	8,292,585	100x
$\text{C}_9\text{H}_{12}\text{N}_2\text{O}_5$	1.8T	180,727	13,983,652	77x

MOLGEN hit its built-in limit of $2^{31} - 1$ structures for most formulas; reported times were linearly extrapolated. Both generators were instructed to generate but not output structures. MOLGEN was run with -noaromaticity for fair comparison since Surge v1.0 lacks aromaticity detection.

Surge supports output in both SDfile and SMILES formats. SMILES output is produced efficiently by constructing a template for each simple graph at Stage 1, so that only atom types and bond multiplicities must be filled in before output.

Surge also supports built-in filters applied during generation (more efficient than post-hoc filtering):

-p0:1: at most one cycle of length 5
-P: the molecule must be planar
-B5: no atom has two double bonds and otherwise only hydrogen neighbors
-B9: no atom lies on more than one cycle of length 3 or 4

These filter options are inspired by corresponding features in MOLGEN. Surge’s open-source design also supports a plugin mechanism: users can write small code snippets to insert custom filters into any of the three stages, enabling efficient pruning of the generation tree.

Limitations

Version 1.0 does not perform Hückel aromaticity detection, so it generates duplicate Kekulé structures for aromatic rings that are graph-theoretically distinct
Benchmarking against MOLGEN required disabling MOLGEN’s aromaticity detection (-noaromaticity) for fair comparison
Written in C (from the nauty suite), which limits accessibility compared to Python-based tools, though this is also the source of its speed

Reproducibility Details

Artifact	Type	License	Notes
Surge on GitHub	Code	Apache 2.0	Official C implementation from the nauty suite
Surge project page	Other	Apache 2.0	Project homepage with documentation and binaries

Status: Highly Reproducible. Source code, build instructions, and benchmark formulas are all publicly available.
Hardware: Benchmarks used a compute-optimized c2-standard-4 Google Cloud VM. Surge runs in at most 5 MB of RAM regardless of molecule size.
Build: Standard Unix Configure/Make scheme producing a standalone command-line executable. Written in portable C from the nauty suite.
Dependencies: Requires the nauty package (bundled).

Paper Information

Published: Journal of Cheminformatics, Volume 14, Article 24, April 23, 2022
Preprint: ChemRxiv, December 7, 2021
License: Apache 2.0 (software), Open Access (paper)

@article{mckay2022surge,
  title={Surge: a fast open-source chemical graph generator},
  author={McKay, Brendan D. and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={24},
  year={2022},
  publisher={BioMed Central},
  doi={10.1186/s13321-022-00604-9}
}

SpeechT5: Unified Speech-Text Pre-Training Framework

Sat, 11 Apr 2026 00:00:00 +0000

A Unified Encoder-Decoder for Spoken Language Processing

SpeechT5 is a Method paper that introduces a shared encoder-decoder pre-training framework for spoken language processing. Inspired by T5’s text-to-text paradigm, SpeechT5 reformulates all spoken language tasks as “speech/text to speech/text” problems. The framework uses modal-specific pre-nets and post-nets to interface between raw speech or text and a shared Transformer encoder-decoder, enabling a single pre-trained model to handle six downstream tasks: automatic speech recognition (ASR), text-to-speech synthesis (TTS), speech translation (ST), voice conversion (VC), speech enhancement (SE), and speaker identification (SID).

Bridging the Gap Between Speech and Text Pre-Training

Prior speech pre-training work (wav2vec 2.0, HuBERT) suffered from two key limitations. First, these models learned speech representations from unlabeled audio alone, ignoring the complementary information in text data that is critical for cross-modal tasks like ASR and TTS. Second, they relied on encoder-only architectures with task-specific prediction heads, leaving the decoder un-pretrained for sequence-to-sequence generation tasks.

SpeechT5 addresses both gaps by (1) jointly pre-training on unlabeled speech and text data, and (2) using a full encoder-decoder architecture that benefits generation tasks directly. The approach builds on the observation that speech and text, despite their surface differences, share underlying semantic structure that a unified representation can capture.

The core innovation in SpeechT5 is a cross-modal vector quantization (VQ) mechanism that aligns speech and text representations into a shared semantic space. The architecture consists of three components:

Shared encoder-decoder backbone. A Transformer with 12 encoder blocks and 6 decoder blocks (768-dim, 12 heads), using relative position embeddings.

Modal-specific pre/post-nets. Six specialized networks handle the conversion between raw modalities and the shared representation space:

Speech-encoder pre-net: a convolutional feature extractor (from wav2vec 2.0) downsampling raw waveforms
Speech-decoder pre-net: three FC layers with ReLU, processing 80-dimensional log Mel-filterbank features
Speech-decoder post-net: a linear layer predicting Mel features plus five 1D conv layers (256 channels) for residual refinement, with an x-vector speaker embedding concatenated for multi-speaker support
Text pre/post-nets: shared embedding layers mapping between character-level token indices and hidden states (768-dim)

Cross-modal vector quantization. A shared codebook $\mathbf{C}^{K}$ with $K$ learnable embeddings bridges the two modalities. Encoder outputs $\mathbf{u}_i$ are quantized via nearest-neighbor lookup:

$$ \mathbf{c}_i = \arg\min_{j \in [K]} | \mathbf{u}_i - \mathbf{c}_j |_2 $$

A proportion (10%) of contextual representations are randomly replaced with these quantized latent units before being fed to the decoder’s cross-attention. This mixing forces the quantizer to capture cross-modal features. A diversity loss encourages full codebook utilization:

$$ \mathcal{L}_d = \frac{1}{K} \sum_{k=1}^{K} p_k \log p_k $$

Pre-Training Objectives

SpeechT5 combines three pre-training objectives:

Speech pre-training uses two tasks. A bidirectional masked prediction loss $\mathcal{L}_{mlm}^{s}$ follows HuBERT’s approach, masking 8% of timesteps in 10-step spans and predicting frame-level targets from an acoustic unit discovery model:

$$ \mathcal{L}_{mlm}^{s} = \sum_{n \in \mathcal{M}} \log p(\mathbf{z}_n \mid \hat{\mathbf{H}}, n) $$

A reconstruction loss $\mathcal{L}_{1}^{s}$ minimizes the $L_1$ distance between predicted and original Mel-filterbank features, plus a binary cross-entropy stop-token loss $\mathcal{L}_{bce}^{s}$.

Text pre-training uses BART-style denoising, masking 30% of text spans (Poisson $\lambda = 3.5$) and training with maximum likelihood estimation:

$$ \mathcal{L}_{mle}^{t} = \sum_{n=1}^{N^t} \log p(\mathbf{y}_n^t \mid \mathbf{y}_{< n}^t, \hat{\mathbf{X}}^t) $$

The full pre-training loss combines all components:

$$ \mathcal{L} = \mathcal{L}_{mlm}^{s} + \mathcal{L}_{1}^{s} + \mathcal{L}_{bce}^{s} + \mathcal{L}_{mle}^{t} + \gamma \mathcal{L}_d $$

where $\gamma = 0.1$.

Evaluation Across Six Spoken Language Tasks

SpeechT5 was evaluated on six downstream tasks, each using a different combination of the shared encoder-decoder and task-appropriate pre/post-nets:

Automatic Speech Recognition (ASR)

Fine-tuned on LibriSpeech 100h with joint CTC/attention decoding. The decoding objective maximizes a combination of decoder, CTC, and language model log-probabilities:

$$ \alpha \log P_{Dec} + (1 - \alpha) \log P_{CTC} + \beta \log P_{LM} $$

where $\alpha = 0.5$ and $\beta = 1.0$ for the 100h setting (beam size 30). Results on the test sets:

Model	LM	test-clean	test-other
wav2vec 2.0 BASE	-	6.1	13.3
HuBERT BASE	-	5.8	13.3
SpeechT5	-	4.4	10.4
wav2vec 2.0 BASE	Transf.	2.6	6.3
SpeechT5	Transf.	2.4	5.8

Text-to-Speech Synthesis (TTS)

Fine-tuned on LibriTTS 460h clean sets with HiFi-GAN vocoder:

Model	Naturalness	MOS	CMOS
Ground Truth	-	3.87 ± 0.04	-
Baseline	2.76	3.56 ± 0.05	0
SpeechT5	2.91	3.65 ± 0.04	+0.290

Speech Translation (ST)

Evaluated on MUST-C English-to-German and English-to-French:

Model	EN-DE	EN-FR
Fairseq ST	22.70	32.90
Adapter Tuning	24.63	34.98
Baseline (HuBERT init)	23.43	33.76
SpeechT5	25.18	35.30

Voice Conversion (VC)

Evaluated on CMU Arctic:

Model	WER (bdl→slt)	MCD (bdl→slt)
VTN w/ TTS	7.6%	6.33
Many-to-many VTN	-	6.13
SpeechT5	7.8%	5.93

Speech Enhancement (SE)

On WHAM! dataset, SpeechT5 reduced WER from 76.1% (noisy) to 8.9%, a relative 9% improvement over the baseline’s 10.9%.

Speaker Identification (SID)

On VoxCeleb1, SpeechT5 achieved 96.49% accuracy, outperforming HuBERT LARGE at 90.33% (from SUPERB) and SpeechNet multi-task at 87.90%.

Ablation Study and Key Findings

The ablation study reveals the contribution of each pre-training component:

Model	ASR (clean)	ASR (other)	VC (MCD)	SID (ACC)
SpeechT5	4.4	10.7	5.93	96.49%
w/o Speech PT	-	-	6.49	38.61%
w/o Text PT	5.4	12.8	6.03	95.60%
w/o Joint PT	4.6	11.3	6.18	95.54%
w/o $\mathcal{L}_{mlm}^{s}$	7.6	22.4	6.29	90.91%

Key findings:

Speech pre-training is critical: without it, ASR fails to converge entirely, and SID accuracy drops to 38.61%.
Text pre-training complements speech: removing it degrades ASR by ~20% relative, confirming that textual knowledge transfers to speech tasks.
Joint pre-training enables cross-modal transfer: the vector quantization approach is essential for modality-bridging tasks like ASR.
The masked prediction loss $\mathcal{L}_{mlm}^{s}$ is the most important single component, responsible for learning strong acoustic features.

The authors note limitations in the current scope (English-only, BASE model size) and propose scaling to larger models and multilingual settings as future work.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Speech pre-training	LibriSpeech	960 hours	Full training set
Text pre-training	LibriSpeech LM text	400M sentences	Normalized language model text
ASR fine-tuning	LibriSpeech	100h / 960h subsets
TTS fine-tuning	LibriTTS	460h clean sets
ST fine-tuning	MUST-C	EN-DE, EN-FR
VC fine-tuning	CMU Arctic	4 speakers	bdl, clb, slt, rms
SE fine-tuning	WHAM!	16 kHz max	enhance-single task
SID fine-tuning	VoxCeleb1	100k+ utterances	1,251 speakers

Algorithms

Optimizer: Adam with warmup (8% of steps) to peak LR $2 \times 10^{-4}$, then linear decay
Speech masking: 8% of timesteps, 10-step spans
Text masking: 30% of spans, Poisson $\lambda = 3.5$
Vector quantization: 2 codebooks × 100 entries = $10^4$ theoretical maximum codes
CTC/attention joint decoding for ASR (beam size 30)
HiFi-GAN vocoder for TTS and SE waveform generation
Parallel WaveGAN vocoder for VC

Fine-Tuning Hyperparameters

Task	GPUs	Steps	Peak LR	Batch (per GPU)	Schedule
ASR (100h)	8×V100	80k	6e-5	256k audio samples	Warmup 10%, hold 40%, linear decay
ASR (960h)	8×V100	320k	1.3e-4	256k audio samples	Warmup 10%, hold 40%, linear decay
TTS	8×V100	120k	4e-4	45k tokens	Warmup 10k steps, inv. sqrt decay
ST	8×V100	80k	-	-	Warmup 10k steps
VC	8×V100	60k	1e-4	20k tokens	6k warmup, inv. sqrt decay
SE	8×V100	100k	1e-4	16k tokens	10k warmup, inv. sqrt decay
SID	8×V100	60k	5e-4	64 segments (3s each)	Triangular cyclical (1e-8 to 5e-4)

Models

Encoder: 12 Transformer blocks (768-dim, 3072 FFN, 12 heads)
Decoder: 6 Transformer blocks (same dimensions)
Speech-encoder pre-net: 7 conv blocks (512 channels, strides [5,2,2,2,2,2,2], kernels [10,3,3,3,3,2,2])
Code and pre-trained models available at github.com/microsoft/SpeechT5 (MIT license)

Artifacts

Artifact	Type	License	Notes
microsoft/SpeechT5	Code	MIT	Official Fairseq-based implementation
Pre-trained models (via repo)	Model	MIT	SpeechT5 BASE encoder-decoder checkpoints
LibriSpeech	Dataset	CC-BY-4.0	960h speech pre-training and ASR fine-tuning
LibriTTS	Dataset	CC-BY-4.0	460h TTS fine-tuning
MUST-C	Dataset	CC-BY-NC-ND-4.0	Speech translation fine-tuning
CMU Arctic	Dataset	Free	Voice conversion fine-tuning
WHAM!	Dataset	CC-BY-NC-4.0	Speech enhancement fine-tuning
VoxCeleb1	Dataset	CC-BY-SA-4.0	Speaker identification fine-tuning

Hardware

Pre-training: 32 NVIDIA V100 GPUs
Batch: ~90s speech per GPU + 12k text tokens per GPU, gradient accumulation 2
Pre-training steps: 500k

Paper Information

Citation: Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., Wei, Z., Qian, Y., Li, J., & Wei, F. (2022). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5723-5738.

@inproceedings{ao2022speecht,
  title={SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
  author={Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={5723--5738},
  year={2022},
  doi={10.18653/v1/2022.acl-long.393}
}

nauty and Traces: Graph Isomorphism Algorithms

Sat, 11 Apr 2026 00:00:00 +0000

A Method Paper on Practical Graph Isomorphism

This is a Method paper that brings the published description of nauty (version 2.5) up to date and introduces Traces (version 2.0), a new program for graph isomorphism testing and canonical labeling. The paper provides a unified theoretical framework for the individualization-refinement paradigm that underpins all leading graph isomorphism programs, then details the distinct implementation strategies of nauty and Traces. Extensive benchmarks compare both programs against saucy, Bliss, and conauto across graph families ranging from easy to extremely difficult.

The Graph Isomorphism Problem in Practice

An isomorphism between two graphs is a bijection between their vertex sets that preserves adjacency. The graph isomorphism problem (GI) asks whether such a bijection exists. While GI is in NP, it is neither known to be in co-NP nor proven NP-complete. NP-completeness is considered unlikely, as it would imply collapse of the polynomial-time hierarchy. The best proven worst-case running time has stood for three decades at $e^{O(\sqrt{n \log n})}$.

In practice, direct isomorphism testing is poorly suited for common tasks like removing duplicates from large graph collections or looking up graphs in databases. The standard approach is canonical labeling: relabeling a graph so that isomorphic graphs become identical after relabeling. This allows sorting algorithms and standard data structures to handle isomorph rejection and retrieval.

The dominant practical approach is the individualization-refinement paradigm, introduced by Parris and Read (1969) and developed by Corneil and Gotlieb (1970). McKay’s nauty (1978, 1980) was the first program to handle both structurally regular graphs with hundreds of vertices and graphs with large automorphism groups. Its key innovation was using discovered automorphisms to prune the search tree. nauty dominated the field for decades until competitors like saucy (2004), Bliss (2007), and conauto (2009) introduced sparse data structures, early refinement abort, and other improvements.

The paper provides a general formal framework encompassing all leading graph isomorphism algorithms. The core idea has three components: vertex colorings, a search tree built by individualizing vertices, and pruning via node invariants and automorphisms.

A colouring of vertex set $V$ is a surjective function $\pi: V \to {1, 2, \ldots, k}$. A colouring is equitable if any two vertices of the same colour are adjacent to the same number of vertices of each colour. Given any colouring $\pi$, there exists a unique coarsest equitable colouring $\pi’$ with $\pi’ \preceq \pi$ (meaning $\pi’$ is finer than or equal to $\pi$). Computing this equitable refinement is the primary computational bottleneck.

Individualization gives a single vertex a unique colour, then refines:

$$ I(\pi, v)(w) = \begin{cases} \pi(w), & \text{if } \pi(w) < \pi(v) \text{ or } w = v \\ \pi(w) + 1, & \text{otherwise} \end{cases} $$

The refinement function $R(G, \pi_0, \nu)$ applies equitable refinement after each individualization step for a sequence of vertices $\nu = (v_1, v_2, \ldots)$.

Search Tree and Canonical Forms

The search tree $\mathcal{T}(G, \pi_0)$ is a rooted tree whose nodes are vertex sequences. Starting from the empty sequence at the root, each node extends the sequence by choosing a vertex from a target cell (a non-singleton cell of the current colouring). Leaves correspond to discrete colourings (permutations of $V$).

A canonical form is a function $C: \mathcal{G} \times \Pi \to \mathcal{G} \times \Pi$ satisfying:

$C(G, \pi) \cong (G, \pi)$ (the canonical form is isomorphic to the input)
$C(G^g, \pi^g) = C(G, \pi)$ for all $g \in S_n$ (label-invariance)

The canonical form is computed by finding the leaf $\nu^*$ maximizing the node invariant $\phi(G, \pi_0, \nu)$, then applying the corresponding discrete colouring.

Tree Pruning

Three pruning operations keep the search tractable:

$P_A(\nu, \nu’)$: Remove subtree at $\nu’$ if $\phi(G, \pi_0, \nu) > \phi(G, \pi_0, \nu’)$ (invariant comparison)
$P_B(\nu, \nu’)$: Remove subtree at $\nu’$ if $\phi(G, \pi_0, \nu) \neq \phi(G, \pi_0, \nu’)$ (inequivalence)
$P_C(\nu, g)$: Remove subtree at $\nu^g$ if $g \in \text{Aut}(G, \pi_0)$ and $\nu < \nu^g$ (automorphism pruning)

Theorem 5 in the paper guarantees that after any sequence of these pruning operations, at least one canonical leaf survives and the discovered automorphisms generate the full automorphism group.

Implementation: nauty vs. Traces

While both programs operate within the same individualization-refinement framework, their implementation strategies differ substantially.

Both nauty and Traces compute equitable colourings using Algorithm 1, which iteratively splits cells based on adjacency counts. For regular graphs (where all vertices have equal degree), the initial colouring is trivially equitable, making these graphs difficult. nauty addresses this with a library of stronger partitioning functions (e.g., triangle counting), which require user expertise to select. Traces instead uses a richer node invariant that often makes stronger refinements unnecessary.

Target Cell Selection

nauty has two strategies: using the first non-singleton cell regardless of size, or choosing the first cell with the most non-trivial joins to other cells (where a non-trivial join means more than 0 edges and less than the maximum possible between two cells). An earlier version of nauty preferred the smallest non-singleton cell, hypothesizing it would more likely correspond to a group orbit, but experiments showed the first non-singleton cell performs better in most cases. Traces prefers large target cells, which produce shallower search trees. Specifically, Traces selects the first largest non-singleton cell that is a subset of the parent node’s target cell. If no non-singleton cells satisfy this, it falls back to the grandparent node’s target cell, and so on.

Node Invariants: The Trace

The most consequential difference is in node invariants. nauty computes a single integer $f(\nu)$ at each node, forming a vector $(f([\nu]_0), f([\nu]_1), \ldots, f(\nu))$ for lexicographic comparison. Traces defines $f(\nu)$ as a vector encoding the sizes and positions of cells in the order they are created during refinement. This vector-of-vectors structure (the “trace,” hence the program’s name) enables comparison while refinement is still incomplete. For many difficult graph families, only a fraction of refinement operations need to finish before pruning can occur.

Tree Scanning Order

This is the fundamental architectural difference. nauty uses depth-first search, keeping the lexicographically least leaf $\nu_1$ and the leaf $\nu^*$ with the greatest invariant discovered so far. Pruning applies when a node’s invariant matches neither.

Traces uses breadth-first search, processing all nodes at each level $k$ and retaining only those with the greatest invariant value. By property $(\phi 1)$, the best nodes at level $k$ are children of the best nodes at level $k-1$, so no backtracking is needed. This maximizes pruning operation $P_A$.

To compensate for the fact that breadth-first search delays automorphism discovery (which requires leaves), Traces generates experimental paths: random paths from each node down to a leaf. Random experimental paths tend to find automorphisms generating larger subgroups, making more of the group available early for pruning. Both programs maintain discovered automorphisms using the random Schreier method for efficient orbit computation.

Low-Degree Vertex Handling

Traces includes special handling for vertices of degree 0, 1, 2, or $n-1$. After the initial refinement, vertices with equal colours also have equal degrees. The target cell selector never selects cells containing vertices of these low degrees, and nodes whose non-trivial cells consist only of such vertices are not expanded further. Instead, special-purpose code produces generators for the automorphism group fixed by that node and, if needed, a unique discrete colouring. This technique is effective for graphs with many small components and tree-like structures (as in constraint satisfaction problems), though the authors note that such graphs could also benefit from preprocessing that factors out tree-like appendages and replaces vertices with identical neighborhoods.

Automorphism Detection

Beyond leaf comparison, saucy introduced early detection of automorphisms higher in the search tree by checking whether partial mappings between equivalent colourings extend trivially. Traces extends this idea with a heuristic that attempts non-trivial extensions. When computing only the automorphism group (not canonical labeling), Traces employs a strategy where it finds all discrete children of one node and then checks each remaining node for a single matching discrete child, further reducing search effort.

Performance Benchmarks

The authors compare nauty 2.5, Traces 2.0, saucy 3.0, Bliss 7.2, and conauto 2.0.1 on a MacBook Pro with a 2.66 GHz Intel i7 processor. All graphs were randomly labeled before processing to avoid artifacts from input ordering. The benchmark covers both automorphism group computation and canonical labeling.

Graph Family	Best Program(s)	Notes
Random graphs ($p = 1/2$)	nauty, Traces	All programs fast; easy class
Random graphs ($p = n^{-1/2}$)	nauty	Sparse random graphs
Random cubic graphs	nauty (with invariant)	nauty benefits from distance invariant
Hypercubes	Traces	Vertex-transitive; Traces dramatically faster
Misc. vertex-transitive	Traces	Large automorphism groups
Unions of tripartite graphs	conauto, Bliss	Special handling for disjoint components
Small strongly-regular graphs	Traces, nauty	Both competitive
Large strongly-regular graphs	Traces	Orders of magnitude faster
Hadamard matrix graphs	Traces	Among the hardest known classes
Random trees	nauty	Low-degree preprocessing helps
Cai-Furer-Immerman graphs	Traces	Designed to defeat refinement; Traces still efficient
Miyazaki graphs	Traces	Another hard class; dramatic advantage
Projective planes (order 16)	Traces	Large automorphism groups on bipartite graphs
Combinatorial graphs	Mixed	Performance varies by instance; Traces generally competitive

The results show that nauty is generally fastest for small graphs and some easier families, while Traces dominates on most difficult graph classes, sometimes by orders of magnitude. The breadth-first tree scanning strategy of Traces, combined with its richer node invariant, provides the largest gains on graphs with complex symmetry structure (strongly-regular graphs, Hadamard matrix graphs, vertex-transitive graphs). The exception is graph families with many disjoint or minimally-overlapping components, where conauto and Bliss have specialized handling that nauty and Traces lack.

Key Findings and Limitations

The paper establishes several findings:

The breadth-first tree scanning approach in Traces, combined with experimental paths for early automorphism discovery, provides large efficiency gains on difficult graph classes.
Traces’ richer node invariant (the trace) enables early pruning during incomplete refinement, reducing dependence on user-selected invariant functions compared to nauty.
No single program dominates all graph classes. nauty remains preferred for mass processing of small graphs.
The random Schreier method for maintaining the automorphism group is effective in both programs, enabling more complete pruning via orbit computation.

Limitations acknowledged by the authors include: nauty and Traces lack specialized code for graphs consisting of disjoint or minimally-overlapping components (where conauto and Bliss excel), and the choice of refinement function in nauty still requires user expertise for certain difficult graph classes.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Benchmarking	Bliss test collection	Multiple families	Graphs ranging from easy to very difficult
Benchmarking	nauty/Traces website collection	Multiple families	All test graphs available at the project website

All test graphs are publicly available at the nauty and Traces website. Graphs were randomly labeled before processing to avoid non-typical behavior from input labeling.

Algorithms

The core algorithms are described formally with proofs of correctness (Theorem 5 guarantees pruning validity). Key implementation choices:

Refinement: Equitable colouring via Algorithm 1 (iterated cell splitting by adjacency counts)
Target cell selection: nauty uses first non-singleton or most non-trivially joined cell; Traces uses first largest cell within parent’s target
Tree scanning: nauty uses depth-first; Traces uses breadth-first with experimental paths
Group maintenance: Random Schreier method for orbit computation in both programs

Software

Program	Version	Canonical Labeling	Open Source
nauty	2.5	Yes	Yes
Traces	2.0	Yes	Yes
saucy	3.0	No (v3.0)	Yes
Bliss	7.2	Yes	Yes
conauto	2.0.1	No	Yes

Artifacts

Artifact	Type	License	Notes
nauty and Traces	Code	Apache 2.0	Official distribution (v2.9.3 as of 2026); includes gtools graph utilities
Test graphs	Dataset	Apache 2.0	All benchmark graphs from the paper, available at the project website

Hardware

Benchmarks run on a MacBook Pro with 2.66 GHz Intel i7 processor, compiled with gcc 4.7, single-threaded execution.

Paper Information

Citation: McKay, B. D., & Piperno, A. (2013). Practical graph isomorphism, II. Journal of Symbolic Computation, 60, 94-112.

@article{mckay2013practical,
  title={Practical graph isomorphism, {II}},
  author={McKay, Brendan D. and Piperno, Adolfo},
  journal={Journal of Symbolic Computation},
  volume={60},
  pages={94--112},
  year={2013},
  publisher={Elsevier BV},
  doi={10.1016/j.jsc.2013.09.003}
}

Molecular Complexity from the GDB Chemical Space

Sat, 11 Apr 2026 00:00:00 +0000

Molecular Complexity as Branching in the Molecular Graph

This paper proposes two simple, interpretable measures of molecular complexity grounded in the observation that most GDB-enumerated molecules are synthetically challenging despite containing only standard functional groups and ring systems. The core insight is that branching points (non-divalent nodes) in the molecular graph correspond to synthesis difficulty: each additional branching point implies a new ring or substituent requiring extra synthetic steps, possible protecting groups, potential stereogenic centers, and increased steric hindrance.

Motivation: Why Most GDB Molecules Are Hard to Make

The Generated DataBases (GDBs) enumerate billions of hypothetical small organic molecules by exhaustively substituting atoms and bonds in mathematical graphs. Despite applying filters for ring strain, functional group diversity, fragment-likeness, drug-likeness, and ChEMBL-likeness, most enumerated molecules remain daunting to synthesize. Even in the most restrictive subset (GDB-13s, 99.4 million molecules from the 977 million in GDB-13), practical synthesis remains challenging for most entries. This motivated the search for a complexity measure that captures why these molecules are hard, without relying on reaction databases or machine learning.

MC1 and MC2: Two Graph-Based Complexity Measures

The two proposed measures are:

MC1 (size-independent): the fraction of non-divalent nodes in the molecular graph.

$$ \text{MC1} = 1 - \text{FDV} $$

where FDV is the fraction of divalent nodes (e.g., $-\text{CH}_2-$, $=\text{CH}-$, $=\text{C}=$, $-\text{O}-$, $-\text{NH}-$, $=\text{N}-$, $-\text{S}-$) in the molecular graph. The graph is computed by treating the molecule as if all bonds were single and all heavy atoms were carbon. MC1 is independent of molecule size, making it useful for comparing molecules of different sizes.

MC2 (size-dependent): the count of non-divalent nodes, excluding carbonyl carbons in standard carboxyl derivatives.

$$ \text{MC2} = \text{NDV} $$

where NDV is the number of non-divalent nodes, not counting $\text{C}{=}\text{O}$ in $(\text{X}-\text{C}{=}\text{O})$ for $\text{X} = \text{N}$ or $\text{O}$ (acids, esters, amides, carbonates, carbamates, ureas). MC2 grows with molecule size only when branching increases. Linear extensions (adding divalent atoms to chains or enlarging rings) do not increase MC2.

The rationale for excluding carboxyl groups from MC2 is that their chemistry (amide bond formation, esterification) is well-established and straightforward. Functional groups like amidines, guanidines, thioesters, thiones, sulfoxides, sulfinates, sulfones, and sulfonamides, as well as phosphorus-containing groups, are still counted because their synthesis is less routine.

Design Choices and Limitations

MC1 and MC2 deliberately do not distinguish between $\text{sp}^2$ and $\text{sp}^3$ branching points or count chiral centers. This choice is motivated by the observation that unusual substitution patterns on aromatic rings in GDB molecules are also synthetically difficult, and that functionalization of aromatic/heteroaromatic rings and control of atropisomerism in biaryls are both challenging. A consequence is that carbohydrates and polyphenols receive high complexity scores despite being abundant in biomass.

MC1 gives uninformative values for very small molecules (trifluoroacetic acid and tert-butanol both score $\text{MC1} = 1$) and for polymers (where the repeating unit dominates). MC2 similarly cannot give useful values for polymers due to its size dependence.

Comparison with Existing Complexity Measures

The authors compare MC1 and MC2 against six molecular complexity scores and two synthetic accessibility scores across four databases: GDB-13s, ZINC, ChEMBL, and COCONUT.

Measure	Category	Description
FCFP4	Complexity	Number of on-bits in a binary 2048-bit FCFP4 fingerprint
DataWarrior	Complexity	Fractal complexity via Minkowski-Bouligand (box-counting) dimension of distinct substructures up to 7 bonds
Böttcher	Complexity	Shannon entropy using additive atom contributions (valence electrons, atom environment, chirality, symmetry)
Proudfoot	Complexity	Shannon entropy using additive atom contributions (atomic number, connections, paths up to length 2)
SPS/nSPS	Complexity	Spacial score summing heavy atom contributions (hybridization, stereochemistry, nonaromaticity, neighbor count); nSPS normalizes by HAC
SAscore	Synthesizability	Fragment frequency from PubChem combined with complexity penalty (ring types, stereochemistry, size)
SCS	Synthesizability	Machine-learned score from 12 million Reaxys reactions predicting synthesis steps from ECFP4 fingerprint (max value 5)

Key findings from the correlation analysis:

For GDB-13s (where nearly all molecules have HAC = 13), complexity measures generally do not correlate with each other ($r^2 < 0.6$), except MC1 with MC2 and SPS with nSPS (expected, since each pair differs only in size normalization).
For ZINC, ChEMBL, and COCONUT (spanning a broad range of molecular sizes), several complexity measures correlate with heavy atom count (HAC) and therefore with each other.
Size-independent measures (DataWarrior, nSPS, SCS, SAscore, MC1) are unaffected by molecule size across datasets, while Böttcher and Proudfoot scores are strongly size-dependent. FCFP4 and SPS show partial size dependence.
SPS and nSPS also correlate with SAscore.

The analysis is supported by interactive TMAP visualizations (tree-maps organized by MAP4C molecular fingerprint similarity) for 30,000 random molecules from each database, color-coded by each complexity measure. The interactive TMAPs are available online for GDB-13s, ZINC, ChEMBL, and COCONUT.

Reproducibility Details

Artifact	Type	License	Notes
Molecular_Complexity	Code	MIT	Python implementation of MC1, MC2, and eight comparison metrics with Jupyter notebooks

The paper is open access (hybrid). The GitHub repository provides Python code for computing MC1 and MC2 along with Jupyter notebooks demonstrating all ten complexity and synthesizability measures from Table 1. The four databases used (GDB-13s, ZINC, ChEMBL, COCONUT) are all publicly available. No model training or specialized hardware is involved, as MC1 and MC2 are deterministic graph computations.

Reproducibility status: Highly Reproducible.

Paper Information

Journal: Journal of Chemical Information and Modeling, Vol. 65, No. 16, pp. 8405-8410
Published: May 15, 2025
Part of: Special issue “Chemical Compound Space Exploration by Multiscale High-Throughput Screening and Machine Learning”

@article{buehler2025view,
  title={A View on Molecular Complexity from the GDB Chemical Space},
  author={Buehler, Ye and Reymond, Jean-Louis},
  journal={Journal of Chemical Information and Modeling},
  volume={65},
  number={16},
  pages={8405--8410},
  year={2025},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.5c00334}
}

LSTNet: Long- and Short-Term Time Series Network

Sat, 11 Apr 2026 00:00:00 +0000

A Deep Learning Framework for Multivariate Forecasting

This is a Method paper that introduces the Long- and Short-term Time-series Network (LSTNet), a deep learning architecture specifically designed for multivariate time series forecasting. LSTNet combines convolutional neural networks (CNNs), recurrent neural networks (RNNs) with a novel skip-connection structure, and a traditional autoregressive (AR) component into a unified framework. The architecture targets the challenge of simultaneously capturing both short-term local dependencies and long-term periodic patterns in temporal data.

Why Short-Term and Long-Term Patterns Need Separate Treatment

Real-world multivariate time series often exhibit a mixture of repeating patterns at different time scales. Highway traffic, for example, shows daily peaks (morning vs. evening commutes) alongside weekly patterns (weekday vs. weekend behavior). Solar energy output varies with cloud movements on short time scales and with seasonal daylight changes on longer ones. Electricity consumption follows similar daily and weekly cycles.

Traditional autoregressive methods (VAR, ARIMA) and Gaussian Process models struggle to distinguish and jointly model these two kinds of recurring patterns. Standard RNNs, including LSTM and GRU variants, theoretically handle long-range dependencies but in practice suffer from gradient vanishing when the period length is large (e.g., 24 hours at hourly resolution, or 168 time steps for weekly patterns). The authors also identify a scale sensitivity problem: neural network models can fail when the magnitude of the input signal changes in non-periodic ways, such as sudden shifts in electricity consumption due to holidays or weather events.

Combining CNNs, Recurrent-Skip Connections, and Autoregression

The LSTNet architecture consists of four main components that work together.

Convolutional Component

The first layer applies 1D convolution without pooling across the multivariate input. Each filter has width $\omega$ (in the time dimension) and height $n$ (spanning all variables), producing feature maps that capture short-term local dependency patterns among variables:

$$h_k = \text{RELU}(W_k * X + b_k)$$

where $*$ denotes convolution and the input is zero-padded so each output vector has length $T$. The output is a $d_c \times T$ matrix where $d_c$ is the number of filters.

Recurrent Component

The CNN output feeds into a GRU-based recurrent layer that uses RELU (rather than the standard tanh) as the hidden update activation:

$$\begin{aligned} r_t &= \sigma(x_t W_{xr} + h_{t-1} W_{hr} + b_r) \\ u_t &= \sigma(x_t W_{xu} + h_{t-1} W_{hu} + b_u) \\ c_t &= \text{RELU}(x_t W_{xc} + r_t \odot (h_{t-1} W_{hc}) + b_c) \\ h_t &= (1 - u_t) \odot h_{t-1} + u_t \odot c_t \end{aligned}$$

Recurrent-Skip Component

The key architectural innovation is a recurrent structure with temporal skip connections. Instead of connecting to the immediately preceding hidden state $h_{t-1}$, skip links connect to the hidden state from $p$ steps ago ($h_{t-p}$), where $p$ corresponds to the period length of the data (e.g., $p = 24$ for hourly data with daily periodicity):

$$\begin{aligned} r_t &= \sigma(x_t W_{xr} + h_{t-p} W_{hr} + b_r) \\ u_t &= \sigma(x_t W_{xu} + h_{t-p} W_{hu} + b_u) \\ c_t &= \text{RELU}(x_t W_{xc} + r_t \odot (h_{t-p} W_{hc}) + b_c) \\ h_t &= (1 - u_t) \odot h_{t-p} + u_t \odot c_t \end{aligned}$$

This design shortens the effective path length for learning periodic dependencies, making optimization easier. A dense layer combines outputs from both recurrent components:

$$h_t^D = W^R h_t^R + \sum_{i=0}^{p-1} W_i^S h_{t-i}^S + b$$

Temporal Attention Alternative

For datasets without clear periodicity, LSTNet offers an attention-based variant (LSTNet-Attn) as an alternative to the recurrent-skip component. The attention mechanism learns to weight hidden representations across the input window adaptively. The attention weights $\alpha_t \in \mathbb{R}^q$ at time $t$ are computed as:

$$\alpha_t = \text{AttnScore}(H_t^R, h_{t-1}^R)$$

where $H_t^R = [h_{t-q}^R, \dots, h_{t-1}^R]$ stacks the RNN hidden representations column-wise and AttnScore is a similarity function (dot product, cosine, or a parameterized MLP). The weighted context vector and final output are:

$$\begin{aligned} c_t &= H_t \alpha_t \\ h_t^D &= W[c_t;; h_{t-1}^R] + b \end{aligned}$$

Autoregressive Component

To address the scale insensitivity of neural networks, LSTNet adds a classical autoregressive model in parallel:

$$h_{t,i}^L = \sum_{k=0}^{q^{ar}-1} W_k^{ar} y_{t-k,i} + b^{ar}$$

The final prediction integrates both the neural network and AR outputs:

$$\hat{Y}_t = h_t^D + h_t^L$$

This decomposition separates the prediction into a linear part (handling local scale changes) and a non-linear part (capturing recurring patterns).

Objective Function

LSTNet supports two loss functions, selected via validation performance. The default is the squared (L2) loss:

$$\underset{\Theta}{\text{minimize}} \sum_{t \in \Omega_{\text{Train}}} \left| Y_t - \hat{Y}_{t-h} \right|_F^2$$

Motivated by the strong performance of Linear SVR baselines, LSTNet also supports the absolute (L1) loss, which is more robust to anomalies in real time series data:

$$\underset{\Theta}{\text{minimize}} \sum_{t \in \Omega_{\text{Train}}} \sum_{i=0}^{n-1} \left| Y_{t,i} - \hat{Y}_{t-h,i} \right|$$

where $\Theta$ is the full parameter set, $\Omega_{\text{Train}}$ is the set of training time stamps, $|\cdot|_F$ is the Frobenius norm, and $h$ is the forecast horizon.

Evaluation on Four Benchmark Datasets

Datasets

Dataset	Length	Variables	Sample Rate
Traffic	17,544	862	1 hour
Solar-Energy	52,560	137	10 minutes
Electricity	26,304	321	1 hour
Exchange-Rate	7,588	8	1 day

All datasets are split 60/20/20 (train/validation/test) in chronological order. Traffic, Solar-Energy, and Electricity exhibit clear periodic patterns (daily and weekly), while Exchange-Rate shows only short-term local continuity.

Baselines

The authors compare against seven methods: AR (univariate autoregression), LRidge (VAR with L2 regularization), LSVR (VAR with SVR objective), TRMF (temporal regularized matrix factorization), GP (Gaussian Process), VAR-MLP (hybrid MLP-autoregressive), and RNN-GRU (standard GRU).

Metrics

Two evaluation metrics are used:

Root Relative Squared Error (RSE) (lower is better): A scaled RMSE that normalizes by the standard deviation of the test data, making comparison across datasets readable regardless of data scale:

$$\text{RSE} = \frac{\sqrt{\sum_{(i,t) \in \Omega_{\text{Test}}} (Y_{it} - \hat{Y}_{it})^2}}{\sqrt{\sum_{(i,t) \in \Omega_{\text{Test}}} (Y_{it} - \text{mean}(Y))^2}}$$

Empirical Correlation Coefficient (CORR) (higher is better): The average Pearson correlation between predicted and true time series across all $n$ variables:

$$\text{CORR} = \frac{1}{n} \sum_{i=1}^{n} \frac{\sum_t (Y_{it} - \text{mean}(Y_i))(\hat{Y}_{it} - \text{mean}(\hat{Y}_i))}{\sqrt{\sum_t (Y_{it} - \text{mean}(Y_i))^2 \sum_t (\hat{Y}_{it} - \text{mean}(\hat{Y}_i))^2}}$$

Main Results

The models are evaluated at horizons $h \in {3, 6, 12, 24}$, corresponding to 3-24 hours for Traffic and Electricity, 30-240 minutes for Solar-Energy, and 3-24 days for Exchange-Rate.

LSTNet-Skip achieved the best result in 17 out of 32 (dataset, metric, horizon) combinations, and LSTNet-Attn won 7 more. No other method won more than 3. At horizon 24, the best LSTNet variant improved over RNN-GRU by 9.2% RSE on Solar-Energy (LSTNet-Attn), 11.7% on Traffic (LSTNet-Skip), and 22.2% on Electricity (LSTNet-Skip). On the Exchange-Rate dataset, which lacks periodic patterns, LSTNet performed comparably to AR and LRidge, as expected.

Ablation Study

Removing each component individually revealed:

Without AR: The largest performance drops across most datasets, confirming the AR component’s role in handling scale changes. Visualization showed that LSTNet-Skip successfully tracks sudden magnitude shifts in electricity consumption around the 1000th hour, while the model without AR fails.
Without Skip/CNN: Significant drops on datasets with periodic patterns, though less consistent than removing AR.
Full LSTNet: The most robust configuration across all datasets and horizons.

A simulation experiment with synthetic autoregressive data confirmed that standard RNN-GRU fails to track non-periodic scale changes, while LSTNet with its AR component adapts properly.

Robust Performance Through Architectural Complementarity

LSTNet’s main strength is the complementarity of its components. The CNN captures short-term local patterns, the recurrent-skip layer captures long-term periodic dependencies, and the AR component provides robustness to scale changes. On datasets with strong periodicity (Traffic, Solar-Energy, Electricity), the skip connections provide large gains. On datasets without periodicity (Exchange-Rate), the AR component prevents degradation below competitive baselines.

The primary limitation is that the skip length $p$ in the recurrent-skip component must be manually specified or tuned. For datasets with known periodicity (e.g., hourly data with daily cycles), $p$ is straightforward to set. For datasets without clear periodicity, $p$ must be tuned as a hyperparameter, and the attention-based variant (LSTNet-Attn) offers an alternative that avoids this requirement. Future work directions include automatic period detection and incorporating variable-level attribute information into the convolutional layer.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	Traffic	17,544 x 862	California DoT highway occupancy, hourly, 2015-2016
Training/Evaluation	Solar-Energy	52,560 x 137	Solar power from 137 PV plants in Alabama, 10-min intervals, 2006
Training/Evaluation	Electricity	26,304 x 321	kWh consumption for 321 clients, hourly, 2012-2014
Training/Evaluation	Exchange-Rate	7,588 x 8	Daily exchange rates for 8 countries, 1990-2016

All datasets are publicly available via the GitHub repository.

Algorithms

Optimizer: Adam
Dropout: 0.1 or 0.2 after each layer except input and output
Window size $q$: grid search over ${2^0, 2^1, \ldots, 2^9}$
Skip length $p$: set to 24 for Traffic/Electricity; tuned from $2^1$ to $2^6$ for Solar-Energy/Exchange-Rate
Objective: L2 loss (Eq. 7) or L1 loss (Eq. 9), selected via validation

Models

Hidden dimensions (Recurrent/CNN): ${50, 100, 200}$
Hidden dimensions (Recurrent-skip): ${20, 50, 100}$
AR regularization: ${0.1, 1, 10}$

Evaluation

Metric	Best LSTNet RSE	Baseline (RNN-GRU)	Improvement
Solar-Energy (h=24)	0.4403 (Attn)	0.4852	9.2%
Traffic (h=24)	0.4973 (Skip)	0.5633	11.7%
Electricity (h=24)	0.1007 (Skip)	0.1295	22.2%

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
LSTNet (laiguokun/LSTNet)	Code	MIT	Official PyTorch implementation (Python 2.7, PyTorch 0.3.0)
Multivariate Time Series Data (laiguokun/multivariate-time-series-data)	Dataset	Unknown	Preprocessed benchmark datasets (Traffic, Solar-Energy, Electricity, Exchange-Rate)

Reproducibility status: Highly Reproducible. Code and all four benchmark datasets are publicly available. Hyperparameter search ranges are fully specified.

Paper Information

Citation: Lai, G., Chang, W.-C., Yang, Y., & Liu, H. (2018). Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ‘18), 95-104. https://doi.org/10.1145/3209978.3210006

@inproceedings{lai2018modeling,
  title={Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks},
  author={Lai, Guokun and Chang, Wei-Cheng and Yang, Yiming and Liu, Hanxiao},
  booktitle={The 41st International ACM SIGIR Conference on Research \& Development in Information Retrieval},
  pages={95--104},
  year={2018},
  doi={10.1145/3209978.3210006}
}

DGCNN: Dynamic Graph CNN for Point Cloud Learning

Sat, 11 Apr 2026 00:00:00 +0000

A General-Purpose Edge Convolution Module for Point Cloud Learning

This is a Method paper that introduces EdgeConv, a neural network module for learning on point clouds. The key idea is to construct a local graph structure and define convolution-like operations over edges connecting neighboring points. Unlike prior graph neural network approaches that operate on a fixed graph, DGCNN (Dynamic Graph CNN) recomputes the graph at each layer using k-nearest neighbors in feature space. This dynamic graph update allows the network to learn semantic groupings that differ from spatial proximity, enabling information propagation across long distances in the original point cloud. The model achieves strong results on classification (ModelNet40), part segmentation (ShapeNetPart), and semantic segmentation (S3DIS) benchmarks.

Why Point Clouds Need Topology Recovery

Point clouds are the raw output of most 3D acquisition devices (LiDAR, stereo reconstruction) and serve as the simplest geometric representation for countless applications in graphics, robotics, and autonomous driving. However, point clouds inherently lack topological information: they are unordered sets of points with no connectivity structure.

Standard CNNs require grid-structured input, making them incompatible with irregular point cloud data. Volumetric approaches that discretize point clouds onto 3D grids introduce quantization artifacts and excessive memory usage. PointNet addressed this by operating on each point independently and aggregating with a symmetric function (max pooling), achieving permutation invariance. However, this independence means PointNet cannot capture local geometric structure.

PointNet++ partially addresses this by applying PointNet hierarchically in local neighborhoods, but it constructs neighborhoods based on Euclidean distances in the input space and does not update the graph structure during processing. The fundamental limitation is that treating points independently, even locally, prevents the model from learning the geometric relationships between points that carry important structural and semantic information.

EdgeConv: Combining Local Geometry with Global Structure

Given an $F$-dimensional point cloud $\mathbf{X} = \lbrace \mathbf{x}_1, \ldots, \mathbf{x}_n \rbrace \subseteq \mathbb{R}^F$, DGCNN constructs a directed graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ as the $k$-nearest neighbor graph in $\mathbb{R}^F$, including self-loops so each node also points to itself. Edge features are defined as:

$$ \mathbf{x}_i’ = \square_{j:(i,j) \in \mathcal{E}} h_\Theta(\mathbf{x}_i, \mathbf{x}_j) $$

where $h_\Theta$ is a learnable nonlinear function and $\square$ denotes a channel-wise symmetric aggregation operation (e.g., max or sum).

The choice of edge function $h_\Theta$ determines the model’s properties. The authors analyze several options:

Choice	Edge function	Properties
Standard convolution	$\theta_m \cdot \mathbf{x}_j$	Requires fixed grid structure
PointNet	$h_\Theta(\mathbf{x}_i)$	Global only, ignores local structure
PointNet++	$h_\Theta(\mathbf{x}_j)$	Local only, loses global context
Local difference	$h_\Theta(\mathbf{x}_j - \mathbf{x}_i)$	Local patches without global positioning
EdgeConv (this work)	$\bar{h}_\Theta(\mathbf{x}_i, \mathbf{x}_j - \mathbf{x}_i)$	Both local geometry and global structure

The concrete EdgeConv operation uses an asymmetric edge function that combines the point’s own features $\mathbf{x}_i$ (global shape structure) with the relative difference $\mathbf{x}_j - \mathbf{x}_i$ (local neighborhood information):

$$ e’_{ijm} = \text{ReLU}(\boldsymbol{\theta}_m \cdot (\mathbf{x}_j - \mathbf{x}_i) + \boldsymbol{\phi}_m \cdot \mathbf{x}_i) $$

$$ x’_{im} = \max_{j:(i,j) \in \mathcal{E}} e’_{ijm} $$

where $\boldsymbol{\Theta} = (\theta_1, \ldots, \theta_M, \phi_1, \ldots, \phi_M)$ are learnable parameters. This formulation can be implemented as a shared MLP followed by max pooling over neighbors.

Dynamic Graph Recomputation

The defining feature of DGCNN is that the graph $\mathcal{G}^{(l)}$ is recomputed at each layer $l$ using k-NN in the current feature space, rather than being fixed based on input coordinates. This means:

The receptive field grows to be as large as the diameter of the point cloud while remaining sparse.
Points that are far apart in Euclidean space but semantically similar (e.g., the two wings of an airplane) become neighbors in deeper feature spaces.
The model learns to construct the graph itself, rather than taking it as a fixed input.

Permutation and Translation Invariance

EdgeConv is permutation invariant because the max aggregation is a symmetric function. It has a “partial” translation invariance property: the local difference term $\mathbf{x}_j - \mathbf{x}_i$ is fully translation invariant, while the global term $\boldsymbol{\phi}_m \cdot \mathbf{x}_i$ is translation-dependent. Setting $\boldsymbol{\phi}_m = 0$ yields full translation invariance but loses global positioning information.

Benchmarks: Classification, Part Segmentation, and Scene Segmentation

Classification on ModelNet40

The classification architecture uses four EdgeConv layers with output dimensions (64, 64, 128, 256), $k = 20$ nearest neighbors, and shortcut connections that concatenate all EdgeConv outputs into a $64 + 64 + 128 + 256 = 512$-dimensional per-point feature. A shared fully-connected layer (1024) aggregates these multi-scale features. Global max and sum pooling produce a 1D descriptor, followed by two fully-connected layers (512, 256) with dropout (probability 0.5). All layers use LeakyReLU and batch normalization. Input point clouds are rescaled to fit into the unit sphere.

Training uses SGD with momentum 0.9, initial learning rate 0.1, cosine annealing to 0.001, and batch size 32. Batch normalization momentum is 0.9 with no BN decay. Data augmentation includes random scaling and perturbation of object and point locations. The value of $k$ is selected using an 80/20 train/validation split, then the model is retrained on the full training set.

Method	Mean Class Acc. (%)	Overall Acc. (%)
PointNet	86.0	89.2
PointNet++	–	90.7
PointCNN	88.1	92.2
PCNN	–	92.3
DGCNN (baseline, fixed graph)	88.9	91.7
DGCNN	90.2	92.9
DGCNN (2048 points)	90.7	93.5

Model Complexity

DGCNN achieves a favorable tradeoff between model size, inference speed, and accuracy:

Method	Model Size (MB)	Time (ms)	Accuracy (%)
PointNet (baseline)	9.4	6.8	87.1
PointNet	40	16.6	89.2
PointNet++	12	163.2	90.7
PCNN	94	117.0	92.3
DGCNN (baseline)	11	19.7	91.7
DGCNN	21	27.2	92.9

The DGCNN baseline outperforms PointNet++ by 1.0% while being 7x faster. The full DGCNN outperforms PCNN by 0.6% while being 4x faster with 4.5x fewer parameters.

Ablation Studies

Centralization	Dynamic Graph	2048 Points	Mean Class (%)	Overall (%)
			88.9	91.7
x			89.3	92.2
x	x		90.2	92.9
x	x	x	90.7	93.5

The choice of $k$ also matters:

$k$	Mean Class Acc. (%)	Overall Acc. (%)
5	88.0	90.5
10	88.9	91.4
20	90.2	92.9
40	89.4	92.4

$k = 20$ performs best on 1024 points. Larger $k$ (e.g., 40) degrades performance because Euclidean distance poorly approximates geodesic distance at larger scales for a given point density.

Part Segmentation on ShapeNetPart

On the ShapeNetPart dataset (16,881 shapes, 16 categories, 50 part labels), DGCNN achieves 85.2% mean IoU, comparable to PointNet++ (85.1%) and PointCNN (86.1%). The model also demonstrates robustness to partial data, maintaining reasonable segmentation quality even when half of the points are removed.

Indoor Scene Segmentation on S3DIS

On the Stanford Large-Scale 3D Indoor Spaces Dataset (6 indoor areas, 272 rooms, 13 semantic categories), DGCNN achieves 56.1% mean IoU and 84.1% overall accuracy using 6-fold cross-validation over the areas, outperforming PointNet (47.6% / 78.5%) and producing smoother segmentation boundaries. Each point is represented as a 9D vector (XYZ, RGB, and normalized spatial coordinates), with 4,096 points sampled per $1\text{m} \times 1\text{m}$ block during training.

Semantic Feature Spaces and Future Directions

A key qualitative finding is that the feature spaces learned by DGCNN in deeper layers capture semantic similarity rather than spatial proximity. Visualizations show that semantically similar structures (e.g., all legs of a table, or all wings of an airplane) are brought close together in feature space, even when they are far apart in the original 3D embedding. This property also transfers across shapes: features from one airplane’s wing are close to the wing features of a different airplane in the learned feature space.

The authors identify several directions for future work:

Efficiency: Incorporating fast data structures (e.g., KD-trees) instead of computing pairwise distances for k-NN queries.
Higher-order relationships: Considering tuples of points rather than only pairwise relationships.
Non-shared transformations: Applying different transformations to different local patches rather than using shared weights.
Abstract point clouds: Extending the approach to non-geometric applications like document retrieval and image processing, where the role of geometry in abstract feature spaces may provide new insights.

The model has some limitations. On S3DIS, PointCNN achieves notably higher mean IoU (65.39% vs. 56.1%), suggesting room for improvement on large-scale scene segmentation. The dynamic k-NN computation adds overhead relative to fixed-graph approaches, though the overall model remains efficient.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Classification	ModelNet40	12,311 CAD models (40 categories)	1,024 points uniformly sampled per model
Part Segmentation	ShapeNetPart	16,881 shapes (16 categories, 50 parts)	2,048 points per shape
Scene Segmentation	S3DIS	272 rooms (13 categories)	4,096 points per 1m x 1m block

Algorithms

k-NN graph construction: Pairwise distance matrix in feature space, $k = 20$ (classification) or $k = 40$ (2048 points).
EdgeConv: Shared MLP on concatenated $[\mathbf{x}_i, \mathbf{x}_j - \mathbf{x}_i]$ features, followed by channel-wise max pooling over neighbors.
Dynamic graph update: Graph recomputed from k-NN in feature space at each EdgeConv layer.

Models

Classification: 4 EdgeConv layers (64, 64, 128, 256) + shortcut concatenation (512-dim) + shared FC (1024) + global max/sum pooling + FC (512, 256). 21 MB.
Segmentation: Spatial transformer + 3 EdgeConv layers + shared FC (1024) aggregation + shortcut connections + FC (256, 256, 128).
All layers use LeakyReLU and batch normalization. Dropout 0.5 in final FC layers.

Evaluation

Task	Metric	DGCNN	Best Baseline
ModelNet40 Classification	Overall Accuracy	92.9%	92.3% (PCNN)
ShapeNetPart Segmentation	Mean IoU	85.2%	86.1% (PointCNN)
S3DIS Scene Segmentation	Mean IoU	56.1%	65.39% (PointCNN)

Artifacts

Artifact	Type	License	Notes
WangYueFt/dgcnn	Code	MIT	Official TensorFlow and PyTorch implementations

Hardware

Training used NVIDIA TITAN X GPUs. Distributed training (2 GPUs) for part segmentation.
Forward pass time: 27.2 ms per sample (1,024 points) on a single GPU.

Paper Information

Citation: Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., & Solomon, J. M. (2019). Dynamic Graph CNN for Learning on Point Clouds. ACM Transactions on Graphics, 38(5), Article 146. https://doi.org/10.1145/3326362

Code: github.com/WangYueFt/dgcnn (MIT License)

@article{wang2019dynamic,
  title={Dynamic Graph CNN for Learning on Point Clouds},
  author={Wang, Yue and Sun, Yongbin and Liu, Ziwei and Sarma, Sanjay E. and Bronstein, Michael M. and Solomon, Justin M.},
  journal={ACM Transactions on Graphics},
  volume={38},
  number={5},
  articleno={146},
  pages={1--12},
  year={2019},
  publisher={ACM},
  doi={10.1145/3326362}
}

Conformation Autoencoder for 3D Molecules

Sat, 11 Apr 2026 00:00:00 +0000

A Method for Learning Conformation Embeddings

This is a Method paper that introduces an autoencoder architecture for molecular conformations. The model converts the discrete 3D spatial arrangement of atoms (a conformation) in a given molecular graph into a continuous, fixed-size latent representation and back. The approach uses internal coordinates (bond lengths, bond angles, dihedral angles) as input rather than Cartesian coordinates, making the representation inherently invariant to rigid translations and rotations.

Why 3D Structure Matters for Molecular Modeling

Most deep learning methods for molecules operate on 2D representations: molecular graphs (atoms as nodes, bonds as edges) or SMILES strings. These representations capture connectivity and atom types but do not encode the 3D spatial arrangement of atoms. Many important molecular properties, such as the ability to fit inside a protein binding pocket or the shape-dependent pharmacological effect, depend on the molecule’s possible energetically stable spatial arrangements (conformations).

Prior work has addressed either property prediction from fixed conformations (SchNet, Schütt et al., 2018) or conformation generation for a given molecular graph (Mansimov et al., 2019; Simm and Hernández-Lobato, 2019). This paper addresses a different gap: learning a continuous, fixed-size embedding of a conformation that is independent of molecule size and atom ordering, enabling both reconstruction and generation.

Internal Coordinates and Set-Based Encoding

The core innovation is a two-part architecture: a conformation-independent graph neural network and a conformation-dependent encoder/decoder that operates on internal coordinates.

Internal Coordinate Representation

Instead of Cartesian coordinates, conformations are represented as a set of internal coordinates:

$$ \Xi = (\mathcal{D}, \Phi, \Psi) $$

where $\mathcal{D} = \{d_1, \ldots, d_{N_\mathcal{D}}\}$ are bond lengths, $\Phi = \{\phi_1, \ldots, \phi_{N_\Phi}\}$ are bond angles, and $\Psi = \{\psi_1, \ldots, \psi_{N_\Psi}\}$ are dihedral angles. This representation is invariant to rotations and rigid translations and can always be converted to and from Cartesian coordinates.

Molecular Graph Encoder

A Graph Neural Network extracts conformation-independent node embeddings from the molecular graph. The molecular graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ uses node features $v_i \in \mathbb{R}^{F_v}$ encoding atom properties (element type, charge) and edge features $\mathbf{e}_{i,j} \in \mathbb{R}^{F_e}$ encoding bond type (single, double, triple, or aromatic). The architecture combines an edge-conditioned convolution (EConv) layer to encode bond-type information with multiple Graph Attention Network (GAT) layers:

$$ \mathbf{h}_i^l = \mathbf{GAT}^{l-1} \circ \cdots \circ \mathbf{GAT}^1 \circ \text{EConv}(\mathbf{h}_i^0) $$

where $\mathbf{h}_i^0 = v_i \in \mathbb{R}^{F_v}$ are the initial atom features. The GAT attention coefficients are:

$$ \alpha_{i,j} = \frac{\exp\left(\sigma\left(\mathbf{a}^T [\boldsymbol{\Theta}\mathbf{h}_i | \boldsymbol{\Theta}\mathbf{h}_j]\right)\right)}{\sum_{k \in \mathcal{N}(i) \cup \{i\}} \exp\left(\sigma\left(\mathbf{a}^T [\boldsymbol{\Theta}\mathbf{h}_i | \boldsymbol{\Theta}\mathbf{h}_k]\right)\right)} $$

Each GAT layer updates node embeddings using the attention weights:

$$ \mathbf{h}’_i = \alpha_{i,i}\boldsymbol{\Theta}\mathbf{h}_i + \sum_{j \in \mathcal{N}(i)} \alpha_{i,j}\boldsymbol{\Theta}\mathbf{h}_j $$

The EConv layer incorporates edge (bond-type) information via a learned filter:

$$ \mathbf{h}’_i = \boldsymbol{\Theta}\mathbf{h}_i + \sum_{j \in \mathcal{N}(i)} \mathbf{h}_j \cdot \mathrm{f}_{\boldsymbol{\Theta}}(\mathbf{e}_{i,j}) $$

where $\mathrm{f}_{\boldsymbol{\Theta}}$ is a multi-layer perceptron.

Permutation-Invariant Conformation Encoder

The conformation encoder uses a Deep Sets-style architecture (Zaheer et al., 2017) to achieve permutation invariance. Three separate neural networks encode each type of internal coordinate, conditioned on the corresponding node embeddings:

$$ z_\Xi = \frac{1}{N_\mathcal{D} + N_\Phi + N_\Psi} \left(\sum_{d \in \mathcal{D}} \rho_\Theta^{(\mathcal{D})}(\mathcal{H}, d) + \sum_{\phi \in \Phi} \rho_\Theta^{(\Phi)}(\mathcal{H}, \phi) + \sum_{\psi \in \Psi} \rho_\Theta^{(\Psi)}(\mathcal{H}, \psi)\right) $$

Each encoding function $\rho_\Theta$ takes both the internal coordinate value and the node embeddings of the involved atoms as input. The resulting conformation embedding $z_\Xi \in \mathbb{R}^{F_z}$ has a fixed dimensionality regardless of molecule size.

Conformation Decoder and Loss

Three decoder networks $\delta_\Theta^{(\mathcal{D})}$, $\delta_\Theta^{(\Phi)}$, and $\delta_\Theta^{(\Psi)}$ reconstruct internal coordinates from the conformation embedding, conditioned on the node embeddings. The reconstruction loss is:

$$ \mathcal{C}_\Xi = \frac{1}{N_\mathcal{D}} \sum_{d \in \mathcal{D}} |d - \hat{d}|_2^2 + \frac{1}{N_\Phi} \sum_{\phi \in \Phi} |\phi - \hat{\phi}|_2^2 + \frac{1}{N_\Psi} \sum_{\psi \in \Psi} \min\left(|\psi - \hat{\psi}|_2^2, 2\pi - |\psi - \hat{\psi}|_2^2\right) $$

The dihedral angle loss uses a periodic distance to account for angular periodicity. The model can be extended to a variational autoencoder (VAE) by applying the reparameterization trick from Kingma and Welling (2013).

Conformer Generation and Spatial Optimization Experiments

Dataset and Training

The model was trained on the PubChem3D dataset (Bolton et al., 2011), which contains organic molecules with up to 50 heavy atoms with multiple conformations generated by the OMEGA forcefield software.

Reconstruction Quality

Upon convergence, the model reconstructs conformations with low RMSD to the input. The median energetic difference between input and reconstructed conformations is approximately 80 kcal/mol (evaluated using the MMFF94 forcefield via RDKit), corresponding to small deviations from local minima without atom clashes.

Latent Space Structure

The learned latent space exhibits meaningful clustering: similar conformations map to nearby points, while distinct conformations separate. Principal component analysis of 200 conformations of a small molecule reveals clear conformational clusters in the first two principal components.

Conformer Generation via VAE

The variational autoencoder variant can sample diverse conformers from the learned distribution. Comparing the average inter-conformer RMSD (icRMSD) for 200 sampled conformers per molecule against the ETKDG algorithm (Riniker and Landrum, 2015) implemented in RDKit, the model achieves comparable diversity with a slightly higher average icRMSD of 0.07 Angstrom.

Multi-Objective Molecular Optimization

By combining the conformation embedding with a continuous molecular structure embedding (CDDD, Winter et al., 2019), the model enables joint optimization over both molecular graph and conformation. Using particle swarm optimization (Kennedy and Eberhart, 1995) to maximize QED (drug-likeness, values between 0 and 1) and asphericity (deviation from spherical shape, values between 0 and 1), starting from aspirin (combined score 0.76), the method finds molecules with a combined score of 1.82 after 50 iterations.

Compact Conformation Encoding with Practical Applications

The conformation autoencoder produces fixed-size latent representations of molecular 3D structures that are invariant to molecule size, atom ordering, and rigid transformations. The key findings are:

Meaningful latent space: Conformational similarity is preserved in the embedding space, enabling clustering and interpolation.
Diverse conformer generation: The VAE variant generates conformer ensembles with diversity comparable to established force-field-based methods.
Joint optimization: Combining conformation and structure embeddings enables multi-objective optimization over both molecular graph and spatial arrangement.

Limitations include the relatively small energy evaluation (MMFF94 only), the lack of comparison with quantum mechanical energy evaluations, and the proof-of-concept nature of the spatial optimization experiments. The approach also relies on the quality of the internal coordinate representation, which may lose information about ring conformations and other constrained geometries.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PubChem3D	Multiple conformations per molecule	Organic molecules, up to 50 heavy atoms
Evaluation	PubChem3D holdout	Subset	Same distribution as training

Algorithms

Graph Neural Network: EConv + multiple GAT layers
Conformation encoder: Deep Sets architecture with three coordinate-specific encoders
VAE: Reparameterization trick for probabilistic sampling
Optimization: Particle Swarm Optimization for multi-objective design

Models

Conformation-independent: EConv + GAT layers for node embeddings
Conformation-dependent: Three encoder/decoder feed-forward networks per coordinate type
Latent dimension $F_z$ is fixed (exact value not specified in the workshop paper)

Evaluation

Metric	Value	Baseline	Notes
Median energy difference	~80 kcal/mol	Input conformations	MMFF94 forcefield
icRMSD difference vs ETKDG	+0.07 Angstrom	ETKDG (RDKit)	200 conformers per molecule
Combined QED+asphericity	1.82	0.76 (aspirin)	After 50 optimization iterations

Hardware

Hardware details are not specified in the workshop paper.

Artifacts

Artifact	Type	License	Notes
PubChem3D	Dataset	Public domain	NIH public database; conformations generated by OMEGA (Hawkins et al., 2010)
arXiv preprint	Paper	arXiv license	6-page workshop paper, open access

Reproducibility status: Partially Reproducible. The training dataset (PubChem3D) is publicly available, and the architecture is described in sufficient detail for reimplementation. No source code, pre-trained weights, or exact hyperparameters (latent dimension $F_z$, learning rate, number of GAT layers) are released. The workshop paper format (6 pages) limits the level of experimental detail provided.

Paper Information

Citation: Winter, R., Noé, F., & Clevert, D.-A. (2020). Auto-Encoding Molecular Conformations. Machine Learning for Molecules Workshop, NeurIPS 2020.

Publication: Machine Learning for Molecules Workshop at NeurIPS 2020

@misc{winter2021auto,
  title={Auto-Encoding Molecular Conformations},
  author={Winter, Robin and No\'{e}, Frank and Clevert, Djork-Arn\'{e}},
  year={2021},
  eprint={2101.01618},
  archiveprefix={arXiv},
  primaryclass={cs.LG}
}

AllChem: Generating and Searching 10^20 Structures

Sat, 11 Apr 2026 00:00:00 +0000

Combinatorial Synthon Assembly at Scale

AllChem is a computer-aided molecular design system that generates and searches an unprecedentedly large space of synthetically accessible structures (on the order of $10^{20}$). Rather than enumerating molecules from mathematical graphs (as in the GDB databases), AllChem builds its chemical space from real synthetic chemistry: it recursively applies known reactions to commercial building blocks, producing synthons (structures with open valences of defined reactivity) that combinatorially assemble into complete molecules. Every structure found by a search comes paired with a proposed synthetic route.

Motivation: Costs and Benefits Together

Most computer-aided molecular design methods focus on predicting biological activity (the benefit) while leaving synthesis feasibility (the cost) to the laboratory chemist. AllChem addresses both simultaneously. Its predecessor, ChemSpace, accessed $\sim 10^{14}$ structures built from simple combinatorial libraries (chemist-proposed scaffolds plus commercial side chains), but only about 5% of structures in the medicinal chemistry literature fit that template. AllChem aims to cover roughly 50% of published structures by allowing multi-step synthon generation that produces more complex, non-trivial scaffolds.

The gensyn Synthon Generator

The core component is gensyn, a program that recursively applies a curated set of approximately 100 reactions to approximately 7,000 commercially available building blocks. Each product becomes a new building block for subsequent reaction steps, with recursion bounded primarily by a cumulative synthesis “cost” limit (roughly five AllChem-type steps per sequence). Structures bearing open valences are collected as synthons. A typical run produces around $5 \times 10^6$ synthons, which combinatorially represent $(5 \times 10^6)^3 = 10^{20}$ complete structures with an A-B-C topology.

Key design decisions in gensyn:

Reaction curation: All reactions come from external human-readable text files, based on reactions already practiced by laboratory chemists. Scope constraints are calibrated so that at least 90% of randomly sampled reaction applications appear unchallengeable to synthetic chemists.
Reactive intermediates: Explicitly represented. For example, amide formation requires three steps: acid chloride to electrophilic synthon, amine to nucleophilic synthon, then coupling.
Protective groups: Addition and removal are treated as standard reactions.
Concerted cyclizations: Represented by splitting the ring formation across two complementary synthons with specially labeled open valences.
Bimolecular reactions: In addition to unimolecular transformations, gensyn performs reactions that combine selected synthons with other synthons, increasing overall structural diversity.
Constraints: Maximum of one prochiral center (to avoid diastereomeric mixtures), heavy atom count limits for lead-likeness, and a cumulative cost bound on synthetic routes. Each reaction step has a default cost of $-5$, and the maximum allowed cumulative cost is $-25$ (roughly five steps per sequence).

Reaction Description Language

Reactions are described using an extension of Sybyl Line Notation (SLN), a SMILES-like notation. Each reaction description specifies the structural pattern required in the substrate, the transformation to apply, the reactivity class of resulting open valences, the relative cost, incompatible functional groups, and rules for handling multiple equivalent reactive sites. A separate reactivity table defines which valence classes can react with each other (e.g., nucleophilic with electrophilic).

Topomer Similarity Search

Searching among $10^{20}$ complete structures relies on topomer shape similarity as a branch-and-bound filter. A query structure is fragmented by breaking acyclic single bonds (individually and pairwise), each fragment is converted to a topomer (a canonical 3D shape), and the topomer is compared against all stored synthons. Topomer comparisons run at tens of thousands per second. Because the vast majority of synthons are individually shape-dissimilar enough to eliminate every complete structure containing them, the search space collapses rapidly. To be acceptable, a product must also have been formed by joining open valences with complementary reactivity.

Validation used repeated “self-searches,” in which a query structure is assembled from randomly chosen synthons and searched for in the database. On the 250,000-synthon leadhopping database, average self-search time was 7.1 minutes; complete searches of the full-scale database take several hours on standard hardware.

Applications: Lead Hopping and Scaffold Generation

Lead hopping: Finding structurally novel molecules that are shape-similar (and therefore likely biologically similar) to a query lead. Using a 250,000-synthon leadhopping database, 18 of 19 self-search queries recovered the query structure perfectly (shape difference of 0 topomer units). The remaining query also recovered itself as the closest hit.

Scaffold idea generation: Filtering the synthon collection for small ($\leq$ 14 heavy atoms), low-chirality scaffolds with at least two diversification sites (primarily through nucleophilic heteroatom reactions on activated carbon electrophiles or Suzuki-type couplings), UV chromophores, minimal freely rotatable bonds (especially between diversification sites and rings), a ring, and short synthetic paths (all branches fewer than about six AllChem steps). Over 20% of gensyn-proposed synthons pass these scaffold filters, suggesting on the order of $10^6$ accessible and structurally distinct scaffolds, compared to the few thousand scaffolds typically represented in large screening collections.

Compute and Infrastructure

Full-scale synthon database recreation takes approximately one week using two standard workstations (one Oracle database server, one compute engine). The codebase was rewritten from Java to Python for portability and performance. All data is managed through an Oracle relational database, including synthons, intermediates, and a reactions table recording every gensyn conversion.

Limitations

Variable reactivity of open valences (e.g., weakly nucleophilic amines may not form the implied bond readily) is handled only approximately via reagent class annotations.
Stereospecificity and most aromatic electrophilic substitution reactions are omitted.
The system was described as under active development at the time of publication, giving the paper the character of an interim progress report.
Drug-likeness of 3-synthon products (average MW ~800, CLOGP ~8.0) requires careful filtering of the synthon distribution toward smaller, less lipophilic components.

Reproducibility Details

AllChem was developed as proprietary software at Tripos Inc. (Tripos Discovery Research, Bude, Cornwall, UK). No source code, synthon databases, or reaction files have been publicly released. The paper functions as a description of the system’s architecture and early results rather than a reproducibility-oriented publication.

Code: Not publicly available. The system was proprietary to Tripos Inc.
Data: Synthon databases and reaction description files are not shared.
Hardware: Two standard workstations (one Oracle server, one compute engine); no specialized hardware required.
Funding: NIH/GMS SBIR grant 2 R44 GM068359-02.

Reproducibility status: Closed.

Paper Information

Journal: Journal of Computer-Aided Molecular Design, Vol. 21, No. 6, pp. 341-350
Published: January 25, 2007

@article{cramer2007allchem,
  title={AllChem: generating and searching 10^{20} synthetically accessible structures},
  author={Cramer, Richard D. and Soltanshahi, Farhad and Jilek, Robert J. and Campbell, Brian},
  journal={Journal of Computer-Aided Molecular Design},
  volume={21},
  number={6},
  pages={341--350},
  year={2007},
  publisher={Springer Science+Business Media},
  doi={10.1007/s10822-006-9093-8}
}

ACSESS: Diverse Optimal Molecules in the SMU

Sat, 11 Apr 2026 00:00:00 +0000

Diversity-Biased Search of the Small Molecule Universe

The small molecule universe (SMU), estimated at over $10^{60}$ synthetically feasible organic molecules under ~500 Da, is far too large for exhaustive enumeration and evaluation. This paper extends the ACSESS (Algorithm for Chemical Space Exploration with Stochastic Search) framework to simultaneously optimize molecular diversity and a targeted physical property. The key insight is that enforcing diversity at each iteration prevents the search from collapsing into local optima, a failure mode common in standard genetic algorithms.

Motivation: Diversity vs. Fitness

Standard genetic algorithms optimize fitness effectively but sacrifice diversity: they converge to a few high-fitness regions while ignoring equally good solutions elsewhere. Exhaustive enumeration guarantees completeness but is computationally infeasible beyond ~20 heavy atoms. ACSESS bridges this gap by maintaining a maximally diverse library throughout the optimization process, ensuring coverage of multiple fitness peaks without needing to evaluate every candidate.

The Property-Optimizing ACSESS Algorithm

The method has four iterative steps:

Initialize a library (from a single molecule or a seed collection)
Breed new compounds via mutations and crossovers
Filter by property threshold, removing compounds below a cutoff
Select a maximally diverse subset of qualifying structures

The property threshold increases linearly with each iteration, starting low (to prevent population collapse) and gradually rising until the desired fitness level is reached. Diversity is enforced via either a maximin algorithm (maximizing nearest-neighbor distance) or cell-based partitioning (linear scaling for large libraries).

Molecules are represented in a 40-dimensional chemical space using Moreau-Broto autocorrelation descriptors. The descriptor encodes correlations of atomic properties as a function of topological distance (bond distance) $d$:

$$ AC(d, p) = \sum_{i \leq j} p_{i} , p_{j} , \delta(d_{ij} - d) $$

where $p_{i}$ is an atomic property of atom $i$ and $d_{ij}$ is the shortest bond path between atoms $i$ and $j$. Five atomic properties are used: atomic number, Gasteiger-Marsili partial charge, atomic polarizability, topological steric index, and unity ($p_{i} = 1$ for all $i$, effectively counting atom pairs at each distance). Topological distance $d$ ranges from 0 to 7, yielding $5 \times 8 = 40$ descriptor components. Descriptors are mean-centered and normalized to unit variance before computing distances.

Chemical space distance is the Euclidean distance between descriptor vectors:

$$ D_{ij} = \sqrt{\sum_{k=1}^{N} (d_{ik} - d_{jk})^2} $$

Library diversity is measured as the average nearest-neighbor distance:

$$ D_{\min} = \frac{1}{M} \sqrt{\sum_{i=1}^{M} \min_{i \neq j} (D_{ij}^2)} $$

Validation on NKp Fitness Landscapes

The NKp model maps binary strings of length $N$ to fitness values in $[0, 1]$. The fitness of a string $g$ is:

$$ \Phi(g) = \frac{1}{N} \sum_{i=1}^{N} \varphi_{i}(g) $$

where each $\varphi_{i} \in [0, 1]$ is a randomly drawn fitness contribution. Ruggedness is controlled by $K$ (the number of inter-bit associations per position) and $p$ (fitness contribution weights). Using $N = 19$, $K = 9$, $p = 0.9$ (524,288 total strings, comparable to GDB-9 size), the global maximum was ~0.3. Both ACSESS and SGA were initialized with the same diverse subset and ran for 30 iterations across 10 independent runs:

ACSESS found the global optimum in 100% of runs (vs. 60% for SGA)
ACSESS discovered ~15 of 19 globally optimal strings on average (vs. ~3 for SGA)
ACSESS solutions had higher average fitness than SGA solutions

Validation on GDB-9 Dipole Moments

The method was tested on all ~300,000 molecules in GDB-9 (up to 9 heavy atoms; allowed atom types: C, N, O, S, Cl). For each molecule, the Boltzmann-averaged dipole moment was computed at the AM1 level (Gaussian 09):

$$ D = \frac{\sum_{i \in C} \mu_{i} , e^{-\beta E_{i}}}{\sum_{i \in C} e^{-\beta E_{i}}} $$

where $\mu_{i}$ and $E_{i}$ are the dipole moment and internal energy of conformation $i$, and $\beta = 1 / (k_{\text{B}} T)$ at $T = 298$ K. Conformations (including stereoisomers) were generated using OpenEye OMEGA. The target was molecules with dipole moments $\geq 5.5$ D (the 90th percentile). ACSESS first generated a maximally diverse seed set, then ran 60 iterations of fitness-biased optimization. All methods were initialized from the same diverse seed and compared over multiple runs.

Method	Dipole Moment (D)	Diversity (eq. 4)
GA-Roulette	5.8 $\pm$ 0.03	6.5 $\pm$ 0.7
GA-Tournament	6.4 $\pm$ 0.08	3.5 $\pm$ 0.7
GA-Elitism	6.74 $\pm$ 0.08	5.4 $\pm$ 0.4
ACSESS	6.05 $\pm$ 0.05	9.7 $\pm$ 0.6

ACSESS achieved nearly double the diversity of the best SGA variant while maintaining competitive fitness. Its diversity (~9.7) approached the diversity of the full enumerated high-fitness subset of GDB-9 (~12). Self-organizing map (SOM) visualizations confirmed that ACSESS covered high-activity regions that SGAs missed entirely.

Only ~30,000 fitness evaluations were needed to locate diverse optimal regions in the 300,000-molecule space, a 10x efficiency gain over exhaustive enumeration.

Limitations

Tested only on relatively small chemical spaces (GDB-9 with ~300k molecules and 19-bit NKp with ~500k strings); scaling to the full SMU ($10^{60}$) remains a research direction
Property evaluation (AM1 dipole moments with conformer generation) is the computational bottleneck, not the ACSESS algorithm itself
The 40-dimensional autocorrelation descriptor space may not capture all relevant structural features for every optimization target
Comparison is limited to simple genetic algorithms; more sophisticated evolutionary strategies were not benchmarked

Reproducibility Details

The ACSESS algorithm relies on proprietary software, limiting full reproducibility.

Artifact	Type	License	Notes
GDB-9	Dataset	CC-BY-4.0	Publicly available enumerated chemical universe (~300k molecules)

Code: No public source code was released. The implementation depends on OpenEye OEChem TK (molecule generation), OpenEye MolProp TK (filtering), and OpenEye OMEGA TK (conformer generation), all of which require commercial licenses.
Property calculations: Dipole moments were computed at the AM1 level using Gaussian 09, also commercial software.
NKp landscape: Fully specified by parameters ($N = 19$, $K = 9$, $p = 0.9$) and standard NKp model equations, making this portion independently reproducible.
Hardware: No specific compute requirements reported.
Reproducibility status: Partially Reproducible. The algorithm is well-described and the NKp experiments could be reimplemented, but the molecular experiments require OpenEye and Gaussian 09 licenses, and no reference implementation was released.

Paper Information

Journal: Journal of Chemical Information and Modeling, Vol. 55, No. 3, pp. 529-537
Published: January 16, 2015

@article{rupakheti2015strategy,
  title={Strategy To Discover Diverse Optimal Molecules in the Small Molecule Universe},
  author={Rupakheti, Chetan and Virshup, Aaron M. and Yang, Weitao and Beratan, David N.},
  journal={Journal of Chemical Information and Modeling},
  volume={55},
  number={3},
  pages={529--537},
  year={2015},
  publisher={American Chemical Society},
  doi={10.1021/ci500749q}
}

DoReMi: Optimizing Data Mixtures for LM Pretraining

Wed, 08 Apr 2026 00:00:00 +0000

A method for automatic domain reweighting

This is a method paper that introduces Domain Reweighting with Minimax Optimization (DoReMi), an algorithm for automatically tuning the mixture proportions of pretraining data domains. Rather than relying on heuristics or expensive downstream-task-based tuning, DoReMi uses a small proxy model trained with group distributionally robust optimization (Group DRO) to produce domain weights that transfer to much larger models.

Why data mixture proportions matter

Language model pretraining datasets combine text from many domains: web crawls, Wikipedia, books, code, academic papers, and others. The mixture proportions (how much of each domain to include) significantly affect downstream performance, but existing approaches either set them by hand (The Pile uses heuristic weights) or tune them against downstream tasks (GLaM/PaLM), which is expensive and risks overfitting to a specific evaluation set. No principled, task-agnostic method existed for determining mixture proportions.

Minimax optimization over domain excess loss

DoReMi’s core insight is to frame data mixture optimization as a minimax problem: find domain weights that minimize the worst-case excess loss across all domains. The algorithm has three steps.

Step 1: Train a small reference model (280M parameters) on some default domain weights $\alpha_{\text{ref}}$ (e.g., proportional to raw token count).

Step 2: Train a small proxy model $p_{\theta}$ using Group DRO, which solves the minimax objective:

$$ \min_{\theta} \max_{\alpha \in \Delta^{k}} \sum_{i=1}^{k} \alpha_{i} \cdot \left[ \frac{1}{\sum_{x \in D_{i}} |x|} \sum_{x \in D_{i}} \ell_{\theta}(x) - \ell_{\text{ref}}(x) \right] $$

where $\ell_{\theta}(x) = -\log p_{\theta}(x)$ and $\ell_{\text{ref}}(x) = -\log p_{\text{ref}}(x)$. The excess loss $\ell_{\theta}(x) - \ell_{\text{ref}}(x)$ measures how much headroom the proxy has to improve on each example relative to the reference. The inner maximization upweights domains with high excess loss via exponentiated gradient ascent, while the outer minimization trains the proxy on those upweighted domains.

At each training step, the domain weights update as:

$$ \alpha_{t}’ \leftarrow \alpha_{t-1} \exp(\eta \lambda_{t}) $$

where $\lambda_{t}[i]$ is the per-domain excess loss (clipped at zero), followed by renormalization and smoothing with a uniform component: $\alpha_{t} \leftarrow (1-c)\frac{\alpha_{t}’}{\sum_{i} \alpha_{t}’[i]} + cu$, with $c = 10^{-3}$.

The final domain weights are the average over all training steps: $\bar{\alpha} = \frac{1}{T}\sum_{t=1}^{T} \alpha_{t}$.

Step 3: Resample data according to $\bar{\alpha}$ and train the full-scale model using standard procedures.

Iterated DoReMi extends this by running multiple rounds, using the previous round’s optimized weights as the next round’s reference weights. This converges within 3 rounds on the GLaM dataset.

Experiments across The Pile and GLaM datasets

Datasets. The Pile (22 domains, 800GB) and the GLaM dataset (8 domains, also used for PaLM). On The Pile, baseline weights come from the dataset defaults. On GLaM, baseline weights are uniform, with downstream-tuned oracle weights available for comparison.

Setup. Transformer decoder-only LMs trained with next-token prediction. All models use batch size 512 and sequence length 1024. Proxy and reference models are 280M parameters. Main models are 8B parameters (30x larger). Training runs: 200K steps (Pile) or 300K steps (GLaM). The domain weight optimization cost (training two 280M models) is 8% of the compute for the 8B main model.

Evaluation. Per-domain held-out perplexity and one-shot generative accuracy on five tasks: TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, and LAMBADA.

Key domain weight shifts

On The Pile, DoReMi (280M) dramatically upweights diverse web text (Pile-CC: 0.112 to 0.606) while downweighting specialized domains like ArXiv (0.105 to 0.004), PubMed Central (0.107 to 0.005), and StackExchange (0.093 to 0.015). Smaller, underrepresented domains like YouTubeSubtitles and PhilPapers receive proportionally large increases.

Scaling behavior

DoReMi was tested with matched proxy/main model sizes (280M through 1B) and with varying proxy sizes (70M through 1B) feeding into an 8B main model.

Configuration	Speedup to baseline accuracy	Downstream improvement
DoReMi (280M to 280M)	4x	+2% avg accuracy
DoReMi (280M to 8B)	2.6x	+6.5% avg accuracy
DoReMi (150M to 8B)	~2x	Significant
DoReMi (1B to 8B)	~2x	Significant

Improvements are consistent across all tested model scales (280M to 1B matched), with no sign of diminishing returns at larger sizes.

Perplexity improves everywhere, even on downweighted domains

The most striking finding is that DoReMi improves perplexity on all 22 domains in The Pile, including domains it downweights. The proposed explanation: the lowest-entropy domains need few samples to learn (they’re statistically simple), while the highest-entropy domains have token distributions close to the uniform initialization and also need fewer samples. Reallocating weight to medium-entropy domains generates positive transfer that lifts all domains.

On The Pile, DoReMi reaches the baseline’s downstream accuracy in 75K steps versus 200K for the baseline (2.6x speedup) and achieves a 6.5% absolute improvement in average one-shot accuracy at 200K steps.

On the GLaM dataset, iterated DoReMi (round 2) matches the performance of domain weights that were tuned directly on downstream task performance, despite having no knowledge of downstream tasks. Domain weights converge within 3 iterations.

Ablations

Using only the proxy model’s loss (prefer hardest domains) or only the negative reference loss (prefer easiest domains) both underperform the full excess loss formulation. Both components are necessary: the excess loss identifies domains where the proxy has room to improve relative to what is learnable.

The proxy model itself typically underperforms the main model trained on its weights, and this gap grows at larger proxy scales. A 1B proxy model underperforms the 1B baseline, yet its domain weights still improve 1B main model training by over 2x. This suggests the domain weight signal is robust even when the proxy model itself is not well-trained.

Limitations

The domain weight landscape may have multiple local optima: a 280M proxy puts most weight on Pile-CC, while a 1B proxy favors OpenWebText2 instead. Both configurations improve over baseline, but the optimal weights are not unique.

The granularity of “domains” matters. DoReMi works better with more domains (22 on The Pile versus 8 on GLaM). Domains are defined by data provenance, which is coarse-grained. Fine-grained domain definitions (e.g., via clustering) could improve results but also risk DRO putting all weight on a small set of worst-case examples.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	The Pile	800 GB, 22 domains	Default heuristic weights as baseline
Pretraining	GLaM dataset	8 domains	Uniform weights as baseline; downstream-tuned oracle available
Evaluation	TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, LAMBADA	Standard splits	One-shot generative evaluation

Algorithms

Group DRO with exponentiated gradient ascent for domain weight updates. Step size $\eta = 1$, smoothing $c = 10^{-3}$. Per-token excess loss clipped at zero. Domain weights averaged over all training steps. Iterated DoReMi converges when $|\bar{\alpha} - \alpha_{\text{ref}}|_{\infty} < 10^{-3}$.

Models

Vanilla Transformer decoder-only models with 256K vocabulary. Sizes: 70M (3 layers), 150M (6 layers), 280M (12 layers), 510M (12 layers), 760M (12 layers), 1B (16 layers), 8B (32 layers). All use 64-dim attention heads except 8B (128-dim).

Evaluation

Metric	DoReMi (280M to 8B)	Baseline (8B)	Notes
Avg one-shot accuracy	+6.5% over baseline	Reference	5 generative tasks
Worst-case log-perplexity	1.46	1.71	Across 22 Pile domains
Avg log-perplexity	1.40	1.64	Across 22 Pile domains
Domains beating baseline	22/22	0/22	Per-domain perplexity

Hardware

Proxy and reference models (under 1B) trained on TPUv3. Models at 1B and 8B trained on TPUv4. Domain weight optimization (two 280M runs) costs 8% of 8B training FLOPs.

Citation

@inproceedings{xie2023doremi,
  title={DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining},
  author={Xie, Sang Michael and Pham, Hieu and Dong, Xuanyi and Du, Nan and Liu, Hanxiao and Lu, Yifeng and Liang, Percy and Le, Quoc V. and Ma, Tengyu and Yu, Adams Wei},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

RWKV: Linear-Cost RNN with Transformer Training

Tue, 07 Apr 2026 00:00:00 +0000

A New Architecture Bridging RNNs and Transformers

This is a Method paper that introduces RWKV (Receptance Weighted Key Value), a novel sequence model architecture that combines the parallelizable training of Transformers with the efficient $O(Td)$ inference of RNNs. RWKV can be formulated equivalently as either a Transformer (for parallel training) or an RNN (for sequential inference), achieving the lowest computational and memory complexity among comparable architectures while matching Transformer-level performance. The authors scale RWKV to 14 billion parameters, making it the largest dense RNN ever trained at the time of publication.

The Quadratic Cost of Self-Attention

Transformers have become the dominant architecture for NLP, powering models like GPT-3, LLaMA, and Chinchilla. Their self-attention mechanism captures both local and long-range dependencies while supporting parallelized training. However, self-attention scales quadratically with sequence length in both time ($O(T^2d)$) and space ($O(T^2 + Td)$), making it computationally and memory intensive for long sequences and resource-constrained deployment.

RNNs, by contrast, offer linear scaling in memory and computation, but suffer from the vanishing gradient problem and cannot parallelize across the time dimension during training. This limits their scalability and makes them unable to match Transformer performance in practice.

Prior work on efficient Transformers (Reformer, Performer, Linformer, AFT, MEGA) has attempted to reduce this quadratic cost, often at the expense of model expressivity. RWKV aims to combine the best of both worlds: Transformer-grade training efficiency with RNN-grade inference cost, without any approximation to the attention mechanism.

Linear Attention via Channel-Wise Decay

RWKV is built on four core vectors that interact multiplicatively at each timestep:

R (Receptance): receives past information, acting as a gating signal
W (Weight): a trainable positional weight decay vector
K (Key): analogous to keys in standard attention
V (Value): analogous to values in standard attention

The architecture consists of stacked residual blocks, each containing a time-mixing sub-block and a channel-mixing sub-block.

Token Shift

All linear projection vectors are produced by interpolating between the current input $x_t$ and the previous input $x_{t-1}$, creating a token shift mechanism:

$$ r_t = W_r \cdot (\mu_r \odot x_t + (1 - \mu_r) \odot x_{t-1}) $$

$$ k_t = W_k \cdot (\mu_k \odot x_t + (1 - \mu_k) \odot x_{t-1}) $$

$$ v_t = W_v \cdot (\mu_v \odot x_t + (1 - \mu_v) \odot x_{t-1}) $$

where $\mu_r$, $\mu_k$, $\mu_v$ are learnable interpolation parameters. This is implemented efficiently as a simple offset in the temporal dimension.

The WKV Operator

The core attention-like computation replaces standard dot-product attention with a channel-wise weighted sum using exponential decay:

$$ wkv_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} \odot v_i + e^{u + k_t} \odot v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u + k_t}} $$

Here $w$ is the channel-wise time decay vector and $u$ is a separate bonus vector that attends specifically to the current token. Unlike AFT where $W$ is a pairwise matrix, RWKV treats $W$ as a channel-wise vector modified by relative position, enabling the recurrent formulation.

Output Gating

The receptance vector gates the WKV output through a sigmoid:

$$ o_t = W_o \cdot (\sigma(r_t) \odot wkv_t) $$

The channel-mixing block uses a similar gating mechanism with squared ReLU activation:

$$ o’_t = \sigma(r’_t) \odot (W’_v \cdot \max(k’_t, 0)^2) $$

Dual-Mode Operation

During training, RWKV operates in time-parallel mode. The matrix multiplications ($W_\lambda$ for $\lambda \in {r, k, v, o}$) dominate at $O(BTd^2)$ and parallelize identically to standard Transformers. The element-wise WKV computation is $O(BTd)$ and parallelizes along batch and channel dimensions.

During inference, RWKV switches to time-sequential mode. Each timestep updates a fixed-size state vector, giving constant $O(d)$ memory and $O(Td)$ total time for generating $T$ tokens, compared to $O(T^2d)$ for standard Transformers.

Optimizations

Three additional design choices improve training:

Custom CUDA kernels for the sequential WKV computation, fusing it into a single kernel on training accelerators
Small init embedding: initializing the embedding matrix with small values plus an additional LayerNorm, accelerating convergence
Custom initialization: most weights initialized to zero with no biases, following identity-mapping principles from residual network design

Scaling to 14B Parameters and Benchmark Evaluation

Model Scaling

The authors train six RWKV models from 169M to 14B parameters, all for one epoch (330B tokens) on the Pile:

Model	Layers	Dimension	Parameters	FLOP/Token
169M	12	768	$1.69 \times 10^8$	$2.61 \times 10^8$
430M	24	1024	$4.30 \times 10^8$	$7.57 \times 10^8$
1.5B	24	2048	$1.52 \times 10^9$	$2.82 \times 10^9$
3B	32	2560	$2.99 \times 10^9$	$5.71 \times 10^9$
7B	32	4096	$7.39 \times 10^9$	$1.44 \times 10^{10}$
14B	40	5120	$1.42 \times 10^{10}$	$2.78 \times 10^{10}$

The parameter count follows: $\text{params} = 2VD + 13D^2L + D(11L + 4)$, where $V = 50277$ is vocabulary size, $D$ is model dimension, and $L$ is layers. FLOPs match the standard transformer formula: $\text{FLOP} = 6 \cdot [\text{tokens}] \cdot [\text{params}]$.

Scaling Laws

Training 45 RWKV models across varied (dataset, parameters) pairs, the authors find that RWKV follows the same log-log linear scaling law established for Transformers. The linear fit to Pareto-optimal points achieves $r^2 = 0.994$, and extrapolation an additional order of magnitude still yields $r^2 = 0.875$. This contrasts with prior claims that LSTMs do not follow transformer-like scaling.

NLP Benchmarks

RWKV is compared against similarly-sized models trained on comparable token budgets: Pythia, OPT, and BLOOM (all FLOP-matched). Results span twelve benchmarks: ARC (Easy/Challenge), BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, and Winogrande.

RWKV performs competitively with Transformers across all model sizes. On average across benchmarks, RWKV tracks closely with Pythia and outperforms OPT and BLOOM at comparable scales.

Long Context and Extended Finetuning

RWKV can extend its context length after pretraining through progressive finetuning: doubling from 1024 to 2048 (10B tokens), then to 4096 (100B tokens), and finally to 8192 (100B tokens). Each doubling reduces test loss on the Pile, indicating effective use of longer context.

On the Long Range Arena (LRA) benchmark, which tests sequences from 1,000 to 16,000 tokens, RWKV performs second only to S4 across the five datasets.

Inference Efficiency

Benchmarking text generation on CPU (x86) and GPU (NVIDIA A100 80GB) at float32 precision shows that RWKV exhibits linear scaling in generation time, while Transformers scale quadratically. This advantage grows with sequence length: for long outputs, RWKV completes generation substantially faster at equivalent model sizes.

Competitive Performance with Key Caveats

RWKV demonstrates that RNN-class models can match Transformer performance at scale, while maintaining $O(Td)$ time and $O(d)$ memory during inference. The key findings are:

Scaling laws hold: RWKV follows the same compute-optimal scaling as Transformers ($r^2 = 0.994$), contradicting earlier claims about RNN scaling behavior
Competitive NLP performance: Across twelve benchmarks, RWKV matches similarly-sized Transformers trained on comparable data
Linear inference cost: Generation time scales linearly rather than quadratically, with constant memory regardless of sequence length
Context extension: Progressive finetuning effectively extends the context window post-training

Limitations

The authors identify two primary limitations:

Information compression: Linear attention funnels all past information through a single fixed-size state vector. For tasks requiring recall of specific details over very long contexts, this is mechanistically more constrained than full self-attention, which maintains direct access to all previous tokens.

Prompt sensitivity: RWKV is more sensitive to prompt engineering than standard Transformers. The linear attention mechanism limits how much prompt information carries forward, making the order of information in the prompt particularly important. Reordering prompts improved F1 from 44.2% to 74.8% on one task.

Future Directions

The authors suggest several avenues: applying parallel scan to reduce WKV cost to $O(B \log(T) d)$, extending RWKV to encoder-decoder and multimodal architectures, leveraging hidden states for interpretability and safety, and increasing internal state size to improve long-range recall.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
BlinkDL/RWKV-LM	Code	Apache-2.0	Official PyTorch training and inference implementation
Pre-trained weights (169M to 14B)	Model	Apache-2.0	All six Pile-trained sizes on HuggingFace (`BlinkDL/rwkv-4-pile-*`)
The Pile	Dataset	Mixed	825 GiB pretraining corpus; component licenses vary by source

Reproducibility classification: Highly Reproducible. Training code (Apache-2.0), pre-trained weights for all six model sizes, the full training corpus, and complete hyperparameters (Appendix G) are all publicly available. The only missing detail is the specific GPU cluster configuration used for pretraining.

Data

Purpose	Dataset	Size	Notes
Pretraining	The Pile	330B tokens	One full epoch for all model sizes
Context extension	The Pile	210B additional tokens	Progressive doubling: 1024 to 8192
NLP evaluation	ARC, BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, Winogrande	Various	Zero-shot evaluation
Long-range evaluation	Long Range Arena (LRA)	1K-16K tokens	Five sub-tasks

Algorithms

Optimizer: Adam ($\beta = (0.9, 0.99)$), no weight decay
Precision: bfloat16
Training context length: 1024 tokens
Learning rate: constant warmup, then exponential decay
Auxiliary loss from PaLM (softmax normalizer regularization)
Batch size: 128 or 256 sequences (dynamically switched)
Training organized into mini-epochs of 40,320 samples each (8,043 mini-epochs per Pile epoch)

Models

Model	Init LR	Warmup Mini-Epochs	End LR
169M	6e-4	361	1e-5
430M	4e-4	411	1e-5
1.5B	3e-4	443	1e-5
3B	1.5e-4	451	1e-5
7B	1.5e-4	465	1e-5
14B	1e-4	544	7e-6

All pretrained models (169M to 14B) are publicly released on HuggingFace (BlinkDL/rwkv-4-pile-*) under Apache-2.0. Training code is at BlinkDL/RWKV-LM (Apache-2.0).

Evaluation

All NLP benchmarks evaluated in zero-shot setting
FLOP-matched comparison against Pythia, OPT, BLOOM
Inference benchmarked on CPU (x86) and GPU (NVIDIA A100 80GB) at float32

Hardware

Inference experiments: NVIDIA A100 80GB GPU
Training hardware details not fully specified; FLOP budgets reported per model

Paper Information

Citation: Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., … & Zhu, R.-J. (2023). RWKV: Reinventing RNNs for the Transformer Era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14048-14077.

Publication: Findings of EMNLP 2023

Additional Resources:

GitHub Repository (Apache-2.0)

@inproceedings{peng2023rwkv,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Derczynski, Leon and Du, Xingjian and Grella, Matteo and GV, Kranthi Kiran and He, Xuzheng and Hou, Haowen and Kazienko, Przemys{\l}aw and Koco{\'n}, Jan and Kong, Jiaming and Koptyra, Bart{\l}omiej and Lau, Hayden and Lin, Jiaju and Mantri, Krishna Sri Ipsit and Mom, Ferdinand and Saito, Atsushi and Song, Guangyu and Tang, Xiangru and Wind, Johan S. and Wo{\'z}niak, Stanis{\l}aw and Zhang, Zhenyuan and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
  pages={14048--14077},
  year={2023},
  doi={10.18653/v1/2023.findings-emnlp.936}
}

Liquid-S4: Input-Dependent State-Space Models

Tue, 07 Apr 2026 00:00:00 +0000

A Method for Input-Adaptive Sequence Modeling

This is a Method paper that introduces Liquid-S4, a new state-space model combining the structured state-space framework (S4) with liquid time-constant (LTC) networks. The primary contribution is an input-dependent state transition mechanism that allows the model to adapt its dynamics based on incoming inputs, while retaining the efficient convolutional kernel computation of S4.

Scaling Liquid Networks to Long Sequences

Liquid time-constant (LTC) networks are continuous-time neural networks with input-dependent state transitions, giving them strong generalization and causal modeling properties. However, LTCs rely on ODE solvers that limit their scalability to long sequences. Structured state-space models (S4) solve this scalability problem through HiPPO initialization, diagonal plus low-rank (DPLR) parameterization, and efficient Cauchy kernel computation in the frequency domain, but they use fixed (input-independent) state transitions.

The key question this paper addresses: can the expressivity of LTC networks be combined with the efficiency and scalability of S4 to improve long-range sequence modeling?

The Liquid Kernel: Input-Dependent Convolutions

The core innovation is a linearized LTC state-space model that replaces the standard SSM dynamics:

$$\dot{x}(t) = \mathbf{A}x(t) + \mathbf{B}u(t)$$

with an input-dependent formulation:

$$\dot{x}(t) = \left[\mathbf{A} + \mathbf{B}u(t)\right]x(t) + \mathbf{B}u(t)$$

where $u(t)$ now modulates the state transition matrix itself. After discretization via the bilinear transform, the recurrence becomes:

$$x_{k} = \left(\overline{\mathbf{A}} + \overline{\mathbf{B}}u_{k}\right)x_{k-1} + \overline{\mathbf{B}}u_{k}$$

Unrolling this recurrence reveals that the output $y_{k}$ decomposes into two parts:

$$y = \overline{\mathbf{K}} * u + \overline{\mathbf{K}}_{\text{liquid}} * u_{\text{correlations}}$$

The first term is the standard S4 convolutional kernel $\overline{\mathbf{K}}$, mapping individual input time steps independently. The second term is a new “liquid kernel” $\overline{\mathbf{K}}_{\text{liquid}}$ that operates on auto-correlation terms of the input signal (products $u_{i}u_{j}$, $u_{i}u_{j}u_{k}$, etc., up to a chosen order $\mathcal{P}$).

Proposition 1 shows that each liquid kernel of order $p$ can be computed from the precomputed S4 kernel via a Hadamard product with $\overline{\mathbf{B}}^{p-1}$ followed by an anti-diagonal transformation (flip):

$$\overline{\mathbf{K}}_{\text{liquid}=p} = \left[\overline{\mathbf{K}}_{(L-\tilde{L},L)} \odot \overline{\mathbf{B}}_{(L-\tilde{L},L)}^{p-1}\right] * \mathbf{J}_{\tilde{L}}$$

This is the KB (Kernel $\times$ B) mode. The authors also propose a simplified PB (Powers of B) mode that sets the transition matrix $\overline{\mathbf{A}}$ to identity for the correlation terms:

$$\overline{\mathbf{K}}_{\text{liquid}=p} = \overline{\mathbf{C}} \odot \overline{\mathbf{B}}^{p-1}$$

The PB kernel is cheaper to compute and performs equally well or better in practice.

The computational complexity is $\tilde{\mathcal{O}}(N + L + p_{\text{max}}\tilde{L})$, where $N$ is the state size, $L$ the sequence length, $p_{\text{max}}$ the maximum liquid order, and $\tilde{L}$ the liquid kernel length (typically two orders of magnitude smaller than $L$).

Benchmarks Across Long-Range Sequence Tasks

Liquid-S4 is evaluated on four benchmark suites with the PB kernel using the S4-LegS (scaled Legendre) parameterization.

Long Range Arena (LRA)

The LRA benchmark contains six tasks with sequence lengths from 1K to 16K. Liquid-S4 achieves state-of-the-art on all six tasks with an average accuracy of 87.32%:

Task	Input Length	Liquid-S4	S4-LegS	Improvement
ListOps	2048	62.75%	59.60%	+3.15%
Text (IMDB)	2048	89.02%	86.82%	+2.20%
Retrieval (AAN)	4000	91.20%	90.90%	+0.30%
Image (CIFAR)	1024	89.50%	88.65%	+0.85%
Pathfinder	1024	94.80%	94.20%	+0.60%
Path-X	16384	96.66%	96.35%	+0.31%
Average		87.32%	86.09%	+1.23%

Liquid orders $p$ range from 2 to 6 across tasks.

BIDMC Vital Signs

On medical time-series regression (heart rate, respiratory rate, SpO2 prediction from length-4000 biomarker signals):

Task	Liquid-S4 (RMSE)	S4-LegS (RMSE)	Improvement
Heart Rate	0.303	0.332	8.7%
Respiratory Rate	0.158	0.247	36.0%
SpO2	0.066	0.090	26.7%

Sequential CIFAR (sCIFAR)

Liquid-S4 with $p=3$ achieves 92.02% accuracy on 1-D pixel-level image classification, improving over S4-LegS (91.80%).

Speech Commands (Full 35 Labels)

On the raw 16kHz speech recognition task, Liquid-S4 achieves 96.78% accuracy with only 224K parameters, a 30% reduction compared to S4’s 307K. On the zero-shot 8kHz experiment, performance drops to 90.00% (vs. 91.32% for S4-LegS), which the authors attribute to the liquid kernel’s sensitivity to input covariance structure at different sampling rates.

Consistent Improvements with Smaller Models

Liquid-S4 achieves state-of-the-art performance on every benchmark evaluated: all six LRA tasks (87.32% average), all three BIDMC vital signs tasks, sCIFAR, and full Speech Commands recognition. The gains are particularly large on tasks where input correlation structure matters (ListOps +3.15%, IMDB +2.20%, respiratory rate RMSE improvement of 36%).

A practical advantage is that Liquid-S4 works well with smaller state sizes (as low as 7 units for some tasks), reducing parameter counts. The PB kernel is recommended over KB for its simplicity and competitive performance. Higher liquid orders ($p$) consistently improve performance, though $p=3$ is recommended as a default.

Limitations include degraded performance in zero-shot frequency transfer (8kHz Speech Commands), suggesting the liquid kernel’s input covariance terms may not generalize well across sampling rate changes. The paper also does not compare against non-SSM approaches beyond the LRA benchmark. The causal (unidirectional) configuration works better than bidirectional for Liquid-S4, which may limit applicability to tasks that benefit from bidirectional context.

Reproducibility Details

Classification: Partially Reproducible. Code and all benchmark datasets are publicly available, with complete hyperparameters documented. No pre-trained weights are released and hardware requirements are not specified.

Artifacts

Artifact	Type	License	Notes
raminmh/liquid-s4	Code	Apache-2.0	Official PyTorch implementation; fork of the S4 repo with KB and PB kernels added

Data

Purpose	Dataset	Size	Notes
Evaluation	Long Range Arena (LRA)	6 tasks, 1K-16K seq length	ListOps, IMDB, AAN, CIFAR, Pathfinder, Path-X
Evaluation	BIDMC Vital Signs	4000-length biomarker signals	Heart rate, respiratory rate, SpO2
Evaluation	sCIFAR	1024-length flattened images	10-class classification
Evaluation	Speech Commands	16kHz raw audio, 35 labels	Full dataset with zero-shot 8kHz test

Algorithms

The Liquid-S4 kernel computation builds on the S4 kernel pipeline:

Initialize $\mathbf{A}$ with HiPPO (scaled Legendre) matrix in DPLR form
Compute S4 kernel $\overline{\mathbf{K}}$ via Cauchy kernel and iFFT
For each liquid order $p \in {2, \ldots, \mathcal{P}}$, compute $\overline{\mathbf{K}}_{\text{liquid}=p}$ using either KB or PB mode
Convolve $\overline{\mathbf{K}}_{\text{liquid}}$ with input correlation vector $u_{\text{correlations}}$

The PB kernel mode is used in all reported experiments. The PyKeops package is used for large tensor computations.

Models

Task	Depth	Features	State Size	Norm	LR	Epochs
ListOps	9	128	7	BN	0.002	30
IMDB	4	128	7	BN	0.003	50
AAN	6	256	64	BN	0.005	20
CIFAR (LRA)	6	512	512	LN	0.01	200
Pathfinder	6	256	64	BN	0.0004	200
Path-X	6	320	64	BN	0.001	60
Speech Commands	6	128	7	BN	0.008	50
BIDMC (HR)	6	128	256	LN	0.005	500
BIDMC (RR)	6	128	256	LN	0.01	500
BIDMC (SpO2)	6	128	256	LN	0.01	500
sCIFAR	6	512	512	LN	0.01	200

Liquid-S4 generally requires smaller learning rates than S4/S4D. $\Delta t_{\text{max}} = 0.2$ for all experiments; $\Delta t_{\text{min}} \propto 1/\text{seq_length}$.

Evaluation

All results report validation accuracy (except BIDMC, which reports test RMSE). Experiments use 2-3 random seeds with standard deviations reported.

Hardware

Not specified in the paper.

Paper Information

Citation: Hasani, R., Lechner, M., Wang, T.-H., Chahine, M., Amini, A., & Rus, D. (2022). Liquid Structural State-Space Models. arXiv preprint arXiv:2209.12951.

@misc{hasani2022liquid,
  title={Liquid Structural State-Space Models},
  author={Hasani, Ramin and Lechner, Mathias and Wang, Tsun-Hsuan and Chahine, Makram and Amini, Alexander and Rus, Daniela},
  year={2022},
  eprint={2209.12951},
  archiveprefix={arXiv},
  primaryclass={cs.LG}
}

Lagrangian Neural Networks for Physics

Tue, 07 Apr 2026 00:00:00 +0000

A Method for Learning Arbitrary Lagrangians

This is a Method paper that introduces Lagrangian Neural Networks (LNNs), a neural network architecture that parameterizes arbitrary Lagrangians to learn energy-conserving dynamics from data. The key contribution is showing that neural networks can learn Lagrangian functions directly, and that the Euler-Lagrange equation can be solved numerically using automatic differentiation to produce physically consistent dynamics. The approach is strictly more general than prior methods: it does not require canonical coordinates (unlike Hamiltonian Neural Networks) and does not restrict the functional form of kinetic energy (unlike Deep Lagrangian Networks).

Why Standard Neural Networks Fail at Conservation Laws

Neural networks struggle to learn fundamental symmetries and conservation laws from data. A standard neural network trained on trajectories of a double pendulum will gradually dissipate energy over long rollouts, producing physically implausible behavior. This happens because unconstrained function approximators have no inductive bias toward conservation.

Hamiltonian Neural Networks (HNNs) addressed this by learning a Hamiltonian function, which automatically enforces energy conservation. However, the Hamiltonian formalism requires inputs in canonical coordinates $(q, p)$ satisfying strict Poisson bracket relations:

$$ p_i \equiv \frac{\partial \mathcal{L}}{\partial \dot{q}_i} \quad \Longleftrightarrow \quad {q_i, q_j} = 0, \quad {p_i, p_j} = 0, \quad {q_i, p_j} = \delta_{ij} $$

In many real-world settings, the canonical momenta are unknown or difficult to compute. For example, in special relativity the canonical momentum $\dot{q}(1 - \dot{q}^2)^{-3/2}$ is a complex nonlinear function of velocity. Deep Lagrangian Networks (DeLaNs) partially addressed this by learning Lagrangians, but they assumed kinetic energy takes the rigid-body form $T = \dot{q}^T M \dot{q}$, which excludes relativistic and other non-standard systems.

Solving Euler-Lagrange for a Black-Box Lagrangian

The core innovation of LNNs is a method for computing accelerations from a neural network that represents an arbitrary Lagrangian $\mathcal{L}(q, \dot{q})$. Starting from the Euler-Lagrange equation:

$$ \frac{d}{dt} \nabla_{\dot{q}} \mathcal{L} = \nabla_{q} \mathcal{L} $$

The authors expand the time derivative using the chain rule, yielding:

$$ \left(\nabla_{\dot{q}} \nabla_{\dot{q}}^{\top} \mathcal{L}\right) \ddot{q} + \left(\nabla_{q} \nabla_{\dot{q}}^{\top} \mathcal{L}\right) \dot{q} = \nabla_{q} \mathcal{L} $$

Solving for the accelerations gives:

$$ \ddot{q} = \left(\nabla_{\dot{q}} \nabla_{\dot{q}}^{\top} \mathcal{L}\right)^{-1} \left[ \nabla_{q} \mathcal{L} - \left(\nabla_{q} \nabla_{\dot{q}}^{\top} \mathcal{L}\right) \dot{q} \right] $$

This requires computing the Hessian of the neural network with respect to $\dot{q}$ and then inverting it (using a pseudoinverse for numerical stability). JAX’s automatic differentiation makes this feasible in just a few lines of code, despite the seemingly complex chain of second-order derivatives. The matrix inverse scales as $\mathcal{O}(d^3)$ with the number of coordinates $d$.

A critical implementation detail is the choice of activation function. Since the method takes second-order derivatives of the network, ReLU is unsuitable (its second derivative is zero everywhere). After a hyperparameter search over ReLU$^2$, ReLU$^3$, tanh, sigmoid, and softplus, the authors found softplus performed best.

The authors also developed a custom initialization scheme, using symbolic regression to find initialization variances that maintain well-conditioned gradients through the Hessian computation:

$$ \sigma = \frac{1}{\sqrt{n}} \begin{cases} 2.2 & \text{First layer} \\ 0.58i & \text{Hidden layer } i \\ n & \text{Output layer} \end{cases} $$

Extension to Graphs and Continuous Systems

LNNs extend naturally to graph-structured and continuous systems via Lagrangian Graph Networks. For a system with $n$ gridpoints, the total Lagrangian is decomposed into local densities:

$$ \mathcal{L} = \sum_{i=1}^{n} \mathcal{L}_i, \quad \text{where} \quad \mathcal{L}_i = \mathcal{L}_{\text{density}}\left({\phi_j, \dot{\phi}_j}_{j \in \mathcal{I}_i}\right) $$

Here $\mathcal{I}_i$ defines the neighborhood of node $i$ (e.g., ${i-1, i, i+1}$ for a 1D grid). The Lagrangian density is modeled as an MLP. The resulting Hessian matrix is sparse, with non-zero entries only at “neighbor of neighbor” positions, enabling efficient computation: in 1D, only 5 forward-over-backward autodiff passes are needed, and the tridiagonal inverse runs in linear time.

Experiments: Double Pendulum, Relativity, and Waves

All models used 4-layer MLPs with 500 hidden units, softplus activations, a decaying learning rate starting at $10^{-3}$, and batch size 32.

Double Pendulum

The LNN and baseline achieved similar instantaneous acceleration losses ($7.3$ vs. $7.4 \times 10^{-2}$). The key difference appeared in long-term energy conservation: averaged over 40 random initial conditions with 100 time steps, the mean energy discrepancy was 8% of max potential energy for the baseline but only 0.4% for the LNN.

Relativistic Particle

For a particle with Lagrangian $\mathcal{L} = ((1 - \dot{q}^2)^{-1/2} - 1) + gq$, the canonical momenta $\dot{q}(1 - \dot{q}^2)^{-3/2}$ are non-trivial. An HNN trained on non-canonical coordinates $(q, \dot{q})$ failed to learn the dynamics. The LNN succeeded using the same non-canonical coordinates, matching the performance of an HNN given the correct canonical coordinates.

1D Wave Equation

The Lagrangian Graph Network learned the wave equation dynamics ($\ddot{\phi} = \frac{\partial^2 \phi}{\partial x^2}$ with $c = 1$) on a 100-gridpoint domain with periodic boundary conditions. The network learned the Lagrangian density corresponding to the continuum form $\mathcal{L} = \int (\dot{\phi}^2 - (\partial \phi / \partial x)^2) dx$, accurately modeling wave propagation and conserving energy across the material.

Experiment	Model	Energy Error (% of max PE)	Canonical Coords Required
Double Pendulum	Baseline	8%	N/A
Double Pendulum	LNN	0.4%	No
Relativistic Particle	HNN (non-canonical)	Failed	Yes
Relativistic Particle	HNN (canonical)	Succeeded	Yes
Relativistic Particle	LNN	Succeeded	No
1D Wave Equation	LGN	Energy conserved	No

Findings and Comparison to Prior Approaches

LNNs combine several desirable properties that no single prior method offers:

Property	Neural Net	Neural ODE	HNN	DeLaN	LNN
Models dynamical systems	Yes	Yes	Yes	Yes	Yes
Learns differential equations		Yes	Yes	Yes	Yes
Learns exact conservation laws			Yes	Yes	Yes
Learns from arbitrary coordinates	Yes	Yes		Yes	Yes
Learns arbitrary Lagrangians					Yes

The main limitation is computational cost: the Hessian computation and inversion scale as $\mathcal{O}(d^3)$ in the number of coordinates. The Lagrangian Graph Network partially mitigates this for spatially extended systems through the sparsity of the resulting Hessian. The method also assumes access to state derivatives ($\dot{q}$) during training, which may not always be directly available from observations.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	Double pendulum	600,000 random initial conditions	Simulated with masses and lengths set to 1
Training	Relativistic particle	Random initial conditions and $g$ values	$c = 1$, mass = 1, uniform potential
Training	1D wave equation	100 gridpoints	Periodic boundary conditions, $c = 1$

Algorithms

Forward model: Euler-Lagrange equation solved via Equation 6 using JAX autodiff
Pseudoinverse used for Hessian inversion to handle potential singular matrices
Custom initialization scheme (Equation 16) derived via symbolic regression with eureqa
Softplus activation selected via hyperparameter search

Models

4-layer MLP with 500 hidden units for all experiments
Softplus activation function
Code: github.com/MilesCranmer/lagrangian_nns (Apache-2.0)

Evaluation

Metric	LNN	Baseline	Notes
Acceleration loss (double pendulum)	$7.3 \times 10^{-2}$	$7.4 \times 10^{-2}$	Similar short-term accuracy
Energy error (double pendulum)	0.4%	8%	Percentage of max potential energy

Hardware

Not specified in the paper. JAX-based implementation supports CPU and GPU execution.

Reproducibility Status: Highly Reproducible

Artifacts

Artifact	Type	License	Notes
lagrangian_nns	Code	Apache-2.0	Official JAX implementation with notebooks for all experiments
Training data	Dataset	N/A	Generated procedurally; simulation code included in repository
Trained models	Model	N/A	Not provided

Paper Information

Citation: Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P., Spergel, D., & Ho, S. (2020). Lagrangian Neural Networks. ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations. arXiv: 2003.04630

Publication: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations

@misc{cranmer2020lagrangian,
  title={Lagrangian Neural Networks},
  author={Cranmer, Miles and Greydanus, Sam and Hoyer, Stephan and Battaglia, Peter and Spergel, David and Ho, Shirley},
  year={2020},
  eprint={2003.04630},
  archiveprefix={arXiv},
  primaryclass={cs.LG}
}

Ewald Message Passing for Molecular Graphs

Tue, 07 Apr 2026 00:00:00 +0000

A Fourier-Space Long-Range Correction for Molecular GNNs

This is a Method paper that introduces Ewald message passing (Ewald MP), a general framework for incorporating long-range interactions into message passing neural networks (MPNNs) for molecular potential energy surface prediction. The key contribution is a nonlocal Fourier-space message passing scheme, grounded in the classical Ewald summation technique from computational physics, that complements the short-range message passing of existing GNN architectures.

The Long-Range Interaction Problem in Molecular GNNs

Standard MPNNs for molecular property prediction rely on a spatial distance cutoff to define atomic neighborhoods. While this locality assumption enables favorable scaling with system size and provides a useful inductive bias, it fundamentally limits the model’s ability to capture long-range interactions such as electrostatic forces and van der Waals (London dispersion) interactions. These interactions decay slowly with distance (e.g., electrostatic energy follows a $1/r$ power law), and truncating them with a distance cutoff can introduce severe artifacts in thermochemical predictions.

This problem is well-known in molecular dynamics, where empirical force fields explicitly separate bonded (short-range) and non-bonded (long-range) energy terms. The Ewald summation technique addresses this by decomposing interactions into a short-range part that converges quickly with a distance cutoff and a long-range part whose Fourier transform converges quickly with a frequency cutoff. The authors propose bringing this same strategy into the GNN paradigm.

From Ewald Summation to Learnable Fourier-Space Messages

The core insight is a formal analogy between the continuous-filter convolution used in MPNNs and the electrostatic potential computation in Ewald summation. In a standard continuous-filter convolution, the message sum for atom $i$ is:

$$ M_i^{(l+1)} = \sum_{j \in \mathcal{N}(i)} h_j^{(l)} \cdot \Phi^{(l)}(| \mathbf{x}_i - \mathbf{x}_j |) $$

where $h_j^{(l)}$ are atom embeddings and $\Phi^{(l)}$ is a learned radial filter. Comparing this to the electrostatic potential $V_i^{\text{es}}(\mathbf{x}_i) = \sum_{j \neq i} q_j \cdot \Phi^{\text{es}}(| \mathbf{x}_i - \mathbf{x}_j |)$ reveals a direct correspondence: atom embeddings play the role of partial charges, and learned filters replace the $1/r$ kernel.

Ewald MP decomposes the learned filter into short-range and long-range components. The short-range part is handled by any existing GNN architecture with a distance cutoff. The long-range part is computed as a sum over Fourier frequencies:

$$ M^{\text{lr}}(\mathbf{x}_i) = \sum_{\mathbf{k}} \exp(i \mathbf{k}^T \mathbf{x}_i) \cdot s_{\mathbf{k}} \cdot \hat{\Phi}^{\text{lr}}(| \mathbf{k} |) $$

where $s_{\mathbf{k}}$ are structure factor embeddings, computed as:

$$ s_{\mathbf{k}} = \sum_{j \in \mathcal{S}} h_j \exp(-i \mathbf{k}^T \mathbf{x}_j) $$

These structure factor embeddings are a Fourier-space representation of the atom embedding distribution, and truncating to low frequencies effectively coarse-grains the hidden model state while preserving long-range information. The frequency filters $\hat{\Phi}^{\text{lr}}$ are learned, making the entire scheme data-driven rather than tied to a fixed physical functional form.

The method handles both periodic systems (where the reciprocal lattice provides a natural frequency discretization) and aperiodic systems (where the Fourier domain is discretized using a cubic voxel grid with SVD-based rotation alignment to preserve rotation invariance). The combined embedding update becomes:

$$ h_i^{(l+1)} = \frac{1}{\sqrt{3}} \left[ h_i^{(l)} + f_{\text{upd}}^{\text{sr}}(M_i^{\text{sr}}) + f_{\text{upd}}^{\text{lr}}(M_i^{\text{lr}}) \right] $$

The computational complexity is $\mathcal{O}(N_{\text{at}} N_{\text{k}})$, and by fixing the number of frequency vectors $N_{\text{k}}$, linear scaling $\mathcal{O}(N_{\text{at}})$ is achievable.

Experiments Across Four GNN Architectures and Two Datasets

The authors test Ewald MP as an augmentation on four baseline architectures: SchNet, PaiNN, DimeNet++, and GemNet-T. Two datasets are used:

OC20 (Chanussot et al., 2021): ~265M periodic structures of adsorbate-catalyst systems with DFT-computed energies and forces. The OC20-2M subsplit is used for training.
OE62 (Stuke et al., 2020): ~62,000 large aperiodic organic molecules with DFT-computed energies that include a DFT-D3 dispersion correction for London dispersion interactions.

All baselines use a 6 Å distance cutoff and 50 maximum neighbors. The Ewald modification is minimal: the long-range message sum is added as an additional skip connection term in each interaction block. Comparison studies include: (1) increasing the distance cutoff to match the computational cost of Ewald MP, (2) replacing the Ewald block with a SchNet interaction block at increased cutoff, and (3) increasing atom embedding dimensions to match Ewald MP’s parameter count.

Key Energy MAE Results on OE62

Model	Baseline (meV)	Ewald MP (meV)	Improvement
SchNet	133.5	79.2	40.7%
PaiNN	61.4	57.9	5.7%
DimeNet++	51.2	46.5	9.2%
GemNet-T	51.5	47.4	8.0%

Key Energy MAE Results on OC20 (Averaged Across Test Splits)

Model	Baseline (meV)	Ewald MP (meV)	Improvement
SchNet	895	830	7.3%
PaiNN	448	393	12.3%
DimeNet++	496	445	10.4%
GemNet-T	346	307	11.3%

Robust Long-Range Improvements and Dispersion Recovery

Ewald MP achieves consistent improvements across all models and both datasets, averaging 16.1% on OE62 and 10.3% on OC20. Several findings stand out:

Robustness: Unlike the increased-cutoff and SchNet-LR alternatives, Ewald MP never produces detrimental effects in any tested configuration. The increased cutoff setting hurts SchNet and PaiNN on OE62, and the SchNet-LR block fails to improve DimeNet++ and GemNet-T.
Long-range specificity: A binning analysis on OE62 groups molecules by the magnitude of their DFT-D3 dispersion correction. Ewald MP shows an outsize improvement for structures with large long-range energy contributions. It recovers or surpasses a “cheating” baseline that receives the exact DFT-D3 ground truth as an additional input.
Efficiency on periodic systems: Ewald MP achieves similar relative improvements on OC20 at roughly half the relative computational cost compared to OE62, suggesting periodic structures as a particularly attractive application domain.
Force predictions: Improvements in force MAEs are consistent but small, which is expected since the frequency truncation removes high-frequency contributions to the potential energy surface.
Ablation studies: Results are robust across different frequency cutoffs, voxel resolutions, and filtering strategies, with the non-radial periodic filtering scheme outperforming radial alternatives on out-of-distribution generalization.

Limitations include the current focus on scalar (invariant) embeddings only (PaiNN’s equivariant vector embeddings are not augmented), and the potential for a “gap” of medium-range interactions when $N_{\text{k}}$ is fixed for linear scaling. The authors suggest adapting more efficient Ewald summation variants (e.g., particle mesh Ewald with $\mathcal{O}(N \log N)$ scaling) as future work.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (periodic)	OC20-2M	~2M structures	Subsplit of OC20; PBC; DFT energies and forces
Training (aperiodic)	OE62	~62,000 molecules	Large organic molecules; DFT energies with D3 correction
Evaluation	OC20-test (4 splits: ID, OOD-ads, OOD-cat, OOD-both)	Varies	Evaluated via submission to OC20 evaluation server
Evaluation	OE62-val, OE62-test	~6,000 each	Direct evaluation

Algorithms

Ewald message passing is integrated as an additional skip connection term in each interaction block
For periodic systems: non-radial filtering with fixed reciprocal lattice positions ($N_x, N_y, N_z$ hyperparameters)
For aperiodic systems: radial Gaussian basis function filtering with frequency cutoff $c_k$ and voxel resolution $\Delta = 0.2$ Å$^{-1}$
SVD-based coordinate alignment for rotation invariance in the aperiodic case
Bottleneck dimension $N_\downarrow = 16$ (GemNet-T) or $N_\downarrow = 8$ (others)
Update function: dense layer + $N_{\text{hidden}}$ residual layers ($N_{\text{hidden}} = 3$, except PaiNN with $N_{\text{hidden}} = 0$)

Models

Model	Embedding Size (OE62)	Interaction Blocks	Ewald Params (OE62)
SchNet	512	4	12.2M total
PaiNN	512	4	15.7M total
DimeNet++	256	3	4.8M total
GemNet-T	256	3	16.1M total

Evaluation

Primary metric: Energy mean absolute error (EMAE) in meV
Secondary metric: Force MAE in meV/Å (OC20 only)
Loss: Linear combination of energy and force MAEs (Eq. 15) with model-specific force multipliers
Optimizer: Adam with weight decay ($\lambda = 0.01$)

Hardware

All runtime measurements on NVIDIA A100 GPUs
Runtimes measured after 50 warmup batches, averaged over 500 batches, minimum of 3 repetitions
Code: EwaldMP (Hippocratic License 3.0)

Artifacts

Artifact	Type	License	Notes
EwaldMP	Code	Hippocratic License 3.0 (new files) / MIT (OC20 base)	Official implementation built on the Open Catalyst Project codebase
OC20	Dataset	CC-BY-4.0	~265M periodic adsorbate-catalyst structures with DFT energies and forces
OE62	Dataset	CC-BY-4.0	~62,000 large organic molecules with DFT energies including D3 correction

Reproducibility status: Highly Reproducible. Source code, both datasets, and detailed hyperparameters (including per-model learning rates, batch sizes, and Ewald-specific settings) are all publicly available. Pre-trained model weights are not provided.

Paper Information

Citation: Kosmala, A., Gasteiger, J., Gao, N., & Günnemann, S. (2023). Ewald-based Long-Range Message Passing for Molecular Graphs. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023).

Publication: ICML 2023

@inproceedings{kosmala2023ewald,
  title={Ewald-based Long-Range Message Passing for Molecular Graphs},
  author={Kosmala, Arthur and Gasteiger, Johannes and Gao, Nicholas and G{\"u}nnemann, Stephan},
  booktitle={Proceedings of the 40th International Conference on Machine Learning},
  year={2023},
  series={PMLR},
  volume={202}
}

Block-Recurrent Transformers for Long Sequences

Tue, 07 Apr 2026 00:00:00 +0000

A Method for Combining Attention with Block-Level Recurrence

This is a Method paper that introduces the Block-Recurrent Transformer, a model architecture that integrates recurrence into the transformer framework at the block level. Rather than processing tokens one at a time (as in traditional RNNs) or attending over entire sequences (as in standard transformers), this approach applies a transformer layer recurrently across blocks of tokens. The result is a model with linear complexity in sequence length that maintains the parallelism benefits of transformers during training. A related approach, RWKV, later explored similar ideas using linear attention with channel-wise decay.

Why Transformers Struggle with Long Documents

Transformers have largely replaced RNNs for sequence modeling tasks, but their quadratic self-attention cost limits the length of sequences they can process. A transformer with a window size of 512 tokens cannot see information beyond that window, making it blind to long-range dependencies in books, technical papers, or source code repositories.

Prior approaches to this problem fall into several categories: sparse attention patterns (BigBird, Routing Transformers, Reformer), sequence compression (Linformer, Funnel Transformers), and linearized attention approximations. These methods either sacrifice the expressiveness of full softmax attention or introduce implementation complexity.

Traditional RNNs like LSTMs offer linear complexity but suffer from three key limitations: sequential processing prevents parallelism on modern hardware, a single state vector bottlenecks information capacity, and vanishing gradients limit effective memory to a few hundred tokens.

Block-Level Recurrence with LSTM-Style Gates

The core innovation is applying a standard transformer layer in a recurrent fashion along the sequence, operating on blocks of $W$ tokens rather than individual tokens. The recurrent cell maintains $S$ state vectors (typically $S = W = 512$) that are updated at each block boundary.

The Recurrent Cell

The cell has two processing directions:

Vertical direction: An ordinary transformer layer with self-attention over input tokens and cross-attention to recurrent states, producing output embeddings.
Horizontal direction: Self-attention over current state vectors and cross-attention to input tokens, producing updated state vectors. Residual connections are replaced with gates.

Self-attention and cross-attention are computed in parallel (not sequentially), with results concatenated and fed into a linear projection. Keys and values are shared between directions, while queries are separate, yielding four query sets: $Q_e^v$, $Q_s^v$ (vertical) and $Q_s^h$, $Q_e^h$ (horizontal).

Gating Mechanisms

Two gate types are explored. The fixed gate uses a learned convex combination:

$$ g = \sigma(b_g) $$

$$ c_{t+1} = c_t \odot g + z_t \odot (1 - g) $$

where $g$ is constant after training, implementing an exponential moving average.

The LSTM gate uses input and forget gates:

$$ i_t = \sigma(W_i h_t + b_i - 1) $$

$$ f_t = \sigma(W_f h_t + b_f + 1) $$

$$ c_{t+1} = c_t \odot f_t + z_t \odot i_t $$

The bias offsets ($-1$ for input, $+1$ for forget) initialize the model to “remember” by default, which is critical for training stability. Without careful initialization, the model can fall into a local optimum where it ignores the recurrent state entirely. This echoes the gate initialization challenges studied by Tallec and Ollivier, who derived chrono initialization for LSTMs from time-warping invariance.

Gate Configurations

Three configurations are tested: dual (gates on both attention and MLP outputs), single (gate only on MLP output), and skip (gate only on attention output, no MLP). The skip configuration removes the large MLP from the recurrent layer entirely.

Learned State IDs

Since the same weights are applied to all state vectors, learned “state IDs” (analogous to position embeddings) are added so each state vector can issue distinct queries. T5-style relative position bias is used for token self-attention, with no position bias for state-token cross-attention.

Language Modeling on PG19, arXiv, and GitHub

Experimental Setup

The base model is a 12-layer transformer with 150M parameters (8 heads of size 128, embedding dimension 1024, MLP hidden size 4096). The recurrent layer is placed at layer 10 with segment length $N = 4096$ and window size $W = 512$. The architecture is evaluated on three long-document datasets:

PG19: Full-length books from Project Gutenberg (pre-1919)
arXiv: Mathematics papers in LaTeX
GitHub: Concatenated source code from open-source repositories

All models report bits-per-token ($\log_2$ perplexity, lower is better).

Baselines

Five baselines are compared: Transformer-XL with window sizes of 512, 1024, and 2048, plus 12-layer and 13-layer sliding window models. The 13-layer sliding window (Slide:13L) is the primary comparison, having equivalent computation cost and parameter count to the recurrent models.

Main Results

Model	Step Time	PG19 (bytes)	PG19 (tokens)	arXiv	GitHub
XL:512	0.88	1.01	3.62	1.45	1.21
XL:2048	2.11	0.990	3.58	1.31	1.01
Slide:13L	1.00	0.989	3.58	1.42	1.17
Rec:fixed:skip	0.99	0.952	3.53	1.24	0.976
Rec:fixed:dual	1.01	0.957	3.52	1.27	0.991
Feedback:fixed:skip	1.35	0.935	3.49	1.24	-
Memorizing Trans. 64k	1.94	0.950	3.53	1.22	-

The Rec:fixed:skip configuration achieves the best overall results while being slightly faster than the 13-layer baseline. It outperforms XL:2048, which runs over 2x slower. The block feedback variant (allowing all layers to cross-attend to recurrent states) improves perplexity further at ~35-40% higher step time.

Scaling Behavior

Models from 40M to 1.3B parameters show that the benefit of recurrence is consistent across scales and increases with model size. At larger sizes, adding recurrence provides a benefit greater than doubling the number of parameters. The 1.3B parameter model achieves 26.50 word-level perplexity on PG19, setting a new state of the art at the time of publication.

Model	Layers	PG19 Perplexity	Parameters
Compressive Transformer	36	33.6	-
Routing Transformer	22	33.2	490M
Perceiver AR	60	28.9	974.6M
Block-Recurrent Transformer	24	26.50	1.3B

Ablations

Multiple recurrent layers: Two adjacent layers (9, 10) provide no benefit. Two separated layers (4, 10) help but no more than adding another non-recurrent layer.
Number of states: Improvement up to 1024 states, degradation at 2048.
Window size reduction: Reducing the sliding window hurts Transformer-XL dramatically but has smaller impact on the recurrent model, which compensates via recurrence.
Gate type: The fixed gate consistently outperforms the LSTM gate despite being theoretically less expressive.

Qualitative Analysis

Comparing per-token predictions against Transformer-XL on PG19 books, the recurrent model’s advantage comes overwhelmingly from predicting proper names (17/20 top-improvement tokens). In 19/20 cases, the predicted word was outside the attention window, confirming it was stored in recurrent state. The model can remember book titles and authors across 60,000+ tokens.

Findings, Limitations, and Future Directions

The Block-Recurrent Transformer demonstrates that recurrence at the block level is a cost-effective way to improve language modeling on long sequences. The fixed:skip configuration (the simplest variant) performs best, suggesting the model primarily uses recurrence for long-range name lookup rather than complex reasoning. The fact that removing the MLP from the recurrent layer has minimal impact further supports this interpretation.

Key limitations include: the model was only evaluated on language modeling perplexity (no downstream tasks), the LSTM gate underperforms the simpler fixed gate (suggesting untapped potential for more expressive recurrence), and the authors acknowledge that training the recurrent layer to fully exploit its capacity for knowledge extraction will require further advances.

The authors note that evaluating on downstream tasks requiring long-range context (book summarization, long-document QA, code completion) is an important direction for future work.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	PG19	~29k books	Public domain, freely available
Training/Eval	arXiv	Mathematics papers	Obtained via private channels, not redistributable
Training/Eval	GitHub	Open-source repos	Obtained via private channels, not redistributable

Algorithms

Optimizer: Adafactor
Learning rate: 1.0 with inverse square root decay (initial experiments), cosine decay with max 0.01 (scaling experiments)
Warmup: 1000 steps
Dropout: 0.05
Vocabulary: 32k SentencePiece (T5 pretrained for initial, custom for scaling)
Gate initialization: bias of $+1$ for forget gate, $-1$ for input gate to ensure initial “remember” behavior

Models

Variant	Layers	Parameters	Recurrent Layers
Base	12 (+1 recurrent)	~151-164M	Layer 10
Large	24 (+2 recurrent)	650M	Layers 10, 20
XL	24 (+2 recurrent)	1.3B	Layers 10, 20

Evaluation

Metric	Best Model	PG19 (tokens)	arXiv	GitHub
Bits-per-token	Rec:fixed:skip	3.53	1.24	0.976
Word-level PPL	1.3B model	26.50	-	-

Error bars on PG19 are between 0.002 and 0.007 (3 runs with different seeds).

Hardware

Training: 32 Google V4 TPU replicas
Training time: ~48 hours for 500k steps on PG19
Batch size: 32 (segment length 4096) or 256 (segment length 512), adjusted so each model sees the same tokens per step

Artifacts

Artifact	Available	License	URL
Code (Meliad)	Yes	Apache 2.0	github.com/google-research/meliad
PG19 Dataset	Yes	Public Domain	Public
arXiv Dataset	No	Not redistributable	Private
GitHub Dataset	No	Not redistributable	Private
Pretrained Models	No	-	-

Reproducibility Assessment: Partially Reproducible. Source code is available under Apache 2.0 and the PG19 dataset is public. However, two of three evaluation datasets (arXiv, GitHub) were obtained via private channels and are not redistributable. No pretrained model checkpoints are released.

Paper Information

Citation: Hutchins, D., Schlag, I., Wu, Y., Dyer, E., & Neyshabur, B. (2022). Block-Recurrent Transformers. Advances in Neural Information Processing Systems 35 (NeurIPS 2022).

@misc{hutchins2022block,
  title={Block-Recurrent Transformers},
  author={Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam},
  year={2022},
  eprint={2203.07852},
  archiveprefix={arXiv},
  primaryclass={cs.LG}
}

NaViT: Native Resolution Vision Transformer

Mon, 06 Apr 2026 00:00:00 +0000

A Method for Flexible-Resolution Vision Transformers

This is a Method paper that introduces NaViT (Native Resolution ViT), a Vision Transformer trained using sequence packing to handle images of arbitrary resolution and aspect ratio. The core idea, called “Patch n’ Pack,” borrows example packing from NLP and applies it to vision: patches from multiple images of different sizes are concatenated into a single sequence, enabling native-resolution processing without resizing or padding.

Why Fixed-Resolution Pipelines Are Suboptimal

Standard computer vision pipelines resize all images to a fixed square resolution before processing. This practice originates from convolutional neural network constraints, where fixed spatial dimensions were architecturally required. Even with Vision Transformers, which operate on sequences of patches and could in principle handle variable lengths, the convention of fixed-resolution input persists.

This approach has clear drawbacks. Most images are not square: analysis of ImageNet, LVIS, and WebLI shows that most images deviate more than 20% from a 1:1 aspect ratio. Resizing distorts content and discards information, while padding wastes computation. Prior work like FlexiViT addressed variable patch sizes and Pix2Struct introduced aspect-ratio-preserving patching, but neither fully solved the problem of training efficiently on images at their original resolution.

Patch n’ Pack: Sequence Packing for Vision

The key insight is that ViT already processes images as sequences of patch tokens, and NLP has long used example packing to handle variable-length sequences efficiently. NaViT applies this directly: patches from multiple images (each at its native resolution and aspect ratio) are packed into a single fixed-length sequence.

Architectural Modifications

Three changes enable Patch n’ Pack:

Masked self-attention and masked pooling: Attention masks prevent patches from different images from attending to each other. Masked pooling extracts a single representation per image from the packed sequence.
Factorized positional embeddings: Standard 1D positional embeddings cannot handle arbitrary resolutions. NaViT decomposes position into separate $x$ and $y$ embeddings $\phi_{x}$ and $\phi_{y}$, which are summed together. Two schemes are considered:
- Absolute embeddings: $\phi(p): [0, \text{maxLen}] \to \mathbb{R}^{D}$, a function of the absolute patch index
- Fractional embeddings: $\phi(r): [0, 1] \to \mathbb{R}^{D}$, where $r = p / \text{side-length}$ is the relative position along the image
Chunked contrastive loss: For contrastive pretraining, the $\mathcal{O}(n^{2})$ loss computation is handled via chunked computation across device subsets to support the high number of examples per sequence.

Training Innovations

Packing enables two techniques that were previously impractical:

Continuous token dropping: Instead of dropping the same proportion of tokens from every image, the drop rate varies per image. Some images keep all tokens while others have aggressive dropping, reducing the train/inference discrepancy. The drop rate can follow a schedule that decreases over training.
Resolution sampling: Each image’s resolution is sampled from a distribution (e.g., $R \sim \mathcal{U}(64, R_{\text{max}})$) while preserving aspect ratio. This mixes the throughput benefits of small images with the detail of large ones.

Computational Overhead

A natural concern is the $\mathcal{O}(n^{2})$ attention cost for longer packed sequences. In practice, as the transformer hidden dimension scales, attention becomes an increasingly small fraction of total compute (the MLP dominates). Packing overhead is typically less than 2% from padding tokens, using a simple greedy bin-packing algorithm.

Pretraining and Downstream Evaluation

NaViT is evaluated in two pretraining setups:

Classification pretraining on JFT-4B with sigmoid cross-entropy loss, evaluated via linear probing (10 examples per class)
Contrastive pretraining on WebLI using image-text contrastive loss, evaluated on zero-shot ImageNet classification and COCO retrieval

Training Efficiency

At fixed compute budget, NaViT consistently outperforms ViT across model scales. The top-performing ViT can be matched by NaViT with 4x less compute. The primary driver is throughput: packing with variable resolution and token dropping enables NaViT-L/16 to process approximately 5x more images during training.

Variable Resolution Results

Models trained with variable resolution ($R \sim \mathcal{U}(64, R_{\text{max}})$) outperform fixed-resolution models even when evaluated at the fixed resolution’s own training resolution. Sampling side lengths from a truncated normal biased toward lower values gives the best cost-performance trade-off.

For fine-tuning on ImageNet-1k, a single NaViT fine-tuned with variable resolutions (64 to 512) matches the performance of models fine-tuned at each specific resolution individually.

Positional Embedding Comparison

Factorized embeddings outperform both standard ViT 1D embeddings (with interpolation) and Pix2Struct’s learned 2D embeddings. The factorized approach generalizes to resolutions outside the training range, while 2D embeddings fail because they require seeing all $(x, y)$ coordinate pairs during training. Additive combination of $\phi_{x}$ and $\phi_{y}$ works best.

Token Dropping Strategies

Variable token dropping with Beta-distributed rates consistently outperforms constant rates. Resolution-dependent dropping (higher rates for higher-resolution images) further improves performance. Scheduling the drop rate to decrease over training provides additional gains.

Downstream Tasks

Task	Setup	Result
Semantic segmentation	ADE20k, L/16, linear decoder	NaViT at $R_{384}$ beats ViT at $R_{512}$ while being 2x faster
Object detection	OWL-ViT-L/14 backbone	NaViT: 28.3% LVIS AP vs. ViT: 23.3%
Video classification	Kinetics-400, tubelet extraction	NaViT-L matches ViViT-L (80.4%) in ~6x fewer epochs
Fairness annotation	FairFace, CelebA linear probes	Statistically significant accuracy improvements ($p = 3 \times 10^{-4}$)

Out-of-Distribution Robustness

NaViT shows strong gains on ImageNet-A (which contains many extreme aspect ratios) when evaluated without center cropping. Performance on ObjectNet is also competitive. The model maintains stable calibration (ECE between 0.045 and 0.047) across a wide range of token counts per image (128 to 1024).

Key Findings and Limitations

NaViT demonstrates that sequence packing, when applied to Vision Transformers, yields substantial improvements in training efficiency, inference flexibility, and downstream performance. The approach processes images at their native resolution without the information loss from resizing or the waste from padding.

Key takeaways:

4x compute reduction to match top ViT performance
A single model works across a continuous range of resolutions at inference time
Variable-resolution training and token dropping provide complementary efficiency gains
Factorized positional embeddings generalize to unseen resolutions
Benefits transfer to detection, segmentation, video, and fairness tasks

Limitations: The paper does not release model weights or code. All experiments use Google-internal datasets (JFT-4B, WebLI) and infrastructure (TPUs, JAX/Scenic), making direct reproduction difficult. The attention masking approach for packing assumes that cross-image attention is undesirable, which may not hold for all tasks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Classification pretraining	JFT-4B	~4B labeled images	Google-internal, not publicly available
Contrastive pretraining	WebLI	Large-scale web data	Google-internal, not publicly available
Classification fine-tuning	ImageNet-1k	1.28M images	Publicly available
Segmentation	ADE20k	20K images	Publicly available
Detection	LVIS	164K images	Publicly available
Video	Kinetics-400	~240K videos	Publicly available (partial)
Fairness	FairFace, CelebA	108K / 200K images	Publicly available

Algorithms

Greedy bin-packing for sequence construction (less than 2% padding tokens)
Resolution sampling: side length from truncated normal $\mathcal{N}_{t}(-0.5, 1)$ mapped to $[64, R_{\text{max}}]$
Token dropping: Beta-distributed per-image rates, optionally resolution-dependent
Factorized positional embeddings with additive combination

Models

NaViT variants: B/16, L/16, L/14
Based on vanilla ViT with query-key normalization, no biases, attention pooling
Implemented in JAX/FLAX within the Scenic framework
No public model checkpoints available

Evaluation

Metric	NaViT	ViT Baseline	Notes
JFT linear probe (L/16)	Matches top ViT	4x more compute	Compute-matched comparison
ImageNet zero-shot (L/14)	72.9%	68.3%	Contrastive pretraining
LVIS AP (L/14)	28.3%	23.3%	OWL-ViT detection
LVIS AP rare (L/14)	24.3%	17.2%	OWL-ViT detection
ADE20k mIoU (L/16, 384)	Beats ViT@512	At 2x cost	Segmenter linear decoder

Hardware

Training on Cloud TPUs (specific configuration not detailed)
Inference latency measured on Cloud TPUv3

Paper Information

Citation: Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I., Oliver, A., Padlewski, P., Gritsenko, A., Lučić, M., & Houlsby, N. (2023). Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. Advances in Neural Information Processing Systems 36 (NeurIPS 2023).

@misc{dehghani2023patch,
  title={Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
  author={Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lučić, Mario and Houlsby, Neil},
  year={2023},
  eprint={2307.06304},
  archiveprefix={arXiv},
  primaryclass={cs.CV}
}

MarkushGrapher-2: End-to-End Markush Recognition

Mon, 06 Apr 2026 00:00:00 +0000

A Multimodal Method for Markush Structure Recognition

This is a Method paper that introduces MarkushGrapher-2, a universal encoder-decoder model for recognizing both standard molecular structures and multimodal Markush structures from chemical images. The primary contribution is a dual-encoder architecture that fuses a pretrained OCSR (Optical Chemical Structure Recognition) vision encoder with a Vision-Text-Layout (VTL) encoder, connected through a dedicated ChemicalOCR module for end-to-end processing. The paper also introduces two new resources: a large-scale training dataset (USPTO-MOL-M) of real-world Markush structures extracted from USPTO patent MOL files, and IP5-M, a manually annotated benchmark of 1,000 Markush structures from five major patent offices.

Why Markush Structure Recognition Remains Challenging

Markush structures are compact representations used in patent documents to describe families of related molecules. They combine a visual backbone (atoms, bonds, variable regions) with textual definitions of substituents that can replace those variable regions. This multimodal nature makes them harder to parse than standard molecular diagrams.

Three factors limit automatic Markush recognition. First, visual styles vary across patent offices and publication years. Second, textual definitions lack standardization and often contain conditional or recursive descriptions. Third, real-world training data with comprehensive annotations is scarce. As a result, Markush structures are currently indexed only in two proprietary, manually curated databases: MARPAT and DWPIM.

Prior work, including the original MarkushGrapher, required pre-annotated OCR outputs at inference time, limiting practical deployment. General-purpose models like GPT-5 and DeepSeek-OCR produce mostly chemically invalid outputs on Markush images, suggesting these lie outside their training distribution.

Dual-Encoder Architecture with Dedicated ChemicalOCR

MarkushGrapher-2 uses two complementary encoding pipelines:

Vision encoder pipeline: The input image passes through a Swin-B Vision Transformer (taken from MolScribe) pretrained for OCSR. This encoder extracts visual features representing molecular structures and remains frozen during training.
Vision-Text-Layout (VTL) pipeline: The same image goes through ChemicalOCR, a compact 256M-parameter vision-language model fine-tuned from SmolDocling for OCR on chemical images. ChemicalOCR extracts character-level text and bounding boxes. These, combined with image patches, feed into a T5-base VTL encoder following the UDOP fusion paradigm, where visual and textual tokens are spatially aligned by bounding box overlap.

The VTL encoder output is concatenated with projected embeddings from the vision encoder. This joint representation feeds a text decoder that auto-regressively generates a CXSMILES (ChemAxon Extended SMILES) string describing the backbone structure and a substituent table listing variable group definitions.

Two-Stage Training Strategy

Training proceeds in two phases:

Phase 1 (Adaptation): The vision encoder is frozen. The MLP projector and text decoder train on 243K real-world image-SMILES pairs from MolScribe’s USPTO dataset (3 epochs). This aligns the decoder to the pretrained OCSR feature space.
Phase 2 (Fusion): The vision encoder, projector, and ChemicalOCR are all frozen. The VTL encoder and text decoder train on a mix of 235K synthetic and 145K real-world Markush samples (2 epochs). The VTL encoder learns the features needed for CXSMILES and substituent table prediction without disrupting the established OCSR representations.

The total model has 831M parameters, of which 744M are trainable.

Datasets and Evaluation Benchmarks

Training Data

Purpose	Dataset	Size	Source
OCR pretraining	Synthetic chemical structures	235K	PubChem SMILES augmented to CXSMILES, rendered with annotations
OCR fine-tuning	Manual OCR annotations	7K	IP5 patent document crops
Phase 1 (OCSR)	MolScribe USPTO	243K	Real image-SMILES pairs
Phase 2 (MMSR)	Synthetic CXSMILES	235K	Same as OCR pretraining set
Phase 2 (MMSR)	MolParser dataset	91K	Real-world Markush, converted to CXSMILES
Phase 2 (MMSR)	USPTO-MOL-M	54K	Real-world, auto-extracted from USPTO MOL files (2010-2025)

Evaluation Benchmarks

Markush benchmarks: M2S (103 samples), USPTO-M (74), WildMol-M (10K, semi-manual), and the new IP5-M (1,000 manually annotated from USPTO, JPO, KIPO, CNIPA, and EPO patents, 1980-2025).

OCSR benchmarks: USPTO (5,719), JPO (450), UOB (5,740), WildMol (10K).

The primary metric is CXSMILES Accuracy (A): a prediction is correct when (1) the predicted SMILES matches the ground truth by InChIKey equivalence, and (2) all Markush features (variable groups, positional and frequency variation indicators) are correctly represented. Stereochemistry is ignored during evaluation.

Results: Markush Structure Recognition

Model	M2S	USPTO-M	WildMol-M	IP5-M
MolParser-Base	39	30	38.1	47.7
MolScribe	21	7	28.1	22.3
GPT-5	3	0	-	-
DeepSeek-OCR	0	0	1.9	0.0
MarkushGrapher-1	38	10	32	-
MarkushGrapher-2	56	13	55	48.0

On M2S, MarkushGrapher-2 achieves 56% CXSMILES accuracy vs. 38% for MarkushGrapher-1, a relative improvement of 47%. On WildMol-M (the largest benchmark at 10K samples), MarkushGrapher-2 reaches 55% vs. 38.1% for MolParser-Base and 32% for MarkushGrapher-1. GPT-5 and DeepSeek-OCR generate mostly chemically invalid outputs on Markush images: only 30% and 15% of their predictions are valid CXSMILES on M2S, respectively.

Results: Standard Molecular Structure Recognition

Model	WildMol	JPO	UOB	USPTO
MolParser-Base	76.9	78.9	91.8	93.0
MolScribe	66.4	76.2	87.4	93.1
DECIMER 2.7	56.0	64.0	88.3	59.9
MolGrapher	45.5	67.5	94.9	91.5
DeepSeek-OCR	25.8	31.6	78.7	36.9
MarkushGrapher-2	68.4	71.0	96.6	89.8

MarkushGrapher-2 achieves the highest score on UOB (96.6%) and remains competitive on other OCSR benchmarks, despite being primarily optimized for Markush recognition.

ChemicalOCR vs. General OCR

Model	M2S F1	USPTO-M F1	IP5-M F1
PaddleOCR v5	7.7	1.2	1.9
EasyOCR	10.2	18.0	18.4
ChemicalOCR	87.2	93.0	86.5

General-purpose OCR tools fail on chemical images because they misinterpret bonds as characters and cannot parse chemical abbreviations. ChemicalOCR outperforms both by a large margin.

Ablation Results and Key Findings

OCR input is critical for Markush features. Without OCR, CXSMILES accuracy drops from 56% to 4% on M2S, and from 53.7% to 15.4% on IP5-M. The backbone structure accuracy ($A_{\text{InChIKey}}$) also drops substantially (from 80% to 39% on M2S), though the vision encoder alone can still recover some structural information. This confirms that textual cues (brackets, indices, variable definitions) are essential for Markush feature prediction.

Two-phase training improves both tasks. Compared to single-phase (fusion only) training, the two-phase strategy improves CXSMILES accuracy from 44% to 50% on M2S and from 53.0% to 61.5% on JPO after the same number of epochs. Adapting the decoder to OCSR features before introducing the VTL encoder prevents the fusion process from degrading learned visual representations.

Frequency variation indicators remain the hardest feature. On IP5-M, the per-feature breakdown shows 73.3% accuracy for backbone InChI, 74.8% for variable groups, 78.8% for positional variation, but only 30.7% for frequency variation (Sg groups). These repeating structural units are particularly challenging to represent and predict.

Limitations: The model relies on accurate OCR as a prerequisite. Performance on USPTO-M (13% CXSMILES accuracy) lags behind other benchmarks, likely due to the older patent styles in that dataset. The paper does not report inference latency.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
OCR pretraining	Synthetic chemical images	235K	Generated from PubChem SMILES, augmented to CXSMILES
OCR fine-tuning	IP5 patent crops	7K	Manually annotated
Phase 1 training	MolScribe USPTO	243K	Public, real image-SMILES pairs
Phase 2 training	Synthetic + MolParser + USPTO-MOL-M	380K	Mix of synthetic (235K), MolParser (91K), USPTO-MOL-M (54K)
Evaluation	M2S, USPTO-M, WildMol-M, IP5-M	103 to 10K	Markush benchmarks
Evaluation	WildMol, JPO, UOB, USPTO	450 to 10K	OCSR benchmarks

Models

Component	Architecture	Parameters	Status
Vision encoder	Swin-B ViT (from MolScribe)	~87M	Frozen
VTL encoder + decoder	T5-base	~744M trainable	Trained
ChemicalOCR	SmolDocling-based VLM	256M	Fine-tuned, frozen in Phase 2
MLP projector	Linear projection	-	Trained in Phase 1, frozen in Phase 2
Total		831M

Evaluation

Metric	Definition
CXSMILES Accuracy (A)	Percentage of samples where InChIKey matches AND all Markush features correct
$A_{\text{InChIKey}}$	Backbone structure accuracy only (ignoring Markush features)
Table Accuracy	Percentage of correctly predicted substituent tables
Markush Accuracy	Joint CXSMILES + Table accuracy
OCR F1	Bounding-box-level precision/recall at IoU > 0.5

Hardware

Training: NVIDIA A100 GPU
Phase 1: 3 epochs, Adam optimizer, lr 5e-4, 1000 warmup steps, batch size 10, weight decay 1e-3
Phase 2: 2 epochs, batch size 8

Artifacts

Artifact	Type	License	Notes
MarkushGrapher GitHub	Code	MIT	Official implementation of MarkushGrapher-2 with models and datasets

Reproducibility classification: Highly Reproducible. Code, models, and datasets are all publicly released under an MIT license with documented training hyperparameters and a single A100 GPU requirement.

Paper Information

Citation: Strohmeyer, T., Morin, L., Meijer, G. I., Weber, V., Nassar, A., & Staar, P. (2026). MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Publication: CVPR 2026

Additional Resources:

@misc{strohmeyer2026markushgrapher,
  title={MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures},
  author={Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Val\'{e}ry and Nassar, Ahmed and Staar, Peter},
  year={2026},
  eprint={2603.28550},
  archiveprefix={arXiv},
  primaryclass={cs.CV}
}

REINVENT: Reinforcement Learning for Mol. Design

Sat, 28 Mar 2026 00:00:00 +0000

Augmented Episodic Likelihood for Goal-Directed Generation

This is a Method paper that introduces REINVENT, a policy-based reinforcement learning framework for molecular de novo design. The primary contribution is a novel cost function, the augmented episodic likelihood, that fine-tunes a SMILES-based recurrent neural network (RNN) pre-trained on ChEMBL toward generating molecules satisfying user-defined property objectives. The method anchors the agent to the prior distribution of valid drug-like molecules, addressing failure modes of standard REINFORCE algorithms (reward exploitation and mode collapse to trivially simple structures).

De Novo Design Needs Flexible, Data-Driven Approaches

Traditional de novo design methods fall into three categories, each with limitations:

Structure-based approaches grow ligands to fit binding pockets but often produce molecules with poor DMPK profiles and synthetic intractability.
Ligand-based virtual library approaches generate large libraries and score them, but are constrained by pre-defined reaction rules or transformation rules that limit chemical diversity.
Inverse QSAR methods attempt to map favorable activity regions back to molecular structures, but require descriptors suitable for both forward prediction and inverse mapping.

RNN-based generative models trained on SMILES offer a data-driven alternative that can learn the underlying distribution of drug-like chemical space without rigid rules. Segler et al. (2017) showed that fine-tuning a pre-trained RNN on focused actives yields high fractions of predicted actives. However, this maximum likelihood fine-tuning cannot use negative or continuous scores and risks catastrophic forgetting.

Prior RL approaches had significant issues. Jaques et al. (2016) used Deep Q-learning with prior likelihood regularization for sequence generation, but reported dependence on hand-written rules to penalize undesirable sequences and still observed reward exploitation producing unrealistically simple molecules. Standard REINFORCE algorithms tend to converge on trivial solutions (e.g., generating only “C” to satisfy a scoring function).

The Augmented Episodic Likelihood Framework

The core innovation is a formulation where the agent learns a policy that minimizes the squared difference between its own log-likelihood and an augmented target likelihood.

The RNN is first pre-trained on 1.5 million canonical SMILES from ChEMBL via maximum likelihood estimation:

$$ J(\Theta) = -\sum_{t=1}^{T} \log P(x^{t} \mid x^{t-1}, \dots, x^{1}) $$

The pre-trained model (the Prior) is then used as the starting point for the Agent. For a generated SMILES sequence $A = a_1, a_2, \dots, a_T$, the model likelihood is $P(A) = \prod_{t=1}^{T} \pi(a_t \mid s_t)$, and a scoring function $S(A) \in [-1, 1]$ rates desirability.

The augmented likelihood combines prior likelihood with the score:

$$ \log P(A)_{\mathbb{U}} = \log P(A)_{Prior} + \sigma S(A) $$

where $\sigma$ is a scalar coefficient controlling the trade-off between prior fidelity and score optimization.

The return is defined as the negative squared difference between the augmented likelihood and the agent’s likelihood:

$$ G(A) = -\left[\log P(A)_{\mathbb{U}} - \log P(A)_{\mathbb{A}}\right]^{2} $$

The agent minimizes $J(\Theta) = -G$, effectively learning a policy whose sequence likelihoods match the prior modulated by the scoring function. The authors show in supplementary material that this is equivalent to a REINFORCE algorithm with a specific final-step reward formulation.

This design has three key advantages over standard REINFORCE:

The target policy is explicitly stochastic, preserving diversity in generated molecules
The prior anchoring prevents catastrophic forgetting of SMILES syntax and chemical space coverage
No hand-written rules are needed to penalize degenerate solutions

The Agent is trained on-policy with batches of 128 generated sequences, using SGD with learning rate 0.0005 and gradient clipping to $[-3, 3]$.

Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity

Prior Network Architecture

The Prior is a 3-layer RNN with 1024 Gated Recurrent Units per layer, trained on RDKit canonical SMILES from ChEMBL (molecules with 10-50 heavy atoms, elements from ${H, B, C, N, O, F, Si, P, S, Cl, Br, I}$). Training used Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) for 50,000 steps with batch size 128 and learning rate decay of 0.02 every 100 steps. The Prior generates 94% valid SMILES, of which 90% are novel.

Experiment 1: Learning to Avoid Sulphur

A proof-of-principle task where the scoring function assigns $S(A) = 1$ for valid sulphur-free molecules, $S(A) = 0$ for invalid SMILES, and $S(A) = -1$ for sulphur-containing molecules.

The Agent method was compared against three alternatives:

Method	Fraction Valid	Fraction No S	Avg MW	Avg cLogP	Avg RotBonds	Avg AromRings
Prior	0.94	0.66	371	3.36	5.39	2.26
Agent	0.95	0.98	367	3.37	5.41	2.26
Action basis	0.95	0.92	372	3.39	6.08	2.09
REINFORCE	0.98	0.98	585	11.3	30.0	0.57
REINFORCE + Prior	0.98	0.92	232	3.05	2.8	2.11

Standard REINFORCE exploited the reward by generating sequences of predominantly “C” (average MW 585, cLogP 11.3). REINFORCE + Prior avoided this but collapsed to small, simplistic structures (MW 232). The Agent achieved 98% sulphur-free structures while maintaining molecular properties nearly identical to the Prior, demonstrating that augmented episodic likelihood preserves the prior distribution.

Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)

The scoring function uses Jaccard similarity on FCFP4 fingerprints:

$$ S(A) = -1 + 2 \times \frac{\min{J_{i,j}, k}}{k} $$

where $k$ caps the rewarded similarity. With $k = 1$ and $\sigma = 15$, the Agent recovers Celecoxib itself within 200 training steps. Even when all structures with $J > 0.5$ to Celecoxib (1,804 molecules) were removed from the Prior training set, the Agent still found Celecoxib after 400 steps, despite a 700-fold reduction in prior likelihood ($\log_e P$ from $-12.7$ to $-19.2$).

With moderate similarity targets ($k = 0.7$, $\sigma = 12$), the Agent generates diverse analogues including scaffold hops where functional groups are rearranged.

Experiment 3: Target Activity (DRD2)

The most drug-discovery-relevant task: generating molecules predicted active against the dopamine receptor type 2 (DRD2). An SVM classifier (Gaussian kernel, $C = 2^7$, $\gamma = 2^{-6}$) was trained on bioactivity data from ExCAPE-DB (7,218 actives with pIC50 > 5, 100,000 sampled inactives). The actives were split by Butina clustering (ECFP6, cutoff 0.4) to decrease nearest-neighbor similarity between train and test sets.

Metric	Prior	Agent	Prior (reduced)	Agent (reduced)
Fraction valid SMILES	0.94	0.99	0.94	0.99
Fraction predicted actives	0.03	0.97	0.02	0.96
Fraction similar to train active	0.02	0.79	0.02	0.75
Fraction similar to test active	0.01	0.46	0.01	0.38
Test actives recovered (x10^-3)	13.5	126	2.85	72.6

The Agent increased the fraction of predicted actives from 2-3% (Prior) to 96-97%, representing a 250-fold enrichment in the probability of generating a test set active. The Agent based on the reduced Prior (DRD2 actives removed from ChEMBL) still recovered 7% of test actives, meaning it generated experimentally confirmed actives that appeared in neither the generative model nor the activity prediction model training data.

Anchored Policy Learning Prevents Reward Exploitation

The key finding is that augmented episodic likelihood successfully balances score optimization with prior distribution preservation. The Agent achieves task objectives (sulphur avoidance, similarity targets, activity prediction) while maintaining the molecular property distributions learned from ChEMBL. This is a significant improvement over standard REINFORCE, which either exploits rewards trivially or collapses to simple structures.

Analysis of the conditional probability distributions between the Prior and Agent (for DRD2 active generation) shows that the policy changes are not drastic: most trends learned by the Prior carry over, with targeted modifications at specific steps that substantially alter sequence likelihoods and generated structure types.

Limitations acknowledged by the authors:

All experiments use single-parameter scoring functions; multi-parametric optimization (activity + DMPK + synthetic accessibility) is left for future work
The quality of generated structures depends heavily on the Prior’s coverage of chemical space
The activity model (SVM) has limited domain of applicability, and structures outside this domain may be falsely scored
No exhaustive study of how Prior training set size, model size, and regularization affect generation quality

Future directions include multi-parametric scoring functions, exploration of token embeddings, and adversarial training where the scoring function is replaced by a discriminator network (GAN-style training).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL	1.5M structures	10-50 heavy atoms, filtered elements
DRD2 activity model	ExCAPE-DB	7,218 actives + 100K inactives	Butina clustering split (ECFP6, cutoff 0.4)
Similarity target	Celecoxib	1 query structure	FCFP4 fingerprints for Jaccard similarity

Algorithms

Prior: 3-layer GRU RNN (1024 units/layer), Adam optimizer, 50K steps, batch size 128, LR 0.001 with 0.02 decay/100 steps
Agent: Same architecture, SGD with LR 0.0005, gradient clipping [-3, 3], on-policy batches of 128
DRD2 model: SVM with Gaussian kernel ($C = 2^7$, $\gamma = 2^{-6}$), grid search on validation set

Models

Artifact	Type	License	Notes
REINVENT	Code	MIT	Original implementation in TensorFlow/Python 2.7
Archived version	Code	MIT	Zenodo archive (DOI: 10.5281/zenodo.572576)

Evaluation

SMILES validity rate (RDKit parsing)
Fraction of structures satisfying scoring function
Molecular property distributions (MW, cLogP, rotatable bonds, aromatic rings)
Jaccard similarity on ECFP6/FCFP4 fingerprints
Recovery rate of known actives from test set

Hardware

Not specified in the paper. The implementation uses TensorFlow 1.0.1 with Python 2.7, RDKit, and Scikit-learn.

Paper Information

Citation: Olivecrona, M., Blaschke, T., Engkvist, O., & Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. Journal of Cheminformatics, 9(1), 48.

@article{olivecrona2017molecular,
  title={Molecular de-novo design through deep reinforcement learning},
  author={Olivecrona, Marcus and Blaschke, Thomas and Engkvist, Ola and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={9},
  number={1},
  pages={48},
  year={2017},
  publisher={Springer},
  doi={10.1186/s13321-017-0235-x}
}

ReactionT5: Pre-trained T5 for Reaction Prediction

Sat, 28 Mar 2026 00:00:00 +0000

A Two-Stage Pre-trained Transformer for Chemical Reactions

ReactionT5 is a Method paper that proposes a T5-based pre-trained model for chemical reaction tasks, specifically product prediction and yield prediction. The primary contribution is a two-stage pretraining pipeline: first on a compound library (ZINC, 23M molecules) to learn molecular representations, then on a large-scale reaction database (the Open Reaction Database, 1.5M reactions) to learn reaction-level patterns. The key result is that this pre-trained model can be fine-tuned with very limited target-domain data (as few as 30 reactions) and still achieve competitive performance against models trained on full datasets.

Bridging the Gap Between Single-Molecule and Multi-Molecule Pretraining

While transformer-based models pre-trained on compound libraries (e.g., SMILES-BERT, MolGPT) have seen substantial development, most focus on single-molecule inputs and outputs. Pretraining for multi-molecule contexts, such as chemical reactions involving reactants, reagents, catalysts, and products, remains underexplored. T5Chem supports multi-task reaction prediction but focuses on building a single multi-task model rather than investigating the effectiveness of pre-trained models for fine-tuning on limited in-house data.

The authors identify two key gaps:

Most pre-trained chemical models do not account for reaction-level interactions between multiple molecules.
In practical settings, target-domain reaction data is often scarce, making transfer learning from large public datasets essential.

Two-Stage Pretraining with Compound Restoration

The core innovation is a two-stage pretraining procedure built on the T5 (text-to-text transfer transformer) architecture:

Stage 1: Compound Pretraining (CompoundT5). An initialized T5 model is trained on 23M SMILES from the ZINC database using span-masked language modeling. The model learns to predict masked subsequences of SMILES tokens. A SentencePiece unigram tokenizer is trained on this compound library, allowing more compact representations than character-level or atom-level tokenizers. After this stage, new tokens are added to the tokenizer to cover metal atoms and other characters present in the reaction database but absent from ZINC.

Stage 2: Reaction Pretraining (ReactionT5). CompoundT5 is further pretrained on 1.5M reactions from the Open Reaction Database (ORD) on both product prediction and yield prediction tasks. Reactions are formulated as text-to-text tasks using special tokens:

REACTANT:, REAGENT:, and PRODUCT: tokens delimit the role of each molecule in the reaction string.
For product prediction, the model takes reactants and reagents as input and generates product SMILES.
For yield prediction, the model takes the full reaction (including products) and outputs a numerical yield value.

Compound Restoration. A notable methodological detail is the handling of uncategorized compounds in the ORD. About 31.8% of ORD reactions contain compounds with unknown roles. Simply discarding these reactions introduces severe product bias (only 447 unique products remain vs. 439,898 with uncategorized data included). The authors develop RestorationT5, a binary classifier built from CompoundT5, that assigns uncategorized compounds to either reactant or reagent roles. This classifier uses a sigmoid output layer and achieves an F1 score of 0.1564 at a threshold of 0.97, outperforming a random forest baseline (F1 = 0.1136). The restored dataset (“ORD(restored)”) is then used for reaction pretraining.

For yield prediction, the loss function is mean squared error:

$$L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

where $y_i$ is the true yield (normalized to [0, 1]) and $\hat{y}_i$ is the predicted yield.

Experimental Setup: Product and Yield Prediction Benchmarks

Product Prediction

The USPTO dataset (479K reactions) is used for evaluation, with standard train/val/test splits (409K/30K/40K). Reactions overlapping with the ORD (18%) are removed during evaluation. Beam search with beam size 10 is used for decoding, and minimum/maximum output length constraints are set based on the training data distribution. Top-k accuracy (k = 1, 2, 3, 5) and invalidity rate are reported.

Baselines include Seq-to-seq, WLDN (graph neural network), Molecular Transformer, and T5Chem.

Model	Train	Top-1	Top-2	Top-3	Top-5	Invalidity
Seq-to-seq	USPTO	80.3	84.7	86.2	87.5	-
WLDN	USPTO	85.6	90.5	92.8	93.4	-
Molecular Transformer	USPTO	88.8	92.6	-	94.4	-
T5Chem	USPTO	90.4	94.2	-	96.4	-
CompoundT5	USPTO	88.0	92.4	93.9	95.0	7.5
ReactionT5 (restored ORD)	USPTO200	85.5	91.7	93.5	94.9	12.0

A critical finding: ReactionT5 pre-trained on ORD achieves 0% accuracy on USPTO without fine-tuning due to domain mismatch (ORD includes byproducts; USPTO lists only the main product). Fine-tuning on just 200 USPTO reactions with the restored ORD model produces competitive results.

The few-shot fine-tuning analysis shows rapid performance scaling:

Samples	Top-1	Top-2	Top-3	Top-5	Invalidity
10	9.0	12.5	15.3	19.1	12.4
30	80.5	87.3	89.8	92.0	17.2
50	83.7	89.9	92.2	94.0	14.8
100	85.1	91.0	92.8	94.4	14.0
200	85.5	91.7	93.5	94.9	12.0

Yield Prediction

The Buchwald-Hartwig C-N cross-coupling dataset (3,955 reactions) is used with random 7:3 splits (repeated 10 times) plus four out-of-sample test sets (Tests 1-4) designed so that similar reactions do not appear in both train and test.

Model	Random 7:3	Test 1	Test 2	Test 3	Test 4	Avg. Tests 1-4
DFT	0.92	0.80	0.77	0.64	0.54	0.69
MFF	0.927	0.851	0.713	0.635	0.184	0.596
Yield-BERT	0.951	0.838	0.836	0.738	0.538	0.738
T5Chem	0.970	0.811	0.907	0.789	0.627	0.785
CompoundT5	0.971	0.855	0.852	0.712	0.547	0.741
ReactionT5	0.966	0.914	0.940	0.819	0.896	0.892
ReactionT5 (zero-shot)	0.904	0.919	0.927	0.847	0.909	0.900

ReactionT5 achieves the highest average $R^2$ across Tests 1-4 (0.892), with the zero-shot variant performing even better (0.900). The improvement is most dramatic on Test 4, the hardest split, where ReactionT5 achieves $R^2 = 0.896$ versus T5Chem’s 0.627 and Yield-BERT’s 0.538.

In a low-data regime (30% train / 70% test), ReactionT5 ($R^2 = 0.927$) substantially outperforms a random forest baseline ($R^2 = 0.853$), and even zero-shot ReactionT5 ($R^2 = 0.898$) exceeds the random forest.

Key Findings and Limitations

Key Findings

Two-stage pretraining is effective: Compound pretraining followed by reaction pretraining produces models with strong generalization, particularly on out-of-distribution test sets.
Few-shot transfer works: With as few as 30 fine-tuning reactions, ReactionT5 achieves over 80% Top-1 accuracy on product prediction, competitive with models trained on the full USPTO dataset.
Compound restoration matters: Restoring uncategorized compounds in the ORD is essential for product prediction. Without restoration, fine-tuning on 200 USPTO reactions yields 0% accuracy; with restoration, the same fine-tuning yields 85.5% Top-1.
Zero-shot yield prediction is surprisingly effective: ReactionT5 achieves $R^2 = 0.900$ on the out-of-sample yield tests without any task-specific fine-tuning, outperforming all fine-tuned baselines.

Limitations

Product prediction shows a high invalidity rate (12.0% for the best ReactionT5 variant) compared to CompoundT5 (7.5%), suggesting the reaction pretraining may introduce some noise.
The 0% accuracy without fine-tuning on product prediction reveals a significant domain gap between ORD and USPTO annotation conventions (byproducts vs. main products).
The RestorationT5 classifier has low precision (0.0878) despite high recall (0.7212), meaning many compounds are incorrectly assigned roles. The paper does not investigate how this impacts downstream performance.
The paper does not report training times, computational costs, or model sizes, making resource requirements unclear.
Only two downstream tasks (product prediction on USPTO, yield prediction on Buchwald-Hartwig) are evaluated.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Compound pretraining	ZINC	22,992,522 compounds	SMILES canonicalized with RDKit
Reaction pretraining	ORD (restored)	1,505,916 reactions	Atom mapping removed, compounds canonicalized
Product prediction eval	USPTO	479,035 reactions	409K/30K/40K train/val/test split
Yield prediction eval	Buchwald-Hartwig C-N	3,955 reactions	Random 7:3 split (10 repeats) + 4 OOS tests

Algorithms

Base architecture: T5 (text-to-text transfer transformer)
Tokenizer: SentencePiece unigram, trained on ZINC, extended with special reaction tokens
Compound pretraining: Span-masked language modeling (15% masking rate, average span length 3)
Beam search: size 10 for product prediction
Output length constraints: min/max from training data distribution
Yield normalization: clipped to [0, 100], then scaled to [0, 1]

Models

CompoundT5: T5 pretrained on ZINC
RestorationT5: CompoundT5 fine-tuned for binary classification (reactant vs. reagent)
ReactionT5: CompoundT5 pretrained on ORD for product and yield prediction
Pre-trained weights available on Hugging Face

Evaluation

Metric	Task	Best Value	Notes
Top-1 accuracy	Product prediction	85.5%	ReactionT5 with 200 fine-tuning reactions
Top-5 accuracy	Product prediction	94.9%	ReactionT5 with 200 fine-tuning reactions
$R^2$	Yield prediction (random)	0.966	ReactionT5 fine-tuned
$R^2$	Yield prediction (OOS avg.)	0.900	ReactionT5 zero-shot

Hardware

Not specified in the paper. Training times and GPU requirements are not reported.

Artifacts

Artifact	Type	License	Notes
ReactionT5v2 (GitHub)	Code	MIT	Official implementation
ReactionT5 models (Hugging Face)	Model	MIT	Pre-trained weights

Paper Information

Citation: Sagawa, T. & Kojima, R. (2023). ReactionT5: a large-scale pre-trained model towards application of limited reaction data. arXiv preprint arXiv:2311.06708.

@article{sagawa2023reactiont5,
  title={ReactionT5: a large-scale pre-trained model towards application of limited reaction data},
  author={Sagawa, Tatsuya and Kojima, Ryosuke},
  journal={arXiv preprint arXiv:2311.06708},
  year={2023},
  doi={10.48550/arxiv.2311.06708}
}

PharMolixFM: Multi-Modal All-Atom Molecular Models

Sat, 28 Mar 2026 00:00:00 +0000

A Unified Framework for All-Atom Molecular Foundation Models

PharMolixFM is a Method paper that introduces a unified framework for constructing all-atom foundation models for molecular modeling and generation. The primary contribution is the systematic implementation of three multi-modal generative model variants (diffusion, flow matching, and Bayesian flow networks) within a single architecture, along with a task-unifying denoising formulation that enables training on multiple structural biology tasks simultaneously. The framework achieves competitive performance on protein-small-molecule docking and structure-based drug design while providing the first empirical analysis of inference scaling laws for molecular generative models.

Existing all-atom foundation models such as AlphaFold3, RoseTTAFold All-Atom, and ESM-AA face two core challenges that limit their generalization across molecular modeling and generation tasks.

First, atomic data is inherently multi-modal: each atom comprises both a discrete atom type and continuous 3D coordinates. This poses challenges for structure models that need to jointly capture and predict both modalities. Unlike text or image data that exhibit a single modality, molecular structures require generative models that can handle discrete categorical variables (atom types, bond types) and continuous variables (coordinates) simultaneously.

Second, there has been no comprehensive analysis of how different training objectives and sampling strategies impact the performance of all-atom foundation models. Prior work has focused on individual model architectures without systematically comparing generative frameworks or studying how inference-time compute scaling affects prediction quality.

PharMolixFM addresses both challenges by providing a unified framework that implements three state-of-the-art multi-modal generative models and formulates all downstream tasks as a generalized denoising process with task-specific priors.

The core innovation of PharMolixFM is the formulation of molecular tasks as a generalized denoising process where task-specific priors control which parts of the molecular system are noised during training. The framework decomposes a biomolecular system into $N$ atoms represented as a triplet $\bar{\mathbf{S}}_0 = \langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle$, where $\mathbf{X}_0 \in \mathbb{R}^{N \times 3}$ are atom coordinates, $\mathbf{A}_0 \in \mathbb{Z}^{N \times D_1}$ are one-hot atom types, and $\mathbf{E}_0 \in \mathbb{Z}^{N \times N \times D_2}$ are one-hot bond types.

The generative model estimates the density $p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)$ subject to SE(3) invariance:

$$ p_\theta(\langle \mathbf{R}\mathbf{X}_0 + \mathbf{t}, \mathbf{A}_0, \mathbf{E}_0 \rangle) = p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle) $$

The variational lower bound is optimized over latent variables $S_1, \ldots, S_T$ obtained by adding independent noise to different modalities and atoms:

$$ q(S_{1:T} \mid S_0) = \prod_{i=1}^{T} \prod_{j=1}^{N} q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}, \sigma_{i,j}^{(\mathbf{X})}) , q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}, \sigma_{i,j}^{(\mathbf{A})}) , q(\mathbf{E}_{i,j} \mid \mathbf{E}_{0,j}, \sigma_{i,j}^{(\mathbf{E})}) $$

A key design choice is the noise schedule $\sigma_{i,j}^{(\mathcal{M})} = \frac{i}{T} \cdot \text{fix}_j^{(\mathcal{M})}$, where $\text{fix}_j^{(\mathcal{M})}$ is a scaling factor between 0 and 1 that controls which atoms and modalities receive noise. This “Fix” mechanism enables multiple training tasks:

Docking ($\text{Fix} = 1$ for protein and molecular graph, $\text{Fix} = 0$ for molecule coordinates): predicts binding pose given known atom/bond types.
Structure-based drug design ($\text{Fix} = 1$ for protein, $\text{Fix} = 0$ for all molecule properties): generates novel molecules for a given pocket.
Robustness augmentation ($\text{Fix} = 0.7$ for 15% randomly selected atoms, $\text{Fix} = 0$ for rest): simulates partial structure determination.

Three Generative Model Variants

Multi-modal diffusion (PharMolixFM-Diff) uses a Markovian forward process. Continuous coordinates follow Gaussian diffusion while discrete variables use a D3PM categorical transition:

$$ q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\sqrt{\alpha_{i,j}} , \mathbf{X}_{0,j}, (1 - \alpha_{i,j}) \mathbf{I}), \quad \alpha_{i,j} = \prod_{k=1}^{i}(1 - \sigma_{i,j}^{(\mathbf{X})}) $$

$$ q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}) = \text{Cat}(\mathbf{A}_{0,j} \bar{Q}_{i,j}^{(\mathbf{A})}), \quad Q_{i,j}^{(\mathbf{A})} = (1 - \sigma_{i,j}^{(\mathbf{A})}) \mathbf{I} + \frac{\sigma_{i,j}^{(\mathbf{A})}}{D_1} \mathbb{1}\mathbb{1}^T $$

The training loss combines coordinate MSE with cross-entropy for discrete variables:

$$ \mathcal{L} = \mathbb{E}_{S_0, i, S_i} \left[ \lambda_i^{(\mathbf{X})} | \tilde{\mathbf{X}}_0 - \mathbf{X}_0 |_2^2 + \lambda_i^{(\mathbf{A})} \mathcal{L}_{CE}(\tilde{\mathbf{A}}_0, \mathbf{A}_0) + \lambda_i^{(\mathbf{E})} \mathcal{L}_{CE}(\tilde{\mathbf{E}}_0, \mathbf{E}_0) \right] $$

Multi-modal flow matching (PharMolixFM-Flow) constructs a direct mapping between data and prior distributions using conditional vector fields. For coordinates, the conditional flow uses a Gaussian path $q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}((1 - \sigma_{i,j}^{(\mathbf{X})}) \mathbf{X}_{0,j}, (\sigma_{i,j}^{(\mathbf{X})})^2 \mathbf{I})$, while discrete variables use the same D3PM Markov chain. Sampling proceeds by solving an ODE via Euler integration.

Bayesian flow networks (PharMolixFM-BFN) perform generative modeling in the parameter space of the data distribution rather than the data space. The Bayesian flow distribution for coordinates is:

$$ p_F(\tilde{\mathbf{X}}_{i,j}^{(\theta)} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\gamma_{i,j} \mathbf{X}_{0,j}, \gamma_{i,j}(1 - \gamma_{i,j}) \mathbf{I}), \quad \gamma_{i,j} = 1 - \alpha^{2(1 - \sigma_{i,j}^{(\mathbf{X})})} $$

Network Architecture

The architecture follows PocketXMol with a dual-branch SE(3)-equivariant graph neural network. A protein branch (4-layer GNN with kNN graph) processes pocket atoms, then representations are passed to a molecule branch (6-layer GNN) that captures protein-molecule interactions. Independent prediction heads reconstruct atom coordinates, atom types, and bond types, with additional confidence heads for self-ranking during inference.

Docking and Drug Design Experiments

Protein-Small-Molecule Docking

PharMolixFM is evaluated on the PoseBusters benchmark (428 protein-small-molecule complexes) using the holo docking setting with a known protein structure and 10 Angstrom binding pocket. The metric is the ratio of predictions with RMSD < 2 Angstrom.

Method	Self-Ranking (%)	Oracle-Ranking (%)
DiffDock	38.0	-
RFAA	42.0	-
Vina	52.3	-
UniMol-Docking V2	77.6	-
SurfDock	78.0	-
AlphaFold3	90.4	-
PocketXMol (50 repeats)	82.2	95.3
PharMolixFM-Diff (50 repeats)	83.4	96.0
PharMolixFM-Flow (50 repeats)	73.4	93.7
PharMolixFM-BFN (50 repeats)	78.5	93.5
PharMolixFM-Diff (500 repeats)	83.9	98.1

PharMolixFM-Diff achieves the second-best self-ranking result (83.4%), outperforming PocketXMol by 1.7% absolute but trailing AlphaFold3 (90.4%). The key advantage is inference speed: approximately 4.6 seconds per complex on a single A800 GPU compared to approximately 249.0 seconds for AlphaFold3 (a 54x speedup). Under oracle-ranking with 500 repeats, PharMolixFM-Diff reaches 98.1%, suggesting that better ranking strategies could further improve practical performance.

Structure-Based Drug Design

Evaluation uses the CrossDocked test set (100 protein pockets, 100 molecules generated per pocket), measuring Vina binding affinity scores and drug-likeness properties (QED and SA).

Method	Vina Score (Avg/Med)	QED	SA
Pocket2Mol	-5.14 / -4.70	0.57	0.76
TargetDiff	-5.47 / -6.30	0.48	0.58
DecompDiff	-5.67 / -6.04	0.45	0.61
MolCRAFT	-6.61 / -8.14	0.46	0.62
PharMolixFM-Diff	-6.18 / -6.44	0.50	0.73
PharMolixFM-Flow	-6.34 / -6.47	0.49	0.74
PharMolixFM-BFN	-6.38 / -6.45	0.48	0.64

PharMolixFM achieves a better balance between binding affinity and drug-like properties compared to baselines. While MolCRAFT achieves the best Vina scores, PharMolixFM-Diff and Flow variants show notably higher QED (0.49-0.50 vs. 0.45-0.48) and SA (0.73-0.74 vs. 0.58-0.62), which are important for downstream validation and in-vivo application.

Inference Scaling Law

The paper explores whether inference-time scaling holds for molecular generative models, fitting the relationship:

$$ \text{Acc} = a \log(bR + c) + d $$

where $R$ is the number of sampling repeats. All three PharMolixFM variants exhibit logarithmic improvement in docking accuracy with increased sampling repeats, analogous to inference scaling laws observed in NLP. Performance plateaus eventually due to distributional differences between training and test sets.

Competitive Docking with Faster Inference, but Limited Task Scope

PharMolixFM demonstrates that multi-modal generative models can achieve competitive all-atom molecular modeling with substantial inference speed advantages over AlphaFold3. The key findings are:

Diffusion outperforms flow matching and BFN for docking under standard sampling budgets. The stochastic nature of diffusion sampling appears beneficial compared to the deterministic ODE integration of flow matching.
Oracle-ranking reveals untapped potential: the gap between self-ranking (83.4%) and oracle-ranking (98.1%) at 500 repeats indicates that confidence-based ranking is a bottleneck. Better ranking methods could close the gap with AlphaFold3.
The three variants show similar performance for drug design, suggesting that model architecture and training data may matter more than the generative framework for generation tasks.
Inference scaling laws hold for molecular generative models, paralleling findings in NLP.

Limitations include that the framework is only evaluated on two tasks (docking and SBDD), and the paper does not address protein structure prediction, protein-protein interactions, or nucleic acid modeling, which are part of AlphaFold3’s scope. The BFN variant underperforms the diffusion model, which the authors attribute to smaller noise scales at early sampling steps making training less challenging. The paper also does not compare against concurrent work on inference-time scaling for molecular models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PDBBind, Binding MOAD, CrossDocked2020, PepBDB	Not specified	Filtered by PocketXMol criteria
Docking eval	PoseBusters benchmark	428 complexes	Holo docking with known protein
SBDD eval	CrossDocked test set	100 pockets	100 molecules per pocket

Algorithms

Three generative variants: multi-modal diffusion (D3PM), flow matching, Bayesian flow networks
Task-specific noise via Fix mechanism (0, 0.7, or 1.0)
Training tasks selected with equal probability per sample
AdamW optimizer: weight decay 0.001, $\beta_1 = 0.99$, $\beta_2 = 0.999$
Linear warmup to learning rate 0.001 over 1000 steps
180K training steps with batch size 40

Models

Dual-branch SE(3)-equivariant GNN (protein: 4-layer, molecule: 6-layer)
kNN graph construction for protein and protein-molecule interactions
Independent prediction heads for coordinates, atom types, bond types
Confidence heads for self-ranking during inference

Evaluation

Metric	PharMolixFM-Diff	AlphaFold3	Notes
RMSD < 2A self-ranking	83.4% (50 rep)	90.4%	PoseBusters docking
RMSD < 2A oracle-ranking	98.1% (500 rep)	-	PoseBusters docking
Inference time (per complex)	~4.6s	~249.0s	Single A800 GPU
Vina score (avg)	-6.18	-	CrossDocked SBDD

Hardware

Training: 4x 80GB A800 GPUs
Inference benchmarked on single A800 GPU

Artifacts

Artifact	Type	License	Notes
OpenBioMed (GitHub)	Code	MIT	Official implementation

Paper Information

Citation: Luo, Y., Wang, J., Fan, S., & Nie, Z. (2025). PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation. arXiv preprint arXiv:2503.21788.

@article{luo2025pharmolixfm,
  title={PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation},
  author={Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2503.21788},
  year={2025}
}

PharmaGPT: Domain-Specific LLMs for Pharma and Chem

Sat, 28 Mar 2026 00:00:00 +0000

A Domain-Specific LLM Suite for Biopharmaceuticals and Chemistry

This is a Method paper that introduces PharmaGPT, a suite of domain-specific large language models with 13 billion and 70 billion parameters. The models are built on the LLaMA architecture and undergo continued pretraining on a curated corpus of biopharmaceutical and chemical literature, followed by instruction fine-tuning and reinforcement learning from human feedback (RLHF). The primary contribution is demonstrating that domain-specific continued pretraining on a general-purpose LLM backbone can produce models that outperform much larger general-purpose models on pharmaceutical knowledge tasks, using only a fraction of the parameters.

Bridging the Gap Between General-Purpose LLMs and Specialized Pharmaceutical Knowledge

General-purpose LLMs like GPT-3.5 and GPT-4 show impressive broad capabilities but often fall short in specialized domains requiring precise terminology, deep domain knowledge, and high accuracy. The biopharmaceutical and chemical sectors present particular challenges: intricate terminologies, specialized regulatory knowledge, and a demand for precision that general models cannot consistently deliver. Most state-of-the-art LLMs are proprietary, English-centric, and lack depth in vertical domains. The authors identify a gap in the availability of domain-specific LLMs for biomedicine and chemistry, particularly multilingual models that can handle both English and Chinese pharmaceutical content.

Continued Pretraining with Domain-Specific Data and Weighted Instruction Tuning

PharmaGPT’s core innovation lies in its training pipeline, which adapts the LLaMA backbone through three stages:

Extended Tokenizer: The authors develop a new tokenizer using byte-pair encoding (BPE) from SentencePiece, trained on their pretraining data and merged with the LLaMA2 tokenizer. This extends the vocabulary from 32,000 to 55,296 tokens, improving compression efficiency for Chinese text and specialized domain terminology. The embedding and output layers are resized from $V \times H$ to $V’ \times H$ where $V = 32{,}000$ and $V’ = 55{,}296$.

Two-Stage Continued Pretraining: The models consume 153 billion tokens in Stage 1 (primarily web, news, patents, and papers) and 43 billion tokens in Stage 2 (research reports, exams, books, chats, code, and supervised data). The data distribution shifts between stages to move from general domain knowledge toward specialized biopharmaceutical tasks.

Weighted Instruction Fine-tuning: Inspired by OpenChat, the authors use a weighted autoregressive objective that zeros out loss on user instruction tokens. The loss function is:

$$\mathcal{L}_{SFT}(\Theta) = \mathbb{E}_{x \sim \mathcal{D}_{SFT}} \left[ -\alpha \sum_{i \in \text{output}} \log p(x_i \mid x_0, x_1, \dots, x_{i-1}; \Theta) \right]$$

where the weight $\alpha$ is set to 1 for expert-curated domain-specific instructions ($\mathcal{D}_{\exp}$) and 0.1 for generic instructions ($\mathcal{D}_{\text{gen}}$). This differential weighting ensures domain-relevant instructions receive higher priority during training.

RLHF with PPO: A reward model is initialized from the pretrained PharmaGPT-70B and enhanced with two MLPs to output a scalar preference score. The reward model is trained with a binary ranking loss:

$$\mathcal{L}_{\text{ranking}} = -\log\left(\sigma\left(r_\theta(x, y_c) - r_\theta(x, y_r)\right)\right)$$

where $r_\theta(x, y_c)$ is the score for the preferred response and $r_\theta(x, y_r)$ is the score for the rejected response. The RLHF dataset consists of 50,000 human preference expert-annotated instructions with responses from PharmaGPT variants and commercial LLMs (GPT-4, ChatGPT-3.5). Proximal Policy Optimization (PPO) is used for the RL training, selecting the highest-scoring response from four generated candidates at each step.

Evaluation on Pharmacy Licensing Exams, Translation, and MMLU

The evaluation covers four main benchmarks:

NAPLEX (North American Pharmacist Licensure Examination): PharmaGPT is tested across three NAPLEX sections. Results show consistent improvement across model iterations:

Model	NAPLEX I	NAPLEX II	NAPLEX III
PharmaGPT 0.1	5.0	2.5	3.5
PharmaGPT 0.3	42.0	48.0	46.5
PharmaGPT 0.5	57.0	59.0	58.0
PharmaGPT 0.7	66.0	68.0	76.0

PharmaGPT 0.7 scores in the 66-76% range across all three NAPLEX sections, outperforming GPT-3.5-turbo by considerable margins.

Chinese Pharmacist Examination: PharmaGPT achieves scores in the 70% range across all four exam categories, outperforming both GPT-3.5-turbo and GPT-4 in all categories. This result is notable given GPT-4’s much larger scale.

Biomedical Translation: PharmaGPT 0.7 outperforms GPT-3.5, Claude 3, and Google Translate on biomedical paper translation (English-Chinese), achieving BLEU scores of 30 (paragraph-level), 18 (sentence-level), and 10 (word-level).

MMLU: On the general Multitask Multilingual Language Understanding benchmark, PharmaGPT achieves scores in the 80% range across most biomedical and life science tasks, surpassing GPT-3.5-turbo and performing comparably to GPT-4 in areas such as physiology, health sciences, and biology.

Strong Domain Performance with Smaller Scale, but Limited Reproducibility

Key findings:

Domain-specific continued pretraining enables a 70B parameter model to match or exceed GPT-4 on pharmaceutical knowledge tasks, despite having a fraction of GPT-4’s parameters
Iterative post-training (versions 0.1 through 0.7) shows consistent improvement, with the largest gains occurring between versions 0.3 and 0.5
The two-stage pretraining strategy, shifting from general domain data to more specialized exam and report data, appears effective for building domain expertise
Scaling laws hold within the PharmaGPT family: larger parameter counts consistently produce better performance on both NAPLEX and Chinese pharmaceutical exams

Limitations acknowledged by the authors:

Potential biases in the training data
Model dependency on the quality and diversity of input prompts
Challenges in accurately assessing performance on highly specialized tasks without domain expert evaluation
Interpretability concerns for use in sensitive healthcare and pharmaceutical applications
The 3B model is trained from scratch while the 13B and 70B models use LLaMA as a backbone, making direct comparison across model sizes less straightforward

Missing details: The paper does not release model weights, training code, or the proprietary training dataset. No ablation studies isolate the contribution of each training stage (continued pretraining vs. instruction tuning vs. RLHF). The evaluation is limited to multiple-choice exams and translation, without testing on molecular property prediction, reaction prediction, or other computational chemistry tasks common in this domain.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining Stage 1	Web, News, Patents, Papers	153B tokens	Proprietary corpus; not publicly available
Pretraining Stage 2	Research Reports, Exams, Books, Chats, Code	43B tokens	Proprietary corpus; not publicly available
Instruction Tuning	Manually labeled + synthesized data	Several hundred thousand instructions	Includes expert Q&A, patent data, ShareGPT
RLHF	Human preference annotations	50,000 annotated instructions	Expert annotators ranked responses
Evaluation	NAPLEX, Chinese Pharmacist Exam, MMLU, MT	Not specified	Exam datasets sourced from public exams

Algorithms

Base architecture: LLaMA (13B and 70B variants); 3B model trained from scratch
Tokenizer: Extended BPE tokenizer (55,296 vocab size) merged with LLaMA2 tokenizer
Training objective: Standard autoregressive LM (pretraining), weighted autoregressive with $\alpha \in {0.1, 1.0}$ (SFT), PPO (RLHF)
Reward model: Initialized from PharmaGPT-70B with two additional MLPs

Models

Model	Parameters	Base	Notes
PharmaGPT-3B	3B	Trained from scratch	Not evaluated in main results
PharmaGPT-13B	13B	LLaMA-13B	Post-trained
PharmaGPT-70B	70B	LLaMA-70B	Primary model; versions 0.1-0.7 reported

Evaluation

Metric	PharmaGPT 0.7	GPT-3.5	Notes
NAPLEX I	66%	~50%	Estimated from figures
NAPLEX II	68%	~50%	Estimated from figures
NAPLEX III	76%	~50%	Estimated from figures
Chinese Pharmacist Exam	~70% range	Lower	Outperforms GPT-4
Biomedical Translation (paragraph BLEU)	30	27	English-Chinese

Hardware

The paper does not specify the hardware used for training. Training hyperparameters for the 70B model include tensor parallelism (TP=8) and pipeline parallelism (PP=16) during pretraining, suggesting multi-node GPU training, likely on at least 128 GPUs.

Artifact	Type	License	Notes
PharmaGPT models	Model	Not released	No public weights or API access
Training data	Dataset	Proprietary	PatSnap internal data
Training code	Code	Not released	No public repository

Reproducibility status: Closed. Neither the model weights, training data, nor training code are publicly available. The proprietary nature of both the data pipeline and the models makes independent reproduction infeasible.

Paper Information

Citation: Chen, L., Wang, W., Bai, Z., Xu, P., Fang, Y., Fang, J., … & Tu, C. (2024). PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry. arXiv preprint arXiv:2406.18045.

@article{chen2024pharmagpt,
  title={PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry},
  author={Chen, Linqing and Wang, Weilei and Bai, Zilong and Xu, Peng and Fang, Yan and Fang, Jie and Wu, Wentao and Zhou, Lizhi and Zhang, Ruiji and Xia, Yubin and Xu, Chaobo and Hu, Ran and Xu, Licong and Cai, Qijun and Hua, Haoran and Sun, Jing and Liu, Jin and Qiu, Tian and Liu, Haowen and Hu, Meng and Li, Xiuwen and Gao, Fei and Wang, Yufu and Tie, Lin and Wang, Chaochao and Lu, Jianping and Sun, Cheng and Wang, Yixin and Yang, Shengjie and Li, Yuancheng and Jin, Lu and Zhang, Lisha and Bian, Fu and Ye, Zhongkai and Pei, Lidong and Tu, Changyang},
  journal={arXiv preprint arXiv:2406.18045},
  year={2024},
  doi={10.48550/arXiv.2406.18045}
}

ORGAN: Objective-Reinforced GANs for Molecule Design

Sat, 28 Mar 2026 00:00:00 +0000

Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation

This is a Method paper that introduces ORGAN (Objective-Reinforced Generative Adversarial Network), a framework for generating sequences that are both realistic (close to the training distribution) and optimized for domain-specific objectives. ORGAN extends SeqGAN by adding external reward functions to the reinforcement learning signal, with a tunable parameter $\lambda$ controlling the balance between adversarial (discriminator) and objective-based rewards. The authors demonstrate ORGAN on two domains: molecular generation using SMILES strings (optimizing druglikeness, solubility, and synthesizability) and musical melody generation (optimizing tonality and step ratios).

Exposure Bias and Mode Collapse in Discrete Sequence Generation

Generating discrete sequences with desirable properties presents two intertwined challenges. First, RNNs trained via maximum likelihood estimation (MLE) suffer from exposure bias, where the model sees only ground-truth prefixes during training but must condition on its own (potentially erroneous) outputs at generation time. Second, while GANs can address some of these issues through adversarial training, they were not initially applicable to discrete data due to non-differentiability of the sampling step. SeqGAN resolved this by framing the generator as an RL agent, but it optimizes only for distributional fidelity (fooling the discriminator) without any mechanism to steer generation toward specific property targets.

In drug discovery, simply generating valid, drug-like molecules is insufficient. Practitioners need to optimize for particular pharmaceutical properties (e.g., solubility, synthesizability, druglikeness) while maintaining structural diversity. Naive RL approaches can optimize properties effectively but tend to collapse onto trivial solutions (e.g., repeating “CCCCCCC” to maximize solubility). The challenge is to combine the distributional regularization of adversarial training with the goal-directedness of RL.

Mixed Reward: Interpolating Between Adversarial and Objective Signals

ORGAN’s core innovation is a reward function that linearly interpolates between the discriminator score and domain-specific objectives:

$$R(Y_{1:T}) = \lambda \cdot D_{\phi}(Y_{1:T}) + (1 - \lambda) \cdot O_{i}(Y_{1:T})$$

When $\lambda = 1$, the model reduces to SeqGAN (pure adversarial training). When $\lambda = 0$, it becomes naive RL optimizing only the objective. Intermediate values allow the adversarial component to regularize the generator, keeping samples within the distribution while the objective component steers toward desired properties.

The generator $G_{\theta}$ is an LSTM-based RNN that produces sequences token-by-token. Training follows the REINFORCE algorithm, where the expected long-term reward is:

$$J(\theta) = \mathbb{E}\left[R(Y_{1:T}) \mid s_{0}, \theta\right] = \sum_{y_{1} \in Y} G_{\theta}(y_{1} \mid s_{0}) \cdot Q(s_{0}, y_{1})$$

For intermediate timesteps (partial sequences), the action-value function $Q$ is estimated via $N$-time Monte Carlo rollouts:

$$Q(Y_{1:t-1}, y_{t}) = \begin{cases} \frac{1}{N} \sum_{n=1}^{N} R(Y_{1:T}^{n}), & \text{if } t < T \\ R(Y_{1:T}), & \text{if } t = T \end{cases}$$

where $Y_{1:T}^{n}$ are completions sampled by rolling out the current policy $G_{\theta}$ from state $Y_{1:t}$.

The policy gradient is:

$$\nabla_{\theta} J(\theta) \simeq \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{y_{t} \sim G_{\theta}(y_{t} \mid Y_{1:t-1})} \left[\nabla_{\theta} \log G_{\theta}(y_{t} \mid Y_{1:t-1}) \cdot Q(Y_{1:t-1}, y_{t})\right]$$

Two additional mechanisms improve training:

Diversity penalty: Repeated sequences have their reward divided by their copy count, providing diminishing returns for non-unique outputs.
Wasserstein distance: The authors also implement a variant (OR(W)GAN) that replaces the standard GAN discriminator loss with the Wasserstein-1 distance via Kantorovich-Rubinstein duality, which can improve training stability and diversity.

Molecular and Musical Melody Generation Experiments

Architecture

The generator $G_{\theta}$ is an RNN with LSTM cells. The discriminator $D_{\phi}$ is a CNN for text classification following Kim (2014), with 75% dropout and L2 regularization. All optimization uses Adam. Molecular metrics are computed with RDKit.

Molecular Generation Setup

Training data consists of 5,000 random molecules from the QM9 dataset (134k stable small molecules with up to 9 heavy atoms), encoded as SMILES strings with maximum sequence length 51 and alphabet size 43. Each generator is pre-trained for 250 MLE epochs, with the discriminator trained for 10 epochs. Adversarial/RL training then proceeds for up to 100 additional epochs. The default $\lambda$ is 0.5.

Three molecular objectives are evaluated:

Solubility (LogP): water-octanol partition coefficient via RDKit’s Crippen function
Synthesizability: SA score estimating ease of synthesis (0 = hard, 1 = easy)
Druglikeness: QED score capturing medicinal chemistry aesthetics

Diversity is measured using average Jaccard distance of molecular fingerprints relative to a random training subset.

Molecular Generation Results

Objective	Algorithm	Validity (%)	Diversity	Druglikeness	Synthesizability	Solubility
None	MLE	75.9	0.64	0.48 (0%)	0.23 (0%)	0.30 (0%)
None	SeqGAN	80.3	0.61	0.49 (+2%)	0.25 (+6%)	0.31 (+3%)
Druglikeness	ORGAN	88.2	0.55	0.52 (+8%)	0.32 (+38%)	0.35 (+18%)
Druglikeness	OR(W)GAN	85.0	0.95	0.60 (+25%)	0.54 (+130%)	0.47 (+57%)
Druglikeness	Naive RL	97.1	0.80	0.57 (+19%)	0.53 (+126%)	0.50 (+67%)
Synthesizability	ORGAN	96.5	0.92	0.51 (+6%)	0.83 (+255%)	0.45 (+52%)
Synthesizability	OR(W)GAN	97.6	1.00	0.20 (-59%)	0.75 (+223%)	0.84 (+184%)
Solubility	ORGAN	94.7	0.76	0.50 (+4%)	0.63 (+171%)	0.55 (+85%)
Solubility	OR(W)GAN	94.1	0.90	0.42 (-12%)	0.66 (+185%)	0.54 (+81%)
Solubility	Naive RL	92.7	0.75	0.49 (+3%)	0.70 (+200%)	0.78 (+162%)
All (alternated)	ORGAN	96.1	92.3	0.52 (+9%)	0.71 (+206%)	0.53 (+79%)

Key observations: OR(W)GAN consistently achieves higher diversity than standard ORGAN. Naive RL often achieves higher raw objective scores but at the cost of generating trivial solutions (e.g., simple atom chains for solubility). The Wasserstein variant provides better diversity properties. Multi-objective training via alternating objectives across epochs achieves gains comparable to individually optimized models.

Music Generation Setup

Using 1,000 melodies from the EsAC folk dataset, each encoded as 36-token sequences where tokens represent sixteenth-note events across three octaves (C3-B5). Two metrics are optimized: tonality (proportion of perfect fifths) and ratio of steps (conjunct melodic motion). Diversity is measured as average pairwise edit distance.

Music Results

Objective	Algorithm	Diversity	Tonality	Ratio of Steps
None	MLE	0.221	0.007	0.010
None	SeqGAN	0.187	0.005	0.010
Tonality	Naive RL	0.100	0.478	2.9E-05
Tonality	ORGAN	0.268	0.372	1.78E-04
Tonality	OR(W)GAN	0.268	0.177	2.4E-04
Ratio of Steps	Naive RL	0.321	0.001	0.829
Ratio of Steps	ORGAN	0.433	0.001	0.632
Ratio of Steps	OR(W)GAN	0.134	5.95E-05	0.622

ORGAN outperforms SeqGAN and MLE on all metrics. Naive RL achieves higher raw scores but with lower diversity, producing simpler, less interesting outputs.

Capacity Ceilings, Trade-offs, and Future Directions

The authors identify several limitations and findings:

Capacity ceiling: GAN-based models tend to generate sequences matching the training set’s average length (15.42 characters). RL-only approaches can break this constraint, generating shorter (9.4) or longer (21.3) sequences depending on the objective. The upper bound of optimized properties also matches the training data’s maximum, suggesting dataset-dependent limits.

Lambda trade-off: Varying $\lambda$ reveals an optimal balance between objective optimization and distributional fidelity. This optimum depends on the model, dataset, and metric, suggesting that hyperparameter search over $\lambda$ is important in practice.

Tonality vs. steps inverse relationship: In the music task, optimizing for tonality (perfect fifths) inherently conflicts with optimizing for step ratios (consecutive notes), since consecutive scale notes do not form perfect fifths.

Limitations: The paper evaluates on relatively small datasets (5k molecules, 1k melodies) and short sequences. The molecular experiments use QM9 (small molecules with up to 9 heavy atoms), which limits the scope of conclusions for drug-like chemical space. The Wasserstein variant sometimes lags behind the standard GAN loss in raw metric scores, though it offers better diversity.

Future directions: The authors propose extending ORGAN to non-sequential data (images, audio) by framing GANs as RL problems more broadly, and investigating how different heuristic choices affect performance. They also suggest exploring other discrete GAN formulations (MaliGAN, BGAN) with RL extensions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Molecular training	QM9 subset	5,000 molecules	Random subset from 134k stable small molecules with up to 9 heavy atoms
Music training	EsAC folk dataset	1,000 melodies	36-token sequences, processed following Chen et al. (2017)

Algorithms

Generator pre-trained for 250 epochs via MLE; discriminator for 10 epochs
Adversarial/RL training for up to 100 epochs
Default $\lambda = 0.5$ for reward mixing
Monte Carlo rollouts for intermediate reward estimation
Duplicate penalty: reward divided by copy count

Models

Generator: RNN with LSTM cells
Discriminator: CNN for text classification (Kim, 2014) with 75% dropout, L2 regularization
Optimizer: Adam for all gradient descent steps

Evaluation

Metric	Description	Domain
Validity (%)	Fraction of generated SMILES that decode to valid molecules	Molecules
Diversity	Average Jaccard distance of fingerprints to training subset	Molecules
Druglikeness (QED)	Quantitative Estimate of Drug-likeness	Molecules
Synthesizability (SA)	Synthetic accessibility score	Molecules
Solubility (LogP)	Water-octanol partition coefficient	Molecules
Tonality	Proportion of perfect fifths	Music
Ratio of Steps	Proportion of conjunct melodic intervals	Music
Diversity (edit)	Average pairwise edit distance	Music

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
ORGAN	Code	GPL-2.0	Official implementation including metrics for molecules and music

Paper Information

Citation: Guimaraes, G. L., Sánchez-Lengeling, B., Outeiral, C., Farias, P. L. C., & Aspuru-Guzik, A. (2017). Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv preprint arXiv:1705.10843.

@article{guimaraes2017organ,
  title={Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models},
  author={Guimaraes, Gabriel Lima and Sanchez-Lengeling, Benjamin and Outeiral, Carlos and Farias, Pedro Luis Cunha and Aspuru-Guzik, Al{\'a}n},
  journal={arXiv preprint arXiv:1705.10843},
  year={2017}
}

Neural Machine Translation for Reaction Prediction

Sat, 28 Mar 2026 00:00:00 +0000

Pioneering Seq2Seq Translation for Reaction Prediction

This is a Method paper. It introduces the idea of applying neural machine translation (NMT) to organic chemistry reaction prediction by framing product prediction as a sequence-to-sequence translation problem from reactant/reagent SMILES to product SMILES. This was one of the earliest works to demonstrate that a data-driven encoder-decoder model could predict reaction products without any hand-coded reaction rules or SMARTS transformations.

Limitations of Existing Reaction Prediction Methods

Prior computational approaches to reaction prediction fell into three categories, each with significant drawbacks:

Rule-based methods (e.g., CAMEO, EROS) relied on manually encoded reaction rules. They performed well on reactions covered by the rules but required continuous manual encoding as new reaction types were discovered. Many older systems became outdated for this reason.
Physical calculation methods computed energies of transition states from plausible reaction pathways using quantum mechanics. While principled, these approaches carried high computational cost. Simplified approaches (ToyChem, ROBIA) traded accuracy for speed.
Machine learning methods at the time either predicted individual mechanistic steps (requiring tree search for multi-step reactions) or classified reaction types and applied SMARTS transformations to generate products. The classification-based approach of Wei et al. still required manual encoding of SMARTS transformations for new reaction types and struggled with ambiguous reaction classes.

The key gap was the absence of a method that could predict reaction products directly from input molecules, learn from data alone, and generalize to new reaction types without manual rule encoding.

Core Innovation: Reactions as Machine Translation

The central insight is that SMILES strings can be treated as a language with grammatical specifications. Predicting reaction products then becomes a problem of translating “reactant and reagent” sentences into “product” sentences.

The model uses a GRU-based encoder-decoder architecture with attention:

Encoder: 3 layers of GRU cells that process the reversed, tokenized SMILES string of reactants and reagents
Decoder: 3 layers of GRU cells that generate product SMILES tokens autoregressively
Attention mechanism: allows the decoder to attend to relevant encoder states at each generation step
Embedding dimension: 600
Vocabulary: 311 input tokens (reactants/reagents), 180 output tokens (products)
Bucketed sequences: four bucket sizes handle variable-length inputs and outputs: (54, 54), (70, 60), (90, 65), (150, 80)

The SMILES tokenization uses a PEG-based parser that splits SMILES strings into atoms, bonds, branching symbols, and ring closure numbers. Input sequences are reversed before feeding to the encoder, following standard practice in NMT at the time.

The translation objective finds the product sequence $\mathbf{y}$ that maximizes the conditional probability:

$$p(\mathbf{y} \mid \mathbf{x}) = \prod_{t=1}^{T} p(y_t \mid y_1, \ldots, y_{t-1}, \mathbf{x})$$

where $\mathbf{x}$ is the tokenized reactant/reagent sequence and $T$ is the product sequence length.

Training Data and Experimental Evaluation

Training Sets

Two training sets were constructed:

Source	Size	Description
Patent reactions (“real”)	1,094,235	USPTO patent applications (2001-2013), filtered by length
Generated reactions (“gen”)	865,118	75 reaction types from Wade’s organic chemistry textbook, applied to GDB-11 molecules (1-10 atoms)

The “real” set was filtered to exclude reactions with reactant/reagent strings longer than 150 characters, product strings longer than 80 characters, or more than four products. The “gen” set was composed by iterating reaction templates (as SMARTS) over small molecules from GDB-11, covering five substrate types: acid derivatives, alcohols, aldehydes/ketones, alkenes, and alkynes.

Two models were compared: a “gen” model (trained only on generated reactions) and a “real+gen” model (trained on both sets).

Textbook Problem Evaluation

The models were tested on 10 problem sets from Wade’s textbook, following the evaluation approach of Wei et al. Each problem set contained 6-15 reactions. Evaluation metrics included the ratio of fully correct predictions and the average Tanimoto similarity between Morgan fingerprints of predicted and actual products.

The “real+gen” model outperformed the “gen” model on most problem sets. On problem set 17-44 (aromatic compound reactions, only present in the “real” training set), the “real+gen” model correctly answered 4 out of 11 problems while the “gen” model answered 2. The “gen” model’s ability to correctly predict some aromatic reactions despite never being trained on them suggests the model can extrapolate to unseen reaction patterns.

For Diels-Alder reactions (problem set 15-30), neither model achieved fully correct predictions for all problems, though the “real+gen” model showed better Tanimoto scores, indicating partially correct structural predictions even when the exact product was missed.

Scalability Testing

A scalability test used generated reactions with substrate molecules containing 11-16 atoms (larger than the training set molecules with fewer than 11 atoms). Results showed:

The “real+gen” model maintained Tanimoto scores around 0.7 and error rates around 0.4 as substrate atom count increased
The ratio of fully correct predictions decreased as atom count increased, revealing that the recurrent network struggled with longer input sequences
The “real+gen” model produced fewer invalid SMILES strings than the “gen” model, likely because training on more reactions improved the decoder’s ability to generate syntactically valid SMILES

Attention Analysis

Visualization of attention weights revealed a limitation: the decoder cells predominantly attended to the first few encoder cells rather than distributing attention across the full input sequence. This means the attention mechanism was not learning meaningful “alignment” between reactant atoms and product atoms. The authors note that if decoder cells generating tokens for unreactive sites could attend to the corresponding encoder cells (analogous to atom mapping), prediction quality on longer sequences could improve.

Token Embedding Analysis

t-SNE visualization of the learned token embeddings showed that encoder and decoder tokens clustered primarily by syntactic similarity rather than chemical properties. The model did not learn chemically meaningful embeddings, which the authors identify as an area for future improvement.

Key Findings, Limitations, and Impact

Key Findings

Treating reaction prediction as NMT is viable: the seq2seq model can predict products without any hand-coded rules
Training on real patent data significantly improves prediction over generated data alone
The model can extrapolate to reaction types not seen during training (e.g., the “gen” model predicting aromatic reactions)
Compared to the fingerprint-based approach of Wei et al., this method performed better on textbook problems and eliminated the need for manual SMARTS encoding

Limitations

Invalid SMILES generation: the token-by-token generation process can produce syntactically invalid SMILES (e.g., mismatched parentheses), which the authors scored as zero
Sequence length degradation: prediction accuracy dropped for longer SMILES strings, a known limitation of RNN-based seq2seq models at the time
Poor attention alignment: attention weights collapsed to the first encoder positions rather than learning meaningful reactant-product correspondences
Chemically naive embeddings: token embeddings did not capture chemical properties
Multiple reaction pathways: reactions with competing pathways (e.g., substitution vs. elimination) were difficult for the model to handle

Historical Significance

This paper is historically significant as one of the first (alongside concurrent work) to propose the NMT framing for reaction prediction. This framing was later adopted and refined by the Molecular Transformer (Schwaller et al., 2019), which replaced GRUs with the Transformer architecture and achieved over 90% top-1 accuracy on standard benchmarks. The conceptual contribution of treating SMILES-to-SMILES translation as machine translation became the foundation of an entire subfield.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (real)	USPTO patent reactions	1,094,235	2001-2013 applications, filtered by length
Training (gen)	Generated from Wade textbook templates	865,118	75 reaction types, GDB-11 substrates
Testing (textbook)	Wade textbook problems	~100	10 problem sets, 6-15 reactions each
Testing (scalability)	Generated from GDB-17	2,400	400 per atom count (11-16)

Algorithms

GRU-based encoder-decoder with attention mechanism
PEG-based SMILES tokenizer
Input sequence reversal
Bucketed training with four bucket sizes
TensorFlow seq2seq tutorial implementation with default learning rate

Models

Parameter	Value
GRU layers	3
Embedding size	600
Input vocabulary	311 tokens
Output vocabulary	180 tokens
Buckets	(54,54), (70,60), (90,65), (150,80)

Evaluation

Metric	gen Model	real+gen Model	Notes
Textbook correct ratio	Variable by set	Higher on most sets	10 problem sets
Average Tanimoto similarity	Variable	~0.7 on scalability test	Morgan fingerprint based
Invalid SMILES ratio	Higher	~0.4 on scalability test	Decreases with more training data

Hardware

Not specified in the paper.

Paper Information

Citation: Nam, J. & Kim, J. (2016). Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions. arXiv preprint, arXiv:1612.09529. https://arxiv.org/abs/1612.09529

Publication: arXiv preprint 2016

@article{nam2016linking,
  title={Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions},
  author={Nam, Juno and Kim, Jurae},
  journal={arXiv preprint arXiv:1612.09529},
  year={2016},
  doi={10.48550/arxiv.1612.09529}
}

MoMu: Bridging Molecular Graphs and Natural Language

Sat, 28 Mar 2026 00:00:00 +0000

Bridging Molecular Graphs and Natural Language Through Contrastive Learning

MoMu (Molecular Multimodal foundation model) is a Method paper that proposes a multimodal pre-training approach to associate molecular graphs with natural language descriptions. The primary contribution is a dual-encoder architecture, consisting of a Graph Isomorphism Network (GIN) for molecular graphs and a BERT-based text encoder, jointly trained through contrastive learning on weakly-correlated graph-text pairs collected from scientific literature. The pre-trained model supports four downstream capabilities: cross-modal retrieval (graph-to-text and text-to-graph), molecule captioning, zero-shot text-to-graph molecule generation, and molecular property prediction.

Why Single-Modality Models Are Insufficient for Molecular Understanding

Existing AI models for molecular tasks generally operate on a single modality and learn a single cognitive ability. Language-based models process SMILES strings or natural language texts and handle tasks like property prediction from strings, literature comprehension, or SMILES-based generation. Graph-based models use molecular graph representations and handle graph-level property prediction or graph generation. Neither category connects structural information from molecular graphs with the rich semantic knowledge encoded in scientific texts.

Prior work by Zeng et al. (KV-PLM) jointly modeled molecule-related texts and SMILES strings, but SMILES representations have inherent drawbacks: they are one-dimensional and may lose structural information, they cannot capture structural similarities between molecules, and a single molecule can have multiple valid SMILES representations. Molecular graphs, by contrast, are more intuitive and better reveal functional structures. Human experts learn molecular knowledge by associating both graphical representations and textual descriptions, yet no prior model bridged these two modalities directly.

The key challenge is the scarcity of paired molecular graph-text data compared to general image-text datasets. Additionally, learning specialized molecular knowledge requires foundational cognitive abilities in both the graph and text domains, making training from scratch infeasible with limited data.

MoMu consists of two encoders initialized from pre-trained unimodal models: a GIN graph encoder initialized from GraphCL self-supervised weights, and a BERT text encoder initialized from either Sci-BERT (yielding MoMu-S) or KV-PLM (yielding MoMu-K).

Data Collection

The authors collect approximately 15,613 molecular graph-document pairs by:

Gathering names, synonyms, and SMILES for the top 50K compounds in PubChem
Converting SMILES to molecular graphs using the OGB smiles2graph function
Retrieving related text from the S2ORC corpus (136M+ papers) by querying with molecule names, filtering to Medicine, Biology, Chemistry, and Computer Science fields
Restricting retrieval to abstract, introduction, and conclusion sections to avoid experimental data artifacts

Contrastive Training Objective

For each graph-text pair in a mini-batch of $N$ pairs, MoMu applies two graph augmentations (node dropping and subgraph extraction) to create two augmented graphs, and randomly samples two sentences from the document. This produces $2N$ graph representations ${z_1^G, \tilde{z}_1^G, \ldots, z_N^G, \tilde{z}_N^G}$ and $2N$ text representations ${z_1^T, \tilde{z}_1^T, \ldots, z_N^T, \tilde{z}_N^T}$.

The cross-modal contrastive loss for a pair $(z_i^G, z_i^T)$ is:

$$ \ell_i^{(z_i^G, z_i^T)} = -\log \frac{\exp(\text{sim}(z_i^G, z_i^T) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, z_j^T) / \tau)} $$

where $\tau$ is the temperature parameter and $\text{sim}(\cdot, \cdot)$ projects both representations into a shared 256-dimensional space before computing cosine similarity. The total cross-modal loss includes four contrastive terms for each pair: $(z_i^G, z_i^T)$, $(\tilde{z}_i^G, z_i^T)$, $(z_i^G, \tilde{z}_i^T)$, and $(\tilde{z}_i^G, \tilde{z}_i^T)$.

An intra-modal graph contrastive loss further strengthens the graph encoder:

$$ \ell_i^{(z_i^G, \tilde{z}_i^G)} = -\log \frac{\exp(\text{sim}(z_i^G, \tilde{z}_i^G) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, \tilde{z}_j^G) / \tau)} $$

Zero-Shot Text-to-Graph Generation

MoMu enables a zero-shot generation pipeline by combining the pre-trained MoMu encoders with MoFlow, a flow-based molecular generator. Given an input text description $x^T$, the method:

Samples a latent variable $q$ from MoFlow’s Gaussian prior $P(q)$
Generates a molecular graph through MoFlow’s reverse flows: $\hat{E} = f_g^{-1}(q_e)$ and $\hat{V} = f_c^{-1}(q_v \mid GN(\hat{E}))$
Feeds $\hat{V}$ (using soft atom type probabilities instead of hard assignments) into MoMu’s graph encoder
Optimizes $q$ to maximize the cosine similarity between the resulting graph and text representations:

$$ \ell_q = -\text{sim}(z^G, z^T) / \tau $$

All MoMu and MoFlow parameters are frozen; only $q$ is updated via Adam for up to 500 iterations. The final molecule is obtained by applying argmax to the optimized probability matrices $\hat{V}$ and $\hat{E}$.

Evaluation Across Four Downstream Tasks

MoMu is evaluated on the PCdes dataset (15K SMILES-description pairs from PubChem, split 10,500/1,500/3,000 for train/val/test). Retrieval is performed in mini-batches of 64 pairs, reporting top-1 accuracy and Recall@20.

Graph-to-Text Retrieval (PCdes, fine-tuned):

Method	Sentence Acc	Sentence R@20	Paragraph Acc	Paragraph R@20
Sci-BERT	50.38	62.11	62.57	60.67
KV-PLM	53.79	66.63	64.81	63.87
KV-PLM*	55.92	68.59	77.92	75.93
MoMu-S	58.64	80.59	80.62	79.11
MoMu-K	58.74	81.29	81.09	80.15

Text-to-Graph Retrieval (PCdes, fine-tuned):

Method	Sentence Acc	Sentence R@20	Paragraph Acc	Paragraph R@20
Sci-BERT	50.12	68.02	61.75	60.77
KV-PLM	54.22	71.80	64.95	64.27
KV-PLM*	55.61	74.77	77.03	75.47
MoMu-S	55.44	76.92	80.22	79.02
MoMu-K	54.94	78.29	81.45	80.62

In zero-shot retrieval (on a separate test set of 5,562 pairs not seen during pre-training), MoMu achieves approximately 39-46% accuracy compared to below 2% for Sci-BERT and KV-PLM, demonstrating strong generalization.

Molecule Captioning

MoMu’s graph features are appended to MolT5’s encoder inputs through a learned MLP mapping module on the ChEBI-20 dataset. Results show improvements in BLEU, METEOR, and Text2Mol scores when incorporating graph features, though ROUGE-L slightly drops. The graph structural information leads to more accurate captions for complex molecular structures.

Molecular Property Prediction

The pre-trained graph encoder from MoMu is fine-tuned on eight MoleculeNet datasets using scaffold splitting and ROC-AUC evaluation (10 runs).

Dataset	No Pre-Train	GraphCL	MoMu-S	MoMu-K
BBBP	65.8	69.7	70.5	70.1
Tox21	74.0	73.9	75.6	75.6
ToxCast	63.4	62.4	63.4	63.0
SIDER	57.3	60.5	60.5	60.4
ClinTox	58.0	76.0	79.9	77.4
MUV	71.8	69.8	70.5	71.1
HIV	75.3	78.5	75.9	76.2
BACE	70.1	75.4	76.7	77.1
Average	66.96	70.78	71.63	71.36

MoMu-S achieves the best average ROC-AUC (71.63%) across all eight datasets, outperforming GraphCL (70.78%), the self-supervised method used to initialize MoMu’s graph encoder. MoMu outperforms GraphCL on six of eight datasets. Notably, MoMu-S and MoMu-K perform comparably, indicating that KV-PLM’s SMILES-based knowledge does not transfer well to graph-based representations.

Zero-Shot Text-to-Graph Generation

The method generates molecules from three types of text descriptions:

High-level vague descriptions (e.g., “The molecule is beautiful”): MoMu generates diverse, interpretable molecules where “beautiful” tends to produce locally symmetric and stretched graphs, “versatile” produces molecules with varied elements and functional groups, and “strange” produces cluttered, irregular structures.
Functional descriptions (e.g., “fluorescent molecules”, “high water solubility and barrier permeability with low toxicity”): MoMu successfully generates molecules with appropriate functional groups and properties. For the solubility/permeability/toxicity query, MoMu generates molecules that satisfy three of three evaluable properties.
Structural descriptions (e.g., “molecules containing nucleophilic groups”): MoMu generates diverse molecules with appropriate functional groups (amino, hydroxyl, carbonyl, halogen atoms).

Promising Multimodal Transfer with Clear Data Limitations

MoMu demonstrates that contrastive pre-training on weakly-correlated graph-text data can bridge molecular graphs and natural language in a shared representation space. The key findings are:

Cross-modal alignment works with limited data: With only 15K graph-text pairs (far fewer than the millions used in vision-language models like CLIP), MoMu achieves meaningful cross-modal retrieval and enables zero-shot generation.
Multimodal supervision improves graph representations: The graph encoder supervised by text descriptions outperforms self-supervised methods (GraphCL, AttrMasking, ContextPred) on average across molecular property prediction benchmarks.
SMILES knowledge does not transfer to graphs: MoMu-S and MoMu-K perform comparably across all tasks, showing that structural information learned from one-dimensional SMILES strings does not readily generalize to graph neural networks.

Limitations

The authors acknowledge several important limitations:

Data scarcity: 15K graph-text pairs is substantially smaller than general image-text datasets, potentially leaving the common space insufficiently aligned.
Noisy supervision: Retrieved texts may mention a molecule by name without describing its properties or structure, leading to spurious correlations.
Generator constraints: The zero-shot generation method is limited by MoFlow’s capacity (maximum 38 atoms, 9 element types from ZINC250K training).
Property coverage: Generation quality degrades for molecular properties that appear infrequently or not at all in the training texts.

Future Directions

The authors propose four avenues: (1) collecting larger-scale multimodal molecular data including 3D conformations, (2) using strongly-correlated paired data with more advanced generators, (3) developing interpretable tools for the learned cross-modal space, and (4) wet-lab validation of generated molecules.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	Collected graph-text pairs (PubChem + S2ORC)	15,613 pairs	~37M paragraphs total; top 50K PubChem compounds
Cross-modal retrieval	PCdes	15K pairs (10.5K/1.5K/3K split)	SMILES-description pairs from PubChem
Molecule captioning	ChEBI-20	~33K pairs	Used with MolT5
Text-to-graph generation	ZINC250K (MoFlow)	250K molecules	Pre-trained generator, max 38 atoms
Property prediction	MoleculeNet (8 datasets)	Varies	BBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE

Algorithms

Graph augmentations: Node dropping (10% ratio) and subgraph extraction (80% of original size via random walk)
Contrastive learning: InfoNCE loss with temperature $\tau = 0.1$, following the DeClip paradigm with both inter-modal and intra-modal objectives
Zero-shot generation: Adam optimizer on latent variable $q$ for up to 500 iterations; formal charges prohibited in output

Models

Graph encoder: GIN with 5 layers, 300-dimensional hidden size, initialized from GraphCL checkpoint
Text encoder: BERT-base (768 hidden size), initialized from Sci-BERT or KV-PLM
Projection heads: Two MLPs projecting graph (300-dim) and text (768-dim) features to 256-dimensional shared space
Optimizer: AdamW, learning rate 0.0001, weight decay 1e-5, 300 epochs, batch size 256

Evaluation

Task	Metric	Best Result	Notes
G-T Retrieval (PCdes)	Accuracy / R@20	81.09 / 80.15 (paragraph)	MoMu-K, fine-tuned
T-G Retrieval (PCdes)	Accuracy / R@20	81.45 / 80.62 (paragraph)	MoMu-K, fine-tuned
Zero-shot G-T Retrieval	Accuracy	~46%	vs. ~1.4% for baselines
Property Prediction	ROC-AUC (avg)	71.63%	MoMu-S, 8 MoleculeNet datasets
Molecule Captioning	Text2Mol	Improved over MolT5	MoMu + MolT5-large

Hardware

Pre-training: 8x NVIDIA Tesla V100 PCIe 32GB GPUs
Framework: PyTorch

Artifacts

Artifact	Type	License	Notes
MoMu code	Code	Not specified	Pre-training and downstream task code
GraphTextRetrieval	Code	Not specified	Data collection and cross-modal retrieval code
Pre-training dataset	Dataset	Not specified	Hosted on Baidu Pan (Chinese cloud storage)

Paper Information

Citation: Su, B., Du, D., Yang, Z., Zhou, Y., Li, J., Rao, A., Sun, H., Lu, Z., & Wen, J.-R. (2022). A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language. arXiv preprint arXiv:2209.05481.

@article{su2022momu,
  title={A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language},
  author={Su, Bing and Du, Dazhao and Yang, Zhao and Zhou, Yujie and Li, Jiangmeng and Rao, Anyi and Sun, Hao and Lu, Zhiwu and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2209.05481},
  year={2022}
}

MolFM: Trimodal Molecular Foundation Pre-training

Sat, 28 Mar 2026 00:00:00 +0000

Trimodal Pre-training for Molecular Understanding

MolFM is a Method paper that introduces a multimodal molecular foundation model integrating three distinct sources of molecular knowledge: 2D molecular graphs, biomedical text, and knowledge graphs. The primary contribution is a pre-training framework that uses fine-grained cross-modal attention to fuse information across all three modalities, combined with theoretical justification from a deep metric learning perspective. MolFM achieves the best reported results (at time of publication) on cross-modal retrieval, molecule captioning, text-based molecule generation, and molecular property prediction.

Why Existing Molecular Models Fall Short

Prior multimodal molecular foundation models operate on at most two modalities (structures and text) and suffer from two key limitations. First, generative approaches like KV-PLM and MolT5 rely on 1D SMILES strings, which cannot capture complex topological and spatial molecular properties such as macrocycles. Contrastive approaches like MoMu and MoleculeSTM learn global alignment between molecule graphs and text but overlook fine-grained connections between specific substructures and textual descriptions.

Second, and more fundamentally, no prior model incorporates knowledge graphs as a third modality. Knowledge graphs encode global-level relationships among molecules, target ligands, diseases, and other biomedical entities. These relationships capture functional and structural similarity patterns that cannot be learned from individual molecule-text pairs alone. MolFM addresses both gaps by introducing cross-modal attention across all three modalities and providing theoretical guarantees about what the pre-training objectives learn.

Architecture

MolFM uses three pre-trained single-modal encoders:

Molecular graph encoder: A 5-layer GIN (1.8M parameters) initialized from GraphMVP, producing atom-level features $h_{SA}$ and a graph-level feature $h_{SM}$
Text encoder: A 6-layer transformer (61.8M parameters) initialized from KV-PLM’s first 6 layers, producing token features $h_T$
Knowledge graph encoder: A TransE model (12.6M parameters) trained on the knowledge graph for 500 epochs, producing entity features $h_K$

A multimodal encoder (61.8M parameters, 6 transformer layers with cross-attention) fuses the three modalities. The cross-attention uses text token features as queries and the concatenation of atom features and knowledge graph neighbor features as keys and values. For each molecule, the knowledge graph input is the molecule’s entity and $N=4$ randomly sampled one-hop neighbors.

Pre-training Objectives

MolFM combines four losses:

Structure-text contrastive (STC) aligns the global feature spaces of structure and text encoders using a symmetric InfoNCE loss:

$$\mathcal{L}_{stc} = -\frac{1}{2} \left[ \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{S’ \in B} \exp(s(z_{S’}, z_T) / \tau)} + \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{T’ \in B} \exp(s(z_S, z_{T’}) / \tau)} \right]$$

where $s(\cdot, \cdot)$ is cosine similarity and $\tau = 0.1$ is a temperature parameter.

Cross-modal matching (CMM) predicts whether a structure-text-knowledge triplet corresponds to the same molecule, using cross-entropy over the multimodal encoder’s CLS token:

$$\mathcal{L}_{cmm} = \sum_{(\tilde{S}, \tilde{T}, \tilde{K}) \in \tilde{B}} H\left[y_{cmm}(\tilde{S}, \tilde{T}, \tilde{K}),; p_{cmm}\left(\mathcal{M}_\theta(h_{\tilde{S}}, h_{\tilde{T}}, h_{\tilde{K}})\right)\right]$$

Masked language modeling (MLM) predicts masked text tokens conditioned on all three modalities:

$$\mathcal{L}_{mlm} = H\left[y_{mlm}(\hat{T}),; p_{mlm}\left(\mathcal{M}_\theta(h_S, h_{\hat{T}}, h_K)\right)\right]$$

Knowledge graph embedding (KGE) regularizes entity embeddings with a max-margin TransE loss:

$$\mathcal{L}_{kge} = \sum_{h \in K} \left[\max(0, d(h,r,t) - d(h,r,\tilde{t}) + \Delta) + \max(0, d(h,r,t) - d(\tilde{h},r,t) + \Delta)\right]$$

where $d(h,r,t) = | f(h) + g(r) - f(t) |_2$ and $\Delta = 0.2$.

The total pre-training loss is:

$$\mathcal{L} = \mathbb{E}_{(S,T,K)}\left[\mathcal{L}_{stc} + \mathcal{L}_{cmm} + \mathcal{L}_{mlm} + \mathcal{L}_{kge}\right]$$

Theoretical Justifications

The authors provide metric learning interpretations for each objective. For CMM, they show that the loss is proportional to assigning higher scores to matched triplets and lower scores to unmatched ones, aligning the feature space across all three modalities.

For KGE, two lemmas provide guarantees about structurally and functionally similar molecules:

Lemma 1 (Structural similarity): For a symmetric structural-similarity relation $r_s$, the KGE loss satisfies:

$$\mathcal{L}_{kge}(h, r_s, t) \propto 2|f(h) - f(t)| - \mathbb{E}_{\tilde{t}}|f(h) - f(\tilde{t})| - \mathbb{E}_{\tilde{h}}|f(\tilde{h}) - f(t)|$$

This shows KGE pulls structurally similar molecules closer while pushing dissimilar ones apart.

Lemma 2 (Functional similarity): For molecules $h$ and $t$ that interact with a common entity $o$, the distance between their embeddings is upper-bounded:

$$|f(h) - f(t)| \leq \alpha,\mathbb{E}_{(e_1, r, e_2) \sim \mathcal{I}}\left[\mathcal{L}_{kge}(e_1, r, e_2)\right] + C$$

where $\alpha \approx 1$ and $C \approx 0$. This guarantees that minimizing KGE also brings functionally similar molecules closer in the embedding space.

Experiments Across Four Downstream Tasks

Pre-training Data

MolFM pre-trains on 15K molecules from PubChem paired with 37M paragraphs from S2ORC. The knowledge graph contains 49K entities and 3.2M relations, constructed from DrugBank, BindingDB, and additional public databases with heuristic augmentation.

Evaluated on PCdes (paragraph-level) in zero-shot and fine-tuning settings. MolFM uses a re-ranking strategy that linearly combines cosine similarity with CMM logits over the top-$k$ retrieved candidates.

Mode	Model	S-T MRR	S-T R@1	S-T R@10	T-S MRR	T-S R@1	T-S R@10
Zero-shot	MoMu	9.89	5.08	18.93	10.33	4.90	20.69
Zero-shot	MolFM	21.42	13.90	36.21	23.63	16.14	39.54
Fine-tune	MoMu	34.29	24.47	53.84	34.53	24.87	54.25
Fine-tune	MolFM	39.56	29.76	58.63	39.34	29.39	58.49

MolFM achieves 12.13% and 5.04% absolute gains over MoMu under zero-shot and fine-tuning settings, respectively.

Molecule Captioning

Evaluated on ChEBI-20 using MolT5 decoders. MolFM’s structure encoder features are concatenated with the MolT5 encoder outputs.

Decoder	Encoder	BLEU-4	ROUGE-L	METEOR	Text2Mol
MolT5-base	MolT5-base	0.457	0.578	0.569	0.547
MolT5-base	MoMu	0.462	0.575	0.576	0.558
MolT5-base	GraphMVP	0.491	0.592	0.599	0.570
MolT5-base	MolFM	0.498	0.594	0.607	0.576

Text-Based Molecule Generation

Also on ChEBI-20 with MolT5 decoders. MolFM’s text features are projected and fed to the decoder.

Decoder	Encoder	Exact	Valid	Morgan FTS	Text2Mol
MolT5-base	MolT5-base	0.082	0.786	0.601	0.543
MolT5-base	MoMu	0.183	0.863	0.678	0.580
MolT5-base	MolFM	0.210	0.892	0.697	0.583

Molecular Property Prediction

On MoleculeNet (8 classification datasets), MolFM concatenates the structure feature and the multimodal encoder’s CLS feature to predict properties.

Model	BBBP	Tox21	ClinTox	HIV	BACE	Avg
GraphMVP	72.4	74.4	77.5	77.0	81.2	73.07
DeepEIK	72.1	72.4	89.7	75.0	80.5	73.27
MolFM (w/o T+K)	72.2	76.6	78.6	78.2	82.6	73.95
MolFM (w/ T+K)	72.9	77.2	79.7	78.8	83.9	74.62

With multimodal inputs, MolFM averages 74.62% ROC-AUC, a 1.55% absolute gain over GraphMVP.

Ablation Studies

Zero-shot retrieval ablations reveal that cross-modal attention to atoms and CMM are the most critical components. Removing either causes a sharp drop (approximately 3% on S-T retrieval). Knowledge graph incorporation yields a 1.5% average improvement, with both attention to neighbors and KGE contributing marginally.

Key Findings and Limitations

MolFM demonstrates that incorporating knowledge graphs as a third modality provides consistent improvements across all evaluated tasks. The theoretical analysis connecting pre-training objectives to deep metric learning provides interpretability for why the model works: STC and CMM align representations of the same molecule across modalities, while KGE pulls structurally and functionally similar molecules closer in the embedding space.

The cross-modal attention visualizations show that MolFM learns to associate specific atom substructures with relevant text tokens and knowledge graph entities. For example, the model correctly attends to functional groups mentioned in textual descriptions.

The authors acknowledge several limitations:

Data quality: The pre-training dataset (15K molecules) is small and may introduce biases
Cold-start problem: MolFM provides limited benefit for newly emerged molecules lacking text and knowledge graph information
Entity scope: The model focuses on molecules and does not incorporate proteins, genes, or cell lines, which could further improve biomedical understanding

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training (molecules)	PubChem	15K molecules	Follows MoMu’s pre-training data
Pre-training (text)	S2ORC	37M paragraphs	Biomedical literature paragraphs
Knowledge graph	DrugBank, BindingDB, public DBs	49K entities, 3.2M relations	Constructed with heuristics from MoCL
Cross-modal retrieval	PCdes	Paragraph-level	Test split
Captioning/Generation	ChEBI-20	-	Following MolT5 splits
Property prediction	MoleculeNet	8 datasets	Classification tasks, ROC-AUC metric

Algorithms

Optimizer: AdamW with weight decay $1 \times 10^{-4}$
Learning rate: linear warmup to $1 \times 10^{-4}$ over 2,000 iterations, cosine annealing to $1 \times 10^{-5}$
Batch size: 128
Pre-training epochs: 300
Knowledge graph neighbors per molecule: $N = 4$
Temperature: $\tau = 0.1$
Margin: $\Delta = 0.2$

Models

Component	Architecture	Parameters	Initialization
Graph encoder	5-layer GIN	1.8M	GraphMVP
Text encoder	6-layer Transformer	61.8M	KV-PLM (first 6 layers)
Knowledge encoder	TransE	12.6M	Trained 500 epochs on KG
Multimodal encoder	6-layer Transformer + cross-attention	61.8M	KV-PLM (last 6 layers)
Total		~138M

Evaluation

Task	Metrics
Cross-modal retrieval	MRR, Recall@1/5/10
Molecule captioning	BLEU-2/4, ROUGE-1/2/L, METEOR, Text2Mol
Text-to-molecule generation	BLEU, Exact ratio, Validity, Levenshtein, Fingerprint Tanimoto (MACCS/RDKit/Morgan), Text2Mol
Property prediction	ROC-AUC per dataset

Hardware

4 NVIDIA A100 GPUs for pre-training

Artifacts

Artifact	Type	License	Notes
OpenBioMed	Code	MIT	Official implementation including MolFM

Paper Information

Citation: Luo, Y., Yang, K., Hong, M., Liu, X. Y., & Nie, Z. (2023). MolFM: A Multimodal Molecular Foundation Model. arXiv preprint arXiv:2307.09484.

@article{luo2023molfm,
  title={MolFM: A Multimodal Molecular Foundation Model},
  author={Luo, Yizhen and Yang, Kai and Hong, Massimo and Liu, Xing Yi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2307.09484},
  year={2023}
}

MolecularRNN: Graph-Based Molecular Generation and RL

Sat, 28 Mar 2026 00:00:00 +0000

A Graph Recurrent Model for Molecular Generation with Property Optimization

This is a Method paper that introduces MolecularRNN, a graph-based recurrent generative model for molecular structures. The model extends GraphRNN to handle typed nodes (atom types) and typed edges (bond types), enabling direct generation of molecular graphs rather than working through string representations like SMILES. Three key contributions are combined: (1) the MolecularRNN architecture for autoregressive graph generation, (2) valency-based rejection sampling for guaranteed 100% validity at inference, and (3) policy gradient reinforcement learning for shifting molecular property distributions toward desired ranges.

Why Generate Molecules as Graphs Rather Than Strings

Computational de novo molecular design aims to create novel molecules with desired properties, a task central to drug discovery. At the time of this work, most deep generative models for molecules operated on SMILES strings, inheriting the complications of SMILES grammar and the problem that structurally similar molecules can have very different string representations. Graph-based representations are more natural for molecules, with atoms mapping to nodes and bonds to edges, and they allow direct enforcement of chemical constraints during generation.

Existing graph-based methods had their own limitations. Junction tree VAE (JT-VAE) generates molecules from structural fragments, which introduces ambiguity when converting junction trees back to molecules, particularly problematic during property optimization since molecules sharing a junction tree can have very different property values. The GCPN model uses graph convolutional networks with reinforcement learning but was evaluated only on top-3 generated molecules, making it difficult to assess overall distribution quality. Prior atom-level graph generation models like Li et al. (2018a) were restricted to molecules with at most 20 heavy atoms, limiting practical applicability.

Core Innovation: Extending GraphRNN with Chemical Constraints and RL

MolecularRNN builds on the GraphRNN architecture by introducing atom type predictions alongside edge type predictions. The model generates molecular graphs sequentially: at each step, a NodeRNN predicts the type of the next atom, then an EdgeRNN predicts bond types to all preceding atoms within a BFS-ordered window.

Autoregressive Graph Generation

The joint likelihood over atom types $C^{\pi}$ and adjacency vectors $S^{\pi}$ under BFS ordering $\pi$ is factorized as:

$$ p\left(S^{\pi}, C^{\pi}\right) = \prod_{i=1}^{n+1} p\left(C_{i}^{\pi} \mid S_{

NodeRNN processes embeddings of previous atom types and adjacency vectors to produce a hidden state, from which a two-layer MLP with softmax predicts the next atom type $\psi_{i}$:

$$ h_{i}^{\text{node}} = \text{NodeRNN}\left(h_{i-1}^{\text{node}}, \left[\text{emb}(S_{i-1}^{\pi}), \text{emb}(C_{i-1}^{\pi})\right]\right) $$

$$ \psi_{i} = \text{NodeMLP}\left(h_{i}^{\text{node}}\right) $$

EdgeRNN then unrolls across preceding atoms to predict bond types $\phi_{i,j}$, initialized with the NodeRNN hidden state:

$$ h_{i,j}^{\text{edge}} = \text{EdgeRNN}\left(h_{i,j-1}^{\text{edge}}, \text{emb}(S_{i,j-1}^{\pi})\right), \quad h_{i,0}^{\text{edge}} = h_{i}^{\text{node}} $$

$$ \phi_{i,j} = \text{EdgeMLP}\left(h_{i,j}^{\text{edge}}\right) $$

Bond types are categorical over {no bond, single, double, triple}, and molecules are represented in kekulized form. BFS ordering limits the EdgeRNN window to $M = 12$ preceding atoms.

Valency-Based Rejection Sampling

During inference, each proposed bond of order $k$ between atoms $i$ and $j$ is accepted only if both atoms remain within their allowed valencies:

$$ \sum_{j} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{i}^{\pi}} \quad \text{and} \quad \sum_{i} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{j}^{\pi}} $$

Atoms that do not fill their valencies are complemented with hydrogens. This constraint can be enforced directly on graphs (unlike SMILES, where intermediate substrings are not chemically meaningful), yielding 100% valid molecules.

Property Optimization via Policy Gradient

For property optimization, MolecularRNN is formulated as a policy network in a Markov Decision Process. The loss function uses REINFORCE with a discounted final reward:

$$ L(\theta) = -\sum_{i=1}^{N} r(s_{N}) \cdot \gamma^{i} \cdot \log p(s_{i} \mid s_{i-1}; \theta) $$

where $r(s_{N})$ is the reward from a property critic and $\gamma$ is a discount factor. The authors also introduce a structural penalty during RL training that assigns a penalty of $-10$ to atoms violating valency constraints, providing a learning signal from invalid intermediate molecules.

Experimental Setup: Pretraining and Property Optimization

Pretraining

MolecularRNN is pretrained on three datasets: ChEMBL (~1.5M bioactive molecules), ZINC 250k (250K randomly selected commercially available compounds), and MOSES (~1.9M drug-like molecules from ZINC). The model considers 9 atom types (C, N, O, F, P, S, Cl, Br, I), 3 bond types (single, double, triple), and molecules with 10-50 heavy atoms. Architecture: NodeRNN with 4 GRU layers (hidden size 256), EdgeRNN with 4 GRU layers (hidden size 128), node embedding size 128, edge embedding size 16. Training uses Adam with learning rate 0.001 and multiplicative decay on 4 GPUs with batch size 512 per GPU for 250 epochs.

Generation Quality at Scale

The pretrained model generates 1 million molecules per dataset (far larger than prior work: JT-VAE used 5K samples, Li et al. used 100K). Results with valency-based rejection sampling:

Training Set	Valid	Unique	Novel	IntDiv (p=1)	IntDiv (p=2)	SA Score	QED
ChEMBL	100%	99.2%	99.3%	0.895	0.890	3.67 +/- 1.20	0.56 +/- 0.20
ZINC 250k	100%	99.8%	100%	0.892	0.887	3.60 +/- 1.01	0.68 +/- 0.16
MOSES	100%	99.4%	100%	0.881	0.876	3.24 +/- 0.97	0.74 +/- 0.14

Comparison with baselines on ZINC 250k (30K samples):

Method	Valid	Unique	Novel	SA Score	QED	IntDiv
JT-VAE	99.8%	100%	100%	3.37	0.76	0.85
GCPN	100%	99.97%	100%	4.62	0.61	0.90
MolecularRNN	100%	99.89%	100%	3.59	0.68	0.89

GCPN generates overly complex molecules (high SA score of 4.62), while MolecularRNN produces more realistic structures with higher internal diversity than JT-VAE.

Property Optimization Results

Policy gradient optimization is run for 300 iterations with batch size 512 and constant learning rate $10^{-5}$, discount factor $\gamma = 0.97$. Top-3 scores for penalized logP and QED:

Method	logP 1st	logP 2nd	logP 3rd	QED 1st	QED 2nd	QED 3rd
ORGAN	3.63	3.49	3.44	0.896	0.824	0.820
JT-VAE	5.30	4.93	4.49	0.925	0.911	0.910
GCPN	7.98	7.85	7.80	0.948	0.947	0.946
MolecularRNN	10.34	10.19	10.14	0.948	0.948	0.947

MolecularRNN achieves the highest penalized logP scores (10.34 vs. GCPN’s 7.98) while matching GCPN on QED. The authors also demonstrate melting temperature optimization using a GCN-based property predictor as the critic (RMSE 39.5 degrees C), showing that the RL framework generalizes to properties that cannot be computed directly from molecular graphs.

Distribution-Level Evaluation and Learned Chemical Patterns

The authors emphasize that reporting only top-3 scores is not informative, and they compare full property distributions. MolecularRNN shifts the QED distribution further toward maximum values compared to GCPN. They also note that during melting temperature optimization, the model rediscovered two chemical phenomena: fusing aromatic rings increases melting point, and the presence of polar groups (C=O, OH, NH2, heterocyclic nitrogens) enhances dipole-dipole interactions and raises melting temperature.

Without valency-based rejection sampling, the pretrained model achieves 65% validity. After structural penalty training (assigning -10 to valency-violating atoms and optimizing with policy gradient), validity increases to 90%. Enabling rejection sampling then achieves 100%.

Several limitations are worth noting. The BFS ordering introduces an arbitrary sequencing over equivalent graph traversals (the node order permutation problem is not addressed). The evaluation uses top-3 scores for property optimization, though the authors do advocate for distributional evaluation. The molecule size is capped at 50 heavy atoms. The paper does not report training time or wall-clock generation speed. Future directions mentioned include multi-objective property optimization and scaffold completion (graph completion from a given core structure).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL	~1.5M molecules	Bioactive molecules with experimental measurements
Pretraining	ZINC 250k	250K molecules	Random subset of ZINC database
Pretraining	MOSES	~1.9M molecules	Drug-like subset of ZINC
Melting point critic	Custom split	37,940 train / 9,458 test	Melting temperatures from -196 to 517 degrees C

Algorithms

Pretraining: Maximum likelihood with Adam optimizer, learning rate 0.001 with multiplicative decay to $10^{-5}$, 250 epochs
Structural penalty: Policy gradient with -10 penalty per valency-violating atom
Property optimization: REINFORCE (policy gradient), 300 iterations, batch size 512, learning rate $10^{-5}$, discount factor $\gamma = 0.97$
Melting point critic: GCN regression (4 layers, hidden size 128), Adam with learning rate 0.001, exponential decay $\gamma = 0.8$, 30 epochs, batch size 32

Models

NodeRNN: 4 GRU layers, hidden size 256, node embedding 128
EdgeRNN: 4 GRU layers, hidden size 128, edge embedding 16
NodeMLP/EdgeMLP: 2-layer MLP with 128 hidden units, ReLU activation, softmax output
BFS window: $M = 12$ preceding atoms
Atom types: 9 (C, N, O, F, P, S, Cl, Br, I)
Bond types: 3 (single, double, triple) + no bond

Evaluation

Metric	Description
Validity	% chemically valid molecules (RDKit)
Uniqueness	% unique in generated pool (up to 1M)
Novelty	% not in training set
Internal Diversity	Average pairwise Tanimoto distance
SA Score	Synthetic accessibility (2-4 optimal range)
QED	Drug-likeness score (0-1)
Penalized logP	Lipophilicity with ring and SA penalties

Hardware

4 GPUs (NVIDIA, specific model not stated)
Per-GPU batch size of 512 for pretraining
Training time not reported

Paper Information

Citation: Popova, M., Shvets, M., Oliva, J., & Isayev, O. (2019). MolecularRNN: Generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372.

@article{popova2019molecularrnn,
  title={MolecularRNN: Generating realistic molecular graphs with optimized properties},
  author={Popova, Mariya and Shvets, Mykhailo and Oliva, Junier and Isayev, Olexandr},
  journal={arXiv preprint arXiv:1905.13372},
  year={2019}
}

Memory-Assisted RL for Diverse De Novo Mol. Design

Sat, 28 Mar 2026 00:00:00 +0000

A Memory Module for Diverse Molecular Generation via RL

This is a Method paper that introduces a memory unit for reinforcement learning (RL)-based molecular generation. The primary contribution is a hash-table-based memory mechanism that integrates into the REINVENT framework’s scoring function. By tracking previously generated high-scoring molecules and penalizing the reward when new molecules are too similar to those already stored, the memory unit forces the generative model to explore different regions of chemical space rather than collapsing onto a single scaffold family.

Policy Collapse Limits RL-Based De Novo Design

Recurrent neural networks (RNNs) trained with reinforcement learning can generate novel molecules optimized for desired properties. The REINVENT algorithm and related approaches (ORGANIC, GENTRL) demonstrated the viability of coupling a pretrained SMILES-based generative model with a scoring function via RL. However, a persistent problem is policy collapse (also called mode collapse): once the model discovers a high-scoring region of chemical space, it continues to exploit that region, producing structurally similar compounds with minor substitution differences. This severely limits the practical utility of RL-based generation in drug design, where medicinal chemists need diverse scaffolds to explore structure-activity relationships and manage intellectual property concerns.

Prior work by Liu et al. [31] attempted to address this by engineering an explorative RNN alongside the standard generative RNN, but it did not substantially increase diversity compared to standard REINVENT. Other approaches like Generative Examination Networks (GEN) performed statistical analysis during training but were not evaluated in optimization scenarios.

Core Innovation: Hash-Table Memory Unit for Reward Modification

The key insight is to dynamically modify the reward surface during RL by maintaining a memory of previously explored chemical space. The memory unit is a hash table of index-bucket pairs. Each bucket stores up to a fixed number of high-scoring molecules (default: 25) that are chemically similar to a seed molecule (the index).

Integration with REINVENT

The memory unit modifies the augmented likelihood used in REINVENT. For a generated compound $c$, the augmented log-likelihood becomes:

$$ \log P(c)_{Aug} = \log P(c)_{PriorNetwork} + \sigma \times S(c) \times M(c) $$

where $\sigma$ is a scalar coefficient, $S(c)$ is the scoring function output, and $M(c)$ is the memory unit output (either 0 or 1). The reward is:

$$ R(c) = \left(\log P(c)_{Aug} - \log P(c)_{AgentNetwork}\right)^2 $$

and the loss is $\text{loss} = -R(c)$.

Memory Unit Operation

When a high-scoring molecule is generated:

Its fingerprint or scaffold is compared against all index structures in the memory
If it is similar to an index (above a Tanimoto cutoff, default 0.6) and the corresponding bucket is not full, $M(c) = 1$ and the molecule is added to the bucket
If the bucket is full, $M(c) = 0$, effectively zeroing the reward contribution and discouraging the model from generating similar molecules
If no similar index exists, a new index-bucket pair is created

Four Similarity Criteria

The authors evaluate four criteria for grouping molecules in the memory:

Compound similarity: ECFP4 Tanimoto similarity at the whole-molecule level
Identical Bemis-Murcko (BM) scaffold: exact match of Bemis-Murcko frameworks
Identical carbon skeleton: exact match of carbon skeletons (BM scaffolds with all heteroatoms replaced by carbon and bonds set to single)
Scaffold similarity: atom pair fingerprint Tanimoto similarity between carbon skeletons (fuzzy matching)

Alternative Output Modes

Beyond the binary output ($M(c) \in {0, 1}$), the authors also explored smooth output functions. The linear mode:

$$ M(c) = 1 - \frac{\text{compounds in bucket}}{\text{bucket size}} $$

And the sigmoid mode:

$$ M(c) = 1 - \frac{1}{1 + e^{-\left(\frac{\frac{\text{compounds in bucket}}{\text{bucket size}} \times 2 - 1}{0.15}\right)}} $$

Both smooth modes yielded slightly fewer analogs than the binary mode and were not pursued further.

Experimental Setup: LogP Optimization and Target Activity Prediction

Case Study 1: LogP Optimization

As a proof of concept, the authors optimized LogP values for known DRD2 inhibitors. Starting from 487 DRD2 compounds with LogP >= 5 (from ExCAPE-DB), they applied transfer learning to the prior model for 20 epochs, then ran RL for 150 iterations (100 compounds per iteration, 15,000 total). The scoring function was:

$$ S = 1 - \tanh\left(\min\left(|2 - \text{AlogP}|, |3 - \text{AlogP}|\right)\right) $$

targeting LogP values between 2.0 and 3.0.

Case Study 2: HTR1A and DRD2 Activity Prediction

For a more complex scenario, the authors trained SVM classifiers (with Platt scaling for probabilistic output) on bioactivity data from ExCAPE-DB to predict activity against two neurotransmitter receptors:

HTR1A: 3,599 actives (pIC50 >= 7) and 66,684 inactives
DRD2: 2,981 actives (pIC50 >= 7) and 346,206 inactives (100,000 sampled)

Data was split using Butina clustering on ECFP6 at a 0.4 Tanimoto cutoff (60/20/20 train/val/test). The SVM models achieved excellent performance:

Target	Set	Balanced Accuracy	ROC AUC	F1	MCC
HTR1A	Test	0.96	0.99	0.75	0.75
DRD2	Test	0.95	0.99	0.71	0.72

RL was run for 300 iterations (100 compounds each, 30,000 total). Compounds with predicted activity >= 0.7 were considered active.

Generative Model Architecture

The RNN prior model followed the REINVENT architecture: an embedding layer, three GRU layers with 256 dimensions, and a linear output layer. It was pretrained on ~1.5 million ChEMBL 25 compounds (filtered to remove known HTR1A actives and DRD2 analogs) for 10 epochs using Adam with a learning rate of 0.01.

Comparisons

The authors compared memory-assisted RL against:

Standard REINVENT RL (no memory)
Experience replay (re-presenting 8 high-scoring compounds per iteration)
Temperature scaling (values from 1.0 to 10.0)
Memory + experience replay combined

Results: Up to Fourfold Increase in Diverse Active Compounds

LogP Optimization Results

Memory-assisted RL increased the number of optimized compounds (LogP 2-3) by roughly threefold:

Memory Type	Optimized Compounds	Unique BM Scaffolds	Unique Carbon Skeletons
No memory	938	727	396
Compound similarity	3,451	2,963	1,472
Identical BM Scaffold	3,428	2,865	1,398
Identical Carbon Skeleton	3,315	3,002	1,799
Scaffold Similarity	3,591	3,056	1,538

The memory unit also increased the generation of relevant analogs. ECFP6 analogs (Tanimoto >= 0.4 to training set) increased from 145 to up to 549, and shared MMP cores increased from 5 to up to 19, confirming that the memory unit promoted exploration of chemically relevant space rather than random drift.

HTR1A and DRD2 Activity Optimization Results

The improvements were even more pronounced for target activity optimization:

Target	Memory Type	Active Compounds	Unique BM Scaffolds	Unique Carbon Skeletons
HTR1A	No memory	9,323	7,312	5,446
HTR1A	Compound similarity	16,779	13,304	9,887
HTR1A	Identical Carbon Skeleton	17,597	15,531	12,408
DRD2	No memory	5,143	2,635	1,949
DRD2	Compound similarity	21,486	17,844	12,749
DRD2	Scaffold Similarity	22,784	20,712	16,434

For DRD2, the effect was particularly striking: standard RL showed clear policy collapse with only 576 ECFP6 analogs to the training set, while memory-assisted RL generated up to 6,315. The compound similarity memory unit produced the most MMP analogs (217 to the training set vs. 7 without memory).

Parameter Sensitivity

Bucket size had a modest effect: larger buckets (allowing more compounds before penalization) slightly increased analog generation. The Tanimoto similarity threshold of 0.6 was near-optimal for the scaffold similarity memory; higher thresholds reduced diversity gains. The compound similarity memory showed increasing analogs with higher thresholds, but BM scaffold and carbon skeleton counts plateaued above 0.6.

Comparison with Experience Replay and Temperature Scaling

Experience replay alone increased diversity compared to vanilla RL but was less effective than the memory unit alone
Memory + experience replay achieved the best results overall, as experience replay provided the model with diverse starting points for exploration after the memory unit altered the reward landscape
Temperature scaling was largely ineffective: only a value of 1.25 showed improvement, and even then it achieved only about 50% of the analogs generated by memory-assisted RL. Temperatures above 2.0 degraded SMILES validity, and above 4.0 prevented valid molecule generation entirely

Limitations

The authors acknowledge several limitations:

All evaluations are retrospective; no synthesized compounds were experimentally tested
The SVM activity models, while accurate, may have applicability domain limitations for highly novel scaffolds
The binary memory output mode was found to work best, but the transition from exploration to exploitation is abrupt
The method was only tested with two biological targets and one physicochemical property
Computational overhead of the memory unit is not discussed

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior model training	ChEMBL 25	~1.5M compounds	Filtered: max 50 heavy atoms, no stereochemistry, removed HTR1A actives and DRD2 analogs
HTR1A activity data	ExCAPE-DB	3,599 actives + 66,684 inactives	pIC50 >= 7 threshold for actives
DRD2 activity data	ExCAPE-DB	2,981 actives + 100,000 inactives (sampled)	pIC50 >= 7 threshold for actives

Algorithms

Generative model: RNN with embedding + 3 GRU layers (256 dim) + linear output (REINVENT architecture)
RL: Augmented likelihood formulation with sigma scaling coefficient
SVM classifiers: Non-linear SVM with MinMax kernel, Platt scaling, ECFP6 count-based fingerprints (2048 dim)
Butina clustering: ECFP6 Tanimoto cutoff 0.4 for train/val/test splitting

Evaluation

Metric	Description
Unique compounds	Number of distinct valid SMILES generated
Unique BM scaffolds	Bemis-Murcko framework diversity
Unique carbon skeletons	Carbon skeleton diversity (stripped BM scaffolds)
ECFP6 analogs	Compounds with Tanimoto >= 0.4 to known actives
MMP analogs	Matched molecular pair relationships with known actives
Shared MMP cores	Scaffold cores shared between generated and known compounds

Artifacts

Artifact	Type	License	Notes
reinvent-memory	Code	MIT	Official implementation with prepared datasets

Hardware

Not specified in the paper.

Paper Information

Citation: Blaschke, T., Engkvist, O., Bajorath, J., & Chen, H. (2020). Memory-assisted reinforcement learning for diverse molecular de novo design. Journal of Cheminformatics, 12, 68. https://doi.org/10.1186/s13321-020-00473-0

@article{blaschke2020memory,
  title={Memory-assisted reinforcement learning for diverse molecular de novo design},
  author={Blaschke, Thomas and Engkvist, Ola and Bajorath, J{\"u}rgen and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={68},
  year={2020},
  publisher={Springer},
  doi={10.1186/s13321-020-00473-0}
}

LSTM Neural Network for Drug-Like Molecule Generation

Sat, 28 Mar 2026 00:00:00 +0000

An Early Method for LSTM-Based Molecular Generation

This is a Method paper that applies character-level LSTM networks to the task of de novo drug-like molecule generation. The primary contribution is demonstrating that an LSTM trained on SMILES strings from a large bioactive compound database (ChEMBL) can produce novel, diverse molecules whose chemical properties closely match those of known drug-like compounds. The paper also validates the generated molecules through virtual screening with profile QSAR models, showing comparable predicted bioactivity to the training set.

The Challenge of Exploring Drug-Like Chemical Space

The theoretical space of drug-like molecules is astronomically large. Brute-force enumeration approaches such as GDB-17 (which catalogued 166 billion molecules) are feasible only for small molecules, and full enumeration of molecules with 25-30 heavy atoms (the typical size of drug molecules) remains computationally intractable. Traditional cheminformatics approaches to sampling this space rely on fragment combination, evolutionary algorithms, or particle swarm optimization.

The authors position LSTM networks as a viable alternative. LSTMs had already demonstrated the ability to learn sequential structure in domains like text and music generation, making them natural candidates for learning SMILES grammar and generating novel valid molecular strings. At the time of writing (late 2017), several groups were exploring this direction, including Bjerrum and Threlfall (ZINC-based generation), Gomez-Bombarelli et al. (VAE-based latent space design), Olivecrona et al. (RL-guided generation), and Segler et al. (focused library design). This paper contributes a large-scale empirical study with detailed analysis of the generated molecules’ chemical quality.

Character-Level LSTM with Temperature-Based Sampling

The core approach is straightforward: train an LSTM to predict the next character in a SMILES string, then sample from the trained model to generate new molecules character by character.

The network architecture consists of:

Two stacked LSTM layers (which learn the SMILES grammar)
A dropout layer for regularization
A dense output layer with 23 neurons (one per character in the reduced SMILES alphabet) and softmax activation

The RMSProp optimizer was used for training. The learning rate was gradually decreased from 0.01 to 0.0002 during training. At generation time, a temperature parameter controls the randomness of character sampling to produce more diverse structures rather than reproducing training molecules too closely.

A key preprocessing step reduces the SMILES alphabet to 23 characters. Multi-character atom tokens are replaced with single characters (Cl → L, Br → R, [nH] → A). Only the organic atom subset (H, C, N, O, S, P, F, Cl, Br, I) is retained. Charged molecules, stereo information, and molecules with more than 5 ring closures are excluded. The training corpus totals 23,664,668 characters, with 40-character windows used as input sequences during training.

Training on ChEMBL and Generating One Million Molecules

Training Data

The training set consists of 509,000 bioactive molecules from ChEMBL with reported activity below 10 micromolar on any target.

Generation and Filtering

The LSTM generates SMILES strings character by character. The generated strings undergo a two-stage validation:

Bracket and ring closure check (fast text-based): 54% of generated SMILES are discarded for unpaired brackets or ring closures
Full chemical parsing with RDKit: An additional 14% fail due to unrealistic aromatic systems or incorrect valences
Final yield: 32% of generated SMILES correspond to valid molecules

One million valid molecules were generated in under 2 hours on 300 CPUs.

Novelty and Diversity

Out of one million generated molecules, only 2,774 (0.28%) were identical to molecules in the training ChEMBL set. The generated set contained 627,000 unique scaffolds compared to 172,000 in ChEMBL, with an overlap of only 18,000 scaffolds. This demonstrates substantial novelty and diversity.

Physicochemical Properties

Calculated molecular descriptors (molecular weight, logP, and topological polar surface area) for the generated molecules closely matched the distributions of the ChEMBL training set. The synthetic accessibility score distributions were also practically identical, indicating comparable molecular complexity.

Substructure Feature Comparison

The paper compares substructure features across three molecule sets: ChEMBL training data, LSTM-generated molecules, and a naive SMILES baseline generator. The naive generator uses only character frequency statistics and basic SMILES syntax rules, producing primarily macrocycles with very few fused aromatic systems.

Feature	ChEMBL (%)	LSTM Generated (%)	Naive Baseline (%)
No rings	0.4	0.4	0.1
1 ring	2.8	4.3	13.2
2 rings	14.8	23.1	17.7
3 rings	32.2	43.5	27.3
4 rings	32.7	23.9	25.2
>4 rings	17.2	4.8	16.5
Fused aromatic rings	38.8	30.9	0.2
Large rings (>8)	0.4	1.8	75.9
Spiro rings	1.9	0.6	0.6
Contains N	96.5	96.1	92.3
Contains O	93.0	92.0	85.5
Contains S	35.6	27.9	39.6
Contains halogen	40.7	38.8	49.4

The LSTM-generated molecules closely mirror the ChEMBL distributions, while the naive generator fails to capture drug-like structural patterns. The LSTM tends to slightly over-represent 2-3 ring systems and under-represent 4+ ring systems relative to ChEMBL. Functional group distributions also closely matched between ChEMBL and the LSTM output.

Virtual Screening Validation

The generated molecules were evaluated using profile QSAR models for 159 ChEMBL kinase assays. The six best models (with realistic test set R-squared > 0.75) were used to predict pIC50 values for both actual ChEMBL compounds and generated compounds. The cumulative frequency distributions of predicted activity were nearly identical between the two sets.

Kolmogorov-Smirnov (KS) tests on random samples of 1,000 compounds confirmed this quantitatively:

Assay	KS D	Distributions Differ?	Mean (Real)	Mean (Gen)	Stdev (Real)	Stdev (Gen)
688395	6.01%	No	4.66	4.69	0.25	0.24
668624	3.60%	No	4.86	4.86	0.25	0.24
809226	9.90%	Yes	5.33	5.26	0.34	0.30
809226	4.30%	No	5.18	5.13	0.47	0.43
688781	2.20%	No	4.83	4.82	0.26	0.25
809170	8.70%	Yes	5.12	5.07	0.51	0.46

For 4 of 6 models, the null hypothesis that the distributions are the same could not be rejected at the 95% confidence level (critical D = 6.04%). Even for the two assays where the KS test rejected the null hypothesis, the maximum vertical distance between distributions was below 10%.

Generated Molecules Are Novel, Drug-Like, and Potentially Bioactive

The key findings of this study are:

High novelty: Only 0.28% of generated molecules match training compounds; 627K novel scaffolds were produced versus 172K in ChEMBL
Drug-like quality: Physicochemical properties, substructure features, functional group distributions, and synthetic accessibility scores all closely match the ChEMBL training distribution, without these being explicit constraints
Predicted bioactivity: Virtual screening with profile QSAR models shows the generated molecules have comparable predicted activity profiles to known bioactive compounds
Scalability: One million valid molecules in under 2 hours on 300 CPUs, with the potential to scale to billions with GPU acceleration
LSTM superiority over naive baselines: A simple statistical SMILES generator using only character frequencies produces chemically unrealistic molecules (mostly macrocycles), demonstrating that the LSTM genuinely learns drug-like chemical patterns

The main limitations are the 32% validity rate (68% of generated SMILES are invalid), the exclusion of stereochemistry and charged molecules from the training set, and the lack of any goal-directed generation capability (the model produces unconditional samples from the training distribution). The code was described as “available on request” from the corresponding author rather than publicly released.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ChEMBL bioactive molecules	509,000 molecules	Activity < 10 uM on any target; organic atoms only; no charges or stereo

Algorithms

Double-stacked LSTM layers with dropout
Softmax output over 23-character reduced SMILES alphabet
RMSProp optimizer with learning rate annealed from 0.01 to 0.0002
Temperature-based sampling at generation time
40-character input windows during training

Models

The architecture consists of two LSTM layers, a dropout layer, and a 23-neuron dense output layer. Exact hidden unit counts and dropout rates are not specified in the paper.

Evaluation

Metric	Value	Notes
Valid SMILES rate	32%	After bracket check and RDKit parsing
Novelty (vs. training)	99.72%	Only 2,774 of 1M match ChEMBL
Unique scaffolds	627,000	vs. 172,000 in ChEMBL
KS test (4/6 assays)	Not significantly different	At 95% confidence

Hardware

Generation: 300 CPUs for under 2 hours (1 million valid molecules)
Training hardware not specified

Paper Information

Citation: Ertl, P., Lewis, R., Martin, E., & Polyakov, V. (2017). In silico generation of novel, drug-like chemical matter using the LSTM neural network. arXiv preprint, arXiv:1712.07449.

@article{ertl2017silico,
  title={In silico generation of novel, drug-like chemical matter using the LSTM neural network},
  author={Ertl, Peter and Lewis, Richard and Martin, Eric and Polyakov, Valery},
  journal={arXiv preprint arXiv:1712.07449},
  year={2017}
}

LatentGAN: Latent-Space GAN for Molecular Generation

Sat, 28 Mar 2026 00:00:00 +0000

A GAN Operating in Learned Latent Space for Molecular Design

LatentGAN is a Method paper that introduces a two-stage architecture for de novo molecular generation. The first stage trains a heteroencoder to map SMILES strings into a continuous latent vector space. The second stage trains a Wasserstein GAN with gradient penalty (WGAN-GP) to generate new latent vectors that, when decoded, produce valid and novel molecular structures. The key contribution is decoupling the GAN from direct SMILES string generation, allowing the adversarial training to focus on learning the distribution of molecular latent representations rather than character-level sequence generation.

Limitations of Direct SMILES Generation with GANs

Prior GAN-based molecular generation methods such as ORGAN and ORGANIC operated directly on SMILES strings. This created a fundamental challenge: the generator had to simultaneously learn valid SMILES syntax and the distribution of chemically meaningful molecules. ORGAN struggled with optimizing discrete molecular properties like Lipinski’s Rule of Five, while ORGANIC showed limited success beyond the QED drug-likeness score. Other approaches (RANC, ATNC) substituted more advanced recurrent architectures but still operated in the discrete SMILES space.

Meanwhile, variational autoencoders (VAEs) demonstrated that working in continuous latent space could enable molecular generation, but they relied on forcing the latent distribution to match a Gaussian prior through KL divergence. This assumption is not necessarily appropriate for chemical space, which is inherently discontinuous.

RNN-based methods with transfer learning offered an alternative for target-biased generation, but the authors hypothesized that combining GANs with learned latent representations could produce complementary chemical space coverage.

Heteroencoder Plus Wasserstein GAN Architecture

The core innovation of LatentGAN is separating molecular representation learning from adversarial generation through a two-component pipeline.

Heteroencoder

The heteroencoder is an autoencoder trained on pairs of different non-canonical (randomized) SMILES representations of the same molecule. This is distinct from a standard autoencoder because the input and target SMILES are different representations of the same structure.

The encoder uses a two-layer bidirectional LSTM with 512 units per layer (256 forward, 256 backward). The concatenated output feeds into a 512-dimensional feed-forward layer. During training, zero-centered Gaussian noise with $\sigma = 0.1$ is added to the latent vector as regularization. The decoder is a four-layer unidirectional LSTM with a softmax output layer. Batch normalization with momentum 0.9 is applied to all hidden layers except the noise layer.

Training uses teacher forcing with categorical cross-entropy loss for 100 epochs. The learning rate starts at $10^{-3}$ for the first 50 epochs and decays exponentially to $10^{-6}$ by the final epoch. After training, the noise layer is deactivated for deterministic encoding and decoding.

An important design choice is that the heteroencoder makes no assumption about the latent space distribution (unlike VAEs with their KL divergence term). The latent space is shaped purely by reconstruction loss, and the GAN later learns to sample from this unconstrained distribution.

Wasserstein GAN with Gradient Penalty

The GAN uses the WGAN-GP formulation. The critic (discriminator) consists of three feed-forward layers of 256 dimensions each with leaky ReLU activations (no activation on the final layer). The generator has five feed-forward layers of 256 dimensions each with batch normalization and leaky ReLU between layers.

The training ratio is 5:1, with five critic updates for every generator update. The generator takes random vectors sampled from a uniform distribution and learns to produce latent vectors indistinguishable from the real encoded molecular latent vectors.

The WGAN-GP loss for the critic is:

$$L_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim \mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x \sim \mathbb{P}_r}[D(x)] + \lambda \mathbb{E}_{\hat{x} \sim \mathbb{P}_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2]$$

where $\lambda$ is the gradient penalty coefficient, $\mathbb{P}_r$ is the real data distribution (encoded latent vectors), $\mathbb{P}_g$ is the generator distribution, and $\mathbb{P}_{\hat{x}}$ samples uniformly along straight lines between pairs of real and generated points.

Generation Pipeline

At inference time, the full pipeline operates as: (1) sample a random vector, (2) pass through the trained generator to produce a latent vector, (3) decode the latent vector into a SMILES string using the pretrained heteroencoder decoder.

Experiments on Drug-Like and Target-Biased Generation

Datasets

The heteroencoder was trained on 1,347,173 SMILES from ChEMBL 25, standardized with MolVS and restricted to molecules with atoms from {H, C, N, O, S, Cl, Br} and at most 50 heavy atoms.

For general drug-like generation, a random subset of 100,000 ChEMBL compounds was used to train the GAN model for 30,000 epochs.

For target-biased generation, three datasets were extracted from ExCAPE-DB for EGFR, HTR1A, and S1PR1 targets. These were clustered into training and test sets to ensure chemical series were not split across sets.

Target	Training Set	Test Set	SVM ROC-AUC	SVM Kappa
EGFR	2,949	2,326	0.850	0.56
HTR1A	48,283	23,048	0.993	0.90
S1PR1	49,381	23,745	0.995	0.91

SVM target prediction models using 2048-bit FCFP6 fingerprints were built with scikit-learn to evaluate generated compounds.

Baselines

RNN-based generative models with transfer learning served as the primary baseline. A prior RNN model was trained on the same ChEMBL set, then fine-tuned on each target dataset. The LatentGAN was also benchmarked on the MOSES platform against VAE, JTN-VAE, and AAE architectures.

Heteroencoder Performance

The heteroencoder achieved 99% valid SMILES on the training set and 98% on the test set. Reconstruction error (decoding to a different molecule) was 18% on training and 20% on test. Notably, decoding to a different valid SMILES of the same molecule is not counted as an error.

Target-Biased Generation Results

From 50,000 sampled SMILES per target model:

Target	Arch.	Valid (%)	Unique (%)	Novel (%)	Active (%)	Recovered Actives (%)	Recovered Neighbors
EGFR	GAN	86	56	97	71	5.26	196
EGFR	RNN	96	46	95	65	7.74	238
HTR1A	GAN	86	66	95	71	5.05	284
HTR1A	RNN	96	50	90	81	7.28	384
S1PR1	GAN	89	31	98	44	0.93	24
S1PR1	RNN	97	35	97	65	3.72	43

MOSES Benchmark

On the MOSES benchmark (trained on a ZINC subset of 1,584,663 compounds, sampled 30,000 SMILES), LatentGAN showed comparable or better results than JTN-VAE and AAE on Frechet ChemNet Distance (FCD), Fragment similarity, and Scaffold similarity, while producing slightly worse nearest-neighbor cosine similarity (SNN). The standard VAE showed signs of mode collapse with high test metric overlap and low novelty.

Complementary Generation and Drug-Likeness Preservation

Key Findings

Validity and novelty: LatentGAN achieved 86-89% validity on target-biased tasks (lower than RNN’s 96-97%) but produced higher uniqueness on two of three targets and comparable or higher novelty (95-98%).

Complementary chemical space: The overlap between LatentGAN-generated and RNN-generated active compounds was very small at both compound and scaffold levels. A probabilistic analysis showed that the RNN model would be very unlikely to eventually cover the LatentGAN output space. This suggests the two architectures can work complementarily in de novo design campaigns.

Drug-likeness: QED score distributions of LatentGAN-generated compounds closely matched training set distributions across all three targets, with training compounds showing only slightly higher drug-likeness. SA score distributions were similarly well-preserved.

Chemical space coverage: PCA analysis using MQN fingerprints confirmed that generated compounds occupy most of the chemical space of the training sets. Some regions of the PCA plots contained compounds predicted as inactive, which corresponded to non-drug-like outliers in the training data.

Novel scaffolds: About 14% of scaffolds in the sampled sets had similarity below 0.4 to the training set across all three targets, indicating LatentGAN can generate genuinely novel chemical scaffolds. Around 5% of generated compounds were identical to training set compounds, while 21-25% had Tanimoto similarity below 0.4.

Limitations

The paper acknowledges several limitations. The 18-20% heteroencoder reconstruction error means a non-trivial fraction of encoded molecules decode to different structures. Validity rates (86-89%) are lower than RNN baselines (96-97%). The S1PR1 target showed notably lower uniqueness (31%) and predicted activity (44%) compared to the other targets, possibly due to the smaller effective training set of active compounds. The paper does not report specific hardware requirements or training times. No wet-lab experimental validation of generated compounds was performed.

Future Directions

The authors envision LatentGAN as a complementary tool to existing RNN-based generative models, with the two architectures covering different regions of chemical space. The approach of operating in learned latent space rather than directly on SMILES strings offers a general framework that could be extended to other molecular representations or generation objectives.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Heteroencoder training	ChEMBL 25 (subset)	1,347,173 SMILES	Standardized with MolVS; atoms restricted to H, C, N, O, S, Cl, Br; max 50 heavy atoms
General GAN training	ChEMBL 25 (random subset)	100,000	Subset of heteroencoder training set
Target-biased training	ExCAPE-DB (EGFR)	2,949 actives	Clustered train/test split
Target-biased training	ExCAPE-DB (HTR1A)	48,283 actives	Clustered train/test split
Target-biased training	ExCAPE-DB (S1PR1)	49,381 actives	Clustered train/test split
Benchmarking	ZINC (MOSES subset)	1,584,663	Canonical SMILES

Algorithms

Heteroencoder: Bidirectional LSTM encoder (2 layers, 512 units) + unidirectional LSTM decoder (4 layers), trained with teacher forcing and categorical cross-entropy for 100 epochs
GAN: WGAN-GP with 5:1 critic-to-generator training ratio. General model trained 30,000 epochs; target models trained 10,000 epochs
Evaluation: SVM classifiers with FCFP6 fingerprints (2048 bits) for activity prediction; MQN fingerprints for PCA-based chemical space analysis; Murcko scaffolds for scaffold-level analysis

Models

Heteroencoder: 512-dim latent space, bidirectional LSTM encoder, unidirectional LSTM decoder
Generator: 5 feed-forward layers of 256 dims with batch norm and leaky ReLU
Critic: 3 feed-forward layers of 256 dims with leaky ReLU

Evaluation

Metric	LatentGAN (EGFR)	RNN Baseline (EGFR)	Notes
Validity	86%	96%	Percent valid SMILES
Uniqueness	56%	46%	Percent unique among valid
Novelty	97%	95%	Not in training set
Predicted active	71%	65%	By SVM model

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
LatentGAN source code	Code	Not specified	Includes trained heteroencoder model and training sets

Paper Information

Citation: Prykhodko, O., Johansson, S.V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E.J., Engkvist, O., & Chen, H. (2019). A de novo molecular generation method using latent vector based generative adversarial network. Journal of Cheminformatics, 11(1), 74. https://doi.org/10.1186/s13321-019-0397-9

@article{prykhodko2019latentgan,
  title={A de novo molecular generation method using latent vector based generative adversarial network},
  author={Prykhodko, Oleksii and Johansson, Simon Viet and Kotsias, Panagiotis-Christos and Ar{\'u}s-Pous, Josep and Bjerrum, Esben Jannik and Engkvist, Ola and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  pages={74},
  year={2019},
  publisher={Springer},
  doi={10.1186/s13321-019-0397-9}
}

Grammar VAE: Generating Valid Molecules via CFGs

Sat, 28 Mar 2026 00:00:00 +0000

A Grammar-Constrained VAE for Discrete Data Generation

This is a Method paper that introduces the Grammar Variational Autoencoder (GVAE), a variant of the variational autoencoder that operates directly on parse trees from context-free grammars (CFGs) rather than on raw character sequences. The primary contribution is a decoding mechanism that uses a stack and grammar-derived masks to restrict the output at every timestep to only syntactically valid production rules. This guarantees that every decoded output is a valid string under the grammar, addressing a fundamental limitation of character-level VAEs when applied to structured discrete data such as SMILES molecular strings and arithmetic expressions.

Why Character-Level VAEs Fail on Structured Discrete Data

Generative models for continuous data (images, audio) had achieved impressive results by 2017, but generating structured discrete data remained difficult. The key challenge is that string representations of molecules and mathematical expressions are brittle: small perturbations to a character sequence often produce invalid outputs. Gomez-Bombarelli et al. (2016) demonstrated a character-level VAE (CVAE) for SMILES strings that could encode molecules into a continuous latent space and decode them back, enabling latent-space optimization for molecular design. However, the CVAE frequently decoded latent points into strings that were not valid SMILES, particularly when exploring regions of latent space far from training data.

The fundamental issue is that character-level decoders must implicitly learn the syntactic rules of the target language from data alone. For SMILES, this includes matching parentheses, valid atom types, proper bonding, and ring closure notation. The GVAE addresses this by giving the decoder explicit knowledge of the grammar, so it can focus entirely on learning the semantic structure of the data.

Core Innovation: Stack-Based Grammar Masking in the Decoder

The GVAE encodes and decodes sequences of production rules from a context-free grammar rather than sequences of characters.

Encoding. Given an input string (e.g., a SMILES molecule), the encoder first parses it into a parse tree using the CFG, then performs a left-to-right pre-order traversal of the tree to extract an ordered sequence of production rules. Each rule is represented as a one-hot vector of dimension $K$ (total number of production rules in the grammar). The resulting $T(\mathbf{X}) \times K$ matrix is processed by a convolutional neural network to produce the mean and variance of a Gaussian posterior $q_{\phi}(\mathbf{z} \mid \mathbf{X})$.

Decoding with grammar masks. The decoder maps a latent vector $\mathbf{z}$ through an RNN to produce a matrix of logits $\mathbf{F} \in \mathbb{R}^{T_{max} \times K}$. The key innovation is a last-in first-out (LIFO) stack that tracks the current parsing state. At each timestep $t$, the decoder:

Pops the top non-terminal $\alpha$ from the stack
Applies a fixed binary mask $\mathbf{m}_{\alpha} \in {0, 1}^K$ that zeros out all production rules whose left-hand side is not $\alpha$
Samples a production rule from the masked softmax distribution:

$$ p(\mathbf{x}_{t} = k \mid \alpha, \mathbf{z}) = \frac{m_{\alpha,k} \exp(f_{tk})}{\sum_{j=1}^{K} m_{\alpha,j} \exp(f_{tj})} $$

Pushes the right-hand-side non-terminals of the selected rule onto the stack (right-to-left, so the leftmost is on top)

This process continues until the stack is empty or $T_{max}$ timesteps are reached. Because the mask restricts selection to only those rules applicable to the current non-terminal, every generated sequence of production rules is guaranteed to be a valid derivation under the grammar.

Training. The model is trained by maximizing the ELBO:

$$ \mathcal{L}(\phi, \theta; \mathbf{X}) = \mathbb{E}_{q(\mathbf{z} \mid \mathbf{X})} \left[ \log p_{\theta}(\mathbf{X}, \mathbf{z}) - \log q_{\phi}(\mathbf{z} \mid \mathbf{X}) \right] $$

where the likelihood factorizes as:

$$ p(\mathbf{X} \mid \mathbf{z}) = \prod_{t=1}^{T(\mathbf{X})} p(\mathbf{x}_{t} \mid \mathbf{z}) $$

During training, the masks at each timestep are determined by the ground-truth production rule sequence, so no stack simulation is needed. The stack-based decoding is only required at generation time.

Syntactic vs. semantic validity. The grammar guarantees syntactic validity but not semantic validity. The GVAE can still produce chemically implausible molecules (e.g., an oxygen atom with three bonds) because such constraints are not context-free. SMILES ring-bond digit matching is also not context-free, so the grammar cannot enforce it. Additionally, sequences that have not emptied the stack by $T_{max}$ are marked invalid.

Experiments on Symbolic Regression and Molecular Optimization

The authors evaluate the GVAE on two domains: arithmetic expressions and molecules. Both use Bayesian optimization (BO) over the learned latent space.

Setup. After training each VAE, the authors encode training data into latent vectors and train a sparse Gaussian process (SGP) with 500 inducing points to predict properties from latent representations. They then run batch BO with expected improvement, selecting 50 candidates per iteration.

Arithmetic Expressions

Data: 100,000 randomly generated univariate expressions from a simple grammar (3 binary operators, 2 unary operators, 3 constants), each with at most 15 production rules
Target: Find an expression minimizing $\log(1 + \text{MSE})$ against the true function $1/3 + x + \sin(x \cdot x)$
BO iterations: 5, averaged over 10 repetitions

Method	Fraction Valid	Average Score
GVAE	0.99 +/- 0.01	3.47 +/- 0.24
CVAE	0.86 +/- 0.06	4.75 +/- 0.25

The GVAE’s best expression ($x/1 + \sin(3) + \sin(x \cdot x)$, score 0.04) nearly exactly recovers the true function, while the CVAE’s best ($x \cdot 1 + \sin(3) + \sin(3/1)$, score 0.39) misses the sinusoidal component.

Molecular Optimization

Data: 250,000 SMILES strings from the ZINC database
Target: Maximize penalized logP (water-octanol partition coefficient penalized for ring size and synthetic accessibility)
BO iterations: 10, averaged over 5 trials

Method	Fraction Valid	Average Score
GVAE	0.31 +/- 0.07	-9.57 +/- 1.77
CVAE	0.17 +/- 0.05	-54.66 +/- 2.66

The GVAE produces roughly twice as many valid molecules as the CVAE and finds molecules with substantially better penalized logP scores (best: 2.94 vs. 1.98).

Latent Space Quality

Interpolation experiments show that the GVAE produces valid outputs at every intermediate point when linearly interpolating between two encoded expressions, while the CVAE passes through invalid strings. Grid searches around encoded molecules in the GVAE latent space show smooth transitions where neighboring points differ by single atoms.

Predictive Performance

Sparse GP models trained on GVAE latent features achieve better test RMSE and log-likelihood than those trained on CVAE features for both expressions and molecules:

Metric	GVAE (Expressions)	CVAE (Expressions)	GVAE (Molecules)	CVAE (Molecules)
Test LL	-1.320 +/- 0.001	-1.397 +/- 0.003	-1.739 +/- 0.004	-1.812 +/- 0.004
Test RMSE	0.884 +/- 0.002	0.975 +/- 0.004	1.404 +/- 0.006	1.504 +/- 0.006

Reconstruction and Prior Sampling

On held-out molecules, the GVAE achieves 53.7% reconstruction accuracy vs. 44.6% for the CVAE. When sampling from the prior $p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I})$, 7.2% of GVAE samples are valid molecules vs. 0.7% for the CVAE.

Key Findings, Limitations, and Impact

Key findings. Incorporating grammar structure into the VAE decoder consistently improves validity rates, latent space smoothness, downstream predictive performance, and Bayesian optimization outcomes across both domains. The approach is general: any domain with a context-free grammar can benefit.

Limitations acknowledged by the authors.

The GVAE guarantees syntactic but not semantic validity. For molecules, invalid ring-bond patterns and chemically implausible structures can still be generated.
The molecular validity rate during BO (31%) is substantially higher than the CVAE (17%) but still means most decoded molecules are invalid, largely due to non-context-free constraints in SMILES.
The approach requires a context-free grammar for the target domain, which limits applicability to well-defined formal languages.
Sequences that do not complete parsing within $T_{max}$ timesteps are discarded as invalid.

Impact. The GVAE was an influential early contribution to constrained molecular generation. It directly inspired the Syntax-Directed VAE (SD-VAE) by Dai et al. (2018), which uses attribute grammars for tighter semantic constraints, and contributed to the broader movement toward structured molecular generation methods including graph-based approaches. The paper demonstrated that encoding domain knowledge into the decoder architecture is more effective than relying on the model to learn structural constraints from data alone.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (expressions)	Generated arithmetic expressions	100,000	Up to 15 production rules each
Training (molecules)	ZINC database subset	250,000 SMILES	Same subset as Gomez-Bombarelli et al. (2016)

Algorithms

Encoder: 1D convolutional neural network over one-hot rule sequences
Decoder: RNN with stack-based grammar masking
Latent space: 56 dimensions (molecules), isotropic Gaussian prior
Property predictor: Sparse Gaussian process with 500 inducing points
Optimization: Batch Bayesian optimization with expected improvement, 50 candidates per iteration, Kriging Believer for batch selection

Models

Architecture details follow Gomez-Bombarelli et al. (2016) with modifications for grammar-based encoding/decoding. Specific layer sizes and hyperparameters are described in the supplementary material.

Evaluation

Metric	GVAE	CVAE	Notes
Fraction valid (expressions)	0.99	0.86	During BO
Fraction valid (molecules)	0.31	0.17	During BO
Best penalized logP	2.94	1.98	Best molecule found
Reconstruction accuracy	53.7%	44.6%	On held-out molecules
Prior validity	7.2%	0.7%	Sampling from N(0,I)

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
grammarVAE	Code	Not specified	Official implementation

Paper Information

Citation: Kusner, M. J., Paige, B., & Hernández-Lobato, J. M. (2017). Grammar Variational Autoencoder. Proceedings of the 34th International Conference on Machine Learning (ICML), 1945-1954.

@inproceedings{kusner2017grammar,
  title={Grammar Variational Autoencoder},
  author={Kusner, Matt J. and Paige, Brooks and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel},
  booktitle={Proceedings of the 34th International Conference on Machine Learning},
  pages={1945--1954},
  year={2017},
  publisher={PMLR}
}

DrugEx v2: Pareto Multi-Objective RL for Drug Design

Sat, 28 Mar 2026 00:00:00 +0000

Multi-Objective De Novo Drug Design with Pareto Optimization

This is a Method paper that extends the DrugEx framework (v1) to handle multi-objective optimization in de novo drug design. The primary contribution is integrating Pareto-based ranking with evolutionary algorithm concepts (crossover and mutation) into an RNN-based reinforcement learning pipeline. The system generates SMILES-based molecules optimized simultaneously for activity toward multiple protein targets while avoiding off-targets, addressing polypharmacology scenarios where drugs must bind multiple specific receptors.

Polypharmacology and the Limits of Single-Objective Generation

Traditional drug discovery follows the “one drug, one target, one disease” paradigm, but drug molecules interact with an average of six protein targets. Off-target binding causes side effects that remain a leading cause of clinical failure and post-approval drug withdrawals (over 500 drugs withdrawn due to fatal toxicity). Complex diseases often require modulating multiple targets simultaneously, making polypharmacology an important design objective.

Prior deep learning approaches for de novo design, including DrugEx v1, focused on generating molecules active against a single target. Extending these methods to multiple objectives introduces fundamental challenges: objectives are often contradictory (high affinity for one target may correlate with high affinity for an undesired off-target), and naive weighted-sum approaches can collapse diversity by over-optimizing a single dominant objective. The authors specifically target the adenosine receptor system, where $A_1AR$ and $A_{2A}AR$ selectivity profiles matter for therapeutic efficacy, and hERG channel binding must be avoided to prevent cardiac toxicity.

Evolutionary Exploration and Pareto Ranking in RL

The core innovation of DrugEx v2 has two components: an evolutionary exploration strategy and Pareto-based reward assignment.

Evolutionary Exploration Strategy

The generation process uses three RNN networks with identical LSTM architectures:

Agent net ($G_A$): the primary generator, updated at each training epoch via policy gradient
Crossover net ($G_C$): initialized from the fine-tuned model, updated iteratively from $G_A$ after each convergence period
Mutation net ($G_M$): initialized from the pre-trained model, parameters fixed throughout training

At each token-generation step, a random number determines whether the token probability comes from the combination of $G_A$ and $G_C$ (with probability $1 - \varepsilon$) or from $G_M$ (with probability $\varepsilon$). This mirrors crossover and mutation operations from evolutionary algorithms, maintaining diversity while steering toward desired properties.

Pareto Front Reward Scheme

For $n$ objectives (three in this study: $A_1AR$, $A_{2A}AR$, hERG), each molecule receives a score $R_i$ based on its predicted bioactivity:

$$ R_{i} = \begin{cases} \text{minmax}(pX_{i}), & \text{if high affinity required} \\ 1 - \text{minmax}(pX_{i}), & \text{if low affinity required} \\ 0, & \text{if SMILES invalid} \end{cases} $$

where $pX_i$ is the predicted bioactivity (range 3.0 to 10.0), normalized to [0, 1].

For the multi-target case, high affinity is required for both $A_1AR$ and $A_{2A}AR$ while low affinity is required for hERG. For the target-specific case, high affinity is required only for $A_{2A}AR$ while low affinity is required for both $A_1AR$ and hERG.

Molecules are ranked using a non-dominated sorting algorithm to construct Pareto fronts. Within each front, molecules are ranked by average Tanimoto distance (using ECFP6 fingerprints) rather than crowding distance, favoring chemically diverse solutions. The final reward is:

$$ R_i^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, & \text{if desired} \\ \frac{k}{2N_{undesired}}, & \text{if undesired} \end{cases} $$

where $k$ is the molecule’s index in the Pareto rank. Rewards for undesired and desired solutions are distributed in $(0, 0.5]$ and $(0.5, 1.0]$, respectively.

The agent is trained via policy gradient:

$$ J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) \middle|\theta\right] = \sum_{t=1}^{T} \log G(y_t | y_{1:t-1}) \cdot R^{*}(y_{1:T}) $$

Weighted Sum Alternative

The authors also implement a weighted sum (WS) scheme with dynamic weights proportional to the ratio of undesired to desired molecules per objective:

$$ w_i = \frac{r_i}{\sum_{k=1}^{M} r_k}, \quad R^{*} = \sum_{i=1}^{n} w_i R_i $$

This auto-adjusts importance toward under-performing objectives during training.

Molecular Diversity Metric

Diversity is measured using the Solow-Polasky metric adapted from ecological biodiversity:

$$ I(A) = \frac{1}{|A|} \mathbf{e}^{\top} F(\mathbf{s})^{-1} \mathbf{e} $$

where $F(\mathbf{s})$ is a distance matrix with entries $f(d_{ij}) = e^{-\theta d_{ij}}$ and $d_{ij}$ is the Tanimoto distance between ECFP6 fingerprints of molecules $s_i$ and $s_j$.

Multi-Target and Target-Specific Experiments

QSAR Environment

Four ML algorithms were benchmarked for the bioactivity prediction environment: Random Forest (RF), SVM, PLS, and Multi-task DNN (MT-DNN). Input features combined 2048-bit ECFP6 fingerprints with 19 physicochemical descriptors (2067D total). The training data came from ChEMBL v26: 25,731 ligands with bioactivity measurements toward $A_1AR$, $A_{2A}AR$, and hERG. RF was selected as the final predictor based on superior performance in temporal-split independent testing ($R^2$ and RMSE), prioritizing robustness over cross-validation metrics.

Generative Model Architecture

The RNN generator uses six layers: input, embedding (128D), three LSTM recurrent layers (512 hidden units), and output. LSTM was chosen over GRU based on higher valid SMILES rates (97.5% vs. 93.1% for pre-trained, 97.9% vs. 95.7% for fine-tuned). Pre-training used 1.7M molecules from ChEMBL; fine-tuning used the 25,731 LIGAND set molecules.

Baselines

DrugEx v2 was compared against DrugEx v1, REINVENT, and ORGANIC, all using the same RNN architecture and pre-trained/fine-tuned models, with only the RL framework differing. Both Pareto front (PF) and weighted sum (WS) reward schemes were tested.

Multi-Target Results

In the multi-target case (high affinity for $A_1AR$ and $A_{2A}AR$, low affinity for hERG):

Method	Scheme	Validity	Desirability	Uniqueness	Diversity
DrugEx v2	PF	99.57%	80.81%	87.29%	0.70
DrugEx v2	WS	99.80%	97.45%	89.08%	0.49
REINVENT	PF	99.54%	57.43%	98.84%	0.77
ORGANIC	PF	98.84%	66.01%	82.67%	0.65
DrugEx v1	PF	98.28%	43.27%	88.96%	0.71

DrugEx v2 achieved the highest desirability under both schemes. The WS scheme maximized desirability (97.45%) but at the cost of diversity (0.49). The PF scheme maintained higher diversity (0.70) with still-strong desirability (80.81%).

Target-Specific Results

In the target-specific case (high $A_{2A}AR$, low $A_1AR$ and hERG):

Method	Scheme	Validity	Desirability	Uniqueness	Diversity
DrugEx v2	PF	99.53%	89.49%	90.55%	0.73
DrugEx v2	WS	99.62%	97.86%	90.54%	0.31
REINVENT	WS	99.55%	81.27%	98.87%	0.34
ORGANIC	PF	98.29%	86.98%	80.30%	0.64

DrugEx v2 with PF achieved high desirability (89.49%) while maintaining diversity (0.73), outperforming both the WS scheme’s diversity collapse (0.31) and competing methods.

Chemical Space Coverage

t-SNE visualization with ECFP6 descriptors showed that the PF scheme guided generators to cover chemical space more broadly than the WS scheme. DrugEx v1 and v2 covered nearly all of the chemical space occupied by known active ligands, while REINVENT and ORGANIC covered only partial regions in the target-specific case.

Substructure Distribution

Generated molecules were evaluated for purine ring, furan ring, and benzene ring frequencies. DrugEx v2 with PF produced substructure distributions closest to the LIGAND set, suggesting it better preserves the chemical characteristics of known active molecules compared to REINVENT (which over-represented benzene rings) and ORGANIC.

GuacaMol Benchmark

DrugEx v2 was tested on 20 goal-directed tasks from the GuacaMol benchmark, achieving the best score in 12 of 20 tasks and an overall second place. The method struggled with tasks requiring contradictory objectives in narrow chemical spaces (e.g., the Sitagliptin MPO task), reflecting its emphasis on diverse feasible molecules rather than optimal individual solutions.

Diversity-Desirability Trade-off and Limitations

The key finding is that the Pareto front scheme and weighted sum scheme offer complementary strengths: PF produces molecules with higher diversity and more realistic substructure distributions, while WS achieves higher raw desirability scores. The Pareto front scheme is preferred for polypharmacology applications where chemical diversity matters for lead optimization.

The mutation rate $\varepsilon$ controls the diversity-desirability trade-off. Higher $\varepsilon$ increases diversity at the cost of desirability. The authors tested $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$ and found that appropriate tuning is important.

Limitations acknowledged by the authors include:

The method is less effective for tasks with contradictory objectives in narrow chemical spaces
Emphasis is on generating diverse feasible molecules rather than individual optimal solutions
REINVENT 2.0 did not converge with the PF scheme, suggesting the Pareto approach may not be universally compatible with all RL frameworks
Bioactivity predictions rely on QSAR models (RF), which may not generalize perfectly to novel chemical scaffolds

Future directions mentioned include adopting newer architectures (BERT, Transformer, GPT-2), handling graph and fragment representations, and integrating additional objectives like stability and synthesizability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v26 (ChEMBL set)	1.7M molecules	SMILES syntax learning, drug-like molecules
Fine-tuning / Environment	LIGAND set	25,731 ligands	Bioactivities for $A_1AR$, $A_{2A}AR$, hERG from ChEMBL
Benchmark	GuacaMol	20 tasks	Goal-directed generation tasks

Active/inactive thresholds: $pX \geq 6.5$ (active), $pX < 6.5$ (inactive). Low-quality data without exact pX assigned $pX = 3.99$ with sample weight 0.1.

Algorithms

QSAR predictor: Random Forest, 1000 trees, Gini criterion. Input: 2048-bit ECFP6 + 19 physicochemical properties (2067D). MinMax normalization.
Generator: 6-layer RNN with LSTM cells (512 hidden units), embedding dim 128, vocabulary 84 tokens. Adam optimizer, lr $10^{-3}$, batch size 512, 1000 epochs.
RL training: Policy gradient with Pareto-based or weighted-sum reward. Mutation rates tested: $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$.
Pareto ranking: GPU-accelerated non-dominated sorting via PyTorch. Tanimoto-based crowding distance with ECFP6 fingerprints.

Models

Component	Architecture	Parameters
Generator	LSTM (3 layers, 512 hidden)	Embedding 128D, vocab 84
Predictor	Random Forest	1000 trees, 2067D input
MT-DNN (alternative)	3 hidden layers (4000, 2000, 1000)	ReLU, 20% dropout

Evaluation

Metric	Description
Validity	Fraction of generated SMILES that parse to valid molecules
Desirability	Fraction of molecules meeting all activity thresholds ($pX \geq 6.5$ on-targets, $pX < 6.5$ off-targets)
Uniqueness	Fraction of non-duplicate molecules
Diversity	Solow-Polasky metric on ECFP6 Tanimoto distances
SA score	Synthetic accessibility (1-10, lower is easier)
QED	Quantitative estimate of drug-likeness (0-1, higher is better)

Hardware

GPU acceleration was used for Pareto optimization via PyTorch. Specific hardware details (GPU model, training time) are not reported in the paper.

Artifacts

Artifact	Type	License	Notes
DrugEx GitHub	Code	MIT	Official implementation (Python, PyTorch)
ChEMBL v26	Dataset	CC BY-SA 3.0	Source of training molecules and bioactivity data

Paper Information

Citation: Liu, X., Ye, K., van Vlijmen, H. W. T., Emmerich, M. T. M., IJzerman, A. P., & van Westen, G. J. P. (2021). DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. Journal of Cheminformatics, 13(1), 85. https://doi.org/10.1186/s13321-021-00561-9

@article{liu2021drugex,
  title={DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology},
  author={Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and Emmerich, Michael T. M. and IJzerman, Adriaan P. and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={85},
  year={2021},
  doi={10.1186/s13321-021-00561-9}
}

DrugChat: Conversational QA on Drug Molecule Graphs

Sat, 28 Mar 2026 00:00:00 +0000

A Prototype for Conversational Drug Compound Analysis

Method ($\Psi_{\text{Method}}$)

DrugChat is a prototype system that enables ChatGPT-like conversational interaction with drug molecule graphs. Users upload a compound’s molecular graph and ask free-form, multi-turn questions about its properties, mechanism of action, or therapeutic applications. The system generates natural language answers by combining a graph neural network (GNN) encoder, a large language model (LLM), and a lightweight linear adaptor that bridges the two modalities. The primary contribution is the architecture and the accompanying instruction tuning datasets (10,834 drug compounds, 143,517 QA pairs) that make this graph-to-language interaction possible.

Why Conversational Interfaces for Drug Molecules?

Drug discovery is time-intensive and expensive, often requiring years and billions of dollars to bring a single compound to market. Traditional computational chemistry tools provide specialized outputs but lack the ability to support open-ended, interactive exploration of molecular properties. Researchers working with drug compound data frequently need quick answers to diverse questions: What is the mechanism of action? Are there known drug interactions? What structural modifications could improve efficacy?

At the time of this work, large language models had demonstrated strong conversational capabilities for text, and multimodal extensions (MiniGPT-4, LLaVA) had connected vision encoders to LLMs. However, no system had bridged graph-structured molecular data with LLMs for interactive dialogue. DrugChat addresses this gap by proposing the first system (to the authors’ knowledge) that connects molecular graph representations directly to an LLM for multi-turn question answering.

Architecture: GNN-Adaptor-LLM Pipeline

The core innovation is the three-component architecture and its training strategy:

Graph Neural Network (GNN): A pre-trained GNN from Hu et al. (2020) processes the compound’s molecular graph. At each layer $k$, node representations are updated by aggregating features from neighboring nodes:

$$ h_{v}^{k} = \sigma\left(h_{v}^{k-1}, \text{AGG}\left(\left\{h_{u}^{k-1}, u \in \mathcal{N}(v)\right\}\right)\right) $$

A permutation-invariant pooling function produces the graph-level representation:

$$ h_{G} = f\left(\left\{h_{v}^{K}, v \in G\right\}\right) $$

Linear Adaptor: A single linear transformation matrix converts the GNN graph representation into a soft prompt vector compatible with the LLM’s input space. This is the only component whose weights are updated during training.

Large Language Model (Vicuna-13B): The pre-trained Vicuna-13B model takes the transformed graph prompt vector along with user questions and generates answers. Both the GNN and LLM weights remain frozen during training.

The prompt template follows the Vicuna conversational format:

$$ \mathbf{Q}: \langle\text{Graph}\rangle\langle\text{GraphFeature}\rangle\langle/\text{Graph}\rangle\langle\text{Instruction}\rangle \quad \mathbf{A}: \langle\text{Desc}\rangle $$

During training, the system minimizes a negative log-likelihood loss between generated and ground-truth answers. The entire training procedure updates only the adaptor’s parameters, making the approach computationally lightweight compared to full fine-tuning.

Instruction Tuning Datasets from ChEMBL and PubChem

The authors constructed two instruction tuning datasets:

Dataset	Drug Compounds	QA Pairs	Source
ChEMBL	3,892	129,699	ChEMBL database (Feb 2023)
PubChem	6,942	13,818	PubChem (May 2023)
Total	10,834	143,517

ChEMBL Dataset: Starting from 2,354,965 compounds in ChEMBL, the authors identified 14,816 with drug information and filtered to 3,892 with sufficient descriptive content. For each drug, they gathered SMILES strings, molecular features (formula, acid/base classification), and drug-specific properties (mechanism of action, therapeutic applications). They manually crafted QA pairs covering topics like rotatable bond count, Lipinski rule violations, chirality, polar surface area, development stage, approval year, and USAN classification.

PubChem Dataset: From 66,469,244 compounds in PubChem, 19,319 had drug information, and 6,942 were retained after filtering for detailed descriptions. Descriptions were sourced from ChEBI, LOTUS, and YMDB databases, yielding 13,818 QA pairs primarily asking for drug descriptions.

The QA pairs are formulaic: the ChEMBL set covers up to 34 question types per drug (an example drug in the paper shows all 34), while PubChem questions ask for descriptive summaries from different source databases.

Qualitative Demonstrations Only

The paper presents only qualitative results. Two demonstration examples show DrugChat answering multi-turn questions about test compounds not seen during training. Questions like “what makes this compound unique?” and “what diseases can this compound potentially treat?” are answered in natural language.

No systematic quantitative evaluation is reported. The authors state they “will perform a systematic quantitative evaluation by collaborating with pharmaceutical scientists,” but this evaluation is not included in the technical report.

Limitations and Future Directions

The authors identify language hallucination as the primary limitation. Since DrugChat incorporates an LLM, it may produce convincing but incorrect text descriptions about drugs, which could mislead decision-makers in real drug discovery pipelines.

Proposed mitigations include:

Higher-quality training data and filtering strategies
More advanced GNN encoders and LLMs
Reinforcement learning from human feedback (RLHF) as the user base grows

Several additional limitations are worth noting:

The QA pairs are largely factoid-style questions with short, formulaic answers, which may not capture the nuanced reasoning needed for real drug discovery tasks
The evaluation is entirely qualitative, with no comparison to baselines or quantitative metrics
The linear adaptor is a minimal alignment mechanism; it remains unclear how much molecular structural information is preserved through this single linear transformation
The training data covers only a small fraction of known chemical space (10,834 compounds out of millions)

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ChEMBL Drug Instruction Tuning	3,892 drugs, 129,699 QA pairs	From ChEMBL (Feb 2023 dump)
Training	PubChem Drug Instruction Tuning	6,942 drugs, 13,818 QA pairs	From PubChem (May 2023)

Algorithms

GNN: Pre-trained model from Hu et al. (2020), “Strategies for Pre-training Graph Neural Networks”
Adaptor: Single linear transformation matrix (only trainable component)
Loss: Negative log-likelihood between generated and ground-truth answers
Training: Only adaptor weights updated; GNN and LLM weights frozen

Models

Component	Model	Parameters	Status
GNN Encoder	Pre-trained GNN (Hu et al., 2020)	Not specified	Frozen during training
LLM	Vicuna-13B	~13B	Frozen during training
Adaptor	Linear projection	Not specified	Trained

Evaluation

No quantitative evaluation metrics are reported. The paper provides only qualitative demonstrations on unseen compounds.

Hardware

No hardware specifications are reported for training or inference.

Artifacts

Artifact	Type	License	Notes
DrugChat Code	Code	Not specified	Official implementation (repository returned 404 as of March 2026)

Paper Information

Citation: Liang, Y., Zhang, R., Zhang, L., & Xie, P. (2023). DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs. arXiv preprint arXiv:2309.03907.

@article{liang2023drugchat,
  title={DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs},
  author={Liang, Youwei and Zhang, Ruiyi and Zhang, Li and Xie, Pengtao},
  journal={arXiv preprint arXiv:2309.03907},
  year={2023}
}

DrugAssist: Interactive LLM Molecule Optimization

Sat, 28 Mar 2026 00:00:00 +0000

An Interactive LLM for Molecule Optimization

DrugAssist is a Method paper that proposes an interactive molecule optimization model built by fine-tuning Llama2-7B-Chat with LoRA on a newly constructed instruction dataset. The primary contribution is twofold: (1) the MolOpt-Instructions dataset containing over one million molecule pairs with six molecular properties and three optimization task categories, and (2) a dialogue-based molecule optimization system that allows domain experts to iteratively refine molecular modifications through multi-turn natural language conversations.

Why Interactive Molecule Optimization Matters

Molecule optimization is a core step in the drug discovery pipeline, where lead compounds must be modified to improve specific pharmacological properties while maintaining structural similarity. Existing approaches fall into sequence-based methods (treating SMILES optimization as machine translation) and graph-based methods (graph-to-graph translation), but they share a critical limitation: they are non-interactive. These models learn patterns from chemical structure data without incorporating expert feedback.

The drug discovery process is inherently iterative and requires integrating domain expertise. Medicinal chemists typically refine candidates through repeated cycles of suggestion, evaluation, and adjustment. Prior LLM-based approaches like ChatDrug relied on prompt engineering with general-purpose models (GPT-3.5-turbo) rather than fine-tuning, limiting their optimization accuracy. Additionally, most existing molecule optimization benchmarks focus on single-property optimization with vague objectives (e.g., “maximize QED”), while real-world drug design requires optimizing property values within specific ranges across multiple properties simultaneously.

Instruction-Based Fine-Tuning with MolOpt-Instructions

The core innovation has two components: the MolOpt-Instructions dataset construction pipeline and the multi-task instruction tuning strategy.

Dataset Construction

MolOpt-Instructions is built from one million molecules randomly sampled from the ZINC database. The construction workflow uses mmpdb (an open-source Matched Molecular Pair platform) to generate structurally similar molecule pairs through Matched Molecular Pair Analysis (MMPA). Pairs are filtered to satisfy two criteria: Tanimoto similarity greater than 0.65 and logP difference greater than 2.5. Property values for six properties (Solubility, BBBP, hERG inhibition, QED, hydrogen bond donor count, and hydrogen bond acceptor count) are computed using Tencent’s iDrug platform. The final dataset contains 1,029,949 unique pairs covering 1,595,839 unique molecules, with mean similarity of 0.69 and mean logP difference of 2.82.

Three categories of optimization tasks are defined:

Loose: Increase or decrease a given property value (no threshold)
Strict: Increase or decrease by at least a specified threshold
Range: Optimize the property value to fall within a given interval

Instruction templates are generated with ChatGPT assistance and manually refined. To ensure balance, source and target molecules are swapped for some pairs to maintain a roughly 1:1 ratio of property increases to decreases.

Murcko scaffold analysis confirms chemical diversity: the average molecules per scaffold is 2.95, and over 93.7% of scaffolds contain no more than five molecules.

Multi-Task Instruction Tuning

The model is fine-tuned on Llama2-7B-Chat using LoRA (rank 64, alpha 128). To prevent catastrophic forgetting of general language capabilities, the training data combines MolOpt-Instructions with the Stanford Alpaca dataset (52k instruction-following examples, replicated 5x to balance the mixture). The training objective minimizes the negative log-likelihood over the response tokens:

$$L(R; \boldsymbol{\theta}) = -\sum_{u_i \in R} \log \Phi(u_i \mid u_{

where $I$ is the instruction, $R$ is the response, and $\Phi$ is the model’s conditional probability.

Training runs for 10 epochs with batch size 512, using AdamW ($\beta = (0.9, 0.999)$), learning rate 1e-4, 3% warm-up steps with cosine decay, and no weight decay. The data is split 90/5/5 for train/validation/test.

Experimental Setup and Multi-Property Optimization Results

Comparison with Traditional Approaches

DrugAssist is compared against Mol-Seq2Seq and Mol-Transformer (He et al., 2021) on simultaneous Solubility and BBBP optimization with range constraints. The evaluation prompt asks the model to generate an optimized molecule with solubility within a given range and BBBP category changed from one level to another.

Model	Solubility	BBBP	Both	Valid Rate	Similarity
Mol-Seq2Seq	0.46	0.55	0.35	0.76	0.61
Mol-Transformer	0.70	0.78	0.59	0.96	0.70
DrugAssist	0.74	0.80	0.62	0.98	0.69

DrugAssist achieves the highest success rates in both single-property and multi-property optimization while maintaining high validity (0.98) and comparable structural similarity (0.69).

Comparison with LLMs

DrugAssist is compared against Llama2-7B-Chat, GPT-3.5-turbo (via ChatDrug), and BioMedGPT-LM-7B on 16 tasks covering all three optimization categories. These comparisons use multi-turn dialogues following the ChatDrug protocol: if the model’s output fails to meet requirements, a database-retrieved molecule meeting the criteria and similar to the model’s output is provided as a hint for iterative refinement.

Selected results on single-property tasks (valid ratio / correct ratio, loose/strict):

Task	Llama2-7B-Chat	GPT-3.5-turbo	BioMedGPT-LM	DrugAssist
QED+	0.17 / 0.16	0.15 / 0.15	0.15 / 0.09	0.76 / 0.63
Acceptor+	0.08 / 0.08	0.04 / 0.06	0.18 / 0.13	0.71 / 0.67
Donor+	0.15 / 0.08	0.10 / 0.04	0.17 / 0.09	0.72 / 0.76
Solubility+	0.36 / 0.20	0.16 / 0.05	0.18 / 0.09	0.80 / 0.41
BBBP+	0.19 / 0.14	0.10 / 0.10	0.16 / 0.07	0.82 / 0.61
hERG-	0.39 / 0.31	0.13 / 0.15	0.13 / 0.12	0.71 / 0.67

Multi-property tasks:

Task	Llama2-7B-Chat	GPT-3.5-turbo	BioMedGPT-LM	DrugAssist
Sol+ & Acc+	0.15 / 0.04	0.09 / 0.02	0.10 / 0.07	0.50 / 0.27
QED+ & BBBP+	0.14 / 0.09	0.09 / 0.06	0.16 / 0.11	0.65 / 0.41

DrugAssist outperforms all baselines across every task. BioMedGPT-LM frequently misunderstands the task, generating guidance text rather than molecules. GPT-3.5-turbo achieves high validity but often outputs the input molecule unchanged.

Key Findings

Zero-shot transferability: Although DrugAssist trains on single-property optimization data, it successfully handles multi-property optimization requests at inference time. In a case study, the model simultaneously increased both BBBP and QED by at least 0.1 while maintaining structural similarity, without any multi-property training examples.

Few-shot generalization: DrugAssist optimizes properties not seen during training (e.g., logP) when provided with a few in-context examples of successful optimizations, a capability that traditional sequence-based or graph-based models cannot achieve without retraining.

Iterative optimization: When an initial optimization fails to meet requirements, DrugAssist can incorporate feedback (a database-retrieved hint molecule) and modify different functional groups in a second attempt to produce a compliant molecule.

Limitations

The authors acknowledge that DrugAssist has a relatively lower success rate on the most challenging task category, strict range-constrained solubility optimization (0.41 success rate under strict criteria vs. 0.80 under loose criteria). The model also relies on iDrug for property prediction of Solubility, BBBP, and hERG inhibition, meaning its optimization quality is bounded by the accuracy of these property predictors. The evaluation uses only 500 test molecules for LLM comparisons, which is a relatively small evaluation set. The paper does not report statistical significance tests or confidence intervals for any results.

Future Directions

The authors plan to improve multimodal data handling to reduce hallucination problems and to further enhance DrugAssist’s interactive capabilities for better understanding of user needs and feedback.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	MolOpt-Instructions	1,029,949 molecule pairs	Sourced from ZINC via mmpdb; 6 properties
Training (auxiliary)	Stanford Alpaca	52k instructions (5x replicated)	Mitigates catastrophic forgetting
Evaluation (traditional)	From He et al. (2021)	Not specified	Multi-property optimization test
Evaluation (LLM)	ZINC subset	500 molecules	Randomly selected

Algorithms

Base model: Llama2-7B-Chat
Fine-tuning: LoRA with rank 64, alpha 128
Optimizer: AdamW, $\beta = (0.9, 0.999)$, lr = 1e-4, no weight decay
Schedule: 3% warm-up, cosine decay
Epochs: 10
Batch size: 512
Property calculation: iDrug (Solubility, BBBP, hERG); RDKit (H-bond donors/acceptors, QED)
Molecular pairs: mmpdb for Matched Molecular Pair Analysis

Models

Fine-tuned Llama2-7B-Chat with LoRA adapters
No pre-trained weights released (code and data available)

Evaluation

Metric	Description
Success rate	Fraction of molecules meeting optimization criteria
Valid rate	Fraction of generated SMILES that parse as valid molecules
Similarity	Tanimoto similarity between input and optimized molecules

Hardware

8 NVIDIA Tesla A100-SXM4-40GB GPUs

Artifacts

Artifact	Type	License	Notes
DrugAssist Code	Code	Not specified	Training and inference code
MolOpt-Instructions	Dataset	Not specified	1M+ molecule pairs, 6 properties

Paper Information

Citation: Ye, G., Cai, X., Lai, H., Wang, X., Huang, J., Wang, L., Liu, W., & Zeng, X. (2024). DrugAssist: A Large Language Model for Molecule Optimization. Briefings in Bioinformatics, 26(1), bbae693.

@article{ye2024drugassist,
  title={DrugAssist: A Large Language Model for Molecule Optimization},
  author={Ye, Geyan and Cai, Xibao and Lai, Houtim and Wang, Xing and Huang, Junhong and Wang, Longyue and Liu, Wei and Zeng, Xiangxiang},
  journal={Briefings in Bioinformatics},
  volume={26},
  number={1},
  pages={bbae693},
  year={2024},
  doi={10.1093/bib/bbae693}
}

Coscientist: Autonomous Chemistry with LLM Agents

Sat, 28 Mar 2026 00:00:00 +0000

An LLM-Powered Agent for Autonomous Chemical Experimentation

This is a Method paper that introduces Coscientist, an AI system driven by GPT-4 that autonomously designs, plans, and performs complex chemical experiments. The primary contribution is a modular multi-LLM agent architecture that integrates internet search, documentation retrieval, code execution, and robotic experimentation APIs into a unified system capable of end-to-end experimental chemistry with minimal human intervention.

Bridging LLM Capabilities and Laboratory Automation

Transformer-based large language models had demonstrated strong capabilities in natural language processing, biology, chemistry, and code generation by early 2023. Simultaneously, laboratory automation had progressed with autonomous reaction discovery, automated flow systems, and mobile robotic platforms. However, these two threads remained largely separate: LLMs could reason about chemistry in text, but could not act on that reasoning by controlling physical experiments.

The gap this work addresses is the integration of LLM reasoning with laboratory automation in a closed-loop system. Prior automated chemistry systems relied on traditional optimization algorithms or narrow AI components. The question was whether GPT-4’s general reasoning capabilities could be combined with tool access to produce a system that autonomously designs experiments, writes instrument code, executes reactions, and interprets results, all from natural language prompts.

This work was developed independently and in parallel with other autonomous agent efforts (AutoGPT, BabyAGI, LangChain), with ChemCrow serving as another chemistry-specific example.

A Modular Multi-LLM Architecture with Tool Access

The core innovation is Coscientist’s modular architecture, centered on a “Planner” module (a GPT-4 chat completion instance) that orchestrates four command types:

GOOGLE: A Web Searcher module (itself an LLM) that transforms prompts into search queries, browses results, and funnels answers back to the Planner.
PYTHON: A Code Execution module running in an isolated Docker container for calculations and data analysis, with no LLM dependency.
DOCUMENTATION: A Docs Searcher module that retrieves and summarizes technical documentation (e.g., Opentrons Python API, Emerald Cloud Lab Symbolic Lab Language) using ada embeddings and distance-based vector search.
EXPERIMENT: An Automation module that executes generated code on laboratory hardware or provides synthetic procedures.

The system prompts are engineered in a modular fashion, with the Planner receiving initial user input and command outputs as messages. The Planner can iteratively call commands, fix software errors, and refine its approach. This design allows natural language instructions (e.g., “perform multiple Suzuki reactions”) to be translated into complete experimental protocols.

For documentation retrieval, all sections of the OT-2 API documentation were embedded using OpenAI’s ada model, and relevant sections are retrieved via cosine similarity search. For the Emerald Cloud Lab, the system learned to program in a symbolic lab language (SLL) that was completely unknown to GPT-4 at training time, demonstrating effective in-context learning from supplied documentation.

Six Tasks Demonstrating Autonomous Chemistry Capabilities

The paper evaluates Coscientist across six tasks of increasing complexity.

Task 1: Chemical Synthesis Planning

A benchmark of seven compounds was used to compare synthesis planning across models (GPT-4, GPT-3.5, Claude 1.3, Falcon-40B-Instruct) with and without web search. Outputs were scored on a 1-5 scale:

Score	Meaning
5	Very detailed and chemically accurate procedure
4	Detailed and accurate but without reagent quantities
3	Correct chemistry but no step-by-step procedure
2	Extremely vague or unfeasible
1	Incorrect or failure to follow instructions

The GPT-4-powered Web Searcher achieved maximum scores for acetaminophen, aspirin, nitroaniline, and phenolphthalein. It was the only approach to achieve acceptable scores (3+) for ibuprofen, which all non-browsing models synthesized incorrectly. These results highlight the importance of grounding LLMs to avoid hallucinations.

Task 2: Documentation Search

The system correctly identified relevant ECL functions from documentation and generated valid SLL code that was successfully executed at ECL, including an HPLC experiment on a caffeine standard sample.

Task 3: Cloud Laboratory Execution

Using prompt-to-function and prompt-to-SLL pipelines, Coscientist generated executable code for the Emerald Cloud Lab. It also searched a catalogue of 1,110 model samples to identify relevant stock solutions from simple search terms.

Task 4: Liquid Handler Control

Using the Opentrons OT-2, Coscientist translated natural language prompts (e.g., “colour every other line with one colour of your choice,” “draw a red cross”) into accurate liquid handling protocols.

Task 5: Integrated Multi-Module Experiment

The most complex demonstration combined web search, code execution, documentation retrieval, and hardware control to design and execute Suzuki-Miyaura and Sonogashira cross-coupling reactions. Coscientist:

Searched the internet for reaction conditions and stoichiometries
Selected correct coupling partners (never misassigning phenylboronic acid to Sonogashira)
Calculated reagent volumes and wrote OT-2 protocols
Self-corrected when using an incorrect heater-shaker method by consulting documentation
Successfully produced target products confirmed by GC-MS analysis (biphenyl at 9.53 min for Suzuki, diphenylacetylene at 12.92 min for Sonogashira)

Task 6: Reaction Optimization

Coscientist was tested on two fully mapped reaction datasets:

Suzuki reaction flow dataset (Perera et al.): varying ligands, reagents/bases, and solvents
Buchwald-Hartwig C-N coupling dataset (Doyle et al.): varying ligands, additives, and bases

Performance was evaluated using a normalized advantage metric:

$$\text{Normalized Advantage} = \frac{\text{yield}_i - \overline{\text{yield}}}{\text{yield}_{\max} - \overline{\text{yield}}}$$

A value of 1 indicates maximum yield reached, 0 indicates random performance, and negative values indicate worse than random. The normalized maximum advantage (NMA) tracks the best result achieved up to each iteration.

Key findings from the optimization experiments:

GPT-4 with prior information (10 random data points) produced better initial guesses than GPT-4 without prior information
Both GPT-4 approaches converged to similar NMA values at the limit
Both GPT-4 approaches outperformed standard Bayesian optimization in NMA and normalized advantage
GPT-3.5 largely failed due to inability to output correct JSON schemas
On the Buchwald-Hartwig dataset, GPT-4 performed comparably whether given compound names or SMILES strings, and could reason about electronic properties from SMILES representations

All experiments used a maximum of 20 iterations (5.2% and 6.9% of the total reaction space for the two datasets).

Demonstrated Versatility with Safety Considerations

Coscientist demonstrated that GPT-4, when equipped with appropriate tool access, can autonomously handle the full experimental chemistry workflow from literature search to reaction execution and data interpretation. The system showed chemical reasoning capabilities, including selecting appropriate reagents, providing justifications for choices based on reactivity and selectivity, and using experimental data to guide subsequent iterations.

Several limitations are acknowledged:

The experimental setup was not yet fully automated (plates were moved manually between instruments), though no human decision-making was involved
GPT-3.5 consistently underperformed due to inability to follow formatting instructions
The synthesis planning evaluation scale is inherently subjective
It is unclear whether GPT-4’s training data contained information from the optimization datasets
The comparison with Bayesian optimization may reflect different exploration/exploitation balances rather than pure capability differences

The authors raise safety concerns about dual-use potential and note that full code and prompts were withheld pending development of US AI regulations. A simplified implementation was released for reproducibility purposes.

Future directions include extending the system with reaction databases (Reaxys, SciFinder), implementing advanced prompting strategies (ReAct, Chain of Thought, Tree of Thoughts), and developing automated quality control for cloud laboratory experiments.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Synthesis benchmark	7 compound set	7 compounds	Acetaminophen, aspirin, ibuprofen, nitroaniline, etc.
Optimization	Perera et al. Suzuki flow dataset	Fully mapped condition space	Varying ligands, bases, solvents
Optimization	Doyle Buchwald-Hartwig dataset	Fully mapped condition space	Varying ligands, additives, bases
Reagent selection	SMILES compound database	Not specified	Used for computational experiments

Algorithms

Planner: GPT-4 chat completion with modular system prompts
Web Searcher: GPT-4 or GPT-3.5-turbo for query generation and result parsing
Documentation embedding: OpenAI ada model with distance-based vector search
Code execution: Isolated Docker container (no LLM dependency)
Baseline: Bayesian optimization with varying initial sample sizes (1-10)

Models

GPT-4 (primary)
GPT-3.5-turbo (baseline)
Claude 1.3 (baseline for synthesis planning)
Falcon-40B-Instruct (baseline for synthesis planning)
OpenAI ada (for documentation embedding)

Evaluation

Metric	Context	Notes
Synthesis score (1-5)	7-compound benchmark	Subjective expert grading
Normalized advantage	Optimization tasks	Measures improvement over random
NMA	Optimization tasks	Maximum advantage achieved through iteration N
GC-MS confirmation	Cross-coupling reactions	Product formation verified experimentally

Hardware

Opentrons OT-2 liquid handler with heater-shaker module
UV-Vis plate reader
Emerald Cloud Lab (cloud-based automation)
Computational requirements not specified (relies on OpenAI API calls)

Artifacts

Artifact	Type	License	Notes
gomesgroup/coscientist	Code	Apache-2.0 with Commons Clause	Simplified implementation; full code withheld for safety

Paper Information

Citation: Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624(7992), 570-578. https://doi.org/10.1038/s41586-023-06792-0

@article{boiko2023autonomous,
  title={Autonomous chemical research with large language models},
  author={Boiko, Daniil A. and MacKnight, Robert and Kline, Ben and Gomes, Gabriel dos Passos},
  journal={Nature},
  volume={624},
  number={7992},
  pages={570--578},
  year={2023},
  publisher={Springer Nature},
  doi={10.1038/s41586-023-06792-0}
}

ChemGE: Molecule Generation via Grammatical Evolution

Sat, 28 Mar 2026 00:00:00 +0000

Grammatical Evolution for De Novo Molecular Design

This is a Method paper that introduces ChemGE, a population-based molecular generation approach built on grammatical evolution. Rather than using deep neural networks, ChemGE evolves populations of SMILES strings through a context-free grammar, enabling concurrent evaluation by multiple molecular simulators and producing diverse molecular libraries. The method represents an alternative paradigm for de novo drug design: evolutionary optimization over formal grammars rather than learned latent spaces or autoregressive neural models.

Limitations of Sequential Deep Learning Generators

At the time of publication, the dominant approaches to de novo molecular generation included Bayesian optimization over VAE latent spaces (CVAE, GVAE), reinforcement learning with recurrent neural networks (ORGAN, REINVENT), sequential Monte Carlo search, and Monte Carlo tree search (ChemTS). These methods share two practical limitations:

Simulation concurrency: Most methods generate one molecule at a time, making it difficult to run multiple molecular simulations (e.g., docking) in parallel. This wastes computational resources in high-throughput virtual screening settings.
Molecular diversity: Deep learning generators tend to exploit narrow regions of chemical space. Deep reinforcement learning methods in particular often generate very similar molecules, requiring special countermeasures to maintain diversity. Since drug discovery is a multi-stage pipeline, limited diversity reduces survival rates in downstream ADMET screening.

ChemGE addresses both problems by maintaining a large population of molecules that are evolved and evaluated concurrently.

Core Innovation: Chromosome-to-SMILES Mapping via Grammar Rules

ChemGE encodes each molecule as a chromosome: a sequence of $N$ integers that deterministically maps to a SMILES string through a context-free grammar. The mapping process works as follows:

Start with the grammar’s start symbol
At each step $k$, look up the $k$-th integer $c = C[k]$ from the chromosome
Identify the leftmost non-terminal symbol and count its $r$ applicable production rules
Apply the $((c \bmod r) + 1)$-th rule
Repeat until no non-terminal symbols remain or the chromosome is exhausted

The context-free grammar is a subset of the OpenSMILES specification, defined formally as $G = (V, \Sigma, R, S)$ where $V$ is the set of non-terminal symbols, $\Sigma$ is the set of terminal symbols, $R$ is the set of production rules, and $S$ is the start symbol.

Evolution follows the $(\mu + \lambda)$ evolution strategy:

Create $\lambda$ new chromosomes by drawing random chromosomes from the population and mutating one integer at a random position
Translate each chromosome to a SMILES string and evaluate fitness (e.g., docking score). Invalid molecules receive fitness $-\infty$
Select the top $\mu$ molecules from the merged pool of $\mu + \lambda$ candidates

The authors did not use crossover, as it did not improve performance. Diversity is inherently maintained because a large fraction of molecules are mutated in each generation.

Experimental Setup and Benchmark Comparisons

Druglikeness Score Benchmark

The first experiment optimized the penalized logP score $J^{\log P}$, an indicator of druglikeness defined as:

$$ J^{\log P}(m) = \log P(m) - \text{SA}(m) - \text{ring-penalty}(m) $$

where $\log P(m)$ is the octanol-water partition coefficient, $\text{SA}(m)$ is the synthetic accessibility score, and ring-penalty$(m)$ penalizes carbon rings larger than size 6. All terms are normalized to zero mean and unit standard deviation. Initial populations were randomly sampled from the ZINC database (35 million compounds), with fitness set to $-\infty$ for molecules with molecular weight above 500 or duplicate structures.

ChemGE was compared against CVAE, GVAE, and ChemTS across population sizes $(\mu, \lambda) \in {(10, 20), (100, 200), (1000, 2000), (10000, 20000)}$.

Method	2h	4h	6h	8h	Mol/Min
ChemGE (10, 20)	4.46 +/- 0.34	4.46 +/- 0.34	4.46 +/- 0.34	4.46 +/- 0.34	14.5
ChemGE (100, 200)	5.17 +/- 0.26	5.17 +/- 0.26	5.17 +/- 0.26	5.17 +/- 0.26	135
ChemGE (1000, 2000)	4.45 +/- 0.24	5.32 +/- 0.43	5.73 +/- 0.33	5.88 +/- 0.34	527
ChemGE (10000, 20000)	4.20 +/- 0.33	4.28 +/- 0.28	4.40 +/- 0.27	4.53 +/- 0.26	555
CVAE	-30.18 +/- 26.91	-1.39 +/- 2.24	-0.61 +/- 1.08	-0.006 +/- 0.92	0.14
GVAE	-4.34 +/- 3.14	-1.29 +/- 1.67	-0.17 +/- 0.96	0.25 +/- 1.31	1.38
ChemTS	4.91 +/- 0.38	5.41 +/- 0.51	5.49 +/- 0.44	5.58 +/- 0.50	40.89

At $(\mu, \lambda) = (1000, 2000)$, ChemGE achieved the highest final score of 5.88 and generated 527 unique molecules per minute, roughly 13x faster than ChemTS and 3700x faster than CVAE. The small population (10, 20) converged prematurely with insufficient diversity, while the overly large population (10000, 20000) could not run enough generations to optimize effectively.

Docking Experiment with Thymidine Kinase

The second experiment applied ChemGE to generate molecules with high predicted binding affinity for thymidine kinase (KITH), a well-known antiviral drug target. The authors used rDock for docking simulation, taking the best intermolecular score $S_{\text{inter}}$ from three runs with different initial conformations. Fitness was defined as $-S_{\text{inter}}$ (lower scores indicate higher affinity). The protein structure was taken from PDB ID 2B8T.

With 32 parallel cores and $(\mu, \lambda) = (32, 64)$, ChemGE completed 1000 generations in approximately 26 hours, generating 9466 molecules total. Among these, 349 molecules achieved intermolecular scores better than the best known inhibitor in the DUD-E database.

Diversity Analysis

Molecular diversity was measured using internal diversity based on Morgan fingerprints:

$$ I(A) = \frac{1}{|A|^2} \sum_{(x,y) \in A \times A} T_d(x, y) $$

where $T_d(x, y) = 1 - \frac{|x \cap y|}{|x \cup y|}$ is the Tanimoto distance.

The 349 “ChemGE-active” molecules (those scoring better than the best known inhibitor) had an internal diversity of 0.55, compared to 0.46 for known inhibitors and 0.65 for the whole ZINC database. This is a substantial improvement over known actives, achieved without any explicit diversity-promoting mechanism.

ISOMAP visualizations showed that ChemGE populations migrated away from known inhibitors over generations, ultimately occupying a completely different region of chemical space by generation 1000. This suggests ChemGE discovered a novel structural class of potential binders.

High Throughput and Diversity Without Deep Learning

ChemGE demonstrates several notable findings:

Deep learning is not required for competitive de novo molecular generation. Grammatical evolution over SMILES achieves higher throughput and comparable or better optimization scores than VAE- and RNN-based methods.
Population size matters significantly. Too small a population leads to premature convergence. Too large a population prevents sufficient per-molecule optimization within the computational budget. The $(\mu, \lambda) = (1000, 2000)$ setting provided the best balance.
Inherent diversity is a key advantage of evolutionary methods. Without any explicit diversity loss or penalty, ChemGE maintains diversity comparable to the ZINC database and exceeds that of known active molecules.
Parallel evaluation is naturally supported. Each generation produces $\lambda$ independent molecules that can be evaluated by separate docking simulators simultaneously.

The authors acknowledge several limitations. Synthetic routes and ADMET properties were not evaluated for the generated molecules. The docking scores, while favorable, require confirmation through biological assays. The authors also note that incorporating probabilistic or neural models into the evolutionary process might further improve performance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Initial population	ZINC	~35M compounds	Randomly sampled starting molecules
Docking target	PDB 2B8T	1 structure	Thymidine kinase-ligand complex
Baseline actives	DUD-E (KITH)	57 inhibitors	Known thymidine kinase inhibitors

Algorithms

Grammatical evolution with $(\mu + \lambda)$ evolution strategy
Mutation only (no crossover)
Context-free grammar subset of OpenSMILES specification
Chromosome length: $N$ integers per molecule
Fitness set to $-\infty$ for invalid SMILES, MW > 500, or duplicate molecules

Models

No neural network models are used. ChemGE is purely evolutionary.

Evaluation

Metric	Value	Baseline	Notes
Max $J^{\log P}$ (8h)	5.88 +/- 0.34	ChemTS: 5.58 +/- 0.50	ChemGE (1000, 2000)
Molecules/min	527	ChemTS: 40.89	~13x throughput improvement
Docking hits	349	Best DUD-E inhibitor	Molecules with better $S_{\text{inter}}$
Internal diversity	0.55	Known inhibitors: 0.46	Morgan fingerprint Tanimoto distance

Hardware

CPU: Intel Xeon E5-2630 v3 (benchmark experiments, single core)
Docking: 32 cores in parallel (thymidine kinase experiment, ~26 hours for 1000 generations)

Artifacts

Artifact	Type	License	Notes
ChemGE	Code	MIT	Official implementation

Paper Information

Citation: Yoshikawa, N., Terayama, K., Sumita, M., Homma, T., Oono, K., & Tsuda, K. (2018). Population-based de novo molecule generation, using grammatical evolution. Chemistry Letters, 47(11), 1431-1434. https://doi.org/10.1246/cl.180665

@article{yoshikawa2018chemge,
  title={Population-based De Novo Molecule Generation, Using Grammatical Evolution},
  author={Yoshikawa, Naruki and Terayama, Kei and Sumita, Masato and Homma, Teruki and Oono, Kenta and Tsuda, Koji},
  journal={Chemistry Letters},
  volume={47},
  number={11},
  pages={1431--1434},
  year={2018},
  publisher={Oxford University Press},
  doi={10.1246/cl.180665}
}

ChemCrow: Augmenting LLMs with 18 Chemistry Tools

Sat, 28 Mar 2026 00:00:00 +0000

An LLM-Powered Chemistry Agent

This is a Method paper that introduces ChemCrow, an LLM chemistry agent that augments GPT-4 with 18 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. Rather than relying on the LLM’s internal knowledge (which is often inaccurate for chemistry), ChemCrow uses the LLM as a reasoning engine that iteratively calls specialized tools to gather information, plan actions, and execute experiments. The system successfully planned and executed real-world chemical syntheses on a robotic platform, demonstrating one of the first chemistry-related LLM agent interactions with the physical world.

Bridging LLM Reasoning and Chemical Expertise

Large language models have transformed many domains, but they struggle with chemistry-specific problems. GPT-4 cannot reliably perform basic operations like multiplying large numbers, converting IUPAC names to molecular structures, or predicting reaction outcomes. These limitations stem from the models’ token-prediction design, which does not encode chemical reasoning or factual chemical knowledge reliably.

Meanwhile, the chemistry community has developed numerous specialized computational tools for reaction prediction, retrosynthesis planning, molecular property prediction, and de novo molecular generation. These tools exist in isolated environments with steep learning curves, making them difficult for experimental chemists to integrate and use together. The gap between LLM reasoning capabilities and specialized chemistry tools presents an opportunity: augmenting LLMs with these tools could compensate for the models’ chemical knowledge deficiencies while providing a natural language interface to specialized computational chemistry capabilities.

Tool-Augmented Reasoning via ReAct

ChemCrow builds on the ReAct (Reasoning and Acting) framework, where the LLM follows an iterative Thought-Action-Action Input-Observation loop. At each step, the model reasons about the current state of the task, selects an appropriate tool, provides input, pauses while the tool executes, and then incorporates the observation before deciding on the next step. This continues until the final answer is reached.

The system integrates 18 tools organized into four categories:

General tools include web search (via SerpAPI), literature search (using paper-qa with OpenAI embeddings and FAISS), a Python REPL for arbitrary code execution, and a human interaction interface.

Molecule tools cover Name2SMILES (converting molecule names to SMILES via Chem-Space, PubChem, and OPSIN), SMILES2Price (checking purchasability via molbloom and ZINC20), Name2CAS (CAS number lookup via PubChem), molecular Similarity (Tanimoto similarity with ECFP2 fingerprints), ModifyMol (local chemical space exploration via SynSpace), PatentCheck (bloom filter patent lookup via molbloom), FuncGroups (functional group identification via SMARTS patterns), and SMILES2Weight (molecular weight calculation via RDKit).

Safety tools include ControlledChemicalCheck (screening against chemical weapons lists from OPCW and the Australia Group), ExplosiveCheck (GHS explosive classification via PubChem), and SafetySummary (comprehensive safety overview from PubChem data).

Chemical reaction tools include NameRXN (reaction classification via NextMove Software), ReactionPredict (product prediction via IBM’s RXN4Chemistry API using the Molecular Transformer), ReactionPlanner (multi-step synthesis planning via RXN4Chemistry), and ReactionExecute (direct synthesis execution on IBM’s RoboRXN robotic platform).

A key design feature is that safety checks are automatically invoked before synthesis execution. If a molecule is flagged as a controlled chemical or precursor, execution stops immediately.

Experimental Validation and Evaluation

Autonomous Synthesis

ChemCrow autonomously planned and executed four real-world syntheses on the IBM RoboRXN cloud-connected robotic platform:

DEET (insect repellent), from the prompt “Plan and execute the synthesis of an insect repellent”
Three thiourea organocatalysts (Schreiner’s, Ricci’s, and Takemoto’s catalysts), from a prompt asking to find and synthesize a thiourea organocatalyst that accelerates the Diels-Alder reaction

All four syntheses yielded the anticipated compounds. ChemCrow demonstrated the ability to autonomously adapt synthesis procedures when the RoboRXN platform flagged issues (such as insufficient solvent or invalid purification actions), iteratively modifying the procedure until it was valid.

Novel Chromophore Discovery

In a human-AI collaboration scenario, ChemCrow was instructed to train a machine learning model to screen candidate chromophores. The system loaded and cleaned data from a chromophore database, trained and evaluated a random forest model, and suggested a molecule with a target absorption maximum of 369 nm. The proposed molecule was subsequently synthesized and characterized, revealing a measured absorption maximum of 336 nm, confirming the discovery of a new chromophore.

Expert vs. LLM Evaluation

The evaluation used 14 use cases spanning synthesis planning, molecular design, and chemical logic. Both ChemCrow and standalone GPT-4 (without tools) were evaluated by:

Expert human evaluators (n=4): Assessed correctness of chemistry, quality of reasoning, and degree of task completion
EvaluatorGPT: An LLM evaluator prompted to assess responses

Key findings from the evaluation:

Evaluator	Preferred System	Reasoning
Human experts	ChemCrow	Better chemical accuracy and task completeness, especially on complex tasks
EvaluatorGPT	GPT-4	Favored fluent, complete-sounding responses despite factual errors

Human experts preferred ChemCrow across most tasks, with the exception of very simple tasks where GPT-4 could answer from memorized training data (e.g., synthesis of well-known molecules like paracetamol). GPT-4 without tools consistently produced hallucinations that appeared convincing but were factually incorrect upon expert inspection.

An important finding is that LLM-based evaluation (EvaluatorGPT) cannot replace expert human assessment for scientific tasks. The LLM evaluator lacks the domain knowledge needed to distinguish fluent but incorrect answers from accurate ones, rendering it unsuitable for benchmarking factuality in chemistry.

Key Findings and Limitations

ChemCrow demonstrates that augmenting LLMs with expert-designed tools transforms them from “hyperconfident, typically wrong information sources” into reasoning engines that can gather and act on accurate chemical information. The system lowers the barrier for non-experts to access computational chemistry tools through natural language while serving as an assistant to expert chemists.

Several limitations are acknowledged:

Tool dependency: ChemCrow’s performance is bounded by the quality and coverage of its tools. Improved synthesis engines would directly improve synthesis planning capabilities.
Reasoning failures: Tools become useless if the LLM’s reasoning about when and how to use them is flawed, or if garbage inputs are provided.
Reproducibility: The API-based approach to closed-source LLMs (GPT-4) limits reproducibility of individual results. The authors note that open-source models could address this, potentially at the cost of reasoning quality.
Evaluation scope: The 14 evaluation tasks, while diverse, represent a limited test set. Standardized benchmarks for LLM-based chemistry tools did not exist at the time of publication.
Safety considerations: While safety tools prevent execution of controlled chemical syntheses, risks remain from inaccurate reasoning or tool outputs leading to suboptimal conclusions.

The authors emphasize that ChemCrow’s modular design allows easy extension with new tools, and that future integration of image-processing tools, additional language-based tools, and other capabilities could substantially enhance the system.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Chromophore screening	DB for chromophore (Joung et al.)	Not specified	Used for training random forest model
Evaluation	14 expert-designed tasks	14 tasks	Spanning synthesis, molecular design, and chemical logic
Chemical safety	OPCW Schedules 1-3, Australia Group lists	Not specified	Used for controlled chemical screening

Algorithms

LLM: GPT-4 with temperature 0.1
Framework: LangChain for tool integration
Reasoning: ReAct (Reasoning + Acting) framework with chain-of-thought prompting
Synthesis planning: IBM RXN4Chemistry API (Molecular Transformer-based)
Molecule similarity: Tanimoto similarity with ECFP2 fingerprints via RDKit
Chemical space exploration: SynSpace with 50 robust medicinal chemistry reactions

Models

GPT-4 (OpenAI, closed-source) for reasoning
Random forest for chromophore screening (trained on the fly)
Molecular Transformer via RXN4Chemistry API for reaction prediction and retrosynthesis

Evaluation

Human evaluation: 4 expert chemists rated responses on chemistry correctness, reasoning quality, and task completion
LLM evaluation: EvaluatorGPT assessed responses (found unreliable for factuality)
Experimental validation: 4 syntheses on RoboRXN platform, 1 novel chromophore characterization

Hardware

Hardware requirements are not specified in the paper. The system relies primarily on API calls to GPT-4 and RXN4Chemistry, so local compute requirements are minimal.

Artifacts

Artifact	Type	License	Notes
chemcrow-public	Code	MIT	Open-source implementation with 12 of 18 tools
chemcrow-runs	Data	Not specified	All experiment outputs and evaluation data
Zenodo release (code)	Code	MIT	Archived release v0.3.24
Zenodo release (runs)	Data	Not specified	Archived experiment runs

Paper Information

Citation: Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., & Schwaller, P. (2024). Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6(5), 525-535.

@article{bran2024augmenting,
  title={Augmenting large language models with chemistry tools},
  author={Bran, Andres M. and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe},
  journal={Nature Machine Intelligence},
  volume={6},
  number={5},
  pages={525--535},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-024-00832-8}
}

ChatDrug: Conversational Drug Editing with ChatGPT

Sat, 28 Mar 2026 00:00:00 +0000

A Framework for Conversational Drug Editing with LLMs

ChatDrug is a Method paper that introduces a parameter-free framework for drug editing using conversational large language models (specifically ChatGPT/GPT-3.5). The primary contribution is a three-module pipeline that combines prompt engineering, retrieval-augmented domain feedback, and iterative conversation to perform text-guided editing of small molecules, peptides, and proteins. The paper also establishes a benchmark of 39 drug editing tasks spanning these three drug types.

Bridging Conversational AI and Drug Discovery

Drug editing (also called lead optimization or protein design) is a critical step in the drug discovery pipeline where molecular substructures are modified to achieve desired properties. Traditional approaches rely on domain experts for manual editing, which can be subjective and biased. Recent multi-modal approaches like MoleculeSTM and ProteinDT have started exploring text-guided drug editing, but they are domain-specific (limited to one drug type) and lack conversational capabilities for iterative refinement.

The authors identify three properties of conversational LLMs that make them suitable for drug discovery: (1) pretraining on comprehensive knowledge bases covering drug-related concepts, (2) strong few-shot adaptation and generalization abilities, and (3) interactive communication enabling iterative feedback incorporation. However, directly applying LLMs to drug editing yields suboptimal results because the models do not fully utilize prior domain knowledge. ChatDrug addresses this gap through structured retrieval and feedback mechanisms.

Three-Module Pipeline: PDDS, ReDF, and Conversation

ChatDrug consists of three modules that operate sequentially without any parameter learning.

PDDS Module (Prompt Design for Domain-Specific)

The PDDS module constructs domain-specific prompts for ChatGPT. Given an input drug $\pmb{x}_{\text{in}}$ and a text prompt $\pmb{x}_t$ describing the desired property change, the goal is:

$$ \pmb{x}_{\text{out}} = \text{ChatDrug}(\pmb{x}_{\text{in}}, \pmb{x}_t) $$

The prompts are designed around high-level property descriptions (e.g., “more soluble in water”) rather than exact substructure replacements. The authors argue that ChatDrug is better suited for “fuzzy searching” (property-based editing with non-deterministic answers) rather than “exact searching” (precise substructure replacement that experts can do directly).

ReDF Module (Retrieval and Domain Feedback)

The ReDF module retrieves structurally similar examples from a domain-specific database and injects them into the conversation as demonstrations. For an input drug $\pmb{x}_{\text{in}}$, a candidate drug $\tilde{\pmb{x}}$ that failed the desired property change, and a retrieval database, ReDF returns:

$$ \pmb{x}_R = \text{ReDF}(\pmb{x}_{\text{in}}, \tilde{\pmb{x}}; \pmb{x}_t) = \underset{\pmb{x}’_R \in \text{RetrievalDB}}{\arg\max} \langle \tilde{\pmb{x}}, \pmb{x}’_R \rangle \wedge D(\pmb{x}_{\text{in}}, \pmb{x}’_R; \pmb{x}_t) $$

where $D(\cdot, \cdot; \cdot) \in {\text{True}, \text{False}}$ is a domain feedback function checking whether the retrieved drug satisfies the desired property change, and $\langle \tilde{\pmb{x}}, \pmb{x}’_R \rangle$ is a similarity function (Tanimoto similarity for small molecules, Levenshtein distance for peptides and proteins).

The retrieved example $\pmb{x}_R$ is injected into the prompt as: “Your provided sequence [$\tilde{\pmb{x}}$] is not correct. We find a sequence [$\pmb{x}_R$] which is correct and similar to the molecule you provided. Can you give me a new molecule?”

Conversation Module

The conversation module enables iterative refinement over $C$ rounds. At each round $c$, if the edited drug $\pmb{x}_c$ does not satisfy the evaluation condition, ChatDrug retrieves a new example via ReDF using $\tilde{\pmb{x}} = \pmb{x}_c$ and continues the conversation. This aligns with the iterative nature of real drug discovery workflows.

Experiments Across 39 Drug Editing Tasks

Task Design

The benchmark includes 39 tasks across three drug types:

Small molecules (28 tasks): 16 single-objective (tasks 101-108, each with loose and strict thresholds) and 12 multi-objective tasks (tasks 201-206, each with two thresholds). Properties include solubility (LogP), drug-likeness (QED), permeability (tPSA), hydrogen bond acceptors/donors.
Peptides (9 tasks): 6 single-objective and 3 multi-objective tasks for editing peptide-MHC binding affinity across different HLA allele types.
Proteins (2 tasks): Editing protein sequences to increase alpha-helix or beta-strand secondary structures.

Baselines

For small molecules, baselines include Random, PCA, High-Variance, and GS-Mutate (all based on MegaMolBART), plus MoleculeSTM with SMILES and Graph representations. For peptides and proteins, random mutation baselines with 1-3 mutated positions are used.

Main Results

ChatDrug achieves the best performance on 33 out of 39 tasks. Key results for small molecule editing (hit ratio):

Task	Property	ChatDrug (loose)	Best Baseline (loose)
101	More soluble	94.13	67.86 (MoleculeSTM-Graph)
102	Less soluble	96.86	64.79 (MoleculeSTM-Graph)
106	Lower permeability	77.35	34.13 (MoleculeSTM-SMILES)
107	More HBA	95.35	54.01 (MoleculeSTM-SMILES)
108	More HBD	96.54	60.97 (MoleculeSTM-Graph)

ChatDrug underperforms on tasks 104 (less like a drug) and 105 (higher permeability) and most multi-objective tasks involving permeability (205), where MoleculeSTM variants perform better.

For peptide editing, ChatDrug achieves 41-69% hit ratios compared to 0.4-14.4% for random mutation baselines. For protein editing, ChatDrug reaches 34.79% and 51.38% hit ratios on helix and strand tasks respectively, compared to 26.90% and 21.44% for the best random mutation baseline.

Ablation Studies

Conversation rounds: Performance increases with more rounds, converging around $C = 2$. For example, on task 101 (loose threshold), zero-shot achieves 78.26%, $C = 1$ reaches 89.56%, and $C = 2$ reaches 93.37%.

ReDF threshold: Using a stricter threshold in the domain feedback function $D$ (matching the evaluation threshold) yields substantially higher performance than using a loose threshold. For example, on task 107 with strict evaluation, the strict-threshold ReDF achieves 72.60% vs. 14.96% for the loose-threshold ReDF.

Similarity analysis: Retrieved molecules $\pmb{x}_R$ tend to have lower similarity to input molecules than the intermediate outputs $\pmb{x}_1$, yet they have higher hit ratios. This suggests the ReDF module explores the chemical space effectively, and the conversation module balances similarity preservation with property optimization.

Knowledge extraction: ChatDrug can articulate domain-specific reasoning for its edits (e.g., summarizing rules for increasing water solubility by introducing polar functional groups), though the extracted knowledge shows some redundancy.

Limitations and Future Directions

ChatDrug demonstrates that conversational LLMs can serve as useful tools for drug editing, achieving strong results across diverse drug types with a parameter-free approach. The framework exhibits open vocabulary and compositional properties, allowing it to handle novel drug concepts and multi-objective tasks through natural language.

The authors acknowledge two main limitations. First, ChatDrug struggles with understanding complex 3D drug geometries, which would require deeper geometric modeling. Second, the framework requires multiple conversation rounds to achieve strong performance, adding computational cost through repeated API calls. The authors suggest that knowledge summarization capabilities of LLMs could help reduce this cost.

The evaluation relies entirely on computational oracles (RDKit for small molecules, MHCflurry2.0 for peptides, ProteinCLAP for proteins) rather than wet-lab validation. The hit ratio metric also excludes invalid outputs from the denominator, so the effective success rate on all attempted edits may be lower than reported.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Small molecule inputs	ZINC	200 molecules	Sampled SMILES strings
Small molecule retrieval DB	ZINC	10K molecules	For ReDF similarity search
Peptide inputs	Peptide-MHC binding dataset	500 peptides per task	From 30 common MHC alleles
Peptide retrieval DB	Experimental binding data	Varies by allele	Target allele experimental data
Protein inputs	TAPE test set	Varies	Secondary structure prediction test data
Protein retrieval DB	TAPE training set	Varies	Secondary structure prediction training data

Algorithms

GPT-3.5-turbo via OpenAI ChatCompletion API, temperature=0, frequency_penalty=0.2
System prompt: “You are an expert in the field of molecular chemistry.”
$C = 2$ conversation rounds for main results
5 random seeds (0-4) for small molecule main results, seed 0 for ablations

Models

ChatGPT (GPT-3.5-turbo): used as-is, no fine-tuning
MHCflurry 2.0: pseudo-oracle for peptide binding affinity evaluation
ProteinCLAP-EBM-NCE from ProteinDT: protein secondary structure prediction
ESMFold: protein folding for visualization
RDKit: molecular property calculations for small molecules

Evaluation

Metric	Description	Notes
Hit Ratio	Fraction of valid edits satisfying property requirements	Invalid sequences excluded from denominator

Hardware

All experiments conducted on a single NVIDIA RTX A6000 GPU (used only for peptide and protein evaluation). Total OpenAI API cost was less than $100.

Artifact	Type	License	Notes
ChatDrug GitHub	Code	Not specified	Official implementation

Paper Information

Citation: Liu, S., Wang, J., Yang, Y., Wang, C., Liu, L., Guo, H., & Xiao, C. (2024). Conversational Drug Editing Using Retrieval and Domain Feedback. ICLR 2024.

@inproceedings{liu2024chatdrug,
  title={Conversational Drug Editing Using Retrieval and Domain Feedback},
  author={Liu, Shengchao and Wang, Jiongxiao and Yang, Yijin and Wang, Chengpeng and Liu, Ling and Guo, Hongyu and Xiao, Chaowei},
  booktitle={International Conference on Learning Representations},
  year={2024}
}

BioT5: Cross-Modal Integration of Biology and Chemistry

Sat, 28 Mar 2026 00:00:00 +0000

A Unified Pretraining Framework for Molecules, Proteins, and Text

BioT5 is a Method paper that introduces a comprehensive T5-based pretraining framework for cross-modal integration of molecules, proteins, and natural language. The primary contribution is a multi-task pretraining approach that uses SELFIES (instead of SMILES) for 100% valid molecular representations, separate tokenization for each modality, and a combination of masked language modeling and translation objectives to connect structured biological data with unstructured scientific text. After fine-tuning, BioT5 (252M parameters) achieves state-of-the-art performance on 10 out of 15 downstream tasks spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction, molecule captioning, and text-based molecule generation.

Bridging the Gap Between Molecular Sequences and Scientific Knowledge

Prior cross-modal models in computational biology face three recurring challenges. First, models like MolT5 and MolXPT rely on SMILES to represent molecules, but SMILES strings are syntactically fragile: random perturbations or model-generated sequences frequently produce invalid molecular structures. Edwards et al. (2022) and Li et al. (2023) both highlight this validity problem as a bottleneck for text-to-molecule generation. Second, the contextual information surrounding molecular and protein names in scientific literature (e.g., mentions in PubMed abstracts that describe properties, interactions, and experimental results) remains underutilized. Most models either ignore this context or treat it identically to structured database entries. Third, existing approaches like MolT5 and Galactica share a single tokenizer and embedding space across molecules, proteins, and text. This leads to chemically incorrect tokenization: the bromine atom “Br” in SMILES gets split into “B” (boron) and “r”, producing erroneous downstream predictions.

BioT5 addresses all three issues simultaneously by adopting SELFIES for molecular representation, extracting entity-linked contextual knowledge from PubMed, and employing separate vocabularies for each modality.

SELFIES, Separate Tokenization, and Multi-Task Pretraining

The core innovations of BioT5 center on three design decisions:

SELFIES for Robust Molecular Representation

BioT5 replaces SMILES with SELFIES (Self-referencing Embedded Strings) for all molecular representations. Every permutation of symbols within the SELFIES alphabet generates a chemically valid molecular structure, guaranteeing 100% validity in generation tasks. Molecules from ZINC20 are converted from SMILES to SELFIES during data preprocessing.

Modality-Specific Tokenization

Rather than sharing a single SentencePiece vocabulary across modalities, BioT5 maintains three separate dictionaries:

Molecules: Each SELFIES token corresponds to a chemically meaningful atom group enclosed in brackets (e.g., [C], [=C], [Br]).
Proteins: Amino acids are prefixed with a special
token to distinguish them from text characters (e.g.,
M,
K,
R).
Text: The standard T5 vocabulary is retained.

This prevents semantic conflation across modalities. The total vocabulary size is 35,073, and the model comprises 252M parameters using the T5-v1.1-base architecture.

Multi-Task Pretraining Objectives

BioT5 uses six pretraining tasks organized into three categories:

Single-modal T5 objective: Standard span corruption and recovery applied independently to molecule SELFIES (task 1), protein FASTA (task 2), and general text from C4 (task 3).
Wrapped text T5 objective (task 4): Applied to PubMed articles where molecular names are replaced with corresponding SELFIES strings and gene names are appended with protein FASTA sequences, using BERN2 for named entity recognition and entity linking.
Bidirectional translation (tasks 5 and 6): Molecule SELFIES to text description and vice versa (using 339K pairs from PubChem), and protein FASTA to text description and vice versa (using 569K pairs from Swiss-Prot).

The translation direction is randomly sampled with probability 0.5 for each example. For downstream tasks, BioT5 uses prompt-based fine-tuning to cast all tasks into a sequence generation format, reducing the gap between pretraining and fine-tuning.

Evaluation Across 15 Downstream Tasks

BioT5 is evaluated on 15 tasks organized into three categories: single-instance prediction, multi-instance prediction, and cross-modal generation.

Molecule Property Prediction (MoleculeNet)

BioT5 is evaluated on six binary classification tasks from MoleculeNet using scaffold splitting: BBBP, Tox21, ClinTox, HIV, BACE, and SIDER. Results are averaged over three random runs.

Dataset	GEM	MolXPT	BioT5
BBBP	72.4	80.0	77.7
Tox21	78.1	77.1	77.9
ClinTox	90.1	95.3	95.4
HIV	80.6	78.1	81.0
BACE	85.6	88.4	89.4
SIDER	67.2	71.7	73.2
Avg	79.0	81.9	82.4

BioT5 achieves the best average AUROC (82.4) across all six datasets, surpassing both GNN-based methods (GEM) and language model baselines (MolXPT).

Protein Property Prediction (PEER Benchmark)

On the PEER benchmark, BioT5 is evaluated on protein solubility and subcellular localization prediction:

Model	Params	Solubility (Acc)	Localization (Acc)
ESM-1b	652.4M	70.23	92.40
ProtBert	419.9M	68.15	91.32
BioT5	252.1M	74.65	91.69

BioT5 achieves the best solubility prediction accuracy (74.65%) despite being 2-3x smaller than dedicated protein language models like ESM-1b and ProtBert.

Drug-Target Interaction Prediction

BioT5 is evaluated on three DTI datasets (BioSNAP, Human, BindingDB) with five random runs:

Method	BioSNAP AUROC	Human AUROC	BindingDB AUROC
DrugBAN	0.903	0.982	0.960
BioT5	0.937	0.989	0.963

BioT5 consistently outperforms DrugBAN and other specialized DTI models across all three datasets.

Molecule Captioning and Text-Based Molecule Generation

On the ChEBI-20 dataset, BioT5 outperforms all baselines in molecule captioning:

Model	Params	BLEU-4	METEOR	Text2Mol
MolT5-large	783M	0.508	0.614	0.582
MolXPT	350M	0.505	0.626	0.594
BioT5	252M	0.556	0.656	0.603

For text-based molecule generation, BioT5 achieves an exact match score of 0.413 (vs. 0.311 for MolT5-large) while maintaining 100% validity, compared to 90.5% for MolT5-large. This demonstrates the direct benefit of SELFIES: every generated sequence is a valid molecule.

Protein-Protein Interaction Prediction

On the PEER PPI benchmarks (Yeast and Human), BioT5 achieves competitive results, outperforming fully fine-tuned ProtBert and ESM-1b on the Yeast dataset (64.89% vs. 63.72% for ProtBert) and placing second on Human (86.22% vs. 88.06% for ESM-1b with frozen weights).

Key Findings, Limitations, and Future Directions

BioT5 demonstrates that integrating molecular, protein, and textual modalities within a single pretraining framework yields consistent improvements across diverse biological tasks. Three factors drive BioT5’s performance: (1) SELFIES guarantees 100% molecular validity in generation tasks, eliminating a persistent failure mode of SMILES-based models; (2) separate tokenization preserves the semantic integrity of each modality; (3) wrapped text pretraining on PubMed provides contextual biological knowledge that pure sequence models miss.

The authors acknowledge several limitations. BioT5 requires full-parameter fine-tuning for each downstream task because instruction-tuning does not generalize across tasks, and combining datasets via instructions causes data leakage (the authors note overlaps between BindingDB training data and BioSNAP/Human test sets). The model only handles sequence-format bio-entities and does not incorporate 2D or 3D structural information. Additional biological modalities such as DNA/RNA sequences and cell-level data are also left for future work.

The authors also note risks: BioT5 could potentially be misused to generate dangerous molecules, and it may fail to generate effective therapeutic molecules or produce compounds with adverse side effects.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining (molecules)	ZINC20	~300M molecules	Converted from SMILES to SELFIES
Pretraining (proteins)	UniRef50	27M proteins	Filtered by length
Pretraining (text)	C4	Large	Standard T5 corpus
Pretraining (wrapped text)	PubMed	33M articles	Entity linking via BERN2
Pretraining (molecule-text pairs)	PubChem	339K pairs	Excludes ChEBI-20 molecules
Pretraining (protein-text pairs)	Swiss-Prot	569K pairs	High-quality annotations
Evaluation (molecular properties)	MoleculeNet	6 datasets	Scaffold splitting
Evaluation (protein properties)	PEER	2 tasks	Solubility and localization
Evaluation (DTI)	BioSNAP, Human, BindingDB	3 datasets	Binary classification
Evaluation (PPI)	Yeast, Human	2 datasets	From PEER benchmark
Evaluation (generation)	ChEBI-20	33K pairs	Molecule captioning and text-to-molecule

Algorithms

Architecture: T5-v1.1-base (encoder-decoder transformer)
Optimizer: AdamW with RMS scaling
Learning rate: cosine annealing, base $1 \times 10^{-2}$, minimum $1 \times 10^{-5}$
Warmup steps: 10,000
Dropout: 0.0
Maximum input length: 512 tokens
Pretraining steps: 350K
Batch size: 96 per GPU (6 data types per batch)
Prompt-based fine-tuning for all downstream tasks

Models

Model	Parameters	Vocabulary Size	Architecture
BioT5	252M	35,073	T5-v1.1-base

Evaluation

Molecule property prediction: AUROC on 6 MoleculeNet tasks (scaffold split, 3 runs)
Protein property prediction: accuracy on PEER benchmark (3 runs)
Drug-target interaction: AUROC, AUPRC, accuracy on 3 DTI datasets (5 runs)
Protein-protein interaction: accuracy on 2 PPI datasets (3 runs)
Molecule captioning: BLEU, ROUGE, METEOR, Text2Mol on ChEBI-20
Text-based molecule generation: BLEU, exact match, fingerprint similarities, FCD, validity on ChEBI-20

Hardware

8x NVIDIA A100 80GB GPUs for pretraining
Codebase: nanoT5

Artifacts

Artifact	Type	License	Notes
BioT5 Code	Code	MIT	Official implementation

Paper Information

Citation: Pei, Q., Zhang, W., Zhu, J., Wu, K., Gao, K., Wu, L., Xia, Y., & Yan, R. (2023). BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 1102-1123. https://doi.org/10.18653/v1/2023.emnlp-main.70

@inproceedings{pei2023biot5,
  title={BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations},
  author={Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui},
  booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
  pages={1102--1123},
  year={2023},
  publisher={Association for Computational Linguistics},
  doi={10.18653/v1/2023.emnlp-main.70}
}

MTL-BERT: Multitask BERT for Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

A Multitask BERT Framework for Molecular Property Prediction

MTL-BERT is a Method paper that introduces a multitask learning framework built on BERT for predicting molecular properties from SMILES strings. The primary contribution is the combination of three strategies to address data scarcity in drug discovery: (1) masked token pretraining on 1.7 million unlabeled molecules from ChEMBL, (2) multitask fine-tuning across 60 property prediction datasets simultaneously, and (3) SMILES enumeration as a data augmentation technique applied during pretraining, fine-tuning, and inference. The model achieves strong performance across 60 ADMET and molecular property datasets (44 classification and 16 regression), outperforming baselines including GNNs, XGBoost with molecular fingerprints, and prior SMILES-BERT approaches.

Data Scarcity in Molecular Property Prediction

Deep learning methods for molecular property prediction face a fundamental tension: they require large amounts of labeled data to learn effectively, but labeled bioactivity data is scarce due to the cost and time of laboratory experiments. Existing approaches at the time of publication addressed this in isolation. Graph neural networks (GNNs) learn from molecular graphs but are typically shallow (2-3 layers) and prone to overfitting on small datasets. The original SMILES-BERT model applied masked language modeling to SMILES strings but fine-tuned separately for each task, missing opportunities to share information across related properties. Fixed molecular representations like CDDD (continuous and data-driven descriptors) cannot be further optimized for specific downstream tasks.

The authors identify three specific gaps: (1) single-task fine-tuning wastes the correlations between related ADMET properties (e.g., lipophilicity relates to many ADMET endpoints), (2) using only canonical SMILES limits the model’s ability to learn robust molecular features, and (3) no prior work had combined pretraining, multitask learning, and SMILES enumeration into a unified framework.

Three Strategies Combined: Pretraining, Multitask Learning, and SMILES Enumeration

The core innovation of MTL-BERT is the synergistic combination of three strategies in a single pipeline.

Masked SMILES Pretraining

Following the BERT paradigm, MTL-BERT pretrains on 1.7 million unlabeled molecules from ChEMBL using a masked token recovery task. For each SMILES string, 15% of tokens are randomly selected: 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% remain unchanged. The loss is computed only at masked positions. Unlike the original BERT, MTL-BERT omits the next-sentence prediction task since there is no sequential relationship between SMILES strings (following the RoBERTa finding that this task is unnecessary).

SMILES strings are tokenized with a regular expression that captures multi-character tokens (e.g., Si, Br, Cl) and common SMILES syntax. The model uses positional encoding to capture token order.

Transformer Architecture

The model uses a standard Transformer encoder with multihead self-attention. The scaled dot-product attention computes:

$$\mathbf{O}_h = \text{softmax}\left(\frac{\mathbf{Q}_h \mathbf{K}_h^T}{\sqrt{d_k}}\right) \mathbf{V}_h$$

where $\mathbf{Q}_h$, $\mathbf{K}_h$, and $\mathbf{V}_h$ are the query, key, and value matrices for head $h$, and $\sqrt{d_k}$ is a scaling factor. The outputs from all heads are concatenated and projected. Each attention sublayer is followed by a position-wise feedforward network with GELU activation, layer normalization, and residual connections.

Three model sizes were compared:

Model	Layers	Heads	Embedding Size	FFN Size	Recovery Accuracy	Fine-tuning Performance
MTL-BERT_SMALL	4	4	128	512	0.931	0.826
MTL-BERT_MEDIUM	8	8	256	1,024	0.962	0.852
MTL-BERT_LARGE	12	12	576	2,304	0.974	0.848

The medium model was selected for its best fine-tuning performance with lower computational cost, despite the large model achieving higher pretraining recovery accuracy. The slight performance drop for the large model suggests mild overfitting.

Multitask Fine-tuning with Task Tokens

During fine-tuning, task tokens ([T0], [T1], …) are prepended to each input SMILES string. The Transformer output at each task token position is passed through a task-specific two-layer feedforward network for the corresponding prediction task. An attention mask prevents direct information exchange between task tokens, allowing each task to learn directly from SMILES tokens without interference. This design also reduces the discrepancy between pretraining (no task tokens visible) and fine-tuning.

Cross-entropy loss is used for classification tasks and mean squared error for regression tasks. The total multitask loss is a simple sum of per-task losses without learned weighting.

SMILES Enumeration as Data Augmentation

A molecule can be represented by multiple valid SMILES strings by varying starting atoms and traversal orders. MTL-BERT applies SMILES enumeration at all three stages:

Pretraining: Enumerated SMILES increase diversity of the self-supervised training data.
Fine-tuning: Each dataset is augmented 20x with random SMILES variants, increasing data diversity and helping the model learn position-invariant features.
Inference: Multiple SMILES are generated per test molecule, predictions are fused (averaged) for a more robust final prediction.

The 20x augmentation factor was chosen based on prior work showing diminishing returns beyond this level while significantly increasing computational cost.

Experimental Evaluation Across 60 Datasets

Setup

MTL-BERT was evaluated on 60 datasets (44 classification, 16 regression) covering ADMET properties and common molecular benchmarks. Datasets were sourced from ADMETlab and MoleculeNet. Each dataset was split 8:1:1 (train/validation/test), and experiments were repeated 10 times with random splits, reporting mean and standard deviation.

Classification tasks were evaluated with ROC-AUC and accuracy; regression tasks with $R^2$ and RMSE.

Baselines

Five baselines were compared:

ECFP4-XGBoost: Extended-connectivity fingerprints (diameter 4) with gradient boosting
Graph Attention Network (GAT)
Graph Convolutional Network (GCN)
AttentiveFP: A GNN with attention for molecular property prediction
CDDD: Continuous and data-driven descriptors from a pretrained RNN auto-encoder

Ablation Study

Three model variants were compared to isolate contributions:

MTL-BERT: Full model (pretraining + multitask + SMILES enumeration)
STL-BERT: Single-task fine-tuning with SMILES enumeration (no multitask)
Cano-BERT: Canonical SMILES only, single-task fine-tuning (equivalent to SMILES-BERT)

Cano-BERT showed more than 10% degradation on several datasets (CL, Fu, LC50DM) compared to STL-BERT, demonstrating the importance of SMILES enumeration. MTL-BERT outperformed STL-BERT on most datasets, with improvements exceeding 5% on $F_{20\%}$, SR-ARE, and SR-ATAD5, confirming that multitask learning provides additional benefit on top of enumeration.

Results vs. Baselines

MTL-BERT outperformed all baselines on nearly all 60 datasets. Specific findings:

ECFP4-XGBoost performed inconsistently, doing well on some tasks (e.g., $F_{30\%}$, BACE, CL) but poorly on others, reflecting the limitation of fixed-length fingerprint representations.
GNNs generally improved over fingerprints but still suffered from data scarcity, falling behind ECFP4-XGBoost by more than 3% on $F_{30\%}$, Carcinogenicity, CL, and VD.
MTL-BERT surpassed all baselines except on CYP2C19-sub and BACE (by less than 1.1%).
On 14 tasks (NR-ER, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, Bioconcentration Factor, Fu, LC50FM, Lipophilicity, CL, PPB, VD, LC50DM), MTL-BERT exceeded the best baseline by more than 5-10%.
Improvements were statistically significant at the 95% confidence level (paired t-test, $P \leq 0.001$).

Representation Analysis

t-SNE visualization of pretrained token embeddings (from 1,000 randomly selected molecules, approximately 35,000 tokens) showed that:

Tokens of the same type cluster together (capturing atomic type information).
Within type clusters, sub-groups correspond to different chemical environments (e.g., oxygen atoms in nitrate groups vs. carbonyl groups).
Nearby embeddings share similar molecular neighborhood environments.

Attention-based Interpretability

The model’s attention weights provide interpretability for predictions:

For a solubility task (LogS/LogD), attention concentrated on polar groups, which are known determinants of aqueous solubility.
For AMES (mutagenicity), attention focused on azide, nitrosamide, acylchloride, and nitrite groups, which are known mutagenic structural alerts.

Performance Gains from Combined Strategies with Interpretable Attention

MTL-BERT demonstrates that the combination of pretraining, multitask learning, and SMILES enumeration is more effective than any individual strategy for molecular property prediction. The ablation study provides clear evidence for the additive benefit of each component.

Key strengths include the breadth of evaluation (60 datasets covering diverse ADMET endpoints), the consistent improvement over multiple baseline types (fingerprints, GNNs, pretrained representations), and the interpretable attention mechanism that highlights chemically meaningful substructures.

Limitations to note: the simple sum of multitask losses (no learned task weighting) may not be optimal when tasks have very different scales or when some tasks are unrelated. The authors observe slight degradation on a few datasets (AMES, CYP1A2-Sub, FreeSolv), suggesting negative transfer in those cases. The 20x SMILES enumeration significantly increases computational cost during fine-tuning and inference. The paper does not report wall-clock training times or GPU hours, making it difficult to assess the practical cost of the enumeration strategy. Hardware details are not specified beyond acknowledgment of the High-Performance Computing Center at Central South University.

The hierarchical clustering of task representations reveals meaningful task groupings (e.g., LogD and LogP cluster together due to their shared relationship with water solubility), supporting the premise that multitask learning captures cross-task correlations.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL	1.7M molecules	Unlabeled SMILES; 10% held out for evaluation
Fine-tuning/Evaluation	ADMETlab + MoleculeNet	60 datasets (44 classification, 16 regression)	8:1:1 train/val/test split

Algorithms

Pretraining: Masked token prediction (15% masking rate: 80% [MASK], 10% random, 10% unchanged). Adam optimizer, learning rate 1e-4, batch size 512, 50 epochs.
Fine-tuning: Adam optimizer, learning rate 5e-5, batch size 64, dropout 0.1. Cross-entropy for classification, MSE for regression. Early stopping with patience 20, max 200 epochs.
SMILES enumeration: 20x augmentation. Repeated search up to 100 times if enumerated SMILES is identical to a previous one.
Inference fusion: Predictions from multiple enumerated SMILES are averaged.

Models

MTL-BERT_MEDIUM (selected model): 8 layers, 8 attention heads, 256 embedding size, 1,024 FFN size
Pretraining recovery accuracy: 0.962
1,000 task tokens pre-allocated for future tasks

Evaluation

Metric	Task Type	Notes
ROC-AUC	Classification	Primary metric
Accuracy	Classification	Secondary metric
$R^2$	Regression	Primary metric
RMSE	Regression	Secondary metric

All experiments repeated 10 times with random splits; mean and standard deviation reported.

Hardware

Hardware specifications are not reported in the paper. The authors acknowledge the High-Performance Computing Center of Central South University.

Artifacts

Artifact	Type	License	Notes
MTL-BERT	Code	Not specified	Official implementation
ChEMBL	Dataset	CC BY-SA 3.0	Pretraining data source
MoleculeNet	Dataset	MIT	Fine-tuning benchmark
ADMETlab	Dataset	Free for academic use	ADMET property datasets

Paper Information

Citation: Zhang, X.-C., Wu, C.-K., Yi, J.-C., Zeng, X.-X., Yang, C.-Q., Lu, A.-P., Hou, T.-J., & Cao, D.-S. (2022). Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. Research, 2022, Article 0004. https://doi.org/10.34133/research.0004

@article{zhang2022mtlbert,
  title={Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration},
  author={Zhang, Xiao-Chen and Wu, Cheng-Kun and Yi, Jia-Cai and Zeng, Xiang-Xiang and Yang, Can-Qun and Lu, Ai-Ping and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal={Research},
  volume={2022},
  pages={Article 0004},
  year={2022},
  doi={10.34133/research.0004},
  publisher={American Association for the Advancement of Science (AAAS)}
}

Mol2vec: Unsupervised ML with Chemical Intuition

Fri, 27 Mar 2026 00:00:00 +0000

Word2vec Meets Cheminformatics

Mol2vec is a Method paper that introduces an unsupervised approach for learning dense vector representations of molecular substructures. The core idea is a direct analogy to Word2vec from natural language processing: molecular substructures (derived from the Morgan algorithm) are treated as “words,” and entire molecules are treated as “sentences.” By training on a large unlabeled corpus of 19.9 million compounds, Mol2vec produces embeddings where chemically related substructures occupy nearby regions of vector space. Compound-level vectors are then obtained by summing constituent substructure vectors, and these can serve as features for downstream supervised learning tasks.

Sparse Fingerprints and Their Limitations

Molecular fingerprints, particularly Morgan fingerprints (extended-connectivity fingerprints, ECFP), are among the most widely used molecular representations in cheminformatics. They perform well for similarity searching, virtual screening, and activity prediction. However, they suffer from several practical drawbacks:

High dimensionality and sparsity: Morgan fingerprints are typically hashed to fixed-length binary vectors (e.g., 2048 or 4096 bits), resulting in very sparse representations.
Bit collisions: The hashing step can map distinct substructures to the same bit position, losing structural information.
No learned relationships: Each bit is independent, so the representation does not encode any notion of chemical similarity between substructures.

At the time of this work (2017), NLP techniques had started to appear in cheminformatics. The tf-idf method had been applied to Morgan fingerprints for compound-protein interaction prediction, and Latent Dirichlet Allocation had been used for chemical topic modeling. The Word2vec concept had been adapted for protein sequences (ProtVec) but had not yet been applied to small molecules. Mol2vec fills this gap.

From Substructure Identifiers to Dense Embeddings

The central insight of Mol2vec is that the Morgan algorithm already produces a natural “vocabulary” of molecular substructures, and the order in which these substructures appear in a molecule provides local context, analogous to word order in a sentence.

Corpus Construction

The training corpus was assembled from ZINC v15 and ChEMBL v23, merged and deduplicated, then filtered by molecular weight (12-600), heavy atom count (3-50), clogP (-5 to 7), and allowed elements (H, B, C, N, O, F, P, S, Cl, Br). This yielded 19.9 million compounds.

Sentence Generation

For each molecule, the Morgan algorithm generates atom identifiers at radius 0 and radius 1. Each atom contributes two identifiers (one per radius), ordered according to the atom order in the canonical SMILES. This sequence of identifiers forms a “sentence” for Word2vec training.

Word2vec Training

The model was trained using the gensim implementation of Word2vec. After evaluating both CBOW and Skip-gram architectures with window sizes of 5, 10, and 20, and embedding dimensions of 100 and 300, the best configuration was:

Architecture: Skip-gram
Window size: 10
Embedding dimension: 300

Rare identifiers appearing fewer than 3 times in the corpus were replaced with a special “UNSEEN” token, which learns a near-zero vector. This allows the model to handle novel substructures at inference time.

Compound Vector Generation

The final vector for a molecule is the sum of all its substructure vectors:

$$\mathbf{v}_{\text{mol}} = \sum_{i=1}^{N} \mathbf{v}_{s_i}$$

where $\mathbf{v}_{s_i}$ is the 300-dimensional embedding for the $i$-th substructure identifier in the molecule. This summation implicitly captures substructure counts and importance through vector amplitude.

Benchmarking Across Regression and Classification Tasks

Datasets

The authors evaluated Mol2vec on four datasets:

Dataset	Task	Size	Description
ESOL	Regression	1,144	Aqueous solubility prediction
Ames	Classification	6,511	Mutagenicity (balanced: 3,481 positive, 2,990 negative)
Tox21	Classification	8,192	12 human toxicity targets (imbalanced)
Kinase	Classification	284 kinases	Bioactivity from ChEMBL v23

Machine Learning Methods

Three ML methods were compared using both Mol2vec and Morgan FP features:

Random Forest (RF): scikit-learn, 500 estimators
Gradient Boosting Machine (GBM): XGBoost, 2000 estimators, max depth 3, learning rate 0.1
Deep Neural Network (DNN): Keras/TensorFlow, 4 hidden layers with 2000 neurons each for Mol2vec; 1 hidden layer with 512 neurons for Morgan FP

All models were validated using 20x 5-fold cross-validation with the Wilcoxon signed-rank test for statistical comparison.

ESOL Regression Results

Features	Method	$R^2_{\text{ext}}$	MSE	MAE
Descriptors	MLR	0.81 +/- 0.01	0.82	0.69
Molecular Graph	CNN	0.93	0.31 +/- 0.03	0.40 +/- 0.00
Morgan FP	GBM	0.66 +/- 0.00	1.43 +/- 0.00	0.88 +/- 0.00
Mol2vec	GBM	0.86 +/- 0.00	0.62 +/- 0.00	0.60 +/- 0.00

Mol2vec substantially outperformed Morgan FP ($R^2_{\text{ext}}$ 0.86 vs. 0.66) but did not match the best graph convolution methods ($R^2_{\text{ext}}$ ~0.93).

Classification Results (Ames and Tox21)

On the Ames dataset, Mol2vec and Morgan FP performed comparably (AUC 0.87 vs. 0.88), both matching or exceeding prior SVM and Naive Bayes results. On Tox21, both achieved an average AUC of 0.83, outperforming literature results from graph convolution (0.71) and DNN/SVM approaches (0.71-0.72).

Proteochemometric (PCM) Extension

Mol2vec was combined with ProtVec (protein sequence embeddings using the same Word2vec approach on 3-grams) by concatenating vectors, forming PCM2vec. This was evaluated using a rigorous 4-level cross-validation scheme:

CV1: New compound-target pairs
CV2: New targets
CV3: New compounds
CV4: New compounds and targets

On Tox21, PCM2vec improved predictions for new compound-target pairs (CV1: AUC 0.87 vs. 0.79 for Morgan FP) and new compounds (CV3: AUC 0.85 vs. 0.78). On the kinase dataset, PCM2vec approached the performance of classical PCM (Morgan + z-scales) while being alignment-independent, meaning it can be applied to proteins with low sequence similarity.

Chemical Intuition and Practical Value

Embedding Quality

The learned substructure embeddings capture meaningful chemical relationships. Hierarchical clustering of the 25 most common substructures shows expected groupings: aromatic carbons cluster together, aliphatic ring carbons form a separate group, and carbonyl carbons and oxygens are closely related. Similarly, t-SNE projections of amino acid vectors encoded by Mol2vec reproduce known amino acid relationships (e.g., similar distances between Glu/Gln and Asp/Asn pairs, reflecting the carboxylic acid to amide transition).

Key Findings

Skip-gram with 300-dimensional embeddings provides the best Mol2vec representations, consistent with NLP best practices.
Mol2vec excels at regression tasks, substantially outperforming Morgan FP on ESOL solubility prediction ($R^2_{\text{ext}}$ 0.86 vs. 0.66).
Classification performance is competitive with Morgan FP across Ames and Tox21 datasets.
PCM2vec enables alignment-independent proteochemometrics, extending PCM approaches to diverse protein families with low sequence similarity.
Tree-based methods (RF, GBM) outperformed DNNs on these tasks, though the authors note further DNN tuning could help.

Limitations

The compound vector is a simple sum of substructure vectors, which discards information about substructure arrangement and molecular topology.
Only Morgan identifiers at radii 0 and 1 were used. Larger radii might capture more context but would increase vocabulary size.
DNN architectures were not extensively optimized, leaving open the question of how well Mol2vec pairs with deep learning.
The approach was benchmarked against Morgan FP but not against other learned representations such as graph neural networks in a controlled comparison.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC v15 + ChEMBL v23	19.9M compounds	Filtered by MW, atom count, clogP, element types
Evaluation	ESOL	1,144 compounds	Aqueous solubility regression
Evaluation	Ames	6,511 compounds	Mutagenicity classification
Evaluation	Tox21	8,192 compounds	12 toxicity targets, retrieved via DeepChem
Evaluation	Kinase (ChEMBL v23)	284 kinases	IC50/Kd/Ki binding assays
Protein corpus	UniProt	554,241 sequences	For ProtVec training

Algorithms

Word2vec: Skip-gram, window size 10, 300-dimensional embeddings, min count 3
Morgan algorithm: Radii 0 and 1 (119 and 19,831 unique identifiers respectively)
UNSEEN token: Replaces identifiers occurring fewer than 3 times
Compound vector: Sum of all substructure vectors

Models

RF: scikit-learn, 500 estimators, sqrt features, balanced class weights
GBM: XGBoost, 2000 estimators, max depth 3, learning rate 0.1
DNN: Keras/TensorFlow, 4 layers x 2000 neurons (Mol2vec) or 1 layer x 512 neurons (Morgan FP), ReLU activation, dropout 0.1

Evaluation

Metric	Mol2vec Best	Morgan FP Best	Task
$R^2_{\text{ext}}$	0.86 (GBM)	0.66 (GBM)	ESOL regression
AUC	0.87 (RF)	0.88 (RF)	Ames classification
AUC	0.83 (RF)	0.83 (RF)	Tox21 classification

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
mol2vec	Code	BSD-3-Clause	Python package with pre-trained model

Paper Information

Citation: Jaeger, S., Fulle, S., & Turk, S. (2018). Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. Journal of Chemical Information and Modeling, 58(1), 27-35. https://doi.org/10.1021/acs.jcim.7b00616

@article{jaeger2018mol2vec,
  title={Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition},
  author={Jaeger, Sabrina and Fulle, Simone and Turk, Samo},
  journal={Journal of Chemical Information and Modeling},
  volume={58},
  number={1},
  pages={27--35},
  year={2018},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.7b00616}
}

MG-BERT: Graph BERT for Molecular Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

A Graph-Aware BERT for Molecular Property Prediction

MG-BERT is a Method paper that adapts the BERT pretraining paradigm from NLP to molecular graphs. The primary contribution is a modified Transformer architecture that replaces global self-attention with bond-based local attention, allowing atoms to exchange information only through chemical bonds. This creates a deep message-passing network that avoids the oversmoothing problem of conventional graph neural networks (GNNs). Combined with a masked atom prediction pretraining strategy on 1.7 million unlabeled molecules from ChEMBL, MG-BERT learns context-sensitive atomic representations that transfer effectively to downstream property prediction tasks.

Data Scarcity in Molecular Property Prediction

Molecular property prediction is central to drug discovery, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoints. While deep learning has advanced many domains, molecular property prediction faces a persistent challenge: labeled data scarcity. ADMET measurements require expensive, time-consuming experiments, and typical datasets contain only hundreds to thousands of examples.

Prior approaches fall into three categories, each with limitations:

Feature engineering (molecular fingerprints, descriptors): Requires expert design, suffers from low scalability, and fixed representations cannot be optimized for specific tasks.
SMILES-based deep learning (CNNs, LSTMs, Transformers on SMILES strings): Must learn to parse molecular information from complex string syntax, increasing learning difficulty. Autoencoder-based methods (e.g., CDDD) learn fixed representations that cannot be fine-tuned.
Graph neural networks (GAT, GCN): Can learn directly from molecular topology, but are limited to 2-3 layers due to oversmoothing, restricting their capacity to capture deep-level patterns.

The BERT model from NLP demonstrated that self-supervised pretraining on large unlabeled corpora followed by fine-tuning on small labeled datasets can substantially improve downstream performance. SMILES-BERT applied this idea to SMILES strings directly, but suffered from interpretability issues due to auxiliary characters in the SMILES syntax. MG-BERT addresses these limitations by operating directly on molecular graphs.

Bond-Based Local Attention and Masked Atom Pretraining

The core innovation of MG-BERT has two components: a modified Transformer architecture for molecular graphs and a self-supervised pretraining strategy.

Architecture Modifications

The original BERT model uses three components: an embedding layer, Transformer encoder layers, and a task-specific output layer. MG-BERT makes three key modifications:

Atom embeddings replace word embeddings. The dictionary contains 16 tokens: 13 common atom types ([H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se]), plus [UNK] for rare atoms, [MASK] for pretraining, and [GLOBAL] for graph-level readout.
No positional encoding. Unlike sequential text, atoms in a molecular graph have no inherent ordering, so positional embeddings are removed.
Local attention replaces global attention. The adjacency matrix of the molecular graph is used as a visibility matrix to modulate the attention scores. Each atom can only attend to atoms connected by chemical bonds. Formally, the attention is constrained so that:

$$A’_{ij} = \begin{cases} A_{ij} & \text{if bond exists between } i \text{ and } j \\ -\infty & \text{otherwise} \end{cases}$$

where $A_{ij}$ is the standard scaled dot-product attention score. This local message passing makes MG-BERT a variant of GNN, but one that can stack many layers (6 in the medium configuration) without oversmoothing, thanks to the residual connections inherited from the Transformer architecture.

Supernode for graph-level readout. A [GLOBAL] supernode is added to each molecular graph, connected to all atoms. This node aggregates information from the entire molecule and serves as the molecular representation for downstream prediction.

Masked Atom Prediction

The pretraining strategy mirrors BERT’s masked language model but operates on atoms:

15% of atoms in each molecule are randomly selected (at least one atom per molecule)
Of selected atoms: 80% are replaced with [MASK], 10% are randomly replaced with another atom type, and 10% remain unchanged
The model is trained to predict the original atom type at masked positions
Loss is computed only at masked positions

Model Configurations

Three model sizes were compared:

Configuration	Layers	Heads	Embedding Size	FFN Size	Recovery Accuracy
MG-BERT Small	3	2	128	256	95.27%
MG-BERT Medium	6	4	256	512	98.31%
MG-BERT Large	12	8	576	1152	98.35%

The medium configuration was selected for all experiments because it achieved the best downstream performance, despite the large model having slightly higher pretraining recovery accuracy. The authors attribute this to overfitting risk with the larger model.

Experimental Setup and Baselines

Pretraining

MG-BERT was pretrained on 1.7 million compounds randomly selected from ChEMBL, with 10% held out for evaluation (1.53M training molecules). Molecules were converted to 2D undirected graphs using RDKit, with hydrogen atoms explicitly included. The model was pretrained for 10 epochs using Adam with learning rate 1e-4 and batch size 256.

Fine-tuning Datasets

Sixteen datasets covering ADMET endpoints and common molecular properties were collected from ADMETlab and MoleculeNet:

Type	Dataset	Category	Size
Regression	Caco2	Absorption	979
Regression	logD	Physicochemical	10,354
Regression	logS	Physicochemical	5,045
Regression	PPB	Distribution	1,480
Regression	tox	Toxicity	7,295
Regression	ESOL	Physicochemical	1,128
Regression	FreeSolv	Physicochemical	642
Regression	Lipo	Physicochemical	4,200
Classification	Ames	Toxicity	6,719
Classification	BBB	Distribution	1,855
Classification	FDAMDD	Toxicity	795
Classification	H_HT	Toxicity	2,170
Classification	Pgp_inh	Absorption	2,125
Classification	Pgp_sub	Absorption	1,210
Classification	BACE	Biophysics	1,513
Classification	BBBP	Physiology	2,039

Datasets were split 8:1:1 (train:validation:test) with stratified sampling by SMILES length. Each experiment was repeated 10 times with random splits, reporting mean and standard deviation. Regression was evaluated by R-squared, classification by ROC-AUC. Early stopping with a maximum of 100 epochs was used.

Baselines

Five baselines were compared:

ECFP4-XGBoost: Extended connectivity fingerprints (diameter 4) with gradient-boosted trees
GAT: Graph Attention Network
GCN: Graph Convolutional Network
CDDD: Continuous and Data-Driven Descriptors (pretrained RNN encoder on SMILES with a fully connected network)
SMILES-BERT: Original BERT applied directly to SMILES strings

Ablation Studies

Two ablation studies were conducted:

Pretraining effectiveness: Comparing pretrained vs. non-pretrained MG-BERT under identical hyperparameters
Hydrogen atoms: Comparing MG-BERT with and without explicit hydrogen atoms in the molecular graph

Consistent Improvements Across ADMET Benchmarks

Main Results

MG-BERT consistently outperformed all baselines across all 16 datasets. Key results on the 11 ADMET datasets:

Dataset	ECFP4-XGBoost	GAT	GCN	CDDD	SMILES-BERT	MG-BERT
Caco2 (R2)	61.41	69.16	67.15	73.42	72.39	74.68
logD (R2)	70.84	84.62	86.22	85.85	86.31	87.46
logS (R2)	73.73	84.06	83.47	84.01	85.20	87.66
PPB (R2)	55.11	59.96	57.34	54.12	62.37	65.94
Ames (AUC)	87.21	86.38	87.04	86.82	87.69	89.33
BBB (AUC)	94.62	93.03	92.67	94.44	94.02	95.41
BBBP (AUC)	89.16	90.33	90.74	91.12	91.32	92.08

The overall improvement across all datasets was 28.1% (7.02% on classification, 21.28% on regression). Improvements were statistically significant at the 95% confidence level (paired t-test, P <= 0.001).

Pretraining Ablation

Pretraining improved performance by more than 2% on all datasets. The benefit was largest for small datasets: Caco2 improved by approximately 10 percentage points (64.79 to 74.68 R2), and FDAMDD improved by about 7.5 points (80.76 to 88.23 AUC). This confirms that self-supervised pretraining effectively addresses the labeled data scarcity problem.

Hydrogen Atom Ablation

Including explicit hydrogen atoms improved pretraining recovery accuracy from 92.25% to 98.31% and consistently improved downstream performance. The authors provide an intuitive explanation: hydrogen atoms help determine bond counts for neighboring atoms, which is critical for the masked atom recovery task. They also show that removing hydrogens can make structurally distinct molecules (e.g., benzene and cyclohexane) indistinguishable at the graph level.

Interpretability via Attention Visualization

The authors provide two forms of interpretability analysis:

t-SNE visualization of atomic representations: Pretrained atomic representations cluster by atom type and, more specifically, by local chemical environment (e.g., aromatic carbons separate from aliphatic carbons, C-N bonds from C-O bonds). This demonstrates that pretraining captures neighborhood context beyond simple atom identity.
Attention weight visualization: On the logD task, the supernode’s attention focuses on polar groups (which govern lipophilicity). On the Ames mutagenicity task, attention concentrates on known mutagenic structural alerts (acylchloride, nitrosamide, azide groups). This provides chemically meaningful explanations for predictions.

Limitations

The paper does not extensively discuss limitations, but several can be identified:

The model uses only 2D molecular topology (atom types and bonds) without 3D conformational information or bond-type features
The atom dictionary is limited to 13 common types plus [UNK], which may lose information for molecules containing rarer elements
Evaluation is limited to ADMET-focused datasets; broader chemical spaces (e.g., materials, catalysts) are not tested
The comparison baselines do not include other graph-based pretraining methods (e.g., the contemporaneous Strategies for Pre-training Graph Neural Networks by Hu et al.)

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL (random subset)	1.7M molecules (1.53M train)	10% held out for evaluation
Fine-tuning	ADMETlab + MoleculeNet	16 datasets (642-10,354 molecules)	8:1:1 splits, stratified by SMILES length

Algorithms

Optimizer: Adam (pretraining: lr=1e-4, batch=256; fine-tuning: lr from {1e-5, 5e-5, 1e-4}, batch from {16, 32, 64})
Pretraining epochs: 10
Fine-tuning: Up to 100 epochs with early stopping
Dropout: Optimized per task in range [0.0, 0.5]
Masking: 15% of atoms (80% [MASK], 10% random, 10% unchanged)

Models

Architecture: MG-BERT Medium (6 layers, 4 heads, embedding size 256, FFN size 512)
Molecule processing: RDKit for graph conversion with explicit hydrogens

Evaluation

Metric	Task Type	Notes
R-squared (R2)	Regression	Higher is better
ROC-AUC	Classification	Higher is better
Accuracy, RMSE	Both	Reported in supplementary Table S1

All results averaged over 10 random splits with standard deviations reported.

Hardware

The paper does not specify hardware requirements (GPU type, training time, or memory usage).

Artifacts

Artifact	Type	License	Notes
Molecular-graph-BERT	Code	Not specified	Jupyter Notebook implementation; last code push August 2021

Paper Information

Citation: Zhang, X.-C., Wu, C.-K., Yang, Z.-J., Wu, Z.-X., Yi, J.-C., Hsieh, C.-Y., Hou, T.-J., & Cao, D.-S. (2021). MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in Bioinformatics, 22(6), bbab152. https://doi.org/10.1093/bib/bbab152

@article{zhang2021mgbert,
  title={{MG-BERT}: leveraging unsupervised atomic representation learning for molecular property prediction},
  author={Zhang, Xiao-Chen and Wu, Cheng-Kun and Yang, Zhi-Jiang and Wu, Zhen-Xing and Yi, Jia-Cai and Hsieh, Chang-Yu and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal={Briefings in Bioinformatics},
  volume={22},
  number={6},
  pages={bbab152},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbab152}
}

MAT: Graph-Augmented Transformer for Molecules (2020)

Fri, 27 Mar 2026 00:00:00 +0000

A Graph-Augmented Transformer for Molecular Property Prediction

This is a Method paper that proposes the Molecule Attention Transformer (MAT), a Transformer-based architecture adapted for molecular property prediction. The primary contribution is a modified self-attention mechanism that incorporates inter-atomic distances and molecular graph structure alongside the standard query-key attention. Combined with self-supervised pretraining on 2 million molecules from ZINC15, MAT achieves competitive performance across seven diverse molecular property prediction tasks while requiring minimal hyperparameter tuning.

Challenges in Deep Learning for Molecular Properties

Predicting molecular properties is central to drug discovery and materials design, yet deep neural networks have struggled to consistently outperform shallow methods like random forests and SVMs on these tasks. Wu et al. (2018) demonstrated through the MoleculeNet benchmark that graph neural networks do not reliably beat classical models. Two recurring problems compound this:

Underfitting: Graph neural networks tend to underfit training data, with performance failing to scale with model complexity (Ishiguro et al., 2019).
Hyperparameter sensitivity: Deep models for molecule property prediction require extensive hyperparameter search (often 500+ configurations) to achieve competitive results, making them impractical for many practitioners.

Concurrent work explored using vanilla Transformers on SMILES string representations of molecules (Honda et al., 2019; Wang et al., 2019), but these approaches discard the explicit structural information encoded in molecular graphs and 3D conformations. The motivation for MAT is to combine the flexibility of the Transformer architecture with domain-specific inductive biases from molecular structure.

Molecule Self-Attention: Combining Attention, Distance, and Graph Structure

The core innovation is the Molecule Self-Attention layer, which replaces standard Transformer self-attention. In a standard Transformer, head $i$ computes:

$$ \mathcal{A}^{(i)} = \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) \mathbf{V}_{i} $$

MAT augments this with two additional information sources. Let $\mathbf{A} \in {0, 1}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the molecular graph adjacency matrix and $\mathbf{D} \in \mathbb{R}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the inter-atomic distance matrix. The modified attention becomes:

$$ \mathcal{A}^{(i)} = \left(\lambda_{a} \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) + \lambda_{d}, g(\mathbf{D}) + \lambda_{g}, \mathbf{A}\right) \mathbf{V}_{i} $$

where $\lambda_{a}$, $\lambda_{d}$, and $\lambda_{g}$ are scalar hyperparameters weighting each component, and $g$ is either a row-wise softmax or an element-wise exponential decay $g(d) = \exp(-d)$.

Key architectural details:

Atom embedding: Each atom is represented as a 26-dimensional vector encoding atomic identity (one-hot over B, N, C, O, F, P, S, Cl, Br, I, dummy, other), number of heavy neighbors, number of hydrogens, formal charge, ring membership, and aromaticity.
Dummy node: An artificial disconnected node (distance $10^{6}$ from all atoms) is added to each molecule, allowing the model to “skip” attention heads when no relevant pattern exists, similar to how BERT uses the separation token.
3D conformers: Distance matrices are computed from RDKit-generated 3D conformers using the Universal Force Field (UFF).
Pretraining: Node-level masked atom prediction on 2 million ZINC15 molecules (following Hu et al., 2019), where 15% of atom features are masked and the model predicts them.

Benchmark Evaluation and Ablation Studies

Experimental setup

MAT is evaluated on seven molecular property prediction datasets spanning regression and classification:

Dataset	Task	Size	Metric	Split
FreeSolv	Regression (hydration free energy)	642	RMSE	Random
ESOL	Regression (log solubility)	1,128	RMSE	Random
BBBP	Classification (BBB permeability)	2,039	ROC AUC	Scaffold
Estrogen-alpha	Classification (receptor activity)	2,398	ROC AUC	Scaffold
Estrogen-beta	Classification (receptor activity)	1,961	ROC AUC	Scaffold
MetStab-high	Classification (metabolic stability)	2,127	ROC AUC	Random
MetStab-low	Classification (metabolic stability)	2,127	ROC AUC	Random

Baselines include GCN, Weave, EAGCN, Random Forest (RF), and SVM. Each model receives the same hyperparameter search budget (150 or 500 evaluations). Results are averaged over 6 random train/validation/test splits.

Main results

MAT achieves the best average rank across all seven tasks:

Model	Avg. Rank (500 budget)	Avg. Rank (150 budget)
MAT	2.42	2.71
RF	3.14	3.14
SVM	3.57	3.28
GCN	3.57	3.71
Weave	3.71	3.57
EAGCN	4.14	4.14

With self-supervised pretraining, Pretrained MAT achieves an average rank of 1.57, outperforming both Pretrained EAGCN (4.0) and SMILES Transformer (4.29). Pretrained MAT requires tuning only the learning rate (7 values tested), compared to 500 hyperparameter combinations for the non-pretrained models.

Ablation results

Ablation studies on BBBP, ESOL, and FreeSolv reveal:

Variant	BBBP (AUC)	ESOL (RMSE)	FreeSolv (RMSE)
MAT (full)	.723	.286	.250
- Graph	.716	.316	.276
- Distance	.729	.281	.281
- Attention	.692	.306	.329
- Dummy node	.714	.317	.249
+ Edge features	.683	.314	.358

Removing any single component degrades performance on at least one task, supporting the value of combining all three information sources. Adding edge features does not help, suggesting the adjacency and distance matrices already capture sufficient bond-level information.

Interpretability analysis

Individual attention heads in the first layer learn chemically meaningful functions. Six heads were identified that focus on specific chemical patterns: 2-neighbored aromatic carbons, sulfur atoms, non-ring nitrogens, carbonyl oxygens, 3-neighbored aromatic atoms (substitution positions), and aromatic ring nitrogens. Statistical validation using Kruskal-Wallis tests confirmed that atoms matching these SMARTS patterns receive significantly higher attention weights ($p < 0.001$ for all patterns).

Findings, Limitations, and Future Directions

MAT demonstrates that augmenting Transformer self-attention with molecular graph structure and 3D distance information produces a model that performs consistently well across diverse property prediction tasks. The key practical finding is that self-supervised pretraining dramatically reduces the hyperparameter tuning burden: Pretrained MAT matches or exceeds the performance of extensively tuned models while requiring only learning rate selection.

Several limitations are acknowledged:

Fingerprint-based models still win on some tasks: RF and SVM with extended-connectivity fingerprints outperform MAT on metabolic stability and Estrogen-beta tasks, suggesting that incorporating fingerprint representations could improve MAT further.
Single conformer: Only one pre-computed 3D conformer is used per molecule. More sophisticated conformer sampling or ensemble strategies were not explored.
Limited pretraining exploration: Only the masked atom prediction task from Hu et al. (2019) was used. The authors note that exploring additional pretraining objectives is a promising direction.
Scalability: The pretrained model uses 1024-dimensional embeddings with 8 layers and 16 attention heads, fitting the largest model that fits in GPU memory.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ZINC15	2M molecules	Sampled from ZINC database
Evaluation	FreeSolv	642	Hydration free energy regression
Evaluation	ESOL	1,128	Log solubility regression
Evaluation	BBBP	2,039	Blood-brain barrier classification
Evaluation	Estrogen-alpha/beta	2,398 / 1,961	Receptor activity classification
Evaluation	MetStab-high/low	2,127 each	Metabolic stability classification

Algorithms

Optimizer: Adam with Noam learning rate scheduler (warmup then inverse square root decay)
Pretraining: 8 epochs, learning rate 0.001, batch size 256, binary cross-entropy loss
Fine-tuning: 100 epochs, batch size 32, learning rate selected from {1e-3, 5e-4, 1e-4, 5e-5, 1e-5, 5e-6, 1e-6}
Distance kernel: exponential decay $g(d) = \exp(-d)$ for pretrained model
Lambda weights: $\lambda_{a} = \lambda_{d} = 0.33$ for pretrained model

Models

Pretrained MAT: 1024-dim embeddings, 8 layers, 16 attention heads, 1 feed-forward layer per block
Dropout: 0.0, weight decay: 0.0 for pretrained model
Atom featurization: 26-dimensional one-hot encoding (Table 1 in paper)

Evaluation

Regression: RMSE (FreeSolv, ESOL)
Classification: ROC AUC (BBBP, Estrogen-alpha/beta, MetStab-high/low)
All experiments repeated 6 times with different train/validation/test splits
Scaffold split for BBBP, Estrogen, random split for others

Hardware

The paper does not specify exact hardware details. The pretrained model is described as “the largest model that still fits the GPU memory.”

Artifacts

Artifact	Type	License	Notes
gmum/MAT	Code	MIT	Official implementation with pretrained weights

Paper Information

Citation: Maziarka, Ł., Danel, T., Mucha, S., Rataj, K., Tabor, J., & Jastrzębski, S. (2020). Molecule Attention Transformer. arXiv preprint arXiv:2002.08264.

@article{maziarka2020molecule,
  title={Molecule Attention Transformer},
  author={Maziarka, {\L}ukasz and Danel, Tomasz and Mucha, S{\l}awomir and Rataj, Krzysztof and Tabor, Jacek and Jastrz{\k{e}}bski, Stanis{\l}aw},
  journal={arXiv preprint arXiv:2002.08264},
  year={2020}
}

DMP: Dual-View Molecule Pre-training (SMILES+GNN)

Fri, 27 Mar 2026 00:00:00 +0000

A Dual-Branch Pre-training Method for Molecular Property Prediction

DMP (Dual-view Molecule Pre-training) is a Method paper that introduces a pre-training framework combining two complementary molecular encoders: a Transformer operating on SMILES strings and a Graph Neural Network (GNN) operating on molecular graphs. The two branches are trained jointly with masked language modeling (MLM) objectives plus a BYOL-style dual-view consistency loss. After pre-training on 10M PubChem molecules, either branch (or both) can be fine-tuned for downstream tasks. The authors recommend the Transformer branch based on empirical results. DMP achieves the best reported performance on 7 of 9 MoleculeNet classification tasks and 3 retrosynthesis benchmarks (at the time of the 2021 arXiv version).

Why Combine SMILES and Graph Views for Molecules

Prior molecule pre-training methods used either graph representations with GNNs or SMILES representations with Transformers, but not both. The authors observe that the two views are complementary: Transformers handle molecules with large atom distances (long chains) well, while GNNs handle molecules with many concatenated rings better. Neither model alone captures the full range of molecular structures effectively.

Existing GNN-based pre-training methods (Hu et al. 2020, MolCLR, GROVER) and SMILES-based methods (ChemBERTa, SMILES-BERT) each have blind spots dictated by their input representation. DMP addresses this by pre-training both views simultaneously and enforcing representation consistency between them, so each branch benefits from the structural knowledge of the other.

Dual-View Consistency with BYOL-Style Training

The core innovation is the dual-view consistency objective, inspired by Bootstrap Your Own Latent (BYOL). Given a molecule $M$ with SMILES representation $M_s$ and graph representation $M_g$, DMP obtains high-level features from each branch:

Transformer branch: A RoBERTa-base model encodes the SMILES sequence. The [CLS] token output serves as the molecule representation $f_s$.
GNN branch: A DeeperGCN network encodes the molecular graph. Mean+max pooling over atom representations yields $f_g$.

The dual-view consistency loss uses nonlinear projection heads $\psi_g, \psi_s$ and prediction heads $\rho_g, \rho_s$:

$$ p_g = \psi_g(f_g), \quad q_g = \rho_g(p_g); \quad p_s = \psi_s(f_s), \quad q_s = \rho_s(p_s) $$

The consistency loss maximizes cross-view cosine similarity with stop-gradient (SG) on the target:

$$ \ell_{\text{dual}}(\tilde{M}_g, \tilde{M}_s) = -\cos(q_s, \text{SG}(p_g)) - \cos(q_g, \text{SG}(p_s)) $$

where $\cos(p, q) = \frac{p^\top q}{|p|_2 |q|_2}$ and $\tilde{M}_g, \tilde{M}_s$ are the masked versions of the inputs. The stop-gradient prevents representation collapse without requiring negative samples or a momentum encoder.

The full training objective combines three losses:

MLM on Transformer: Recover masked tokens in SMILES sequences
MLM on GNN: Recover masked atoms in molecular graphs
Dual-view consistency: The BYOL-style loss above

Both MLM objectives and the consistency loss are necessary. Ablations show that removing MLM (using only dual-view loss) degrades performance, and using two branches of the same type (two Transformers or two GNNs) is less effective than the heterogeneous Transformer+GNN combination.

Experiments on MoleculeNet and Retrosynthesis

Pre-training Setup

DMP is pre-trained on 10M molecules from PubChem (matching prior work). The Transformer branch uses RoBERTa-base (12 layers, hidden dim 768, 87M parameters). The GNN branch uses DeeperGCN (12 layers, hidden dim 384, 7.4M parameters). Combined, DMP has 104.1M parameters. Training runs for 200K iterations on 8 V100 GPUs over 3.8 days with Adam optimizer (lr = 5e-4, weight decay 0.01).

Molecular Property Prediction (MoleculeNet)

DMP is evaluated on 6 binary classification tasks (BBBP, Tox21, ClinTox, HIV, BACE, SIDER) using official DeepChem splits, and on 3 additional tasks (BBBP, SIDER, ClinTox classification + ESOL, QM7, QM8 regression) using scaffold splits from GROVER.

Key results on DeepChem splits (ROC-AUC %):

Dataset	MolCLR	TF (MLM)	DMP_TF	DMP_TF+GNN
BBBP	73.6	74.9	78.1	77.8
Tox21	79.8	77.6	78.8	79.1
ClinTox	93.2	92.9	95.0	95.6
HIV	80.6	80.2	81.0	81.4
BACE	89.0	88.0	89.3	89.4
SIDER	68.0	68.4	69.2	69.8

On scaffold splits (comparison with GROVER and MPG):

Dataset	GROVER	MPG	DMP_TF
BBBP (AUC)	0.940	0.922	0.945
SIDER (AUC)	0.658	0.661	0.695
ClinTox (AUC)	0.944	0.963	0.968
ESOL (RMSE)	0.831	0.741	0.700
QM7 (MAE)	72.6	-	69.6
QM8 (MAE)	0.0125	-	0.0124

Retrosynthesis

DMP is tested on USPTO-50K (reaction type known/unknown) and USPTO-full. Using a “DMP fusion” approach (fusing pre-trained representations into a Transformer encoder-decoder for retrosynthesis), DMP improves top-1 accuracy by 2-3 points over the baseline Transformer across all settings:

Setting	Transformer	ChemBERTa fusion	DMP fusion
USPTO-50K (unknown)	42.3	43.9	46.1
USPTO-50K (known)	54.2	56.4	57.5
USPTO-full	42.9	-	45.0

For GNN-based retrosynthesis, replacing GLN’s GNN modules with DMP’s pre-trained GNN branch improves top-1 accuracy from 52.5% to 54.2% (unknown type) and from 64.2% to 66.5% (known type).

Representation Quality

t-SNE visualization of pre-trained representations shows that DMP produces better scaffold-based clustering than either GNN-only or Transformer-only pre-training. The Davies-Bouldin index improves from 3.56 (GNN) and 3.59 (Transformer) to 2.19 (DMP), indicating much tighter within-scaffold clusters.

Key Findings and Limitations

Key findings:

Combining heterogeneous views (SMILES + graph) during pre-training is more effective than using two branches of the same type. TF(x2) and GNN(x2) variants show smaller gains.
Both MLM and dual-view consistency loss contribute. Removing MLM (dual-view only) hurts performance, especially on BBBP (71.1 vs 78.1 with both losses).
The Transformer branch alone is recommended for downstream tasks, as it achieves strong results without adding GNN parameters at inference time.
Scaling pre-training data from 10M to 100M compounds yields marginal additional improvement.

Limitations acknowledged by the authors:

Training cost is higher than single-branch methods (3.8 days vs 2.5 days for TF-only on 8 V100s), since both branches must be trained jointly.
A fixed branch selection strategy is used at inference time. The authors note that a meta-controller for dynamic branch selection per molecule would be preferable.
The GNN branch uses simple atom masking without bond deletion or subgraph removal, leaving room for stronger graph-level pre-training objectives.

Relation to co-training: The authors clarify that DMP differs from classical co-training (Blum and Mitchell 1998) in that it does not require conditional independence between views and produces a pre-trained model rather than additional labeled data.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem subset	10M compounds	Same subset as MolCLR and ChemBERTa
Pre-training (large)	PubChem subset	100M compounds	Additional scale experiment
Evaluation (classification)	MoleculeNet (BBBP, Tox21, ClinTox, HIV, BACE, SIDER)	1.5K-41K molecules	Official DeepChem splits
Evaluation (regression)	MoleculeNet (ESOL, QM7, QM8)	Varies	Scaffold splits from GROVER
Evaluation (retrosynthesis)	USPTO-50K, USPTO-full	50K / 950K reactions	Splits from Dai et al. (2019)

Algorithms

Transformer branch: RoBERTa-base with MLM. SMILES tokenized using regex from Schwaller et al. (2019).
GNN branch: DeeperGCN with 12 layers, atom masking for MLM.
Dual-view loss: BYOL-style with 3-layer MLP projection heads and 2-layer MLP prediction heads, stop-gradient on targets.
Optimizer: Adam (lr=5e-4, beta1=0.9, beta2=0.98, epsilon=1e-6), weight decay 0.01, 10K warmup steps, linear decay.

Models

Component	Architecture	Parameters
Transformer branch	RoBERTa-base (12L, 768H, 12 heads)	87M
GNN branch	DeeperGCN (12L, 384H)	7.4M
DMP (total)	Transformer + GNN + projection/prediction heads	104.1M

Evaluation

Classification: ROC-AUC, averaged over 3 random seeds
Regression: RMSE (ESOL) or MAE (QM7, QM8)
Retrosynthesis: Top-k exact match accuracy (k=1,3,5,10,20,50)

Hardware

Pre-training: 8 NVIDIA V100 GPUs, batch size 12288 tokens, gradient accumulation 16x
Pre-training time: 3.8 days (DMP), 2.5 days (TF-only), 1.7 days (GNN-only)

Artifacts

No public code repository or pre-trained model weights were identified for this paper. The paper references GLN’s code repository (https://github.com/Hanjun-Dai/GLN) for the retrosynthesis baseline but does not release DMP-specific code.

Artifact	Type	License	Notes
GLN (baseline)	Code	MIT	Retrosynthesis baseline, not DMP code

Paper Information

Citation: Zhu, J., Xia, Y., Wu, L., Xie, S., Zhou, W., Qin, T., Li, H., & Liu, T.-Y. (2023). Dual-view Molecular Pre-training. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 3615-3627). https://doi.org/10.1145/3580305.3599317

@inproceedings{zhu2023dualview,
  title={Dual-view Molecular Pre-training},
  author={Zhu, Jinhua and Xia, Yingce and Wu, Lijun and Xie, Shufang and Zhou, Wengang and Qin, Tao and Li, Houqiang and Liu, Tie-Yan},
  booktitle={Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  pages={3615--3627},
  year={2023},
  doi={10.1145/3580305.3599317}
}

X-MOL: Pre-training on 1.1B Molecules for SMILES

Thu, 26 Mar 2026 00:00:00 +0000

A Unified Molecular Pre-training Framework

X-MOL is a Method paper that introduces a large-scale pre-training framework for SMILES-based molecular understanding. The primary contribution is a Transformer encoder-decoder model pre-trained on 1.1 billion molecules from ZINC15, which is then fine-tuned across five distinct molecular analysis tasks: molecular property prediction (classification and regression), chemical reaction productivity prediction, drug-drug interaction (DDI) prediction, de novo molecule generation (distribution learning and goal-directed), and molecule optimization. The paper demonstrates that a single pre-trained model can serve as a universal foundation for diverse downstream chemistry tasks.

Bridging Scale and Understanding in Molecular SMILES

Prior to X-MOL, most molecular analysis tasks were investigated individually with task-specific models. SMILES-based deep learning methods existed but lacked the benefit of large-scale pre-training that had proven transformative in NLP (BERT, RoBERTa, ERNIE, XLNet, T5). Two challenges motivated this work:

SMILES sacrifices structural information for simplicity. While SMILES is a convenient linear representation, it does not directly encode molecular topology, making it harder for models to learn 3D structure from string input.
Labelled molecular data is scarce. Most benchmark datasets (MoleculeNet) contain only thousands of labelled examples, making it difficult to train large models from scratch without overfitting.

The authors hypothesized that massive-scale pre-training on unlabelled SMILES could teach a model the grammar rules and implicit structural information in SMILES, providing a strong initialization for multiple downstream tasks.

Generative Pre-training with Random SMILES

The core innovation in X-MOL is a generative pre-training strategy that exploits the non-uniqueness of SMILES. A single molecule can be represented by many valid SMILES strings (random SMILES), depending on the starting atom, main chain selection, and ring-opening position. X-MOL trains the model to generate a valid alternative SMILES given an input SMILES of the same molecule, forcing the model to:

Reconstruct the molecular structure from the input SMILES
Generate a valid output SMILES following SMILES grammar rules

The architecture uses a shared-parameter encoder-decoder based on the Transformer. Unlike standard encoder-decoder models (e.g., for machine translation), X-MOL shares all parameters between encoder and decoder, forcing both encoding and decoding to occur in the same semantic space. The output SMILES is fully masked during training, and only unidirectional attention is permitted within the output sequence.

The self-attention mechanism computes attention for each character $i$ as:

$$ Z_{i} = \text{SoftMax}\left(\frac{Q_{i} \cdot K^{T}}{\sqrt{D}}\right) \cdot V $$

where $Q_{i}$, $K$, and $V$ are the query, key, and value matrices, and $D$ is the feature dimension. The model uses 12 attention heads to capture different relational patterns.

Model Architecture

12 Transformer encoder layers
768-dimensional hidden units
12 attention heads
Character-level SMILES tokenization (108 chemical characters plus 5 special tokens: [PAD], [CLS], [SEP], [MASK], [UNK])
Characters within square brackets and double digits preceded by “%” are treated as single tokens

Data Augmentation in Pre-training

Because a molecule has multiple valid random SMILES, the output may differ from the predefined target. To handle this, X-MOL generates multiple training samples per molecule with the same input SMILES but different output random SMILES, and places these in the same mini-batch.

Experimental Setup Across Five Tasks

X-MOL is fine-tuned with task-specific strategies organized into two categories: prediction tasks and generation tasks.

Prediction Tasks

For prediction tasks, the [CLS] token’s output representation is passed through a fully connected network to produce predictions. The input format varies by task:

Task	Input Format	Loss Function	Metric
Property prediction (classification)	Single SMILES	Cross-entropy	ROC-AUC
Property prediction (regression)	Single SMILES	MSE	RMSE
Reaction productivity prediction	Four SMILES (reactant, additive, base, ligand)	MSE	RMSE
DDI prediction	Two SMILES (drug pair)	Cross-entropy	Accuracy

Molecular Property Prediction (Classification): Four MoleculeNet benchmarks were used: HIV (41,127 compounds), BACE (1,513), BBBP (2,039), and ClinTox (1,484). Data were randomly split 20 times, and average ROC-AUC is reported.

Molecular Property Prediction (Regression): Three MoleculeNet benchmarks: ESOL (1,128), FreeSolv (642), and Lipophilicity (4,200). Data augmentation with random SMILES was applied to the training set. Average RMSE over 20 random splits is reported.

Chemical Reaction Productivity Prediction: The C-N cross-coupling dataset (3,956 reactions) from Ahneman et al. was used with 10-fold cross-validation.

DDI Prediction: The DeepDDI dataset (192,284 DDI pairs, 86 interaction types) was used as benchmark.

Generation Tasks

Task	Generation Source	Sampling Strategy
Distribution learning (DL) generation	Fixed initial symbol ([CLS])	Random sampling
Goal-directed (GD) generation	Unfixed initial symbol	Random sampling
Molecule optimization	Input molecule	Beam search (beam size = 4)

DL-based Generation: Evaluated on ZINC250K (249,456 molecules) using validity, uniqueness, and novelty.

GD Generation: Also on ZINC250K, using QED as the goal property with target QED = 0.948 (the dataset maximum). 10,000 molecules were generated for evaluation.

Molecule Optimization: Evaluated on ZINC250K with QED as the optimization goal. Molecular pairs were constructed by selecting pairs with Tanimoto similarity in [0.6, 0.8], where the lower-QED molecule serves as input and the higher-QED molecule as target.

Key Results

Classification (ROC-AUC, higher is better): X-MOL achieved state-of-the-art on all four datasets, outperforming both shallow learning methods and deep learning baselines including graph convolutional models.

Regression (RMSE, lower is better): X-MOL achieved the best RMSE on ESOL, FreeSolv, and Lipophilicity.

Reaction Productivity: X-MOL obtained an average RMSE of 0.0626, compared to the random forest baseline of 0.078.

DDI Prediction: X-MOL achieved accuracy of 0.952, improving over DeepDDI’s 0.924.

DL-based Generation:

Method	Validity	Uniqueness	Novelty
GCPN	20%	99.97%	100%
MRNN	65%	99.89%	100%
GraphAF	68%	99.10%	100%
X-MOL	85.28%	99.91%	100%

GD Generation: X-MOL generated all top-3 molecules with QED = 0.948, matching the dataset maximum. GraphAF reached 0.948/0.948/0.947, while JT-VAE and MRNN fell further behind.

Knowledge Embedding Ablation

The paper tested three additional embedding strategies to inject structural information into the model:

Link embedding: Encodes connection information between atoms (position of the previous connected atom)
Ring embedding: Encodes ring structure information from SMILES number pairs
Type embedding: Categorizes characters into 9 types (atoms, bonds, structural symbols)

None of these additional embeddings improved performance on the HIV or DDI tasks, whether with or without pre-training. The authors conclude that SMILES already contains sufficient information for molecular understanding and that pre-training effectively extracts this information, a finding they label “SMILES is all you need.”

Attention Visualization

The authors provide attention heatmap analysis demonstrating that:

Middle layers (e.g., layer 9) reconstruct molecular structure by correctly identifying atom connectivity and ring closures
Later layers abstract higher-level features for property prediction
In multi-input prediction tasks (reaction productivity), attention reveals which reaction components are most important (e.g., the ligand receives highest cross-attention)
In generation tasks, attention patterns differ between DL (self-focused), GD (source-constrained), and optimization (gradual shift from input to output)

Findings, Limitations, and Future Directions

X-MOL demonstrates that large-scale pre-training on SMILES can produce a single model that achieves competitive or state-of-the-art performance across five distinct molecular analysis tasks. The key findings are:

Scale enables SMILES understanding. Pre-training on 1.1 billion molecules allows the model to learn SMILES grammar rules well enough to outperform graph-based methods on molecule generation validity.
Unified framework. A single pre-trained backbone serves classification, regression, reaction prediction, DDI prediction, and generative tasks through different fine-tuning strategies.
SMILES is sufficient. Additional knowledge embeddings (link, ring, type) do not improve performance, suggesting pre-training extracts the necessary structural information from SMILES alone.
Interpretable attention. Attention visualization confirms that the model reconstructs molecular structure internally.

Limitations (observed):

The paper reports only MoleculeNet benchmarks with relatively few datasets. No scaffold splits or temporal splits are used; all splits are random, which can overestimate performance on structurally novel compounds.
Comparison baselines are somewhat dated (2018-2019 era methods), and the paper does not compare against concurrent SMILES pre-training methods.
The molecule generation validity (85.28%) is much higher than graph baselines like GCPN (20%), but later work achieved near 100% validity with constrained SMILES grammars.
No code or model weights have been publicly released, limiting independent verification.
The paper remains a bioRxiv preprint and has not been published in a peer-reviewed venue.

Future directions proposed by the authors include: better pre-training strategies, extension to graph-based representations, and fine-tuning on additional downstream tasks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC15	1.1 billion molecules	Random SMILES augmentation
Classification	HIV (MoleculeNet)	41,127	Binary classification
Classification	BACE (MoleculeNet)	1,513	Binary classification
Classification	BBBP (MoleculeNet)	2,039	Binary classification
Classification	ClinTox (MoleculeNet)	1,484	Two sub-datasets, averaged
Regression	ESOL (MoleculeNet)	1,128	Water solubility
Regression	FreeSolv (MoleculeNet)	642	Hydration free energy
Regression	Lipophilicity (MoleculeNet)	4,200	logD at pH 7.4
Reaction	C-N cross-coupling	3,956	From Ahneman et al. (2018)
DDI	DeepDDI	192,284 DDI pairs	86 interaction types
Generation	ZINC250K	249,456	For DL, GD, and optimization

Algorithms

Pre-training: Generative SMILES-to-SMILES with shared encoder-decoder Transformer
Fine-tuning prediction tasks: [CLS] token passed through fully connected layers
Fine-tuning generation tasks: Autoregressive generation with random sampling (DL, GD) or beam search (optimization)
Data augmentation: Random SMILES augmentation for regression tasks
Repeated training: 20 random splits with averaged results for classification/regression
10-fold cross-validation for reaction productivity

Models

12-layer Transformer, 768 hidden dimensions, 12 attention heads
Character-level tokenization: 108 chemical characters + 5 special tokens
Implemented in PaddlePaddle framework

Evaluation

Task	Metric	X-MOL	Best Baseline
HIV (classification)	ROC-AUC	State-of-the-art	Previous best (various)
BACE (classification)	ROC-AUC	State-of-the-art	Previous best (various)
BBBP (classification)	ROC-AUC	State-of-the-art	Previous best (various)
ClinTox (classification)	ROC-AUC	State-of-the-art	Previous best (various)
ESOL (regression)	RMSE	State-of-the-art	Previous best (various)
FreeSolv (regression)	RMSE	State-of-the-art	Previous best (various)
Lipophilicity (regression)	RMSE	State-of-the-art	Previous best (various)
C-N coupling	RMSE	0.0626	0.078 (random forest)
DDI prediction	Accuracy	0.952	0.924 (DeepDDI)
DL generation	Validity	85.28%	68% (GraphAF)
GD generation	Top-3 QED	All 0.948	0.948/0.948/0.947 (GraphAF)

Hardware

Pre-training: 8/16 Tesla P40 GPUs (24 GB each), approximately 4 days
Data pre-processing: Over 1,000 CPUs with Hadoop

Artifacts

No code, model weights, or pre-trained checkpoints have been publicly released. The model was implemented in Baidu’s PaddlePaddle framework, but no repository is available.

Reproducibility status: Closed. While the datasets are all publicly available (ZINC15, MoleculeNet, ZINC250K, DeepDDI, C-N coupling), the model implementation, pre-trained weights, and fine-tuning code are not released. The computational requirements (1,000+ CPUs for data processing, 8-16 GPUs for 4 days of pre-training) are substantial.

Paper Information

Citation: Xue, D., Zhang, H., Xiao, D., Gong, Y., Chuai, G., Sun, Y., Tian, H., Wu, H., Li, Y., & Liu, Q. (2020). X-MOL: Large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv.

@article{xue2020xmol,
  title={X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis},
  author={Xue, Dongyu and Zhang, Han and Xiao, Dongling and Gong, Yukang and Chuai, Guohui and Sun, Yu and Tian, Hao and Wu, Hua and Li, Yukun and Liu, Qi},
  journal={bioRxiv},
  year={2020},
  doi={10.1101/2020.12.23.424259},
  publisher={Cold Spring Harbor Laboratory}
}

VAE for Automatic Chemical Design (2018 Seminal)

Thu, 26 Mar 2026 00:00:00 +0000

A Foundational Method for Continuous Molecular Representation

This is a Method paper that introduces a variational autoencoder (VAE) framework for mapping discrete molecular representations (SMILES strings) into a continuous latent space. The primary contribution is demonstrating that this continuous representation enables three key capabilities: (1) automatic generation of novel molecules by decoding random or perturbed latent vectors, (2) smooth interpolation between molecules in latent space, and (3) gradient-based optimization of molecular properties using a jointly trained property predictor. This work is widely regarded as one of the earliest and most influential applications of deep generative models to molecular design.

The Challenge of Searching Discrete Chemical Space

Molecular design is fundamentally an optimization problem: identify molecules that maximize some set of desirable properties. The search space is enormous (estimated $10^{23}$ to $10^{60}$ drug-like molecules) and discrete, making systematic exploration difficult. Prior approaches fell into two categories, each with significant limitations:

Virtual screening over fixed libraries: effective but monolithic, costly to enumerate, and requiring hand-crafted rules to avoid impractical chemistries.
Discrete local search (e.g., genetic algorithms): requires manual specification of mutation and crossover heuristics, and cannot leverage gradient information to guide the search.

The core insight is that mapping molecules into a continuous vector space sidesteps these problems entirely. In a continuous space, new compounds can be generated by vector perturbation (no hand-crafted mutation rules), optimization can follow property gradients (enabling larger and more directed jumps), and large unlabeled chemical databases can be leveraged through unsupervised representation learning.

A VAE Architecture for SMILES Strings with Joint Property Prediction

The architecture consists of three coupled neural networks trained jointly:

Encoder: Converts SMILES character strings into fixed-dimensional continuous vectors (the latent representation). Uses three 1D convolutional layers followed by a fully connected layer. For ZINC molecules, the latent space has 196 dimensions; for QM9, 156 dimensions.
Decoder: Converts latent vectors back into SMILES strings character by character using three layers of gated recurrent units (GRUs). The output is stochastic, as each character is sampled from a probability distribution over the SMILES alphabet.
Property Predictor: A multilayer perceptron that predicts molecular properties directly from the latent representation. Joint training with the autoencoder reconstruction loss organizes the latent space so that molecules with similar properties cluster together.

The VAE Objective

The model uses the variational autoencoder framework of Kingma and Welling. The training objective combines three terms:

$$\mathcal{L} = \mathcal{L}_{recon} + \beta \cdot D_{KL}(q(z|x) | p(z)) + \lambda \cdot \mathcal{L}_{prop}$$

where $\mathcal{L}_{recon}$ is the reconstruction loss (cross-entropy over SMILES characters), $D_{KL}$ is the KL divergence regularizer that encourages the latent distribution $q(z|x)$ to match a standard Gaussian prior $p(z)$, and $\mathcal{L}_{prop}$ is the property prediction regression loss. Both the variational loss and the property prediction loss are annealed in using a sigmoid schedule after 29 epochs over a total of 120 epochs of training.

The KL regularization is critical: it forces the decoder to handle a wider variety of latent points, preventing “dead areas” in latent space that would decode to invalid molecules.

Gradient-Based Optimization

After training, a Gaussian process (GP) surrogate model is fit on top of the latent representations to predict the target property. Optimization proceeds by:

Encoding a seed molecule into the latent space
Using the GP model to define a smooth property surface over the latent space
Optimizing the latent vector $z$ to maximize the predicted property via gradient ascent
Decoding the optimized $z$ back into a SMILES string

The objective used for demonstration is $5 \times \text{QED} - \text{SAS}$, balancing drug-likeness (QED) against synthetic accessibility (SAS).

Experiments on ZINC and QM9 Datasets

Two autoencoder systems were trained:

ZINC: 250,000 drug-like molecules from the ZINC database, with a 196-dimensional latent space. Properties predicted: logP, QED, SAS.
QM9: 108,000 molecules with fewer than 9 heavy atoms, with a 156-dimensional latent space. Properties predicted: HOMO energy, LUMO energy, electronic spatial extent ($\langle R^2 \rangle$).

Latent Space Quality

The encoded latent dimensions follow approximately normal distributions as enforced by the variational regularizer. Decoding is stochastic: sampling the same latent point multiple times yields different SMILES strings, with the most frequent decoding tending to be closest to the original point in latent space. Decoding validity rates are 73-79% for points near known molecules but only 4% for randomly selected latent points.

Spherical interpolation (slerp) between molecules in latent space produces smooth structural transitions, accounting for the geometry of high-dimensional Gaussian distributions where linear interpolation would pass through low-probability regions.

Molecular Generation Comparison

Source	Dataset	Samples	logP	SAS	QED	% in ZINC	% in eMolecules
Data	ZINC	249k	2.46 (1.43)	3.05 (0.83)	0.73 (0.14)	100	12.9
GA	ZINC	5303	2.84 (1.86)	3.80 (1.01)	0.57 (0.20)	6.5	4.8
VAE	ZINC	8728	2.67 (1.46)	3.18 (0.86)	0.70 (0.14)	5.8	7.0
Data	QM9	134k	0.30 (1.00)	4.25 (0.94)	0.48 (0.07)	0.0	8.6
GA	QM9	5470	0.96 (1.53)	4.47 (1.01)	0.53 (0.13)	0.018	3.8
VAE	QM9	2839	0.30 (0.97)	4.34 (0.98)	0.47 (0.08)	0.0	8.9

The VAE generates molecules whose property distributions closely match the training data, outperforming a genetic algorithm baseline that biases toward higher chemical complexity and decreased drug-likeness. Only 5.8% of VAE-generated ZINC molecules were found in the original ZINC database, indicating genuine novelty.

Property Prediction

Dataset/Property	Mean Baseline	ECFP	Graph Conv.	1-hot SMILES	Encoder Only	VAE
ZINC/logP	1.14	0.38	0.05	0.16	0.13	0.15
ZINC/QED	0.112	0.045	0.017	0.041	0.037	0.054
QM9/HOMO (eV)	0.44	0.20	0.12	0.12	0.13	0.16
QM9/LUMO (eV)	1.05	0.20	0.15	0.11	0.14	0.16
QM9/Gap (eV)	1.07	0.30	0.18	0.16	0.18	0.21

The VAE latent representation achieves property prediction accuracy comparable to graph convolutions for some properties, though graph convolutions generally perform best. The primary purpose of joint training is not to maximize prediction accuracy but to organize the latent space for optimization.

Optimization Results

Bayesian optimization with a GP model on the jointly trained latent space consistently produces molecules with higher percentile scores on the $5 \times \text{QED} - \text{SAS}$ objective compared to both random Gaussian search and genetic algorithm baselines. Starting from molecules in the bottom 10th percentile of the ZINC dataset, the optimizer reliably discovers molecules in regions of high objective value. Training the GP with 1000 molecules (vs. 2000) produces a wider diversity of solutions by optimizing to multiple local optima rather than a single global optimum.

Key Findings, Limitations, and Legacy

Key Findings

A continuous latent representation of molecules enables gradient-based search through chemical space, a qualitatively different approach from discrete enumeration or genetic algorithms.
Joint training with property prediction organizes the latent space by property values, creating smooth gradients that optimization can follow.
The VAE generates novel molecules with realistic property distributions, and the latent space encodes an estimated 7.5 million molecules despite training on only 250,000.

Acknowledged Limitations

The SMILES-based decoder sometimes produces formally valid but chemically undesirable molecules (acid chlorides, anhydrides, cyclopentadienes, aziridines, etc.) because the grammar of valid SMILES does not capture all synthetic or stability constraints.
Character-level SMILES generation is fragile: the decoder must implicitly learn which strings are valid SMILES, making the learning problem harder than necessary.
Decoding validity drops to only 4% for random latent points far from training data, limiting the ability to explore truly novel regions of chemical space.

Directions Identified

The authors point to several extensions that were already underway at the time of publication:

Grammar VAE: Using an explicitly defined SMILES grammar instead of forcing the model to learn one (Kusner et al., 2017).
Graph-based decoders: Directly outputting molecular graphs to avoid the SMILES validity problem.
Adversarial training: Using GANs for molecular generation (ORGAN, ORGANIC).
LSTM/RNN generators: Applying recurrent networks directly to SMILES for generation and reaction prediction.

This paper has been cited over 2,900 times and launched a large body of follow-up work in VAE-based, GAN-based, and reinforcement learning-based molecular generation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ZINC (drug-like subset)	250,000 molecules	Randomly sampled from ZINC database
Training	QM9	108,000 molecules	Molecules with fewer than 9 heavy atoms
Evaluation	ZINC held-out set	5,000 molecules	For latent space analysis

Algorithms

Encoder: 3 x 1D convolutional layers (ZINC: filters 9,9,10 with kernels 9,9,11; QM9: filters 2,2,1 with kernels 5,5,4), followed by a fully connected layer
Decoder: 3 x GRU layers (ZINC: hidden dim 488; QM9: hidden dim 500), trained with teacher forcing
Property Predictor: 2 fully connected layers of 1000 neurons (dropout 0.20) for prediction; smaller 3-layer MLP of 67 neurons (dropout 0.15) for latent space shaping
Variational loss annealing: Sigmoid schedule after 29 epochs, total 120 epochs
SMILES validation: Post-hoc filtering with RDKit; invalid outputs discarded
Optimization: Gaussian process surrogate model trained on 2000 maximally diverse molecules from latent space

Models

Built with Keras and TensorFlow. Latent dimensions: 196 (ZINC), 156 (QM9). SMILES alphabet: 35 characters (ZINC), 22 characters (QM9). Maximum string length: 120 (ZINC), 34 (QM9). Only canonicalized SMILES used for training.

Evaluation

Metric	Description
logP	Water-octanol partition coefficient
QED	Quantitative Estimation of Drug-likeness (0-1)
SAS	Synthetic Accessibility Score
HOMO/LUMO (eV)	Frontier orbital energies (QM9)
Decoding validity	Fraction of latent points producing valid SMILES
Novelty	Fraction of generated molecules not in training set

Hardware

Training was performed on the Harvard FAS Odyssey Cluster. Specific GPU types and training times are not reported. The Gaussian process optimization requires only minutes to train on a few thousand molecules.

Artifacts

Artifact	Type	License	Notes
chemical_vae	Code	Apache-2.0	Official implementation with training scripts and pre-trained models

Paper Information

Citation: Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., & Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science, 4(2), 268-276. https://doi.org/10.1021/acscentsci.7b00572

@article{gomez2018automatic,
  title={Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules},
  author={G{\'o}mez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel and S{\'a}nchez-Lengeling, Benjam{\'i}n and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Al{\'a}n},
  journal={ACS Central Science},
  volume={4},
  number={2},
  pages={268--276},
  year={2018},
  publisher={American Chemical Society},
  doi={10.1021/acscentsci.7b00572}
}

Transformer-CNN: SMILES Embeddings for QSAR Modeling

Thu, 26 Mar 2026 00:00:00 +0000

Transformer-Based SMILES Embeddings for Property Prediction

This is a Method paper that introduces Transformer-CNN, a two-stage architecture for QSAR (Quantitative Structure-Activity Relationship) modeling. The primary contribution is a transfer learning approach: a Transformer model is first trained on the task of SMILES canonicalization (mapping non-canonical SMILES to canonical forms), and the encoder’s internal representations are then used as “dynamic SMILES embeddings” for downstream property prediction via a convolutional neural network (TextCNN). The authors also contribute an interpretability framework based on Layer-wise Relevance Propagation (LRP) that traces predictions back to individual atom contributions.

From Descriptors to Learned Embeddings in QSAR

Traditional QSAR methods rely on hand-engineered molecular descriptors (fragment counts, physicochemical features) coupled with feature selection and classical ML algorithms. While deep learning approaches that operate on raw SMILES strings or molecular graphs have reduced the need for manual feature engineering, they typically require large training datasets to learn effective representations from scratch. QSAR datasets, in contrast, often contain only hundreds of molecules, making it difficult to train end-to-end deep models.

The authors identify two specific gaps. First, existing SMILES-based autoencoders such as CDDD (Continuous and Data-Driven molecular Descriptors) produce fixed-length latent vectors, discarding positional information that could be useful for property prediction and interpretation. Second, QSAR models built on deep architectures generally lack interpretability, making it hard to verify that predictions rely on chemically meaningful structural features rather than spurious correlations.

Dynamic SMILES Embeddings via Canonicalization Pre-training

The core insight is that training a Transformer to perform SMILES canonicalization (a Seq2Seq task mapping non-canonical SMILES to canonical SMILES) produces an encoder whose internal states serve as information-rich, position-dependent molecular embeddings.

Pre-training on SMILES Canonicalization

The Transformer encoder-decoder is trained on approximately 17.7 million canonicalization pairs derived from the ChEMBL database (SMILES with length up to 110 characters). Each molecule is augmented 10 times by generating non-canonical SMILES variants, plus one identity pair where both sides are canonical. The training uses character-level tokenization with a 66-symbol vocabulary covering drug-like molecules including stereochemistry, charges, and inorganic ions.

The Transformer architecture follows Vaswani et al. with 3 layers and 10 self-attention heads. The learning rate schedule follows:

$$\lambda = \text{factor} \cdot \min(1.0,; \text{step} / \text{warmup}) / \max(\text{step},; \text{warmup})$$

where factor = 20, warmup = 16,000 steps, and $\lambda$ is clipped at a minimum of $10^{-4}$. Training runs for 10 epochs (275,907 batches per epoch) without early stopping.

On validation with 500,000 generated ChEMBL-like SMILES, the model correctly canonicalizes 83.6% of all samples. Performance drops for stereochemistry (37.2% for @-containing SMILES) and cis/trans notation (73.9%).

From Encoder States to QSAR Predictions

After pre-training, the encoder’s output for a molecule with $N$ characters is a matrix of dimensions $(N, \text{EMBEDDINGS})$. Unlike fixed-length CDDD descriptors, these “dynamic embeddings” preserve positional information, meaning equivalent characters receive different embedding values depending on their context and position.

To handle variable-length embeddings, the authors use a TextCNN architecture (from DeepChem) with 1D convolutional filters at kernel sizes (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20) producing (100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160) filters respectively. After GlobalMaxPool and concatenation, the features pass through Dropout (rate = 0.25), a Dense layer ($N = 512$), a Highway layer, and finally an output layer (1 neuron for regression, 2 for classification).

The Transformer weights are frozen during QSAR training. The Adam optimizer is used with a fixed learning rate of $10^{-4}$ and early stopping on a 10% held-out validation set. Critically, SMILES augmentation ($n = 10$) is applied during both training and inference, with the final prediction being the average over augmented SMILES for each molecule.

Interpretability via Layer-wise Relevance Propagation

The LRP algorithm propagates relevance scores from the output back through the CNN layers to the Transformer encoder output (which is position-wise). The relevance conservation property holds:

$$y = R = f(x) = \sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} = \cdots = \sum_{l \in (1)} R_{l}$$

In practice, biases absorb some relevance, so the total propagated to the input is less than the output:

$$\sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} + B$$

For gated connections in the Highway block, the authors implement the signal-take-all redistribution rule. The interpretation algorithm generates one SMILES per non-hydrogen atom (each drawn starting from that atom), runs LRP on each, and averages contributions. If more than 50% of relevance dissipates on biases, the interpretation may be unreliable, serving as an applicability domain indicator.

Benchmarks Across 18 Regression and Classification Datasets

The authors evaluate on the same 18 datasets (9 regression, 9 classification) used in their previous SMILES augmentation study, enabling direct comparison. All experiments use five-fold cross-validation.

Regression Results ($r^2$)

Dataset	Descriptor-based	SMILES-based (augm=10)	Transformer-CNN (no augm)	Transformer-CNN (augm=10)	CDDD
MP (19,104)	0.83	0.85	0.83	0.86	0.85
BP (11,893)	0.98	0.98	0.97	0.98	0.98
BCF (378)	0.85	0.85	0.71	0.85	0.81
FreeSolv (642)	0.94	0.93	0.72	0.91	0.93
LogS (1,311)	0.92	0.92	0.85	0.91	0.91
Lipo (4,200)	0.70	0.72	0.60	0.73	0.74
BACE (1,513)	0.73	0.72	0.66	0.76	0.75
DHFR (739)	0.62	0.63	0.46	0.67	0.61
LEL (483)	0.19	0.25	0.20	0.27	0.23

Classification Results (AUC)

Dataset	Descriptor-based	SMILES-based (augm=10)	Transformer-CNN (no augm)	Transformer-CNN (augm=10)	CDDD
HIV (41,127)	0.82	0.78	0.81	0.83	0.74
AMES (6,542)	0.86	0.88	0.86	0.89	0.86
BACE (1,513)	0.88	0.89	0.89	0.91	0.90
ClinTox (1,478)	0.77	0.76	0.71	0.77	0.73
Tox21 (7,831)	0.79	0.83	0.81	0.82	0.82
BBBP (2,039)	0.90	0.91	0.90	0.92	0.89
JAK3 (886)	0.79	0.80	0.70	0.78	0.76
BioDeg (1,737)	0.92	0.93	0.91	0.93	0.92
RP AR (930)	0.85	0.87	0.83	0.87	0.86

Key Comparisons

Baselines include descriptor-based methods (the best from LibSVM, Random Forest, XGBoost, ASNN, and DNNs), direct SMILES-based models with augmentation, and CDDD descriptors analyzed by the same classical ML methods. CDDD descriptors come from the Sml2canSml autoencoder approach, which produces fixed 512-dimensional vectors.

Transformer-CNN with augmentation matches or exceeds all baselines on 14 of 18 datasets. The effect of augmentation is dramatic: without it, Transformer-CNN underperforms substantially (e.g., BCF drops from 0.85 to 0.71, JAK3 from 0.78 to 0.70). This confirms that the internal consensus from multiple SMILES representations is essential to the method’s effectiveness.

A practical advantage over CDDD is that Transformer-CNN imposes no constraints on molecular properties (CDDD requires logP in (-5, 7), molecular weight under 12,600, 3-50 heavy atoms, and organic molecules only), since the Transformer was trained on the full diversity of ChEMBL.

Interpretability Case Studies

For AMES mutagenicity, the LRP analysis of 1-Bromo-4-nitrobenzene correctly identifies the nitro group and halogen as structural alerts, consistent with known mutagenicity rules. For aqueous solubility of haloperidol, the model assigns positive contributions to hydroxyl, carbonyl, and aliphatic nitrogen groups (which increase solubility) and negative contributions to aromatic carbons (which decrease it). Both cases align with established chemical knowledge, supporting the trustworthiness of the model.

Effective Transfer Learning for Small QSAR Datasets

Transformer-CNN achieves competitive or superior QSAR performance across 18 diverse benchmarks by combining three ingredients: (1) Transformer-based pre-training via SMILES canonicalization, (2) SMILES augmentation during training and inference, and (3) a lightweight CNN head. The method requires minimal hyperparameter tuning, as the Transformer weights are frozen and the CNN architecture is fixed.

The authors acknowledge several limitations and future directions:

Stereochemistry canonicalization accuracy is low (37.2%), which could impact models for stereo-sensitive properties
The LRP interpretability depends on sufficient relevance propagation (at least 50% reaching the input layer)
The variance among augmented SMILES predictions could serve as a confidence estimate, but this is left to future work
Applicability domain assessment based on SMILES reconstruction quality is proposed but not fully developed

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL (SMILES <= 110 chars)	17.7M pairs	10x augmentation + 1 identity pair per molecule
Validation (canon.)	Generated ChEMBL-like SMILES	500,000	From a molecular generator
QSAR benchmarks	9 regression + 9 classification	378-41,127	Available on OCHEM (https://ochem.eu)

Algorithms

Transformer: 3 layers, 10 self-attention heads, character-level tokenization (66 symbols)
TextCNN: 12 kernel sizes (1-10, 15, 20) with 100-200 filters each, GlobalMaxPool, Dense(512), Highway, Dropout(0.25)
Augmentation: n=10 non-canonical SMILES per molecule during training and inference
LRP: signal-take-all redistribution for Highway gates, standard LRP for Dense and Conv layers

Models

Transformer encoder weights pre-trained on canonicalization task (frozen during QSAR training)
QSAR CNN trained with Adam optimizer, learning rate $10^{-4}$, early stopping
Pre-trained embeddings and standalone prediction models available in the GitHub repository

Evaluation

Regression: coefficient of determination $r^2 = 1 - SS_{\text{res}} / SS_{\text{tot}}$
Classification: Area Under the ROC Curve (AUC)
Five-fold cross-validation with bootstrap standard errors

Hardware

NVIDIA Quadro P6000, Titan Xp, and Titan V GPUs (donated by NVIDIA)
TensorFlow v1.12.0, RDKit v2018.09.2

Artifacts

Artifact	Type	License	Notes
transformer-cnn	Code	MIT	Source code, pre-trained embeddings, standalone prediction models
OCHEM	Other	N/A	Online platform hosting the method, training datasets, and models

Paper Information

Citation: Karpov, P., Godin, G., & Tetko, I. V. (2020). Transformer-CNN: Swiss knife for QSAR modeling and interpretation. Journal of Cheminformatics, 12, 17. https://doi.org/10.1186/s13321-020-00423-w

@article{karpov2020transformer,
  title={Transformer-{CNN}: Swiss knife for {QSAR} modeling and interpretation},
  author={Karpov, Pavel and Godin, Guillaume and Tetko, Igor V.},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={17},
  year={2020},
  publisher={Springer},
  doi={10.1186/s13321-020-00423-w}
}

Transformer Name-to-SMILES with Atom Count Losses

Thu, 26 Mar 2026 00:00:00 +0000

Translating Chemical Names to Structures with Transformers

This is a Method paper that proposes using Transformer-based sequence-to-sequence models to predict chemical compound structures (represented as SMILES strings) from chemical compound names. The primary contribution is the application of neural machine translation techniques to the name-to-structure problem, along with two domain-specific improvements: an atom-count constraint loss function and a multi-task learning approach that jointly predicts SMILES and InChI strings.

Why Rule-Based Name-to-Structure Fails for Synonyms

Chemical compound names come in several varieties. IUPAC names follow systematic nomenclature and are well-handled by rule-based parsers like OPSIN. Database IDs (e.g., CAS registry numbers) can be resolved by dictionary lookup. The third category, Synonyms (which includes abbreviations, common names, and other informal designations), is problematic because naming patterns are complex and widely variable.

In preliminary experiments, rule-based tools achieved F-measures of 0.878 to 0.960 on IUPAC names but only 0.719 to 0.758 on Synonyms. This performance gap motivates a data-driven approach. The authors frame name-to-SMILES prediction as a machine translation problem: the source language is the chemical compound name and the target language is the SMILES string. A neural model trained on millions of name-SMILES pairs can learn patterns that rule-based systems miss, particularly for non-systematic nomenclature.

Atom-Count Constraints and Multi-Task Learning

The paper introduces two improvements over a vanilla Transformer seq2seq model.

Atom-Count Constraint Loss

A correct structure prediction must contain the right number of atoms of each element. The authors add an auxiliary loss that penalizes the squared difference between the predicted and true atom counts for each element. The predicted atom counts are obtained by summing Gumbel-softmax outputs across all decoded positions.

For the $i$-th output token, the Gumbel-softmax probability vector is:

$$ y_{ij} = \frac{\exp\left((\log(\pi_{ij}) + g_{ij}) / \tau\right)}{\sum_{k=1}^{|\mathcal{V}|} \exp\left((\log(\pi_{ik}) + g_{ik}) / \tau\right)} $$

where $\pi_{ij}$ is the model’s softmax output, $g_{ij}$ is a Gumbel noise sample, and $\tau = 0.1$ is the temperature. The predicted token frequency vector is $\mathbf{y}^{pred} = \sum_{i=1}^{m} \mathbf{y}_i$, and the atom-count loss is:

$$ \mathcal{L}_{atom} = \frac{1}{|A|} \sum_{a \in A} \left(N_a(T) - y_{idx(a)}^{pred}\right)^2 $$

where $A$ is the set of chemical elements in the vocabulary, $N_a(T)$ returns the number of atoms of element $a$ in the correct SMILES string $T$, and $idx(a)$ returns the vocabulary index of element $a$. Only element tokens (e.g., “C”, “O”) are counted; bond symbols (e.g., “=”, “#”) are excluded.

The combined objective is:

$$ \mathcal{L}_{smiles} + \lambda_{atom} \mathcal{L}_{atom} $$

with $\lambda_{atom} = 0.7$.

Multi-Task SMILES/InChI Prediction

SMILES and InChI strings encode the same chemical structure in different formats. The authors hypothesize that jointly predicting both representations can improve the shared encoder. The multi-task model shares the encoder between a SMILES decoder and an InChI decoder, minimizing:

$$ \mathcal{L}_{smiles} + \lambda_{inchi} \mathcal{L}_{inchi} $$

where $\mathcal{L}_{inchi} = -\log P(I | X; \boldsymbol{\theta}_{enc}, \boldsymbol{\theta}_{inchi})$ and $\lambda_{inchi} = 0.3$.

Experimental Setup and Evaluation

Dataset

The dataset was constructed from PubChem dump data (97M compound records). Chemical compound names categorized as Synonyms were paired with canonical SMILES strings (converted via RDKit). Database-like IDs were filtered out using regular expressions. Duplicate names mapping to different CIDs were removed.

Split	Size
Training	5,000,000
Development	1,113
Test	11,194

Model Configuration

The Transformer uses 6 encoder/decoder layers, 8 attention heads, 512-dimensional embeddings, and 0.1 dropout. Training used label-smoothing cross-entropy ($\epsilon = 0.1$), Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$), and a warmup schedule with peak learning rate 0.0005 over 4,000 steps followed by inverse square root decay. Models were trained for 300,000 update steps. Final predictions averaged the last 10 checkpoints and used beam search (beam size 4, length penalty $\alpha = 0.6$, max output length 200).

Tokenization

Three tokenization strategies were compared:

BPE: Byte pair encoding learned on chemical compound names (500 merge operations) via fastBPE
OPSIN-TK: The OPSIN rule-based tokenizer
OPSIN-TK+BPE: A hybrid where OPSIN handles tokenizable names and BPE handles the rest

SMILES tokens were identified by regular expressions (elements as single tokens, remaining symbols as characters). InChI strings were tokenized by SentencePiece (vocabulary size 1,000).

Baselines

OPSIN: Open-source rule-based parser
Tool A and Tool B: Two commercially available name-to-structure tools

Results

Method	Tokenizer	Recall	Precision	F-measure
OPSIN	Rule-based	0.693	0.836	0.758
Tool A	Rule-based	0.711	0.797	0.752
Tool B	Rule-based	0.653	0.800	0.719
Transformer	BPE	0.793	0.806	0.799
+ atomnum	BPE	0.798	0.808	0.803
+ inchigen	BPE	0.810	0.819	0.814
Transformer	OPSIN-TK+BPE	0.763	0.873	0.814
+ atomnum	OPSIN-TK+BPE	0.768	0.876	0.818
+ inchigen	OPSIN-TK+BPE	0.779	0.886	0.829
Transformer	OPSIN-TK	0.755	0.868	0.808
+ atomnum	OPSIN-TK	0.757	0.867	0.808
+ inchigen	OPSIN-TK	0.754	0.869	0.807

The best configuration (inchigen with OPSIN-TK+BPE) achieved an F-measure of 0.829, surpassing OPSIN by 0.071 points. The multi-task learning approach (inchigen) consistently outperformed the atom-count constraint alone (atomnum) across all tokenizer settings.

Key Findings and Error Analysis

The Transformer-based approach produced grammatically correct SMILES strings (parseable by RDKit) for 99% of test examples, compared to 81.6-88.4% for the rule-based tools. Even when predictions were incorrect, they tended to be structurally similar to the correct answer. Using MACCS fingerprints and Jaccard (Tanimoto) similarity, the average similarity between incorrectly predicted and correct structures was 0.753.

The OPSIN-TK tokenizer yielded higher precision than BPE because approximately 11.5% (1,293 of 11,194) of test compounds could not be tokenized by OPSIN, reducing the number of outputs. BPE-based tokenizers achieved higher recall by covering all inputs. The hybrid OPSIN-TK+BPE approach balanced both, achieving the highest overall F-measure.

Limitations: The paper does not evaluate on IUPAC names separately with the Transformer models (only comparing rule-based tools on IUPAC). The atom-count constraint and multi-task learning are not combined in a single model. The dataset is released but the training code is not. Hardware details and training times are not reported. The evaluation uses only exact-match F-measure and Jaccard similarity, without measuring partial credit for nearly-correct structures.

Future work: The authors plan to explore additional tokenization methods, combine the atom-count constraint with multi-task learning, and apply the constraint loss to other chemistry problems including chemical reaction prediction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PubChem Synonyms (custom split)	5,000,000 pairs	Chemical compound names to canonical SMILES
Development	PubChem Synonyms (custom split)	1,113 pairs	Filtered for duplicates
Test	PubChem Synonyms (custom split)	11,194 pairs	Filtered for duplicates; released as benchmark

The authors state the dataset is released for future research. The data was constructed from the PubChem dump (97M compound records) using RDKit for SMILES canonicalization. Database-like IDs were removed with regular expressions and duplicate names across CIDs were filtered.

Algorithms

Transformer seq2seq (6 layers, 8 heads, 512-dim embeddings)
BPE tokenization via fastBPE (500 merge operations)
SentencePiece for InChI tokenization (vocabulary size 1,000)
Gumbel-softmax atom-count constraint ($\tau = 0.1$, $\lambda_{atom} = 0.7$)
Multi-task SMILES/InChI loss ($\lambda_{inchi} = 0.3$)
Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 10^{-8}$)
Label smoothing ($\epsilon = 0.1$), 300K training steps
Beam search (beam size 4, length penalty $\alpha = 0.6$)

Models

Standard Transformer architecture following Vaswani et al. (2017). No pre-trained weights or model checkpoints are released.

Evaluation

Metric	Best Value	Model	Notes
F-measure	0.829	inchigen (OPSIN-TK+BPE)	Highest overall
Precision	0.886	inchigen (OPSIN-TK+BPE)	Highest overall
Recall	0.810	inchigen (BPE)	Highest overall
Grammatical correctness	99%	inchigen (BPE)	SMILES parseable by RDKit
Avg. Jaccard similarity (errors)	0.753	inchigen (BPE)	On incorrect predictions only

Hardware

Not reported.

Paper Information

Citation: Omote, Y., Matsushita, K., Iwakura, T., Tamura, A., & Ninomiya, T. (2020). Transformer-based Approach for Predicting Chemical Compound Structures. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 154-162. https://doi.org/10.18653/v1/2020.aacl-main.19

@inproceedings{omote2020transformer,
  title={Transformer-based Approach for Predicting Chemical Compound Structures},
  author={Omote, Yutaro and Matsushita, Kyoumoto and Iwakura, Tomoya and Tamura, Akihiro and Ninomiya, Takashi},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  pages={154--162},
  year={2020},
  publisher={Association for Computational Linguistics},
  doi={10.18653/v1/2020.aacl-main.19}
}

t-SMILES: Tree-Based Fragment Molecular Encoding

Thu, 26 Mar 2026 00:00:00 +0000

A Fragment-Based Molecular Representation Method

This is a Method paper that proposes t-SMILES (tree-based SMILES), a framework for representing molecules as SMILES-type strings derived from fragment-based decompositions. The primary contribution is an encoding algorithm that converts fragmented molecular graphs into full binary trees (FBTs) and then traverses them breadth-first to produce linear strings. Three coding variants are introduced: TSSA (shared atom), TSDY (dummy atom without ID), and TSID (dummy atom with ID). The framework achieves 100% theoretical validity, higher novelty scores, and improved distribution-learning metrics compared to classical SMILES, DeepSMILES, and SELFIES across ChEMBL, ZINC, and QM9 benchmarks.

Why Fragment-Based Representations Matter for Molecular Generation

Classical SMILES encodes molecules via depth-first traversal of the molecular graph, requiring parentheses and ring identifiers to appear in matched pairs with deep nesting. When generative models (LSTM, Transformer) are trained on SMILES, they produce chemically invalid strings, particularly on small datasets, because they struggle to learn these long-range pairing constraints. DeepSMILES addresses some syntactical issues but still permits semantic violations (e.g., oxygen with three bonds). SELFIES guarantees 100% valid strings but at the cost of readability and, as the authors show, lower FCD scores indicating generated molecules diverge from the training distribution.

Fragment-based approaches reduce the search space compared to atom-level methods and can provide insights into molecular recognition (e.g., protein-ligand interactions). However, existing fragment-based deep learning methods rely on fixed dictionaries of candidate fragments, creating in-vocabulary/out-of-vocabulary problems and high-dimensional sparse representations. The encoding of fragments as SMILES-type strings, rather than dictionary IDs, had not been systematically explored before this work.

The authors draw on the observation that fragments in organic molecules follow a Zipf-like rank distribution similar to words in natural language, motivating the use of NLP techniques for fragment-based molecular modeling.

Core Innovation: Binary Tree Encoding of Fragmented Molecules

The t-SMILES algorithm proceeds in three steps:

Fragmentation: A molecule is decomposed into valid chemical fragments using a chosen algorithm (JTVAE, BRICS, MMPA, or Scaffold), producing a fragmented molecular graph.
Tree construction: The fragmented graph is converted into an Acyclic Molecular Tree (AMT), which is a reduced graph where nodes represent fragments and edges represent bonds between them. The AMT is then transformed into a Full Binary Tree (FBT), where every internal node has exactly two children.
String generation: The FBT is traversed using breadth-first search (BFS) to produce the t-SMILES string.

The framework introduces only two new symbols beyond standard SMILES: & marks empty tree nodes (branch terminators providing global structural information), and ^ separates adjacent substructure segments (analogous to spaces between words in English).

Three Coding Variants

TSSA (shared atom): Two fragments share a real atom at their connection point. Produces the highest novelty scores and is recommended for goal-directed tasks.
TSDY (dummy atom, no ID): Uses dummy atoms (marked with *) to indicate bonding points. Provides a balanced choice between novelty and distribution fidelity.
TSID (dummy atom with ID): Uses numbered dummy atoms ([n*]) for unambiguous reconstruction. Produces the most faithful distribution reproduction and is recommended for distribution-learning tasks.

Structural Advantages

The key structural benefit is a dramatic reduction in nesting depth. For TSDY_M on ChEMBL, the proportion of tokens at nesting depth 0-1-2 increases from 68.0% (SMILES) to 99.3%, while depth 3-4-5 drops from 31.9% to 0.7%, and depth 6-11 drops from 0.1% to 0.0002%. The & symbol, which encodes molecular topology, does not need to appear in pairs (unlike parentheses in SMILES), and its high frequency means it does not create a scarcity problem for learning.

The framework also supports a multi-code system where classical SMILES can be integrated as a special case called TS_Vanilla, and multiple fragmentation-based codes can be combined into hybrid models.

Reconstruction and Data Augmentation

Molecules can be reconstructed from t-SMILES strings by reversing the process: rebuilding the FBT from the string, converting to AMT, and assembling fragments into a molecular graph. This reconstruction process can itself generate novel molecules without any model training by randomly assembling fragments. On ChEMBL, TSSA reconstruction achieves uniqueness above 0.98 and novelty above 0.68 for all four fragmentation algorithms, with 100% validity.

Data augmentation in t-SMILES operates at four levels: (1) different decomposition algorithms, (2) reconstruction, (3) enumeration of fragment strings, and (4) enumeration of FBTs. Unlike SMILES enumeration (which only produces different strings for the same molecule), t-SMILES reconstruction generates genuinely different molecules from the same fragment set.

Systematic Evaluation Across Multiple Benchmarks

All experiments use MolGPT (a Transformer-decoder model) as the primary generative model. Three types of metrics are employed: distribution-learning benchmarks, goal-directed benchmarks, and Wasserstein distance metrics for physicochemical properties.

Low-Resource Datasets (JNK3 and AID1706)

On JNK3 (923 active molecules), the authors investigate overfitting behavior across training epochs:

Model	Valid	Novelty	FCD	Active Novel
SMILES [R200]	0.795	0.120	0.584	0.072
SMILES [R2000]	1.000	0.001	0.765	0.004
SELFIES [R200]	1.000	0.238	0.544	0.148
SELFIES [R2000]	1.000	0.008	0.767	0.050
TSSA_S [R300]	1.000	0.833	0.564	0.582
TSSA_S [R5000]	1.000	0.817	0.608	0.564
TF_TSSA_S [R5]	1.000	0.932	0.483	0.710
TSSA_S_Rec50 [R10]	1.000	0.962	0.389	0.829

Key findings: SMILES and DeepSMILES novelty scores collapse to near zero after 200 epochs, while t-SMILES novelty stabilizes around 0.8. The highest active-novel score of 0.829 comes from t-SMILES with reconstruction-based data augmentation. Transfer learning with t-SMILES maintains novelty of 0.710 at 5 epochs versus 0.526 for SMILES, and at 100 epochs the gap widens dramatically (0.569 vs. 0.023).

Distribution Learning on ChEMBL

t-SMILES models outperform graph baselines (Graph MCTS, hG2G, MGM) and fragment-based methods (FASMIFRA). TSID_B and TSID_S achieve FCD scores of 0.909 while maintaining novelty of 0.941 and 0.933, surpassing SMILES (FCD 0.906, novelty 0.907) in both dimensions. TSDY and TSID models consistently outperform TSSA on distribution fidelity for larger molecules.

Goal-Directed Tasks on ChEMBL

On 20 GuacaMol subtasks, different fragmentation algorithms excel at different tasks. The goal-directed reconstruction algorithm significantly outperforms random reconstruction. On the Sitagliptin MPO task (T16.SMPO), the TSDY_M model with goal-directed reconstruction achieves a score of 0.930, compared to 0.598 for SMILES and 0.708 for CReM. On Valsartan SMARTS (T18.VS), t-SMILES models reach 0.997 versus 0.985 for SMILES.

Distribution Learning on ZINC and QM9

On ZINC, t-SMILES models significantly outperform existing fragment-based baselines (JTVAE, FragDgm). Seven t-SMILES models achieve both higher FCD and novelty scores than SELFIES. On QM9 (smaller molecules), all string-based models achieve high FCD scores (above 0.960), with t-SMILES performing better than existing string and graph approaches.

Physicochemical Properties

Across ChEMBL and ZINC, TSDY and TSID models capture physicochemical property distributions (MolWt, LogP, SAScore, N_Atoms, N_Rings, etc.) more faithfully than TSSA models. Multiple t-SMILES models outperform SMILES in more than four out of nine property categories. Baseline models hG2G and JTVAE show the weakest pattern learning, producing molecules with fewer atoms and rings than the training data.

Key Findings and Limitations

Main Results

t-SMILES achieves 100% theoretical validity by fragmenting molecules into chemically valid pieces before encoding.
The framework avoids the overfitting problem on low-resource datasets, maintaining stable novelty scores where SMILES, DeepSMILES, and SELFIES collapse.
The multi-code system allows different coding algorithms to complement each other, with hybrid models accessing broader chemical space.
Goal-directed reconstruction significantly outperforms all baselines on targeted optimization tasks.
TSDY and TSID provide better distribution fidelity than TSSA on larger molecules, while TSSA excels at novelty generation for goal-directed tasks.

Limitations

The authors acknowledge several limitations:

Whether the tree structure of t-SMILES can be effectively learned by Large Language Models remains unexplored.
Only published fragmentation algorithms were tested; custom fragmentation schemes were not investigated.
Experiments on more complex (larger) molecules were not performed.
The reconstruction algorithm uses simple rules for fragment assembly; more sophisticated assembly methods (Monte Carlo tree search, CReM) could improve quality.

Future Directions

The authors suggest exploring advanced reconstruction and optimization algorithms, improved generative models, evolutionary techniques, and extending t-SMILES to property prediction, retrosynthesis, and reaction prediction tasks. The framework is also extensible to other string representations (t-DSMILES, t-SELFIES) by changing how fragments are encoded.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Low-resource evaluation	JNK3	923 active molecules	Kinase inhibitors
Low-resource evaluation	AID1706	329 active molecules	SARS 3CLPro inhibitors
Distribution learning	ChEMBL	Standard split	Large drug-like molecules
Distribution learning	ZINC	250K subset	Medium drug-like molecules
Distribution learning	QM9	~134K molecules	Small organic molecules

Algorithms

Fragmentation: JTVAE, BRICS, MMPA, Scaffold (all via RDKit)
Tree construction: AMT from reduced graph, then FBT transformation
Traversal: Breadth-first search on FBT
Generative model: MolGPT (Transformer decoder)
Discriminative model: AttentiveFP for activity prediction on JNK3/AID1706

Evaluation

Metric	Description
Validity	Fraction of generated strings that decode to valid molecules
Uniqueness	Fraction of distinct molecules among valid generations
Novelty	Fraction of generated molecules not in training set
KLD	Kullback-Leibler divergence for physicochemical property distributions
FCD	Frechet ChemNet Distance measuring chemical similarity to training set
Active Novel	Novel molecules predicted active by AttentiveFP

Artifacts

Artifact	Type	License	Notes
t-SMILES GitHub	Code	MIT	Official implementation with training/generation scripts
Zenodo deposit	Code + Data	CC-BY-4.0	Archived code and data
Code Ocean capsule	Code	Not specified	Certified reproducible compute capsule

Hardware

The paper mentions limited computational resources but does not specify exact GPU types or training times.

Paper Information

Citation: Wu, J.-N., Wang, T., Chen, Y., Tang, L.-J., Wu, H.-L., & Yu, R.-Q. (2024). t-SMILES: a fragment-based molecular representation framework for de novo ligand design. Nature Communications, 15, 4993.

@article{wu2024tsmiles,
  title={t-SMILES: a fragment-based molecular representation framework for de novo ligand design},
  author={Wu, Juan-Ni and Wang, Tong and Chen, Yue and Tang, Li-Juan and Wu, Hai-Long and Yu, Ru-Qin},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={4993},
  year={2024},
  doi={10.1038/s41467-024-49388-6}
}

SPMM: A Bidirectional Molecular Foundation Model

Thu, 26 Mar 2026 00:00:00 +0000

A Multimodal Foundation Model for Structure-Property Comprehension

This is a Method paper that introduces the Structure-Property Multi-Modal foundation model (SPMM), a transformer-based architecture that treats SMILES strings and molecular property vectors (PVs) as two separate modalities and learns to align them in a shared embedding space. The primary contribution is enabling bidirectional generation through a single pre-trained model: given a property vector, SPMM can generate molecules (inverse-QSAR), and given a SMILES string, it can predict all 53 properties simultaneously. The model also transfers to unimodal downstream tasks including MoleculeNet benchmarks and reaction prediction.

Bridging the Gap Between Molecular Structure and Properties

Existing chemical pre-trained models typically learn representations from a single modality (SMILES, graphs, or fingerprints) and fine-tune for specific downstream tasks. While some approaches have attempted multimodal learning by combining SMILES with graph representations or InChI strings, these modalities encode nearly identical structural information, limiting the potential for emergent cross-modal knowledge.

The key gap SPMM addresses is the lack of multimodal pre-training that incorporates genuinely complementary modalities. Prior conditional molecule generation methods could typically control only a small number of properties simultaneously and required retraining when target properties changed. The authors draw on successes in vision-language pre-training (VLP), where aligning image and text modalities has enabled rich bidirectional understanding, and apply similar ideas to molecular structure and property domains.

Treating Property Vectors as a Language

The core innovation in SPMM is treating a collection of 53 RDKit-computed molecular properties as a “language” where each property value is analogous to a word token. This design allows the model to attend to individual properties independently rather than treating the entire property vector as a single fixed-length condition.

Dual-Stream Architecture

SPMM follows the dual-stream VLP architecture. The model has three components:

SMILES Encoder: 6 BERT-base layers that encode tokenized SMILES (using a 300-subword BPE vocabulary) via self-attention
PV Encoder: 6 BERT-base layers that encode the 53 normalized property values (each passed through a linear layer) with learnable positional embeddings
Fusion Encoder: 6 BERT-base layers with cross-attention that combines both modalities, using one modality’s features as queries and the other as keys/values

Pre-training Objectives

The model is pre-trained with four complementary losses:

Contrastive Learning aligns SMILES and PV features in a shared embedding space. For [CLS] token outputs $\mathbf{S}_{cls}$ and $\mathbf{P}_{cls}$:

$$ \text{sim}(\mathbf{S}, \mathbf{P}) = \left(h_{S}(\mathbf{S}_{cls})\right)^{\top} h_{P}(\mathbf{P}_{cls}) $$

The intermodal similarities are computed with a learnable temperature $\tau$:

$$ s_{s2p} = \frac{\exp(\text{sim}(\mathbf{S}, \mathbf{P}) / \tau)}{\sum_{n=1}^{N} \exp(\text{sim}(\mathbf{S}, \mathbf{P}_{n}) / \tau)} $$

The contrastive loss uses cross-entropy with one-hot labels (1 for same-molecule pairs):

$$ L_{\text{contrastive}} = \frac{1}{2}\left(H(y_{s2p}, s_{s2p}) + H(y_{p2s}, s_{p2s}) + H(y_{s2s}, s_{s2s}) + H(y_{p2p}, s_{p2p})\right) $$

Next Word Prediction (NWP) trains autoregressive SMILES generation conditioned on the PV:

$$ L_{NWP} = \sum_{i=1}^{n} H\left(y_{n}^{NWP}, p^{NWP}(s_{n} \mid s_{0:n-1}, \mathbf{P})\right) $$

Next Property Prediction (NPP) applies the same autoregressive concept to property values, using mean-square-error loss:

$$ L_{NPP} = \sum_{i=1}^{n} \left(p_{n} - \hat{p}_{n}(p_{0:n-1}, \mathbf{S})\right)^{2} $$

SMILES-PV Matching (SPM) is a binary classification loss predicting whether a SMILES-PV pair originated from the same molecule, trained with hard-negative mining.

The overall pre-training loss combines all four:

$$ L = \widetilde{L}_{\text{contrastive}} + \widetilde{L}_{NWP} + L_{NPP} + L_{SPM} $$

where tildes indicate the use of momentum teacher distillation to soften one-hot labels, acknowledging that multiple valid SMILES-PV pairings may exist.

Random Property Masking

During pre-training, 50% of property values are randomly replaced with a special [UNK] token. This serves three purposes: preventing overfitting to specific properties, augmenting data, and enabling flexible inference where users can specify any subset of the 53 properties as generation conditions. The model can handle all $2^{53}$ possible property combinations at inference time despite never seeing most of them during training.

Experiments Across Bidirectional and Unimodal Tasks

PV-to-SMILES Generation (Conditional Molecule Design)

The authors evaluate SPMM on multiple generation scenarios using 1000 unseen PubChem PVs:

Sampling	Input PV	Validity	Uniqueness	Novelty	Norm. RMSE
Deterministic	1000 unseen PVs	0.995	0.999	0.961	0.216
Stochastic	Full PV (molecule 1)	0.974	0.905	0.998	0.185
Stochastic	Molar mass = 150	0.974	0.945	0.872	0.192
Stochastic	4 properties controlled	0.998	0.981	0.952	0.257
Stochastic	No control (all [UNK])	0.971	0.991	0.950	-

The normalized RMSE of 0.216 across 53 properties indicates that generated molecules closely match the input property conditions. The model can also perform unconditional generation (all properties masked) where outputs follow the pre-training distribution. The authors report that SPMM outperforms benchmark models including MolGAN, GraphVAE, and scaffold-based graph generative models in both conditional and unconditional settings (Supplementary Table 1).

SMILES-to-PV Generation (Multi-Property Prediction)

When given 1000 unseen ZINC15 molecules, SPMM predicts all 53 properties autoregressively with a mean $r^{2}$ of 0.924 across all properties.

MoleculeNet Benchmarks

Using only the SMILES encoder (6 BERT layers), SPMM achieves best or competitive performance on 9 MoleculeNet tasks:

Dataset	Metric	SPMM	Best Baseline	Baseline Model
ESOL	RMSE	0.817	0.798	ChemRL-GEM
LIPO	RMSE	0.681	0.660	ChemRL-GEM
FreeSolv	RMSE	1.868	1.877	ChemRL-GEM
BACE (reg)	RMSE	1.041	1.047	MolFormer
Clearance	RMSE	42.607	43.175	MolFormer
BBBP	AUROC	75.1%	73.6%	MolFormer
BACE (cls)	AUROC	84.4%	86.3%	MolFormer
ClinTox	AUROC	92.7%	91.2%	MolFormer
SIDER	AUROC	66.9%	67.2%	ChemRL-GEM

SPMM achieved best performance on 5 of 9 tasks, with notable gains on BBBP (75.1% vs. 73.6%) and ClinTox (92.7% vs. 91.2%). Without pre-training, all scores dropped substantially.

DILI Classification

On Drug-Induced Liver Injury prediction, SPMM achieved 92.6% AUROC, outperforming the 5-ensemble model of Ai et al. (90.4% AUROC) while using a single model.

Reaction Prediction

On USPTO-480k forward reaction prediction, SPMM achieved 91.5% top-1 accuracy, the highest among all models tested (including Chemformer at 91.3%). On USPTO-50k retro-reaction prediction, SPMM reached 53.4% top-1 accuracy, second only to Chemformer (54.3%) among string-based models.

Bidirectional Generation From a Single Pre-trained Model

SPMM demonstrates that multimodal pre-training with genuinely complementary modalities (structure and properties, rather than structurally redundant representations) enables a single foundation model to handle both generation directions and downstream unimodal tasks. Key findings include:

Flexible conditional generation: The [UNK] masking strategy allows controlling any subset of 53 properties at inference time without retraining, a capability not demonstrated by prior methods.
Interpretable cross-attention: Attention visualizations show that the model learns chemically meaningful structure-property relationships (e.g., hydrogen bonding properties attend to oxygen and nitrogen atoms; ring count properties attend to ring tokens).
Competitive unimodal transfer: Despite using only 6 BERT layers and 50M pre-training molecules (smaller than ChemBERTa-2’s 77M or Chemformer’s 100M), the SMILES encoder alone achieves best or second-best results on 5 of 9 MoleculeNet tasks and the highest forward reaction prediction accuracy among tested models.

Limitations

The authors acknowledge several limitations:

SMILES representation constraints: Implicit connectivity information in SMILES means small structural changes can cause drastic string changes. Graph representations could be a complementary alternative.
Stereochemistry blindness: All 53 RDKit properties used are invariant to stereochemistry, meaning different stereoisomers produce identical PVs. The contrastive loss then forces their SMILES encoder outputs to converge, which the authors identify as the primary factor limiting MoleculeNet performance on stereo-sensitive tasks.
No wet-lab validation: Generated molecules and predicted properties are not experimentally verified.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem	50M molecules	SMILES + 53 RDKit properties
Property prediction	MoleculeNet (9 tasks)	642-4200 per task	Scaffold split via DeepChem (8:1:1)
DILI classification	Ai et al. dataset	Not specified	Following published preparation
Forward reaction	USPTO-480k	479,035 pairs	Reactant-product pairs
Retro reaction	USPTO-50k	50,037 pairs	Product-reactant pairs, no reaction types used
SMILES-to-PV test	ZINC15	1000 molecules	Not in pre-training set

Algorithms

Tokenization: BPE with 300-subword dictionary
Property masking: 50% random replacement with [UNK] during pre-training
Momentum distillation: EMA parameter $\lambda = 0.995$, soft-label mixing $\alpha$ linearly warmed from 0 to 0.4 over first epoch
Contrastive queue: Size $k = 24{,}576$ for storing recent SMILES and PV instances
Beam search: $k = 2$ for PV-to-SMILES generation
SMILES augmentation: Random non-canonical augmentation with probability 0.5 for reaction tasks

Models

Architecture: 6 BERT-base encoder layers each for SMILES encoder, PV encoder, and fusion encoder (18 total layers)
Vocabulary: 300 BPE subwords for SMILES; 53 property tokens for PV
Pre-trained weights: Available via GitHub

Evaluation

Task	Metric	Value	Notes
PV-to-SMILES (deterministic)	Validity	99.5%	1000 unseen PubChem PVs
PV-to-SMILES (deterministic)	Normalized RMSE	0.216	Across 53 properties
SMILES-to-PV	Mean $r^{2}$	0.924	1000 ZINC15 molecules
Forward reaction (USPTO-480k)	Top-1 accuracy	91.5%	Best among all tested models
Retro reaction (USPTO-50k)	Top-1 accuracy	53.4%	Second-best string-based
DILI classification	AUROC	92.6%	Single model vs. 5-ensemble

Hardware

Pre-training: 8 NVIDIA A100 GPUs, approximately 52,000 batch iterations, roughly 12 hours
Batch size: 96
Optimizer: AdamW with weight decay 0.02
Learning rate: Warmed up to $10^{-4}$, cosine decay to $10^{-5}$

Artifacts

Artifact	Type	License	Notes
SPMM Source Code	Code	Apache-2.0	Official implementation with experimental scripts
SPMM Zenodo Archive	Code	Apache-2.0	Archived version for reproducibility
PubChem	Dataset	Public domain	50M molecules for pre-training
MoleculeNet	Dataset	Varies	Benchmark datasets via DeepChem

Paper Information

Citation: Chang, J., & Ye, J. C. (2024). Bidirectional generation of structure and properties through a single molecular foundation model. Nature Communications, 15, 2323. https://doi.org/10.1038/s41467-024-46440-3

@article{chang2024bidirectional,
  title={Bidirectional generation of structure and properties through a single molecular foundation model},
  author={Chang, Jinho and Ye, Jong Chul},
  journal={Nature Communications},
  volume={15},
  pages={2323},
  year={2024},
  doi={10.1038/s41467-024-46440-3}
}

SPE: Data-Driven SMILES Substructure Tokenization

Thu, 26 Mar 2026 00:00:00 +0000

A Data-Driven Tokenization Method for Chemical Deep Learning

This is a Method paper that introduces SMILES Pair Encoding (SPE), a tokenization algorithm adapted from byte pair encoding (BPE) in natural language processing. The primary contribution is a data-driven approach that learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset and then uses that vocabulary to tokenize SMILES for downstream deep learning tasks. The authors provide an open-source Python package (SmilesPE) and demonstrate improvements on both molecular generation and QSAR prediction benchmarks.

Limitations of Atom-Level SMILES Tokenization

SMILES-based deep learning models require tokenization to convert molecular strings into sequences of discrete units. The standard approaches have well-known drawbacks:

Character-level tokenization breaks SMILES character by character, splitting chemically meaningful multi-character atoms. For example, [C@@H] becomes six separate tokens ([, C, @, @, H, ]), losing the stereochemistry information of a single carbon.
Atom-level tokenization addresses some of these issues by treating multi-character element symbols (Cl, Br) and bracketed atoms ([nH], [O-]) as single tokens. However, these tokens still encode only individual atoms, not substructures.
k-mer tokenization (sequences of k consecutive overlapping characters) captures some connectivity information but suffers from the out-of-vocabulary problem: the model cannot represent k-mers not seen during training.

All three approaches produce relatively long input sequences (mean ~40 tokens per molecule on ChEMBL at the atom level), which increases computational cost for sequential architectures like RNNs and exacerbates long-range dependency issues.

Core Innovation: Adapting Byte Pair Encoding for SMILES

SPE adapts the byte pair encoding algorithm, originally developed for data compression and later adopted for subword tokenization in NLP, to the domain of chemical strings. The algorithm has two phases:

Vocabulary training:

Tokenize SMILES from a large dataset (ChEMBL) at the atom level
Initialize the vocabulary with all unique atom-level tokens
Iteratively count the frequency of all adjacent token pairs, merge the most frequent pair into a new token, and add it to the vocabulary
Stop when either the maximum vocabulary size (MVS) or a minimum frequency threshold (FT) is reached

Tokenization: Given a trained SPE vocabulary, a new SMILES string is first tokenized at the atom level, then token pairs are iteratively merged according to their frequency rank in the vocabulary until no further merges are possible.

The key hyperparameters are MVS and FT. In the reported experiments, MVS was set to 30,000 and FT was set to 2,000. The vocabulary was trained on ~3.4 million SMILES (both canonical and one non-canonical variant per molecule) from ChEMBL25. The resulting vocabulary contained 3,002 unique SMILES substrings with lengths ranging from 1 to 22 atom-level characters.

The trained SPE vocabulary produces tokens that are human-readable and correspond to chemically meaningful substructures and functional groups. SPE tokenization reduces the mean sequence length from approximately 40 tokens (atom-level) to approximately 6 tokens on ChEMBL, a roughly 6-7x compression. This shorter representation directly reduces computational cost for RNN-based and other sequential models.

The algorithm is also compatible with other text-based molecular representations such as DeepSMILES and SELFIES, since these share atom-level character structures that can serve as the starting point for pair merging.

Molecular Generation and QSAR Prediction Experiments

Molecular Generation

The authors trained AWD-LSTM language models with SPE and atom-level tokenization on 9 million SMILES (1 canonical + 5 non-canonical per compound from ChEMBL25). Each model sampled 1 million SMILES for evaluation. The AWD-LSTM architecture used an embedding size of 400, three LSTM layers with 1,152 hidden units each, and various dropout settings (embedding: 0.1, input: 0.6, weight: 0.5, hidden: 0.2). Models were trained for 10 epochs with a base learning rate of 0.008 using one-cycle scheduling.

Metric	SPE	Atom-level
Validity	0.941	0.970
Uniqueness	0.994	0.992
Novelty	0.983	0.978
Internal diversity	0.897	0.886
Nearest neighbor similarity	0.391	0.386

The SPE model generated a more diverse population of novel molecules at the cost of slightly lower validity (94.1% vs. 97.0%). Internal diversity is defined as:

$$ \text{Internal diversity} = 1 - \frac{1}{|G|} \sum_{(x_1, x_2) \in G \times G} T(x_1, x_2) $$

where $T(x_1, x_2)$ is the Tanimoto similarity between molecules $x_1$ and $x_2$ using 1024-bit ECFP6 fingerprints. Nearest neighbor similarity (SNN) measures how well the generated set resembles the reference set:

$$ \text{SNN} = \frac{1}{|G|} \sum_{x_G \in G} \max_{x_R \in R} T(x_G, x_R) $$

Substructure coverage analysis showed both models recovered the same top-1000 BRICS fragments (100% coverage), but SPE consistently outperformed atom-level tokenization on top-5000 coverage across all four substructure types: BRICS fragments (0.997 vs. 0.987), functional groups (0.688 vs. 0.659), scaffolds (0.872 vs. 0.825), and ring systems (0.781 vs. 0.761).

QSAR Prediction

QSAR models were built using the MolPMoFiT transfer learning framework, which pre-trains a language model on ChEMBL and then fine-tunes it for specific prediction tasks. The evaluation used 24 regression benchmarks (pIC50 values) from Cortes-Ciriano et al., covering targets ranging from 199 molecules (alpha-2a adrenergic receptor) to 5,010 molecules (hERG). Models were evaluated on 10 random 80:10:10 splits using RMSE, R-squared, and MAE. Random forest models with 1024-bit ECFP6 were included as baseline comparisons.

Cohen’s d effect sizes were computed to quantify performance differences between tokenization methods. SPE performed comparably or better than atom-level tokenization on 23 out of 24 datasets. Notable results with medium or large effect sizes favoring SPE included cannabinoid CB1 receptor (large effect), A2a adrenergic receptor, LCK, estrogen receptor, and Aurora-A kinase (all medium effects). Against k-mer tokenization, SPE matched or outperformed on 22 out of 24 datasets.

Cohen’s d is defined as:

$$ \text{Cohen’s } d = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{(\text{SD}_1^2 + \text{SD}_2^2) / 2}} $$

where $\bar{x}_1, \bar{x}_2$ are the group means and $\text{SD}_1, \text{SD}_2$ are the standard deviations. Thresholds of 0.2 (small), 0.5 (medium), and 0.8 (large) were used following standard recommendations.

SMILES-based deep learning models generally performed on par with or better than the RF baseline, with particularly strong advantages on the four largest datasets (COX-2, acetylcholinesterase, erbB1, and hERG).

In addition to performance gains, SPE-based models trained on average 5 times faster than atom-level models due to the shorter input sequences.

Results Summary and Future Directions

The main findings of this study are:

SPE produces chemically meaningful tokens. The learned vocabulary contains human-readable SMILES substrings that correspond to common substructures and functional groups, making model interpretations more accessible.
SPE compresses input sequences by ~6-7x. Mean token sequence length drops from ~40 (atom-level) to ~6 (SPE) on ChEMBL, yielding a ~5x training speedup.
SPE improves molecular generation diversity. The SPE-based generative model produces molecules with higher novelty (98.3% vs. 97.8%), internal diversity (0.897 vs. 0.886), and substructure coverage, at the cost of slightly lower validity (94.1% vs. 97.0%).
SPE matches or outperforms atom-level and k-mer tokenization on QSAR prediction. Across 24 benchmarks, SPE showed comparable or better performance in 23/24 comparisons against atom-level and 22/24 against k-mer tokenization.

Limitations acknowledged by the authors:

The SPE vocabulary is trained on a specific dataset (ChEMBL25) and may not optimally represent chemical spaces that differ significantly from drug-like compounds.
The validity rate for molecular generation is slightly lower than atom-level tokenization (94.1% vs. 97.0%), since longer substructure tokens can introduce invalid fragments.
The k-mer tokenization suffers from an out-of-vocabulary problem, which the authors address by replacing unseen 4-mers with [UNK] tokens, but this is a limitation of the comparison rather than of SPE itself.

Future directions: The authors suggest SPE could serve as a general tokenization method for SMILES-based deep learning, applicable to any task where SMILES strings are used as input (generation, property prediction, reaction prediction, retrosynthesis). The algorithm can also be applied to DeepSMILES and SELFIES representations without modification.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
SPE vocabulary training	ChEMBL25	~3.4M SMILES	1 canonical + 1 non-canonical per molecule
Language model training	ChEMBL25 augmented	~9M SMILES	1 canonical + 5 non-canonical per molecule
Molecular generation evaluation	Sampled from model	1M SMILES per model	Validated with RDKit
QSAR benchmarks	Cortes-Ciriano et al.	24 datasets, 199-5010 molecules	pIC50 regression tasks

Algorithms

SPE vocabulary training: iterative pair merging with MVS=30,000 and FT=2,000
Language model: AWD-LSTM with embedding size 400, 3 LSTM layers with 1,152 hidden units
Dropout: embedding=0.1, input=0.6, weight=0.5, hidden=0.2
Training: 10 epochs, base learning rate 0.008, one-cycle policy
QSAR: MolPMoFiT transfer learning with 25x training augmentation and 15x validation augmentation
Test time augmentation: average of canonical + 4 augmented SMILES predictions
RF baseline: 500 trees, 1024-bit ECFP6, default scikit-learn parameters

Models

AWD-LSTM architecture from Merity et al. (2018)
MolPMoFiT framework from Li and Fourches (2020) for transfer learning QSAR

Evaluation

Metric	Task	Notes
Validity, Uniqueness, Novelty	Generation	Basic quality metrics
Internal diversity	Generation	1 - mean pairwise Tanimoto (ECFP6)
Nearest neighbor similarity	Generation	Mean max Tanimoto to reference set
Substructure coverage	Generation	BRICS, functional groups, scaffolds, ring systems
RMSE, R-squared, MAE	QSAR regression	10 random 80:10:10 splits
Cohen’s d	QSAR comparison	Effect size between tokenization methods

Hardware

Not explicitly specified in the paper.

Artifacts

Artifact	Type	License	Notes
SmilesPE	Code	Apache-2.0	SPE tokenization Python package
MolPMoFiT	Code	Not specified	Transfer learning QSAR framework

Paper Information

Citation: Li, X., & Fourches, D. (2021). SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning. Journal of Chemical Information and Modeling, 61(4), 1560-1569. https://doi.org/10.1021/acs.jcim.0c01127

@article{li2021smiles,
  title={SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning},
  author={Li, Xinhao and Fourches, Denis},
  journal={Journal of Chemical Information and Modeling},
  volume={61},
  number={4},
  pages={1560--1569},
  year={2021},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.0c01127}
}

Smirk: Complete Tokenization for Molecular Models

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Complete Chemical Tokenization

This is a Method paper that introduces two new tokenizers for molecular foundation models: Smirk and Smirk-GPE. The primary contribution is a tokenization scheme that achieves complete coverage of the OpenSMILES specification using only 165 tokens, addressing the vocabulary gaps present in existing atom-wise tokenizers. The paper also proposes n-gram language models as low-cost proxy evaluators for tokenizer quality and validates these proxies against 18 transformer-based models across multiple benchmarks.

Vocabulary Gaps in Molecular Tokenization

Molecular foundation models overwhelmingly use “atom-wise” tokenization, where SMILES strings are split at atom boundaries using a regular expression first proposed by Schwaller et al. A key pattern in this regex treats all “bracketed atoms” (e.g., [C@@H], [18F], [Au+]) as single, irreducible tokens. Since bracketed atoms encode isotopes, chirality, charge, hydrogen count, and element identity, the number of possible permutations under the OpenSMILES specification exceeds 28 trillion. In practice, existing atom-wise tokenizers maintain vocabularies of fewer than 3,000 tokens, leaving large portions of chemical space unrepresentable.

This gap has real consequences. Many chemistry-specific tokenizers emit the unknown token [UNK] at non-negligible frequencies, particularly on datasets with diverse elements and stereochemistry. For example, SPE and APE tokenizers produce [UNK] for roughly 19% of tokens on MoleculeNet and approximately 50% on the tmQM transition metal complex dataset. Even models like SELFormer and ReactionT5 lack tokens for elements such as copper, ruthenium, gold, and uranium.

The authors also note a subtler issue: some open-vocabulary tokenizers (e.g., ChemBERTa’s BPE) conflate chemically distinct entities. The same Sc token may represent both a sulfur-carbon bond (in organic SMILES) and the element scandium (in [Sc]), creating ambiguity in downstream analysis.

Smirk: Glyph-Level Decomposition of SMILES

The core insight behind Smirk is to fully decompose bracketed atoms into their constituent “glyphs,” the primitive symbols defined by the OpenSMILES specification (element symbols, chirality markers, charges, isotope numbers, hydrogen counts, and brackets themselves). This transforms tokenization from a word-level scheme (one token per bracketed atom) to a character-level scheme over chemically meaningful glyphs.

Smirk uses a two-stage tokenization process:

Atom decomposition: Split a SMILES string into atom-level units using a regex (e.g., OC[C@@H][OH] becomes O C [C@@H] [OH]).
Glyph decomposition: Further split each unit into its constituent glyphs (e.g., [C@@H] becomes [ C @@ H ]).

The two-stage process is necessary to resolve ambiguities. For example, Sc in an unbracketed context represents a sulfur-carbon bond, while [Sc] denotes scandium. This ambiguity occurs over half a million times in PubChem’s compound dataset.

The resulting vocabulary contains only 165 tokens, requires no training, and by construction can faithfully tokenize any molecule that conforms to the OpenSMILES specification. The implementation is written in Rust using HuggingFace’s Tokenizers library and is available on PyPI.

Smirk-GPE (Glyph Pair Encoding) extends Smirk with a BPE-like compression step. After Smirk tokenization, adjacent tokens are merged using learned rules, reducing sequence length. Unlike standard BPE, merges operate on token IDs rather than character strings, preserving the distinction between chemically different entities that happen to share the same characters. Smirk-GPE was trained on 262 million molecules from Enamine REAL Space with a target vocabulary of 50,000 tokens, though training terminated at 2,300 tokens after exhausting all possible merges.

Evaluation Framework: Intrinsic Metrics, N-Gram Proxies, and Transformer Benchmarks

The evaluation covers 34 tokenizers across three datasets (Enamine REALSpace, MoleculeNet, and tmQM) using both intrinsic and extrinsic metrics.

Intrinsic Metrics

Four intrinsic metrics are computed for each tokenizer:

Fertility measures the mean tokenized sequence length. Higher fertility increases computational cost due to the quadratic scaling of attention:

$$ \text{cost} \propto \text{fertility}^2 $$

Normalized entropy quantifies how close a tokenizer comes to the information-theoretic ideal where all tokens are equally probable:

$$ \eta = \frac{-1}{\log |V|} \sum_{x \in V} p(x) \log p(x) $$

where $V$ is the vocabulary and $p(x)$ is the observed token probability. Higher normalized entropy correlates with better downstream performance.

Token imbalance measures the distance between observed token frequencies and a uniform distribution:

$$ D = \frac{1}{2} \sum_{x \in V} |p(x) - |V|^{-1}| $$

Unknown token frequency captures the fraction of emitted tokens that are [UNK]. This metric is particularly revealing: all existing chemistry-specific tokenizers (SPE/APE, atom-wise, BPE, and Unigram variants) emit [UNK] at non-negligible rates, while NLP tokenizers, Smirk, and Smirk-GPE do not.

N-Gram Proxy Language Models

The paper proposes using n-gram models as low-cost proxies for transformer-based evaluation. An n-gram estimates token likelihood with add-one smoothing:

$$ P_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}) = \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} $$

where $C$ is the count function and $|V|$ is the vocabulary size. N-grams were “pretrained” on 1.6 billion SMILES from Enamine REAL Space and evaluated on validation splits. Cross-entropy loss and information loss from unknown tokens were computed.

To quantify information lost to [UNK] tokens, the authors compute the KL-divergence between token distributions with and without unknown tokens, using a bidirectional character n-gram model:

$$ B_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}, x_{i+1}, \dots, x_{i+n-1}) \propto \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} \times \frac{C(x_{i}, \dots, x_{i+n-1}) + 1}{C(x_{i+1}, \dots, x_{i+n-1}) + |V|} $$

Transformer Experiments

Eighteen encoder-only RoBERTa models (25M parameters each, excluding embeddings) were pretrained from scratch using masked language modeling on Enamine REAL Space (245M molecules, 30,000 steps). Each model used a different tokenizer, isolating the tokenizer’s effect on performance. Finetuning was conducted on six regression and seven classification tasks from MoleculeNet and tmQM.

Linear fixed-effects models were used to estimate the standardized effect of each tokenization scheme relative to an atom-wise SMILES baseline.

Key Findings and Practical Implications

Tokenizer Performance

Smirk shows a positive effect on pretraining quality and downstream performance on tmQM (the dataset with the most bracketed atoms), but performs comparably to atom-wise tokenization on MoleculeNet tasks.
SPE and APE tokenizers have a negative impact on both pretraining and downstream performance relative to the atom-wise baseline, likely due to their high [UNK] rates.
Molecular encoding choice (SMILES vs. SELFIES) has a negligible effect on performance.
NLP tokenizers (GPT-4o, LLaMA, Gemma) score comparably to chemistry-specific tokenizers on intrinsic metrics and do not emit unknown tokens.

N-Gram Proxy Validation

N-gram cross-entropy and information loss metrics show strong rank correlation (Spearman’s $\rho$) with downstream transformer performance, validating their use as low-cost evaluation proxies. The effect sizes from n-gram and transformer experiments are directionally consistent.

Information Loss from Unknown Tokens

Information loss is minimal for tokenizers with robust coverage but substantial for tokenizers with limited vocabularies on chemically diverse datasets. MoLFormer incurs only 0.1 nats/molecule on MoleculeNet but 40.3 nats/molecule on tmQM. Open-vocabulary tokenizers (Smirk, Smirk-GPE, NLP tokenizers) mitigate this degradation.

Practical Recommendations

The authors argue that molecular foundation models must encode the entire breadth of chemical space or risk obscuring critical features. Bracketed atoms encode information essential to clinically relevant pharmaceuticals (e.g., Amoxicillin), industrial compounds (e.g., Tricalcium Silicate), and foundational chemistry (e.g., Cisplatin, where omitting the chiral marker erases medically relevant stereochemical information). The paper encourages the community to adopt open-vocabulary tokenizers and develop more chemically diverse benchmarks.

Limitations

The analysis uses a single-point evaluation for transformer experiments, which may underestimate performance achievable with additional hyperparameter tuning.
Smirk-GPE’s learned merges from REALSpace did not fully generalize to tmQM, as indicated by the token imbalance metric.
Current benchmarks (MoleculeNet) lack sufficient diversity to evaluate tokenizer robustness across the full periodic table, isotopes, charged species, and uncommon bond types.
The downstream impact of token ambiguities in BPE-based tokenizers (e.g., ChemBERTa’s conflation of Sc as both sulfur-carbon and scandium) remains unclear.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	Enamine REAL Space	1.6B SMILES (n-gram), 245M molecules (transformer)	80/10/10 train/val/test split
Downstream evaluation	MoleculeNet	Multiple tasks	6 regression + 7 classification tasks
Downstream evaluation	tmQM	108K transition metal complexes	OpenSMILES molecular encodings
Smirk-GPE training	Enamine REAL Space (subset)	262M molecules	Training split only

Algorithms

Smirk: Two-stage regex-based tokenization (atom decomposition, then glyph decomposition). No training required. Vocabulary: 165 tokens.
Smirk-GPE: BPE-like compression on top of Smirk. Operates on token IDs (not strings) to preserve chemical disambiguation. Final vocabulary: 2,300 tokens.
N-gram models: Add-one smoothing, bidirectional context ($2n - 2$ total context window). Implemented in Julia with exact integer arithmetic.

Models

Architecture: RoBERTa-PreLayerNorm, 8 layers, 8 attention heads, hidden size 512, intermediate size 2048, max sequence length 2048. ~25M parameters (excluding embeddings).
Pretraining: Masked language modeling, 30,000 steps, effective batch size 8192, FusedLamb optimizer, learning rate $1.6 \times 10^{-4}$.
Finetuning: 100,000 steps, AdamW optimizer, effective batch size 128, learning rate $1.6 \times 10^{-4}$.

Evaluation

MoleculeNet preferred metrics per task (AUROC for classification, MAE/RMSE for regression)
Fixed-effects models for standardized effect size estimation
Spearman’s rank correlation between n-gram and transformer metrics

Hardware

Pretraining: 2x NVIDIA A100 GPUs (Delta system at NCSA)
Finetuning: 1x NVIDIA A40 GPU
N-gram models: CPU-based (Julia implementation)

Artifacts

Artifact	Type	License	Notes
Smirk tokenizer	Code	Apache-2.0	Rust implementation with Python bindings, available on PyPI
Model checkpoints	Model	Not specified	Pretrained and finetuned checkpoints included in data release
N-gram code	Code	Not specified	Julia implementation included in data release

Paper Information

Citation: Wadell, A., Bhutani, A., & Viswanathan, V. (2026). Tokenization for Molecular Foundation Models. Journal of Chemical Information and Modeling, 66(3), 1384-1393. https://doi.org/10.1021/acs.jcim.5c01856

@article{wadell2026tokenization,
  title={Tokenization for Molecular Foundation Models},
  author={Wadell, Alexius and Bhutani, Anoushka and Viswanathan, Venkatasubramanian},
  journal={Journal of Chemical Information and Modeling},
  volume={66},
  number={3},
  pages={1384--1393},
  year={2026},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.5c01856}
}

SMILES2Vec: Interpretable Chemical Property Prediction

Thu, 26 Mar 2026 00:00:00 +0000

A General-Purpose RNN for Chemical Property Prediction from SMILES

SMILES2Vec is a Method paper that introduces a deep recurrent neural network architecture for predicting chemical properties directly from SMILES text representations. The primary contributions are: (1) a Bayesian-optimized CNN-GRU architecture that serves as a general-purpose predictor for diverse chemical properties (toxicity, activity, solubility, solvation energy), (2) an explanation mask mechanism that provides interpretable predictions by identifying which SMILES characters drive the network’s decisions, and (3) evidence that representation learning from raw SMILES can match or outperform models using hand-crafted molecular descriptors.

Motivation: Beyond Engineered Features in Chemical Modeling

At the time of writing (2017), deep learning models in chemistry relied heavily on engineered molecular descriptors and fingerprints as input features. Over 5,000 molecular descriptors had been developed since the late 1940s, and QSAR/QSPR modeling remained the dominant paradigm. The authors identified two key limitations with this approach:

Restricted search space: Engineered features limit the neural network’s ability to discover potentially useful representations that domain experts have not anticipated.
Incomplete domain knowledge: For complex properties where first-principles understanding is incomplete, the lack of appropriate descriptors constrains model performance.

In contrast, computer vision and NLP had shown that deep learning models trained on raw data (unaltered images, raw text) could learn powerful representations without feature engineering. The chemical SMILES notation, a text-based encoding of molecular structure that serves as the standard interchange format in cheminformatics, provided a natural analog to text data for NLP-style modeling.

A secondary motivation was interpretability. Most ML and DL models for chemistry operated as black boxes, which posed particular problems for regulated applications like FDA drug approval where mechanistic explanations are required.

Core Innovation: CNN-GRU Architecture with Explanation Masks

Architecture Design via Bayesian Optimization

SMILES2Vec treats SMILES strings as character-level text input. The network processes one-hot encoded characters (padded to length 250, covering 99.9% of the ChEMBL database) through three stages:

Embedding layer: Maps one-hot character vectors to a learned embedding space (size 50)
1D convolutional layer: 192 filters with kernel size 3, stride 1
Bidirectional GRU layers: Two layers with 224 and 384 units respectively

The authors explored four architectural classes (GRU, LSTM, CNN-GRU, CNN-LSTM) using Bayesian optimization via SigOpt. Each class was evaluated over 60 trials, optimizing embedding size, convolutional filter count, and RNN layer widths. The CNN-GRU class was selected as the best compromise: CNN-LSTM performed best on classification (Tox21), while GRU-based networks excelled at regression (FreeSolv). The final architecture is summarized by the hyperparameters:

Component	Parameter	Value
Embedding	Size	50
Conv1D	Filters	192
BiGRU Layer 1	Units	224
BiGRU Layer 2	Units	384

Explanation Mask for Interpretability

The explanation mask is a post-hoc interpretability mechanism. Given a trained (frozen) SMILES2Vec base model, a separate explanation network learns to produce a per-character mask over the input SMILES string. The mask is trained to preserve the base model’s output while masking as much input as possible. The loss function for a single sample is:

$$ \text{Loss}_i = | f(\text{SMILES}_i, \theta) - \text{Sol}(\text{SMILES}_i) |_2 + 10^{-6} | \text{MASK}_i |_2 + 0.05 , H(\text{MASK}_i) $$

where $f(\text{SMILES}_i, \theta)$ is the base network prediction, $\text{Sol}(\text{SMILES}_i)$ is the ground truth solubility, $H$ is the entropy of the normalized mask, and $\text{MASK}_i$ is the per-character mask vector. The L2 term encourages sparsity and the entropy term penalizes uniform attention distributions.

The explanation network itself is a 20-layer residual network with SELU activations, ending in a 1D convolution of length 1, batch normalization, and a softplus activation. The softplus output ranges from 0 (fully masked) to infinity (amplified attention), allowing the mask to both suppress and emphasize specific SMILES characters.

Experimental Setup and Baseline Comparisons

Datasets

The model was evaluated on four datasets from the MoleculeNet benchmark and the ESOL solubility dataset:

Dataset	Property	Task	Size
Tox21	Toxicity	Multi-task classification	8,014
HIV	Activity	Single-task classification	41,193
FreeSolv	Solvation energy	Single-task regression	643
ESOL	Solubility	Single-task regression	1,128

SMILES strings longer than 250 characters were excluded. Classification datasets (Tox21, HIV) used 1/6 test split with minority class oversampling; regression datasets (FreeSolv, ESOL) used 1/10 test split. All experiments used 5-fold cross-validation.

Training Protocol

Optimizer: RMSprop with learning rate $10^{-3}$, $\rho = 0.9$, $\epsilon = 10^{-8}$
Batch size: 32
Epochs: 250 with early stopping (patience of 25 epochs based on validation loss)
Classification loss: Binary cross-entropy
Regression loss: Mean absolute error
Metrics: AUC for classification, RMSE for regression

Baselines

SMILES2Vec was compared against:

MLP with engineered features: Standard multi-layer perceptron using molecular fingerprints (from MoleculeNet)
Molecular graph convolutions: Graph-based neural network from MoleculeNet
Chemception: CNN operating on 2D chemical images

Bayesian Optimization Protocol

Only two datasets were used for architecture optimization: the nr-ahr toxicity task from Tox21 (classification) and FreeSolv (regression). The remaining datasets (full Tox21, HIV, ESOL) served purely for generalization evaluation. A fixed test set was held out during optimization, and correlation between validation and test metrics (0.54 for Tox21, 0.78 for FreeSolv) confirmed limited overfitting to the validation set.

Results: Competitive Accuracy with Interpretable Predictions

Property Prediction Performance

SMILES2Vec achieved the following validation metrics (with a pre-training approach from ChemNet improving performance slightly):

Dataset	Metric	SMILES2Vec	SMILES2Vec + Pre-training	Graph Conv
Tox21	AUC	0.80	0.81	0.81
HIV	AUC	0.78	0.80	0.80
FreeSolv	RMSE (kcal/mol)	1.4	1.2	1.3
ESOL	RMSE	0.63	-	-

Exact numbers for MLP and Chemception baselines were reported only in a bar chart (Figure 6) and not as precise values. The paper states that MLP with fingerprints performed worst across all tasks, and Chemception fell between MLP and the graph/SMILES methods.

Key findings:

SMILES2Vec outperformed MLP models using engineered features across all tasks, despite using no feature engineering.
Against graph convolutions (the state-of-the-art at the time), SMILES2Vec matched on classification (Tox21: 0.81 vs 0.81, HIV: 0.80 vs 0.80) and outperformed on regression (FreeSolv: 1.2 vs 1.3).
SMILES2Vec outperformed Chemception (2D image CNN) on classification tasks but slightly underperformed on regression, which the authors attributed to SMILES lacking explicit atomic number information.

Interpretability Evaluation

On the ESOL solubility dataset, the explanation mask was evaluated against first-principles chemical knowledge. The authors separated compounds into soluble (> 1.0) and insoluble (< -5.0) categories and defined ground truth: soluble compounds should attend to hydrophilic atoms (O, N) while insoluble compounds should attend to hydrophobic atoms (C, F, Cl, Br, I). The top-3 character accuracy was 88%, confirming that SMILES2Vec learned representations consistent with known functional group chemistry.

Qualitative analysis of the masks showed that for low-solubility molecules, characters corresponding to hydrophobic groups (c, C, Cl) received high attention, while high-solubility molecules showed attention focused on hydrophilic groups (O, N).

Limitations

The interpretability evaluation was limited to solubility, a well-understood property with simple first-principles rules. The authors acknowledged that quantifying interpretability for complex properties (toxicity, activity) where no simple ground truth exists is nontrivial.
The Bayesian optimization used only a subset of datasets, so the architecture may not be globally optimal across all chemical tasks.
SMILES strings lack explicit atomic number information, which may limit performance on physical property prediction compared to image or graph representations.
The explanation mask approach requires training a separate 20-layer network per property, adding computational overhead.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Architecture optimization	Tox21 (nr-ahr task)	8,014	Single toxicity task for Bayesian optimization
Architecture optimization	FreeSolv	643	Solvation free energy regression
Evaluation	Tox21 (full, 12 tasks)	8,014	Multi-task classification
Evaluation	HIV	41,193	Single-task classification
Evaluation	ESOL	1,128	Solubility regression, also used for interpretability

All datasets are publicly available through MoleculeNet. The ESOL dataset is from Delaney (2004).

Algorithms

Bayesian optimization via SigOpt (60 trials per architectural class, 4 classes, 6 manually seeded initial designs per class)
RMSprop optimizer with standard settings
Explanation mask trained with Adam, learning rate annealed from $10^{-2}$ to $10^{-6}$

Models

Final architecture: Embedding(50) -> Conv1D(192, kernel=3, stride=1) -> BiGRU(224) -> BiGRU(384)
Explanation network: 20-layer residual network with SELU activations
No pre-trained weights or code were released

Evaluation

Metric	Dataset	Value	Notes
AUC	Tox21	0.81	With pre-training
AUC	HIV	0.80	With pre-training
RMSE	FreeSolv	1.2 kcal/mol	With pre-training
RMSE	ESOL	0.63	Base model
Top-3 accuracy	ESOL interpretability	88%	Explanation mask

Hardware

The authors report using TensorFlow with GPU acceleration via NVIDIA cuDNN libraries. Specific GPU models and training times were not reported.

Artifacts

No code, models, or data artifacts were released by the authors. The datasets used are publicly available through MoleculeNet.

Paper Information

Citation: Goh, G. B., Hodas, N. O., Siegel, C., & Vishnu, A. (2017). SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties. arXiv preprint arXiv:1712.02034.

@article{goh2017smiles2vec,
  title={SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties},
  author={Goh, Garrett B. and Hodas, Nathan O. and Siegel, Charles and Vishnu, Abhinav},
  journal={arXiv preprint arXiv:1712.02034},
  year={2017},
  doi={10.48550/arxiv.1712.02034}
}

SMILES-BERT: BERT-Style Pre-Training for Molecules

Thu, 26 Mar 2026 00:00:00 +0000

Pre-Training Transformers on SMILES for Molecular Properties

SMILES-BERT is a Method paper that introduces a BERT-inspired pre-training and fine-tuning framework for molecular property prediction. The primary contribution is adapting the masked language model paradigm from NLP to SMILES strings, enabling a Transformer encoder to learn molecular representations from large-scale unlabeled data before fine-tuning on smaller labeled datasets.

Limited Labels in Molecular Property Prediction

Molecular property prediction is central to drug discovery and chemical design, but obtaining labeled data requires expensive biological assays. Deep learning methods for this task fall into three categories: manually designed fingerprints (e.g., ECFP), graph-based methods (GCNs operating on molecular graphs), and sequence-based methods (RNNs or CNNs operating on SMILES strings).

Prior unsupervised approaches like Seq2seq Fingerprint used an encoder-decoder architecture to learn representations from unlabeled SMILES, but the decoder acts as scaffolding that consumes GPU memory during pre-training without contributing to downstream prediction. The semi-supervised Seq3seq Fingerprint improved on this by incorporating labeled data, but retained the encoder-decoder inefficiency. RNN-based methods also suffer from difficulty in parallel training and require careful tuning (gradient clipping, early stopping) to converge.

The authors identify two motivations: (1) building a semi-supervised model that effectively leverages large pools of unlabeled SMILES to improve prediction with limited labels, and (2) designing an architecture where the entire pre-trained model participates in fine-tuning (no wasted decoder parameters) and naturally supports parallel training.

Masked SMILES Recovery with Transformer Encoders

The core innovation is the Masked SMILES Recovery pre-training task, directly analogous to BERT’s masked language modeling. The model architecture is a stack of Transformer encoder layers, making it fully convolutional and parallelizable.

Architecture

SMILES-BERT uses 6 Transformer encoder layers, each with 4-head multi-head self-attention and feed-forward dimension of 1024. Each Transformer layer contains three components: a pre-attention feed-forward network, a self-attention layer, and a post-attention feed-forward network, all followed by layer normalization with residual connections.

The self-attention mechanism uses scaled dot-product attention:

$$ Z = \text{Softmax}\left(\frac{(XW^{Q})(XW^{K})^{T}}{\sqrt{d_{k}}}\right) XW^{V} $$

where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^{Q}$, $W^{K}$, $W^{V} \in \mathbb{R}^{M \times d_{k}}$ are the query, key, and value weight matrices, and $\sqrt{d_{k}}$ is the scaling factor.

Input SMILES are tokenized at the character level with token embeddings and positional embeddings. A special token is prepended to each SMILES, and its output representation is used for downstream classification/regression after fine-tuning.

Pre-training: Masked SMILES Recovery

Following BERT’s masking strategy, 15% of tokens in each SMILES are selected for masking (minimum one per SMILES). Of the selected tokens:

85% are replaced with a token
10% are replaced with a random token from the vocabulary
5% are kept unchanged

The model is trained to recover the original tokens at masked positions. The loss is computed only on the masked token outputs.

Fine-tuning

After pre-training, a classifier or regressor head is added to the token output. The entire model (all Transformer layers plus the new head) is fine-tuned on the labeled dataset.

Key differences from the original BERT:

Only the Masked SMILES Recovery task is used (BERT’s next sentence prediction is dropped since SMILES have no consecutive-sentence structure)
Segment embeddings are removed
The architecture is smaller (6 layers, 4 heads, 1024 FFN dim) since SMILES have a much smaller vocabulary and shorter sequences than natural language

The authors compared this configuration against a larger BERT-base setup (12 layers, 12 heads, 3072 FFN dim) and found no meaningful performance difference, confirming that the smaller model is sufficient for SMILES.

Experimental Setup and Baseline Comparisons

Pre-training Data

SMILES-BERT was pre-trained on the ZINC database with 18,671,355 training SMILES, 10,000 for validation, and 10,000 for evaluation. Pre-training ran for 10 epochs using the Adam optimizer with a warm-up strategy (learning rate from $10^{-9}$ to $10^{-4}$ over 4,000 steps, then inverse-square-root decay). Batch size was 256 and dropout was 0.1. The pre-training masked SMILES exact recovery rate reached 82.85% on the validation set.

Fine-tuning Datasets

Dataset	Source	Size	Task	Metric
LogP	NCATS/NIH	10,850	Classification (threshold 1.88)	Accuracy
PM2	NCATS/NIH	323,242	Classification (threshold 0.024896)	Accuracy
PCBA-686978	PubChem	302,175	Classification	Accuracy

All datasets were split 80/10/10 for train/validation/test. Fine-tuning used Adam with a fixed learning rate for 50 epochs, selecting the best model on validation data.

Baselines

Circular Fingerprint (CircularFP): Manually designed hash-based fingerprint (ECFP family)
Neural Fingerprint (NeuralFP): Graph-based neural network replacing hash functions with learned layers
Seq2seq Fingerprint (Seq2seqFP): Unsupervised encoder-decoder model on SMILES
Seq3seq Fingerprint (Seq3seqFP): Semi-supervised encoder-decoder model on SMILES

Results

Method	LogP	PM2	PCBA-686978
CircularFP	~0.90	0.6858	~0.82
NeuralFP	~0.90	0.6802	~0.82
Seq2seqFP	~0.87	0.6112	~0.80
Seq3seqFP	~0.90	0.7038	~0.84
SMILES-BERT	0.9154	0.7589	0.8784

SMILES-BERT outperformed all baselines on all three datasets. The improvement over Seq3seqFP was approximately 2% on LogP, 5.5% on PM2, and 3.8% on PCBA-686978. The results on PM2 (the largest labeled dataset) show that pre-training benefits persist even with substantial labeled data.

Structure Study

Configuration	Layers	Attention Heads	FFN Dim	LogP Accuracy
SMILES-BERT	6	4	1024	0.9154
SMILES-BERT (large)	12	12	3072	0.9147

The larger configuration provided no improvement, supporting the choice of the smaller, more efficient architecture.

Findings, Limitations, and Future Directions

SMILES-BERT demonstrated that BERT-style masked pre-training on SMILES strings produces transferable molecular representations that improve property prediction across datasets of varying sizes and property types.

Key findings:

The Masked SMILES Recovery pre-training task transfers effectively to molecular property prediction
The full model participates in fine-tuning (no wasted decoder), making SMILES-BERT more parameter-efficient than encoder-decoder alternatives
A smaller Transformer configuration (6 layers, 4 heads) matches the performance of a BERT-base-sized model for SMILES data
Pre-training on ~18.7M SMILES from ZINC provides robust initialization across different downstream tasks

Limitations: The evaluation uses only classification accuracy as the metric, without reporting AUC-ROC, F1, or other metrics common in molecular property prediction. The comparison is limited to four baselines, and two of the three evaluation datasets (LogP, PM2) are non-public NIH datasets. The paper does not explore different pre-training dataset sizes or ablate the masking strategy. Only classification tasks are evaluated, though the architecture supports regression.

Future work: The authors propose incorporating Quantitative Estimate of Druglikeness (QED) prediction as an additional pre-training task to warm up the model’s classification capability, analogous to BERT’s next sentence prediction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC	18,671,355 SMILES	Publicly available database
Fine-tuning	LogP	10,850	Non-public, from NCATS/NIH
Fine-tuning	PM2	323,242	Non-public, from NCATS/NIH
Fine-tuning	PCBA-686978	302,175	Public, from PubChem BioAssay

Algorithms

Pre-training: Adam optimizer, warm-up for 4,000 steps ($10^{-9}$ to $10^{-4}$), inverse-square-root LR schedule, batch size 256, dropout 0.1, 10 epochs
Fine-tuning: Adam optimizer, fixed LR (insensitive to choice among $10^{-5}$, $10^{-6}$, $10^{-7}$), 50 epochs, best model on validation

Models

6 Transformer encoder layers, 4-head multi-head attention, FFN dim 1024
Token embedding + positional embedding, special token
Implemented with FairSeq (Facebook AI Research Sequence-to-Sequence Toolkit)

Evaluation

Metric	SMILES-BERT	Best Baseline (Seq3seqFP)	Notes
LogP Accuracy	0.9154	~0.90	~2% improvement
PM2 Accuracy	0.7589	0.7038	~5.5% improvement
PCBA Accuracy	0.8784	~0.84	~3.8% improvement

Hardware

The paper mentions GPU training and NVIDIA GPU donation in acknowledgments but does not specify the exact GPU model or training time beyond noting that pre-training on a single GPU takes over a week for 10 epochs.

Artifact	Type	License	Notes
No public code or model release identified	-	-	Paper does not provide a GitHub link or model checkpoint

Reproducibility status: Partially Reproducible. The ZINC pre-training data is public and the architecture is described in detail, but no code or pre-trained weights are released. Two of three evaluation datasets (LogP, PM2) are non-public.

Paper Information

Citation: Wang, S., Guo, Y., Wang, Y., Sun, H., & Huang, J. (2019). SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB ‘19), 429-436. https://doi.org/10.1145/3307339.3342186

@inproceedings{wang2019smilesbert,
  title={SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction},
  author={Wang, Sheng and Guo, Yuzhi and Wang, Yuhong and Sun, Hongmao and Huang, Junzhou},
  booktitle={Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics},
  pages={429--436},
  year={2019},
  publisher={ACM},
  doi={10.1145/3307339.3342186}
}

SMILES vs SELFIES Tokenization for Chemical LMs

Thu, 26 Mar 2026 00:00:00 +0000

Atom Pair Encoding for Chemical Language Modeling

This is a Method paper that introduces Atom Pair Encoding (APE), a tokenization algorithm designed specifically for chemical string representations (SMILES and SELFIES). The primary contribution is demonstrating that a chemistry-aware tokenizer, which preserves atomic identity during subword merging, leads to improved molecular property classification accuracy in transformer-based models compared to the standard Byte Pair Encoding (BPE) approach.

Why Tokenization Matters for Chemical Strings

Existing chemical language models based on BERT/RoBERTa architectures have typically relied on BPE for tokenizing SMILES and SELFIES strings. Byte Pair Encoding (BPE) was originally designed for natural language and data compression, where it excels at breaking words into meaningful subword units. When applied to chemical strings, BPE operates at the character level without understanding chemical semantics, leading to several problems:

Stray characters: BPE may create tokens like “C)(” that have no chemical meaning.
Element splitting: Multi-character elements like chlorine (“Cl”) can be split into “C” and “l”, causing the model to misinterpret carbon and a dangling character.
Lost structural context: BPE compresses sequences without considering how character position encodes molecular structure.

Previous work on SMILES Pair Encoding (SPE) attempted to address this by iteratively merging SMILES substrings into chemically meaningful tokens. However, SPE had practical limitations: its Python implementation did not support SELFIES, and it produced a smaller vocabulary (~3000 tokens) than what the data could support. These gaps motivated the development of APE.

The APE Tokenizer: Chemistry-Aware Subword Merging

APE draws inspiration from both BPE and SPE but addresses their shortcomings. The key design decisions are:

Atom-level initialization: Instead of starting from individual characters (as BPE does), APE begins with chemically valid atomic units. For SMILES, this means recognizing multi-character elements (e.g., “Cl”, “Br”) as single tokens. For SELFIES, each bracketed string (e.g., [C], [Ring1], [=O]) serves as the fundamental unit.
Iterative pair merging: Like BPE, APE iteratively merges the most frequent adjacent token pairs. The difference is that the initial tokenization preserves atomic boundaries, so merged tokens always represent valid chemical substructures.
Larger vocabulary: Using the same minimum frequency threshold of 2000, APE generates approximately 5300 unique tokens from the PubChem dataset, compared to SPE’s approximately 3000. This richer vocabulary provides more expressive power for representing chemical substructures.
SELFIES compatibility: APE natively supports both SMILES and SELFIES, using the bracketed token structure of SELFIES as its starting point for that representation.

The tokenizer was trained on a subset of 2 million molecules from PubChem (10 million SMILES total). This produced four tokenizer variants: SMILES-BPE, SMILES-APE, SELFIES-BPE, and SELFIES-APE.

Pre-training and Evaluation on MoleculeNet Benchmarks

Model architecture

All four models use the RoBERTa architecture with 6 hidden layers, a hidden size of 768, an intermediate size of 1536, and 12 attention heads. Pre-training used masked language modeling (MLM) with 15% token masking on 1 million molecules from PubChem, with a validation set of 100,000 molecules. Each model was pre-trained for 20 epochs using AdamW, with hyperparameter optimization via Optuna.

Downstream tasks

The models were fine-tuned on three MoleculeNet classification tasks:

Dataset	Category	Compounds	Tasks	Metric
BBBP	Physiology	2,039	1	ROC-AUC
HIV	Biophysics	41,127	1	ROC-AUC
Tox21	Physiology	7,831	12	ROC-AUC

Data was split 80/10/10 (train/validation/test) following MoleculeNet recommendations. Models were fine-tuned for 5 epochs with early stopping based on validation ROC-AUC.

Baselines

Results were compared against two text-based models (ChemBERTa-2 MTR-77M and SELFormer) and two graph-based models (D-MPNN from Chemprop and MoleculeNet Graph-Conv).

Main results

Model	BBBP ROC	HIV ROC	Tox21 ROC
SMILYAPE-1M	0.754 +/- 0.006	0.772 +/- 0.010	0.838 +/- 0.002
SMILYBPE-1M	0.746 +/- 0.006	0.754 +/- 0.015	0.849 +/- 0.002
SELFYAPE-1M	0.735 +/- 0.015	0.768 +/- 0.012	0.842 +/- 0.002
SELFYBPE-1M	0.676 +/- 0.014	0.709 +/- 0.012	0.825 +/- 0.001
ChemBERTa-2-MTR-77M	0.698 +/- 0.014	0.735 +/- 0.008	0.790 +/- 0.003
SELFormer	0.716 +/- 0.021	0.769 +/- 0.010	0.838 +/- 0.005
MoleculeNet-Graph-Conv	0.690	0.763	0.829
D-MPNN	0.737	0.776	0.851

APE consistently outperforms BPE for both SMILES and SELFIES. SMILYAPE achieves the best BBBP score (0.754), beating D-MPNN (0.737). On HIV, SMILYAPE (0.772) is competitive with D-MPNN (0.776). On Tox21, D-MPNN (0.851) leads, with SMILYBPE (0.849) and SELFYAPE (0.842) close behind.

Statistical significance

Mann-Whitney U tests confirmed statistically significant differences between SMILYAPE and SMILYBPE (p < 0.05 on all datasets). Cliff’s delta values indicate large effect sizes: 0.74 (BBBP), 0.70 (HIV), and -1.00 (Tox21, favoring BPE). For SELFIES models, SELFYAPE achieved Cliff’s delta of 1.00 across all three datasets, indicating complete separation from SELFYBPE.

Key Findings and Limitations

APE outperforms BPE by preserving atomic identity

The consistent advantage of APE over BPE stems from APE’s atom-level initialization. By starting with chemically valid units rather than individual characters, APE avoids creating nonsensical tokens that break chemical elements or mix structural delimiters with atoms.

SMILES outperforms SELFIES with APE tokenization

SMILYAPE generally outperforms SELFYAPE across tasks. Attention weight analysis revealed that SMILYAPE assigns more weight to immediate neighboring tokens (0.108 vs. 0.096) and less to distant tokens (0.030 vs. 0.043). This pattern aligns with chemical intuition: bonding is primarily determined by directly connected atoms. SMILYAPE also produces more compact tokenizations (8.6 tokens per molecule vs. 11.9 for SELFYAPE), potentially allowing more efficient attention allocation.

SELFIES models show higher inter-tokenizer agreement

On the BBBP dataset, all true positives identified by SELFYBPE were also captured by SELFYAPE, with SELFYAPE achieving higher recall (61.68% vs. 55.14%). In contrast, SMILES-based models shared only 29.3% of true positives between APE and BPE variants, indicating that tokenization choice has a larger impact on SMILES models.

Limitations

Pre-training used only 1 million molecules, compared to 77 million for ChemBERTa-2. Despite this, APE models were competitive or superior, but scaling effects remain unexplored.
Evaluation was limited to three binary classification tasks from MoleculeNet. Regression tasks, molecular generation, and reaction prediction were not tested.
The Tox21 result is notable: SMILYBPE outperforms SMILYAPE (0.849 vs. 0.838), suggesting APE’s advantage may be task-dependent.
No comparison with recent atom-level tokenizers like Atom-in-SMILES or newer approaches beyond SPE.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Tokenizer training	PubChem subset	2M molecules	SMILES strings converted to SELFIES via selfies library
Pre-training	PubChem subset	1M molecules	100K validation set
Evaluation	BBBP	2,039 compounds	80/10/10 split
Evaluation	HIV	41,127 compounds	80/10/10 split
Evaluation	Tox21	7,831 compounds	80/10/10 split, 12 tasks

Algorithms

Tokenizers: BPE (via Hugging Face), APE (custom implementation, minimum frequency 2000)
Pre-training: Masked Language Modeling (15% masking) for 20 epochs
Optimizer: AdamW with Optuna hyperparameter search
Fine-tuning: 5 epochs with early stopping on validation ROC-AUC

Models

Architecture: RoBERTa with 6 layers, hidden size 768, intermediate size 1536, 12 attention heads
Four variants: SMILYAPE, SMILYBPE, SELFYAPE, SELFYBPE

Evaluation

Metric	SMILYAPE	SMILYBPE	SELFYAPE	SELFYBPE
BBBP ROC-AUC	0.754	0.746	0.735	0.676
HIV ROC-AUC	0.772	0.754	0.768	0.709
Tox21 ROC-AUC	0.838	0.849	0.842	0.825

Hardware

NVIDIA RTX 3060 GPU with 12 GiB VRAM

Artifacts

Artifact	Type	License	Notes
APE Tokenizer	Code	Other (unspecified SPDX)	Official APE tokenizer implementation
PubChem10M SMILES/SELFIES	Dataset	Not specified	10M SMILES with SELFIES conversions
Pre-trained and fine-tuned models	Model	Not specified	All four model variants on Hugging Face

Paper Information

Citation: Leon, M., Perezhohin, Y., Peres, F., Popovič, A., & Castelli, M. (2024). Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. Scientific Reports, 14(1), 25016. https://doi.org/10.1038/s41598-024-76440-8

@article{leon2024comparing,
  title={Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling},
  author={Leon, Miguelangel and Perezhohin, Yuriy and Peres, Fernando and Popovi{\v{c}}, Ale{\v{s}} and Castelli, Mauro},
  journal={Scientific Reports},
  volume={14},
  number={1},
  pages={25016},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-024-76440-8}
}

SMILES Transformer: Low-Data Molecular Fingerprints

Thu, 26 Mar 2026 00:00:00 +0000

A Transformer Approach to Learned Molecular Fingerprints

This is a Method paper that introduces SMILES Transformer (ST), a Transformer-based sequence-to-sequence model pre-trained on unlabeled SMILES strings to produce continuous, data-driven molecular fingerprints. The primary contribution is demonstrating that unsupervised pre-training on chemical text representations yields fingerprints that generalize well under low-data conditions, outperforming both rule-based fingerprints (ECFP) and graph convolution models on several MoleculeNet benchmarks. A secondary contribution is the Data Efficiency Metric (DEM), a scalar metric for evaluating model performance across varying training set sizes.

The Low-Data Problem in Molecular Property Prediction

Machine learning for drug discovery depends on molecular representations, but labeled datasets of experimentally validated properties are typically small. Conventional approaches fall into two camps: rule-based fingerprints like ECFP that hash substructures into sparse binary vectors, and graph-based methods like GraphConv that learn representations end-to-end. Rule-based fingerprints perform poorly with shallow models or limited data, while graph-based methods are designed for large fully-labeled settings.

Pre-training on unlabeled data had shown strong results in NLP (ELMo, BERT, XLNet), and prior work in cheminformatics had explored RNN-based and VAE-based pre-training on SMILES (Seq2Seq fingerprints, Grammar VAE, heteroencoders). However, none of these studies systematically evaluated performance in small-data settings. Honda et al. fill this gap by applying Transformer-based pre-training to SMILES and measuring data efficiency explicitly.

Transformer Pre-training on SMILES with Pooled Fingerprint Extraction

The core innovation is a Transformer encoder-decoder architecture pre-trained as an autoencoder on SMILES strings, with a specific fingerprint extraction strategy that pools the encoder outputs into a fixed-length vector.

Architecture

The model uses 4 Transformer blocks for both the encoder and decoder, each with 4-head attention and 256 embedding dimensions plus 2 linear layers. Input SMILES are tokenized at the symbol level (e.g., ‘c’, ‘Br’, ‘=’, ‘(’, ‘2’) and one-hot encoded. Following Vaswani et al. (2017), the input uses the sum of token encoding and positional encoding.

Pre-training

The model is pre-trained on 861,000 unlabeled SMILES sampled from ChEMBL24 to minimize cross-entropy between input and output SMILES (i.e., reconstruction). SMILES enumeration (Bjerrum, 2017) randomly generates non-canonical SMILES at each epoch to reduce representation bias. Training runs for 5 epochs with Adam optimization, reaching a perplexity of 1.0 (perfect decoding).

Fingerprint Extraction

Since the Transformer outputs symbol-level (atom-level) representations, a pooling strategy produces molecule-level fingerprints. Four vectors are concatenated:

Mean-pooled output of the last encoder layer
Max-pooled output of the last encoder layer
First output token of the last encoder layer
First output token of the penultimate encoder layer

This produces a 1024-dimensional fingerprint, matching the dimensionality of ECFP for fair comparison.

Data Efficiency Metric

The paper proposes DEM to measure how well a model performs across different training set sizes:

$$ M_{DE}(f, m) = \frac{1}{|I|} \sum_{i \in I} m(f_i, X_i, Y_i) $$

where $f_i$ is the model trained on the fraction $i$ of training data, $m$ is the task metric, and $I = {0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8}$ doubles the training percentage at each step. This captures average performance across a range of data availability, giving a single scalar that balances accuracy and data efficiency.

Benchmarking Across MoleculeNet with Data Efficiency Focus

Datasets

The evaluation uses 10 datasets from MoleculeNet spanning three categories:

Category	Dataset	Tasks	Type	Molecules	Metric
Physical chemistry	ESOL	1	Regression	1,128	RMSE
Physical chemistry	FreeSolv	1	Regression	643	RMSE
Physical chemistry	Lipophilicity	1	Regression	4,200	RMSE
Biophysics	MUV	17	Classification	93,127	PRC-AUC
Biophysics	HIV	1	Classification	41,913	ROC-AUC
Biophysics	BACE	1	Classification	1,522	ROC-AUC
Physiology	BBBP	1	Classification	2,053	ROC-AUC
Physiology	Tox21	12	Classification	8,014	ROC-AUC
Physiology	SIDER	27	Classification	1,427	ROC-AUC
Physiology	ClinTox	2	Classification	1,491	ROC-AUC

Baselines

ECFP4: Rule-based extended-connectivity fingerprint with 1024 dimensions
RNNS2S: RNN-based Seq2Seq pre-trained fingerprint (3-layer bidirectional GRU, same pre-training data as ST)
GraphConv: Graph convolution network trained end-to-end on labeled data

Experimental Setup

All fingerprint methods use a simple MLP classifier/regressor from scikit-learn with default hyperparameters to isolate the fingerprint quality from model capacity. Datasets are randomly split (stratified for classification), and results are averaged over 20 trials. Note that random splits are used rather than scaffold splits for the DEM experiments.

Data Efficiency Results (DEM)

Dataset	ST+MLP	ECFP+MLP	RNNS2S+MLP	GraphConv
ESOL (RMSE, lower is better)	1.144	1.741	1.317	1.673
FreeSolv (RMSE, lower is better)	2.246	3.043	2.987	3.476
Lipophilicity (RMSE, lower is better)	1.169	1.090	1.219	1.062
MUV (PRC-AUC, higher is better)	0.009	0.036	0.010	0.004
HIV (ROC-AUC, higher is better)	0.683	0.697	0.682	0.723
BACE (ROC-AUC, higher is better)	0.719	0.769	0.717	0.744
BBBP (ROC-AUC, higher is better)	0.900	0.760	0.884	0.795
Tox21 (ROC-AUC, higher is better)	0.706	0.616	0.702	0.687
SIDER (ROC-AUC, higher is better)	0.559	0.588	0.558	0.557
ClinTox (ROC-AUC, higher is better)	0.963	0.515	0.904	0.936

ST achieves the best DEM in 5 of 10 datasets (ESOL, FreeSolv, BBBP, Tox21, ClinTox), with particularly strong margins on ClinTox (+0.027 over GraphConv) and BBBP (+0.016 over RNNS2S).

Linear Model Experiments

To further isolate fingerprint quality, the authors replace MLP with ridge/logistic regression with L2 penalty. On 8 datasets (excluding MUV and SIDER due to class imbalance issues), ST achieves best DEM in 5 of 8, confirming the fingerprint quality holds regardless of downstream model.

Stratified Analysis by Molecule Size

On BBBP stratified by SMILES length, ST’s ROC-AUC increases with longer SMILES, similar to RNNS2S but unlike GraphConv which shows stable performance across lengths. This suggests text-based models extract richer information from longer sequences.

Comparison with Record Scores (Large Data)

Under the large-data setting (80/10/10 train/val/test split with hyperparameter tuning via Optuna), ST achieves first place only in ClinTox (0.954) but performs comparably to ECFP and graph-based models on the other datasets. This confirms that ST’s main advantage is in the low-data regime.

Strong Low-Data Performance with Caveats on Scalability

Key Findings

Transformer-based unsupervised pre-training on SMILES produces fingerprints that excel in low-data molecular property prediction, achieving best data efficiency on 5 of 10 MoleculeNet tasks.
The advantage is most pronounced on small datasets (ESOL with 1,128 molecules, FreeSolv with 643, BBBP with 2,053, ClinTox with 1,491) where pre-training enables good generalization.
With sufficient labeled data and hyperparameter tuning, ST fingerprints perform comparably to (but do not surpass) graph-based methods.
Longer SMILES provide richer information for text-based models, as shown by the stratified analysis on BBBP.

Limitations

Random splits are used for most DEM experiments rather than scaffold splits, which may inflate performance estimates for drug discovery applications where training and test molecules are structurally distinct.
The pre-training corpus (861K SMILES from ChEMBL24) is relatively small by modern standards.
MUV performance is poor across all methods (PRC-AUC near zero), suggesting the DEM framework may not be informative for extremely imbalanced or noisy datasets.
No comparison with BERT-style masked language model pre-training, which later work (ChemBERTa) would show as a viable alternative.

Future Directions

The authors propose three directions: (1) replacing the Transformer with Transformer-XL to handle longer SMILES, (2) multi-task pre-training that jointly predicts molecular descriptors (e.g., molecular weight, LogP) alongside SMILES reconstruction, and (3) better exploitation of enumerated SMILES to constrain the latent space.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL24	861,000 SMILES	Unlabeled, randomly sampled
Evaluation	MoleculeNet (10 datasets)	643 to 93,127 molecules	See Table 1 for per-dataset details

Algorithms

Transformer encoder-decoder: 4 blocks each, 4-head attention, 256 embedding dimensions
Pre-training: 5 epochs, Adam optimizer, cross-entropy loss, SMILES enumeration for augmentation
Fingerprint: 1024 dimensions from concatenated mean pool, max pool, and first-token outputs
Downstream: scikit-learn MLP (default hyperparameters) for DEM experiments; ridge/logistic regression for linear model experiments; Optuna for hyperparameter search in large-data comparison

Models

Artifact	Type	License	Notes
smiles-transformer	Code	MIT	Official implementation (Jupyter notebooks)

Evaluation

DEM averaged over 7 training fractions (1.25% to 80%), 20 trials each
Random splits for DEM; scaffold splits for HIV, BACE, BBBP in large-data comparison
Metrics: RMSE (regression), ROC-AUC or PRC-AUC (classification) per MoleculeNet conventions

Hardware

The paper does not specify GPU type or training time for the pre-training phase.

Paper Information

Citation: Honda, S., Shi, S., & Ueda, H. R. (2019). SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. arXiv preprint arXiv:1911.04738.

@article{honda2019smiles,
  title={SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery},
  author={Honda, Shion and Shi, Shoi and Ueda, Hiroki R.},
  journal={arXiv preprint arXiv:1911.04738},
  year={2019}
}

SMI+AIS: Hybridizing SMILES with Environment Tokens

Thu, 26 Mar 2026 00:00:00 +0000

A Hybrid Molecular Representation Combining SMILES and Chemical-Environment Tokens

This is a Method paper that introduces SMI+AIS(N), a hybrid molecular string representation combining standard SMILES tokens with Atom-In-SMILES (AIS) tokens. AIS tokens encode local chemical environment information (central atom, ring membership, and neighboring atoms) into a single token. The key contribution is a systematic hybridization strategy that selectively replaces the most frequent SMILES tokens with AIS equivalents, preserving SMILES grammar compatibility while enriching token diversity. The method is validated on molecular structure generation via latent space optimization for drug design.

Limitations of Standard SMILES for Machine Learning

SMILES is the most widely adopted string-based molecular representation, used in major databases like ZINC and PubChem. Despite this ubiquity, SMILES has several well-known limitations for machine learning applications:

Non-unique representations: The same molecule can be encoded as multiple distinct SMILES strings.
Invalid string generation: Generative models can produce syntactically invalid SMILES that do not correspond to any molecule.
Limited token diversity: SMILES tokens map one-to-one to atoms or bonds, so the token vocabulary is restricted to the available atom and bond types.
Insufficient chemical context: Individual SMILES tokens carry no information about the local chemical environment of an atom.

Alternative representations like SELFIES (guaranteeing validity) and InChI (guaranteeing uniqueness) address some of these issues but share the same fundamental limitation of low token diversity. The Atom-In-SMILES (AIS) representation (Ucak et al., 2023) enriches tokens with neighboring atom and ring information, but using AIS exclusively produces a large vocabulary with many infrequent tokens that can cause data sparsity problems. The authors aim to find a middle ground: adding chemical context to the most common tokens while keeping the vocabulary manageable.

Core Innovation: Selective Token Hybridization with AIS

The SMI+AIS(N) representation hybridizes standard SMILES with AIS tokens through a frequency-based selection process:

AIS Token Structure

Each AIS token encodes three pieces of information about an atom, delimited by semicolons:

$$ \lbrack \text{central atom} ; \text{ring info} ; \text{neighbor atoms} \rbrack $$

For example, the oxygen in a carboxyl group of benzoic acid is represented as [O;!R;C], meaning: oxygen atom, not in a ring, bonded to carbon. In standard SMILES, this would simply be O.

Hybridization Procedure

Convert all SMILES strings in the ZINC database to their full AIS representations.
Count the frequency of each AIS token across the database.
Select the top-N most frequent AIS tokens to form the hybrid vocabulary.
In the hybrid representation, atoms matching these top-N AIS tokens are written in AIS notation; all other atoms use standard SMILES notation.

For benzoic acid, the hybridization produces:

$$ \text{SMI}: \texttt{O=C(O)c1ccccc1} $$

$$ \text{SMI+AIS}: \texttt{\lbrack O;!R;C\rbrack=\lbrack C;!R;COO\rbrack(\lbrack OH;!R;C\rbrack)c1ccccc1} $$

The parameter N controls vocabulary size. The authors test N = 50, 100, 150, and 200, finding that N = 100-150 provides the best balance for the ZINC database.

Token Frequency Rebalancing

A key benefit of hybridization is mitigating the severe token frequency imbalance in standard SMILES. Carbon (C), the most frequent element with ~184 million occurrences in ZINC, is represented by only 16 token types in SMILES. With SMI+AIS(200), carbon is distinguished into 145 token types based on chemical environment, with 74% of carbon occurrences represented by AIS tokens. Less common elements like halogens see minimal change (only 2% AIS representation), which avoids introducing unnecessarily rare tokens.

Element	Frequency	SMILES Types	SMI+AIS(100) Types (AIS %)	SMI+AIS(200) Types (AIS %)
C	183,860,954	16	78 (73%)	145 (74%)
O	27,270,229	8	16 (11%)	24 (11%)
N	26,022,928	11	32 (1%)	46 (10%)
X (halogens)	6,137,030	7	10 (2%)	11 (2%)
S	4,581,307	12	17 (2%)	24 (2%)

Latent Space Optimization for Molecular Generation

Model Architecture

The evaluation uses a conditional variational autoencoder (CVAE) with:

Encoder: BERT-style architecture with entity and positional embeddings, 4 multi-head attention layers (8 heads each), producing mean and standard deviation vectors in latent space.
Decoder: 4 stacked gated recurrent unit (GRU) layers that transform sampled latent vectors (conditioned) back into token sequences.
Training: 20 epochs on 9 million compounds from the ZINC database (8:1:1 train/valid/test split) under identical conditions for all representations.

Optimization Setup

Bayesian Optimization (BO) via BoTorch is applied to the CVAE latent space, maximizing a multi-objective function:

$$ \text{Obj} = -\text{BA} - 0.5 \times \text{SA}^2 $$

where BA is binding affinity (docking score from QuickVina 2, lower is stronger) and SA is synthetic accessibility score (from RDKit, lower is more synthesizable). Each BO iteration generates 800 candidate latent vectors. Invalid strings receive a penalty objective value of -100.

Protein Targets

Four diverse targets were used to assess generalizability:

PDK4 (Pyruvate Dehydrogenase Kinase 4): narrow, deep binding pocket
5-HT1B (Serotonin Receptor 1B): shallow, open GPCR conformation
PARP1 (Poly ADP-ribose Polymerase 1): small, flexible molecule binding site
CK1d (Casein Kinase I Delta): broad, accessible conformation

Protein structures were obtained from the Protein Data Bank (PDB IDs: 4V26, 4IAQ, 6I8M, 4TN6). Each optimization was run 10 times independently from the same 5 initial compounds selected from BindingDB.

Key Results

SMI+AIS(100) consistently achieved the highest objective values across protein targets.

PDK4 Optimization (Top-1 results over 10 independent runs):

SMI+AIS(100) achieved approximately 12% improvement over standard SMILES and 28% improvement over SELFIES based on median Top-1 objective values.
Generated structures exhibited BA scores between -10 and -9 and SA scores between 2.0 and 2.3.
Molecular weights clustered around 400 amu, consistent with the CVAE conditioning.

Validity Ratios: Standard SMILES produced approximately 40% valid structures. SMI+AIS representations showed significant improvement as N increased, though SMI+AIS(200) showed slight saturation, likely from insufficiently trained infrequent tokens.

SELFIES: Despite achieving the highest validity ratio, SELFIES failed to generate chemically meaningful structures with desirable BA and SA scores. The authors attribute this to SELFIES grammar where token meaning is highly context-dependent, causing minor latent space variations to produce large structural changes.

Cross-target consistency: Improvements were observed across all four protein targets, with slight variation (5-HT1B showed smaller differences between SMI and SMI+AIS(100) for Top-1, while other targets showed significant improvements).

Improved Molecular Generation Through Chemical Context Enrichment

The SMI+AIS(N) representation achieves consistent improvements in molecular generation quality compared to both standard SMILES and SELFIES. The core findings are:

Binding affinity improvement: Approximately 7% improvement over standard SMILES for the PDK4 target.
Synthesizability improvement: Approximately 6% increase in synthetic accessibility scores.
Target independence: Performance gains transfer across four structurally diverse protein targets.
Preserved structural motifs: The generative model retains chemically meaningful fragments (e.g., acetamide and piperidine) from initial compounds without explicit fragment constraints.

Limitations

The authors acknowledge several limitations:

Stereochemistry: SMI+AIS inherits the limited stereochemistry handling of standard SMILES.
Evaluation scope: Only molecular generation was tested; property prediction and other ML tasks remain unexplored.
Compute constraints: The study was limited to molecular generation due to computing power and time.
Single optimization strategy: Only latent space optimization with Bayesian optimization was evaluated; other generative approaches were not compared.

Future Directions

The authors suggest extending SMI+AIS to diverse benchmarking tests including molecular property prediction, experimental validation, and broader applications of chemical language models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Vocab	ZINC Database	9M compounds	Canonicalized, deduplicated, split 8:1:1
Binding targets	BindingDB	5 initial compounds per target	Selected for each protein target
Protein structures	PDB	4 structures	IDs: 4V26, 4IAQ, 6I8M, 4TN6

Algorithms

Tokenization: AIS token frequency counting on full ZINC database, top-N selection
Generative model: Conditional VAE with BERT encoder (4 layers, 8 heads) and GRU decoder (4 layers)
Optimization: Bayesian Optimization via BoTorch (800 candidates per iteration)
Docking: QuickVina 2 with 25 A pocket size, 10 docking simulations per ligand
SA scoring: RDKit SA score
Training: 20 epochs for all representations under identical conditions

Models

CVAE architecture details in supplementary (Fig. S9, Tables S2, S4)
No pre-trained weights released

Evaluation

Metric	SMI+AIS(100) vs SMILES	SMI+AIS(100) vs SELFIES	Notes
Median Top-1 Obj. Value	+12%	+28%	PDK4 target
Validity Ratio	Higher than ~40% (SMILES)	Lower than SELFIES	SMI+AIS improves with N
BA (binding affinity)	~7% improvement	Substantial	Lower (more negative) is better
SA (synthesizability)	~6% improvement	Substantial	Lower is more synthesizable

Hardware

Hardware details are not specified in the main text. Optimization wall times are reported in supplementary Table S5.

Artifacts

Artifact	Type	License	Notes
AIS-Drug-Opt	Code	Not specified	Source code and datasets for reproduction

Reproducibility Status: Partially Reproducible. Code and processed data are publicly available on GitHub, but no pre-trained model weights are released, the license is unspecified, and hardware requirements are not documented in the main text.

Paper Information

Citation: Han, H., Yeom, M. S., & Choi, S. (2025). Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation. Scientific Reports, 15, 16892. https://doi.org/10.1038/s41598-025-01890-7

@article{han2025hybridization,
  title={Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation},
  author={Han, Herim and Yeom, Min Sun and Choi, Sunghwan},
  journal={Scientific Reports},
  volume={15},
  number={1},
  pages={16892},
  year={2025},
  publisher={Springer Nature},
  doi={10.1038/s41598-025-01890-7}
}

SMI-TED: Encoder-Decoder Foundation Models for Chemistry

Thu, 26 Mar 2026 00:00:00 +0000

An Encoder-Decoder Chemical Foundation Model Family

SMI-TED is a Method paper that introduces a family of encoder-decoder transformer-based foundation models for chemistry. The primary contribution is the SMI-TED289M architecture, a 289-million parameter model pre-trained on 91 million curated SMILES from PubChem, along with a Mixture-of-Experts variant (MoE-OSMI) that scales to 8x289M parameters. The models support molecular property prediction, molecule reconstruction, reaction yield prediction, and few-shot reasoning over molecular embeddings. All model weights and code are open-sourced under an Apache 2.0 license.

Bridging Encoding and Decoding for Molecular Representations

Chemical language models based on SMILES have gained traction for molecular property prediction and generation. Most existing models, such as MoLFormer and ChemBERTa, are encoder-only architectures that produce molecular embeddings through mean pooling. While effective for downstream classification and regression, this encoder-only approach has a limitation: mean pooling has no natural inverse, meaning the model cannot reconstruct the input molecule from its latent representation. This restricts the model’s utility for generative tasks and limits the interpretability of the learned latent space.

The authors argue that adding a decoder with a reconstruction objective forces the model to encode a more complete set of structural features. Prior work has shown that the quality of pre-training data matters more than the choice of SMILES vs. SELFIES, and that large-scale pre-training can yield useful chemical representations. SMI-TED builds on these observations by combining an encoder-decoder architecture with a carefully curated 91-million molecule dataset from PubChem.

Invertible Pooling and Two-Phase Pre-Training

The core architectural innovation in SMI-TED is a learned pooling mechanism that replaces standard mean or max pooling with an invertible projection. Given token embeddings $\mathbf{x} \in \mathbb{R}^{D \times L}$ (where $D = 202$ is the maximum token count and $L = 768$ is the embedding dimension), the submersion into the latent space $\mathbf{z} \in \mathbb{R}^{L}$ is computed as:

$$ \mathbf{z} = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{W}_1^T \mathbf{x} + \mathbf{b}_1\right)\right)\right) \mathbf{W}_2 $$

where $\mathbf{W}_1 \in \mathbb{R}^{D \times L}$, $\mathbf{b}_1 \in \mathbb{R}^{L}$, and $\mathbf{W}_2 \in \mathbb{R}^{L \times L}$. The immersion (inverse mapping) back to the token space is:

$$ \tilde{\mathbf{x}}^T = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{z} \mathbf{W}_3 + \mathbf{b}_3\right)\right)\right) \mathbf{W}_4 $$

where $\mathbf{W}_3 \in \mathbb{R}^{L \times L}$, $\mathbf{b}_3 \in \mathbb{R}^{L}$, and $\mathbf{W}_4 \in \mathbb{R}^{L \times D}$. A decoder language model then predicts the next token from $\tilde{\mathbf{x}}$.

The encoder uses a modified RoFormer attention mechanism with rotary position embeddings:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle} $$

where $R_m$ are position-dependent rotation matrices and $\varphi$ is a random feature map.

Two-phase pre-training strategy:

Phase 1: The token encoder is pre-trained on 95% of the data using masked language modeling (15% token selection, of which 80% masked, 10% random, 10% unchanged). The remaining 5% trains the encoder-decoder layer, preventing convergence issues from unstable early embeddings.
Phase 2: After the token embeddings converge, both the encoder and decoder train on 100% of the data jointly.

Mixture-of-Experts (MoE-OSMI): The MoE variant composes 8 fine-tuned SMI-TED289M expert models with a gating network. Given an input embedding $x$, the output is:

$$ y = \sum_{i=1}^{n} G(x)_i E_i(\hat{x}) $$

where $G(x) = \text{Softmax}(\text{TopK}(x \cdot W_g))$ selects the top $k = 2$ experts per input, setting all other gate values to zero.

Benchmarks Across Property Prediction, Generation, and Reaction Yield

MoleculeNet classification (6 datasets, ROC-AUC)

Method	BBBP	ClinTox	HIV	BACE	SIDER	Tox21
MoLFormer	73.6 +/- 0.8	91.2 +/- 1.4	80.5 +/- 1.65	86.3 +/- 0.6	65.5 +/- 0.2	80.46 +/- 0.2
Uni-Mol	72.9 +/- 0.6	91.9 +/- 1.8	80.8 +/- 0.3	85.7 +/- 0.2	65.9 +/- 1.3	79.6 +/- 0.5
GEM	72.4 +/- 0.4	90.1 +/- 1.3	80.6 +/- 0.9	85.6 +/- 1.1	67.2 +/- 0.4	78.1 +/- 0.1
SMI-TED289M (pre-trained)	91.46 +/- 0.47	93.49 +/- 0.85	80.51 +/- 1.34	85.58 +/- 0.92	66.01 +/- 0.88	81.53 +/- 0.45
SMI-TED289M (fine-tuned)	92.26 +/- 0.57	94.27 +/- 1.83	76.85 +/- 0.89	88.24 +/- 0.50	65.68 +/- 0.45	81.85 +/- 1.42

SMI-TED achieves the best results in 4 of 6 classification tasks. Notably, the pre-trained version (without fine-tuning) already matches or exceeds many baselines on BBBP, ClinTox, and Tox21.

MoleculeNet regression (5 datasets, MAE for QM9/QM8, RMSE for ESOL/FreeSolv/Lipophilicity)

Method	QM9	QM8	ESOL	FreeSolv	Lipophilicity
MoLFormer	1.5894	0.0102	0.880	2.342	0.700
D-MPNN	3.241	0.0143	0.98	2.18	0.65
SMI-TED289M (fine-tuned)	1.3246	0.0095	0.6112	1.2233	0.5522

SMI-TED289M achieves the best results across all 5 regression tasks when fine-tuned. The improvements are substantial on ESOL (0.61 vs. 0.82 for next best) and FreeSolv (1.22 vs. 1.91 for next best).

Reaction yield prediction (Buchwald-Hartwig C-N cross-coupling)

The model was tested on Pd-catalyzed Buchwald-Hartwig reactions with 3,955 reactions across varying train/test splits. Selected $R^2$ results:

Split	Yield-BERT (Aug)	DRFP	SMI-TED289M
70/30	0.97	0.95	0.984
10/90	0.81	0.81	0.961
2.5/97.5	0.61	0.62	0.875
Test 1-4 avg	0.58	0.71	0.983

SMI-TED shows particularly strong performance in low-data regimes. With only 2.5% training data, it achieves $R^2 = 0.875$, compared to 0.61-0.62 for competing methods.

MOSES molecular generation benchmarks

SMI-TED is competitive with baselines including CharRNN, SMILES VAE, JT-VAE, LIMO, MolGen-7b, and GP-MoLFormer on standard metrics (validity, uniqueness, novelty, FCD, internal diversity). It achieves superior scaffold cosine similarity (Scaf) and nearest-neighbor similarity (SNN) scores.

Latent space compositionality

Using six families of carbon chains ($\mathcal{F} = {CC, CO, CN, CS, CF, CP}$), the authors test whether the embedding space respects hierarchical distance structures. A linear regression on SMI-TED embeddings yields $R^2 = 0.99$ and $MSE = 0.002$, compared to $R^2 = 0.55$ and $MSE = 0.237$ for MoLFormer. This indicates that the SMI-TED latent space captures compositional chemical relationships far more faithfully.

For structure-property analysis on QM9, nitrogen-containing molecules represent 9.10% of the dataset but account for 32.81% of the top 10% by HOMO energy. In the SMI-TED latent space, these molecules cluster distinctly (Davies-Bouldin index of 2.82 vs. 4.28 for MoLFormer), suggesting the decoder objective encourages encoding of functional group information.

Strong Performance with a Compositional Latent Space

SMI-TED289M demonstrates competitive or superior performance across molecular property prediction, reaction yield prediction, and molecular generation benchmarks. The key findings include:

Broad applicability: The single pre-trained model achieves strong results across classification (4/6 best), regression (5/5 best), reaction yield, and generation tasks.
Low-data robustness: The pre-training on 91M molecules provides chemical knowledge that transfers well to small training sets, as shown by the reaction yield experiments where SMI-TED maintains high accuracy even at 2.5% training data.
Compositional embeddings: The encoder-decoder architecture produces a latent space where molecular similarity follows chemical intuition, with near-perfect linear relationships between functional group families ($R^2 = 0.99$).
Structure-property capture: The reconstruction objective appears to enforce encoding of chemically meaningful features like nitrogen substituent effects on HOMO energy, outperforming encoder-only models in latent space organization.

Limitations: The paper evaluates on MoleculeNet benchmarks, which are well-studied but may not reflect performance on more diverse chemical tasks. The BBBP classification result (92.26) shows a large jump from prior methods (73.6 for MoLFormer), which is worth scrutinizing. The MoE variant is evaluated only in supplementary materials, and scaling behavior beyond 8 experts is not explored.

Future directions: The authors note that compositionality of the learned representations suggests potential for reasoning applications, though they acknowledge that stronger claims require further studies following compositionality analysis methodologies from natural language processing. The model has been integrated into the dZiner agent for inverse molecular design.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem (curated)	91M molecules, 4B tokens	Deduplicated, canonicalized, validity-checked
Classification	MoleculeNet (BBBP, ClinTox, HIV, BACE, SIDER, Tox21)	Varies	Original benchmark splits
Regression	MoleculeNet (QM9, QM8, ESOL, FreeSolv, Lipophilicity)	Varies	Original benchmark splits
Generation	MOSES	1.94M molecules	Train/test/scaffold test splits
Reaction yield	Buchwald-Hartwig HTE	3,955 reactions	3x 1536-well plates

Algorithms

Masked language modeling for token encoder (15% selection: 80% masked, 10% random, 10% unchanged)
Two-phase pre-training (95/5 split then 100% joint training)
RoFormer attention with rotary position embeddings
Vocabulary: 2,993 tokens (2,988 molecular + 5 special)
Maximum sequence length: 202 tokens (covers 99.4% of PubChem)
Learning rate: 1.6e-4, batch size: 288 molecules
40 epochs over the full PubChem corpus
10 random seeds per experiment for robustness

Models

Variant	Parameters	Encoder	Decoder	Description
SMI-TED289M base	289M	47M	242M	12 layers, 12 attention heads, hidden size 768, dropout 0.2
MoE-OSMI	8x289M	-	-	8 experts, top-k=2 routing, gating network

Evaluation

Classification: ROC-AUC
Regression: MAE (QM9, QM8), RMSE (ESOL, FreeSolv, Lipophilicity)
Reaction yield: $R^2$
Generation: Validity, uniqueness, novelty, FCD, IntDiv, Scaf, SNN (MOSES metrics)
Latent space: Linear regression $R^2$, MSE, Davies-Bouldin index, t-SNE visualization

Hardware

24 NVIDIA V100 GPUs (16GB)
4 nodes with DDP (Distributed Data Parallel)
Pre-training: 40 epochs on 91M molecules

Artifacts

Artifact	Type	License	Notes
IBM/materials (smi_ted)	Code	Apache-2.0	Training, fine-tuning scripts, Jupyter notebooks
ibm/materials.smi-ted	Model	Apache-2.0	Pre-trained model weights
Zenodo archive	Code + Data	Apache-2.0	Archival copy of scripts

Paper Information

Citation: Soares, E., Vital Brazil, E., Shirasuna, V., Zubarev, D., Cerqueira, R., & Schmidt, K. (2025). An open-source family of large encoder-decoder foundation models for chemistry. Communications Chemistry, 8(1). https://doi.org/10.1038/s42004-025-01585-0

@article{soares2025smited,
  title={An open-source family of large encoder-decoder foundation models for chemistry},
  author={Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin},
  journal={Communications Chemistry},
  volume={8},
  number={1},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s42004-025-01585-0}
}

Seq2seq Fingerprint: Unsupervised Molecular Embedding

Thu, 26 Mar 2026 00:00:00 +0000

An Unsupervised Seq2seq Method for Molecular Fingerprints

This is a Method paper that introduces seq2seq fingerprint, an unsupervised molecular embedding approach based on sequence-to-sequence learning. The core idea is to train a GRU encoder-decoder network to translate SMILES strings to themselves, then extract the intermediate fixed-length vector as a molecular fingerprint. These fingerprints are then used with standard supervised classifiers for downstream property prediction tasks such as solubility classification and promiscuity prediction.

The Labeled Data Bottleneck in Drug Discovery

Machine learning approaches to molecular property prediction depend on fixed-length feature vectors as inputs. Traditional molecular fingerprints fall into two categories: hash-based methods like Extended-Connectivity Fingerprints (ECFP) that are fast but lossy and non-invertible, and biologist-guided local-feature fingerprints that require domain expertise and are task-specific. Supervised deep learning fingerprints (e.g., neural fingerprints) can learn representations from data but require large amounts of labeled data, which is expensive to obtain in drug discovery due to the cost of biological experiments.

The authors identify three limitations of existing approaches:

Hash-based fingerprints discard information during the hashing process and cannot reconstruct the original molecule
Local-feature fingerprints require expert knowledge and generalize poorly across tasks
Supervised deep learning fingerprints are data-hungry and fail when labeled data is limited

Self-Translation as Unsupervised Molecular Encoding

The key insight is to adapt the sequence-to-sequence learning framework from machine translation (originally English-to-French) to molecular representation learning by setting both the input and output to the same SMILES string. Since the intermediate vector must contain enough information to reconstruct the original SMILES, it serves as a rich, task-agnostic molecular fingerprint.

The architecture consists of two components:

Perceiver network: A multi-layer GRU encoder that reads the SMILES string and compresses it into a fixed-length vector
Interpreter network: A multi-layer GRU decoder that reconstructs the original SMILES from the fingerprint vector

The GRU cell computes a sequence of outputs $(s_1, \ldots, s_T)$ from input sequences $(x_1, \ldots, x_T)$ by iterating:

$$ z_t = \sigma_g(W_z x_t + U_z s_{t-1} + b_z) $$

$$ r_t = \sigma_r(W_r x_t + U_r s_{t-1} + b_r) $$

$$ h_t = \tanh(U_h x_t + W_h(s_{t-1} \circ r_t)) $$

$$ s_t = (1 - z_t) \circ h_{t-1} + z_t \circ s_{t-1} $$

where $z_t$ is the update gate, $r_t$ is the reset gate, $\circ$ denotes element-wise multiplication, and $W$, $U$, $b$ are trainable parameters.

Several adaptations to the original seq2seq framework make this work for molecular data:

GRU instead of LSTM: GRU provides comparable performance with faster training, which is important given the large training data pool
Attention mechanism: Establishes a stronger connection between the perceiver and interpreter networks via soft alignment, addressing the challenge of passing information through hidden memory for long sequences (SMILES can be up to 250 characters)
Dropout layers: Added to input and output gates (but not hidden memory transfer) following the approach of Zaremba et al. to combat overfitting when training on large datasets
Fingerprint extraction layer: A fixed-unit fully connected layer combined with a GRU cell state concatenation layer is inserted between encoder and decoder to explicitly output the fingerprint vector
Reverse target sequence: Following Sutskever et al., the target sequence is reversed to improve SGD optimization
Bucket training: Sequences are distributed into buckets by length and padded to enable GPU parallelization

Classification Experiments on LogP and PM2 Datasets

Training Setup

The unsupervised training used 334,092 valid SMILES representations from combined LogP and PM2-full datasets obtained from the National Center for Advancing Translational Sciences (NCATS) at NIH. Three model variants were trained with fingerprint dimensions of 512, 768, and 1024, differing in the number of GRU layers (2, 3, and 4 respectively) while keeping the latent dimension at 256. Each model was trained for 24 hours on a workstation with an Intel i7-6700K CPU, 16 GB RAM, and an NVIDIA GTX 1080 GPU.

Reconstruction Performance

The models were evaluated on their ability to reconstruct SMILES strings from their fingerprints:

Model	GRU Layers	Latent Dim	Perplexity	Exact Match Accuracy
seq2seq-512	2	256	1.00897	94.24%
seq2seq-768	3	256	1.00949	92.92%
seq2seq-1024	4	256	1.01472	90.26%

Deeper models showed lower reconstruction accuracy, possibly because larger fingerprint spaces introduce more null spaces and require longer training to converge.

Classification Results

Two labeled datasets were used for downstream classification:

LogP: 10,850 samples with water-octanol partition coefficient values, binarized at a threshold of 1.88
PM2-10k: 10,000 samples with binary promiscuity class labels

The seq2seq fingerprints were evaluated with three ensemble classifiers (AdaBoost, GradientBoost, RandomForest) against circular fingerprints (ECFP) and neural fingerprints. Results are 100-run averages of 5-fold cross-validation accuracy.

LogP classification accuracy:

Method	Mean Accuracy	Std Dev
Circular FP (ECFP)	0.3674	0.0074
Neural FP	0.6080	0.0135
Seq2seq-1024 + GradientBoost	0.7664	0.0043
Seq2seq-1024 + AdaBoost	0.7342	0.0042
Seq2seq-512 + GradientBoost	0.7350	0.0060

PM2-10k classification accuracy:

Method	Mean Accuracy	Std Dev
Circular FP (ECFP)	0.3938	0.0114
Neural FP	0.5227	0.0112
Seq2seq-1024 + GradientBoost	0.6206	0.0198
Seq2seq-1024 + AdaBoost	0.6036	0.0147
Seq2seq-512 + GradientBoost	0.5741	0.0086

The seq2seq fingerprint outperformed both baselines across all configurations. Despite the seq2seq-1024 model having lower reconstruction accuracy, it provided the best classification performance, suggesting that the longer fingerprint captures more discriminative information for downstream tasks even if the reconstruction is less exact.

Unsupervised Transfer Learning for Molecular Properties

The results demonstrate that unsupervised pretraining on large unlabeled molecular datasets can produce fingerprints that transfer well to supervised property prediction with limited labels. The key advantages confirmed by the experiments are:

Label-free training: The unsupervised approach uses essentially unlimited SMILES data, avoiding the expensive label collection process
Task-agnostic representations: The same fingerprints work across different classification tasks (solubility and promiscuity) without retraining
Invertibility: The fingerprints contain enough information to reconstruct the original SMILES (up to 94.24% exact match), unlike hash-based methods

Limitations acknowledged by the authors include:

Long training times (24 hours per model variant), motivating future work on distributed training
The relationship between fingerprint dimensionality and downstream performance is non-monotonic (768-dim underperforms 512-dim on some tasks), suggesting sensitivity to hyperparameter choices
Only classification tasks were evaluated; regression performance was not assessed
The comparison baselines are limited to ECFP and neural fingerprints from 2015

Future directions proposed include distributed training strategies, hyperparameter optimization methods, and semi-supervised extensions that incorporate label information into the fingerprint training.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Unsupervised training	LogP + PM2-full (combined)	334,092 SMILES	Obtained from NCATS at NIH
Classification	LogP	10,850 samples	Binary labels at LogP threshold 1.88
Classification	PM2-10k	10,000 samples	Binary promiscuity labels

Algorithms

Encoder-decoder: Multi-layer GRU with attention mechanism and dropout
Fingerprint dimensions: 512, 768, 1024 (with 2, 3, 4 GRU layers respectively)
Latent dimension: 256 for all variants
Downstream classifiers: AdaBoost, GradientBoost, RandomForest
Evaluation: 5-fold cross-validation, 100-run averages
Baselines: ECFP via RDKit, Neural Fingerprint from HIPS/neural-fingerprint

Models

Three model variants trained for 24 hours each. The paper states code would become publicly available after acceptance, but no public repository has been confirmed.

Evaluation

Metric	Best Value	Task	Configuration
Classification accuracy	0.7664	LogP	seq2seq-1024 + GradientBoost
Classification accuracy	0.6206	PM2-10k	seq2seq-1024 + GradientBoost
Exact match reconstruction	94.24%	SMILES recovery	seq2seq-512
Perplexity	1.00897	SMILES recovery	seq2seq-512

Hardware

Training: Intel i7-6700K @ 4.00 GHz, 16 GB RAM, NVIDIA GTX 1080 GPU
Hyperparameter search and classifier training: TACC Lonestar 5 cluster
Training time: 24 hours per model variant

Artifacts

Artifact	Type	License	Notes
Neural Fingerprint (baseline)	Code	MIT	Baseline comparison code

The authors indicated the seq2seq fingerprint code would be released after acceptance, but no public repository has been found as of this writing. The datasets were sourced from NCATS/NIH.

Paper Information

Citation: Xu, Z., Wang, S., Zhu, F., & Huang, J. (2017). Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB ‘17), 285-294. https://doi.org/10.1145/3107411.3107424

@inproceedings{xu2017seq2seq,
  title={Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery},
  author={Xu, Zheng and Wang, Sheng and Zhu, Feiyun and Huang, Junzhou},
  booktitle={Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics},
  pages={285--294},
  year={2017},
  publisher={ACM},
  doi={10.1145/3107411.3107424}
}

S4 Structured State Space Models for De Novo Drug Design

Thu, 26 Mar 2026 00:00:00 +0000

Structured State Spaces Meet Chemical Language Modeling

This is a Method paper that introduces structured state space sequence (S4) models to chemical language modeling (CLM) for de novo drug design. S4 models have a dual formulation: they process entire input sequences via convolution during training (like Transformers) and generate sequences element-by-element via recurrence during inference (like LSTMs). The authors benchmark S4 against LSTM and GPT architectures across multiple drug discovery tasks, including drug-like molecule generation, bioactivity learning, chemical space exploration, natural product design, and prospective kinase inhibitor design validated by molecular dynamics simulations.

Bridging the LSTM-Transformer Gap in Molecular Generation

Chemical language models (CLMs) generate molecules by learning the “chemical language” of SMILES string representations. The two dominant architectures for CLMs are LSTMs and GPTs, each with complementary strengths and limitations:

LSTMs generate sequences recurrently (element-by-element), which enables efficient generation and good learning of local/short-range dependencies. However, their sequential information bottleneck limits learning of global sequence properties.
GPTs (Transformer decoders) process the entire input at once, better capturing global properties like bioactivity. However, they become increasingly compute-intensive for longer SMILES strings and struggle with chemical space exploration at higher sampling temperatures.

Complex molecular properties like bioactivity can emerge from separated portions of a SMILES string (e.g., distant functional groups in the linear notation). Neither architecture fully addresses the need to learn these long-range dependencies while maintaining efficient, robust generation. The chemical space, estimated at up to $10^{60}$ small molecules, demands models that can both capture complex property relationships and explore diverse scaffolds efficiently.

The Dual Nature of S4: Convolution Meets Recurrence

S4 models are built on discrete state space models, which map an input sequence $\mathbf{u}$ to an output sequence $\mathbf{y}$ through learnable parameters $\overline{\mathbf{A}} \in \mathbb{R}^{N \times N}$, $\overline{\mathbf{B}} \in \mathbb{R}^{N \times 1}$, $\overline{\mathbf{C}} \in \mathbb{R}^{1 \times N}$, and $\overline{\mathbf{D}} \in \mathbb{R}^{1 \times 1}$:

$$ x_{k} = \overline{\mathbf{A}} x_{k-1} + \overline{\mathbf{B}} u_{k} $$

$$ y_{k} = \overline{\mathbf{C}} x_{k} + \overline{\mathbf{D}} u_{k} $$

This linear recurrence can equivalently be “unrolled” into a global convolution:

$$ \mathbf{y} = \mathbf{u} * \overline{\mathbf{K}} $$

where $\overline{\mathbf{K}}$ is a convolution filter parameterized by $\overline{\mathbf{A}}$, $\overline{\mathbf{B}}$, and $\overline{\mathbf{C}}$. This duality is the core innovation for CLMs:

Training: S4 uses the convolutional formulation to learn from entire SMILES sequences simultaneously, capturing global molecular properties.
Generation: S4 switches to the recurrent formulation, producing SMILES tokens one at a time for efficient, robust chemical space exploration.

S4 addresses the numerical instabilities of naive state space models through high-order polynomial projection operators (HiPPO) and reduction to the stable Cauchy kernel computation, enabling effective learning of long-range dependencies.

For molecular ranking after fine-tuning, the log-likelihood score subtracts the pre-training likelihood to isolate target-specific information:

$$ \mathcal{L}_{\text{score}}(\mathbf{M}) = \mathcal{L}(\mathbf{M}_{\text{ft}}) - \mathcal{L}(\mathbf{M}_{\text{pt}}) $$

where $\mathcal{L}(\mathbf{M}_{\text{ft}})$ and $\mathcal{L}(\mathbf{M}_{\text{pt}})$ are the fine-tuned and pre-trained model log-likelihoods, respectively.

Benchmarking S4 Across Drug Discovery Tasks

Drug-like molecule generation

All three CLMs (S4, LSTM, GPT) were pre-trained on 1.9M canonical SMILES from ChEMBL v31 (molecules with fewer than 100 tokens). Each model generated 102,400 SMILES strings de novo.

Model	Valid	Unique	Novel
S4	99,268 (97%)	98,712 (96%)	95,552 (93%)
LSTM	97,151 (95%)	96,618 (94%)	82,988 (81%)
GPT	93,580 (91%)	93,263 (91%)	91,590 (89%)

S4 produces the most valid, unique, and novel molecules. Error analysis reveals that each architecture shows different failure modes: LSTMs struggle most with branching errors, GPTs with ring and bond assignment errors, while S4 generates fewer branching and ring errors but more bond assignment errors than LSTM. This pattern supports the hypothesis that S4 captures long-range dependencies (branching, ring opening/closure) better while local dependencies (bond assignment) are handled better by recurrent processing.

Bioactivity learning via transfer learning

Five fine-tuning campaigns were conducted on targets from the LIT-PCBA dataset: PKM2, MAPK1, GBA, mTORC1, and TP53. After fine-tuning, models ranked held-out test molecules by learned log-likelihoods to evaluate bioactive compound prioritization.

S4 outperformed both benchmarks across targets. Wilcoxon signed-rank tests on pooled scores confirmed statistically significant superiority:

S4 vs. LSTM: $p$ [top 10] = 8.41e-6, $p$ [top 50] = 2.93e-7, $p$ [top 100] = 1.45e-7
S4 vs. GPT: $p$ [top 10] = 2.33e-3, $p$ [top 50] = 3.72e-3, $p$ [top 100] = 2.61e-2

TP53 was the most challenging target, where no model consistently retrieved actives in the top 10, possibly due to activity cliffs in the test set.

Chemical space exploration with temperature sampling

Models were evaluated across sampling temperatures from $T = 1.0$ to $T = 2.0$ on three metrics: SMILES validity, rediscovery rate of known actives, and scaffold diversity. Key findings:

Validity: S4 and LSTM maintain higher validity than GPT at elevated temperatures (GPT median validity drops below 40% at high T).
Rediscovery: S4 outperforms LSTM in rediscovering bioactive molecules at all temperatures.
Scaffold diversity: LSTM achieves the highest number of unique scaffold clusters (median 6,602 at $T = 1.75$), with S4 as close second (6,520 clusters).

S4 provides the best balance between bioactivity capture and structural diversity.

Natural product design

Models were trained on 32,360 large natural product SMILES (length > 100 tokens) from the COCONUT database and used to generate 102,400 designs each.

Metric	S4	LSTM	GPT	Training Set
Valid	82,633 (81%)	76,264 (74%)	70,117 (68%)	n.a.
Unique	53,293 (52%)	51,326 (50%)	50,487 (49%)	n.a.
Novel	40,897 (40%)	43,245 (42%)	43,168 (42%)	n.a.
NP-likeness	1.6 +/- 0.7	1.5 +/- 0.7	1.5 +/- 0.7	1.6 +/- 0.7

S4 designs the most valid molecules (6,000 to 12,000 more than benchmarks) and achieves significantly higher NP-likeness ($p = 1.41 \times 10^{-53}$ vs. LSTM, $p = 1.02 \times 10^{-82}$ vs. GPT). S4 also achieves the lowest Kolmogorov-Smirnov distances to the training/test distributions across multiple structural properties (sp3 carbons, aliphatic rings, spiro atoms, molecular weight, fused ring size, heavy atoms).

For computational efficiency, S4 trains as fast as GPT (both approximately 1.3x faster than LSTM) and generates fastest among all architectures.

Prospective MAPK1 inhibitor design

The pre-trained S4 model was fine-tuned on 68 manually curated MAPK1 inhibitors ($K_i < 1 \mu M$) from ChEMBL v33. The last five fine-tuning epochs generated 256K molecules across five temperature values. After ranking and filtering by log-likelihood score and scaffold similarity, the top 10 designs were evaluated via Umbrella Sampling molecular dynamics simulations.

Eight out of ten designs showed high predicted affinity, with $\Delta G$ values ranging from $-10.3 \pm 0.6$ to $-23 \pm 4$ kcal/mol. These affinities are comparable to or exceed those of the closest known active neighbors ($\Delta G = -9.1 \pm 0.8$ to $-13 \pm 2$ kcal/mol). The most potent predicted design (molecule 2, $\Delta G = -23 \pm 4$ kcal/mol) engages extensively with the MAPK1 binding pocket, though synthetic accessibility may be limited. Several designs incorporate halogen substitutions favorable for MAPK1 inhibition, consistent with known structure-activity relationships.

S4 Combines the Best of LSTMs and GPTs for Molecular Design

The main findings of this study are:

S4 outperforms both LSTM and GPT in learning complex molecular properties like bioactivity, while maintaining competitive or superior performance in syntax learning and chemical space exploration.
The dual formulation is key: holistic training (convolution) enables better capture of global molecular properties, while recurrent generation preserves robust chemical syntax and diverse scaffold exploration.
S4 is especially strong for longer sequences: natural product design (SMILES > 100 tokens) shows the largest advantages over benchmarks in validity and property matching.
Prospective validation: 8/10 S4-designed MAPK1 inhibitors are predicted as highly active by molecular dynamics, with affinities comparable to or exceeding known actives.

Limitations acknowledged by the authors:

All evaluations are computational; no wet-lab experimental validation is reported.
Bioactivity evaluation relies on likelihood-based ranking, which is an indirect proxy.
The MD simulations, while more rigorous than simple docking, still represent in silico predictions.
SMILES augmentation and improved ranking protocols could further boost performance.

Future directions include application to macrocyclic peptides and protein sequences, organic reaction planning, structure-based drug design, and integration with wet-lab experimental validation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v31	1.9M SMILES	Molecules with SMILES length <= 100 tokens
Fine-tuning (bioactivity)	LIT-PCBA (5 targets)	11-56 actives + ~10K inactives per target	PKM2, MAPK1, GBA, mTORC1, TP53
Natural product training	COCONUT	32,360 SMILES	SMILES length > 100 tokens
Prospective fine-tuning	ChEMBL v33 (MAPK1)	68 inhibitors	$K_i < 1 \mu M$, target ID CHEMBL4040

Algorithms

Pre-training: next-token prediction on SMILES strings
Fine-tuning: transfer learning with early stopping (patience 5, tolerance $10^{-5}$)
Molecule ranking: log-likelihood scoring with pre-training bias subtraction (Eq. 5)
Temperature sampling: $T$ from 1.0 to 2.0 (step 0.25) for chemical space exploration

Models

S4: Structured state space sequence model with HiPPO initialization; hyperparameter search over 242 + 108 configurations
LSTM: 40 configurations optimized via random search
GPT: 35 configurations optimized via random search
All models share the same pre-training data and fine-tuning protocol for fair comparison

Evaluation

Metric	Best Model	Value	Notes
Validity (ChEMBL)	S4	97%	Out of 102,400 generated SMILES
Uniqueness (ChEMBL)	S4	96%	Among valid designs
Novelty (ChEMBL)	S4	93%	Not in training set
Bioactivity ranking (top 10)	S4	Significant (p = 8.41e-6 vs LSTM)	Wilcoxon signed-rank test
NP validity	S4	81%	COCONUT, SMILES > 100 tokens
MAPK1 inhibitor success	S4	8/10 designs active	Validated by MD (Umbrella Sampling)

Hardware

Hyperparameter search: NVIDIA A100 40GB GPUs
LSTM/GPT search: 5 days on single A100
S4 search: 10 days on multiple A100 GPUs
MD simulations: Dutch supercomputer Snellius; 1.2-1.6 microseconds per ligand (Umbrella Sampling)

Artifacts

Artifact	Type	License	Notes
S4 for de novo drug design	Code	MIT	Official PyTorch implementation with data and trained models
Zenodo archive	Dataset	CC-BY-4.0	Source data and molecule designs

Paper Information

Citation: Ozcelik, R., de Ruiter, S., Criscuolo, E., & Grisoni, F. (2024). Chemical language modeling with structured state space sequence models. Nature Communications, 15, 6176.

@article{ozcelik2024chemical,
  title={Chemical language modeling with structured state space sequence models},
  author={\"O{}z\c{c}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={6176},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-024-50469-9}
}

Protein-to-Drug Molecule Translation via Transformer

Thu, 26 Mar 2026 00:00:00 +0000

Protein-Targeted Drug Generation as Machine Translation

This is a Method paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the “language” of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein’s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein’s three-dimensional structure.

Limitations of Existing Generative Drug Design Approaches

Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.

The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein’s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.

Sequence-to-Sequence Translation with Self-Attention

The core insight is to treat protein-targeted drug generation as a translation problem between two “languages,” applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.

The self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

$$ \text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O $$

Positional encoding uses sinusoidal functions:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right) $$

$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right) $$

The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.

Data, Model Architecture, and Docking Evaluation

Data

The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.

Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via Needleman-Wunsch global alignment).

Model Configuration

The model uses the original Transformer implementation via the tensor2tensor library with:

4 encoder/decoder layers of size 128
4 attention heads
Adam optimizer with learning rate decay from the original Transformer paper
Batch size of 4,096 tokens
Training for 600K epochs on a single GPU in Google Colaboratory
Vocabulary of 71 symbols (character-level tokenization)

Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (“one per one” mode) and beam size 10 keeping all 10 results (“ten per one” mode).

Chemical Validity and Uniqueness

Metric	One per One (avg)	Ten per One (avg)
Valid SMILES (%)	90.2	82.6
Unique SMILES (%)	92.3	81.7
ZINC15 match (%)	30.6	17.1

Docking Evaluation

To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with SMINA. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).

ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.

Drug-Likeness Properties

Generated molecules were evaluated against Lipinski’s Rule of Five and other drug-likeness criteria:

Property	Constraint	One per One (%)	Ten per One (%)
logP	< 5	84.4	85.6
Molecular weight	< 500 Da	95.8	88.9
H-bond donors	< 5	95.8	91.9
H-bond acceptors	< 10	97.9	93.5
Rotatable bonds	< 10	97.9	91.2
TPSA	< 140	98.0	92.7
SAS	< 6	99.9	100.0

Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).

Structural Novelty

Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (> 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.

Generated Molecules Show Drug-Like Properties and Predicted Binding

The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.

The authors acknowledge several limitations:

Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.
Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.
The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL’s 1.5 million molecules).
Model interpretability remains limited and is identified as important future work.
The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Test	BindingDB (filtered)	238,147 records	1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 < 100 nM
Docking validation	PDB structures	11 (IGF-1R), 20 (VEGFR2)	SMINA docking with default settings
Database matching	ZINC15	N/A	Used for novelty assessment

Algorithms

Transformer (encoder-decoder) via tensor2tensor library
Beam search decoding (beam sizes 4 and 10)
Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)
SMINA for molecular docking
RDKit for validity checking, property calculation, and canonicalization

Models

4 layers, 128 hidden size, 4 attention heads
Character-level tokenization with 71-symbol vocabulary
5-fold Monte Carlo cross-validation with < 20% sequence similarity between train/test proteins

Evaluation

Metric	Value	Notes
Valid SMILES	90.2% (1-per-1), 82.6% (10-per-1)	Averaged across 5 splits
Unique SMILES	92.3% (1-per-1), 81.7% (10-per-1)	Averaged across 5 splits
ZINC15 match	30.6% (1-per-1), 17.1% (10-per-1)	Averaged across 5 splits
QED	0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)	Drug-likeness score
SAS compliance	99.9% (1-per-1), 100% (10-per-1)	SAS < 6

Hardware

Google Colaboratory with one GPU
Training for 600K epochs

Artifacts

Artifact	Type	License	Notes
molecule_structure_generation	Code	Not specified	Jupyter Notebook implementation using tensor2tensor

Paper Information

Citation: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Scientific Reports, 11, 321. https://doi.org/10.1038/s41598-020-79682-4

@article{grechishnikova2021transformer,
  title={Transformer neural network for protein-specific de novo drug generation as a machine translation problem},
  author={Grechishnikova, Daria},
  journal={Scientific Reports},
  volume={11},
  number={1},
  pages={321},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-020-79682-4}
}

PrefixMol: Prefix Embeddings for Drug Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

Unified Multi-Conditional Molecular Generation

PrefixMol is a Method paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, Lipinski) as a learnable feature vector prepended to the input sequence of a GPT-based SMILES generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.

Bridging Target-Aware and Chemistry-Aware Molecular Design

Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., REINVENT, RetMol, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:

Data scarcity: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.
Negative transfer: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.

PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.

Prefix Conditioning in Attention Layers

The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}’ = [\text{PREFIX}; \mathbf{x}]$.

The output of each position is:

$$ h_i = \begin{cases} p_{\phi,i}, & \text{if } i < n_c \\ \text{LM}_\theta(x_i’, h_{

Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:

$$ \begin{aligned} \text{head} &= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\ &\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}} \end{aligned} $$

where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.

Condition correlation is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:

$$ \text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v) $$

The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.

Condition Encoders

Each condition has a dedicated encoder:

3D Pocket: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.
Chemical properties: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.

Training Objective

PrefixMol is trained with two losses. The auto-regressive loss is:

$$ \mathcal{L}_{AT} = -\sum_{1 < i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{

A triplet property prediction loss encourages generated molecules to match desired properties:

$$ \mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right) $$

where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).

Experimental Setup and Controllability Evaluation

Dataset

The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.

Metrics

Vina score (binding affinity, computed by QVina after UFF refinement)
QED (quantitative estimate of drug-likeness, 0-1)
SA (synthetic accessibility, 0-1)
LogP (octanol-water partition coefficient)
Lipinski (rule-of-five compliance count)
High Affinity (fraction of pockets where generated molecules match or exceed test set affinities)
Diversity (average pairwise Tanimoto distance over Morgan fingerprints)
Sim.Train (maximum Tanimoto similarity to training set)

Baselines

Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.

Key Results

Unconditional generation (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.

Single-property control (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.

Multi-property control (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.

Condition Relation Analysis

By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:

Vina is weakly self-controllable but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.
LogP and QED are the most correlated property pair.
Lipinski is coupled to QED and SA, saturating at 5.0 when both QED and SA control scales reach +2.

Key Findings, Limitations, and Interpretability Insights

PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:

A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.
Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.
The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).

Limitations: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training / Evaluation	CrossDocked (extended)	22.5M protein-ligand structures	Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)

Algorithms

GPT-based auto-regressive SMILES generation with prefix conditioning
GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention
Separate MLP encoders for each chemical property
Triplet property prediction loss with non-differentiable RDKit-computed properties
QVina for Vina score computation with UFF refinement

Models

GPT transformer backbone for SMILES generation
6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski
Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper

Evaluation

Metric	PrefixMol (unconditional)	Pocket2Mol	Notes
Vina (kcal/mol)	-6.532	-7.288	Lower is better
QED	0.551	0.563	Higher is better
SA	0.750	0.765	Higher is better
Diversity	0.856	0.688	Higher is better
Sim.Train	0.239	0.376	Lower is better

Hardware

Not reported in the paper.

Artifacts

Artifact	Type	License	Notes
PrefixMol	Code	Not specified	Official PyTorch implementation

Paper Information

Citation: Gao, Z., Hu, Y., Tan, C., & Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. arXiv preprint arXiv:2302.07120.

@article{gao2023prefixmol,
  title={PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding},
  author={Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.},
  journal={arXiv preprint arXiv:2302.07120},
  year={2023}
}

PASITHEA: Gradient-Based Molecular Design via Dreaming

Thu, 26 Mar 2026 00:00:00 +0000

Inceptionism Applied to Molecular Inverse Design

This is a Method paper that introduces PASITHEA, a gradient-based approach to de-novo molecular design inspired by inceptionism (deep dreaming) techniques from computer vision. The core contribution is a direct optimization framework that modifies molecular structures by backpropagating through a trained property-prediction network, with the molecular input (rather than weights) serving as the optimizable variable. PASITHEA is enabled by SELFIES, a surjective molecular string representation that guarantees 100% validity of generated molecules.

The Need for Direct Gradient-Based Molecular Optimization

Existing inverse molecular design methods, including variational autoencoders (VAEs), generative adversarial networks (GANs), reinforcement learning (RL), and genetic algorithms (GAs), share a common characteristic: they optimize molecules indirectly. VAEs and GANs learn distributions and scan latent spaces. RL agents learn policies from environmental rewards. GAs iteratively apply mutations and selections. None of these approaches directly maximize an objective function in a gradient-based manner with respect to the molecular representation itself.

This indirection has several consequences. VAE-based methods require learning a latent space, and the optimization happens in that space rather than directly on molecular structures. RL and GA methods require expensive function evaluations for each candidate molecule. The authors identify an opportunity to exploit gradients more directly by reversing the learning process of a neural network trained to predict molecular properties, thereby sidestepping latent spaces, policies, and population-based search entirely.

A second motivation is interpretability. By operating directly on the molecular representation (rather than a learned latent space), PASITHEA can reveal what a regression network has learned about structure-property relationships, a capability the authors frame as analogous to how deep dreaming reveals what image classifiers have learned about visual features.

Core Innovation: Inverting Regression Networks on SELFIES

PASITHEA’s key insight is a two-phase training procedure that repurposes the standard neural network training loop for molecule generation.

Phase 1: Prediction training. A fully connected neural network is trained to predict a real-valued chemical property (logP) from one-hot encoded SELFIES strings. The standard feedforward and backpropagation process updates the network weights to minimize mean squared error between predicted and ground-truth property values:

$$ \min_{\theta} \frac{1}{N} \sum_{i=1}^{N} (f_{\theta}(\mathbf{x}_i) - y_i)^2 $$

where $f_{\theta}$ is the neural network with parameters $\theta$, $\mathbf{x}_i$ is the one-hot encoded SELFIES input, and $y_i$ is the target logP value.

Phase 2: Inverse training (deep dreaming). The network weights $\theta$ are frozen. For a given input molecule $\mathbf{x}$ and a desired target property value $y_{\text{target}}$, the gradients are computed with respect to the input representation rather than the weights:

$$ \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} \mathcal{L}(f_{\theta}(\mathbf{x}), y_{\text{target}}) $$

This gradient descent on the input incrementally modifies the one-hot encoding of the molecular string, transforming it toward a structure whose predicted property matches the target value. At each step, the argmax function converts the continuous one-hot encoding back to a discrete SELFIES string, which always maps to a valid molecular graph due to the surjective property of SELFIES.

The role of SELFIES. The surjective mapping from strings to molecular graphs is essential. With SMILES, intermediate strings during optimization can become syntactically invalid (e.g., an unclosed ring like “CCCC1CCCCC”), producing no valid molecule. SELFIES enforces constraints that guarantee every string maps to a valid molecular graph, making the continuous gradient-based optimization feasible.

Input noise injection. Because inverse training transforms a one-hot encoding from binary values to real numbers, the discrete-to-continuous transition can cause convergence problems. The authors address this by initializing the input with noise: every zero in the one-hot encoding is replaced by a random number in $[0, k]$, where $k$ is a hyperparameter between 0.5 and 0.95. This smooths the optimization landscape and enables incremental molecular modifications rather than abrupt changes.

Experimental Setup on QM9 with LogP Optimization

Dataset and Property

The experiments use a random subset of 10,000 molecules from the QM9 dataset. The target property is the logarithm of the partition coefficient (logP), computed using RDKit. LogP measures lipophilicity, an important drug-likeness indicator that follows an approximately normal distribution in QM9 and has a nearly continuous range, making it suitable for gradient-based optimization.

Network Architecture

PASITHEA uses a fully connected neural network with four layers, each containing 500 nodes with ReLU activation. The loss function is mean squared error. Data is split 85%/15% for training/testing. The prediction model trains for approximately 1,500 epochs with an Adam optimizer and a learning rate of $1 \times 10^{-6}$.

For inverse training, the authors select a noise upper-bound of 0.9 and a learning rate of 0.01, chosen from hyperparameter tuning experiments that evaluate the percentage of molecules optimized toward the target property.

Optimization Targets

Two extreme logP targets are used: $+6$ (high lipophilicity) and $-6$ (low lipophilicity). These values exceed the range of logP values in the QM9 dataset (minimum: $-2.19$, maximum: $3.08$), testing whether the model can extrapolate beyond the training distribution.

Distribution Shifts and Interpretable Molecular Transformations

Distribution-Level Results

Applying deep dreaming to the full set of 10,000 molecules produces a clear shift in the logP distribution:

Statistic	QM9 Original	Optimized (target +6)	Optimized (target -6)
Mean logP	0.3909	1.8172	-0.3360
Min logP	-2.1903	-0.8240	-2.452
Max logP	3.0786	4.2442	0.9018

The optimized distributions extend beyond the original dataset’s property range. The right-shifted distribution (target +6) produces molecules with logP values up to 4.24, exceeding the original maximum of 3.08. The left-shifted distribution (target -6) reaches -2.45, below the original minimum. This indicates that PASITHEA can generate molecules with properties outside the training data bounds.

Additionally, 97.2% of the generated molecules do not exist in the original training set, indicating that the network is not memorizing data but rather using structural features to guide optimization. Some generated molecules contain more heavy atoms than the QM9 maximum of 9, since the SELFIES string length allows for larger structures.

Molecule-Level Interpretability

The stepwise molecular transformations reveal interpretable “strategies” the network employs:

Nitrogen appendage: When optimizing for lower logP, the network repeatedly appends nitrogen atoms to the molecule. The authors observe this as a consistent pattern across multiple test molecules, reflecting the known relationship between nitrogen content and reduced lipophilicity.
Length modulation: When optimizing for higher logP, the network tends to increase molecular chain length (e.g., extending a carbon chain). When optimizing for lower logP, it shortens chains. This captures the intuition that larger, more carbon-heavy molecules tend to be more lipophilic.
Bond order changes: The network replaces single bonds with double or triple bonds during optimization, demonstrating an understanding of the relationship between bonding patterns and logP.
Consistency across trials: Because the input initialization includes random noise, repeated trials with the same molecule produce different transformation sequences. Despite this stochasticity, the network applies consistent strategies across trials (e.g., always shortening chains for negative optimization), validating that it has learned genuine structure-property relationships.

Thermodynamic Stability

The authors assess synthesizability by computing heats of formation using MOPAC2016 at the PM7 level of theory. Some optimization trajectories move toward thermodynamically stable molecules (negative heats of formation), while others produce less stable structures. The authors acknowledge this limitation and propose multi-objective optimization incorporating stability as a future direction.

Comparison to VAEs

The key distinction from VAEs is where gradient computation occurs. In VAEs, a latent space is learned through encoding and decoding, and property optimization happens in that latent space. In PASITHEA, gradients are computed directly with respect to the molecular representation (SELFIES one-hot encoding). The authors argue this makes the approach more interpretable, since we can probe what the network learned about molecular structure without the “detour” through a latent space.

Limitations

The authors are forthright about the preliminary nature of these results:

The method is demonstrated only on a small subset of QM9 with a single, computationally inexpensive property (logP).
The simple four-layer architecture may not scale to larger molecular spaces or more complex properties.
Generated molecules are not always thermodynamically stable, requiring additional optimization objectives.
The approach has not been benchmarked against established methods (VAEs, GANs, RL) on standard generative benchmarks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	QM9 (random subset)	10,000 molecules	logP values computed via RDKit

Algorithms

Prediction training: 4-layer fully connected NN, 500 nodes/layer, ReLU activation, MSE loss, Adam optimizer, LR $1 \times 10^{-6}$, ~1,500 epochs, 85/15 train/test split
Inverse training: Frozen weights, Adam optimizer, LR 0.01, noise upper-bound 0.9, logP targets of +6 and -6
Heats of formation: MOPAC2016, PM7 level, geometry optimization with eigenvector following (EF)

Models

The architecture is a simple 4-layer MLP. No pre-trained weights are distributed, but the full code is available.

Evaluation

Metric	Value	Notes
Novel molecules	97.2%	Generated molecules not in training set
Max logP (target +6)	4.2442	Exceeds QM9 max of 3.0786
Min logP (target -6)	-2.452	Below QM9 min of -2.1903

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
Pasithea	Code	MIT	Official implementation

Paper Information

Citation: Shen, C., Krenn, M., Eppel, S., & Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. Machine Learning: Science and Technology, 2(3), 03LT02. https://doi.org/10.1088/2632-2153/ac09d6

@article{shen2021deep,
  title={Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations},
  author={Shen, Cynthia and Krenn, Mario and Eppel, Sagi and Aspuru-Guzik, Al{\'a}n},
  journal={Machine Learning: Science and Technology},
  volume={2},
  number={3},
  pages={03LT02},
  year={2021},
  publisher={IOP Publishing},
  doi={10.1088/2632-2153/ac09d6}
}

Neural Machine Translation of Chemical Nomenclature

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Neural Translation of Chemical Names

This is a Method paper that introduces deep learning approaches for translating chemical nomenclature between English and Chinese. The primary contribution is demonstrating that character-level sequence-to-sequence neural networks (both CNN-based and LSTM-based) can serve as viable alternatives to hand-crafted rule-based translation systems for chemical names. The work compares two neural architectures against an existing rule-based tool on bilingual chemical name datasets.

Bridging the English-Chinese Chemical Nomenclature Gap

English and Chinese are the two most widely used languages for chemical nomenclature worldwide. Translation between them is important for chemical data processing, especially for converting Chinese chemical names extracted via named entity recognition into English names that existing name-to-structure tools can parse. Rule-based translation between these languages faces considerable challenges:

Chinese chemical names lack word boundaries (no spaces), making segmentation difficult.
Word order is often reversed between English and Chinese chemical names (e.g., “ethyl acetate” maps to characters meaning “acetate-ethyl” in Chinese).
The same English morpheme can map to different Chinese characters depending on chemical context (e.g., “ethyl” translates differently in “ethyl acetate” vs. “ethyl alcohol”).
Trivial names, especially for natural products, follow irregular translation patterns or are transliterations.

Building comprehensive rule sets requires a formally trained chemist fluent in both languages, making rule-based approaches expensive and fragile.

Character-Level Sequence-to-Sequence Translation

The core idea is to treat chemical name translation as a character-level machine translation task, applying encoder-decoder architectures with attention mechanisms. Two architectures are proposed:

CNN-based architecture: Three 1D convolutional layers encode the input character sequence. A decoder with three 1D convolutional layers processes the target sequence offset by one timestep, combined with attention mechanism layers that connect encoder and decoder outputs. Two additional 1D convolutional layers produce the final decoded output sequence.

LSTM-based architecture: An LSTM encoder converts the input sequence into two state vectors. An LSTM decoder is trained with teacher forcing, using the encoder’s state vectors as its initial state, and generating the target sequence offset by one timestep.

Both models operate at the character level. Input chemical name strings are transformed into embedding vectors, with the vocabulary size equal to the number of unique characters in the respective language (100 unique characters for English names, 2,056 unique characters for Chinese names).

Experimental Setup and Comparison with Rule-Based Tool

Datasets

The authors built two directional datasets from a manually curated corpus of scientific literature maintained at their institution:

En2Ch (English to Chinese): 30,394 name pairs after deduplication
Ch2En (Chinese to English): 37,207 name pairs after deduplication

The datasets cover systematic compound names through trivial names. For names with multiple valid translations, the most commonly used translation was selected. Each dataset was split 80/20 for training and validation.

Model Configuration

Both neural network models used the following hyperparameters:

Batch size: 64
Epochs: 100
Latent dimensionality: 256 (encoding and decoding space)
Implementation: Python 3.7 with Keras 2.3 and TensorFlow backend

Evaluation Metrics

The models were evaluated on five metrics across both translation directions:

Success Rate: Percentage of inputs that produced any output
String Matching Accuracy: Exact match with the single target name
Data Matching Accuracy: Exact match allowing any valid translation from the corpus
Manual Spot Check: Blind evaluation of 100 random samples per approach
Running Time: Wall-clock time on the same hardware

Baseline

The rule-based comparison system operates in three steps: disassemble the input name into word fragments, translate each fragment, and reassemble into the target language. This tool had been deployed as an online service with over one million uses at the time of publication.

Key Findings and Limitations

Main Results

Metric	CNN	LSTM	Rule-based
Success Rate En2Ch	100%	100%	75.97%
Success Rate Ch2En	100%	100%	59.90%
String Match En2Ch	82.92%	89.64%	39.81%
String Match Ch2En	78.11%	55.44%	43.77%
Data Match En2Ch	84.44%	90.82%	45.15%
Data Match Ch2En	80.22%	57.40%	44.91%
Manual Check En2Ch	90.00%	89.00%	80.00%
Manual Check Ch2En	82.00%	61.00%	78.00%
Time En2Ch (s)	1423	190	288
Time Ch2En (s)	1876	303	322

Both neural approaches achieved 100% success rate (always producing output), while the rule-based tool failed on 24% and 40% of inputs for En2Ch and Ch2En respectively. The rule-based tool’s failures were concentrated on Chinese names lacking word boundaries and on trivial names of natural products.

For English-to-Chinese translation, LSTM performed best at 89.64% string matching accuracy (90.82% data matching), followed by CNN at 82.92%. For Chinese-to-English, CNN substantially outperformed LSTM (78.11% vs. 55.44% string matching), suggesting that LSTM had difficulty with long-term dependencies in Chinese character sequences. The authors observed that many LSTM errors appeared at the ends of chemical names.

Analysis by Name Type

The CNN-based approach outperformed LSTM on CAS names (80% vs. 52% in manual checks) and was more robust for longer names. The rule-based tool showed consistent performance regardless of name length, suggesting it was more suited to regular systematic names but struggled with the diversity of real-world chemical nomenclature.

Limitations

Performance depends heavily on training data quality and quantity.
Neither neural approach was validated on an external test set outside the institution’s corpus.
The CNN model was considerably slower (5-6x) than the other two approaches.
No comparison against modern transformer-based NMT architectures (the study predates widespread adoption of transformers for this task).
The dataset is relatively small by modern NMT standards (30-37K pairs).
The authors noted that some neural translations were actually better than the target labels, suggesting the evaluation metrics understate true performance.

Future Directions

The authors suggest that combining CNN and LSTM architectures could yield further improvements, and that the approach has practical applications in scientific publishing (Chinese journals requiring English abstracts) and chemical database interoperability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Validation (En2Ch)	Curated bilingual corpus	30,394 pairs	80/20 split, from SIOC chemical data system
Training/Validation (Ch2En)	Curated bilingual corpus	37,207 pairs	80/20 split, from SIOC chemical data system
Testing (En2Ch)	Held-out validation split	6,079 records	Same source
Testing (Ch2En)	Held-out validation split	7,441 records	Same source

Training data, Python code for both models, and result data are provided as supplementary files with the paper.

Algorithms

Character-level CNN encoder-decoder with attention (3+3+2 conv layers)
Character-level LSTM encoder-decoder with teacher forcing
Batch size: 64, epochs: 100, latent dim: 256

Models

Both models implemented in Python 3.7 with Keras 2.3 / TensorFlow. No pre-trained weights are released separately, but the training code is provided as supplementary material.

Evaluation

Metric	Best Value (En2Ch)	Best Value (Ch2En)	Notes
Success Rate	100% (both DL)	100% (both DL)	Rule-based: 75.97% / 59.90%
String Matching	89.64% (LSTM)	78.11% (CNN)	Best neural model per direction
Data Matching	90.82% (LSTM)	80.22% (CNN)	Allows multiple valid translations
Manual Spot Check	90.00% (CNN)	82.00% (CNN)	Blind evaluation of 100 samples

Hardware

Not specified in the paper. Running times reported but hardware details not provided.

Artifacts

Artifact	Type	License	Notes
Supplementary files	Code + Data	CC-BY 4.0	Training data, CNN/LSTM code, results (Additional files 1-6)
SIOC Translation Tool	Other	Not specified	Rule-based baseline tool, online service

Paper Information

Citation: Xu, T., Chen, W., Zhou, J., Dai, J., Li, Y., & Zhao, Y. (2020). Neural machine translation of chemical nomenclature between English and Chinese. Journal of Cheminformatics, 12, 50. https://doi.org/10.1186/s13321-020-00457-0

@article{xu2020neural,
  title={Neural machine translation of chemical nomenclature between English and Chinese},
  author={Xu, Tingjun and Chen, Weiming and Zhou, Junhong and Dai, Jingfang and Li, Yingyong and Zhao, Yingli},
  journal={Journal of Cheminformatics},
  volume={12},
  pages={50},
  year={2020},
  doi={10.1186/s13321-020-00457-0},
  publisher={Springer}
}

nach0: A Multimodal Chemical and NLP Foundation Model

Thu, 26 Mar 2026 00:00:00 +0000

A Multi-Domain Encoder-Decoder for Chemistry and NLP

nach0 is a Method paper that introduces a unified encoder-decoder foundation model capable of handling both natural language processing (NLP) tasks and chemistry tasks within a single architecture. The primary contribution is demonstrating that a T5-based model pre-trained on scientific text, patents, and SMILES molecular strings can be instruction-tuned to perform molecular property prediction, reaction prediction, molecular generation, named entity recognition, question answering, and cross-domain translation (text-to-molecule and molecule-to-text) simultaneously. The model is available in base (250M parameters) and large (780M parameters) configurations.

Bridging Chemical and Linguistic Representations

Most existing biomedical language models (BioBERT, SciFive, BioMegatron) are trained exclusively on natural language text from sources like PubMed, omitting chemical structure information encoded in SMILES strings. Conversely, chemistry-specific models trained on SMILES data often lack the ability to process natural language instructions or perform NLP tasks. Models like Galactica and MolT5 attempted to bridge this gap by training on both natural language and chemical data, but they were not fine-tuned on a diverse set of chemical tasks using instruction tuning in a multi-task fashion.

nach0 addresses this by creating a shared representation space for both modalities and fine-tuning across a comprehensive set of tasks spanning three domains: NLP-only tasks, chemistry-only tasks, and cross-domain tasks that require translating between natural language and molecular representations.

Unified Text-to-Text Framework with SMILES Tokenization

The core innovation in nach0 is formulating all chemical and linguistic tasks as text-to-text problems within a single encoder-decoder transformer, combined with a specialized SMILES tokenization strategy.

SMILES Token Integration

Rather than treating SMILES as plain text, nach0 extends the T5 vocabulary with dedicated SMILES tokens. Each SMILES token is annotated with special symbols in the format , creating a distinct vocabulary space for molecular representations while preserving the natural language vocabulary from FLAN-T5. The embedding matrix is initialized by reusing learned embeddings from the pre-trained model for original tokens, with new chemical tokens initialized from the first embeddings.

Architecture

Both model sizes use the standard T5 encoder-decoder architecture:

Configuration	Parameters	Layers	Hidden Size	FFN Size	Attention Heads
Base	250M	12	768	3072	12
Large	780M	24	1024	4096	16

Pre-training Data

The model is pre-trained with a language modeling objective on three data sources:

Source	Documents	Tokens
PubMed abstracts (chemistry-filtered)	13M	355M
USPTO patent descriptions	119K	2.9B
ZINC molecular database	~100M	4.7B

Instruction Tuning

Following the approach of Raffel et al. and Chung et al., nach0 uses natural language prompts to formulate each task. For example, a retrosynthesis task might be phrased as “What reactants could be used to synthesize [SMILES]?” and a property prediction task as “Can [SMILES] penetrate the BBB?” This enables multi-task training across all domains with a single loss function and shared hyperparameters.

Training uses a batch size of 1024, learning rate of $1 \times 10^{-4}$, and weight decay of 0.01. Pre-training runs for one epoch, and fine-tuning for 10 epochs. Data mixing follows the examples-proportional mixing strategy from T5.

Multi-Task Evaluation Across NLP and Chemistry Benchmarks

nach0 is evaluated on a comprehensive set of benchmarks spanning three task categories.

Task Categories

NLP tasks: Named entity recognition (BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, JNLPBA), PICO extraction (EBM PICO), textual entailment (MedNLI, SciTail), relation extraction (ChemProt, DDI, GAD), sentence similarity (BIOSSES), document classification (HoC), and question answering (PubMedQA, BioASQ, MedMCQA, MMLU).

Chemistry tasks: Molecular property prediction (ESOL, FreeSolv, Lipophilicity, BBBP, HIV, BACE from MoleculeNet; QM9 from Mol-Instructions), molecular generation (MOSES), forward reaction prediction, reagent prediction, and retrosynthesis (from Mol-Instructions/USPTO).

Cross-domain tasks: Description-guided molecule design and molecular description generation (from Mol-Instructions).

Baselines

nach0 is compared against FLAN-T5 (250M), SciFive (220M), and MolT5 (220M), all trained in multi-task fashion.

Key Results

On chemistry and cross-domain tasks, nach0 base consistently outperforms all base-sized baselines. Selected highlights from Table 3:

Task	Metric	MolT5	SciFive	FLAN	nach0 Base	nach0 Large
Forward reaction	Acc@1	27.0%	60.0%	59.0%	88.0%	89.9%
Retrosynthesis	Acc@1	15.0%	31.0%	31.0%	53.0%	56.3%
Reagent prediction	Acc@1	1.1%	3.8%	4.0%	6.3%	13.1%
BACE	BA	0.58	0.65	0.65	0.74	0.71
BBBP	BA	0.55	0.66	0.60	0.67	0.68
HFE (FreeSolv)	R2	-0.36	0.51	0.55	0.77	0.78
MOSES (FCD)	FCD/Test	0.521	0.578	0.529	0.311	0.304
Description-guided mol. design	BLEU-2	30.3%	44.2%	43.6%	49.0%	48.8%
Mol. description gen.	BLEU-2	35.6%	39.6%	38.6%	43.9%	41.7%

On NLP tasks, nach0 base performs comparably to FLAN base, with the two models trading wins across different tasks. nach0 large improves substantially over nach0 base on most tasks.

Ablation Study

The ablation study (Table 4) examines the impact of multi-task training across chemical task groups. Key findings:

nach0 trained on all chemical tasks jointly outperforms models trained on individual task groups (prediction-only, reaction-only, or generation-only) on the total set of metrics
The joint model shows lower novelty scores on MOSES compared to the generation-only model, but this reflects less overfitting to training data rather than worse performance
nach0 consistently outperforms MolT5 across all chemical task configurations, demonstrating the benefit of pre-training on both natural language and chemical data with specialized SMILES tokens

Case Studies

Two applied case studies demonstrate nach0 in drug discovery scenarios:

End-to-end drug discovery for diabetes mellitus: Using a sequence of prompts, nach0 identifies biological targets, analyzes mechanisms of action, generates molecular structures, proposes synthesis routes, and predicts molecular properties.
JAK3 inhibitor generation with Chemistry42: nach0 replaces 42 specialized generative models in Insilico Medicine’s Chemistry42 platform. In 45 minutes, nach0 generates 8 molecules satisfying all 2D and 3D requirements (hinge binding, active site binding), compared to a 0.04% discovery rate from a combinatorial generator over 24 hours. Chemistry42’s full pipeline (72 hours) still produces better structures since it uses reinforcement learning feedback and explicit structural constraints.

Comparison with ChatGPT

On a subset evaluation, fine-tuned nach0 base outperforms GPT-3.5-turbo on all tested tasks: EBM PICO (F1: 67.6% vs. 64.4%), MedMCQA-Open (BLEU-2: 6.3% vs. 1.7%), and molecular description generation (BLEU-2: 42.8% vs. 2.2%).

Competitive Multi-Task Performance with Clear Limitations

nach0 demonstrates that a single encoder-decoder model can achieve competitive results across both chemical and NLP tasks when pre-trained on mixed-modality data and fine-tuned with instruction tuning. The model’s strongest advantages appear on chemistry tasks (reaction prediction, property prediction, molecular generation), where specialized SMILES tokenization and chemical pre-training provide clear benefits over general-purpose models of similar scale.

Limitations Acknowledged by the Authors

Not at chemist expert level: Human evaluations indicate the model does not match domain expert performance. Key gaps include chemical reasoning, knowledge alignment with domain-specific knowledge graphs, and the ability to learn from expert feedback.
SMILES-only molecular representation: The model lacks 3D geometric information. SMILES notation is not one-to-one with molecular structures, and the model does not incorporate molecular graphs or 3D coordinates. The authors suggest SELFIES as a potential alternative representation.
Prompt sensitivity: Performance depends on prompt quality and specificity. Over-reliance on domain-specific prompts may limit response diversity.
Limited chemical diversity: Cross-domain datasets from Mol-Instructions primarily cover known drugs and chemical probes from PubChem, representing only a fraction of predicted chemical space.

Future Directions

The authors propose extending nach0 with protein sequence modalities (using Group SELFIES), expanding zero-shot evaluation capabilities, and integrating knowledge graph information through self-supervised approaches.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training (text)	PubMed abstracts	13M docs, 355M tokens	Filtered for chemistry-related content
Pre-training (text)	USPTO patents	119K docs, 2.9B tokens	Patent descriptions
Pre-training (chemical)	ZINC	~100M docs, 4.7B tokens	Molecular SMILES strings
Fine-tuning (NLP)	17 NLP datasets	Varies	See Table 1 in paper
Fine-tuning (chemistry)	MoleculeNet, MOSES, Mol-Instructions	Varies	Predefined or random splits

Algorithms

Architecture: T5 encoder-decoder (base: 250M, large: 780M parameters)
Pre-training objective: Language modeling (masked span prediction)
Fine-tuning: Multi-task instruction tuning with examples-proportional mixing
Hyperparameters: batch size 1024, learning rate $1 \times 10^{-4}$, weight decay 0.01
Pre-training: 1 epoch; fine-tuning: 10 epochs

Models

Artifact	Type	License	Notes
nach0 Base (HuggingFace)	Model	CC-BY-NC-4.0	250M parameter encoder-decoder
nach0 Large (HuggingFace)	Model	CC-BY-NC-4.0	780M parameter encoder-decoder
nach0 GitHub Repository	Code	Not specified	Training and inference code

Evaluation

Evaluation spans 17+ NLP benchmarks and 10+ chemistry benchmarks. Metrics include F1 (NER, RE, classification), accuracy (QA, entailment, reaction prediction), balanced accuracy (molecular property classification), R2/RMSE (regression), BLEU-2 (generation), and FCD/SNN/validity/novelty (molecular generation via MOSES).

Hardware

Base models: NVIDIA A4000 and A5000 GPUs
Large models: NVIDIA DGX cloud platform
Training used tensor and pipeline parallelism via NeMo toolkit
Specific GPU counts and training times not reported

Paper Information

Citation: Livne, M., Miftahutdinov, Z., Tutubalina, E., Kuznetsov, M., Polykovskiy, D., Brundyn, A., Jhunjhunwala, A., Costa, A., Aliper, A., Aspuru-Guzik, A., & Zhavoronkov, A. (2024). nach0: Multimodal Natural and Chemical Languages Foundation Model. Chemical Science, 15(22), 8380-8389. https://doi.org/10.1039/D4SC00966E

@article{livne2024nach0,
  title={nach0: multimodal natural and chemical languages foundation model},
  author={Livne, Micha and Miftahutdinov, Zulfat and Tutubalina, Elena and Kuznetsov, Maksim and Polykovskiy, Daniil and Brundyn, Annika and Jhunjhunwala, Aastha and Costa, Anthony and Aliper, Alex and Aspuru-Guzik, Al{\'a}n and Zhavoronkov, Alex},
  journal={Chemical Science},
  volume={15},
  number={22},
  pages={8380--8389},
  year={2024},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D4SC00966E}
}

MolPMoFiT: Inductive Transfer Learning for QSAR

Thu, 26 Mar 2026 00:00:00 +0000

Transfer Learning Meets Molecular Property Prediction

This is a Method paper that introduces MolPMoFiT (Molecular Prediction Model Fine-Tuning), a transfer learning approach for QSPR/QSAR modeling. The primary contribution is adapting the ULMFiT framework from NLP to molecular property prediction by treating SMILES strings as a chemical language. A general-purpose molecular structure prediction model (MSPM) is pre-trained on one million ChEMBL molecules via self-supervised next-token prediction, then fine-tuned for specific QSAR endpoints. The approach achieves competitive or superior results to graph neural networks and descriptor-based methods across four benchmark datasets, with particular benefits for small datasets.

The Small Data Problem in QSAR Modeling

Deep learning models for molecular property prediction typically require large labeled training sets to learn useful structural representations. While methods like graph convolutional neural networks and SMILES-based models have achieved strong results on well-studied endpoints, they must be trained from scratch for each new task. This presents a challenge for small chemical datasets with limited labeled data, which remain common in drug discovery for specialized endpoints like allosteric inhibition, renal clearance, and inhibitor residence times.

Transfer learning had already shown transformative impact in computer vision (ImageNet pre-training) and NLP (ELMo, BERT, ULMFiT). In chemistry, prior transfer learning efforts included ChemNet (supervised pre-training on computed descriptors), Mol2vec (unsupervised substructure embeddings), and pre-trained graph neural networks. However, a systematic application of the ULMFiT self-supervised pre-training pipeline to SMILES-based molecular models had not been explored. MolPMoFiT fills this gap by treating the vast corpus of unlabeled molecular structures as a self-supervised training signal, analogous to how language models learn from unlabeled text.

Core Innovation: ULMFiT Adapted for SMILES

MolPMoFiT adapts ULMFiT’s three-stage transfer learning pipeline to molecular property prediction:

Stage 1: General-Domain MSPM Pre-training. A molecular structure prediction model is trained on one million curated ChEMBL molecules to predict the next token in a SMILES string. This is purely self-supervised: the SMILES string provides its own labels. The model learns general chemical syntax and structural patterns.

Stage 2: Task-Specific MSPM Fine-tuning (Optional). The general MSPM is further fine-tuned on the unlabeled SMILES of the target task dataset. This adapts the language model to the specific chemical distribution of interest (e.g., HIV inhibitors vs. general bioactive molecules). Discriminative fine-tuning adjusts learning rates per layer:

$$\eta^{layer-1} = \eta^{layer} / 2.6$$

where higher layers (containing more task-specific features) receive higher learning rates.

Stage 3: QSAR/QSPR Model Fine-tuning. The embedding and encoder weights from the pre-trained MSPM are transferred to a new model with a task-specific classifier head. Fine-tuning uses three key techniques from ULMFiT:

Discriminative fine-tuning: Different learning rates per layer group
Gradual unfreezing: Layers are unfrozen sequentially (classifier first, then progressively deeper LSTM layers)
One cycle policy: Learning rate scheduling following Smith’s approach

The model architecture is AWD-LSTM (ASGD Weight-Dropped LSTM) with an embedding dimension of 400, three LSTM layers with 1152 hidden units, and dropouts applied at every layer (embedding, input, weights, hidden). The QSAR classifier concatenates max pooling, mean pooling, and the last hidden state $h_T$ from the final LSTM layer, feeding this into two feedforward layers.

SMILES Augmentation. Since multiple valid SMILES can represent the same molecule through different atom orderings, the authors use SMILES enumeration as data augmentation. For regression tasks, Gaussian noise ($\sigma_{noise}$) is added to labels of augmented SMILES to simulate experimental error. Test-time augmentation (TTA) averages predictions across the canonical SMILES and four randomized SMILES.

Benchmarks Across Four QSAR Datasets

Datasets

Dataset	Size	Task	Metric
Lipophilicity	4,200	Regression (logD)	RMSE
FreeSolv	642	Regression (solvation energy)	RMSE
HIV	41,127	Classification (replication inhibition)	AUROC
BBBP	2,039	Classification (blood-brain barrier)	AUROC

All datasets use the same 10 random 80:10:10 splits from Yang et al. (2019) for fair comparison. Both random and scaffold splits were evaluated, with scaffold splits representing a more realistic test of generalization to novel chemical scaffolds.

Baselines

Models were compared against results reported by Yang et al. (2019): directed message passing neural network (D-MPNN), D-MPNN with RDKit features, random forest on Morgan fingerprints, feed-forward networks on Morgan fingerprints, and feed-forward networks on RDKit descriptors.

Hyperparameters

The same set of fine-tuning hyperparameters was used across all four tasks (tuned on the HIV dataset):

Layer Group	Base Learning Rate	Epochs
Linear head only	3e-2	4
+ Final LSTM layer	5e-3	4
+ Final two LSTM layers	5e-4	4
Full model	5e-5	6

Data augmentation settings were task-specific: lipophilicity training SMILES augmented 25x ($\sigma_{noise} = 0.3$); FreeSolv augmented 50x ($\sigma_{noise} = 0.5$); HIV active class augmented 60x and inactive 2x; BBBP positive class 10x and negative 30x.

Key Findings and Limitations

Benchmark Results

Lipophilicity (random split): MolPMoFiT achieved RMSE of $0.565 \pm 0.037$ with TTA and $0.625 \pm 0.032$ without, outperforming D-MPNN and other baselines.

FreeSolv (random split): RMSE of $1.197 \pm 0.127$ with TTA. The small dataset size (642 compounds) led to high variance across splits.

BBBP (random split): AUROC of $0.950 \pm 0.020$, outperforming all comparison models. Task-specific MSPM fine-tuning showed no clear benefit over the general MSPM.

HIV (random split): General MolPMoFiT achieved AUROC of $0.828 \pm 0.029$ with TTA. Task-specific fine-tuning yielded a slightly higher $0.834 \pm 0.025$ with TTA.

Scaffold splits consistently produced lower performance than random splits across all datasets, as expected for out-of-distribution generalization.

Transfer Learning Impact

Across all four datasets and varying training set sizes, MolPMoFiT consistently outperformed models trained from scratch with the same architecture. The improvement was most pronounced at smaller training set sizes, confirming the utility of pre-trained representations for low-data regimes.

SMILES Augmentation Analysis

Training data augmentation provided significant improvements across all tasks. For classification (HIV, BBBP), augmentation improved performance regardless of whether class re-balancing was applied. For regression (lipophilicity, FreeSolv), both SMILES augmentation and label noise were beneficial, with optimal noise levels varying by dataset.

Limitations

The authors note a fundamental limitation: the model learns mappings from individual SMILES strings to properties rather than from molecular structures to properties. SMILES augmentation acts as a regularization technique to mitigate this, making the model more robust to different SMILES representations of the same molecule. The task-specific MSPM fine-tuning stage did not consistently improve results, requiring further investigation. All hyperparameters were tuned on one dataset (HIV) and applied uniformly, which may not be optimal for all endpoints.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL (curated)	1M molecules	Filtered: no mixtures, max 50 heavy atoms, standardized with MolVS, canonized with RDKit
Evaluation	Lipophilicity	4,200	MoleculeNet benchmark
Evaluation	FreeSolv	642	MoleculeNet benchmark
Evaluation	HIV	41,127	MoleculeNet benchmark
Evaluation	BBBP	2,039	MoleculeNet benchmark

Algorithms

AWD-LSTM architecture with embedding dim 400, three LSTM layers (1152 hidden units), dropouts at all layers
ULMFiT fine-tuning: discriminative learning rates ($\eta^{layer-1} = \eta^{layer}/2.6$), gradual unfreezing, one cycle policy
SMILES character-level tokenization with special handling for two-character tokens (Cl, Br) and bracket-enclosed tokens
SMILES enumeration for data augmentation with optional Gaussian label noise for regression

Models

General-domain MSPM pre-trained on 1M ChEMBL molecules (10 epochs)
Task-specific MSPMs fine-tuned per dataset (optional stage)
QSAR models fine-tuned with transferred embeddings and encoder

Evaluation

Dataset	Split	Metric	MolPMoFiT (TTA)	Best Baseline
Lipophilicity	Random	RMSE	$0.565 \pm 0.037$	D-MPNN
Lipophilicity	Scaffold	RMSE	$0.635 \pm 0.031$	D-MPNN
FreeSolv	Random	RMSE	$1.197 \pm 0.127$	D-MPNN
FreeSolv	Scaffold	RMSE	$2.082 \pm 0.460$	D-MPNN
BBBP	Random	AUROC	$0.950 \pm 0.020$	D-MPNN
BBBP	Scaffold	AUROC	$0.931 \pm 0.025$	D-MPNN
HIV	Random	AUROC	$0.828 \pm 0.029$	D-MPNN
HIV	Scaffold	AUROC	$0.816 \pm 0.022$	D-MPNN

Hardware

NVIDIA Quadro P4000 GPU (single GPU)
General-domain MSPM pre-training: approximately 1 day
Pre-training needs to be done only once; fine-tuning is fast per task

Artifacts

Artifact	Type	License	Notes
MolPMoFiT	Code	Not specified	PyTorch + fastai v1 implementation with curated datasets

Paper Information

Citation: Li, X., & Fourches, D. (2020). Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. Journal of Cheminformatics, 12, 27. https://doi.org/10.1186/s13321-020-00430-x

@article{li2020molpmofit,
  title={Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT},
  author={Li, Xinhao and Fourches, Denis},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={27},
  year={2020},
  doi={10.1186/s13321-020-00430-x}
}

MolBERT: Auxiliary Tasks for Molecular BERT Models

Thu, 26 Mar 2026 00:00:00 +0000

BERT-Based Molecular Representations with Auxiliary Pre-Training Tasks

This is a Method paper that introduces MolBERT, a bidirectional Transformer (BERT) architecture applied to SMILES-based molecular representations for drug discovery. The primary contribution is a systematic study of how different domain-relevant self-supervised pre-training tasks affect the quality of learned molecular embeddings, paired with a model that achieves state-of-the-art performance on virtual screening and quantitative structure-activity relationship (QSAR) benchmarks.

Why Domain-Relevant Pre-Training Matters for Molecular Language Models

Molecular representations are foundational for predictive, generative, and analytical tasks in drug discovery. Language models applied to text-based molecular representations like SMILES have demonstrated strong performance across property prediction, reaction prediction, and molecular generation. However, several open questions remained at the time of this work:

Task selection for pre-training: Prior work explored masked token prediction, input translation, and property concatenation, but there was no systematic comparison of how different self-supervised tasks affect downstream performance.
SMILES ambiguity: The same molecule can be encoded as many different SMILES strings depending on how the molecular graph is traversed. Canonicalization algorithms address this but introduce their own artifacts that may distract the model.
Domain knowledge integration: Standard NLP pre-training objectives (e.g., masked language modeling) do not explicitly encode chemical knowledge. It was unclear whether incorporating chemistry-specific supervision during pre-training could improve representation quality.

MolBERT addresses these gaps by evaluating three pre-training tasks, including a novel physicochemical property prediction objective, and measuring their individual and combined effects on downstream drug discovery benchmarks.

Three Auxiliary Tasks for Chemistry-Aware Pre-Training

MolBERT uses the BERT-Base architecture (12 attention heads, 12 layers, 768-dimensional hidden states, approximately 85M parameters) and explores three self-supervised pre-training tasks:

Masked Language Modeling (MaskedLM): The standard BERT objective where 15% of input tokens are masked and the model predicts their identity. The loss is cross-entropy between predicted and true tokens.

SMILES Equivalence (SMILES-Eq): A binary classification task where the model receives two SMILES strings and predicts whether they represent the same molecule. The second string is either a random permutation of the first (same molecule, different traversal) or a randomly sampled molecule. This is optimized with cross-entropy loss.

Physicochemical Property Prediction (PhysChemPred): Using RDKit, a set of 200 real-valued molecular descriptors are computed for each molecule. The model predicts these normalized descriptors from the SMILES input using mean squared error:

$$\mathcal{L}_{\text{PhysChemPred}} = \frac{1}{D} \sum_{d=1}^{D} (y_d - \hat{y}_d)^2$$

where $D = 200$ is the number of descriptors, $y_d$ is the true normalized descriptor value, and $\hat{y}_d$ is the model’s prediction.

The final training loss is the arithmetic mean of all active task losses:

$$\mathcal{L}_{\text{total}} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \mathcal{L}_t$$

where $\mathcal{T}$ is the set of active pre-training tasks.

Additionally, MolBERT supports SMILES permutation augmentation during training, where each input molecule is represented by a randomly sampled non-canonical SMILES string rather than the canonical form. The model uses a fixed vocabulary of 42 tokens, a sequence length of 128, and relative positional embeddings (from Transformer-XL) to support arbitrary-length SMILES at inference time.

Ablation Study and Benchmark Evaluation

Pre-Training Setup

All models were pre-trained on the GuacaMol benchmark dataset, consisting of approximately 1.6M compounds curated from ChEMBL, using an 80%/5% train/validation split. Training used the Adam optimizer with a learning rate of $3 \times 10^{-5}$ for 20 epochs (ablation) or 100 epochs (final model).

Ablation: Impact of Task Combinations on Virtual Screening

The ablation study evaluated all seven possible task combinations on the RDKit virtual screening benchmark (69 datasets, 5 query molecules per target). Results measured by AUROC and BEDROC20 (an early enrichment metric with $\alpha = 20$):

MaskedLM	PhysChemPred	SMILES-Eq	AUROC (w/ perm)	BEDROC20 (w/ perm)	AUROC (w/o perm)	BEDROC20 (w/o perm)
Yes	Yes	Yes	0.685 +/- 0.069	0.246 +/- 0.041	0.707 +/- 0.059	0.280 +/- 0.042
Yes	Yes	No	0.738 +/- 0.060	0.323 +/- 0.071	0.740 +/- 0.066	0.322 +/- 0.065
Yes	No	Yes	0.483 +/- 0.092	0.092 +/- 0.069	0.493 +/- 0.068	0.108 +/- 0.070
No	Yes	Yes	0.476 +/- 0.077	0.064 +/- 0.034	0.514 +/- 0.165	0.084 +/- 0.014
Yes	No	No	0.696 +/- 0.058	0.283 +/- 0.077	0.676 +/- 0.060	0.250 +/- 0.073
No	Yes	No	0.719 +/- 0.057	0.293 +/- 0.071	0.716 +/- 0.061	0.290 +/- 0.076
No	No	Yes	0.129 +/- 0.067	0.005 +/- 0.037	0.508 +/- 0.068	0.048 +/- 0.035

Key findings from the ablation:

PhysChemPred had the highest individual impact (average BEDROC20 of 0.292 alone vs. 0.266 for MaskedLM alone).
Combining MaskedLM + PhysChemPred achieved the best performance (BEDROC20 of 0.323), though the additive gain from MaskedLM was modest (+0.031).
The SMILES-Eq task consistently decreased performance when added to other task combinations.

A further sub-ablation on PhysChemPred descriptor groups showed that surface descriptors alone (49 of 200 descriptors) achieved nearly the same performance as the full set, suggesting molecular surface properties provide particularly informative supervision.

Virtual Screening Results

Using the best task combination (MaskedLM + PhysChemPred) trained for 100 epochs:

Method	AUROC	BEDROC20
MolBERT (100 epochs)	0.743 +/- 0.062	0.344 +/- 0.062
CDDD	0.725 +/- 0.057	0.310 +/- 0.080
RDKit descriptors	0.633 +/- 0.027	0.217 +/- 0.000
ECFC4	0.603 +/- 0.056	0.170 +/- 0.079

MolBERT outperformed all baselines including CDDD (the prior state of the art), RDKit calculated descriptors, and extended-connectivity fingerprints (ECFC4).

QSAR Results

On MoleculeNet regression tasks (RMSE, lower is better):

Dataset	RDKit (norm)	ECFC4	CDDD	MolBERT	MolBERT (finetune)
ESOL	0.687 +/- 0.08	0.902 +/- 0.06	0.567 +/- 0.06	0.552 +/- 0.07	0.531 +/- 0.04
FreeSolv	1.671 +/- 0.45	2.876 +/- 0.38	1.456 +/- 0.43	1.523 +/- 0.66	0.948 +/- 0.33
Lipophilicity	0.738 +/- 0.04	0.770 +/- 0.03	0.669 +/- 0.02	0.602 +/- 0.01	0.561 +/- 0.03

On MoleculeNet classification tasks (AUROC, higher is better):

Dataset	RDKit (norm)	ECFC4	CDDD	MolBERT	MolBERT (finetune)
BACE	0.831	0.845	0.833	0.849	0.866
BBBP	0.696	0.678	0.761	0.750	0.762
HIV	0.708	0.714	0.753	0.747	0.783

Fine-tuned MolBERT achieved the best performance on all six QSAR datasets. When used as a fixed feature extractor with an SVM, MolBERT embeddings outperformed other representations on three of six tasks.

Key Findings and Limitations

Key Findings

Pre-training task selection matters significantly. The choice of auxiliary tasks during pre-training has a large effect on downstream performance. PhysChemPred provides the strongest individual signal.
Domain-relevant auxiliary tasks improve representation quality. Predicting physicochemical properties during pre-training encodes chemical knowledge directly into the embeddings, outperforming purely linguistic objectives.
The SMILES equivalence task hurts performance. Despite being chemically motivated, the SMILES-Eq task consistently degraded results, suggesting it may introduce conflicting learning signals.
PhysChemPred organizes the embedding space. Analysis of pairwise cosine similarities showed that models trained with PhysChemPred assign high similarity to permutations of the same molecule and low similarity to different molecules, creating a more semantically meaningful representation space.

Limitations

The paper evaluates only SMILES-based representations, inheriting all limitations of string-based molecular encodings (inability to capture 3D structure, sensitivity to tokenization).
The virtual screening evaluation uses a fixed number of query molecules ($n = 5$), which may not reflect realistic screening scenarios.
Cross-validation splits from ChemBench were used for QSAR evaluation rather than scaffold splits, which may overestimate performance on structurally novel compounds.
The model’s 128-token sequence length limit may truncate larger molecules, though relative positional embeddings partially address this at inference time.

Future Directions

The authors propose extending MolBERT to learn representations for other biological entities such as proteins, and developing more advanced pre-training strategies.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	GuacaMol (ChEMBL)	~1.6M compounds	80% train / 5% validation split
Virtual Screening	RDKit benchmark v1.2	69 target datasets	Filtered subset with active/decoy compounds
QSAR (Regression)	ESOL, FreeSolv, Lipophilicity	Varies	From MoleculeNet, ChemBench splits
QSAR (Classification)	BACE, BBBP, HIV	Varies	From MoleculeNet, ChemBench splits

Algorithms

Architecture: BERT-Base (12 heads, 12 layers, 768-dim hidden, ~85M params)
Optimizer: Adam, learning rate $3 \times 10^{-5}$
Vocabulary: 42 tokens, sequence length 128
Masking: 15% of tokenized input
Positional encoding: relative positional embeddings (Transformer-XL)
Fine-tuning SVM: $C = 5.0$, RBF kernel (from Winter et al.)
Fine-tuning head: single linear layer on pooled output
Embeddings: pooled output (or average sequence output when only MaskedLM is used)

Models

BERT-Base with ~85M parameters
Pre-trained weights available at BenevolentAI/MolBERT

Evaluation

Metric	Task	Notes
AUROC	Virtual Screening, Classification QSAR	Standard area under ROC curve
BEDROC20	Virtual Screening	Early enrichment metric, $\alpha = 20$
RMSE	Regression QSAR	Root mean squared error

Hardware

2 GPUs, 16 CPUs
Pre-training time: ~40 hours (20 epochs)

Artifacts

Artifact	Type	License	Notes
BenevolentAI/MolBERT	Code + Model	MIT	Official implementation with pre-trained weights

Paper Information

Citation: Fabian, B., Edlich, T., Gaspar, H., Segler, M., Meyers, J., Fiscato, M., & Ahmed, M. (2020). Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230.

@article{fabian2020molecular,
  title={Molecular representation learning with language models and domain-relevant auxiliary tasks},
  author={Fabian, Benedek and Edlich, Thomas and Gaspar, H{\'e}l{\'e}na and Segler, Marwin and Meyers, Joshua and Fiscato, Marco and Ahmed, Mohamed},
  journal={arXiv preprint arXiv:2011.13230},
  year={2020}
}

LMs Generate 3D Molecules from XYZ, CIF, PDB Files

Thu, 26 Mar 2026 00:00:00 +0000

Language Models as 3D Chemical Structure Generators

This is a Method paper that demonstrates transformer-based language models can generate molecules, crystalline materials, and protein binding sites directly in three dimensions by training on sequences derived from standard chemical file formats (XYZ, CIF, PDB). The key contribution is showing that unmodified autoregressive language models, using only next-token prediction, achieve performance comparable to domain-specific 3D generative models that incorporate SE(3) equivariance and other geometric inductive biases.

Beyond Graphs and Strings: The Need for 3D Chemical Generation

Molecular design with deep learning has largely relied on two representation paradigms: molecular graphs (processed with graph neural networks) and linearized string representations like SMILES and SELFIES (processed with sequence models). Both approaches have proven effective for drug-like organic molecules, but they share a fundamental limitation: they cannot represent structures whose identity depends on 3D spatial arrangement.

Crystalline materials, for example, have periodic lattice structures that cannot be reduced to simple graphs. Protein binding sites are defined by the 3D arrangement of hundreds of atoms across multiple residues. For tasks like catalysis design or structure-based drug discovery, the geometric positions of atoms are essential information that graphs and strings discard entirely.

Existing 3D generative models address this gap but typically require specialized architectures with SE(3) equivariance to handle rotational and translational symmetries. This work asks whether the general-purpose sequence modeling capability of transformers is sufficient to learn 3D chemical structure distributions without any domain-specific architectural modifications.

Direct Tokenization of Chemical File Formats

The core insight is straightforward: any 3D molecule, crystal, or biomolecule is already stored as text in standard file formats (XYZ, CIF, PDB). These files encode atom types and their Cartesian coordinates as sequences of characters and numbers. Rather than designing specialized architectures for point cloud generation, the authors simply tokenize these files and train a standard GPT-style transformer to predict the next token.

A molecule with $n$ atoms is represented as:

$$ \mathcal{M} = (e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n) $$

where $e_i$ is the element type and $(x_i, y_i, z_i)$ are Cartesian coordinates. Crystals additionally include lattice parameters:

$$ \mathcal{C} = (\ell_a, \ell_b, \ell_c, \alpha, \beta, \gamma, e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n) $$

Protein binding sites use residue-atom indicators (e.g., HIS-C, CYS-N) instead of bare element symbols:

$$ \mathcal{P} = (a_1, x_1, y_1, z_1, \dots, a_n, x_n, y_n, z_n) $$

The language model learns the joint distribution via autoregressive factorization:

$$ p(x) = \prod_{i=1}^{n} p(t_i \mid t_{i-1}, \dots, t_1) $$

Two tokenization strategies are explored:

Character-level (LM-CH): Every character in the file is a token, including digits, minus signs, spaces, and newlines. This produces long sequences but uses a small vocabulary (~30 tokens).
Atom+coordinate-level (LM-AC): Each atom placement requires exactly 4 tokens: one element/residue token and three coordinate tokens (e.g., ‘-1.98’). The vocabulary is larger (~100-10K tokens) but sequences are shorter.

Numerical precision is controlled by rounding coordinates to 1, 2, or 3 decimal places. Since the model lacks rotation and translation invariance, random rotation augmentation during training improves performance.

Experiments Across Molecules, Crystals, and Protein Binding Sites

Molecular Generation (ZINC)

The model is evaluated on 250K commercially available molecules from the ZINC dataset, with an average of 23 heavy atoms. XYZ files are generated using RDKit’s conformer tools. Coordinates use 2 decimal places of precision. The authors generate 10K molecules and evaluate both 3D geometry quality and standard generative metrics.

For 3D geometry assessment, root mean squared deviation (RMSD) between language model-generated conformers and RDKit-generated conformers shows most molecules fall between 1.0 and 2.0 RMSD, with a heavy tail extending to 4.0.

Standard metrics include validity, uniqueness, novelty, and earth mover’s distance (WA) for molecular property distributions (QED, SA score, molecular weight).

Model	3D	Valid (%)	Unique (%)	Novel (%)	WA MW	WA SA	WA QED
Train	No	100.0	100.0	100.0	0.816	0.013	0.002
SM-LM	No	98.35	100.0	100.0	3.640	0.049	0.005
SF-LM	No	100.0	100.0	100.0	3.772	0.085	0.006
JTVAE	No	100.0	98.56	100.0	22.63	0.126	0.023
ENF	Yes	1.05	96.37	99.72	168.5	1.886	0.160
G-SchNet	Yes	1.20	55.96	98.33	152.7	1.126	0.185
EDM	Yes	77.51	96.40	95.30	101.2	0.939	0.093
LM-CH	Yes	90.13	100.0	100.0	3.912	2.608	0.077
LM-AC	Yes	98.51	100.0	100.0	1.811	0.026	0.004

The atom+coordinate tokenization model (LM-AC) achieves 98.51% validity with 100% uniqueness and novelty. Its WA scores for molecular weight (1.811) and QED (0.004) are substantially better than all other 3D generative baselines and competitive with SMILES/SELFIES language models. The character-level model (LM-CH) at 90.13% validity performs comparably to graph-based models but falls short of the string-based language models.

Crystal Generation (Perov-5 and MP-20)

Crystal generation uses CIF-derived sequences with 3 decimal places of precision. Two datasets are used: Perov-5 (18,928 perovskite materials, 5 atoms per unit cell, 56 elements) and MP-20 (45,231 diverse materials, 1-20 atoms per unit cell, 89 elements).

Evaluation metrics include structural validity (minimum interatomic distance > 0.5 angstrom), compositional validity (charge neutrality via SMACT), coverage (recall and precision between generated and test sets), and earth mover’s distance for density and number of unique elements.

Data	Model	Struc. Valid (%)	Comp. Valid (%)	COV-R (%)	COV-P (%)	WA density	WA elements
Perov-5	CDVAE	100.0	98.59	99.45	98.46	0.126	0.063
Perov-5	LM-CH	100.0	98.51	99.60	99.42	0.071	0.036
Perov-5	LM-AC	100.0	98.79	98.78	99.36	0.089	0.028
MP-20	CDVAE	100.0	86.70	99.15	99.49	0.688	1.432
MP-20	LM-CH	84.81	83.55	99.25	97.89	0.864	0.132
MP-20	LM-AC	95.81	88.87	99.60	98.55	0.696	0.092

On Perov-5, both language models outperform CDVAE across most metrics. On the more diverse MP-20 dataset, LM-AC achieves the best scores on 3 of 6 metrics and remains competitive on the others. LM-CH struggles more with structural validity on MP-20 (84.81%).

Protein Binding Site Generation (PDB)

The most challenging task involves generating protein binding sites (~200-250 atoms each) from PDB-derived sequences. The dataset contains approximately 180K protein-ligand pairs. Residue-atom tokenization is used (e.g., CYS-C, CYS-N), with 2 decimal places of precision.

Validity is assessed per-residue using xyz2mol, with an additional check for inter-residue atomic overlap (atoms from different residues closer than the minimum bond distance). Approximately 99% of generated pockets pass the residue validity check, while about 5% fail the overlap check. Of generated pockets, 89.8% have unique residue orderings, and 83.6% have novel orderings not seen in training, indicating the model is generating novel binding site structures rather than memorizing.

Competitive 3D Generation Without Geometric Inductive Biases

The central finding is that standard transformer language models, without any equivariance or geometric inductive biases, can generate valid 3D chemical structures across three substantially different domains. The atom+coordinate tokenization (LM-AC) consistently outperforms character-level tokenization (LM-CH), likely because it produces shorter sequences and reduces the number of sequential decisions needed per atom placement.

Several limitations are worth noting. The model generates atoms using absolute Cartesian coordinates, which means it must learn rotation and translation invariance purely from data augmentation rather than having it built into the architecture. The authors acknowledge this becomes increasingly difficult as structure size grows. The vocabulary size also scales with coordinate precision and structure complexity, which could become prohibitive for very large systems.

The paper does not include computational cost comparisons with baseline models, making it difficult to assess the practical tradeoff between the simplicity of the language modeling approach and the efficiency of specialized architectures. The authors also note that further validation through computational simulation and experiment is needed to confirm the physical plausibility of generated structures.

Future directions identified include inverse design of molecules and materials conditioned on target properties, extension to more complex structures (metal-organic frameworks), and exploration of alternative tokenization strategies to handle larger systems.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	ZINC	250K molecules	~23 heavy atoms avg; XYZ files via RDKit conformer generation
Training/Eval	Perov-5	18,928 perovskites	5 atoms/unit cell, 56 elements
Training/Eval	MP-20	45,231 materials	1-20 atoms/unit cell, 89 elements
Training/Eval	Protein binding sites	~180K protein-ligand pairs	Processed to 200-250 atoms per pocket

Algorithms

Architecture: GPT-style transformer with ~1M to 100M parameters
Layers: 12
Embedding size: 128 to 1024
Attention heads: 4 to 12
Batch size: 4 to 32 structures
Learning rate: $10^{-4}$ to $10^{-5}$, decayed to $9 \times 10^{-6}$
Data augmentation: Random rotation of training structures at each epoch
Numerical precision: 2 decimal places (molecules, proteins), 3 decimal places (crystals)

Models

No pre-trained model weights are publicly available. The paper mentions “Example code can be found at” but the URL appears to be missing from the published version.

Evaluation

Metric	Domain	Description
Validity	Molecules	xyz2mol produces valid RDKit Mol object
Validity	Crystals	Structural (min distance > 0.5 angstrom) and compositional (charge neutral)
Uniqueness	All	Fraction of distinct generated structures
Novelty	All	Fraction not in training set
Earth mover’s distance	All	Distribution match for domain-specific properties
RMSD	Molecules	Deviation from RDKit conformer geometries
Coverage	Crystals	Recall and precision between generated and test sets

Hardware

Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types, counts, and training times are not reported.

Artifacts

No public code repository, model weights, or datasets specific to this work were found. The ZINC, Perov-5, and MP-20 datasets used for evaluation are publicly available from their original sources.

Paper Information

Citation: Flam-Shepherd, D. & Aspuru-Guzik, A. (2023). Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files. arXiv preprint arXiv:2305.05708.

@article{flamshepherd2023language,
  title={Language models can generate molecules, materials, and protein binding sites directly in three dimensions as {XYZ}, {CIF}, and {PDB} files},
  author={Flam-Shepherd, Daniel and Aspuru-Guzik, Al{\'a}n},
  journal={arXiv preprint arXiv:2305.05708},
  year={2023}
}

LLM4Mol: ChatGPT Captions as Molecular Representations

Thu, 26 Mar 2026 00:00:00 +0000

LLM-Generated Text as Molecular Representations

This is a Method paper that proposes using large language models (specifically ChatGPT) to generate natural language explanations for molecules represented as SMILES strings, and then using those explanations as input representations for downstream molecular property prediction. The approach is called Captions as new Representations (CaR). The authors also evaluate ChatGPT directly on zero-shot and few-shot molecular classification to gauge in-context learning ability on chemical data.

Bridging Molecular Data and Natural Language Understanding

Molecular property prediction is central to virtual screening, drug discovery, and materials design. Molecules are typically represented either as graphs (processed by GNNs) or as SMILES strings (processed by NLP-based methods). While both paradigms have shown success, they do not directly use the broad world knowledge embedded in large language models.

LLMs such as ChatGPT demonstrate strong capabilities in text understanding and can generate informative descriptions when given SMILES strings, including functional groups, chemical properties, and potential pharmaceutical applications. The question motivating this work is whether LLM-generated textual descriptions can serve as better molecular representations than raw SMILES or graph encodings for property prediction tasks.

Prior work had not systematically explored two directions: (1) whether LLMs can perform molecular classification via in-context learning, and (2) whether LLM-generated captions can serve as transferable representations for small downstream models.

Captions as Representations (CaR)

The core contribution is the CaR framework, which operates in two stages:

Caption generation: Given a molecule’s SMILES string, ChatGPT is prompted to produce a detailed textual explanation covering functional groups, chemical properties, and potential applications.
Fine-tuning a small LM: The generated text explanations replace the original SMILES as input to a pre-trained language model (e.g., RoBERTa). This small LM is then fine-tuned on downstream classification or regression tasks.

The insight is that ChatGPT’s world knowledge can enrich the molecular representation with semantically meaningful features that raw SMILES lack. For example, on the PTC (Predictive Toxicology Challenge) dataset, the authors performed keyword searches for terms like “toxicity”, “cancer”, and “harmful” in the ChatGPT-generated explanations and found that these keywords appeared predominantly in entries labeled as toxic, indicating that the generated captions carry predictive signal.

The authors also explore in-context molecular classification, where ChatGPT is directly prompted with zero or few examples to classify molecules. This serves as a preliminary evaluation of LLM reasoning capabilities on molecular data.

Experimental Setup and Benchmarks

Datasets

The evaluation spans 9 datasets across classification and regression:

Classification (TUDataset): MUTAG, PTC, AIDS
Classification (MoleculeNet): SIDER, ClinTox, BACE, BBBP
Regression (MoleculeNet): ESOL, Lipophilicity

Baselines

Baselines include GNN-based methods (GCN, GIN, ChebyNet, D-MPNN, GraphMVP, InfoGraph, G-Motif, Mole-BERT) and SMILES-based methods (ECFP4-MLP, SMILES-Transformer, MolR, ChemBERTa, MolKD).

Splitting Strategies

Random splitting: 8/1/1 train/validate/test with 10-fold cross-validation
Scaffold splitting: 5 random seeds, reported as mean and standard deviation

Key Results: Random Splitting

Under random splitting, CaR-RoBERTa achieves the best results on almost all datasets:

Method	MUTAG (ACC)	PTC (ACC)	AIDS (ACC)	SIDER (AUC)	ClinTox (AUC)	ESOL (RMSE)	Lipo (RMSE)
GCN	90.00	62.57	78.68	64.24	91.88	0.77	0.80
GIN	89.47	58.29	78.01	66.19	92.08	0.67	0.79
ECFP4-MLP	96.84	85.71	94.64	90.19	95.81	0.60	0.60
CaR-RoBERTa	91.05	93.14	94.37	88.81	99.80	0.45	0.47

CaR-RoBERTa improves over the best GNN by up to 53% on PTC and reduces RMSE by 35-37% on regression tasks. However, ECFP4-MLP outperforms CaR on MUTAG (96.84 vs. 91.05).

Key Results: Scaffold Splitting

Under the more challenging scaffold splitting:

Method	SIDER (AUC)	ClinTox (AUC)	BACE (AUC)	BBBP (AUC)	ESOL (RMSE)	Lipo (RMSE)
GraphMVP-C	63.90	77.50	81.20	72.40	1.03	0.68
Mole-BERT	62.80	78.90	80.80	71.90	1.02	0.68
MolKD	61.30	83.80	80.10	74.80	-	-
CaR-RoBERTa	58.06	84.16	80.73	81.99	0.96	1.02

Results are more mixed under scaffold splitting. CaR achieves the best performance on ClinTox (+30% over GNNs) and BBBP (+15%), but underperforms on SIDER and Lipophilicity.

Few-Shot Classification with ChatGPT

Direct few-shot classification with ChatGPT shows mixed results. On MUTAG, ChatGPT underperforms classical methods across all shot counts. On PTC, ChatGPT outperforms GNNs in the few-shot regime. Performance improves with increasing number of shots, but results are inconsistent across different prompts.

Replacing the Small LM

The authors test CaR with different downstream models: RoBERTa, DeBERTa, and an adaptive language model for molecules. Pre-trained models all perform similarly, and all outperform a DeBERTa trained from scratch, validating that CaR’s effectiveness comes from the caption quality rather than the specific choice of downstream model.

Findings, Limitations, and Future Directions

Key Findings

ChatGPT-generated text explanations serve as effective molecular representations, outperforming GNNs and SMILES-based methods on most benchmarks under random splitting.
ChatGPT has some capacity for few-shot molecular classification, but performance is inconsistent and prompt-sensitive.
The CaR approach is model-agnostic: different pre-trained small LMs achieve similar results when fine-tuned on the generated captions.
Under scaffold splitting, CaR shows strong results on some datasets (ClinTox, BBBP) but underperforms on others (SIDER, Lipophilicity).

Limitations Acknowledged by the Authors

Single LLM: Only ChatGPT was used. Other LLMs (GPT-4, domain-specific models like MolReGPT) were not evaluated.
No graph structure integration: CaR treats molecular prediction purely as an NLP task and does not incorporate structural graph information, which is known to be important for molecular properties.
Limited to small molecules: The approach works only for molecules representable as SMILES. Proteins, antibodies, and other large biomolecules with 3D structure are not addressed.

Additional Considerations

The random splitting results are notably strong, but random splits tend to overestimate performance compared to scaffold splits, which test generalization to structurally novel molecules. The high variance on some scaffold-split results (e.g., ClinTox with 17.63 standard deviation) suggests instability. The reliance on a proprietary API (ChatGPT) also limits reproducibility and introduces cost constraints for large-scale applications.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Classification	MUTAG (TUDataset)	188 molecules	Mutagenicity prediction
Classification	PTC (TUDataset)	344 molecules	Predictive toxicology
Classification	AIDS (TUDataset)	2,000 molecules	HIV activity
Classification	SIDER (MoleculeNet)	1,427 molecules	Side effect prediction
Classification	ClinTox (MoleculeNet)	1,478 molecules	Clinical trial toxicity
Classification	BACE (MoleculeNet)	1,513 molecules	Beta-secretase inhibition
Classification	BBBP (MoleculeNet)	2,039 molecules	Blood-brain barrier penetration
Regression	ESOL (MoleculeNet)	1,128 molecules	Aqueous solubility
Regression	Lipophilicity (MoleculeNet)	4,200 molecules	Lipophilicity

Algorithms

ChatGPT (GPT-3.5) generates textual explanations for SMILES strings
RoBERTa is fine-tuned on generated captions using HuggingFace Transformers with default parameters
10-fold cross-validation for random split; 5 random seeds for scaffold split

Models

ChatGPT (GPT-3.5) for caption generation
RoBERTa-base for downstream fine-tuning (default HuggingFace parameters)
DeBERTa and adaptive-lm-molecules tested as alternatives

Evaluation

Classification: accuracy (ACC) and ROC-AUC
Regression: RMSE
Mean and standard deviation reported across folds/seeds

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
LLM4Mol	Code	Not specified	Official implementation

Paper Information

Citation: Qian, C., Tang, H., Yang, Z., Liang, H., & Liu, Y. (2023). Can Large Language Models Empower Molecular Property Prediction? arXiv preprint arXiv:2307.07443. https://arxiv.org/abs/2307.07443

@article{qian2023can,
  title={Can Large Language Models Empower Molecular Property Prediction?},
  author={Qian, Chen and Tang, Huayi and Yang, Zhirui and Liang, Hong and Liu, Yong},
  journal={arXiv preprint arXiv:2307.07443},
  year={2023},
  doi={10.48550/arxiv.2307.07443}
}

LLM-Prop: Predicting Crystal Properties from Text

Thu, 26 Mar 2026 00:00:00 +0000

Text-Based Crystal Property Prediction with LLMs

LLM-Prop is a Method paper that proposes using the encoder portion of T5 (a general-purpose language model) fine-tuned on crystal text descriptions to predict physical and electronic properties of crystalline materials. The primary contribution is demonstrating that text-based representations of crystals, generated by Robocrystallographer, can serve as effective inputs for property prediction, outperforming graph neural network (GNN) baselines on several tasks despite using a non-domain-specific pre-trained model with fewer parameters.

Why Text Instead of Crystal Graphs?

Graph neural networks have been the dominant approach for crystal property prediction. Models like CGCNN, MEGNet, and ALIGNN represent crystals as graphs where atoms are nodes and bonds are edges. However, GNNs face several fundamental challenges for crystals:

Periodicity encoding: Crystals have repetitive unit cell arrangements that are distinct from standard molecular graphs, and GNNs struggle to encode this periodicity efficiently.
Information incorporation: Critical structural information like bond angles, space group symmetry, and Wyckoff sites is difficult to incorporate into graph representations.
Expressiveness: Graphs may lack the expressiveness needed to convey complex crystal information relevant to property prediction.

Meanwhile, textual descriptions of crystals (generated by tools like Robocrystallographer) naturally encode space group information, bond geometries, coordination environments, and symmetry details in human-readable form. Despite this richness, text-based approaches for crystal property prediction had been largely unexplored.

Core Innovation: T5 Encoder with Careful Fine-Tuning

The key insight of LLM-Prop is to take a pre-trained encoder-decoder model (T5-small) and discard the decoder entirely, using only the encoder with a linear prediction head. This design has several advantages:

Cutting the network in half (from ~60M to ~37M parameters) allows processing of longer input sequences
Longer sequences mean more crystal information can be included
The encoder-only approach avoids T5’s known weakness at regression in text-to-text format

The framework applies several preprocessing strategies to the crystal text descriptions:

Stopword removal: Standard English stopwords are removed, except digits and symbols carrying chemical information
Numerical token replacement: Bond distances are replaced with a [NUM] token and bond angles with [ANG], reducing sequence length while preserving structural cues
[CLS] token prepending: A classification token is added at the start, and its learned embedding is used as input to the prediction layer
Label scaling: For regression tasks, targets are normalized using z-score, min-max, or log normalization

The normalization schemes are defined as:

$$ \hat{Y}_{i}(\text{z-score}) = \frac{Y_{i} - \mu}{\sigma} $$

$$ \hat{Y}_{i}(\text{min-max}) = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}} $$

$$ \hat{Y}_{i}(\text{log-norm}) = \log(Y_{i} + 1) $$

The tokenizer is also retrained on the crystal text corpus with a vocabulary size of 32k, and the special tokens [NUM], [ANG], and [CLS] are added to the vocabulary.

Experimental Setup and Baselines

Dataset: TextEdge

The authors collected data from the Materials Project database (as of November 2022), yielding 144,931 crystal structure-description pairs split into 125,098 training, 9,945 validation, and 9,888 test samples. Crystal text descriptions were generated using Robocrystallographer. The dataset covers six prediction tasks:

Task	Type	Metric
Band gap (eV)	Regression	MAE (lower is better)
Unit cell volume (A^3/cell)	Regression	MAE (lower is better)
Formation energy per atom (eV/atom)	Regression	MAE (lower is better)
Energy per atom (eV/atom)	Regression	MAE (lower is better)
Energy above hull (eV/atom)	Regression	MAE (lower is better)
Is-gap-direct	Classification	AUC (higher is better)

Baselines

Seven baselines were compared:

GNN-based: CGCNN, MEGNet, ALIGNN, DeeperGATGNN
Classic ML: XGBoost, Random Forest (on Robocrystallographer features)
Text-based: MatBERT (domain-specific pre-trained BERT, ~110M parameters)

All models were trained and evaluated on the same dataset splits for fair comparison. GNN models were retrained on the new data rather than using results from older, smaller Materials Project versions.

Main Results: LLM-Prop vs. GNN Baselines

When using crystal text descriptions as input, LLM-Prop achieved:

Model	Band gap (eV)	Volume (A^3/cell)	FEPA (eV/atom)	EPA (eV/atom)	Ehull (eV/atom)	Is-gap-direct (AUC)
CGCNN	0.293	188.834	0.046	0.082	0.040	0.830
MEGNet	0.304	297.948	0.077	0.056	0.051	N/A
ALIGNN	0.250	129.580	0.027	0.059	0.028	0.678
DeeperGATGNN	0.291	111.857	0.081	0.116	0.045	N/A
LLM-Prop (Descr.)	0.231	39.252	0.056	0.067	0.047	0.857

LLM-Prop outperformed the best GNN baseline (ALIGNN) by approximately 8% on band gap prediction, 65% on volume prediction, and 3% on band gap classification (Is-gap-direct). For formation energy per atom, energy per atom, and energy above hull, ALIGNN retained an advantage.

LLM-Prop vs. MatBERT

LLM-Prop also outperformed MatBERT (a domain-specific pre-trained BERT) across all tasks despite having roughly 3x fewer parameters. The table below shows the best result for each model across the three input preprocessing strategies (w/ Numbers, w/o Numbers, w/ [NUM]&[ANG]):

Model	Band gap (eV)	Volume (A^3/cell)	FEPA (eV/atom)	EPA (eV/atom)	Ehull (eV/atom)	Is-gap-direct (AUC)
MatBERT (best)	0.258	54.969	0.071	0.098	0.050	0.722
LLM-Prop (best)	0.231	39.138	0.056	0.067	0.047	0.857

Note: LLM-Prop’s best band gap (0.231) comes from the “w/o Numbers” configuration, while the best volume (39.138) comes from “w/ Numbers”. The best Is-gap-direct AUC (0.857) uses the “[NUM]&[ANG]” configuration.

Ablation Studies

The contribution of each preprocessing strategy was evaluated:

Configuration	Band gap	Volume	Is-gap-direct (AUC)
LLM-Prop (baseline)	0.256	69.352	0.796
+ modified tokenizer	0.247	78.632	0.785
+ label scaling	0.242	44.515	N/A
+ [CLS] token	0.231	39.520	0.842
+ [NUM] token	0.251	86.090	0.793
+ [ANG] token	0.242	64.965	0.810
- stopwords	0.252	56.593	0.779
LLM-Prop+all (no space group)	0.235	97.457	0.705
LLM-Prop+all	0.229	42.259	0.857

The [CLS] token provided the single largest improvement across all tasks. Label scaling was critical for volume prediction (reducing MAE from 69.352 to 44.515). Removing space group information from descriptions degraded volume prediction dramatically (from 42.259 to 97.457), confirming that space group symmetry is a key factor.

Data Efficiency and Transfer Learning

LLM-Prop achieved SOTA results on band gap and volume prediction with only about 90k training samples (35k fewer than baselines). For volume prediction specifically, LLM-Prop outperformed all GNN baselines with just 30k training samples.

Transfer learning experiments showed that LLM-Prop transferred well between band gap and volume prediction tasks:

Model	Volume-to-Band gap (Test)	Band gap-to-Volume (Test)
CGCNN-transfer	0.295	182.997
ALIGNN-transfer	0.322	136.164
MatBERT-transfer	0.266	54.289
LLM-Prop-transfer	0.244	50.753

Key Findings, Limitations, and Future Directions

Key findings:

Text descriptions of crystals carry rich structural information (space groups, Wyckoff sites, coordination geometries) that is difficult to encode in graphs but naturally expressed in text
A carefully fine-tuned general-purpose LLM encoder can outperform domain-specific pre-trained models, challenging the assumption that in-domain pre-training is always necessary
Removing numerical information (bond distances and angles) from descriptions often improves performance, because current LLMs treat numbers as regular tokens without understanding their quantitative meaning
Longer input sequences correlate with better performance, with 888 tokens as the default maximum on the hardware used

Limitations acknowledged by the authors:

The origin of LLM-Prop’s performance advantage over GNNs is not fully understood. It remains unclear whether the boost comes from additional structured information in text or from the different data modality itself
LLM-Prop cannot perform zero-shot predictions since T5 was not pre-trained on materials science data
The approach depends on Robocrystallographer to generate text descriptions, adding a preprocessing dependency
Current LLMs’ inability to reason about numerical values limits the use of quantitative information in descriptions

Future directions suggested by the authors include investigating techniques to use CIF files directly as LLM inputs, developing new GNN architectures that incorporate space group and Wyckoff site information, and further exploring which information in crystal descriptions contributes most to each property prediction task.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	TextEdge	144,931 crystals	From Materials Project (Nov 2022), text generated by Robocrystallographer
Training split	TextEdge	125,098	Random split
Validation split	TextEdge	9,945	Random split
Test split	TextEdge	9,888	Random split

Algorithms

Optimizer: Adam with one-cycle learning rate scheduler
Learning rate: 1e-3 for LLM-Prop, 5e-5 for MatBERT
Dropout: 0.2 for LLM-Prop, 0.5 for MatBERT
Batch size: 64 (888 tokens) or 16 (2000 tokens) for LLM-Prop
Epochs: 200-300 depending on task
Loss: MAE for regression, BCE for classification
Evaluation: MAE for regression, AUC for classification
Each model run 5 times on test set, averaged MAE reported

Models

Base model: T5-small encoder (~60M parameters total, ~37M after discarding decoder and adding prediction head)
Vocabulary size: 32k (retrained tokenizer)
Max input tokens: 888 (default) or 2000
Special tokens: [CLS], [NUM], [ANG]

Artifacts

Artifact	Type	License	Notes
LLM-Prop	Code	MIT	Official implementation
TextEdge + Checkpoints	Dataset + Model	Not specified	Benchmark dataset and trained model checkpoints

Hardware

GPUs: NVIDIA RTX A6000
Training time: ~40 minutes per epoch for LLM-Prop
Inference: ~1 minute for 10,000 materials on one GPU

Paper Information

Citation: Rubungo, A. N., Arnold, C. B., Rand, B. P., & Dieng, A. B. (2025). LLM-Prop: predicting the properties of crystalline materials using large language models. npj Computational Materials, 11, 186. https://doi.org/10.1038/s41524-025-01536-2

@article{rubungo2025llmprop,
  title={LLM-Prop: predicting the properties of crystalline materials using large language models},
  author={Rubungo, Andre Niyongabo and Arnold, Craig B. and Rand, Barry P. and Dieng, Adji Bousso},
  journal={npj Computational Materials},
  volume={11},
  number={1},
  pages={186},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41524-025-01536-2}
}

Link-INVENT: RL-Driven Molecular Linker Generation

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Generative Linker Design with Reinforcement Learning

Link-INVENT is a Method paper that introduces a generative model for molecular linker design built on the REINVENT de novo design platform. The primary contribution is an encoder-decoder recurrent neural network (RNN) architecture that generates SMILES-based linkers connecting two molecular subunits, combined with a flexible multi-parameter optimization (MPO) scoring function and reinforcement learning (RL) to steer generation toward desired properties. Link-INVENT targets three practical drug discovery tasks: fragment linking, scaffold hopping, and proteolysis targeting chimera (PROTAC) design.

Why Linker Design Needs Flexible Multi-Parameter Optimization

Generating suitable chemical linkers between molecular subunits is a central challenge in fragment-based drug discovery (FBDD), scaffold hopping, and PROTAC design. Traditional computational approaches rely on database searches, inherently limiting the generalizability of proposed linkers to the pre-defined collection. Recent deep learning methods (DeLinker, SyntaLinker, 3DLinker, DiffLinker) can generate novel linkers but offer limited support for optimizing specific physicochemical properties. Users can typically control only linker length and a few properties like hydrogen-bond donor count.

The key gaps that Link-INVENT addresses are:

Conditioning on both subunits: Prior RNN-based approaches (SAMOA) generate linkers conditioned only on the SMILES sequence seen so far, which may not account for the second molecular subunit. Link-INVENT conditions on both warheads simultaneously.
Flexible scoring: Existing DL-based linker design tools lack the ability to define tailored MPO objectives. Link-INVENT inherits REINVENT 4’s full scoring infrastructure and adds linker-specific properties.
Generalizability: A single trained prior handles fragment linking, scaffold hopping, and PROTAC tasks without retraining.

Core Innovation: Conditional Linker Generation with Augmented Likelihood RL

Link-INVENT’s architecture is an encoder-decoder RNN adapted from the Lib-INVENT library design model. The encoder processes a pair of warheads (molecular subunits with defined exit vectors), and the decoder generates a linker token by token, yielding a connected molecule in SMILES format. The model uses three hidden layers of 512 LSTM cells with an embedding size of 256.

Training

The prior is trained on ChEMBL v27 data processed through reaction-based slicing to generate (linker, warheads pair, full molecule) tuples. SMILES randomization augments the training data at each epoch, improving chemical space generalizability. The prior is trained by maximizing the likelihood of generating a linker conditioned on the input warhead pair, with teacher forcing for stability.

Multi-Parameter Optimization via RL

The scoring function $S(x)$ is a weighted geometric mean of individual component scores:

$$ S(x) = \left(\prod_{i=1}^{n} C_{i}(x)^{w_{i}}\right)^{\frac{1}{\sum_{i=1}^{n} w_{i}}} $$

where $x$ is a sampled linked molecule, $C_{i}(x)$ is the score for the $i$-th component, and $w_{i}$ is its weight.

The agent (initialized as a copy of the prior) is updated via the Difference of Augmented and Posterior likelihoods (DAP) loss. The augmented log likelihood is:

$$ \log \pi_{\text{augmented}} = \log \pi_{\text{prior}} + \sigma \cdot S(x) $$

where $\pi$ denotes a policy (token sampling probabilities conditioned on the sequence so far) and $\sigma$ is a scalar factor. The loss function is:

$$ J(\theta) = \left(\log \pi_{\text{augmented}} - \log \pi_{\text{agent}}\right)^{2} $$

Minimizing $J(\theta)$ steers the agent to generate molecules that satisfy the scoring function while remaining anchored to the prior’s chemical space.

Diversity Filters

Link-INVENT uses Diversity Filters (DFs) to balance exploration and exploitation. Buckets of limited size track unique Bemis-Murcko scaffolds. When a bucket is full, further sampling of that scaffold receives a score of zero, encouraging the agent to explore diverse chemical space regions.

Linker-Specific Scoring Components

New scoring components provide direct control over linker properties:

Linker effective length: number of bonds between attachment atoms
Linker maximum graph length: bonds in the longest graph traversal path
Linker length ratio: effective length divided by maximum graph length (controls branching)
Linker ratio of rotatable bonds: rotatable bonds over total bonds (controls flexibility)
Linker number of rings: controls linearity vs. cyclicity
Linker number of HBDs: hydrogen-bond donors in the linker itself

Experimental Evaluation Across Three Drug Discovery Tasks

Link-INVENT was evaluated through four experiments across three drug discovery applications, all using the same pre-trained prior.

Illustrative Example: Two Benzene Rings

A simple experiment linked two benzene rings with the objectives of limiting HBDs and requiring exactly one ring in the linker. Over 20 epochs, the agent learned to satisfy both objectives, demonstrating the basic RL-guided generation process.

Experiment 1a: Fragment Linking (CK2 alpha Inhibitors)

Based on the casein kinase 2 (CK2 alpha) fragment linking campaign by Fusco and Brear et al., Link-INVENT was tasked with linking two fragment hits while retaining the Lys68 hydrogen-bond interaction via a DockStream docking constraint (Glide/LigPrep backend). The scoring function also enforced linker length ratio >= 70 and linker MW <= 200 Da.

Over 100 epochs in triplicate, the agent generated molecules with gradually improving docking scores. Key results:

Docking score distributions across triplicates were nearly identical, demonstrating reproducibility
Some generated molecules achieved more favorable docking scores than the reference ligand CAM4066 (-15.20 kcal/mol)
More than 5000 unique Bemis-Murcko scaffolds were generated, with minimal overlap across replicates
Binding pose analysis showed the generated linker closely resembled the ground-truth linker, retaining the Lys68 interaction

Experiment 1b: Comparison Fragment Linking (IMPDH Inhibitors)

Using the IMPDH inhibitor fragment linking case study from Trapero et al., this experiment applied core constrained docking (fragment pose within 0.3 A of reference) and compared results to DeLinker and SyntaLinker. The scoring function enforced linker effective length in [3, 5], length ratio >= 70, and linker MW <= 150 Da.

Link-INVENT generated 8960 SMILES across 70 epochs (comparable to DeLinker’s 9000 molecular graphs). Results:

Link-INVENT generated molecules with more favorable docking scores than the reference ligand across triplicate runs
Of 20 DeLinker and 3 SyntaLinker example molecules, none and one (the recovered reference) docked better than or equal to the reference
Approximately 3000 unique Bemis-Murcko scaffolds were generated from 5000 total molecules
Link-INVENT’s advantage comes from including docking explicitly as a learning objective rather than applying it post hoc

Experiment 2: Scaffold Hopping (DLK Inhibitor CNS Optimization)

Based on Patel et al.’s dual leucine zipper kinase (DLK) inhibitor campaign, Link-INVENT generated new scaffold ideas to improve CNS penetration while retaining potency. The scoring function included a Cys193 docking constraint plus CNS-compatible properties (HBDs < 2, tPSA <= 90 A squared, 3 <= SlogP <= 4, MW <= 450 Da, 1-2 aromatic rings in linker).

The solution space was significantly narrower than fragment linking. The agent still generated diverse scaffolds with favorable docking scores, though fewer exceeded the reference ligand’s score. Binding pose analysis confirmed retained Cys193 interactions and predicted additional Gln195 hydrogen bonds.

Experiment 3: PROTAC Design (Bcl-2/Mcl-1 Dual Degradation)

Three sub-experiments demonstrated linker-specific scoring components for PROTAC design based on Wang et al.’s Bcl-2/Mcl-1 dual degradation strategy:

Sub-Experiment	Objective	Key Finding
Sub-Exp 1: Linker length	Generate linkers within specified length intervals [4,6], [7,9], [10,12], [13,15]	Clear enrichment within target intervals vs. baseline broad distribution
Sub-Exp 2: Linearity	Control linear vs. cyclic linkers at fixed length [7,9]	Baseline ratio ~1:2 linear:cyclic; enforcing linearity or cyclicity achieved strong enrichment
Sub-Exp 3: Flexibility	Generate linkers with Low [0,30], Moderate [40,60], or High [70,100] rotatable bond ratios	Agent learned that rings and sp2 atoms yield rigidity; linear sp3 chains yield flexibility

Key Findings and Practical Implications for Drug Discovery

Link-INVENT demonstrates several practical advantages for molecular linker design:

Single prior, multiple tasks: The same pre-trained model handles fragment linking, scaffold hopping, and PROTAC design without retraining.
Docking as a learning signal: Including molecular docking explicitly in the scoring function (via DockStream) during RL yields molecules with more favorable docking scores than approaches that apply docking post hoc.
Implicit 3D awareness: The docking constraint guides the agent toward 3D structural awareness without explicit 3D coordinate inputs, as demonstrated by the overlap between generated and reference binding poses.
Diverse and reproducible output: Diversity filters ensure exploration of multiple chemical space regions, and triplicate experiments show consistent docking score distributions with minimal scaffold overlap.

Limitations acknowledged by the authors include:

The linker flexibility metric (ratio of rotatable bonds) is agnostic to intra-molecular hydrogen bonds and does not account for all rigidity factors
Molecular docking is an approximation that can be exploited (e.g., excessive HBDs achieving favorable scores at the expense of permeability)
Experiments 1a and 1b require a proprietary Schrodinger license for Glide/LigPrep docking
No direct experimental (wet-lab) validation was performed in this study

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL v27 (reaction-sliced)	Not specified	Filtered for drug-like compounds, then reaction-based slicing with SMIRKS
Validation	Held-out Bemis-Murcko scaffolds	287 scaffolds	Held out from training set
SMILES augmentation	Randomized SMILES per epoch	Same tuples, different representations	Improves generalizability

Algorithms

Architecture: Encoder-decoder RNN with 3 hidden layers of 512 LSTM cells, embedding size 256
RL loss: DAP (Difference of Augmented and Posterior likelihoods)
Batch size: 128 molecules per epoch
Diversity filter: Bemis-Murcko scaffold buckets of size 25
Score threshold: 0 (to store all molecules for analysis)
Scoring function: Weighted geometric mean of component scores

Models

Single pre-trained prior used across all experiments
Agent initialized as copy of prior, updated via RL
Pre-trained prior available at GitHub repository

Evaluation

Molecular docking via DockStream with Glide/LigPrep backend
Triplicate runs for all experiments
Metrics: docking scores, unique Bemis-Murcko scaffold counts, binding pose overlap

Hardware

Hardware specifications are not reported in the paper.

Artifacts

Artifact	Type	License	Notes
REINVENT (Link-INVENT code)	Code	Apache-2.0	Main codebase for Link-INVENT
ReinventCommunity (data + tutorial)	Code + Data	MIT	Training/validation data, reaction SMIRKS, pre-trained prior, Jupyter tutorial

Reproducibility status: Partially Reproducible. Code, training data, and pre-trained prior are publicly available. However, reproducing the docking-based experiments (1a, 1b, and 2) requires a proprietary Schrodinger license for Glide and LigPrep. The PROTAC experiments (Experiment 3) that use only physicochemical scoring are fully reproducible with the open-source code.

Paper Information

Citation: Guo, J., Knuth, F., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., & Patronov, A. (2023). Link-INVENT: generative linker design with reinforcement learning. Digital Discovery, 2, 392-408. https://doi.org/10.1039/D2DD00115B

@article{guo2023link,
  title={Link-INVENT: generative linker design with reinforcement learning},
  author={Guo, Jeff and Knuth, Franziska and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas},
  journal={Digital Discovery},
  volume={2},
  number={2},
  pages={392--408},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D2DD00115B}
}

Lingo3DMol: Language Model for 3D Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

A Language Model Approach to Structure-Based Drug Design

This is a Method paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.

Limitations of Existing 3D Molecular Generative Models

Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.

FSMILES: Fragment-Based SMILES with Dual Coordinate Systems

The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., C_6 for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.

The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:

$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$

$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$

$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$

Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.

NCI/Anchor Prediction Model

A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, halogen bonds, salt bridges, or pi-pi stacking interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).

Pretraining and Architecture

The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:

$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$

The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:

$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$

Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.

Evaluation on DUD-E with Drug-Likeness Filtering

The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED >= 0.3 and SAS <= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.

Molecular properties and binding mode (Table 1, drug-like molecules only):

Metric	Pocket2Mol	TargetDiff	Lingo3DMol
Drug-like molecules (% of total)	61%	49%	82%
Mean QED	0.56	0.60	0.59
Mean SAS	3.5	4.0	3.1
ECFP TS > 0.5 (% of targets)	8%	3%	33%
Mean min-in-place GlideSP	-6.7	-6.2	-6.8
Mean GlideSP redocking	-7.5	-7.0	-7.8
Mean RMSD vs. low-energy conformer (A)	1.1	1.1	0.9
Diversity	0.84	0.88	0.82

Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.

Molecular geometry: Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).

Information leakage analysis: The authors controlled for information leakage by excluding proteins with >30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol’s training set, Lingo3DMol’s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.

Ablation studies (Table 2):

Metric	Standard	Random NCI	No Pretraining
Drug-like (%)	82%	47%	71%
ECFP TS > 0.5	33%	6%	3%
Mean min-in-place GlideSP	-6.8	-5.8	-4.9
Dice score	0.25	0.15	0.13

Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.

Key Findings, Limitations, and Future Directions

Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.

Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.

Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	In-house commercial library	12M molecules (1.4M public)	Filtered for drug-likeness; conformers via ConfGen
Fine-tuning	PDBbind 2020 (general set)	11,800 samples (8,201 PDB IDs)	Filtered for <30% sequence identity to DUD-E targets
NCI labels	PDBbind 2020	Same as fine-tuning	Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking
Evaluation	DUD-E	101 targets, 20,000+ active compounds	Standard benchmark for structure-based drug design
Geometry evaluation	CrossDocked2020	100 targets	Used for bond length and atom distance distribution comparisons

Algorithms

Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)
Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption
Depth-first search sampling with reward function combining model confidence and anchor fulfillment
Fine-tuning: first three encoder layers frozen
Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$

Models

Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)
NCI/anchor prediction model: same architecture, initialized from pretrained parameters
Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare

Evaluation

Metric	Lingo3DMol	Best Baseline	Notes
Drug-like molecules (%)	82%	61% (P2M)	QED >= 0.3, SAS <= 5
ECFP TS > 0.5 (% targets)	33%	8% (P2M)	Tanimoto similarity to known actives
Min-in-place GlideSP	-6.8	-6.7 (P2M)	Lower is better
GlideSP redocking	-7.8	-7.5 (P2M)	Lower is better
RMSD vs. low-energy conformer	0.9 A	1.1 A (both)	Lower is better
Generation speed (100 mol)	874 +/- 401 s	962 +/- 622 s (P2M)	NVIDIA Tesla V100

Hardware

Inference benchmarked on NVIDIA Tesla V100 GPUs
Generation of 100 valid molecules per target: 874 +/- 401 seconds

Artifacts

Artifact	Type	License	Notes
Lingo3DMol	Code	GPL-3.0	Inference code and model architecture
Model checkpoints	Model	GPL-3.0	Pretraining, fine-tuning, and NCI checkpoints
Training data	Dataset	Not specified	Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules
Online service	Other	N/A	Web interface for molecule generation

Paper Information

Citation: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., & Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. Nature Machine Intelligence, 6(1), 62-73. https://doi.org/10.1038/s42256-023-00775-6

@article{feng2024generation,
  title={Generation of 3D molecules in pockets via a language model},
  author={Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao},
  journal={Nature Machine Intelligence},
  volume={6},
  number={1},
  pages={62--73},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-023-00775-6}
}

Group SELFIES: Fragment-Based Molecular Strings

Thu, 26 Mar 2026 00:00:00 +0000

A Fragment-Aware Extension of SELFIES

This is a Method paper that introduces Group SELFIES, a molecular string representation extending SELFIES by incorporating group tokens that represent functional groups or entire substructures. The primary contribution is a representation that maintains the 100% chemical validity guarantee of SELFIES while enabling fragment-level molecular encoding. Group SELFIES is shorter, more human-readable, and produces better distribution learning compared to both SMILES and standard SELFIES.

From Atoms to Fragments in Molecular Strings

Molecular string representations underpin nearly all string-based molecular generation, from chemical language models and VAEs to genetic algorithms. SMILES, the dominant representation, suffers from validity issues: generated strings frequently contain syntax errors or violate valency constraints. SELFIES solved this by guaranteeing that every string decodes to a valid molecule, but both SMILES and SELFIES operate at the atomic level. Human chemists, by contrast, think about molecules in terms of functional groups and substructures.

Fragment-based generative models exploit this inductive bias by constructing custom representations amenable to fragment-based molecular design. However, these approaches are typically graph-based, losing the desirable properties of string representations: easy manipulation and direct input into established language models. Historical string representations like Wiswesser Line Notation (WLN), Hayward Notation, and SYBYL Line Notation (SLN) did use non-atomic tokens, but none provided chemical robustness guarantees.

The gap is clear: no existing string representation combines the chemical robustness of SELFIES with the fragment-level abstraction that captures meaningful chemical motifs.

Group Tokens with Chemical Robustness Guarantees

The core innovation is the introduction of group tokens into the SELFIES framework. Each group token represents a predefined molecular fragment (such as a benzene ring, carboxyl group, or any user-specified substructure) and is treated as a single unit during encoding and decoding.

Group Definition

Each group is defined as a set of atoms and bonds with labeled attachment points that specify how the group participates in bonding. Each attachment point has a specified maximum valency, allowing the decoder to continue tracking available valency during string construction. Group tokens take the form [:S], where S is the starting attachment index.

Encoding

To encode a molecule, the encoder first recognizes and replaces substructure matches from the group set. By default, the encoder processes larger groups first, but users can override this with priority values. The encoder then traverses the molecular graph similarly to standard SELFIES encoding, inserting tokens that track attachment indices for entering and exiting groups.

Decoding

When the decoder encounters a group token, it looks up the corresponding group in the group set dictionary, places all atoms of the group, and connects the main chain to the starting attachment point. Navigation between attachment points is handled by reading subsequent tokens as relative indices. If an attachment point is occupied, the next available one is used. If all attachment points are exhausted, the group is immediately popped from the stack.

Chemical Robustness

The key property preserved from SELFIES is that any arbitrary Group SELFIES string decodes to a molecule with valid valency. This is achieved by maintaining the same two SELFIES decoder features within the group framework:

Token overloading: every token can be interpreted as a number when needed (for branch lengths, ring targets, or attachment indices).
Valency tracking: if adding a bond would exceed available valency, the decoder adjusts the bond order or skips the bond.

The authors verified robustness by encoding and decoding 25 million molecules from the eMolecules database.

Chirality Handling

Group SELFIES handles chirality differently from SMILES and SELFIES. Rather than using @-notation for tetrahedral chirality, all chiral centers must be specified as groups. An “essential set” of 23 groups covers all relevant chiral centers in the eMolecules database. This approach also supports extended chirality (axial, helical, planar) by abstracting the entire chiral substructure into a group token.

Fragment Selection

The group set is a user-defined dictionary that maps group names to molecular fragments. Users can specify groups manually using SMILES-like syntax, extract them from fragment libraries, or use fragmentation algorithms such as matched molecular pair analysis. The authors tested several approaches, including a naive method that cleaves side chains from rings and methods based on cheminformatics fragmentation tools. A useful group set typically contains fragments that appear in many molecules and replace many atoms, with similar fragments merged to reduce redundancy.

Experiments on Compactness, Generation, and Distribution Learning

Compactness (Section 4.1)

Using 53 groups (30 extracted from ZINC-250k plus 23 from the essential set), Group SELFIES strings are shorter than their SMILES and SELFIES equivalents. Despite Group SELFIES having a larger alphabet, the compressed file size of the ZINC-250k dataset is smallest for Group SELFIES, indicating lower information-theoretic complexity.

Random Molecular Generation (Section 4.2)

To isolate the effect of the representation from the generative model, the authors use a primitive generative model: sample a random string length from the dataset, draw tokens uniformly from a bag of all tokens, and concatenate. From 100,000 ZINC-250k molecules:

Randomly sampled Group SELFIES strings produce molecules whose SAScore and QED distributions more closely overlap with the original ZINC dataset than molecules from randomly sampled SELFIES strings.
The Wasserstein distances to the ZINC distribution are consistently lower for Group SELFIES.
On a nonfullerene acceptor (NFA) dataset, Group SELFIES preserves aromatic rings while SELFIES rarely does.

Distribution Learning with VAEs (Section 4.3)

Using the MOSES benchmarking framework, VAEs were trained for 125 epochs on both Group SELFIES and SELFIES representations. The Group SELFIES VAE used 300 groups extracted from the MOSES training set. Results from 100,000 generated molecules:

Metric	Group-VAE-125	SELFIES-VAE-125	Train (Reference)
Valid	1.0 (0)	1.0 (0)	1.0
Unique@1k	1.0 (0)	0.9996 (5)	1.0
Unique@10k	0.9985 (4)	0.9986 (4)	1.0
FCD (Test)	0.1787 (29)	0.6351 (43)	0.008
FCD (TestSF)	0.734 (109)	1.3136 (128)	0.4755
SNN (Test)	0.6051 (4)	0.6014 (3)	0.6419
Frag (Test)	0.9995 (0)	0.9989 (0)	1.0
Scaf (Test)	0.9649 (21)	0.9588 (15)	0.9907
IntDiv	0.8587 (1)	0.8579 (1)	0.8567
Novelty	0.9623 (7)	0.96 (4)	1.0

The most notable improvement is in Frechet ChemNet Distance (FCD), where Group SELFIES achieves 0.1787 versus 0.6351 for SELFIES on the test set. FCD measures the difference between penultimate-layer activations of ChemNet, encoding a mixture of biological and chemical properties relevant to drug-likeness. Most other metrics are comparable, with Group SELFIES matching or slightly outperforming SELFIES across the board.

Advantages, Limitations, and Future Directions

Key Findings

Group SELFIES provides three main advantages over standard SELFIES:

Substructure control: Important scaffolds, chiral centers, and charged groups can be preserved during molecular optimization.
Compactness: Group tokens represent multiple atoms, yielding shorter strings with lower information-theoretic complexity.
Improved distribution learning: The FCD metric shows substantial improvement, indicating generated molecules better capture biological and chemical properties of the training set.

Both SELFIES and Group SELFIES achieve 100% validity, eliminating the validity issues associated with SMILES-based generation.

Limitations

The authors acknowledge several limitations:

Computational speed: Encoding and decoding is slower than SELFIES due to RDKit overhead, particularly for the encoder which performs substructure matching for every group in the set.
No group overlap: Groups cannot overlap in the current formulation, which limits expressiveness for polycyclic compounds.
Group set design: Choosing an effective group set remains an open design choice that may require domain expertise or fragmentation algorithm tuning.
Limited generative model evaluation: The paper focuses on random sampling and VAEs; evaluation with more sophisticated models (GANs, reinforcement learning, genetic algorithms) is left to future work.

Future Directions

The authors propose several extensions: flexible scaffold tokens that preserve topology while allowing atom-type variation, representations based on cellular complexes or hypergraphs to handle overlapping groups, and integration with genetic algorithms like JANUS for molecular optimization.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Compactness / Generation	ZINC-250k	250,000 molecules	Random subset of 10,000 for fragment extraction; 100,000 for generation
Distribution Learning	MOSES benchmark	~1.9M molecules	Standard train/test split from MOSES framework
Robustness Verification	eMolecules	25M molecules	Full database encode-decode round trip
NFA Generation	NFA dataset	Not specified	Nonfullerene acceptors from Lopez et al. (2017)

Algorithms

Fragmentation: Naive ring-sidechain cleavage, matched molecular pair analysis, and diversity-based selection of 300 groups for VAE experiments.
Essential set: 23 chiral groups covering all relevant chiral centers in eMolecules.
Random generation: Bag-of-tokens sampling with length matched to dataset distribution.

Models

VAE: Trained for 125 epochs on MOSES dataset using both SELFIES and Group SELFIES tokenizations.
Architecture details follow the MOSES benchmark VAE configuration.

Evaluation

Metric	Description
FCD	Frechet ChemNet Distance (penultimate layer activations)
SNN	Average Tanimoto similarity to nearest neighbor in reference set
Frag	Cosine similarity of BRICS fragment distributions
Scaf	Cosine similarity of Bemis-Murcko scaffold distributions
IntDiv	Internal diversity via Tanimoto similarity
Validity	Percentage passing RDKit parsing
Uniqueness	Percentage of non-duplicate generated molecules
Novelty	Fraction of generated molecules not in training set

Hardware

Robustness verification performed on the Niagara supercomputer (SciNet HPC Consortium).
VAE training hardware not specified.

Artifacts

Artifact	Type	License	Notes
group-selfies	Code	Apache-2.0	Open-source Python implementation

Paper Information

Citation: Cheng, A. H., Cai, A., Miret, S., Malkomes, G., Phielipp, M., & Aspuru-Guzik, A. (2023). Group SELFIES: A robust fragment-based molecular string representation. Digital Discovery, 2(3), 748-758. https://doi.org/10.1039/D3DD00012E

@article{cheng2023group,
  title={Group SELFIES: A Robust Fragment-Based Molecular String Representation},
  author={Cheng, Austin H. and Cai, Andy and Miret, Santiago and Malkomes, Gustavo and Phielipp, Mariano and Aspuru-Guzik, Al{\'a}n},
  journal={Digital Discovery},
  volume={2},
  number={3},
  pages={748--758},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D3DD00012E}
}

Evolutionary Molecular Design via Deep Learning + GA

Thu, 26 Mar 2026 00:00:00 +0000

Fingerprint-Based Evolutionary Molecular Design

This is a Method paper that introduces an evolutionary design methodology (EDM) for goal-directed molecular optimization. The primary contribution is a three-component framework where (1) molecules are encoded as extended-connectivity fingerprint (ECFP) vectors, (2) a genetic algorithm evolves these fingerprint vectors through mutation and crossover, (3) a recurrent neural network (RNN) decodes the evolved fingerprints back into valid SMILES strings, and (4) a deep neural network (DNN) evaluates molecular fitness. The key advantage over prior evolutionary approaches is that no hand-crafted chemical rules or fragment libraries are needed, as the RNN learns valid molecular reconstruction from data.

Challenges in Evolutionary Molecular Optimization

Evolutionary algorithms for molecular design face two core challenges. First, maintaining chemical validity of evolved molecules is difficult when operating on graph or string representations directly. Prior methods rely on predefined chemical rules and fragment libraries to constrain structural modifications (atom/bond additions, deletions, substitutions), but these introduce bias and risk convergence to local optima. Each new application domain requires specifying new chemical rules, which may not exist for emerging areas. Second, fitness evaluation must be both efficient and accurate. Simple evaluation methods like structural similarity indices or semi-empirical quantum chemistry calculations reduce computational cost but may not capture complex property relationships.

High-throughput computational screening (HTCS) is a common alternative, but it depends on the quality of predefined virtual chemical libraries and often requires multiple iterative enumerations, limiting its ability to explore novel chemical space.

Core Innovation: Evolving Fingerprints with Neural Decoding

The key insight is to perform genetic operations in fingerprint space rather than in molecular graph or SMILES string space. The framework comprises three learned functions:

Encoding function $e(\cdot)$: Converts a SMILES string $\mathbf{m}$ into a 5000-dimensional ECFP vector $\mathbf{x}$ using Morgan fingerprints with a neighborhood radius of 6. This is a deterministic hash-based encoding (not learned).

Decoding function $d(\cdot)$: An RNN with three hidden layers of 500 LSTM units that reconstructs a SMILES string from an ECFP vector. The RNN generates SMILES as a sequence of three-character substrings, conditioning each prediction on the current substring and the input ECFP vector:

$$d(\mathbf{x}) = \mathbf{m}, \quad \text{where } p(\mathbf{m}_{t+1} | \mathbf{m}_{t}, \mathbf{x})$$

The three-character substring approach reduces the ratio of invalid SMILES by imposing additional constraints on subsequent characters.

Property prediction function $f(\cdot)$: A five-layer DNN with 250 hidden units per layer that predicts molecular properties from ECFP vectors:

$$\mathbf{t} = f(e(\mathbf{m}))$$

The RNN is trained by minimizing cross-entropy loss between the softmax output and the target SMILES string $\mathbf{m}_{i}$, learning the relationship $d(e(\mathbf{m}_{i})) = \mathbf{m}_{i}$. The DNN is trained by minimizing mean squared error between predicted and computed property values. Both use the Adam optimizer with mini-batch size 100, 500 training epochs, and dropout rate 0.5.

Genetic Algorithm Operations

The GA evolves ECFP vectors using the DEAP library with the following parameters:

Population size: 50
Crossover rate: 0.7 (uniform crossover, mixing ratio 0.2)
Mutation rate: 0.3 (Gaussian mutation, $N(0, 0.2^{2})$, applied to 1% of elements)
Selection: Tournament selection with size 3, top 3 individuals as parents
Termination: 500 generations or 30 consecutive generations without fitness improvement

The evolutionary loop proceeds as follows: a seed molecule $\mathbf{m}_{0}$ is encoded to $\mathbf{x}_{0}$, mutated to generate a population $\mathbf{P}^{0} = {\mathbf{z}_{1}, \mathbf{z}_{2}, \ldots, \mathbf{z}_{L}}$, each vector is decoded via the RNN, validity is checked with RDKit, fitness is evaluated via the DNN, and the top parents produce the next generation through crossover and mutation.

Experimental Setup: Light-Absorbing Wavelength Optimization

Training Data and Deep Learning Performance

The models were trained on 10,000 to 100,000 molecules randomly sampled from PubChem (molecular weight 200-600 g/mol). Each molecule was labeled with DFT-computed excitation energy ($S_{1}$), HOMO, and LUMO energies using B3LYP/6-31G.

Training Data	Validity (%)	Reconstructability (%)	$S_{1}$ (R, MAE)	HOMO (R, MAE)	LUMO (R, MAE)
100,000	88.8	62.4	0.977, 0.185 eV	0.948, 0.168 eV	0.960, 0.195 eV
50,000	86.7	60.1	0.973, 0.198 eV	0.945, 0.172 eV	0.955, 0.209 eV
30,000	85.3	59.8	0.930, 0.228 eV	0.934, 0.191 eV	0.945, 0.224 eV
10,000	83.2	55.7	0.913, 0.278 eV	0.885, 0.244 eV	0.917, 0.287 eV

Validity refers to the proportion of chemically valid SMILES after RDKit inspection. Reconstructability measures how often the RNN can reproduce the original molecule from its ECFP (62.4% at 100k training samples by matching canonical SMILES among 10,000 generated strings).

Design Task 1: Unconstrained S1 Modification

Fifty seed molecules with $S_{1}$ values between 3.8 eV and 4.2 eV were evolved in both increasing and decreasing directions. With 50,000 training samples, $S_{1}$ increased by approximately 60% on average in the increasing direction and showed slightly lower rates of change in the decreasing direction. The asymmetry is attributed to the skewed $S_{1}$ distribution of training data (average $S_{1}$ of 4.3-4.4 eV, higher than the seed median of 4.0 eV). Performance saturated at approximately 50,000 training samples.

Design Task 2: S1 Modification with HOMO/LUMO Constraints

The same 50 seeds were evolved with constraints: $-7.0 \text{ eV} < \text{HOMO} < -5.0 \text{ eV}$ and $\text{LUMO} < 0.0 \text{ eV}$. In the increasing $S_{1}$ direction, constraints suppressed the rate of change because both HOMO and LUMO bounds limit the achievable HOMO-LUMO gap. In the decreasing direction, constraints had minimal effect because LUMO could freely decrease while HOMO had sufficient room to rise within the allowed range.

Design Task 3: Extrapolation Beyond Training Data

To generate molecules with $S_{1}$ values below 1.77 eV (outside the training distribution, which had mean $S_{1}$ of 4.91 eV), the authors introduced iterative “phases”: generate molecules, compute their properties via DFT, retrain the models, and repeat. Starting from the 30 lowest-$S_{1}$ seed molecules with 300 generation runs per phase:

Phase 1: Average $S_{1}$ = 2.20 eV, 12 molecules below 1.77 eV
Phase 2: Average $S_{1}$ = 2.22 eV, 37 molecules below 1.77 eV
Phase 3: Average $S_{1}$ = 2.31 eV, 58 molecules below 1.77 eV

While the average $S_{1}$ rose slightly across phases, variance decreased (from 1.40 to 1.36), indicating the model concentrated its outputs closer to the target range. This active-learning-like loop demonstrates the framework can extend beyond the training distribution.

Design Task 4: GuacaMol Benchmarks

The method was evaluated on the GuacaMol goal-directed benchmark suite using the ChEMBL25 training dataset. The RNN model was retrained with three-character substrings.

Benchmark	Best of Dataset	SMILES LSTM	SMILES GA	Graph GA	Graph MCTS	cRNN	EDM (ours)
Celecoxib rediscovery	0.505	1.000	0.607	1.000	0.378	1.000	1.000
Troglitazone rediscovery	0.419	1.000	0.558	1.000	0.312	1.000	1.000
Thiothixene rediscovery	0.456	1.000	0.495	1.000	0.308	1.000	1.000
LogP(-1.0)	1.000	1.000	1.000	1.000	0.980	1.000	1.000
LogP(8.0)	1.000	1.000	1.000	1.000	0.979	1.000	1.000
TPSA(150.0)	1.000	1.000	1.000	1.000	1.000	1.000	1.000
CNS MPO	1.000	1.000	1.000	1.000	1.000	1.000	1.000
QED	0.948	0.948	0.948	0.948	0.944	0.948	0.948

The EDM achieves maximum scores on all eight tasks, matching the cRNN baseline. The 256 highest-scoring molecules from the ChEMBL25 test set were used as seeds, with 500 SMILES strings generated per seed.

Key Findings and Limitations

Results

The evolutionary design framework successfully evolved seed molecules toward target properties across all four design tasks. The RNN decoder maintained 88.8% chemical validity at 100k training samples, and the DNN property predictor achieved correlation coefficients above 0.94 for $S_{1}$, HOMO, and LUMO prediction. The iterative retraining procedure enabled exploration outside the training data distribution, generating 58 molecules with $S_{1}$ below 1.77 eV after three phases. On GuacaMol benchmarks, the method achieved maximum scores on all eight tasks, matching SMILES LSTM, Graph GA, and cRNN baselines.

Limitations

Several limitations are worth noting:

Reconstructability ceiling: Only 62.4% of molecules could be reconstructed from their ECFP vectors, meaning the RNN decoder fails to recover the original molecule approximately 38% of the time. This information loss in the ECFP encoding is a fundamental bottleneck.
Data dependence: Performance is sensitive to the training data distribution. The asymmetric evolution rates for increasing vs. decreasing $S_{1}$ directly reflect the skewed training data.
Structural constraints: Three heuristic constraints (fused ring sizes, number of fused rings, alkyl chain lengths) were still needed to maintain reasonable molecular structures, partially undermining the claim of a fully data-driven approach.
DFT reliance: The extrapolation experiment requires DFT calculations in the loop, which are computationally expensive and may limit scalability.
Limited benchmark scope: Only 8 GuacaMol tasks were tested, and all achieved perfect scores, making it difficult to differentiate from competing methods. The paper does not report on harder multi-objective benchmarks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	PubChem random sample	10,000-100,000 molecules	MW 200-600 g/mol, labeled with DFT-computed $S_{1}$, HOMO, LUMO
GuacaMol Benchmark	ChEMBL25	Standard split	Used for retraining RNN; 256 top-scoring seeds

Algorithms

Genetic algorithm: DEAP library; population 50, crossover rate 0.7, mutation rate 0.3, tournament size 3
RNN decoder: 3 hidden layers, 500 LSTM units each, three-character substring generation
DNN predictor: 5 layers, 250 hidden units, sigmoid activations, linear output
Training: Adam optimizer, mini-batch 100, 500 epochs, dropout 0.5

Models

All neural networks were implemented using Keras with the Theano backend (GPU-accelerated). No pre-trained model weights are publicly available.

Evaluation

RNN validity: Proportion of chemically valid SMILES (RDKit check)
Reconstructability: Fraction of seed molecules recoverable from ECFP (canonical SMILES match in 10,000 generated strings)
DNN accuracy: Correlation coefficient (R) and MAE via 10-fold cross-validation
Evolutionary performance: Average rate of $S_{1}$ change across 50 seeds; molecule count in target range
GuacaMol: Standard rediscovery and property satisfaction benchmarks

Hardware

The paper does not specify GPU models, training times, or computational requirements for the evolutionary runs. DFT calculations used the Gaussian 09 program suite with B3LYP/6-31G.

Artifacts

No public code repository or pre-trained models are available. The paper is published under a CC-BY 4.0 license as open access in Scientific Reports.

Artifact	Type	License	Notes
Paper (Nature)	Paper	CC-BY 4.0	Open access

Reproducibility classification: Partially Reproducible. The method is described in sufficient detail for reimplementation, but no code, trained models, or preprocessed datasets are released. The DFT calculations require Gaussian 09, a commercial software package.

Paper Information

Citation: Kwon, Y., Kang, S., Choi, Y.-S., & Kim, I. (2021). Evolutionary design of molecules based on deep learning and a genetic algorithm. Scientific Reports, 11, 17304. https://doi.org/10.1038/s41598-021-96812-8

@article{kwon2021evolutionary,
  title={Evolutionary design of molecules based on deep learning and a genetic algorithm},
  author={Kwon, Youngchun and Kang, Seokho and Choi, Youn-Suk and Kim, Inkoo},
  journal={Scientific Reports},
  volume={11},
  number={1},
  pages={17304},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-021-96812-8}
}

DrugEx v3: Scaffold-Constrained Graph Transformer

Thu, 26 Mar 2026 00:00:00 +0000

A Graph Transformer Method for Scaffold-Constrained Drug Design

This is a Method paper that introduces DrugEx v3, a Graph Transformer model for scaffold-constrained de novo drug design. The primary contribution is a novel positional encoding scheme for molecular graphs that allows a Transformer architecture to operate on graph-structured molecular data rather than SMILES strings. The model takes user-provided scaffold fragments as input and generates complete molecules through growing and connecting operations, trained with multi-objective reinforcement learning to optimize for both target affinity and drug-likeness.

From Fixed Objectives to User-Guided Scaffold Design

Prior versions of DrugEx (v1 and v2) used RNN-based generators trained with reinforcement learning for de novo drug design, but they operated under fixed objectives and could not accept user-provided structural priors. If a medicinal chemist wanted to explore analogs of a specific scaffold, the model needed retraining from scratch. Meanwhile, SMILES-based molecular generators face inherent limitations for scaffold-constrained design: SMILES is a linear notation, so inserting fragments at multiple positions of a scaffold requires complex grammar handling, and small token changes can produce invalid molecules.

Several approaches had been proposed for scaffold-based generation, including graph generative models (Lim et al., 2019), DeepScaffold (Li et al., 2020), SMILES-based scaffold decorators (Arus-Pous et al., 2020), and SyntaLinker for fragment linking (Yang et al., 2020). DrugEx v3 aims to combine the advantages of graph representations (validity guarantees, local invariance, flexible extension) with the Transformer architecture’s ability to handle complex dependencies, while maintaining the multi-objective reinforcement learning framework from DrugEx v2.

Graph Positional Encoding for Molecular Transformers

The core innovation is adapting the Transformer architecture to work directly with molecular graph representations. Two key modifications make this possible.

Graph word encoding. Since atoms and bonds cannot be processed simultaneously in a graph, the authors combine them into a single index:

$$ W = T_{atom} \times 4 + T_{bond} $$

where $T_{atom}$ is the atom type index and $T_{bond}$ is the bond type index (four bond types: single, double, triple, and none).

Graph positional encoding. Standard sequential position encoding does not capture molecular topology. The authors propose an adjacency-matrix-based positional encoding:

$$ P = I_{Atom} \times L_{max} + I_{Connected} $$

where $I_{Atom}$ is the current atom index, $L_{max}$ is the maximum sequence length, and $I_{Connected}$ is the index of the atom connected by the current bond. This encoding is then processed through the standard sinusoidal positional encoding:

$$ PE_{(p, 2i)} = \sin(pos / 10000^{2i / d_{m}}) $$

$$ PE_{(p, 2i+1)} = \cos(pos / 10000^{2i / d_{m}}) $$

with $d_{m} = 512$.

Molecule generation procedure. Each molecule in the training data is represented as a five-row matrix encoding atom type, bond type, connected atom index, current atom index, and fragment index. The columns are divided into three sections: fragment (the scaffold), growing (new atoms added to fragments), and linking (bonds connecting grown fragments). The decoder uses a GRU-based recurrent layer to sequentially output atom type, bond type, connected atom index, and current atom index at each step, with chemical valence rules enforced at every generation step to guarantee valid molecules.

Multi-objective reinforcement learning. The generator is trained with a policy gradient objective:

$$ J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) | \theta\right] = \sum_{t=1}^{T} \log G(y_{t} | y_{1:t-1}) \cdot R^{\ast}(y_{1:T}) $$

where $R^{*}$ is a Pareto-based reward combining target affinity and QED drug-likeness score:

$$ R^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, & \text{if desired} \\ \frac{k}{2N_{undesired}}, & \text{if undesired} \end{cases} $$

with $k$ being the solution’s index in the Pareto rank. An exploration strategy uses two networks: an exploitation network $G_{\theta}$ (updated by policy gradient) and an exploration network $G_{\phi}$ (fixed, pre-trained on ChEMBL), with an exploration rate $\varepsilon$ controlling how many scaffolds are routed to $G_{\phi}$ during training.

Experimental Setup: Architecture Comparison and RL Optimization

Data

The ChEMBL set (version 27) contained approximately 1.7 million molecules for pre-training, preprocessed via RDKit (charge neutralization, metal/fragment removal). The LIGAND set comprised 10,828 adenosine receptor ligands for fine-tuning. Each molecule was decomposed into fragments using the BRICS algorithm, creating scaffold-molecule pairs (up to 15 pairs per molecule with four fragments). The ChEMBL set yielded 9.3 million training pairs, and the LIGAND set produced 53,888 training pairs.

Architecture comparison

Four architectures were compared:

Graph Transformer: graph input with novel positional encoding
Sequential Transformer: SMILES input with standard Transformer
LSTM-BASE: SMILES encoder-decoder with three recurrent layers
LSTM+ATTN: LSTM-BASE with an attention mechanism between encoder and decoder

All models were pre-trained on ChEMBL and fine-tuned on the LIGAND set. The bioactivity predictor was a random forest regression model using 2048D ECFP6 fingerprints and 19D physicochemical descriptors, with an activity threshold of pX = 6.5 for the A2A adenosine receptor.

Evaluation metrics

Five metrics were used: validity (parseable molecules), accuracy (scaffold containment), desirability (meeting all objectives), uniqueness, and novelty (not in ChEMBL). Diversity was measured using the Solow-Polasky index with Tanimoto distance on ECFP6 fingerprints:

$$ I(A) = \frac{1}{|A|} \mathbf{e}^{\intercal} F(\mathbf{s})^{-1} \mathbf{e} $$

Hardware

Models were benchmarked on a server with NVIDIA Tesla P100 GPUs.

Key Results: Graph Representation Advantages and RL Trade-offs

Pre-training and fine-tuning performance

The Graph Transformer achieved the best overall performance across all metrics:

Method	Validity (PT)	Accuracy (PT)	Validity (FT)	Accuracy (FT)	Novelty (FT)	Uniqueness (FT)
Graph Transformer (512)	100.0%	99.3%	100.0%	99.2%	68.9%	82.9%
Seq. Transformer (512)	96.7%	74.0%	99.3%	92.7%	8.9%	28.9%
LSTM+ATTN (512)	94.3%	72.8%	96.9%	85.9%	6.3%	20.7%
LSTM-BASE (512)	93.9%	52.4%	98.7%	81.6%	3.9%	19.2%

PT = pre-trained, FT = fine-tuned. The Graph Transformer achieved 100% validity due to its explicit valence checking at each generation step. It also produced substantially more novel and unique molecules after fine-tuning compared to SMILES-based methods.

The authors identified four advantages of the graph representation over SMILES: (1) local invariance, where fragment ordering does not affect output; (2) global extendibility, where new atoms can be appended without restructuring existing data; (3) freedom from grammar constraints; and (4) direct accessibility of chemical valence rules for validity enforcement.

Reinforcement learning results

With multi-objective RL (affinity + QED), 74.6% of generated molecules were predicted active at $\varepsilon = 0.0$. The exploration rate $\varepsilon$ trades off desirability against uniqueness:

$\varepsilon$	Desirability	Uniqueness	Novelty	Diversity
0.0	74.6%	60.7%	60.6%	0.879
0.1	66.8%	75.0%	74.6%	0.842
0.2	61.6%	80.2%	79.4%	0.879
0.3	56.8%	89.8%	88.8%	0.874

The authors report that $\varepsilon = 0.3$ produced the best balance between desirability and uniqueness, with 56.8% desired molecules and 89.8% uniqueness. Diversity remained above 0.84 across all settings.

Limitations

The Graph Transformer produced molecules with worse synthetic accessibility (SA scores) compared to SMILES-based methods, particularly after fine-tuning on the smaller LIGAND set. The authors attribute this to uncommon ring systems generated when the model handles long-distance dependencies. A kekulization issue also causes a small fraction of molecules to fail scaffold matching: aromatic bond inference during sanitization can alter the scaffold substructure. Without single-objective affinity constraint, the model generates molecules with molecular weight exceeding 500 Da, reducing drug-likeness. All bioactivity predictions rely on a random forest model rather than experimental validation, and the t-SNE analysis suggests some generated molecules fall outside the model’s applicability domain.

Future directions

The authors propose extending the Graph Transformer to accept protein information as input via proteochemometric modeling, enabling design of ligands for targets without known ligands. Lead optimization, where a “hit” serves as input to generate improved analogs, is also identified as a natural extension.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v27	~1.7M molecules (9.3M scaffold-molecule pairs)	Preprocessed via RDKit
Fine-tuning	LIGAND set (A2A AR ligands from ChEMBL)	10,828 ligands (53,888 pairs)	Split 8:1:1 train/val/test
Bioactivity labels	ChEMBL A2A AR activity data	pX threshold = 6.5	Average pChEMBL values

Algorithms

Fragment decomposition: BRICS algorithm via RDKit (max 4 fragments per molecule)
Optimizer: Adam with learning rate $10^{-4}$, batch size 256
Pre-training: 20 epochs; fine-tuning: up to 1,000 epochs with early stopping (patience: 100 epochs)
Bioactivity predictor: random forest regression (scikit-learn) with 2048D ECFP6 + 19D physicochemical descriptors
Pareto-based multi-objective ranking with GPU acceleration

Models

Graph Transformer: 512 hidden units, 8 attention heads, $d_{k} = d_{v} = 64$
Sequential Transformer: same hidden size, sinusoidal positional encoding
LSTM-BASE / LSTM+ATTN: 128 embedding units, 512 hidden units, 3 recurrent layers

Evaluation

Metric	Graph Transformer	Best SMILES Baseline	Notes
Validity (fine-tuned)	100.0%	99.6% (LSTM-BASE 1024)	Valence checking guarantees validity
Accuracy (fine-tuned)	99.2%	94.3% (Seq. Transformer 1024)	Scaffold containment
Desirability (RL, $\varepsilon$=0.0)	74.6%	N/A	Only Graph Transformer used for RL
Diversity (RL)	0.879	N/A	Solow-Polasky index

Hardware

NVIDIA Tesla P100 GPUs. Specific training times not reported, but Transformer models trained faster than LSTM models with the same hidden layer size.

Artifacts

Artifact	Type	License	Notes
CDDLeiden/DrugEx	Code	MIT	Official implementation (v1, v2, v3)
ChEMBL v27	Dataset	CC-BY-SA 3.0	Pre-training data source

Paper Information

Citation: Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P., & van Westen, G. J. P. (2023). DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. Journal of Cheminformatics, 15, 24. https://doi.org/10.1186/s13321-023-00694-z

@article{liu2023drugex,
  title={DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning},
  author={Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and IJzerman, Adriaan P. and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={24},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00694-z}
}

DeepSMILES: Adapting SMILES Syntax for Machine Learning

Thu, 26 Mar 2026 00:00:00 +0000

A New Molecular String Notation for Generative Models

This is a Method paper that introduces DeepSMILES, a modified SMILES syntax designed to reduce the rate of syntactically invalid strings produced by machine-learning generative models. The primary contribution is a pair of string-level transformations (for ring closures and for branches) that can be applied independently and interconverted with standard SMILES without loss of information, including stereochemistry.

The Problem of Invalid SMILES in Molecular Generation

Deep neural networks for de novo molecular design commonly operate on SMILES strings. Variational autoencoders (Gomez-Bombarelli et al., 2018), recurrent neural networks with LSTM (Segler et al., 2018; Olivecrona et al., 2017), and grammar-based approaches (Kusner et al., 2017) all generate molecules by sampling character sequences. A persistent problem is that many generated strings are syntactically invalid SMILES, with reported validity rates ranging from 7% to 80%.

Two structural features of SMILES syntax are responsible for most invalid strings:

Balanced parentheses: Branches require matched open/close parenthesis pairs. A generative model must track nesting state across long sequences to produce valid brackets.
Paired ring closure symbols: Rings require two identical digit tokens at corresponding positions. The model must remember which digits are “open” and close them appropriately.

Grammar-based approaches (e.g., Grammar VAE) can enforce balanced parentheses through a context-free grammar, but they cannot enforce the ring closure pairing constraint because that constraint is context-sensitive. Syntax-directed approaches (Dai et al., 2018) add explicit ring closure constraints but at the cost of significantly more complex decoder architectures.

Core Innovation: Postfix Branch Notation and Single Ring Closure Symbols

DeepSMILES addresses both syntax problems through two independent string transformations.

Ring closure transformation

Standard SMILES uses a pair of identical digits to mark ring openings and closings (e.g., c1ccccc1 for benzene). DeepSMILES eliminates the ring-opening digit and replaces the ring-closing digit with the ring size, counting back along the tree path to the ring-opening atom. Benzene becomes cccccc6, where 6 means “connect to the atom 6 positions back.”

This transformation has three key properties:

Every ring of a given size always uses the same digit, regardless of context. A phenyl ring is always cccccc6 in DeepSMILES, whereas in SMILES it might be c1ccccc1, c2ccccc2, c3ccccc3, etc.
A single symbol cannot be “unmatched” since there is no corresponding opening symbol.
For double-digit ring sizes, the %N notation is used (and %(N) for sizes above 99).

Bond stereochemistry is preserved by moving any explicit or stereo bond from the eliminated ring-opening symbol to the ring-closing symbol, with direction adjusted as needed.

Branch (parenthesis) transformation

Standard SMILES uses matched open/close parenthesis pairs for branches (e.g., C(OC)(SC)F). DeepSMILES replaces this with a postfix notation inspired by Reverse Polish Notation (RPN). Only close parentheses are used, and the number of consecutive close parentheses indicates how far back on the current branch the next atom attaches.

For example, C(OC)(SC)F becomes COC))SC))F. The interpretation uses a stack: atoms are pushed onto the stack as they are read, each close parenthesis pops one atom from the stack, and the next atom connects to whatever is on top of the stack.

Stereochemistry preservation

Tetrahedral stereochemistry is fully preserved through the transformations. When ring closure symbol reordering would change the stereo configuration, the @/@@ annotation is inverted during encoding to compensate.

Independence of transformations

The two transformations are independent and can be applied separately or together. Any application of DeepSMILES should specify which transformations were applied.

Roundtrip Validation on ChEMBL 23

The authors validated DeepSMILES by roundtripping all entries in the ChEMBL 23 database through SMILES-to-DeepSMILES-to-SMILES conversion. Canonical SMILES (including stereochemistry) were generated by four independent cheminformatics toolkits: CDK, OEChem, Open Babel, and RDKit. Using multiple toolkits ensures coverage of different traversal orders and ring closure ordering conventions.

All SMILES strings roundtripped without error across all three configurations (branches only, rings only, both). The exact string representation may differ in ring closure digit assignment or digit ordering, sometimes with an associated stereo inversion at tetrahedral centers, but the canonical SMILES of the original and roundtripped molecules are identical.

Performance characteristics

The following table shows the effect of DeepSMILES conversion on string length and throughput, measured on canonical SMILES from Open Babel for ChEMBL 23:

Transformation	Mean % change in length	Encoding (per sec)	Decoding (per sec)
Branches only	+8.2%	32,000	16,000
Rings only	-6.4%	26,000	24,000
Both	+1.9%	26,000	17,500

The ring transformation slightly shortens strings (by removing one digit per ring), while the branch transformation slightly lengthens them (additional close parentheses). Combined, the net effect is a small increase of about 2%. Throughput is in the tens of thousands of conversions per second in pure Python.

Limitations and Future Directions

DeepSMILES does not eliminate all invalid strings. Invalid DeepSMILES can still be generated, for example when there are more close parentheses than atoms on the stack, or when a ring size exceeds the number of available atoms. The reference implementation raises a DecodeError in these cases, though the authors note that a more tolerant decoder (ignoring extra parentheses or defaulting to the first atom for oversized rings) could be used during generation.

The paper assumes that input SMILES are generated by a standard cheminformatics toolkit as a depth-first traversal of the molecular graph. Non-standard SMILES (e.g., CC(C1)CCCC1) cannot be directly encoded.

The authors suggest several directions for future work:

Investigating whether a preferred traversal order (e.g., shorter branches first) would make DeepSMILES even easier for models to learn.
Exploring notations where atoms in the organic subset explicitly list their hydrogen count, which would allow a fully parenthesis-free representation.
Using SMILES augmentation with random traversal orders (as explored by Bjerrum and Threlfall, 2017) in combination with DeepSMILES.
Designing entirely new line notations optimized for ML, where every string maps to a valid molecule, there are few duplicate representations, small string changes produce small structural changes, and string length correlates with pharmaceutical relevance.

The fused ring case presents additional complexity: a bicyclic system has three cycles, and depending on traversal order, the ring size digit may not directly correspond to the ring size of any individual ring. This is an inherent limitation of depth-first traversal-based notations.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Validation	ChEMBL 23	~1.7M compounds	Canonical SMILES from CDK, OEChem, Open Babel, RDKit

Algorithms

The DeepSMILES encoder and decoder are pure string-processing algorithms with no machine-learning components. The transformations operate on SMILES syntax tokens (atoms, bonds, parentheses, ring closure digits) without chemical interpretation.

Evaluation

Metric	Value	Notes
Roundtrip accuracy	100%	All ChEMBL 23 entries across 4 toolkits
Encoding throughput	26,000-32,000/s	Pure Python, varies by transformation
Decoding throughput	16,000-24,000/s	Pure Python, varies by transformation

Hardware

No specific hardware requirements. The implementation is a pure Python module with no GPU dependencies.

Artifacts

Artifact	Type	License	Notes
deepsmiles	Code	MIT	Pure Python encoder/decoder

Paper Information

Citation: O’Boyle, N. M., & Dalke, A. (2018). DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. ChemRxiv. https://doi.org/10.26434/chemrxiv.7097960.v1

@article{oboyle2018deepsmiles,
  title={DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures},
  author={O'Boyle, Noel M. and Dalke, Andrew},
  journal={ChemRxiv},
  year={2018},
  doi={10.26434/chemrxiv.7097960.v1}
}

Curriculum Learning for De Novo Drug Design (REINVENT)

Thu, 26 Mar 2026 00:00:00 +0000

Curriculum Learning as a Method for Molecular Generation

This is a Method paper that introduces curriculum learning (CL) into the REINVENT de novo molecular design platform. The primary contribution is a training strategy that decomposes complex multi-parameter optimization (MPO) objectives into sequences of simpler tasks with increasing complexity. The agent learns each simpler task before progressing to the full production objective, accelerating convergence and improving the quality and diversity of generated molecules compared to standard policy-based reinforcement learning (RL).

The Computational Cost of Complex Reward Functions

Policy-based RL for molecular design works by training a generative model (the agent) to produce molecules that maximize a reward function. In practice, drug design reward functions often include computationally expensive components such as molecular docking. When the reward landscape is complex and minima are difficult to find, the agent may spend many epochs sampling molecules far from the desired objective. The resulting small gradients cause minimal policy updates, leading to long periods of non-productivity. This is particularly wasteful when each reward evaluation involves expensive physics-based computations.

The core problem is that standard RL treats the full MPO objective as a monolithic task. If the agent cannot find any rewarding molecules early in training, it receives near-zero gradients and makes negligible progress. This creates a bootstrapping problem: the agent needs to already be sampling from favorable regions of chemical space to receive useful learning signals, but it has no guidance on how to get there.

Curriculum learning, originally proposed by Bengio et al. (2009), addresses this by arranging training tasks in order of increasing difficulty. When constituent tasks are correlated with the final objective, the gradients from simpler tasks provide more effective traversal of the optimization landscape.

Formalized Curriculum Strategy for REINVENT

The key innovation is a two-phase training protocol with formal definitions for curriculum progression.

A scoring function maps SMILES strings to desirability scores in $[0, 1]$ using a weighted geometric mean:

$$S(x) = \left(\prod_{i=1}^{n} c_{i}(x)^{w_{i}}\right)^{1 / \sum_{i=1}^{n} w_{i}}$$

where $x$ is a sampled compound in SMILES format, $c_{i}$ is the $i$-th scoring component, and $w_{i}$ is its weight.

A Curriculum $C$ consists of a sequence of Objectives $O = {O_{C_1}, \ldots, O_{C_n}, O_{P}}$, where subscripts $C$ and $P$ denote Curriculum and Production Objectives respectively. Each Objective has a corresponding scoring function. Progression is controlled by Curriculum Progression Criteria $P = {P_{1}, \ldots, P_{n}}$, where each $P_{i}$ defines a score threshold the agent must achieve before advancing to the next objective.

Curriculum Phase: The agent trains on sequential Curriculum Objectives with increasing complexity. A diversity filter is not applied during this phase, as it could be counterproductive to guiding the agent toward favorable chemical space. No computationally expensive components (e.g., docking) are used in Curriculum Objectives.

Production Phase: Activated only when the final Curriculum Progression Criterion $P_{n}$ is satisfied. The agent now optimizes the full Production Objective, which may include expensive components like molecular docking. A new inception memory is initialized (clearing Curriculum Phase compounds), and a Bemis-Murcko scaffold diversity filter is applied to encourage exploration across multiple local minima.

The implementation builds on REINVENT’s RNN architecture: three hidden layers of 512 LSTM cells with an embedding size of 256 and a linear layer with softmax activation, pretrained on ChEMBL to learn SMILES syntax.

Three Experiments on PDK1 Inhibitor Design

The authors evaluate CL on three molecular design tasks of increasing complexity, all centered on designing 3-phosphoinositide-dependent protein kinase-1 (PDK1) inhibitors.

Experiment 1: Target Scaffold Construction

The goal is to generate compounds possessing a dihydro-pyrazoloquinazoline scaffold with a phenyl substituent, a scaffold not present in the prior’s training set. Standard RL fails entirely over 2000 epochs because the probability of randomly sampling a compound with this scaffold is negligibly small, producing binary rewards (1.0 if scaffold present, 0.5 otherwise) that never rise above 0.5.

CL decomposes the target scaffold into 5 progressively complex substructures. Each Curriculum Progression Criterion threshold is set to 0.8. The agent learns to generate compounds with each substructure before advancing. CL finds the target scaffold within 1750 epochs, while baseline RL cannot find it in the same timeframe.

Experiments 2 and 3: Molecular Docking Constraints

These experiments use a Production Objective combining a molecular docking constraint (retaining two hydrogen-bonding interactions with Ala 162 of PDK1, PDB ID: 2XCH) and QED (Quantitative Estimate of Druglikeness). Both experiments limit computational cost by capping production epochs at 300.

Experiment 2 uses Tanimoto (2D) similarity to a reference ligand as the Curriculum Objective. Two scenarios are tested: “Low” (threshold 0.5) and “High” (threshold 0.8).

Experiment 3 uses ROCS (3D) shape-based similarity to the reference ligand as the Curriculum Objective, with “Low” (0.5) and “High” (0.75) thresholds.

All experiments are run in triplicate. The baseline includes both standard RL and RL with Tanimoto/ROCS components added directly to the scoring function (not sequentially), to control for the presence of these components.

Across all CL experiments, CL generates between 2,941 and 9,068 more compounds with docking scores better than the reference ligand (-10.907 kcal/mol) compared to baseline RL, corresponding to 12.42-23.79% improvement in the fraction of compounds exceeding the reference. Between the Curriculum Objectives, the “High” threshold scenario outperforms the “Low” scenario by 316-3,415 additional compounds (with percentage improvements ranging from -0.4% to 10.57%).

Baseline RL produces essentially no compounds satisfying the docking constraint for the first 100 epochs. CL agents achieve immediate productivity: in the “High” Tanimoto scenario, the initial docking score already exceeds the maximum score achieved by baseline RL over 300 epochs.

Scaffold Diversity Analysis

CL generates more unique Bemis-Murcko scaffolds than baseline RL in all experiments. The “High” scenarios produce more unique scaffolds than the “Low” scenarios. CL also produces a higher fraction of “favorable” scaffolds (those with better docking scores than the reference ligand).

Accelerated Convergence with a Diversity Trade-off

The results demonstrate three consistent findings across all experiments:

Accelerated productivity: CL agents reach productive sampling of favorable compounds substantially faster than baseline RL. Even a single Curriculum Objective with a computationally inexpensive metric can guide the agent to regions of chemical space where expensive Production Objectives are readily satisfied.
Improved output quality: CL generates more compounds with favorable docking scores, more unique scaffolds, and a higher fraction of scaffolds that outperform the reference ligand.
Controllable trade-off: The Curriculum Progression Criterion threshold provides direct control over agent policy. Higher thresholds produce better Production Objective optimization but reduce intra-set diversity (higher cross-Tanimoto similarities among generated compounds). UMAP visualizations confirm that “Low” and “High” scenarios sample from nearby but distinct regions of chemical space.

The authors note that even moderate optimization of similarity-based Curriculum Objectives (the “Low” scenarios) already substantially narrows the agent’s perceived solution space. This suggests that CL inherently regularizes the agent policy, trading some diversity for convergence speed.

Limitations: The authors acknowledge that data supporting the findings are available only upon request rather than as public deposits. The approach is demonstrated on a single target (PDK1), and the curriculum design requires domain expertise to decompose objectives appropriately. The inverse relationship between Curriculum Objective optimization and solution diversity means practitioners must carefully tune thresholds for their specific applications.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL	Not specified	Used to pretrain the RNN on SMILES syntax
Docking target	PDB 2XCH	1 structure	PDK1 receptor crystal structure

Raw data supporting the findings are available from the corresponding author upon request.

Algorithms

REINVENT platform with LSTM-based RNN (3 hidden layers, 512 cells, embedding size 256)
Scoring function: weighted geometric mean of components
Curriculum Progression Criteria: score thresholds (0.5 or 0.75-0.8 depending on scenario)
Diversity filter: Identical Murcko Scaffold with bucket size 25 (Production Phase only)
Inception (experience replay) for both phases, reset at phase transition
Batch size: 128, learning rate: 0.0001, sigma: 128, Adam optimizer

Models

Prior: RNN pretrained on ChEMBL SMILES
Agent: Initialized from prior, focused via RL/CL
No pretrained model weights are publicly released

Evaluation

Metric	Description	Notes
Docking score (Glide SP)	Predicted binding affinity (kcal/mol)	Lower is better; reference ligand: -10.907
QED	Quantitative Estimate of Druglikeness	Range [0, 1]
Unique Bemis-Murcko scaffolds	Scaffold diversity measure	Averaged over triplicates
Cross-Tanimoto similarity	Intra-set compound diversity	Calculated on pooled triplicates
Tanimoto/ROCS similarity	Curriculum Objective metrics	2D fingerprint and 3D shape similarity

Hardware

GPU: NVIDIA Tesla V100 (32 GB)
Docking: AWS p3.8xlarge instance
LigPrep parallelized over 8 CPU cores
Glide docking parallelized over 48 CPU cores via DockStream

Artifacts

Artifact	Type	License	Notes
REINVENT	Code	Apache-2.0	De novo molecular design platform
CL Tutorial Notebook	Code	MIT	Jupyter notebook tutorial for curriculum learning

Paper Information

Citation: Guo, J., Fialková, V., Arango, J. D., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., & Patronov, A. (2022). Improving de novo molecular design with curriculum learning. Nature Machine Intelligence, 4, 555-563. https://doi.org/10.1038/s42256-022-00494-4

@article{guo2022curriculum,
  title={Improving de novo molecular design with curriculum learning},
  author={Guo, Jeff and Fialkov{\'a}, Vendy and Arango, Juan Diego and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas},
  journal={Nature Machine Intelligence},
  volume={4},
  number={6},
  pages={555--563},
  year={2022},
  publisher={Springer Nature},
  doi={10.1038/s42256-022-00494-4}
}

CogMol: Controlled Molecule Generation for COVID-19

Thu, 26 Mar 2026 00:00:00 +0000

A Controlled Generation Framework for Target-Specific Drug Design

This is a Method paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.

Multi-Constraint Drug Design for Novel Viral Targets

Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.

The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.

Controlled Latent Space Sampling with Pre-trained Protein Embeddings

CogMol’s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:

1. SMILES VAE with adaptive pre-training. A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:

$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$

where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.

2. Protein-molecule binding affinity predictor. A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.

3. CLaSS controlled sampling. The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:

$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$

where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes’ rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.

Selectivity modeling. Off-target selectivity for a molecule $m$ against target $T$ is defined as:

$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$

where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.

Experimental Setup: COVID-19 Targets and In Silico Screening

Target proteins. CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.

Training data. The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.

CLaSS controlled generation. Molecules were generated with simultaneous constraints on binding affinity (> 0.5 normalized), QED (> 0.8 normalized), and selectivity (> 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.

In silico screening pipeline. Generated molecules underwent:

Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure
Binding affinity rescoring with a higher-accuracy SMILES-level predictor
Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures
Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data

Baselines. VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.

Key Results

CLaSS enrichment (Table 1). CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity > 0.5, QED > 0.8, selectivity > 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.

Target	CLaSS (Aff+QED+Sel)	Random (Aff+QED+Sel)	Enrichment
NSP9	6.9%	0.7%	~10x
RBD	9.0%	0.9%	~10x
Mpro	10.4%	1.1%	~9.5x

Docking results (Table 3). 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.

Novelty. The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.

Synthesizability. Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.

Toxicity. Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.

Generated Molecules Show Favorable Binding and Drug-Like Properties

CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:

CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).
Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.
Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.
The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.

Limitations. The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor’s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.

Future directions. The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
VAE pre-training	MOSES/ZINC	1.6M train, 176K test	Publicly available benchmark
VAE adaptive training	BindingDB (DeepAffinity split)	~27K protein-ligand pairs	Curated IC50 data
Protein embeddings	UniRef50 via UniRep	24M sequences	Pre-trained, publicly available
Toxicity prediction	Tox21 + ClinTox	12 in vitro + clinical endpoints	Public benchmark datasets
Docking validation	PDB structures	3 SARS-CoV-2 targets	Public crystal structures

Algorithms

VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors
CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers
Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings
Selectivity: excess binding affinity over average of $k$ random off-targets

Models

SMILES VAE with adaptive pre-training (ZINC then BindingDB)
Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints
Binding affinity predictor (latent-level for generation, SMILES-level for screening)
Retrosynthetic predictor based on Molecular Transformer

Evaluation

Metric	Value	Baseline	Notes
Validity	90%	-	Generated SMILES
Uniqueness	99%	-	Among valid molecules
Filter pass	95%	-	Relevant chemical filters
Docking BFE < -6 kcal/mol	87-95%	-	Varies by target
Synthetic feasibility	85-90%	78% (FDA drugs)	COVID-19 targets
Low toxicity (0-1 endpoints)	~70% parent, ~80% metabolite	Comparable to FDA drugs	MT-DNN prediction

Hardware

The paper does not specify GPU types or training times. The work was funded internally by IBM Research.

Artifacts

Artifact	Type	License	Notes
CogMol (GitHub)	Code	Apache-2.0	Official implementation
~3500 generated molecules	Dataset	Open license	For three SARS-CoV-2 targets

Paper Information

Citation: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., & Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. Advances in Neural Information Processing Systems, 33, 4320-4332.

@inproceedings{chenthamarakshan2020cogmol,
  title={CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models},
  author={Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\'c}, Aleksandra},
  booktitle={Advances in Neural Information Processing Systems},
  volume={33},
  pages={4320--4332},
  year={2020}
}

CDDD: Learning Descriptors by Translating SMILES

Thu, 26 Mar 2026 00:00:00 +0000

A Translation-Based Method for Learned Molecular Descriptors

This is a Method paper that introduces Continuous and Data-Driven Descriptors (CDDD), a neural machine translation approach for learning fixed-size, continuous molecular representations. Rather than training an autoencoder to reconstruct SMILES strings, Winter et al. train an encoder-decoder model to translate between semantically equivalent but syntactically different molecular representations (e.g., randomized SMILES to canonical SMILES, or InChI to canonical SMILES). The bottleneck latent vector serves as a general-purpose molecular descriptor. Pretrained on approximately 72 million compounds from ZINC15 and PubChem, CDDD produces 512-dimensional descriptors that achieve competitive QSAR performance and significantly outperform all tested molecular fingerprints in ligand-based virtual screening.

Why Translation Instead of Reconstruction?

Molecular descriptors are central to cheminformatics. Traditional approaches rely on human-engineered fingerprints like ECFPs, which encode structural features as fixed-length bit vectors. While effective, these representations are constrained by predefined feature extraction rules.

Recent work applied deep neural networks directly to molecular graphs or SMILES strings to learn task-specific representations. However, these end-to-end approaches must learn features from scratch for each new dataset, making them prone to overfitting on the small bioactivity datasets typical in drug discovery.

Unsupervised approaches based on autoencoders (notably Gomez-Bombarelli et al.’s VAE and Xu et al.’s seq2seq model) offered a path toward general-purpose learned descriptors. These models reconstruct SMILES strings through an information bottleneck, forcing the latent space to capture molecular information. The concern with reconstruction, however, is that the model may focus on syntactic patterns of the string representation rather than the underlying chemical semantics. A model that memorizes SMILES syntax shortcuts can achieve low reconstruction error without truly encoding chemical meaning.

Winter et al. address this by drawing on the analogy to neural machine translation: a translator must understand the meaning of a sentence to produce a correct translation in another language. By training the model to translate between different molecular representations (which share chemical semantics but differ in syntax), the latent space is forced to capture the chemical information common to both representations, rather than representation-specific syntactic artifacts.

Translation as Semantic Compression

The core insight is that translating between two syntactically different but semantically equivalent representations forces the encoder to capture only the chemical meaning shared by both. The model architecture follows the standard encoder-decoder framework from neural machine translation.

The encoder reads a source molecular string (e.g., a randomized SMILES or InChI) and compresses it into a fixed-size latent vector. The decoder takes this latent vector and generates the target molecular string (canonical SMILES). The model is trained to minimize character-level cross-entropy between the decoder output and the target sequence.

Four translation tasks were evaluated:

Randomized SMILES to canonical SMILES (best performing)
InChI to canonical SMILES
Canonical SMILES to canonical SMILES (autoencoding baseline)
Canonical SMILES to InChI (failed to learn)

The final model uses an RNN encoder with 3 stacked GRU layers (512, 1024, and 2048 units). The concatenated cell states pass through a fully connected layer with tanh activation to produce a 512-dimensional latent vector. The decoder mirrors this architecture, initializing its GRU states from the latent vector via separate fully connected layers. Teacher forcing is used during training, and left-to-right beam search is used at inference.

An auxiliary property prediction network takes the latent vector as input and predicts nine molecular properties (logP, partial charges, valence electrons, H-bond donors/acceptors, Balaban’s J, molar refractivity, TPSA). This multi-task signal encourages the latent space to encode physically meaningful information. The full training objective combines the translation cross-entropy loss with the property prediction mean squared error:

$$\mathcal{L} = \mathcal{L}_{\text{translation}} + \mathcal{L}_{\text{properties}}$$

To ensure invariance to input SMILES representation at inference time, the model uses randomized SMILES as input half the time and canonical SMILES the other half during training. Input dropout (15% at the character level) and Gaussian noise (standard deviation 0.05) are applied for regularization.

QSAR Benchmarks, Virtual Screening, and Latent Space Exploration

Pretraining

The model was pretrained on approximately 72 million compounds from ZINC15 and PubChem (merged, deduplicated, filtered for organic molecules with MW 12-600, >3 heavy atoms, logP between -7 and 5). All evaluation compounds were removed from the pretraining set.

QSAR Experiments

Ten QSAR datasets were used, spanning classification (Ames mutagenicity, hERG inhibition, BBB penetration, BACE inhibition, bee toxicity) and regression (EGFR inhibition, Plasmodium falciparum inhibition, lipophilicity, aqueous solubility, melting point). Two datasets (Ames and lipophilicity) served as validation for architecture selection; the remaining eight were held out for final evaluation.

CDDD descriptors with an SVM were benchmarked against:

Nine circular fingerprint variants (Morgan fingerprints, radius 1-3, folded to 512/1024/2048 bits) with RF, SVM, and GB
Graph convolution models (DeepChem)

Both random-split and cluster-split (K-means on MACCS fingerprints, K=5) cross-validation were performed.

Task	Split	CDDD + SVM	Best Fingerprint	Graph Conv
Ames (ROC-AUC)	Random	0.89	0.89 (ecfc2, RF)	0.88
hERG (ROC-AUC)	Random	0.86	0.85 (ecfc4, RF)	0.86
BBBP (ROC-AUC)	Random	0.93	0.93 (ecfc2, RF)	0.92
BACE (ROC-AUC)	Random	0.90	0.91 (ecfc2, RF)	0.91
Bee toxicity (ROC-AUC)	Random	0.92	0.91 (ecfc6, RF)	0.89
Lipophilicity ($r^2$)	Random	0.72	0.69 (ecfc2, SVM)	0.73
ESOL ($r^2$)	Random	0.92	0.58 (ecfc6, SVM)	0.86
Melting point ($r^2$)	Random	0.42	0.38 (ecfc2, SVM)	0.39

CDDD descriptors showed competitive or better performance across all tasks. Notably, CDDD achieved substantially higher $r^2$ on aqueous solubility (0.92 vs. 0.58 for the best fingerprint). The authors emphasize that CDDD’s feature extraction was fixed based on two validation tasks, while baseline methods selected the best fingerprint/model combination per task, making the comparison conservative for CDDD.

Virtual Screening

Ligand-based virtual screening experiments followed the Riniker et al. benchmarking protocol on 40 DUD targets and 17 MUV targets. Five active compounds were randomly selected per target, and remaining compounds were ranked by similarity (cosine similarity for CDDD, Tanimoto for fingerprints). This process was repeated 50 times per target.

Database	CDDD (ROC-AUC)	Second Best	p-value (Wilcoxon)
DUD	0.949	0.899 (laval)	$5 \times 10^{-38}$
MUV	0.679	0.677 (ap)	0.04

CDDD significantly outperformed all 14 baseline fingerprints on both databases. The DUD improvement was particularly large (+5.0 ROC-AUC points over the next best). On MUV, which is designed to be harder, the advantage was smaller but still statistically significant. Importantly, while the best baseline fingerprint varied between DUD and MUV (laval vs. ap), CDDD ranked first on both, demonstrating consistent performance.

Latent Space Exploration

The continuous, reversible nature of CDDD enables chemical space navigation. Shifting a molecule’s embedding along the first principal component of the pretraining data correlates with molecular size (Spearman $r = 0.947$, $p = 0.00048$), while the second principal component correlates with polarity/logP ($r = -0.916$, $p = 0.00015$).

When shifting 1000 compounds along 100 random directions, the model maintained high valid SMILES generation rates (>97% for the top beam search output, >99% when considering the top 3 outputs). Euclidean distance in the descriptor space correlated smoothly with Tanimoto distance in fingerprint space, confirming that the latent space supports meaningful interpolation.

Consistent Learned Descriptors for Chemistry

CDDD demonstrated that translation between molecular representations produces more informative latent spaces than autoencoder reconstruction. The key findings are:

Translation outperforms reconstruction: Models trained on translating between different representations consistently produced better downstream descriptors than autoencoding models, despite autoencoding being an easier task.
Auxiliary property prediction helps: The additional classification task for molecular properties improved descriptor quality, particularly for physicochemical endpoints correlated with the predicted properties.
Consistent performance: Unlike baseline methods where the best fingerprint varies by task, CDDD showed consistent performance across all QSAR and VS experiments.
Smooth latent space: The continuous descriptor space supports meaningful interpolation and chemical space exploration with high valid SMILES rates.

The authors acknowledge several limitations. The InChI-to-SMILES translation worked but produced inferior descriptors compared to SMILES-to-SMILES, and SMILES-to-InChI translation failed entirely, likely due to InChI’s complex syntax (counting, arithmetic). The approach was only tested with string-based representations; translation between conceptually different representations (e.g., 3D structures) remains future work. The QSAR evaluation, while extensive, used relatively standard datasets, and the method’s advantage over graph convolution models was modest on tasks where end-to-end learning had sufficient data.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ZINC15 + PubChem (merged)	~72M compounds	Filtered: organic, MW 12-600, >3 heavy atoms, logP -7 to 5
Validation	Ames mutagenicity	6,130	Classification
Validation	Lipophilicity	3,817	Regression
Test	hERG, BBBP, BACE, bee toxicity	188-3,440	Classification
Test	EGFR, Plasmodium, ESOL, melting point	184-4,451	Regression
VS	DUD	40 targets	Ligand-based virtual screening
VS	MUV	17 targets	Maximum unbiased validation

Algorithms

Encoder: 3 stacked GRU layers (512, 1024, 2048 units) with tanh bottleneck to 512-dim latent space
Decoder: Matching 3 stacked GRU layers, initialized from latent space
Auxiliary classifier: 3 FC layers (512, 128, 9) predicting molecular properties
Optimizer: Adam, initial LR $5 \times 10^{-4}$, decayed by 0.9 every 50,000 steps
Batch size: 64 with bucketing by sequence length
Input regularization: 15% character dropout + Gaussian noise (std 0.05)
Beam search for decoding at inference

Models

Artifact	Type	License	Notes
CDDD (GitHub)	Code + Model	MIT	Pretrained model and extraction code

Evaluation

QSAR: 5-fold random CV and 5-fold cluster CV (K-means on MACCS, K=5)
Classification metric: ROC-AUC
Regression metric: $r^2$
VS: ROC-AUC averaged over 50 random active set selections per target
Statistical test: Wilcoxon signed-rank test for VS comparisons

Hardware

Framework: TensorFlow 1.4.1
Fingerprint extraction on GPU is comparable in speed to RDKit on CPU
SVM training on 512-dim CDDD descriptors takes seconds (vs. minutes for 2048-dim fingerprints)
Graph convolution training: ~30 minutes per task on GPU

Paper Information

Citation: Winter, R., Montanari, F., Noe, F., & Clevert, D.-A. (2019). Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chemical Science, 10(6), 1692-1701. https://doi.org/10.1039/C8SC04175J

@article{winter2019learning,
  title={Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations},
  author={Winter, Robin and Montanari, Floriane and No{\'e}, Frank and Clevert, Djork-Arn{\'e}},
  journal={Chemical Science},
  volume={10},
  number={6},
  pages={1692--1701},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C8SC04175J}
}

BindGPT: GPT for 3D Molecular Design and Docking

Thu, 26 Mar 2026 00:00:00 +0000

A Language Model for Joint 3D Molecular Graph and Conformation Generation

BindGPT is a Method paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.

The Graph Reconstruction Problem in 3D Molecular Generation

Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.

SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates

The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a token, followed by character-level SMILES tokens encoding the molecular graph, then an token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.

For protein pockets, sequences begin with a token followed by atom names and coordinates. Following AlphaFold’s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.

The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.

Pre-training on Large-Scale 3D Data

Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.

Supervised Fine-Tuning with Augmentation

For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:

SMILES randomization: Each molecule can yield 100-1000 different valid SMILES strings
Random 3D rotation: The same rotation matrix is applied to both pocket and ligand coordinates

During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.

Reinforcement Learning with Docking Feedback

BindGPT applies REINFORCE (not PPO or REINVENT, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.

The RL objective can be written as:

$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$

where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.

Experimental Evaluation Across Three 3D Generation Tasks

Datasets

Purpose	Dataset	Size	Notes
Pre-training	Uni-Mol 3D	208M conformations (12M molecules) + 3.2M pockets	Large-scale 3D molecular dataset
Fine-tuning (SFT)	CrossDocked 2020	~27M pocket-ligand pairs	14k molecules x 3k pockets, includes all pose qualities
Fine-tuning (conformer)	GEOM-DRUGS	27M conformations for 300k molecules	Standard benchmark for 3D conformer generation
Evaluation (conformer)	Platinum	Experimentally validated conformations	Zero-shot evaluation holdout
Evaluation (pocket)	CrossDocked holdout	100 pockets	Held out from training

Task 1: 3D Molecule Generation (Pre-training)

Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF’s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).

Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)

Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:

Metric	EDM	MolDiff	BindGPT
JS bond lengths	0.246	0.365	0.029
JS bond angles	0.282	0.155	0.075
JS dihedral angles	0.328	0.162	0.098
JS freq. bond types	0.378	0.163	0.045
JS freq. bond pairs	0.396	0.136	0.043
JS freq. bond triplets	0.449	0.125	0.042
Time (1000 molecules)	1.4e6 s	7500 s	200 s

BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.

Task 3: Pocket-Conditioned Molecule Generation

Method	Vina Score	SA	QED	Lipinski
Pocket2Mol	-7.15 +/- 4.89	0.75	0.57	4.88
TargetDiff	-7.80 +/- 3.61	0.58	0.48	4.51
BindGPT-FT	-5.44 +/- 2.09	0.78	0.50	4.72
BindGPT-RFT	-7.24 +/- 1.68	0.74	0.48	4.32
BindGPT-RL	-8.60 +/- 1.90	0.84	0.43	4.81

The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.

Conformer Generation

On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.

Key Findings, Limitations, and Future Directions

BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:

Joint SMILES+XYZ generation eliminates graph reconstruction errors, achieving 98.58% validity compared to 12.87% for XYZ-Transformer
Large-scale pre-training is critical for pocket-conditioned generation, as none of the baselines use pre-training and instead rely on heavy inductive biases
RL fine-tuning with docking feedback substantially improves binding affinity beyond what SFT alone achieves
Sampling is two orders of magnitude faster than diffusion baselines (200s vs. 1.4M s for EDM)

Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	Uni-Mol 3D	208M conformations, 12M molecules, 3.2M pockets	From Zhou et al. (2023)
SFT (pocket)	CrossDocked 2020	~27M pocket-ligand pairs	Full version including low-quality poses
SFT (conformer)	GEOM-DRUGS	27M conformations, 300k molecules	Standard benchmark
Evaluation	Platinum	Experimentally validated conformations	Zero-shot holdout

Algorithms

Architecture: GPT-NeoX with rotary position embeddings (RoPE)
Pre-training: Causal language modeling with 1.6M tokens per batch
SFT augmentation: SMILES randomization + random 3D rotation
RL: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward

Models

Size: 108M parameters, 15 layers, 12 heads, hidden size 768
Vocabulary: Character-level SMILES tokens + special tokens (, , ) + coordinate tokens (6 per 3D position)

Evaluation

Validity, SA, QED, Lipinski: Standard drug-likeness metrics
Jensen-Shannon divergences: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)
RMSD: Alignment quality of generated conformations vs. RDKit reference
RMSD-Coverage: CDF of RMSD between generated and reference conformers
Vina score: Binding energy from QVINA docking software

Hardware

Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency
Specific GPU counts and training times are described in Appendix G (not available in the main text)

Artifacts

Artifact	Type	License	Notes
Project Page	Other	Not specified	Project website with additional details

No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.

Reproducibility Status: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.

Paper Information

Citation: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., & Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 26083-26091. https://doi.org/10.1609/aaai.v39i24.34804

@inproceedings{zholus2025bindgpt,
  title={BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning},
  author={Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={24},
  pages={26083--26091},
  year={2025},
  doi={10.1609/aaai.v39i24.34804}
}

Augmented Hill-Climb for RL-Based Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

A Hybrid RL Strategy for De Novo Molecule Generation

This is a Method paper that proposes Augmented Hill-Climb (AHC), a reinforcement learning strategy for conditioning SMILES-based language models during de novo molecule generation. The primary contribution is a simple hybrid between the REINVENT and Hill-Climb (HC) RL strategies that computes the REINVENT loss function only on the top-k highest-scoring molecules per batch (as in HC), thereby removing the counterproductive regularization effect of low-scoring molecules. The authors demonstrate that AHC improves optimization ability ~1.5-fold and sample efficiency ~45-fold compared to REINVENT across docking tasks against four GPCR targets, and that the approach generalizes to transformer architectures.

Sample Efficiency Bottleneck in RL-Guided Molecular Generation

Recurrent neural networks trained on SMILES have become a standard approach for de novo molecule generation, with RL strategies like REINVENT and Hill-Climb achieving top performance on benchmarks such as GuacaMol and MOSES. However, RL-guided generation can be highly sample-inefficient, often requiring $10^5$ or more molecules to optimize complex objectives. This is acceptable for cheap scoring functions (e.g., QSAR models, property calculators) but becomes a practical bottleneck when using computationally expensive scoring functions like molecular docking or computer-aided synthesis planning.

The REINVENT strategy regularizes the agent by computing a loss based on the difference between the agent’s policy and an “augmented likelihood” that combines the prior policy with a scaled reward. When low-scoring molecules are sampled ($R_T \approx 0$), the augmented likelihood reduces to the prior likelihood, causing the agent to trend back toward the prior policy. This negates useful learnings, especially early in training or when the objective is difficult. Meanwhile, Hill-Climb simply fine-tunes the RNN on the top-k molecules per batch, which is sample-efficient but lacks explicit regularization, leading to mode collapse and generation of invalid SMILES.

Previous work by Neil et al. compared RL strategies but did not clearly quantify sample-efficiency differences, and modifications to the REINVENT loss function by Fialkova et al. showed no significant improvement. The best agent reminder (BAR) mechanism offered modest gains but was originally tested on graph-based models.

Core Innovation: Filtering Low-Scoring Molecules from the REINVENT Loss

Augmented Hill-Climb combines the loss formulation of REINVENT with the top-k selection mechanism of Hill-Climb. The agent samples a batch of molecules, ranks them by reward, and computes the REINVENT loss only on the top-k molecules. This removes the counterproductive regularization caused by low-scoring molecules while retaining the prior-based regularization for high-scoring molecules.

The REINVENT loss defines an augmented likelihood:

$$ \log P_{\mathbb{U}}(A) = \log P_{prior}(A) + \sigma R_T $$

where $\sigma$ is a scaling coefficient controlling the reward contribution. The agent loss is the squared difference between the augmented likelihood and the agent’s log-likelihood:

$$ L(\theta) = \left[\log P_{\mathbb{U}}(A) - \log P_{agent}(A)\right]^2 $$

In standard REINVENT, this loss is computed over all molecules in the batch. When $R_T \approx 0$, the augmented likelihood collapses to the prior likelihood, pushing the agent back toward the prior. AHC avoids this by computing the loss only on the top-k molecules ranked by reward, exactly as Hill-Climb selects molecules for fine-tuning.

The key insight is that high-scoring molecules are still regularized by the prior component of the augmented likelihood ($\log P_{prior}(A)$), preventing catastrophic forgetting. Low-scoring molecules, which would otherwise pull the agent back toward the prior, are simply excluded from the loss computation.

Diversity Filters to Prevent Mode Collapse

AHC is more susceptible to mode collapse than REINVENT because it focuses learning on high-scoring molecules. The authors address this with diversity filters (DFs) that penalize the reward of molecules similar to previously generated ones. Through a hyperparameter search over 825 configurations on three GuacaMol tasks, they identify an optimal DF configuration (DF2) with:

Minimum score threshold of 0.5 (lower than DF1’s 0.8)
Linear penalization output mode (softer than binary)
Bin size of 50 (larger than DF1’s 25)
Scaffold similarity based on ECFP4 fingerprints

The authors find that stricter DFs (lower thresholds, smaller bins) better prevent mode collapse but reduce optimization performance, while more lenient DFs enable better learning of chemotype-reward associations. DF2 represents a compromise.

Experimental Setup: Docking Tasks and Benchmark Comparisons

The evaluation spans five experiments:

Experiment 1: AHC vs. REINVENT on DRD2 docking over 100 RL updates (6,400 samples), varying $\sigma$ from 30 to 240. RNN trained on the MOSESn dataset (MOSES with neutralized charges, 2.45M molecules).

Experiment 2: AHC + DF2 vs. REINVENT on four GPCR targets (DRD2, OPRM1, AGTR1, OX1R) over 500 RL updates. Docking performed with Glide-SP after ligand preparation with LigPrep.

Experiment 3: Diversity filter hyperparameter search (825 configurations) on three GuacaMol tasks (Aripiprazole similarity, C11H24 isomers, Osimertinib MPO) using the GuacaMol training set (1.27M molecules from ChEMBL24).

Experiment 4: Benchmark of AHC against REINFORCE, REINVENT (v1 and v2), BAR, and Hill-Climb (with and without KL regularization) on six tasks of varying difficulty:

Task	Difficulty	Objective
Heavy atoms	Easy	Maximize number of heavy atoms
Risperidone similarity	Easy	Maximize Tanimoto similarity to Risperidone
DRD2 activity	Medium	Maximize QSAR-predicted DRD2 activity
DRD2 docking	Medium	Minimize Glide-SP docking score
DRD2-DRD3 dual	Hard	Maximize predicted activity against both targets
DRD2/DRD3 selective	Hard	Maximize selective DRD2 activity over DRD3

Experiment 5: AHC vs. REINVENT on transformer (Tr) and gated transformer (GTr) architectures on the same six benchmark tasks. The GTr implements a GRU-style gate in place of residual connections to stabilize RL training.

RNN and Transformer Architectures

Three RNN configurations were used: (1) embedding 128 + 3 GRU layers of 512 (REINVENT v1), (2) embedding 256 + 3 LSTM layers of 512 (REINVENT 2.0), (3) 3 LSTM layers of 512 with dropout 0.2 (GuacaMol). Transformers used 4 encoder layers with hidden dimension 512, 8 attention heads, and feed-forward dimension 1024.

QSAR models for DRD2 and DRD3 activity were random forest classifiers trained on ExCAPE-DB data with GHOST threshold identification for handling class imbalance.

Key Findings: 45-Fold Sample Efficiency Improvement

Experiment 1: AHC Consistently Outperforms REINVENT

AHC improved optimization ability by 1.39-fold over REINVENT averaged across all $\sigma$ values, with maximum optimization of 205% at $\sigma = 240$ (compared to 128% for REINVENT). AHC required ~80 fewer RL steps to match REINVENT’s mean docking score at 100 steps. With DF1 applied, the improvement was 1.45-fold.

AHC showed greater sensitivity to $\sigma$, giving practitioners more control over the regularization-optimization trade-off. At $\sigma = 60$ (heavily regularized), AHC still improved 1.47-fold over REINVENT while maintaining property space defined by the MOSESn training set. At higher $\sigma$ values, AHC extrapolated further outside the training distribution, which can be favorable (novel chemical space) or unfavorable (scoring function exploitation, e.g., larger molecules getting better docking scores due to the additive nature of scoring functions).

Experiment 2: Improvement Across Four GPCR Targets

Across DRD2, OPRM1, AGTR1, and OX1R, AHC + DF2 required on average 7.4-fold fewer training steps and 45.5-fold fewer samples to reach optimization thresholds. The improvement was largest early in training: 19.8-fold fewer steps to reach 120% optimization, and 71.8-fold fewer samples to first produce a molecule exceeding 160% optimization.

AHC + DF2 surpassed the 80% retrospective precision threshold within 100 RL updates for all targets except the challenging OX1R. DF2 successfully stabilized learning, avoiding the convergence-to-threshold failure mode observed with DF1.

Scaffold analysis showed AHC generates similar chemistry to REINVENT. The top 500 scaffolds produced by REINVENT were also generated by AHC, but typically much sooner.

Experiment 4: Benchmark Against All RL Strategies

AHC outperformed all other RL strategies on all six benchmark tasks except maximizing heavy atoms (an extrapolation task of limited practical relevance). AHC was particularly superior during early-stage optimization and for harder objectives (dual activity, selective activity).

Hill-Climb with a smaller batch size (HC*) showed improved early-stage sample efficiency similar to AHC, but rapidly underwent mode collapse. KL regularization did not rescue mode collapse in any case and sometimes worsened performance. BAR performed poorly in most tasks, possibly because the best-agent memory acts as a second regularizer that inhibits learning.

In terms of wall time for the DRD2 docking task, AHC reached 140% optimization in 16 CPU hours vs. 202 CPU hours for REINVENT 2.0. AHC was the only strategy to reach 200% optimization within the allotted time (216 CPU hours). Parallelized over 10 CPUs, this corresponds to ~21.6 hours, making docking-guided generation feasible on local machines.

Experiment 5: Generalization to Transformers

AHC outperformed REINVENT on both the standard transformer and the gated transformer architectures. The standard transformer was unstable under RL, readily undergoing mode collapse. The gated transformer (with GRU-style gating replacing residual connections) stabilized RL training. AHC’s efficiency gains generalized to both architectures.

Limitations

The authors acknowledge several limitations:

Chemistry quality evaluation is complicated by the interaction between RL strategy and scoring function suitability. Greater optimization may lead to unreasonable chemistry due to scoring function exploitation rather than the RL strategy itself.
The diversity filter hyperparameter search was conducted on GuacaMol toy tasks, which may not fully transfer to docking-based objectives.
The docking scoring function was system-dependent: DRD2 and OPRM1 were optimized effectively, while AGTR1 and OX1R proved more challenging (especially AGTR1, where the docking algorithm targeted the wrong sub-pocket).
KL regularization proved ineffective for HC and REINFORCE, suggesting it is not a sufficient regularization method in this context.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
RNN pretraining	MOSESn (MOSES neutralized)	2,454,087 molecules	ZINC15 clean leads with neutralized charges
RNN pretraining	GuacaMol train	1,273,104 molecules	ChEMBL24 with property filters
QSAR training	ExCAPE-DB (DRD2)	4,609 actives / 343,026 inactives	Random forest with GHOST thresholds
QSAR training	ExCAPE-DB (DRD3)	2,758 actives / 402,524 inactives	Unique subsets for dual/selective tasks
DF parameter search	GuacaMol benchmark tasks	3 tasks	825 configurations tested

Algorithms

AHC: REINVENT loss computed on top-k molecules per batch, ranked by reward
Baselines: REINFORCE, REINVENT (v1, v2), BAR, Hill-Climb, Hill-Climb + KL regularization
Hyperparameters: Default values from each original publication (listed in Supplementary Table S3)
Docking: Glide-SP with Schrodinger Protein Preparation Wizard, LigPrep for ligand preparation

Models

RNNs: 3 configurations (GRU/LSTM, 512 hidden units, trained 5-10 epochs)
Transformer: 4 encoder layers, 512 hidden dim, 8 heads, 1024 FFN dim
Gated Transformer: Same architecture with GRU-style gating replacing residual connections
QSAR: Random forest classifiers (100 estimators, max depth 15, min leaf 2)

Evaluation

Metric	AHC + DF2	REINVENT	Notes
Optimization fold-improvement	1.45x	baseline	DRD2 docking, averaged across sigma values
Sample efficiency	45.5x fewer samples	baseline	Averaged across 4 GPCR targets
Step efficiency	7.4x fewer steps	baseline	Averaged across 4 GPCR targets
CPU hours to 140% (DRD2 docking)	16h	202h (REINVENT 2.0)	AMD Threadripper 1920 + RTX 2060 Super

Hardware

AMD Threadripper 1920 CPU
Nvidia GeForce RTX 2060 Super GPU
DRD2 docking benchmark: 216 CPU hours for AHC to reach 200% optimization (~21.6h parallelized over 10 CPUs)

Artifacts

Artifact	Type	License	Notes
SMILES-RNN	Code	MIT	RNN and transformer generative model code
MolScore	Code	MIT	Scoring function platform
Figshare datasets	Dataset	CC-BY-4.0	Supporting data (published under same license as paper)

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of Cheminformatics, 14, 68.

@article{thomas2022augmented,
  title={Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={68},
  year={2022},
  doi={10.1186/s13321-022-00646-z}
}

Atom-in-SMILES: Better Tokens for Chemical Models

Thu, 26 Mar 2026 00:00:00 +0000

A New Tokenization Method for Chemical Language Models

This is a Method paper that introduces Atom-in-SMILES (AIS), a tokenization scheme for SMILES strings that replaces generic atomic tokens with environment-aware tokens encoding each atom’s local chemical neighborhood. The primary contribution is demonstrating that tokenization quality has a significant impact on chemical language model outcomes across multiple tasks: SMILES canonicalization, single-step retrosynthesis, and molecular property prediction.

Why Standard SMILES Tokenization Falls Short

Standard atom-wise SMILES tokenization treats all atoms of the same element identically. Every carbon is tokenized as “C” regardless of whether it is part of an aromatic ring, a carbonyl group, or a methyl chain. This creates a highly degenerate token space where chemically distinct atoms share the same representation.

The authors draw an analogy between natural language and chemical language. A typical SMILES sequence is about three times longer than a natural language sentence, yet the token vocabulary is roughly 1000 times smaller. This mismatch leads to extreme token repetition: the same tokens (C, c, N, O) appear many times within a single sequence. In natural language processing, token degeneration (where models repeatedly predict the same token) is a known failure mode of autoregressive decoders. The repetitive nature of SMILES tokens exacerbates this problem in chemical language models.

SMILES also lacks a one-to-one correspondence between tokens and chemical meaning. Two molecules that differ in only one atom substitution (e.g., swapping a carbon for a nitrogen in a ring) produce identical token sets under atom-wise tokenization, making it harder for models to distinguish structurally similar molecules.

Core Innovation: Encoding Atom Environments into Tokens

The key insight is to replace each atomic token with a richer token that encodes the atom’s local chemical environment, inspired by the atoms-in-molecules (AIM) concept from quantum chemistry. For a given SMILES string, the AIS mapping function $f$ operates on the token space:

$$ f(X) = \begin{cases} AE|_{X_{\text{central}}} & \text{if } X \text{ is an atom} \\ X & \text{otherwise} \end{cases} $$

where $AE|_{X_{\text{central}}}$ denotes the atomic environment centered on atom $X$. Non-atomic tokens (brackets, bond symbols, ring closures) pass through unchanged.

Each AIS token is formatted as [Sym;Ring;Neighbors] where:

Sym is the atomic symbol with chirality, aromaticity (lowercase for aromatic), hydrogen count, and formal charge
Ring indicates whether the atom is in a ring (R) or not (!R)
Neighbors lists the neighboring atoms interacting with the central atom

This mapping is bijective: SMILES strings can be fully recovered from AIS strings via an inverse projection. The algorithm iterates over atoms in a molecule, computes their local environments using RDKit, and produces environment-aware token variants.

As a concrete example, in glycine the two carbons and two oxygens are indistinguishable under atom-wise tokenization. Under AIS, each receives a unique token reflecting its bonding environment (e.g., the carboxyl carbon is distinguished from the alpha carbon).

The AIS tokenization also exhibits a fingerprint-like property. Because each token encodes local structural information, the set of AIS tokens for a molecule functions similarly to circular fingerprints like ECFP2. The authors show that pairwise Tanimoto similarities computed from AIS token sets have resolution comparable to ECFP2 and HashAP fingerprints, and better resolution than MACCS, Avalon, and RDKit fingerprints.

Token repetition can be quantified as:

$$ \text{rep-}l = \sum_{t=1}^{|s|} \mathbb{1}[s_t \in s_{t-w-1:t-1}] $$

where $s$ is the predicted sequence, $|s|$ is the token count, and $w$ is the window size. AIS tokens exhibit consistently lower normalized repetition rates compared to SMILES, SELFIES, and DeepSMILES across diverse molecular datasets (drugs, natural products, steroids, lipids, metal complexes, octane isomers).

Experimental Evaluation Across Three Chemical Tasks

Input-Output Equivalent Mapping (SMILES Canonicalization)

The first task tests whether a model can translate non-canonical SMILES enumerations into canonical form. The authors constructed deliberately challenging datasets from GDB-13 subsets with cumulative structural constraints (no cyclic heteroatom-heteroatom bonds, stable functional groups only, fragment-like, scaffold-like, etc.), generating training sets of 1M molecules augmented with 150K molecules from the most restrictive subset at 10x, 30x, and 50x augmentation levels.

GDB-13 Subset	Atom-wise (x10)	Atom-wise (x50)	AIS (x10)	AIS (x50)
ab	34.2%	33.2%	37.3%	34.1%
abc	31.0%	29.6%	33.7%	30.4%
abcde	48.7%	45.5%	53.6%	47.0%
abcdef	41.8%	39.1%	52.5%	46.9%
abcdefg	50.9%	50.0%	59.9%	56.8%

AIS outperformed atom-wise tokenization on all subsets and augmentation levels. The performance gap grew larger for more restrictive (more similar) subsets, reaching up to 10.7% on the abcdef subset. This demonstrates that AIS is particularly effective when molecules are structurally similar and harder to distinguish.

Single-Step Retrosynthesis

The second task uses the USPTO-50K benchmark for single-step retrosynthetic prediction via a template-free transformer encoder-decoder model. The model was trained for 200,000 steps with Adam optimizer, negative log-likelihood loss, and cyclic learning rate scheduling.

Tokenization	rep-\|P - rep-\|GT >= 2	String Exact (%)	Tc Exact (%)
Atom-wise baseline	–	42.00	–
Atom-wise (reproduced)	801	42.05	44.72
SmilesPE	821	19.82	22.74
SELFIES	886	28.82	30.76
DeepSMILES	902	38.63	41.20
Atom-in-SMILES	727	46.32	47.62

AIS achieved 46.32% string exact accuracy (4.3% above the atom-wise baseline) and 47.62% Tanimoto exact accuracy (2.9% above baseline). AIS also had the fewest degenerate token repetitions (727 vs. 801 for atom-wise), representing approximately a 10% reduction. DeepSMILES had the highest repetition count (902) despite reasonable overall accuracy. SELFIES and SmilesPE both performed substantially worse than the atom-wise baseline on this task.

The authors identified six common token repetition patterns in retrosynthetic predictions: long head repetitions, long tail repetitions, repetitive rings, repetitive chains, and halogen repetitions on both aliphatic and aromatic carbons.

Molecular Property Prediction

The third task evaluates tokenization schemes on MoleculeNet benchmarks using Random Forest models with 5-fold cross-validation. AIS tokens were converted to fingerprint-like feature vectors.

Dataset	SMILES	DeepSMILES	SELFIES	SmilesPE	AIS
Regression (RMSE, lower is better)
ESOL	0.628	0.631	0.675	0.689	0.553
FreeSolv	0.545	0.544	0.564	0.761	0.441
Lipophilicity	0.924	0.895	0.938	0.800	0.683
Classification (ROC-AUC, higher is better)
BBBP	0.758	0.777	0.799	0.847	0.885
BACE	0.740	0.774	0.746	0.837	0.835
HIV	0.649	0.648	0.653	0.739	0.729

AIS achieved the best performance on all three regression datasets and two of three classification datasets. On ESOL, the RMSE improvement over standard SMILES was 12%. On lipophilicity, the improvement was 26%.

Key Findings: Better Tokens Yield Better Chemical Models

The main findings of this work are:

Tokenization significantly impacts chemical language model quality. The choice of tokenization scheme can change prediction accuracy by over 10 percentage points on equivalent mapping tasks.
AIS reduces token degeneration by approximately 10% compared to atom-wise SMILES tokenization, with consistently lower normalized repetition rates across diverse molecular datasets.
AIS outperforms all compared tokenization schemes (atom-wise SMILES, SmilesPE, SELFIES, DeepSMILES) on canonicalization, retrosynthesis, and property prediction.
The fingerprint-like nature of AIS tokens enables direct use as molecular features for property prediction and provides resolution comparable to established circular fingerprints.
The mapping is invertible, so AIS strings can always be converted back to valid SMILES. This is a practical advantage over approaches that may lose structural information.

Limitations: AIS cannot distinguish environmentally identical substructures or atoms related by a molecular symmetry plane, since it only considers nearest-neighbor environments. Performance on long-chain molecules (e.g., lipids) is similar across all tokenization schemes, suggesting that local environment encoding is less informative for repetitive linear structures.

Future directions: The authors suggest AIS has potential for broader adoption in molecular generative models, chemical translation, and property prediction tasks across the cheminformatics community.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Canonicalization training	GDB-13 subsets	1M + 150K augmented	Cumulative structural constraints a-h
Canonicalization testing	GDB-13 disjoint test sets	20K per subset	Various restriction levels
Retrosynthesis	USPTO-50K	~50K reactions	Sequences > 150 tokens removed
Property prediction	MoleculeNet (ESOL, FreeSolv, Lipophilicity, BBBP, BACE, HIV)	Varies	Standard benchmark splits

Algorithms

Transformer encoder-decoder architecture for canonicalization and retrosynthesis tasks
200,000 training steps with Adam optimizer, negative log-likelihood loss, cyclic learning rate scheduler
Random Forest with 5-fold cross-validation for property prediction
AIS tokenization implemented via RDKit for atom environment extraction

Evaluation

Metric	Task	Notes
String exact match (%)	Canonicalization, Retrosynthesis	Exact SMILES match
Tanimoto exactness (Tc)	Retrosynthesis	Morgan FP radius 3, 2048 bits
RMSE	Regression property prediction	ESOL, FreeSolv, Lipophilicity
ROC-AUC	Classification property prediction	BBBP, BACE, HIV
rep-l	Token degeneration	Single-token repetition count

Hardware

Not explicitly specified in the paper.

Artifacts

Artifact	Type	License	Notes
atom-in-SMILES	Code	CC-BY-NC-SA-4.0	AIS tokenization implementation

Paper Information

Citation: Ucak, U. V., Ashyrmamatov, I., & Lee, J. (2023). Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. Journal of Cheminformatics, 15, 55. https://doi.org/10.1186/s13321-023-00725-9

@article{ucak2023improving,
  title={Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization},
  author={Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={55},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00725-9}
}

AlphaDrug: MCTS-Guided Target-Specific Drug Design

Thu, 26 Mar 2026 00:00:00 +0000

Target-Conditioned Molecular Generation via Transformer and MCTS

AlphaDrug is a Method paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT’s predicted probabilities and docking scores from the SMINA program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.

Bridging the Gap Between Molecular Generation and Protein Targeting

Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the transformer-based approach of Grechishnikova (2021), show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.

AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.

Lmser Transformer and Docking-Guided MCTS

The key innovations are the Lmser Transformer architecture and the MCTS search strategy.

Lmser Transformer (LT)

The standard transformer for sequence-to-sequence tasks passes information from the encoder’s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder’s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.

Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:

$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$

where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.

The multi-head attention follows the standard formulation:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$

$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$

MCTS for Molecular Generation

The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:

Select: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:

$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$

where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT’s predicted probability.

The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:

$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$

Expand: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.

Rollout: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.

Backup: Docking values propagate back up the tree, updating visit counts and cumulative rewards.

Training Objective

The LT is trained on known protein-ligand pairs using cross-entropy loss:

$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$

MCTS is only activated during inference, not during training.

Experiments on Diverse Protein Targets

Dataset

The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 < 100 nM, molecular weight < 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).

Baselines

T+BS10: Standard transformer with beam search (K=10) from Grechishnikova (2021)
LT+BS10: The proposed Lmser Transformer with beam search
LiGANN: 3D pocket-to-ligand shape generation via BicycleGAN
SBMolGen: ChemTS-based method with docking constraints
SBDD-3D: 3D autoregressive graph-based generation
Decoys: Random compounds from ZINC database
Known ligands: Original binding partners from the database

Main Results

Method	Docking	Uniqueness	LogP	QED	SA	NP
Decoys	7.3	-	2.4	0.8	2.4	-1.2
Known ligands	9.8	-	2.2	0.5	3.3	-1.0
LiGANN	6.7	94.7%	2.9	0.6	3.0	-1.1
SBMolGen	7.7	100%	2.6	0.7	2.8	-1.2
SBDD-3D	7.7	99.3%	1.5	0.6	4.0	0.3
T+BS10	8.5	90.6%	3.8	0.5	2.8	-0.8
LT+BS10	8.5	98.1%	4.0	0.5	2.7	-1.0
AlphaDrug (freq)	10.8	99.5%	4.9	0.4	2.9	-1.0
AlphaDrug (max)	11.6	100%	5.2	0.4	2.7	-0.8

AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.

MCTS vs. Beam Search Under Equal Compute

When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:

Docking times (N)	BS	MCTS	P-value
N = 105 (S=10)	8.4 (10.9)	10.9 (11.5)	1.8e-34 (4.5e-3)
N = 394 (S=50)	8.3 (11.4)	11.6 (12.2)	1.4e-31 (1.8e-3)
N = 1345 (S=500)	8.4 (11.9)	12.4 (13.2)	2.2e-39 (8.2e-6)

Values in parentheses are average top-1 scores per protein.

Ablation: Effect of Protein Sequence Input

Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):

Method	Uniqueness	SpS	Molecular length
TE + MCTS (S=50)	81.0%	0.1926	62.93
T + MCTS (S=50)	98.0%	0.2149	55.63
LT + MCTS (S=50)	100.0%	0.2159	56.54

The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.

Computational Efficiency

A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.

Docking Gains with Acknowledged Limitations

Key Findings

86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.
The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.
MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.
Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.

Limitations

The authors identify three areas for improvement:

Sequence-only representation: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.
External docking as value function: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.
Full rollout requirement: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.

The physicochemical properties (QED, SA) of AlphaDrug’s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	BindingDB (filtered)	192,712 protein-ligand pairs	Human proteins, IC50 < 100 nM, MW < 1000 Da
Validation	BindingDB (filtered)	17,049 pairs	Same filtering criteria
Testing	BindingDB (filtered)	100 proteins from 25 clusters	Clustered at 30% sequence identity via MMseqs2

Algorithms

MCTS with PUCT selection criterion, $c_{puct} = 1.5$
$S = 50$ simulations per step (default), $S = 10$ for fast variant
Greedy rollout policy using LT probabilities
Docking lookup table for efficiency (caches SMINA results)
Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)

Models

Lmser Transformer with hierarchical encoder-to-decoder skip connections
Sinusoidal positional encoding
Multi-head cross-attention at each decoder layer
Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)

Evaluation

Metric	AlphaDrug (max)	Known ligands	Best baseline (T+BS10)
Docking score	11.6	9.8	8.5
Uniqueness	100%	-	90.6%
Validity	100%	-	Not reported

Hardware

Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.

Artifacts

Artifact	Type	License	Notes
CMACH508/AlphaDrug	Code	MIT	Official implementation, includes data processing and generation scripts

Paper Information

Citation: Qian, H., Lin, C., Zhao, D., Tu, S., & Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. PNAS Nexus, 1(4), pgac227. https://doi.org/10.1093/pnasnexus/pgac227

@article{qian2022alphadrug,
  title={AlphaDrug: protein target specific de novo molecular generation},
  author={Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei},
  journal={PNAS Nexus},
  volume={1},
  number={4},
  pages={pgac227},
  year={2022},
  doi={10.1093/pnasnexus/pgac227}
}

TamGen: GPT-Based Target-Aware Drug Design and Generation

Wed, 25 Mar 2026 00:00:00 +0000

A Method for Target-Conditioned Molecular Generation

This is a Method paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.

Bridging Generative AI and Practical Drug Discovery

Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.

The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:

Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility
High cellular toxicity and decreased developability associated with excessive fused ring counts
Slow generation speeds (tens of minutes to hours per 100 compounds)
Limited real-world experimental validation of generated candidates

TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.

TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.

Compound Decoder (Chemical Language Model)

The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:

$$ \min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1) $$

where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.

Protein Encoder with Distance-Aware Attention

The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:

$$ h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right) $$

where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.

The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:

$$ \begin{aligned} \hat{\alpha}_j &= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\ \alpha_j &= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\ \hat{\boldsymbol{h}}_i^{(l+1)} &= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)}) \end{aligned} $$

where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.

VAE-Based Contextual Encoder

A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:

$$ \min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z)) $$

where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.

Benchmark Evaluation and Tuberculosis Drug Discovery

CrossDocked2020 Benchmark

TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:

Docking score (AutoDock-Vina): binding affinity estimate
QED: quantitative estimate of drug-likeness
Lipinski’s Rule of Five: physicochemical property compliance
SAS: synthetic accessibility score
LogP: lipophilicity (optimal range 0-5 for oral administration)
Molecular diversity: Tanimoto similarity between Morgan fingerprints

TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.

TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.

Design-Refine-Test Pipeline for ClpP Inhibitors

The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond Bortezomib.

Design stage: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.

Refine stage: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.

Test stage: From a 446k commercial compound library, 159 analogs (MCS similarity > 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:

Compound	Series	Source	$\text{IC}_{50}$ ($\mu$M)	Key Feature
Analog-005	II	Commercial library	1.9	Most potent analog
Analog-003	I	Commercial library	< 20	Strongest single-dose inhibition
Syn-A003-01	I	TamGen (synthesized)	< 20	Diphenylurea scaffold

Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen’s ability to produce viable hits without the library search step.

Ablation Studies

Four ablation experiments clarified the contributions of TamGen’s components:

Without pre-training: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.
Shuffled pocket-ligand pairs (TamGen-r): Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.
Without distance-aware attention: Significant decline in docking scores when removing the geometric attention term from Eq. 2.
Without coordinate augmentation: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.

Validated Drug-Like Generation with Practical Limitations

TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.

Key limitations acknowledged by the authors include:

Insufficient sensitivity to minor target differences: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins
Requires known structure and pocket: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information
Limited cellular validation: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested
1D generation trade-off: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space

Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in PrefixMol.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem (random sample)	10M SMILES	Compound decoder pre-training
Fine-tuning	CrossDocked2020	~100k pairs	Filtered pocket-ligand pairs
Extended fine-tuning	CrossDocked + PDB	~300k pairs	Used for TB compound generation
Evaluation	CrossDocked2020 test	100 pockets	Same split as TargetDiff/Pocket2Mol

Algorithms

Compound decoder: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps
Protein encoder: 4-layer Transformer with hidden dimension 256, distance-aware attention
VAE encoder: 4-layer standard Transformer encoder with hidden dimension 256
Optimizer: Adam with initial learning rate $3 \times 10^{-5}$
VAE $\beta$: 0.1 or 1.0 depending on generation stage
Beam search: beam sizes of 4, 10, or 20 depending on stage
Pocket definition: residues within 10 or 15 Angstrom distance cutoff from ligand center

Models

Pre-trained model weights are available via Zenodo at https://doi.org/10.5281/zenodo.13751391.

Evaluation

Metric	TamGen	Best Baseline	Notes
Overall MRR	Best	TargetDiff (2nd)	Ranked across 6 metrics
Fused rings (avg)	1.78	~3-5 (others)	Matches FDA-approved drug average
Generation speed	9 sec/100 compounds	~13 min (ResGen)	Single A6000 GPU
ClpP hit rate	6/8 synthesized	N/A	$\text{IC}_{50}$ < 40 $\mu$M

Hardware

Pre-training: 8x V100 GPUs for 200k steps
Inference benchmarking: 1x A6000 GPU
Generation time: ~9 seconds per 100 compounds per target

Artifact	Type	License	Notes
TamGen code	Code	MIT	Official implementation
Model weights and data	Model + Data	CC-BY-4.0	Pre-trained weights, source data

Paper Information

Citation: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., & Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. Nature Communications, 15, 9360. https://doi.org/10.1038/s41467-024-53632-4

@article{wu2024tamgen,
  title={TamGen: drug design with target-aware molecule generation through a chemical language model},
  author={Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={9360},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-024-53632-4}
}

STONED: Training-Free Molecular Design with SELFIES

Wed, 25 Mar 2026 00:00:00 +0000

A Training-Free Algorithm for Molecular Generation

This is a Method paper that introduces STONED (Superfast Traversal, Optimization, Novelty, Exploration and Discovery), a suite of algorithms for molecular generation and chemical space exploration. STONED operates entirely through string manipulations on the SELFIES molecular representation, avoiding the need for deep learning models, training data, or GPU resources. The key claim is that simple character-level mutations and interpolations in SELFIES can achieve results competitive with state-of-the-art deep generative models on standard benchmarks.

Why Deep Generative Models May Be Overkill

Deep generative models (VAEs, GANs, RNNs, reinforcement learning) have become popular for inverse molecular design, but they come with practical costs: large training datasets, expensive GPU compute, and long training times. Fragile representations like SMILES compound the problem, since large portions of a latent space can map to invalid molecules. Even with the introduction of SELFIES (a 100% valid string representation), prior work still embedded it within neural network architectures.

The authors argue that for tasks like local chemical space exploration and molecular interpolation, the guarantees of SELFIES alone may be sufficient. Because every SELFIES string maps to a valid molecule, random character mutations always produce valid structures. This observation eliminates the need for learned generation procedures entirely.

Core Innovation: SELFIES String Mutations as Molecular Operators

STONED relies on four key techniques built on SELFIES string manipulations:

1. Random character mutations. A point mutation in SELFIES (character replacement, deletion, or addition) always yields a valid molecule. The position of mutations serves as a hyperparameter controlling exploration vs. exploitation: terminal character mutations preserve more structural similarity to the seed, while random mutations explore more broadly.

2. Multiple SMILES orderings. A single molecule has many valid SMILES strings, each mapping to a different SELFIES. By generating 50,000 SMILES orderings and converting to SELFIES before mutation, the diversity of generated structures increases substantially.

3. Deterministic interpolation. Given two SELFIES strings (padded to equal length), characters at equivalent positions can be successively replaced from the start molecule to the target molecule. Every intermediate string is a valid molecule. A chemical path is extracted by keeping only those intermediates that increase fingerprint similarity to the target.

4. Fingerprint-based filtering. Since edit distance in SELFIES does not reflect molecular similarity, STONED uses fingerprint comparisons (ECFP4, FCFP4, atom-pair) to enforce structural similarity constraints.

The authors also propose a revised joint molecular similarity metric for evaluating median molecules. Given $n$ reference molecules $M = {m_1, m_2, \ldots, m_n}$, the joint similarity of a candidate molecule $m$ is:

$$ F(m) = \frac{1}{n} \sum_{i=1}^{n} \text{sim}(m_i, m) - \left[\max_{i} \text{sim}(m_i, m) - \min_{i} \text{sim}(m_i, m)\right] $$

This penalizes candidates that are similar to only a subset of references, unlike the geometric mean metric used in GuacaMol which can yield high scores even with lopsided similarities.

Experimental Setup and Applications

Local chemical subspace formation

Starting from a single seed molecule (aripiprazole, albuterol, mestranol, or celecoxib), the algorithm generates 50,000 SMILES orderings and performs 1-5 point mutations per ordering, producing 250,000 candidate strings. Unique valid molecules are filtered by fingerprint similarity thresholds.

Starting structure	Fingerprint	Molecules at $\delta > 0.75$	Molecules at $\delta > 0.60$	Molecules at $\delta > 0.40$
Aripiprazole (SELFIES, random)	ECFP4	513 (0.25%)	4,206 (2.15%)	34,416 (17.66%)
Albuterol (SELFIES, random)	FCFP4	587 (0.32%)	4,156 (2.33%)	16,977 (9.35%)
Mestranol (SELFIES, random)	AP	478 (0.22%)	4,079 (1.90%)	45,594 (21.66%)
Celecoxib (SELFIES, random)	ECFP4	198 (0.10%)	1,925 (1.00%)	18,045 (9.44%)
Celecoxib (SELFIES, terminal 10%)	ECFP4	864 (2.02%)	9,407 (21.99%)	34,187 (79.91%)

Key finding: restricting mutations to terminal characters yields a 20x increase in high-similarity molecules compared to random positions. Compared to SMILES mutations (0.30% valid) and DeepSMILES (1.44% valid), SELFIES mutations are all valid by construction.

A two-step expansion (mutating all unique first-round neighbors) produced over 17 million unique molecules, with 120,000 having similarity greater than 0.4 to celecoxib.

Chemical path formation and drug design

Deterministic SELFIES interpolation between tadalafil and sildenafil generated paths where logP and QED values varied smoothly. A more challenging application docked intermediates between dihydroergotamine (5-HT1B binder) and prinomastat (CYP2D6 binder), finding molecules with non-trivial binding affinity to both proteins without any optimization routine.

Median molecules for photovoltaics

Using 100 triplets from the Harvard Clean Energy (HCE) dataset, each with one molecule optimized for high LUMO energy, one for high dipole moment, and one for high HOMO-LUMO gap, generalized chemical paths produced median molecules. These were evaluated with GFN2-xTB semiempirical calculations. The generated medians matched or exceeded the best molecules available in the HCE database in both structural similarity and target properties.

GuacaMol benchmarks

Without any training, STONED achieved an overall GuacaMol score of 14.70, competitive with several deep generative models. The approach simply identifies the single best molecule in the benchmark’s training set and generates its local chemical subspace. 38% of the top-100 molecules from each benchmark passed compound quality filters, comparable to Graph GA and SMILES GA.

Results Summary and Limitations

STONED demonstrates that SELFIES string mutations can match or approach deep generative models on standard molecular design benchmarks while being orders of magnitude faster and requiring no training. The most expensive benchmark (aripiprazole subspace) completed in 500 seconds on a laptop CPU.

The method comparison table from the paper highlights STONED’s unique position:

Feature	Expert Systems	VAE	GAN	RL	STONED
Expert rule-free	No	Yes	Yes	Yes	Yes
Structure coverage	Partial	Partial	Partial	Partial	Yes
Interpolatability	No	Yes	Yes	No	Yes
Property-based navigation	Partial	Yes	Yes	Yes	Partial
Training-free	Yes	No	No	No	Yes
Data independence	Yes	No	No	No	Yes

Limitations acknowledged by the authors:

STONED lacks property-based navigation (gradient-guided optimization toward specific property targets). It can only do stochastic property optimization when wrapped in a genetic algorithm.
The success rate of mutations leading to structurally similar molecules is relatively low (0.1-2% at high similarity thresholds), though speed compensates.
Chemical paths can contain molecules with unstable functional groups or tautomerization issues, requiring post-hoc filtering with domain-specific rules.
Fingerprint similarity does not capture all aspects of chemical similarity (3D geometry, reactivity, synthesizability).
The penalized logP and QED benchmarks used by GuacaMol do not represent the full complexity of practical molecular design.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Photovoltaics	Harvard Clean Energy (HCE) database	~2.3M molecules	Used for median molecule triplet experiments
Benchmarking	GuacaMol benchmark suite	Varies per task	Standard benchmarks for generative molecular design
Comparison	ChEMBL (SCScore <= 2.5 subset)	Fragment database	Used for CReM comparison experiments

Algorithms

Local subspace formation: 50,000 SMILES orderings per seed molecule, 1-5 SELFIES point mutations each, totaling 250,000 candidates per experiment.
Chemical paths: Deterministic character-by-character interpolation between padded SELFIES strings, with monotonic fingerprint similarity filtering.
Median molecules: Generalized paths between 3+ reference molecules using 10,000 paths per triplet with randomized SMILES orderings.
Docking: SMINA with crystal structures from PDB (4IAQ for 5-HT1B, 3QM4 for CYP2D6). Top-5 binding poses averaged.
Quantum chemistry: GFN2-xTB for dipole moments, LUMO energies, and HOMO-LUMO gaps.

Evaluation

Metric	Value	Baseline	Notes
GuacaMol overall score	14.70	Varies by model	Competitive with deep generative models
Quality filter pass rate	38%	Graph GA/SMILES GA comparable	Top-100 molecules per benchmark
Celecoxib neighbors ($\delta > 0.75$)	198-864	CReM: 239	Depends on mutation position strategy

Hardware

All experiments run on a laptop with Intel i7-8750H CPU at 2.20 GHz. No GPU required. Most expensive single experiment (aripiprazole subspace) completed in 500 seconds.

Artifacts

Artifact	Type	License	Notes
stoned-selfies	Code	Not specified	Official implementation of STONED algorithms

Paper Information

Citation: Nigam, A. K., Pollice, R., Krenn, M., dos Passos Gomes, G., & Aspuru-Guzik, A. (2021). Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES. Chemical Science, 12(20), 7079-7090. https://doi.org/10.1039/d1sc00231g

@article{nigam2021stoned,
  title={Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery ({STONED}) algorithm for molecules using {SELFIES}},
  author={Nigam, AkshatKumar and Pollice, Robert and Krenn, Mario and dos Passos Gomes, Gabriel and Aspuru-Guzik, Al{\'a}n},
  journal={Chemical Science},
  volume={12},
  number={20},
  pages={7079--7090},
  year={2021},
  publisher={Royal Society of Chemistry},
  doi={10.1039/d1sc00231g}
}

SPECTRA: Evaluating Generalizability of Molecular AI

Wed, 25 Mar 2026 00:00:00 +0000

A Spectral Framework for Evaluating Molecular ML Generalizability

This is a Method paper that introduces SPECTRA (SPECtral framework for model evaluaTion on moleculaR dAtasets), a systematic approach for evaluating how well machine learning models generalize on molecular sequencing data. The primary contribution is a framework that generates train-test splits with controlled, decreasing levels of overlap, producing a spectral performance curve (SPC) and a single summary metric, the area under the spectral performance curve (AUSPC), for comparing model generalizability across tasks and architectures.

Why Existing Molecular Benchmarks Overestimate Generalizability

Deep learning has achieved high performance on molecular sequencing benchmarks, but a persistent gap exists between benchmark performance and real-world deployment. The authors identify the root cause: existing evaluation approaches use either metadata-based (MB) splits or similarity-based (SB) splits, both of which provide an incomplete picture of generalizability.

MB splits partition data by metadata properties (e.g., temporal splits, random splits) without controlling sequence similarity between train and test sets. This means high train-test similarity can inflate performance metrics. SB splits control similarity at a single threshold, but the model’s behavior at other similarity levels remains unknown.

For example, the TAPE benchmark’s remote homology family split has 97% cross-split overlap, while the superfamily split has 71%. Model accuracy drops by 50% between these two points, yet the full curve of performance degradation is never characterized. This gap between evaluated and real-world overlap levels leads to overoptimistic deployment expectations, as demonstrated by the case of rifampicin resistance prediction in M. tuberculosis, where commercial genotypic assays later proved unreliable in specific geographic regions.

The SPECTRA Framework: Spectral Properties, Graphs, and Performance Curves

SPECTRA takes three inputs: a molecular sequencing dataset, a machine learning model, and a spectral property definition. A spectral property (SP) is a molecular sequence property expected to influence model generalizability for a specific task. For sequence-to-sequence datasets, the spectral property is typically sequence identity (proportion of aligned positions > 0.3). For mutational scan datasets, it is defined by sample barcodes (string representations of mutations present in each sample).

Spectral Property Graph Construction

SPECTRA constructs a spectral property graph (SPG) where nodes represent samples and edges connect samples that share the spectral property. The goal is to generate train-test splits with controlled levels of cross-split overlap by finding approximate maximal independent sets of this graph.

Finding the exact maximal independent set is NP-Hard, so SPECTRA uses a greedy randomized algorithm parameterized by a spectral parameter $\mathbf{SP} \in [0, 1]$:

Randomly order SPG vertices
Select the first vertex and delete each neighbor with probability equal to $\mathbf{SP}$
Continue until no vertices remain

When $\mathbf{SP} = 0$, this produces a random split (maximum cross-split overlap). When $\mathbf{SP} = 1$, it approximates the maximal independent set (minimum cross-split overlap). For each spectral parameter value (incremented by 0.05 from 0 to 1), three splits with different random seeds are generated.

The Spectral Performance Curve and AUSPC

The model is trained and evaluated on each split. Plotting test performance against the spectral parameter produces the spectral performance curve (SPC). The area under this curve, the AUSPC, serves as a single summary metric for model generalizability that captures behavior across the full spectrum of train-test overlap.

Handling Mutational Scan Datasets

For mutational scan datasets where sample barcodes map to multiple samples, SPECTRA introduces two modifications: (1) weighting nodes in the SPG by the number of samples they represent, and (2) running a subset sum algorithm to ensure 80/20 train-test splits by sample count.

Evaluation Across 18 Datasets and 19 Models

The authors apply SPECTRA to 18 molecular sequencing datasets spanning three benchmarks (TAPE, PEER, ProteinGym) plus PDBBind, evaluating 19 models including CNNs, LSTMs, GNNs (GearNet), LLMs (ESM2), diffusion models (DiffDock), variational autoencoders (EVE), and logistic regression.

Benchmark Datasets

The core evaluation covers five primary tasks:

Task	Dataset	Type	Metric	Samples
Rifampicin resistance (RIF)	TB clinical isolates	MSD	AUROC	17,474
Isoniazid resistance (INH)	TB clinical isolates	MSD	AUROC	26,574
Pyrazinamide resistance (PZA)	TB clinical isolates	MSD	AUROC	12,146
Fluorescence prediction	GFP variants	MSD	Spearman’s $\rho$	54,024
Vaccine escape	SARS-CoV-2 RBD	MSD	Spearman’s $\rho$	438,046

Additional benchmarks include remote homology detection, secondary structure prediction, subcellular localization, and protein-ligand binding (PDBBind, Astex diverse set, Posebusters).

Models Evaluated

Eight models were evaluated in depth across the five primary tasks: logistic regression, CNN, ESM2 (pretrained), ESM2-Finetuned, GearNet, GearNet-Finetuned, EVE, and SeqDesign. Additional models (LSTM, ResNet, DeepSF, Transformer, HHblits, Equibind, DiffDock, TankBind, Transception, MSA Transformer, ESM1v, Progen2) were evaluated on specific benchmark tasks.

Existing Splits as Points on the SPC

SPECTRA reveals that existing benchmark splits correspond to specific points on the spectral performance curve. For instance:

Task	Benchmark Split	Cross-Split Overlap	Spectral Parameter
Remote homology	TAPE family	97%	0.025
Remote homology	TAPE superfamily	71%	0.475
Secondary structure	CASP12	48%	0.5
Protein-ligand binding	Equibind temporal	76%	0.55
Protein-ligand binding	LPPDBind similarity	91%	0.275
Protein-ligand binding	Posebusters	70%	0.575

Performance Degradation and Foundation Model Insights

Universal Performance Decline

All evaluated models demonstrate decreased performance as cross-split overlap decreases. Logistic regression drops from AUROC > 0.9 to 0.5 for rifampicin resistance. ESM2-Finetuned decreases from Spearman’s $\rho > 0.9$ to less than 0.4 for GFP fluorescence prediction.

No single model achieves the highest AUSPC across all tasks. CNN maintains AUSPC > 0.6 across all tasks but is surpassed by ESM2-Finetuned and ESM2 on rifampicin resistance. Some models retain reasonable performance even at $\mathbf{SP} = 1$ (minimal overlap): ESM2, ESM2-Finetuned, and CNN maintain AUROC > 0.7 for RIF and PZA at this extreme.

Uncovering Hidden Spectral Properties

SPECTRA can detect unconsidered spectral properties through high variance in model performance at fixed spectral parameters. For rifampicin resistance, the CNN shows high variance at $\mathbf{SP} = 0.9$, $0.95$, and $1.0$ (standard deviations of 0.09, 0.10, and 0.08 respectively).

The authors trace this to the rifampicin resistance determining region (RRDR), a 26-amino-acid region of the rpoB gene. They define diff-RRDR as:

$$ \text{diff-RRDR} = \left(\max\left(\text{position}_{\text{train}}\right) - \max\left(\text{position}_{\text{test}}\right)\right) + \left(\min\left(\text{position}_{\text{train}}\right) - \min\left(\text{position}_{\text{test}}\right)\right) $$

diff-RRDR correlates with CNN performance variance (Spearman’s $\rho = -0.51$, p-value $= 1.79 \times 10^{-5}$) but not with ESM2 performance. The authors attribute this to ESM2’s larger context window (512 positions vs. CNN’s 12), making it more invariant to positional shifts in resistance-determining mutations.

Foundation Model Generalizability

For protein foundation models, SPECTRA reveals that AUSPC correlates with the similarity between task-specific datasets and the pretraining dataset. ESM2’s AUSPC varies from 0.91 (RIF) to 0.26 (SARS-CoV-2). The correlation between UniRef50 overlap and AUSPC is strong (Spearman’s $\rho = 0.9$, p-value $= 1.4 \times 10^{-27}$).

This finding holds across multiple foundation models (Transception, MSA Transformer, ESM1v, Progen2) evaluated on five ProteinGym datasets (Spearman’s $\rho = 0.9$, p-value $= 0.04$). Fine-tuning improves AUSPC for tasks with low pretraining overlap (PZA, SARS-CoV-2, GFP).

Computational Cost

Generating SPECTRA splits ranges from 5 minutes (amyloid beta aggregation) to 9 hours (PDBBind). Generating spectral performance curves ranges from 1 hour (logistic regression) to 5 days (ESM2-Finetuned). The authors recommend releasing SPECTRA splits alongside new benchmarks to amortize this cost.

Limitations and Future Directions

The authors acknowledge several limitations:

Spectral property selection is pivotal: The choice of spectral property must be biologically informed and task-specific. Standardized definitions across the community are needed.
Computational cost: Running SPECTRA is expensive, especially for large models. The authors mitigate this with multi-core CPU parallelization and multi-GPU training.
Not a model ranking tool: SPECTRA is designed for understanding generalizability patterns, not for ranking models. Proper ranking requires averaging AUSPCs across many tasks in a standardized benchmark.
Spectral parameter vs. cross-split overlap: The minimal achievable cross-split overlap varies across tasks, so SPECTRA plots performance against the spectral parameter rather than overlap directly. This means the AUSPC reflects relative impact on performance per unit decrease in overlap.

The authors envision SPECTRA as a foundation for next-generation molecular benchmarks that explicitly characterize generalizability across the full spectrum of distribution shift, applicable beyond molecular data to small molecule therapeutics, inverse protein folding, and patient-level clinical datasets.

Reproducibility Details

Data

All data used in this study is publicly available.

Purpose	Dataset	Size	Notes
Evaluation	TB RIF resistance	17,474 isolates	From Green et al. (2022)
Evaluation	TB INH resistance	26,574 isolates	From Green et al. (2022)
Evaluation	TB PZA resistance	12,146 isolates	From Green et al. (2022)
Evaluation	GFP fluorescence	54,024 samples	From Sarkisyan et al. (2016)
Evaluation	SARS-CoV-2 escape	438,046 samples	From Greaney et al. (2021)
Benchmark	TAPE (remote homology, secondary structure)	Various	From Rao et al. (2019)
Benchmark	PEER (subcellular localization)	13,949 samples	From Xu et al. (2022)
Benchmark	ProteinGym (amyloid, RRM)	Various	From Notin et al. (2022)
Benchmark	PDBBind (protein-ligand binding)	14,993-16,742 complexes	From Wang et al. (2005)

Data is also available on Harvard Dataverse.

Algorithms

Spectral property comparison uses Biopython pairwise alignment (match=1, mismatch=-2, gap=-2.5) with a 0.3 similarity threshold for sequence-to-sequence datasets
Greedy randomized maximal independent set approximation for split generation
Spectral parameter incremented in 0.05 steps from 0 to 1
Three random seeds per spectral parameter value
80/20 train-test split ratio enforced via subset sum for mutational scan datasets

Models

ESM2: 650M parameter version from Lin et al. (2023)
ESM2-Finetuned: First 30 layers frozen, masked language head replaced with linear prediction layer
GearNet and GearNet-Finetuned: Protein structures generated via ESMFold
CNN: Architecture from Green et al. (2022), one-hot encoded sequences
Logistic regression: One-hot encoded mutational barcodes
EVE and SeqDesign: MSAs constructed via Jackhmmer against UniRep100

Evaluation

Metric	Task	Notes
AUROC	TB resistance (RIF, INH, PZA)	Binary classification
Spearman’s $\rho$	GFP fluorescence, SARS-CoV-2 escape	Regression tasks
Accuracy	Remote homology, secondary structure, subcellular localization	Per-label/class accuracy
RMSE	Protein-ligand binding	Predicted vs. actual complex
AUSPC	All tasks	Area under spectral performance curve

Hardware

Most models: 1x Tesla A10 GPU
ESM2-Finetuned: 4x Tesla A100 GPUs on Azure cluster
Hyperparameter optimization: Weights & Biases random search over learning rate
All code in PyTorch

Artifacts

Artifact	Type	License	Notes
SPECTRA Code	Code	MIT	Framework implementation and reproduction scripts
Harvard Dataverse	Dataset	CC0 1.0	All datasets and generated splits

Paper Information

Citation: Ektefaie, Y., Shen, A., Bykova, D., Marin, M. G., Zitnik, M., & Farhat, M. (2024). Evaluating generalizability of artificial intelligence models for molecular datasets. Nature Machine Intelligence, 6(12), 1512-1524. https://doi.org/10.1038/s42256-024-00931-6

@article{ektefaie2024evaluating,
  title={Evaluating generalizability of artificial intelligence models for molecular datasets},
  author={Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Marin, Maximillian G. and Zitnik, Marinka and Farhat, Maha},
  journal={Nature Machine Intelligence},
  volume={6},
  number={12},
  pages={1512--1524},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-024-00931-6}
}

Perplexity for Molecule Ranking and CLM Bias Detection

Wed, 25 Mar 2026 00:00:00 +0000

A Method for Intrinsic Scoring and Bias Detection in Chemical Language Models

This is a Method paper that introduces two contributions to the chemical language model (CLM) pipeline for de novo molecular design. First, the authors propose using perplexity as a model-intrinsic score to rank generated SMILES strings by how well they match the design objectives encoded in the fine-tuning data. Second, they introduce a “delta score” that compares molecule rankings from pretrained and fine-tuned CLMs to detect pretraining bias, where molecules are generated primarily based on generic pretraining knowledge rather than task-specific fine-tuning objectives.

The Ranking and Bias Problem in CLM-Based Molecule Generation

Chemical language models generate new molecules as SMILES strings by iteratively predicting the next character based on learned probability distributions. After training, CLMs can produce large virtual libraries of candidate molecules via multinomial sampling. However, two key challenges remain: (1) the generated molecules lack a natural ranking, requiring external scoring methods such as similarity assessment or activity prediction for prioritization, and (2) transfer learning (pretraining on a large corpus followed by fine-tuning on a small target set) can introduce “pretraining bias,” where some generated molecules reflect generic chemical knowledge from pretraining rather than the specific design objectives of the fine-tuning data.

Beam search offers an alternative sampling approach that produces inherently ranked molecules by greedily selecting the most probable SMILES strings. However, beam search explores only a narrow portion of chemical space. The authors sought to combine the ranking advantage of beam search with the chemical space exploration of multinomial sampling by applying perplexity scoring as a post-hoc ranking criterion.

Perplexity Scoring and the Delta Score for Bias Estimation

The core innovation is the application of perplexity, a standard evaluation metric from natural language processing, to score SMILES strings generated by CLMs. For a SMILES string of length $N$ with character probabilities $p_i$ assigned by the CLM, perplexity is computed as:

$$ \text{perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log(p_{i})} $$

Low perplexity indicates that the CLM assigns high probability to each character in the SMILES string, suggesting the molecule closely matches the learned distribution of the fine-tuning data. The metric is normalized by string length, making it comparable across molecules of different sizes.

To address pretraining bias, the authors introduce a delta score. For each generated molecule, the perplexity-based rank from the fine-tuned model ($\text{rank}_{ft}$) is compared against the rank from the pretrained model ($\text{rank}_{pt}$):

$$ \text{delta} = \text{rank}_{ft} - \text{rank}_{pt} $$

A positive delta score indicates that the fine-tuned model ranks the molecule higher than the pretrained model, suggesting the molecule was generated based on task-specific fine-tuning knowledge. A negative delta score flags molecules that may have been generated primarily from pretraining information, which do not necessarily match the design objectives.

The multinomial sampling probability for each character is computed via the softmax function:

$$ p_{i} = \frac{e^{z_{i}/T}}{\sum_{j} e^{z_{j}/T}} $$

where $z_{i}$ is the CLM output logit for the $i$th character, $j$ runs over all dictionary characters, and $T$ is the temperature parameter (set to $T = 1$ in this study).

Experimental Setup: 10 Protein Targets Across Four Data Regimes

The authors systematically evaluated perplexity scoring across 10 macromolecular targets and four low-data fine-tuning regimes (5, 10, 20, and 40 molecules per target).

Model architecture: A four-layer LSTM-based RNN (5,820,515 parameters) with batch normalization layers, LSTM layers of 1024 and 256 units, trained using the Adam optimizer with a learning rate of $10^{-4}$.

Pretraining: The model was pretrained on 1,683,181 molecules from ChEMBL (version 28), encoded as canonical SMILES (20-90 characters), for 90 epochs.

Fine-tuning: For each of 10 randomly selected protein targets (Table 1), bioactive ligands with pChEMBL > 6 were selected. Fine-tuning sets of 5, 10, 20, and 40 molecules were compiled for each target. Fine-tuning ran for 100 epochs, with 1,000 SMILES strings sampled every second epoch via multinomial sampling ($T = 1$).

CHEMBL ID	Target	Protein Classification
CHEMBL1836	Prostanoid EP4 receptor	G protein-coupled receptor
CHEMBL1945	Melatonin receptor 1A	G protein-coupled receptor
CHEMBL1983	Serotonin 1D (5-HT1D) receptor	Family A GPCR
CHEMBL202	Dihydrofolate reductase	Oxidoreductase
CHEMBL3522	Cytochrome P450 17A1	Cytochrome P450
CHEMBL4029	Interleukin-8 receptor A	Family A GPCR
CHEMBL5073	CaM kinase I delta	Kinase
CHEMBL5137	Metabotropic glutamate receptor 2	G protein-coupled receptor
CHEMBL5408	Serine/threonine-protein kinase TBK1	Kinase
CHEMBL5608	NT-3 growth factor receptor	Kinase

Sampling comparison: Beam search sampling was performed with beam widths $k = 10$ and $k = 50$ for comparison against multinomial sampling.

Molecular similarity: Tanimoto similarity was computed using Morgan fingerprints (radius 2, length 1024) and 2D pharmacophore fingerprints via RDKit (2019.03.2).

Key Findings: Multinomial Sampling Outperforms Beam Search

Perplexity correlates with molecular similarity. The Pearson correlation between perplexity and Tanimoto distance to the fine-tuning set stabilized at approximately 0.5 across all data regimes. This correlation emerged earlier with larger fine-tuning sets. The result confirms that perplexity captures both substructural and pharmacophore features while also incorporating additional CLM-learned information.

Multinomial sampling produces better-ranked molecules than beam search. With the smallest fine-tuning sets (5 molecules), the top 50 molecules from multinomial sampling consistently exhibited lower (better) perplexity values than beam search at $k = 10$ or $k = 50$. Increasing the beam width from 10 to 50 did not markedly improve beam search performance. For novel molecules (Tanimoto similarity below 50% to the nearest fine-tuning compound), multinomial sampling identified lower-perplexity molecules in 72% of cases with the smallest fine-tuning sets.

Perplexity scoring narrows the quality distribution. The top 50 molecules selected by perplexity from multinomial sampling spanned a narrower range of perplexity values compared to beam search, suggesting a more consistent pool of high-quality candidates for follow-up synthesis.

Pretraining bias is substantial. The delta score analysis revealed that more than 40% of sampled molecules had negative delta scores during the first 20 fine-tuning epochs, meaning they were ranked higher by the pretrained model than the fine-tuned model. This fraction remained above 10% even at the end of 100 fine-tuning epochs across all data regimes, confirming that 10-40% of generated molecules reflect “generic” pretraining rather than task-focused fine-tuning.

Perplexity alone partially mitigates bias. Among the top 50 molecules selected by perplexity from multinomial sampling, only up to 3% had negative delta scores, compared to 10-40% in the unfiltered population. This suggests that perplexity-based ranking already reduces pretraining bias, though the delta score provides additional filtering power.

SMILES validity remained high. Mean SMILES string validity consistently exceeded 90% across all fine-tuned models and fine-tuning epochs.

Limitations

The authors note several limitations and future directions. The study used a fixed temperature of $T = 1$ for multinomial sampling; combining perplexity with temperature tuning or SMILES augmentation remains unexplored. The evaluation focused on 10 protein targets, and broader validation across diverse target classes would strengthen the conclusions. The authors also suggest that combining CLMs with perplexity scoring could be applied to screen large collections of commercially available compounds, which has not yet been tested.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL v28	1,683,181 molecules	Canonical SMILES, 20-90 characters, salts and duplicates removed
Validation	ChEMBL v28 (split)	84,160 molecules	Random split from pretraining set
Fine-tuning	ChEMBL v28 (per target)	5, 10, 20, or 40 molecules	pChEMBL > 6, 10 targets

Algorithms

LSTM-based CLM with character-level SMILES prediction
Multinomial sampling at $T = 1$
Beam search at $k = 10$ and $k = 50$
Perplexity computed per Equation 1; delta score per Equation 2
Adam optimizer, learning rate $10^{-4}$, 90 pretraining epochs, 100 fine-tuning epochs

Models

4-layer LSTM RNN: batch normalization, LSTM (1024 units), LSTM (256 units), batch normalization
5,820,515 parameters total
One-hot encoded SMILES input
Pretrained weights available in the GitHub repository

Evaluation

Metric	Description	Notes
Perplexity	Model confidence in generated SMILES	Lower is better
Delta score	Rank difference between fine-tuned and pretrained models	Positive indicates task-relevant generation
Tanimoto similarity	Morgan and pharmacophore fingerprints	Compared to fine-tuning set
Pearson correlation	Perplexity vs. Tanimoto distance	Stabilizes at ~0.5
SMILES validity	Fraction of valid SMILES strings	Consistently > 90%

Hardware

Hardware specifications are not reported in the paper. The implementation uses Keras (v2.2.0) with TensorFlow GPU backend (v1.9.0).

Artifacts

Artifact	Type	License	Notes
CLM_perplexity	Code	MIT	Framework, pretrained weights, and training data
Beam search implementation	Code	Unknown	Referenced beam search implementation

Paper Information

Citation: Moret, M., Grisoni, F., Katzberger, P., & Schneider, G. (2022). Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models. Journal of Chemical Information and Modeling, 62(5), 1199-1206. https://doi.org/10.1021/acs.jcim.2c00079

Publication: Journal of Chemical Information and Modeling, 2022

Additional Resources:

GitHub: CLM_perplexity (MIT License)

Citation

@article{moret2022perplexity,
  title={Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models},
  author={Moret, Michael and Grisoni, Francesca and Katzberger, Paul and Schneider, Gisbert},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={5},
  pages={1199--1206},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c00079}
}

Graph-Based GA and MCTS Generative Model for Molecules

Wed, 25 Mar 2026 00:00:00 +0000

A Graph-Based Approach to Molecular Optimization

This is a Method paper that introduces two graph-based approaches for exploring chemical space: a genetic algorithm (GB-GA) and a generative model combined with Monte Carlo tree search (GB-GM-MCTS). The primary contribution is demonstrating that these non-ML, graph-based methods can match or exceed the performance of contemporary ML-based generative models for molecular property optimization, while being several orders of magnitude faster. The paper provides open-source implementations built on the RDKit cheminformatics package. The two approaches explore chemical space using direct graph manipulations rather than string-based representations like SMILES.

Why Compare Simple Baselines to ML Generative Models?

By 2018, several ML-based generative models for molecules had been published, including VAEs, RNNs, and graph convolutional policy networks. However, these models were rarely compared against traditional optimization approaches such as genetic algorithms. Jensen identifies this gap explicitly: while ML generative model performance had been impressive, the lack of comparison to simpler baselines made it difficult to assess whether the complexity of ML approaches was justified.

A practical barrier to such comparisons was the absence of free, open-source GA implementations for molecular optimization (the existing ACSESS algorithm required proprietary OpenEye toolkits). This paper fills that gap by providing RDKit-based implementations of both the GB-GA and GB-GM-MCTS.

Graph-Based Crossovers, Mutations, and Monte Carlo Tree Search

GB-GA: Crossovers and Mutations on Molecular Graphs

The GB-GA operates directly on molecular graph representations (not string representations like SMILES). It combines ideas from Brown et al. (2004) and the ACSESS algorithm of Virshup et al. (2013).

Crossovers can occur at two types of positions with equal probability:

Non-ring bonds: a molecule is cut at a non-ring bond, and fragments from two parent molecules are recombined
Ring bonds: adjacent bonds or bonds separated by one bond are cut, and fragments are mated using single or double bonds

Mutations include seven operation types, each with specified probabilities:

Append atom (15%): adds an atom with a single, double, or triple bond
Insert atom (15%): inserts an atom into an existing bond
Delete atom (14%): removes an atom, reconnecting neighbors
Change atom type (14%): swaps element identity (C, N, O, F, S, Cl, Br)
Change bond order (14%): toggles between single, double, and triple bonds
Delete ring bond (14%): opens a ring
Add ring bond (14%): closes a new ring

Molecules with macrocycles (seven or more atoms), allene centers in rings, fewer than five heavy atoms, incorrect valences, or more non-H atoms than the target size are discarded. The target size is sampled from a normal distribution with mean 39.15 and standard deviation 3.50 non-H atoms, calibrated to match the molecules found by Yang et al. (2017).

GB-GM-MCTS: A Probabilistic Growth Model with Tree Search

The GB-GM grows molecules one atom at a time, with the choice of bond order and atom type determined probabilistically from a bonding analysis of a reference dataset (the first 1000 molecules from ZINC). Since 63% of atoms in the reference set are ring atoms, ring-creation or ring-insertion mutations are chosen 63% of the time.

The generative model is combined with a Monte Carlo tree search where:

Each node corresponds to an atom addition step
Leaf parallelization uses a maximum of 25 leaf nodes
The exploration factor is $1 / \sqrt{2}$
Rollout terminates if the molecule exceeds the target size
The reward function returns 1 if the predicted $J(\mathbf{m})$ value exceeds the largest value found so far, and 0 otherwise

The Penalized logP Objective

Both methods optimize the penalized logP score $J(\mathbf{m})$:

$$ J(\mathbf{m}) = \log P(\mathbf{m}) - \text{SA}(\mathbf{m}) - \text{RingPenalty}(\mathbf{m}) $$

where $\log P(\mathbf{m})$ is the octanol-water partition coefficient predicted by RDKit, $\text{SA}(\mathbf{m})$ is a synthetic accessibility score, and $\text{RingPenalty}(\mathbf{m})$ penalizes unrealistically large rings by reducing the score by $\text{RingSize} - 6$ for each oversized ring. Each property is normalized to zero mean and unit standard deviation across the ZINC dataset.

Experimental Setup and Comparisons to ML Methods

GB-GA Experiments

Ten GA simulations were performed with a population size of 20 over 50 generations (1000 $J(\mathbf{m})$ evaluations per run). The initial mating pool was 20 random molecules from the first 1000 molecules in ZINC. Two mutation rates were tested: 50% and 1%.

GB-GM-MCTS Experiments

Ten simulations used ethane as a seed molecule with 1000 tree traversals per run. Additional experiments used 5000 traversals and an adjusted probability of generating $\text{C}=\text{C}-\text{C}$ ring patterns (increased from 62% to 80%).

Baselines

Results were compared to those compiled by Yang et al. (2017):

ChemTS (RNN + MCTS)
RNN with and without Bayesian optimization
Continuous VAE (CVAE)
Grammar VAE (GVAE)
Graph convolutional policy network (GCPN, from You et al. 2018)

Key Results

Method	Average $J(\mathbf{m})$	Molecules Evaluated	CPU Time
GB-GA (50% mutation)	6.8 +/- 0.7	1000	30 seconds
GB-GA (1% mutation)	7.4 +/- 0.9	1000	30 seconds
GB-GM-MCTS (62%)	2.6 +/- 0.6	1000	90 seconds
GB-GM-MCTS (80%)	3.4 +/- 0.6	1000	90 seconds
GB-GM-MCTS (80%)	4.3 +/- 0.6	5000	9 minutes
ChemTS	4.9 +/- 0.5	~5000	2 hours
ChemTS	5.6 +/- 0.5	~20000	8 hours
RNN + BO	4.5 +/- 0.2	~4000	8 hours
Only RNN	4.8 +/- 0.2	~20000	8 hours
CVAE + BO	0.0 +/- 0.9	~100	8 hours
GVAE + BO	0.2 +/- 1.3	~1000	8 hours

The GB-GA with 1% mutation rate achieved an average maximum $J(\mathbf{m})$ of 7.4, which is 1.8 units higher than the best ML result (ChemTS at 5.6) while using 20x fewer evaluations and completing in 30 seconds versus 8 hours. The two highest-scoring individual molecules found by GB-GA had $J(\mathbf{m})$ scores of 8.8 and 8.5, exceeding the 7.8-8.0 range found by the GCPN approach. These molecules bore little resemblance to the initial mating pool (Tanimoto similarities of 0.27 and 0.12 to the most similar ZINC molecules), indicating that the GA traversed a large distance in chemical space in just 50 generations.

The GB-GM-MCTS performed below ChemTS at equal evaluations (4.3 vs. 4.9 at 5000 evaluations) but was several orders of magnitude faster (9 minutes vs. 2 hours). The MCTS approach tended to extract the dominant hydrophobic structural motif (benzene rings) from the training set, making it more dependent on training set composition than the GA.

Simple Methods Set a High Bar for Molecular Optimization

The central finding is that a simple graph-based genetic algorithm outperforms all tested ML-based generative models on penalized logP optimization, both in terms of solution quality and computational efficiency. The GB-GA achieves higher $J(\mathbf{m})$ scores with 1000 evaluations in 30 seconds than ML methods achieve with 20,000 evaluations over 8 hours.

Several additional observations emerge:

Chemical space traversal: The GB-GA can reach high-scoring molecules that are structurally distant from the starting population, with Tanimoto similarity as low as 0.12 to the nearest ZINC molecule.
Mutation rate matters: A 1% mutation rate outperformed a 50% rate (7.4 vs. 6.8), suggesting that preserving more parental structure during crossover is beneficial for this objective.
Training set dependence: The GB-GM-MCTS is more sensitive to training set composition than the GA. Its preference for benzene-ring-containing molecules (the dominant ZINC motif) limits its ability to discover alternative structural solutions like the long aliphatic chains favored by the GA.
Generalizability caveat: Jensen explicitly notes that these comparisons cover only one property (penalized logP) and that similar comparisons for other properties are needed before drawing general conclusions.

The paper’s influence has been substantial: it helped establish the expectation that new molecular generative models should be benchmarked against genetic algorithm baselines, a position subsequently reinforced by Brown et al. (2019) in GuacaMol and by Tripp and Hernandez-Lobato (2023).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Initial mating pool / reference set	ZINC (subset)	First 1000 molecules	Same subset used in previous studies (Gomez-Bombarelli et al., Yang et al.)
Target molecule size	Derived from Yang et al. results	20 molecules	Mean 39.15, SD 3.50 non-H atoms

Algorithms

GB-GA: Population size 20, 50 generations, mutation rates of 1% and 50% tested. Crossovers at ring and non-ring bonds with equal probability. Seven mutation types with specified probabilities. Molecules selected from mating pool based on normalized logP scores.
GB-GM: Atom-by-atom growth using probabilistic rules derived from ZINC bonding analysis. Ring creation probability 63% (matching ZINC), with 80% variant also tested. Seed molecule: ethane.
MCTS: Modified from haroldsultan/MCTS Python implementation. Leaf parallelization with max 25 leaf nodes. Exploration factor $1/\sqrt{2}$. Binary reward function (1 if new best, 0 otherwise).
Property calculation: logP, SA score, and ring penalty all computed via RDKit. Each property normalized to zero mean and unit standard deviation across ZINC.

Models

No neural network models are used. The GB-GA and GB-GM are purely algorithmic approaches parameterized by bonding statistics from the ZINC dataset.

Evaluation

Metric	GB-GA (1%)	Best ML (ChemTS)	Notes
Average max $J(\mathbf{m})$	7.4 +/- 0.9	5.6 +/- 0.5	Over 10 runs
Single best $J(\mathbf{m})$	8.8	~8.0 (GCPN)	GB-GA vs. You et al.
Evaluations per run	1000	~20,000	20x fewer for GB-GA
CPU time per run	30 seconds	8 hours	~960x faster

Hardware

All GB-GA and GB-GM experiments were run on a laptop. No GPU required. The GB-GA completes in 30 seconds per run and the GB-GM-MCTS in 90 seconds (1000 traversals) to 9 minutes (5000 traversals).

Artifacts

Artifact	Type	License	Notes
GB-GA (v0.0)	Code	Not specified	Graph-based genetic algorithm, RDKit dependency only
GB-GM (v0.0)	Code	Not specified	Graph-based generative model + MCTS, RDKit dependency only

Paper Information

Citation: Jensen, J. H. (2019). A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chemical Science, 10(12), 3567-3572. https://doi.org/10.1039/c8sc05372c

Publication: Chemical Science (Royal Society of Chemistry), 2019

Additional Resources:

Citation

@article{jensen2019graph,
  title={A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space},
  author={Jensen, Jan H.},
  journal={Chemical Science},
  volume={10},
  number={12},
  pages={3567--3572},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/c8sc05372c}
}

Frechet ChemNet Distance for Molecular Generation

Wed, 25 Mar 2026 00:00:00 +0000

A Unified Evaluation Metric for Molecular Generation

This is a Method paper that introduces the Frechet ChemNet Distance (FCD), a single scalar metric for evaluating generative models that produce molecules for drug discovery. FCD adapts the Frechet Inception Distance (FID) from image generation to the molecular domain. By comparing distributions of learned representations from a drug-activity prediction network (ChemNet), FCD simultaneously captures whether generated molecules are chemically valid, biologically relevant, and structurally diverse.

Inconsistent Evaluation of Molecular Generative Models

At the time of this work (2018), deep generative models for molecules were proliferating: RNNs combined with variational autoencoders, reinforcement learning, and GANs all produced SMILES strings representing novel molecules. The evaluation landscape was fragmented. Different papers reported different metrics: percentage of valid SMILES, mean logP, druglikeness, synthetic accessibility (SA) scores, or internal diversity via Tanimoto distance.

This inconsistency created several problems. First, method comparison across publications was difficult because no common metric existed. Second, simple metrics like “fraction of valid SMILES” could be trivially maximized by generating short, simple molecules (e.g., “CC” or “CCC”). Third, individual property metrics (logP, druglikeness) each captured only one dimension of quality. A model could score well on logP but produce molecules that were not diverse or not biologically meaningful.

The authors argued that a good metric should capture three properties simultaneously: (1) chemical validity and similarity to real drug-like molecules, (2) biological relevance, and (3) diversity within the generated set.

Core Innovation: Frechet Distance over ChemNet Activations

The key insight is to use a neural network trained on biological activity prediction as a feature extractor for molecules, then compare distributions of these features using the Frechet (Wasserstein-2) distance.

ChemNet Architecture

ChemNet is a multi-task neural network trained to predict bioactivities across approximately 6,000 assays from three major drug discovery databases (ChEMBL, ZINC, PubChem). The architecture processes one-hot encoded SMILES strings through:

Two 1D convolutional layers with SELU activations
A max-pooling layer
Two stacked LSTM layers
A fully connected output layer

The penultimate layer (the second LSTM’s hidden state after processing the full input sequence) serves as the molecular representation. Because ChemNet was trained to predict drug activities, its internal representations encode both chemical structure (from the input side) and biological function (from the output side).

The FCD Formula

Given a set of real molecules and a set of generated molecules, FCD is computed as follows:

Pass each molecule (as a SMILES string) through ChemNet and extract penultimate-layer activations.
Fit a multivariate Gaussian to each set by computing the mean $\mathbf{m}$ and covariance $\mathbf{C}$ for the generated set, and mean $\mathbf{m}_w$ and covariance $\mathbf{C}_w$ for the real set.
Compute the squared Frechet distance:

$$ d^{2}\left((\mathbf{m}, \mathbf{C}), (\mathbf{m}_w, \mathbf{C}_w)\right) = |\mathbf{m} - \mathbf{m}_w|_2^{2} + \mathrm{Tr}\left(\mathbf{C} + \mathbf{C}_w - 2(\mathbf{C}\mathbf{C}_w)^{1/2}\right) $$

The Gaussian assumption is justified by the maximum entropy principle: the Gaussian is the maximum-entropy distribution for given mean and covariance. A lower FCD indicates that the generated distribution is closer to the real distribution.

Why Not Just Fingerprints?

The authors also define a Frechet Fingerprint Distance (FFD) that replaces ChemNet activations with 2048-bit ECFP_4 fingerprints. FFD captures chemical structure but not biological function. The experimental comparison shows that FCD produces more distinct separations between biased and unbiased molecule sets, particularly for biologically meaningful biases.

Detecting Flaws in Generative Models

The experiments evaluate whether FCD can detect specific failure modes in generative models. The authors simulate five types of biased generators by selecting molecules from real databases that exhibit particular properties, then compare FCD against individual metrics (logP, druglikeness, SA score, internal diversity) and FFD.

Simulated Bias Experiments

All experiments use 5,000 molecules drawn 5 times each. The reference distribution is 200,000 randomly drawn real molecules not used for ChemNet training.

Bias Type	logP	Druglikeness	SA Score	Int. Diversity	FFD	FCD
Low druglikeness (<5th pct)	-	Detects	-	-	Detects	Detects
High logP (>95th pct)	Detects	Detects	-	-	Detects	Detects
Low SA score (<5th pct)	-	Partial	-	Partial	Detects	Detects
Mode collapse (cluster)	-	-	-	Detects	Detects	Detects
Kinase inhibitors (PLK1)	-	-	-	-	Detects	Detects

FCD is the only metric that detects all five bias types. The biological bias test (kinase inhibitors for PLK1-PBD from PubChem AID 720504) is particularly notable: only FFD and FCD detect this bias, and FCD provides a more distinct separation. This validates the hypothesis that incorporating biological information through ChemNet activations improves evaluation beyond purely chemical descriptors.

Sample Size Requirements

The authors tested FCD convergence with varying sample sizes (5 to 300,000 molecules). Mean FCD values for samples drawn from the real distribution:

Sample Size	Mean FCD	Std Dev
5	76.46	5.03
50	31.86	0.75
500	4.41	0.03
5,000	0.42	0.01
50,000	0.05	0.00
300,000	0.02	0.00

A sample size of 5,000 molecules is sufficient for reliable estimation, with the mean FCD approaching zero and negligible variance.

Benchmarking Published Generative Models

The authors computed FCD for several published generative methods:

Method	FCD	Notes
Random real molecules	0.22	Baseline (near zero as expected)
Segler et al. (LSTM)	1.62	Trained to approximate full ChEMBL distribution
DRD2-targeted methods	24.14 to 47.85	Olivecrona, RL, and ORGAN agents
Rule-based baseline	58.76	Random concatenation of C, N, O atoms

The ranking matches expectations. The Segler model, trained to approximate the overall molecule distribution, achieves the lowest FCD (1.62). Models optimized for a specific target (DRD2), including the Olivecrona RL agents, the RL method by Benhenda, and ORGAN, produce higher FCD values (24.14 to 47.85) against the general distribution. More training iterations push these models further from the general distribution, as they become increasingly DRD2-specific. The canonical and reduced Olivecrona agents learn similar chemical spaces, consistent with the original authors’ conclusions. The rule-based system scores worst (58.76), confirming FCD as a meaningful quality metric.

Conclusions and Impact

FCD provides a single metric that unifies the evaluation of chemical validity, biological relevance, and diversity for molecular generative models. Its main advantages are:

It captures multiple quality dimensions in one score, simplifying method comparison.
It detects biases that no single existing metric can catch alone.
It requires only SMILES strings as input, making it applicable to any generative method (including graph-based approaches via SMILES conversion).
It incorporates biological information through ChemNet, distinguishing it from purely chemical metrics like FFD.

Limitations: The metric depends on the ChemNet model, which was trained on a specific set of bioactivity assays. Molecules outside the training distribution of ChemNet may not be well-represented. The Gaussian assumption for the activation distributions may not hold perfectly. FCD measures distance to a reference set, so it evaluates how well a generator approximates a given distribution rather than the absolute quality of individual molecules. When using FCD for targeted generation (e.g., molecules active against a specific protein), the reference set should be chosen accordingly, not the general drug-like molecule distribution.

FCD has since become a standard evaluation metric in the molecular generation community, adopted by benchmarking platforms like MOSES and GuacaMol.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
ChemNet training	ChEMBL, ZINC, PubChem	~6,000 assays	Two-thirds for training, one-third for testing
Reference distribution	Combined databases	200,000 molecules	Excluded from ChemNet training
Bias simulations	Subsets of combined databases	5,000 per experiment	5 repetitions each

Algorithms

ChemNet: 2x 1D-conv (SELU), max-pool, 2x stacked LSTM, FC output
FCD: Squared Frechet distance between Gaussian-fitted ChemNet penultimate-layer activations
FFD: Same as FCD but using 2048-bit ECFP_4 fingerprints instead of ChemNet activations
Molecular property calculations: RDKit (logP, druglikeness, SA score, Morgan fingerprints with radius 2)

Evaluation

Metric	Description
FCD	Frechet distance over ChemNet activations (lower = closer to reference)
FFD	Frechet distance over ECFP_4 fingerprints
logP	Mean partition coefficient
Druglikeness	Geometric mean of desired molecular properties (QED)
SA Score	Synthetic accessibility score
Internal Diversity	Tanimoto distance within generated set

Hardware

Hardware specifications are not provided in the paper.

Artifacts

Artifact	Type	License	Notes
FCD Implementation	Code	LGPL-3.0	Official Python implementation; requires only SMILES input

Paper Information

Citation: Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., & Klambauer, G. (2018). Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery. Journal of Chemical Information and Modeling, 58(9), 1736-1741.

@article{preuer2018frechet,
  title={Fr{\'e}chet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery},
  author={Preuer, Kristina and Renz, Philipp and Unterthiner, Thomas and Hochreiter, Sepp and Klambauer, G{\"u}nter},
  journal={Journal of Chemical Information and Modeling},
  volume={58},
  number={9},
  pages={1736--1741},
  year={2018},
  doi={10.1021/acs.jcim.8b00234},
  publisher={American Chemical Society}
}

Back Translation for Semi-Supervised Molecule Generation

Wed, 25 Mar 2026 00:00:00 +0000

Semi-Supervised Data Augmentation for Molecular Tasks

This is a Method paper that introduces back translation, a semi-supervised technique from neural machine translation, to the domain of molecular generation. The primary contribution is a general-purpose data augmentation strategy that leverages large pools of unlabeled molecules (from databases like ZINC) to improve the performance of both sequence-based and graph-based models on molecule optimization and retrosynthesis prediction tasks.

Bridging the Labeled Data Gap in Molecular Generation

Molecular generation tasks, such as property optimization and retrosynthesis, require paired training data: an input molecule (or property specification) mapped to a desired output molecule. Obtaining these labeled pairs is expensive and labor-intensive. Meanwhile, enormous databases of unlabeled molecules exist. ZINC alone contains over 750 million compounds, and PubChem has 109 million.

Prior approaches to using unlabeled molecular data include variational autoencoders (VAEs) for learning latent representations, conditional recurrent neural networks for inverse design, and pretraining techniques borrowed from NLP. However, these methods either focus on representation learning rather than direct generation, or require task-specific architectural modifications. The authors identify back translation, a well-established technique in machine translation, as a natural fit for molecular generation tasks that can be treated as sequence-to-sequence mappings.

Back Translation as Molecular Data Augmentation

The core idea is straightforward. Given a main task that maps from source domain $\mathcal{X}$ to target domain $\mathcal{Y}$ (e.g., mapping low-QED molecules to high-QED molecules), the method trains a reverse model $g$ that maps from $\mathcal{Y}$ back to $\mathcal{X}$. This reverse model then “back translates” unlabeled molecules from $\mathcal{Y}$ to generate synthetic source molecules, creating pseudo-labeled training pairs.

The theoretical motivation comes from maximizing the reconstruction probability. Given an unlabeled molecule $y_u \in \mathcal{U}_y$, the logarithmic reconstruction probability through the reverse model $g$ and forward model $f$ is:

$$ \log P(y_u = \hat{y}_u \mid y_u; g, f) = \log \sum_{\hat{x}_u \in \mathcal{X}} P(\hat{x}_u \mid y_u; g) P(y_u = \hat{y}_u \mid \hat{x}_u; f) $$

Since summing over the exponentially large space $\mathcal{X}$ is intractable, the authors apply Jensen’s inequality to obtain a lower bound:

$$ \log P(y_u = \hat{y}_u \mid y_u; g, f) \geq \mathbb{E}_{\hat{x}_u \sim P(\cdot \mid y_u; g)} \log P(y_u = \hat{y}_u \mid \hat{x}_u; f) $$

This lower bound is optimized via Monte Carlo sampling in three steps:

Step 1: Train both forward model $f$ and reverse model $g$ on the labeled data $\mathcal{L}$:

$$ \begin{aligned} \min_{\theta_f} \sum_{(x,y) \in \mathcal{L}} -\log P(y \mid x; \theta_f) \\ \min_{\theta_g} \sum_{(x,y) \in \mathcal{L}} -\log P(x \mid y; \theta_g) \end{aligned} $$

Step 2: Use the trained reverse model $g$ to back translate each unlabeled molecule $y_u \in \mathcal{U}_y$, producing synthetic pairs:

$$ \hat{\mathcal{L}} = {(\hat{x}_u, y_u) \mid y_u \in \mathcal{U}_y, \hat{x}_u \text{ sampled from } P(\cdot \mid y_u; \theta_g)} $$

Step 3: Retrain the forward model $f$ on the combined labeled and synthetic data $\mathcal{L} \cup \hat{\mathcal{L}}$, warm-starting from the parameters obtained in Step 1:

$$ \min_{\theta_f^} \sum_{(x,y) \in \mathcal{L} \cup \hat{\mathcal{L}}} -\log P(y \mid x; \theta_f^) $$

A key practical finding is that data filtration matters. When using large amounts of unlabeled data (1M molecules), keeping only the synthetic pairs that satisfy the same constraints as the labeled data (e.g., similarity thresholds and property ranges) significantly improves performance over using all back-translated data unfiltered.

Experiments on Property Optimization and Retrosynthesis

Molecular Property Improvement

The authors evaluate on four tasks from Jin et al. (2019, 2020), each requiring the model to improve a specific molecular property while maintaining structural similarity (measured by Dice similarity on Morgan fingerprints):

LogP (penalized partition coefficient): two settings with similarity thresholds $\delta \geq 0.4$ and $\delta \geq 0.6$
QED (quantitative estimation of drug-likeness): translate molecules from QED range [0.7, 0.8] to [0.9, 1.0]
DRD2 (dopamine type 2 receptor activity): translate inactive ($P < 0.5$) to active ($P \geq 0.5$)

Two backbone architectures are tested: a Transformer (6 layers, 4 heads, 128-dim embeddings, 512-dim FFN) and HierG2G, a hierarchical graph-to-graph translation model. Unlabeled molecules are sampled from ZINC at 250K and 1M scales.

Method	LogP ($\delta \geq 0.6$)	LogP ($\delta \geq 0.4$)	QED (%)	DRD2 (%)
JT-VAE	0.28	1.03	8.8	3.4
GCPN	0.79	2.49	9.4	4.4
JTNN	2.33	3.55	59.9	77.8
Transformer baseline	2.45	3.69	71.9	60.2
+BT (1M, filtered)	2.86	4.41	82.9	67.4
HierG2G baseline	2.49	3.98	76.9	85.9
+BT (250K, filtered)	2.75	4.24	79.1	87.3

Retrosynthesis Prediction

On the USPTO-50K benchmark (50K reactions, 10 reaction types, 80/10/10 train/val/test split), the method is applied to Transformer and GLN (Graph Logic Network) backbones. For other approaches to this benchmark, see Tied Two-Way Transformers and Data Transfer for Retrosynthesis. Unlabeled reactant sets are constructed by sampling molecules from ZINC and concatenating them following the training data’s reactant count distribution ($N_1 : N_2 : N_3 = 29.3% : 70.4% : 0.3%$).

Method	Top-1	Top-3	Top-5	Top-10
Reaction type given
GLN	64.2	79.1	85.2	90.0
Ours + GLN	67.9	82.5	87.3	91.5
Transformer	52.2	68.2	72.7	77.4
Ours + Transformer	55.9	72.8	77.8	79.7
Reaction type unknown
GLN	52.5	69.0	75.6	83.7
Ours + GLN	54.7	70.2	77.0	84.4
Transformer	37.9	57.3	62.7	68.1
Ours + Transformer	43.5	58.8	64.6	69.7

The improvements are largest at lower $k$ values (top-1 and top-3), suggesting that back translation helps the model make more precise high-confidence predictions.

Ablation Studies

Effect of unlabeled data size: On retrosynthesis with Transformer, performance improves as unlabeled data increases from 50K to 250K, then plateaus or declines beyond 250K. The authors attribute this to noise in the back-translated data outweighing the benefits at larger scales.

Effect of labeled data size: With only 5K labeled samples, adding back-translated data hurts performance because the reverse model is too weak to generate useful synthetic data. As labeled data increases (10K, 25K, 50K), the benefit of back translation grows. This confirms that the method requires a reasonably well-trained reverse model to be effective.

Data filtration: Using 1M unfiltered back-translated molecules sometimes hurts performance (e.g., QED drops from 71.9% to 75.1% vs. 82.9% with filtering), while filtering to enforce the same constraints as the labeled data recovers and exceeds the 250K filtered results.

Consistent Gains Across Architectures and Tasks

The method achieves state-of-the-art results on all four molecular property improvement tasks and the USPTO-50K retrosynthesis benchmark at time of publication. Several observations stand out:

Architecture agnosticism: Back translation improves both sequence-based (Transformer) and graph-based (HierG2G, GLN) models, confirming that the approach is independent of the underlying architecture.
Filtration is essential at scale: Unfiltered 1M back-translated data can degrade performance, but filtered data at the same scale consistently outperforms smaller unfiltered sets.
Training overhead is moderate: On the DRD2 task, back translation with Transformer takes about 2.5x the supervised training time (11.0h vs. 8.5h for initial training), with the back-translation step itself taking under 1 hour.
Diversity and novelty increase: Back translation improves both diversity (average pairwise distance among generated molecules) and novelty (fraction of generated molecules not seen in training) across QED and DRD2 tasks.

The authors acknowledge limitations: the method does not form a closed loop between forward and reverse models (as in dual learning approaches), and the data filtration strategy is rule-based rather than learned. They suggest joint training of forward and reverse models and learned filtration as future directions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (property improvement)	Jin et al. (2019, 2020) datasets	34K-99K pairs	LogP, QED, DRD2 tasks
Training (retrosynthesis)	USPTO-50K	40K reactions	80/10/10 split from Dai et al. (2019)
Unlabeled molecules	ZINC	250K or 1M	Randomly sampled
Evaluation	Same as training	800-1000 test samples	Per-task test sets

Algorithms

Back translation with optional data filtration
Beam search with $k=20$ for inference
Random sampling for back-translation step (Equation 5)
Dice similarity on Morgan fingerprints for similarity constraint

Models

Transformer: 6 layers, 4 attention heads, 128-dim embeddings, 512-dim FFN (for property improvement); 4 layers, 8 heads, 256-dim embeddings, 2048-dim FFN (for retrosynthesis)
HierG2G: Settings from Jin et al. (2020)
GLN: Settings from Dai et al. (2019)

Evaluation

Metric	Task	Best Value	Baseline	Notes
LogP improvement	LogP ($\delta \geq 0.6$)	2.86	2.49 (HierG2G)	Transformer + BT(1M, filtered)
LogP improvement	LogP ($\delta \geq 0.4$)	4.41	3.98 (HierG2G)	Transformer + BT(1M, filtered)
Success rate	QED	82.9%	76.9% (HierG2G)	Transformer + BT(1M, filtered)
Success rate	DRD2	87.3%	85.9% (HierG2G)	HierG2G + BT(250K, filtered)
Top-1 accuracy	USPTO-50K (known type)	67.9%	64.2% (GLN)	Ours + GLN

Hardware

The paper reports training times (8.5h for Transformer, 16.8h for HierG2G on DRD2 with 1M unlabeled data) but does not specify the GPU hardware used.

Artifacts

Artifact	Type	License	Notes
BT4MolGen	Code	MIT	Official implementation in Python

Paper Information

Citation: Fan, Y., Xia, Y., Zhu, J., Wu, L., Xie, S., & Qin, T. (2021). Back translation for molecule generation. Bioinformatics, 38(5), 1244-1251. https://doi.org/10.1093/bioinformatics/btab817

@article{fan2022back,
  title={Back translation for molecule generation},
  author={Fan, Yang and Xia, Yingce and Zhu, Jinhua and Wu, Lijun and Xie, Shufang and Qin, Tao},
  journal={Bioinformatics},
  volume={38},
  number={5},
  pages={1244--1251},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btab817}
}

Tied Two-Way Transformers for Diverse Retrosynthesis

Mon, 23 Mar 2026 00:00:00 +0000

Bridging Forward and Backward Reaction Prediction

This is a Method paper that addresses three key limitations of template-free retrosynthesis models: invalid SMILES outputs, chemically implausible predictions, and lack of diversity in reactant candidates. The solution combines three techniques: (1) cycle consistency checks using a paired forward reaction transformer, (2) parameter tying between the forward and backward transformers, and (3) multinomial latent variables with a learned prior to capture multiple reaction pathways.

Three Problems in Template-Free Retrosynthesis

Template-free retrosynthesis models cast retrosynthesis as a sequence-to-sequence translation problem (product SMILES to reactant SMILES). While these models avoid the cost of hand-coded reaction templates, they suffer from:

Invalid SMILES: predicted reactant strings that contain grammatical errors and cannot be parsed into molecules
Implausibility: predicted reactants that are valid molecules but cannot actually synthesize the target product
Lack of diversity: beam search produces duplicate or near-duplicate candidates, reducing the number of useful suggestions

Prior work addressed these individually (SCROP adds a syntax corrector for validity, Chen et al. use latent variables for diversity), but this paper tackles all three simultaneously.

Model Architecture

Tied Two-Way Transformers

The model pairs a retrosynthesis transformer $p(y|z, x)$ (product to reactants) with a forward reaction transformer $p(\tilde{x}|z, y)$ (reactants to product). Both use the standard encoder-decoder transformer architecture with 6 layers, 8 attention heads, and 256-dimensional embeddings.

The key architectural innovation is aggressive parameter tying: the two transformers share the entire encoder and all decoder parameters except layer normalization. This means the two-transformer system has approximately the same parameter count as a single transformer (17.5M vs. 17.4M). The shared parameters force the model to learn bidirectional reaction patterns from both forward and backward training data simultaneously, improving grammar learning and reducing invalid outputs.

Multinomial Latent Variables

A discrete latent variable $z \in \{1, \ldots, K\}$ is introduced to capture multiple reaction modes. Each latent value conditions a different decoding path, encouraging diverse reactant predictions. The decoder initializes with a latent-class-specific start token (e.g., “”) and then decodes autoregressively.

The prior $p(z|x)$ is a learned multinomial distribution parametrized by a two-layer feed-forward network with tanh activation, taking the mean-pooled encoder output as input. This learned prior outperforms the uniform prior used by Chen et al., producing a smaller trade-off between top-1 and top-10 accuracy as $K$ increases.

Training with Hard EM

Since the latent variable $z$ is unobserved during training, the model is trained with the online hard-EM algorithm. The loss function is:

$$\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim \text{data}} \left[ \min_{z} \mathcal{L}_h(x, y, z; \theta) \right]$$

where $\mathcal{L}_h = -(\log p(z|x) + \log p(y|z,x) + \log p(\tilde{x}=x|z,y))$. The E-step selects the best $z$ for each training pair (with dropout disabled), and the M-step updates parameters given the complete data.

Inference with Cycle Consistency Reranking

At inference, the model: (1) generates $K$ sets of beam search hypotheses from the retrosynthesis transformer (one per latent value), (2) scores each candidate with the forward reaction transformer for cycle consistency $p(\tilde{x}=x|z,y)$, and (3) reranks candidates by the full likelihood $p(z|x) \cdot p(y|z,x) \cdot p(\tilde{x}=x|z,y)$. This pushes chemically plausible predictions to higher ranks.

Results on USPTO-50K

All results are averaged over 5 random seeds with beam size 10.

Model	Top-1 Acc.	Top-5 Acc.	Top-10 Acc.	Top-1 Invalid	Top-10 Invalid
Liu-LSTM	37.4%	57.0%	61.7%	12.2%	22.0%
SCROP	43.7%	65.2%	68.7%	0.7%	2.3%
Lin-TF	42.0%	71.3%	77.6%	2.2%	7.8%
Base transformer	44.3%	68.4%	72.7%	1.7%	12.1%
Proposed ($K$=5)	46.8%	73.5%	78.5%	0.1%	2.6%

The proposed model achieves a +3.1% top-1 accuracy improvement over the best previous template-free method and reduces top-1 invalid rate to 0.1%.

Ablation Analysis

The ablation study isolates the contribution of each component:

Base+CC (cycle consistency only): reranks candidates to improve top-1/3/5 accuracy and validity, but top-10 stays the same since the candidate set is unchanged. Parameter count doubles (34.8M).
Base+PT (parameter tying only): improves accuracy and validity at all top-$k$ levels with negligible parameter increase. Parameter tying during training improves the retrosynthesis transformer itself, even without cycle consistency at inference.
Proposed ($K$=1): combines tying with cycle consistency reranking.
Proposed ($K$=5): adds latent diversity, further improving top-10 accuracy (+2.2%) and reducing top-10 invalid rate (from 10.2% to 2.6%).

Diversity: Unique Rate

As $K$ increases from 1 to 5, the unique molecule rate among 10 predictions rises substantially, confirming that latent modeling produces more diverse candidates. The learned prior reduces the top-1/top-10 accuracy trade-off compared to Chen et al.’s uniform prior.

Results on In-House Multi-Pathway Dataset

The in-house dataset (162K reactions from Reaxys) contains multiple ground-truth reactions per product, enabling direct evaluation of pathway diversity through coverage (proportion of ground-truth pathways correctly predicted in the top-10 candidates).

Model	Top-1 Acc.	Top-10 Acc.	Unique Rate	Coverage
Base	64.2%	91.6%	76.1%	84.4%
Proposed	66.0%	92.8%	93.2%	87.3%

The proposed model covers 87.3% of ground-truth reaction pathways on average, compared to 84.4% for the baseline. The unique rate jumps from 76.1% to 93.2%, confirming that the latent variables effectively encourage diverse predictions.

Limitations

The model uses SMILES string representation, which linearizes molecules and does not exploit the inherently rich chemical graph structure. Graph-based retrosynthesis models (e.g., GraphRetro at 63.8% top-1) substantially outperform template-free string-based models. The USPTO-50K dataset provides only one ground-truth pathway per product, making diversity evaluation limited on this benchmark. The in-house dataset is not publicly available. The model also does not predict reaction conditions (solvents, catalysts, temperature) or reagents.

Reproducibility

Artifact	Type	License	Notes
ejklike/tied-twoway-transformer	Code	Not specified	Training and inference code

Data: USPTO-50K dataset (public, 50K reactions from USPTO patents). In-house dataset (162K reactions from Reaxys, not publicly available).

Hardware: 4 NVIDIA Tesla M40 GPUs. Checkpoints saved every 5000 steps, last 5 averaged.

Training: Adam optimizer ($\beta$ = 0.9, 0.98), initial learning rate 2 with 8000 warm-up steps, dropout 0.3, gradient accumulation over 4 batches. Label smoothing set to 0.

Inference: Beam size 10, generating 10 candidates per product.

Paper Information

Citation: Kim, E., Lee, D., Kwon, Y., Park, M. S., & Choi, Y.-S. (2021). Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables. Journal of Chemical Information and Modeling, 61, 123-133.

Publication: Journal of Chemical Information and Modeling, 2021

Additional Resources:

GitHub: ejklike/tied-twoway-transformer

@article{kim2021valid,
  title={Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables},
  author={Kim, Eunji and Lee, Dongseon and Kwon, Youngchun and Park, Min Sik and Choi, Youn-Suk},
  journal={Journal of Chemical Information and Modeling},
  volume={61},
  number={1},
  pages={123--133},
  year={2021},
  publisher={ACS Publications},
  doi={10.1021/acs.jcim.0c01074}
}

UnCorrupt SMILES: Post Hoc Correction for De Novo Design

Sun, 22 Mar 2026 00:00:00 +0000

A Transformer-Based SMILES Error Corrector

This is a Method paper that proposes a post hoc approach to fixing invalid SMILES produced by de novo molecular generators. Rather than trying to prevent invalid outputs through alternative representations (SELFIES) or constrained architectures (graph models), the authors train a transformer model to translate invalid SMILES into valid ones. The corrector is framed as a sequence-to-sequence translation task, drawing on techniques from grammatical error correction (GEC) in natural language processing.

The Problem of Invalid SMILES in Molecular Generation

SMILES-based generative models produce some percentage of invalid outputs that cannot be converted to molecules. The invalidity rate varies substantially across model types:

RNN models (DrugEx): 5.7% invalid (pretrained) and 4.7% invalid (target-directed)
GANs (ORGANIC): 9.5% invalid
VAEs (GENTRL): 88.9% invalid

These invalid outputs represent wasted computation and potentially introduce bias toward molecules that are easier to generate correctly. Previous approaches to this problem include using alternative representations (DeepSMILES, SELFIES) or graph-based models, but these either limit the search space or increase computational cost. The authors propose a complementary strategy: fix the errors after generation.

Error Taxonomy Across Generator Types

The paper classifies invalid SMILES errors into six categories based on RDKit error messages:

Syntax errors: malformed SMILES grammar
Unclosed rings: unmatched ring closure digits
Parentheses errors: unbalanced open/close parentheses
Bond already exists: duplicate bonds between the same atoms
Aromaticity errors: atoms incorrectly marked as aromatic or kekulization failures
Valence errors: atoms exceeding their maximum bond count

The distribution of error types differs across generators. RNN-based models primarily produce aromaticity errors, suggesting they learn SMILES grammar well but struggle with chemical validity. The GAN (ORGANIC) produces mostly valence errors. The VAE (GENTRL) produces more grammar-level errors (syntax, parentheses, unclosed rings), indicating that sampling from the continuous latent space often produces sequences that violate basic SMILES structure.

Architecture and Training

The SMILES corrector uses a standard encoder-decoder transformer architecture based on Vaswani et al., with learned positional encodings. Key specifications:

Embedding dimension: 256
Encoder/decoder layers: 3 each
Attention heads: 8
Feed-forward dimension: 512
Dropout: 0.1
Optimizer: Adam (learning rate 0.0005)
Training: 20 epochs, batch size 16

Since no dataset of manually corrected invalid-valid SMILES pairs exists, the authors create synthetic training data by introducing errors into valid SMILES from the Papyrus bioactivity dataset (approximately 1.3M pairs). Errors are introduced through random perturbations following SMILES syntax rules: character substitutions, bond order changes, fragment additions from the GDB-8 database to atoms with full valence, and other structural modifications.

Training with Multiple Errors Improves Correction

A key finding is that training the corrector on inputs with multiple errors per SMILES substantially improves performance on real generator outputs. The baseline model (1 error per input) fixes 35-80% of invalid outputs depending on the generator. Increasing errors per training input to 12 raises this to 62-95%:

Generator	1 error/input	12 errors/input
RNN (DrugEx)	~60% fixed	62% fixed
Target-directed RNN	~60% fixed	68% fixed
GAN (ORGANIC)	~80% fixed	95% fixed
VAE (GENTRL)	~35% fixed	80% fixed

Training beyond 12 errors per input yields diminishing returns (80% average at 20 errors vs. 78% at 12). The improvement from multi-error training is consistent with GEC literature, where models learn to “distrust” inputs more when exposed to higher error rates.

The model also shows low overcorrection: only 14% of valid SMILES are altered during translation, comparable to overcorrection rates in spelling correction systems.

Fixed Molecules Are Comparable to Generator Outputs

The corrected molecules are evaluated against both the training set and the readily generated (valid) molecules from each generator:

Uniqueness: 97% of corrected molecules are unique
Novelty vs. generated: 97% of corrected molecules are novel compared to the valid generator outputs
Similarity to nearest neighbor (SNN): 0.45 between fixed and generated sets, indicating the corrected molecules explore different parts of chemical space
Property distributions: KL divergence scores between fixed molecules and the training set are comparable to those between generated molecules and the training set

This demonstrates that SMILES correction produces molecules that are as chemically reasonable as the generator’s valid outputs while exploring complementary regions of chemical space.

Local Chemical Space Exploration via Error Introduction

Beyond fixing generator errors, the authors propose using the SMILES corrector for analog generation. The workflow is:

Take a known active molecule
Introduce random errors into its SMILES (repeated 1000 times)
Correct the errors using the trained corrector

This “local sequence exploration” generates novel analogs with 97% validity. The uniqueness (39%) and novelty (16-37%) are lower than for generator correction because the corrector often regenerates the original molecule. However, the approach produces molecules that are structurally similar to the starting compound (SNN of 0.85 to known ligands).

The authors demonstrate this on selective Aurora kinase B (AURKB) inhibitors. The generated analogs occupy the same binding site region as the co-crystallized ligand VX-680 in docking studies, with predicted bioactivities similar to known compounds. Compared to target-directed RNN generation, SMILES exploration produces molecules closer to known actives (higher SNN, scaffold similarity, and KL divergence scores).

Limitations

The corrector performance drops when applied to real generator outputs compared to synthetic test data, because the synthetic error distribution does not perfectly match the errors that generators actually produce. Generator-specific correctors trained on actual invalid outputs could improve performance. The local exploration approach has limited novelty since the corrector frequently regenerates the original molecule. The evaluation uses predicted rather than experimental bioactivities for the Aurora kinase case study.

Reproducibility

Artifact	Type	License	Notes
LindeSchoenmaker/SMILES-corrector	Code + Data	MIT	Training code, synthetic error generation, and evaluation scripts

Data: Synthetic training pairs derived from the Papyrus bioactivity dataset (v5.5). Approximately 1.3M invalid-valid pairs per error-count setting.

Code: Transformer implemented in PyTorch, adapted from Ben Trevett’s seq2seq tutorial. Generative model baselines use DrugEx, GENTRL, and ORGANIC.

Evaluation: Validity assessed with RDKit. Similarity metrics (SNN, fragment, scaffold) and KL divergence computed following MOSES and GuacaMol benchmark protocols.

Paper Information

Citation: Schoenmaker, L., Béquignon, O. J. M., Jespers, W., & van Westen, G. J. P. (2023). UnCorrupt SMILES: a novel approach to de novo design. Journal of Cheminformatics, 15, 22.

Publication: Journal of Cheminformatics, 2023

Additional Resources:

GitHub: LindeSchoenmaker/SMILES-corrector

@article{schoenmaker2023uncorrupt,
  title={UnCorrupt SMILES: a novel approach to de novo design},
  author={Schoenmaker, Linde and B{\'e}quignon, Olivier J. M. and Jespers, Willem and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={22},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00696-x}
}

RetMol: Retrieval-Based Controllable Molecule Generation

Sun, 22 Mar 2026 00:00:00 +0000

Retrieval-Augmented Generation for Molecules

This is a Method paper that introduces RetMol, a retrieval-based framework for controllable molecule generation. The key idea is to guide a pre-trained generative model using a small set of exemplar molecules that partially satisfy the desired design criteria, retrieved from a task-specific database. The approach requires no task-specific fine-tuning of the generative backbone and works effectively with very few exemplar molecules (as few as 23).

Limitations of Existing Controllable Generation

Existing approaches to controllable molecule generation fall into three categories, each with drawbacks:

Reinforcement learning (RL)-based methods require task-specific fine-tuning of the generative model for each new objective
Supervised learning (SL)-based methods need molecules with desired properties as training data, which may be scarce
Latent optimization-based methods require training property predictors in the latent space, which is challenging with limited active molecules and incompatible with variable-length latent spaces like those in transformers

RetMol addresses all three issues by keeping the generative backbone frozen and using a lightweight, task-agnostic retrieval module that can be applied to new tasks simply by swapping the retrieval database.

The RetMol Framework

RetMol consists of four components built around a pre-trained encoder-decoder backbone (Chemformer, a BART variant trained on ZINC):

Retrieval Database

A task-specific collection of exemplar molecules that at least partially satisfy the design criteria. The database can be very small (e.g., 23 known inhibitors for the SARS-CoV-2 task) and is dynamically updated during inference with newly generated molecules.

Molecule Retriever

A heuristic-based module that selects the $K$ most relevant exemplar molecules (default $K = 10$). It first constructs a feasible set of molecules satisfying all constraints, then selects those with the best property scores. If too few molecules satisfy all constraints, it progressively relaxes constraints until enough candidates are available.

Information Fusion via Cross-Attention

The core trainable component. Retrieved exemplar embeddings are fused with the input molecule embedding using cross-attention:

$$\boldsymbol{e} = f_{\text{CA}}(\boldsymbol{e}_{\text{in}}, \boldsymbol{E}_r; \theta) = \text{Attn}(\text{Query}(\boldsymbol{e}_{\text{in}}), \text{Key}(\boldsymbol{E}_r)) \cdot \text{Value}(\boldsymbol{E}_r)$$

where $\boldsymbol{e}_{\text{in}} = \text{Enc}(x_{\text{in}}) \in \mathbb{R}^{L \times D}$ is the input embedding and $\boldsymbol{E}_r = [\boldsymbol{e}_r^1, \ldots, \boldsymbol{e}_r^K]$ are the retrieved exemplar embeddings. This module adds less than 5% parameter overhead (460K parameters over the 10M base model).

Self-Supervised Training: Nearest Neighbor Prediction

Rather than reconstructing the input molecule (which would make the retrieval module unnecessary), RetMol trains the fusion module to predict the nearest neighbor of the input:

$$\mathcal{L}(\theta) = \sum_{i=1}^{B} \text{CE}\left(\text{Dec}\left(f_{\text{CA}}(\boldsymbol{e}_{\text{in}}^{(i)}, \boldsymbol{E}_r^{(i)}; \theta)\right), x_{\text{1NN}}^{(i)}\right)$$

The remaining $K - 1$ nearest neighbors serve as the retrieved exemplar molecules. This forces the fusion module to learn how to use exemplar molecules to transform the input toward a related target. Only the fusion module parameters are updated; the encoder and decoder remain frozen.

During inference, RetMol uses an iterative process:

Encode the input molecule and retrieved exemplars
Fuse embeddings via cross-attention
Perturb the fused embedding $M$ times with Gaussian noise
Greedily decode $M$ candidate molecules
Replace the input with the best candidate if it improves upon the current score
Add remaining good candidates to the retrieval database
Repeat until convergence or a maximum number of iterations

The dynamic update of the retrieval database is critical for extrapolating beyond the initial set of exemplar molecules.

Experiments and Results

RetMol is evaluated on four tasks of increasing difficulty:

QED Optimization Under Similarity Constraint

Goal: generate molecules with QED $\geq$ 0.9 while maintaining Tanimoto similarity $\geq$ 0.4 to the input. RetMol achieves 94.5% success rate, compared to 92.8% for the previous best (QMO).

Penalized LogP Optimization

Goal: maximize penalized LogP while maintaining structural similarity. At $\delta = 0.4$, RetMol achieves 11.55 average improvement, compared to 7.71 for QMO.

GSK3$\beta$ + JNK3 Dual Inhibitor Design

Goal: simultaneously satisfy four constraints (GSK3$\beta$ inhibition $\geq$ 0.5, JNK3 inhibition $\geq$ 0.5, QED $\geq$ 0.6, SA $\leq$ 4). Results:

Method	Success %	Novelty	Diversity
REINVENT	47.9	0.561	0.621
RationaleRL	74.8	0.568	0.701
MARS	92.3	0.824	0.719
MolEvol	93.0	0.757	0.681
RetMol	96.9	0.862	0.732

RetMol achieves this without task-specific fine-tuning and requires only 80 iterations compared to MARS’s 550.

SARS-CoV-2 Main Protease Inhibitor Optimization

A real-world task using only 23 known inhibitors as the retrieval database and optimizing 8 weakly-binding drugs. Under the milder similarity constraint ($\delta = 0.4$), RetMol achieves 2.84 kcal/mol average binding affinity improvement versus 1.67 for Graph GA. Under the stricter constraint ($\delta = 0.6$), RetMol succeeds on 5/8 molecules versus 3/8 for Graph GA.

Key Analysis Findings

Database size: Strong performance even with 100 molecules, already outperforming baselines on success rate
Database quality: Molecules satisfying all four constraints give the best results (96.9%), but partial satisfaction still works reasonably (84.7% with two properties)
Training objective: The nearest neighbor prediction objective outperforms conventional reconstruction on validity (0.902 vs. 0.834) and uniqueness (0.922 vs. 0.665)
Dynamic database update: Essential for extrapolating beyond the initial retrieval database, generating molecules with property values exceeding the best in the original database

Limitations

RetMol requires exemplar molecules that at least partially satisfy the design criteria. When such molecules are entirely unavailable, the framework cannot be applied. The method also relies on property predictors (for scoring and retrieval), whose accuracy directly affects generation quality. The iterative refinement process adds computational overhead at inference time, and the results depend on the Chemformer backbone’s generation capabilities.

Reproducibility

Artifact	Type	License	Notes
NVlabs/RetMol	Code	NVIDIA Source Code License-NC	Full training and inference code
NVlabs/RetMol (checkpoints)	Model	CC BY-NC-SA 4.0	Pre-trained model checkpoints

Data: ZINC250k and ChEMBL datasets for training. Task-specific retrieval databases constructed from these datasets. COVID-19 task uses 23 known SARS-CoV-2 Mpro inhibitors.

Training: Information fusion module trained on 4x V100 GPUs (16GB each) for approximately 2 hours. Batch size of 256 per GPU, 50K iterations.

Inference: Single V100 GPU. Greedy decoding with Gaussian perturbation ($\sigma = 1$) for sampling multiple candidates per iteration.

Backbone: Chemformer (BART variant) pre-trained on ZINC. Frozen during RetMol training and inference.

Paper Information

Citation: Wang, Z., Nie, W., Qiao, Z., Xiao, C., Baraniuk, R. G., & Anandkumar, A. (2023). Retrieval-based Controllable Molecule Generation. Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023).

Publication: International Conference on Learning Representations (ICLR) 2023

Additional Resources:

@inproceedings{wang2023retrieval,
  title={Retrieval-based Controllable Molecule Generation},
  author={Wang, Zichao and Nie, Weili and Qiao, Zhuoran and Xiao, Chaowei and Baraniuk, Richard G. and Anandkumar, Anima},
  booktitle={International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=vDFA1tpuLvk}
}

Regression Transformer: Prediction Meets Generation

Sun, 22 Mar 2026 00:00:00 +0000

A Multitask Model That Unifies Regression and Generation

The Regression Transformer (RT) is a Method paper. It introduces a single model architecture that can both predict continuous molecular properties and conditionally generate molecules with desired property values. The core idea is to reformulate regression as a sequence modelling task: instead of training a dedicated regression head, continuous property values are tokenized into sequences of digits and predicted alongside molecular tokens using a cross-entropy loss.

Closing the Gap Between Predictors and Generators

Existing transformer-based approaches in computational chemistry develop property predictors and generative models as separate systems. Even when a single architecture like Chemformer (Irwin et al., 2022) addresses both tasks, it does so through task-specific heads. This means the two capabilities remain disjoint, and the generative model cannot use its own property prediction ability during generation.

The RT addresses three specific gaps:

No true multitask entanglement: Prior work either tunes separate heads for prediction and generation or limits communication between modules to a reward signal.
No inductive bias for continuous properties: Molecular generative models lack mechanisms to condition generation on floating-point property values.
Disconnected workflows: Property predictors cannot generate molecules, and generators cannot assess whether their outputs satisfy property constraints.

Core Innovation: Regression as Conditional Sequence Modelling

The RT’s key insight is that regression can be cast as sequential classification over digit tokens while preserving predictive accuracy. This is achieved through three components:

Numerical Tokenization

Floating-point property values are split into individual digit tokens that preserve decimal order. Each token $t_{v,p}$ encodes a digit value $v \in [0, 9]$ and its decimal place $p \in \mathbb{Z}$. For example, the value 12.3 becomes the token sequence [1_1, 2_0, 3_-1].

Numerical Encodings

To provide an inductive bias about the semantic proximity of digit tokens (which cross-entropy loss cannot convey), the RT introduces Numerical Encodings (NEs), analogous to positional encodings. For a token $t_{v,p}$ at embedding dimension $j$:

$$ \text{NE}_{\text{Float}}(v, p, j) = (-1)^j \cdot \frac{v \cdot 10^p}{j + 1} $$

These encodings ensure that pairwise distances between digit tokens decay monotonically with their floating-point proximity. The model can also learn digit orderings from data alone, but NEs provide a useful inductive bias.

Alternating Training with Self-Consistency

The RT uses an XLNet backbone trained with permutation language modelling (PLM). The key is that the same model serves two roles depending on which tokens are masked:

Mask numerical tokens: the model performs property prediction (regression)
Mask textual tokens: the model performs conditional sequence generation

The base PLM objective is:

$$ \mathcal{L}_{\text{PLM}} = \mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_T} \left[ \sum_{i=c+1}^{T} \log p_\theta(x_{z_i} \mid \mathbf{x}_{\mathbf{z}_{< i}}) \right] $$

This is refined into two specialized objectives: a property prediction objective $\mathcal{L}_P$ that masks only numerical tokens, and a generation objective $\mathcal{L}_G$ that masks only textual tokens. Training alternates between these every 50 steps.

The self-consistency (SC) loss adds a critical feedback loop. After generating a candidate molecule $\hat{\mathbf{x}}$, the model re-evaluates it by predicting the property of the generated sequence:

$$ \mathcal{L}_{\text{SC}} = \mathcal{L}_G(\mathbf{x}) + \alpha \cdot \mathcal{L}_P(\hat{\mathbf{x}}) $$

This rewards generating molecules whose predicted properties match the primed property value, exploiting the RT’s dual capability as both predictor and generator.

Experiments Across Molecules, Proteins, and Reactions

Drug Likeness (QED)

Initial validation on a synthetic QED dataset (~1.4M molecules from ChEMBL) demonstrated that the RT can simultaneously learn to predict QED scores (RMSE < 0.06) and generate novel molecules conditioned on desired QED values (Spearman’s $\rho$ up to 0.517 between primers and generated molecule properties). Novelty exceeded 99% across all configurations. The alternating training scheme with SC loss outperformed both single-task models and the vanilla PLM objective.

SELFIES representations proved comparable to SMILES for property prediction and far superior for generation (~100% validity vs. ~40% for SMILES).

MoleculeNet Regression Benchmarks

On MoleculeNet benchmarks ESOL, FreeSolv, and Lipophilicity, the RT outperformed XGBoost and MPNN baselines despite using only a classification loss. It performed on par with XLNet using a conventional regression head, and was only mildly inferior to models like BERT and BART that used large-scale self-supervised pre-training with regression losses.

Critically, only the RT could also conditionally generate molecules for these tasks. External validation with Grover (a self-supervised Graph Transformer) confirmed high correlation with the RT’s own property predictions (0.86, 0.84, and 0.75 for ESOL, FreeSolv, and Lipophilicity respectively).

Constrained Property Optimization

On the penalized logP (plogP) benchmark with similarity constraints, the RT outperformed JT-VAE and GCPN by large margins. At similarity threshold $\delta = 0.4$, the RT achieved 3.16 average improvement with 97.1% success rate, while also predicting plogP with PCC of 0.92. Competing methods cannot perform property prediction at all.

Model	Improvement ($\delta$=0.4)	Success	Property Prediction
JT-VAE	0.84	83.6%	Unfeasible
GCPN	2.49	100%	Unfeasible
MoFlow	4.71	85.7%	Unfeasible
RT	3.16	97.1%	PCC = 0.92

The comparison is not strictly fair: all competing methods are trained specifically to maximize plogP, and some (GCPN, JT-VAE) apply gradient optimization at inference time. The RT is only trained to reconstruct molecules with similar predicted plogP to the seed, so its training objective is property-agnostic rather than directly optimizing for higher plogP values.

Protein Language Modelling

On the TAPE benchmark, the RT matched or outperformed conventional transformers on fluorescence and stability prediction tasks, despite those baselines being pre-trained on 24-106 million protein sequences (vs. 2.6 million for the RT). The RT also performed conditional protein generation, a task that none of the TAPE baselines can address.

Chemical Reaction Modelling

The RT was applied to reaction yield prediction on Buchwald-Hartwig amination and Suzuki coupling datasets. It matched Yield-BERT performance ($R^2$ = 0.939 and 0.81 respectively) while also enabling novel capabilities: reconstructing missing precursors from partial reactions and decorating existing reactions to achieve higher predicted yields. Across both datasets, over 40% of top-five predicted sequences contained reactions with novel precursors and higher predicted yield.

Key Findings and Limitations

Key Findings

Regression can be successfully reformulated as sequential classification over digit tokens without losing predictive accuracy compared to models using regression losses.
The alternating training scheme with self-consistency loss enables cross-task benefits, where the model outperforms single-task variants at both prediction and generation.
A single ~27M parameter model handles property prediction, conditional molecular generation, conditional protein generation, and reaction yield prediction with precursor generation.
The model learns the natural ordering of digits from data: 47% of embedding dimensions for the tenths place directly encode digit ordering even without explicit numerical encodings.

Limitations

No large-scale pre-training: The RT uses ~27M parameters trained from scratch on task-specific datasets, unlike BARTSmiles or MoLFormer which pre-train on billions of molecules. Scaling up could improve results.
Fine-grained regression precision: The model sometimes struggles with intra-mode precision (e.g., on the fluorescence dataset where predictions cluster around bright/dark modes rather than capturing continuous variation).
Single-property focus: All reported experiments use a single continuous property, though the framework naturally extends to multi-property settings.
SELFIES validity caveats: While SELFIES are always syntactically valid, they can produce degenerate short molecules (~1.9% defective generations where the output has less than 50% of the seed’s atoms).
XLNet backbone limitations: Results on MoleculeNet regression are slightly below models using BART or BERT backbones with large-scale pre-training, suggesting the RT framework could benefit from stronger base models.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
Regression Transformer (GitHub)	Code	MIT	Training and evaluation scripts
GT4SD Integration	Code + Models	MIT	Pre-trained model inference pipelines
HuggingFace Demo	Demo	-	Interactive inference webapp

Data

Purpose	Dataset	Size	Notes
Drug likeness	ChEMBL (QED)	~1.4M molecules	Synthetic QED labels computed with RDKit
Regression benchmark	MoleculeNet (ESOL, FreeSolv, Lipo)	642-4,200 compounds	16x SMILES augmentation, 3 random splits
Property optimization	ZINC (plogP)	215,381 train / 799 test	Fixed split from Jin et al. (2018)
Protein pre-training	UniProt (Boman)	2,648,205 peptides	15-45 amino acid peptides
Protein benchmarks	TAPE (Fluorescence, Stability)	21,446-53,416 samples	Fixed splits
Reaction pre-training	USPTO	2,830,616 reactions	Molecular weight as numerical property
Reaction yield	Buchwald-Hartwig / Suzuki	3,955 / 5,760 reactions	Ten 70/30 random splits

Algorithms

Architecture: XLNet (32 hidden layers, 256 hidden dim, 1024 FFN dim, 16 attention heads, 20% dropout)
Parameters: ~27 million
Training: Permutation language modelling pre-training, then alternating objectives (property prediction + conditional generation with SC loss)
Decoding: Greedy for property prediction, beam search for sequence generation

Evaluation

Task	Metric	RT Result	Notes
QED prediction	RMSE	0.037	Best config (NE + SC)
QED generation	Spearman’s $\rho$	0.517	Between primers and generated QED
ESOL	RMSE	Comparable to XLNet	Within s.d. of regression-loss XLNet
plogP optimization ($\delta$=0.4)	Improvement	3.16	Outperforms JT-VAE, GCPN
Protein fluorescence	Spearman’s $\rho$	0.72	Outperforms TAPE baselines
BH yield prediction	$R^2$	0.939	Near Yield-BERT (0.951)

Hardware

All models trained on single GPUs (NVIDIA A100 or V100)
Training time: ~4 days for pre-training, ~1 day for fine-tuning
Framework: PyTorch 1.3.1 with HuggingFace Transformers 3.1.0

Paper Information

Citation: Born, J. & Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. Nature Machine Intelligence, 5(4), 432-444. https://doi.org/10.1038/s42256-023-00639-z

Publication: Nature Machine Intelligence, April 2023

Additional Resources:

Citation

@article{born2023regression,
  title={Regression Transformer enables concurrent sequence regression and generation for molecular language modelling},
  author={Born, Jannis and Manica, Matteo},
  journal={Nature Machine Intelligence},
  volume={5},
  number={4},
  pages={432--444},
  year={2023},
  publisher={Nature Publishing Group}
}

LIMO: Latent Inceptionism for Targeted Molecule Generation

Sun, 22 Mar 2026 00:00:00 +0000

Paper Information

Citation: Eckmann, P., Sun, K., Zhao, B., Feng, M., Gilson, M. K., & Yu, R. (2022). LIMO: Latent Inceptionism for Targeted Molecule Generation. Proceedings of the 39th International Conference on Machine Learning (ICML 2022), PMLR 162, 5777–5792.

Publication: ICML 2022

Additional Resources:

@inproceedings{eckmann2022limo,
  title={LIMO: Latent Inceptionism for Targeted Molecule Generation},
  author={Eckmann, Peter and Sun, Kunyang and Zhao, Bo and Feng, Mudong and Gilson, Michael K and Yu, Rose},
  booktitle={International Conference on Machine Learning},
  pages={5777--5792},
  year={2022},
  organization={PMLR}
}

Gradient-Based Reverse Optimization in Molecular Latent Space

This is a Method paper that introduces LIMO, a framework for generating molecules with desired properties using gradient-based optimization on a VAE latent space. The key innovation is a stacked architecture where a property predictor operates on the decoded molecular representation rather than directly on the latent space, combined with an inceptionism-like technique that backpropagates through the frozen decoder and predictor to optimize the latent code. This approach is 6-8x faster than RL baselines and 12x faster than sampling-based approaches while producing molecules with higher binding affinities.

Slow Property Optimization in Existing Methods

Generating molecules with high binding affinity to target proteins is a central goal of early drug discovery, but existing computational approaches are slow when optimizing for properties that are expensive to evaluate (such as docking-based binding affinity). RL-based methods require many calls to the property function during training. Sampling-based approaches like MARS need hundreds of iterations. Latent optimization methods that predict properties directly from the latent space suffer from poor prediction accuracy because the mapping from latent space to molecular properties is difficult to learn.

The LIMO Framework

LIMO consists of three components: a VAE for learning a molecular latent space, a property predictor with a novel stacked architecture, and a gradient-based reverse optimization procedure.

SELFIES-Based VAE

The VAE encodes molecules represented as SELFIES strings into a 1024-dimensional latent space $\mathbf{z} \in \mathbb{R}^m$ and decodes to probability distributions over SELFIES symbols. Since all SELFIES strings correspond to valid molecules, this guarantees 100% chemical validity. The output molecule is obtained by taking the argmax at each position:

$$\hat{x}_i = s_{d_i^*}, \quad d_i^* = \operatorname{argmax}_{d} \{y_{i,1}, \ldots, y_{i,d}\}$$

The VAE uses fully-connected layers (not recurrent), with a 64-dimensional embedding layer, four batch-normalized linear layers (2000-dimensional first layer, 1000-dimensional for the rest) with ReLU activation, and is trained with ELBO loss (0.9 weight on reconstruction, 0.1 on KL divergence).

Stacked Property Predictor

The critical architectural choice: the property predictor $g_\theta$ takes the decoded molecular representation $\hat{\mathbf{x}}$ as input rather than the latent code $\mathbf{z}$. The predictor is trained after the VAE is frozen by minimizing MSE on VAE-generated molecules:

$$\ell_0(\theta) = \left\| g_\theta\left(f_{\text{dec}}(\mathbf{z})\right) - \pi\left(f_{\text{dec}}(\mathbf{z})\right) \right\|^2$$

where $\pi$ is the ground-truth property function. This stacking improves prediction accuracy from $r^2 = 0.04$ (predicting from $\mathbf{z}$) to $r^2 = 0.38$ (predicting from $\hat{\mathbf{x}}$) on an unseen test set. The improvement comes because the mapping from molecular space to property is easier to learn than the mapping from latent space to property.

Reverse Optimization (Inceptionism)

After training, the decoder and predictor weights are frozen and $\mathbf{z}$ becomes the trainable parameter. For multiple properties with weights $(w_1, \ldots, w_k)$, the optimization minimizes:

$$\ell_1(\mathbf{z}) = -\sum_{i=1}^{k} w_i \cdot g^i\left(f_{\text{dec}}(\mathbf{z})\right)$$

Since both the decoder and predictor are neural networks, gradients flow through the entire chain, enabling efficient optimization with Adam. This is analogous to the “inceptionism” (DeepDream) technique from computer vision, where network inputs are optimized to maximize specific outputs.

Substructure-Constrained Optimization

For lead optimization, LIMO can fix a molecular substructure during optimization by adding a regularization term:

$$\ell_2(\mathbf{z}) = \lambda \sum_{i=1}^{n} \sum_{j=1}^{d} \left(M_{i,j} \cdot \left(f_{\text{dec}}(\mathbf{z})_{i,j} - (\hat{\mathbf{x}}_{\text{start}})_{i,j}\right)\right)^2$$

where $M$ is a binary mask specifying which SELFIES positions must remain unchanged and $\lambda = 1000$. This capability is enabled by the intermediate decoded representation, which most VAE-based methods lack.

Experiments and Results

Benchmark Tasks (QED and Penalized LogP)

LIMO achieves competitive results with deep generative and RL-based models in 1 hour, compared to 8-24 hours for baselines. Top QED score: 0.947 (maximum possible: 0.948). Top penalized LogP: 10.5 (among length-limited models, comparable to MolDQN’s 11.8).

The ablation study (“LIMO on z”) confirms the stacked predictor architecture: predicting from $\hat{\mathbf{x}}$ yields top p-logP of 10.5 versus 6.52 when predicting directly from $\mathbf{z}$.

Binding Affinity Maximization

The primary contribution. LIMO generates molecules with substantially higher computed binding affinities (lower $K_D$) than baselines against two protein targets:

Method	ESR1 best $K_D$ (nM)	ACAA1 best $K_D$ (nM)	Time (hrs)
GCPN	6.4	75	6
MolDQN	373	240	6
MARS	17	163	6
GraphDF	25	370	12
LIMO	0.72	37	1

For ESR1, LIMO’s best molecule has a $K_D$ of 0.72 nM from docking, nearly 10x better than the next method (GCPN at 6.4 nM). When corroborated with more rigorous absolute binding free energy (ABFE) calculations, one LIMO compound achieved a predicted $K_D$ of $6 \times 10^{-14}$ M (0.00006 nM), far exceeding the affinities of approved drugs tamoxifen ($K_D$ = 1.5 nM) and raloxifene ($K_D$ = 0.03 nM).

Multi-Objective Optimization

Single-objective optimization produces molecules with high affinity but problematic structures (polyenes, large rings). Multi-objective optimization simultaneously targeting binding affinity, QED ($>$ 0.4), and SA ($<$ 5.5) produces drug-like, synthesizable molecules that still have nanomolar binding affinities. Generated molecules satisfy Lipinski’s rule of 5 with zero PAINS alerts.

Limitations

The LIMO property predictor achieves only moderate prediction accuracy ($r^2$ = 0.38), meaning the optimization relies on gradient direction being correct rather than absolute predictions being accurate. AutoDock-GPU docking scores do not correlate well with the more accurate ABFE results, a known limitation of docking. The fully-connected VAE architecture limits the molecular diversity compared to recurrent or attention-based alternatives (LSTM decoder produced max QED of only 0.3). The greedy fine-tuning step (replacing carbons with heteroatoms) is a heuristic rather than a learned procedure.

Reproducibility

Artifact	Type	License	Notes
Rose-STL-Lab/LIMO	Code	UC San Diego Custom (non-commercial)	Full training, optimization, and evaluation code

Data: ZINC250k dataset for optimization tasks. MOSES dataset for random generation evaluation. Binding affinities computed with AutoDock-GPU.

Hardware: Two GTX 1080 Ti GPUs (one for PyTorch, one for AutoDock-GPU), 4 CPU cores, 32 GB memory.

Training: VAE trained for 18 epochs with learning rate 0.0001. Property predictor uses 3 layers of 1000 units, trained for 5 epochs. Reverse optimization uses learning rate 0.1 for 10 epochs.

Targets: Human estrogen receptor (ESR1, PDB 1ERR) and human peroxisomal acetyl-CoA acyl transferase 1 (ACAA1, PDB 2IIK).

BARTSmiles: BART Pre-Training for Molecular SMILES

Sun, 22 Mar 2026 00:00:00 +0000

A BART-Based Method for Molecular Self-Supervised Learning

BARTSmiles is a Method paper. It introduces a self-supervised pre-training approach for molecular representations based on the BART (Bidirectional and Auto-Regressive Transformers) architecture from Lewis et al. (2019). The primary contribution is a pre-training strategy, discovered through systematic ablations, that trains a BART-large model on 1.7 billion deduplicated SMILES strings from the ZINC20 dataset. BARTSmiles achieves the best reported results on 11 tasks spanning molecular property classification, regression, and chemical reaction generation.

Scaling Self-Supervised Molecular Representations Beyond Prior Work

At the time of publication, large-scale self-supervised representation learning had produced significant improvements in NLP, computer vision, and speech, but molecular representation learning had not benefited from comparable scale. Previous SMILES-based pre-trained models such as ChemBERTa (Chithrananda et al., 2020) and ChemFormer (Irwin et al., 2022) used encoder-only or encoder-decoder architectures with substantially less compute. ChemFormer, the most closely related prior work, also trained a BART-like model but with a fraction of the compute and data.

The paper argues that three gaps needed to be addressed:

Scale: Prior molecular pre-training used orders of magnitude less compute than NLP pre-training.
Architecture choice: Encoder-only models like ChemBERTa cannot perform generative fine-tuning (retrosynthesis, reaction prediction), limiting their applicability.
Pre-training recipe: Standard BART hyperparameters (e.g., 30% mask token budget) were tuned for natural language and had not been validated for molecular SMILES strings.

Core Innovation: Ablation-Driven Pre-Training Recipe for SMILES

The key insight of BARTSmiles is that the BART denoising objective, when carefully tuned for the molecular domain, learns representations that implicitly encode downstream task information. The authors discover this through a systematic three-stage ablation:

Tokenization

Rather than using hand-crafted tokenization rules that separate individual atoms (C, N, H) and bond symbols (#, =), BARTSmiles uses a learned SentencePiece unigram tokenizer trained on 10 million random SMILES with a vocabulary size of 1,021. On matched compute budgets, learned tokenization achieves 0.801 average AUC-ROC vs. 0.779 for hand-crafted tokenization on the ablation benchmark (HIV, BBBP, ClinTox).

Masking Strategy

The BART denoising objective has three main hyperparameters: the mask token budget (fraction of tokens masked), random mask probability, and the Poisson $\lambda$ controlling mask span length. The ablation results show:

Mask token budget: The standard BART value of 0.30 is suboptimal for molecules. A budget of 0.20 performs best (0.821 AUC-ROC), with performance degrading at both lower (0.10: 0.753) and higher (0.40: 0.701) budgets.
Span masking: The choice of random mask probability and $\lambda$ has a minor effect once the budget is set to 0.20. Values of random mask = 0.10 and $\lambda$ = 2.5 or 3.5 all yield 0.821.
Token randomization: Disabling the randomize-tokens noise (where some tokens are replaced with random tokens rather than masked) improves performance from 0.821 to 0.835.

Scale

Training on the full 1.7 billion molecule ZINC20 dataset (20 hours on 1,024 A100 GPUs, totaling 20,480 A100 GPU-hours) improves performance by 5 absolute AUC-ROC points over the same model trained on 100 million samples. The previous most compute-intensive molecular pre-training used 3,330 V100-hours (Ross et al., 2021).

Implicit Task Encoding

The paper provides a quantitative demonstration that frozen BARTSmiles representations encode task-specific information. Using L1-regularized logistic regression on frozen 1,024-dimensional mean-pooled representations, just 7 neurons are sufficient to achieve 0.987 AUC-ROC on ClinTox (within 2 percentage points of full fine-tuning). Even a single neuron achieves 0.77 AUC-ROC on ClinTox subtask 1.

Experimental Setup: MoleculeNet, Toxicology, and Generative Benchmarks

Classification Tasks

BARTSmiles is evaluated on 7 classification datasets from MoleculeNet (SIDER, ClinTox, Tox21, ToxCast, HIV, BACE, BBBP) plus 2 toxicology datasets (Ames, Micronucleus Assay). All classification tasks use AUC-ROC. Baselines include both supervised graph models (D-MPNN, Attentive FP, 3D InfoMax) and self-supervised methods (ChemBERTa, MolFormer-XL, GROVER-large, MolCLR, iMolCLR).

Selected classification results (AUC-ROC):

Dataset	BARTSmiles	Previous Best	Previous Best Model
ClinTox	0.997	0.954	iMolCLR
ToxCast	0.825	0.805	Attentive FP
SIDER	0.705	0.699	iMolCLR
Tox21	0.851	0.858	Attentive FP

The authors note that three scaffold-split datasets (HIV, BACE, BBBP) are highly sensitive to the specific split used, and they suspect some baseline results use different or random splits. These results are marked with caveats in the paper.

Regression Tasks

All three MoleculeNet regression tasks (ESOL, FreeSolv, Lipophilicity) are evaluated using RMSE:

Dataset	BARTSmiles	Previous Best	Previous Best Model
ESOL	0.095	0.279	MoLFormer-XL
FreeSolv	0.114	0.231	MoLFormer-XL
Lipophilicity	0.292	0.529	MoLFormer-XL

BARTSmiles achieves substantial improvements on all three regression tasks.

Generative Tasks

Retrosynthesis (USPTO-50k): BARTSmiles achieves 55.6% Top-1 accuracy using a sample-128 + perplexity re-ranking strategy, compared to 55.3% for Dual-TF and 54.3% for ChemFormer. Top-5 and Top-10 results are 74.2% and 80.9% respectively.

Chemical Reaction Prediction (USPTO MIT/LEF/STEREO): BARTSmiles with beam search outperforms the Molecular Transformer baseline across all six evaluation settings. On USPTO-MIT (split), BARTSmiles achieves 91.8% vs. 90.4% for the Transformer baseline.

Fine-Tuning Recipe

The fine-tuning approach is designed to minimize hyperparameter tuning:

Batch size 16, 10 epochs, polynomial decay learning rate schedule with warmup at 16% of training
Grid search over dropout (0.1, 0.2, 0.3) and learning rate ($5 \times 10^{-6}$, $1 \times 10^{-5}$, $3 \times 10^{-5}$)
Stochastic Weight Averaging (SWA) over three sets of four checkpoints
For generative tasks: R3F regularization (Aghajanyan et al., 2020a) and full fp32 precision
For generation: beam search (beam size 10) or sample 128 sequences with perplexity re-ranking

Key Findings and Limitations

Key Findings

Scale matters for molecular pre-training: Training on 1.7B molecules with 20,480 A100 GPU-hours yields 5 absolute points of AUC-ROC improvement over training on 100M molecules.
Domain-specific ablation is necessary: The optimal BART masking configuration for molecules (20% budget, no token randomization) differs from the standard NLP configuration (30% budget, with randomization).
Frozen representations capture task structure: A small number of neurons from the frozen model can nearly match full fine-tuning performance on certain tasks, suggesting the pre-training objective implicitly encodes molecular properties.
Interpretability aligns with domain knowledge: Integrated Gradients attribution on fine-tuned BARTSmiles highlights known structural alerts (e.g., nitro groups in mutagenic compounds, hydroxyl groups in soluble compounds).

Limitations

Scaffold split sensitivity: Results on HIV, BACE, and BBBP are sensitive to the specific scaffold split, making direct comparison with baselines difficult.
Pre-training data distribution: The Frechet distance analysis shows that some downstream datasets (BBBP, SIDER) are far from ZINC20 in representation space, which may explain weaker performance on those tasks.
Fingerprints carry complementary information: On the Ames and Micronucleus Assay datasets, BARTSmiles alone does not beat fingerprint-based baselines. Combining BARTSmiles with ECFP4 fingerprints closes the gap, implying that SMILES-based pre-training does not fully capture all structural information.
Compute requirements: Pre-training requires 1,024 A100 GPUs, which limits accessibility.

Future Directions

The authors suggest investigating the impact of pre-training data composition, noting that ZINC20 contains over a billion molecules but its distribution may be irrelevant for many downstream tasks. They also propose further collaboration between ML and chemistry experts to discover new molecular substructure-property relationships.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
BARTSmiles (GitHub)	Code + Model	MIT	Pre-training, fine-tuning, and evaluation scripts with pre-trained weights

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC20 (deduplicated)	~1.7B molecules	Canonicalized SMILES, 10K validation holdout
Classification	MoleculeNet (7 datasets)	1,427-41,127 compounds	AUC-ROC metric
Regression	MoleculeNet (3 datasets)	642-4,200 compounds	RMSE metric
Toxicology	Ames, MN Assay	6,512 / 641 compounds	Cross-validation for Ames; external test for MN
Retrosynthesis	USPTO-50k	Standard split	Top-K accuracy
Reaction prediction	USPTO (MIT/LEF/STEREO)	Standard splits	Top-1 accuracy

Algorithms

Architecture: BART-Large (pre-layer norm Transformer encoder-decoder)
Tokenizer: SentencePiece unigram, vocabulary size 1,021, max sequence length 128
Pre-training objective: BART denoising (mask token budget 0.20, Poisson span masking with $\lambda$ = 2.5, no token randomization)
Fine-tuning: polynomial decay LR, SWA, grid search over dropout and LR
Generative fine-tuning: R3F regularization, fp32 precision, Adam initialized from pre-training moving averages

Models

BART-Large architecture (exact parameter count not specified in paper)
Pre-trained checkpoint released on GitHub
Maximum sequence length: 128 tokens

Evaluation

Task	Metric	BARTSmiles	Notes
ClinTox	AUC-ROC	0.997	New SOTA
ToxCast	AUC-ROC	0.825	New SOTA
ESOL	RMSE	0.095	New SOTA
FreeSolv	RMSE	0.114	New SOTA
Lipophilicity	RMSE	0.292	New SOTA
USPTO-50k Retro (Top-1)	Accuracy	55.6%	New SOTA (sample + re-rank)
USPTO-MIT Rxn (Split)	Accuracy	91.8%	New SOTA (beam-10)

Hardware

Pre-training: 1,024 NVIDIA A100 GPUs for 20 hours (20,480 A100 GPU-hours)
Ablation runs: 128 A100 GPUs per run
Framework: FairSeq with FairScale (fully sharded data parallel), automatic mixed precision
Experiment tracking: Aim

Paper Information

Citation: Chilingaryan, G., Tamoyan, H., Tevosyan, A., Babayan, N., Khondkaryan, L., Hambardzumyan, K., Navoyan, Z., Khachatrian, H., & Aghajanyan, A. (2024). BARTSmiles: Generative Masked Language Models for Molecular Representations. Journal of Chemical Information and Modeling, 64(15), 5832-5843. https://doi.org/10.1021/acs.jcim.4c00512

Publication: Journal of Chemical Information and Modeling, 2024 (preprint: arXiv 2022)

Additional Resources:

BARTSmiles GitHub Repository (MIT License)

Citation

@article{chilingaryan2024bartsmiles,
  title={BARTSmiles: Generative Masked Language Models for Molecular Representations},
  author={Chilingaryan, Gayane and Tamoyan, Hovhannes and Tevosyan, Ani and Babayan, Nelly and Khondkaryan, Lusine and Hambardzumyan, Karen and Navoyan, Zaven and Khachatrian, Hrant and Aghajanyan, Armen},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={15},
  pages={5832--5843},
  doi={10.1021/acs.jcim.4c00512},
  year={2024}
}

MolGen: Molecular Generation with Chemical Feedback

Fri, 20 Mar 2026 00:00:00 +0000

A SELFIES-Based Method for Molecular Generation

This is a Method paper that introduces MolGen, a pre-trained molecular language model for generating molecules with desired chemical properties. The primary contribution is a three-part framework: (1) pre-training on 100M+ molecular SELFIES to learn structural and grammatical knowledge, (2) domain-agnostic molecular prefix tuning for cross-domain knowledge transfer, and (3) a chemical feedback paradigm that aligns the model’s generative probabilities with real-world chemical preferences. MolGen is the first language model pre-trained on SELFIES rather than SMILES, which guarantees 100% syntactic validity of generated molecules.

Challenges in Language Model-Based Molecule Generation

Generating novel molecules with desirable properties is a central task in drug discovery and chemical design. The molecular space is estimated at $10^{33}$ possible structures, making exhaustive search impractical. Prior deep generative approaches face several limitations:

Syntactic invalidity: SMILES-based language models frequently generate strings that do not correspond to valid molecular graphs. A single random mutation of a SMILES string has only a 9.9% chance of remaining valid.
Narrow domain focus: Most existing models focus exclusively on synthetic molecules and neglect natural products, which have distinct structural complexity and scaffold diversity.
Molecular hallucinations: Generated molecules may satisfy chemical structural rules yet fail to exhibit anticipated chemical activity in practical applications. The authors formally define this as molecules that “comply with chemical structural rules, yet fail to exhibit practical utility or the anticipated properties.”
Limited optimization signals: Existing approaches rely on reinforcement learning (high variance), fixed-dimensional latent spaces, or expert-provided generation rules, all of which impede efficient exploration of chemical space.

Core Innovation: Pre-training with SELFIES and Chemical Feedback

MolGen’s novelty rests on three interconnected components.

SELFIES-Based Pre-training

MolGen uses SELFIES (Self-Referencing Embedded Strings) instead of SMILES. SELFIES guarantees that every possible combination of symbols in the alphabet corresponds to a chemically valid molecular graph. The model uses a compact vocabulary of 185 tokens.

The first pre-training stage uses a BART-style encoder-decoder. Tokens from a SELFIES string $S = {s_1, \ldots, s_l}$ are randomly replaced with [MASK], then the corrupted input is encoded bidirectionally and decoded left-to-right. The reconstruction loss is:

$$ \mathcal{L}_{\text{ce}}(S) = -\sum_{j=1}^{l} \sum_{s} p_{\text{true}}(s \mid S, S_{< j}) \log p_{\theta}(s \mid S, S_{< j}; \theta) $$

where $S_{< j}$ denotes the partial sequence ${s_0, \ldots, s_{j-1}}$ and $p_{\text{true}}$ is the one-hot distribution under standard maximum likelihood estimation.

Domain-Agnostic Molecular Prefix Tuning

The second pre-training stage introduces shared prefix vectors $P_k, P_v \in \mathbb{R}^{m \times d}$ prepended to the keys and values of multi-head attention at each layer. Unlike conventional prefix tuning that freezes model parameters, MolGen updates the entire model. The attention output becomes:

$$ \text{head} = \text{Attn}\left(xW_q, [P_k, XW_k], [P_v, XW_v]\right) $$

This decomposes into a linear interpolation between prefix attention and standard attention:

$$ \text{head} = \lambda(x) \cdot \text{Attn}(xW_q, P_k, P_v) + (1 - \lambda(x)) \cdot \text{Attn}(xW_q, XW_k, XW_v) $$

where $\lambda(x)$ is a scalar representing the sum of normalized attention weights on the prefixes. The prefixes are trained simultaneously across synthetic and natural product domains, acting as a domain instructor.

Chemical Feedback Paradigm

To address molecular hallucinations, MolGen aligns the model’s probabilistic rankings with chemical preference rankings. Given a molecule $S$ and a set of candidate outputs $\mathcal{S}^*$ with distinct property scores $\text{Ps}(\cdot)$, the model should satisfy:

$$ p_{\text{true}}(S_i \mid S) > p_{\text{true}}(S_j \mid S), \quad \forall S_i, S_j \in \mathcal{S}^*, \text{Ps}(S_i) > \text{Ps}(S_j) $$

This is enforced via a rank loss:

$$ \mathcal{L}_{\text{rank}}(S) = \sum_{i} \sum_{j > i} \max\left(0, f(S_j) - f(S_i) + \gamma_{ij}\right) $$

where $\gamma_{ij} = (j - i) \cdot \gamma$ is a margin scaled by rank difference and $f(S) = \sum_{t=1}^{l} \log p_{\theta}(s_t \mid S, S_{< t}; \theta)$ is the estimated log-probability. The overall training objective combines cross-entropy and rank loss:

$$ \mathcal{L} = \mathcal{L}_{\text{ce}} + \alpha \mathcal{L}_{\text{rank}} $$

Label smoothing is applied to the target distribution in $\mathcal{L}_{\text{ce}}$, allocating probability mass $\beta$ to non-target tokens to maintain generative diversity.

Experiments Across Distribution Learning and Property Optimization

Datasets

Stage 1 pre-training: 100M+ unlabeled molecules from ZINC-15 (molecular weight $\leq$ 500 Da, LogP $\leq$ 5)
Stage 2 pre-training: 2.22M molecules spanning synthetic (ZINC, MOSES) and natural product (NPASS, 30,926 compounds) domains
Downstream evaluation: MOSES synthetic dataset, ZINC250K, and natural product molecules

Molecular Distribution Learning

MolGen generates 10,000 synthetic and 80,000 natural product molecules, evaluated on seven metrics (Validity, Fragment similarity, Scaffold similarity, SNN, Internal Diversity, FCD, and Novelty). Baselines include AAE, LatentGAN, CharRNN, VAE, JT-VAE, LIMO, and Chemformer.

Model	Validity	Frag	Scaf	SNN	IntDiv	FCD	Novelty
Chemformer	.9843	.9889	.9248	.5622	.8553	.0061	.9581
MolGen	1.000	.9999	.9999	.9996	.8567	.0015	1.000

On synthetic molecules, MolGen achieves 100% validity, near-perfect fragment and scaffold similarity, and the lowest FCD (0.0015). For natural products, MolGen achieves FCD of 0.6519 compared to Chemformer’s 0.8346.

Targeted Molecule Discovery

For penalized logP maximization (top-3 scores):

Model	1st	2nd	3rd
MARS (no length limit)	44.99	44.32	43.81
MolGen (no length limit)	80.30	74.70	69.85
MolGen (length-limited)	30.51	28.98	28.95

For QED maximization, MolGen achieves the maximum score of 0.948 across the top-3.

Molecular Docking

MolGen optimizes binding affinity for two protein targets (ESR1 and ACAA1), measured by dissociation constant $K_D$ (lower is better):

Model	ESR1 1st	ESR1 2nd	ESR1 3rd	ACAA1 1st	ACAA1 2nd	ACAA1 3rd
LIMO	0.72	0.89	1.4	37	37	41
MolGen	0.13	0.35	0.47	3.36	3.98	8.50

MolGen achieves the lowest dissociation constants across both targets. Optimization of the 1,000 worst-affinity molecules yields 96.7% relative improvement for ESR1 and 70.4% for ACAA1.

Constrained Molecular Optimization

Optimizing 800 molecules from ZINC250K with lowest p-logP scores under Tanimoto similarity constraints:

Model	$\delta = 0.6$	$\delta = 0.4$
RetMol	3.78 (3.29)	11.55 (11.27)
MolGen	12.08 (0.82)	12.35 (1.21)

MolGen achieves the highest mean improvement with the lowest standard deviation under both constraints.

Ablation Studies

Chemical feedback: Without it, the model generates molecules with property scores similar to initial molecules. With it ($\alpha = 3$), property scores increase progressively across generation rounds.
Prefix tuning: Removing prefix tuning reduces constrained optimization improvement by 0.45 at $\delta = 0.6$ and 2.12 at $\delta = 0.4$.
Label smoothing: Enhances diversity of generated molecules as measured by Internal Diversity.
Substructure attention: MolGen focuses attention on chemically meaningful functional groups (fluoro, phenyl, hydroxyl), while SMILES-based PLMs scatter attention across syntactic tokens. The Substructure Attention Level (SAL) metric confirms MolGen’s superior focus.

Key Findings, Limitations, and Future Directions

Key Findings

SELFIES pre-training guarantees 100% molecular validity, eliminating the need for external valency checks.
Domain-agnostic prefix tuning enables effective knowledge transfer between synthetic and natural product domains.
The chemical feedback paradigm aligns model outputs with chemical preferences without requiring external annotated data or reference databases.
MolGen achieves the best or competitive results across all evaluated tasks: distribution learning, targeted molecule discovery, constrained optimization, and molecular docking.

Limitations

The authors acknowledge several limitations:

Computational cost: Training and fine-tuning on large datasets is computationally intensive.
Model interpretability: The transformer architecture makes it difficult to understand explicit rationale behind decisions.
Single-target optimization only: The chemical feedback paradigm handles single-target optimization; multiple conflicting objectives could create ambiguous optimization trajectories.
Task specificity: MolGen is designed for 2D molecular generation; 3D conformation information is not incorporated.
Reaction prediction: When applied to reaction prediction (an off-target task), MolGen achieves only 71.4% accuracy on 39,990 reaction samples.

Future Directions

The authors suggest applying MolGen to retrosynthesis and reaction prediction, exploring multimodal pre-training, and incorporating additional knowledge sources.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Stage 1 pre-training	ZINC-15	100M+ molecules	MW $\leq$ 500 Da, LogP $\leq$ 5
Stage 2 pre-training	ZINC + MOSES + NPASS	2.22M molecules	Synthetic and natural product domains
Distribution learning (synthetic)	MOSES	~1.9M molecules	Standard benchmark split
Distribution learning (natural)	NPASS	30,926 compounds	30,126 train / 800 test
Constrained optimization	ZINC250K	800 molecules	Lowest p-logP scores

Algorithms

Architecture: BART-based encoder-decoder with SELFIES vocabulary (185 tokens)
Prefix length: 5 tunable vectors per layer
Optimizer: LAMB (pre-training), AdamW (fine-tuning)
Pre-training: 600M steps with linear warm-up (180,000 steps) followed by linear decay
Rank loss weight ($\alpha$): Recommended values of 3 or 5
Candidate generation: 30 candidates per molecule (synthetic), 8 candidates (natural products)

Models

MolGen is publicly available on Hugging Face. The model uses a vocabulary of 185 SELFIES tokens and is comparable in size to Chemformer-large.

Evaluation

Metric	Domain	MolGen	Best Baseline	Notes
FCD (lower is better)	Synthetic	0.0015	0.0061 (Chemformer)	Distribution learning
p-logP top-1 (no limit)	Synthetic	80.30	44.99 (MARS)	Targeted discovery
QED top-1	Synthetic	0.948	0.948 (several)	Tied at maximum
ESR1 $K_D$ top-1	Docking	0.13	0.72 (LIMO)	Binding affinity
p-logP improvement ($\delta=0.4$)	Synthetic	12.35 (1.21)	11.55 (11.27) (RetMol)	Constrained optimization

Hardware

6 NVIDIA V100 GPUs
Pre-training batch size: 256 molecules per GPU
Fine-tuning batch size: 6 (synthetic and natural product)
Training: 100 epochs for fine-tuning tasks

Artifacts

Artifact	Type	License	Notes
zjunlp/MolGen	Code	Unknown	Official PyTorch implementation
zjunlp/MolGen-large	Model	Unknown	Pre-trained weights on Hugging Face

Paper Information

Citation: Fang, Y., Zhang, N., Chen, Z., Guo, L., Fan, X., & Chen, H. (2024). Domain-Agnostic Molecular Generation with Chemical Feedback. Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024).

Additional Resources:

@inproceedings{fang2024domain,
  title={Domain-Agnostic Molecular Generation with Chemical Feedback},
  author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Guo, Lingbing and Fan, Xiaohui and Chen, Huajun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=9rPyHyjfwP}
}

Molecular Transformer: Calibrated Reaction Prediction

Wed, 18 Mar 2026 00:00:00 +0000

Paper Contribution and Methodological Classification

This is a Method paper. It adapts the Transformer architecture to chemical reaction prediction, treating it as a machine translation problem from reactant SMILES to product SMILES. The key contributions are (1) demonstrating that a fully attention-based model outperforms all prior template-based, graph-based, and RNN-based methods, (2) showing the model works without separating reactants from reagents, and (3) introducing calibrated uncertainty estimation for ranking synthesis pathways.

Motivation: Limitations of Existing Reaction Prediction

Prior approaches to reaction prediction fell into two broad groups, template-based and template-free, each with fundamental limitations:

Template-based methods rely on libraries of reaction rules, either handcrafted or automatically extracted from atom-mapped data. Automatic template extraction itself depends on atom mapping, which depends on templates, creating a circular dependency.
Graph-based template-free methods (e.g., WLDN, ELECTRO) avoid explicit templates but still require atom-mapped training data and cannot handle stereochemistry.
RNN-based seq2seq models (also template-free) treat reactions as SMILES translation but impose a positional inductive bias: tokens far apart in the SMILES string are assumed to be less related. This is incorrect because SMILES position has no relationship to 3D spatial distance.

Core Innovation: Transformer for Reaction Prediction

The Molecular Transformer adapts the Transformer architecture to chemical reactions by treating SMILES strings of reactants and reagents as source sequences and product SMILES as target sequences.

Architecture: Encoder-decoder Transformer with 4 layers, 256-dimensional hidden states, 8 attention heads, and 12M parameters (reduced from the original 65M NMT model).
Tokenization: Atom-wise regex tokenization of SMILES strings, applied uniformly to both reactants and reagents (no special reagent tokens).
Data augmentation: Training data is doubled by generating random (non-canonical) SMILES for each reaction, which improves top-1 accuracy by roughly 1%.
Weight averaging: Final model weights are averaged over the last 20 checkpoints, providing a further accuracy boost without the inference cost of ensembling.
Mixed input: Unlike all prior work that separates reactants from reagents (which implicitly assumes knowledge of the product), the Molecular Transformer operates on mixed inputs where no distinction is made.

The multihead attention mechanism is the key architectural advantage over RNNs. It allows the model to attend to any pair of tokens regardless of their position in the SMILES string, correctly capturing long-range chemical relationships that RNNs miss.

Uncertainty Estimation

A central contribution is calibrated uncertainty scoring. The product of predicted token probabilities serves as a confidence score for each prediction. This score achieves 0.89 AUC-ROC for classifying whether a prediction is correct.

An important finding: label smoothing hurts uncertainty calibration. While label smoothing (as used in the original Transformer) marginally improves top-1 accuracy (87.44% vs 87.28%), it destroys the model’s ability to distinguish correct from incorrect predictions. Setting the label smoothing parameter to 0.0 preserves calibration.

The confidence score shows no correlation with SMILES length (Pearson $r = 0.06$), confirming it is not biased against predictions of larger molecules.

Experimental Results

Forward Synthesis Prediction

Dataset	Setting	Top-1 (%)	Top-2 (%)	Top-5 (%)
USPTO_MIT	separated	90.4	93.7	95.3
USPTO_MIT	mixed	88.6	92.4	94.2
USPTO_STEREO	separated	78.1	84.0	87.1
USPTO_STEREO	mixed	76.2	82.4	85.8

The mixed-input model (88.6%) outperforms all prior methods that used separated inputs (best previous: WLDN5 at 85.6%).

Comparison with Quantum Chemistry

On regioselectivity of electrophilic aromatic substitution in heteroaromatics, the Molecular Transformer achieves 83% top-1 accuracy vs 81% for RegioSQM (a quantum-chemistry-based predictor), at a fraction of the computational cost.

Comparison with Human Chemists

On 80 reactions sampled across rarity bins, the Molecular Transformer achieves 87.5% top-1 accuracy vs 76.5% for the best human chemist and 72.5% for the best graph-based model (WLDN5).

Chemically Constrained Beam Search

Constraining beam search to only predict atoms present in the reactants (preventing “alchemy”) produces no change in accuracy, confirming the model has learned conservation of atoms from data alone.

Trade-offs and Limitations

Stereochemistry: Accuracy drops significantly on USPTO_STEREO (76-78% vs 88-90% on USPTO_MIT), indicating stereochemical prediction remains challenging.
Resolution reactions: Near-zero accuracy on resolution reactions (28.6%), where reagent information is often missing from patent data.
Unclassified reactions: Accuracy on “unrecognized” reaction classes is 46.3%, likely reflecting noisy or mistranscribed data.
No atom mapping: The model provides no explicit atom mapping between reactants and products, which limits interpretability for understanding reaction mechanisms.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Primary benchmark	USPTO_MIT	479K	Filtered by Jin et al., no stereochemistry
LEF subset	USPTO_LEF	350K	Subset of MIT with linear electron flow only
Stereo benchmark	USPTO_STEREO	1.0M	Patent reactions through Sept 2016, includes stereochemistry
Time-split test	Pistachio_2017	15.4K	Non-public, reactions from 2017

Preprocessing: SMILES canonicalized with RDKit. Regex tokenization from Schwaller et al. (2018). Two input modes: “separated” (reactants > reagents) and “mixed” (all molecules concatenated).

Model

Hyperparameter	Value
Layers	4
Model dimension	256
Attention heads	8
Parameters	~12M
Label smoothing	0.0
Optimizer	Adam
Warm-up steps	8000
Batch size	~4096 tokens
Beam width	5

Evaluation

Metric	Task	Key Result	Baseline
Top-1 accuracy	USPTO_MIT (sep)	90.4%	85.6% (WLDN5)
Top-1 accuracy	USPTO_MIT (mixed)	88.6%	80.3% (S2S RNN)
AUC-ROC	Uncertainty calibration	0.89	N/A
Top-1 accuracy	Regioselectivity	83%	81% (RegioSQM)
Top-1 accuracy	Human comparison	87.5%	76.5% (best human)

Hardware

Training: Single Nvidia P100 GPU, 48h for best single model
Inference: 20 min for 40K reactions on single P100

Paper Information

Citation: Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C. A., Bekas, C., & Lee, A. A. (2019). Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Central Science, 5(9), 1572-1583. https://doi.org/10.1021/acscentsci.9b00576

Publication: ACS Central Science 2019

@article{schwallerMolecularTransformerModel2019,
  title = {Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction},
  author = {Schwaller, Philippe and Laino, Teodoro and Gaudin, Th{\'e}ophile and Bolgar, Peter and Hunter, Christopher A. and Bekas, Costas and Lee, Alpha A.},
  year = 2019,
  journal = {ACS Central Science},
  volume = {5},
  number = {9},
  pages = {1572--1583},
  publisher = {American Chemical Society},
  doi = {10.1021/acscentsci.9b00576}
}

Umeyama's Method: Corrected SVD for Point Alignment

Mon, 16 Mar 2026 00:00:00 +0000

Fixing the Reflection Problem in SVD-Based Alignment

This Method paper addresses a specific failure mode in prior SVD-based solutions to the point set registration problem. Both Arun et al. (1987) and Horn, Hilden, and Negahdaripour (1988) presented SVD-based methods for finding the optimal rotation between two point patterns. (Note: this is a different paper from Horn’s 1987 quaternion method, which does not suffer from this issue.) These SVD-based methods can produce a reflection ($\det(R) = -1$) instead of a proper rotation when the data is severely corrupted. Umeyama provides a corrected formulation that always yields a proper rotation matrix.

The Similarity Transformation Problem

Given two point sets ${\mathbf{x}_i}$ and ${\mathbf{y}_i}$ ($i = 1, \ldots, n$) in $m$-dimensional space, find the similarity transformation parameters (rotation $R$, translation $\mathbf{t}$, and scale $c$) minimizing the mean squared error:

$$ e^2(R, \mathbf{t}, c) = \frac{1}{n} \sum_{i=1}^{n} \lVert \mathbf{y}_i - (cR\mathbf{x}_i + \mathbf{t}) \rVert^2 $$

This generalizes the Kabsch problem (rotation only) and the absolute orientation problem (rotation + translation + scale) to arbitrary dimensions $m$.

The Core Lemma: Corrected SVD Rotation

The key contribution is a lemma for finding the rotation $R$ minimizing $\lVert A - RB \rVert^2$. Given the SVD of $AB^T = UDV^T$ (with $d_1 \geq d_2 \geq \cdots \geq d_m \geq 0$), define the correction matrix:

$$ S = \begin{cases} I & \text{if } \det(AB^T) \geq 0 \\ \operatorname{diag}(1, 1, \ldots, 1, -1) & \text{if } \det(AB^T) < 0 \end{cases} $$

The minimum value is:

$$ \min_{R} \lVert A - RB \rVert^2 = \lVert A \rVert^2 + \lVert B \rVert^2 - 2\operatorname{tr}(DS) $$

When $\operatorname{rank}(AB^T) \geq m - 1$, the optimal rotation is uniquely determined as:

$$ R = USV^T $$

The critical insight is that when $\det(AB^T) = 0$ (i.e., $\operatorname{rank}(AB^T) = m - 1$), the matrix $S$ must instead be chosen based on $\det(U)\det(V)$:

$$ S = \begin{cases} I & \text{if } \det(U)\det(V) = 1 \\ \operatorname{diag}(1, 1, \ldots, 1, -1) & \text{if } \det(U)\det(V) = -1 \end{cases} $$

This handles the degenerate case where the sign of $\det(AB^T)$ is unreliable.

Complete Similarity Transformation Solution

Umeyama derives the full solution using centered coordinates and the covariance matrix $\Sigma_{xy} = \frac{1}{n} \sum_i (\mathbf{y}_i - \boldsymbol{\mu}_y)(\mathbf{x}_i - \boldsymbol{\mu}_x)^T$.

Given the SVD $\Sigma_{xy} = UDV^T$:

Rotation:

$$ R = USV^T $$

Scale:

$$ c = \frac{1}{\sigma_x^2} \operatorname{tr}(DS) $$

Translation:

$$ \mathbf{t} = \boldsymbol{\mu}_y - cR\boldsymbol{\mu}_x $$

Minimum error:

$$ \varepsilon^2 = \sigma_y^2 - \frac{\operatorname{tr}(DS)^2}{\sigma_x^2} $$

where $\sigma_x^2$ and $\sigma_y^2$ are the variances of the respective point sets around their centroids.

Why Prior Methods Fail

The methods of Arun et al. and Horn et al. use $R = UV^T$ directly from the SVD. This works when $\det(UV^T) = 1$ (proper rotation). When $\det(UV^T) = -1$, these methods either produce a reflection or apply an ad hoc correction (flipping the sign of the last column of $U$). Umeyama shows that the correct fix depends on $\det(\Sigma_{xy})$:

If $\det(\Sigma_{xy}) \geq 0$: set $S = I$, so $R = UV^T$
If $\det(\Sigma_{xy}) < 0$: set $S = \operatorname{diag}(1, \ldots, 1, -1)$, flipping the last singular value’s contribution

This distinction matters because corrupted data can make $\det(UV^T) = -1$ even when the true transformation is a proper rotation. Simply flipping a column of $U$ does not always yield the correct least-squares solution.

Generality

The formulation works for any dimension $m$, covering both 2D and 3D registration problems. The proof uses Lagrange multipliers with explicit enforcement of both orthogonality ($R^T R = I$) and the proper rotation constraint ($\det(R) = 1$), which prior methods enforced only partially.

Paper Information

Citation: Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4), 376-380. https://doi.org/10.1109/34.88573

Publication: IEEE TPAMI, 1991

Additional Resources:

Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX (tutorial with implementations including the Kabsch-Umeyama scaling extension)

@article{umeyama1991least,
  title={Least-squares estimation of transformation parameters between two point patterns},
  author={Umeyama, Shinji},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  volume={13},
  number={4},
  pages={376--380},
  year={1991},
  publisher={IEEE},
  doi={10.1109/34.88573}
}

SELFormer: A SELFIES-Based Molecular Language Model

Mon, 16 Mar 2026 00:00:00 +0000

A SELFIES-Based Chemical Language Model

This is primarily a Method paper ($\Psi_{\text{Method}}$) with a secondary Resource component ($\Psi_{\text{Resource}}$).

SELFormer applies the RoBERTa transformer architecture to SELFIES molecular string representations instead of the SMILES notation used by prior chemical language models. The model is pretrained via masked language modeling (MLM) on 2M drug-like compounds from ChEMBL and fine-tuned for molecular property prediction tasks on MoleculeNet benchmarks. The authors release pretrained models, fine-tuning code, and datasets as open-source resources.

Why SELFIES Over SMILES for Pretraining?

Existing chemical language models, including ChemBERTa, ChemBERTa-2, MolBERT, and MolFormer, all use SMILES as their input representation. SMILES has well-documented validity and robustness issues: arbitrary perturbations to a SMILES string frequently produce syntactically invalid outputs. This means a pretrained model must spend capacity learning SMILES grammar rules rather than chemical semantics.

SELFIES addresses this by construction: every possible SELFIES string decodes to a valid molecule. Despite this theoretical advantage and SELFIES’ growing adoption in generative chemistry, no prior work had systematically evaluated SELFIES as input for large-scale transformer pretraining. SELFormer fills this gap by providing a direct comparison between SELFIES-based and SMILES-based chemical language models on standard benchmarks.

Masked Language Modeling on Guaranteed-Valid Molecular Strings

SELFormer uses byte-level Byte-Pair Encoding (BPE) to tokenize SELFIES strings, then pretrains a RoBERTa encoder using the standard MLM objective. 15% of input tokens are masked, and the model minimizes the cross-entropy loss over the masked positions:

$$ \mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta) $$

where $\mathcal{M}$ is the set of masked token indices, $x_i$ is the true token at position $i$, $x_{\setminus \mathcal{M}}$ is the corrupted input context, and $\theta$ are the model parameters.

The key insight is that because SELFIES guarantees 100% validity, every masked token prediction corresponds to a valid molecular fragment. The model never wastes capacity predicting invalid chemistry. For fine-tuning, a two-layer classification or regression head is added on top of the encoder’s output embedding.

Two model sizes were trained. Notably, the larger SELFormer uses fewer attention heads (4) but more hidden layers (12) than SELFormer-Lite (12 heads, 8 layers). This counterintuitive configuration emerged from the authors’ hyperparameter search over ~100 models, where deeper architectures with fewer heads outperformed wider, shallower ones:

Configuration	SELFormer-Lite	SELFormer
Attention Heads	12	4
Hidden Layers	8	12
Batch Size	16	16
Learning Rate	5e-5	5e-5
Weight Decay	0.01	0.01
Pretraining Epochs	100	100
Parameters	58.3M	86.7M

Benchmarking Against SMILES Transformers and Graph Models

SELFormer was pretrained on 2.08M drug-like compounds from ChEMBL v30 (converted from SMILES to SELFIES), then fine-tuned on nine MoleculeNet tasks. All evaluations use scaffold splitting via the Chemprop library.

Classification tasks (ROC-AUC, scaffold split):

Model	BACE	BBBP	HIV	Tox21	SIDER
SELFormer	0.832	0.902	0.681	0.653	0.745
ChemBERTa-2	0.799	0.728	0.622	-	-
MolBERT	0.866	0.762	0.783	-	-
D-MPNN	0.809	0.710	0.771	0.759	0.570
MolCLR	0.890	0.736	0.806	0.787	0.652
GEM	0.856	0.724	0.806	0.781	0.672
KPGT	0.855	0.908	-	0.848	0.649

Regression tasks (RMSE, scaffold split, lower is better):

Model	ESOL	FreeSolv	Lipophilicity	PDBbind
SELFormer	0.682	2.797	0.735	1.488
ChemBERTa-2	-	-	0.986	-
D-MPNN	1.050	2.082	0.683	1.397
GEM	0.798	1.877	0.660	-
KPGT	0.803	2.121	0.600	-

The ablation study compared SELFormer vs. SELFormer-Lite across pretrained-only, 25-epoch, and 50-epoch fine-tuning configurations on randomly split datasets. SELFormer consistently outperformed SELFormer-Lite, confirming the benefit of the deeper (12-layer) architecture.

Strong Classification Performance with Compact Pretraining

SELFormer’s strongest results come on classification tasks where molecular substructure matters:

SIDER: Best overall ROC-AUC (0.745), outperforming the next best method (MolCLR at 0.652) by 9.3 percentage points. The authors attribute this to SELFIES’ ability to capture subtle structural differences relevant to drug side effects.
BBBP: Second best (0.902), behind only KPGT (0.908). SELFormer scored 17.4 percentage points above ChemBERTa-2 (0.728) on this task.
BACE/HIV vs. ChemBERTa-2: SELFormer outperformed ChemBERTa-2 by 3.3 points on BACE (0.832 vs 0.799), 17.4 on BBBP, and 5.9 on HIV (0.681 vs 0.622). Since both models use similar RoBERTa architectures, this comparison is suggestive of a SELFIES advantage, though differences in pretraining corpus (ChEMBL vs PubChem), corpus size, and training procedure confound a clean attribution to the input representation alone.
ESOL regression: Best RMSE (0.682) vs GEM (0.798), a 14.5% relative improvement.

Limitations are also apparent:

HIV and Tox21: SELFormer underperforms graph-based methods (MolCLR, GEM, KPGT) on these larger datasets. The authors attribute this to insufficient hyperparameter search given computational constraints.
FreeSolv and Lipophilicity regression: D-MPNN and graph-based methods maintain an edge, suggesting that explicit 2D/3D structural inductive biases remain valuable for certain property types.
Small pretraining corpus: At 2M molecules, SELFormer’s corpus is orders of magnitude smaller than MolFormer’s 1.1B. Despite this, SELFormer outperforms MolFormer on SIDER (0.745 vs 0.690), highlighting SELFIES’ representational advantage.
Single-task ablation scope: Some architectural claims rest on limited task coverage, and broader benchmarking would strengthen the conclusions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL v30	2,084,725 compounds (2,084,472 after SELFIES conversion)	Drug-like bioactive small molecules
Classification	BACE	1,513	Beta-secretase 1 inhibitor binding
Classification	BBBP	2,039	Blood-brain barrier permeability
Classification	HIV	41,127	HIV replication inhibition
Classification	SIDER	1,427	Drug side effects (27 classes)
Classification	Tox21	7,831	Toxicity (12 targets)
Regression	ESOL	1,128	Aqueous solubility
Regression	FreeSolv	642	Hydration free energy
Regression	Lipophilicity	4,200	Octanol/water distribution coefficient
Regression	PDBbind	11,908	Binding affinity

Algorithms

Pretraining objective: Masked language modeling (MLM), 15% token masking
Tokenization: Byte-level Byte-Pair Encoding (BPE) on SELFIES strings
SMILES to SELFIES conversion: SELFIES API with Pandaral.lel for parallelization
Splitting: Scaffold splitting via Chemprop library (80/10/10 train/validation/test)
Fine-tuning: Two-layer classification/regression head on encoder output; up to 200 epochs with hyperparameter search

Models

Architecture: RoBERTa (HuggingFace Transformers)
SELFormer: 12 hidden layers, 4 attention heads, 86.7M parameters
SELFormer-Lite: 8 hidden layers, 12 attention heads, 58.3M parameters
Hyperparameter search: Sequential search over ~100 configurations on 100K molecule subset

Evaluation

Metric	Task Type	Details
ROC-AUC	Classification	Area under receiver operating characteristic curve
PRC-AUC	Classification	Area under precision-recall curve (reported for random splits)
RMSE	Regression	Root mean squared error

Results reported on scaffold split and random split datasets.

Hardware

Compute: 2x NVIDIA A5000 GPUs
Hyperparameter optimization time: ~11 days
Full pretraining: 100 epochs on 2.08M molecules

Artifacts

Artifact	Type	License	Notes
SELFormer GitHub	Code	GPL-3.0	Pretraining, fine-tuning, and evaluation scripts
SELFormer on HuggingFace	Model	GPL-3.0	Pretrained SELFormer weights
ChEMBL v30	Dataset	CC BY-SA 3.0	Source pretraining data
MoleculeNet	Benchmark	Unknown	Downstream evaluation tasks

Paper Information

Citation: Yüksel, A., Ulusoy, E., Ünlü, A., & Doğan, T. (2023). SELFormer: Molecular Representation Learning via SELFIES Language Models. Machine Learning: Science and Technology, 4(2), 025035. https://doi.org/10.1088/2632-2153/acdb30

Publication: Machine Learning: Science and Technology 2023

Additional Resources:

Citation

@article{yuksel2023selformer,
  title={{SELFormer}: Molecular Representation Learning via {SELFIES} Language Models},
  author={Y{\"u}ksel, Atakan and Ulusoy, Erva and {\"U}nl{\"u}, Atabey and Do{\u{g}}an, Tunca},
  journal={Machine Learning: Science and Technology},
  volume={4},
  number={2},
  pages={025035},
  year={2023},
  publisher={IOP Publishing},
  doi={10.1088/2632-2153/acdb30}
}

MoLFormer: Large-Scale Chemical Language Representations

Mon, 16 Mar 2026 00:00:00 +0000

A Billion-Scale Chemical Language Model

This is primarily a Method paper ($\Psi_{\text{Method}}$).

MoLFormer is a transformer encoder pretrained via masked language modeling on 1.1 billion SMILES strings from PubChem and ZINC. The key architectural choices are linear attention (for $O(N)$ complexity instead of $O(N^2)$) and rotary positional embeddings (RoPE). The resulting model, MoLFormer-XL, produces molecular embeddings that outperform or match GNN baselines across a wide range of MoleculeNet classification and regression tasks, including quantum-chemical property prediction from SMILES alone.

Bridging the Gap Between Molecular Languages and Graph Neural Networks

Prior chemical language models like ChemBERTa were pretrained on relatively small datasets (10M-77M molecules) and generally underperformed GNNs on molecular property prediction. The core question: does a transformer trained on a sufficiently large SMILES corpus learn enough chemical structure to compete with graph-based methods that have explicit topological inductive biases?

Two specific challenges motivated this work:

Scale: The chemical space spans $10^{60}$ to $10^{100}$ plausible molecules, yet labeled property data is scarce. Self-supervised pretraining on the ~1.1B unlabeled molecules available in public databases could provide a general-purpose representation.
Efficiency: Standard transformer attention is $O(N^2)$ in sequence length, making billion-scale pretraining impractical without architectural modifications.

Linear Attention with Rotary Positional Embeddings

MoLFormer’s two key architectural choices are its attention mechanism and positional encoding scheme.

Standard attention computes:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle) v_n}{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle)} $$

MoLFormer replaces this with linear attention using a generalized feature map $\varphi$, combined with rotary positional embeddings $R_m$ applied before the feature map:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle} $$

This differs from the original RoFormer formulation, which applies the rotation after the feature map. The authors found that rotating the raw queries and keys before projection led to faster convergence and lower validation loss. The combination of linear attention and adaptive sequence-length bucketing reduces GPU requirements from ~1000 to 16 for training on the full 1.1B corpus.

The model uses masked language modeling (15% token masking, following BERT conventions) with a vocabulary of 2,362 SMILES tokens. Sequence length is capped at 202 tokens, covering 99.4% of all molecules.

Broad MoleculeNet Benchmarking with Scaling Ablations

MoLFormer-XL was evaluated on 11 MoleculeNet tasks against supervised GNNs, self-supervised GNNs, and prior language models.

Classification tasks (ROC-AUC, scaffold split; values reported as percentages in the original paper, converted to proportions here for consistency):

Model	BBBP	Tox21	ClinTox	HIV	BACE	SIDER
MoLFormer-XL	0.937	0.847	0.948	0.822	0.882	0.690
N-Gram	0.912	0.769	0.855	0.830	0.876	0.632
MolCLR	0.736	0.798	0.932	0.806	0.890	0.680
GEM	0.724	0.781	0.901	0.806	0.856	0.672
Hu et al.	0.708	0.787	0.789	0.802	0.859	0.652
GeomGCL	-	0.850	0.919	-	-	0.648
ChemBERTa	0.643	-	0.906	0.622	-	-

Regression tasks (RMSE for ESOL/FreeSolv/Lipophilicity, avg MAE for QM9/QM8):

Model	QM9	QM8	ESOL	FreeSolv	Lipophilicity
MoLFormer-XL	1.5894	0.0102	0.2787	0.2308	0.5289
A-FP	2.6355	0.0282	0.5030	0.736	0.578
MPNN	3.1898	0.0143	0.58	1.150	0.7190
GC	4.3536	0.0148	0.970	1.40	0.655

MoLFormer-XL also outperforms geometry-aware GNNs (DimeNet, GeomGCL, GEM) on ESOL (0.279 vs 0.575), FreeSolv (0.231 vs 0.866), and Lipophilicity (0.529 vs 0.541).

Key ablation findings:

Data scale matters: Performance improves monotonically from 10% subsets through the full 1.1B corpus. Training on 100% ZINC alone performed worst, likely due to its smaller vocabulary and less diverse molecule lengths.
Model depth matters: MoLFormer-Base (6 layers) underperforms MoLFormer-XL (12 layers) on most tasks.
Fine-tuning » frozen: Fine-tuning the full encoder consistently outperforms using frozen embeddings with a downstream classifier.
Rotary > absolute at scale: Rotary embeddings underperform absolute embeddings on smaller pretraining sets but overtake them once the corpus exceeds 1B molecules.

SMILES Transformers Learn Molecular Geometry

The most striking finding is that MoLFormer’s attention patterns correlate with 3D interatomic distances, despite training only on 1D SMILES strings.

Using QM9 molecules with known 3D geometries, the authors computed cosine similarity between attention maps and spatial distance matrices across three distance categories:

Distance Category	Range	Linear Attention (Rotary)	Full Attention (Rotary)
Short	$\leq$ 2 Å	0.594-0.602	0.598-0.615
Medium	2-4 Å	0.724-0.730	0.716-0.727
Long	4-10 Å	0.209-0.211	0.204-0.210

The strong correlation in the short and medium categories indicates the model captures covalent bond connectivity and near-neighbor spatial relationships. Linear attention shows marginally higher cosine similarity than full attention on medium-range distances (0.724-0.730 vs 0.716-0.727), though the differences are small.

MoLFormer-XL embeddings also correlate more strongly with molecular fingerprint similarity (0.64 vs 0.48 for ChemBERTa) and maximum common subgraph size (-0.60 vs -0.44), confirming that the representations encode structural information.

Limitations:

Quantum-chemical energies: SchNet and DimeNet (which encode explicit 3D geometry) outperform MoLFormer-XL on QM9 atomization energy tasks, with DimeNet achieving roughly 10x lower MAE on U0_atom (0.008 vs 0.083 eV). 3D information remains important for these properties.
Sequence length cap: The 202-token limit excludes 0.6% of molecules, potentially limiting applicability to larger structures.
SMILES canonicalization: The model depends on RDKit canonical SMILES; sensitivity to non-canonical forms is not evaluated.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	PubChem	111M molecules	Canonical SMILES via RDKit
Pretraining	ZINC	~1B molecules	Canonical SMILES via RDKit
Pretraining (combined)	PubChem + ZINC	~1.1B molecules	MoLFormer-XL training set
Classification	BBBP, Tox21, ClinTox, HIV, BACE, SIDER	1,427-41,127	MoleculeNet scaffold splits
Regression	QM9, QM8, ESOL, FreeSolv, Lipophilicity	642-133,885	MoleculeNet random splits (QM9/QM8), scaffold (others)

Algorithms

Pretraining objective: Masked language modeling (15% selection: 80% masked, 10% random, 10% unchanged)
Tokenization: SMILES tokenizer from Schwaller et al., vocabulary of 2,362 tokens
Sequence length: 1-202 tokens (99.4% coverage)
Optimizer: Fused LAMB (via APEX), chosen for stability with large batch sizes and no need for learning rate warm-up
Adaptive bucketing: Sequences grouped by length into buckets to minimize padding waste

Models

Architecture: Transformer encoder with linear attention and rotary positional embeddings
MoLFormer-XL: 12 layers, 12 attention heads, hidden size 768
MoLFormer-Base: 6 layers (ablation only)
Feature map size: 32 (generalized feature map for linear attention)
Frozen head: Fully connected model with hyperparameter sweep (learning rate, batch size, hidden dim, number of layers)

Evaluation

Metric	Task Type	Details
ROC-AUC	Classification	Scaffold splits per MoleculeNet
RMSE	Regression (ESOL, FreeSolv, Lipophilicity)	Scaffold splits
Avg MAE	Regression (QM9, QM8)	Random splits per MoleculeNet

QM9 results also reported with 5-fold cross-validation for robustness.

Hardware

Compute: GPU cluster with nodes containing either 8 NVIDIA Tesla V100 (32GB) or 8 Ampere A100 (40GB) GPUs connected via NVLink and InfiniBand
GPU reduction: Linear attention + bucketing reduced GPU requirements from ~1000 to 16

Artifacts

Artifact	Type	License	Notes
IBM/molformer	Code	Apache-2.0	Pretraining, fine-tuning, and attention visualization
MoLFormer-XL (HuggingFace)	Model	Apache-2.0	Pretrained weights (46.8M parameters)
PubChem	Dataset	Public domain	111M molecules
ZINC	Dataset	See ZINC terms	~1B molecules

Paper Information

Citation: Ross, J., Belgodere, B., Chenthamarakshan, V., Padhi, I., Mroueh, Y., & Das, P. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. Nature Machine Intelligence, 4, 1256-1264. https://doi.org/10.1038/s42256-022-00580-7

Publication: Nature Machine Intelligence 2022

Additional Resources:

Citation

@article{ross2022molformer,
  title={Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
  author={Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
  journal={Nature Machine Intelligence},
  volume={4},
  number={12},
  pages={1256--1264},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-022-00580-7}
}

Horn et al.: Absolute Orientation Using Orthonormal Matrices

Mon, 16 Mar 2026 00:00:00 +0000

A Matrix-Based Companion to the Quaternion Method

This Method paper presents a closed-form solution to the absolute orientation problem using $3 \times 3$ orthonormal matrices directly, complementing Horn’s earlier quaternion-based solution (1987). The authors note that while quaternions are more elegant, orthonormal matrices are more widely used in photogrammetry, graphics, and robotics. The solution relies on the polar decomposition of the cross-covariance matrix via its matrix square root.

The paper also compares two approaches: (1) directly finding the best-fit orthonormal matrix (the main result), and (2) finding an unconstrained best-fit linear transformation and then projecting it onto the nearest orthonormal matrix. These give different results, and only the first approach has the desired symmetry property.

The Rotation via Polar Decomposition

As in the quaternion paper, the problem reduces to finding the orthonormal matrix $R$ maximizing $\operatorname{Tr}(R^T M)$, where $M = \sum_{i=1}^{n} \mathbf{r}’_{r,i} (\mathbf{r}’_{l,i})^T$ is the cross-covariance matrix of the centered point sets.

The key insight is the polar decomposition: any matrix $M$ can be written as:

$$ M = U S $$

where $U$ is orthonormal and $S = (M^T M)^{1/2}$ is positive semidefinite. When $M$ is nonsingular:

$$ U = M (M^T M)^{-1/2} $$

The matrix square root $(M^T M)^{1/2}$ is computed via eigendecomposition. If $M^T M$ has eigenvalues $\lambda_1, \lambda_2, \lambda_3$ and eigenvectors $\hat{\mathbf{u}}_1, \hat{\mathbf{u}}_2, \hat{\mathbf{u}}_3$:

$$ (M^T M)^{1/2} = \sqrt{\lambda_1} , \hat{\mathbf{u}}_1 \hat{\mathbf{u}}_1^T + \sqrt{\lambda_2} , \hat{\mathbf{u}}_2 \hat{\mathbf{u}}_2^T + \sqrt{\lambda_3} , \hat{\mathbf{u}}_3 \hat{\mathbf{u}}_3^T $$

The sign of $\det(U)$ equals the sign of $\det(M)$, so $U$ is a proper rotation when $\det(M) > 0$ and a reflection when $\det(M) < 0$.

Handling the Coplanar Case

When one set of measurements is coplanar, $M$ is singular ($\operatorname{rank}(M) = 2$) and one eigenvalue of $M^T M$ is zero. The matrix square root still exists (positive semidefinite rather than positive definite), but $S$ is no longer invertible.

In this case, $U$ is determined only for two of its three columns. The third column (corresponding to the zero eigenvalue) is fixed by the orthonormality constraint, with a sign ambiguity resolved by requiring $\det(U) = +1$ (proper rotation).

The Nearest Orthonormal Matrix (Alternative Approach)

The paper also derives a closed-form solution for finding the orthonormal matrix nearest to an arbitrary matrix $A$ (minimizing $\lVert A - R \rVert^2$). This uses the same polar decomposition machinery: if $A = U_A S_A$, then $U_A$ is the nearest orthonormal matrix.

This approach (find unconstrained best-fit transform, then project to nearest orthonormal matrix) was used by some earlier methods. Horn et al. show it gives a different result from the direct least-squares solution and lacks the symmetry property: the inverse transformation from right-to-left is generally not the exact inverse of the left-to-right solution.

Relationship to Other Methods

Method	Rotation representation	Core computation
Kabsch (1976)	Orthogonal matrix	Eigendecomposition of $\tilde{R}R$ ($3 \times 3$)
Horn (1987)	Unit quaternion	Eigenvector of $N$ ($4 \times 4$)
Horn et al. (1988)	Orthonormal matrix	Square root of $M^T M$ ($3 \times 3$)
Arun et al. (1987)	Orthonormal matrix	SVD of $H$ ($3 \times 3$)

The polar decomposition approach (this paper) and the SVD approach (Arun et al.) are closely related: the SVD $M = U \Lambda V^T$ gives the polar decomposition as $M = (UV^T)(V \Lambda V^T)$ where $UV^T$ is the orthonormal factor and $V \Lambda V^T$ is the positive semidefinite factor. Both methods can produce reflections under noisy data, which Umeyama (1991) later addressed.

Paper Information

Citation: Horn, B. K. P., Hilden, H. M., & Negahdaripour, S. (1988). Closed-form solution of absolute orientation using orthonormal matrices. Journal of the Optical Society of America A, 5(7), 1127-1135. https://doi.org/10.1364/josaa.5.001127

Publication: Journal of the Optical Society of America A, 1988

Additional Resources:

Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX (tutorial with differentiable implementations)

@article{horn1988closed,
  title={Closed-form solution of absolute orientation using orthonormal matrices},
  author={Horn, Berthold K. P. and Hilden, Hugh M. and Negahdaripour, Shahriar},
  journal={Journal of the Optical Society of America A},
  volume={5},
  number={7},
  pages={1127--1135},
  year={1988},
  publisher={Optica Publishing Group},
  doi={10.1364/josaa.5.001127}
}

Arun et al.: SVD-Based Least-Squares Fitting of 3D Points

Mon, 16 Mar 2026 00:00:00 +0000

SVD for 3D Point Set Registration

This Method paper presents a concise algorithm for finding the least-squares rotation and translation between two 3D point sets using the singular value decomposition (SVD) of a $3 \times 3$ cross-covariance matrix. The approach is closely related to the earlier Kabsch algorithm (1976), which used eigendecomposition, and was developed independently of Horn’s quaternion method (1987). The paper also identifies a reflection degeneracy that Umeyama later provided a complete fix for.

Problem Formulation

Given two 3D point sets ${p_i}$ and ${p’_i}$ ($i = 1, \ldots, N$) related by:

$$ p’_i = R p_i + T + N_i $$

where $R$ is a rotation matrix, $T$ is a translation vector, and $N_i$ is noise, find $\hat{R}$ and $\hat{T}$ minimizing:

$$ \Sigma^2 = \sum_{i=1}^{N} \lVert p’_i - (R p_i + T) \rVert^2 $$

Decoupling Translation and Rotation

The translation is eliminated by centering both point sets at their centroids $p$ and $p’$. Defining centered coordinates $q_i = p_i - p$ and $q’_i = p’_i - p’$, the problem reduces to:

$$ \Sigma^2 = \sum_{i=1}^{N} \lVert q’_i - R q_i \rVert^2 $$

Once $\hat{R}$ is found, the translation follows as $\hat{T} = p’ - \hat{R} p$.

The SVD Algorithm

The algorithm proceeds in five steps:

Center both point sets by subtracting centroids
Compute the $3 \times 3$ cross-covariance matrix: $H = \sum_{i=1}^{N} q_i q’^t_i$
Compute the SVD: $H = U \Lambda V^t$
Form the candidate rotation: $X = V U^t$
Check $\det(X)$: if $+1$, then $\hat{R} = X$; if $-1$, the result is a reflection

The key insight is that minimizing $\Sigma^2$ is equivalent to maximizing $\operatorname{Trace}(RH)$. Using a lemma based on the Cauchy-Schwarz inequality, Arun et al. show that $X = VU^t$ maximizes this trace over all orthonormal matrices.

The Reflection Problem

When $\det(VU^t) = -1$, the SVD produces a reflection rather than a proper rotation. Arun et al. analyze three cases:

Noiseless, non-coplanar points: The SVD always gives a proper rotation ($\det = +1$). No issue arises.

Coplanar points (including $N = 3$): One singular value of $H$ is zero. Both a rotation and a reflection achieve $\Sigma^2 = 0$. The fix is to flip the sign of the column of $V$ corresponding to the zero singular value:

$$ V’ = [v_1, v_2, -v_3], \quad X’ = V’ U^t $$

Noisy, non-coplanar points with $\det = -1$: The paper acknowledges this case cannot be handled by the algorithm. The reflection genuinely minimizes $\Sigma^2$ over all orthonormal matrices, meaning no rotation achieves a lower error. The authors suggest this only occurs with very large noise and recommend RANSAC-like approaches.

This last case is precisely what Umeyama (1991) later resolved with a corrected formulation using a sign matrix $S$ conditioned on $\det(\Sigma_{xy})$.

Computational Comparison

The paper includes VAX 11/780 benchmarks comparing three methods:

Points	SVD (ms)	Quaternion (ms)	Iterative (ms)
3	54.6	26.6	126.8
11	37.0	41.0	105.2
30	44.2	48.3	111.0

The SVD and quaternion methods have comparable speed, both significantly faster than the iterative approach. SVD becomes faster than quaternion for larger point sets since its core computation operates on a $3 \times 3$ matrix regardless of $N$.

Paper Information

Citation: Arun, K. S., Huang, T. S., & Blostein, S. D. (1987). Least-Squares Fitting of Two 3-D Point Sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(5), 698-700. https://doi.org/10.1109/TPAMI.1987.4767965

Publication: IEEE TPAMI, 1987

Additional Resources:

Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX (tutorial with differentiable implementations)

@article{arun1987least,
  title={Least-Squares Fitting of Two 3-D Point Sets},
  author={Arun, K. S. and Huang, T. S. and Blostein, S. D.},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  volume={PAMI-9},
  number={5},
  pages={698--700},
  year={1987},
  publisher={IEEE},
  doi={10.1109/TPAMI.1987.4767965}
}

Uni-Parser: Industrial-Grade Multi-Modal PDF Parsing (2025)

Sun, 15 Mar 2026 00:00:00 +0000

Uni-Parser is a modular, loosely coupled PDF parsing engine built for scientific literature and patents. It routes different content types (text, equations, tables, figures, chemical structures) to specialized expert models, then reassembles the parsed outputs into structured formats (JSON, Markdown, HTML) for downstream consumption by LLMs and other applications.

The system processes up to 20 PDF pages per second on 8 NVIDIA RTX 4090D GPUs and supports over 80 languages for OCR.

A Five-Stage Pipeline Architecture

The system is organized into five sequential stages:

Document Pre-Processing: Validates PDFs, extracts metadata, checks text accessibility, and identifies language.
Group-based Layout Detection: Locates semantic blocks and identifies their categories using a novel tree-structured layout representation. Groups naturally paired elements (image-caption, table-title, molecule-identifier).
Semantic Contents Parsing: Routes each block to a specialized model: OCR for text, formula recognition for equations, table structure recognition, OCSR for chemical structures, reaction extraction, and chart parsing. Over ten sub-models operate in parallel.
Semantic Contents Gathering: Filters non-essential elements, reconstructs reading order, merges cross-page and multi-column content, and reintegrates inline multimodal elements.
Output Formatting and Semantic Chunking: Exports parsed documents in task-specific formats with proper chunking for RAG and other downstream tasks.

Group-Based Layout Detection

A key contribution is the group-based layout detection model (Uni-Parser-LD), which uses a hierarchical tree structure to represent page layouts. Elements are organized into a bottom layer (parent nodes like paragraphs, tables, images) and a top layer (child nodes like captions, footnotes, identifiers). This preserves semantic associations between paired elements, such as molecules and their identifiers.

The model is trained on 500k pages, including 220k human-annotated pages from scientific journals and patents across 85 languages. A modified DETR-based architecture was selected as the backbone after finding that RT-DETRv2, YOLOv12, and D-FINE exhibited training instability for this task.

Chemical Structure Recognition with MolParser 1.5

Uni-Parser integrates MolParser 1.5 for OCSR, an end-to-end model that directly generates molecular representations from images. The authors explicitly note that graph-based (atom-bond) methods were the first direction they explored but ultimately abandoned because of:

Strong reliance on rigid, hand-crafted rules that limit scalability
Substantially higher annotation costs (over 20x compared to end-to-end approaches)
Lower performance ceilings despite increasing training data

Molecule Localization

Uni-Parser-LD achieves strong molecule detection performance:

Model	mAP@50	mAP@50-95
Uni-Parser-LD (Uni-Parser Bench)	0.994	0.968
MolDet-Doc-L	0.983	0.919
MolDet-General-L	0.974	0.815
Uni-Parser-LD (BioVista Bench)	0.981	0.844
MolDet-Doc-L	0.961	0.871
MolDet-General-L	0.945	0.815
BioMiner	0.929	-
MolMiner	0.899	-

OCSR Accuracy

MolParser 1.5 consistently outperforms prior methods across molecule types:

Model	Full	Chiral	Markush	All
MolParser 1.5 (Uni-Parser Bench)	0.979	0.809	0.805	0.886
MolParser 1.0	0.953	0.676	0.664	0.800
MolScribe	0.617	0.274	0.168	0.417
MolParser 1.5 (BioVista Bench)	0.795	0.604	0.761	0.780
MolParser 1.0	0.669	0.352	0.733	0.703
MolMiner	0.774	0.497	0.185	0.507
MolScribe	0.703	0.481	0.156	0.455
MolNexTR	0.695	0.419	0.045	0.401
DECIMER	0.545	0.326	0.000	0.298

Chiral molecule recognition remains a significant challenge and is identified as a key area for future work.

Document Parsing Benchmarks

On the Uni-Parser Benchmark (150 PDFs, 2,887 pages from patents and scientific articles), Uni-Parser (HQ mode) achieves an overall score of 89.74 (excluding molecules), outperforming both pipeline tools (MinerU, PP-StructureV3) and specialized VLMs (MinerU2-VLM, DeepSeek-OCR, PaddleOCR-VL). Competing systems score zero on molecule localization and OCSR because they lack molecular recognition capabilities.

On the general-document OmniDocBench-1.5, a variant (Uni-Parser-G) using a swapped layout module achieves 89.75 overall, competitive with top-performing specialized VLMs.

Comparison with OCSR-Enabled PDF Parsers

On a controlled test set of 141 simple molecules, Uni-Parser outperforms other PDF parsing systems with OCSR support:

Method	Recall	OCSR Success	OCSR Acc	Id Match	Time
Uni-Parser	100%	100%	96.5%	100%	1.8s
MathPix	100%	75.9%	59.6%	-	66.1s
MinerU.Chem	66.7%	63.1%	22.7%	-	~7 min

Reproducibility

Artifact	Type	License	Notes
HuggingFace Models	Model/Dataset	Unknown	MolDet models and MolParser-7M dataset available
Project Page	Other	Unknown	Project website with documentation

The Uni-Parser system is deployed on a cluster of 240 NVIDIA L40 GPUs (48 GB each) with 22 CPU cores and 90 GB of host memory per GPU. The reference throughput benchmark (20 pages/second) uses 8 NVIDIA RTX 4090D GPUs. The HuggingFace organization hosts MolDet detection models and several datasets (MolParser-7M, RxnBench, OmniScience), but the full Uni-Parser system code and end-to-end inference pipeline do not appear to be publicly released. MolParser 1.5 model weights are not publicly available as of this writing.

Limitations and Future Directions

Chiral molecule recognition remains a challenge for end-to-end OCSR models
Chemical reaction understanding in real-world literature has substantial room for improvement
Layout models are primarily tailored to scientific and patent documents, with plans to expand to newspapers, slides, books, and financial statements
Chart parsing falls short of industrial-level requirements across the diversity of chart types in scientific literature

Paper Information

Citation: Fang, X., Tao, H., Yang, S., Huang, C., Zhong, S., Lu, H., Lyu, H., Li, X., Zhang, L., & Ke, G. (2025). Uni-Parser Technical Report. arXiv preprint arXiv:2512.15098. https://arxiv.org/abs/2512.15098

Publication: arXiv 2025

Additional Resources:

Latent Diffusion Models for High-Res Image Synthesis

Sun, 15 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a Method paper. It introduces Latent Diffusion Models (LDMs), which train denoising diffusion models in the latent space of pretrained autoencoders rather than directly in pixel space. The key insight is that separating perceptual compression from generative learning enables high-resolution image synthesis at a fraction of the computational cost of pixel-based diffusion. The paper also introduces a cross-attention conditioning mechanism for flexible multi-modal generation.

Computational Cost of Pixel-Space Diffusion

Training diffusion models directly in pixel space is computationally expensive (150 to 1000 V100 GPU-days for leading models at the time) because the model must process high-dimensional RGB data at every denoising step. Much of this compute is spent modeling imperceptible high-frequency details. The authors observe that learning can be split into two stages: a perceptual compression stage that removes high-frequency detail, and a semantic compression stage where the generative model learns the conceptual composition. Prior two-stage approaches (VQGAN, DALL-E) relied on aggressive compression and autoregressive modeling in discrete latent spaces, trading off reconstruction quality for tractability.

Core Innovation: Diffusion in Latent Space

LDMs decompose image synthesis into two phases:

Phase 1: Perceptual Compression. A pretrained autoencoder (encoder $\mathcal{E}$, decoder $\mathcal{D}$) maps images $x \in \mathbb{R}^{H \times W \times 3}$ to a lower-dimensional latent representation $z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}$ with spatial downsampling factor $f = H/h$. The autoencoder is trained with a perceptual loss (matching deep features from a pretrained VGG network) and a patch-based adversarial objective, with either KL or VQ regularization on the latent space.

Phase 2: Latent Diffusion. A standard denoising diffusion model operates in this latent space. The training objective becomes:

$$L_{\text{LDM}} := \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0,1), t} \left[ \left| \epsilon - \epsilon_\theta(z_t, t) \right|_2^2 \right]$$

where $z_t$ is the noised latent at timestep $t$, and $\epsilon_\theta$ is a time-conditional UNet.

Cross-Attention Conditioning. To enable conditioning on text, semantic maps, or other modalities, the authors introduce cross-attention layers into the UNet. A domain-specific encoder $\tau_\theta$ maps conditioning input $y$ to an intermediate representation $\tau_\theta(y) \in \mathbb{R}^{M \times d_\tau}$, which interacts with the UNet features via:

$$Q = W_Q^{(i)} \cdot \varphi_i(z_t), \quad K = W_K^{(i)} \cdot \tau_\theta(y), \quad V = W_V^{(i)} \cdot \tau_\theta(y)$$

The conditional objective then becomes:

$$L_{\text{LDM}} := \mathbb{E}_{\mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0,1), t} \left[ \left| \epsilon - \epsilon_\theta(z_t, t, \tau_\theta(y)) \right|_2^2 \right]$$

Both $\tau_\theta$ and $\epsilon_\theta$ are optimized jointly.

Experimental Setup and Results

The authors evaluate across multiple tasks and datasets:

Perceptual compression tradeoffs. Downsampling factors $f \in {1, 2, 4, 8, 16, 32}$ are compared on ImageNet class-conditional generation. LDM-1 (pixel-based) trains slowly; LDM-32 loses too much information. LDM-4 and LDM-8 achieve the best balance, with LDM-8 outperforming pixel-based diffusion by 38 FID points after 2M training steps on a single A100.

Unconditional image synthesis on CelebA-HQ 256, FFHQ 256, LSUN Churches/Bedrooms 256: LDM-4 achieves FID 5.11 on CelebA-HQ (state of the art at the time), outperforming LSGM, GANs, and other likelihood-based models. On LSUN-Bedrooms, LDM-4 achieves FID 2.95, close to ADM (1.90) with half the parameters and roughly 4x less training compute (see Appendix E.3.5).

Text-to-image synthesis on MS-COCO: A 1.45B parameter LDM-KL-8 model trained on LAION-400M achieves FID 12.63 with classifier-free guidance (a technique that amplifies the conditioning signal at the cost of diversity, by interpolating between conditional and unconditional predictions) at scale s=1.5, on par with GLIDE (FID 12.24, 6B params) and Make-A-Scene (FID 11.84, 4B params) with substantially fewer parameters.

Class-conditional ImageNet 256: LDM-4-G achieves FID 3.60, IS 247.67, outperforming ADM-G (FID 4.59) with fewer parameters and less compute.

Super-resolution: LDM-4 (big) achieves FID 2.4 on ImageNet 64-to-256 upscaling (validation split), outperforming SR3 in FID.

Inpainting on Places: LDM-4 (big, w/ ft) achieves FID 1.50, setting a new state of the art on image inpainting.

Key Findings and Limitations

LDM-4 and LDM-8 offer the best tradeoff between perceptual compression and generation quality.
The autoencoder only needs to be trained once and can be reused across different diffusion models and tasks.
Cross-attention conditioning generalizes to text, semantic layouts, and bounding boxes without architecture changes.
Convolutional sampling enables generation at resolutions higher than the training resolution (up to 1024x1024).
Sequential sampling remains slower than GANs. The autoencoder reconstruction can become a bottleneck for tasks requiring pixel-level precision.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Unconditional	CelebA-HQ, FFHQ, LSUN	256x256	Standard benchmarks
Class-conditional	ImageNet	256x256	1000 classes
Text-to-image	LAION-400M	256x256	400M image-text pairs
Inpainting	Places	256x256, 512x512	Following LaMa protocol
Super-resolution	ImageNet	64 to 256	Following SR3 pipeline

Algorithms

Autoencoder regularization: KL-reg (KL penalty toward standard normal, weighted by ~$10^{-6}$) or VQ-reg (vector quantization layer on the latent space with a learned codebook)
Diffusion: Standard DDPM denoising with reweighted objective
Sampling: DDIM sampler with configurable steps (100 to 500 depending on task)
Guidance: Classifier-free diffusion guidance with scale $s$ (1.5 for class-conditional and text-to-image quantitative evaluation; 10.0 for qualitative text-to-image samples)

Models

Autoencoder: Based on VQGAN architecture with perceptual + adversarial loss
UNet backbone: Time-conditional with cross-attention layers at multiple resolutions
Text encoder: BERT-tokenizer with transformer $\tau_\theta$ for LAION text-to-image model
LDM-4-G: 400M parameters, $f=4$ downsampling
LDM-KL-8 (text): 1.45B parameters, $f=8$ downsampling, KL-regularized

Evaluation

Metric	Task	Best Value	Notes
FID	CelebA-HQ unconditional	5.11	500 DDIM steps
FID	ImageNet class-conditional	3.60	LDM-4-G, cfg s=1.5
FID	MS-COCO text-to-image	12.63	LDM-KL-8-G, 250 steps, cfg s=1.5
FID	Places inpainting	1.50	LDM-4 big, w/ ft
FID	ImageNet 4x super-resolution	2.4	LDM-4 big, 100 steps

Hardware

Perceptual compression tradeoff experiments: single NVIDIA A100
Inpainting model trained on eight V100
Training at least 2.7x faster than pixel-based diffusion at equal parameters

Artifacts

Artifact	Type	License	Notes
CompVis/latent-diffusion	Code	MIT	Official implementation with pretrained models

Paper Information

Citation: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. https://arxiv.org/abs/2112.10752

Publication: CVPR 2022

@inproceedings{rombach2022highresolution,
  title     = {High-Resolution Image Synthesis with Latent Diffusion Models},
  author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages     = {10684--10695},
  year      = {2022}
}

Additional Resources:

Kabsch Algorithm: Optimal Rotation for Point Set Alignment

Sun, 15 Mar 2026 00:00:00 +0000

A Closed-Form Solution for Optimal Rotation

This short communication presents a Method paper: a direct, analytical solution to a constrained optimization problem. Given two sets of vectors, Kabsch derives the orthogonal matrix (rotation) that best superimposes one set onto the other by minimizing a weighted sum of squared deviations. Prior approaches either solved an unconstrained problem and factorized the result (Diamond, 1976) or used iterative methods (McLachlan, 1972). Kabsch shows that a direct, non-iterative solution exists despite the non-linear nature of the orthogonality constraint.

The Superposition Problem

The core problem arises frequently in crystallography and structural biology: given two sets of corresponding points (e.g., atomic coordinates from a known structure and experimentally measured coordinates), find the rigid rotation that best aligns them. Translations can be removed by centering both point sets at the origin, leaving only the rotational component.

Formally, given vector sets $\mathbf{x}_n$ and $\mathbf{y}_n$ ($n = 1, 2, \ldots, N$) with weights $w_n$, find the orthogonal matrix $\mathsf{U}$ minimizing:

$$ E = \frac{1}{2} \sum_{n} w_n (\mathsf{U} \mathbf{x}_n - \mathbf{y}_n)^2 $$

subject to orthogonality: $\tilde{\mathsf{U}} \mathsf{U} = \mathsf{I}$.

Derivation via Lagrange Multipliers

Kabsch introduces a symmetric matrix $\mathsf{L}$ of Lagrange multipliers to enforce orthogonality, forming the Lagrangian:

$$ G = E + \frac{1}{2} \sum_{i,j} l_{ij} \left( \sum_{k} u_{ki} u_{kj} - \delta_{ij} \right) $$

Setting $\partial G / \partial u_{ij} = 0$ and defining two key matrices:

$$ r_{ij} = \sum_{n} w_n , y_{ni} , x_{nj} \qquad s_{ij} = \sum_{n} w_n , x_{ni} , x_{nj} $$

where $\mathsf{R} = (r_{ij})$ is the weighted cross-covariance matrix and $\mathsf{S} = (s_{ij})$ is the weighted auto-covariance matrix, the stationarity condition becomes:

$$ \mathsf{U} \cdot (\mathsf{S} + \mathsf{L}) = \mathsf{R} $$

Eigendecomposition Solution

The key insight is that multiplying both sides by their transposes eliminates the unknown $\mathsf{U}$:

$$ (\mathsf{S} + \mathsf{L})(\mathsf{S} + \mathsf{L}) = \tilde{\mathsf{R}} \mathsf{R} $$

Since $\tilde{\mathsf{R}} \mathsf{R}$ is symmetric positive definite, it has positive eigenvalues $\mu_k$ and eigenvectors $\mathbf{a}_k$. The matrix $\mathsf{S} + \mathsf{L}$ shares the same eigenvectors with eigenvalues $\sqrt{\mu_k}$.

From the eigenvectors $\mathbf{a}_k$, a second set of unit vectors $\mathbf{b}_k$ is defined:

$$ \mathbf{b}_k = \frac{1}{\sqrt{\mu_k}} \mathsf{R} , \mathbf{a}_k $$

The optimal rotation matrix is then constructed directly:

$$ u_{ij} = \sum_{k} b_{ki} , a_{kj} $$

Handling Degeneracies and Generalizations

Kabsch addresses two extensions:

Planar point sets: When all vectors lie in a plane, one eigenvalue of $\tilde{\mathsf{R}} \mathsf{R}$ is zero. The missing eigenvectors are recovered via cross products: $\mathbf{a}_3 = \mathbf{a}_1 \times \mathbf{a}_2$ and $\mathbf{b}_3 = \mathbf{b}_1 \times \mathbf{b}_2$.
General metric constraints: The orthogonality constraint $\tilde{\mathsf{U}} \mathsf{U} = \mathsf{I}$ can be replaced by $\tilde{\mathsf{U}} \mathsf{U} = \mathsf{M}$ for any symmetric positive definite $\mathsf{M}$. By finding any specific solution $\mathsf{B}$ and transforming the input vectors as $\mathbf{x}’_n = \mathsf{B} \mathbf{x}_n$, the problem reduces back to the standard orthogonal case.

The method generalizes naturally to vector spaces of arbitrary dimension.

Legacy and Impact

This two-page communication became one of the most cited papers in structural biology. The “Kabsch algorithm” (or “Kabsch rotation”) is the standard method for computing the root-mean-square deviation (RMSD) between two molecular structures after optimal superposition. It underpins structure comparison tools across crystallography, NMR spectroscopy, cryo-EM, and computational chemistry.

Paper Information

Citation: Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 32(5), 922-923. https://doi.org/10.1107/s0567739476001873

Publication: Acta Crystallographica Section A, 1976

Additional Resources:

Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX (tutorial with differentiable implementations)

@article{kabsch1976solution,
  title={A solution for the best rotation to relate two sets of vectors},
  author={Kabsch, Wolfgang},
  journal={Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography},
  volume={32},
  number={5},
  pages={922--923},
  year={1976},
  publisher={International Union of Crystallography},
  doi={10.1107/s0567739476001873}
}

Horn's Method: Absolute Orientation via Unit Quaternions

Sun, 15 Mar 2026 00:00:00 +0000

A Quaternion Approach to Point Set Registration

This Method paper presents a closed-form solution to the absolute orientation problem: given corresponding points measured in two different coordinate systems, find the optimal rotation, translation, and scale that maps one set onto the other. While the Kabsch algorithm (1976) solved the rotation subproblem via eigendecomposition of $\tilde{\mathsf{R}}\mathsf{R}$, Horn’s approach uses unit quaternions to represent rotation, reducing the problem to finding the eigenvector of a $4 \times 4$ symmetric matrix associated with its largest eigenvalue.

The Absolute Orientation Problem

Given $n$ point pairs ${\mathbf{r}_{l,i}}$ and ${\mathbf{r}_{r,i}}$ measured in “left” and “right” coordinate systems, find the transformation:

$$ \mathbf{r}_r = s , R(\mathbf{r}_l) + \mathbf{r}_0 $$

where $s$ is a scale factor, $R$ is a rotation, and $\mathbf{r}_0$ is a translation, minimizing the sum of squared residual errors:

$$ \sum_{i=1}^{n} \lVert \mathbf{r}_{r,i} - s , R(\mathbf{r}_{l,i}) - \mathbf{r}_0 \rVert^2 $$

Prior methods either used iterative numerical procedures or selectively discarded constraints (e.g., Thompson’s and Schut’s three-point methods). Horn derives a direct solution that uses all available information from all points simultaneously.

Decoupling Translation, Scale, and Rotation

Horn shows that the three components of the transformation can be solved sequentially.

Translation: After centering both point sets at their centroids ($\bar{\mathbf{r}}_l$ and $\bar{\mathbf{r}}_r$), the optimal translation is:

$$ \mathbf{r}_0 = \bar{\mathbf{r}}_r - s , R(\bar{\mathbf{r}}_l) $$

Scale: Horn derives three formulations (asymmetric left, asymmetric right, and symmetric). The symmetric version, which ensures the inverse transformation yields the reciprocal scale, is:

$$ s = \left( \frac{\sum_{i=1}^{n} \lVert \mathbf{r}’_{r,i} \rVert^2}{\sum_{i=1}^{n} \lVert \mathbf{r}’_{l,i} \rVert^2} \right)^{1/2} $$

the ratio of root-mean-square deviations from the respective centroids.

Rotation: After removing translation and scale, the remaining problem is to find the rotation $R$ that maximizes:

$$ \sum_{i=1}^{n} \mathbf{r}’_{r,i} \cdot R(\mathbf{r}’_{l,i}) $$

The Quaternion Eigenvector Solution

Horn represents rotation using unit quaternions $\dot{q} = q_0 + i q_x + j q_y + k q_z$ with $\lVert \dot{q} \rVert = 1$. A rotation acts on a vector (represented as a purely imaginary quaternion $\dot{r}$) via the composite product:

$$ \dot{r}’ = \dot{q} , \dot{r} , \dot{q}^* $$

Using the $4 \times 4$ matrix representations of quaternion products, the objective function becomes a quadratic form:

$$ \dot{q}^T N \dot{q} $$

where $N$ is a real symmetric $4 \times 4$ matrix whose elements are combinations of the sums of products $S_{xx}, S_{xy}, \ldots, S_{zz}$ from the $3 \times 3$ cross-covariance matrix $M = \sum_i \mathbf{r}’_{l,i} \mathbf{r}’^T_{r,i}$:

$$ N = \begin{bmatrix} (S_{xx} + S_{yy} + S_{zz}) & S_{yz} - S_{zy} & S_{zx} - S_{xz} & S_{xy} - S_{yx} \\ S_{yz} - S_{zy} & (S_{xx} - S_{yy} - S_{zz}) & S_{xy} + S_{yx} & S_{zx} + S_{xz} \\ S_{zx} - S_{xz} & S_{xy} + S_{yx} & (-S_{xx} + S_{yy} - S_{zz}) & S_{yz} + S_{zy} \\ S_{xy} - S_{yx} & S_{zx} + S_{xz} & S_{yz} + S_{zy} & (-S_{xx} - S_{yy} + S_{zz}) \end{bmatrix} $$

The trace of $N$ is always zero. The unit quaternion maximizing $\dot{q}^T N \dot{q}$ is the eigenvector corresponding to the most positive eigenvalue of $N$.

The Characteristic Polynomial

The eigenvalues satisfy a quartic $\lambda^4 + c_3 \lambda^3 + c_2 \lambda^2 + c_1 \lambda + c_0 = 0$ where:

$c_3 = 0$ (trace of $N$ is zero, so the four roots sum to zero)
$c_2 = -2 \operatorname{Tr}(M^T M)$ (always negative, guaranteeing both positive and negative roots)
$c_1 = -8 \det(M)$
$c_0 = \det(N)$

When points are coplanar (including the common case of exactly three points), $\det(M) = 0$, so $c_1 = 0$ and the quartic reduces to a biquadratic solvable in closed form.

Coplanar Points and the Three-Point Case

For coplanar measurements, the quartic simplifies to $\lambda^4 + c_2 \lambda^2 + c_0 = 0$, yielding:

$$ \lambda_m = \left[ \frac{1}{2} \left( (c_2^2 - 4c_0)^{1/2} - c_2 \right) \right]^{1/2} $$

Horn also provides a geometric interpretation for the coplanar case: first rotate one plane into the other (about their line of intersection), then solve a 2D least-squares rotation within the shared plane.

Comparison with the Kabsch Algorithm

Both methods solve the same underlying optimization problem but approach it differently:

Aspect	Kabsch (1976)	Horn (1987)
Rotation representation	Orthogonal matrix	Unit quaternion
Core computation	SVD or eigendecomposition of $\tilde{R}R$ ($3 \times 3$)	Eigenvector of $N$ ($4 \times 4$)
Scale estimation	Not addressed	Three formulations (including symmetric)
Constraint enforcement	Lagrange multipliers	Unit quaternion norm
Symmetry guarantee	Not addressed	Proven for symmetric scale
Degenerate cases	Cross-product fallback	Biquadratic closed form

Horn emphasizes a symmetry property: the inverse transformation should yield exactly the inverse parameters. This holds automatically for the quaternion rotation but requires a specific (symmetric) choice of scale formula.

Paper Information

Citation: Horn, B. K. P. (1987). Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A, 4(4), 629-642. https://doi.org/10.1364/JOSAA.4.000629

Publication: Journal of the Optical Society of America A, 1987

Additional Resources:

Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX (tutorial with differentiable implementations of the related SVD-based method)

@article{horn1987closed,
  title={Closed-form solution of absolute orientation using unit quaternions},
  author={Horn, Berthold K. P.},
  journal={Journal of the Optical Society of America A},
  volume={4},
  number={4},
  pages={629--642},
  year={1987},
  publisher={Optica Publishing Group},
  doi={10.1364/josaa.4.000629}
}

GraSP: Graph Recognition via Subgraph Prediction (2026)

Sun, 15 Mar 2026 00:00:00 +0000

A General Framework for Visual Graph Recognition

GraSP (Graph Recognition via Subgraph Prediction) addresses a fundamental limitation in image-to-graph methods: existing solutions are task-specific and do not transfer between domains. Whether the task is OCSR, scene graph recognition, music notation parsing, or road network extraction, each domain has developed independent solutions despite solving the same conceptual problem of extracting a graph from an image.

The key insight is that graph recognition can be reformulated as sequential subgraph prediction using a binary classifier, sidestepping two core difficulties of using graphs as neural network outputs:

Graph isomorphism: An uncolored graph with $n$ nodes has $n!$ equivalent representations, making direct output comparison intractable
Compositional outputs: Nodes, edges, and features are interdependent, so standard i.i.d. loss functions are insufficient

Sequential Subgraph Prediction as an MDP

GraSP formulates graph recognition as a Markov Decision Process. Starting from an empty graph, the method iteratively expands the current graph by adding one edge at a time (connecting either a new node or two existing nodes). At each step, a binary classifier predicts whether each candidate successor graph is a subgraph of the target graph shown in the image.

The critical observation is that the optimal value function $V^{\pi^*}$ satisfies:

$$V^{\pi^*}(\mathcal{G}_t | \mathcal{I}) = 1 \iff \mathcal{G}_t \subseteq \mathcal{G}_{\mathcal{I}}$$

This means the value function reduces to a subgraph membership test, which can be learned as a binary classifier rather than requiring reinforcement learning. Greedy decoding then suffices: at each step, select any successor that the classifier predicts is a valid subgraph, and terminate when the classifier indicates the current graph is complete.

This formulation decouples decision (what to add) from generation (in what order), making the same model applicable across different graph types without modification.

Architecture: GNN + FiLM-Conditioned CNN

The architecture has three components:

GNN encoder: A Message Passing Neural Network processes the candidate subgraph, producing a graph embedding. Messages are constructed as concatenations of source node features, target node features, and connecting edge features.
FiLM-conditioned CNN: A ResNet-v2 processes the image, with FiLM layers placed after every normalization layer within each block. The graph embedding conditions the image processing, producing a joint graph-image representation.
MLP classification head: Takes the conditioned image embedding plus a binary terminal flag (indicating whether this is a termination check) and predicts subgraph membership.

The model uses only 7.25M parameters. Group Normalization is used in the CNN (8 groups per layer), Layer Normalization in the GNN and MLP.

Training via Streaming Data Generation

Training uses a streaming architecture rather than a fixed dataset:

For each iteration, a target graph $\mathcal{G}_T$ is sampled and rendered as an image
Positive samples are generated by deleting edges that do not disconnect the graph (yielding valid subgraphs)
Negative samples are generated by expanding successor states and checking via approximate subgraph matching
Two FIFO buffers (one for positives, one for negatives), each holding up to 25,000 images, maintain diverse and balanced mini-batches of 1024 samples
Training uses the RAdam optimizer with a cosine learning rate schedule (warmup over 50M samples, cycle of 250M samples) on 4 A100 GPUs with a 24h budget

Synthetic Benchmarks on Colored Trees

GraSP is evaluated on increasingly complex synthetic tasks involving colored tree graphs:

Small trees (6-9 nodes): Tasks with varying numbers of node colors (1, 3, 5) and edge colors (1, 3, 5). The model works well across all configurations, with simpler tasks (fewer colors) converging faster.
Larger trees (10-15 nodes): The same trends hold but convergence is slower due to increased structural complexity.
Out-of-distribution generalization: Models trained on 6-9 node trees show zero-shot generalization to 10-node trees, indicating learned patterns are size-independent.

OCSR Evaluation on QM9

For the real-world OCSR evaluation, GraSP is applied to QM9 molecular images (grayscale, no stereo-bonds) with a 10,000-molecule held-out test set:

Method	Accuracy
OSRA	45.61%
GraSP	67.51%
MolGrapher	88.36%
DECIMER	92.08%

GraSP does not match state-of-the-art OCSR tools, but the authors emphasize that the same model architecture and training procedure transfers directly from synthetic tree tasks to molecular graphs with no task-specific modifications. The only domain knowledge incorporated is a simple chemistry rule: not extending nodes that already have degree four.

The method highlights the practical advantage of decoupling decision from generation. Functional groups can be represented at different granularities (as single nodes to reduce trajectory depth, or expanded to reduce trajectory breadth) without changing the model.

Reproducibility

Artifact	Type	License	Notes
GraSP Code	Code	Unknown	Official implementation with pre-trained models

The repository includes pre-trained models and example trajectories for interactive exploration. Training requires 4 A100 GPUs with a 24h time budget. The QM9 dataset used for OCSR evaluation is publicly available. No license file is included in the repository.

Limitations and Future Directions

Finite type assumption: The current framework assumes a finite set of node and edge types, limiting applicability to open-vocabulary tasks like scene graph recognition
Scaling to large graphs: For very large graphs, the branching factor of successor states becomes expensive. Learned filters to prune irrelevant successor states could help
OCSR performance gap: While GraSP demonstrates transferability, it falls short of specialized OCSR tools that use domain-specific encodings (SMILES) or pixel-level supervision
Modality extension: The framework could extend beyond images to other input modalities, such as vector embeddings of graphs

Paper Information

Citation: Eberhard, A., Neumann, G., & Friederich, P. (2026). Graph Recognition via Subgraph Prediction. arXiv preprint arXiv:2601.15133. https://arxiv.org/abs/2601.15133

Publication: arXiv 2026

GraphReco: Probabilistic Structure Recognition (2026)

Sun, 15 Mar 2026 00:00:00 +0000

Paper Information

Citation: Wang, H., Yu, Y., & Liu, J.-C. (2026). GraphReco: Probabilistic Structure Recognition for Chemical Molecules. ChemistryOpen, e202500537. https://doi.org/10.1002/open.202500537

Publication: ChemistryOpen 2026 (Open Access)

A Rule-Based OCSR System with Probabilistic Graph Assembly

GraphReco tackles a challenge that is rarely addressed explicitly in rule-based OCSR: the ambiguity that arises during graph assembly when lower-level component extraction results are imprecise. Small deviations in bond endpoint locations, false positive detections, and spatial proximity between elements all create uncertainty about which atoms and bonds should be connected, merged, or discarded.

The system introduces two main contributions:

Fragment Merging (FM) line detection: An adaptive three-stage algorithm for precise bond line identification across images of variable resolution
Probabilistic ambiguity resolution: A Markov network that infers the most likely existence and merging state of atom and bond candidates

Three-Stage Pipeline

GraphReco follows a three-stage workflow:

Component Extraction: Detects circles (aromatic bonds), bond lines (via the FM algorithm), and chemical symbols (via Tesseract OCR). Includes detection of solid wedge, dashed wedge, dashed line, and wavy bond styles. A semi-open-loop correction step resolves cases where symbols are misclassified as bonds and vice versa.
Atom and Bond Ambiguity Resolution: Creates atom and bond candidates from detected components, builds a Markov network to infer their most probable states, and resolves candidates through existence and merging decisions.
Graph Reconstruction: Assembles resolved atoms and bonds into a molecule graph, selects the largest connected component, and exports as MDL Molfile.

Fragment Merging Line Detection

Classical Line Hough Transform (LHT) struggles with chemical structure images because bond lines suffer from pixelization, and algorithm parameters that work for one image resolution fail at others. The FM algorithm addresses this with three stages:

Fragment extraction: Apply LHT with high-resolution parameters (resolution $r = 2$, resolution $\theta = 2°$) to detect fine line fragments. Walk along detected theoretical lines to find actual black pixels and group them by connectivity.
Fragment grouping: Pair fragments that share similar angles, are close in the perpendicular direction, and are either overlapping or connected by a path of black pixels.
Fragment merging: Merge grouped fragments into single line segments using the two border pixels farthest from the centroid.

The FM algorithm effectively handles the tradeoff that plagues standard LHT: coarse parameters miss short lines and produce overlaps, while fine parameters return many fragments shorter than actual bonds.

Probabilistic Ambiguity Resolution via Markov Network

After component extraction, GraphReco creates atom and bond candidates rather than directly assembling the graph. Each bond endpoint generates an atom candidate with a circular bounding area of radius:

$$r_b = \min(l_{\text{bond}}, l_{\text{med}}) / 4$$

where $l_{\text{bond}}$ is the bond length and $l_{\text{med}}$ is the median bond length.

A Markov network is constructed with four types of nodes:

Atom nodes: Boolean existence variables for each atom candidate
Bond nodes: Boolean existence variables for each bond candidate
Atom merge nodes: Boolean variables for pairs of overlapping atom candidates
Bond merge nodes: Boolean variables for pairs of nearby bond candidates

Potential functions encode rules about when candidates should exist or merge, with merging likelihood between two bond-ending atom candidates defined as a piecewise function of center distance $d$:

$$P(a_1, a_2) = \begin{cases} 0.9, & \text{if } d \leq Q \\ 0.7 - 0.4(d - Q)/(R - Q), & \text{if } Q < d \leq R \\ 0.1, & \text{if } d > R \end{cases}$$

where $Q = \max(r_1, r_2)$ and $R = \min(1.5Q, r_1 + r_2)$. MAP inference determines the final state of all candidates.

Evaluation Results

GraphReco is evaluated on USPTO benchmarks with InChI string comparison (stereochemistry removed):

System	USPTO-10K	USPTO-10K-Abb	USPTO
GraphReco	94.2	86.7	89.9
MolVec 0.9.7	92.4	70.3	89.1
Imago 2.0	89.9	63.0	89.4
OSRA 2.1	89.7	63.9	89.3
MolGrapher	93.3	82.8	91.5
Img2Mol	35.4	13.8	25.2

GraphReco outperforms all rule-based systems and most ML systems, with a particularly large margin on USPTO-10K-Abb (abbreviation-heavy molecules). MolGrapher achieves slightly higher accuracy on the USPTO dataset.

Robustness on Perturbed Images

On USPTO-perturbed (rotation and shearing applied), rule-based methods degrade substantially:

System	USPTO-perturbed
MolGrapher	86.7
Img2Mol	42.3
GraphReco	40.6
MolVec 0.9.7	30.7
OSRA 2.1	6.4
Imago 2.0	5.1

GraphReco performs better than other rule-based systems on perturbed inputs (40.6% vs. under 31%) thanks to its probabilistic assembly, but still falls far behind MolGrapher (86.7%), demonstrating the robustness advantage of learned approaches.

Ablation Study

Each component contributes substantially to overall performance on USPTO-10K:

Configuration	USPTO-10K	USPTO-10K-Abb	USPTO
Full system	94.2	86.7	89.9
Without FM line detection	2.9	5.5	4.8
Without atom candidates	9.8	0.4	5.0
Without bond candidates	79.1	75.8	75.0
Without Markov network	88.2	81.4	84.2

The FM algorithm and atom candidate mechanism are both critical (accuracy drops below 10% without either). Bond candidates provide a moderate improvement (~15 percentage points), and the Markov network adds ~6 points over hard-threshold alternatives.

Limitations

Deterministic expert rules limit robustness on perturbed or noisy images, as evidenced by the large accuracy gap with MolGrapher on USPTO-perturbed
The system relies on Tesseract OCR for symbol recognition, which may struggle with unusual fonts or degraded image quality
Only handles single 2D molecule structures per image
Stereochemistry is removed during evaluation, so performance on stereo-bond recognition is not assessed

Reproducibility

GraphReco is implemented in Python and relies on Tesseract OCR, OpenCV, and RDKit. The authors provide an online demo for testing but have not released the source code or a public repository.

Artifact	Type	License	Notes
Online Demo	Other	Unknown	Google Cloud Run deployment (no longer available)

Missing components for full reproduction:

Source code is not publicly available
No pre-built package or installable library
Hyperparameters for Markov network potential functions are given in the paper (Equations 8-11), but full implementation details are not released

Hardware/compute requirements: Not specified in the paper. The system uses classical computer vision (Hough transforms, thinning) and probabilistic inference (Markov networks), so GPU hardware is likely not required.

D3PM: Discrete Denoising Diffusion Probabilistic Models

Sun, 15 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a Method paper. It extends denoising diffusion probabilistic models (DDPMs) from continuous to discrete state-spaces by introducing structured Markov transition matrices for the corruption process. The paper unifies several corruption strategies, draws a formal connection between absorbing-state diffusion and masked language models, and demonstrates competitive results on both image and text generation.

Diffusion Beyond Continuous Spaces

Standard DDPMs operate in continuous state-spaces (e.g., pixel values treated as real numbers) and use Gaussian noise for corruption. Many important data types are inherently discrete: text (tokens from a vocabulary), quantized images (discrete pixel values), molecular structures, and segmentation maps. Prior work by Hoogeboom et al. extended binary diffusion to multinomial diffusion with uniform transition probabilities, but this limits the structure of the corruption process. D3PMs generalize this by allowing arbitrary transition matrices that encode domain-specific inductive biases.

Core Innovation: Structured Transition Matrices

D3PMs define a forward corruption process over discrete variables $\mathbf{x} \in {1, \ldots, K}^D$ using transition matrices $\mathbf{Q}_t \in \mathbb{R}^{K \times K}$:

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \text{Cat}(\mathbf{x}_t; \mathbf{p} = \mathbf{x}_{t-1} \mathbf{Q}_t)$$

where $\mathbf{x}_{t-1}$ is a one-hot row vector. The cumulative transition after $t$ steps is $\overline{\mathbf{Q}}_t = \mathbf{Q}_1 \mathbf{Q}_2 \cdots \mathbf{Q}_t$, giving:

$$q(\mathbf{x}_t | \mathbf{x}_0) = \text{Cat}(\mathbf{x}_t; \mathbf{p} = \mathbf{x}_0 \overline{\mathbf{Q}}_t)$$

The paper explores several transition matrix designs:

Uniform diffusion: $[\mathbf{Q}_t]_{ij} = (1 - \beta_t) \mathbf{1}_{i=j} + \beta_t / K$. Transitions with equal probability to any state. Stationary distribution is uniform.

Absorbing state: In absorbing-state diffusion, each non-mask token transitions to the mask state with probability $\beta_t$ per step, while tokens already at the mask state remain there:

$[\mathbf{Q}_t]_{ij} = (1-\beta_t)\mathbf{1}_{i=j\neq m} + \beta_t \mathbf{1}_{j=m} + \mathbf{1}_{i=j=m}$. Each token transitions to a designated absorbing state $m$ (e.g., [MASK] for text, gray pixel for images) with probability $\beta_t$. This establishes a direct connection to masked language models like BERT.

Discretized Gaussian: Transition probabilities decay as a function of the distance $|i-j|$ between states, mimicking Gaussian diffusion on ordinal data like pixel values.

Embedding-based nearest neighbor: For text, transitions are weighted by proximity in a pretrained word embedding space, so corruption preferentially swaps words with semantically similar ones.

Training objective. The reverse process $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ is parameterized by predicting $\tilde{p}_\theta(\tilde{\mathbf{x}}_0 | \mathbf{x}_t)$ and computing the posterior:

$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \propto \sum_{\tilde{\mathbf{x}}_0} q(\mathbf{x}_{t-1} | \mathbf{x}_t, \tilde{\mathbf{x}}_0) , \tilde{p}_\theta(\tilde{\mathbf{x}}_0 | \mathbf{x}_t)$$

The loss combines the variational lower bound (VLB) with an auxiliary cross-entropy loss $L_\lambda$:

$$L = L_{\text{VLB}} + \lambda , L_{\text{CE}}$$

where $L_{\text{CE}}$ is a reweighted cross-entropy loss on the $\mathbf{x}_0$ prediction that stabilizes training and improves sample quality. The VLB decomposes into per-timestep KL divergences between the true and predicted reverse transitions.

Experiments and Results

Image generation (CIFAR-10):

Model	Loss	IS	FID	NLL (bpd)
D3PM uniform	$L_{\text{VLB}}$	5.99	51.27	5.08
D3PM absorbing	$L_\lambda$ ($\lambda{=}0.001$)	6.78	30.97	4.40
D3PM Gauss	$L_{\text{VLB}}$	7.75	15.30	3.97
D3PM Gauss	$L_\lambda$ ($\lambda{=}0.001$)	8.54	8.34	3.98
D3PM Gauss + logistic	$L_\lambda$ ($\lambda{=}0.001$)	8.56	7.34	3.44
DDPM $L_{\text{simple}}$ (continuous)	–	9.46	3.17	3.75

The best discrete D3PM variant is D3PM Gauss + logistic, which achieves FID 7.34 and NLL 3.44 bpd using the combined $L_\lambda$ loss with a truncated logistic parameterization. The truncated logistic parameterization replaces the standard softmax output with a discretized logistic distribution over pixel values, assigning probability mass to each discrete bin based on a continuous logistic CDF. This provides a smoother output distribution that better captures the ordinal structure of pixel intensities. This variant exceeds the continuous DDPM in log-likelihood (3.44 vs. 3.75 bpd) while approaching its sample quality (FID 7.34 vs. 3.17).

Text generation (text8, character-level, 1000 steps):

Model	bpc
D3PM absorbing ($L_\lambda$)	1.45
D3PM NN ($L_{\text{VLB}}$)	1.59
D3PM uniform	1.61
Discrete Flow (Tran et al.)	1.23

Among the D3PM variants and baselines evaluated, D3PM absorbing achieves the best bpc on text8 apart from Discrete Flow (Tran et al., 2019). On LM1B (sentencepiece vocabulary of 8192 tokens), D3PM absorbing achieves a perplexity of 76.9 at 1000 steps, compared to 137.9 for D3PM uniform and 43.6 for a comparable autoregressive transformer, demonstrating that discrete diffusion scales to large vocabularies.

Ablation findings:

The auxiliary cross-entropy loss $L_\lambda$ is critical: for D3PM Gauss, it improves FID from 15.30 ($L_{\text{VLB}}$) to 8.34 ($L_\lambda$, $\lambda{=}0.001$). Adding the truncated logistic parameterization further improves FID to 7.34.
Discretized Gaussian transitions outperform both uniform and absorbing-state transitions on CIFAR-10 across all metrics.
For text, the absorbing-state (mask) model outperforms uniform and nearest-neighbor models. Nearest-neighbor diffusion provides only marginal improvement over uniform, a surprising negative result.
The $\mathbf{x}_0$-parameterization ensures the learned reverse distribution has the correct sparsity pattern dictated by the transition matrix $\mathbf{Q}_t$.

Findings and Limitations

The choice of transition matrix is an important design decision that encodes domain-specific inductive biases. Discretized Gaussian transitions work best for ordinal image data; absorbing-state transitions work best for text.
D3PMs formally unify diffusion models and masked language models: absorbing-state diffusion with a [MASK] token is equivalent to a reweighted BERT-style training objective.
The combined VLB + auxiliary loss ($L_\lambda$) achieves better density estimation (3.44 bpd) than continuous DDPMs (3.75 bpd) while producing competitive samples.
Sample quality (best FID 7.34 for D3PM Gauss + logistic) still lags behind continuous-space DDPMs (FID 3.17) on CIFAR-10, though the gap narrows with structured transitions and the auxiliary loss.
Scaling to very large numbers of categories $K$ requires special techniques (low-rank corruption or matrix exponentials) to manage the $O(K^2 T)$ memory cost of storing transition matrices.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Image generation	CIFAR-10	32x32, 256 categories	Quantized to 256 ordinal values per channel
Text generation	text8	Character-level	27 character vocabulary, sequences of length 256
Text generation	LM1B	Word-level	Sentencepiece vocabulary of 8192 tokens, sequence length 128

Algorithms

Noise schedules: Linear schedule for D3PM Gauss, cosine schedule for D3PM uniform, and a novel mutual information schedule for absorbing and nearest-neighbor models
Reverse parameterization: $\mathbf{x}_0$-parameterization with posterior computation via Bayes’ rule
Loss: $L_{\text{VLB}} + \lambda L_{\text{CE}}$ with $\lambda = 0.001$ for images and $\lambda = 0.01$ for text absorbing models
Scaling: Low-rank corruption (absorbing, uniform) scales as $O(r^2 T)$; matrix exponentials for nearest-neighbor transitions

Models

Image models: Modified U-Net architecture from Ho et al. (2020) adapted for categorical output via softmax over $K$ classes
Text models: 12-layer T5-style transformer encoder with 70M parameters (12 heads, MLP dim 3072, QKV dim 768)
Timesteps: $T = 1000$ for both images and text, though text models can be evaluated with fewer steps (e.g., 256 or 20)

Evaluation

Metric	Dataset	Best D3PM	Continuous DDPM
FID	CIFAR-10	7.34 (Gauss + logistic)	3.17
NLL (bpd)	CIFAR-10	3.44 (Gauss + logistic)	3.75
BPC	text8 (char)	1.45 (absorbing, $L_\lambda$)	N/A
Perplexity	LM1B	76.9 (absorbing)	N/A

Hardware

All models trained for 1M steps with batch size 512 on TPUv2 or TPUv3
Text models: 12-layer transformer encoder (T5 architecture), 70M parameters
Image models: Modified U-Net architecture from Ho et al. (2020)

Artifacts

Artifact	Type	License	Notes
google-research/d3pm	Code	Apache-2.0	Official JAX/Flax implementation for image and text experiments

Paper Information

Citation: Austin, J., Johnson, D. D., Ho, J., Tarlow, D., & van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces. NeurIPS 2021. https://arxiv.org/abs/2107.03006

Publication: NeurIPS 2021

@inproceedings{austin2021structured,
  title     = {Structured Denoising Diffusion Models in Discrete State-Spaces},
  author    = {Austin, Jacob and Johnson, Daniel D. and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne},
  booktitle = {Advances in Neural Information Processing Systems},
  volume    = {34},
  year      = {2021}
}

Additional Resources:

Score-Based Generative Modeling with SDEs

Consistency Models: Fast One-Step Diffusion Generation

Sun, 15 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a Method paper. It proposes consistency models, a new class of generative models designed for fast one-step (or few-step) generation. The models can be trained either by distilling pretrained diffusion models (consistency distillation) or as standalone generative models from scratch (consistency training). The paper provides theoretical analysis of both training modes and achieves FID 3.55 on CIFAR-10 for single-step non-adversarial generation (state of the art at the time of publication).

The Slow Sampling Problem in Diffusion

Diffusion models produce high-quality samples but require iterating through many denoising steps (often tens to hundreds), making generation slow compared to GANs or VAEs. Previous approaches to speed up sampling include faster ODE/SDE solvers (DDIM, DPM-Solver) and progressive distillation. These either still require multiple steps or depend on a complex multi-stage distillation pipeline. The goal is a model that can generate high-quality samples in a single forward pass while optionally allowing more steps for better quality.

Core Innovation: The Self-Consistency Property

The key idea builds on the Probability Flow (PF) ODE from the score-based SDE framework. The PF ODE describes a deterministic trajectory that converts noise into data, governed by the learned score function. For the VE-SDE parameterization used by EDM (Karras et al., 2022), this takes the form:

$$\frac{d\mathbf{x}_t}{dt} = -t , s_\phi(\mathbf{x}_t, t)$$

where $s_\phi$ is a pretrained score model, a consistency function $f(\mathbf{x}_t, t)$ maps any point on an ODE trajectory to the trajectory’s origin $\mathbf{x}_\epsilon$. The defining property is self-consistency:

$$f(\mathbf{x}_t, t) = f(\mathbf{x}_{t’}, t’) \quad \text{for all } t, t’ \in [\epsilon, T]$$

for any points $\mathbf{x}_t$ and $\mathbf{x}_{t’}$ on the same PF ODE trajectory.

Parameterization. The model enforces the boundary condition $f(\mathbf{x}_\epsilon, \epsilon) = \mathbf{x}_\epsilon$ using skip connections:

$$f_\theta(\mathbf{x}, t) = c_{\text{skip}}(t) , \mathbf{x} + c_{\text{out}}(t) , F_\theta(\mathbf{x}, t)$$

where $c_{\text{skip}}(\epsilon) = 1$ and $c_{\text{out}}(\epsilon) = 0$, ensuring the boundary condition is satisfied by construction.

Consistency Distillation (CD). Given a pretrained diffusion model, CD trains a consistency model by enforcing self-consistency between adjacent timesteps:

$$\mathcal{L}_{\text{CD}}^N(\theta, \theta^-; \phi) = \mathbb{E}\left[\lambda(t_n) , d!\left(f_\theta(\mathbf{x}_{t_{n+1}}, t_{n+1}), , f_{\theta^-}(\hat{\mathbf{x}}_{t_n}^\phi, t_n)\right)\right]$$

where $\hat{\mathbf{x}}_{t_n}^\phi$ is obtained by running one step of the ODE solver using the pretrained score model, $\theta^-$ is an exponential moving average (EMA) of $\theta$, and $d(\cdot, \cdot)$ is a distance metric. The use of a target network $\theta^-$ (updated via EMA) parallels techniques from deep Q-learning and momentum contrastive learning.

Consistency Training (CT). CT eliminates the need for a pretrained diffusion model. It replaces the ODE solver step with a score estimate derived from the denoising score matching identity:

$$\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) = \mathbb{E}\left[\frac{\mathbf{x} - \mathbf{x}_t}{t^2} ,\middle|, \mathbf{x}_t\right]$$

Because this identity lets us estimate the score from noisy data alone (without a pretrained model), we can compute the ODE update directly from training samples. This allows training directly on data pairs $(\mathbf{x}, \mathbf{x} + t\mathbf{z})$ where $\mathbf{z} \sim \mathcal{N}(0, I)$.

Theoretical guarantee. If CD achieves zero loss, the consistency model error is bounded by $O((\Delta t)^p)$ where $\Delta t$ is the maximum timestep gap and $p$ is the order of the ODE solver.

Experiments and Benchmarks

Datasets: CIFAR-10 (32x32), ImageNet 64x64, LSUN Bedroom 256x256, LSUN Cat 256x256.

Architecture: All models use the NCSN++/EDM architecture. CD distills from pretrained EDM models.

Key results for consistency distillation (CD):

Dataset	Steps	FID
CIFAR-10	1	3.55
CIFAR-10	2	2.93
ImageNet 64x64	1	6.20
ImageNet 64x64	2	4.70
LSUN Bedroom 256	1	7.80
LSUN Bedroom 256	2	5.22
LSUN Cat 256	1	11.0
LSUN Cat 256	2	8.84

CD outperforms progressive distillation (PD) across all datasets and sampling steps, with the exception of single-step generation on Bedroom 256x256 where CD with $\ell_2$ slightly underperforms PD with $\ell_2$.

Key results for consistency training (CT):

Dataset	Steps	FID
CIFAR-10	1	8.70
CIFAR-10	2	5.83
ImageNet 64x64	1	13.0
ImageNet 64x64	2	11.1
LSUN Bedroom 256	1	16.0
LSUN Cat 256	1	20.7

CT outperforms existing single-step non-adversarial models (VAEs, normalizing flows), e.g., improving over DC-VAE’s FID of 17.90 on CIFAR-10. Samples from CT share structural similarity with EDM samples from the same initial noise, suggesting CT does not suffer from mode collapse.

Zero-shot editing: Consistency models support colorization, super-resolution, inpainting, stroke-guided generation, interpolation, and denoising at test time without task-specific training, by modifying the multi-step sampling algorithm.

Findings and Limitations

Consistency distillation achieves state-of-the-art FID for one-step generation (3.55 on CIFAR-10, 6.20 on ImageNet 64x64).
Multi-step sampling provides a smooth quality-compute tradeoff: more steps yield better FID.
CT produces competitive results without any pretrained diffusion model, making consistency models a standalone generative model family.
The LPIPS distance metric $d(\cdot, \cdot)$ generally outperforms $\ell_1$ and $\ell_2$ for training consistency models.
At higher resolutions (LSUN 256x256), the gap between CD/CT and full EDM sampling widens.
CT currently underperforms CD, suggesting room for improvement in the standalone training paradigm.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Primary benchmark	CIFAR-10	32x32, 50K train	FID on 50K samples
Scaling benchmark	ImageNet 64x64	64x64, 1.28M	Unconditional generation
High-res benchmark	LSUN Bedroom, Cat	256x256	Unconditional generation

Algorithms

ODE solver for CD: Euler and Heun (2nd order) solvers on the empirical PF ODE
EMA for target network: Decay rate $\mu$ scheduled as a function of training step
Schedule functions: $N$ (number of discretization steps) and $\mu$ (EMA rate) increase over training following specific schedules (see Appendix C of the paper)
Distance metric: LPIPS performs best; $\ell_2$ and $\ell_1$ also evaluated

Models

Architecture: NCSN++/EDM architecture from Karras et al. (2022)
CD teacher: Pretrained EDM models
Parameterization: Skip-connection formulation with $c_{\text{skip}}(t)$ and $c_{\text{out}}(t)$ from EDM

Evaluation

Metric	Dataset	CD 1-step	CT 1-step	EDM (full)
FID	CIFAR-10	3.55	8.70	2.04
FID	ImageNet 64	6.20	13.0	2.44
FID	LSUN Bedroom	7.80	16.0	3.57
FID	LSUN Cat	11.0	20.7	6.69

Hardware

Training details follow EDM conventions
CD and CT use the same batch sizes and learning rate schedules as EDM training

Artifacts

Artifact	Type	License	Notes
openai/consistency_models	Code	MIT	Official implementation with pretrained checkpoints

Paper Information

Citation: Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). Consistency Models. ICML 2023. https://arxiv.org/abs/2303.01469

Publication: ICML 2023

@inproceedings{song2023consistency,
  title     = {Consistency Models},
  author    = {Song, Yang and Dhariwal, Prafulla and Chen, Mark and Sutskever, Ilya},
  booktitle = {International Conference on Machine Learning},
  volume    = {202},
  year      = {2023},
  url       = {https://arxiv.org/abs/2303.01469}
}

Additional Resources:

AdaptMol: Domain Adaptation for Molecular OCSR (2026)

Sun, 15 Mar 2026 00:00:00 +0000

Bridging the Synthetic-to-Real Gap in Graph-Based OCSR

Most OCSR methods are trained on synthetic molecular images and evaluated on high-quality literature figures, both exhibiting relatively uniform styles. Hand-drawn molecules represent a particularly challenging domain with irregular bond lengths, variable stroke widths, and inconsistent atom symbols. Prior graph reconstruction methods like MolScribe and MolGrapher drop below 15% accuracy on hand-drawn images, despite achieving over 65% on literature datasets.

AdaptMol addresses this with a three-stage pipeline that enables effective transfer from synthetic to real-world data without requiring graph annotations in the target domain:

Base model training on synthetic data with comprehensive augmentation and dual position representation
MMD alignment of bond-level features between source and target domains
Self-training with SMILES-validated pseudo-labels on unlabeled target images

End-to-End Graph Reconstruction Architecture

AdaptMol builds on MolScribe’s architecture, using a Swin Transformer base encoder ($384 \times 384$ input) with a 6-layer Transformer decoder (8 heads, hidden dim 256). The model jointly predicts atoms and bonds:

Atom prediction follows the Pix2Seq approach, autoregressively generating a sequence of atom tokens:

$$S_N = [l_1, x_1, y_1, l_2, x_2, y_2, \dots, l_n, x_n, y_n]$$

where $l_i$ is the atom label and $(x_i, y_i)$ are discretized coordinate bin indices.

Dual position representation adds a 2D spatial heatmap on top of token-based coordinate prediction. The heatmap aggregates joint spatial distributions of all atoms:

$$\mathbf{H} = \text{Upsample}\left(\sum_{i=1}^{n} P_y^{(i)} \otimes P_x^{(i)}\right)$$

where $P_x^{(i)}$ and $P_y^{(i)}$ are coordinate probability distributions from the softmax logits. During training, this heatmap is supervised with Gaussian kernels at ground-truth atom positions. This reduces false positive atom predictions substantially (from 356 to 33 false positives at IoU 0.05).

Bond prediction extracts atom-level features from decoder hidden states and enriches them with encoder visual features via multi-head attention with a learnable residual weight $\alpha$:

$$\mathbf{F}_{\text{enriched}} = \text{LayerNorm}(\mathbf{F}_{\text{atom}} + \alpha \cdot \text{MHA}(\mathbf{F}_{\text{atom}}, \mathbf{E}_{\text{vis}}))$$

A feed-forward network then predicts bond types between all atom pairs.

Bond-Level Domain Adaptation via MMD

The key insight is that bond features are domain-invariant: they encode structural relationships (single, double, triple, aromatic) independent of visual style. Atom-level alignment is problematic due to class imbalance (carbon dominates), multi-token spanning (functional groups), and position-dependent features.

AdaptMol aligns bond-level feature distributions via class-conditional Maximum Mean Discrepancy:

$$L_{\text{MMD}} = \frac{1}{|\mathcal{C}’|} \sum_{c \in \mathcal{C}’} MMD(F_c^{\text{src}}, F_c^{\text{tgt}})$$

where $\mathcal{C}’$ contains classes with sufficient samples in both domains. Confidence-based filtering retains only high-confidence predictions (confidence > 0.95, entropy < 0.1) for alignment, tightening to 0.98 and 0.05 after the first epoch. Progressive loss weighting follows a schedule of 0.1 (epoch 0), 0.075 (epoch 1), and 0.05 thereafter.

An important side effect: MMD alignment improves inter-class bond discrimination, reducing confusion between visually similar bond types (e.g., jagged double bonds vs. aromatic bonds).

Self-Training with SMILES Validation

After MMD alignment, the model generates predictions on unlabeled target images. Predicted molecular graphs are converted to SMILES and validated against ground-truth SMILES annotations. Only exact matches are retained as pseudo-labels, providing complete graph supervision (atom coordinates, element types, bond types) that was previously unavailable in the target domain.

This approach is far more data-efficient than alternatives: AdaptMol uses only 4,080 real hand-drawn images vs. DECIMER-Handdraw’s 38 million synthetic hand-drawn images.

Comprehensive Data Augmentation

Two categories of augmentation are applied during synthetic data generation:

Structure-rendering augmentation: Functional group abbreviation substitution, bond type conversions (single to wavy/aromatic, Kekule to aromatic rings), R-group insertion, and rendering parameter randomization (font family/size, bond width/spacing)
Image-level augmentation: Geometric operations, quality degradation, layout variations, and chemical document artifacts (caption injection, arrows, marginal annotations)

Structure-rendering augmentation provides the larger benefit, contributing ~20% accuracy improvement on JPO and ~30% on ACS benchmarks.

Results

Hand-Drawn Molecule Recognition

Method	DECIMER test (Acc)	ChemPix (Acc)
AdaptMol	82.6	60.5
DECIMER v2.2	71.9	51.4
AtomLenz	30.0	48.4
MolScribe	10.1	26.1
MolGrapher	10.7	14.5

Literature and Synthetic Benchmarks

AdaptMol achieves state-of-the-art on 4 of 6 literature benchmarks:

Dataset	AdaptMol	MolScribe	MolGrapher	DECIMER v2.2
CLEF	92.7	87.5	57.2	77.7
JPO	88.2	78.8	73.0	75.7
UOB	89.3	88.2	85.1	87.2
ACS	75.5	72.8	41.0	37.7
USPTO	90.9	92.6	74.9	59.6
Staker	84.0	84.4	0.0	66.3

MolScribe edges out on USPTO and Staker. The authors attribute this to MolScribe directly training on all 680K USPTO samples, which may cause it to specialize to that distribution.

Pipeline Ablation

Configuration	Hand-drawn	ChemDraw	JPO
Base model	10.4	92.3	82.7
+ Font augmentation	30.2	92.5	82.8
+ Font aug + MMD	42.1	94.0	83.0
+ Font aug + MMD + Self-training	82.6	95.9	88.2

Each component contributes meaningfully: font augmentation (+19.8), MMD alignment (+11.9), and self-training (+40.5) on hand-drawn accuracy.

Reproducibility

Artifact	Type	License	Notes
AdaptMol Code	Code	MIT	Official implementation
Model + Data	Model/Dataset	MIT	Pretrained checkpoint and datasets

Training uses 2 NVIDIA A100 GPUs (40GB each). Base model trains for 30 epochs on 1M synthetic samples. Domain adaptation involves 3 steps: USPTO self-training (3 iterations of 3 epochs), MMD alignment on hand-drawn data (5 epochs), and hand-drawn self-training (5 iterations).

Limitations

Sequence length constraints prevent accurate prediction of very large molecules (>120 atoms), where resizing causes significant information loss
Cannot recognize Markush structures with repeating unit notation (parentheses/brackets), as synthetic training data lacks such cases
Stereochemistry information is lost when stereo bonds connect to abbreviated functional groups due to RDKit post-processing limitations
The retrained baseline (30 epochs from scratch on synthetic + pseudo-labels) achieves higher hand-drawn accuracy (87.2%) but at the cost of cross-domain robustness on literature benchmarks

Paper Information

Citation: Hu, F., He, E., & Verspoor, K. (2026). AdaptMol: Domain Adaptation for Molecular Image Recognition with Limited Supervision. Research Square preprint. https://doi.org/10.21203/rs.3.rs-8365561/v1

Publication: Research Square preprint, February 2026

Additional Resources:

Spherical CNNs: Rotation-Equivariant Networks on the Sphere

Sat, 14 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a method paper that introduces the theory and implementation of convolutional neural networks on the sphere. The key contribution is defining spherical cross-correlation that is SO(3)-equivariant and can be computed efficiently using generalized Fast Fourier Transforms from non-commutative harmonic analysis.

Why planar convolutions fail on spherical data

Many problems require analyzing spherical signals: omnidirectional vision for robots and autonomous vehicles, molecular regression, and global weather modeling. A naive approach of projecting spherical data to a plane introduces space-varying distortions that break translational weight sharing. Rotating a spherical signal cannot be emulated by translating its planar projection.

The fundamental issue is geometric: patterns on a plane move via translations, but patterns on a sphere move via 3D rotations. A spherical CNN should detect patterns regardless of how they are rotated over the sphere. The relevant symmetry group is SO(3) (the group of all 3D rotations).

Spherical cross-correlation and the SO(3) output space

The paper defines spherical cross-correlation by replacing filter translations with rotations. For spherical signals $f$ on $S^2$ (the unit sphere) and filter $\psi$, the correlation is:

$$\lbrack\psi \star f\rbrack(R) = \langle L_R \psi, f \rangle = \int_{S^2} \sum_{k=1}^{K} \psi_k(R^{-1}x) f_k(x) , dx$$

where $L_R$ is the rotation operator $\lbrack L_R f\rbrack(x) = f(R^{-1}x)$.

A crucial subtlety: whereas the space of moves for the plane (2D translations) is isomorphic to the plane itself, the space of moves for the sphere (3D rotations) is SO(3), a different three-dimensional manifold. The output of a spherical correlation is therefore a function on SO(3), not on $S^2$. This means subsequent layers must use SO(3) correlation:

$$\lbrack\psi \star f\rbrack(R) = \int_{\text{SO}(3)} \sum_{k=1}^{K} \psi_k(R^{-1}Q) f_k(Q) , dQ$$

Equivariance proof

Equivariance follows from the unitarity of $L_R$ in a single line:

$$\lbrack\psi \star \lbrack L_Q f\rbrack\rbrack(R) = \langle L_R \psi, L_Q f \rangle = \langle L_{Q^{-1}R} \psi, f \rangle = \lbrack\psi \star f\rbrack(Q^{-1}R) = \lbrack L_Q\lbrack\psi \star f\rbrack\rbrack(R)$$

This holds for both $S^2$ and SO(3) correlation.

Efficient computation via generalized FFT

A naive SO(3) correlation is $O(n^6)$. The paper addresses this using the generalized Fourier transform (GFT) from non-commutative harmonic analysis.

The GFT projects functions onto orthogonal basis functions: spherical harmonics $Y_m^l(x)$ for $S^2$, and Wigner D-functions $D_{mn}^l(R)$ for SO(3). Both satisfy generalized Fourier theorems:

SO(3) convolution theorem: $\widehat{\psi \star f} = \hat{f} \cdot \hat{\psi}^\dagger$ (matrix multiplication of block Fourier coefficients)
$S^2$ convolution theorem: $\widehat{\psi \star f}^l = \hat{f}^l \cdot \hat{\psi}^{l\dagger}$ (outer product of $S^2$ Fourier coefficient vectors)

The SO(3) FFT works in two steps: (1) standard 2D FFT over the $\alpha$ and $\gamma$ Euler angles, then (2) linear contraction of the $\beta$ axis with precomputed Wigner-d function samples, implemented as a custom GPU kernel.

Experiments

Equivariance error

Since the theory applies to continuous functions but the implementation is discretized, the authors rigorously measure equivariance error. The approximation error grows with resolution and depth but stays manageable for practical bandwidths. With ReLU activations, the error is higher but stays flat across layers, indicating the error comes from feature map rotation (exact only for bandlimited functions) rather than accumulating through the network.

Spherical MNIST

MNIST digits projected onto the sphere, tested in non-rotated (NR) and rotated (R) settings with ~165K parameters per model:

Train / Test	Planar CNN	Spherical CNN
NR / NR	99%	91%
R / R	45%	91%
NR / R	9%	85%

The planar CNN collapses to chance when trained on non-rotated data and tested on rotated data. The spherical CNN maintains strong performance across all settings.

3D shape recognition (SHREC17)

3D meshes projected onto an enclosing sphere via ray casting. For each point on the sphere, a ray is cast toward the origin, collecting three types of information from the intersection: ray length and cos/sin of the surface angle. The same three channels are computed for the convex hull, giving 6 channels total. The network (~1.4M parameters) placed 2nd on recall, mAP, and NDCG, and 3rd on precision and F1 in the SHREC17 competition, competing against methods with highly task-specialized architectures.

Molecular atomization energy (QM7)

Molecules represented as spherical potential functions around each atom (generalizing the Coulomb matrix). A deep ResNet-style $S^2$CNN with DeepSets-style permutation-invariant aggregation over atoms achieved 8.47 RMSE, outperforming all kernel-based approaches and sorted Coulomb matrix methods.

Discussion and future directions

The authors highlight several avenues for future work. For volumetric tasks like 3D model recognition, extending beyond SO(3) to the roto-translation group SE(3) could improve results. They also note that a Steerable CNN for the sphere would enable analysis of vector fields (e.g., global wind directions). Omnidirectional vision is mentioned as a compelling application as 360-degree sensors become more prevalent.

Reproducibility

The official PyTorch implementation is publicly available. The code does not support recent PyTorch versions due to changes in the FFT interface.

Artifact	Type	License	Notes
s2cnn	Code	MIT	Official PyTorch implementation (deprecated for modern PyTorch)

Hardware requirements from the paper: the SHREC17 model uses 8GB GPU memory at batch size 16 and takes 50 hours to train. The QM7 model uses 7GB at batch size 20 and takes 3 hours to train. Datasets used (Spherical MNIST, SHREC17, QM7) are all publicly available.

Paper Information

Citation: Cohen, T. S., Geiger, M., Köhler, J., & Welling, M. (2018). Spherical CNNs. International Conference on Learning Representations. https://arxiv.org/abs/1801.10130

Publication: ICLR 2018

Additional Resources:

@inproceedings{cohen2018spherical,
  title={Spherical {CNNs}},
  author={Cohen, Taco S. and Geiger, Mario and K{\"o}hler, Jonas and Welling, Max},
  booktitle={International Conference on Learning Representations},
  year={2018}
}

SE(3)-Transformers: Equivariant Attention for 3D Data

Sat, 14 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a method paper that introduces the SE(3)-Transformer, a self-attention mechanism for 3D point clouds and graphs that is equivariant under continuous 3D rotations and translations. It builds on tensor field networks (TFNs) by adding data-dependent attention weights, resolving a known expressiveness limitation of equivariant convolutions.

Why equivariant attention for point clouds?

Point cloud data appears in 3D object scans, molecular structures, and particle simulations. Two properties are essential: handling varying numbers of irregularly sampled points, and invariance to global changes in pose (rotations and translations).

Self-attention handles variable-size inputs naturally and has proven effective across many domains. Tensor field networks provide SE(3)-equivariant convolutions but suffer from a key limitation: their filter kernels are decomposed into learnable radial functions and fixed angular components (spherical harmonics). The angular dependence is completely constrained by the equivariance condition, leaving no learnable degrees of freedom in the angular direction. This has been identified in the literature as severely limiting performance.

The SE(3)-Transformer resolves this by introducing data-dependent attention weights that modulate the angular profile of the kernels while maintaining equivariance.

Architecture: invariant attention meets equivariant values

The core layer combines three components:

$$\mathbf{f}_{\text{out},i}^{\ell} = \underbrace{\mathbf{W}_V^{\ell\ell} \mathbf{f}_{\text{in},i}^{\ell}}_{\text{self-interaction}} + \sum_{k \geq 0} \sum_{j \in \mathcal{N}_i \setminus i} \underbrace{\alpha_{ij}}_{\text{attention}} \underbrace{\mathbf{W}_V^{\ell k}(\mathbf{x}_j - \mathbf{x}_i) \mathbf{f}_{\text{in},j}^k}_{\text{value message}}$$

Invariant attention weights

The attention weights use dot-product attention between equivariant queries and keys:

$$\alpha_{ij} = \frac{\exp(\mathbf{q}_i^\top \mathbf{k}_{ij})}{\sum_{j’ \in \mathcal{N}_i \setminus i} \exp(\mathbf{q}_i^\top \mathbf{k}_{ij’})}$$

Both $\mathbf{q}_i$ and $\mathbf{k}_{ij}$ are constructed using TFN-type linear embeddings, making them SE(3)-equivariant. Their inner product is invariant because SO(3) representations are orthogonal: $\mathbf{q}^\top \mathbf{S}_g^\top \mathbf{S}_g \mathbf{k} = \mathbf{q}^\top \mathbf{k}$.

Equivariant value messages

The value messages use the same TFN kernel structure as tensor field networks: weight kernels $\mathbf{W}_V^{\ell k}(\mathbf{x})$ decomposed into learnable radial functions and Clebsch-Gordan/spherical harmonic angular components. Features are typed by irreducible representation degree $\ell$ (the independent matrix blocks into which SO(3) group actions decompose): type-0 vectors are rotation-invariant scalars, type-1 vectors transform as 3D vectors, and so on.

Angular modulation

The attention weights $\alpha_{ij}$ multiply the value messages, creating data-dependent kernels $\alpha_{ij} \mathbf{W}_V^{\ell k}(\mathbf{x})$. This effectively modulates the angular profile of the fixed spherical harmonic components, adding learnable angular degrees of freedom while preserving equivariance. The authors describe this as one of the first examples of a nonlinear equivariant layer.

Attentive self-interaction

The paper also introduces attentive self-interaction as an alternative to the standard linear self-interaction (analogous to 1x1 convolutions). Instead of fixed learned weights across all points, the weights are generated by an MLP operating on invariant inner products of the input features:

$$w_{i,c’c}^{\ell\ell} = \text{MLP}\left(\bigoplus_{c,c’} \mathbf{f}_{\text{in},i,c’}^{\ell\top} \mathbf{f}_{\text{in},i,c}^{\ell}\right)$$

Experiments

N-body particle simulation

Five charged particles carrying positive or negative charges, exerting repulsive or attractive forces on each other. The task is predicting positions and velocities 500 timesteps ahead. The SE(3)-Transformer achieves 0.0076 MSE on position (vs. 0.0139 for Set Transformer and 0.0151 for TFN), with equivariance error on the order of $10^{-7}$, confirming exact equivariance up to numerical precision.

ScanObjectNN (real-world 3D object classification)

2902 real-world scanned objects across 15 categories. This task is only SO(2)-invariant (gravity axis matters), so the authors provide the z-component as an additional scalar input. With only 128 input points, the SE(3)-Transformer+z achieves 85.0% accuracy, competitive with methods using 1024 points and task-specific architectures. The model learns to ignore the symmetry-breaking z-input when trained on rotation-augmented data.

QM9 molecular property regression

134k molecules with up to 29 atoms, predicting 6 quantum chemical properties. The SE(3)-Transformer achieves competitive results against other equivariant models (TFN, Cormorant), with improvements over TFN on all six targets. Across all three experiments, the SE(3)-Transformer outperforms both a non-equivariant attention baseline (Set Transformer) and equivariant models without attention (TFN).

Practical contributions

The paper includes a PyTorch spherical harmonics implementation that is 10x faster than Scipy on CPU and 100-1000x faster on GPU. For a ScanObjectNN model, this yields roughly 22x speedup of the forward pass compared to the lie-learn library, directly addressing a major bottleneck of TFN-based architectures.

Conclusions and limitations

Adding attention to a roto-translation-equivariant model consistently led to higher accuracy and increased training stability across all three experiments. For large neighbourhoods, attention proved essential for model convergence. The equivariance constraints also improved performance compared to conventional (non-equivariant) attention in all experiments.

The authors note that the SE(3)-Transformer is inherently suited for classification and regression on molecular data and discuss applications in drug research, including early-stage suitability classification of molecules for inhibiting viral reproductive cycles.

Reproducibility

Artifact	Type	License	Notes
se3-transformer-public	Code	MIT	Official PyTorch + DGL implementation

The repository includes code for N-body simulations and QM9 experiments. Hyperparameters and architecture details are provided in the paper’s appendix (4 equivariant layers, representation degrees, channels per degree, learning rates, batch sizes). Hardware requirements are not explicitly stated in the paper.

Paper Information

Citation: Fuchs, F. B., Worrall, D. E., Fischer, V., & Welling, M. (2020). SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks. Advances in Neural Information Processing Systems, 33. https://arxiv.org/abs/2006.10503

Publication: NeurIPS 2020

Additional Resources:

arXiv
GitHub

@inproceedings{fuchs2020se3,
  title={{SE(3)-Transformers}: 3D Roto-Translation Equivariant Attention Networks},
  author={Fuchs, Fabian B. and Worrall, Daniel E. and Fischer, Volker and Welling, Max},
  booktitle={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

OCSU: Optical Chemical Structure Understanding (2025)

Sat, 14 Mar 2026 00:00:00 +0000

Paper Information

Citation: Fan, S., Xie, Y., Cai, B., Xie, A., Liu, G., Qiao, M., Xing, J., & Nie, Z. (2025). OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery. arXiv preprint arXiv:2501.15415. https://doi.org/10.48550/arXiv.2501.15415

Publication: arXiv 2025

Additional Resources:

Code and Dataset (GitHub)

Multi-Level Chemical Understanding (Method and Resource)

This is primarily a Methodological Paper ($\Psi_{\text{Method}}$) with a significant Resource ($\Psi_{\text{Resource}}$) contribution.

Methodological: It proposes two novel architectures, DoubleCheck (an enhanced recognition model) and Mol-VL (an end-to-end vision-language model), to solve the newly formulated OCSU task.
Resource: It constructs and releases Vis-CheBI20, the first large-scale dataset specifically designed for optical chemical structure understanding, containing 29.7K images and 117.7K image-text pairs.

The Motivation for OCSU Beyond Basic Graph Recognition

Existing methods for processing molecular images focus narrowly on Optical Chemical Structure Recognition (OCSR), which translates an image solely into a machine-readable graph or SMILES string. However, SMILES strings are not chemist-friendly and lack high-level semantic context.

Gap: There is a lack of systems that can translate chemical diagrams into human-readable descriptions (e.g., functional groups, IUPAC names) alongside the graph structure.
Goal: To enable Optical Chemical Structure Understanding (OCSU), bridging the gap between visual representations and both machine/chemist-readable descriptions to support drug discovery and property prediction.

Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset

The paper introduces the OCSU task, enabling multi-level understanding (motif, molecule, and abstract levels). To solve this, it introduces two distinct paradigms:

DoubleCheck (OCSR-based): An enhancement to standard OCSR models (like MolScribe) that performs a “second look” at locally ambiguous atoms. It uses attentive feature enhancement to fuse global molecular features with local features from ambiguous regions.
Mol-VL (OCSR-free): An end-to-end Vision-Language Model (VLM) based on Qwen2-VL. It uses multi-task learning to directly generate text descriptions from molecular images without an intermediate SMILES step.
Vis-CheBI20 Dataset: A new benchmark specifically constructed for OCSU, deriving captions and functional group data from ChEBI-20 and PubChem.

Methodology and Experimental Evaluation

The authors evaluated both paradigms on Vis-CheBI20 and existing benchmarks (USPTO, ACS) across four subtasks:

Functional Group Caption: Retrieval/F1 score evaluation.
Molecule Description: Natural language generation metrics (BLEU, ROUGE, METEOR).
IUPAC Naming: Text generation metrics (BLEU, ROUGE).
SMILES Naming (OCSR): Exact matching accuracy ($Acc_s$).

Baselines:

Task-Specific: MolScribe, MolVec, OSRA.
LLM/VLM: Qwen2-VL, BioT5+, Mol-Instructions.
Ablation: DoubleCheck vs. MolScribe backbone to test the “feature enhancement” mechanism.

Results and Conclusions: Paradigm Trade-Offs

DoubleCheck Superiority: DoubleCheck outperformed MolScribe on OCSR tasks across all benchmarks. On USPTO, it achieved 92.85% $Acc_s$ (vs. 92.57%), and on the ACS dataset it showed a +3.12% gain on chiral molecules. On Vis-CheBI20, DoubleCheck improved over MolScribe by an average of 2.27% across all metrics.
Paradigm Trade-offs:
- Mol-VL (OCSR-free) excelled at semantic tasks like Functional Group Captioning, achieving 97.32% F1 (vs. 93.63% for DoubleCheck & RDKit and 89.60% for MolScribe & RDKit). It benefits from end-to-end learning of structural context.
- DoubleCheck (OCSR-based) performed better on IUPAC naming recall and exact SMILES recovery, as explicit graph reconstruction is more precise for rigid nomenclature than VLM generation.
Conclusion: Enhancing submodules improves OCSR-based paradigms, while end-to-end VLMs offer stronger semantic understanding but struggle with exact syntax generation (SMILES/IUPAC).

Reproducibility Details

Data

Vis-CheBI20 Dataset

Source: Derived from ChEBI-20 and PubChem.
Size: 29,700 molecular diagrams, 117,700 image-text pairs.
Generation: Images generated from SMILES using RDKit to simulate real-world journal/patent styles.
Splits (vary by task, see table below):

Task	Train Size	Test Size
Functional Group	26,144	3,269
Description	26,407	3,300
IUPAC Naming	26,200	2,680
SMILES Naming	26,407	3,300

Algorithms

DoubleCheck (Attentive Feature Enhancement)

Ambiguity Detection: Uses atom prediction confidence to identify “ambiguous atoms”.
Masking: Applies a 2D Gaussian mask to the image centered on the ambiguous atom.
Local Encoding: A Swin-B encoder ($\Phi_l$) encodes the masked image region.
Fusion: Aligns local features ($\mathcal{F}_l$) with global features ($\mathcal{F}_g$) using a 2-layer MLP and fuses them via weighted summation.

$$ \begin{aligned} \mathcal{F}_e = \mathcal{F}_g + \text{MLP}(\mathcal{F}_g \oplus \hat{\mathcal{F}}_l) \cdot \hat{\mathcal{F}}_l \end{aligned} $$

Two-Stage Training:
- Stage 1: Train atom/bond predictors (30 epochs).
- Stage 2: Train alignment/fusion modules with random Gaussian mask noise (10 epochs).

Mol-VL (Multi-Task VLM)

Prompting: System prompt: “You are working as an excellent assistant in chemistry…”
Tokens: Uses and special tokens.
Auxiliary Task: Functional group recognition (identifying highlighted groups) added to training to improve context learning.

Models

DoubleCheck:
- Backbone: MolScribe architecture.
- Encoders: Swin-B for both global and local atom encoding.
Mol-VL:
- Base Model: Qwen2-VL (2B and 7B versions).
- Vision Encoder: ViT with naive dynamic resolution and M-RoPE.

Evaluation

Key Metrics:

SMILES: Exact Match Accuracy ($Acc_s$), Chiral Accuracy ($Acc_c$).
Functional Groups: F1 Score (Information Retrieval task).
Text Generation: BLEU-2/4, METEOR, ROUGE-L.

Selected Results:

Model	Task	Metric	Score
DoubleCheck	OCSR (USPTO)	$Acc_s$	92.85%
MolScribe	OCSR (USPTO)	$Acc_s$	92.57%
Mol-VL-7B	Func. Group Caption	F1	97.32%
DoubleCheck & RDKit	Func. Group Caption	F1	93.63%

Hardware

DoubleCheck: Trained on 4 NVIDIA A100 GPUs for 4 days.
- Max LR: 4e-4.
Mol-VL: Trained on 4 NVIDIA A100 GPUs for 10 days.
- Max LR: 1e-5, 50 epochs.

Artifacts

Artifact	Type	License	Notes
PharMolix/OCSU (GitHub)	Code, Model, Dataset	Apache-2.0	Official implementation, Mol-VL-7B weights, and Vis-CheBI20 dataset

Limitations

The authors acknowledge several limitations:

The long-tail distribution of functional groups in training data limits performance on uncommon chemical structures.
Mol-VL struggles with exact syntax generation (SMILES and IUPAC) compared to explicit graph-reconstruction approaches.
Vis-CheBI20 images are synthetically generated via RDKit, which may not fully capture the diversity of real-world journal and patent images.
The authors note that OCSU technologies should be restricted to research purposes, as downstream molecule discovery applications could potentially generate harmful molecules.

Citation

@misc{fanOCSUOpticalChemical2025,
  title = {OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery},
  shorttitle = {OCSU},
  author = {Fan, Siqi and Xie, Yuguang and Cai, Bowen and Xie, Ailin and Liu, Gaochao and Qiao, Mu and Xing, Jie and Nie, Zaiqing},
  year = {2025},
  month = jan,
  number = {arXiv:2501.15415},
  eprint = {2501.15415},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2501.15415},
  archiveprefix = {arXiv}
}

GTR-CoT: Graph Traversal Chain-of-Thought for Molecules

Sat, 14 Mar 2026 00:00:00 +0000

Paper Information

Citation: Wang, J., He, Y., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., & He, C. (2025). GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition (arXiv:2506.07553). arXiv. https://doi.org/10.48550/arXiv.2506.07553

Publication: arXiv preprint (2025)

Additional Resources:

Paper on arXiv

Contribution: Vision-Language Modeling for OCSR

This is a method paper that introduces GTR-VL, a Vision-Language Model for Optical Chemical Structure Recognition (OCSR). The work addresses the persistent challenge of converting molecular structure images into machine-readable formats, with a particular focus on handling chemical abbreviations that cause errors in existing systems.

Motivation: The Abbreviation Bottleneck

The motivation tackles a long-standing bottleneck in chemical informatics: most existing OCSR systems produce incorrect structures when they encounter abbreviated functional groups. When a chemist draws “Ph” for phenyl or “Et” for ethyl, current models fail because they have been trained on data where images contain abbreviations but the ground-truth labels contain fully expanded molecular graphs.

This creates a fundamental mismatch. The model sees “Ph” in the image but is told the “correct” answer is a full benzene ring. The supervision signal is inconsistent with what is actually visible.

Beyond this data problem, existing graph-parsing methods use a two-stage approach: predict all atoms first, then predict all bonds. This is inefficient and ignores the structural constraints that could help during prediction. The authors argue that mimicking how humans analyze molecular structures - following bonds from atom to atom in a connected traversal - would be more effective.

Novelty: Graph Traversal as Visual Chain-of-Thought

The novelty lies in combining two key insights about how to properly train and architect OCSR systems. The main contributions are:

Graph Traversal as Visual Chain of Thought: GTR-VL generates molecular graphs by traversing them sequentially, predicting an atom, then its connected bond, then the next atom, and so on. This mimics how a human chemist would trace through a structure and allows the model to use previously predicted atoms and bonds as context for subsequent predictions.

Formally, the model output sequence for image $I_m$ is generated as:

$$ R_m = \text{concat}(CoT_m, S_m) $$

where $CoT_m$ represents the deterministic graph traversal steps (atoms and bonds) and $S_m$ is the final SMILES representation. This intermediate reasoning step makes the model more interpretable and helps it learn the structural logic of molecules.
“Faithfully Recognize What You’ve Seen” Principle: This addresses the abbreviation problem head-on. The authors correct the ground-truth annotations to match what’s actually visible in the image.

They treat abbreviations like “Ph” as single “superatoms” and build a pipeline to automatically detect and correct training data. Using OCR to extract visible text from molecular images, they replace the corresponding expanded substructures in the ground-truth with the appropriate abbreviation tokens. This ensures the supervision signal is consistent with the visual input.
Large-Scale Dataset (GTR-1.3M): To support this approach, the authors created a large-scale dataset combining 1M synthetic molecules from PubChem with 351K corrected real-world patent images from USPTO. The key innovation is the correction pipeline that identifies abbreviations in patent images and fixes the inconsistent ground-truth labels.
GRPO for Hand-Drawn OCSR: Hand-drawn molecular data lacks fine-grained atom/bond coordinate annotations, making SFT-based graph parsing inapplicable. The authors use Group Relative Policy Optimization (GRPO) with a composite reward function that combines format, SMILES, and graph-level rewards. The graph reward computes the maximum common subgraph (MCS) between predicted and ground-truth molecular graphs:

$$ R_{\text{graph}} = \frac{|N_m^a|}{|N_g^a| + |N_p^a|} + \frac{|N_m^b|}{|N_g^b| + |N_p^b|} $$

where $N_m^a$, $N_g^a$, $N_p^a$ are atom counts in the MCS, ground truth, and prediction, and $N_m^b$, $N_g^b$, $N_p^b$ are the corresponding bond counts.
Two-Stage Training: Stage 1 performs SFT on GTR-1.3M for printed molecule recognition. Stage 2 applies GRPO on a mixture of printed data (GTR-USPTO-4K) and hand-drawn data (DECIMER-HD-Train, 4,070 samples) to extend capabilities to hand-drawn structures.
MolRec-Bench Evaluation: Traditional SMILES-based evaluation fails for molecules with abbreviations because canonicalization breaks down. The authors created a new benchmark that evaluates graph structure directly, providing three metrics: direct SMILES generation, graph-derived SMILES, and exact graph matching.

What experiments were performed?

The evaluation focused on demonstrating that GTR-VL’s design principles solve real problems that plague existing OCSR systems:

Comprehensive Baseline Comparison: GTR-VL was tested against three categories of models:
- Specialist OCSR systems: MolScribe and MolNexTR
- Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU
- General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max
MolRec-Bench Evaluation: The new benchmark includes two subsets of patent images:
- MolRec-USPTO: 5,423 standard patent images similar to existing benchmarks
- MolRec-Abb: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher’s USPTO 10K abb subset
This design directly tests whether models can handle the abbreviation problem that breaks existing systems.
Ablation Studies: Systematic experiments isolated the contribution of key design choices:
- Chain-of-Thought vs. Direct: Comparing graph traversal CoT against direct SMILES prediction
- Traversal Strategy: Graph traversal vs. the traditional “atoms-then-bonds” approach
- Dataset Quality: Training on corrected vs. uncorrected data
Retraining Experiments: Existing specialist models (MolScribe, MolNexTR) were retrained from scratch on the corrected GTR-1.3M dataset to isolate the effect of data quality from architectural improvements.
Hand-Drawn OCSR Evaluation: GTR-VL was also evaluated on the DECIMER Hand-drawn test set and ChemPix dataset, comparing against DECIMER and AtomLenz+EditKT baselines.
Qualitative Analysis: Visual inspection of predictions on challenging cases with heavy abbreviation usage, complex structures, and edge cases to understand failure modes.

Results & Conclusions: Resolving the Abbreviation Bottleneck

Performance Gains on Abbreviations: On MolRec-Abb, GTR-VL-Stage1 achieves 85.49% Graph accuracy compared to around 20% for MolScribe and MolNexTR with their original checkpoints. On MolRec-USPTO, GTR-VL-Stage1 reaches 93.45% Graph accuracy. Existing specialist models see their accuracy drop below 20% on MolRec-Abb when abbreviations are present.
Data Correction is Critical: When MolScribe and MolNexTR were retrained on GTR-1.3M, their MolRec-Abb Graph accuracy jumped from around 20% to 70.60% and 71.85% respectively. GTR-VL-Stage1 still outperformed these retrained baselines at 85.49%, confirming that both data correction and the graph traversal approach contribute.
Chain-of-Thought Helps: Ablation on GTR-USPTO-351K shows that CoT yields 68.85% Gen-SMILES vs. 66.54% without CoT, a 2.31 percentage point improvement.
Graph Traversal Beats Traditional Parsing: Graph traversal achieves 83.26% Graph accuracy vs. 80.15% for the atoms-then-bonds approach, and 81.88% vs. 79.02% on Gra-SMILES.
General VLMs Still Struggle: General-purpose VLMs like GPT-4o scored near 0% on MolRec-Bench across all metrics, highlighting the importance of domain-specific training for OCSR.
Hand-Drawn Recognition via GRPO: GTR-VL-Stage1 (SFT only) achieves only 9.53% Graph accuracy on DECIMER-HD-Test, but after GRPO training in Stage 2, performance jumps to 75.44%. On ChemPix, Graph accuracy rises from 22.02% to 86.13%. The graph reward is essential: GRPO without graph supervision achieves only 11.00% SMILES on DECIMER-HD-Test, while adding graph reward reaches 75.64%.
Evaluation Methodology Matters: The new graph-based evaluation metrics revealed problems with traditional SMILES-based evaluation that previous work had missed. Many “failures” in existing benchmarks were actually correct graph predictions that got marked wrong due to canonicalization issues with abbreviations.

The work establishes that addressing the abbreviation problem requires both correcting the training data and rethinking the model architecture. The combination of faithful data annotation and sequential graph generation improves OCSR performance on molecules with abbreviations by a large margin over previous methods.

Reproducibility Details

Models

Base Model: GTR-VL fine-tunes Qwen2.5-VL.

Input/Output Mechanism:

Input: The model takes an image $I_m$ and a text prompt
Output: The model generates $R_m = \text{concat}(CoT_m, S_m)$, where it first produces the Chain-of-Thought (the graph traversal steps) followed immediately by the final SMILES string
Traversal Strategy: Uses depth-first traversal to alternately predict atoms and bonds

Prompt Structure: The model is prompted to “list the types of atomic elements… the coordinates… and the chemical bonds… then… output a canonical SMILES”. The CoT output is formatted as a JSON list of atoms (with coordinates) and bonds (with indices referring to previous atoms), interleaved.

Data

Training Dataset (GTR-1.3M):

Synthetic Component: 1 million molecular SMILES from PubChem, converted to images using Indigo
Real Component: 351,000 samples from USPTO patents (filtered from an original 680,000)
- Processed using an OCR pipeline to detect abbreviations (e.g., “Ph”, “Et”)
- Ground truth expanded structures replaced with superatoms to match visible abbreviations in images
- This “Faithfully Recognize What You’ve Seen” correction ensures training supervision matches visual input

Evaluation Dataset (MolRec-Bench):

MolRec-USPTO: 5,423 molecular images from USPTO patents
MolRec-Abb: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher’s USPTO 10K abb subset

Algorithms

Graph Traversal Algorithm:

Depth-first traversal strategy
Alternating atom-bond prediction sequence
Each step uses previously predicted atoms and bonds as context

Two-Stage Training:

Stage 1 (SFT): Train on GTR-1.3M to learn visual CoT mechanism for printed molecules (produces GTR-VL-Stage1)
Stage 2 (GRPO): Apply GRPO on GTR-USPTO-4K + DECIMER-HD-Train (4,070 samples) for hand-drawn recognition (produces GTR-VL-Stage2, i.e., GTR-VL)

Training Procedure:

Optimizer: AdamW
Learning Rate (SFT): Peak learning rate of $1.6 \times 10^{-4}$ with cosine decay
Learning Rate (GRPO): Peak learning rate of $1 \times 10^{-5}$ with cosine decay
Warm-up: Linear warm-up for the first 10% of iterations
Batch Size (SFT): 2 per GPU with gradient accumulation over 16 steps, yielding effective batch size of 1024
Batch Size (GRPO): 4 per GPU with gradient accumulation of 1, yielding effective batch size of 128

Evaluation

Metrics (three complementary measures to handle abbreviation issues):

Gen-SMILES: Exact match ratio of SMILES strings directly generated by the VLM (image-captioning style)
Gra-SMILES: Exact match ratio of SMILES strings derived from the predicted graph structure (graph-parsing style)
Graph: Exact match ratio between ground truth and predicted graphs (node/edge comparison, bypassing SMILES canonicalization issues)

Baselines Compared:

Specialist OCSR systems: MolScribe, MolNexTR
Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU
General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max

Hardware

Compute: Training performed on 32 NVIDIA A100 GPUs

Reproducibility Status

Status: Closed. As of the paper’s publication, no source code, pre-trained model weights, or dataset downloads (GTR-1.3M, MolRec-Bench) have been publicly released. The paper does not mention plans for open-source release. The training data pipeline relies on PubChem SMILES (public), USPTO patent images (publicly available through prior work), the Indigo rendering tool (open-source), and an unspecified OCR system for abbreviation detection. Without the released code and data corrections, reproducing the full pipeline would require substantial re-implementation effort.

ChemDFM-R: Chemical Reasoning LLM with Atomized Knowledge

Fri, 26 Dec 2025 00:00:00 +0000

Method and Resource Contributions

This is primarily a Method paper with significant Resource contributions.

Methodological Basis: The paper introduces a training pipeline (“mix-sourced distillation”) and domain-specific reinforcement learning to improve reasoning capabilities in chemical LLMs. It validates the approach through ablation studies across training stages.
Resource Contribution: The authors constructed ChemFG, a 101 billion-token corpus annotated with “atomized” knowledge regarding functional groups and reaction centers.

Bridging the Chemical Reasoning Gap

Current chemical LLMs struggle to reason logically for two main reasons:

Shallow Domain Understanding: Models generally learn molecule-level properties directly, bypassing the intermediate “atomized” characteristics (e.g., functional groups) that ultimately dictate chemical behavior.
Specialized Reasoning Logic: Chemical logic differs fundamentally from math or code. Distilling reasoning from general teacher models like DeepSeek-R1 frequently fails because the teachers lack the domain intuition required to generate valid chemical rationales.

Atomized Knowledge and Mixed-Source Distillation

The authors introduce three structural innovations to solve the reasoning gap:

Atomized Knowledge Enhancement (ChemFG): A toolkit was built leveraging SMARTS notations to identify functional group changes during reactions. A critique of this approach is that it relies heavily on 2D cheminformatics abstractions, potentially missing deeper 3D stereochemical interactions.
Mix-Sourced Distillation: General models (DeepSeek-R1/o3-mini) are fed “pseudo-reasoning” prompts that include ground truth answers and functional group data. While this forces the teacher to generate high-quality rationales for the student to learn, it introduces a layer of hindsight bias into the generated reasoning chains. During inference, the student model lacks both the pre-calculated functional group metadata and the ground truth, forcing it to bridge an artificially steep generalization gap.
Chemical Reinforcement Learning: The intermediate model undergoes domain-specific reinforcement learning. The RL details are described in the paper’s Appendix D, with the authors citing the open-source DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) framework. The optimization relies on rule-based rewards (format adherence and canonicalized SMILES accuracy) across a variety of chemical tasks.

Benchmark Evaluation and Ablation Studies

The model was evaluated on comprehensive chemical benchmarks: SciKnowEval (19 tasks) and ChemEval (36 tasks).

Baselines: Compared against similarly sized open models (Qwen2.5-14B-Instruct, Qwen3-14B), domain models (ChemLLM, MolInst), and frontier models (GPT-4o, DeepSeek-R1).
Ablation: Evaluated across training stages (Base → ChemDFM-I → ChemDFM-R) to measure the specific impact of the instruction tuning versus the reasoning stages.
Qualitative Analysis: The paper includes case studies demonstrating the model’s step-by-step chemical reasoning and its potential for human-AI collaboration (Sections 4.2 and 4.3).

Performance Outcomes and Numerical Limitations

Performance vs. Baselines: ChemDFM-R outperforms similarly sized open models and domain models on molecule-centric and reaction-centric tasks, and surpasses the much larger DeepSeek-R1 on ChemEval (0.78 vs. 0.58 overall). It shows competitive results relative to o4-mini, though o4-mini leads on SciKnowEval (0.74 vs. 0.70).
Reasoning Interactivity: The model generates readable rationales that allow users to catch structural errors or identify reaction mechanisms accurately. Section 4.3 of the paper demonstrates human-AI collaboration scenarios.
Quantitative Limitations: The model struggles with tasks involving numerical prediction and calculation (e.g., yield extraction, molecular property calculation). The paper notes that all molecule-centric and reaction-centric tasks where ChemDFM-R falls short of Qwen2.5-14B-Instruct involve numerical reasoning.

Reproducibility Details

Data

The training data is constructed in three phases:

1. Domain Pre-training (ChemFG):

Size: 101 billion tokens
Composition:
- 12M literature documents (79B tokens)
- 30M molecules from PubChem/PubChemQC
- 7M reactions from USPTO-FULL
Augmentation: SMILES augmentation (10x) using R-SMILES
Atomized Features: Annotated with a custom “Functional Group Identification Toolkit” that identifies 241 functional group types and tracks changes in reaction centers. Note: Data and toolkit are partially reproduced; while the toolkit (ChemFG-Tool) was open-sourced on GitHub, the 101 billion-token ChemFG dataset itself has not been publicly released.

2. Instruction Tuning:

Sources: Molecule-centric (PubChem, MoleculeNet), Reaction-centric (USPTO), and Knowledge-centric (Exams, Literature QA) tasks
Mixing: Mixed with general instruction data in a 1:2 ratio

3. Distillation Dataset:

Sources:
- ~70% ChemDFM-R instruction data
- ~22% constructed pseudo-reasoning (functional group descriptions)
- ~8% teacher rationales (from DeepSeek-R1/o3-mini)
Mixing: Mixed with general data (including AM-Deepseek-R1-Distill-1.4M) in a 1:2 ratio

Algorithms

Functional Group Identification:

Extends the thermo library’s SMARTS list
For reactions, identifies “reacting functional groups” by finding reactants containing atoms involved in bond changes (reaction centers) that do not appear in the product

Mix-Sourced Distillation:

Teacher models (DeepSeek-R1, o3-mini) are prompted with Question + Ground Truth + Functional Group Info to generate high-quality “Thoughts”
These rationales are distilled into the student model using a supervised fine-tuning loss across target tokens $y_t$: $$ \mathcal{L}_{\text{SFT}} = - \sum_{t=1}^T \log P_\theta(y_t \mid x, y_{

Reinforcement Learning:

Algorithm: The paper cites DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) as the RL framework; full details are in Appendix D of the paper. Note: While the underlying DAPO framework is open-source, the specific chemistry-oriented RL pipeline and environment used for ChemDFM-R has not been publicly released.
Hyperparameters (from paper appendix): Learning rate 5e-7, rollout batch size 512, training batch size 128
Rewards: The reward system applies rule-based constraints focusing on physical form and chemical validity. The total reward $R(y, y^*)$ for a generated response $y$ given target $y^*$ combines a format adherence reward ($R_{\text{format}}$) and an accuracy reward ($R_{\text{acc}}$) evaluated on canonicalized SMILES: $$ R(y, y^*) = R_{\text{format}}(y) + R_{\text{acc}}(\text{canonicalize}(y), \text{canonicalize}(y^*)) $$

Models

Base Model: Qwen2.5-14B
ChemDFM-I: Result of instruction tuning the domain-pretrained model for 2 epochs
ChemDFM-R: Result of applying mix-sourced distillation (1 epoch) followed by RL on ChemDFM-I. Note: Model weights are publicly available on Hugging Face.

Hardware

Hardware and training time details are described in the paper’s appendices, which are not available in the extracted text. The details below are reported from the paper but could not be independently cross-verified against the main text:

Compute: NVIDIA A800 Tensor Core GPUs
Training Time: 30,840 GPU hours total (Domain Pretraining: 24,728 hours; Instruction Tuning: 3,785 hours; Distillation: 2,059 hours; Reinforcement Learning: 268 hours)

Evaluation

Benchmarks:

SciKnowEval: 19 tasks (text-centric, molecule-centric, reaction-centric)
ChemEval: 36 tasks, categorized similarly

Key Metrics: Accuracy, F1 Score, BLEU score (with PRS normalization for ChemEval)

Model	SciKnowEval (all)	ChemEval* (all)	Notes
Qwen2.5-14B-Instruct	0.61	0.57	General-domain baseline
ChemDFM-I	0.69	0.72	After domain pretraining + instruction tuning
ChemDFM-R	0.70	0.78	After distillation + RL
DeepSeek-R1	0.62	0.58	General-domain reasoning model
o4-mini	0.74	0.69	Frontier reasoning model

Artifacts

Artifact	Type	License	Notes
ChemDFM-R-14B	Model	AGPL-3.0	Final reasoning model weights on Hugging Face
ChemFG-Tool	Code	Apache-2.0	Functional group identification toolkit (241 groups)

Missing components: The 101B-token ChemFG pretraining dataset is not publicly released. The chemistry-oriented RL pipeline and training code are not open-sourced. The instruction tuning and distillation datasets are not available.

Paper Information

Citation: Zhao, Z., Chen, B., Wan, Z., Chen, L., Lin, X., Yu, S., Zhang, S., Ma, D., Zhu, Z., Zhang, D., Wang, H., Dai, Z., Wen, L., Chen, X., & Yu, K. (2025). ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge. arXiv preprint arXiv:2507.21990. https://doi.org/10.48550/arXiv.2507.21990

Publication: arXiv 2025

@misc{zhao2025chemdfmr,
  title={ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge},
  author={Zihan Zhao and Bo Chen and Ziping Wan and Lu Chen and Xuanze Lin and Shiyang Yu and Situo Zhang and Da Ma and Zichen Zhu and Danyang Zhang and Huayang Wang and Zhongyang Dai and Liyang Wen and Xin Chen and Kai Yu},
  year={2025},
  eprint={2507.21990},
  archivePrefix={arXiv},
  primaryClass={cs.CE},
  url={https://arxiv.org/abs/2507.21990}
}

GP-MoLFormer: Molecular Generation via Transformers

Thu, 25 Dec 2025 00:00:00 +0000

Contribution and Taxonomic Focus

This is primarily a Methodological paper, as it proposes a specific neural architecture (GP-MoLFormer) and a novel fine-tuning algorithm (Pair-tuning) for molecular generation. It validates these contributions against standard baselines (e.g., JT-VAE, MolGen-7b).

It also contains a secondary Theoretical contribution by establishing an empirical scaling law that relates inference compute (generation size) to the novelty of the generated molecules.

Motivation: Data Scale and Prompt-Based Optimization

While large language models (LLMs) have transformed text generation, the impact of training data scale and memorization on molecular generative models remains under-explored. Specifically, there is a need to understand how training on billion-scale datasets affects the novelty of generated molecules and whether biases in public databases (like ZINC and PubChem) perpetuate memorization. Furthermore, existing optimization methods often require computationally expensive property predictors or reinforcement learning loops; there is a practical need for more efficient “prompt-based” optimization techniques.

Core Innovations: Architecture and Pair-Tuning

Architecture: The application of a linear-attention transformer decoder with Rotary Positional Embeddings (RoPE) to generative chemistry, allowing for efficient training on 1.1 billion SMILES.
Pair-Tuning: A novel, parameter-efficient fine-tuning method that uses property-ordered molecular pairs to learn “soft prompts” for optimization without updating the base model weights.
Scaling Analysis: An extensive empirical investigation mapping the trade-off between inference compute (up to 10B generations) and chemical novelty, fitting an exponential decay curve that demonstrates how novelty saturates as generation volume grows.

Experimental Methodology and Downstream Tasks

The authors evaluated GP-MoLFormer on three distinct tasks, though the comparisons highlight the difficulty of evaluating foundation models against classical baselines:

De Novo Generation: Comparing validity, uniqueness, and novelty against baselines (CharRNN, VAE, LIMO, MolGen-7b) on a held-out test set. Notably, this is an unequal comparison; most baselines were trained on the 1.6M molecule MOSES dataset, whereas GP-MoLFormer uses up to 1.1B molecules, meaning performance gains are heavily driven by data scale.
Scaffold-Constrained Decoration: Generating molecules from DRD2 active binder scaffolds and measuring the hit rate of active compounds against specialized scaffold decorators.
Property-Guided Optimization: Using Pair-tuning to optimize for Drug-likeness (QED), Penalized logP, and DRD2 binding activity, comparing the results to graph-based and reinforcement learning benchmarks.

Additionally, they performed a Scaling Study:

Comparing models trained on raw (1.1B) vs. de-duplicated (650M) data.
Generating up to 10 billion molecules to fit empirical scaling laws for novelty.

Key Findings and Scaling Laws

Scale Driven Performance: GP-MoLFormer achieves high internal diversity and validity on generation metrics. However, its baseline novelty percentage (~32%) is considerably lower than classical models. The authors attribute this to the massive training scale forcing the model to heavily prioritize matching real-world molecule frequencies over pure exploration. GP-MoLFormer’s advantage in generation metrics over LLM-baselines like MolGen-7b likely stems heavily from its 10x larger training dataset rather than fundamental architectural superiority.
Pair-Tuning Efficacy: The proposed pair-tuning method effectively optimizes properties (e.g., improving DRD2 activity scores) without requiring full model fine-tuning or external reward loops. While successful, the text-based generation yields ~94.5% validity during optimization, which lags behind graph and SELFIES-based baselines that guarantee 100% structural validity.
Memorization vs. Novelty: Training on de-duplicated data (GP-MoLFormer-UNIQ) yields higher novelty (approx. 5-8% higher) than training on raw data, confirming that duplication bias in public databases leads directly to memorization.
Inference Scaling Law: Novelty decays exponentially with generation size ($y = ae^{-bx}$), yet the model maintains generative capability (~16.7% novelty) even after generating an unprecedented 10 billion molecules.

Reproducibility Details

Data

Sources: A combination of PubChem (111M SMILES) and ZINC (1B SMILES) databases. Downloading and pre-training instructions are located in the repository’s data/README.md.
Preprocessing:
- All SMILES were canonicalized using RDKit (no isomeric information).
- GP-MoLFormer (Base): Trained on the full 1.1B dataset (includes duplicates).
- GP-MoLFormer-UNIQ: Trained on a de-duplicated subset of 650M SMILES.
Tokenization: Uses the tokenizer from Schwaller et al. (2019) with a vocabulary size of 2,362 tokens.
Filtering: Sequences restricted to a maximum length of 202 tokens.

Algorithms

Pair-Tuning (Algorithm 1):

Objective: Learn task-specific soft prompts $\phi_T$ to maximize the conditional probability of target molecule $b$ given a seed molecule $a$, where pair $(a, b)$ satisfies the property condition $b > a$. The base model parameters $\theta$ remain frozen.
Prompt Structure: Autoregressive training optimizes the continuous embeddings of $n$ enhancement tokens against the cross-entropy loss of the target sequence: $$ \mathcal{L}(\phi_T) = - \sum_{i=1}^{|b|} \log P_{\theta}(b_i | \phi_T, a, b_{
Hyperparameters: Trained for 1,000 epochs with a batch size of 35 and a fixed learning rate of $3 \times 10^{-2}$.
Inference: The learned prompt $\phi_T$ and seed molecule $a$ are prepended as context, and candidates are sampled autoregressively until a termination token is produced.

Models

Availability: The model trained on deduplicated data (GP-MoLFormer-UNIQ) is publicly available on Hugging Face. The full 1.1B base model is not explicitly hosted. The source code repository includes a disclosure that IBM will not maintain the code going forward.
Architecture: Transformer decoder (~47M parameters: 12 layers, 12 heads, hidden size 768).
Attention Mechanism: Combines Linear Attention (Generalized Random Feature map, $\phi$) with Rotary Positional Embeddings (RoPE). To avoid the quadratic complexity of standard attention while maintaining relative positional awareness, RoPE is applied to queries ($Q$) and keys ($K$) prior to the random feature mapping: $$ \text{Attention}(Q, K, V) = \frac{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle v_n}{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle} $$
Inference Speed: ~3ms per forward pass on a single A100 GPU.

Evaluation

Generation Quality Metrics: Validity, Uniqueness, Novelty (MOSES suite), Fréchet ChemNet Distance (FCD), Scaffold similarity (Scaf), and Similarity to Nearest Neighbor (SNN).
MoLFormer-Based Metrics: The authors introduce Fréchet MoLFormer Distance (FMD) and MoLFormer-space IntDiv2 to measure distributional similarity using their own pre-trained continuous embeddings instead of standard fingerprints.
Optimization Metrics: Penalized logP (calculated as $\text{logP} - \text{SA} - \text{max}(\text{maxrings}(size) - 6, 0)$), Drug-likeness (QED), and DRD2 activity scores.
Scaling Metrics: Empirical fit for novelty decay: $y = ae^{-bx}$.

Hardware

Compute: 16 x NVIDIA A100 (80 GB) GPUs across 2 nodes connected via EDR Infiniband.
Training Time:
- GP-MoLFormer (1.1B data): ~115 hours total (28.75 hours/epoch for 4 epochs).
- GP-MoLFormer-UNIQ (650M data): ~80 hours total.
Hyperparameters: Used a batch size of 1,600 molecules per GPU with a fixed learning rate of $1.6 \times 10^{-4}$ (scaled up to $8\times$ factor as GPUs increased).
Optimization: Used distributed data-parallel training and adaptive bucketing by sequence length to handle scale.

Artifacts

Artifact	Type	License	Notes
GP-MoLFormer (GitHub)	Code	Apache 2.0	Official implementation; IBM will not maintain going forward
GP-MoLFormer-Uniq (Hugging Face)	Model	Apache 2.0	Pre-trained on 650M de-duplicated SMILES

The full 1.1B base model weights are not publicly hosted. The training data (PubChem and ZINC) is publicly available, and instructions for downloading and pre-processing are in the repository’s data/README.md.

Paper Information

Citation: Ross, J., Belgodere, B., Hoffman, S. C., Chenthamarakshan, V., Navratil, J., Mroueh, Y., & Das, P. (2025). GP-MoLFormer: A Foundation Model For Molecular Generation. Digital Discovery, 4(10), 2684–2696. https://doi.org/10.1039/D5DD00122F

Publication: Digital Discovery, vol. 4, no. 10, pp. 2684–2696 (2025)

@article{ross2025gpmolformer,
  title={GP-MoLFormer: a foundation model for molecular generation},
  author={Ross, Jerret and Belgodere, Brian and Hoffman, Samuel C and Chenthamarakshan, Vijil and Navratil, Jiri and Mroueh, Youssef and Das, Payel},
  journal={Digital Discovery},
  volume={4},
  number={10},
  pages={2684--2696},
  year={2025},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D5DD00122F}
}

ChemBERTa-2: Scaling Molecular Transformers to 77M

Thu, 25 Dec 2025 00:00:00 +0000

Classifying ChemBERTa-2’s Methodological Contributions

This is primarily a Methodological paper with a secondary Resource contribution.

It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine “how well” these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.

Key methodological indicators:

Baseline comparison: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (ChemBERTa-1) with prominent benchmark tables
Ablation studies: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size
Scaling analysis: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance

Motivations for Scaling Molecular Transformers

The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a “chemical foundation model”.

Key motivations:

Label scarcity: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant
Scaling hypothesis: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP
Efficiency: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and weakly supervised (MTR, using RDKit computed properties as labels) approaches

Novelty in Multi-Task Regression Objectives

Scale: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for SMILES-BERT).

Pipeline optimization: A direct, controlled comparison of Masked Language Modeling (MLM) vs. Multi-Task Regression (MTR) pretraining objectives on identical datasets.

Proxy selection: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.

Experimental Pretraining Setup on 77M Compounds

Pretraining Setup

Datasets: Subsets of PubChem containing 5M, 10M, and 77M unique SMILES.

Tasks:

MLM: Masking 15% of tokens (following RoBERTa procedure). The model is optimized by minimizing the cross-entropy loss over the predicted masked tokens: $$ \mathcal{L}_{MLM} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) $$ where $\mathcal{M}$ represents the set of masked token indices.
MTR: Predicting 200 calculated molecular properties (via RDKit) simultaneously using a mean squared error objective: $$ \mathcal{L}_{MTR} = \frac{1}{200} \sum_{j=1}^{200} \frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_{ij} - y_{ij} \right)^2 $$ Continuous target labels $y_{ij}$ are mean-normalized prior to training to equilibrate the disparate scales of different chemical properties.

Hyperparameter search: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.

Downstream Validation

Finetuning: Evaluated on 8 tasks from MoleculeNet (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).

Analysis: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.

Key Performance Outcomes and Scaling Realities

Highly competitive performance: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks, though the margins demonstrate that task-specific baselines remain notably robust.

MTR superiority: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM on every finetuning task evaluated. MTR is substantially slower than MLM due to the larger input size from the 200-element label vector, but MLM loss serves as a reliable proxy for MTR loss, enabling cheaper architecture search before committing to full MTR pretraining.

Scaling laws versus downstream utility: Pretraining loss improved by 25-35% when increasing the dataset from 5M to 77M compounds. However, this improvement in pretraining loss does not uniformly transfer to downstream tasks. For MTR models, SR-p53 ROC-AUC decreases monotonically from 0.834 (5M) to 0.827 (10M) to 0.817 (77M), and Lipophilicity RMSE is worse at 77M (0.798) than at 5M (0.758), despite a dip at 10M (0.744). This variability in transfer challenges the assumption that pretraining improvements always yield downstream gains.

Transfer learning: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.

Reproducibility Details

Data

The pretraining corpus is derived from PubChem.

Purpose	Dataset	Size	Notes
Pretraining	PubChem	77M SMILES	Canonicalized and globally shuffled. Subsets of 5M and 10M used. Note: Exact splits and datasets are not published.
Validation	PubChem	100k SMILES	A fixed set held out from the 77M corpus. Note: Exact 100k subset is not published.
MTR Labels	RDKit	200 props	200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized. Note: Calculated labels are not published and must be re-computed.
Finetuning	MoleculeNet	1.5k - 8k	Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.

Algorithms

Pretraining Objectives:

Masked Language Modeling (MLM): Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.
Multi-Task Regression (MTR): Predicting 200 RDKit properties. Labels are mean-normalized.

Tokenizer:

Dictionary of common SMILES characters
Maximum vocabulary size: 591 tokens

Optimization:

Patience: Early stopping set to one pass through the dataset to ensure full coverage
Hyperparameter search: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate. Note: The precise configuration of the winning models that were scaled to 77M is absent from the paper.

Models

Architecture: Based on RoBERTa (HuggingFace implementation)
Parameter scale: Models ranged between 5M and 46M parameters
Selection: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset
Checkpoints: Pre-trained weights are hosted by DeepChem on Hugging Face. Direct links include DeepChem/ChemBERTa-77M-MTR and DeepChem/ChemBERTa-77M-MLM (Note: Model cards are currently empty).
Code Reference: While the DeepChem repository is referenced for code, isolated training scripts tailored to recreate ChemBERTa-2’s exact pipeline are not separated from the generalized deepchem library tooling.

Evaluation

Benchmarks were performed on MoleculeNet using DeepChem.

Metric	Tasks	Baseline	Notes
RMSE ($\downarrow$)	Delaney, Lipo, BACE (Reg), Clearance	D-MPNN	ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.8).
ROC-AUC ($\uparrow$)	BBBP, ClinTox, HIV, Tox21, BACE (Cls)	D-MPNN	ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.

Hardware

Compute: AWS EC2 instances with Nvidia T4 GPUs
Strategy: AWS Spot instances were used to reduce cost; implemented frequent checkpointing to handle interruptions.
Note: For MTR, they wrote a custom data loader wrapper around HuggingFace’s text loader to handle CSV parsing efficiency, as the default CSV loader was a major bottleneck for the 200-element target vectors.

Paper Information

Citation: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., & Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. arXiv preprint arXiv:2209.01712. https://doi.org/10.48550/arXiv.2209.01712

Publication: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)

Additional Resources:

ChemBERTa-1 Paper

@misc{ahmadChemBERTa2ChemicalFoundation2022,
  title = {{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}},
  shorttitle = {{{ChemBERTa-2}}},
  author = {Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  year = 2022,
  month = sep,
  number = {arXiv:2209.01712},
  eprint = {2209.01712},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2209.01712},
  urldate = {2025-12-25},
  archiveprefix = {arXiv}
}

Chemformer: A Pre-trained Transformer for Comp Chem

Tue, 23 Dec 2025 00:00:00 +0000

Paper Contribution and Methodological Classification

This is a Methodological ($\Psi_{\text{Method}}$) paper. It proposes an architecture adaptation (Chemformer based on BART) and a specific pre-training strategy (“Combined” masking and augmentation). The paper validates this method by benchmarking against established models on multiple tasks, including direct synthesis, retrosynthesis, and molecular optimization. It also includes a secondary Resource ($\Psi_{\text{Resource}}$) contribution by making the pre-trained models and code available.

Motivation: Computational Bottlenecks in Cheminformatics

Existing Transformer models for cheminformatics are often developed for single applications and are computationally expensive to train from scratch. For example, training a Molecular Transformer for reaction prediction can take days, limiting hyperparameter exploration. Self-supervised pre-training (like BERT or T5) has significantly advanced NLP by reducing fine-tuning time and improving performance. In chemistry, applications have traditionally focused on task-specific datasets or encoder-only architectures, which perform poorly on sequence generation tasks. The authors aim to use transfer learning on a large unlabelled dataset to create a model that converges quickly and performs well across diverse sequence-to-sequence and discriminative tasks.

Core Innovation: BART Architecture and Combined Pre-training

The primary insight lies in the adaptation of the BART architecture for chemistry and the introduction of a “Combined” self-supervised pre-training task.

Architecture: Chemformer uses the BART encoder-decoder structure, allowing it to handle both discriminative (property prediction) and generative (reaction prediction) tasks efficiently. This provides an alternative to encoder-only (BERT) or decoder-only (GPT) models.
Combined Pre-training: The authors introduce a task that applies both Span Masking (randomly replacing tokens with ) and SMILES Augmentation (permuting atom order, see Randomized SMILES) simultaneously. Formally, given a canonical SMILES sequence $x$, a corrupted sequence $\tilde{x} = \text{Mask}(\text{Augment}(x))$ is generated. The model is trained using an autoregressive cross-entropy loss to reconstruct the canonical sequence from the corrupted input: $$ \mathcal{L}_{\text{pre-train}} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{
Tunable Augmentation: A downstream augmentation strategy is proposed where the probability of augmenting the input/output SMILES ($p_{aug}$) is a tunable hyperparameter, performed on-the-fly.

Experimental Setup and Pre-training Tasks

The authors pre-trained Chemformer on 100 million molecules from ZINC-15 and fine-tuned it on three distinct task types:

Seq2Seq Reaction Prediction:
- Direct Synthesis: USPTO-MIT dataset (Mixed and Separated).
- Retrosynthesis: USPTO-50K dataset (see also Molecular Transformer, Tied Two-Way Transformers).
Molecular Optimization: Generating molecules with improved properties (LogD, solubility, clearance) starting from ChEMBL matched molecular pairs.
Discriminative Tasks:
- QSAR: Predicting properties (ESOL, FreeSolv, Lipophilicity) from MoleculeNet.
- Bioactivity: Predicting pXC50 values for 133 genes using ExCAPE data.

Ablation studies compared three pre-training strategies (Masking, Augmentation, Combined) against a randomly initialized baseline.

Results, Trade-offs, and Conclusions

Performance: Chemformer achieved competitive top-1 accuracy on USPTO-MIT (91.3% Mixed) and USPTO-50K (53.6-54.3%), outperforming the Augmented Transformer and graph-based models (GLN, GraphRetro).
Convergence Speed: Pre-training significantly accelerated training; fine-tuning for just 20 epochs (30 mins) outperformed the previous baselines trained for significantly longer.
Pre-training Tasks: The “Combined” task generally performed best for reaction prediction and bioactivity, while “Masking” was superior for molecular optimization.
Augmentation Trade-off: The augmentation strategy improved top-1 accuracy but significantly degraded top-5/10 accuracy because beam search outputs became populated with augmented versions of the same molecule. This presents a considerable limitation for practical applications like retrosynthesis mapping, where retrieving a diverse set of candidate reactions is often critical.
Discriminative Evaluation Caveats: Chemformer underperformed specialized baselines (like D-MPNN or MolBERT) on small discriminative datasets. The authors note that direct comparison is difficult: Chemformer was trained simultaneously on multiple subtasks (multi-task learning), while the literature baselines were trained and tuned on each subtask separately. Additionally, the Chemformer encoder uses fewer than 20M parameters compared to MolBERT’s approximately 85M, and Chemformer’s pre-training does not include molecular property objectives. For other transfer learning approaches to QSAR, see MolPMoFiT.
Pre-training Data Scope: The 100M pre-training dataset from ZINC-15 was selected with constraints on molecular weight ($\le 500$ Da) and LogP ($\le 5$), focusing the learned representations on small, drug-like molecules.

Reproducibility Details

Data

Note: The primary GitHub repository for Chemformer was officially archived on February 11, 2026. Pre-trained weights and datasets used in the paper are still hosted externally on Box. Active development of Chemformer models has moved to the AiZynthModels repository.

Artifact	Type	License	Notes
Chemformer (GitHub)	Code	Apache-2.0	Archived; original PyTorch implementation
AiZynthModels (GitHub)	Code	Apache-2.0	Active successor repository
Pre-trained weights (Box)	Model	Unknown	Base and Large model checkpoints

The following datasets were used for pre-training and benchmarking.

Purpose	Dataset	Size	Notes
Pre-training	ZINC-15	100M	Selected subset (reactive, annotated purchasability, MW $\le 500$, LogP $\le 5$). Split: 99% Train / 0.5% Val / 0.5% Test.
Direct Synthesis	USPTO-MIT	~470k	Evaluated on “Mixed” and “Separated” variants.
Retrosynthesis	USPTO-50K	~50k	Standard benchmark for retrosynthesis.
Optimization	ChEMBL MMPs	~160k Train	Matched Molecular Pairs for LogD, solubility, and clearance optimization.
Properties	MoleculeNet	Small	ESOL (1128), FreeSolv (642), Lipophilicity (4200).
Bioactivity	ExCAPE	~312k	133 gene targets; >1200 compounds per gene.

Preprocessing:

Tokenization: Regex-based tokenization (523 tokens total) derived from ChEMBL 27 canonical SMILES.
Augmentation: SMILES enumeration (permuting atom order) used for pre-training and on-the-fly during fine-tuning ($p_{aug}=0.5$ for Seq2Seq, $p_{aug}=1.0$ for discriminative).

Algorithms

Pre-training Tasks:
1. Masking: Span masking (BART style).
2. Augmentation: Input is a randomized SMILES; target is canonical SMILES.
3. Combined: Input is augmented then masked; target is canonical SMILES.
Optimization:
- Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$).
- Schedule: Linear warm-up (8000 steps) for pre-training; One-cycle schedule for fine-tuning.
Inference: Beam search with width 10 for Seq2Seq tasks. Used molbart/inference_score.py and molbart/retrosynthesis/round_trip_inference.py for standard and round-trip validation.

Models

Two model sizes were trained. Both use the Pre-Norm Transformer layout with GELU activation.

Hyperparameter	Chemformer (Base)	Chemformer-Large
Layers	6	8
Model Dimension	512	1024
Feed-forward Dim	2048	4096
Attention Heads	8	16
Parameters	~45M	~230M
Pre-training Task	All 3 variants	Combined only

Evaluation

Comparisons relied on Top-N accuracy for reaction tasks and validity metrics for optimization.

Metric	Task	Key Result	Baseline
Top-1 Acc	Direct Synthesis (Sep)	92.8% (Large)	91.1% (Aug Transformer)
Top-1 Acc	Retrosynthesis	54.3% (Large)	53.7% (GraphRetro) / 52.5% (GLN)
Desirable %	Mol Optimization	75.0% (Base-Mask)	70.2% (Transformer-R)
RMSE	Lipophilicity	0.598 (Combined)	0.555 (D-MPNN)

Hardware

Compute: 4 NVIDIA V100 GPUs (batch size 128 per GPU).
Training Time:
- Pre-training: 2.5 days (Base) / 6 days (Large) for 1M steps.
- Fine-tuning: ~20-40 epochs for reaction prediction (<12 hours).

Paper Information

Citation: Irwin, R., Dimitriadis, S., He, J., & Bjerrum, E. J. (2022). Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 3(1), 015022. https://doi.org/10.1088/2632-2153/ac3ffb

Publication: Machine Learning: Science and Technology 2022

@article{irwinChemformerPretrainedTransformer2022,
  title = {Chemformer: A Pre-Trained Transformer for Computational Chemistry},
  shorttitle = {Chemformer},
  author = {Irwin, Ross and Dimitriadis, Spyridon and He, Jiazhen and Bjerrum, Esben Jannik},
  year = 2022,
  month = jan,
  journal = {Machine Learning: Science and Technology},
  volume = {3},
  number = {1},
  pages = {015022},
  publisher = {IOP Publishing},
  issn = {2632-2153},
  doi = {10.1088/2632-2153/ac3ffb}
}

ChemBERTa: Molecular Property Prediction via Transformers

Tue, 23 Dec 2025 00:00:00 +0000

Taxonomy and Paper Contributions

This is primarily a Method paper ($\Psi_{\text{Method}}$), with a significant Resource component ($\Psi_{\text{Resource}}$).

It is a methodological investigation because it systematically evaluates a specific architecture (Transformers/RoBERTa) against established State-of-the-Art (SOTA) baselines like directed Message Passing Neural Networks (D-MPNNs) to determine “how well does this work?” in the chemical domain. It ablates dataset size, tokenization, and input representation.

It is also a resource paper as it introduces “PubChem-77M,” a curated dataset of 77 million SMILES strings designed to facilitate large-scale self-supervised pretraining for the community.

Overcoming Data Scarcity in Property Prediction

The primary motivation is data scarcity in molecular property prediction. Graph Neural Networks (GNNs) achieve strong performance on property prediction tasks when provided with sufficient labeled data. Generating these labels requires costly and time-consuming laboratory testing, leading to severe data scarcity in specialized chemical domains.

Massive quantities of unlabeled chemical structure data exist in the form of SMILES strings. Inspired by the success of Transformers in NLP, where self-supervised pretraining on large corpora yields strong transfer learning, the authors aim to use these unlabeled datasets to learn effective molecular representations. Additionally, Transformers benefit from a mature software ecosystem (HuggingFace) that offers efficiency advantages over GNNs.

Pretraining Scaling Laws and Novelty

Previous works applied Transformers to SMILES strings. This paper advances the field by systematically evaluating scaling laws and architectural components for this domain. Specifically:

Scaling Analysis: It explicitly tests how pretraining dataset size (100K to 10M) impacts downstream performance.
Tokenizer Comparison: It compares standard NLP Byte-Pair Encoding (BPE) against a chemically-aware “SmilesTokenizer”.
Representation Comparison: It evaluates if the robust SELFIES string representation offers advantages over standard SMILES in a Transformer context.

Experimental Setup: Pretraining and Finetuning

The authors trained ChemBERTa (based on RoBERTa) using Masked Language Modeling (MLM) on subsets of the PubChem dataset. The core training objective minimizes the cross-entropy loss over a corrupted input where a subset of basic tokens, denoted by $\mathcal{M}$, are masked:

$$ \mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta) $$

where $x_i$ is the exact masked token, $x_{\setminus \mathcal{M}}$ is the corrupted SMILES context string, and $\theta$ represents the network parameters.

Pretraining: Models were pretrained on dataset sizes of 100K, 250K, 1M, and 10M compounds.
Baselines: Performance was compared against D-MPNN (Graph Neural Network), Random Forest (RF), and SVM using 2048-bit Morgan Fingerprints.
Downstream Tasks: Finetuning was performed individually on small MoleculeNet classification tasks: BBBP (blood-brain barrier), ClinTox (clinical toxicity), HIV, and Tox21 (p53 stress-response). This poses a transfer learning challenge, as the model must adapt from pretraining on 10 million molecules to classifying datasets ranging from ~1.5K to ~41K examples.
Ablations:
- Tokenization: BPE vs. SmilesTokenizer on the 1M dataset, evaluated on Tox21.
- Input: SMILES vs. SELFIES strings on the Tox21 task.

Results vs. Graph Neural Network Baselines

The main comparison between ChemBERTa (pretrained on 10M compounds) and Chemprop baselines on MoleculeNet tasks is summarized below (Table 1 from the paper):

Model	BBBP ROC	BBBP PRC	ClinTox ROC	ClinTox PRC	HIV ROC	HIV PRC	Tox21 ROC	Tox21 PRC
ChemBERTa 10M	0.643	0.620	0.733	0.975	0.622	0.119	0.728	0.207
D-MPNN	0.708	0.697	0.906	0.993	0.752	0.152	0.688	0.429
RF	0.681	0.692	0.693	0.968	0.780	0.383	0.724	0.335
SVM	0.702	0.724	0.833	0.986	0.763	0.364	0.708	0.345

Scaling Improvements & Training Dynamics: Performance scales predictably with pretraining data size. Increasing data from 100K to 10M improved ROC-AUC by +0.110 and PRC-AUC by +0.059 on average across BBBP, ClinTox, and Tox21 (HIV was omitted due to resource constraints). Notably, researchers had to halt pretraining on the 10M subset after just 3 epochs due to overfitting, suggesting that simple 15% token masking might not provide a sufficiently difficult learning curvature for large-scale chemical representation.
Performance Limits vs. GNNs: ChemBERTa generally performs below the D-MPNN baseline. On the Tox21 dataset, ChemBERTa-10M achieved a higher ROC-AUC (0.728) than D-MPNN (0.688); nonetheless, it recorded a substantially lower PRC-AUC (0.207 vs 0.429). This gap indicates that current Transformer iterations lack the explicit inductive biases of graph algorithms and struggle with the severe class imbalances typical of chemical datasets.
Ablation Limitations (Tokenization & SELFIES): The authors’ ablation studies for tokenization (SmilesTokenizer narrowly beating BPE) and input representation (SELFIES performing comparably to SMILES) were evaluated exclusively on the single Tox21 task. Deriving broad architectural conclusions regarding “semantically-aware tokenization” or string robustness from an $N=1$ empirical evaluation is a significant limitation of the study. Broader benchmarking is required to validate these findings.
Interpretability: Attention heads organically learn to track chemically relevant substructures (like specific functional groups and aromatic rings), mimicking the inductive biases of graph convolutions.

Reproducibility Details

Data

The authors curated a massive dataset for pretraining and utilized standard benchmarks for evaluation.

Pretraining Data: PubChem-77M.
- Source: 77 million unique SMILES from PubChem.
- Preprocessing: Canonicalized and globally shuffled.
- Subsets used: 100K, 250K, 1M, and 10M subsets.
- Availability Note: The authors provided a direct link to the canonicalized 10M compound subset used for their largest experiments. Full reproducibility of the smaller (100K, 250K, 1M) or full 77M sets may require re-extracting from PubChem.
Evaluation Data: MoleculeNet.
- Tasks: BBBP (2,039), ClinTox (1,478), HIV (41,127), Tox21 (7,831).
- Splitting: 80/10/10 train/valid/test split using a scaffold splitter to ensure chemical diversity between splits.

Algorithms

The core training methodology mirrors standard BERT/RoBERTa procedures adapted for chemical strings.

Objective: Masked Language Modeling (MLM) with 15% token masking.
Tokenization:
- BPE: Byte-Pair Encoder (vocab size 52K).
- SmilesTokenizer: Regex-based custom tokenizer available in DeepChem (documented here).
Sequence Length: Maximum sequence length of 512 tokens.
Finetuning: Appended a linear classification layer; backpropagated through the base model for up to 25 epochs with early stopping on ROC-AUC.

Models

Architecture: RoBERTa (via HuggingFace).
- Layers: 6
- Attention Heads: 12 (72 distinct mechanisms total).
- Implementation Note: The original training notebooks and scripts are maintained in the authors’ bert-loves-chemistry repository, alongside the primary downstream tasks integrated into DeepChem. A full Tox21 transfer learning tutorial has been incorporated into the DeepChem repository.
Baselines (via Chemprop library):
- D-MPNN: Directed Message Passing Neural Network with default hyperparameters.
- RF/SVM: Scikit-learn Random Forest and SVM using 2048-bit Morgan fingerprints (RDKit).

Evaluation

Performance is measured using dual metrics to account for class imbalance common in toxicity datasets.

Metric	Details
ROC-AUC	Area Under Receiver Operating Characteristic Curve
PRC-AUC	Area Under Precision-Recall Curve (vital for imbalanced data)

Hardware

Compute: Single NVIDIA V100 GPU.
Training Time: Approximately 48 hours for the 10M compound subset.
Carbon Footprint: Estimated 17.1 kg $\text{CO}_2\text{eq}$ (offset by Google Cloud).

Artifacts

Artifact	Type	License	Notes
bert-loves-chemistry	Code	MIT	Training notebooks and finetuning scripts
DeepChem	Code	MIT	Integration of ChemBERTa and SmilesTokenizer
ChemBERTa-zinc-base-v1	Model	Unknown	Pre-trained RoBERTa on 100K ZINC SMILES
PubChem-10M subset	Dataset	Unknown	Canonicalized 10M compound subset used for largest experiments

Reproducibility status: Partially Reproducible. Code and pre-trained models are available, and the 10M pretraining subset is downloadable. However, smaller subsets (100K, 250K, 1M) may need re-extraction from PubChem, and exact hyperparameter details for finetuning (learning rate, batch size) are not fully specified in the paper.

Paper Information

Citation: Chithrananda, S., Grand, G., & Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv preprint arXiv:2010.09885. https://doi.org/10.48550/arXiv.2010.09885

Publication: arXiv 2020 (Preprint)

Additional Resources:

HuggingFace Model Hub (ChemBERTa-zinc-base-v1) - Additional pre-trained variations on PubChem & ZINC datasets are available on the author’s seyonec HF profile.
bert-loves-chemistry GitHub Repository - Notebooks and scripts used for MLM pretraining and finetuning evaluations.

BibTeX

@misc{chithranandaChemBERTaLargeScaleSelfSupervised2020,
  title = {{{ChemBERTa}}: {{Large-Scale Self-Supervised Pretraining}} for {{Molecular Property Prediction}}},
  shorttitle = {{{ChemBERTa}}},
  author = {Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  year = 2020,
  month = oct,
  number = {arXiv:2010.09885},
  eprint = {2010.09885},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2010.09885},
  urldate = {2025-12-24},
  archiveprefix = {arXiv}
}

Score-Based Generative Modeling with SDEs (Song 2021)

Sun, 21 Dec 2025 00:00:00 +0000

What kind of paper is this?

This is primarily a Method paper. It proposes a unified framework that generalizes previous discrete score-based models (SMLD and DDPM) into continuous-time Stochastic Differential Equations (SDEs). The paper introduces algorithms for sampling (Predictor-Corrector) and likelihood computation (Probability Flow ODE), validated by setting new records on CIFAR-10 (FID 2.20, IS 9.89 at the time of publication). It also contains elements of Systematization by showing how existing methods are special cases of this broader framework.

What is the motivation?

Prior successful generative models, specifically Score Matching with Langevin Dynamics (SMLD) and Denoising Diffusion Probabilistic Models (DDPM), operate by sequentially corrupting data with slowly increasing noise and learning to reverse the process. Both methods treat the noise scales as a finite set of discrete steps. The authors aim to generalize this to a continuum of noise scales by modeling the diffusion process as a Stochastic Differential Equation (SDE). This continuous formulation enables:

Flexible sampling: Use of general-purpose SDE solvers.
Exact likelihood computation: Via connection to Neural ODEs.
Controllable generation: Solving inverse problems (inpainting, colorization) without retraining.

What is the novelty here?

The core novelty is the SDE framework for score-based generative modeling:

Continuous Generalization: Proving that SMLD and DDPM noise perturbations correspond to discretizations of Variance Exploding (VE) SDEs and Variance Preserving (VP) SDEs, respectively.
Reverse-Time SDE: Leveraging Anderson’s result (Anderson, 1982: a result on time-reversal of diffusion processes showing that the reverse is also a diffusion, with the forward drift reversed and a correction term involving the score of the marginal density) that the reverse of a diffusion process is also a diffusion process, governed by the score (gradient of log density).
Predictor-Corrector (PC) Samplers: A hybrid sampling strategy where a numerical SDE solver (Predictor) estimates the next step, and a score-based MCMC approach (Corrector) corrects the marginal distribution.
Probability Flow ODE: Deriving a deterministic ODE that shares the same marginal densities as the SDE, enabling near-exact likelihood computation (accuracy is limited by both numerical ODE solver discretization and variance of the unbiased Hutchinson trace estimator) and latent space manipulation.
Sub-VP SDE: A new SDE class proposed to improve likelihoods by bounding variance tighter than the VP SDE.

What experiments were performed?

The authors validated the framework on standard image benchmarks:

Datasets: CIFAR-10 (32x32), CelebA (64x64), LSUN (Bedroom, Church), and CelebA-HQ (256x256 and 1024x1024).
Ablation Studies: Comparing samplers (Ancestral vs. Reverse Diffusion vs. Probability Flow vs. PC) and SDE types (VE, VP, sub-VP).
Architecture Search: Exploring improvements like FIR up/downsampling, rescaling skip connections, and increasing depth (leading to NCSN++ and DDPM++ architectures).
Likelihood Evaluation: Computing Negative Log-Likelihood (NLL) in bits/dim using the Probability Flow ODE.
Inverse Problems: Testing class-conditional generation, inpainting, and colorization using the conditional reverse-time SDE.

What outcomes/conclusions?

Record Performance: The NCSN++ cont. (deep, VE) model achieved an Inception Score of 9.89 and FID of 2.20 on CIFAR-10 (as of ICLR 2021).
High-Fidelity Generation: First score-based model to generate 1024x1024 images (CelebA-HQ).
Competitive Likelihoods: The DDPM++ cont. (deep, sub-VP) model achieved 2.99 bits/dim on uniformly dequantized CIFAR-10, a record at the time.
Sampling Efficiency: PC samplers consistently outperformed predictor-only methods (like standard ancestral sampling) for the same computational cost.
Controllable Generation: Successful application to inpainting and colorization using a single unconditional model.
Limitations: Sampling remains slower than GANs on the same datasets. The breadth of available samplers introduces many hyperparameters (SDE type, predictor, corrector, signal-to-noise ratio, number of steps) that require tuning.

Reproducibility Details

Data

CIFAR-10: Used for main benchmarking (FID, Inception Score, NLL).
CelebA-HQ: Used for high-resolution experiments at 256x256 and 1024x1024.
LSUN: Bedroom and Church Outdoor categories (256x256) used for sampler comparison and controllable generation (inpainting, colorization).
Preprocessing: CIFAR-10 images are 32x32; CelebA pre-processed to 64x64 following Song & Ermon (2020). Data is typically scaled to $[0, 1]$ or standardized depending on the specific SDE config.

Algorithms

Forward SDEs:

Here $dw$ denotes a Wiener process increment (a small, independent Gaussian noise burst at each timestep).

VE SDE (Variance Exploding): $dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}} dw$. Corresponds to SMLD. Used with $\sigma_{\min}=0.01$ and $\sigma_{\max}$ chosen via heuristics.
VP SDE (Variance Preserving): $dx = -\frac{1}{2}\beta(t)x dt + \sqrt{\beta(t)} dw$. Corresponds to DDPM.
Sub-VP SDE: $dx = -\frac{1}{2}\beta(t)x dt + \sqrt{\beta(t)(1 - e^{-2\int_0^t \beta(s)ds})} dw$. Bounded variance, good for likelihoods.

Reverse-Time SDE Solver (Predictor):

Discretized via Reverse Diffusion Sampling, which matches the forward discretization.
Euler-Maruyama solver used for continuously-trained models.

Corrector Algorithm:

Langevin MCMC: Applies annealed Langevin dynamics: adds noise and takes a score-guided gradient step to correct the marginal distribution at each timestep.
PC Sampling: Alternates between one step of the Predictor and one step of the Corrector.
Signal-to-Noise Ratio ($r$): A hyperparameter for the corrector step size. Tuned values: $r \approx 0.16$ for VE SDEs on CIFAR-10.

Models

NCSN++: Optimized architecture for VE SDEs. Key features:
- 4 residual blocks per resolution.
- BigGAN-type residual blocks.
- Rescaling skip connections by $1/\sqrt{2}$.
- FIR (Finite Impulse Response) up/downsampling.
- “Residual” progressive architecture for input, no progressive growing for output.
DDPM++: Optimized architecture for VP/sub-VP SDEs. Similar to NCSN++ but without FIR upsampling and no progressive growing.
Deep Variants: “cont. (deep)” models double the depth (from 4 to 8 blocks per resolution) for the best reported results.
Conditioning: Time $t$ is conditioned via random Fourier feature embeddings (scale 16) for continuous models.

Evaluation

Metrics:

FID (Fréchet Inception Distance): Computed on 50k samples.
Inception Score: Reported for CIFAR-10.
NLL (Negative Log-Likelihood): Reported in bits/dim on uniformly dequantized data using the Probability Flow ODE.

Denoising: A single denoising step using Tweedie’s formula is applied at the end of sampling to remove residual noise, which significantly improves FID.

Hardware

Training:

Batch size: 128 for CIFAR-10, 64 for LSUN, 8 for high-res CelebA-HQ.
Iterations: Discrete-objective models trained for 1.3M iterations during architecture exploration. Continuous-objective models (cont.) trained for 0.95M iterations. High-res CelebA-HQ (1024x1024) trained for approximately 2.4M iterations.
EMA: Exponential Moving Average rate of 0.999 used for VE models, 0.9999 for VP models.

Artifacts

Artifact	Type	License	Notes
yang-song/score_sde	Code	Apache-2.0	Official JAX and PyTorch implementation with pretrained checkpoints

All datasets used (CIFAR-10, CelebA-HQ, LSUN) are publicly available. Pretrained model checkpoints for CIFAR-10, CelebA-HQ, and FFHQ are provided in the repository. Specific hardware requirements (GPU type, training time) are not detailed in the paper.

Paper Information

Citation: Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. https://arxiv.org/abs/2011.13456

Publication: ICLR 2021

@inproceedings{song2021scorebased,
  title     = {Score-Based Generative Modeling through Stochastic Differential Equations},
  author    = {Song, Yang and Sohl-Dickstein, Jascha and Kingma, Diederik P and Kumar, Abhishek and Ermon, Stefano and Poole, Ben},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://openreview.net/forum?id=PxTIG12RRHS}
}

Additional Resources:

Method Papers: New Algorithms, Architectures, and Mechanisms on Hunter Heidenreich | ML Research Scientist

MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine

A Modular MB-nrg Method for Biomolecular Potentials

Why Empirical Force Fields Fall Short for Protein Dynamics

Building Polyalanine from Functional-Group n-mers

Training on DLPNO-CCSD(T) with Metadynamics Sampling

CCSD(T) Energy Landscapes, Free-Energy Surfaces, and Helix Dynamics

Transferability Without Whole-Chain Training Data

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Paper Information

MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs

Extending MB-nrg from Gas-Phase Polyalanine to Aqueous Solution

Why Empirical Force Fields Struggle with Hydrated Peptides

A Modular MB-nrg PEF for Polyalanine in Water

Training Set Generation and DLPNO-CCSD(T) Reference Data

Validation: Dimer Scans, Free-Energy Surfaces, and Hydration

A Modular Path to Chemically Accurate Biomolecular Simulations

Reproducibility Details

Data

Artifacts table

Algorithms

Models

Evaluation

Hardware

Paper Information

Graph Grammar and ILP for Carbon Fixation Pathways

A Graph-Grammar and ILP Framework for Pathway Discovery

Why Computational Pathway Design for Carbon Fixation

Generative Chemical Space Expansion with Graph-Grammar Rules

Experimental Setup, Queries, and Comparison to Literature

Two Novel Autocatalytic Cycles Competitive with Synthetic Pathways

Findings, Limitations, and Future Directions

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts and licensing

Paper Information

Surge: Fastest Open-Source Chemical Graph Generator

A Three-Stage Canonical Generation Path

Motivation: The Need for Fast, Open Structure Generators

The Three-Stage Algorithm

Efficient Automorphism Handling via Group Factorization

Benchmark Results

Limitations

Reproducibility Details

Paper Information

SpeechT5: Unified Speech-Text Pre-Training Framework

A Unified Encoder-Decoder for Spoken Language Processing

Bridging the Gap Between Speech and Text Pre-Training

Cross-Modal Vector Quantization for Alignment

Pre-Training Objectives

Evaluation Across Six Spoken Language Tasks

Automatic Speech Recognition (ASR)

Text-to-Speech Synthesis (TTS)

Speech Translation (ST)

Voice Conversion (VC)

Speech Enhancement (SE)

Speaker Identification (SID)

Ablation Study and Key Findings

Reproducibility Details

Data

Algorithms

Fine-Tuning Hyperparameters

Models

Artifacts

Hardware

Paper Information

nauty and Traces: Graph Isomorphism Algorithms

A Method Paper on Practical Graph Isomorphism

The Graph Isomorphism Problem in Practice

The Individualization-Refinement Framework

Colorings and Refinement