Systematization Papers: Surveys, Reviews, and Taxonomies on Hunter Heidenreich | ML Research Scientist

T5: Exploring Transfer Learning Limits

Wed, 08 Apr 2026 00:00:00 +0000

A systematic study of NLP transfer learning

This is a systematization paper that provides a comprehensive empirical survey of transfer learning techniques for NLP. Rather than proposing a single new method, T5 introduces a unified text-to-text framework and uses it as a testbed to systematically compare pre-training objectives, architectures, unlabeled data sources, transfer approaches, and multi-task mixing strategies. The scale of the ablation study (covering dozens of configurations) and the release of C4, pre-trained models, and code make it both a reference guide and a resource.

Unifying NLP tasks as text-to-text

The core design decision is to cast every NLP task as a text-to-text problem: both the input and output are text strings, with a task-specific prefix. Classification, regression, summarization, translation, and question answering all use the same model, loss function (cross-entropy on output tokens), and decoding procedure. This simplicity enables fair comparison across tasks and training strategies.

The model architecture is a standard encoder-decoder Transformer. The paper finds that this form outperforms decoder-only (language model) and encoder-only (BERT-style) variants in the text-to-text setting, while having similar computational cost to decoder-only models despite twice the parameters (the encoder processes the input only once, then the decoder attends to it).

Multi-task mixing: strategies and findings

The most thesis-relevant contribution is the systematic ablation of multi-task mixing strategies (Section 3.5.2). When training on multiple tasks simultaneously (which in the text-to-text framework simply means mixing data from different sources), the central question is how to set the proportion of data from each task.

Three mixing strategies

Examples-proportional mixing. Sample in proportion to each dataset’s size, with an artificial cap $K$ on the maximum dataset size. Without the cap, the unsupervised pre-training data (orders of magnitude larger) would dominate all batches. The mixing rate for task $m$ is:

$$ r_{m} = \frac{\min(e_{m}, K)}{\sum_{n} \min(e_{n}, K)} $$

where $e_{m}$ is the number of examples in task $m$’s dataset.

Temperature-scaled mixing. Raise each mixing rate $r_{m}$ to the power $1/T$ and renormalize. At $T=1$ this equals examples-proportional mixing; as $T$ increases, proportions approach equal mixing. Uses a large cap $K = 2^{21}$.

Equal mixing. Sample uniformly from all tasks. Included as a negative reference: the model overfits on low-resource tasks and underfits on high-resource tasks.

Results

Mixing strategy	GLUE	CNN/DM	SQuAD	SuperGLUE	EnDe	EnFr	EnRo
Baseline (pre-train/fine-tune)	83.28	19.24	80.88	71.36	26.98	39.82	27.65
Equal	76.13	19.02	76.51	63.37	23.89	34.31	26.78
Examples-proportional, $K=2^{18}$	81.67	19.07	78.17	67.94	24.57	35.19	27.39
Examples-proportional, $K=2^{19}$	81.42	19.24	79.78	67.30	25.21	36.30	27.76
Temperature-scaled, $T=2$	81.90	19.28	79.42	69.92	25.42	36.72	27.20

Key findings on mixing:

Multi-task training underperforms pre-train-then-fine-tune on most tasks. No mixing strategy matches the baseline of unsupervised pre-training followed by task-specific fine-tuning.
Equal mixing is worst. It dramatically degrades performance, confirming that proportions matter.
There exists a task-specific sweet spot for the cap $K$. Most tasks have an optimal $K$ value; larger or smaller values hurt. The exception is very high-resource tasks (WMT English-French) that always benefit from higher mixing proportions.
Temperature scaling at $T=2$ provides the best single compromise. It achieves reasonable performance across all tasks without requiring per-task tuning of $K$.
Multi-task pre-training followed by fine-tuning closes the gap. When multi-task training is used as pre-training (not as the final training stage), followed by task-specific fine-tuning, performance becomes comparable to unsupervised pre-training alone. This suggests that multi-task exposure during pre-training provides useful early signal without the negative effects of forcing a single model to perform all tasks simultaneously.
“Leave-one-out” training works. Pre-training on a multi-task mixture that excludes a target task, then fine-tuning on it, produces only slightly worse results. This indicates that multi-task pre-training builds general capabilities that transfer to unseen tasks without dramatic task interference.

Data repetition degrades performance

The paper also systematically tests the effect of pre-training data set size by truncating C4 and training over repeated data:

Unique tokens	Repeats	GLUE	SQuAD	SuperGLUE
Full dataset	0	83.28	80.88	71.36
$2^{29}$	64	82.87	80.97	72.03
$2^{27}$	256	82.62	79.78	69.97
$2^{25}$	1,024	79.55	76.27	64.76
$2^{23}$	4,096	76.34	70.92	59.29

Performance degrades as data shrinks, with 64 repeats showing limited effects but 1,024+ repeats causing significant degradation. Training loss curves confirm memorization at high repetition counts. The paper recommends using large, diverse pre-training datasets whenever possible.

Scaling and final configuration

The paper compares scaling strategies: more data, larger models, and ensembles. Training a larger model for fewer steps generally outperforms training a smaller model on more data. Ensembles of independently pre-trained and fine-tuned models provide orthogonal gains.

The final T5-11B model combines the best choices from all ablations: encoder-decoder architecture, span corruption objective, C4 pre-training data, multi-task pre-training followed by fine-tuning, and scaling to 11B parameters trained on over 1 trillion tokens. It achieves state-of-the-art results on GLUE (90.3 average), SuperGLUE (88.9, near human performance of 89.8), SQuAD, and CNN/Daily Mail. It does not achieve state-of-the-art on WMT translation tasks, where methods using backtranslation and cross-lingual pre-training retain the lead.

Implications and limitations

The T5 paper’s multi-task mixing findings are its most enduring contribution beyond the model itself. The core lessons: proportions matter enormously (equal mixing fails), examples-proportional mixing with a cap is a reasonable default, temperature scaling provides a single-knob alternative, and multi-task pre-training followed by fine-tuning can match pure unsupervised pre-training.

Limitations:

All ablations use the same encoder-decoder architecture. Findings may not transfer to decoder-only models that dominate current practice.
The multi-task mixing experiments treat each task as a separate “domain.” Interactions between similar tasks (e.g., multiple classification tasks) are not isolated.
The paper does not provide a principled method for choosing $K$ or $T$; both require empirical search.
C4 has known quality issues (templated text, noisy content) that have been addressed in later datasets.

Reproducibility Details

Status: Highly Reproducible. Code, pre-trained models, and the C4 dataset are all publicly released.

Data

Purpose	Dataset	Size	Notes
Pre-training	C4 (Colossal Clean Crawled Corpus)	~750 GB	Heuristically cleaned Common Crawl
Downstream	GLUE, SuperGLUE, SQuAD, CNN/DM, WMT (EnDe, EnFr, EnRo)	Standard splits	Text-to-text format

Models

Encoder-decoder Transformer. Sizes: Base (220M), Small (60M), Large (770M), 3B, 11B. Baseline uses Base size. SentencePiece vocabulary with 32K tokens. Pre-trained for $2^{19}$ steps, fine-tuned for $2^{18}$ steps on individual tasks.

Algorithms

Multi-task mixing: examples-proportional with cap $K \in {2^{16}, \ldots, 2^{21}}$, temperature-scaled with $T \in {2, 4, 8}$, and equal mixing. Unsupervised objective: span corruption (mean span length 3, 15% corruption rate). Training with Adafactor optimizer, inverse square root learning rate schedule.

Hardware

All models trained using Mesh TensorFlow on TPU slices. T5-11B pre-trained for 1M steps with batch size $2^{11}$ sequences of length 512 (~1 trillion tokens total). Exact TPU pod configurations per experiment not detailed.

Artifacts

Artifact	Type	License	Notes
T5 Code	Code	Apache 2.0	Official TensorFlow implementation (JAX successor: T5X)
T5 Models	Model	Apache 2.0	Pre-trained checkpoints (Small through 11B)
C4 Dataset	Dataset	-	~750 GB cleaned Common Crawl, via TensorFlow Datasets

Citation

@article{raffel2020exploring,
  title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J.},
  journal={Journal of Machine Learning Research},
  volume={21},
  number={140},
  pages={1--67},
  year={2020}
}

Materials Representations for ML Review

Mon, 06 Apr 2026 00:00:00 +0000

A Systematization of Material Representations

This paper is a Systematization that organizes and categorizes the strategies researchers use to convert solid-state materials into numerical representations suitable for machine learning models. Rather than proposing a new method, the review provides a structured taxonomy of existing approaches, connecting each to the practical constraints of data availability, computational cost, and prediction targets. It covers structural descriptors, graph-based learned representations, compositional features, transfer learning, and generative models for inverse design.

Why Material Representations Matter

Machine learning has enabled rapid property prediction for materials, but every ML pipeline depends on how the material is encoded as a numerical input. The authors identify three guiding principles for effective representations:

Similarity preservation: Similar materials should have similar representations, and dissimilar materials should diverge in representation space.
Domain coverage: The representation should be constructable for every material in the target domain.
Cost efficiency: Computing the representation should be cheaper than computing the target property directly (e.g., via DFT).

In practice, materials scientists face several barriers. Atomistic structures span diverse space groups, supercell sizes, and disorder parameters. Real material performance depends on defects, microstructure, and interfaces. Structural information often requires expensive experimental or computational effort to obtain. Datasets in materials science tend to be small, sparse, and biased toward well-studied systems.

Structural Descriptors: Local, Global, and Topological

The review covers three families of hand-crafted structural descriptors that encode atomic positions and types.

Local Descriptors

Local descriptors characterize the environment around each atom. Atom-centered symmetry functions (ACSF), introduced by Behler and Parrinello, define radial and angular functions:

$$ G_{i}^{1} = \sum_{j \neq i}^{\text{neighbors}} e^{-\eta(R_{ij} - R_{s})^{2}} f_{c}(R_{ij}) $$

$$ G_{i}^{2} = 2^{1-\zeta} \sum_{j,k \neq i}^{\text{neighbors}} (1 + \lambda \cos \theta_{ijk})^{\zeta} e^{-\eta(R_{ij}^{2} + R_{ik}^{2} + R_{jk}^{2})} f_{c}(R_{ij}) f_{c}(R_{ik}) f_{c}(R_{jk}) $$

The Smooth Overlap of Atomic Positions (SOAP), proposed by Bartók et al., defines atomic neighborhood density as a sum of Gaussians and computes a rotationally invariant kernel through expansion in radial functions and spherical harmonics:

$$ \rho_{i}(\mathbf{r}) = \sum_{j} \exp\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^{2}}{2\sigma^{2}}\right) = \sum_{nlm} c_{nlm} g_{n}(\mathbf{r}) Y_{lm}(\hat{\mathbf{r}}) $$

The power spectrum $\mathbf{p}(\mathbf{r}) \equiv \sum_{m} c_{nlm}(c_{n’lm})^{*}$ serves as a vector descriptor of the local environment. SOAP has seen wide adoption both as a similarity metric and as input to ML models.

Voronoi tessellation provides another local approach, segmenting space into cells and extracting features like effective coordination numbers, cell volumes, and neighbor properties.

Global Descriptors

Global descriptors encode the full structure. The Coulomb matrix models electrostatic interactions between atoms:

$$ M_{i,j} = \begin{cases} Z_{i}^{2.4} & \text{for } i = j \\ \frac{Z_{i}Z_{j}}{|r_{i} - r_{j}|} & \text{for } i \neq j \end{cases} $$

Other global methods include partial radial distribution functions (PRDF), the many-body tensor representation (MBTR), and cluster expansions. The Atomic Cluster Expansion (ACE) framework generalizes cluster expansions to continuous environments and has become a foundation for modern deep learning potentials.

Topological Descriptors

Persistent homology from topological data analysis (TDA) identifies geometric features at multiple length scales. Topological descriptors capture pore geometries in porous materials and have outperformed traditional structural descriptors for predicting CO$_{2}$ adsorption in metal-organic frameworks and methane storage in zeolites. A caveat is the $O(N^{3})$ worst-case computational cost per filtration.

Crystal Graph Neural Networks

Graph neural networks bypass manual feature engineering by learning representations directly from structural data. Materials are converted to graphs $G(V, E)$ where nodes represent atoms and edges connect neighbors within a cutoff radius, with periodic boundary conditions.

Key architectures discussed include:

Model	Key Innovation
CGCNN	Crystal graph convolutions for broad property prediction
MEGNet	Materials graph networks with global state attributes
ALIGNN	Line graph neural networks incorporating three-body angular features
Equivariant GNNs	E(3)-equivariant message passing for tensorial properties

The review identifies several limitations. Graph convolutions based on local neighborhoods can fail to capture long-range interactions or periodicity-dependent properties (e.g., lattice parameters, phonon spectra). Strategies to address this include concatenation with hand-tuned descriptors, plane-wave periodic basis modulation, and reciprocal-space features.

A major practical restriction is the requirement for relaxed atomic positions. Graphs built from unrelaxed crystal prototypes lose information about geometric distortions, degrading accuracy. Approaches to mitigate this include data augmentation with perturbed structures, Bayesian optimization of prototypes, and surrogate force-field relaxation.

Equivariant models that introduce higher-order tensors to node and edge features, constrained to transform correctly under E(3) operations, achieve state-of-the-art accuracy and can match structural descriptor performance even in low-data (~100 datapoints) regimes.

Compositional Descriptors Without Structure

When crystal structures are unavailable, representations can be built purely from stoichiometry and tabulated atomic properties (radii, electronegativity, valence electrons). Despite their simplicity, these methods have distinct advantages: zero computational overhead, accessibility to non-experts, and robustness for high-throughput screening.

Key methods include:

MagPie: 145 input features derived from elemental properties
SISSO: Compressive sensing over algebraic combinations of atomic properties, capable of discovering interpretable descriptors (e.g., a new tolerance factor $\tau$ for perovskite stability)
ElemNet: Deep neural network using only fractional stoichiometry as input, outperforming MagPie with >3,000 training points
ROOST: Fully-connected compositional graph with attention-based message passing, achieving strong performance with only hundreds of examples
CrabNet: Self-attention on element embeddings with fractional encoding, handling dopant-level concentrations via log-scale inputs

Compositional models cannot distinguish polymorphs and generally underperform structural approaches. They are most valuable when atomistic resolution is unavailable.

Defects, Surfaces, and Grain Boundaries

The review extends beyond idealized unit cells to practical materials challenges:

Point defects: Representations of the pristine bulk can predict vacancy formation energies through linear relationships with band structure descriptors. Frey et al. proposed using relative differences between defect and parent structure properties, requiring no DFT on the defect itself.

Surfaces and catalysis: Binding energy prediction for catalysis requires representations beyond the bulk unit cell. The d-band center for metals and oxygen 2p-band center for metal oxides serve as simple electronic descriptors, following the Sabatier principle that optimal catalytic activity requires intermediate binding strength. Graph neural networks trained on the Open Catalyst 2020 dataset (>1 million DFT energies) have enabled broader screening, though errors remain high for certain adsorbates and non-metallic surfaces.

Grain boundaries: SOAP descriptors computed for atoms near grain boundaries and clustered into local environment classes can predict grain boundary energy, mobility, and shear coupling. This approach provides interpretable structure-property relationships.

Transfer Learning Across Representations

When target datasets are small, transfer learning leverages representations learned from large, related datasets. The standard procedure involves: (1) pretraining on a large dataset (e.g., all Materials Project formation energies), (2) freezing parameters up to a chosen depth, and (3) either fine-tuning remaining layers or extracting features for a separate model.

Key findings from the review:

Transfer learning is most effective when the source dataset is orders of magnitude larger than the target
Physically related tasks transfer better (e.g., Open Catalyst absorption energies transfer well to new adsorbates, less so to unrelated small molecules)
Earlier neural network layers learn more general representations and transfer better across properties
Multi-depth feature extraction, combining activations from multiple layers, can improve transfer
Predictions from surrogate models can serve as additional descriptors, expanding screening domains by orders of magnitude

Generative Models for Crystal Inverse Design

Generative models for solid-state materials face challenges beyond molecular generation: more diverse atomic species, the need to specify both positions and lattice parameters, non-unique definitions (rotations, translations, supercell scaling), and large unit cells (>100 atoms for zeolites and MOFs).

The review traces the progression of approaches:

Voxel representations: Discretize unit cells into volume elements. Early work (iMatGen, Court et al.) demonstrated feasibility but was restricted to specific chemistries or cubic systems.
Continuous coordinate models: Point cloud and invertible representations allowed broader chemical spaces but lacked symmetry invariances.
Symmetry-aware models: Crystal Diffusion VAE (CDVAE) uses periodic graphs and SE(3)-equivariant message passing for translationally and rotationally invariant generation, establishing benchmark tasks for the field.
Constrained models for porous materials: Approaches like SmVAE represent MOFs through their topological building blocks (RFcodes), ensuring all generated structures are physically valid.

Open Problems and Future Directions

The review highlights four high-impact open questions:

Local vs. global descriptor trade-offs: Local descriptors (SOAP) excel for short-range interactions but struggle with long-range physics. Global descriptors model periodicity but lack generality across space groups. Combining local and long-range features could provide more universal models.
Prediction from unrelaxed prototypes: ML force fields can relax structures at a fraction of DFT cost, potentially expanding screening domains. Key questions remain about required training data scale and generalizability.
Applicability of compositional descriptors: The performance gap between compositional and structural models may be property-dependent, being smaller for properties like band gap that depend on global features rather than local site energies.
Extensions of generative models: Diffusion-based architectures have improved on voxel approaches for small unit cells, but extending to microstructure, dimensionality, and surface generation remains open.

Reproducibility Details

This paper is a review and does not present new experimental results or release any novel code, data, or models. The paper is open-access (hybrid OA at Annual Reviews) and the arXiv preprint is freely available. The following artifacts table covers key publicly available resources discussed in the review.

Artifacts

Artifact	Type	License	Notes
arXiv preprint (2301.08813)	Other	arXiv (open access)	Free preprint version
Materials Project	Dataset	CC-BY-4.0	DFT energies, band gaps, structures for >100,000 compounds
OQMD	Dataset	CC-BY-4.0	Open Quantum Materials Database, >600,000 DFT entries
Open Catalyst 2020 (OC20)	Dataset	CC-BY-4.0	>1,000,000 DFT surface adsorption energies
AFLOW	Dataset	Public	High-throughput ab initio library, >3,000,000 entries
Matminer	Code	BSD	Open-source toolkit for materials data mining and featurization

Algorithms

The review covers: ACSF, SOAP, Voronoi tessellation, Coulomb matrices, PRDF, MBTR, cluster expansions, ACE, persistent homology, CGCNN, MEGNet, ALIGNN, E(3)-equivariant GNNs, MagPie, SISSO, ElemNet, ROOST, CrabNet, VAE, GAN, and diffusion-based crystal generators.

Hardware

No new experiments are conducted. Hardware requirements vary by the referenced methods (DFT calculations require HPC; GNN training typically requires 1-8 GPUs).

Reproducibility Status

Partially Reproducible: The review paper itself is open-access. All major datasets discussed (Materials Project, OQMD, OC20, AFLOW) are publicly available under permissive licenses. Most referenced model implementations (CGCNN, MEGNet, ALIGNN, ROOST, CDVAE) have open-source code. No novel artifacts are released by the authors.

Paper Information

Citation: Damewood, J., Karaguesian, J., Lunger, J. R., Tan, A. R., Xie, M., Peng, J., & Gómez-Bombarelli, R. (2023). Representations of Materials for Machine Learning. Annual Review of Materials Research, 53. https://doi.org/10.1146/annurev-matsci-080921-085947

Publication: Annual Review of Materials Research, 2023

@article{damewood2023representations,
  title={Representations of Materials for Machine Learning},
  author={Damewood, James and Karaguesian, Jessica and Lunger, Jaclyn R. and Tan, Aik Rui and Xie, Mingrou and Peng, Jiayu and G{\'o}mez-Bombarelli, Rafael},
  journal={Annual Review of Materials Research},
  volume={53},
  year={2023},
  doi={10.1146/annurev-matsci-080921-085947}
}

Transformers and LLMs for Chemistry Drug Discovery

Sat, 28 Mar 2026 00:00:00 +0000

A Systematization of Transformers in Chemistry

This book chapter by Bran and Schwaller is a Systematization paper that organizes the growing body of work applying transformer architectures to chemistry and drug discovery. Rather than proposing a new method, the authors trace a three-stage evolution: (1) task-specific single-modality models operating on SMILES and reaction strings, (2) multimodal models bridging molecular representations with spectra, synthesis actions, and natural language, and (3) large language models and LLM-powered agents capable of general chemical reasoning.

Why Transformers for Chemistry?

The authors motivate the review by drawing analogies between natural language and chemical language. Just as text can be decomposed into subwords and tokens, molecules can be linearized into SMILES or SELFIES strings, and chemical reactions can be encoded as reaction SMILES. This structural parallel enabled direct transfer of transformer architectures, originally designed for machine translation, to chemical prediction tasks.

Several factors accelerated this adoption:

The publication of open chemical databases and benchmarks (e.g., MoleculeNet, Open Reaction Database, Therapeutics Data Commons)
Improvements in compute infrastructure and training algorithms
The success of attention mechanisms at capturing context-dependent relationships, which proved effective for learning chemical grammar and atom-level correspondences

The review positions the transformer revolution in chemistry as a natural extension of NLP advances, noting that the gap between chemical and natural language is progressively closing.

Molecular Representations as Language

A key section of the review covers text-based molecular representations that make transformer applications possible:

SMILES (Simplified Molecular Input Line Entry System): The dominant linearization scheme since the 1980s, encoding molecular graphs as character sequences with special symbols for bonds, branches, and rings.
SELFIES (Self-Referencing Embedded Strings): A newer representation that guarantees every string maps to a valid molecule, addressing the robustness issues of SMILES in generative settings.
Reaction SMILES: Extends molecular representations to encode full chemical reactions in the format “A.B > catalyst.reagent > C.D”, enabling reaction prediction as a sequence-to-sequence task.

The authors note that while IUPAC names, InChI, and DeepSMILES exist as alternatives, SMILES and SELFIES dominate practical applications.

Stage 1: Task-Specific Transformer Models

The first stage of transformer adoption focused on clearly defined chemical tasks, with models trained on a single data modality (molecular strings).

Chemical Translation Tasks

The encoder-decoder architecture was directly applied to tasks framed as translation:

Molecular Transformer (Schwaller et al.): Treated reaction prediction as translation from reactant SMILES to product SMILES, becoming a leading method for forward synthesis prediction.
Retrosynthetic planning: The reverse task, predicting reactants from products, with iterative application to construct full retrosynthetic trees mapping to commercially available building blocks.
Chemformer (Irwin et al.): A pre-trained model across multiple chemical tasks, offering transferability to new applications with improved performance.
Graph-to-sequence models (Tu and Coley): Used a custom graph encoder with a transformer decoder, achieving improvements through permutation-invariant molecular graph encoding.

Representation Learning and Feature Extraction

Encoder-only transformers proved valuable for generating molecular and reaction embeddings:

Reaction representations (Wang et al., SMILES-BERT): Trained models to generate reaction vectors that outperformed hand-engineered features on downstream regression tasks.
Reaction classification (Schwaller et al.): Replaced the decoder with a classification layer to map chemical reactions by class, revealing clustering patterns by reaction type, data source, and molecular properties.
Yield prediction: Regression heads attached to encoders achieved strong results on high-throughput experimentation datasets.
Protein language models (Rives et al., ESM): Trained on 250 million protein sequences using unsupervised learning, achieving strong performance on protein property prediction and structure forecasting.
RXNMapper (Schwaller et al.): A notable application where attention weight analysis revealed that transformers internally learn atom-to-atom mappings in chemical reactions, leading to an open-source atom mapping algorithm that outperformed existing approaches.

Stage 2: Multimodal Chemical Models

The second stage extended transformers beyond molecular strings to incorporate additional data types:

Molecular captioning: Describing molecules in natural language, covering scaffolds, sources, drug interactions, and other features (Edwards et al.).
Bidirectional molecule-text conversion: Models capable of generating molecules from text queries and performing molecule-to-molecule tasks (Christofidellis et al.).
Experimental procedure prediction: Generating actionable synthesis steps from reaction SMILES (Vaucher et al.), bridging the gap between retrosynthetic planning and laboratory execution.
Structural elucidation from IR spectra: Encoding IR spectra as text sequences alongside chemical formulas, then predicting SMILES from these inputs (Alberts et al.), achieving 45% accuracy in structure prediction and surpassing prior approaches for functional group identification.

Stage 3: Large Language Models and Chemistry Agents

The most recent stage builds on foundation models pre-trained on vast text corpora, adapted for chemistry through fine-tuning and in-context learning.

Scaling Laws and Emergent Capabilities

The authors discuss how model scaling leads to emergent capabilities relevant to chemistry:

Below certain compute thresholds, model performance on chemistry tasks appears random.
Above critical sizes, sudden improvements emerge, along with capabilities like chain-of-thought (CoT) reasoning and instruction following.
These emergent abilities enable chemistry tasks that require multi-step reasoning without explicit training on chemical data.

LLMs as Chemistry Tools

Key applications of LLMs in chemistry include:

Fine-tuning for low-data chemistry (Jablonka et al.): GPT-3 fine-tuned on limited chemistry datasets performed comparably to, and sometimes exceeded, specialized models with engineered features for tasks like predicting transition wavelengths and phase classification.
In-context learning: Providing LLMs with a few examples enables prediction on chemistry tasks without any parameter updates, particularly valuable when data is scarce.
Bayesian optimization with LLMs (Ramos et al.): Using GPT models for uncertainty-calibrated regression, enabling catalyst and molecular optimization directly from synthesis procedures without feature engineering.
3D structure generation (Flam-Shepherd and Aspuru-Guzik): Using language models to generate molecular structures with three-dimensional atomic positions in XYZ, CIF, and PDB formats, matching graph-based algorithms while overcoming representation limitations.

LLM-Powered Chemistry Agents

The review highlights the agent paradigm as the most impactful recent development:

14 LLM use-cases (Jablonka et al.): A large-scale collaborative effort demonstrating applications from computational tool wrappers to reaction optimization assistants and scientific question answering.
ChemCrow (Bran, Cox et al.): An LLM-powered agent equipped with curated computational chemistry tools, capable of planning and executing tasks across drug design, materials design, and synthesis. ChemCrow demonstrated that tool integration overcomes LLM hallucination issues by grounding responses in reliable data sources.
Autonomous scientific research (Boiko et al.): Systems with focus on cloud laboratory operability.

The agent paradigm offers tool composability through natural language interfaces, allowing users to chain multiple computational tools into custom pipelines.

Outlook and Limitations

The authors identify several themes for the future:

The three stages represent increasing generality, from task-specific single-modality models to open-ended agents.
Natural language interfaces are progressively closing the gap between chemical and human language.
Tool integration through agents provides grounding that mitigates hallucination, a known limitation of direct LLM application to chemistry.
The review acknowledges that LLMs have a “high propensity to generate false and inaccurate content” on chemical tasks, making tool-augmented approaches preferable to direct application.

The chapter does not provide quantitative benchmarks or systematic comparisons across the methods discussed, as its goal is to organize the landscape rather than evaluate individual methods.

Reproducibility Details

This is a review/survey chapter and does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the referenced works rather than the review itself.

Key Referenced Resources

Several open-source tools and datasets discussed in the review are publicly available:

Artifact	Type	License	Notes
RXNMapper	Code	MIT	Attention-based atom mapping
ChemCrow	Code	MIT	LLM-powered chemistry agent
MoleculeNet	Dataset	Various	Molecular ML benchmarks
Open Reaction Database	Dataset	CC-BY-SA-4.0	Curated reaction data
Therapeutics Data Commons	Dataset	MIT	Drug discovery ML datasets

Reproducibility Classification

Not applicable (review paper). Individual referenced works range from Highly Reproducible (open-source models like RXNMapper, ChemCrow) to Partially Reproducible (some models without released code) to Closed (proprietary LLMs like GPT-3/GPT-4 used in fine-tuning studies).

Paper Information

Citation: Bran, A. M., & Schwaller, P. (2024). Transformers and Large Language Models for Chemistry and Drug Discovery. In Drug Development Supported by Informatics (pp. 143-163). Springer Nature Singapore. https://doi.org/10.1007/978-981-97-4828-0_8

@incollection{bran2024transformers,
  title={Transformers and Large Language Models for Chemistry and Drug Discovery},
  author={Bran, Andres M. and Schwaller, Philippe},
  booktitle={Drug Development Supported by Informatics},
  pages={143--163},
  year={2024},
  publisher={Springer Nature Singapore},
  doi={10.1007/978-981-97-4828-0_8}
}

Transformers for Molecular Property Prediction Review

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformers for Molecular Property Prediction

This is a Systematization paper. Sultan et al. provide the first comprehensive, structured review of sequence-based transformer models applied to molecular property prediction (MPP). The review catalogs 16 models published between 2019 and 2023, organizes them by architecture type (encoder-decoder, encoder-only, decoder-only), and systematically examines seven key design decisions that arise when building a transformer for MPP. The paper’s primary contribution is identifying gaps in current evaluation practices and articulating what standardization the field needs for meaningful progress.

The Problem: Inconsistent Evaluation Hinders Progress

Molecular property prediction is essential for drug discovery, crop protection, and environmental science. Deep learning approaches, including transformers, have been increasingly applied to this task by learning molecular representations from string notations like SMILES and SELFIES. However, the field faces several challenges:

Small labeled datasets: Labeled molecular property datasets typically contain only hundreds or thousands of molecules, making supervised learning alone insufficient.
No standardized evaluation protocol: Different papers use different data splits (scaffold vs. random), different splitting implementations, different numbers of repetitions (3 to 50), and sometimes do not share their test sets. This makes direct comparison across models infeasible.
Unclear design choices: With many possible configurations for pre-training data, chemical language, tokenization, positional embeddings, model size, pre-training objectives, and fine-tuning approaches, the field lacks systematic analyses to guide practitioners.

The authors note that standard machine learning methods with fixed-size molecular fingerprints remain strong baselines for real-world datasets, illustrating that the promise of transformers for MPP has not yet been fully realized.

Seven Design Questions for Molecular Transformers

The central organizing framework of this review addresses seven questions practitioners must answer when building a transformer for MPP. For each, the authors synthesize findings across the 16 reviewed models.

Reviewed Models

The paper catalogs 16 models organized by architecture:

Architecture	Base Model	Models
Encoder-Decoder	Transformer, BART	ST, Transformer-CNN, X-Mol, ChemFormer
Encoder-Only	BERT	SMILES-BERT, MAT, MolBERT, Mol-BERT, Chen et al., K-BERT, FP-BERT, MolFormer
Encoder-Only	RoBERTa	ChemBERTa, ChemBERTa-2, SELFormer
Decoder-Only	XLNet	Regression Transformer (RT)

The core attention mechanism shared by all these models is the scaled dot-product attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V $$

where $Q$, $K$, and $V$ are the query, key, and value matrices, and $d_{k}$ is the dimension of the key vectors.

Question 1: Which Database and How Many Molecules?

Pre-training data sources vary considerably. The three main databases are ZINC (37 billion molecules in ZINC22), ChEMBL (2.4 million unique molecules with 20 million bioactivity measurements), and PubChem (111 million unique molecules). Pre-training set sizes ranged from 900K (ST on ChEMBL) to 1.1B molecules (MolFormer on ZINC + PubChem).

Model	Database	Size	Language
ST	ChEMBL	900K	SMILES
MolBERT	ChEMBL (GuacaMol)	1.6M	SMILES
ChemBERTa	PubChem	100K-10M	SMILES, SELFIES
ChemBERTa-2	PubChem	5M-77M	SMILES
MAT	ZINC	2M	List of atoms
MolFormer	ZINC + PubChem	1.1B	SMILES
Chen et al.	C, CP, CPZ	2M-775M	SMILES

A key finding is that larger pre-training datasets do not consistently improve downstream performance. MolFormer showed minimal difference between models trained on 100M vs. 1.1B molecules. ChemBERTa-2 found that the model trained on 5M molecules using MLM performed comparably to 77M molecules for BBBP (both around 0.70 ROC-AUC). Chen et al. reported comparable $R^{2}$ values of $0.925 \pm 0.01$, $0.917 \pm 0.012$, and $0.915 \pm 0.01$ for ESOL across datasets of 2M, 103M, and 775M molecules, respectively. The data composition and covered chemical space appear to matter more than raw size.

Question 2: Which Chemical Language?

Most models use SMILES. ChemBERTa, RT, and SELFormer also explored SELFIES. MAT uses a simple list of atoms with structural features, while Mol-BERT and FP-BERT use circular fingerprints.

Direct comparisons between SMILES and SELFIES (by ChemBERTa on Tox21 SR-p53 and RT for drug-likeness prediction) found no significant performance difference. The RT authors reported that SELFIES models performed approximately $0.004 \pm 0.01$ better on RMSE, while SMILES models performed approximately $0.004 \pm 0.01$ better on Pearson correlation. The choice of chemical language does not appear to be a major factor in prediction performance, and even non-string representations (atom lists in MAT, fingerprints in Mol-BERT) perform competitively.

Question 3: How to Tokenize?

Tokenization methods span atom-level (42-66 vocabulary tokens), regex-based (47-2,362 tokens), BPE (509-52K tokens), and substructure-based (3,357-13,325 tokens) approaches. No systematic comparison of tokenization strategies exists in the literature. The vocabulary size varied dramatically, from 42 tokens for MolBERT to over 52K for ChemBERTa. The authors argue that chemically meaningful tokenization (e.g., functional group-based fragmentation) could improve both performance and explainability.

Question 4: How to Add Positional Embeddings?

Most models inherited the absolute positional embedding from their NLP base models. MolBERT and RT adopted relative positional embeddings. MolFormer combined absolute and Rotary Positional Embedding (RoPE). MAT incorporated spatial information (inter-atomic 3D distances and adjacency) alongside self-attention.

MolFormer’s comparison showed that RoPE became superior to absolute embeddings only when the pre-training dataset was very large. The performance difference (MAE on QM9) between absolute and RoPE embeddings for models trained on 111K, 111M, and 1.1B molecules was approximately $-0.20 \pm 0.18$, $-0.44 \pm 0.22$, and $0.27 \pm 0.12$, respectively.

The authors highlight that SMILES and SELFIES are linearizations of a 2D molecular graph, so consecutive tokens in a sequence are not necessarily spatially close. Positional embeddings that reflect 2D or 3D molecular structure remain underexplored.

Question 5: How Many Parameters?

Model sizes range from approximately 7M (ST, Mol-BERT) to over 100M parameters (MAT). Most chemical language models operate with 100M parameters or fewer, much smaller than NLP models like BERT (110M-330M) or GPT-3 (175B).

Model	Dimensions	Heads	Layers	Parameters
ST	256	4	4	7M
MolBERT	768	12	12	85M
MolFormer	768	12	6, 12	43M, 85M
SELFormer	768	12, 4	8, 12	57M, 85M
MAT	1024	16	8	101M
ChemBERTa	768	12	6	43M

SELFormer and MolFormer both tested different model sizes. SELFormer’s larger model (approximately 86M parameters) showed approximately 0.034 better ROC-AUC for BBBP compared to the smaller model. MolFormer’s larger model (approximately 87M parameters) performed approximately 0.04 better ROC-AUC on average for BBBP, HIV, BACE, and SIDER. The field lacks the systematic scaling analyses (analogous to Kaplan et al. and Hoffmann et al. in NLP) needed to establish proper scaling laws for chemical language models.

Question 6: Which Pre-training Objectives?

Pre-training objectives fall into domain-agnostic and domain-specific categories:

Model	Pre-training Objective	Fine-tuning
MolFormer	MLM	Frozen, Update
SMILES-BERT	MLM	Update
MolBERT	MLM, PhysChemPred, SMILES-EQ	Frozen, Update
K-BERT	Atom feature, MACCS prediction, CL	Update last layer
ChemBERTa-2	MLM, MTR	Update
MAT	MLM, 2D Adjacency, 3D Distance	Update
ChemFormer	Denoising Span MLM, Augmentation	Update
RT	PLM (Permutation Language Modeling)	-

Domain-specific objectives (predicting physico-chemical properties, atom features, or MACCS keys) showed promising but inconsistent results. MolBERT’s PhysChemPred performed closely to the full three-objective model (approximately $0.72 \pm 0.06$ vs. $0.71 \pm 0.06$ ROC-AUC in virtual screening). The SMILES-EQ objective (identifying equivalent SMILES) was found to lower performance when combined with other objectives. K-BERT’s contrastive learning objective did not significantly change performance (average ROC-AUC of 0.806 vs. 0.807 with and without CL).

ChemBERTa-2’s Multi-Task Regression (MTR) objective performed noticeably better than MLM-only for almost all four classification tasks across pre-training dataset sizes.

Question 7: How to Fine-tune?

Fine-tuning through weight updates generally outperforms frozen representations. SELFormer showed this most dramatically, with a difference of 2.187 RMSE between frozen and updated models on FreeSolv. MolBERT showed a much smaller difference (0.575 RMSE on FreeSolv), likely because its domain-specific pre-training objectives already produced representations closer to the downstream tasks.

Benchmarking Challenges and Performance Comparison

Downstream Datasets

The review focuses on nine benchmark datasets across three categories from MoleculeNet:

Dataset	Molecules	Tasks	Type	Application
ESOL	1,128	1 regression	Physical chemistry	Aqueous solubility
FreeSolv	642	1 regression	Physical chemistry	Hydration free energy
Lipophilicity	4,200	1 regression	Physical chemistry	LogD at pH 7.4
BBBP	2,050	1 classification	Physiology	Blood-brain barrier
ClinTox	1,484	2 classification	Physiology	Clinical trial toxicity
SIDER	1,427	27 classification	Physiology	Drug side effects
Tox21	7,831	12 classification	Physiology	Nuclear receptor/stress pathways
BACE	1,513	1 classification	Biophysics	Beta-secretase 1 binding
HIV	41,127	1 classification	Biophysics	Anti-HIV activity

Inconsistencies in Evaluation

The authors document substantial inconsistencies that prevent fair model comparison:

Data splitting: Models used different splitting methods (scaffold vs. random) and different implementations even when using the same method. Not all models adhered to scaffold splitting for classification tasks as recommended.
Different test sets: Even models using the same split type may not evaluate on identical test molecules due to different random seeds.
Varying repetitions: Repetitions ranged from 3 (RT) to 50 (Chen et al.), making some analyses more statistically robust than others.
Metric inconsistency: Most use ROC-AUC for classification and RMSE for regression, but some models report only averages without standard deviations, while others report standard errors.

Performance Findings

When comparing only models evaluated on the same test sets (Figure 2 in the paper), the authors observe that transformer models show comparable, but not consistently superior, performance to existing ML and DL models. The performance varies considerably across models and datasets.

For BBBP, the Mol-BERT model reported lower ROC-AUC than its corresponding MPNN (approximately 0.88 vs. 0.91), while MolBERT outperformed its corresponding CDDD model (approximately 0.86 vs. 0.76 ROC-AUC) and its SVM baseline (approximately 0.86 vs. 0.70 ROC-AUC). A similar mixed pattern appeared for HIV: ChemBERTa performed worse than its corresponding ML models, while MolBERT performed better than its ML (approximately 0.08 higher ROC-AUC) and DL (approximately 0.03 higher ROC-AUC) baselines. For SIDER, Mol-BERT performed approximately 0.1 better ROC-AUC than its corresponding MPNN. For regression, MAT and MolBERT showed improved performance over their ML and DL baselines on ESOL, FreeSolv, and Lipophilicity. For example, MAT performed approximately 0.2 lower RMSE than an SVM model and approximately 0.03 lower RMSE than the Weave model on ESOL.

Key Takeaways and Future Directions

The review concludes with six main takeaways:

Performance: Transformers using SMILES show comparable but not consistently superior performance to existing ML and DL models for MPP.
Scaling: No systematic analysis of model parameter scaling relative to data size exists for chemical language models. Such analysis is essential.
Pre-training data: Dataset size alone is not the sole determinant of downstream performance. Composition and chemical space coverage matter.
Chemical language: SMILES and SELFIES perform similarly. Alternative representations (atom lists, fingerprints) also work when the architecture is adjusted.
Domain knowledge: Domain-specific pre-training objectives show promise, but tokenization and positional encoding remain underexplored.
Benchmarking: The community needs standardized data splitting, fixed test sets, statistical analysis, and consistent reporting to enable meaningful comparison.

The authors also highlight the need for attention visualization and explainability analysis, investigation of NLP-originated techniques (pre-training regimes, fine-tuning strategies like LoRA, explainability methods), and adaptation of these techniques to the specific characteristics of chemical data (smaller vocabularies, shorter sequences).

Reproducibility Details

Data

This is a review paper. No new data or models are introduced. All analyses use previously reported results from the 16 reviewed papers, with additional visualization and comparison. The authors provide a GitHub repository with the code and data used to generate their comparative figures.

Algorithms

Not applicable (review paper). The paper describes training strategies at a conceptual level, referencing the original publications for implementation details.

Models

Not applicable (review paper). The paper catalogs 16 models with their architecture details, parameter counts, and training configurations across Tables 1, 4, 5, 6, and 7.

Evaluation

The paper compiles performance across nine MoleculeNet datasets. Key comparison figures (Figures 2 and 7) restrict to models evaluated on the same test sets for fair comparison, using ROC-AUC for classification and RMSE for regression.

Hardware

Not applicable (review paper).

Artifact	Type	License	Notes
Transformers4MPP_review	Code	MIT	Figure generation code and compiled data

Paper Information

Citation: Sultan, A., Sieg, J., Mathea, M., & Volkamer, A. (2024). Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years. Journal of Chemical Information and Modeling, 64(16), 6259-6280. https://doi.org/10.1021/acs.jcim.4c00747

@article{sultan2024transformers,
  title={Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years},
  author={Sultan, Afnan and Sieg, Jochen and Mathea, Miriam and Volkamer, Andrea},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={16},
  pages={6259--6280},
  year={2024},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.4c00747}
}

Transformer CLMs for SMILES: Literature Review 2024

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer-Based Chemical Language Models

This paper is a Systematization (literature review) that surveys the landscape of transformer-based chemical language models (CLMs) operating on SMILES representations. It organizes the field into three architectural categories (encoder-only, decoder-only, encoder-decoder), discusses tokenization strategies, pre-training and fine-tuning methodologies, and identifies open challenges and future research directions. The review covers approximately 30 distinct CLMs published through early 2024.

Why Review Transformer CLMs for SMILES?

The chemical space is vast, with databases like ZINC20 exceeding 5.5 billion compounds, and the amount of unlabeled molecular data far outstrips available labeled data for specific tasks like toxicity prediction or binding affinity estimation. Traditional molecular representations (fingerprints, descriptors, graph-based methods) require expert-engineered features and extensive domain knowledge.

Transformer-based language models, originally developed for NLP, have emerged as a compelling alternative. By treating SMILES strings as a “chemical language,” these models can leverage large-scale unsupervised pre-training on abundant unlabeled molecules, then fine-tune on small labeled datasets for specific downstream tasks. Earlier approaches like Seq2Seq and Seq3Seq fingerprint methods used RNN-based encoder-decoders, but these suffered from vanishing gradients and sequential processing bottlenecks when handling long SMILES sequences.

The authors motivate this review by noting that no prior survey has comprehensively organized transformer-based CLMs by architecture type while simultaneously covering tokenization, embedding strategies, and downstream application domains.

Architectural Taxonomy: Encoder, Decoder, and Encoder-Decoder Models

The core organizational contribution is a three-way taxonomy of transformer CLMs based on their architectural backbone.

Encoder-Only Models (BERT Family)

These models capture bidirectional context, making them well suited for extracting molecular representations for property prediction tasks. The review covers:

BERT (Lee and Nam, 2022): Adapted for SMILES processing with linguistic knowledge infusion, using BPE tokenization
MOLBERT (Fabian et al., 2020): Chemistry-specific BERT for physicochemical property and bioactivity prediction
SMILES-BERT (Wang et al., 2019): BERT variant designed to learn molecular representations directly from SMILES without feature engineering
ChemBERTa / ChemBERTa-2 (Chithrananda et al., 2020; Ahmad et al., 2022): RoBERTa-based models optimized for chemical property prediction, with ChemBERTa-2 exploring multi-task pre-training
GPT-MolBERTa (Balaji et al., 2023): Combines GPT molecular features with a RoBERTa backbone
MoLFormer (Ross et al., 2022): Large-scale model trained on 1.1 billion molecules, published in Nature Machine Intelligence
SELFormer (Yuksel et al., 2023): Operates on SELFIES representations rather than SMILES
Mol-BERT / MolRoPE-BERT (Li and Jiang, 2021; Liu et al., 2023): Differ in positional embedding strategy, with MolRoPE-BERT using rotary position embedding to handle longer sequences
BET (Chen et al., 2021): Extracts predictive representations from hundreds of millions of molecules

Decoder-Only Models (GPT Family)

These models excel at generative tasks, including de novo molecular design:

GPT-2-based model (Adilov, 2021): Generative pre-training from molecules
MolXPT (Liu et al., 2023): Wraps molecules with text for generative pre-training, connecting chemical and natural language
BioGPT (Luo et al., 2022): Focuses on biomedical text generation and mining
MolGPT (Haroon et al., 2023): Uses relative attention to capture token distances and relationships for de novo drug design
Mol-Instructions (Fang et al., 2023): Large-scale biomolecular instruction dataset for LLMs

Encoder-Decoder Models

These combine encoding and generation capabilities for sequence-to-sequence tasks:

Chemformer (Irwin et al., 2022): BART-based model for reaction prediction and molecular property prediction
MolT5 (adapted T5): Unified text-to-text framework for molecular tasks
SMILES Transformer (Honda et al., 2019): Pre-trained molecular fingerprints for low-data drug discovery
X-MOL (Xue et al., 2020): Large-scale pre-training for molecular understanding
Regression Transformer (Born and Manica, 2023): Operates on SELFIES, enabling concurrent regression and generation
TransAntivirus (Mao et al., 2023): Specialized for antiviral drug design using IUPAC nomenclature

Tokenization, Embedding, and Pre-Training Strategies

SMILES Tokenization

The review identifies tokenization as a critical preprocessing step that affects downstream performance. SMILES tokenization differs from standard NLP tokenization because SMILES strings lack whitespace and use parentheses for branching rather than sentence separation. The key approaches include:

Strategy	Source	Description
Atom-in-SMILES (AIS)	Ucak et al. (2023)	Atom-level tokens preserving chemical identity
SMILES Pair Encoding (SPE)	Li and Fourches (2021)	BPE-inspired substructure tokenization
Byte-Pair Encoding (BPE)	Chithrananda et al. (2020); Lee and Nam (2022)	Standard subword tokenization adapted for SMILES
SMILESTokenizer	Chithrananda et al. (2020)	Character-level tokenization with chemical adjustments

Positional Embeddings

The models use various positional encoding strategies: absolute, relative key, relative key-query, rotary (RoPE), and sinusoidal. Notably, SMILES-based models omit segmentation embeddings since SMILES data consists of single sequences rather than sentence pairs.

Pre-Training and Fine-Tuning Pipeline

The standard workflow follows two phases:

Pre-training: Unsupervised training on large unlabeled SMILES databases (ZINC, PubChem, ChEMBL) using masked language modeling (MLM), where the model learns to predict masked tokens within SMILES strings
Fine-tuning: Supervised adaptation on smaller labeled datasets for specific tasks (classification or regression)

The self-attention mechanism, central to all transformer CLMs, is formulated as:

$$ Z = \text{Softmax}\left(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}}\right) XW^V $$

where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^Q$, $W^K$, $W^V \in \mathbb{R}^{M \times d_k}$ are learnable weight matrices, and $\sqrt{d_k}$ is the scaling factor.

Benchmark Datasets and Evaluation Landscape

The review catalogs the standard evaluation ecosystem for CLMs. Pre-training databases include ZINC, PubChem, and ChEMBL. Fine-tuning and evaluation rely heavily on MoleculeNet benchmarks:

Category	Datasets	Task Type	Example Size
Physical Chemistry	ESOL, FreeSolv, Lipophilicity	Regression	642 to 4,200
Biophysics	PCBA, MUV, HIV, PDBbind, BACE	Classification/Regression	11,908 to 437,929
Physiology	BBBP, Tox21, ToxCast, SIDER, ClinTox	Classification	1,427 to 8,575

The authors also propose four new fine-tuning datasets targeting diseases: COVID-19 drug compounds, cocrystal formation, antimalarial drugs (Plasmodium falciparum targets), and cancer gene expression/drug response data.

Challenges, Limitations, and Future Directions

Current Challenges

The review identifies several persistent limitations:

Data efficiency: Despite transfer learning, transformer CLMs still require substantial pre-training data, and labeled datasets for specific tasks remain scarce
Interpretability: The complexity of transformer architectures makes it difficult to understand how specific molecular features contribute to predictions
Computational cost: Training large-scale models demands significant GPU resources, limiting accessibility
Handling rare molecules: Models struggle with molecular structures that deviate significantly from training data distributions
SMILES limitations: Non-unique representations, invalid strings, exceeded atom valency, and inadequate spatial information capture

SMILES Representation Issues

The authors highlight five specific problems with SMILES as an input representation:

Non-canonical representations reduce string uniqueness for the same molecule
Many symbol combinations produce chemically invalid outputs
Valid SMILES strings can encode chemically impossible molecules (e.g., exceeded valency)
Spatial information is inadequately captured
Syntactic and semantic robustness is limited

Future Research Directions

The review proposes several directions:

Alternative molecular representations: Exploring SELFIES, DeepSMILES, IUPAC, and InChI beyond SMILES
Role of SMILES token types: Strategic masking of metals, non-metals, bonds, and branches during MLM pre-training to identify which components are most critical
Few-shot learning: Combining few-shot approaches with large-scale pre-trained CLMs for data-scarce scenarios
Drug repurposing: Training CLMs to distinguish identical compounds with different biological activity profiles across therapeutic domains
Improved benchmarks: Incorporating disease-specific datasets (malaria, cancer, COVID-19) for more realistic evaluation
Ethical considerations: Addressing dual-use risks, data biases, and responsible open-source release of CLMs

Reproducibility Details

This is a literature review paper. It does not introduce new models, code, or experimental results. The reproducibility assessment focuses on the accessibility of the reviewed works and proposed datasets.

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC20	5.5B+ compounds	Publicly available
Pre-training	PubChem	100M+ compounds	Publicly available
Pre-training	ChEMBL	2M+ compounds	Publicly available
Fine-tuning	MoleculeNet (8 datasets)	642 to 437,929	Standard benchmark suite
Proposed	COVID-19 drug compounds	740	From Harigua-Souiai et al. (2021)
Proposed	Cocrystal formation	3,282	From Mswahili et al. (2021)
Proposed	Antimalarial drugs	4,794	From Mswahili et al. (2024)
Proposed	Cancer gene/drug response	201 drugs, 734 cell lines	From Kim et al. (2021)

Artifacts

Artifact	Type	License	Notes
DAI Lab website	Other	N/A	Authors’ research lab

No code, models, or evaluation scripts are released with this review. The paper does not include a supplementary materials section or GitHub repository.

Hardware

Not applicable (literature review).

Paper Information

Citation: Mswahili, M. E., & Jeong, Y.-S. (2024). Transformer-based models for chemical SMILES representation: A comprehensive literature review. Heliyon, 10(20), e39038. https://doi.org/10.1016/j.heliyon.2024.e39038

@article{mswahili2024transformer,
  title={Transformer-based models for chemical {SMILES} representation: A comprehensive literature review},
  author={Mswahili, Medard Edmund and Jeong, Young-Seob},
  journal={Heliyon},
  volume={10},
  number={20},
  pages={e39038},
  year={2024},
  publisher={Elsevier},
  doi={10.1016/j.heliyon.2024.e39038}
}

Systematic Review of Deep Learning CLMs (2020-2024)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Chemical Language Models for Molecular Generation

This paper is a Systematization that provides a comprehensive, PRISMA-guided systematic review of deep learning chemical language models (CLMs) used for de novo molecular generation. The primary contribution is a structured statistical analysis of 72 retrieved articles from 2020 to June 2024, comparing architectures (RNNs, transformers, VAEs, GANs, S4 models), molecular representations, biased generation strategies, and quality metrics from the MOSES and GuacaMol benchmarking platforms. The review addresses five research questions about architecture configuration effects, best-performing architectures, impactful hyperparameters, common molecular representations, and effective biased generation methods.

Motivation: Evaluating Four Years of Generative CLM Progress

Deep learning molecular generation has expanded rapidly since 2018, when Gomez-Bombarelli et al. and Segler et al. demonstrated that deep generative models could learn to produce novel molecules from SMILES representations. By 2020, multiple architectures (RNNs, transformers, VAEs, GANs) were being applied to chemical language modeling, and benchmarking platforms like MOSES and GuacaMol had been introduced to enable standardized evaluation.

Despite this growth, existing reviews largely focused on theoretical background or drug development applications rather than systematic statistical comparison of model performance. Few studies had examined how architecture choice, training dataset size, molecular representation format, and biased learning strategies interact to affect generation quality metrics like validity, uniqueness, and novelty. This review fills that gap by restricting the analysis to papers reporting MOSES or GuacaMol metrics, enabling quantitative cross-study comparison.

PRISMA-Based Systematic Review Methodology

The review follows the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines. Articles were retrieved from Scopus, Web of Science, and Google Scholar using six Boolean search queries combining terms like “Molecule Generation,” “Chemical Language Models,” “Deep Learning,” and specific architecture names. The search window covered January 2020 to June 2024.

Eligibility Criteria

Papers were included if they:

Were written in English
Explicitly presented at least two metrics of uniqueness, validity, or novelty
Defined these metrics consistent with MOSES or GuacaMol concepts
Used deep learning generative models for de novo molecule design
Used conventional (non-quantum) deep learning methods
Were published between January 2020 and June 2024

This yielded 48 articles from query-based search and 25 from citation search, totaling 72 articles. Of these, 62 used CLM approaches (string-based molecular representations) and 10 used graph-based representations.

Data Collection

For each article, the authors extracted: journal details, database name, training dataset size, molecular representation type (SMILES, SELFIES, InChI, DeepSMILES), architecture details (embedding length, layers, hidden units, trainable parameters, dropout, temperature, batch size, epochs, learning rate, optimizer), biased method usage (TL, RL, conditional learning), and generation metrics (validity, uniqueness, novelty, scaffold diversity, SNN, FCD).

Evaluation Metrics

The review focuses on three core MOSES metrics:

$$ \text{Validity}(V_m) = \frac{\text{Valid molecules}}{\text{Molecules produced}} $$

$$ \text{Uniqueness} = \frac{\text{set}(V_m)}{V_m} $$

$$ \text{Novelty} = 1 - \frac{V_m \cap T_d}{V_m} $$

where $V_m$ denotes valid molecules and $T_d$ the training dataset.

Architecture Distribution and Performance Comparison

Architecture Trends (2020-2024)

The review found that RNNs and transformers dominate CLM usage, with a growing trend toward transformers over time. The breakdown across 62 CLM articles: 24 RNN-based, 23 transformer-based, 16 VAE-based, 8 GAN-based, and 1 S4-based model. Among RNN variants, LSTM was the most common, followed by GRU, despite GRU having fewer trainable parameters.

The increase in transformer adoption is attributed to self-attention mechanisms enabling parallel computation and effective long-range dependency capture. Meanwhile, GANs and VAEs saw lower adoption rates, partly due to higher memory and time complexity and reduced ability to generate large molecules.

Molecular Representations and Databases

SMILES was used exclusively in 77.27% of CLM articles, reflecting its wide database availability and compact format. SELFIES, DeepSMILES, and InChI each appeared in smaller fractions. The dominant databases were ChEMBL and ZINC (27 articles each), followed by PubChem (4 articles). Approximately 71% of reviewed articles focused on drug discovery applications.

Database	Molecules (millions)	Representation	Articles
ChEMBL	2.4	SMILES, InChI	27
ZINC	750	SMILES	27
PubChem	115.3	SMILES, InChI	4
COCONUT	0.695	SMILES, InChI	1
DNA-Encoded Library	1,040	SMILES	1

Unbiased Model Performance

Validity: No statistically significant differences were observed across architecture families. Transformers generally achieved high validity through self-attention mechanisms that retain uncompressed sequence information. However, one transformer model (TransMol) achieved only 6.9% validity when using stochastic sampling with Gaussian noise to explore unseen chemical space. GANs showed high dispersion, with validity as low as 8.5% when learning from gene expression signatures rather than molecular structures directly.

Uniqueness: No significant differences in median uniqueness across architectures. Transformer-based models using masked self-attention achieved near-perfect uniqueness scores. Scaffold decoration and fragment-linking approaches sometimes compromised uniqueness due to overfit-driven redundancy.

Validity-Novelty Trade-off: The authors propose a “Valid/Sample” metric (Validity x Novelty) and find an inverse trend between validity and novelty (Spearman $\rho = -0.3575$, p-value = 0.0618). Only 17.9% of models achieved above-median values for both validity (95.6%) and novelty (96.5%) simultaneously. SELFIES-based models achieve 100% validity by construction, which can help address this trade-off.

Biased Model Performance

The review examines three biased generation strategies:

Transfer Learning (TL): The most prevalent biased method, used across all architecture types. Fine-tuning transfers pre-trained parameters to a target model, requiring significantly fewer training molecules (median ~2,507 vs. ~1.1M for unbiased). TL does not significantly affect validity (p = 0.16) or novelty (p = 0.84), but uniqueness decreases significantly (median 90.2% vs. 97.9%, p = 0.014), likely due to overfitting on small target datasets.

Metric	Unbiased (median)	TL Target (median)	p-value
Training size	1,128,920	2,507	<0.0001
Validity	98.05%	95.5%	0.1602
Uniqueness	97.9%	90.2%	0.0144
Novelty	91.6%	96.0%	0.8438

Reinforcement Learning (RL): Applied only to RNNs and transformers in the reviewed set. 90.1% of RL implementations used policy gradient methods with scoring functions for properties like synthesizability, binding affinity, and membrane permeability. No significant effects on generation metrics were observed.

Metric	Unbiased (median)	RL Target (median)	p-value
Validity	91.1%	96.5%	0.1289
Uniqueness	99.9%	89.7%	0.0935
Novelty	91.5%	93.5%	0.2500

Conditional Learning (CL): Integrates domain-specific data (properties, bioactivities, functional groups) directly into training via constraint tokens or property embeddings. Used primarily with encoder-decoder architectures (ARAEs, VAEs, transformers). CL does not significantly degrade generation metrics relative to unbiased models.

Metric	Unbiased (median)	CL Target (median)	p-value
Validity	98.5%	96.8%	0.4648
Uniqueness	99.9%	97.5%	0.0753
Novelty	89.3%	99.6%	0.2945

Key Findings and Directions for Chemical Language Models

Main Conclusions

Transformers are overtaking RNNs as the dominant CLM architecture, driven by self-attention mechanisms that capture long-range dependencies without the gradient vanishing issues of recurrent models.
SMILES remains dominant (77% of models) despite known limitations (non-uniqueness, syntax errors). SELFIES shows promise for improving the validity-novelty trade-off.
No architecture achieves both high validity and high novelty easily. Only 17.9% of unbiased models exceeded medians for both metrics simultaneously, highlighting a fundamental tension in generative chemistry.
Transfer learning requires only ~2,500 molecules to generate targeted compounds, compared to ~1.1M for unbiased training, but at the cost of reduced uniqueness.
Combining biased methods (e.g., TL + RL, CL + TL) shows promise for multi-objective optimization and exploring distant regions of chemical space.
S4 models were newly introduced for CLMs in 2023, showing competitive performance with the dual nature of convolution during training and recurrent generation.

Limitations

The review is restricted to papers reporting MOSES or GuacaMol metrics, which excludes many molecular generation studies that use alternative evaluation frameworks. The statistical comparisons rely on median values reported across different experimental settings, making direct architecture comparisons approximate. Graph-based approaches are included only for coarse comparison (10 of 72 articles) and are not the focus of the analysis.

Reproducibility Details

Data

This is a systematic review, so no new models were trained. The authors collected metadata from 72 published articles. No datasets were generated or analyzed beyond the literature corpus.

Algorithms

Statistical comparisons used Mann-Whitney U tests for paired samples. Spearman correlation was used to assess the validity-novelty relationship. Outlier identification used the Valid/Sample (Validity x Novelty) metric with box plot analysis.

Evaluation

The review evaluates models using MOSES metrics: validity, uniqueness, novelty, scaffold diversity, scaffold novelty, fragment similarity, SNN, internal diversity, and FCD. Statistical tests were applied to compare medians across architecture families and between biased and unbiased models.

Hardware

Not applicable (systematic review, no model training performed).

Paper Information

Citation: Flores-Hernandez, H., & Martínez-Ledesma, E. (2024). A systematic review of deep learning chemical language models in recent era. Journal of Cheminformatics, 16(1), 129. https://doi.org/10.1186/s13321-024-00916-y

@article{floreshernandez2024systematic,
  title={A systematic review of deep learning chemical language models in recent era},
  author={Flores-Hernandez, Hector and Mart{\'i}nez-Ledesma, Emmanuel},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={129},
  year={2024},
  publisher={BioMed Central},
  doi={10.1186/s13321-024-00916-y}
}

Survey of Transformer Architectures in Molecular Science

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer Architectures for Molecular Science

This paper is a Systematization review. It organizes and taxonomizes 12 families of transformer architectures that have been applied across molecular science, including chemistry, biology, and drug discovery. The primary contribution is not a new method or dataset, but a structured technical overview of the algorithmic internals of each transformer variant and their specific applications to molecular problems. The review covers 201 references and provides a unified treatment of how these architectures capture molecular patterns from sequential, graphical, and image-based data.

Bridging the Gap Between Transformer Variants and Molecular Applications

Transformer-based models have become widespread in molecular science, yet the authors identify a gap: there is no organized taxonomy linking these diverse techniques in the existing literature. Individual papers introduce specific architectures or applications, but practitioners lack a unified reference that explains the technical differences between GPT, BERT, BART, graph transformers, and other variants in the context of molecular data. The review aims to fill this gap by providing an in-depth investigation of the algorithmic components of each model family, explaining how their architectural innovations contribute to processing complex molecular data. The authors note that the success of transformers in molecular science stems from several factors: the sequential nature of chemical and biological molecules (DNA, RNA, proteins, SMILES strings), the attention mechanism’s ability to capture long-range dependencies within molecular structures, and the capacity for transfer learning through pre-training on large chemical and biological datasets.

Twelve Transformer Families and Their Molecular Mechanisms

The review covers transformer preliminaries before diving into 12 specific architecture families. The core self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is the dimension of the key vectors. The position-wise feed-forward network is:

$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$

The 12 architecture families covered are:

GPT (Generative Pre-trained Transformer): Uses the decoder part of the transformer for autoregressive generation. Applications include MolGPT for molecular generation, DrugGPT for protein-ligand binding, and cMolGPT for target-specific de novo molecular generation.
BERT (Bidirectional Encoder Representations from Transformers): Uses transformer encoders with masked language modeling and next-sentence prediction for pre-training. Molecular applications include FP-BERT for molecular property prediction using composite fingerprint representations, Graph-BERT for protein-protein interaction identification, SMILES-BERT, and Mol-BERT.
BART (Bidirectional and Auto-Regressive Transformers): Functions as a denoising autoencoder with both encoder and decoder. Molecular applications include Chemformer for sequence-to-sequence chemistry tasks, MS2Mol for mass spectrometry analysis, and MolBART for molecular feature learning.
Graph Transformer: Leverages self-attention on graph-structured data to capture global context. Applications include GraphSite for protein-DNA binding site prediction (using AlphaFold2 structure predictions), KPGT for knowledge-guided molecular graph pre-training, and PAGTN for establishing long-range dependencies in molecular graphs.
Transformer-XL: Incorporates relative positional encoding for modeling long sequences. Used for small molecule retention time prediction, drug design with CHEMBL data (1.27 million molecules), and Heck reaction generation.
T5 (Text-to-Text Transfer Transformer): Unifies NLP tasks into text-to-text mapping. T5Chem was pre-trained on 97 million molecules from PubChem and achieved 99.5% accuracy on reaction classification (USPTO 500 MT). C5T5 uses IUPAC naming for molecular optimization in drug discovery.
Vision Transformer (ViT): Applies transformer architecture to image patches. Used for organic molecule classification (97% accuracy with WGAN-generated data), bacterial identification via SERS, and molecular property prediction from mass spectrometry data (TransG-Net).
DETR (Detection Transformer): End-to-end object detection using transformers. Applied to cryo-EM particle picking (TransPicker), molecular structure image recognition (IMG2SMI), and cell segmentation (Cell-DETR).
Conformer: Integrates convolutional modules into transformer structure. Used for DNA storage error correction (RRCC-DNN), drug-target affinity prediction (NG-DTA with Davis and Kiba datasets).
CLIP (Contrastive Language-Image Pre-training): Multimodal learning linking text and images. Applied to peptide design (Cut&CLIP for protein degradation), gene identification (pathCLIP), and drug discovery (CLOOME for zero-shot transfer learning).
Sparse Transformers: Use sparse attention matrices to reduce complexity to $O(n\sqrt{n})$. Applied to drug-target interaction prediction with gated cross-attention mechanisms.
Mobile and Efficient Transformers: Compressed variants (TinyBERT, MobileBERT) for resource-constrained environments. Molormer uses ProbSparse self-attention for drug-drug interaction prediction. LOGO is a lightweight pre-trained language model for non-coding genome interpretation.

Survey Organization and Coverage of Molecular Domains

As a survey paper, this work does not present new experiments. Instead, it catalogues existing applications across multiple molecular domains:

Drug Discovery and Design: GPT-based ligand design (DrugGPT), BART-based molecular generation (Chemformer, MolBART), graph transformer pre-training for molecular property prediction (KPGT), T5-based chemical reaction prediction (T5Chem), and sparse transformer methods for drug-target interactions.

Protein Science: BERT-based protein-protein interaction prediction (Graph-BERT), graph transformer methods for protein-DNA binding (GraphSite with AlphaFold2 integration), conformer-based drug-target affinity prediction (NG-DTA), and CLIP-based peptide design (Cut&CLIP).

Molecular Property Prediction: FP-BERT for fingerprint-based prediction, SMILES-BERT and Mol-BERT for end-to-end prediction from SMILES, KPGT for knowledge-guided graph pre-training, and Transformer-XL for property modeling with relative positional encoding.

Structural Biology: DETR-based cryo-EM particle picking (TransPicker), vision transformer applications in cell imaging, and Cell-DETR for instance segmentation in microscopy.

Genomics: Conformer-based DNA storage error correction (RRCC-DNN), LOGO for non-coding genome interpretation, and MetaTransformer for metagenomic sequencing analysis.

Future Directions and Limitations of the Survey

The review concludes with four future directions:

ChatGPT integration into molecular science: Using LLMs for data analysis, literature review, and hypothesis generation in chemistry and biology.
Multifunction transformers: Models that extract features across diverse molecular structures and sequences simultaneously.
Molecular-aware transformers: Architectures that handle multiple data types (text, sequence, structure, image, energy, molecular dynamics, function) in a unified framework.
Self-assessment transformers and superintelligence: Speculative discussion of models that learn from seemingly unrelated data sources.

The review has several limitations worth noting. The coverage is broad but shallow: each architecture family receives only 1-2 pages of discussion, and the paper largely describes existing work rather than critically evaluating it. The review does not systematically compare the architectures against each other on common benchmarks. The future directions section (particularly the superintelligence discussion) is speculative and lacks concrete proposals. The paper also focuses primarily on technical architecture descriptions rather than analyzing failure modes, scalability challenges, or reproducibility concerns across the surveyed methods. As a review article, no new data were created or analyzed.

Reproducibility Details

Data

This is a survey paper. No new datasets were created or used. The paper reviews applications involving datasets such as PubChem (97 million molecules for T5Chem), CHEMBL (1.27 million molecules for Transformer-XL drug design), USPTO 500 MT (reaction classification), ESOL (5,328 molecules for property prediction), and Davis/Kiba (drug-target affinity).

Algorithms

No new algorithms are introduced. The paper provides mathematical descriptions of the core transformer components (self-attention, positional encoding, feed-forward networks, layer normalization) and describes how 12 architecture families modify these components.

Models

No new models are presented. The paper surveys existing models including MolGPT, DrugGPT, FP-BERT, SMILES-BERT, Chemformer, MolBART, GraphSite, KPGT, T5Chem, TransPicker, Cell-DETR, CLOOME, and Molormer, among others.

Evaluation

No new evaluation is performed. Performance numbers cited from the literature include: T5Chem reaction classification accuracy of 99.5%, ViT organic molecule classification at 97%, Transformer-XL property prediction RMSE of 0.6 on ESOL, and Heck reaction generation feasibility rate of 47.76%.

Hardware

No hardware requirements are specified, as this is a survey paper.

Artifact	Type	License	Notes
Paper (open access)	Paper	CC-BY-NC-ND	Open access via Wiley

Paper Information

Citation: Jiang, J., Ke, L., Chen, L., Dou, B., Zhu, Y., Liu, J., Zhang, B., Zhou, T., & Wei, G.-W. (2024). Transformer technology in molecular science. WIREs Computational Molecular Science, 14(4), e1725. https://doi.org/10.1002/wcms.1725

@article{jiang2024transformer,
  title={Transformer technology in molecular science},
  author={Jiang, Jian and Ke, Lu and Chen, Long and Dou, Bozheng and Zhu, Yueying and Liu, Jie and Zhang, Bengong and Zhou, Tianshou and Wei, Guo-Wei},
  journal={WIREs Computational Molecular Science},
  volume={14},
  number={4},
  pages={e1725},
  year={2024},
  publisher={Wiley},
  doi={10.1002/wcms.1725}
}

Survey of Scientific LLMs in Bio and Chem Domains

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Scientific Language Models

This paper is a Systematization (survey) that provides a comprehensive review of scientific large language models (Sci-LLMs) designed for biological and chemical domains. The survey covers five main branches of scientific language modeling: textual, molecular, protein, genomic, and multimodal LLMs. For each branch, the authors analyze model architectures, capabilities, training datasets, evaluation benchmarks, and assessment criteria, then identify open challenges and future research directions.

Motivation: Bridging Scientific Languages and LLMs

Large language models have demonstrated strong capabilities in natural language understanding, but scientific research involves specialized “languages” that differ fundamentally from natural text. Chemical molecules are expressed as SMILES or SELFIES strings, proteins as amino acid sequences, and genomes as nucleotide sequences. Each of these language systems has its own vocabulary and grammar. General-purpose LLMs like ChatGPT and GPT-4 often fail to properly handle these scientific data types because the semantics and grammar of scientific languages diverge substantially from natural language.

Prior surveys have focused on individual modalities (molecules, proteins, or genomes) in isolation. No comprehensive review had unified these language modeling advances into a single framework. This survey fills that gap by systematically covering all five modalities and, notably, the emerging area of multimodal Sci-LLMs that integrate multiple scientific languages.

Taxonomy of Scientific Language Models

The survey organizes Sci-LLMs into a clear taxonomic framework built on two axes: the scientific language modality and the model architecture type.

Scientific Language Modalities

The authors define five categories of Sci-LLMs:

Text-Sci-LLMs: LLMs trained on scientific textual corpora (medical, biological, chemical, and comprehensive domains). Examples include BioBERT, BioGPT, ChemBERT, SciBERT, and Galactica.
Mol-LLMs: Models that process molecular languages (SMILES, SELFIES, InChI). These include encoder-only models like ChemBERTa and MolFormer for property prediction, decoder-only models like MolGPT for molecular generation, and encoder-decoder models like Molecular Transformer and Chemformer for reaction prediction.
Prot-LLMs: Models operating on protein amino acid sequences. The ESM series (ESM-1b, ESM-2) and ProtTrans serve as encoders for function and structure prediction, while ProGen and ProtGPT2 generate novel protein sequences.
Gene-LLMs: Models for DNA and RNA sequences, including DNABERT, Nucleotide Transformer, HyenaDNA, and Evo, covering tasks from variant effect prediction to genome-scale sequence modeling.
MM-Sci-LLMs: Multimodal models integrating multiple scientific data types (molecule-text, protein-text, gene-cell-text, molecule-protein), such as MoleculeSTM, BioT5, Mol-Instructions, and BioMedGPT.

Architecture Classification

For each modality, models are categorized into three architecture types:

Encoder-only: Based on BERT/RoBERTa, these models learn fixed-size representations via masked language modeling. They excel at discriminative tasks like property prediction and classification.
Decoder-only: Based on GPT, these models perform autoregressive generation. They are used for de novo molecule design, protein sequence generation, and DNA sequence generation.
Encoder-decoder: Based on architectures like T5 or BART, these handle sequence-to-sequence tasks such as reaction prediction, molecule captioning, and protein sequence-structure translation.

Comprehensive Catalog of Models, Datasets, and Benchmarks

A central contribution of the survey is its exhaustive cataloging of resources across all five modalities. The authors compile detailed summary tables covering over 100 Sci-LLMs, their parameter counts, base architectures, training data, and capabilities.

Molecular LLMs

The survey documents a rich landscape of Mol-LLMs:

Encoder-only models for property prediction include SMILES-BERT, ChemBERTa, ChemBERTa-2, MolBERT, MolFormer, MG-BERT, GROVER, MAT, Uni-Mol, and others. These models are pre-trained on ZINC, PubChem, or ChEMBL datasets and fine-tuned for molecular property prediction tasks on benchmarks like MoleculeNet.

Decoder-only models for molecular generation include MolGPT, SMILES GPT, iupacGPT, cMolGPT, and Taiga. These generate SMILES strings autoregressively, often combining GPT with reinforcement learning for property optimization.

Encoder-decoder models for reaction prediction include Molecular Transformer, Retrosynthesis Transformer, Chemformer, BARTSmiles, Graph2SMILES, and MOLGEN. These handle forward reaction prediction and retrosynthesis.

Key Datasets Surveyed

The survey catalogs pre-training datasets and benchmarks for each modality:

Modality	Pre-training Sources	Key Benchmarks
Text	PubMed, PMC, arXiv, Semantic Scholar	MMLU, MedQA, PubMedQA, SciEval
Molecule	ZINC, PubChem, ChEMBL, USPTO, GDB-17	MoleculeNet, GuacaMol, MOSES, SPECTRA
Protein	UniRef50/90/100, BFD, PDB, AlphaFoldDB	CASP, TAPE, ProteinGym, FLIP, PEER
Genome	GRCh38, 1000 Genomes, ENCODE	NT-Bench, GenBench, BEACON
Multimodal	ChEBI-20, PubChemSTM, Mol-Instructions	Various cross-modal retrieval and generation tasks

Evaluation Metrics

For molecular generation, the survey details standard metrics:

Validity: percentage of chemically viable molecules
Uniqueness: fraction of distinct generated structures
Novelty: fraction not present in the training set
Internal diversity: measured as

$$ \text{IntDiv}_{p}(G) = 1 - \sqrt[p]{\frac{1}{|G|^{2}} \sum_{m_{1}, m_{2} \in G} T(m_{1}, m_{2})^{p}} $$

where $T(m_{1}, m_{2})$ is the Tanimoto similarity between molecules $m_{1}$ and $m_{2}$.

Frechet ChemNet Distance (FCD): comparing distributions of generated and reference molecules

$$ \text{FCD}(G, R) = | \mu_{G} - \mu_{R} |^{2} + \text{Tr}\left[\Sigma_{G} + \Sigma_{R} - 2(\Sigma_{G}\Sigma_{R})^{1/2}\right] $$

For protein generation, analogous metrics include perplexity, Frechet Protein Distance (FPD), foldability (pLDDT), sequence recovery, and novelty (sequence identity).

Critical Challenges and Future Directions

The survey identifies four major challenges and seven future research directions for Sci-LLMs.

Challenges

Training data limitations: Sci-LLM training datasets are orders of magnitude smaller than those for general LLMs. ProGen was trained on 280M protein sequences (tens of billions of tokens), while ChatGPT used approximately 570 billion tokens. Scaling laws suggest larger datasets would improve performance, and advances in sequencing technologies may help close this gap.
Architecture mismatch: Standard Transformer architectures face difficulties with scientific languages. Scientific sequences (proteins with hundreds or thousands of amino acids, DNA with millions of base pairs) are far longer than typical natural language sentences. Additionally, 3D structural information is critical for function prediction but does not naturally map to sequence tokens. Autoregressive generation is also a poor fit since biological sequences function as a whole rather than being read left-to-right.
Evaluation gaps: Computational metrics for generated molecules and proteins provide only indirect quality measures. Wet-lab validation remains the gold standard but is beyond the scope of most AI research teams. Better computational evaluation methods that correlate with experimental outcomes are needed.
Ethics: Sensitive biological data raises privacy concerns. The potential for misuse (e.g., generating harmful substances) requires careful safeguards. Algorithmic bias and equitable access to Sci-LLM benefits also demand attention.

Future Directions

Larger-scale, cross-modal training datasets with strong semantic alignment across modalities
Incorporating 3D structural and temporal information into language-based modeling, including structural motifs as tokens
Integration with external knowledge sources such as Gene Ontology and chemical knowledge graphs to reduce hallucination
Coupling with physical simulation (e.g., molecular dynamics) to ground language models in physical reality
Augmenting Sci-LLMs with specialized tools and agents, following the success of tool-augmented general LLMs like ChemCrow
Development of computational evaluation metrics that are both fast and accurate, enabling rapid research iteration
Super-alignment with human ethics, ensuring ethical reasoning is deeply integrated into Sci-LLM behavior

Reproducibility Details

Data

This is a survey paper that does not present new experimental results. The authors catalog extensive datasets across five modalities (see tables in the paper for comprehensive listings). The survey itself is maintained as an open resource.

Artifacts

Artifact	Type	License	Notes
Scientific-LLM-Survey GitHub	Other	Not specified	Curated list of papers, models, and resources

Hardware

Not applicable (survey paper).

Paper Information

Citation: Zhang, Q., Ding, K., Lyv, T., Wang, X., Yin, Q., Zhang, Y., Yu, J., Wang, Y., Li, X., Xiang, Z., Feng, K., Zhuang, X., Wang, Z., Qin, M., Zhang, M., Zhang, J., Cui, J., Huang, T., Yan, P., Xu, R., Chen, H., Li, X., Fan, X., Xing, H., & Chen, H. (2025). Scientific Large Language Models: A Survey on Biological & Chemical Domains. ACM Computing Surveys, 57(6), 1–38. https://doi.org/10.1145/3715318

@article{zhang2025scientific,
  title={Scientific Large Language Models: A Survey on Biological \& Chemical Domains},
  author={Zhang, Qiang and Ding, Keyan and Lyv, Tianwen and Wang, Xinda and Yin, Qingyu and Zhang, Yiwen and Yu, Jing and Wang, Yuhao and Li, Xiaotong and Xiang, Zhuoyi and Feng, Kehua and Zhuang, Xiang and Wang, Zeyuan and Qin, Ming and Zhang, Mengyao and Zhang, Jinlu and Cui, Jiyu and Huang, Tao and Yan, Pengju and Xu, Renjun and Chen, Hongyang and Li, Xiaolin and Fan, Xiaohui and Xing, Huabin and Chen, Huajun},
  journal={ACM Computing Surveys},
  volume={57},
  number={6},
  pages={1--38},
  year={2025},
  doi={10.1145/3715318}
}

Review: Deep Learning for Molecular Design (2019)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Deep Generative Models for Molecular Design

This is a Systematization paper that organizes and compares the rapidly growing literature on deep generative modeling for molecules. Published in 2019, it catalogs 45 papers from the preceding two years, classifying them by architecture (RNNs, VAEs, GANs, reinforcement learning) and molecular representation (SMILES strings, context-free grammars, graph tensors, 3D voxels). The review provides mathematical foundations for each technique, identifies cross-cutting themes, and proposes a framework for reward function design that addresses diversity, novelty, stability, and synthesizability.

The Challenge of Navigating Vast Chemical Space

The space of potential drug-like molecules has been estimated to contain between $10^{23}$ and $10^{60}$ compounds, while only about $10^{8}$ have ever been synthesized. Traditional approaches to molecular design rely on combinatorial methods, mixing known scaffolds and functional groups, but these generate many unstable or unsynthesizable candidates. High-throughput screening (HTS) and virtual screening (HTVS) help but remain computationally expensive. The average cost to bring a new drug to market exceeds one billion USD, with a 13-year average timeline from discovery to market.

By 2016, deep generative models had shown strong results in producing original images, music, and text. The “molecular autoencoder” of Gomez-Bombarelli et al. (2016/2018) first applied these techniques to molecular generation, triggering an explosion of follow-up work. By the time of this review, the landscape had grown complex enough, with many architectures, representation schemes, and no agreed-upon benchmarking standards, to warrant systematic organization.

Molecular Representations and Architecture Taxonomy

The review’s core organizational contribution is a two-axis taxonomy: molecular representations on one axis and deep learning architectures on the other.

Molecular Representations

The review categorizes representations into 3D and 2D graph-based schemes:

3D representations include raw voxels (placing nuclear charges on a grid), smoothed voxels (Gaussian blurring around nuclei), and tensor field networks. These capture full geometric information but suffer from high dimensionality, sparsity, and difficulty encoding rotation/translation invariance.

2D graph representations include:

SMILES strings: The dominant representation, encoding molecular graphs as ASCII character sequences via depth-first traversal. Non-unique (each molecule with $N$ heavy atoms has at least $N$ SMILES representations), but invertible and widely supported.
Canonical SMILES: Unique but potentially encode grammar rules rather than chemical structure.
Context-free grammars (CFGs): Decompose SMILES into grammar rules to improve validity rates, though not to 100%.
Tensor representations: Store atom types in a vertex feature matrix $X \in \mathbb{R}^{N \times |\mathcal{A}|}$ and bond types in an adjacency tensor $A \in \mathbb{R}^{N \times N \times Y}$.
Graph operations: Directly build molecular graphs by adding atoms and bonds, guaranteeing 100% chemical validity.

Deep Learning Architectures

Recurrent Neural Networks (RNNs) generate SMILES strings character by character, typically using LSTM or GRU units. Training uses maximum likelihood estimation (MLE) with teacher forcing:

$$ L^{\text{MLE}} = -\sum_{s \in \mathcal{X}} \sum_{t=2}^{T} \log \pi_{\theta}(s_{t} \mid S_{1:t-1}) $$

Thermal rescaling of the output distribution controls the diversity-validity tradeoff via a temperature parameter $T$. RNNs achieved SMILES validity rates of 94-98%.

Variational Autoencoders (VAEs) learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$ \mathcal{L}_{\theta,\phi}(x) = \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}[q_{\phi}(z|x), p(z)] $$

The first term encourages accurate reconstruction while the KL divergence term regularizes the latent distribution toward a standard Gaussian prior $p(z) = \mathcal{N}(z, 0, I)$. Variants include grammar VAEs (GVAEs), syntax-directed VAEs, junction tree VAEs, and adversarial autoencoders (AAEs) that replace the KL term with adversarial training.

Generative Adversarial Networks (GANs) train a generator against a discriminator using the minimax objective:

$$ \min_{G} \max_{D} V(D, G) = \mathbb{E}_{x \sim p_{d}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))] $$

The review shows that with an optimal discriminator, the generator objective reduces to minimizing the Jensen-Shannon divergence, which captures both forward and reverse KL divergence terms. This provides a more “balanced” training signal than MLE alone. The Wasserstein GAN (WGAN) uses the Earth mover’s distance for more stable training:

$$ W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma} |x - y| $$

Reinforcement Learning recasts molecular generation as a sequential decision problem. The policy gradient (REINFORCE) update is:

$$ \nabla J(\theta) = \mathbb{E}\left[G_{t} \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid y_{1:t-1})}{\pi_{\theta}(a_{t} \mid y_{1:t-1})}\right] $$

To prevent RL fine-tuning from causing the generator to “drift” away from viable chemical structures, an augmented reward function incorporates the prior likelihood:

$$ R’(S) = [\sigma R(S) + \log P_{\text{prior}}(S) - \log P_{\text{current}}(S)]^{2} $$

Cataloging 45 Models and Their Design Choices

Rather than running new experiments, the review’s methodology involves systematically cataloging and comparing 45 published models. Table 2 in the paper lists each model’s architecture, representation, training dataset, and dataset size. Key patterns include:

RNN-based models (16 entries): Almost exclusively use SMILES, trained on ZINC or ChEMBL datasets with 0.1M-1.7M molecules.
VAE variants (20 entries): The most diverse category, spanning SMILES VAEs, grammar VAEs, junction tree VAEs, graph-based VAEs, and 3D VAEs. Training sets range from 10K to 72M molecules.
GAN models (7 entries): Include ORGAN, RANC, ATNC, MolGAN, and CycleGAN approaches. Notably, GANs appear to work with fewer training samples.
Other approaches (2 entries): Pure RL methods from Zhou et al. and Stahl et al. that do not require pretraining on a dataset.

The review also catalogs 13 publicly available datasets (Table 3), ranging from QM9 (133K molecules with quantum chemical properties) to GDB-13 (977M combinatorially generated molecules) and ZINC15 (750M+ commercially available compounds).

Metrics and Reward Function Design

A significant contribution is the systematic treatment of reward functions. The review argues that generated molecules should satisfy six desiderata: diversity, novelty, stability, synthesizability, non-triviality, and good properties. Key metrics formalized include:

Diversity using Tanimoto similarity over fingerprints:

$$ r_{\text{diversity}} = 1 - \frac{1}{|\mathcal{G}|} \sum_{(x_{1}, x_{2}) \in \mathcal{G} \times \mathcal{G}} D(x_{1}, x_{2}) $$

Novelty measured as the fraction of generated molecules not appearing in a hold-out test set:

$$ r_{\text{novel}} = 1 - \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{T}|} $$

Synthesizability primarily assessed via the SA score, sometimes augmented with ring penalties and medicinal chemistry filters.

The review also discusses the Fréchet ChemNet Distance as an analog of FID for molecular generation, and notes the emergence of standardized benchmarking platforms including MOSES, GuacaMol, and DiversityNet.

Key Findings and Future Directions

The review identifies several major trends and conclusions:

Shift from SMILES to graph-based representations. SMILES-based methods struggle with validity (the molecular autoencoder VAE achieved only 0.7-75% valid SMILES depending on sampling strategy). Methods that work directly on molecular graphs with chemistry-preserving operations achieve 100% validity, and the review predicts this trend will continue.

Advantages of adversarial and RL training over MLE. The mathematical analysis shows that MLE only optimizes forward KL divergence, which can lead to models that place probability mass where the data distribution is zero. GAN training optimizes the Jensen-Shannon divergence, which balances forward and reverse KL terms. RL approaches, particularly pure RL without pretraining, showed competitive performance with much less training data.

Genetic algorithms remain competitive. The review notes that the latest genetic algorithm approaches (Grammatical Evolution) could match deep learning methods for molecular optimization under some metrics, and at 100x lower computational cost in some comparisons. This serves as an important baseline calibration.

Reward function design is underappreciated. Early models generated unstable molecules with labile groups (enamines, hemiaminals, enol ethers). Better reward functions that incorporate synthesizability, diversity, and stability constraints significantly improved practical utility.

Need for standardized benchmarks. The review identifies a lack of agreement on evaluation methodology as a major barrier to progress, noting that published comparisons are often subtly biased toward novel methods.

Limitations

As a review paper from early 2019, the work predates several important developments: transformer-based architectures (which would soon dominate), SELFIES representations, diffusion models for molecules, and large-scale pretrained chemical language models. The review focuses primarily on drug-like small molecules and does not deeply cover protein design or materials optimization.

Reproducibility Details

Data

This is a review paper that does not present new experimental results. The paper catalogs 13 publicly available datasets used across the reviewed works:

Purpose	Dataset	Size	Notes
Training/Eval	GDB-13	977M	Combinatorially generated library
Training/Eval	ZINC15	750M+	Commercially available compounds
Training/Eval	GDB-17	50M	Combinatorially generated library
Training/Eval	ChEMBL	2M	Curated bioactive molecules
Training/Eval	QM9	133,885	Small organic molecules with DFT properties
Training/Eval	PubChemQC	3.98M	PubChem compounds with DFT data

Algorithms

The review provides mathematical derivations for MLE training (Eq. 1), VAE ELBO (Eqs. 9-13), AAE objectives (Eqs. 15-16), GAN objectives (Eqs. 19-22), WGAN (Eq. 24), REINFORCE gradient (Eq. 7), and numerous reward function formulations (Eqs. 26-36).

Evaluation

Key evaluation frameworks discussed:

Fréchet ChemNet Distance (molecular analog of FID)
MOSES benchmarking platform
GuacaMol benchmarking suite
Validity rate, uniqueness, novelty, and internal diversity metrics

Paper Information

Citation: Elton, D. C., Boukouvalas, Z., Fuge, M. D., & Chung, P. W. (2019). Deep Learning for Molecular Design: A Review of the State of the Art. Molecular Systems Design & Engineering, 4(4), 828-849. https://doi.org/10.1039/C9ME00039A

@article{elton2019deep,
  title={Deep Learning for Molecular Design -- A Review of the State of the Art},
  author={Elton, Daniel C. and Boukouvalas, Zois and Fuge, Mark D. and Chung, Peter W.},
  journal={Molecular Systems Design \& Engineering},
  volume={4},
  number={4},
  pages={828--849},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C9ME00039A}
}

Inverse Molecular Design with ML Generative Models

Thu, 26 Mar 2026 00:00:00 +0000

A Foundational Systematization of Inverse Molecular Design

This paper is a Systematization of the nascent field of inverse molecular design using machine learning generative models. Published in Science in 2018, it organizes and contextualizes the rapidly emerging body of work on using deep generative models (variational autoencoders, generative adversarial networks, and reinforcement learning) to navigate chemical space and propose novel molecules with targeted properties. Rather than introducing a new method, the paper synthesizes the conceptual framework connecting molecular representations, generative architectures, and inverse design objectives, establishing a reference point for the field at a critical early stage.

The Challenge of Navigating Chemical Space

The core problem is the sheer scale of chemical space. For pharmacologically relevant small molecules alone, the number of possible structures is estimated at $10^{60}$. Traditional approaches to materials discovery rely on trial and error or high-throughput virtual screening (HTVS), both of which are fundamentally limited by the need to enumerate and evaluate candidates from a predefined library.

The conventional materials discovery pipeline, from concept to commercial product, historically takes 15 to 20 years, involving iterative cycles of simulation, synthesis, device integration, and characterization. Inverse design offers a conceptual alternative: start from a desired functionality and search for molecular structures that satisfy it. This inverts the standard paradigm where a molecule is proposed first and its properties are computed or measured afterward.

The key distinction the authors draw is between discriminative and generative models. A discriminative model learns $p(y|x)$, the conditional probability of properties $y$ given a molecule $x$. A generative model instead learns the joint distribution $p(x,y)$, which can be conditioned to yield either the direct design problem $p(y|x)$ or the inverse design problem $p(x|y)$.

Three Pillars: VAEs, GANs, and Reinforcement Learning

The review organizes inverse molecular design approaches around three generative paradigms and the molecular representations they operate on.

Molecular Representations

The paper surveys representations across three broad categories:

Discrete (text-based): SMILES strings encode molecular structure as 1D text following a grammar syntax. Their adoption has been driven by the availability of NLP deep learning tools.
Continuous (vectors/tensors): Coulomb matrices, bag of bonds, fingerprints, symmetry functions, and electronic density representations. These expose different physical symmetries (permutational, rotational, reflectional, translational invariance).
Weighted graphs: Molecules as undirected graphs where atoms are nodes and bonds are edges, with vectorized features on edges and nodes (bonding type, aromaticity, charge, distance).

An ideal representation for inverse design should be invertible, meaning it supports mapping back to a synthesizable molecular structure. SMILES strings and molecular graphs are invertible, while many continuous representations require lookup tables or auxiliary methods.

Variational Autoencoders (VAEs)

VAEs encode molecules into a continuous latent space and decode latent vectors back to molecular representations. The key insight is that by constraining the encoder to produce latent vectors following a Gaussian distribution, the model gains the ability to interpolate between molecules and sample novel structures. The latent space encodes a geometry: nearby points decode to similar molecules, and gradient-based optimization over this continuous space enables direct property optimization.

The VAE loss function combines a reconstruction term with a KL divergence regularizer:

$$\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) | p(z))$$

where $q(z|x)$ is the encoder (approximate posterior), $p(x|z)$ is the decoder, and $p(z)$ is the prior (typically Gaussian).

Semi-supervised variants jointly train on molecules and properties, reorganizing latent space so molecules with similar properties cluster together. Gomez-Bombarelli et al. demonstrated local and global optimization across generated distributions using Bayesian optimization over latent space.

The review traces the evolution from character-level SMILES VAEs to grammar-aware and syntax-directed variants that improve the generation of syntactically valid structures.

Generative Adversarial Networks (GANs)

GANs pit a generator against a discriminator in an adversarial training framework. The generator learns to produce synthetic molecules from noise, while the discriminator learns to distinguish synthetic from real molecules. Training convergence for GANs is challenging, suffering from mode collapse and generator-discriminator imbalance.

For molecular applications, dealing with discrete SMILES data introduces nondifferentiability, addressed through workarounds like SeqGAN’s policy gradient approach and boundary-seeking GANs.

Reinforcement Learning (RL)

RL treats molecule generation as a sequential decision process where an agent (the generator) takes actions (adding characters to a SMILES string) to maximize a reward (desired molecular properties). Since rewards can only be assigned after sequence completion, Monte Carlo Tree Search (MCTS) is used to simulate possible completions and weight paths based on their success.

Applications include generation of drug-like molecules and retrosynthesis planning. Notable examples cited include RL for optimizing putative JAK2 inhibitors and molecules active against dopamine receptor type 2.

Hybrid Approaches

The review highlights that these paradigms are not exclusive. Examples include druGAN (adversarial autoencoder) and ORGANIC (combined GAN and RL), which leverage strengths of multiple frameworks.

Survey of Applications and Design Paradigms

Being a review paper, this work does not present new experiments but surveys existing applications across domains:

Drug Discovery: Most generative model applications at the time of writing targeted pharmaceutical properties, including solubility, melting temperature, synthesizability, and target activity. Popova et al. optimized for JAK2 inhibitors, and Olivecrona et al. targeted dopamine receptor type 2.

Materials Science: HTVS had been applied to organic photovoltaics (screening by frontier orbital energies and conversion efficiency), organic redox flow batteries (redox potential and solubility), organic LEDs (singlet-triplet gap), and inorganic materials via the Materials Project.

Chemical Space Exploration: Evolution strategies had been applied to map chemical space, with structured search procedures incorporating genotype representations and mutation operations. Bayesian sampling with sequential Monte Carlo and gradient-based optimization of properties with respect to molecular systems represented alternative inverse design strategies.

Graph-Based Generation: The paper notes the emerging extension of VAEs to molecular graphs (junction tree VAE) and message passing networks for incremental graph construction, though the graph isomorphism approximation problem remained a practical challenge.

Future Directions and Open Challenges

The authors identify several open directions for the field:

Closed-Loop Discovery: The ultimate goal is to concurrently propose, create, and characterize new materials with simultaneous data flow between components. At the time of writing, very few examples of successful closed-loop approaches existed.

Active Learning: Combining inverse design with Bayesian optimization enables models that adapt as they explore chemical space, expanding in regions of high uncertainty and discovering molecular regions with desirable properties as a function of composition.

Representation Learning: No single molecular representation works optimally for all properties. Graph and hierarchical representations were identified as areas needing further study. Representations that encode relevant physics tend to generalize better.

Improved Architectures: Memory-augmented sequence generation models, Riemannian optimization methods exploiting latent space geometry, multi-level VAEs for structured latent spaces, and inverse RL for learning reward functions were highlighted as promising research directions.

Integration into Education: The authors advocate for integrating ML into curricula across chemical, biochemical, medicinal, and materials sciences.

Limitations

As a review paper from 2018, this work captures the field at an early stage. Several limitations are worth noting:

The survey is dominated by SMILES-based approaches, reflecting the state of the field at the time. Graph-based and 3D-aware generative models were just emerging.
Quantitative benchmarking of generative models was not yet standardized. The review does not provide systematic comparisons across methods.
The synthesis feasibility of generated molecules receives limited attention. The gap between computationally generated candidates and experimentally realizable molecules was (and remains) a significant challenge.
Transformer-based architectures, which would come to dominate chemical language modeling, are not discussed, as the Transformer had only been published the year prior.

Reproducibility Details

As a review/perspective paper, this work does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the cited primary works rather than the review itself.

Key Cited Methods and Their Resources

Method	Authors	Type	Availability
Automatic Chemical Design (VAE)	Gomez-Bombarelli et al.	Code + Data	Published in ACS Central Science
Grammar VAE	Kusner et al.	Code	arXiv:1703.01925
Junction Tree VAE	Jin et al.	Code	arXiv:1802.04364
ORGANIC	Sanchez-Lengeling et al.	Code	ChemRxiv preprint
SeqGAN	Yu et al.	Code	AAAI 2017
Neural Message Passing	Gilmer et al.	Code	arXiv:1704.01212

Paper Information

Citation: Sánchez-Lengeling, B., & Aspuru-Guzik, A. (2018). Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361(6400), 360-365. https://doi.org/10.1126/science.aat2663

@article{sanchez-lengeling2018inverse,
  title={Inverse molecular design using machine learning: Generative models for matter engineering},
  author={S{\'a}nchez-Lengeling, Benjamin and Aspuru-Guzik, Al{\'a}n},
  journal={Science},
  volume={361},
  number={6400},
  pages={360--365},
  year={2018},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.aat2663}
}

Generative AI Survey for De Novo Molecule and Protein Design

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Generative AI for Drug Design

This is a Systematization paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.

The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.

The Challenge of Navigating De Novo Drug Design

The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.

AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.

The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.

Unified Taxonomy: Two Themes, Seven Subtasks

The survey’s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.

Generative Model Architectures

The survey covers four main generative model families used across both molecule and protein generation:

Variational Autoencoders (VAEs) encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:

$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$

where the KL loss is:

$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$

Generative Adversarial Networks (GANs) use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:

$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$

Flow-Based Models generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:

$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$

Diffusion Models gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:

$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

The training loss minimizes the difference between the true noise and the predicted noise:

$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$

Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.

Small Molecule Generation: Tasks, Datasets, and Models

Target-Agnostic Molecule Design

The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).

Datasets: QM9 (small stable molecules from GDB-17) and GEOM-Drug (more complex, drug-like molecules).

The field has shifted from SMILES-based VAEs (CVAE, GVAE, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:

Model	Type	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	EGNN, Diffusion	99.8	97.5	97.9	97.6
MDM	EGNN, VAE, Diffusion	99.2	89.6	98.6	94.6
JODO	EGNN, Diffusion	99.2	93.4	99.0	96.0
GeoLDM	VAE, Diffusion	98.9	89.4	93.8	92.7
EDM	EGNN, Diffusion	98.7	82.0	91.9	90.7

EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a “relaxed” EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.

On the larger GEOM-Drugs dataset, performance drops for most models:

Model	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	99.8	91.6	77.8	77.8
MDM	–	62.2	99.5	99.0
GeoLDM	84.4	–	99.3	–
EDM	81.3	–	–	–

MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.

Target-Aware Molecule Design

Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.

Datasets: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.

Metrics: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).

Model	Type	Vina	Affinity (%)	QED	SA	Diversity
DiffSBDD	EGNN, Diffusion	-7.333	–	0.467	0.554	0.758
Luo et al.	SchNet	-6.344	29.09	0.525	0.657	0.720
TargetDiff	EGNN, Diffusion	-6.3	58.1	0.48	0.58	0.72
LiGAN	CNN, VAE	-6.144	21.1	0.39	0.59	0.66
Pocket2Mol	EGNN, MLP	-5.14	48.4	0.56	0.74	0.69

DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).

Molecular Conformation Generation

Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations “covered” within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).

Datasets: GEOM-QM9, GEOM-Drugs, ISO17.

Model	Type	GEOM-QM9 COV (%)	GEOM-QM9 MAT	GEOM-Drugs COV (%)	GEOM-Drugs MAT
Torsional Diff.	Diffusion	92.8	0.178	72.7*	0.582
DGSM	MPNN, Diffusion	91.49	0.2139	78.73	1.0154
GeoDiff	GFN, Diffusion	90.07	0.209	89.13	0.8629
ConfGF	GIN, Diffusion	88.49	0.2673	62.15	1.1629
GeoMol	MPNN	71.26	0.3731	67.16	1.0875

*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.

Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.

Protein Generation: From Sequence to Structure

Protein Representation Learning

Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman’s $\rho$).

Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.

Protein Structure Prediction

Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.

AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.

Model	Type	CAMEO RMSD	CAMEO TMScore	CAMEO GDT-TS	CAMEO lDDT	CASP14 TMScore
AlphaFold2	Transformer	3.30	0.87	0.86	0.90	0.38
ESMFold	Transformer	3.99	0.85	0.83	0.87	0.68
RoseTTAFold	Transformer	5.72	0.77	0.71	0.79	0.37
EigenFold	Diffusion	7.37	0.75	0.71	0.78	–

Sequence Generation (Inverse Folding)

Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.

Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):

$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$

ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.

Model	Type	AAR (%)	Div.	RMSD	Non.	Time (s)
ProteinMPNN	MPNN	48.7	0.168	1.019	1.061	112
ESM-IF1	Transformer	47.7	0.184	1.265	1.201	1980
GPD	Transformer	46.2	0.219	1.758	1.333	35
ABACUS-R	Transformer	45.7	0.124	1.482	0.968	233280
3D CNN	CNN	44.5	0.272	1.62	1.027	536544
PiFold	GNN	42.8	0.141	1.592	1.464	221
ProteinSolver	GNN	24.6	0.186	5.354	1.389	180

Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.

Backbone Design

Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.

Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).

ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.

Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using “self-conditioning” on predicted structures. Protpardelle co-designs sequence and structure by creating a “superposition” over possible sidechain states and collapsing them during each iterative diffusion step.

Model	Type	scTM (%)	Design. (%)	PPL	AAR (%)	RMSD
RFDiffusion	Diffusion	–	95.1	–	–	–
Protpardelle	Diffusion	85	–	–	–	–
FrameDiff	Diffusion	84	48.3	–	–	–
Genie	Diffusion	81.5	79.0	–	–	–
LatentDiff	EGNN, Diffusion	31.6	–	–	–	–
FoldingDiff	Diffusion	14.2	–	–	–	–
ProtDiff	EGNN, Diffusion	11.8	–	–	12.47*	8.01*

*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.

Antibody Design

The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.

For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.

Peptide Design

The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).

Current Trends, Challenges, and Future Directions

Current Trends

The survey identifies several parallel trends across molecule and protein generation:

Shift from sequence to structure: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.
Dominance of E(3) equivariant architectures: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.
Structure-based over ligand-based approaches: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.

Challenges

For small molecule generation:

Complexity: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.
Applicability: Generating molecules with high binding affinity to targets remains difficult.
Explainability: Methods are black-box, offering no insight into why generated molecules have desired properties.

For protein generation:

Benchmarking: Protein generative tasks lack a standard evaluative procedure, with variance between each model’s metrics and testing conditions.
Performance: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.

The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.

Future Directions

The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.

Reproducibility Details

As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.

Data

The survey catalogs the following key datasets across subtasks:

Subtask	Datasets	Notes
Target-agnostic molecule	QM9, GEOM-Drug	QM9 from GDB-17; GEOM-Drug for complex molecules
Target-aware molecule	CrossDocked2020, ZINC20, Binding MOAD	CrossDocked2020 most used (22.5M pairs)
Conformation generation	GEOM-QM9, GEOM-Drugs, ISO17	Conformer sets for molecules
Protein structure prediction	PDB, CASP14, CAMEO	CASP biennial blind evaluation
Protein sequence generation	PDB, UniRef, UniParc, CATH, TS500	CATH for domain classification
Backbone design	PDB, AlphaFoldDB, SCOP, CATH	AlphaFoldDB for expanded structural coverage
Antibody structure	SAbDab, RAB	SAbDab: all antibody structures from PDB
Antibody CDR generation	SAbDab, RAB, SKEMPI	SKEMPI for affinity optimization

Artifacts

Artifact	Type	License	Notes
GenAI4Drug	Code	Not specified	Organized repository of all covered sources

Paper Information

Citation: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., & Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. Briefings in Bioinformatics, 25(4), bbae338. https://doi.org/10.1093/bib/bbae338

Publication: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.

Additional Resources:

@article{tang2024survey,
  title={A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation},
  author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={4},
  pages={bbae338},
  year={2024},
  doi={10.1093/bib/bbae338}
}

Foundation Models in Chemistry: A 2025 Perspective

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Foundation Models for Chemistry

This is a Systematization paper. It organizes the rapidly growing landscape of foundation models in chemistry into a coherent taxonomy. The paper distinguishes between “small” foundation models (pretrained for a single application domain) and “big” foundation models (adaptable across multiple domains such as property prediction and inverse design). It covers models based on graph neural networks (GNNs) and language models, reviews pretraining strategies (self-supervised, multimodal, supervised), and maps approximately 40 models across four application domains.

Why a Foundation Model Perspective for Chemistry?

Foundation models have transformed NLP and computer vision through large-scale pretraining and transfer learning. In chemistry, however, several persistent challenges motivate the adoption of this paradigm:

Data scarcity: Chemical datasets are often small and expensive to generate (requiring experiments or quantum mechanical calculations), unlike the large annotated datasets available in NLP/CV.
Poor generalization: ML models in chemistry frequently need to extrapolate to out-of-domain compounds (e.g., novel drug candidates, unseen crystal structures), where conventional models struggle.
Limited transferability: Traditional ML interatomic potentials (MLIPs) are trained on system-specific datasets and cannot be easily transferred across different chemical systems.

Foundation models address these by learning general representations from large unlabeled datasets, which can then be adapted to specific downstream tasks via finetuning. The paper argues that summarizing this fast-moving field is timely, given the diversity of approaches emerging across molecular property prediction, MLIPs, inverse design, and multi-domain applications.

Small vs. Big Foundation Models: A Two-Tier Taxonomy

The paper’s central organizing framework distinguishes two scopes of foundation model:

Small foundation models are pretrained models adapted to various tasks within a single application domain. Examples include:

A model pretrained on large molecular databases that predicts multiple molecular properties (band gap, formation energy, etc.)
A universal MLIP that can simulate diverse chemical systems
A pretrained generative model adapted for inverse design of different target properties

Big foundation models span multiple application domains, handling both property prediction and inverse design within a single framework. These typically use multimodal learning (combining SMILES/graphs with text) or build on large language models.

Architectures

The paper reviews two primary architecture families:

Graph Neural Networks (GNNs) represent molecules and crystals as graphs $G = (V, E)$ with nodes (atoms) and edges (bonds). Node features are updated through message passing:

$$ m_{i}^{t+1} = \sum_{j \in N(i)} M_{t}(v_{i}^{t}, v_{j}^{t}, e_{ij}^{t}) $$

$$ v_{i}^{t+1} = U_{t}(v_{i}^{t}, m_{i}^{t+1}) $$

After $T$ message-passing steps, a readout function produces a graph-level feature:

$$ g = R({v_{i}^{T} \mid i \in G}) $$

Recent equivariant GNNs (e.g., NequIP, MACE, EquformerV2) use vectorial features that respect geometric symmetries, improving expressivity for tasks sensitive to 3D structure.

Language Models operate on string representations of molecules (SMILES, SELFIES) or crystal structures. Autoregressive models like GPT maximize:

$$ \prod_{t=1}^{T} P(y_{t} \mid x_{1}, x_{2}, \ldots, x_{t-1}) $$

Transformers use self-attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V $$

Pretraining Strategies

The paper categorizes pretraining methods into three self-supervised learning (SSL) approaches plus supervised and multimodal strategies:

Strategy	Mechanism	Example Models
Contrastive learning	Maximize similarity between positive pairs, minimize for negatives	GraphCL, MolCLR, GraphMVP, CrysGNN
Predictive learning	Predict self-generated labels (node context, functional groups, space group)	GROVER, Hu et al., CrysGNN
Generative learning	Reconstruct masked nodes/edges or entire molecules/SMILES	SMILES-BERT, ChemBERTa-2, MoLFormer
Supervised pretraining	Train on energy, forces, stress from DFT databases	M3GNet, CHGNet, MACE-MP-0, MatterSim
Multimodal learning	Learn joint representations across SMILES/graph + text modalities	KV-PLM, MoMu, MoleculeSTM, SPMM

A common finding across studies is that combining local and global information (e.g., via contrastive learning between node-level and graph-level views, or supervised learning on both forces and total energy) produces more transferable representations.

Survey of Models Across Four Domains

Property Prediction

The paper reviews 13 models for molecular and materials property prediction. Key findings:

Contrastive learning approaches (GraphCL, MolCLR, GraphMVP) achieve strong results by defining positive pairs through augmentation, 2D/3D structure views, or crystal system membership.
Language model approaches (SMILES-BERT, ChemBERTa-2, MoLFormer) show that transformers trained on SMILES via masked language modeling can compete with GNN-based approaches.
MoLFormer, pretrained on 1.1 billion SMILES from PubChem and ZINC, outperformed many baselines including GNNs on MoleculeNet and QM9 benchmarks. Its attention maps captured molecular structural features directly from SMILES strings.
For crystalline materials, CrysGNN combined contrastive, predictive, and generative learning, demonstrating improvements even on small experimental datasets.

Machine Learning Interatomic Potentials (MLIPs)

The paper surveys 10 universal MLIPs, all using supervised learning on DFT-calculated energies, forces, and stresses:

Model	Architecture	Training Data Size	Key Capability
M3GNet	GNN	187K (MP)	First universal MLIP
CHGNet	GNN	1.58M (MPtrj)	Predicts magnetic moments
MACE-MP-0	MACE	1.58M (MPtrj)	35 diverse applications
GNoME potential	NequIP	89M	Zero-shot comparable to trained MLIPs
MatterSim	M3GNet/Graphormer	17M	SOTA on Matbench Discovery
eqV2	EquformerV2	118M (OMat24)	Structural relaxation

The GNoME potential, trained on approximately 89 million data points, achieved zero-shot performance comparable to state-of-the-art MLIPs trained from scratch. MatterSim, trained on over 17 million entries across wide temperature (0-5000K) and pressure (0-1000 GPa) ranges, achieved state-of-the-art on Matbench Discovery and accurately computed thermodynamic and lattice dynamic properties.

Inverse Design

Few pretrained generative models for inverse design exist. The paper highlights three:

MatterGen (Microsoft): Diffusion model pretrained on Alexandria/MP databases (607K structures), finetuned for conditional generation on band gap, elastic modulus, spacegroup, and composition. Generated S.U.N. (stable, unique, novel) materials at rates more than 2x the previous state of the art.
GP-MoLFormer (IBM): MoLFormer pretrained on 1.1B SMILES, finetuned via pair-tuning for property-guided molecular optimization.
CrystalLLM: Finetuned LLaMA-2 70B for crystal generation with target spacegroup and composition using string representations and prompting.

Multi-Domain Models

The paper covers two multi-domain categories:

Property prediction + MLIP: Denoising pretraining learns virtual forces that guide noisy configurations back to equilibrium, connecting to force prediction. Joint multi-domain pretraining (JMP) from Meta FAIR achieved state-of-the-art on 34 of 40 tasks spanning molecules, crystals, and MOFs by training simultaneously on diverse energy/force databases.

Property prediction + inverse design: Multimodal models (KV-PLM, MoMu, MoleculeSTM, MolFM, SPMM) learn joint representations from molecular structures and text, enabling text-based inverse design and property prediction in a single framework. LLM-based models (ChemDFM, nach0, finetuned GPT-3) can interact with humans and handle diverse chemistry tasks through instruction tuning.

Trends and Future Directions

Scope Expansion

The authors identify three axes for expanding foundation model scope:

Material types: Most models target molecules or a single material class. Foundation models that span molecules, crystals, surfaces, and MOFs could exploit shared chemistry across materials.
Modalities: Beyond SMILES, graphs, and text, additional modalities (images, spectral data like XRD patterns) remain underexplored.
Downstream tasks: Extending to new chemistry and tasks through emergent capabilities, analogous to the capabilities observed in LLMs at scale.

Performance and Scaling

Key scaling challenges include:

Data quality vs. quantity: Noisy DFT labels (e.g., HOMO-LUMO gaps with high uncertainty from different functionals/basis sets) can limit scalability and out-of-distribution performance.
GNN scalability: While transformers scale to hundreds of billions of parameters, GNNs have rarely been explored above one million parameters due to oversmoothing and the curse of dimensionality. Recent work by Sypetkowski et al. demonstrated scaling GNNs to 3 billion parameters with consistent improvements.
Database integration: Combining datasets from different DFT codes requires proper alignment (e.g., total energy alignment methods).

Efficiency

For MLIPs, efficiency is critical since MD simulations require millions of inference steps. Approaches include:

Knowledge distillation from expensive teacher models to lighter student models
Model compression techniques (quantization, pruning) adapted for GNNs
Investigating whether strict equivariance is always necessary

Interpretability

Foundation models can generate hallucinations or mode-collapsed outputs. The authors highlight recent interpretability advances (feature extraction from Claude 3, knowledge localization and editing in transformers) as promising directions for more reliable chemical applications.

Key Findings and Limitations

Key findings:

Combining local and global information in pretraining consistently improves downstream performance across all domains reviewed.
Self-supervised pretraining enables effective transfer learning even in low-data regimes, a critical advantage for chemistry.
Universal MLIPs have reached the point where zero-shot performance can be comparable to system-specific trained models.
Multimodal learning is the most promising approach for big foundation models capable of spanning property prediction and inverse design.

Limitations acknowledged by the authors:

The precise definition of “foundation model” in chemistry is not established and varies by scope.
Most surveyed models focus on molecules, with crystalline materials less explored.
Benchmarks for low-data regimes and out-of-distribution performance are insufficient.
The paper focuses on three domains (property prediction, MLIPs, inverse design) and does not cover retrosynthesis, reaction prediction, or other chemical tasks in depth.

Reproducibility Details

Data

This is a perspective/review paper. No new data or models are introduced. The paper surveys existing models and their training datasets, summarized in Table 1 of the paper.

Algorithms

Not applicable (review paper). The paper describes pretraining strategies (contrastive, predictive, generative, supervised, multimodal) at a conceptual level with references to the original works.

Models

Not applicable (review paper). The paper catalogs approximately 40 foundation models across four domains. See Table 1 in the paper for the complete listing.

Evaluation

Not applicable (review paper). The paper references benchmark results from the original studies (MoleculeNet, QM9, Matbench, Matbench Discovery, JARVIS-DFT) but does not perform independent evaluation.

Hardware

Not applicable (review paper).

Paper Information

Citation: Choi, J., Nam, G., Choi, J., & Jung, Y. (2025). A Perspective on Foundation Models in Chemistry. JACS Au, 5(4), 1499-1518. https://doi.org/10.1021/jacsau.4c01160

@article{choi2025perspective,
  title={A Perspective on Foundation Models in Chemistry},
  author={Choi, Junyoung and Nam, Gunwook and Choi, Jaesik and Jung, Yousung},
  journal={JACS Au},
  volume={5},
  number={4},
  pages={1499--1518},
  year={2025},
  publisher={American Chemical Society},
  doi={10.1021/jacsau.4c01160}
}

Chemical Language Models for De Novo Drug Design Review

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Chemical Language Models for Drug Design

This paper is a Systematization (minireview) that surveys the landscape of chemical language models (CLMs) for de novo drug design. It organizes the field along three axes: molecular string representations, deep learning architectures, and generation strategies (distribution learning, goal-directed, and conditional). The review also highlights experimental validations, current gaps, and future opportunities.

Why Chemical Language Models Matter for Drug Design

De novo drug design faces an enormous combinatorial challenge: the “chemical universe” is estimated to contain up to $10^{60}$ drug-like small molecules. Exhaustive enumeration is infeasible, and traditional design algorithms rely on hand-crafted assembly rules. Chemical language models address this by borrowing natural language processing techniques to learn the “chemical language,” generating molecules as string representations (SMILES, SELFIES, DeepSMILES) that satisfy both syntactic validity (chemically valid structures) and semantic correctness (desired pharmacological properties).

CLMs have gained traction because string representations are readily available for most molecular databases, generation is computationally cheap (one molecule per forward pass through a sequence model), and the same architecture can be applied to diverse tasks (property prediction, de novo generation, reaction prediction). At the time of this review, CLMs had produced experimentally validated bioactive molecules in several prospective studies, establishing them as practical tools for drug discovery.

Molecular String Representations: SMILES, DeepSMILES, and SELFIES

The review covers three main string representations used as input/output for CLMs:

SMILES (Simplified Molecular Input Line Entry Systems) converts hydrogen-depleted molecular graphs into strings where atoms are denoted by atomic symbols, bonds and branching by punctuation, and ring openings/closures by numbers. SMILES are non-univocal (multiple valid strings per molecule), and canonicalization algorithms are needed for unique representations. Multiple studies show that using randomized (non-canonical) SMILES for data augmentation improves CLM performance, with diminishing returns beyond 10- to 20-fold augmentation.

DeepSMILES modifies SMILES to improve machine-readability by replacing the paired ring-opening/closure digits with a count-based system and using closing parentheses only (no opening ones). This reduces the frequency of syntactically invalid strings but does not eliminate them entirely.

SELFIES (Self-Referencing Embedded Strings) use a formal grammar that guarantees 100% syntactic validity of decoded molecules. Every SELFIES string maps to a valid molecular graph. However, SELFIES can produce chemically unrealistic molecules (e.g., highly strained ring systems), and the mapping between string edits and molecular changes is less intuitive than for SMILES.

The review notes a key tradeoff: SMILES offer a richer, more interpretable language with well-studied augmentation strategies, while SELFIES guarantee validity at the cost of chemical realism and edit interpretability.

CLM Architectures and Training Strategies

Architectures

The review describes the main architectures used in CLMs:

Recurrent Neural Networks (RNNs), particularly LSTMs and GRUs, dominated early CLM work. These models process SMILES character-by-character and generate new strings autoregressively via next-token prediction. RNNs are computationally efficient and well-suited to the sequential nature of molecular strings.

Variational Autoencoders (VAEs) encode molecules into a continuous latent space and decode them back into strings. This enables smooth interpolation between molecules and latent-space optimization, but generated strings may be syntactically invalid.

Generative Adversarial Networks (GANs) have been adapted for molecular string generation (e.g., ORGAN), though they face training instability and mode collapse challenges that limit their adoption.

Transformers have emerged as an increasingly popular alternative, offering parallelized training and the ability to capture long-range dependencies in molecular strings. The review notes the growing relevance of Transformer-based CLMs, particularly for large-scale pretraining.

Generation Strategies

The review organizes CLM generation into three categories:

Distribution learning: The model learns to reproduce the statistical distribution of a training set of molecules. No explicit scoring function is used during generation. The generated molecules are evaluated post-hoc by comparing their property distributions to the training set. This approach is end-to-end but provides no direct indication of individual molecule quality.
Goal-directed generation: A pretrained CLM is steered toward molecules optimizing a specified scoring function (e.g., predicted bioactivity, physicochemical properties). Common approaches include reinforcement learning (REINVENT and variants), hill-climbing, and Bayesian optimization. Scoring functions provide direct quality signals but can introduce biases, shortcuts, and limited structural diversity.
Conditional generation: An intermediate approach that learns a joint semantic space between molecular structures and desired properties. The desired property profile serves as an input “prompt” for generation (e.g., a protein target, gene expression signature, or 3D shape). This bypasses the need for external scoring functions but has seen limited experimental application.

Transfer Learning and Chemical Space Exploration

Transfer learning is the dominant paradigm for CLM-driven chemical space exploration. A large-scale pretraining step (on $10^5$ to $10^6$ molecules via next-character prediction) is followed by fine-tuning on a smaller set of molecules with desired properties (often 10 to $10^2$ molecules). Key findings from the literature:

The minimum training set size depends on target molecule complexity and heterogeneity.
SMILES augmentation is most beneficial with small training sets (fewer than 10,000 molecules) and plateaus for large, structurally complex datasets.
Fine-tuning with as few as 10 to 100 molecules has produced experimentally validated bioactive designs.
Hyperparameter tuning has relatively little effect on overall CLM performance.

Evaluating CLM Designs and Experimental Validation

The review identifies evaluation as a critical gap. CLMs are often benchmarked on “toy” properties such as calculated logP, molecular weight, or QED (quantitative estimate of drug-likeness). These metrics capture the ability to satisfy predefined criteria but fail to reflect real-world drug discovery complexity and may lead to trivial solutions.

Existing benchmarks (GuacaMol, MOSES) enable comparability across independently developed approaches but do not fully address the quality of generated compounds. The review emphasizes that experimental validation is the ultimate test. At the time of writing, only a few prospective applications had been published:

Dual modulator of retinoid X and PPAR receptors (EC50 ranging from 0.06 to 2.3 uM)
Inhibitor of Pim1 kinase and CDK4 (manually modified from generated design)
Natural-product-inspired RORgamma agonist (EC50 = 0.68 uM)
Molecules designed via combined generative AI and on-chip synthesis

The scarcity of experimental validations reflects the interdisciplinary expertise required and the time/cost of chemical synthesis.

Gaps, Limitations, and Future Directions

The review identifies several key gaps and opportunities:

Scoring function limitations: Current scoring functions struggle with activity cliffs and non-additive structure-activity relationships. Conditional generation methods may help overcome these limitations by learning direct structure-property mappings.

Structure-based design: Generating molecules that match electrostatic and shape features of protein binding pockets holds promise for addressing unexplored targets. However, prospective applications have been limited, potentially due to bias in existing protein-ligand affinity datasets.

Synthesizability: Improving the ability of CLMs to propose synthesizable molecules is expected to increase practical relevance. Automated synthesis platforms may help but could also limit accessible chemical space.

Few-shot learning: Large-scale pretrained CLMs combined with few-shot learning approaches are expected to boost prospective applications.

Extensions beyond small molecules: Extending chemical languages to more complex molecular entities (proteins with non-natural amino acids, crystals, supramolecular chemistry) is an open frontier.

Failure modes: Several studies have documented failure modes in goal-directed generation, including model shortcuts (exploiting scoring function artifacts), limited structural diversity, and generation of chemically unrealistic molecules.

Interdisciplinary collaboration: The review emphasizes that bridging deep learning, cheminformatics, and medicinal chemistry expertise is essential for translating CLM designs into real-world drug candidates.

Reproducibility Details

Data

This is a review paper and does not present novel experimental data. The paper surveys results from the literature.

Algorithms

No novel algorithms are introduced. The review categorizes existing approaches (RNNs, VAEs, GANs, Transformers) and generation strategies (distribution learning, goal-directed, conditional).

Models

No new models are presented. The paper references existing implementations including REINVENT, ORGAN, and various RNN-based and Transformer-based CLMs.

Evaluation

The review discusses existing benchmarks:

GuacaMol: Benchmarking suite for de novo molecular design
MOSES: Benchmarking platform for molecular generation models
QED: Quantitative estimate of drug-likeness
Various physicochemical property metrics (logP, molecular weight)

Hardware

Not applicable (review paper).

Paper Information

Citation: Grisoni, F. (2023). Chemical language models for de novo drug design: Challenges and opportunities. Current Opinion in Structural Biology, 79, 102527. https://doi.org/10.1016/j.sbi.2023.102527

Publication: Current Opinion in Structural Biology, Volume 79, April 2023

@article{grisoni2023chemical,
  title={Chemical language models for de novo drug design: Challenges and opportunities},
  author={Grisoni, Francesca},
  journal={Current Opinion in Structural Biology},
  volume={79},
  pages={102527},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.sbi.2023.102527}
}

Review of Molecular Representation Learning Models

Wed, 25 Mar 2026 00:00:00 +0000

A Systematization of Molecular Representation Foundation Models

This paper is a Systematization that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.

Why a Systematic Review of MRL Foundation Models Is Needed

Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.

Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.

Taxonomy of Molecular Descriptors and Model Architectures

The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.

Molecular Descriptors

The review identifies five primary descriptor types:

Molecular fingerprints: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.
1D sequences: SMILES and SELFIES string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.
2D topological graphs: Atoms as nodes, bonds as edges. Can be derived from SMILES via RDKit, making graph datasets effectively interchangeable with SMILES datasets.
3D geometry: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.
Multimodal: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.

The paper also discusses mathematically abstract molecular representations. For example, the Wiener index quantifies structural complexity:

$$ W = \frac{1}{2} \sum_{i < j} d_{ij} $$

where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.

Degree centrality captures local connectivity:

$$ C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij} $$

where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.

Model Architectures

Models are classified into two primary categories:

Unimodal-based models:

Sequence-based: Transformer models operating on SMILES/SELFIES (e.g., ChemBERTa-2, MoLFormer, MolGEN, LlaSMol). These capture syntactic patterns but miss spatial and topological features.
Topological graph-based: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.
3D geometry-based: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.
Image-based: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.

Multimodal-based models:

Sequence + Graph: DVMP, PanGu Drug Model. Combines the strengths of string and topological representations.
Graph + 3D Geometry: GraphMVP, Transformer-M. Enriches topological features with spatial information.
Text + Molecular Structure: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.

Four Pretraining Paradigms for MRL

The review systematically categorizes pretraining strategies into four paradigms:

Masked Language Modeling (MLM)

The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.

Contrastive Learning (CL)

The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.

Reconstruction-Based Pretraining (RBP)

Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.

Multimodal Alignment Pretraining (MAP)

Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.

Downstream Applications and Performance Benchmarks

The review evaluates MRL foundation models across five application domains.

Molecular Property Prediction

The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight MoleculeNet classification datasets:

Model	Type	BBBP	BACE	ClinTox	Tox21	SIDER	HIV
MGMAE	Graph	94.2	92.7	96.7	86.0	66.4	-
MPG	Graph	92.2	92.0	96.3	83.7	66.1	-
GROVER	Graph+Trans.	94.0	89.4	94.4	83.1	65.8	-
MoLFormer	Sequence	93.7	88.2	94.8	84.7	69.0	82.2
MM-Deacon	Seq.+IUPAC	78.5	-	99.5	-	69.3	80.1
Uni-Mol	3D	72.9	85.7	91.9	79.6	65.9	80.8
DVMP	Seq.+Graph	77.8	89.4	95.6	79.1	69.8	81.4
TxD-T-LLM	Seq.+Text	-	-	86.3	88.2	-	73.2

The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.

Molecular Generation

MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on QM9.

Drug-Drug Interaction Prediction

MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.

Retrosynthesis Prediction

DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).

Drug Synergy Prediction

SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.

Guidelines, Limitations, and Future Directions

Model Selection Guidelines

The authors provide structured guidelines for choosing MRL foundation models based on:

Task objective: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.
Data characteristics: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.
Interpretability needs: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.
Computational budget: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.

Limitations and Future Directions

The review identifies five key challenges:

Multimodal data integration: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating molecular dynamics trajectories as a dynamic modality and using cross-modal data augmentation.
Data scarcity: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.
Interpretability: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.
Training efficiency: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.
Robustness and generalization: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.

Reproducibility Details

This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.

Data

The review catalogs 28 representative molecular datasets used by the surveyed foundation models:

Dataset	Size	Descriptor	Primary Use
PubChem	~118M	SMILES, 3D, Image, IUPAC	Pretraining
ZINC15	~980M	SMILES	Pretraining
ChEMBL	~2.4M	SMILES	Pretraining
QM9	133,884	SMILES	Property prediction
GEOM	450,000	3D coordinates	Property prediction
USPTO-full	950,000	SMILES	Reaction prediction
Molecule3D	4M	3D coordinates	Property prediction

Artifacts

Artifact	Type	License	Notes
Review Materials (GitHub)	Code/Data	Not specified	Code and data tables for figures
Paper (PMC)	Paper	CC-BY	Open access via PubMed Central

Evaluation

All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model’s original setup. The review covers:

ROC-AUC for classification tasks (property prediction, DDI, synergy)
RMSE/MAE for regression tasks
Validity and novelty for molecular generation
Top-k accuracy for retrosynthesis
COV and MAT for conformation generation

Paper Information

Citation: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., & Liu, Y. (2025). A systematic review of molecular representation learning foundation models. Briefings in Bioinformatics, 27(1), bbaf703. https://doi.org/10.1093/bib/bbaf703

@article{song2025systematic,
  title={A systematic review of molecular representation learning foundation models},
  author={Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping},
  journal={Briefings in Bioinformatics},
  volume={27},
  number={1},
  pages={bbaf703},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbaf703}
}

ROGI-XD: Roughness of Pretrained Molecular Representations

Tue, 24 Mar 2026 00:00:00 +0000

Evaluating Chemical Foundation Models Through Surface Roughness

This is a Systematization paper that introduces a metric reformulation (ROGI-XD) and uses it to evaluate whether pretrained chemical models (PCMs) learn representations that produce smoother quantitative structure-property relationship (QSPR) surfaces than simple baselines. The key finding is negative: pretrained representations are no smoother than molecular fingerprints or descriptors, offering a principled explanation for their inconsistent performance on property prediction benchmarks.

The Smoothness Gap in Chemical Foundation Models

Chemical foundation models like ChemBERTa, ChemGPT, and graph-based pretrained networks promise to learn meaningful molecular representations from large unlabeled datasets via self-supervised learning. However, empirical benchmarks consistently show mixed results: these learned representations sometimes match and sometimes underperform simple baselines like Morgan fingerprints or RDKit descriptors.

Prior work by Deng et al. demonstrated that a random forest trained on 2048-bit Morgan fingerprints was competitive with, or superior to, pretrained models like MolBERT and GROVER on MoleculeNet and opioid bioactivity tasks. The authors sought to explain this pattern through the lens of QSPR surface roughness: if pretrained representations do not produce smoother mappings from molecular structure to property, they cannot consistently outperform baselines.

ROGI-XD: A Dimensionality-Independent Roughness Metric

The original ROuGhness Index (ROGI) captures global surface roughness by measuring the loss in property dispersion as a dataset is progressively coarse-grained through hierarchical clustering. However, ROGI values are not comparable across representations of different dimensionalities because distances between randomly sampled points increase with dimension, artificially deflating ROGI for high-dimensional representations.

ROGI-XD addresses this by changing the integration variable. Instead of integrating over normalized distance threshold $t$, ROGI-XD integrates over $1 - \log N_{\text{clusters}} / \log N$, where $N_{\text{clusters}}$ is the number of clusters at a given dendrogram step and $N$ is the dataset size. This variable captures the degree of coarse-graining independent of representation dimensionality, producing comparable roughness values across representations ranging from 14 dimensions (descriptors) to 2048 dimensions (ChemGPT).

The procedure follows five steps: (1) cluster molecules using complete linkage at distance threshold $t$, (2) coarse-grain by replacing each property label $y_i$ with its cluster mean $\bar{y}_j$, (3) compute the standard deviation $\sigma_t$ of the coarse-grained dataset, (4) repeat for all dendrogram steps, and (5) compute the area under the curve of $2(\sigma_0 - \sigma_t)$ versus the new integration variable.

Representations and Tasks Evaluated

The study compares seven molecular representations:

Representation	Type	Dimensionality	Source
Descriptors	Fixed	14	RDKit (14 properties)
Morgan FP	Fixed	512	Radius 2, 512-bit
VAE	Pretrained	128	Character-based SMILES VAE, ZINC 250k
GIN	Pretrained	300	Node attribute masking, ZINC 250k
ChemBERTa	Pretrained	384	77M molecules, masked LM
ChemGPT	Pretrained	2048	PubChem 10M, causal LM
Random	Baseline	128	Uniform $[0,1]^{128}$

These are evaluated on 17 regression tasks drawn from two sources: ADMET datasets from the Therapeutics Data Commons (TDC) and toy datasets generated using GuacaMol oracle functions. Five ML models are used for cross-validation: KNN, MLP, PLS, random forest, and SVR.

Pretrained Representations Are Not Smoother

ROGI-XD correlates strongly with cross-validated RMSE across representations (median Pearson $r = 0.72$-$0.88$ depending on model), compared to the original ROGI which produces weak cross-representation correlations (median $r \in [-0.32, 0.28]$). When correlating over both representations and tasks simultaneously, ROGI-XD achieves $r = 0.91$-$0.99$ versus $r = 0.68$-$0.84$ for the original ROGI.

Using this validated metric, the authors find that pretrained representations do not produce smoother QSPR surfaces than fingerprints or descriptors. In more than 50% of tasks, both descriptors and fingerprints generate smoother surfaces. The median relative ROGI-XD increase for pretrained representations is 9.1-21.3% compared to descriptors and 2.3-10.1% compared to fingerprints, indicating rougher surfaces.

As a practical tool, ROGI-XD can guide representation selection without exhaustive benchmarking. Selecting the representation with the lowest ROGI-XD for each task and then optimizing over model architecture results in only a 6.8% average relative increase in best-case model error across the 17 tasks. In 8 of 17 tasks, the lowest ROGI-XD correctly identifies the optimal representation.

Fine-tuning can improve smoothness. On the Lipophilicity task ($N_{\text{tot}} = 4200$), fine-tuning the VAE with a contrastive loss reduces ROGI-XD from 0.254 to 0.107 ($\pm 0.02$), well below the descriptor baseline of 0.227. On the smaller CACO2 task ($N_{\text{tot}} = 910$), fine-tuning yields ROGI-XD of 0.143 ($\pm 0.05$), comparable to descriptors at 0.132. The impact of fine-tuning is sensitive to both the task and the amount of labeled data.

Implications for Chemical Foundation Model Development

The lack of smoothness in pretrained QSPR surfaces explains the inconsistent empirical performance of chemical foundation models. The authors note that ROGI-XD is thematically similar to a contrastive loss, as both scale proportionally with the frequency and severity of activity cliffs. This connection suggests that imposing stronger smoothness assumptions during pretraining, for example through weak supervision on calculable molecular properties, could help produce representations that generalize better to downstream property prediction. ROGI-XD provides a practical tool for evaluating new pretraining strategies without exhaustive benchmark testing: a representation with lower ROGI-XD on a given task is likely to yield lower model error.

A limitation is that the study treats pretrained representations as static (frozen features). Fine-tuning introduces many additional design choices and can substantially improve representation quality, but this evaluation is left for future work. Additionally, the survey of pretrained models is not exhaustive and focuses on four representative architectures.

Reproducibility

Artifacts

Artifact	Type	License	Notes
coleygroup/rogi-xd	Code	MIT	Official implementation with pretrained models and notebooks; results reproducible via `make all`

Data

Purpose	Dataset	Size	Notes
Pretraining (VAE, GIN)	ZINC 250k	250,000	80/20 train/val split
Pretraining (ChemBERTa)	PubChem	77M	Masked language modeling
Pretraining (ChemGPT)	PubChem 10M	10M	Causal language modeling
Evaluation	TDC ADMET	~900-10,000 per task	12 regression tasks
Evaluation	GuacaMol oracles	10,000 per task	5 synthetic tasks

Algorithms

ROGI-XD: Hierarchical clustering (complete linkage) with integration over $1 - \log N_{\text{clusters}} / \log N$
Cross-validation: 5-fold CV with KNN, MLP, PLS, RF (n_estimators=50), SVR from scikit-learn
Fine-tuning loss: $\mathscr{L} = \mathscr{L}_{\text{CE}} + \beta \cdot \mathscr{L}_{\text{KL}} + \gamma \cdot \mathscr{L}_{\text{cont}}$ with $\beta = 0.1$, $\gamma = 50$; contrastive term uses cosine distance in latent space and absolute value in target space

Hardware

Two AMD Ryzen Threadripper PRO 3995WX CPUs, four NVIDIA A5000 GPUs, 512 GB RAM, Ubuntu 20.04 LTS.

Paper Information

Citation: Graff, D. E., Pyzer-Knapp, E. O., Jordan, K. E., Shakhnovich, E. I., & Coley, C. W. (2023). Evaluating the roughness of structure-property relationships using pretrained molecular representations. Digital Discovery, 2(5), 1452-1460. https://doi.org/10.1039/d3dd00088e

Publication: Digital Discovery 2023

Additional Resources:

ROGI-XD Code Repository

Citation

@article{graff2023roughness,
  title={Evaluating the roughness of structure--property relationships using pretrained molecular representations},
  author={Graff, David E. and Pyzer-Knapp, Edward O. and Jordan, Kirk E. and Shakhnovich, Eugene I. and Coley, Connor W.},
  journal={Digital Discovery},
  volume={2},
  number={5},
  pages={1452--1460},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/d3dd00088e}
}

MolGenSurvey: Systematic Survey of ML for Molecule Design

Mon, 23 Mar 2026 00:00:00 +0000

A Taxonomy for ML-Driven Molecule Design

This is a Systematization paper that reviews machine learning approaches for molecule design across all three major molecular representations (1D string, 2D graph, 3D geometry) and both deep generative and combinatorial optimization paradigms. Prior surveys (including Sánchez-Lengeling & Aspuru-Guzik, 2018, Elton et al., 2019, Xue et al. 2019, Vanhaelen et al. 2020, Alshehri et al. 2020, Jiménez-Luna et al. 2020, and Axelrod et al. 2022) each covered subsets of the literature (e.g., only generative methods, or only specific task types). MolGenSurvey extends these by unifying the field into a single taxonomy based on input type, output type, and generation goal, identifying eight distinct molecule generation tasks. It catalogs over 100 methods across these categories and provides a structured comparison of evaluation metrics, datasets, and experimental setups.

The chemical space of drug-like molecules is estimated at $10^{23}$ to $10^{60}$, making exhaustive enumeration computationally infeasible. Traditional high-throughput screening searches existing databases but is slow and expensive. ML-based generative approaches offer a way to intelligently explore this space, either by learning continuous latent representations (deep generative models) or by directly searching the discrete chemical space (combinatorial optimization methods).

Molecular Representations

The survey identifies three mainstream featurization approaches for molecules, each carrying different tradeoffs for generation tasks.

1D String Descriptions

SMILES and SELFIES are the two dominant string representations. SMILES encodes molecules as character strings following grammar rules for bonds, branches, and ring closures. Its main limitation is that arbitrary strings are often chemically invalid. SELFIES augments the encoding rules for branches and rings to achieve 100% validity by construction.

Other string representations exist (InChI, SMARTS) but are less commonly used for generation. Representation learning over strings has adopted CNNs, RNNs, and Transformers from NLP.

2D Molecular Graphs

Molecules naturally map to graphs where atoms are nodes and bonds are edges. Graph neural networks (GNNs), particularly those following the message-passing neural network (MPNN) framework, have become the standard representation method. The MPNN updates each node’s representation by aggregating information from its $K$-hop neighborhood. Notable architectures include D-MPNN (directional message passing), PNA (diverse aggregation methods), AttentiveFP (attention-based), and Graphormer (transformer-based).

3D Molecular Geometry

Molecules are inherently 3D objects with conformations (3D structures at local energy minima) that determine function. Representing 3D geometry requires models that respect E(3) or SE(3) equivariance (invariance to rotation and translation). The survey catalogs architectures along this line including SchNet, DimeNet, EGNN, SphereNet, and PaiNN.

Additional featurization methods (molecular fingerprints/descriptors, 3D density maps, 3D surface meshes, and chemical images) are noted but have seen limited use in generation tasks.

Deep Generative Models

The survey covers six families of deep generative models applied to molecule design.

Autoregressive Models (ARs)

ARs factorize the joint distribution of a molecule as a product of conditional distributions over its subcomponents:

$$p(\boldsymbol{x}) = \prod_{i=1}^{d} p(\bar{x}_i \mid \bar{x}_1, \bar{x}_2, \ldots, \bar{x}_{i-1})$$

For molecular graphs, this means sequentially predicting the next atom or bond conditioned on the partial structure built so far. RNNs, Transformers, and BERT-style models all implement this paradigm.

Variational Autoencoders (VAEs)

VAEs learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$\log p(\boldsymbol{x}) \geq \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}(q(\boldsymbol{z}|\boldsymbol{x}) | p(\boldsymbol{z}))$$

The first term is the reconstruction objective, and the second is a KL-divergence regularizer encouraging diverse, disentangled latent codes. Key molecular VAEs include ChemVAE (SMILES-based), JT-VAE (junction tree graphs), and GrammarVAE (grammar-constrained SMILES).

Normalizing Flows (NFs)

NFs model $p(\boldsymbol{x})$ via an invertible, deterministic mapping between data and latent space, using the change-of-variable formula with Jacobian determinants. Molecular applications include GraphNVP, MoFlow (one-shot graph generation), GraphAF (autoregressive flow), and GraphDF (discrete flow).

Generative Adversarial Networks (GANs)

GANs use a generator-discriminator game where the generator produces molecules and the discriminator distinguishes real from generated samples. Molecular GANs include MolGAN (graph-based with RL reward), ORGAN (SMILES-based with RL), and Mol-CycleGAN (molecule-to-molecule translation).

Diffusion Models

Diffusion models learn to reverse a gradual noising process. The forward process adds Gaussian noise over $T$ steps; a neural network learns to denoise at each step. The training objective reduces to predicting the noise added at each step:

$$\mathcal{L}_t = \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{\epsilon}}\left[|\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\boldsymbol{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon_t, t)|^2\right]$$

Diffusion has been particularly successful for 3D conformation generation (ConfGF, GeoDiff, DGSM).

Energy-Based Models (EBMs)

EBMs define $p(\boldsymbol{x}) = \frac{\exp(-E_\theta(\boldsymbol{x}))}{A}$ where $E_\theta$ is a learned energy function. The challenge is computing the intractable partition function $A$, addressed via contrastive divergence, noise-contrastive estimation, or score matching.

Combinatorial Optimization Methods

Unlike DGMs that learn from data distributions, combinatorial optimization methods (COMs) search directly over discrete chemical space using oracle calls to evaluate candidate molecules.

Reinforcement Learning (RL)

RL formulates molecule generation as a Markov Decision Process: states are partial molecules, actions are adding/removing atoms or bonds, and rewards come from property oracles. Methods include GCPN (graph convolutional policy network), MolDQN (deep Q-network), RationaleRL (property-aware substructure assembly), and REINVENT (SMILES-based policy gradient).

Genetic Algorithms (GA)

GAs maintain a population of molecules and evolve them through mutation and crossover operations. GB-GA operates on molecular graphs, GA+D uses SELFIES with adversarial discriminator enhancement, and JANUS uses SELFIES with parallel exploration strategies.

Bayesian Optimization (BO)

BO builds a Gaussian process surrogate of the objective function and uses an acquisition function to decide which molecules to evaluate next. It is often combined with VAE latent spaces (Constrained-BO-VAE, MSO) to enable continuous optimization.

Monte Carlo Tree Search (MCTS)

MCTS explores the molecular construction tree by branching and evaluating promising intermediates. ChemTS and MP-MCTS combine MCTS with autoregressive SMILES generators.

MCMC Sampling

MCMC methods (MIMOSA, MARS) formulate molecule optimization as sampling from a target distribution defined by multiple property objectives, using graph neural networks as proposal distributions.

Other Approaches

The survey also identifies two additional paradigms that do not fit neatly into either DGM or COM categories. Optimal Transport (OT) is used when matching between groups of molecules, particularly for conformation generation where each molecule has multiple associated 3D structures (e.g., GeoMol, EquiBind). Differentiable Learning formulates discrete molecules as differentiable objects, enabling gradient-based continuous optimization directly on molecular graphs (e.g., DST).

Task Taxonomy: Eight Molecule Generation Tasks

The survey’s central organizational contribution is a unified taxonomy of eight distinct molecule design tasks, defined by three axes: (1) whether generation is de novo (from scratch, no reference molecule) or conditioned on an input molecule, (2) whether the goal is generation (distribution learning, producing valid and diverse molecules) or optimization (goal-directed search for molecules with specific properties), and (3) the input/output data representation (1D string, 2D graph, 3D geometry). The paper’s Table 2 maps all combinations of these axes, showing that many are not meaningful (e.g., 1D string input to 2D graph output with no goal). Only eight combinations correspond to active research areas.

1D/2D Tasks

De novo 1D/2D molecule generation: Generate new molecules from scratch to match a training distribution. Methods span VAEs (ChemVAE, JT-VAE), flows (GraphNVP, MoFlow, GraphAF), GANs (MolGAN, ORGAN), ARs (MolecularRNN), and EBMs (GraphEBM).
De novo 1D/2D molecule optimization: Generate molecules with optimal properties from scratch, using oracle feedback. Methods include RL (GCPN, MolDQN), GA (GB-GA, JANUS), MCTS (ChemTS), and MCMC (MIMOSA, MARS).
1D/2D molecule optimization: Optimize properties of a given input molecule via local search. Methods include graph-to-graph translation (VJTNN, CORE, MOLER), VAE+BO (MSO, Constrained-BO-VAE), GANs (Mol-CycleGAN, LatentGAN), and differentiable approaches (DST).

3D Tasks

De novo 3D molecule generation: Generate novel 3D molecular structures from scratch, respecting geometric validity. Methods include ARs (G-SchNet, G-SphereNet), VAEs (3DMolNet), flows (E-NFs), and RL (MolGym).
De novo 3D conformation generation: Generate 3D conformations from given 2D molecular graphs. Methods include VAEs (CVGAE, ConfVAE), diffusion models (ConfGF, GeoDiff, DGSM), and optimal transport (GeoMol).
De novo binding-based 3D molecule generation: Design 3D molecules for specific protein binding pockets. Methods include density-based VAEs (liGAN), RL (DeepLigBuilder), and ARs (3DSBDD).
De novo binding-pose conformation generation: Find the appropriate 3D conformation of a given molecule for a given protein pocket. Methods include EBMs (DeepDock) and optimal transport (EquiBind).
3D molecule optimization: Optimize 3D molecular properties (scaffold replacement, conformation refinement). Methods include BO (BOA), ARs (3D-Scaffold, cG-SchNet), and VAEs (Coarse-GrainingVAE).

Evaluation Metrics

The survey organizes evaluation metrics into four categories.

Generation Evaluation

Basic metrics assess the quality of generated molecules:

Validity: fraction of chemically valid molecules among all generated molecules
Novelty: fraction of generated molecules absent from the training set
Uniqueness: fraction of distinct molecules among generated samples
Quality: fraction passing a predefined chemical rule filter
Diversity (internal/external): measured via pairwise similarity (Tanimoto, scaffold, or fragment) within generated set and between generated and training sets

Distribution Evaluation

Metrics measuring how well generated molecules capture the training distribution: KL divergence over physicochemical descriptors, Fréchet ChemNet Distance (FCD), and Mean Maximum Discrepancy (MMD).

Optimization Evaluation

Property oracles used as optimization targets: Synthetic Accessibility (SA), Quantitative Estimate of Drug-likeness (QED), LogP, kinase inhibition scores (GSK3-beta, JNK3), DRD2 activity, GuacaMol benchmark oracles, and Vina docking scores. Constrained optimization additionally considers structural similarity to reference molecules via Tanimoto, scaffold, or fragment similarity.

3D Evaluation

3D-specific metrics include stability (matching valence rules in 3D), RMSD and Kabsch-RMSD (conformation alignment), and Coverage/Matching scores for conformation ensembles.

Datasets

The survey catalogs 12 major datasets spanning 1D/2D and 3D molecule generation:

Dataset	Scale	Dimensionality	Purpose
ZINC	250K	1D/2D	Virtual screening compounds
ChEMBL	2.1M	1D/2D	Bioactive molecules
MOSES	1.9M	1D/2D	Benchmarking generation
CEPDB	4.3M	1D/2D	Organic photovoltaics
GDB-13	970M	1D/2D	Enumerated small molecules
QM9	134K	1D/2D/3D	Quantum chemistry properties
GEOM	450K/37M	1D/2D/3D	Conformer ensembles
ISO17	200/431K	1D/2D/3D	Molecule-conformation pairs
Molecule3D	3.9M	1D/2D/3D	DFT ground-state geometries
CrossDock2020	22.5M	1D/2D/3D	Docked ligand poses
scPDB	16K	1D/2D/3D	Binding sites
DUD-E	23K	1D/2D/3D	Active compounds with decoys

Challenges and Opportunities

Challenges

Out-of-distribution generation: Most deep generative models imitate known molecule distributions and struggle to explore truly novel chemical space.
Unrealistic problem formulation: Many task setups do not respect real-world chemistry constraints.
Expensive oracle calls: Methods typically assume unlimited access to property evaluators, which is unrealistic in drug discovery.
Lack of interpretability: Few methods explain why generated molecules have desired properties. Quantitative interpretability evaluation remains an open problem.
No unified evaluation protocols: The field lacks consensus on what defines a “good” drug candidate and how to fairly compare methods.
Insufficient benchmarking: Despite the enormous chemical space ($10^{23}$ to $10^{60}$ drug-like molecules), available benchmarks use only small fractions of large databases.
Low-data regime: Many real-world applications have limited training data, and generating molecules under data scarcity remains difficult.

Opportunities

Extension to complex structured data: Techniques from small molecule generation may transfer to proteins, antibodies, genes, crystal structures, and polysaccharides.
Connection to later drug development phases: Bridging the gap between molecule design and preclinical/clinical trial outcomes could improve real-world impact.
Knowledge discovery: Generative models over molecular latent spaces could reveal chemical rules governing molecular properties, and graph structure learning could uncover implicit non-bonded interactions.

Limitations

The survey was published in March 2022, so it does not cover subsequent advances in diffusion models for molecules (e.g., EDM, DiffSBDD), large language models applied to chemistry, or flow matching approaches.
Coverage focuses on small molecules. Macromolecule design (proteins, nucleic acids) is noted as a future direction rather than surveyed.
The survey catalogs methods but does not provide head-to-head experimental comparisons across all 100+ methods. Empirical discussion relies on individual papers’ reported results.
1D string-based methods receive less detailed coverage than graph and geometry-based approaches, reflecting the field’s shift toward structured representations at the time of writing.
As a survey, this paper produces no code, models, or datasets. The surveyed methods’ individual repositories are referenced in their original publications but are not aggregated here.

Paper Information

Citation: Du, Y., Fu, T., Sun, J., & Liu, S. (2022). MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design. arXiv preprint arXiv:2203.14500.

Publication: arXiv preprint, March 2022. Note: This survey covers literature through early 2022 and does not include subsequent advances in diffusion models, LLMs for chemistry, or flow matching.

Additional Resources:

arXiv: 2203.14500

@article{du2022molgensurvey,
  title={MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design},
  author={Du, Yuanqi and Fu, Tianfan and Sun, Jimeng and Liu, Shengchao},
  journal={arXiv preprint arXiv:2203.14500},
  year={2022}
}

Exposing Limitations of Molecular ML with Activity Cliffs

Mon, 16 Mar 2026 00:00:00 +0000

A Benchmark for Activity Cliff Prediction

This is a Systematization paper ($\Psi_{\text{Systematization}}$) with a significant Resource component ($\Psi_{\text{Resource}}$).

The paper systematically benchmarks 24 machine learning and deep learning approaches on their ability to predict bioactivity for activity cliff compounds: pairs of structurally similar molecules that exhibit large differences in potency. These cases violate the similarity principle (similar structure implies similar activity) and represent a practical failure mode for molecular property prediction in drug discovery. The authors release MoleculeACE, an open-source benchmarking platform for evaluating ML models on activity cliffs.

The similarity principle underpins most molecular ML: structurally similar compounds should have similar properties. Activity cliffs are the exceptions, where small structural changes cause large potency shifts (e.g., a single substituent change causing a 10x difference in $K_i$).

Despite their importance for hit-to-lead optimization, activity cliffs have received limited attention in ML benchmarking. Standard metrics like RMSE computed over entire test sets can mask poor predictions on cliff compounds. A model might achieve low overall error while systematically mispredicting these edge cases, which are precisely the molecules that matter most for medicinal chemistry applications.

The authors identify 7-52% of compounds as activity cliff molecules across their 30 target datasets, showing this is not a rare phenomenon.

Defining and Detecting Activity Cliffs

The authors use three complementary similarity metrics to identify activity cliffs:

Substructure similarity: Tanimoto coefficient on extended connectivity fingerprints (ECFPs), capturing shared radial substructures
Scaffold similarity: Tanimoto coefficient on ECFPs computed from molecular graph frameworks, detecting core/decoration differences
SMILES similarity: Levenshtein distance on canonical SMILES strings, capturing character-level insertions, deletions, and translocations

Pairs with $\geq 90%$ similarity on any one of the three metrics and $> 10\times$ difference in bioactivity ($K_i$ or $\text{EC}_{50}$) are classified as activity cliff pairs. This union-based approach (rather than requiring agreement across all metrics) captures different types of structural relationships relevant to medicinal chemistry.

24 Methods Across 30 Drug Targets

The benchmark evaluates 16 traditional ML configurations (4 algorithms $\times$ 4 descriptor types) and 8 deep learning approaches across 30 curated ChEMBL v29 datasets (48,707 total molecules).

Traditional ML algorithms: KNN, RF, GBM, SVM, each combined with ECFPs, MACCS keys, WHIM descriptors, or physicochemical properties.

Deep learning methods: MPNN, GCN, GAT, Attentive FP (graph-based), plus LSTM, CNN, Transformer/ChemBERTa (SMILES-based), and an MLP on ECFPs.

Performance is measured with both standard RMSE and a dedicated $\text{RMSE}_{\text{cliff}}$ computed only on activity cliff compounds in the test set:

$$ \text{RMSE}_{\text{cliff}} = \sqrt{\frac{\sum_{j=1}^{n_c} (\hat{y}_j - y_j)^2}{n_c}} $$

Key results:

Molecular descriptors matter more than algorithms: The choice of descriptor (ECFPs vs. MACCS vs. WHIM vs. physicochemical) had a larger impact on $\text{RMSE}_{\text{cliff}}$ than the choice of ML algorithm ($p < 0.05$, Wilcoxon rank-sum test with Benjamini-Hochberg correction).
SVM + ECFPs wins on average: The best overall method for activity cliff prediction, though the difference from RF + ECFPs or GBM + ECFPs was not statistically significant.
Deep learning underperforms: All graph and SMILES-based deep learning methods performed worse than a simple MLP on ECFPs. Among deep learning, LSTM with transfer learning (pretrained on 36K molecules) was the best, outperforming the ChemBERTa transformer pretrained on 10M compounds.
Large case-by-case variation: $\text{RMSE}_{\text{cliff}}$ ranged from 0.62 to 1.60 log units across datasets, with no method consistently best. Deep learning methods showed the highest variance across targets.

Simple Descriptors Beat Complex Architectures on Cliffs

The core finding is that activity cliffs expose a gap in learned molecular representations. Despite graph neural networks and transformers being able to learn directly from molecular structure, they fail to capture the subtle structural differences that drive activity cliffs.

Key observations:

RMSE and $\text{RMSE}_{\text{cliff}}$ correlate ($r = 0.81$ on average), so optimizing overall error usually helps with cliffs too. But this correlation breaks down for some targets (e.g., CLK4), where methods with similar RMSE can have very different $\text{RMSE}_{\text{cliff}}$.
Training set size matters for the RMSE/$\text{RMSE}_{\text{cliff}}$ correlation: Datasets with $> 1000$ training molecules show $r > 0.80$ between the two metrics. In low-data regimes, the correlation weakens, making dedicated cliff evaluation more important.
No relationship between % cliff compounds and model performance, and no target-family-specific effects were found.
Transfer learning helped SMILES models (LSTM) but not graph models: Self-supervised pretraining strategies (context prediction, infomax, edge prediction, masking) did not improve GNN performance, consistent with findings from other studies.

The MoleculeACE platform provides standardized data curation, activity cliff detection, and cliff-specific evaluation, enabling researchers to assess new methods against this benchmark.

Reproducibility Details

Data

Purpose	Source	Size	Notes
Benchmarking	ChEMBL v29	48,707 molecules (35,632 unique) across 30 targets	Curated for duplicates, salts, outliers
Smallest dataset	JAK1	615 molecules	7% activity cliffs
Largest dataset	DRD3	3,657 molecules	39% activity cliffs

Algorithms

Activity cliff detection: Pairwise similarity $\geq 0.9$ (Tanimoto on ECFPs, scaffold ECFPs, or Levenshtein on SMILES) with $> 10\times$ potency difference
Splitting: Spectral clustering on ECFPs (5 clusters), 80/20 stratified split preserving cliff proportion
Hyperparameter optimization: Bayesian optimization with Gaussian process, max 50 combinations, 5-fold cross-validation
SMILES augmentation: 10-fold for all SMILES-based methods
Transfer learning: LSTM pretrained on 36,281 merged training molecules (next-character prediction); ChemBERTa pretrained on 10M PubChem compounds

Models

Traditional ML: KNN, RF, GBM, SVM (scikit-learn v1.0.2)
Descriptors: ECFPs (1024-bit, radius 2), MACCS keys (166-bit), WHIM (114 descriptors), physicochemical (11 properties)
GNNs: MPNN, GCN, GAT, AFP (PyTorch Geometric v2.0.4), with graph multiset transformer pooling
SMILES models: LSTM (4 layers, 5.8M params), 1D CNN, ChemBERTa transformer
Total models trained: 720 (24 methods $\times$ 30 targets)

Evaluation

Metric	Scope	Details
RMSE	All test molecules	Standard root-mean-square error on $\text{pK}_i$ / $\text{pEC}_{50}$
$\text{RMSE}_{\text{cliff}}$	Activity cliff compounds only	RMSE restricted to cliff molecules in test set

Artifacts

Artifact	Type	License	Notes
MoleculeACE	Code + Data	MIT	Benchmark platform with all 30 curated datasets
Curated datasets	Data	MIT	Processed ChEMBL bioactivity data

Paper Information

Citation: van Tilborg, D., Alenicheva, A., & Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. Journal of Chemical Information and Modeling, 62(23), 5938-5951. https://doi.org/10.1021/acs.jcim.2c01073

Publication: Journal of Chemical Information and Modeling 2022

Additional Resources:

Citation

@article{vantilborg2022activity,
  title={Exposing the Limitations of Molecular Machine Learning with Activity Cliffs},
  author={van Tilborg, Derek and Alenicheva, Alisa and Grisoni, Francesca},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={23},
  pages={5938--5951},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c01073}
}

Scaling Laws vs Model Architectures: Inductive Bias

Sat, 14 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a systematization paper that conducts a large-scale empirical comparison of how ten different model architectures scale. Rather than proposing a new architecture, it characterizes the relationship between inductive bias and scaling behavior across both upstream (pretraining) and downstream (transfer) performance.

Why architecture-aware scaling matters

Prior scaling laws work (Kaplan et al., 2020) focused almost exclusively on vanilla Transformers, finding that loss scales as a power law with model size, dataset size, and compute. A common assumption in the field is that improvements observed at one scale transfer to other scales, and new architectures are often evaluated at a single compute point (e.g., base size). This paper challenges that assumption by asking whether different inductive biases scale differently.

Ten architectures, one controlled setup

All models are implemented in Mesh TensorFlow under a shared encoder-decoder (T5-style) framework, pretrained on C4 for $2^{19}$ steps with Adafactor optimizer and inverse square root learning rate schedule, and finetuned for 100K steps on GLUE + SuperGLUE + SQuAD. Models range from 15M to 40B parameters, trained on 16 TPU-v3 chips. The ten architectures span four categories:

Transformer variants: vanilla Transformer, Evolved Transformer (AutoML-derived), Universal Transformer (parameter sharing + recurrence), Switch Transformer (sparse MoE)

Efficient variants: Performer (linear attention), Funnel Transformer (sequence downsampling), ALBERT (cross-layer parameter sharing + embedding factorization)

General improvements: Mixture of Softmaxes (MoS), Gated Linear Units (GLU)

Non-Transformers: Lightweight Convolutions, Dynamic Convolutions, MLP-Mixer

Key findings on scaling behavior

Architecture changes the scaling slope

The paper fits linear scaling laws in log-log space (i.e., power law fits of the form $L \propto C^{-\alpha}$) for each model across multiple axes (FLOPs vs. upstream, FLOPs vs. downstream, etc.). The vanilla Transformer has the highest scaling coefficient on most reported axes ($\alpha_{F,U} = 0.54$, $\alpha_{F,D} = 0.28$). Models that make minimal changes to the Transformer (GLU, MoS) retain similar scaling behavior. Models with more radical inductive biases show worse scaling:

Performer (linear attention): $\alpha_{F,U} = 0.25$, upstream perplexity decreases only 2.7% from base to large vs. 8.4% for vanilla Transformer
ALBERT: scales negatively on downstream ($\alpha_{F,D} = -0.12$), getting worse as compute increases. ALBERT was designed for parameter efficiency (cross-layer weight sharing, embedding factorization), not compute efficiency, so this result is expected: additional FLOPs reuse the same parameters without adding capacity
MLP-Mixer: near-zero downstream scaling ($\alpha_{F,D} = -0.03$)

The best architecture changes with scale

Models that perform well at small compute budgets are not necessarily the best at larger budgets. For example, the Evolved Transformer outperforms vanilla Transformers at tiny-to-small scale on downstream tasks but falls behind when scaled up. MoS-Transformer outperforms vanilla Transformers at some compute regions but not others.

Upstream and downstream scaling diverge

Good upstream perplexity scaling does not guarantee good downstream transfer scaling. Funnel Transformers and Lightweight Convolutions hold up reasonably well on upstream perplexity but suffer substantially on downstream tasks. Switch Transformers show the best upstream-to-downstream transfer ratio ($\alpha_{U,D} = 0.58$).

Depth and width affect architectures differently

Depth scaling has a more substantial impact on downstream performance than width scaling across most architectures. Evolved Transformers are a partial exception, scaling slightly better under width scaling compared to other architectures on downstream tasks.

Practical implications

The authors offer concrete guidance: practitioners should be cautious about staking expensive large-scale runs on architectures that drastically modify the attention mechanism. Performers and MLP-Mixers are characterized as “high risk” options. This helps explain why most large language models at the time (PaLM, Gopher, UL2) use relatively vanilla Transformer architectures.

The paper also notes that not every use case requires billion-parameter models. Inductive biases tailored to small or low-compute regimes remain valuable when scaling is not the priority.

Reproducibility

No code or trained model weights were publicly released with this paper. The experiments rely on Google’s internal Mesh TensorFlow infrastructure with 16 TPU-v3 chips, and pretraining uses the publicly available C4 corpus. Finetuning benchmarks (GLUE, SuperGLUE, SQuAD) are all publicly available. However, reproducing the full study would require substantial compute resources and re-implementation of all ten architectures within a shared framework.

Artifact	Type	License	Notes
arXiv paper	Paper	Open access	Full paper with appendices
C4 corpus	Dataset	ODC-BY	Pretraining data

Missing components: No released code, model checkpoints, or training scripts. Internal Mesh TensorFlow codebase is not publicly available.

Paper Information

Citation: Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D., & Metzler, D. (2022). Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? EMNLP 2022.

Publication: EMNLP 2022

Additional Resources:

arXiv

@inproceedings{tay2022scaling,
  title={Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?},
  author={Tay, Yi and Dehghani, Mostafa and Abnar, Samira and Chung, Hyung Won and Fedus, William and Rao, Jinfeng and Narang, Sharan and Tran, Vinh Q. and Yogatama, Dani and Metzler, Donald},
  booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
  year={2022}
}

Relational Inductive Biases in Deep Learning (2018)

Sat, 14 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a systematization paper that is part position paper, part review, and part unification. It argues that combinatorial generalization, the ability to construct new inferences from known building blocks, is a top priority for AI. It frames relational inductive biases as the key design principle connecting standard deep learning architectures, presents the graph network (GN) as a general framework subsuming prior graph neural network variants, and advocates for combining structured approaches with deep learning rather than choosing between them.

The case for relational inductive biases

Human intelligence relies on representing the world as compositions of entities, relations, and rules. We understand complex systems by decomposing them into parts and their interactions. Modern deep learning’s “end-to-end” philosophy minimizes structural assumptions, relying on data and compute to learn representations from scratch. The paper argues this approach struggles with combinatorial generalization: generalizing beyond one’s experiences by composing known elements in new ways.

The authors reject the false dichotomy between “hand-engineering” and “end-to-end” learning. Just as biology uses both nature and nurture, they advocate for architectures that bake in useful structural assumptions (inductive biases) while still learning flexibly from data.

Inductive biases across standard architectures

The paper provides a systematic analysis of how existing architectures encode relational structure:

Fully connected networks (MLPs): The weakest relational inductive bias. All input units can interact with all others, with no reuse of parameters. No assumptions about the structure of the input.

Convolutional networks (CNNs): Encode locality (nearby elements interact) and translation invariance (the same local function is applied everywhere). The entities are individual units or grid elements (e.g., pixels), the relations are defined by the grid neighborhood, and the rule (convolution kernel) is shared across all positions.

Recurrent networks (RNNs): Encode sequential structure and temporal invariance. The entities are time steps, each step relates to the previous one through a shared transition function. This imposes a Markovian bias (the future depends on the present state, not the full history directly).

Sets and self-attention: Permutation invariant architectures impose no ordering on entities. Self-attention (as in Transformers) allows all pairwise interactions but with no structural prior on which interactions matter.

Each architecture can be understood as making specific commitments about what the entities are, what the relations between them are, and what rules govern their interactions.

The graph network framework

The paper defines a general “graph network” (GN) block that operates on graphs with attributes on nodes, edges, and the global graph level. A GN block performs three update steps and three aggregation steps:

Edge update: For each edge, compute updated edge attributes using the current edge attributes, the sender node attributes, the receiver node attributes, and the global attributes
Node update: For each node, aggregate incoming updated edge attributes, then compute updated node attributes using the aggregated edges, current node attributes, and global attributes
Global update: Aggregate all updated edge and node attributes, then compute updated global attributes

Each update function is learned (typically a small neural network), and each aggregation function must be permutation invariant (typically sum, mean, or max).

This framework generalizes prior work:

Message Passing Neural Networks (Gilmer et al., 2017): edge and node updates with a readout function but no explicit global attribute in message passing
Non-local Neural Networks (Wang et al., 2018): attention-weighted edge interactions
Interaction Networks (Battaglia et al., 2016): physics-inspired message passing
Relation Networks (Santoro et al., 2017): a simple neural network module for relational reasoning
Discovering objects and their relations (Raposo et al., 2017): discovering objects and their relations from entangled scene representations
Deep Sets (Zaheer et al., 2017): node-only aggregation without edge structure
CommNet, Structure2Vec, GGNNs, and others

The paper shows how each prior approach corresponds to a specific configuration of which GN components are used and how they are connected.

Design principles for graph networks

The paper identifies several key design choices:

Flexible representations: GN blocks can output graphs with different structure than their input (e.g., predicting edge existence), enabling tasks like link prediction, clustering, or property regression.

Configurable within-block structure: The internal update and aggregation functions can be swapped freely. The framework separates what is computed (the relational structure) from how it is computed (the function approximators).

Composable multi-block architectures: GN blocks can be stacked, sharing or not sharing weights across layers. They can be composed with other architectures (e.g., an encoder-GN-decoder pattern) or arranged in recurrent configurations.

Combinatorial generalization: Because GN blocks share functions across edges and nodes, they can generalize to graphs of different sizes and topologies than those seen during training. A GN trained on small graphs can, in principle, be applied to larger ones.

Connections to broader AI themes

The paper frames graph networks as supporting:

Relational reasoning: Learning about entities and their interactions
Combinatorial generalization: Applying learned rules to novel combinations
Structured prediction: Producing complex, structured outputs including graphs and sequences
Interpretable representations: Graph structure provides a natural vocabulary for understanding what the model has learned

The authors also discuss connections to classical AI (logic, planning, causal reasoning) and argue that graph networks provide a bridge between the flexibility of deep learning and the compositionality of symbolic approaches.

Limitations and open questions

The paper identifies several limitations of graph networks:

Graph isomorphism: Learned message-passing cannot be guaranteed to discriminate between certain non-isomorphic graphs. Kondor et al. (2018) suggested that covariance, rather than invariance to permutations, may be preferable.
Expressivity limits of graphs: Notions like recursion, control flow, and conditional iteration are not straightforward to represent with graphs. Programs and more “computer-like” processing may offer greater representational and computational expressivity for these concepts.
Where do graphs come from?: Converting raw sensory data (images, text) into graph-structured representations remains an open problem. Fully connected graphs between spatial or linguistic entities are a common workaround but may not reflect the true underlying structure.
Adaptive graph structure: How to modify graph topology during computation (e.g., splitting a node when an object fractures, or adding/removing edges based on contact) is an active research direction.

Reproducibility

The authors released an open-source software library for building graph networks in TensorFlow/Sonnet, including demos for shortest-path finding, sorting, and physical prediction tasks.

Artifact	Type	License	Notes
Graph Nets library	Code	Apache 2.0	Official TensorFlow/Sonnet implementation with demos

This is a position/systematization paper rather than an empirical one, so reproducibility pertains to the accompanying library rather than experimental results. The library and demos are publicly available, making the framework highly accessible.

Paper Information

Citation: Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., … & Pascanu, R. (2018). Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261.

Publication: arXiv 2018

Additional Resources:

@article{battaglia2018relational,
  title={Relational inductive biases, deep learning, and graph networks},
  author={Battaglia, Peter W. and Hamrick, Jessica B. and Bapst, Victor and Sanchez-Gonzalez, Alvaro and Zambaldi, Vinicius and Malinowski, Mateusz and Tacchetti, Andrea and Raposo, David and Santoro, Adam and Faulkner, Ryan and others},
  journal={arXiv preprint arXiv:1806.01261},
  year={2018}
}

Review of OCSR Techniques and Models (Musazade 2022)

Thu, 18 Dec 2025 00:00:00 +0000

Systematization of OCSR Evolution

This is a Systematization paper ($\Psi_{\text{Systematization}}$). It organizes existing literature into two distinct evolutionary phases: Rule-based systems (1990s-2010s) and Machine Learning-based systems (2015-present). It synthesizes performance metrics across these paradigms to highlight the shift from simple classification to “image captioning” (sequence generation).

Justification: The paper focuses on “organizing and synthesizing existing literature” and answers the core question: “What do we know?” The dominant contribution is systematization based on several key indicators:

Survey Structure: The paper explicitly structures content by categorizing the field into two distinct historical and methodological groups: “Rule-based systems” and “ML-based systems”. It traces the “evolution of approaches from rule-based structure analyses to complex statistical models”, moving chronologically from early tools like OROCS and OSRA (1990s-2000s) to modern Deep Learning approaches like DECIMER and Vision Transformers.
Synthesis of Knowledge: The paper aggregates performance metrics from various distinct studies into unified comparison tables (Table 1 for rule-based and Table 2 for ML-based). It synthesizes technical details of different models, explaining how specific architectures (CNNs, LSTMs, Attention mechanisms) are applied to the specific problem of Optical Chemical Structure Recognition (OCSR).
Identification of Gaps: The authors dedicate specific sections to “Gaps of rule-based systems” and “Gaps of ML-based systems”. It concludes with recommendations for future development, such as the need for “standardized datasets” and specific improvements in image augmentation and evaluation metrics.

Motivation for Digitization in Cheminformatics

The primary motivation is the need to digitize vast amounts of chemical knowledge locked in non-digital formats (e.g., scanned PDFs, older textbooks). This is challenging because:

Representational Variety: A single chemical formula can be drawn in many visually distinct ways (e.g., different orientations, bond styles, fonts).
Legacy Data: Older documents contain noise, low resolution, and disconnected strokes that confuse standard computer vision models.
Lack of Standardization: There is no centralized database or standardized benchmark for evaluating OCSR performance, making comparison difficult.

Key Insights and the Paradigm Shift

The paper provides a structured comparison of the “evolution” of OCSR, specifically identifying the pivot point where the field moved from object detection to NLP-inspired sequence generation.

Key insights include:

The Paradigm Shift: Identifying that OCSR has effectively become an “image captioning” problem where the “caption” is a SMILES or InChI string.
Metric Critique: It critically analyzes the flaws in current evaluation metrics, noting that Levenshtein Distance (LD) is better than simple accuracy but still fails to capture semantic chemical severity (e.g., mistaking “F” for “S” is worse than a wrong digit).
Hybrid Potential: Despite the dominance of ML, the authors argue that rule-based heuristics are still valuable for post-processing validation (e.g., checking element order, sequence structure, and formula correspondence).

Comparative Analysis of Rule-Based vs. ML Systems

As a review paper, it aggregates experimental results from primary sources. It compares:

Rule-based systems: OSRA, chemoCR, Imago, Markov Logic OCSR, and various heuristic approaches.
ML-based systems: DECIMER (multiple versions), MSE-DUDL, ICMDT (Image Captioning Model based on Deep Transformer-in-Transformer), and other BMS Kaggle competition solutions.

It contrasts these systems using:

Datasets: BMS (synthetic, 4M images), PubChem (synthetic), U.S. Patents (real-world scanned).
Metrics: Tanimoto similarity (structural overlap) and Levenshtein distance (string edit distance).

Outcomes, Critical Gaps, and Recommendations

Transformers are SOTA: Attention-based encoder-decoder models outperform CNN-RNN hybrids. DECIMER 1.0 achieved 96.47% Tanimoto $= 1.0$ on its test set using an EfficientNet-B3 encoder and Transformer decoder.
Data Hungry: Modern approaches require massive datasets (millions of images) and significant compute. DECIMER 1.0 trained on 39M images for 14 days on TPU, while the original DECIMER took 27 days on a single GPU. Rule-based systems required neither large data nor heavy compute but hit a performance ceiling.
Critical Gaps:
- Super-atoms: Current models struggle with abbreviated super-atoms (e.g., “Ph”, “COOH”).
- Stereochemistry: 3D information (wedges/dashes) is often lost or misinterpreted.
- Resolution: Models are brittle to resolution changes; some require high-res, others fail if images aren’t downscaled.
Recommendation: Future systems should integrate “smart” pre-processing (denoising without cropping) and use domain-specific distance metrics. The authors also note that post-processing formula validation (checking element order, sequence structure, and formula correspondence) increases accuracy by around 5-6% on average. They suggest exploring Capsule Networks as an alternative to CNNs, since capsules add position invariance through routing-by-agreement rather than max-pooling.

Reproducibility

As a review paper, this work does not introduce original code, models, or datasets. The paper itself is open access via the Journal of Cheminformatics. This section summarizes the technical details of the systems reviewed.

Data

The review identifies the following key datasets used for training OCSR models:

Dataset	Type	Size	Notes
BMS (Bristol-Myers Squibb)	Synthetic	~4M images	2.4M train / 1.6M test. Used for Kaggle competition. Test images contain noise (salt & pepper, blur) and rotations absent from training images.
PubChem	Synthetic	~39M	Generated via CDK (Chemistry Development Kit). Used by DECIMER 1.0 (90/10 train/test split).
U.S. Patents (USPTO)	Scanned	Variable	Real-world noise, often low resolution. One of several training sources for MSE-DUDL (alongside PubChem and Indigo, totaling 50M+ samples).
ChemInfty	Scanned	869 images	Older benchmark used to evaluate rule-based systems (e.g., Markov Logic OCSR).

Algorithms

The review highlights the progression of algorithms:

Rule-Based: Hough transforms for bond detection, vectorization/skeletonization, and OCR for atom labels.
Sequence Modeling:
- Image Captioning: Encoder (CNN/ViT) → Decoder (RNN/Transformer).
- Tokenization: Parsing InChI/SMILES into discrete tokens (e.g., splitting C13 into C, 13).
- Beam Search: Used in inference (typical $k=15-20$) to find the most likely chemical string.

Models

Key architectures reviewed:

DECIMER 1.0: Uses EfficientNet-B3 (Encoder) and Transformer (Decoder). Predicts SELFIES strings (more robust than SMILES).
Swin Transformer: Often used in Kaggle ensembles as the visual encoder due to better handling of variable image sizes.
Grid LSTM: Used in older deep learning approaches (MSE-DUDL) to capture spatial dependencies.

Evaluation

Metrics standard in the field:

Levenshtein Distance (LD): Edit distance between predicted and ground truth strings. Lower is better. Formally, for two sequences $a$ and $b$ (e.g. SMILES strings) of lengths $|a|$ and $|b|$, the recursive distance $LD(a, b)$ is bounded from $0$ to $\max(|a|, |b|)$.
Tanimoto Similarity: Measures overlap of molecular fingerprints ($0.0 - 1.0$). Higher is better. DECIMER 1.0 achieved a Tanimoto of 0.99 on PubChem data (Table 2). Calculated as: $$ \begin{aligned} T(A, B) = \frac{N_c}{N_a + N_b - N_c} \end{aligned} $$ where $N_a$ and $N_b$ are the number of bits set to 1 in fingerprints $A$ and $B$, and $N_c$ is the number of common bits set to 1.
1-1 Match Rate: Exact string matching (accuracy). For DECIMER 1.0, 96.47% of results achieved Tanimoto $= 1.0$.

Hardware

Training Cost: High for SOTA. DECIMER 1.0 required ~14 days on TPU. The original DECIMER took ~27 days on a single NVIDIA GPU.
Inference: Transformer models are heavy; rule-based systems run on standard CPUs but with lower accuracy.

Paper Information

Citation: Musazade, F., Jamalova, N., & Hasanov, J. (2022). Review of techniques and models used in optical chemical structure recognition in images and scanned documents. Journal of Cheminformatics, 14(1), 61. https://doi.org/10.1186/s13321-022-00642-3

Publication: Journal of Cheminformatics 2022

@article{musazadeReviewTechniquesModels2022,
  title = {Review of Techniques and Models Used in Optical Chemical Structure Recognition in Images and Scanned Documents},
  author = {Musazade, Fidan and Jamalova, Narmin and Hasanov, Jamaladdin},
  year = 2022,
  month = sep,
  journal = {Journal of Cheminformatics},
  volume = {14},
  number = {1},
  pages = {61},
  doi = {10.1186/s13321-022-00642-3}
}

A Review of Optical Chemical Structure Recognition Tools

Wed, 17 Dec 2025 00:00:00 +0000

Systematization and Benchmarking of OCSR

This is primarily a Systematization paper ($0.7 \Psi_{\text{Systematization}}$) with a significant Resource component ($0.3 \Psi_{\text{Resource}}$).

It serves as a Systematization because it organizes nearly three decades of research in Optical Chemical Structure Recognition (OCSR), categorizing methods into rule-based systems (e.g., Kekulé, CLiDE, OSRA) and emerging machine-learning approaches (e.g., MSE-DUDL, Chemgrapher). It synthesizes information on 16 distinct tools, many of which are commercial or no longer available.

It acts as a Resource by defining a benchmark for the field. The authors evaluate the three available open-source tools (Imago, MolVec, OSRA) against four distinct datasets to establish baseline performance metrics for accuracy and speed.

Motivation: Digitizing Legacy Chemical Literature

A vast amount of chemical knowledge remains “hidden” in the primary scientific literature (printed or PDF), conveyed as 2D images. Because these depictions are not machine-readable, there is a “backlog of decades of chemical literature” that cannot be easily indexed or searched in open-access databases.

While Chemical Named Entity Recognition (NER) exists for text, translating graphical depictions into formats like SMILES or SDfiles requires specialized OCSR tools. The motivation is to enable the automated curation of this legacy data to feed public databases.

Core Innovations: Historical Taxonomy and Open Standards

The primary novelty is the comprehensive aggregation of the history of the field, which had not been thoroughly reviewed recently. It details the algorithmic evolution from the first work in 1990 to deep learning methods in 2019.

Specific contributions include:

Historical Taxonomy: Classification of tools into rule-based vs. machine-learning, and open-source vs. commercial/unavailable.
Open Source Benchmark: A comparative performance analysis of the only three open-source tools available at the time (Imago, MolVec, OSRA) on standardized datasets.
Algorithmic Breakdown: Detailed summaries of the workflows for closed-source or lost tools (e.g., Kekulé, OROCS, ChemReader) based on their original publications.

Benchmarking Methodology and Open-Source Evaluation

The authors performed a benchmark study to evaluate the accuracy and speed of three open-source OCSR tools: MolVec (0.9.7), Imago (2.0), and OSRA (2.1.0).

They tested these tools on four datasets of varying quality and origin:

USPTO: 5,719 images from US patents (high quality).
UOB: 5,740 images from the University of Birmingham, published alongside MolRec.
CLEF 2012: 961 images from the CLEF-IP evaluation (well-segmented, clean).
JPO: 450 images from Japanese patents (low quality, noise, Japanese characters).

Evaluation metrics were:

Accuracy: Percentage of perfectly recognized structures, mathematically defined as exact string matching between generated and reference standard InChI sequences $\text{Accuracy} = \frac{\text{Correct InChI Matches}}{\text{Total Images}}$ (verified by converting output to InChI strings and matching against reference InChIs).
Speed: Total processing time for the dataset.

Results and General Conclusions

Benchmark Results (Table 2):

Dataset	Metric	MolVec 0.9.7	Imago 2.0	OSRA 2.1.0
USPTO (5,719 images)	Time (min)	28.65	72.83	145.04
	Accuracy	88.41%	87.20%	87.69%
UOB (5,740 images)	Time (min)	28.42	152.52	125.78
	Accuracy	88.39%	63.54%	86.50%
CLEF 2012 (961 images)	Time (min)	4.41	16.03	21.33
	Accuracy	80.96%	65.45%	94.90%
JPO (450 images)	Time (min)	7.50	22.55	16.68
	Accuracy	66.67%	40.00%	57.78%

Key Observations:

MolVec was the fastest tool, processing datasets significantly quicker than competitors (e.g., 28.65 min for USPTO vs. 145.04 min for OSRA).
OSRA performed exceptionally well on clean, well-segmented data (94.90% on CLEF 2012) but was slower.
Imago generally lagged in accuracy compared to the other two, particularly on the UOB dataset (63.54% vs. 88.39% for MolVec and 86.50% for OSRA).
JPO Difficulty: All tools struggled with the noisy Japanese Patent Office dataset (accuracies ranged from 40.00% to 66.67%), highlighting issues with noise and non-standard labels.

General Conclusions:

No “gold standard” tool existed (as of 2020) that solved all problems (page segmentation, R-groups, NLP integration).
Rule-based approaches dominate the history of the field, but deep learning methods (MSE-DUDL, Chemgrapher) were emerging, though they were closed-source at the time of writing.
There was a critical need for tools that could handle full-page recognition (combining segmentation and recognition).

Reproducibility Details

The authors provided sufficient detail to replicate the benchmarking study.

Artifacts

Artifact	Type	License	Notes
OCSR_Review (GitHub)	Code / Data	MIT	Benchmark images (PNG, 72 dpi) and evaluation scripts
OSRA	Code	Open Source	Version 2.1.0 tested; precompiled binaries are commercial
Imago	Code	Open Source	Version 2.0 tested; no longer actively developed
MolVec	Code	LGPL-2.1	Version 0.9.7 tested; Java-based standalone tool

Data

The study used four public datasets. Images were converted to PNG (72 dpi) to ensure compatibility across all tools.

Dataset	Size	Source	Characteristics
USPTO	5,719	OSRA Validation Set	US Patent images, generally clean.
UOB	5,740	Univ. of Birmingham	Published alongside MolRec.
CLEF 2012	961	CLEF-IP 2012	Well-segmented, high quality.
JPO	450	Japanese Patent Office	Low quality, noisy, contains Japanese text.

Algorithms

The paper does not propose a new algorithm but benchmarks existing ones. The execution commands for reproducibility were:

Imago: Executed via command line without installation. ./imago_console -dir /image/directory/path
MolVec: Executed as a JAR file. java -cp [dependencies] gov.nih.ncats.molvec.Main -dir [input_dir] -outDir [output_dir]
OSRA: Installed via Conda (PyOSRA) due to compilation complexity. Required dictionaries for superatoms and spelling. osra -f sdf -a [superatom_dict] -l [spelling_dict] -w [output_file] [input_file]

Models

The specific versions of the open-source software tested were:

Tool	Version	Technology	License
MolVec	0.9.7	Java-based, rule-based	LGPL-2.1
Imago	2.0	C++, rule-based	Open Source
OSRA	2.1.0	C++, rule-based	Open Source

Evaluation

Metric: Perfect structural match. The output SDfile/SMILES was converted to a Standard InChI string and compared to the ground truth InChI. Any deviation counted as a failure.
Environment: Linux workstation (Ubuntu 20.04 LTS).

Hardware

The benchmark was performed on a high-end workstation to measure processing time.

CPUs: 2x Intel Xeon Silver 4114 (40 threads total).
RAM: 64 GB.
Parallelization: MolVec had pre-implemented parallelization features that contributed to its speed.

Paper Information

Citation: Rajan, K., Brinkhaus, H. O., Zielesny, A., & Steinbeck, C. (2020). A review of optical chemical structure recognition tools. Journal of Cheminformatics, 12(1), 60. https://doi.org/10.1186/s13321-020-00465-0

Publication: Journal of Cheminformatics 2020

@article{rajanReviewOpticalChemical2020,
  title = {A Review of Optical Chemical Structure Recognition Tools},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2020,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {12},
  number = {1},
  pages = {60},
  issn = {1758-2946},
  doi = {10.1186/s13321-020-00465-0}
}

Three Domains of Life: Woese's Phylogenetic Revolution

Sun, 14 Dec 2025 00:00:00 +0000

What kind of paper is this?

This is a Systematization paper with a strong Position component.

Systematization: It synthesizes decades of molecular sequence data (specifically rRNA) to propose a “formal system of organisms” that replaces previous taxonomies.

Position: It argues that the prevailing “Prokaryote-Eukaryote” and “Five Kingdom” models are “outmoded,” “misleading,” and based on “flawed premises” regarding the organization of life.

What is the motivation?

The authors aim to align formal taxonomy with the “natural system emerging from molecular data”.

The Problem: Existing systems (Whittaker’s 5-Kingdoms) were based on morphology and nutrition, which are insufficient for microbial classification.

The Gap: The “Prokaryote” definition was negative (defined by what they lack, a nucleus), obscuring the fact that “Archaebacteria” are as distinct from “Eubacteria” as they are from Eukaryotes.

The Goal: To establish a taxonomic rank above Kingdom that recognizes the three primary evolutionary lineages.

What is the novelty here?

The core contribution is the formal proposal of the Domain as the highest taxonomic rank. Specific novel definitions include:

Three Domains:
- Bacteria (formerly Eubacteria): Membrane lipids are diacyl glycerol diesters; eubacterial rRNA.
- Archaea (formerly Archaebacteria): Membrane lipids are isoprenoid glycerol diethers/tetraethers; archaeal rRNA. The term “archaebacteria” is abandoned to emphasize their independence.
- Eucarya (Eukaryotes): Cells with nuclei; glycerol fatty acyl diester lipids; eukaryotic rRNA.
Subdivision of Archaea: The domain is formally split into two kingdoms:
- Euryarchaeota (methanogens, halophiles, thermoplasms, sulfate-reducing Archaeoglobus, and thermophiles Thermococcus and Pyrococcus).
- Crenarchaeota (sulfur-dependent extreme thermophiles).

What experiments were performed?

This paper is a synthesis of phylogenetic analysis. It relies on:

rRNA Sequencing: Comparison of 16S (small subunit) ribosomal RNA sequences. The paper cites over 400 known eubacterial cases of a characteristic structural feature (the 6-nucleotide side bulge at positions 500-545).
Phylogenetic Tree Reconstruction: Analysis of branching orders and lengths based on rRNA sequence comparisons (citing Woese, 1987).
Paralogous Gene Rooting: Determining the root of the universal tree by comparing duplicated genes (e.g., elongation factors) that diverged before the three lineages separated.

What outcomes/conclusions?

Tripartite Division: Life divides into three monophyletic groups. The evolutionary differences among the three domains are more profound than those separating traditional kingdoms such as animals and plants.
Archaea-Eucarya Sisterhood: The root of the tree separates Bacteria from the other two, making Archaea and Eucarya sister groups.
Molecular Definition: Phenotypes are replaced by molecular signatures. For example, Bacteria are defined by a 6-nucleotide bulge in the 16S rRNA (positions 500-545), whereas Archaea and Eucarya share a 7-nucleotide bulge.
“Prokaryote” as Invalid Taxon: The paper explicitly argues that “Prokaryote” is not a valid natural taxon. Because it is defined by what the organisms lack (a nucleus), it groups together two deeply divergent domains (Bacteria and Archaea) by a plesiomorphic character. The term should be abandoned in natural classification.
Domain Replaces Kingdom: Introducing the Domain rank above Kingdom resolves the issue. A bacterium is no more related to an archaeon than either is to a eukaryote, so all three deserve equivalent top-level status.
Formal Conclusions (adapted from paper):
1. Life comprises three primary groupings, the Domains Bacteria, Archaea, and Eucarya.
2. None of these is ancestral to the others; all descend from a common ancestor.
3. The Archaea comprise two kingdoms, Euryarchaeota and Crenarchaeota.
4. Both Bacteria and Eucarya will contain numerous kingdoms; for Eucarya, the paper anticipates preserving Plantae, Animalia, and Fungi while replacing Protista with several kingdoms.
5. “Prokaryote” has no phylogenetic meaning and should not be used as a formal taxon.

Reception and ongoing debate: At publication, abandoning “prokaryote” was a controversial claim. Most microbiology and cell biology textbooks through the 2000s retained the term, and many introductory curricula continue to use it today. The three-domain framework has since been adopted in modern phylogenetics and comparative genomics, but the transition is not yet universal in pedagogy, and some researchers have proposed alternative deep-tree topologies (e.g., the eocyte hypothesis) that differ from Woese’s original Archaea-Eucarya sisterhood.

Reproducibility Details

Note: As a theoretical systematics paper from 1990, “reproducibility” refers to the data sources and criteria used to construct the taxonomy.

Data

The taxonomy rests on comparative analysis of Ribosomal RNA (rRNA), specifically the small subunit (16S in prokaryotes, 18S in eukaryotes).

Data Type	Specific Features	Source Reference
16S rRNA	Region 500-545 (Hairpin Loop)	Woese et al., 1983
16S rRNA	Region 180-197 & 405-498	Woese et al., 1983
Membrane Lipids	Diacyl esters vs. Isoprenoid ethers	Used for Domain definition
RNA Polymerase	Subunit patterns and complexity	Schnabel et al., 1983; Puhler et al., 1989

Algorithms

The paper relies on rRNA sequence comparisons to generate the universal tree in Figure 1, using phylogenetic methods standard at the time.

Tree Inference: Branching order/lengths taken from Microbiol. Rev. 51, 221-271 (1987).
Rooting Strategy: The “Outgroup” method using anciently duplicated genes (paralogs) such as Elongation Factors Tu and G, which diverged prior to the Universal Ancestor.

Models

The “Model” proposed is the Three-Domain System:

Domain Bacteria: Rooted independently. Includes Thermotogales, Flavobacteria, Cyanobacteria, Purple bacteria, Gram-positives, Green nonsulfur.
Domain Archaea:
- Kingdom Crenarchaeota: “Ancestral” phenotype (thermophily). Includes Pyrodictium, Thermoproteus.
- Kingdom Euryarchaeota: “Broad” phenotype. Includes Methanogens, Halophiles, Thermoplasma, Archaeoglobus (sulfate-reducing), and Thermococcus and Pyrococcus (thermophilic).
Domain Eucarya: Includes Animals, Ciliates, Plants, Fungi, Flagellates, Microsporidia.

Evaluation

The authors validate the model by demonstrating Molecular Invariants: features present in all members of a domain but absent in others:

Feature	Bacteria	Archaea	Eucarya
rRNA Loop (500-545)	6-nt bulge	7-nt bulge	7-nt bulge
Membrane Lipids	Glycerol fatty acyl diesters	Isoprenoid glycerol ethers	Glycerol fatty acyl diesters
RNA Polymerase	Simple subunit pattern	Complex (Eucarya-like)	Complex (3 separate pols)

Paper Information

Citation: Woese, C. R., Kandler, O., & Wheelis, M. L. (1990). Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. USA, 87(12), 4576-4579. https://doi.org/10.1073/pnas.87.12.4576

Publication: Proc. Natl. Acad. Sci. USA, Volume 87, Number 12, 1990

@article{woeseNaturalSystemOrganisms1990,
  title = {Towards a Natural System of Organisms: Proposal for the Domains {{Archaea}}, {{Bacteria}}, and {{Eucarya}}.},
  shorttitle = {Towards a Natural System of Organisms},
  author = {Woese, C R and Kandler, O and Wheelis, M L},
  year = {1990},
  month = jun,
  journal = {Proceedings of the National Academy of Sciences of the United States of America},
  volume = {87},
  number = {12},
  pages = {4576--4579},
  issn = {0027-8424},
  doi = {10.1073/pnas.87.12.4576}
}

Additional Resources:

Embedded-Atom Method: Theory and Applications Review

Sun, 14 Dec 2025 00:00:00 +0000

Systematizing the Embedded-Atom Method

This is a Systematization (Review) paper. It consolidates the theoretical development, semi-empirical parameterization, and broad applications of the Embedded-Atom Method (EAM) into a unified framework. The paper systematizes the field by connecting the EAM to related theories (Effective Medium Theory, Finnis-Sinclair, “glue” models) and organizing phenomenological results across diverse physical regimes (bulk, surfaces, interfaces).

The authors explicitly frame the work as a survey, stating “We review here the history, development, and application of the EAM” and “This review emphasizes the physical insight that motivated the EAM.” The paper follows a classic survey structure, organizing the literature by application domains.

The Failure of Pair Potentials in Metallic Systems

The primary motivation is the failure of pair-potential models to accurately describe metallic bonding, particularly at defects and interfaces.

Physics Gap: Pair potentials assume bond strength is independent of environment, implying cohesive energy scales linearly with coordination ($Z$), whereas in reality it scales roughly as $\sqrt{Z}$.

Empirical Failures: Pair potentials incorrectly predict the “Cauchy relation” ($C_{12} = C_{44}$) and predict a vacancy formation energy equal to the cohesive energy, contradicting experimental data for fcc metals.

Practical Need: First-principles calculations (like DFT) were computationally too expensive for low-symmetry systems like grain boundaries and fracture tips, creating a need for an efficient, semi-empirical many-body potential.

Theoretical Unification & Core Innovations

The paper’s core contribution is the synthesis of the EAM as a practical computational tool that captures “coordination-dependent bond strength” without the cost of ab initio methods.

Theoretical Unification: It demonstrates that the EAM ansatz can be derived from Density Functional Theory (DFT) by assuming the total electron density is a superposition of atomic densities.

Environmental Dependence: It explicitly formulates how the “effective” pair interaction stiffens and shortens as coordination decreases (e.g., at surfaces), a feature naturally arising from the non-linearity of the embedding function.

Broad Validation: It provides a centralized evaluation of the method across a vast array of metallic properties, establishing it as the standard for atomistic simulations of face-centered cubic (fcc) metals.

Validating EAM Across Application Domains

The authors review computational experiments using Energy Minimization, Molecular Dynamics (MD), and Monte Carlo (MC) simulations across several domains:

Bulk Properties: Calculation of phonon spectra, liquid structure factors, thermal expansion coefficients, and melting points for fcc metals (Ni, Pd, Pt, Cu, Ag, Au).

Defects: Computation of vacancy formation/migration energies and self-interstitial geometries.

Grain Boundaries: Calculation of grain boundary structures, energies, and elastic properties for twist and tilt boundaries in Au and Al. Computed structures show good agreement with X-ray diffraction and HRTEM experiments. The many-body interactions in the EAM produce somewhat better agreement than pair potentials, which tend to overestimate boundary expansion.

Surfaces: Analysis of surface energies, relaxations, reconstructions (e.g., Au(110) missing row), and surface phonons.

Alloys: Investigation of heat of solution, surface segregation profiles (e.g., Ni-Cu), and order-disorder transitions.

Mechanical Properties: Simulation of dislocation mobility, pinning by defects (He bubbles), and crack tip plasticity (ductile vs. brittle fracture modes).

Key Outcomes and the Limits of EAM

Many-Body Success: The EAM successfully reproduces the breakdown of the Cauchy relation and the correct ratio of vacancy formation energy to cohesive energy (~0.35) for fcc metals.

Surface Accuracy: It correctly predicts that surface bonds are shorter and stiffer than bulk bonds due to lower coordination. It accurately predicts surface reconstructions (e.g., Au(110) $(1 \times 2)$).

Alloy Behavior: The method naturally captures segregation phenomena, including oscillating concentration profiles in Ni-Cu, driven by the embedding energy.

Limitations: The method is less accurate for systems with strong directional bonding (covalent materials) or significant Fermi-surface effects, as it assumes spherically averaged electron densities.

Reproducibility Details

Data

Fitting Data: The semi-empirical functions are fitted to basic bulk properties: lattice constants, cohesive energy, elastic constants ($C_{11}$, $C_{12}$, $C_{44}$), and vacancy formation energy.

Universal Binding Curve: The cohesive energy as a function of lattice constant is constrained to follow the “universal binding curve” of Rose et al. to ensure accurate anharmonic behavior.

Alloy Data: For binary alloys, dilute heats of alloying are used for fitting cross-interactions.

Algorithms

Core Ansatz: The total energy is defined as:

$$E_{coh} = \sum_{i} G_i\left( \sum_{j \neq i} \rho_j^a(R_{ij}) \right) + \frac{1}{2} \sum_{i, j (j \neq i)} U_{ij}(R_{ij})$$

where $G$ is the embedding energy (function of local electron density $\rho$), and $U$ is a pair interaction.

Simulation Techniques:

Molecular Dynamics (MD): Used for liquids, phonons, and fracture simulations.
Monte Carlo (MC): Used for phase diagrams and segregation profiles (e.g., approximately $10^5$ iterations per atom).
Phonons: Calculated via the dynamical matrix derived from the force-constant tensor $K_{ij}$.
Normal-Mode Analysis: Vibrational normal modes obtained by diagonalizing the dynamical matrix, feasible for unit cells of up to about 260 atoms.

Models

Parameterizations: The review lists several specific function sets developed by the authors (Table 2), including:

Daw and Baskes: For Ni, Pd, H (elemental metals and H in solution/on surfaces)
Foiles: For Cu, Ag, Au, Ni, Pd, Pt (elemental metals)
Foiles: For Cu, Ni (tailored for the Ni-Cu alloy system)
Foiles, Baskes and Daw: For Cu, Ag, Au, Ni, Pd, Pt (dilute alloys)
Daw, Baskes, Bisson and Wolfer: For Ni, H (fracture, dislocations, H embrittlement)
Foiles and Daw: For Ni, Al (Ni-rich end of the Ni-Al alloy system)
Daw: For Ni (calculated from first principles, not semi-empirical)
Hoagland, Daw, Foiles and Baskes: For Al (elemental Al)

Many of these historical parameterizations are directly downloadable in machine-readable formats from the NIST Interatomic Potentials Repository (linked in the resources below).

Transferability: EAM functions are generally not transferable between different parameterization sets; mixing functions from different sets (e.g., Daw-Baskes Ni with Foiles Pd) is invalid.

Evaluation

Bulk Validation: Phonon dispersion curves for Cu show excellent agreement with experiment across the full Brillouin zone.

Thermal Properties: Linear thermal expansion coefficients match experiment well (e.g., Cu calculated: $16.4 \times 10^{-6}/K$ vs experimental: $16.7 \times 10^{-6}/K$).

Defect Energetics: Vacancy migration energies and divacancy binding energies (~0.1-0.2 eV) align with experimental data.

Surface Segregation: Correctly predicts segregation species for 18 distinct dilute alloy cases (e.g., Cu segregating in Ni).

Hardware

Compute Scale: At the time of publication (1993), Molecular Dynamics simulations of up to 35,000 atoms were possible.

Platforms: Calculations were performed on supercomputers like the CRAY-XMP, though smaller calculations were noted as feasible on high-performance workstations.

Paper Information

Citation: Daw, M. S., Foiles, S. M., & Baskes, M. I. (1993). The embedded-atom method: a review of theory and applications. Materials Science Reports, 9(7-8), 251-310. https://doi.org/10.1016/0920-2307(93)90001-U

Publication: Materials Science Reports 1993

@article{dawEmbeddedatomMethodReview1993,
  title = {The embedded-atom method: a review of theory and applications},
  shorttitle = {The Embedded-Atom Method},
  author = {Daw, Murray S. and Foiles, Stephen M. and Baskes, Michael I.},
  year = 1993,
  month = mar,
  journal = {Materials Science Reports},
  volume = {9},
  number = {7-8},
  pages = {251--310},
  issn = {0920-2307},
  doi = {10.1016/0920-2307(93)90001-U}
}

Additional Resources:

Embedded-Atom Method User Guide: Voter's 1994 Chapter

Sun, 14 Dec 2025 00:00:00 +0000

Contribution: Systematizing the Embedded-Atom Method

This is a Systematization paper (specifically a handbook chapter) with a strong secondary Method projection.

Its primary goal is to serve as a “users’ guide” to the Embedded-Atom Method (EAM). The text organizes existing knowledge:

It traces the physical origins of EAM from Density Functional Theory (DFT) and Effective Medium Theory.
It synthesizes “closely related methods” (Second Moment Approximation, Glue Model), showing they are mathematically equivalent or very similar to EAM.
It provides a pedagogical, step-by-step methodology for fitting potentials to experimental data.

Motivation: Bridging the Gap Between DFT and Pair Potentials

The primary motivation is to bridge the gap between accurate, expensive electronic structure calculations and fast, inaccurate pair potentials.

Computational Efficiency: First-principles methods scale as $O(N^3)$ or worse, limiting simulations to $<100$ atoms (in 1994). Pair potentials scale as $O(N)$ and fail to capture essential many-body physics of metals.
Physical Accuracy: Simple pair potentials cannot accurately model metallic defects; they predict zero Cauchy pressure ($C_{12} - C_{44} = 0$) and equate vacancy formation energy to cohesive energy, both of which are incorrect for transition metals.
Practical Utility: There was a need for a clear guide on how to construct and apply these potentials for large-scale simulations ($10^6+$ atoms) of fracture and defects.

Novelty: A Unified Framework and Robust Fitting Recipe

As a review chapter, the novelty lies in the synthesis and the specific, reproducible recipe for potential construction. Central to this synthesis is the core EAM energy functional:

$$E_{\text{tot}} = \sum_i \left( F(\bar{\rho}_i) + \frac{1}{2} \sum_{j \neq i} \phi(r_{ij}) \right)$$

where the total energy $E_{\text{tot}}$ depends on embedding an atom $i$ into a local background electron density $\bar{\rho}_i = \sum_{j \neq i} \rho(r_{ij})$, plus a repulsive pair interaction $\phi(r_{ij})$.

Unified Framework: It explicitly maps the “Second Moment Approximation” (Tight Binding) and the “Glue Model” onto the fundamental EAM framework above, clarifying that they differ primarily in terminology or specific functional choices (e.g., square root embedding functions).
Cross-Potential Fitting Recipe: It details a robust method for fitting alloy potentials (specifically Ni-Al-B) by using “transformation invariance”, scaling the density and shifting the embedding function to fit alloy properties without disturbing pure element fits.
Specific Parameters: It publishes optimized potential parameters for Ni, Al, and B that accurately reproduce properties like the Boron interstitial preference in $\text{Ni}_3\text{Al}$.

Validation: Computational Benchmarks and Simulations

The “experiments” described are computational validations and simulations using the fitted Ni-Al-B potential:

Potential Fitting:
- Pure elements (Ni, Al) were fitted to elastic constants, vacancy formation energies, and diatomic data. The Ni fit achieved $\chi_{\text{rms}} = 0.75%$ and Al achieved $\chi_{\text{rms}} = 3.85%$.
- Boron was fitted using hypothetical crystal structures (fcc, bcc) calculated via LMTO (Linear Muffin-Tin Orbital) since experimental data for fcc B does not exist.
Molecular Statics (Validation):
- Surface Relaxation: Demonstrated that EAM captures the oscillatory relaxation of atomic layers near a free surface, a many-body effect that pair potentials fail to capture.
- Defect Energetics: Calculated formation energies for Boron interstitials in $\text{Ni}_3\text{Al}$. Found the 6Ni-octahedral site is most stable ($-4.59$ eV relative to an isolated B atom and unperturbed crystal), followed by the 4Ni-2Al octahedral site ($-3.65$ eV) and the 3Ni-1Al tetrahedral site ($-2.99$ eV), consistent with channeling experiments.
Molecular Dynamics (Application):
- Grain Boundary (GB) Cleavage: Simulated the fracture of a (210) tilt grain boundary in $\text{Ni}_3\text{Al}$ at a strain rate of $5 \times 10^{10}$ s$^{-1}$.
- Comparison: Compared pure $\text{Ni}_3\text{Al}$ boundaries vs. those doped with Boron and substitutional Nickel.

Key Outcomes: EAM Efficiency and Boron Strengthening

EAM Efficiency: Confirmed that EAM scales linearly with atom count ($N$), requiring only 2-5 times the computational work of pair potentials.
Boron Strengthening Mechanism: The simulations suggested that Boron segregates to grain boundaries and, specifically when co-segregated with Ni, significantly increases cohesion.
- The maximum stress for the enriched boundary was approximately 22 GPa, compared to approximately 19 GPa for the clean boundary.
- The B-doped boundary required approximately 44% more work to cleave than the undoped boundary.
- The fracture mode shifted from cleaving along the GB to failure in the bulk.
Grain Boundary Segregation: Molecular statics calculations found B interstitial energies at the GB as low as $-6.9$ eV, compared to $-4.59$ eV in the bulk, consistent with experimental observations of boron segregation to grain boundaries.
Limitations: The author concludes that while EAM is excellent for metals, it lacks the angular dependence required for strongly covalent materials (like $\text{MoSi}_2$) or directional bonding.

Reproducibility Details

The chapter provides nearly all details required to implement the described potential from scratch.

Data

Experimental/Reference Data: Used for fitting the cost function $\chi_{\text{rms}}$.
- Pure Elements: Lattice constants ($a_0$), cohesive energy ($E_{\text{coh}}$), bulk modulus ($B$), elastic constants ($C_{11}, C_{12}, C_{44}$), vacancy formation energy ($E_{\text{vac}}^f$), and diatomic bond length/strength ($R_e, D_e$).
- Alloys: Heat of solution and defect energies (APB, SISF) for $\text{Ni}_3\text{Al}$.
- Hypothetical Data: LMTO first-principles data used for unobserved phases (e.g., fcc Boron, B2 NiB) to constrain the fit.

Algorithms

Component Functions:
- Pair Potential $\phi(r)$: Morse potential form: $$\phi(r) = D_M {1 - \exp[-\alpha_M(r - R_M)]}^2 - D_M$$
- Density Function $\rho(r)$: Modified hydrogenic 4s orbital: $$\rho(r) = r^6(e^{-\beta r} + 2^9 e^{-2\beta r})$$
- Embedding Function $F(\bar{\rho})$: Derived numerically to force the crystal energy to match the “Universal Energy Relation” (Rose et al.) as a function of lattice constant.
Fitting Strategy:
- Smooth Cutoff: A polynomial smoothing function ($h_{\text{smooth}}$) applied at $r_{\text{cut}}$ to ensure continuous derivatives.
- Simplex Algorithm: Used to optimize parameters ($D_M, R_M, \alpha_M, \beta, r_{\text{cut}}$).
- Alloy Invariance: Used transformations $F’(\rho) = F(\rho) + g\rho$ and $\rho’(r) = s\rho(r)$ to fit cross-potentials without altering pure-element properties.

Models

Parameters: The text provides the exact optimized parameters for the Ni-Al-B potential in Table 2 (Pure elements) and Table 5 (Cross-potentials).
- Example Ni parameters: $D_M=1.5335$ eV, $\alpha_M=1.7728$ Å$^{-1}$, $r_{\text{cut}}=4.7895$ Å.

Hardware

1994 Context: Mentions that simulations of $10^6$ atoms were possible on the “fastest computers available”.
Scaling: Explicitly notes computational work scales as $O(N)$, roughly 2-5x slower than pair potentials.

Paper Information

Citation: Voter, A. F. (1994). Chapter 4: The Embedded-Atom Method. In Intermetallic Compounds: Vol. 1, Principles, edited by J. H. Westbrook and R. L. Fleischer. John Wiley & Sons Ltd.

Publication: Intermetallic Compounds: Vol. 1, Principles (1994)

@incollection{voterEmbeddedAtomMethod1994,
  title = {The Embedded-Atom Method},
  author = {Voter, Arthur F.},
  booktitle = {Intermetallic Compounds: Vol. 1, Principles},
  editor = {Westbrook, J. H. and Fleischer, R. L.},
  year = {1994},
  publisher = {John Wiley & Sons Ltd},
  pages = {77--90},
  chapter = {4}
}

Additional Resources:

NIST Interatomic Potentials Repository (Modern repository often hosting EAM files)
Original EAM Paper (1984)
EAM Review (1993)

Venus Evolution Through Time: Key Questions and Missions

Sun, 07 Dec 2025 00:00:00 +0000

What kind of paper is this?

This is a Systematization paper (referencing the taxonomy) that synthesizes the current state of Venus science and organizes future exploration strategies. It serves as a comprehensive roadmap that consolidates knowledge from prior missions, articulates open questions, and coordinates upcoming international mission concepts.

What is the motivation?

Venus serves as a natural laboratory for understanding terrestrial planet habitability and evolution. While Earth and Venus share similar mass and bulk geophysical properties, they followed radically different evolutionary paths. Venus is the only spatially resolvable, Earth-sized world that allows us to monitor geophysical envelopes (atmosphere, surface, interior) to support long-term evolutionary models. Major gaps remain regarding the stability of water reservoirs, the transition from a potentially habitable state to the current greenhouse state, and the nature of current geological activity. Understanding Venus directly informs the interpretation of Venus-like exoplanets.

What is the novelty here?

The paper provides a coordinated roadmap for Venus exploration by:

Synthesizing key science questions across four domains (comparative planetology, primordial history, surface processes, and interior-atmosphere coupling).
Detailing the instrument suites and science goals of three selected missions (VERITAS, DAVINCI, and EnVision) and demonstrating their synergies.
Identifying technology gaps and future mission concepts required to fully answer the habitability question.

The novelty lies in the coordinated, multi-mission approach where each mission addresses complementary aspects of Venus science.

What experiments were performed?

This is a review/roadmap paper, so it does not present new experimental results. Instead, it:

Synthesizes prior mission data: Reviews findings from Magellan, Venus Express, Akatsuki, and ground-based radar observations.
Analyzes mission concepts: Evaluates the science objectives and instrument capabilities of VERITAS, DAVINCI, EnVision, Venera-D, and Shukrayaan-1.
Assesses technology readiness: Identifies gaps in high-temperature electronics, long-duration surface operations, and aerial platform capabilities.

The “experiments” are the planned observations and measurements from the coordinated fleet of missions in the 2030s.

What outcomes/conclusions?

The paper concludes that:

Synergistic approach is essential: No single mission can answer the habitability question. The fleet provides complementary global mapping (VERITAS), atmospheric chemistry (DAVINCI), and targeted geological analysis (EnVision).
Key measurements identified: Noble gas isotopes (especially Xenon), D/H ratio, tesserae composition, and surface deformation are critical observables.
Technology gaps remain: Long-lived surface landers and sample return require advances in high-temperature electronics and aerial platforms.
Venus science informs exoplanet interpretation: Understanding the Venus Zone and the transition from habitable to runaway greenhouse states directly supports exoplanet characterization.

The 2030s represent the most coordinated era of Venus exploration to date, with NASA, ESA, Roscosmos, ISRO, the Chinese Academy of Sciences, and private missions all targeting the planet within a decade.

Key Science Questions

The paper organizes open questions into four primary domains.

Comparative Planetology and Exoplanets

The Venus Zone is defined as the orbital region where an Earth-sized planet is more likely to be a Venus analog than an Earth analog. Understanding Venus directly informs the interpretation of exoplanet observations.

Magma Ocean Duration: Venus may lie at a boundary defined by magma ocean cooling times:

Type I: Short-lived magma ocean ($\sim 1$ Myr), allowing water condensation (Earth-like).
Type II: Long-lived magma ocean ($\sim 100$ Myr) due to high insolation, leading to desiccation via photodissociation and escape of the steam atmosphere.

Rotation Rate: Slow rotation is critical for maintaining temperate conditions in the Venus Zone via cloud-albedo feedback. This has implications for habitability assessments of tidally locked exoplanets.

Accretion and Primordial History

Impact History: Did Venus suffer a moon-forming giant impact? The absence of a moon challenges assumptions about early large-scale melting events.

Differentiation: Determining the timing of silicate/metal differentiation (core formation) via Hf/W chronometry is essential to constrain the accretion phase.

Volatile Delivery: Did volatiles arrive via solar nebula, asteroids, or comets? Xenon isotopes are key to detecting cometary contributions.

Surface Processes and Resurfacing

Two competing resurfacing models exist:

Catastrophic: A massive pulse of volcanism $\sim 1$ Ga ago followed by quiescence (suggested by random crater distribution).
Equilibrium: Continuous resurfacing where craters are modified gradually.

Tesserae Terrain: Complex, highly deformed tectonic terrains that may represent the oldest surface rocks. Near-IR emissivity data suggesting low iron content indicates they may be felsic (silica-rich), potentially analogous to Earth’s continental crust formed in the presence of water.

Active Volcanism: Evidence includes variable $\text{SO}_2$ levels, emissivity anomalies at hotspots (Idunn Mons), and young lava flows.

Interior and Atmosphere Coupling

Tectonic Regime: Venus lacks plate tectonics but has deformation zones. It may be in a “stagnant lid” regime or a transitional state.

Noble Gases: Abundances and isotopes (Ne, Ar, Kr, Xe) track atmospheric loss and outgassing history.

Water Loss: The D/H ratio indicates water loss, but does not uniquely constrain when or how fast it happened.

The New Fleet of Missions

A synergistic fleet of three selected missions (plus international partners) will address these questions in the 2030s.

VERITAS (NASA Orbiter)

Status note: VERITAS was selected in 2021 but placed on indefinite hold by NASA in late 2022 due to budget pressures from the Mars Sample Return program. Its launch date and schedule remain uncertain as of 2026. The science case and instrument descriptions below reflect the mission as designed.

Primary Goal: Global mapping of topography, rock type, and active deformation.

Key Instruments:

VISAR (X-band Radar): Global DEM with 300m horizontal postings over 90% of the surface, with a height accuracy requirement of $\leq$10m (achieved accuracy of 5.9m for 95% of the mapped area after bundle adjustment), 30m SAR imagery globally (15m for ~27% of the surface), and interferometry (RPI) to detect cm-scale surface deformation.
VEM (Emissivity Mapper): 14 bands total: 6 surface bands (0.86, 0.91, 0.99, 1.02, 1.11, 1.18 $\mu$m) plus 8 atmospheric and calibration bands, mapping surface iron content (felsic vs. mafic) through atmospheric windows.

Science Target: Determine if Venus has “continents” (felsic tesserae), active volcanism, and subduction-like features. VERITAS provides the global geophysical map and target identification.

DAVINCI (NASA Probe/Flyby)

Primary Goal: In situ chemical analysis of the deep atmosphere and descent imaging.

Descent Probe Instruments:

VMS (Mass Spectrometer): All noble gases (Ne, Ar, Kr, Xe isotopes), trace gases, and D/H ratio throughout descent.
VTLS (Tunable Laser Spectrometer): High-precision isotopes of H, S, C, O.
VASI (Atmospheric Structure Investigation): Temperature, pressure, winds, and turbulence characterization during the approximately one-hour descent from ~67 km to the surface.
VenDI (Descent Imager): Near-IR imaging of the western Alpha Regio tesserae landing ellipse (~348 $\times$ 160 km) at 2–200m imaging scales, with 5–60m topographic resolution derived via Structure-from-Motion.
VfOx (Venus Oxygen Fugacity): Student-built instrument to measure redox state of the near-surface atmosphere.

Carrier Instruments (flyby observations):

VISOR (4-camera UV and near-IR system): Cloud structure and albedo mapping during two Venus flybys.
CUVIS (Compact Ultraviolet Imaging System): UV spectra of Venus upper cloud and haze.

Mission Timeline: Launch June 2029; Venus flyby 1 January 2030; Venus flyby 2 November 2030; probe descent June 2031 targeting western Alpha Regio tesserae.

Science Target: Definitive atmospheric origin/evolution, history of water, and nature of tesserae. DAVINCI provides the chemical “ground truth” and high-res “spot check” of tesserae.

EnVision (ESA Orbiter)

Primary Goal: Holistic view from inner core to upper atmosphere, focusing on activity and geological history.

Key Instruments:

VenSAR (S-band Radar): Polarimetric imaging and stereo topography.
SRS (Subsurface Radar Sounder): Penetrates the subsurface (up to 1 km depth, 20m resolution) to map stratigraphy, buried craters, and tesserae edges.
VenSpec Suite: Spectroscopy (IR and UV) to link surface activity to atmospheric gas variations ($\text{SO}_2$, $\text{H}_2\text{O}$).

Science Target: Characterize the sequence of geological events, subsurface layering, and atmospheric-interior coupling. EnVision provides targeted, multi-scale geological analysis and subsurface sounding.

International Partners

Venera-D (Russia): Orbiter + Lander.

The lander focuses on surface X-ray diffraction and fluorescence (XRD/XRF) analysis (mineralogy) and surviving 2-3 hours.
Includes an aerial platform (balloon) for cloud layer analysis.

Shukrayaan-1 (India): Orbiter.

Features a polarimetric radar (VSAR) and potentially a low-frequency subsurface sounder.

VOICE (China): Venus Volcano Imaging and Climate Explorer (Dong et al. 2023), an orbiter carrying a Polarimetric Synthetic Aperture Radar (PolSAR), a Microwave Radiometric Sounder (MWRS), and a UV-Visible-Near IR Multi-Spectral Imager (UVN-MSI) on a ~350 km polar orbit, complementary to VERITAS and EnVision.

Morning Star (Rocket Lab): Private low-cost small entry probe mission concept (Seager et al. 2021), the Venus Life Finder mission, carrying an ultraviolet autofluorescence backscatter nephelometer to characterize cloud particles and search for biosignatures during descent through the clouds.

CLOVE (Korea): Earth-orbiting CubeSat concept by the Institute for Basic Science (IBS) of South Korea, designed to monitor Venus’s long-term atmospheric variability from 320 nm to the near-infrared.

Future Concepts and Technology Gaps

To fully answer the “habitability” question, investigations beyond the current fleet are required.

Long-Lived Surface Landers

Challenge: Electronics cannot survive Venus surface temperatures ($470^{\circ}\text{C}$) for long periods.

Solution: High-temperature electronics (SiC, GaN) and battery technology.

Science Goal: Seismology. Measuring “Venusquakes” is the only way to definitively resolve the core state and interior structure.

Aerial Platforms (Balloons)

Environment: The cloud layer ($50\text{–}60$ km) is the “habitable zone” ($20^{\circ}\text{C}$, 0.5 atm).

Science Goals:

Long-term monitoring of atmospheric circulation and chemistry.
Aerial Seismology: Detecting infrasound generated by groundquakes from the air (mechanical coupling is $60\times$ stronger on Venus than Earth).

Sample Return

Concept: Skimming the upper atmosphere ($< 120$ km) to collect noble gases and returning them to Earth for high-precision laboratory analysis.

Synergies with Exoplanet Science

Observations of Venus-like exoplanets (e.g., TRAPPIST-1 system) by JWST provide the statistical context for Venus’s divergent evolution. The upcoming decade represents a coordinated campaign:

VERITAS provides the global geophysical map and target identification.
DAVINCI provides the chemical “ground truth” and high-res “spot check” of tesserae.
EnVision provides targeted, multi-scale geological analysis and subsurface sounding.

Understanding Venus allows us to interpret spectra from Venus analogs around other stars, making Venus exploration directly relevant to the search for habitable worlds beyond our solar system.

Reproducibility

This is a review and roadmap paper, so there are no code, model, or dataset artifacts to reproduce. The paper is published open access in Space Science Reviews under a CC license. All referenced mission design documents and companion articles in the same volume are cited and accessible through their respective DOIs.

Paper Information

Citation: Widemann, T., Smrekar, S. E., Garvin, J. B., et al. (2023). Venus Evolution Through Time: Key Science Questions, Selected Mission Concepts and Future Investigations. Space Science Reviews, 219(7), 56. https://doi.org/10.1007/s11214-023-00992-w

Publication: Space Science Reviews, 2023

@article{Widemann2023,
  author = {Widemann, Thomas and Smrekar, Suzanne E. and Garvin, James B. and Straume-Lindner, Anne Grete and Ocampo, Adriana C. and Schulte, Mitchell D. and Voirin, Thomas and Hensley, Scott and Dyar, M. Darby and Whitten, Jennifer L. and Nunes, Daniel C. and Getty, Stephanie A. and Arney, Giada N. and Johnson, Natasha M. and Kohler, Erika and Spohn, Tilman and O'Rourke, Joseph G. and Wilson, Colin F. and Way, Michael J. and Ostberg, Colby and Westall, Frances and H{\"o}ning, Dennis and Jacobson, Seth and Salvador, Arnaud and Avice, Guillaume and Breuer, Doris and Carter, Lynn and Gilmore, Martha S. and Ghail, Richard and Helbert, J{\"o}rn and Byrne, Paul and Santos, Alison R. and Herrick, Robert R. and Izenberg, Noam and Marcq, Emmanuel and Rolf, Tobias and Weller, Matt and Gillmann, Cedric and Korablev, Oleg and Zelenyi, Lev and Zasova, Ludmila and Gorinov, Dmitry and Seth, Gaurav and Rao, C. V. Narasimha and Desai, Nilesh},
  title = {Venus Evolution Through Time: Key Science Questions, Selected Mission Concepts and Future Investigations},
  journal = {Space Science Reviews},
  volume = {219},
  number = {7},
  pages = {56},
  year = {2023},
  doi = {10.1007/s11214-023-00992-w},
  url = {https://doi.org/10.1007/s11214-023-00992-w}
}

Life on Venus? Astrobiology and the Habitability Limits

Fri, 05 Dec 2025 00:00:00 +0000

What kind of paper is this?

This is a synthetic review that evaluates Venus’s past and present habitability by comparing physical conditions against the known limits of terrestrial extremophiles. It is a systematization of knowledge paper that rigorously analyzes environmental constraints based on existing literature.

What is the motivation?

The core question is: To what degree were past habitats or are present habitats on Venus suitable for life? Beyond the solar system, Cockell frames Venus as a critical template for extrasolar greenhouse planets, using it to establish baseline habitability constraints that should guide spectroscopic observations of Venus-like exoplanets. The paper systematically examines each environmental parameter (temperature, pressure, atmospheric composition, UV radiation, pH) to identify which are true biological barriers and which are surmountable based on what we know from terrestrial extremophiles.

What is the novelty here?

The paper provides a rigorous, parameter-by-parameter assessment of Venus’s habitability. The key insight is that temperature acts as the critical constraint, establishing a hierarchy for greenhouse planets where thermal limits are reached well before pressure limits. This suggests that surface pressure is rarely the primary exclusion factor for life on Venus-like exoplanets. While the surface is sterile, the cloud layers between 48-57 km altitude present a more nuanced picture where temperature and pressure fall within habitable ranges, though extreme acidity and low water activity pose the primary biological challenges.

What experiments were performed?

This review paper evaluates Venus’s environmental conditions by synthesizing data from the Venera and Pioneer missions and comparing them against the documented limits of terrestrial extremophiles (thermophiles like Pyrolobus fumarii, acidophiles like Picrophilus, and obligate barophiles from the Mariana Trench). It assesses theoretical metabolic pathways based on available chemical energy sources in the clouds.

What outcomes/conclusions?

The paper concludes that:

Surface: Uninhabitable due to extreme temperature ($464^\circ\text{C}$), which exceeds biochemical limits
Cloud layers (48-57 km): Physically compatible with life (temperature, pressure, nutrients) but extreme acidity ($81\text{–}98%\ \text{H}_2\text{SO}_4, \text{pH} \approx 0$) and low water activity present severe challenges
Early Venus: May have had habitable oceans during a “moist greenhouse” period, with possible interplanetary exchange with early Earth
Future missions: Should target cloud samples between 48-57 km altitude and look for sulfur isotope fractionation as biosignatures

How Cockell’s 1999 Predictions Hold Up Today

From a modern perspective (2026), Cockell’s analysis remains the foundational baseline for Venusian astrobiology, though specific details have evolved:

Phosphine Detection (2020): Cockell correctly identified the importance of searching for non-equilibrium trace gases. The claimed detection of phosphine ($\text{PH}_3$) in 2020 reignited interest in the cloud layer hypothesis, but subsequent re-analysis reduced the reported abundance from ~20 ppb to ~1 ppb, and multiple independent teams (Snellen et al. 2020; Villanueva et al. 2021; Thompson 2021) disputed the signal entirely as a likely instrument artifact. The current consensus leans toward a non-detection, though the question remains open pending new observations.
Water Activity Limits (2021): Later work (e.g., by Hallsworth et al.) quantified the water activity in Venus’s clouds as ~0.004, far below the limit for terrestrial life (~0.585). This reinforces Cockell’s concern that acidity and desiccation are the primary barriers, potentially even more severe than he estimated.
Upcoming Missions: DAVINCI (probe descent June 2031) directly targets the deep atmosphere and cloud chemistry, fulfilling the “Descent Probe” requirement outlined in this 1999 paper. VERITAS was selected for global surface mapping but was placed on indefinite hold by NASA in late 2022; its schedule remains uncertain.

Physical Limits of the Venusian Surface

The paper evaluates surface conditions against the known limits of terrestrial extremophiles.

Temperature (Critical Constraint)

Electron microscope image of Pyrolobus fumarii, which grows optimally at 106°C and defines the upper temperature limit for known life at 113°C. (Manfred Rohde, CC BY-SA 4.0)

Condition: The surface is almost globally isothermal at $464^\circ\text{C}$.
Biological Limit: While the known limit at the time was $113^\circ\text{C}$ (Pyrolobus fumarii), Cockell posits a generic theoretical upper limit of $150^\circ\text{C}$ for his analysis.
Biochemical Barrier: This theoretical limit sits well below $250^\circ\text{C}$, where most peptide bonds hydrolyze in less than 11 minutes (aspartate peptide bonds in less than 1 minute) and ATP decomposes in about 1 second.
Conclusion: The surface temperature is a hard limit to life. Liquid water cannot exist because $464^\circ\text{C}$ exceeds the critical temperature of water ($374^\circ\text{C}$).

Pressure (Habitable Range)

Condition: Surface pressure is 9.5 MPa (~93 atm).
Biological Context: This is equivalent to ~950 m ocean depth on Earth.
Limit: Life exists at the Mariana Trench (~110 MPa); researchers have isolated obligate barophiles (such as Shewanella, Moritella, and Colwellia) that grow optimally at high pressures.
Conclusion: Pressure levels on the surface are within the known tolerance range for piezophilic life.

Atmospheric Composition (Bio-Compatible)

Thermoacidophilic algae Cyanidium (left) and Cyanidiococcus (right), which can tolerate pure CO₂ atmospheres. (Cho et al. 2020, CC BY-SA 4.0)

Condition: $96.5%\ \text{CO}_2, 3.5%\ \text{N}_2$.
Biological Context: Terrestrial algae like Cyanidium caldarium can tolerate pure $\text{CO}_2$. High $\text{CO}_2$ actually makes carbon assimilation energetically easier compared to Earth’s 0.03%.

Surface Acidity (Indeterminate)

Condition: $\text{SO}_2$ and $\text{SO}_3$ in the atmosphere react with surface minerals to form sulfates. The surface lacks liquid acid, and the mineral chemistry is extremely oxidizing and sulfurous.
Biological Context: Terrestrial thermoacidophiles (e.g., Acidianus infernus, which grows optimally at 88 degrees C with a pH range of 0.5-5.5) survive in hot, sulfur-rich, acidic environments. However, these organisms all require liquid water.
Conclusion: Surface acidity is secondary to temperature as a constraint, and the surface provides no supportive chemistry for life.

UV Radiation (Not a Constraint)

Condition: The thick atmosphere ($\text{CO}_2$) scatters most harmful UVC/UVB via Rayleigh scattering, while sulfur-based absorbers in the upper clouds remove the penetrating remainder.
Evolutionary Argument: The UV flux in the upper clouds is comparable to the surface of Archean Earth (when life evolved), despite Venus being closer to the Sun.
Conclusion: Since life emerged on Earth under similar radiation conditions, UV flux cannot be considered a life-limiting constraint on Venus today or in its past.

The Cloud Habitat: A Potential Niche?

The paper identifies a “habitable zone” within the lower and middle cloud layers where physical parameters relax.

Altitude and Conditions (48-68 km)

The three cloud layers span very different conditions. The lower and middle layers (48–57 km) are the most relevant for habitability: temperature and pressure fall within terrestrial extremophile tolerances there. The upper cloud layer (57–68 km) falls below the freezing point, further limiting metabolic activity. Note that H₂SO₄ concentration increases with depth, so the layers with the most favorable temperature and pressure also carry the highest acidity.

Cockell’s Table 1 summarizes the key parameters:

Layer	Altitude	Temperature	Particle Sizes (modes, $\mu$m)	Number/cm$^3$
Upper Cloud	57–68 km	$-40^\circ\text{C}\text{–}0^\circ\text{C}$	0.30, 2.10	200–350
Middle Cloud	51–57 km	$0^\circ\text{C}\text{–}38^\circ\text{C}$	0.30, 2.80, 6.70	250–350
Lower Cloud	48–51 km	$38^\circ\text{C}\text{–}60^\circ\text{C}$	0.30, 2.80, 6.70	50–150

The overall $\text{H}_2\text{SO}_4$ concentration ranges from approximately 81% in the upper cloud layer to 98% in the lower layers. Pressures range from 0.1 to 1.0 MPa across the cloud deck.

Droplet Size: Particles range from 0.3 to 6.7 $\mu$m across three modes, sufficient in diameter to enclose bacteria (0.2–2 $\mu$m) and even bacterial assemblages.
Residence Time: Using Stokes’ law, Cockell calculates that an assemblage of 5-10 bacteria (average size 1.1 $\mu$m) would take over 200 days to drop through the lower cloud layer. This exceeds the division time of most bacteria by three orders of magnitude or more, meaning a population could reproduce far faster than it rains out.

The Primary Challenge: Acidity and Water Activity

Acidity: Cloud droplets are composed of concentrated sulfuric acid, ranging from $\approx 81%$ in the upper clouds to $\approx 98%$ in the lower layers.
pH: The pH is effectively 0.
Biological Limit: While terrestrial acidophiles (e.g., Picrophilus) grow at pH 0, they require high water activity. The hygroscopic nature of concentrated $\text{H}_2\text{SO}_4$ creates extreme desiccation (osmotic) stress. Microbes typically combat this by synthesizing “biocompatible solutes” (like betaine, proline, or glycerol) to balance internal pressure, but the energy cost at this extreme may be prohibitive.

Metabolism in the Clouds (Theoretical)

If a microbe could survive the acidity, the paper proposes a theoretical metabolism based on the sulfur cycle.

Energy Sources

Photosynthesis: Solar flux at the bottom of the cloud layer is ~15% of incident light (about half that on Earth’s surface on a clear day), sufficient to drive photosynthesis.

Chemoautotrophy

Electron Acceptor: Sulfate ($\text{SO}_4^{2-}$) is abundant.
Electron Donors: Hydrogen ($\text{H}_2$) exists at ~25 ppm; Carbon Monoxide ($\text{CO}$) exists at 30-50 ppm.
Analogs: Terrestrial sulfate-reducing bacteria (e.g., Desulfobacterium autotrophicum) serve as biochemical templates.

Nutrients

Phosphorus: Present (likely as phosphoric acid).
Nitrogen: 3.5% of atmosphere, available for fixation.

Early Venus and Evolutionary Implications

Moist Greenhouse Model

Deuterium/Hydrogen ratios suggest early Venus had ~100x more water than today.
A “moist greenhouse” period may have existed with hot oceans (< 100°C) for several hundred million years.

Interplanetary Ecology

High impact rates on early Earth favored thermophiles.
Transfer of material between Earth and Venus suggests a possible early “interplanetary ecology” where life could have transferred to Venusian oceans before the runaway greenhouse took over.

Venus as an Exoplanet Analog

Cockell explicitly frames Venus as a template for understanding extrasolar greenhouse planets. By defining the sequence of habitability constraints, the paper argues that temperature becomes a limiting factor well before pressure.

Hierarchy of Limits: On runaway greenhouse planets, surface temperatures will exceed biochemical limits ($>150^\circ\text{C}$) long before pressures exceed piezophilic limits (>110 MPa).
Spectroscopic Strategy: Consequently, exoplanet surveys should prioritize thermal characterization over pressure estimates when screening for surface habitability. High atmospheric pressure is not, in itself, a disqualifier for life.

Future Directions and Search Strategies

The paper concludes with specific recommendations for exobiology missions.

Planetary Protection

The extreme acidity and temperature of the lower atmosphere likely sterilize incoming spacecraft, mitigating contamination risks.

Proposed Missions

Descent Probe: Equipped with a sample collector arm to analyze cloud droplets between 48-57 km.
Balloon Mission: A free-floating platform to study cloud chemistry and potentially culture organisms in situ.

Key Biomarkers to Search For

Isotopic Fractionation: Biological sulfate reduction prefers $^{32}\text{S}$ over $^{34}\text{S}$; analyzing sulfur isotopes in rocks could reveal past life.
Trace Gases: Precise measurement of non-equilibrium gases ($\text{H}_2, \text{CO}$) in the clouds.

Earth-Based Research: The Missing Venus Analog

We have yet to find a terrestrial microbe that is simultaneously hyperthermophilic, acidophilic, and capable of extreme osmoregulation. Cockell identifies four potential explanations for this absence, each with different implications for whether Venusian life is possible:

Energetic Limitations: The variety of adaptations required (synthesis of biocompatible solutes, continuous proton pumping against low pH, synthesis of heat shock proteins and thermally stable proteins) are likely to be energetically demanding. The cumulative energy cost of multiple extreme adaptations may exceed what phototrophy or chemoautotrophy can supply. Cockell highlights this as an area needing more theoretical and laboratory experimentation.
Biochemical Incompatibilities: Some adaptations to extreme environmental parameters may be possible individually but not simultaneously at great extremes for all parameters. Since our knowledge of many of these adaptations is still in its infancy, evaluating these interrelationships in detail for Venus is difficult.
Habitat Limitation on Earth: Earth simply lacks stable environments that combine all Venus-like stressors. Deep-sea hydrothermal vents provide high temperature and pressure but not extreme acidity or low water activity. Hot springs can be acidic but rarely exceed 90-95 degrees C. The absence of such combined habitats means evolution has not been driven to produce polyextremophiles.
Insufficient Exploration of the Biosphere: Studies of organisms in hot regions of the deep subsurface through deep-drilling may yield additional insights. Subsurface organisms subjected to high temperatures and low water activities would provide a useful biochemical template for understanding adaptation requirements relevant to Venus-like environments.

Comparative Parameter Summary

Cockell’s Table 2 provides a side-by-side assessment of key environmental parameters across Venus’s surface, lower cloud layer, early Venus, and generic extrasolar Venus-like planets:

Parameter	Venus Surface	Lower Clouds (48–51 km)	Early Venus
Temperature	$464^\circ\text{C}$ (lethal)	$38\text{–}60^\circ\text{C}$ (habitable)	Possibly $< 100^\circ\text{C}$ in oceans
Pressure	~93 bar (habitable)	~1 bar (habitable)	~93 bar at surface
Atmospheric gas	$\text{CO}_2$ (tolerable)	$\text{CO}_2$ (tolerable)	$\text{CO}_2/\text{H}_2\text{O}$
$\text{H}_2\text{SO}_4$	Minerals only	~98% (lethal water activity)	Absent (water present)
UV radiation	Absent (shielded)	~Archean Earth (tolerable)	Unknown
Liquid water	Absent	Absent (acid droplets only)	Possibly present
Overall verdict	Uninhabitable	Physically possible, chemistry severe	Potentially habitable

The table highlights that early Venus is the most favorable scenario, while the present surface is definitively uninhabitable and the cloud layer is a physical-but-not-chemical niche.

Connecting Habitability to Terraforming

Understanding the baseline habitability of Venus is the first step in conceptualizing planetary engineering. The extreme limits identified here, especially the $464^\circ\text{C}$ surface temperature and $81\text{–}98%\ \text{H}_2\text{SO}_4$ clouds, must be mitigated before complex life can take hold.

To explore how we might overcome these physical limits and engineer a second Earth, read my notes on:

The Surface of Venus for details on the geological constraints.
Venus Evolution Through Time for the history of its climate catastrophe and potential paths to recovery.

Reproducibility

This is a 1999 theoretical review paper with no associated code, datasets, or models. The paper synthesizes existing mission data (Venera, Pioneer) and published extremophile literature. All environmental parameters cited are drawn from publicly available planetary science databases. The paper is published in Planetary and Space Science (Elsevier), which is paywalled, and no open-access preprint exists (pre-arXiv era for this field).

Paper Information

Citation: Cockell, C. S. (1999). Life on Venus. Planetary and Space Science, 47(12), 1487-1501. https://doi.org/10.1016/S0032-0633(99)00036-7

Publication: Planetary and Space Science, 1999

Additional Resources:

Wikipedia Article

@article{cockell1999life,
  author = {Cockell, Charles S.},
  title = {Life on {Venus}},
  journal = {Planetary and Space Science},
  volume = {47},
  number = {12},
  pages = {1487--1501},
  year = {1999},
  doi = {10.1016/S0032-0633(99)00036-7}
}

NInChI: Toward a Chemical Identifier for Nanomaterials

Sun, 12 Oct 2025 00:00:00 +0000

A New Standard for Nanoinformatics

This is a Systematization paper that proposes a new standard: the NInChI. It addresses a fundamental limitation in nanoinformatics. The result of a collaborative workshop organized by the H2020 research infrastructure NanoCommons and the nanoinformatics project NanoSolveIT, this work uses six detailed case studies to systematically develop a hierarchical, machine-readable notation for complex nanomaterials that could work across experimental research, regulatory frameworks, and computational modeling.

The Breakdown of Traditional Chemical Identifiers

Chemoinformatics has fantastic tools for representing small molecules: SMILES strings, InChI identifiers, and standardized databases that make molecular data searchable and shareable. But when you step into nanotechnology, everything breaks down.

Consider trying to describe a gold nanoparticle with a silica shell and organic surface ligands. How do you capture:

The gold core composition and size
The silica shell thickness and interface
The surface chemistry and ligand density
The overall shape and morphology

There’s simply no standardized way to represent this complexity in a machine-readable format. This creates massive problems for:

Data sharing between research groups
Regulatory assessment where precise identification matters
Computational modeling that needs structured input
Database development and search capabilities

Without a standard notation, nanomaterials research suffers from the same data fragmentation that plagued small molecule chemistry before SMILES existed.

The Five-Tier Nanomaterial Description Hierarchy

The authors propose NInChI (Nanomaterials InChI), a layered extension to the existing InChI system. The core insight is organizing nanomaterial description from the inside out, following the OECD’s framework for risk assessment, with a five-tier hierarchy:

Tier 1: Chemical Composition: What is the core made of? This differentiates uniform compositions (Tier 1.1), randomly mixed (Tier 1.2), ordered core-shell materials (Tier 1.3), and onion-like multi-shell morphologies (Tier 1.4).
Tier 2: Morphology: What shape, size, and dimensionality? This encodes dimension (0D-3D), size and size distribution, and shape information.
Tier 3: Surface Properties: Physical and chemical surface parameters such as charge, roughness, and hydrophobicity. Many of these depend on external conditions (pH, solvent, temperature).
Tier 4: Surface Functionalization: How are coatings attached to the core? This includes functionalization density, orientation, and binding type (covalent vs. non-covalent).
Tier 5: Surface Ligands: What molecules are on the surface, their density, orientation, and distribution?

This hierarchy captures the essential information needed to distinguish between different nanomaterials while building on familiar chemical concepts.

Testing the Standard: Six Case Studies

The authors tested their concept against six real-world case studies to identify what actually matters in practice.

Case Study 1: Gold Nanoparticles

Gold NPs provided a relatively simple test case: an inert metallic core with various surface functionalizations. Key insights: core composition and size are essential, surface chemistry (what molecules are attached) matters critically, shape affects properties, and dynamic properties like protein corona formation belong outside the intrinsic NInChI representation. This established the boundary: NInChI should capture intrinsic, stable properties.

Case Study 2: Graphene-Family NMs

Carbon nanotubes and graphene introduced additional complexity: dimensionality (1D tubes vs 2D sheets vs 0D fullerenes), chirality (the (n,m) vector that defines a nanotube’s structure), defects and impurities that can alter properties, and number of layers (for nanotubes, single-wall vs multi-wall). This case showed that the notation needed to handle both topological complexity and chemical composition.

Case Study 3: Complex Engineered (Doped and Multi-Metallic) NMs

Doped materials, alloys, and core-shell structures revealed key requirements: need to distinguish true alloys (homogeneous mixing) and core-shell structures with the same overall composition, crystal structure information becomes crucial, and component ratios must be precisely specified. The case study assessed whether the MInChI extension could represent these solid solutions.

Case Study 4: Database Applications

The FAIR (Findable, Accessible, Interoperable, Reusable) principles guided this analysis. NInChI addresses real database problems: it provides greater specificity than CAS numbers (which lack nanoform distinction), offers a systematic alternative to ad-hoc naming schemes, and enables machine-searchability.

Case Study 5: Computational Modeling

This explored several applications: automated descriptor generation from NInChI structure, read-across predictions for untested materials, and model input preparation from standardized notation. The layered structure provides structured input that computational tools need for both physics-based and data-driven nanoinformatics approaches.

Case Study 6: Regulatory Applications

Under frameworks like REACH, regulators need to distinguish between different “nanoforms”, which are materials with the same chemical composition but different sizes, shapes, or surface treatments. NInChI directly addresses this by encoding the specific properties that define regulatory categories, providing precision sufficient for legal definitions and risk assessment frameworks.

The NInChI Alpha Specification in Practice

Synthesizing insights from all six case studies, the authors propose the NInChI alpha specification (version 0.00.1A), a three-layer structure. Importantly, the paper distinguishes the five-tier NM description hierarchy (Section 1.2 above) from the three-layer NInChI notation hierarchy. NM properties from the five tiers are encoded into these three notation layers:

Layer 1 (Version Number): Standard header indicating the NInChI version, denoted as 0.00.1A for the alpha version. This follows the convention of all InChI-based notations.

Layer 2 (Composition): Each component (core, shell, ligands, impurities, dopants, linkers) gets described using standard InChI (or PInChI/MInChI) for chemical composition, with additional sublayers for morphology (prefix m, e.g., sp for sphere, sh for shell, tu for tube), size (prefix s, in scientific notation in meters), crystal structure (prefix k), and chirality (prefix w for carbon nanotubes). Components are separated by !.

Layer 3 (Arrangement): Specified with prefix y, this layer describes how the components from Layer 2 are combined, proceeding from inside out. A core-shell material is written as y2&1 where the numbers reference components in Layer 2. Covalent bonding between components is indicated with parentheses, e.g., (1&2&3) for a nano core with a covalently bound ligand coating.

The paper provides concrete worked examples from the case studies:

Silica with gold coating (20 nm silica, 2 nm gold shell): NInChI=0.00.1A/Au/msh/s2t10r1-9;12r2-9!/O2Si/c1-3-2/msp/s20d-9/k000/y2&1
CTAB-capped gold nanoparticle (20 nm diameter): NinChI=0.00.1A/Au/msp/s20d-9!C19H42N.BrH/c1-5-6-7.../y1&2
Chiral single-wall nanotube of the (3,1) type with 0.4 nm diameter: NInChI=0.00.1A/C/mtu/s4d-10/w(3,1)/y1

Property Prioritization: The case studies produced a prioritization of NM properties into four categories (Table 3 in the paper):

Category 1: Must Have	Category 2a: Nice to Have	Category 2b: Extrinsic	Category 3: Out of Scope
Chemical composition	Structural defects	Surface charge	Optical properties
Size/size distribution	Density	Corona	Magnetic properties
Shape	Surface composition	Agglomeration state	Chemical/oxidation state
Crystal structure		Dispersion
Chirality
Ligand and ligand binding

Implementation: The authors built a prototype NInChI generation tool using the ZK framework with a Java backend, available through the Enalos Cloud Platform. The tool lets users specify core composition, morphology, size, crystal structure, and chirality, then build outward by adding shells or clusters. InChIs for shell components are retrieved via the NCI/CADD chemical structure REST API.

Limitations: The alpha version acknowledges areas for future development: nanocomposite and nanostructured materials, inverse NMs (nano holes in bulk material), and nanoporous materials are beyond current scope. Dynamic properties such as dissolution, agglomeration, and protein corona formation are excluded. The stochastic nature of NMs (e.g., broad size distributions) is not yet fully addressed. Covalent bonding between components needs further refinement.

Impact: For researchers, NInChI enables precise structural queries for nanomaterials data sharing. For regulators, it provides systematic identification for risk assessment and nanoform classification under frameworks like REACH. For computational modelers, it enables automated descriptor generation and read-across predictions.

Key Conclusions: The 8-month collaborative process demonstrates that creating systematic notation for nanomaterials is feasible. The hierarchical, inside-out organization provides an approach that satisfies experimentalists, modelers, database owners, and regulators. Testing against six case studies identified the essential features that must be captured. By extending InChI and reusing conventions from MInChI, RInChI, and PInChI, the work builds on existing infrastructure. The proposed NInChI alpha is intended to stimulate further analysis and refinement with the broader community and the InChI Trust.

Reproducibility Details

Paper Accessibility: The paper is fully open-access under the CC BY 4.0 license, allowing for straightforward reading and analysis.
Tools & Code: The authors provided a prototype NInChI generation tool available through the Enalos Cloud Platform, built using the ZK framework with a Java backend. The underlying backend code was not released as an open-source library.
Documentation: The paper serves as the first alpha specification for community discussion and refinement. No formal algorithmic pseudocode for automated string parsing or generation from structured nanomaterials files (like .cif) is provided.

Artifact	Type	License	Notes
NInChI Generator (Enalos Cloud)	Other	Unknown	Prototype web tool for generating NInChI strings; backend not open-source
Paper (MDPI)	Other	CC BY 4.0	Open-access alpha specification

Paper Information

Citation: Lynch, I., Afantitis, A., Exner, T., Himly, M., Lobaskin, V., Doganis, P., … & Melagraki, G. (2020). Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies? Nanomaterials, 10(12), 2493. https://doi.org/10.3390/nano10122493

Publication: Nanomaterials (2020)

@article{lynch2020inchi,
  title={Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies?},
  author={Lynch, Iseult and Afantitis, Antreas and Exner, Thomas and others},
  journal={Nanomaterials},
  volume={10},
  number={12},
  pages={2493},
  year={2020},
  publisher={MDPI},
  doi={10.3390/nano10122493}
}

InChI: The Worldwide Chemical Structure Identifier Standard

Sun, 12 Oct 2025 00:00:00 +0000

InChI as a Resource and Systematization Standard

This is a Resource & Systematization Paper that reviews the history, technical architecture, governance structure, and implementation status of the InChI standard. It documents both the institutional development of an open chemical identifier and the technical specification that enables it.

The Motivation: Interoperability in Chemical Databases

Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like SMILES strings. These were expensive, restricted, and relied on “in-house” databases.

The authors argue the Internet and Open Source software acted as a “black swan” event that disrupted this status quo. The Internet created a need to link diverse, free and fee-based resources without a central gatekeeper. InChI was designed as the solution: a non-proprietary, open-source identifier enabling linking of distinct data compilations.

Technical and Institutional Innovations of InChI

InChI’s innovation is both technical and institutional:

Technical novelty: A hierarchical “layered” canonicalization system where structure representations build from basic connectivity to full stereochemistry. This allows flexible matching: a molecule with unknown stereochemistry produces an InChI that’s a subset of the same molecule with known stereochemistry.

Institutional novelty: Creating an open standard governed by a charitable trust (the InChI Trust) that convinced commercial competitors (publishers, databases) to adopt it as a “pre-competitive” necessity. This solved the political problem of maintaining an open standard in a competitive industry.

Technical Architecture: Layers and Hashing

The InChI String

InChI is a canonicalized structure representation derived from IUPAC conventions. It uses a hierarchical “layered” format where specific layers add detail. The exact technical specification includes these string segments:

Main Layer: Chemical Formula
Connectivity Layer (/c): Atoms and bonds (excluding bond orders)
Hydrogen Layer (/h): Tautomeric and immobile H atoms
Charge (/q) & Proton Balance (/p): Accounting for ionization
Stereochemistry:
- Double bond (/b) and Tetrahedral (/t) parity
- Parity inversion (/m)
- Stereo type (/s): absolute, relative, or racemic
Fixed-H Layer (/f): Distinguishes specific tautomers if needed

This layered approach means that a molecule with unknown stereochemistry will have an InChI that’s a subset of the same molecule with known stereochemistry. This allows for flexible matching at the connectivity level even without complete stereochemical information.

The InChIKey

Because InChI strings can be too long for search engines (which break at ~30 characters or at symbols like / and +), the InChIKey was created.

Mechanism: A 27-character string generated via a SHA-256 hash of the InChI string. This can be represented as:

$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$

Structure:

Block 1 (14 characters): Encodes the molecular skeleton (connectivity)
Block 2 (10 characters): Eight letters encoding stereochemistry and isotopes, plus a flag indicating standard InChI (S) and an InChI version indicator (A for version 1)
Block 3 (1 character): Protonation flag (e.g., ‘N’ for neutral)

Because the InChIKey is a hash, it cannot be converted back to a structure (irreversible) and has a theoretical risk of collision. It is important to distinguish between InChI collisions (which are due to flaws/bugs and are very rare) and InChIKey collisions (which are mathematically inevitable due to hashing).

What experiments were performed?

This is a systematization paper documenting an existing standard. However, the authors provide:

Validation evidence:

Certification Suite: A test suite that software vendors must pass to display the “InChI Certified” logo, preventing fragmentation
Round-trip conversion testing: Demonstrated >99% success rate converting InChI back to structure (100% with AuxInfo layer)
Real-world adoption metrics: Documented integration across major chemical databases and publishers

Known limitations identified:

Tautomer representation issues in Version 1 (different drawings of same tautomer can generate different InChIs)
Edge cases in stereochemistry representation

Institutional History & Governance

Origin: The project was initiated at a March 2000 IUPAC meeting in Washington, DC. It was originally called the IUPAC Chemical Identifier Project (IChIP).

Development: Technical work was done by NIST (Stein, Heller, Tchekhovskoi), overseen by the IUPAC CCINS committee, which later became the InChI Subcommittee of Division VIII.

The InChI Trust: To ensure the algorithm survived beyond a volunteer organization, the InChI Trust was formed in 2009. It is a UK charity supported by publishers and databases (e.g., Nature, RSC) to maintain the standard pre-competitively. This was a critical innovation: getting commercial publishers and software vendors to agree that a non-proprietary standard would benefit everyone.

Real-World Impact and Future Directions

Key Findings

Success through “un-coerced adoption”: InChI succeeded because commercial competitors viewed it as a “pre-competitive” necessity for the Internet age. The open governance model proved durable.

Technical achievements:

Reversible representation (>99% without AuxInfo, 100% with it)
Hierarchical structure enables flexible matching at different levels of detail
InChIKey enables web search despite being a hash (with inherent collision risk)

Limitations Acknowledged (as of 2013)

Tautomerism Issues: Different drawings of the same tautomer (e.g., 1,4-oxime vs nitroso) can generate different InChIs in Version 1, which is targeted for Version 2
Hash collision risk: InChIKey collisions are mathematically inevitable due to SHA-256 hashing, though InChI collisions (actual bugs) are very rare
Certification required: To prevent fragmentation, software must pass the InChI Certification Suite

Future Directions

The authors note that while this paper documents the state as of 2013, InChI continues to evolve. Tautomer handling and edge cases in stereochemistry representation were priorities for future versions. The governance model through the InChI Trust was designed to ensure long-term maintenance beyond the original volunteer contributors.

Reproducibility Details

This systematization paper documents an existing standard. Key implementation resources are openly maintained by the InChI Trust.

Code & Software

Official Open Source Implementation: The C source code and pre-compiled binaries for the InChI algorithm are freely available via the InChI Trust Downloads Page and their official GitHub repository.
Canonicalization algorithm: Open-source implementation of IUPAC-based rules for generating unique representations from multiple possible drawings of the same molecule.

Data & Validation

InChI Certification Suite: A test suite of chemical structures provided by the InChI Trust used to validate that third-party software implementations generate correct InChIs.
Version 1 specification: Complete technical documentation of the layered format.

Evaluation

Round-trip conversion: >99% success rate (100% with AuxInfo) as validated by NIST and IUPAC.
Certification testing: Pass/fail validation for software claiming InChI compliance.

Paper Information

Citation: Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., & Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. Journal of Cheminformatics, 5(1), 7. https://doi.org/10.1186/1758-2946-5-7

Publication: Journal of Cheminformatics, 2013

@article{heller2013inchi,
  title={{InChI} - the worldwide chemical structure identifier standard},
  author={Heller, Stephen and McNaught, Alan and Stein, Stephen and Tchekhovskoi, Dmitrii and Pletnev, Igor},
  journal={Journal of Cheminformatics},
  volume={5},
  number={1},
  pages={7},
  year={2013},
  publisher={Springer},
  doi={10.1186/1758-2946-5-7}
}

MolRec: Chemical Structure Recognition at CLEF 2012

Sat, 11 Oct 2025 00:00:00 +0000

Systematization of Rule-Based OCSR

This is a Systematization paper that evaluates and analyzes MolRec’s performance in the CLEF 2012 chemical structure recognition competition. The work provides systematic insights into how the improved MolRec system performed on different types of molecular diagrams and reveals structural challenges facing rule-based OCSR approaches through comprehensive failure analysis.

Investigating the Limits of Rule-Based Recognition

This work builds on the TREC 2011 competition, where a previous implementation of MolRec already performed well. The CLEF 2012 competition provided an opportunity to test an improved, more computationally efficient version of MolRec on different datasets and understand how performance varies across complexity levels.

The motivation is to understand exactly where rule-based chemical structure recognition breaks down. Examining the specific types of structures that cause failures provides necessary context for the high accuracy rates achieved on simpler structures.

The Two-Stage MolRec Architecture

The novelty lies in the systematic evaluation across two different difficulty levels and the comprehensive failure analysis. The authors tested an improved MolRec implementation that was more efficient than the TREC 2011 version, providing insights into both system evolution and the inherent challenges of chemical structure recognition.

MolRec Architecture Overview: The system follows a two-stage pipeline approach:

Vectorization Stage: The system preprocesses input images through three steps:
- Image binarization using Otsu’s method to convert grayscale images to black and white, followed by labelling of connected components
- OCR processing using nearest neighbor classification with a Euclidean metric to identify and remove text components (atom labels, charges, etc.)
- Separation of bond elements: thinning connected components to single-pixel width, building polyline representations, detecting circles, arrows, and solid triangles, and applying the Douglas-Peucker line simplification algorithm to clean up vectorized bonds
Rule Engine Stage: A set of 18 chemical rules converts geometric primitives into molecular graphs:
- Bridge bond recognition (2 rules applied before all others, handling structures with multiple connection paths depicted in 2.5-dimensional perspective drawings)
- Standard bond and atom recognition (16 rules applied in arbitrary order)
- Context-aware disambiguation resolving ambiguities using the full graph structure and character groups
- Superatom expansion looking up character groups identifying more than one atom in a dictionary and replacing them with molecule subgraphs

The system can output results in standard formats like MOL files or SMILES strings.

CLEF 2012 Experimental Design

The CLEF 2012 organizers provided a set of 961 test images clipped from patent documents, split into two sets:

Automatic Evaluation Set (865 images): Images selected for automatic evaluation by comparison of generated MOL files with ground truth MOL files using the OpenBabel toolkit.
Manual Evaluation Set (95 images): A more challenging collection of images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation. This set was intentionally included to provide a greater challenge.

The authors ran MolRec four times with slightly adjusted internal parameters, then manually examined every incorrect recognition to categorize error types.

Performance Divergence and Critical Failure Modes

The results reveal a stark performance gap between simple and complex molecular structures:

Performance on Automatic Evaluation Set: On the 865-image set, MolRec achieved 94.91% to 96.18% accuracy across four runs with different parameter settings. A total of 46 different diagrams were mis-recognized across all runs.

Performance on Manual Evaluation Set: On the 95-image set, accuracy dropped to 46.32% to 58.95%. A total of 52 different diagrams were mis-recognized across all runs. Some diagrams failed for multiple reasons.

Key Failure Modes Identified (with counts from the paper’s Table 3):

Character Grouping (26 manual, 0 automatic): An implementation bug caused the digit “1” to be repeated within atom groups, so $R_{21}$ was incorrectly recognized as $R_{211}$. A separate problem was difficulty correctly separating closely spaced atom groups.
Touching Characters (8 manual, 1 automatic): The system does not handle touching characters, so overlapping characters cause mis-recognition.
Four-Way Junction Failures (6 manual, 7 automatic): The vectorization process could not handle junctions where four lines meet, leading to incorrect connectivity.
OCR Errors (5 manual, 11 automatic): Character recognition errors included “G” interpreted as “O”, “alkyl” being mis-recognized, and “I” interpreted as a vertical single bond.
Missed Solid and Dashed Wedge Bonds (0 manual, 6 each in automatic): The system incorrectly recognized a number of solid wedge and dashed wedge bonds.
Missed Wavy Bonds (2 manual, 1 automatic): Some wavy bonds were not recognized despite the dedicated wavy bond rule.
Missed Charge Signs (1 manual, 2 automatic): While correctly recognizing positive charge signs, MolRec missed three negative charge signs, including one placed at the top left of an atom name.
Other Errors: An atom too close to a bond endpoint was erroneously considered connected, a solid wedge bond too close to a closed node was incorrectly connected, and a dashed bold bond had its stereocentre incorrectly determined.

Dataset Quality Issues: The authors discovered 11 images where the ground truth MOL files were incorrect and MolRec’s recognition was actually correct. As the authors note, such ground truth errors are very difficult to avoid in this complex task.

Key Insights:

Performance gap between simple and complex structures: While MolRec achieved over 94% accuracy on standard molecular diagrams, the performance drop on the more challenging manual evaluation set (down to 46-59%) highlights the difficulty of real patent document images.
Many errors are fixable: The authors note that many mis-recognition problems (such as the character grouping bug and four-way junction vectorization) can be solved with relatively simple enhancements.
Touching character segmentation remains a notoriously difficult open problem that the authors plan to explore further.
Evaluation challenges: The discovery of 11 incorrect ground truth MOL files illustrates how difficult it is to create reliable benchmarks for chemical structure recognition.

The authors conclude that despite the high recognition rates on simpler structures, there is still plenty of room for improvement. They identify future work areas including recognition of more general Markush structures, robust charge sign spotting, and accurate identification of wedge bonds.

Reproducibility Details

System Architecture

Model Type: Non-neural, Rule-Based System (Vectorization Pipeline + Logic Engine)

Data

Evaluation Datasets (CLEF 2012): 961 total test images clipped from patent documents:

Automatic Evaluation Set: 865 images evaluated automatically using OpenBabel for exact structural matching of generated MOL files against ground truth
Manual Evaluation Set: 95 images containing elements not supported by OpenBabel (typically Markush structures), requiring manual visual evaluation

Training Data: The paper does not describe the reference set used to build OCR character prototypes for the nearest neighbor classifier.

Algorithms

Vectorization Pipeline (three steps):

Image Binarization: Otsu’s method, followed by connected component labelling
OCR: Nearest neighbor classification with Euclidean distance metric; recognized characters are removed from the image
Bond Element Separation: Thinning to single-pixel width, polyline construction, Douglas-Peucker line simplification (threshold set to 1-2x average line width), detection of circles, arrows, and solid triangles

Rule Engine: 18 chemical structure rules converting geometric primitives to molecular graphs:

Bridge Bond Rules (2 rules): Applied before all other rules, handling bridge bond structures depicted in 2.5-dimensional perspective drawings
Wavy Bond Rule: Detailed in paper, identifies approximately collinear connected line segments with zig-zag patterns ($n \geq 3$ segments)
Standard Recognition Rules: 16 rules for bonds, atoms, and chemical features (applied in arbitrary order; most not detailed in this paper)

Optimization: Performance tuned via manual adjustment of fuzzy and strict geometric threshold parameters.

Evaluation

Metrics:

Automated: Exact structural match via OpenBabel MOL file comparison
Manual: Visual inspection by human experts for structures where OpenBabel fails

Results:

Automatic Evaluation Set (865 images): 94.91% to 96.18% accuracy across four runs
Manual Evaluation Set (95 images): 46.32% to 58.95% accuracy across four runs

Hardware

Not specified in the paper. Given the era (2012) and algorithmic nature (Otsu, thinning, geometric analysis), likely ran on standard CPUs.

Reproducibility Assessment

Closed. No public code, data, or models are available. The paper outlines high-level logic (Otsu binarization, Douglas-Peucker simplification, 18-rule system) but does not provide:

The complete specification of all 18 rules (only Rule 2.2 for wavy bonds is detailed)
Exact numerical threshold values for fuzzy/strict parameters used in the CLEF runs
OCR training data or character prototype specifications

The authors refer readers to a separate 2012 DRR (SPIE) paper [5] for a more detailed overview of the MolRec system architecture.

Paper Information

Citation: Sadawi, N. M., Sexton, A. P., & Sorge, V. (2012). MolRec at CLEF 2012: Overview and Analysis of Results. Working Notes of CLEF 2012 Evaluation Labs and Workshop. CLEF. https://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-SadawiEt2012.pdf

Publication: CLEF 2012 Workshop (ImageCLEF Track)

The Surface of Venus: Stratigraphy and Resurfacing History

Mon, 08 Sep 2025 00:00:00 +0000

What kind of paper is this?

This is a Systematization paper (see taxonomy) that organizes and unifies decades of observational data from multiple planetary missions into a coherent geological framework for Venus.

What is the motivation?

Venus and Earth are planetary twins, nearly identical in size, mass, and bulk composition. Earth developed a biosphere. Venus developed a surface temperature of $\sim 740;\text{K}$ ($\sim 467^\circ\text{C}$) and a 93-bar $\text{CO}_2$ atmosphere.

Why did two similar planets diverge so drastically?

Basilevsky and Head synthesize decades of data to answer this. By decoding the geological record preserved in the Venusian crust, they aim to reconstruct the planet’s thermal evolution and understand why Venus operates under a “stagnant lid” regime characterized by a geological cycle of catastrophic global resurfacing.

What is the novelty here?

The paper’s contribution is a comprehensive synthesis integrating findings from Soviet Venera landers and NASA Magellan radar imaging into a unified geological history. Key novelties:

Global Stratigraphy: Establishes a planet-wide sequence of geological units:
1. Tessera Terrain: The oldest, highly tectonized crust, forming “islands” and “continents” above the plains (~8% of the surface).
2. Densely Fractured Plains: Widespread, heavily deformed volcanic plains showing global-scale extensional and shear fracturing.
3. Ridge Belts: Linear bands of folded, compressed material (~3–5 km wide ridges), a transitional tectonic phase.
4. Shield Plains: Widespread fields of small volcanic shields (5–15 km diameter), emplaced after the ridge belts.
5. Wrinkle-Ridged Plains (Regional Plains): The predominant variety of regional plains (which together with shield plains cover ~70% of the surface), consisting of vast basaltic lava flows marked by compressional ridges from gentle horizontal shortening.
6. Younger Plains (Lobate/Smooth): The most recent volcanic flows, showing little deformation and comprising ~10–15% of the surface.
Surface dominated by widespread basaltic volcanism and tectonic deformation, operating under a single-plate regime with no evidence of subduction trenches, island arcs, or mid-oceanic ridges.
The Synchronous Model: Argues geological units (like regional plains) formed synchronously planet-wide, supporting global catastrophic resurfacing events rather than geographically asynchronous activity.

What experiments were performed?

This review synthesizes observational data from multiple spacecraft missions spanning four decades:

Radar Imaging and Altimetry (Magellan, 1990–1994): Global high-resolution mapping (120–220 m/pixel SAR, altimetry, emissivity), revealing the full inventory of volcanoes, tectonic features, and impact craters.
Venera 15/16 (1983): First radar imaging of Venus from 30°N to the north pole at 1–2 km resolution, revealing tessera terrain for the first time.
Lander Missions (Venera 9, 10, 13, 14; Vega 1, 2): TV panoramic cameras providing the first direct surface images; gamma-ray and x-ray fluorescence (XRF) analysis confirming tholeiitic basalt composition at most sites (Venera 9, 10, 14, Vega 1, 2), with Venera 8 and 13 indicating alkaline basalt composition. The Venera 8 landing site, dominated by shield plains, showed elevated potassium, uranium, and thorium, suggesting geochemically evolved material.
Atmospheric Probes (Pioneer Venus, Venera 4–12): Atmospheric composition, temperature, and pressure profiles, plus high D/H ratio ($\sim 0.024$, about 150$\times$ Earth’s oceans) indicating significant primordial water loss.

Global topography of Venus from Magellan radar altimetry. Colors indicate planetary radius (elevation), with highlands like Ishtar Terra and Aphrodite Terra shown in pink/white. (NASA/JPL-Caltech)

The Surface Environment

The Venera landers provided the only direct ground-truth of Venusian surface conditions:

Temperature: $\sim 740;\text{K}$ at mean surface level ($\sim 6051.5$ km radius), varying by altitude. Maxwell Montes (the highest peak, $+11$ km) reaches $\sim 650;\text{K}$; deep depressions ($-2$ km) reach $\sim 755;\text{K}$.
Pressure: 93 bar at mean surface level, 45 bar at Maxwell Montes summit.
Winds: Very low near-surface wind speeds ($0.3\text{–}1;\text{m s}^{-1}$) at lander sites, but the zonal wind at cloud top reaches $\sim 100;\text{m s}^{-1}$, driving the planet-wide atmospheric super-rotation.
Surface appearance: The solid surface is very dark and reddish (reflectivity only 0.03–0.1 in visible light). All four Venera landers photographed platy rocks with prominent fine layering and soil in local depressions, consistent with either lithified aeolian sediment or thin volcanic tuff. At the Venera 13 and 14 sites, rock bearing capacity was measured at only $3\text{–}10;\text{kg cm}^{-2}$, implying porous material.
Chemical weathering: Thermodynamic calculations predict that basaltic minerals react with atmospheric gases to form magnetite, haematite, quartz, magnesite, anhydrite, and pyrite. In the highlands, above a critical altitude (which varies along the planet), iron in silicates segregates into highly conductive iron oxide or sulfide minerals, producing a radar-bright “snow line.”

Geological Terrains and Features

Volcanic Plains (80% of Surface)

The vast majority of the surface consists of volcanic plains. Regional plains (including both wrinkle-ridged and shield varieties) cover ~70% of the surface. The dominant variety, plains with wrinkle ridges, consists of solidified basaltic lava flows deformed by gentle horizontal compression into networks of narrow (1–2 km wide), low (tens to hundreds of km long) ridges. Within the plains run sinuous channels (lava tubes or thermal erosion channels), including Baltis Vallis, the longest channel in the solar system at 6,800 km, about $\frac{1}{6}$ of Venus’s circumference.

Younger volcanic units (10–15% of the surface) include lobate lava fields (over 200 fields each $> 50{,}000;\text{km}^2$) and smooth plains, representing the most recent volcanism. The highest volcano, Maat Mons, stands 9 km above MPR and its lava flows extend 800 km across.

Coronae: A Uniquely Venusian Feature

Several hundred coronae are among the most distinctive structures on Venus. These oval-to-circular volcanic-tectonic features are typically 100–300 km in diameter (a few exceeding 1,000 km) and are unique to Venus in the solar system. A corona typically consists of:

A tectonically deformed annulus (circular rim of compressed/fractured terrain) standing a few hundred metres above the surrounding plains.
A depressed interior flooded with plains-forming volcanics.
Aprons of young lobate lava flows radiating outward from the annulus.

Coronae form from rising hot mantle diapirs: the diapir pushes up the overlying lithosphere and crust, producing magmatic melts that reach the surface as lava flows. When the diapir cools, the uplifted surface collapses, creating the annular structure. Their long-lived circular geometry (rather than being deformed into elongated shapes) is strong evidence that plate tectonics did not operate during their formation.

Deformed Terrains (20% of Surface)

About 20% of the surface is occupied by rough, tectonically deformed terrains:

Ridge belts: Fragments of globally widespread compressed plains-forming material, now partly flooded by regional plains lavas. Their folded ridges (3–5 km wide) indicate past regional-to-global horizontal compression.
Densely fractured plains: “Islands” of 100–200 km extent of heavily fractured plains-forming material, elevated slightly above regional plains. Fracture patterns are often subparallel within a given island, implying global-scale deformation events.
Tessera terrain: The most highly deformed and probably oldest surface unit, forming elevated “islands” and “continents” (e.g. Ishtar Terra at 60–70°N, Fortuna, Ovda, Tellus). The surface is dissected by criss-crossing ridges and grooves a few km wide and tens of km long. Composition is unknown; may be basaltic or more feldspathic (resembling lunar anorthosites or terrestrial granites).
Rifts (Chasmata): A global system of extensional troughs up to 40,000 km long, with floors a few km below surrounding terrain. Associated with young post-regional-plain volcanism and coronae chains.
Mountain ranges: The highest topographic features (Maxwell Montes, 11 km above MPR), formed by intense horizontal compression. Lateral merger with tessera suggests mountain range formation may be the initial stage of tessera formation.

Impact Craters

More than 960 impact craters from 1.5 to 270 km in diameter have been identified on Venus. Their distribution is indistinguishable from random, confirming no plate tectonics (which would preferentially destroy craters). Key characteristics:

Atmospheric screening: The dense atmosphere breaks up relatively small projectiles, so craters smaller than $\sim 10\text{–}20$ km in diameter have irregular floors caused by impacting swarms of fragments rather than single bodies. The observed size-frequency distribution is well-matched by models of the present-day atmosphere.
Impact melt outflows: Many craters show flow-like outflow features (impact melt) extending tens to hundreds of km from the rim, more abundant than on other planets due to the high surface temperature increasing melt production.
Dark parabola halos: The youngest craters are associated with radar-dark parabolic haloes formed by fine crater ejecta carried westward by the strong zonal upper atmosphere winds and settled in a parabolic pattern. Parabolas degrade to non-parabolic halos with age, providing a crater aging tool.

Aeolian Features

In the absence of liquid water, aeolian (wind) processes dominate exogenic resurfacing:

Wind streaks: The most abundant aeolian features. Elongated radar-dark or bright features a few to tens of km long, originating from topographic obstacles; represent erosional and depositional products of wind turbulence.
Dark mantles: Fine-grained debris from impact ejecta, deposited atmospherically and redistributed by wind. Common around impact craters as halos.
Dunes: Only two dune fields identified (one near Fortuna–Meshkenet, one in Lavinia Planitia), each associated with large impact craters that provided debris. The scarcity of dunes implies a general deficit of sand-sized particles on Venus.
Candidate yardangs: Wind-erosional grooves observed near the Mead crater (the largest impact crater on Venus).

The Age Debate: Synchronous vs. Non-Synchronous Resurfacing

A central unresolved question is whether similar geological units across the planet share the same absolute age or different ages:

Synchronous model (favored by authors): Similar units are globally contemporaneous. Evidence: mapping of more than half the planet shows consistent age-sequence relationships across province boundaries; a complete latitude band mapped at 30°N shows the unit sequence is laterally traceable around the planet.
Non-synchronous model (alternative): Unit L in one province may be younger than unit L in a neighboring province (the same stratigraphic position reflects local, not global, timing). Distinction requires fossil biostratigraphy or isotopic dating, neither of which is possible on Venus without sample return.

The synchronous model, if correct, implies a brief period of intense global volcanism and tectonism (resurfacing 80–85% of the surface), followed by a dramatic drop to the low-activity stagnant lid regime that persists today.

What outcomes/conclusions?

Magellan synthetic aperture radar mosaic of Venus’s northern hemisphere, centered on the North Pole. The bright, highly deformed tessera terrain is visible at center, surrounded by darker volcanic plains. (NASA/JPL-Caltech)

The authors conclude that Venus operates under a “stagnant lid” regime, fundamentally different from Earth’s plate tectonics.

Mean surface age: $\sim 500\text{–}800$ Myr (range $\sim 300;\text{Myr}$ to $\sim 1;\text{Gyr}$ with uncertainties), inferred from crater density calibrated against lunar chronology. The pre-regional-plains units and regional plains (occupying 80–85% of the surface together) formed in a compressed time window during the early era.
Two-era history:
1. Global/Early Era: Intense, planet-wide volcanic and tectonic activity resurfaced 80–85% of the surface. Mean volcanic rate was comparable to Earth’s current mid-oceanic ridge volcanism. Tectonic deformation was most intense (tessera-forming) at the start, waning through ridge belts and wrinkle ridges.
2. Localized/Late Era: Beginning $\sim 500\text{–}1000$ Myr ago and continuing to the present, activity dropped to rates lower than terrestrial intraplate volcanism and more comparable to lunar mare volcanism. Concentrated in rift zones ($\sim 4%$ of the surface); lobate and smooth plains occupy only 10–15% of the surface.
The stagnant lid transition: Earth releases internal heat gradually through plate tectonics. On Venus, when the lithosphere thickened sufficiently, its yield strength exceeded the tectonic driving stresses (Solomatov & Moresi 1996), locking the planet into a single immobile plate. This caused mantle heating, suppressed melting that feeds surface volcanism, and halted the overturning cycle. Volcanic and tectonic activity may still occur at low rates.
No magnetic field: Cessation of rapid core cooling (which plate tectonics drives on Earth) likely halted geodynamo action, explaining the absence of an intrinsic magnetic field despite Venus having an Earth-like iron core.
Interior structure: The crust–mantle boundary sits at $\sim 70;\text{km}$; the mantle–core boundary at $\sim 2840;\text{km}$. Gravity correlates strongly with topography (unlike Earth), suggesting Venus lacks an asthenosphere (mechanically soft upper mantle layer), possibly because the high surface temperature precludes the stability of chlorite and serpentine, the “slippery” minerals that weaken Earth’s lithosphere.
Surface-atmosphere coupling: Extensive early-era volcanism may have added $\text{H}_2\text{O}$ and $\text{SO}_2$ to the atmosphere, amplifying the greenhouse effect and causing $\pm 100;\text{K}$, 100–200 Myr surface temperature excursions that could in turn have driven tectonic stress and partial crustal melting.

Connecting to Venus as a System

The geological record decoded here provides essential context for understanding the full story of Venus. The stagnant lid and catastrophic resurfacing events explain how the planet lost its early surface water and why the atmosphere evolved into its current extreme state.

To explore what the surface conditions mean for life and planetary engineering, see:

Life on Venus for how these surface conditions define the hard limits for any biology.
Venus Evolution Through Time for the coordinated mission strategy that will answer whether Venus was ever habitable.
Terraforming Venus: The Cloud Continent Proposal for a speculative look at how humanity might one day engineer around these geological constraints.

Reproducibility

This is a review paper synthesizing publicly available mission data. The primary datasets are accessible through NASA’s Planetary Data System (PDS):

Artifact	Type	License	Notes
Magellan SAR and Altimetry (PDS)	Dataset	Public Domain	Global radar imagery and topography
Venera/Vega Lander Data (NSSDC)	Dataset	Public Domain	Surface composition and imagery

No custom software or models are associated with this paper. Reproducing the geological interpretations requires access to the Magellan radar mosaics and familiarity with planetary geological mapping techniques.

Paper Information

Citation: Basilevsky, A. T., & Head, J. W., III. (2003). The surface of Venus. Reports on Progress in Physics, 66(10), 1699-1734. https://doi.org/10.1088/0034-4885/66/10/R04

Publication: Reports on Progress in Physics, 2003

@article{basilevsky2003surface,
  title={The surface of Venus},
  author={Basilevsky, Alexander T and Head, James W, III},
  journal={Reports on Progress in Physics},
  volume={66},
  number={10},
  pages={1699--1734},
  year={2003},
  publisher={IOP Publishing},
  doi={10.1088/0034-4885/66/10/R04}
}

Dark Side of Forces: Non-Conservative ML Force Models

Sat, 23 Aug 2025 00:00:00 +0000

Contribution: Systematic Assessment of Non-Conservative ML Force Models

This is a Systematization paper. It systematically catalogs the exact failure modes of existing non-conservative force approaches, quantifies them with a new diagnostic metric, and proposes a hybrid Multiple Time-Stepping solution combining the speed benefits of direct force prediction with the physical correctness of conservative models.

Motivation: The Speed-Accuracy Trade-off in ML Force Fields

Many recent machine learning interatomic potential (MLIP) architectures predict forces directly ($F_\theta(r)$). This “non-conservative” approach avoids the computational overhead of automatic differentiation, yielding faster inference (typically 2-3x speedup) and faster training (up to 3x). However, it sacrifices energy conservation and rotational constraints, potentially destabilizing molecular dynamics simulations. The field lacks rigorous quantification of when this trade-off breaks down and how to mitigate the failures.

Novelty: Jacobian Asymmetry and Hybrid Architectures

Four key contributions:

Jacobian Asymmetry Metric ($\lambda$): A quantitative diagnostic for non-conservation. Since conservative forces derive from a scalar field, their Jacobian (the Hessian of energy) must be symmetric. The normalized norm of the antisymmetric part quantifies the degree of violation: $$ \lambda = \frac{|| \mathbf{J}_{\text{anti}} ||_F}{|| \mathbf{J} ||_F} $$ where $\mathbf{J}_{\text{anti}} = (\mathbf{J} - \mathbf{J}^\top)/2$. Measured values range from $\lambda \approx 0.004$ (PET-NC) to $\lambda \approx 0.032$ (SOAP-BPNN-NC), with ORB at 0.015 and EquiformerV2 at 0.017. Notably, the pairwise $\lambda_{ij}$ approaches 1 at large interatomic distances, meaning non-conservative artifacts disproportionately affect long-range and collective interactions.
Systematic Failure Mode Catalog: First comprehensive demonstration that non-conservative models cause runaway heating in NVE ensembles (temperature drifts of ~7,000 billion K/s for PET-NC and ~10x larger for ORB) and equipartition violations in NVT ensembles where different atom types equilibrate to different temperatures, a physical impossibility.
Theoretical Analysis of Force vs. Energy Training: Force-only training overemphasizes high-frequency vibrational modes because force labels carry per-atom gradients that are dominated by stiff, short-range interactions. Energy labels provide a more balanced representation across the frequency spectrum. Additionally, conservative models benefit from backpropagation extending the effective receptive field to approximately 2x the interaction cutoff, while direct-force models are limited to the nominal cutoff radius.
Hybrid Training and Inference Protocol: A practical workflow that combines fast direct-force prediction with conservative corrections:
- Training: Pre-train on direct forces, then fine-tune on energy gradients (2-4x faster than training conservative models from scratch)
- Inference: Multiple Time-Stepping (MTS) where fast non-conservative forces are periodically corrected by slower conservative forces

Methodology: Systematic Failure Mode Analysis

The evaluation systematically tests multiple state-of-the-art models across diverse simulation scenarios:

Models tested:

PET-C/PET-NC (Point Edge Transformer, conservative and non-conservative variants)
PET-M (hybrid variant jointly predicting both conservative and non-conservative forces)
ORB-v2 (non-conservative, trained on Alexandria/MPtrj)
EquiformerV2 (non-conservative equivariant Transformer)
MACE-MP-0 (conservative message-passing)
SevenNet (conservative message-passing)
SOAP-BPNN-C/SOAP-BPNN-NC (descriptor-based baseline, both conservative and non-conservative variants)

Test scenarios:

NVE stability tests on bulk liquid water, graphene, amorphous carbon, and FCC aluminum
Thermostat artifact analysis with Langevin and GLE thermostats
Geometry optimization on water snapshots and QM9 molecules using FIRE and L-BFGS
MTS validation on OC20 catalysis dataset
Species-resolved temperature measurements for equipartition testing

Key metrics:

Jacobian asymmetry ($\lambda$)
Kinetic temperature drift in NVE
Velocity-velocity correlations
Radial distribution functions
Species-resolved temperatures
Inference speed benchmarks

Results: Simulation Instability and Hybrid Solutions

Purely non-conservative models are unsuitable for production simulations due to uncontrollable unphysical artifacts that no thermostat can correct. Key findings:

Performance failures:

Non-conservative models exhibited catastrophic temperature drift in NVE simulations: ~7,000 billion K/s for PET-NC and ~70,000 billion K/s for ORB, with EquiformerV2 comparable to PET-NC
Strong Langevin thermostats ($\tau=10$ fs) damped diffusion by ~5x, negating the speed benefits of non-conservative models
Advanced GLE thermostats also failed to control non-conservative drift (ORB reached 1181 K vs. 300 K target)
Equipartition violations: under stochastic velocity rescaling, O and H atoms equilibrated at different temperatures. For ORB, H atoms reached 336 K and O atoms 230 K against a 300 K target. For PET-NC, deviations were smaller but still significant (H at 296 K, O at 310 K).
Geometry optimization was more fragile with non-conservative forces: inaccurate NC models (SOAP-BPNN-NC) failed catastrophically, while more accurate ones (PET-NC) could converge with FIRE but showed large force fluctuations with L-BFGS. Non-conservative models consistently had lower success rates across water and QM9 benchmarks.

Hybrid solution success:

MTS with non-conservative forces corrected every 8 steps ($M=8$) achieved conservative stability with only ~20% overhead compared to a purely non-conservative trajectory. Results were essentially indistinguishable from fully conservative simulations. Higher stride values ($M=16$) became unstable due to resonances between fast degrees of freedom and integration errors.
Conservative fine-tuning achieved the accuracy of from-scratch training in about 1/3 the total training time (2-4x resource reduction)
Validated on OC20 catalysis benchmark

Scaling caveat: The authors note that as training datasets grow and models become more expressive, non-conservative artifacts should diminish because accurate models naturally exhibit less non-conservative behavior. However, they argue the best path forward is hybrid approaches rather than waiting for scale to solve the problem.

Recommendation: The optimal production path is hybrid architectures using direct forces for acceleration (via MTS and pre-training) while anchoring models in conservative energy surfaces. This captures computational benefits without sacrificing physical reliability.

Reproducibility Details

Data

Primary training/evaluation:

Bulk Liquid Water (Cheng et al., 2019): revPBE0-D3 calculations with over 250,000 force/energy targets, chosen for rigorous thermodynamic testing

Generalization tests:

Graphene, amorphous carbon, FCC aluminum (tested with general-purpose foundation models)

Benchmarks:

QM9: Geometry optimization tests
OC20 (Open Catalyst): Oxygen on alloy surfaces for MTS validation

All datasets publicly available through cited sources.

Models

Point Edge Transformer (PET) variants:

PET-C (Conservative): Forces via energy backpropagation
PET-NC (Non-Conservative): Direct force prediction head, slightly higher parameter count
PET-M (Hybrid): Jointly predicts both conservative and non-conservative forces, accuracy within ~10% of the best single-task models

Baseline comparisons:

Model	Type	Training Data	Notes
ORB-v2	Non-conservative	Alexandria/MPtrj	Rotationally unconstrained
EquiformerV2	Non-conservative	Alexandria/MPtrj	Equivariant Transformer
MACE-MP-0	Conservative	MPtrj	Equivariant message-passing
SevenNet	Conservative	MPtrj	Equivariant message-passing
SOAP-BPNN-C	Conservative	Bulk water	Descriptor-based baseline
SOAP-BPNN-NC	Non-conservative	Bulk water	Descriptor-based baseline

Training details:

Loss functions: PET-C uses joint Energy + Force $L^2$ loss; PET-NC uses Force-only $L^2$ loss
Fine-tuning protocol: PET-NC converted to conservative via energy head fine-tuning
MTS configuration: Non-conservative forces with conservative corrections every 8 steps ($M=8$)

Evaluation

Metrics & Software: Molecular dynamics evaluations were performed using i-PI, while geometry optimizations used ASE (Atomic Simulation Environment). Note that primary code reproducibility is provided via an archived Zenodo snapshot; the authors did not link a live, public GitHub repository.

Jacobian asymmetry ($\lambda$): Quantifies non-conservation via antisymmetric component
Temperature drift: NVE ensemble stability
Velocity-velocity correlation ($\hat{c}_{vv}(\omega)$): Thermostat artifact detection
Radial distribution functions ($g(r)$): Structural accuracy
Species-resolved temperature: Equipartition testing
Inference speed: Wall-clock time per MD step

Key results:

Model	Speed (ms/step)	NVE Stability	Notes
PET-NC	8.58	Failed	~7,000 billion K/s drift
PET-C	19.4	Stable	2.3x slower than PET-NC
SevenNet	52.8	Stable	Conservative baseline
PET Hybrid (MTS)	~10.3	Stable	~20% overhead vs. pure NC

Thermostat artifacts:

Langevin ($\tau=10$ fs) dampened diffusion by ~5x (weaker coupling at $\tau=100$ fs reduced diffusion by ~1.5x)
GLE thermostats also failed to control non-conservative drift
Equipartition violations under SVR: ORB showed H at 336 K and O at 230 K (target 300 K); PET-NC showed smaller but significant species-resolved deviations

Optimization failures:

Non-conservative models showed lower geometry optimization success rates across water and QM9 benchmarks, with inaccurate NC models failing catastrophically

Hardware

Compute resources:

Training: From-scratch baseline models were trained using 4x Nvidia H100 GPUs (over a duration of around two days).
Fine-Tuning: Conservative fine-tuning was performed using a single (1x) Nvidia H100 GPU for a duration of one day.
This hybrid fine-tuning approach achieved a 2-4x reduction in computational resources compared to training conservative models from scratch.

Reproduction resources:

Artifact	Type	License	Notes
Zenodo repository	Code/Data	Unknown	Code and data to reproduce all results
MTS inference tutorial	Other	Unknown	Multiple time-stepping dynamics tutorial
Conservative fine-tuning tutorial	Other	Unknown	Fine-tuning workflow tutorial

Paper Information

Citation: Bigi, F., Langer, M. F., & Ceriotti, M. (2025). The dark side of the forces: assessing non-conservative force models for atomistic machine learning. Proceedings of the 42nd International Conference on Machine Learning, PMLR 267.

Publication: ICML 2025

@inproceedings{bigi2025dark,
  title={The dark side of the forces: assessing non-conservative force models for atomistic machine learning},
  author={Bigi, Filippo and Langer, Marcel F and Ceriotti, Michele},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  series={Proceedings of Machine Learning Research},
  volume={267},
  address={Vancouver, Canada},
  year={2025}
}

Additional Resources: