Discovery Papers: Validated Empirical Findings Using AI on Hunter Heidenreich | ML Research Scientist

SlimPajama-DC: Data Combinations for LLM Training

Wed, 08 Apr 2026 00:00:00 +0000

An empirical study of data domain combinations

This is a discovery paper that empirically investigates how different combinations and proportions of data domains affect language model pretraining. Using the SlimPajama dataset (a globally deduplicated, 627B token refinement of RedPajama), the study trains seven 1.3B model configurations with varying domain mixtures to identify which combinations and deduplication strategies produce the best downstream performance.

Why data combination strategy matters

Multi-source pretraining datasets combine data from web crawls, code repositories, books, academic papers, and other sources. Two underexplored questions drive this work: (1) Does deduplication within each source (local) versus across all sources (global) meaningfully affect model quality? (2) When sources are thoroughly deduplicated, how does the combination and proportion of domains affect downstream performance? Most open-source LLM training datasets (RedPajama, The Pile) perform only local deduplication, leaving cross-source redundancy unaddressed.

Global deduplication and the SlimPajama dataset

SlimPajama applies global MinHashLSH deduplication (Jaccard similarity threshold 0.8, 13-gram signatures) across all seven data sources simultaneously. This reduces RedPajama’s 1.2T tokens to 627B tokens, a roughly 48% reduction. The heaviest deduplication hits CommonCrawl and GitHub, which had the most cross-source overlap.

The key processing steps:

Low-length document filtering: Remove documents below a minimum length threshold.
Global deduplication: MinHashLSH across all sources simultaneously, requiring 64 CPU cores and 1.4TB peak memory. This removes both within-source and between-source duplicates.

The resulting dataset composition:

Source	SlimPajama	RedPajama	LLaMA 1
CommonCrawl	52.2% (333B)	72.6% (878B)	67.0%
C4	26.7% (170B)	14.4% (175B)	15.0%
GitHub	5.2% (33B)	4.9% (59B)	4.5%
Books	4.2% (27B)	2.1% (26B)	4.5%
ArXiv	4.6% (29B)	2.3% (28B)	2.5%
Wikipedia	3.8% (24B)	2.0% (24B)	4.5%
StackExchange	3.3% (21B)	1.7% (20B)	2.0%

Seven domain combination configurations

All configurations train 1.3B parameter models on 330B tokens with identical architecture and hyperparameters. The configurations systematically vary domain diversity:

DC-1: CommonCrawl only (single source)
DC-2: CommonCrawl + C4 (two web sources)
DC-3: CommonCrawl + C4 with adjusted proportions
DC-4: Wikipedia + Books + GitHub + ArXiv + StackExchange (no web crawl)
DC-5: CommonCrawl + C4 + Wikipedia + Books (four sources, no code/academic)
DC-6: All seven SlimPajama sources (maximum diversity)
DC-7: RefinedWeb CommonCrawl (external single-source baseline)

The experimental design probes: incremental diversity (DC-1 to DC-2 to DC-5 to DC-6), proportion sensitivity (DC-2 vs DC-3), source importance (DC-3 vs DC-4), and specialization vs generalization (individual vs combined).

Diversity after global deduplication drives performance

Hugging Face leaderboard results

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA
RedPajama-1.3B	38.0	37.2	55.8	24.9	34.3
DC-1 (CC only)	38.5	36.3	56.0	27.0	34.8
DC-4 (no web)	37.6	33.4	53.3	26.0	37.6
DC-6 (all sources)	40.0	33.7	61.0	26.9	38.4
DC-7 (RefinedWeb)	41.0	35.1	64.7	26.2	37.9

Key patterns:

More domain diversity improves average performance. The progression DC-1 (38.5) to DC-2 (38.4) to DC-5 (38.6) to DC-6 (40.0) shows that adding domains consistently lifts average accuracy once global deduplication has removed cross-source redundancy.
Global deduplication enables clean combination. All SlimPajama configurations except DC-4 outperform RedPajama-1.3B (38.0), which uses local deduplication only. The elimination of cross-source overlap means adding sources contributes genuinely new information.
Removing web crawl data hurts. DC-4 (no CommonCrawl/C4) scores lowest (37.6), demonstrating that web text provides essential breadth even when specialized sources are included.
Individual domains excel at specific tasks. DC-1 (CC only) achieves the highest ARC and MMLU scores. DC-4 leads on Winogrande. DC-5 leads on WSC273. No single combination dominates all tasks, reinforcing that diversity trades specialization for generalization.
Findings transfer to 7B scale. The best 1.3B configuration insights were applied to a 7B model trained with large batch sizes, achieving 63.4 average accuracy across the extended benchmark suite.

Training loss patterns

DC-6 (all sources) achieves the lowest training loss among SlimPajama configurations, consistent with the downstream results. DC-4 (no web crawl) shows the highest training loss, confirming that the large, diverse web crawl data is the most important single component.

Implications and limitations

The central finding is that diversity matters most after deduplication. When cross-source redundancy is removed, each additional source contributes genuinely new signal. Without global deduplication, adding sources may just increase redundancy without proportional benefit.

Limitations:

Only seven fixed configurations are tested. No systematic search over continuous mixture proportions (contrast with DoReMi or Data Mixing Laws).
The configurations are not independent: DC-6 includes all sources from DC-1 through DC-5, making it difficult to isolate the contribution of any single addition.
Only 1.3B and 7B scales tested. Whether the diversity benefit continues scaling is unverified.
English-only. Cross-lingual diversity effects are not studied.
The paper is a technical report without formal peer review.

Reproducibility Details

Status: Highly Reproducible. All 1.3B models and datasets are publicly released under MIT license on HuggingFace.

Data

Purpose	Dataset	Size	Notes
Training	SlimPajama	627B tokens	Globally deduplicated from 1.2T RedPajama
Training	RefinedWeb	600B tokens	External CC-only baseline
Evaluation	HF Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA)	Standard	4 benchmarks
Evaluation	Extended suite	12 additional benchmarks	Zero and few-shot

Models

1.3B parameter Cerebras-GPT architecture with ALiBi positional encoding and SwiGLU activation. All configurations trained on 330B tokens. 7B model trained with large batch-size (LBS) strategy on Cerebras 16x CS-2 cluster (80 PFLOP/s in bf16).

Hardware

Cerebras 16x CS-2 cluster, 80 PFLOP/s in bf16 mixed precision.

Artifacts

Artifact	Type	License	Notes
SlimPajama-DC Models	Model	MIT	All 1.3B DC configurations (select via revision)
SlimPajama-627B-DC Dataset	Dataset	-	Source-split version of SlimPajama-627B

Citation

@article{shen2023slimpajamadc,
  title={SlimPajama-DC: Understanding Data Combinations for LLM Training},
  author={Shen, Zhiqiang and Tao, Tianhua and Ma, Liqun and Neiswanger, Willie and Liu, Zhengzhong and Wang, Hongyi and Tan, Bowen and Hestness, Joel and Vassilieva, Natalia and Soboleva, Daria and Xing, Eric},
  journal={arXiv preprint arXiv:2309.10818},
  year={2023}
}

Scaling Data-Constrained Language Models

Wed, 08 Apr 2026 00:00:00 +0000

An empirical study of scaling under data constraints

This is a discovery paper that systematically investigates what happens when language models are trained for multiple epochs on repeated data. It extends the Chinchilla scaling laws to the data-constrained regime by proposing a new scaling formula that accounts for the diminishing value of repeated tokens, validated across 400+ training runs ranging from 10M to 9B parameters and up to 1500 epochs.

Running out of unique training data

The Chinchilla scaling laws assume unlimited unique data: for a given compute budget, there exists an optimal balance of model parameters and training tokens. But extrapolating these laws to larger models implies data requirements that exceed what is available. Villalobos et al. estimated that high-quality English text would be exhausted by 2024 under Chinchilla-optimal scaling. Most prior large language models trained for a single epoch, and some work explicitly warned against data reuse. The Galactica models (trained for 4.25 epochs) showed that multi-epoch training could work, but no systematic study had quantified the tradeoff between repeated data and fresh data, or how to allocate compute optimally when data is finite.

Effective data with exponential decay for repetition

The paper generalizes the Chinchilla scaling law by replacing raw token count $D$ with an effective data term $D’$ that accounts for the diminishing value of repeated tokens:

$$ L(N, D) = \frac{A}{N’^{\alpha}} + \frac{B}{D’^{\beta}} + E $$

where the effective data is:

$$ D’ = U_{D} + U_{D} R_{D}^{} \left(1 - e^{-R_{D}/R_{D}^{}}\right) $$

Here $U_{D}$ is the number of unique tokens, $R_{D}$ is the number of repetitions (epochs minus 1), and $R_{D}^{}$ is a learned constant representing the “half-life” of data repetition. When $R_{D} = 0$ (single epoch), $D’ = U_{D} = D$ and the formula reduces to standard Chinchilla. When $R_{D} \ll R_{D}^{}$, repeated data is worth almost the same as fresh data. As $R_{D}$ grows large, the value of repeated tokens decays to zero, and $D’$ saturates at $U_{D}(1 + R_{D}^{})$, meaning no amount of repetition can substitute for more than $R_{D}^{}$ epochs’ worth of fresh data.

A symmetric formula handles excess parameters:

$$ N’ = U_{N} + U_{N} R_{N}^{} \left(1 - e^{-R_{N}/R_{N}^{}}\right) $$

where $U_{N}$ is the compute-optimal parameter count for $U_{D}$ unique tokens and $R_{N}$ measures how much the model exceeds that count. The fitted values are $R_{D}^{} \approx 15.0$ (data repetition half-life at ~16 epochs) and $R_{N}^{} \approx 5.3$ (excess parameters decay faster than repeated data).

Experiments across 400+ models

Scale. Models from 10M to 9B parameters, trained for up to 1500 epochs. Three experimental protocols: fixed unique data (100M, 400M, 1.5B tokens), fixed FLOPs, and parametric fitting across all runs. Training on C4 (English web text) with GPT-2 architecture decoder-only transformers.

Resource allocation: epochs scale faster than parameters

With fixed unique data, results show that more than 50% loss reduction is possible by training beyond one epoch and increasing model size beyond the single-epoch optimum. The data-constrained efficient frontier recommends allocating most additional compute to more epochs rather than more parameters, because excess parameters decay faster ($R_{N}^{} < R_{D}^{}$). This contrasts with Chinchilla, which recommends scaling both equally.

A concrete validation: training the data-constrained compute-optimal model for $9.3 \times 10^{21}$ FLOPs with 25B unique tokens, the recommended allocation (27% fewer parameters, more epochs) achieves better loss and downstream performance than the Chinchilla-optimal allocation.

Resource return: the 4-epoch safe zone and 16-epoch half-life

Epochs	Loss impact	Downstream impact
1 (baseline)	Optimal	Optimal
Up to 4	Negligible (+0.5% loss)	No significant difference
~16 ($R_{D}^{*}$)	Diminishing returns begin sharply	Measurable degradation
Beyond 16	Returns decay to near zero	Significant degradation
Extreme (44+)	Training can diverge	Failure

The 8.7B parameter model trained for 4 epochs ($D_{C} = 44$B unique tokens) finishes with only 0.5% higher validation loss than the single-epoch model ($D_{C} = 178$B unique tokens). Beyond 16 epochs, each repeated token retains only $1 - 1/e \approx 63%$ of the value of a fresh token, meaning roughly 37% of value is lost per repetition cycle at the half-life point.

Complementary strategies: code augmentation and filtering

When data is limited, two strategies can extend the effective dataset:

Code augmentation. Mixing Python code from The Stack with natural language data. Up to 50% code (42B tokens) shows no degradation on natural language benchmarks, effectively providing a 2x increase in useful training data. Some tasks (WebNLG generation, bAbI reasoning) actually improve with code, possibly because code trains long-range state-tracking capabilities.

Filtering relaxation. Perplexity filtering (keeping the 25% lowest-perplexity samples) is effective on noisy datasets, but deduplication filtering does not improve downstream performance (though it may reduce memorization). The recommendation: reserve aggressive filtering for noisy data sources; for clean datasets, more data through reduced filtering is better than less data through strict filtering.

Combined strategy: doubling available data with code and then repeating for 4 epochs yields 8x more training tokens with performance expected to match 8x more unique data.

Key findings and limitations

Key findings:

Multi-epoch training is beneficial, not harmful, up to moderate repetition counts.
The data-constrained scaling law accurately predicts loss under repetition using an exponential decay formulation.
Compute should be allocated to epochs faster than parameters when data is constrained.
Code augmentation and selective filtering extend effective data without quality degradation.

Limitations:

All experiments use the GPT-2 transformer architecture; applicability to other architectures or modalities is untested.
Only the entire dataset is repeated uniformly. Selectively repeating subsets (e.g., high-value data for more epochs) is not modeled.
Hyperparameter sensitivity (learning rate, dropout) to epoch count is unexplored. Higher learning rates may cause earlier onset of diminishing returns.
Focused on English text. Cross-lingual augmentation effects are not studied.

Reproducibility Details

Status: Highly Reproducible. Code, models, datasets, and hyperparameters are all publicly released under Apache 2.0.

Data

Purpose	Dataset	Size	Notes
Training	C4 (English)	Varies by experiment	Fixed unique data: 100M, 400M, 1.5B tokens
Code augmentation	The Stack (Python)	Up to 42B tokens	Mixed with natural language
Evaluation	19 NL tasks	Standard splits	Zero to five-shot, 114 scores per model

Algorithms

Data-constrained scaling law: $D’ = U_{D} + U_{D} R_{D}^{}(1 - e^{-R_{D}/R_{D}^{}})$ with $R_{D}^{} \approx 15.0$, $R_{N}^{} \approx 5.3$. Fitted using the methodology of Hoffmann et al. (2022) adapted for the repetition terms. 400+ training runs used for fitting.

Models

GPT-2 architecture decoder-only transformers with GPT-2 tokenizer. Sizes: 10M to 8.7B parameters. Cosine learning rate schedule (max 2e-4, decay to 2e-5), Adam optimizer ($\beta_2 = 0.999$), dropout 0.1, weight decay 0.1, gradient clipping at 1.0. bfloat16 precision. Trained using Megatron-DeepSpeed.

Evaluation

Metric	Data-Constrained Optimal	Chinchilla Optimal	Notes
Validation loss (9.3e21 FLOPs, 25B unique)	Lower	Higher	27% fewer parameters
Downstream (4 epochs vs 1)	No significant difference	Baseline	8.7B params, 44B unique tokens
Code augmentation (50% code)	No NL degradation	Baseline	Some tasks improve

Hardware

Trained on the LUMI supercomputer (Finland) using AMD Instinct MI250X GPUs with data, tensor, and pipeline parallelism. Up to 256 GPUs (64 nodes) per run, with up to 2,200 nodes (~8,800 GPUs) used in parallel across all concurrent runs. Total compute: approximately 3 million GPU hours. The cluster runs on 100% renewable hydroelectric energy.

Artifacts

Artifact	Type	License	Notes
datablations	Code + Models + Data	Apache 2.0	All 400+ models, datasets, and training code
Megatron-DeepSpeed fork	Code	-	Training framework adapted for AMD ROCm

Citation

@inproceedings{muennighoff2023scaling,
  title={Scaling Data-Constrained Language Models},
  author={Muennighoff, Niklas and Rush, Alexander M. and Barak, Boaz and Le Scao, Teven and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

Data Mixing Laws for LM Pretraining Optimization

Wed, 08 Apr 2026 00:00:00 +0000

An empirical discovery of predictable mixture-loss relationships

This is a discovery paper that identifies a quantitative, functional relationship between pretraining data mixture proportions and language model loss. The key finding is that domain-specific validation loss follows an exponential law over the linear combination of training domain proportions, and this law composes with standard scaling laws to enable cheap prediction of large-model performance under arbitrary mixtures.

The missing quantitative link between data mixtures and performance

Pretraining data for large language models combines text from many domains (web, code, academic, books, etc.), and mixture proportions significantly affect model quality. Existing approaches either set proportions by hand without disclosed criteria (LLaMA, Baichuan) or use algorithmic methods like DoReMi that optimize qualitatively but cannot predict the quantitative effect of a specific mixture before training. Scaling laws exist for model size and data quantity, but no equivalent existed for mixture proportions. This paper fills that gap.

The exponential data mixing law

The core finding: for a model of fixed size trained for a fixed number of steps, the validation loss on domain $i$ as a function of the training mixture proportions $r_{1 \dots M}$ follows:

$$ L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right) $$

where $c_{i}$, $k_{i}$, and $t_{ij}$ are fitted parameters. The constant $c_{i}$ represents the irreducible loss (not affected by mixture changes). The interaction coefficients $t_{ij}$ capture how training domain $j$ affects validation loss on domain $i$: negative $t_{ij}$ means domain $j$ helps domain $i$, positive means it hurts.

This was discovered progressively:

Two domains: Log-reducible-loss is linear in domain proportion (univariate exponential).
Three domains: The exponential generalizes to a linear combination over all domain proportions (Eq. above), outperforming alternatives with comparable parameter count.
General validation: For a validation set composed of $K$ domains with proportions $s_{1 \dots K}$, the overall loss is:

$$ L(r_{1 \dots M}) = \sum_{i=1}^{K} s_{i} \left[ c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right) \right] $$

When the validation set composition is unknown, implicit domain aggregation treats $s_{i}$ as learnable parameters. Setting the number of implicit domains larger than the true number works well and is robust to overestimation.

Domain interaction patterns

Visualizing the fitted $t_{ij}$ coefficients across 5 coarse Pile domains reveals three relationship types: most domain pairs are unrelated (sparse interaction matrix where each domain’s loss is dominated by its own training proportion), some show facilitation (e.g., dialogue data helps internet text), and some show conflict (e.g., symbolic data hurts prose). This sparsity explains why the law can be fitted with fewer samples than the quadratic parameter count would suggest.

Nested scaling pipeline for cheap prediction

Fitting data mixing laws directly at target scale is too expensive (requires many full training runs at different mixtures). The paper proposes nesting three scaling laws:

Step 1: For each mixture $r_{i}$ and each small model size $N_{j}$, train for $S_{0}$ steps. Fit a power law $L(S) = E_{1} + B/S^{\beta}$ over steps to extrapolate to the target step count $S_{\text{target}}$.

Step 2: With the step-extrapolated losses for each mixture, fit a power law $L(N) = E_{2} + A/N^{\alpha}$ over model sizes to extrapolate to the target model size $N_{\text{target}}$.

Step 3: With the predicted losses at $(N_{\text{target}}, S_{\text{target}})$ for all sampled mixtures, fit the data mixing law and search for the optimal mixture.

This pipeline requires only training small models (70M to 410M) for short runs (30B tokens) to predict performance of a 1B model trained for 100B tokens.

Mixture sampling strategy

To get informative samples efficiently, the paper uses double-diminishing proportions: for each domain, enumerate proportions by halving from the maximum available. This distributes losses evenly across the exponential law’s range. From 40 candidate mixtures trained at the smallest scale (70M), 20 are selected based on which subset minimizes data mixing law fitting error.

Experiments on RedPajama and continual pretraining

Main experiment. Models trained on RedPajama, validated on the Pile (mimicking the common scenario where validation data comes from a different distribution than training). Small models: 70M, 160M, 305M, 410M trained for 30B tokens. Target: 1B model for 100B tokens.

The optimized mixture dramatically redistributes weight compared to RedPajama defaults:

Domain	Default	Optimized
CommonCrawl	0.670	0.125
C4	0.150	0.250
GitHub	0.045	0.141
ArXiv	0.045	0.250
Books	0.045	0.094
StackExchange	0.025	0.125
Wikipedia	0.020	0.016

The optimized mixture reaches the default mixture’s final performance in 73% of the training steps and eventually achieves performance equivalent to 48% more training on the default mixture.

Comparison to DoReMi and DoGE. Data mixing laws outperform both: the predicted-optimal mixture achieves lower validation loss than DoReMi and DoGE (both universal and OOD settings) for 1B models trained for 100B tokens on RedPajama.

Continual pretraining. The law extends to continual pretraining (Pythia-70M on Pile + Python code). It accurately predicts the critical mixture proportion that avoids catastrophic forgetting on the original domain while improving the target domain. This suggests data mixing laws could guide dynamic data schedules across multi-stage pretraining.

Implications and limitations

The data mixing law provides a predictive framework rather than just an optimization algorithm. Key implications:

The interaction coefficients $t_{ij}$ make domain relationships quantitatively observable before full-scale training, identifying facilitation and conflict pairs.
The nested pipeline’s cost is dominated by the small-model training runs (40 mixtures at 70M scale), which is orders of magnitude cheaper than even a single target-scale run.
The continual pretraining application opens the door to optimizing dynamic data schedules, where mixture proportions change across training stages.

Limitations: The “domain” concept remains loosely defined (provenance-based). The nested scaling laws introduce compounding errors at each step, and predictions tend to slightly underestimate actual loss. The number of required fitting samples, while subquadratic in practice due to sparsity, still scales with the number of domains. No theoretical justification for the exponential form is provided; it is a purely empirical finding.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (pilot)	The Pile (GitHub, Pile-CC, Books3)	30B tokens	2-domain and 3-domain experiments
Training (main)	RedPajama	100B tokens	7 domains
Validation	The Pile validation set	Standard split	Out-of-distribution relative to RedPajama
Continual pretraining	Pile + Python code	10B tokens	Pythia-70M base model

Algorithms

Data mixing law: $L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp(\sum_{j} t_{ij} r_{j})$. Fitted via AdaBoost Regressor on sampled mixtures. Step scaling law: $L(S) = E_{1} + B/S^{\beta}$. Model size scaling law: $L(N) = E_{2} + A/N^{\alpha}$. Both fitted via Huber loss minimization with LBFGS. Decomposed Chinchilla-style (separate fits for stability). 40 candidate mixtures sampled via double-diminishing proportions, 20 selected for the final pipeline.

Models

Transformer decoder-only LMs. Pilot: 70M, 160M. Main pipeline: 70M, 160M, 305M, 410M (for fitting), 1B (target). Batch size: 1M tokens. Cosine learning rate decay with 2K step warmup, decaying to 0.1x at 100K steps.

Evaluation

Metric	Optimized Mixture	Default Mixture	Notes
Steps to match default final loss	73K (73%)	100K (100%)	27% training reduction
Equivalent extra training	+48%	Baseline	Estimated via step scaling law
Validation loss (1B, 100B)	Lowest	Higher than optimized	Also beats DoReMi and DoGE

Hardware

8 A100 GPUs. Training times per 30B-token run: 3.5 hours (70M), 8 hours (160M), 16 hours (305M), 21 hours (410M).

Artifacts

Artifact	Type	License	Notes
The Pile	Dataset	MIT	Pilot and validation data
RedPajama	Dataset	Apache 2.0	Main training data
Pythia Suite	Model	Apache 2.0	Model architecture configs; Pythia-70M checkpoint for continual pretraining

Reproducibility status: Partially Reproducible. Datasets and base model checkpoints are public. No official code release for the data mixing law fitting pipeline, mixture sampling, or the nested scaling law prediction workflow.

Citation

@inproceedings{ye2025datamixinglaws,
  title={Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance},
  author={Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhan, Jun and Zhou, Yunhua and Qiu, Xipeng},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

Randomized SMILES Improve Molecular Generative Models

Thu, 26 Mar 2026 00:00:00 +0000

Data Augmentation Through SMILES Randomization

This is an Empirical paper that performs an extensive benchmark of RNN-based molecular generative models trained with different SMILES string variants. The primary contribution is demonstrating that randomized SMILES (non-unique molecular string representations obtained by randomizing atom orderings) substantially improve the quality of the generated chemical space compared to canonical SMILES, without requiring any changes to the model architecture.

The paper evaluates three properties of generated chemical spaces: uniformity (equal probability of sampling each molecule), completeness (coverage of the target space), and closedness (generating only molecules within the target space). These are measured using a new composite metric called UC-JSD.

Canonical SMILES Bias in Generative Models

Recurrent Neural Networks trained on SMILES strings have shown the capacity to create large chemical spaces of valid molecules. However, when trained with canonical SMILES (the unique string representation produced by a canonicalization algorithm), these models exhibit biases. Specifically, prior work by the same group showed that models trained on one million GDB-13 molecules could only recover 68% of GDB-13 when sampled two billion times, compared to the theoretical maximum of 87% from an ideal uniform sampler.

The canonical SMILES representation introduces two problems. First, the canonicalization algorithm constrains how the molecular graph is traversed (e.g., prioritizing sidechains over ring atoms), forcing the model to learn both valid SMILES syntax and the specific canonical ordering rules. Second, structurally similar molecules can have substantially different canonical SMILES, making some molecules harder to sample than others. Molecules with more ring systems and complex topologies are particularly underrepresented.

The authors also note that DeepSMILES, a recently proposed alternative syntax, had not been benchmarked against randomized SMILES, and that the data augmentation capabilities of randomized SMILES at different training set sizes were unexplored.

Randomized SMILES as Non-Canonical Representations

The core insight is that by randomizing the atom ordering before SMILES generation, each molecule can be represented by multiple different but equally valid SMILES strings. This effectively provides data augmentation: a molecule with $n$ heavy atoms can theoretically yield up to $n$ different SMILES strings (though the actual number is typically lower due to molecular symmetry).

Two randomized SMILES variants are explored:

Restricted randomized SMILES: Atom ordering is randomized, but RDKit’s built-in fixes are applied. These fixes prevent overly complicated traversals, such as prioritizing sidechains before completing ring atoms.
Unrestricted randomized SMILES: Atom ordering is randomized without any RDKit restrictions, producing a superset of the restricted variant that includes more convoluted SMILES strings.

For each training epoch, a new set of randomized SMILES is generated for the same molecules, so a model trained for 300 epochs on one million molecules sees approximately 300 million different SMILES strings (with some overlap due to sampling).

The model architecture is a standard RNN with an embedding layer, $l$ layers of LSTM or GRU cells of size $w$, optional dropout, and a linear output layer with softmax. The training objective minimizes the average negative log-likelihood (NLL):

$$ J(T) = -\ln P(X_{0} = x_{0}) - \sum_{t=1}^{T} \ln P(X_{t} = x_{t} \mid X_{t-1} = x_{t-1} \dots X_{1} = x_{1}) $$

The key metric is the Uniformity-Completeness JSD (UC-JSD), which extends the Jensen-Shannon Divergence to measure how uniform, complete, and closed the generated chemical space is:

$$ JSD = H\left(\sum_{d \in D} \alpha_{i} \cdot d_{i}\right) - \sum_{d \in D} \alpha_{i} H(d_{i}) $$

where $H(d)$ is the Shannon entropy of a probability distribution. The UC-JSD is computed over the NLL vectors of the validation, training, and sampled sets. The composite UCC score is defined as:

$$ UCC = \text{completeness} \times \text{uniformity} \times \text{closedness} $$

where completeness measures coverage of GDB-13, uniformity measures how equal the sampling probabilities are, and closedness measures how few invalid (out-of-target-space) molecules are generated.

Benchmark Design Across SMILES Variants, Training Sizes, and Architectures

The benchmark covers a systematic grid of experimental conditions:

SMILES variants: Canonical, restricted randomized, unrestricted randomized, and three DeepSMILES variants (branch syntax, ring syntax, both).

Training set sizes from GDB-13: 1,000,000, 10,000, and 1,000 molecules with corresponding validation sets.

Architecture choices: LSTM vs. GRU cells, with hyperparameter grids over number of layers ($l$), hidden size ($w$), dropout rate ($d$), and batch size ($b$).

Model	Layers ($l$)	Hidden ($w$)	Dropout ($d$)	Batch ($b$)	Cell
GDB-13 1M	3	512	0, 25, 50	64, 128, 256, 512	GRU, LSTM
GDB-13 10K	2, 3, 4	256, 384, 512	0, 25, 50	8, 16, 32	LSTM
GDB-13 1K	2, 3, 4	128, 192, 256	0, 25, 50	4, 8, 16	LSTM
ChEMBL	3	512	0, 25, 50	64, 128, 256, 512	LSTM

Each model’s best epoch was selected using a smoothed UC-JSD curve, and the best epoch was then sampled with replacement $k = 2 \times 10^{9}$ times for GDB-13 benchmarks.

For ChEMBL experiments, models were trained on 1,483,943 molecules with a validation set of 78,102 molecules. Evaluation used validity, unique molecule count, and Frechet ChemNet Distance (FCD).

Randomized SMILES Produce More Complete and Uniform Chemical Spaces

GDB-13 results (1M training set)

The restricted randomized SMILES model recovered 83.0% of GDB-13, compared to 72.8% for canonical SMILES and 68.4-72.1% for DeepSMILES variants. All three quality metrics improved substantially:

SMILES Variant	% GDB-13	Uniformity	Completeness	Closedness	UCC
Canonical	72.8	0.879	0.836	0.861	0.633
Rand. restricted	83.0	0.977	0.953	0.925	0.860
Rand. unrestricted	80.9	0.970	0.929	0.876	0.790
DeepSMILES (both)	68.4	0.851	0.785	0.796	0.532

The NLL distribution of GDB-13 molecules under the randomized SMILES model was centered near $NLL_{GDB13} = -\ln(1/|GDB13|) = 20.6$ with a narrow spread, indicating near-uniform sampling probability. The canonical model showed a much wider NLL distribution, meaning some molecules were orders of magnitude harder to sample.

Randomized SMILES without data augmentation (same SMILES each epoch) still outperformed canonical SMILES (UCC 0.712 vs. 0.633 for restricted), confirming that the non-canonical representation itself is beneficial beyond the augmentation effect.

Smaller training sets amplify the advantage

With only 10,000 training molecules (0.001% of GDB-13), the randomized model generated 62.3% of GDB-13 vs. 38.8% for canonical. With 1,000 training molecules, the gap widened further: 34.1% vs. 14.5%. Validity also improved dramatically (81.2% vs. 50.4% for the 1K setting), suggesting randomized SMILES helps the model learn valid SMILES syntax more effectively from limited data.

ChEMBL results

On the drug-like ChEMBL dataset, the randomized SMILES model generated at least double the number of unique molecules compared to canonical (64.09% vs. 34.67% unique in a 2B sample), with comparable validity (98.33% vs. 98.26%). The canonical model showed a lower FCD (0.0712 vs. 0.1265), but the authors argue this reflects overfitting: the canonical model’s NLL distributions for training and validation sets overlapped tightly, while the randomized model showed more uniform coverage. Physicochemical property distributions (molecular weight, logP, SA score, QED, NP score, internal diversity) were nearly identical across both models.

Architecture findings

LSTM cells consistently outperformed GRU cells across all SMILES variants. Despite GRU’s faster per-epoch training time, LSTM models converged in fewer epochs, making them faster overall. Dropout improved canonical SMILES models but was less beneficial (or detrimental) for randomized SMILES, suggesting that randomized SMILES themselves serve as a regularization mechanism. Larger batch sizes generally improved performance across all variants.

UC-JSD as a model selection metric

The UC-JSD showed strong correlation with UCC ($R^{2} = 0.931$ for canonical, $R^{2} = 0.856$ for restricted randomized, $R^{2} = 0.885$ for unrestricted randomized), validating its use as a model selection criterion without requiring expensive sampling of every model.

The authors interpret randomized SMILES models as occupying a hybrid space between grammar-based and action-based generative models. The vocabulary serves as a fixed action space where atom tokens are “add atom” actions, bond tokens are “add bond” actions, and ring/branching tokens enable graph traversal. Canonical SMILES constrain this action space to a single deterministic path, while randomized SMILES allow the model to explore multiple valid traversals. This perspective also explains why DeepSMILES performed worse: its altered syntax creates a more complex action space without compensating benefits.

The authors encourage the use of randomized SMILES across different model architectures and tasks, including classification and property prediction, and suggest that finding optimal restricted variants of randomized SMILES is a promising research direction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	GDB-13 subsets	1M / 10K / 1K molecules	Randomly sampled from 975M GDB-13
Training/Eval	ChEMBL	1,483,943 training / 78,102 validation	Filtered subset of ChEMBL database

GDB-13 is available from the Reymond group website. ChEMBL is publicly available.

Algorithms

Character-level tokenization with special handling for multi-character tokens (Cl, Br, bracketed atoms, %-prefixed ring numbers)
Teacher forcing during training with NLL loss
Gradient norm clipping to 1.0
Weight initialization from $\mathcal{U}(-\sqrt{1/w}, \sqrt{1/w})$
Adaptive learning rate decay based on UC-JSD
Best epoch selection via smoothed UC-JSD (window size 4)

Models

Standard RNN architecture: embedding layer, stacked LSTM/GRU layers with optional dropout, linear output with softmax. Best models used 3 layers of 512-dimensional LSTM cells. Vocabulary sizes: 26 (GDB-13), 31 (ChEMBL).

Evaluation

Metric	Best Randomized	Best Canonical	Notes
% GDB-13 (1M)	83.0%	72.8%	2B sample with replacement
UCC (1M)	0.860	0.633	Composite score
% GDB-13 (10K)	62.3%	38.8%	2B sample with replacement
% GDB-13 (1K)	34.1%	14.5%	2B sample with replacement
% Unique ChEMBL	64.09%	34.67%	2B sample with replacement

Hardware

Nvidia Tesla V100 (Volta) 16 GB VRAM with CUDA 9.1, driver 390.30. Training times ranged from 1 minute (1K canonical) to 131 hours (ChEMBL canonical). Randomized SMILES models required longer per-epoch training due to augmentation overhead but converged to better solutions.

Artifacts

Artifact	Type	License	Notes
reinvent-randomized	Code	MIT	Training and benchmarking code
GDB-13	Dataset	Academic use	975 million fragment-like molecules
MOSES benchmark	Code	MIT	Used for FCD and property calculations

Paper Information

Citation: Arús-Pous, J., Johansson, S. V., Prykhodko, O., Bjerrum, E. J., Tyrchan, C., Reymond, J.-L., Chen, H., & Engkvist, O. (2019). Randomized SMILES strings improve the quality of molecular generative models. Journal of Cheminformatics, 11(1), 71. https://doi.org/10.1186/s13321-019-0393-0

@article{aruspous2019randomized,
  title={Randomized SMILES strings improve the quality of molecular generative models},
  author={Ar{\'u}s-Pous, Josep and Johansson, Simon Viet and Prykhodko, Oleksii and Bjerrum, Esben Jannik and Tyrchan, Christian and Reymond, Jean-Louis and Chen, Hongming and Engkvist, Ola},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  pages={71},
  year={2019},
  doi={10.1186/s13321-019-0393-0},
  publisher={Springer}
}

Avoiding Failure Modes in Goal-Directed Generation

Thu, 26 Mar 2026 00:00:00 +0000

Reinterpreting Goal-Directed Generation Failures as QSAR Model Issues

This is an Empirical study that challenges a widely cited finding about failure modes in goal-directed molecular generation. Renz et al. (2019) had shown that when molecules are optimized against a machine learning scoring function, control models trained on the same data distribution assign much lower scores to the generated molecules. This was interpreted as evidence that generation algorithms exploit model-specific biases. Langevin et al. demonstrate that this divergence is already present in the original data distribution and is attributable to disagreement among the QSAR classifiers, not to flaws in the generation algorithms themselves.

Why QSAR Model Agreement Matters for Molecular Generation

Goal-directed generation uses a scoring function (typically a QSAR model) to guide the design of molecules that maximize predicted activity. In the experimental framework from Renz et al., three Random Forest classifiers are trained: an optimization model $C_{opt}$ on Split 1, a model control $C_{mc}$ on Split 1 with a different random seed, and a data control $C_{dc}$ on Split 2. Each returns a confidence score ($S_{opt}$, $S_{mc}$, $S_{dc}$). The expectation is that molecules with high $S_{opt}$ should also score highly under $S_{mc}$ and $S_{dc}$, since all three models are trained on the same data distribution for the same target.

Renz et al. observed that during optimization, $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$, reaching substantially lower values. This was interpreted as goal-directed generation exploiting biases unique to the optimization model. The recommendation was to halt generation when control scores stop increasing, requiring a held-out dataset for a control model, which may not be feasible in low-data regimes.

The key insight of Langevin et al. is that nobody had checked whether this score disagreement existed before generation even began. If the classifiers already disagree on high-scoring molecules in the original dataset, the divergence during generation is expected behavior, not evidence of algorithmic failure.

Pre-Existing Classifier Disagreement Explains the Divergence

The core contribution is showing that the gap between optimization and control scores is a property of the QSAR models, not of the generation algorithms.

The authors introduce a held-out test set (10% of the data, used for neither training split) and augment it via Topliss tree enumeration to produce structural analogs for smoother statistical estimates. On this held-out set, they compute the Mean Average Difference (MAD) between $S_{opt}$ and control scores as a function of $S_{opt}$:

$$ \text{MAD}(x) = \frac{1}{|\{i : S_{opt}(x_i) \geq x\}|} \sum_{S_{opt}(x_i) \geq x} |S_{opt}(x_i) - S_{dc}(x_i)| $$

On the three original datasets (DRD2, EGFR, JAK2), the MAD between $S_{opt}$ and $S_{dc}$ grows substantially with $S_{opt}$, reaching approximately 0.3 for the highest-scoring molecules. For EGFR, even the top molecules (with $S_{opt}$ between 0.5 and 0.6) have $S_{dc}$ below 0.2. This disagreement exists entirely within the original data distribution, before any generative algorithm is applied.

The authors formalize this with tolerance intervals. At each generation time step $t$, the distribution of optimization scores is $P_t[S_{opt}(x)]$. From the held-out set, the conditional distributions $P[S_{dc}(x) | S_{opt}(x)]$ and $P[S_{mc}(x) | S_{opt}(x)]$ are estimated empirically. The expected control scores at time $t$ are then:

$$ \mathbb{E}[S_{dc}] = \int P[S_{dc}(x) | S_{opt}(x)] \cdot P_t[S_{opt}(x)] , dS_{opt} $$

By sampling from these distributions, the authors construct 95% tolerance intervals for the expected control scores at each time step. The observed trajectories of $S_{mc}$ and $S_{dc}$ during generation fall within these intervals, demonstrating that the divergence is fully explained by pre-existing classifier disagreement.

Experimental Setup: Original Reproduction and Corrected Experiments

Reproduction of Renz et al.

The original experimental framework uses three datasets from ChEMBL: DRD2 (842 molecules, 59 actives), EGFR (842 molecules, 40 actives), and JAK2 (667 molecules, 140 actives). These are small, noisy, and chemically diverse. Three goal-directed generation algorithms are tested:

Algorithm	Type	Mechanism
Graph GA	Genetic algorithm on molecular graphs	Mutation and crossover of molecular graphs
SMILES-LSTM	Recurrent neural network	Hill-climbing fine-tuning on best molecules
MSO	Particle swarm in CDDD latent space	Multiple swarm optimization

All algorithms are run for 151 epochs with 10 runs each. The reproduction confirms the findings of Renz et al.: $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$ during optimization.

Tolerance interval analysis

The held-out set is augmented using Topliss tree enumeration on phenyl rings, providing structural analogs that are reasonable from a medicinal chemistry perspective. The optimization score range is divided into 25 equal bins, and for each molecule at each time step, 10 samples from the conditional control score distribution are drawn to construct empirical tolerance intervals.

Corrected experiments with adequate models

To test whether generation algorithms actually exploit biases when the classifiers agree, the authors construct two tasks where optimization and control models correlate well:

ALDH1 dataset: 464 molecules from LIT-PCBA, split using similarity-based pairing to maximize intra-pair chemical similarity. This ensures both splits sample similar chemistry.
Modified JAK2: The same JAK2 dataset but with Random Forest hyperparameters adjusted (200 trees instead of 100, minimum 3 samples per leaf instead of 1) to reduce overfitting to spurious correlations.

In both cases, $S_{opt}$, $S_{mc}$, and $S_{dc}$ agree well on the held-out test set. The starting population for generation is set to the held-out test set (rather than random ChEMBL molecules) to avoid building in a distribution shift.

Findings: No Algorithmic Failure When Models Agree

On the corrected experimental setups (ALDH1 and modified JAK2), there is no major divergence between optimization and control scores during generation. The three algorithms produce molecules that score similarly under all three classifiers.

Key findings:

Pre-existing disagreement explains divergence: On all three original datasets, the divergence between $S_{opt}$ and control scores during generation falls within the tolerance intervals predicted from the initial data distribution alone. The generation algorithms are not exploiting model-specific biases beyond what already exists in the data.
Split similarity bias is also pre-existing: Renz et al. observed that generated molecules are more similar to Split 1 (used to train $C_{opt}$) than Split 2. The authors show this bias is already present in the top-5 percentile of the held-out set: on EGFR and DRD2, high-scoring molecules are inherently more similar to Split 1.
Appropriate model design resolves the issue: When Random Forest hyperparameters are chosen to avoid overfitting (more trees, higher minimum samples per leaf), or when data splits are constructed to be chemically balanced, the classifiers agree and the generation algorithms behave as expected.
Quality problems remain independent: Even when optimization and control scores align, the generated molecules can still be poor drug candidates (unreactive, unsynthesizable, containing unusual fragments). The score divergence issue and the chemical quality issue are separate problems.

Limitations acknowledged by the authors

The study focuses on Random Forest classifiers with ECFP fingerprints. The behavior of other model types (e.g., graph neural networks) and descriptor types is not fully explored, though supplementary results show similar patterns with physico-chemical descriptors and Atom-Pair fingerprints.
The corrected ALDH1 task uses a relatively small dataset (464 molecules) with careful split construction. Scaling this approach to larger, more heterogeneous datasets is not demonstrated.
The authors note that their results do not prove generation algorithms never exploit biases; they show that the specific evidence from Renz et al. can be explained without invoking algorithmic failure.
The problem of low-quality generated molecules (poor synthesizability, unusual fragments) remains unresolved and is acknowledged as an open question.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Original tasks	DRD2, EGFR, JAK2	842, 842, 667 molecules	Extracted from ChEMBL; small with few actives
New task	ALDH1	464 molecules (173 with purine substructure)	Extracted from LIT-PCBA; similarity-based split
Augmentation	Topliss tree analogs	~10x augmentation of held-out set	Structural analogs via phenyl ring enumeration

Algorithms

Three goal-directed generation algorithms from the original Renz et al. study:

Graph GA: Genetic algorithm on molecular graphs (Jensen, 2019)
SMILES-LSTM: Hill-climbing on LSTM-generated SMILES (Segler et al., 2018)
MSO: Multi-Swarm Optimization in CDDD latent space (Winter et al., 2019)

All run for 151 epochs, 10 runs each.

Models

Random Forest classifiers (scikit-learn) with:

ECFP fingerprints (radius 2, 1024 bits, RDKit)
Default parameters for original tasks
Modified parameters for JAK2 correction: 200 trees, min 3 samples per leaf

Evaluation

Metric	Purpose	Notes
Mean Average Difference (MAD)	Measures disagreement between optimization and control scores	Computed as function of $S_{opt}$ on held-out set
95% tolerance intervals	Expected range of control scores given optimization scores	Empirical, constructed from held-out set
Tanimoto similarity	Split bias assessment	Morgan fingerprints, radius 2, 1024 bits
ROC-AUC	Classifier predictive performance	Used to verify models have comparable accuracy

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
Code and datasets	Code	Apache-2.0	Fork of Renz et al. codebase with modifications

Paper Information

Citation: Langevin, M., Vuilleumier, R., & Bianciotto, M. (2022). Explaining and avoiding failure modes in goal-directed generation of small molecules. Journal of Cheminformatics, 14, 20. https://doi.org/10.1186/s13321-022-00601-y

@article{langevin2022explaining,
  title={Explaining and avoiding failure modes in goal-directed generation of small molecules},
  author={Langevin, Maxime and Vuilleumier, Rodolphe and Bianciotto, Marc},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={20},
  year={2022},
  publisher={Springer},
  doi={10.1186/s13321-022-00601-y}
}

Failure Modes in Molecule Generation & Optimization

Wed, 25 Mar 2026 00:00:00 +0000

An Empirical Critique of Molecular Generation Evaluation

This is an Empirical paper that critically examines evaluation practices for molecular generative models. Rather than proposing a new generative method, the paper exposes systematic weaknesses in both distribution-learning metrics and goal-directed optimization scoring functions. The primary contributions are: (1) demonstrating that a trivially simple “AddCarbon” model can achieve near-perfect scores on widely used distribution-learning benchmarks, and (2) introducing an experimental framework with optimization scores and control scores that reveals model-specific and data-specific biases when ML models serve as scoring functions for goal-directed generation.

Evaluation Gaps in De Novo Molecular Design

The rapid growth of deep learning methods for molecular generation (RNN-based SMILES generators, VAEs, GANs, graph neural networks) created a need for standardized evaluation. Benchmarking suites like GuacaMol and MOSES introduced metrics for validity, uniqueness, novelty, KL divergence over molecular properties, and Frechet ChemNet Distance (FCD). For goal-directed generation, penalized logP became a common optimization target.

However, these metrics leave significant blind spots. Distribution-learning metrics do not detect whether a model merely copies training molecules with minimal modifications. Goal-directed benchmarks often use scoring functions that fail to capture the full requirements of drug discovery (synthetic feasibility, drug-likeness, absence of reactive substructures). When ML models serve as scoring functions, the problem worsens because generated molecules can exploit artifacts of the learned model rather than exhibiting genuinely desirable properties.

At the time of writing, wet-lab validations of generative models remained scarce, with only a handful of studies (Merk et al., Zhavoronkov et al.) demonstrating in vitro activity for generated compounds. The lack of rigorous evaluation left the field unable to distinguish meaningfully innovative methods from those that simply exploit metric weaknesses.

The Copy Problem and Control Score Framework

The paper introduces two key conceptual contributions.

The AddCarbon Model for Distribution-Learning

The AddCarbon model is deliberately trivial: it samples a molecule from the training set, inserts a single carbon atom at a random position in its SMILES string, and returns the result if it produces a valid, novel molecule. This model achieves near-perfect scores across most GuacaMol distribution-learning benchmarks:

Benchmark	RS	LSTM	GraphMCTS	AAE	ORGAN	VAE	AddCarbon
Validity	1.000	0.959	1.000	0.822	0.379	0.870	1.000
Uniqueness	0.997	1.000	1.000	1.000	0.841	0.999	0.999
Novelty	0.000	0.912	0.994	0.998	0.687	0.974	1.000
KL divergence	0.998	0.991	0.522	0.886	0.267	0.982	0.982
FCD	0.929	0.913	0.015	0.529	0.000	0.863	0.871

The AddCarbon model beats all baselines except the LSTM on the FCD metric, despite being practically useless. This exposes what the authors call the “copy problem”: current metrics check only for exact matches to training molecules, so minimal edits evade novelty detection. The authors argue that likelihood-based evaluation on hold-out test sets, analogous to standard practice in NLP, would provide a more comprehensive metric.

Control Scores for Goal-Directed Generation

For goal-directed generation, the authors introduce a three-score experimental design:

Optimization Score (OS): Output of a classifier trained on data split 1, used to guide the molecular optimizer.
Model Control Score (MCS): Output of a second classifier trained on split 1 with a different random seed. Divergence between OS and MCS quantifies model-specific biases.
Data Control Score (DCS): Output of a classifier trained on data split 2. Divergence between OS and DCS quantifies data-specific biases.

This mirrors the training/test split paradigm in supervised learning. If a generator truly produces molecules with the desired bioactivity, the control scores should track the optimization score. Divergence between them indicates the optimizer is exploiting artifacts of the specific model or training data rather than learning generalizable chemical properties.

Experimental Setup: Three Targets, Three Generators

Targets and Data

The authors selected three biological targets from ChEMBL: Janus kinase 2 (JAK2), epidermal growth factor receptor (EGFR), and dopamine receptor D2 (DRD2). For each target, the data was split into two halves (split 1 and split 2) with balanced active/inactive ratios. Random forest classifiers using binary folded ECFP4 fingerprints (radius 2, size 1024) were trained to produce three scoring functions per target: the OS and MCS on split 1 (different random seeds), and the DCS on split 2.

Generators

Three molecular generators were evaluated:

Graph-based Genetic Algorithm (GA): Iteratively applies random mutations and crossovers to a population of molecules, retaining the best in each generation. One of the top performers in GuacaMol.
SMILES-LSTM: An autoregressive model that generates SMILES character by character, optimized via hill climbing (iteratively sampling, keeping top molecules, fine-tuning). Also a top GuacaMol performer.
Particle Swarm Optimization (PS): Optimizes molecules in the continuous latent space of a SMILES-based sequence-to-sequence model.

Each optimizer was run 10 times per target dataset.

Score Divergence and Exploitable Biases

Optimization vs. Control Score Divergence

Across all three targets and all three generators, the OS consistently outpaced both control scores during optimization. The DCS sometimes stagnated or even decreased while the OS continued to climb. This divergence demonstrates that the generators exploit biases in the scoring function rather than discovering genuinely active compounds.

The MCS also diverged from the OS despite being trained on exactly the same data, confirming model-specific biases: the optimization exploits features unique to the particular random forest instance. The larger gap between OS and DCS (compared to OS and MCS) indicates that data-specific biases contribute more to the divergence than model-specific biases.

Chemical Space Migration

Optimized molecules migrated toward the region of split 1 actives (used to train the OS), as shown by t-SNE embeddings and nearest-neighbor Tanimoto similarity analysis. Optimized molecules had more similar neighbors in split 1 than in split 2, confirming data-specific bias. By the end of optimization, generated molecules occupied different regions of chemical space than known actives when measured by logP and molecular weight, with compounds from the same optimization run forming distinct clusters.

Quality of Generated Molecules

High-scoring generated molecules frequently contained problematic substructures: reactive dienes, nitrogen-fluorine bonds, long heteroatom chains that are synthetically infeasible, and highly uncommon functional groups. The LSTM optimizer showed a bias toward high molecular weight, low diversity, and high logP values. These molecules would be rejected by medicinal chemists despite their high optimization scores.

Key Takeaways

The authors emphasize several practical implications:

Early stopping: Control scores can indicate when further optimization is exploiting biases rather than finding better molecules. Optimization should stop when control scores plateau.
Scoring function iteration: In practice, generative models are “highly adept at exploiting” incomplete scoring functions, necessitating several iterations of generation and scoring function refinement.
Synthetic accessibility: Even high-scoring molecules are useless if they cannot be synthesized. The authors consider this a major challenge for practical adoption.
Likelihood-based evaluation: For distribution-learning, the authors recommend reporting test-set likelihoods for likelihood-based models, following standard NLP practice.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Bioactivity data	ChEMBL (JAK2, EGFR, DRD2)	See Table S1	Binary classification tasks, split 50/50
Distribution-learning	GuacaMol training set	Subset of ChEMBL	Used as starting population for GA and PS

Algorithms

Scoring function: Random forest classifier (scikit-learn) on binary ECFP4 fingerprints (size 1024, radius 2, RDKit)
GA: Graph-based genetic algorithm from Jensen (2019)
LSTM: SMILES-LSTM with hill climbing, pretrained model from GuacaMol
PS: Particle swarm optimization in latent space of a sequence-to-sequence model (Winter et al. 2019)
Each optimizer run 10 times per target

Evaluation

Metric	Description	Notes
Optimization Score (OS)	RF classifier on split 1	Guides optimization
Model Control Score (MCS)	RF on split 1, different seed	Detects model-specific bias
Data Control Score (DCS)	RF on split 2	Detects data-specific bias
GuacaMol metrics	Validity, uniqueness, novelty, KL div, FCD	For distribution-learning

Artifacts

Artifact	Type	License	Notes
ml-jku/mgenerators-failure-modes	Code	Unknown	Data, code, and results

Hardware

Not specified in the paper.

Citation

@article{renz2019failure,
  title={On failure modes in molecule generation and optimization},
  author={Renz, Philipp and Van Rompaey, Dries and Wegner, J{\"o}rg Kurt and Hochreiter, Sepp and Klambauer, G{\"u}nter},
  journal={Drug Discovery Today: Technologies},
  volume={32-33},
  pages={55--63},
  year={2019},
  publisher={Elsevier},
  doi={10.1016/j.ddtec.2020.09.003}
}

Paper Information

Citation: Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S., & Klambauer, G. (2019). On failure modes in molecule generation and optimization. Drug Discovery Today: Technologies, 32-33, 55-63. https://doi.org/10.1016/j.ddtec.2020.09.003

Publication: Drug Discovery Today: Technologies, Volume 32-33, 2019

Additional Resources:

Code and data (GitHub)

Benchmarking Molecular Property Prediction at Scale

Wed, 25 Mar 2026 00:00:00 +0000

A Large-Scale Empirical Study of Molecular Property Prediction

This is an Empirical paper that systematically benchmarks molecular property prediction across multiple dimensions: molecular representations, model architectures, evaluation metrics, data splitting strategies, and chemical space generalization. The primary contribution is a rigorous, large-scale comparison (62,820 trained models) showing that traditional machine learning models on fixed molecular representations frequently outperform recent deep representation learning approaches, and that several overlooked evaluation factors (statistical testing, metric choice, activity cliffs, dataset size) significantly influence conclusions about model performance.

Motivation: Overlooked Evaluation Pitfalls in Molecular Property Prediction

Molecular property prediction is a core task in AI-driven drug discovery, and recent years have seen a proliferation of representation learning methods (transformers on SMILES, GNNs on molecular graphs) claiming improved performance on MoleculeNet benchmark datasets. However, the authors identify several systemic problems in how these methods are evaluated:

Heavy reliance on MoleculeNet benchmarks, which may not reflect real-world drug discovery challenges. Some benchmark tasks (e.g., SIDER, ClinTox) are arguably unreasonable because they try to predict outcomes from chemical structure alone when other factors (food-drug interactions, patient-level variables) dominate.
Lack of statistical rigor. Most papers report mean metrics over 3 or 10 splits without statistical tests. Without rigorous analysis, improved metrics could be statistical noise.
Inconsistent data splits. Across studies, the actual splits vary because seeds and splitting implementations differ, making cross-paper comparisons unreliable.
Inappropriate metrics. AUROC, the default for classification, can overestimate performance, especially on imbalanced datasets. Precision-oriented metrics (PPV, NPV) may be more relevant for virtual screening.
Neglect of activity cliffs. Most studies only evaluate inter-scaffold generalization via scaffold splits, ignoring intra-scaffold generalization where structurally similar molecules exhibit drastically different activities (activity cliffs).

Core Contribution: Fixed Representations Often Outperform Learned Representations

The central finding is that traditional ML models (RF, SVM, XGBoost) operating on fixed molecular representations (RDKit2D descriptors, Morgan fingerprints, MACCS keys, AtomPairs) frequently outperform recent self-supervised pretrained models (MolBERT, GROVER) across diverse datasets. The authors frame the paper around a central thesis:

“A model cannot save an unqualified dataset which cannot remedy an improper evaluation for an ambiguous chemical space generalization claim.”

Key findings on representations and models:

RF on RDKit2D descriptors achieves the best performance on BACE, BBBP, ESOL, and Lipop under scaffold split. MolBERT only matches RF in HIV.
Concatenating RDKit2D descriptors to GROVER’s learned embeddings (GROVER_RDKit) significantly improves performance, suggesting the learned representations alone are insufficient and that fixed descriptors carry substantial predictive signal.
For binding activity datasets (opioid receptors MOR, DOR, KOR), MorganBits fingerprints outperform other representations, consistent with the structural nature of binding.
PhysChem descriptors excel on datasets where properties correlate strongly with simple molecular features (e.g., ESOL has a near-linear relationship between MolLogP and solubility), but perform poorly on binding activity datasets where the relationship is more complex.

Experimental Setup: 62,820 Models Across Diverse Datasets

Models evaluated

The study evaluates nine models across three categories:

Traditional ML: Random Forest (RF), Support Vector Machine (SVM), XGBoost
Regular neural networks: RNN (GRU variant), GCN, GIN
Pretrained models: MolBERT (SMILES-based, ~85M parameters, pretrained on 1.6M molecules), GROVER (graph-based, ~48M parameters, pretrained on ~10M molecules), and GROVER_RDKit (GROVER with concatenated RDKit2D descriptors)

Molecular representations

Six fixed representations are evaluated: RDKit2D descriptors (200 features), PhysChem descriptors (11 features), MACCS keys, MorganBits fingerprints, MorganCounts fingerprints, and AtomPairs fingerprints. Morgan fingerprints use radius 2 and 2048 bits after testing showed little difference between common parameter choices.

Datasets

Category	Datasets	Task Type	Source
MoleculeNet benchmarks	BACE, BBBP, HIV	Classification	MoleculeNet
MoleculeNet benchmarks	ESOL, FreeSolv, Lipop	Regression	MoleculeNet
Opioids-related	MDR1, CYP2D6, CYP3A4, MOR, DOR, KOR	Classification + Regression	ChEMBL
Activity datasets	24 targets	Regression	Cortes-Ciriano et al.
Activity datasets	30 targets (MoleculeACE)	Regression	Tilborg et al.
Descriptor datasets	MolWt, NumAtoms (16 sizes each)	Regression	ZINC250k

Evaluation protocol

Both scaffold and random splits (80:10:10 ratio)
30 different random seeds per experiment for statistical rigor
Mann-Whitney U test for pairwise significance ($p < 0.05$, two-sided)
Multiple metrics per task: AUROC, AUPRC, PPV, NPV for classification; RMSE, MAE, $R^2$, Pearson $R$ for regression

Key metrics

Classification:

$$ \text{PPV} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

$$ \text{NPV} = \frac{\text{TN}}{\text{TN} + \text{FN}} $$

Regression:

$$ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2} $$

$$ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $$

$$ \text{Pearson}_R = \frac{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})(\hat{y}_i - \bar{y}_{pred})}{\sqrt{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})^2 \sum_{i=1}^{N} (\hat{y}_i - \bar{y}_{pred})^2}} $$

$$ R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})^2} $$

Key Findings: Metrics, Activity Cliffs, and Dataset Size

Statistical testing is essential

Without statistical tests, there is a real risk of drawing incorrect conclusions. Analysis of individual splits shows that in certain splits, MolBERT or GROVER can appear to outperform RF, even though on aggregate with proper statistical testing, RF is significantly better. For example, in BBBP, RF dominates in 20 of 30 splits, but the remaining 10 could mislead a researcher using only a single split.

Metric choice changes conclusions

Different evaluation metrics can lead to contradictory conclusions about the same models:

In BBBP under scaffold split, RF significantly outperforms other models by AUROC, but shows similar performance when evaluated by PPV or NPV.
In FreeSolv, GROVER outperforms RF by Pearson $R$ ($p < 0.05$) but shows similar performance by $R^2$.
Pearson $R$ can overestimate $R^2$: even when $R^2$ drops to zero or negative, Pearson $R$ can remain around 0.5.
AUROC can be over-optimistic, especially on imbalanced datasets like CYP2D6 and CYP3A4.

The authors argue that PPV and NPV are more practically relevant for virtual screening than AUROC or AUPRC, since the goal is to identify true hits among predicted positives (or true non-binders among predicted negatives).

Activity cliffs pose a major challenge

Activity cliffs, defined as IC50 values spanning at least two orders of magnitude within one scaffold, are prevalent in the opioid-related datasets. Although AC scaffolds represent only about 10% of scaffolds, they encompass 25-46% of all molecules:

Dataset	AC scaffolds (%)	AC molecules (%)
MDR1	62 (10.2%)	594 (41.3%)
CYP2D6	124 (9.3%)	710 (31.0%)
CYP3A4	146 (7.2%)	926 (25.2%)
MOR	213 (13.1%)	1627 (46.1%)
DOR	178 (11.6%)	1342 (41.6%)
KOR	218 (13.1%)	1502 (45.2%)

Prediction performance is consistently worse for AC molecules, indicating limited intra-scaffold generalization. Removing edge-case molecules (those sharing scaffolds with pIC50 spanning 5 to 7) from test sets generally improves classification performance, confirming that activity cliffs are a key source of prediction error.

Dataset size is critical for representation learning

Experiments on descriptor datasets (predicting MolWt and NumAtoms) reveal clear patterns:

With fewer than 1K data points, traditional ML on fixed representations outperforms all neural network models except pretrained GROVER, which shows competitive performance in the low-data regime.
MolBERT shows severely limited performance (RMSE > 200 for MolWt) with fewer than 10K data points.
RNN achieves the best performance when dataset size exceeds 10K, demonstrating the promise of representation learning in the “big-data” regime.
SVM achieves near-perfect RMSE (close to zero) on datasets larger than 10K when paired with AtomPairs fingerprints.
GROVER’s performance does not substantially improve with increasing dataset size, while MolBERT improves at 100K but is slow to benefit from more data.

Representation learning models show higher metric variability

Representation learning models, particularly GROVER, exhibit higher variability in performance metrics across splits. This variability correlates negatively with mean performance: models with higher variability tend to perform worse on average. The authors emphasize the importance of reporting metric variability alongside means.

Scaffold split versus random split

Prediction performance under scaffold split is consistently worse than under random split, confirming the inter-scaffold generalization challenge. Notably, random split alleviates the intra-scaffold generalization challenge because some AC scaffolds are seen during training.

Descriptors correlate with specific properties

PhysChem descriptors excel on datasets where molecular properties correlate with simple descriptors (e.g., MolLogP has near $-1$ correlation with ESOL labels). For binding activity datasets, correlation coefficients mostly fall within $[-0.5, 0.5]$, explaining why PhysChem descriptors show limited performance on those tasks, while structural fingerprints are more useful.

Limitations and Future Directions

The authors acknowledge several limitations:

Uncertainty from model training (random initialization, mini-batch shuffling) was not fully addressed. Ensembling was not evaluated due to computational cost.
Experimental uncertainty in labels (noise, measurement error in pIC50 values) was not modeled, though it can be heteroscedastic and impact performance.
Model explainability was not covered, although it is important for building trust in AI tools for drug discovery.
The study focused on GROVERbase only (not GROVERlarge) due to computational constraints.

Future directions include: exploring better ways to use fixed representations alongside learned ones, developing techniques for chemical space generalization (both inter- and intra-scaffold), incorporating experimental uncertainty into model training and evaluation, and generating larger high-quality datasets to fully harness representation learning models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Benchmark	MoleculeNet (BACE, BBBP, HIV, ESOL, FreeSolv, Lipop)	642-41,127 molecules	Downloaded from MolMapNet; max length < 400
Activity	Opioids-related (MDR1, CYP2D6, CYP3A4, MOR, DOR, KOR)	Varies	Collected from ChEMBL27; pIC50 values
Activity	Cortes-Ciriano et al. 24 targets	Varies	Activity data for drug targets
Activity	MoleculeACE 30 targets	Varies	Activity cliffs emphasis
Descriptor	MolWt, NumAtoms from ZINC250k	0.1K to 100K	16 dataset sizes per descriptor

Algorithms

RF: 500 trees (following Chemprop)
SVM: linear kernel
XGBoost: gradient boosting regressor/classifier with default hyperparameters
RNN: GRU variant, hidden size 512, 3 fully connected layers
GCN/GIN: embedding dimension 300, 5 convolutional layers, hidden size 512
MolBERT: BERTBase architecture, 768 embedding, 12 layers, 12 heads, ~85M parameters (769 fine-tuned)
GROVER: GROVERbase, ~48M parameters (~5.2M fine-tuned)
All splits repeated 30 times with seeds 0-29

Models

All model configurations, splits, and raw predictions are available in the GitHub repository.

Evaluation

Metrics: AUROC, AUPRC, PPV, NPV (classification); RMSE, MAE, $R^2$, Pearson $R$ (regression). Statistical testing via Mann-Whitney U test ($p < 0.05$, two-sided). Youden’s $J$ statistic used to determine classification threshold for PPV/NPV.

Hardware

All neural network experiments run on a single NVIDIA V100 GPU for 100 epochs. Batch size 32 for most experiments; 256 for GROVER on HIV due to compute time (MolBERT takes ~3 hours per split on HIV at batch size 32; GROVER takes ~5 hours at batch size 256). The study is partially funded by Stony Brook University OVPR Seed Grant, using the AI Institute at Stony Brook for computational resources.

Artifact	Type	License	Notes
Respite_MPP	Code	Unknown	Code, data, and raw predictions
Nature Communications article	Paper	CC-BY-4.0	Open access

Paper Information

Citation: Deng, J., Yang, Z., Wang, H., Ojima, I., Samaras, D., & Wang, F. (2023). A systematic study of key elements underlying molecular property prediction. Nature Communications, 14, 6395. https://doi.org/10.1038/s41467-023-41948-6

Publication: Nature Communications 2023

Additional Resources:

Respite_MPP GitHub Repository

Citation

@article{deng2023systematic,
  title={A systematic study of key elements underlying molecular property prediction},
  author={Deng, Jianyuan and Yang, Zhibo and Wang, Hehe and Ojima, Iwao and Samaras, Dimitris and Wang, Fusheng},
  journal={Nature Communications},
  volume={14},
  number={1},
  pages={6395},
  year={2023},
  doi={10.1038/s41467-023-41948-6}
}

Neural Scaling of Deep Chemical Models

Tue, 24 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a discovery paper that identifies empirical neural scaling laws in two distinct domains of chemical deep learning: large language models (LLMs) for generative chemistry and graph neural networks (GNNs) for machine-learned interatomic potentials. The paper also introduces training performance estimation (TPE) as a practical tool for accelerating hyperparameter optimization in these domains.

Why scaling laws matter for chemistry

Neural scaling laws, first characterized for NLP models by Kaplan et al. (2020), describe how model loss decreases as a power law with increasing model size, dataset size, or compute:

$$ L(R) = \alpha R^{-\beta} $$

where $\alpha$ is a coefficient, $\beta$ is the scaling exponent, and $R$ is the resource being scaled (parameters, data, or compute). These relationships have guided resource allocation decisions in NLP and computer vision, but their applicability to scientific deep learning was unknown.

Chemical deep learning differs from standard NLP and vision tasks in several key ways. Physics-based priors (like symmetry constraints) may reduce the need for massive scale. The heterogeneity of chemical space and molecular tasks makes general pre-training more challenging. There are no established default architectures, datasets, or training recipes at large scale for chemistry.

This paper asks: do the same scaling behaviors hold for chemical models, and how do physical priors affect them?

Training performance estimation for efficient scaling

Before running expensive scaling experiments, the authors needed a way to efficiently select hyperparameters. They introduced TPE, a generalization of training speed estimation (TSE) to new domains. TSE computes the cumulative training loss over the first $T$ epochs:

$$ \text{TSE} = \sum_{t=1}^{T} \left( \frac{1}{B} \sum_{i=1}^{B} \mathcal{L}\left(f_{\theta(t,i)}(\mathbf{X}_i), \mathbf{y}_i\right) \right) $$

where $B$ is the number of training steps per epoch, $\mathcal{L}$ is the loss function, and $f_{\theta(t,i)}$ is the network at epoch $t$ and mini-batch $i$. A linear regression then predicts converged loss from early-training TSE:

$$ L = m \times \text{TSE} + b $$

Using only 20% of the total training budget, TPE achieves $R^2 = 0.98$ and Spearman’s $\rho = 1.0$ for ChemGPT on the MOSES dataset. For GNNs, it achieves $R^2 \geq 0.86$ and $\rho \geq 0.92$ across SchNet, PaiNN, and SpookyNet. This enables discarding suboptimal configurations early, saving up to 90% of compute.

ChemGPT: scaling chemical language models

ChemGPT is a GPT-3-style autoregressive transformer for molecular generation. It uses GPT-Neo as its backbone with a SELFIES tokenizer, factorizing the probability of a molecular sequence as:

$$ p(x) = \prod_{i=1}^{n} p\left(s_i \mid s_1, \dots, s_{i-1}\right) $$

The authors trained ChemGPT models ranging from ~78K to over 1 billion non-embedding parameters on subsets of PubChem10M (up to ~10 million molecules, or ~300 million tokens). Key findings from the scaling experiments:

Pre-training loss monotonically improves with increasing dataset size up to nearly 10 million molecules, with no saturation observed.
For a fixed data budget, increasing model size provides monotonic improvements until models reach ~1 billion parameters.
The scaling exponent $\beta = 0.17 \pm 0.01$ for the largest dataset (after excluding the three largest models from the power-law fit), and $\beta = 0.30 \pm 0.01$ for the next largest dataset.
Resolution-limited regimes appear where the power-law behavior breaks down, indicating either insufficient data for a given model size or vice versa. These regimes shift depending on the data budget.

An interesting observation: for small datasets, large models ($10^7$ parameters and above) still provide notable loss improvements, suggesting that scaling up model size helps even when data is limited.

Neural force field scaling with GNNs

For tasks requiring three-dimensional molecular geometry, the authors studied GNN-based neural force fields (NFFs). These models predict energies $\hat{E} = f_\theta(X)$ and derive forces by differentiation:

$$ \hat{F}_{ij} = -\frac{\partial \hat{E}}{\partial r_{ij}} $$

Training uses an L1 loss over energies and forces:

$$ \mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \left[ \alpha_E | E_i - \hat{E}_i | + \alpha_F | \mathbf{F}_i - \hat{\mathbf{F}}_i | \right] $$

Four NFF architectures were studied, spanning a range of physical priors:

Model	Type	Key Characteristic
SchNet	E(3) invariant	Continuous filter convolutions
PaiNN	E(3) equivariant	Equivariant message passing
Allegro	E(3) equivariant	Local, learned many-body functions
SpookyNet	E(3) equivariant	Non-local interactions, empirical corrections

Model capacity is parameterized as $c = d \times w$ (depth times width). Models were trained on subsets of the ANI-1x dataset (up to 100,000 geometries, corresponding to ~4.5 million force labels).

Key GNN scaling findings:

PaiNN shows monotonic loss improvement with increasing dataset size and strong correlation between converged loss and model capacity (Spearman’s $\rho \geq 0.88$).
Equivariant GNNs (PaiNN, Allegro) show better scaling efficiency than invariant GNNs (SchNet), with larger $\beta$ values.
The scaling exponent for equivariant GNNs is $\beta = 0.26$, indicating that physics-based equivariance priors provide greater sample efficiency that persists to much larger and more chemically diverse datasets than previously studied.
A transition at $10^4$ datapoints shows nearly perfect rank correlation between model capacity and converged loss ($\rho \geq 0.93$), suggesting this may be a threshold where models move from memorization to generalization.

Results and practical implications

The scaling results provide actionable guidance for resource allocation:

For chemical LLMs with large data budgets, the greatest loss improvements come from scaling up small models (around $10^5$ parameters).
For small data budgets, rapid improvements come from scaling medium-sized models ($10^7$ parameters).
For NFFs, low-capacity models show diminishing returns with more data, while high-capacity models show rapid improvements with increasing dataset size.
Neither model type has saturated with respect to model size, dataset size, or compute, suggesting substantial room for improvement with further scaling.

The 300-million-parameter ChemGPT trained on 300 million tokens and the PaiNN model with capacity ~1,000 trained on $10^5$ frames achieved the minimum losses in their respective scaling plots, providing concrete targets for practitioners.

Reproducibility Details

Data:

PubChem10M (10M SMILES strings, via DeepChem)
MOSES (2M molecules, for TPE validation)
ANI-1x (5M DFT calculations, via Figshare)
Revised MD-17 (10 small organic molecules, 10,000 frames for TPE)

Models:

ChemGPT: GPT-Neo backbone, 24 layers, widths from 16 to 2,048, sizes from ~78K to ~1.2B non-embedding parameters
SchNet, PaiNN, Allegro, SpookyNet: widths of 16, 64, 256; depths of 2, 3, 4; 5 Angstrom cutoff

Training:

ChemGPT: AdamW optimizer, learning rate $2 \times 10^{-5}$, batch size 8 per GPU, 10 epochs, cross-entropy loss
GNNs: Adam optimizer, learning rate scheduler (halved after 30 epochs without improvement), early stopping after 50 stagnant epochs, max 1,000 epochs, L1 loss (force-only training)

Hardware:

NVIDIA Volta V100 GPUs (32 GB), 2 GPUs per node
PyTorch with distributed data parallel (DDP), PyTorch Lightning, LitMatter

Code: LitMatter repository

Paper Information

Citation: Frey, N.C., Soklaski, R., Axelrod, S. et al. Neural scaling of deep chemical models. Nat Mach Intell 5, 1297-1305 (2023).

@article{frey2023neural,
  title={Neural scaling of deep chemical models},
  author={Frey, Nathan C. and Soklaski, Ryan and Axelrod, Simon and Samsi, Siddharth and G{\'o}mez-Bombarelli, Rafael and Coley, Connor W. and Gadepally, Vijay},
  journal={Nature Machine Intelligence},
  volume={5},
  number={11},
  pages={1297--1305},
  year={2023},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-023-00740-3}
}

Language Models Learn Complex Molecular Distributions

Sun, 22 Mar 2026 00:00:00 +0000

RNN Language Models as Flexible Molecular Generators

This is an Empirical paper that investigates the capacity of simple recurrent neural network (RNN) language models to learn complex molecular distributions. The core finding is that LSTM-based models trained on SMILES (SM-RNN) or SELFIES (SF-RNN) string representations consistently outperform popular graph generative models (JTVAE, CGVAE) across three increasingly challenging generative modeling tasks. The paper positions language models as flexible, scalable alternatives to graph-based approaches for molecular generation.

Scaling Beyond Standard Benchmarks

Most molecular generative models are evaluated on relatively small, drug-like molecules from datasets like ZINC or MOSES. These standard benchmarks do not test whether models can handle larger, more structurally diverse molecules or distributions with complex shapes (multi-modal, heavy-tailed). This gap matters because there is increasing interest in larger, more complex molecules for therapeutics, including peptides and natural products.

Graph generative models like JTVAE and CGVAE impose structural constraints (tree decompositions, valency restrictions) that help with validity but limit their ability to scale. Language models, by contrast, only need to generate a single character sequence, making them inherently more flexible.

Three Challenging Generative Modeling Tasks

The paper introduces three benchmark tasks designed to stress-test generative models:

Task 1: Penalized LogP Distribution

A dataset of approximately 160K molecules from ZINC15 with penalized LogP scores exceeding 4.0. The training distribution is sharply peaked around 4.0 to 4.5 with a subtle tail extending above 6.0. Molecules in the tail tend to have long carbon chains and fewer rings. The challenge is learning this skewed distribution rather than just finding individual high-scoring molecules.

A composite dataset of approximately 200K molecules drawn from four sources with distinct molecular weight ranges:

GDB-13 (MW $\leq$ 185)
ZINC (185 $\leq$ MW $\leq$ 425)
Harvard Clean Energy Project (460 $\leq$ MW $\leq$ 600)
POLYMERS (MW $>$ 600)

Models must learn to generate from all four modes simultaneously, each with very different molecular structures.

Task 3: Large-Scale Molecules

The largest molecules in PubChem with more than 100 heavy atoms, yielding approximately 300K molecules with molecular weights ranging from 1,250 to 5,000. These include small biomolecules, photovoltaics, peptides, and cyclic peptides. This task is particularly challenging because the SMILES/SELFIES strings are very long.

Evaluation by Distributional Fidelity

The evaluation framework focuses on how well a model learns the full training distribution rather than generating individual good molecules. The primary quantitative metric is the Wasserstein distance (earth mover’s distance) between molecular property distributions of generated and training molecules:

$$W(P, Q) = \inf_{\gamma \in \Gamma(P,Q)} \int | x - y | , d\gamma(x, y)$$

Properties evaluated include LogP, synthetic accessibility (SA), quantitative estimate of drug-likeness (QED), molecular weight (MW), Bertz complexity (BCT), and natural product likeness (NP). An oracle baseline is computed by measuring the Wasserstein distance between different random samples of the training data itself.

Standard metrics (validity, uniqueness, novelty) are also reported but are secondary to distributional fidelity.

Architecture: LSTM Language Models

The language models use standard LSTM architectures trained autoregressively on molecular strings. Two variants are compared:

SM-RNN: Trained on canonical SMILES
SF-RNN: Trained on SELFIES representations

Hyperparameters are tuned via random search over learning rate ($\in [0.0001, 0.001]$), hidden units ($\in [100, 1000]$), layers (1 to 5), and dropout ($\in [0.0, 0.5]$). Model selection uses a combination of standard metrics and Wasserstein distance rankings.

The graph model baselines include JTVAE (junction tree VAE) and CGVAE (constrained graph VAE), along with several additional baselines (MolGAN, GraphNVP, and others).

Results: Language Models Outperform Graph Models Across All Tasks

Penalized LogP

Both RNN models learn the sharp training distribution far better than graph models. The SM-RNN achieves the lowest Wasserstein distances across most properties. The graph models produce substantial out-of-distribution mass around penalized LogP scores of 1.75 to 2.25, failing to capture the peaked nature of the training distribution.

Critically, the RNNs also learn the subtle tail above penalized LogP of 6.0, generating molecules with long carbon chains and fewer rings that match the structural characteristics of high-scoring training molecules. CGVAE and JTVAE almost entirely miss this tail.

Both RNN models capture all four modes of the training distribution. JTVAE entirely misses the GDB13 mode and poorly learns the ZINC and CEP modes. CGVAE learns GDB13 but misses the CEP mode. The SM-RNN again achieves the best Wasserstein metrics.

Large-Scale Molecules

This is the most discriminating task. Both JTVAE and CGVAE completely fail to train on these large molecules. JTVAE’s tree decomposition produces a vocabulary of approximately 11,000 substructures, making training intractable. Only the RNN models succeed, with the SF-RNN achieving slightly better distributional match due to SELFIES guaranteeing 100% validity even for very long strings.

Both RNN models also learn the bimodal LogP structure within the large-molecule distribution and can generate molecules with substructures resembling peptides, including backbone chains and standard amino acid side chains.

Summary of Wasserstein Distance Results

Task	Model	LogP	SA	QED	MW
LogP	SM-RNN	0.095	0.031	0.007	3.3
LogP	SF-RNN	0.177	0.290	0.010	6.3
LogP	JTVAE	0.536	0.289	0.081	35.9
LogP	CGVAE	1.000	2.120	0.115	69.3
Multi	SM-RNN	0.081	0.025	0.006	5.5
Multi	SF-RNN	0.286	0.179	0.023	11.4
Multi	JTVAE	0.495	0.274	0.034	27.7
Multi	CGVAE	1.617	1.802	0.076	30.3
Large	SM-RNN	1.367	0.213	0.003	124.5
Large	SF-RNN	1.095	0.342	0.010	67.3
Large	JTVAE	–	–	–	–
Large	CGVAE	–	–	–	–

SMILES vs. SELFIES Trade-off

An interesting finding is that SMILES and SELFIES RNNs each have complementary strengths. The SF-RNN consistently achieves better standard metrics (validity, uniqueness, novelty) across all tasks, while the SM-RNN achieves better Wasserstein distance metrics. The authors suggest that the SELFIES grammar may reduce memorization of the training data, improving novelty but slightly hurting distributional fidelity.

Limitations

The authors acknowledge several limitations. Language models cannot account for molecular geometry or 3D information, which is important for many applications. The study evaluates distributional fidelity but does not test downstream utility for specific molecular design tasks (e.g., optimizing for a particular biological target). Additionally, while the graph models (JTVAE, CGVAE) are more interpretable, the language models operate as black boxes over string representations. The comparison is also limited to two specific graph model architectures, and more recent or specialized graph models may close the performance gap. Finally, trained model weights are only available upon request rather than being publicly released.

Reproducibility

Artifact	Type	License	Notes
danielflamshep/genmoltasks	Dataset	Apache-2.0	Processed training data and generated samples

Data: Three custom datasets constructed from ZINC15, GDB-13, Harvard Clean Energy Project, POLYMERS, and PubChem. Processed data available at the GitHub repository.

Code: LSTM networks implemented in PyTorch using the char-rnn code from the MOSES repository. Baselines use the official JTVAE and CGVAE implementations. No unified training script is provided in the repository.

Evaluation: Wasserstein distances computed using SciPy. Molecular properties computed using RDKit. 10K molecules generated from each model for evaluation.

Hyperparameters: Task-specific configurations reported. For example, the LogP task SM-RNN uses 2 hidden layers with 400 units, dropout of 0.2, and learning rate of 0.0001.

Hardware: Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types and training times are not reported.

Paper Information

Citation: Flam-Shepherd, D., Zhu, K., & Aspuru-Guzik, A. (2022). Language models can learn complex molecular distributions. Nature Communications, 13, 3293. https://doi.org/10.1038/s41467-022-30839-x

Additional Resources:

GitHub: danielflamshep/genmoltasks

@article{flamshepherd2022language,
  title={Language models can learn complex molecular distributions},
  author={Flam-Shepherd, Daniel and Zhu, Kevin and Aspuru-Guzik, Al{\'a}n},
  journal={Nature Communications},
  volume={13},
  number={1},
  pages={3293},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-022-30839-x}
}

The Nature of LUCA and Its Impact on the Early Earth System

Sun, 08 Mar 2026 00:00:00 +0000

What kind of paper is this?

This is a Discovery ($\Psi_{\text{Discovery}}$) paper. While it introduces a refined implementation of molecular clock calibration (“cross-bracing”), the primary contribution is the biological inference regarding LUCA’s age, genome size, and metabolic nature. The computational methods serve to characterize a specific biological entity.

What is the motivation?

Understanding the Last Universal Common Ancestor (LUCA) is critical for reconstructing the early evolution of life, yet consensus has been elusive due to disparate data and methods.

Age Conflicts: Estimates vary widely depending on fossil interpretation and molecular clock calibrations, particularly regarding the “Late Heavy Bombardment” (LHB) constraints.
Physiological Uncertainty: Debates persist over whether LUCA was a simple “progenote” dependent on geochemistry or a complex prokaryote-grade organism.
Environmental Context: LUCA is often modeled in isolation, ignoring the ecological interactions that would have shaped its survival and impact on the early Earth system.

What is the novelty here?

The study integrates three advanced computational approaches to provide a holistic reconstruction of LUCA:

Cross-Braced Dating: It employs a “cross-bracing” strategy in Bayesian molecular clocks, using pre-LUCA gene duplications (paralogues) to constrain the root. This allows the same fossil calibrations to be applied to mirrored nodes, significantly reducing uncertainty.
Probabilistic Reconciliation: It uses the ALE (Amalgamated Likelihood Estimation) algorithm to reconcile 9,365 gene family trees against the species tree. This explicitly models gene transfer, duplication, and loss, allowing for a much broader reconstruction of the proteome.
Ecosystem Modeling: The physiological reconstruction is coupled with geochemical modeling to propose that LUCA was a member of a productive, hydrogen-recycling early ecosystem.

What experiments were performed?

Phylogenomics: Inferred a species tree from 57 single-copy marker genes across 700 diverse prokaryotic genomes (350 Archaea, 350 Bacteria) using maximum likelihood (IQ-TREE 2).
Molecular Dating: Estimated divergence times using MCMCtree with a partitioned dataset of 5 pre-LUCA paralogue pairs (e.g., ATP synthase, EF-Tu/G). Calibrations included 13 fossil constraints and a “soft” maximum bound based on the Moon-forming impact (4.51 Ga).
Metabolic Reconstruction: Reconciled 9,365 KEGG ortholog families against the species tree to calculate the posterior probability (PP) of each gene’s presence in LUCA. Metabolic potential was inferred from genes with high PP (typically >0.75).
Genome Size Prediction: Trained a LOESS regression model on modern prokaryotes to predict LUCA’s genome size based on the inferred number of KEGG families.

What outcomes/conclusions?

Age: LUCA lived approximately 4.2 Ga (95% CI: 4.09-4.33 Ga), surprisingly soon after the Moon-forming impact (~4.5 Ga).
Complexity: LUCA was a complex, prokaryote-grade organism with a genome size of ~2.75 Mb (encoding ~2,600 proteins), comparable to modern prokaryotes.
Physiology:
- Metabolism: Anaerobic acetogen using a complete Wood-Ljungdahl pathway (WLP) for $CO_2$ fixation and an almost complete TCA cycle. Likely thermophilic (reverse gyrase present, PP = 0.97). The paper found no strong evidence for nitrogenase or nitrogen fixation.
- Immunity: Possessed 19 Class 1 (Type I and Type III) CRISPR-Cas effector protein families. Cas1 and Cas2 were absent, suggesting an early immune system capable of RNA cleavage and binding but lacking the full CRISPR adaptation machinery.
Ecology: LUCA likely inhabited one of two major habitats: (1) the deep ocean, where hydrothermal vents and serpentinization provided $H_2$ (supported by reverse gyrase presence, PP = 0.97, consistent with hyperthermophily), or (2) the ocean surface, where atmospheric $H_2$ from volcanism and metamorphism could fuel growth. A shallow hydrothermal vent or hot spring is also considered a possibility. LUCA was part of an established ecosystem whose metabolic by-products would have created niches for other metabolisms, including methanogenesis. If methanogens were also present, the $CH_4$ they produced would have been photochemically recycled to $H_2$ in the atmosphere, boosting biosphere productivity by at least an order of magnitude over abiotic $H_2$ input rates.
Limitation: The placement of two small-genome lineages (CPR, Candidate Phyla Radiation, and DPANN) remained uncertain. The AU (approximately unbiased) test could not reject either topology (p = 0.517), meaning the data cannot discriminate between the two placements. This phylogenetic uncertainty affects inferences about the early bacterial and archaeal stem lineages.

Reproducibility Details

Data

The study relied on publicly available genomic data and specific subsets of marker genes.

Purpose	Dataset	Size	Notes
Phylogeny	Prokaryotic Genomes	700 genomes	350 Archaea, 350 Bacteria selected to maximize diversity
Dating	Pre-LUCA Paralogues	5 gene pairs	ATP synthase, Elongation Factor Tu/G, SRP/SRPR, Tyr/Trp-tRNA, Leu/Val-tRNA
Reconciliation	Gene Families	9,365 families	Clustered using KEGG Orthology (KO) identifiers
Calibration	Fossil/Isotope Records	13 constraints	Includes max bound at 4.51 Ga (Moon formation) and min bound at 2.95 Ga (oxygenic photosynthesis)

Algorithms

Key computational steps involved sequence processing, tree inference, and probabilistic reconciliation.

Alignment & Trimming: sequences aligned with MAFFT L-INS-i (v7.407) and trimmed with BMGE (v1.12, BLOSUM30 matrix, entropy 0.5).
Tree Inference: IQ-TREE 2 (v2.1.2). Species tree: LG+C60+F+G (best-fit by BIC from concatenated 57-marker alignment). Gene family trees for ALE reconciliation (9,365 KEGG families): LG+F+G with 1,000 ultrafast bootstraps.
Reconciliation: ALE (Amalgamated Likelihood Estimation) program ALEml_undated used to calculate gene presence probabilities, accounting for HGT, duplication, and loss.
Genome Prediction: LOESS regression (Locally Estimated Scatterplot Smoothing) used to map KEGG family counts to total protein counts/genome size.

Models

The analysis employed sophisticated evolutionary models to handle deep time scales and heterogeneity.

Substitution Models:
- Species Tree: LG+C60+F+G (mixture model with 60 profiles, best-fit by BIC).
- Gene Family Trees (for ALE reconciliation): LG+F+G with 1,000 ultrafast bootstraps.
- Timetree inference: LG+F+G4 for approximate likelihood calculation (CODEML), as CODEML does not implement the CAT mixture model.
Molecular Clock:
- MCMCtree (PAML v4.10.7).
- Relaxed clock models: GBM (Geometric Brownian Motion) and ILN (Independent Lognormal).
- Cross-Bracing: Specifically models shared divergence times for duplicated nodes (driver and mirror nodes).

Evaluation

Validation focused on robustness across different topologies and clock models.

Metric	Value	Baseline	Notes
LUCA Age (GBM)	4.18-4.33 Ga	LHB Hypothesis	Significantly older than LHB constraints often used
LUCA Age (ILN)	4.09-4.32 Ga	-	Consistent across clock models
Genome Size	2.49-2.99 Mb	Prior estimates	Within the range of modern prokaryotes, higher than previous “minimal” gene set theories
Topology Test	p = 0.517	-	AU test cannot reject alternative CPR/DPANN topology; placements are statistically indistinguishable

Hardware

Software: PAML v4.10.7 (MCMCtree), IQ-TREE 2, ALE v0.4, HMMER v3.3.2.
Compute: IQ-TREE runs specified usage of 4 CPUs; MCMCtree approximated likelihood calculation (approx method) to reduce computational cost.

Paper Information

Citation: Moody, E. R. R., Álvarez-Carretero, S., Mahendrarajah, T. A., et al. (2024). The nature of the last universal common ancestor and its impact on the early Earth system. Nature Ecology & Evolution, 8(9), 1654-1666. https://doi.org/10.1038/s41559-024-02461-1

Publication: Nature Ecology & Evolution, Volume 8, Number 9, 2024

@article{moodyTheNatureLast2024,
  title={The nature of the last universal common ancestor and its impact on the early Earth system},
  author={Moody, Edmund R. R. and Álvarez-Carretero, Sandra and Mahendrarajah, Tara A. and Clark, James W. and Betts, Holly C. and Dombrowski, Nina and Szánthó, Lénárd L. and Boyle, Richard A. and Daines, Stuart and Chen, Xi and Lane, Nick and Yang, Ziheng and Shields, Graham A. and Szöllősi, Gergely J. and Spang, Anja and Pisani, Davide and Williams, Tom A. and Lenton, Timothy M. and Donoghue, Philip C. J.},
  journal={Nature Ecology & Evolution},
  volume={8},
  number={9},
  pages={1654--1666},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41559-024-02461-1}
}

Open Access: This article is published under CC BY 4.0 and is freely available at the paper URL above.

Artifacts:

Artifact	Type	License	Notes
LUCA-divtimes (GitHub)	Code	GPL-3.0	Molecular clock analysis code and step-by-step tutorials
Figshare Repository	Data	CC BY 4.0	Reconciliation and phylogenomic analysis data
Bristol Data Repository	Data	Unknown	Additional analysis data

MD Simulation of Self-Diffusion on Metal Surfaces (1994)

Sun, 14 Dec 2025 00:00:00 +0000

Scientific Typology: Computational Discovery

This is primarily a Discovery ($\Psi_{\text{Discovery}}$) paper, with strong supporting contributions as a Method ($\Psi_{\text{Method}}$) evaluation. The primary contribution is the validation and mechanistic visualization of the “exchange mechanism” for surface diffusion using computational methods (Molecular Dynamics with many-body potentials). This physical phenomenon was previously observed in Field Ion Microscope (FIM) experiments but difficult to characterize dynamically. The paper focuses on determining how atoms move, specifically distinguishing between hopping and exchange mechanisms.

The Field Ion Microscope (FIM) Observation Gap

Surface diffusion is critical for understanding phenomena like crystal growth, epitaxy, and catalysis. Experimental evidence from FIM on fcc(001) surfaces (specifically Pt and Ir) suggested an “exchange mechanism” where an adatom replaces a substrate atom, challenging the conventional wisdom that adatoms migrate by hopping over potential barriers (bridge sites) between binding sites. The authors sought to:

Investigate whether this exchange mechanism could be reproduced dynamically in simulation.
Determine which interatomic potentials (EAM, Sutton-Chen, R-G-L) accurately describe these surface behaviors compared to bulk properties.

Dynamic Visualization of Atomic Exchange

The study provides a direct dynamic visualization of the “concerted motion” involved in exchange diffusion events, which happens on timescales too fast for experimental imaging. By comparing three different many-body potentials, the authors demonstrate that the choice of potential is critical for capturing surface phenomena; specifically, identifying that “bulk” derived potentials (like Sutton-Chen) may fail to capture specific surface exchange events that EAM and R-G-L potentials successfully model.

Simulation Protocol & Evaluated Potentials

The authors performed Molecular Dynamics (MD) simulations on Iridium (Ir) surfaces:

Surfaces: Channeled (110), densely packed (111), and loosely packed (001).
Potentials: Three many-body models were tested: Embedded Atom Method (EAM), Sutton-Chen (S-C), and Rosato-Guillope-Legrand (R-G-L).
Conditions: Simulations were primarily run at $T=800$ K to ensure sufficient sampling of diffusion events.
Cross-Validation: The study extended the analysis to Cu, Rh, and Pt systems to verify the universality of the exchange mechanism against experimental data.

Confirmation of Concerted Motion Mechanisms

Mechanism Confirmation: The study confirmed that diffusion on Ir(001) proceeds via an atomic exchange mechanism (concerted motion). The activation energy for exchange ($0.77$ eV) was found to be significantly lower than for hopping over bridge sites ($1.57$ eV).
Surface Structure Dependence:
- Ir(111): Diffusion is rapid (activation energy $V_a = 0.17$ eV from R-G-L Arrhenius plot) and occurs exclusively via hopping; no exchange events were observed due to the close-packed nature of the surface.
- Ir(110): Diffusion is anisotropic; atoms hop along channels but use the exchange mechanism to move across channels.
Potential Validity: The R-G-L and EAM potentials successfully reproduced experimental exchange behaviors, whereas the Sutton-Chen potential failed to predict exchange on Ir(001). The authors attribute the S-C failure primarily to the use of “bulk” potential parameters to describe interactions at the surface.
Cross-System Comparison: The study extended the analysis to Cu, Rh, and Pt systems. Both S-C and R-G-L potentials correctly predicted the absence of exchange on all three Rh surfaces and on (111) surfaces of Cu and Pt. Exchange events were correctly predicted on Cu(001), Cu(110), Pt(001), and Pt(110) by both potentials. The sole discrepancy was S-C failing to predict exchange on Ir(001), where R-G-L and EAM succeeded in agreement with experiment.

Reproducibility Details

Algorithms

Integration: “Velocity” form of the Verlet algorithm.
Time Step: $\Delta t = 0.01$ ps ($10^{-14}$ s).
Simulation Protocol:
1. Quenching: System relaxed to 0 K by zeroing velocities when $v \cdot F < 0$.
2. Equilibration: 5 ps constant-temperature run (renormalizing velocities every step).
3. Production: 15 ps constant-energy (microcanonical) run where trajectories are collected.

Models

The study relies on three specific many-body potential formulations:

Embedded Atom Method (EAM):
- Total energy: $$U_{tot} = \sum_i F_i(\rho_i) + \frac{1}{2} \sum_{j \neq i} \phi_{ij}(r_{ij})$$
Sutton-Chen (S-C):
- Uses a square root density dependence and power-law pair repulsion $(a/r)^{n}$: $$F(\rho) \propto \rho^{1/2}$$
Rosato-Guillope-Legrand (R-G-L):
- Born-Mayer type repulsion: $$\phi_{ij}(r) = A \exp[-p(r/r_0 - 1)]$$
- Attractive band energy: $$F_i(\rho) = -\left(\sum \xi^2 \exp[-2q(r/r_0 - 1)]\right)^{1/2}$$

Data

System Size: 648 classical atoms.
Geometry:
- Cubic box with fixed volume.
- Periodic boundary conditions in $x$ and $y$ (parallel to surface), free motion in $z$.
- Substrate depth: 8, 12, or 9 atomic layers depending on orientation [(001), (110), (111)].
Cutoff Radius: 14 bohr ($\sim 7.4$ Å).
Initial Conditions: Velocities initialized from a Maxwellian distribution.

Evaluation

Diffusion Constant ($D$): Calculated using the Einstein relation via Mean Square Displacement (MSD): $$D = \lim_{t \to \infty} \frac{\langle \Delta r^2(t) \rangle}{2td}$$ where $d=2$ for surface diffusion.
Activation Energy ($V_a$): Extracted from the slope of Arrhenius plots ($\ln D$ vs $1/T$).
Attempt Frequency ($\nu$): Estimated via harmonic approximation: $\nu = \frac{1}{2\pi}\sqrt{c/M}$.

Paper Information

Citation: Shiang, K.-D., Wei, C. M., & Tsong, T. T. (1994). A molecular dynamics study of self-diffusion on metal surfaces. Surface Science, 301(1-3), 136-150. https://doi.org/10.1016/0039-6028(94)91295-5

Publication: Surface Science 1994

@article{shiang1994molecular,
  title={A molecular dynamics study of self-diffusion on metal surfaces},
  author={Shiang, Keh-Dong and Wei, C.M. and Tsong, Tien T.},
  journal={Surface Science},
  volume={301},
  number={1-3},
  pages={136--150},
  year={1994},
  publisher={Elsevier},
  doi={10.1016/0039-6028(94)91295-5}
}

In Situ XRD of Oxidation-Reduction Oscillations on Pt/SiO2

Sun, 14 Dec 2025 00:00:00 +0000

Experimental Validation of the Oxide Model

This is a Discovery (Translational/Application) paper.

It is classified as such because the primary contribution is the experimental resolution of a long-standing scientific debate regarding the physical driving force of kinetic oscillations. The authors use established techniques (in situ X-ray diffraction and Debye Function Analysis) to falsify existing hypotheses (reconstruction model, carbon model) and validate a specific physical mechanism (the oxide model).

The Missing Driving Force in High-Pressure CO Oxidation

The study addresses the debate surrounding the driving force of kinetic oscillations in CO oxidation on platinum catalysts at high pressures ($p > 10^{-3}$ mbar). While low-pressure oscillations on single crystals were known to be caused by surface reconstruction, the mechanism for high-pressure oscillations on supported catalysts was unresolved. Three main models existed:

Reconstruction model: Structural changes of the substrate
Carbon model: Periodic deactivation by carbon
Oxide model: Periodic formation and reduction of surface oxides

Prior to this work, there was no conclusive experimental proof demonstrating the periodic oxidation and reduction required by the oxide model.

Direct In Situ XRD Proof

The core novelty is the first direct experimental evidence connecting periodic structural changes in the catalyst to rate oscillations. Using in situ X-ray diffraction (XRD), the authors demonstrated that the intensity of the Pt(111) Bragg peak oscillates in sync with the reaction rate.

By applying Debye Function Analysis (DFA) to the diffraction profiles, they quantitatively showed that the catalyst transitions between a metallic Pt state and a partially oxidized state (containing $\text{PtO}$ and $\text{Pt}_3\text{O}_4$). This definitively ruled out the reconstruction model (which would produce much smaller intensity variations) and confirmed the oxide model.

In Situ X-ray Diffraction and Activity Monitoring

The authors performed in situ X-ray diffraction experiments on a supported Pt catalyst (EuroPt-1) during the CO oxidation reaction.

Reaction Monitoring: They cycled the temperature and gas flow rates (CO, $\text{O}_2$, He) to induce ignition, extinction, and oscillations.
Activity Metrics: Catalytic activity was tracked via sample temperature (using thermocouples) and $\text{CO}_2$ production (using a quadrupole mass spectrometer).
Structural Monitoring: They recorded the intensity of the Pt(111) Bragg peak continuously.
Cluster Analysis: Detailed angular scans of diffracted intensity were taken at stationary points (active vs. inactive states) and analyzed using Debye functions to determine cluster size and composition.

Periodic Oxidation Mechanism and Reversibility

Key Findings:

Oscillation Mechanism: Rate oscillations are accompanied by the periodic oxidation and reduction of the Pt catalyst.
Phase Relationship: The X-ray intensity (oxide amount) oscillates approximately 120° ahead of the temperature (reaction rate), consistent with the oxide model: oxidation deactivates the surface → rate drops → CO reduces the surface → rate rises.
Oxide Composition: The oxidized state consists of a mixture of metallic clusters, $\text{PtO}$, and $\text{Pt}_3\text{O}_4$. $\text{PtO}_2$ was not found.
Extent of Oxidation: Approximately 20-30% of the metal atoms are oxidized, corresponding effectively to a shell of oxide on the surface of the nanoclusters.
Reversibility: The transition between metallic and oxidized states is fully reversible with no sintering observed under the experimental conditions.
Scope Limitation: The authors note that whether the oxide model also applies to kinetic oscillations on Pt foils or Pt wires remains to be verified, since small Pt clusters likely have a much higher tendency to form oxides than massive Pt metal.

Reproducibility Details

Data

The study used the EuroPt-1 standard catalyst.

Type	Material	Details
Catalyst	EuroPt-1 ($\text{Pt/SiO}_2$)	6.3% Pt loading on silica support
Particle Size	Pt Clusters	Mean diameter ~15.5 Å; dispersion $65 \pm 5\%$
Sample Prep	Pellets	40 mg of catalyst pressed into $15 \times 12 \times 0.3 \text{ mm}^3$ self-supporting pellets

Algorithms

Debye Function Analysis (DFA)

The study used DFA to fit theoretical scattering curves to experimental intensity profiles. This method is suitable for randomly oriented clusters where standard crystallographic methods might fail due to finite size effects.

$$I_{N}(b)=\sum_{m,n=1}^{N}f_{m}f_{n}\frac{\sin(2\pi br_{mn})}{2\pi br_{mn}}$$

Where:

$b$: Scattering vector magnitude, $b=2 \sin \vartheta/\lambda$
$f_m, f_n$: Atomic scattering amplitudes
$r_{mn}$: Distance between atom pairs
Shape Assumption: Cuboctahedral clusters (nearly spherical)

Models

1. The Oxide Model (Physical Mechanism)

Proposed by Sales, Turner, and Maple, validated here:

Oxidation: As oxygen coverage increases, the surface forms a catalytically inactive oxide layer ($\text{PtO}_x$).
Deactivation: The reaction rate drops as the surface deactivates.
Reduction: CO adsorption leads to the reduction of the oxide layer, restoring the metallic surface.
Reactivation: The metallic surface is active for CO oxidation, increasing the rate until oxygen coverage builds up again.

2. Shell Model (Structural)

The diffraction data was fit using a “Shell Model” where a metallic Pt core is surrounded by an oxide shell.

Evaluation

Key Experimental Signatures for Replication:

Ignition Point: A sharp increase in sample temperature accompanied by a steep 18% decrease in Bragg intensity. After the He flow was switched off, the intensity dropped further to a total decrease of 31.5%.
Oscillation Regime: Observed at flow rates $\sim 100 \text{ ml/min}$ after cooling the sample to $\sim 375 \text{ K}$. Below $50 \text{ ml/min}$, only bistability is observed. Temperature oscillations had $\sim 50 \text{ K}$ peak-to-peak amplitude.
Magnitude: Bragg intensity oscillations of ~11% amplitude.

Hardware

Experimental Setup:

Diffractometer: Commercial Guinier diffractometer (HUBER) with monochromatized Cu $K_{\alpha1}$ radiation (45° transmission geometry).
Reactor Cell: Custom 115 $\text{cm}^3$ cell, evacuatable to $10^{-7}$ mbar, equipped with Kapton windows and a Be-cover.
Gases: CO (4.7 purity), $\text{O}_2$ (4.5 purity), He (4.6 purity) regulated by flow controllers.
Sensors: Two K-type thermocouples (surface and gas phase) and a differentially pumped Quadrupole Mass Spectrometer (QMS).

Paper Information

Citation: Hartmann, N., Imbihl, R., & Vogel, W. (1994). Experimental evidence for an oxidation/reduction mechanism in rate oscillations of catalytic CO oxidation on Pt/SiO2. Catalysis Letters, 28(2-4), 373-381. https://doi.org/10.1007/BF00806068

Publication: Catalysis Letters 1994

Related Work: This work complements Oscillatory CO Oxidation on Pt(110), which modeled oscillations via surface reconstruction. Here, the driving force is oxidation/reduction.

@article{hartmannExperimentalEvidenceOxidation1994,
  title = {Experimental Evidence for an Oxidation/Reduction Mechanism in Rate Oscillations of Catalytic {{CO}} Oxidation on {{Pt}}/{{SiO2}}},
  author = {Hartmann, N. and Imbihl, R. and Vogel, W.},
  year = 1994,
  journal = {Catalysis Letters},
  volume = {28},
  number = {2-4},
  pages = {373--381},
  issn = {1011-372X, 1572-879X},
  doi = {10.1007/BF00806068}
}

Chaotic Evolution of the Solar System (Sussman 1992)

Sun, 14 Dec 2025 00:00:00 +0000

What kind of paper is this?

This is a computational/numerical methods paper that combines algorithm development with scientific discovery. The core contribution is both methodological (demonstrating the effectiveness of symplectic mapping for long-term orbital integration) and empirical (providing the first direct numerical confirmation that the entire Solar System exhibits chaotic dynamics). The work validates theoretical predictions through computational experiments spanning nearly 100 million years of simulated planetary motion, significantly longer than any previous full-system integration.

What is the motivation?

The authors aimed to address the fundamental question of the long-term stability and predictability of the Solar System. Prior work had limitations:

Sussman & Wisdom (1988): Found chaos in Pluto’s orbit but did not integrate the full system.
Laskar (1989): Found evidence for chaos in the whole system (excluding Pluto) but relied on analytically averaged equations, which are perturbative and truncated.

A direct integration of the full system without averaging approximations was required to validate these findings and determine whether the observed chaos was a genuine property of the planetary dynamics or an artifact of the approximation methods.

What is the novelty here?

The study represents the first direct, full-system integration spanning nearly 100 million years. Key innovations included:

Symplectic Mapping: Application of the Wisdom-Holman mapping method, which allows for much larger time steps (e.g., 7.2 days) compared to multistep methods (which require ~100 steps/orbit) while maintaining long-term energy conservation and numerical stability.
Custom Hardware: Use of the Supercomputer Toolkit, a multiprocessor system optimized for ODEs, where each processor was 3x faster than the entire previous generation “Digital Orrery”.
Direct Numerical Validation: This work directly integrates Newton’s equations without analytical approximations, providing independent verification of chaotic behavior, whereas Laskar’s approach relied on secular perturbation theory.

What experiments were performed?

100 Myr Integration: Eight separate integrations of the entire planetary system were run for ~100 million years (reversed time) with slightly different initial conditions to measure exponential divergence of trajectories.
Validation:
- Compared a 3-million-year segment against the high-precision integration by Quinn, Tremaine, and Duncan (QTD).
- Compared results with Laskar’s secular resonance angle calculations to verify consistency with the perturbative approach.
Subsystem Analysis: Additional integrations of just the Jovian planets (outer system) and massless Pluto particles were performed to isolate the source of chaos and determine which subsystems contribute to the overall chaotic behavior.

What outcomes/conclusions?

The key results below are supported by three independent lines of evidence against numerical artifacts (see Is the Chaos Real? below).

System-wide Chaos: The Solar System is chaotic with an exponential divergence timescale (Lyapunov time) of approximately 4 million years, meaning that initial condition uncertainties grow by a factor of $e$ roughly every 4 million years.
- Two-segment divergence: The secular phase space divergence is not uniform. The initial segment is dominated by a slower ~12 Myr timescale, and the latter segment transitions to the ~4 Myr component. The inner planets are more sensitive indicators of the faster 4 Myr process; the outer planets show the 12 Myr component for most of the 100 Myr run with the 4 Myr component only appearing in the final 5 Myr. This suggests at least two distinct mechanisms operating simultaneously.
Jovian Chaos: The Jovian planets (Jupiter, Saturn, Uranus, Neptune) are independently chaotic. This was a surprising result: in the earlier 845 Myr Digital Orrery integration, Neptune’s orbital elements showed discrete spectral lines (unlike Pluto’s broad-band spectrum), which led the authors to dismiss Jovian chaos at that time. A new 1 billion year integration with a slightly perturbed initial position of Neptune showed exponential divergence with a timescale of only 5 million years, overturning the earlier conclusion. Subsequent Störmer predictor checks confirmed a timescale of ~19 Myr. Eight additional mapping integrations spanning about 300 Myr each, with step sizes ranging from 0.3 to 1 year, found the Lyapunov timescale varies from 3 to 30 Myr and is not a simple function of initial conditions. One specific run (step size ~0.617979 years) yielded quasiperiodic motion, confirming that nearby initial conditions can give qualitatively different long-term behavior.
Pluto’s Robust Chaos: Pluto’s chaotic motion (10-20 Myr timescale) is robust and persists independently of whether the Jovian planets are themselves chaotic or quasiperiodic. In every outer planet integration, massless Pluto pairs diverged with a timescale between 10 and 20 Myr, regardless of how the Jovians were behaving. This robustness distinguishes Pluto’s chaotic mechanism as independent.
Secular Resonances: Laskar identified three resonance angles ($\sigma_1$, $\sigma_2$, $\sigma_3$) that alternately librate (oscillate around a fixed value, like a pendulum) and circulate (rotate through all values, like a spinning top), and proposed resonance overlap as the driving mechanism. The authors find $\sigma_1$ and $\sigma_2$ behave consistently with Laskar, but $\sigma_3$ only circulates in their calculation, not librates and circulates as Laskar found. They also identify two additional angles, $\sigma_4 = 3(\omega_4^\circ - \omega_3^\circ) - 2(\Omega_4^\circ - \Omega_3^\circ)$ and $\sigma_5 = (\varpi_1^\circ - \varpi_8^\circ) + (\Omega_1^\circ - \Omega_8^\circ)$, both of which alternate between libration and circulation. The polar plot of $\sigma_1$ is more consistent with a chaotic separatrix; the $\sigma_2$ plot resembles a high-dimensional trajectory projected onto a plane, not a chaotic mechanism. Conclusion: resonance overlap is consistent with the data but not unambiguously demonstrated as the sole mechanism.
Two Independent Mechanisms: The numerical experiments suggest at least two independent sources of chaos: one operating in the Jovian subsystem, and a separate mechanism driving Pluto’s chaos independently in the field of the Jovian planets. The simultaneous presence of two different exponential timescales in the full Solar System integration supports this. The authors speculate that Mercury (high eccentricity and inclination) may be involved in the faster mechanism, analogously to how Pluto’s high eccentricity and inclination couple eccentricity and inclination secular subsystems.
Predictability Horizon: The chaotic nature fundamentally limits our ability to predict planetary positions beyond roughly 100 million years into the past or future, regardless of improvements in observational precision or computational power.

Is the Chaos Real or a Numerical Artifact?

The authors devote substantial effort to ruling out the possibility that the exponential divergence is a numerical artifact rather than a genuine physical property. Three lines of evidence support the conclusion that it is real:

Agreement across radically different methods: The 100 Myr Toolkit integration and Laskar’s secular perturbation theory approach use completely different techniques (direct integration vs. analytically averaged equations, different masses, different initial conditions, different physics). Both find consistent chaotic behavior. This cross-method agreement is strong evidence against a method-specific artifact.
Quasiperiodic control integration: The outer planets were integrated for 250 Myr with Uranus removed. Over this period the remaining Jovian planets showed no sign of exponential divergence. This control demonstrates that long-term numerical integrations of planetary systems do not universally produce spurious chaos; the physical configuration matters.
Isolated quasiperiodic run: One integration with a particular step size (near 0.617979 years) produced quasiperiodic secular phase space divergence over 300 Myr. This shows the chaos is not a universal artifact of the integration method and that some nearby initial conditions do not lead to chaotic behavior.

The authors conclude that the chaotic character is not sensitively dependent on the precise model or numerical methods used, but acknowledge that definitively ruling out subtle numerical artifacts requires identifying an unambiguous physical mechanism, which remains open.

Reproducibility Details

Data

Initial Conditions: Derived from JPL Ephemeris DE102, matching the setup used by Quinn, Tremaine, and Duncan (QTD) for direct comparison.
Masses: Planetary masses consistent with DE102.

Algorithms

Integrator: Symplectic n-body mapping (Wisdom & Holman method)

The Hamiltonian is split into Keplerian motion and planetary interactions:

$$H = H_{\text{Kepler}} + H_{\text{Interaction}}$$

Structure: Uses a symplectic mapping (Drift-Kick-Drift) where the system evolves via unperturbed Keplerian orbits (Drift) punctuated by instantaneous velocity changes from planet-planet interactions (Kick). Second order is achieved by a half Kepler step, then alternating full interaction kicks and full Kepler steps, ending with a half Kepler step. This preserves phase space volume, preventing the energy drift common in traditional integrators.
Step Size: 7.2 days (chosen somewhat arbitrarily; this step size conveniently aligned with QTD output timestamps for validation).
Precision: Pseudo-quadruple precision for position accumulation to minimize roundoff errors, though retrospectively deemed unnecessary for this problem.

Step-size Validation: A 22 Myr integration with 3.6-day steps (half the primary step size) was performed to verify accuracy of the 7.2-day integration.

Cross-check Integrator: Traditional linear multistep Störmer predictor (used for independent Jovian planet integrations spanning 100 to 400+ Myr to verify the mapping method’s chaotic result).

Models

Hamiltonian: The system is modeled using a split Hamiltonian approach separating two-body Keplerian motion from perturbative interactions.
General Relativity: Modeled using the potential approximation of Nobili and Roxburgh (1986). Why? General Relativity adds velocity-dependent terms that make the Keplerian part of the Hamiltonian non-integrable. To use the fast symplectic mapping (which requires an analytically solvable Kepler step), they needed a potential-only approximation that mimics GR effects (precession) without breaking the Hamiltonian splitting.
Earth-Moon: Treated via a quadrupole approximation similar to QTD, representing the Earth-Moon system as a single body with appropriate mass distribution.

Evaluation

The primary metric for chaos was the Lyapunov Exponent, estimated via the divergence of nearby trajectories with slightly perturbed initial conditions.

Metric	Value	Notes
Divergence Timescale (Full System)	~4 Myr (final segment)	Initial segment ~12 Myr; inner planets are earlier indicators of 4 Myr component
Divergence Timescale (Pluto)	10-20 Myr	Consistent across methods
Eccentricity Error (vs QTD)	$10^{-4}$ to $10^{-6}$	Excellent agreement over 3 Myr (Table 1)

Hardware

System: Supercomputer Toolkit (MIT/Hewlett-Packard collaboration)
Configuration: 8-processor configuration used for parallel integrations
Performance: ~30 years of Solar System evolution per second per processor
Total Compute: ~1000 hours of Toolkit time for the main experiments

Paper Information

Citation: Sussman, G. J., & Wisdom, J. (1992). Chaotic Evolution of the Solar System. Science, 257(5066), 56-62. https://doi.org/10.1126/science.257.5066.56

Publication: Science 1992

@article{sussmanChaoticEvolutionSolar1992,
  title = {Chaotic {{Evolution}} of the {{Solar System}}},
  author = {Sussman, Gerald J. and Wisdom, Jack},
  journal = {Science},
  volume = {257},
  number = {5066},
  pages = {56--62},
  year = {1992},
  month = {jul},
  doi = {10.1126/science.257.5066.56},
  abstract = {The evolution of the entire planetary system has been numerically integrated for a time span of nearly 100 million years. This calculation confirms that the evolution of the solar system as a whole is chaotic, with a time scale of exponential divergence of about 4 million years. Additional numerical experiments indicate that the Jovian planet subsystem is chaotic, although some small variations in the model can yield quasiperiodic motion. The motion of Pluto is independently and robustly chaotic.}
}

Additional Resources:

Adatom Dimer Diffusion on fcc(111) Crystal Surfaces

Sat, 13 Dec 2025 00:00:00 +0000

Classification: Discovery of Diffusion Mechanisms

Discovery (Translational Basis)

This paper applies a computational method (Molecular Dynamics) to observe and characterize a physical phenomenon: the specific diffusion mechanisms of adatom dimers on a crystal surface. It focuses on the “what was found” (simultaneous multiple jumps).

Based on the AI for Physical Sciences Paper Taxonomy, this is best classified as $\Psi_{\text{Discovery}}$ with a minor superposition of $\Psi_{\text{Method}}$ (approximately 80% Discovery, 20% Method). The dominant contribution is the application of computational tools to observe physical phenomena, while secondarily demonstrating MD’s capability for surface diffusion problems in an era when the technique was still developing.

Bridging the Intermediate Temperature Data Gap

The study aims to investigate the behavior of adatom dimers in an intermediate temperature range ($0.3T_m$ to $0.6T_m$). At the time, Field Ion Microscopy (FIM) provided data at low temperatures ($T \le 0.2T_m$), and previous simulations had studied single adatoms on various surfaces including (111), (110), and (100), but not dimers on (111). The authors sought to compare dimer mobility with single adatom mobility on the (111) surface, where single adatoms move almost like free particles.

Observation of Simultaneous Multiple Jumps

The core contribution is the observation of simultaneous multiple jumps for dimers on the (111) surface at intermediate temperatures. The study reveals that:

Dimers migrate as a whole entity, with both atoms jumping simultaneously
The mobility of dimers (center of mass) is very close to that of single adatoms in this regime.

Molecular Dynamics Simulation Design

The authors performed Molecular Dynamics (MD) simulations of a face-centred cubic (fcc) crystallite:

System: A single crystallite of 192 atoms bounded by two free (111) surfaces
Temperature Range: $0.22 \epsilon/k$ to $0.40 \epsilon/k$ (approximately $0.3T_m$ to $0.6T_m$)
Duration: Integration over 50,000 time steps
Comparison: Results were compared against single adatom diffusion data and Einstein’s diffusion relation

Outcomes on Mobility and Migration Dynamics

Mechanism Transition: At low temperatures ($T^\ast=0.22$), diffusion occurs via discrete single jumps where adatoms rotate or extend bonds. At higher temperatures, the “multiple jump” mechanism becomes preponderant.
Migration Style: The dimer migrates essentially by extending its bond along the $\langle 110 \rangle$ direction.
Mobility: The diffusion coefficient of dimers is quantitatively similar to single adatoms.
Qualitative Support: The results support Bonzel’s hypothesis of delocalized diffusion involving energy transfer between translation and rotation. The authors attempted to quantify the coupling using the cross-correlation function:

$$g(t) = C \langle E_T(t) , E_R(t + t’) \rangle$$

where $C$ is a normalization constant, $E_T$ is the translational energy of the center of mass, and $E_R$ is the rotational energy of the dimer. However, the average lifetime of a dimer (2% to 15% of the total calculation time in the studied temperature range) was too short to allow a statistically significant study of this coupling.

Dimer Concentration: The contribution of dimers to mass transport depends on their concentration. As a first approximation, the dimer concentration is expressed as:

$$C = C_0 \exp\left[\frac{-2E_f - E_d}{k_B T}\right]$$

where $E_f$ is the formation energy of adatoms and $E_d$ is the binding energy of a dimer. If the binding energy is sufficiently strong, dimer contributions should be accounted for even in the intermediate temperature range ($0.3T_m$ to $0.6T_m$).

Reproducibility Details

Data (Simulation Setup)

Because this is an early computational study, “data” refers to the initial structural configuration. The simulation begins with an algorithmically generated generic fcc(111) lattice containing two adatoms as the initial state.

Initial configuration showing an adatom dimer (two adatoms on neighboring sites) on an fcc(111) surface. The crystallite consists of 192 atoms with periodic boundary conditions in the x and y directions.

Parameter	Value	Notes
Particles	192 atoms	Single fcc crystallite
Dimensions	$4[110] \times 4[112]$	Thickness of 6 planes
Boundary	Periodic (x, y)	Free surface in z-direction
Initial State	Dimer on neighbor sites	Starts with 2 adatoms

Algorithms

The simulation relies on standard Molecular Dynamics integration techniques. Historical source code is absent. Complete reproducibility is achievable today utilizing modern open-source tools like LAMMPS with standard lj/cut pair styles and NVE/NVT ensembles.

Integration Scheme: Central difference algorithm (Verlet algorithm)
Time Step: $\Delta t^\ast = 0.01$ (reduced units)
Total Steps: 50,000 integration steps
Dimer Definition: Two adatoms are considered a dimer if their distance $r \le r_c = 2\sigma$

Models (Analytic Potential)

The physics are modeled using a classic Lennard-Jones potential.

Potential Form: (12, 6) Lennard-Jones $$ V(r) = 4\epsilon \left[ \left(\frac{\sigma}{r}\right)^{12} - \left(\frac{\sigma}{r}\right)^6 \right] $$

Parameters (Argon-like):

$\epsilon/k = 119.5$ K
$\sigma = 3.4478$ Å
$m = 39.948$ a.u.
Cut-off radius: $2\sigma$

Evaluation

Metrics used to quantify the diffusion behavior:

Metric	Formula	Notes
Diffusion Coefficient	$D = \frac{\langle R^2 \rangle}{4t}$	Calculated from Mean Square Displacement of center of mass
Trajectory Analysis	Visual inspection	Categorized into “fast migration” (multiple jumps) or “discrete jumps”

Hardware

Specifics: Unspecified in the original text.
Scale: 192 particles simulated for 50,000 steps is extremely lightweight by modern standards. A standard laptop CPU executes this workload in under a second, providing a strong contrast to the mainframe computing resources required in 1984.

Paper Information

Citation: Ghaleb, D. (1984). Diffusion of adatom dimers on (111) surface of face centred crystals: A molecular dynamics study. Surface Science, 137(2-3), L103-L108. https://doi.org/10.1016/0039-6028(84)90515-6

Publication: Surface Science 1984

@article{ghalebDiffusionAdatomDimers1984,
  title = {Diffusion of Adatom Dimers on (111) Surface of Face Centred Crystals: A Molecular Dynamics Study},
  author = {Ghaleb, Dominique},
  year = {1984},
  journal = {Surface Science},
  volume = {137},
  number = {2-3},
  pages = {L103-L108},
  doi = {10.1016/0039-6028(84)90515-6}
}

Invalid SMILES Benefit Chemical Language Models: A Study

Tue, 02 Dec 2025 00:00:00 +0000

Core Contribution: Repurposing Invalid SMILES

This is an Empirical paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate “invalid” SMILES strings is beneficial for model performance.

The Problem with Absolute Validity in Chemical LMs

Prior research attempted to eliminate invalid generations using constrained representations like SELFIES. This paper demonstrates that invalid outputs serve as low-likelihood samples whose removal acts as an implicit quality filter, improving distribution learning.

Invalid Generation as an Implicit Quality Filter

The central insight is counterintuitive: invalid SMILES generation acts as a built-in quality control mechanism. The key contributions are:

Empirical Evidence: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.
Mechanistic Explanation: Invalid SMILES are demonstrated to be low-likelihood samples from the model’s probability distribution. When these are filtered out, it’s equivalent to removing the model’s least confident predictions, a form of automatic quality control.
Causal Evidence: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.
Bias Analysis: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.

Experimental Design and Causal Interventions

The paper uses a multi-pronged approach to establish both correlation and causation:

Performance Comparisons: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.

Loss Analysis: The relationship between SMILES validity and model confidence was examined by analyzing the sequence loss. For a given SMILES string $S$ composed of tokens $t_1, t_2, …, t_N$, the negative log-likelihood acts as a proxy for the model’s uncertainty:

$$ \text{NLL}(S) = -\sum_{i=1}^N \log P(t_i | t_1, …, t_{i-1}) $$

Invalid SMILES strings consistently register higher $\text{NLL}$ scores, meaning they represent the model’s least confident predictions. Filtering them effectively acts as automatic quality control, providing the mechanistic explanation for why invalid filtering improves performance.

Causal Intervention: A key experiment involved modifying the SELFIES valency constraints at two levels: first allowing pentavalent carbons (“Texas SELFIES”), then removing all constraints entirely (“unconstrained SELFIES”). This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.

Structural Bias Analysis: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model’s exploration of chemical space.

Generalization Testing: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.

Practical Application: The approach was tested on structure elucidation, using models to identify unknown molecules from minimal experimental data like mass spectrometry.

Key Findings on Validity Constraints and Bias

Superior Performance Across the Board: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. This performance advantage was directly correlated with the proportion of invalid SMILES generated. Models that produced more invalid outputs performed better after filtering.

Invalid SMILES Are Low-Confidence Predictions: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model’s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.

Causal Evidence Through Unconstrained SELFIES: Direct causal evidence came from modifying SELFIES to allow invalid generation. When “unconstrained SELFIES” models could generate and discard invalid molecules, their performance improved, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.

Validity Constraints Introduce Systematic Bias: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model’s ability to faithfully represent chemical space.

Reduced Generalization: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.

Real-World Application Benefits: In structure elucidation tasks, identifying unknown molecules from experimental data like mass spectrometry, SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.

CASMI 2022 Benchmark: The language model trained on the LOTUS database was benchmarked against 19 submissions to the CASMI 2022 competition for structure elucidation of unknown compounds. Using only accurate mass as input (no MS/MS data), the model achieved competitive performance, highlighting the practical utility of the sampling-frequency-based approach for de novo structure elucidation.

Computational Efficiency: Filtering invalid SMILES is computationally trivial. Parsing ten million SMILES strings with RDKit takes approximately 7.5 minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.

Reproducibility Details

Models

Primary Architecture (LSTM): The main results rely on a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) units.

Structure: Three-layer LSTM with a hidden layer size of 1,024 dimensions
Embedding: An embedding layer of 128 dimensions
Decoder: A linear decoder layer outputs token probabilities

Secondary Architecture (Transformer/GPT): To confirm robustness across architectures, the author also used a Generative Pretrained Transformer (GPT) architecture adapted from MolGPT.

Structure: Eight transformer blocks
Internals: Each block contains eight masked self-attention heads and a feed-forward network (1,024 dimensions) using GELU activation
Embedding: 256 dimensions, concatenated with learned positional encodings

Algorithms

Optimizer: Adam optimizer for both architectures with $\beta_1=0.9$ and $\beta_2=0.999$.

Learning Rate:

LSTM: 0.001
Transformer: 0.0005

Batch Size: 64

Loss Function: Cross-entropy loss of next-token prediction.

Stopping Criteria: Early stopping using a validation set (10% of training data) with patience of 50,000 minibatches.

Data

Primary Source: ChEMBL database (version 28).

Preprocessing Pipeline:

Cleaning: Removal of duplicate SMILES, salts, and solvents (retaining heavy fragments with $\geq 3$ heavy atoms)
Filtering: Molecules with atoms other than {Br, C, Cl, F, H, I, N, O, P, S} were removed
Normalization: Charged molecules were neutralized and converted to canonical SMILES

Training Subsets: Models were trained on random samples of 30,000, 100,000, and 300,000 molecules to test scalability.

Generalization Data: To test generalization, models were also trained on the GDB-13 database (enumerating drug-like molecules up to 13 heavy atoms).

Structure Elucidation Data: For practical application tasks, models were trained on natural products (LOTUS, COCONUT), food compounds (FooDB), and environmental contaminants (NORMAN).

Evaluation

Primary Metric: Fréchet ChemNet Distance (FCD), measuring chemical similarity between generated molecules and the training set (lower is better).

Secondary Metrics:

Validity: Percentage of outputs parseable by RDKit
Scaffold Similarity: Jensen-Shannon distances between Murcko scaffold compositions
Physical Properties: Comparisons of molecular weight, LogP, topological polar surface area (TPSA), and ring counts (aromatic vs. aliphatic)
Structure Elucidation: “Top-k accuracy,” the proportion of held-out molecules where the correct structure appeared in the model’s top $k$ ranked outputs

Hardware

Compute Nodes: Dell EMC C4140 GPU compute nodes
GPUs: NVIDIA Tesla V100
Compute Time: Parsing 10 million SMILES took ~7.5 minutes on a single CPU; SELFIES models required an average of 0.6 hours longer to train than SMILES models

Replicability

Code Availability: Source code and intermediate data are available via Zenodo. Pre-trained model weights are not provided in the archive, requiring researchers to train models from scratch using the included scripts to fully replicate the study.

Data Availability: Training datasets and generated molecule samples (10 million from ChEMBL/GDB-13 models, 100 million from LOTUS/COCONUT/FooDB/NORMAN cross-validation folds) are available via Zenodo.

Software Libraries:

PyTorch: LSTM and Transformer implementations
RDKit: SMILES parsing, validity checking, and property calculation
SELFIES: Version 2.1.1 for conversion

Artifacts

Artifact	Type	License	Notes
Source code (Zenodo)	Code	Unknown	Training scripts, analysis code, and intermediate data
Training and generated molecules (Zenodo)	Dataset	Unknown	Preprocessed training sets and sampled molecules

Implications and Takeaways

This work reframes how we think about “errors” in generative models. The key insight is that model outputs appearing incorrect often represent low-likelihood samples whose removal improves overall performance.

The findings suggest that the field’s drive toward guaranteed validity leads to systematic biases. Letting models fail informatively and using those failures as quality signals can yield better distribution learning. This is relevant as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.

For practitioners, the takeaway is to consider the role of invalid outputs before eliminating them. Filtering low-confidence generations provides automatic quality control that improves final results.

Paper Information

Citation: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437-448. https://doi.org/10.1038/s42256-024-00821-x

Publication: Nature Machine Intelligence (2024)

@article{skinnider2024invalid,
  title={Invalid SMILES are beneficial rather than detrimental to chemical language models},
  author={Skinnider, Michael A},
  journal={Nature Machine Intelligence},
  volume={6},
  number={4},
  pages={437--448},
  year={2024},
  publisher={Nature Publishing Group UK London}
}