Core Contribution: Repurposing Invalid SMILES

This is a Method and Theory paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate “invalid” SMILES strings is actually beneficial for model performance.

The Problem with Absolute Validity in Chemical LMs

Prior research attempts to eliminate invalid generations using constrained representations like SELFIES. This paper shifts the paradigm by demonstrating that invalid outputs provide a necessary theoretical signal for distribution learning, enabling implicit uncertainty estimation.

Invalid Generation as an Implicit Quality Filter

The central insight is counterintuitive: invalid SMILES generation acts as a built-in quality control mechanism. The key contributions are:

  1. Empirical Evidence: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.

  2. Mechanistic Explanation: Invalid SMILES are demonstrated to be low-likelihood samples from the model’s probability distribution. When these are filtered out, it’s equivalent to removing the model’s least confident predictions, a form of automatic quality control.

  3. Causal Evidence: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.

  4. Bias Analysis: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.

Experimental Design and Causal Interventions

The paper uses a multi-pronged approach to establish both correlation and causation:

Performance Comparisons: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.

Loss Analysis: The relationship between SMILES validity and model confidence was examined by analyzing the sequence loss. For a given SMILES string $S$ composed of tokens $t_1, t_2, …, t_N$, the negative log-likelihood acts as a proxy for the model’s uncertainty:

$$ \text{NLL}(S) = -\sum_{i=1}^N \log P(t_i | t_1, …, t_{i-1}) $$

Invalid SMILES strings consistently register higher $\text{NLL}$ scores, meaning they represent the model’s least confident predictions. Filtering them effectively acts as automatic quality control, providing the mechanistic explanation for why invalid filtering improves performance.

Causal Intervention: The most clever experiment involved creating “unconstrained SELFIES” models by relaxing or removing the valency constraints that normally ensure validity. This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.

Structural Bias Analysis: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model’s exploration of chemical space.

Generalization Testing: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.

Practical Application: The approach was tested on structure elucidation, using models to identify unknown molecules from minimal experimental data like mass spectrometry.

Key Findings on Validity Constraints and Bias

Superior Performance Across the Board: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. Remarkably, this performance advantage was directly correlated with the proportion of invalid SMILES generated. Models that produced more invalid outputs performed better after filtering.

Invalid SMILES Are Low-Confidence Predictions: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model’s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.

Causal Evidence Through Unconstrained SELFIES: The most compelling evidence came from modifying SELFIES to allow invalid generation. When “unconstrained SELFIES” models could generate and discard invalid molecules, their performance improved dramatically, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.

Validity Constraints Introduce Systematic Bias: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model’s ability to faithfully represent chemical space.

Reduced Generalization: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.

Real-World Application Benefits: In structure elucidation tasks, identifying unknown molecules from experimental data like mass spectrometry, SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.

Computational Efficiency: Filtering invalid SMILES is computationally trivial. Parsing one million SMILES strings with RDKit takes only minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.

Reproducibility Details

Models

Primary Architecture (LSTM): The main results rely on a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) units.

  • Structure: Three-layer LSTM with a hidden layer size of 1,024 dimensions
  • Embedding: An embedding layer of 128 dimensions
  • Decoder: A linear decoder layer outputs token probabilities

Secondary Architecture (Transformer/GPT): To prove robustness, the author also used a Generative Pretrained Transformer (GPT) architecture adapted from MolGPT.

  • Structure: Eight transformer blocks
  • Internals: Each block contains eight masked self-attention heads and a feed-forward network (1,024 dimensions) using GELU activation
  • Embedding: 256 dimensions, concatenated with learned positional encodings

Algorithms

Optimizer: Adam optimizer for both architectures with $\beta_1=0.9$ and $\beta_2=0.999$.

Learning Rate:

  • LSTM: 0.001
  • Transformer: 0.0005

Batch Size: 64

Loss Function: Cross-entropy loss of next-token prediction.

Stopping Criteria: Early stopping using a validation set (10% of training data) with patience of 50,000 minibatches.

Data

Primary Source: ChEMBL database (version 28).

Preprocessing Pipeline:

  • Cleaning: Removal of duplicate SMILES, salts, and solvents (retaining heavy fragments with $\geq 3$ heavy atoms)
  • Filtering: Molecules with atoms other than {Br, C, Cl, F, H, I, N, O, P, S} were removed
  • Normalization: Charged molecules were neutralized and converted to canonical SMILES

Training Subsets: Models were trained on random samples of 30,000, 100,000, and 300,000 molecules to test scalability.

Generalization Data: To test generalization, models were also trained on the GDB-13 database (enumerating drug-like molecules up to 13 heavy atoms).

Structure Elucidation Data: For practical application tasks, models were trained on natural products (LOTUS, COCONUT), food compounds (FooDB), and environmental contaminants (NORMAN).

Evaluation

Primary Metric: Fréchet ChemNet Distance (FCD), measuring chemical similarity between generated molecules and the training set (lower is better).

Secondary Metrics:

  • Validity: Percentage of outputs parseable by RDKit
  • Scaffold Similarity: Jensen-Shannon distances between Murcko scaffold compositions
  • Physical Properties: Comparisons of molecular weight, LogP, topological polar surface area (TPSA), and ring counts (aromatic vs. aliphatic)
  • Structure Elucidation: “Top-k accuracy,” the proportion of held-out molecules where the correct structure appeared in the model’s top $k$ ranked outputs

Hardware

  • Compute Nodes: Dell EMC C4140 GPU compute nodes
  • GPUs: NVIDIA Tesla V100
  • Compute Time: Parsing 1 million SMILES took ~7.5 minutes on a single CPU; SELFIES models required an average of 0.6 hours longer to train than SMILES models

Replicability

Code Availability: Source code and intermediate data are available via Zenodo. Note that pre-trained model weights are not explicitly provided in the archive, requiring researchers to train models from scratch using the included scripts to fully replicate the study.

Software Libraries:

  • PyTorch: LSTM and Transformer implementations
  • RDKit: SMILES parsing, validity checking, and property calculation
  • SELFIES: Version 2.1.1 for conversion

Implications and Takeaways

This work fundamentally challenges how we think about “errors” in generative models. The key insight finds that model outputs that appear incorrect often represent useful uncertainty signals that improve overall performance when properly handled.

The findings suggest that the field’s drive toward guaranteed validity leads to systematic biases. It proves advantageous to let models fail informatively and use those failures as quality signals. This applies particularly as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.

For practitioners, the message is clear: pause before “fixing” models that generate invalid outputs. That apparent flaw acts as a feature in disguise, providing automatic quality control that improves final results. Success often stems from learning to recognize and filter mistakes effectively.

Paper Information

Citation: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437-448. https://doi.org/10.1038/s42256-024-00821-x

Publication: Nature Machine Intelligence (2024)

@article{skinnider2024invalid,
  title={Invalid SMILES are beneficial rather than detrimental to chemical language models},
  author={Skinnider, Michael A},
  journal={Nature Machine Intelligence},
  volume={6},
  number={4},
  pages={437--448},
  year={2024},
  publisher={Nature Publishing Group UK London}
}