Invalid SMILES Benefit Chemical Language Models: A Study

Core Contribution: Repurposing Invalid SMILES

This is an Empirical paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate “invalid” SMILES strings is beneficial for model performance.

The Problem with Absolute Validity in Chemical LMs

Prior research attempted to eliminate invalid generations using constrained representations like SELFIES. This paper demonstrates that invalid outputs serve as low-likelihood samples whose removal acts as an implicit quality filter, improving distribution learning.

Invalid Generation as an Implicit Quality Filter

The central insight is counterintuitive: invalid SMILES generation acts as a built-in quality control mechanism. The key contributions are:

Empirical Evidence: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.
Mechanistic Explanation: Invalid SMILES are demonstrated to be low-likelihood samples from the model’s probability distribution. When these are filtered out, it’s equivalent to removing the model’s least confident predictions, a form of automatic quality control.
Causal Evidence: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.
Bias Analysis: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.

Experimental Design and Causal Interventions

The paper uses a multi-pronged approach to establish both correlation and causation:

Performance Comparisons: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.

Loss Analysis: The relationship between SMILES validity and model confidence was examined by analyzing the sequence loss. For a given SMILES string $S$ composed of tokens $t_1, t_2, …, t_N$, the negative log-likelihood acts as a proxy for the model’s uncertainty:

$$ \text{NLL}(S) = -\sum_{i=1}^N \log P(t_i | t_1, …, t_{i-1}) $$

Invalid SMILES strings consistently register higher $\text{NLL}$ scores, meaning they represent the model’s least confident predictions. Filtering them effectively acts as automatic quality control, providing the mechanistic explanation for why invalid filtering improves performance.

Causal Intervention: A key experiment involved modifying the SELFIES valency constraints at two levels: first allowing pentavalent carbons (“Texas SELFIES”), then removing all constraints entirely (“unconstrained SELFIES”). This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.

Structural Bias Analysis: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model’s exploration of chemical space.

Generalization Testing: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.

Practical Application: The approach was tested on structure elucidation, using models to identify unknown molecules from minimal experimental data like mass spectrometry.

Key Findings on Validity Constraints and Bias

Superior Performance Across the Board: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. This performance advantage was directly correlated with the proportion of invalid SMILES generated. Models that produced more invalid outputs performed better after filtering.

Invalid SMILES Are Low-Confidence Predictions: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model’s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.

Causal Evidence Through Unconstrained SELFIES: Direct causal evidence came from modifying SELFIES to allow invalid generation. When “unconstrained SELFIES” models could generate and discard invalid molecules, their performance improved, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.

Validity Constraints Introduce Systematic Bias: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model’s ability to faithfully represent chemical space.

Reduced Generalization: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.

Real-World Application Benefits: In structure elucidation tasks, identifying unknown molecules from experimental data like mass spectrometry, SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.

CASMI 2022 Benchmark: The language model trained on the LOTUS database was benchmarked against 19 submissions to the CASMI 2022 competition for structure elucidation of unknown compounds. Using only accurate mass as input (no MS/MS data), the model achieved competitive performance, highlighting the practical utility of the sampling-frequency-based approach for de novo structure elucidation.

Computational Efficiency: Filtering invalid SMILES is computationally trivial. Parsing ten million SMILES strings with RDKit takes approximately 7.5 minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.

Reproducibility Details

Models

Primary Architecture (LSTM): The main results rely on a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) units.

Structure: Three-layer LSTM with a hidden layer size of 1,024 dimensions
Embedding: An embedding layer of 128 dimensions
Decoder: A linear decoder layer outputs token probabilities

Secondary Architecture (Transformer/GPT): To confirm robustness across architectures, the author also used a Generative Pretrained Transformer (GPT) architecture adapted from MolGPT.

Structure: Eight transformer blocks
Internals: Each block contains eight masked self-attention heads and a feed-forward network (1,024 dimensions) using GELU activation
Embedding: 256 dimensions, concatenated with learned positional encodings

Algorithms

Optimizer: Adam optimizer for both architectures with $\beta_1=0.9$ and $\beta_2=0.999$.

Learning Rate:

LSTM: 0.001
Transformer: 0.0005

Batch Size: 64

Loss Function: Cross-entropy loss of next-token prediction.

Stopping Criteria: Early stopping using a validation set (10% of training data) with patience of 50,000 minibatches.

Data

Primary Source: ChEMBL database (version 28).

Preprocessing Pipeline:

Cleaning: Removal of duplicate SMILES, salts, and solvents (retaining heavy fragments with $\geq 3$ heavy atoms)
Filtering: Molecules with atoms other than {Br, C, Cl, F, H, I, N, O, P, S} were removed
Normalization: Charged molecules were neutralized and converted to canonical SMILES

Training Subsets: Models were trained on random samples of 30,000, 100,000, and 300,000 molecules to test scalability.

Generalization Data: To test generalization, models were also trained on the GDB-13 database (enumerating drug-like molecules up to 13 heavy atoms).

Structure Elucidation Data: For practical application tasks, models were trained on natural products (LOTUS, COCONUT), food compounds (FooDB), and environmental contaminants (NORMAN).

Evaluation

Primary Metric: Fréchet ChemNet Distance (FCD), measuring chemical similarity between generated molecules and the training set (lower is better).

Secondary Metrics:

Validity: Percentage of outputs parseable by RDKit
Scaffold Similarity: Jensen-Shannon distances between Murcko scaffold compositions
Physical Properties: Comparisons of molecular weight, LogP, topological polar surface area (TPSA), and ring counts (aromatic vs. aliphatic)
Structure Elucidation: “Top-k accuracy,” the proportion of held-out molecules where the correct structure appeared in the model’s top $k$ ranked outputs

Hardware

Compute Nodes: Dell EMC C4140 GPU compute nodes
GPUs: NVIDIA Tesla V100
Compute Time: Parsing 10 million SMILES took ~7.5 minutes on a single CPU; SELFIES models required an average of 0.6 hours longer to train than SMILES models

Replicability

Code Availability: Source code and intermediate data are available via Zenodo. Pre-trained model weights are not provided in the archive, requiring researchers to train models from scratch using the included scripts to fully replicate the study.

Data Availability: Training datasets and generated molecule samples (10 million from ChEMBL/GDB-13 models, 100 million from LOTUS/COCONUT/FooDB/NORMAN cross-validation folds) are available via Zenodo.

Software Libraries:

PyTorch: LSTM and Transformer implementations
RDKit: SMILES parsing, validity checking, and property calculation
SELFIES: Version 2.1.1 for conversion

Artifacts

Artifact	Type	License	Notes
Source code (Zenodo)	Code	Unknown	Training scripts, analysis code, and intermediate data
Training and generated molecules (Zenodo)	Dataset	Unknown	Preprocessed training sets and sampled molecules

Implications and Takeaways

This work reframes how we think about “errors” in generative models. The key insight is that model outputs appearing incorrect often represent low-likelihood samples whose removal improves overall performance.

The findings suggest that the field’s drive toward guaranteed validity leads to systematic biases. Letting models fail informatively and using those failures as quality signals can yield better distribution learning. This is relevant as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.

For practitioners, the takeaway is to consider the role of invalid outputs before eliminating them. Filtering low-confidence generations provides automatic quality control that improves final results.

Paper Information

Citation: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437-448. https://doi.org/10.1038/s42256-024-00821-x

Publication: Nature Machine Intelligence (2024)

@article{skinnider2024invalid,
  title={Invalid SMILES are beneficial rather than detrimental to chemical language models},
  author={Skinnider, Michael A},
  journal={Nature Machine Intelligence},
  volume={6},
  number={4},
  pages={437--448},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

Core Contribution: Repurposing Invalid SMILES#

The Problem with Absolute Validity in Chemical LMs#

Invalid Generation as an Implicit Quality Filter#

Experimental Design and Causal Interventions#

Key Findings on Validity Constraints and Bias#

Reproducibility Details#

Models#

Algorithms#

Data#

Evaluation#

Hardware#

Replicability#

Artifacts#

Implications and Takeaways#

Paper Information#