Paper Summary
Citation: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437–448. https://doi.org/10.1038/s42256-024-00821-x
Publication: Nature Machine Intelligence (2024)
What kind of paper is this?
This is a methods and theory paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate “invalid” SMILES strings is actually beneficial for model performance, not a flaw to be eliminated.
What is the motivation?
Here’s a problem that anyone working with chemical language models has encountered: your carefully trained model sometimes spits out SMILES strings that don’t correspond to real molecules. The string might have impossible valency, disconnected fragments, or other chemical impossibilities. The natural reaction is to see this as a failure—after all, what good is a model that generates nonsense?
This perceived flaw has driven significant research effort toward creating models that only generate valid outputs. The most prominent example is SELFIES (SELF-referencIng Embedded Strings), a molecular representation designed to guarantee that every generated string corresponds to a valid molecule. The logic seems sound: if invalid outputs are bad, eliminate them entirely.
But Skinnider asks a provocative question: what if we’ve got this backwards? What if the ability to generate invalid SMILES is actually a feature, not a bug?
What is the novelty here?
The central insight is counterintuitive: invalid SMILES generation acts as a built-in quality control mechanism. The key contributions are:
Empirical Evidence: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.
Mechanistic Explanation: Invalid SMILES are demonstrated to be low-likelihood samples from the model’s probability distribution. When these are filtered out, it’s equivalent to removing the model’s least confident predictions—a form of automatic quality control.
Causal Evidence: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.
Bias Analysis: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.
What experiments were performed?
The paper uses a multi-pronged approach to establish both correlation and causation:
Performance Comparisons: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.
Loss Analysis: The relationship between SMILES validity and model confidence was examined by analyzing the loss (negative log-likelihood) of generated strings. This provided the mechanistic explanation for why invalid filtering improves performance.
Causal Intervention: The most clever experiment involved creating “unconstrained SELFIES” models by relaxing or removing the valency constraints that normally ensure validity. This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.
Structural Bias Analysis: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model’s exploration of chemical space.
Generalization Testing: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.
Practical Application: The approach was tested on structure elucidation—using models to identify unknown molecules from minimal experimental data like mass spectrometry.
Superior Performance Across the Board: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. Remarkably, this performance advantage was directly correlated with the proportion of invalid SMILES generated—models that produced more invalid outputs performed better after filtering.
Invalid SMILES Are Low-Confidence Predictions: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model’s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.
Causal Evidence Through Unconstrained SELFIES: The most compelling evidence came from modifying SELFIES to allow invalid generation. When “unconstrained SELFIES” models could generate and discard invalid molecules, their performance improved dramatically, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.
Validity Constraints Introduce Systematic Bias: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model’s ability to faithfully represent chemical space.
Reduced Generalization: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.
Real-World Application Benefits: In structure elucidation tasks—identifying unknown molecules from experimental data like mass spectrometry—SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.
Computational Efficiency: Filtering invalid SMILES is computationally trivial. Parsing one million SMILES strings with RDKit takes only minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.
Implications and Takeaways
This work fundamentally challenges how we think about “errors” in generative models. The key insight is that not all model outputs that appear incorrect are actually mistakes—some represent useful uncertainty signals that can improve overall performance when properly handled.
The findings suggest that the field’s drive toward guaranteed validity may be misguided. Rather than constraining models to never fail, it might be better to let them fail informatively and use those failures as quality signals. This is particularly relevant as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.
For practitioners, the message is clear: don’t be too quick to “fix” models that generate invalid outputs. That apparent flaw might be a feature in disguise, providing automatic quality control that improves your final results. Sometimes the best path forward isn’t avoiding mistakes—it’s learning to recognize and filter them effectively.