Systematic Benchmarking of SMILES Data Augmentation

This is an Empirical paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the “Maxsmi” models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.

The Data Scarcity Problem in QSAR Modeling

Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique SMILES representation of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.

Five Augmentation Strategies and Test-Time Ensemble Learning

The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:

  1. No augmentation: use only the canonical SMILES (baseline)
  2. Augmentation with duplication: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$
  3. Augmentation without duplication: generate $m$ random SMILES and discard exact duplicates
  4. Augmentation with reduced duplication: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above
  5. Augmentation with estimated maximum: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space

Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:

$$ \hat{y}_i(C) = M_{\Theta}(S_i(C)) $$

The compound-level prediction is an aggregation (mean) over these:

$$ \hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big) $$

The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.

Experimental Design: Three Architectures, Four Datasets

Datasets

DatasetSize (after preprocessing)Train / TestTaskProvenance
ESOL1,128902 / 226Water solubilityMoleculeNet
ESOL_small1,068854 / 214Solubility (max 25 heavy atoms)MoleculeNet
FreeSolv642513 / 129Hydration free energyMoleculeNet
Lipophilicity4,1993,359 / 840Octanol/water distributionChEMBL
Affinity (EGFR)5,8494,679 / 1,170pIC50 against EGFR kinaseKinodata

Architectures

Three shallow neural networks are compared:

  • CONV1D: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers
  • CONV2D: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers
  • RNN: LSTM layer followed by two fully connected layers (128 and 64 units)

All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.

Augmentation sweep

The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.

Key Findings: Augmentation Consistently Improves RMSE

Augmentation always helps

Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.

Best models (Maxsmi)

DatasetModelAugmentation NumberStrategyTest RMSE
ESOLCONV1D70Reduced duplication0.569
FreeSolvCONV1D70With duplication1.032
LipophilicityCONV1D80Without duplication0.593

The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.

No single best augmentation strategy

The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.

Canonical SMILES outperform single random SMILES

When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).

Comparison to prior work

StudyESOLFreeSolvLipophilicityModel
Maxsmi0.5691.0320.593CNN
MoleculeNet0.58 +/- 0.031.15 +/- 0.120.655 +/- 0.036GNN
CNF0.621.110.67CNN
MolPMoFiTN/A1.197 +/- 0.1270.565 +/- 0.037RNN

Maxsmi outperforms or matches MoleculeNet’s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.

Confidence estimation

The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.

EGFR affinity test case

Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.

Limitations

  • All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.
  • The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.
  • The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.
  • The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.

Reproducibility Details

Data

PurposeDatasetSizeNotes
Training/EvaluationESOL1,128MoleculeNet, water solubility
Training/EvaluationFreeSolv642MoleculeNet, hydration free energy
Training/EvaluationLipophilicity4,199ChEMBL, logD
Test caseEGFR Affinity5,849Kinodata (ChEMBL v28), pIC50

All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.

Algorithms

  • SMILES generation via RDKit’s random SMILES enumeration
  • One-hot encoding of SMILES characters with padding to max length
  • Five augmentation strategies applied to both training and test sets
  • Mean aggregation for compound-level predictions

Models

ModelArchitectureParameters
CONV1D1D conv (kernel 10, stride 1) + 2 FC layersNot specified
CONV2D2D conv (single channel) + 2 FC layersNot specified
RNNLSTM + FC(128) + FC(64)Not specified
RF BaselineRandom Forest (default sklearn)Morgan FP, radius 2, length 1024

Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.

Evaluation

MetricBest ValueBaselineNotes
RMSE (ESOL)0.5691.102 (RF)CONV1D, 70x reduced dup
RMSE (FreeSolv)1.0322.563 (RF)CONV1D, 70x with dup
RMSE (Lipophilicity)0.5930.860 (RF)CONV1D, 80x without dup
RMSE (EGFR)0.7770.758 (RF)CONV1D, 70x reduced dup

Hardware

Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.

Artifacts

ArtifactTypeLicenseNotes
volkamerlab/maxsmiCodeMITFull source code, trained models, CLI for prediction
DocumentationDocsN/ARead the Docs documentation
KinodataDatasetN/ACurated kinase bioactivity data from ChEMBL v28

Reproducibility status: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.


Paper Information

Citation: Kimber, T. B., Gagnebin, M., & Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. Artificial Intelligence in the Life Sciences, 1, 100014. https://doi.org/10.1016/j.ailsci.2021.100014

@article{kimber2021maxsmi,
  title={Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning},
  author={Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea},
  journal={Artificial Intelligence in the Life Sciences},
  volume={1},
  pages={100014},
  year={2021},
  publisher={Elsevier},
  doi={10.1016/j.ailsci.2021.100014}
}