Paper Information

Citation: Rajan, K., Brinkhaus, H.O., Zielesny, A. et al. (2024). Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. Journal of Cheminformatics, 16(78). https://doi.org/10.1186/s13321-024-00872-7

Publication: Journal of Cheminformatics 2024

Additional Resources:

Method Contribution: Architectural Optimization

This is a Method paper. It proposes an enhanced neural network architecture (EfficientNetV2 + Transformer) specifically designed to solve the problem of recognizing hand-drawn chemical structures. The primary contribution is architectural optimization and a data-driven training strategy, validated through ablation studies (comparing encoders) and benchmarked against existing rule-based and deep learning tools.

Motivation: Digitizing “Dark” Chemical Data

Chemical information in legacy laboratory notebooks and modern tablet-based inputs often exists as hand-drawn sketches.

  • Gap: Existing Optical Chemical Structure Recognition (OCSR) tools (particularly rule-based ones) lack robustness and fail when images have variability in style, line thickness, or noise.
  • Need: There is a critical need for automated tools to digitize this “dark data” effectively to preserve it and make it machine-readable and searchable.

Core Innovation: Decoder-Only Design and Synthetic Scaling

The core novelty is the architectural enhancement and synthetic training strategy:

  1. Decoder-Only Transformer: Using only the decoder part of the Transformer (instead of a full encoder-decoder Transformer) improved average accuracy across OCSR benchmarks from 61.28% to 69.27% (Table 3 in the paper).
  2. EfficientNetV2 Integration: Replacing standard CNNs or EfficientNetV1 with EfficientNetV2-M provided better feature extraction and 2x faster training speeds.
  3. Scale of Synthetic Data: The authors demonstrate that scaling synthetic training data (up to 152 million images generated by RanDepict) directly correlates with improved generalization to real-world hand-drawn images, without ever training on real hand-drawn data.

Experimental Setup: Ablation and Real-World Baselines

  • Model Selection (Ablation): Tested three architectures (EfficientNetV2-M + Full Transformer, EfficientNetV1-B7 + Decoder-only, EfficientNetV2-M + Decoder-only) on standard benchmarks (JPO, CLEF, USPTO, UOB).
  • Data Scaling: Trained the best model on four progressively larger datasets (from 4M to 152M images) to measure performance gains.
  • Real-World Benchmarking: Validated the final model on the DECIMER Hand-drawn dataset (5088 real images drawn by volunteers) and compared against 9 other tools (OSRA, MolVec, Img2Mol, MolScribe, etc.).

Results and Conclusions: Strong Accuracy on Hand-Drawn Scans

  • Strong Performance: The final DECIMER model achieved 99.72% valid predictions and 73.25% exact accuracy on the hand-drawn benchmark. The next best non-DECIMER tool was MolGrapher at 10.81% accuracy, followed by MolScribe at 7.65%.
  • Robustness: Deep learning methods outperform rule-based methods (which scored 3% or less accuracy) on hand-drawn data.
  • Data Saturation: Quadrupling the dataset from 38M to 152M images yielded only marginal gains (about 3 percentage points in accuracy), suggesting current synthetic data strategies may be hitting a plateau.

Reproducibility

Artifacts

ArtifactTypeLicenseNotes
DECIMER Image Transformer (GitHub)CodeMITOfficial TensorFlow implementation
Model Weights (Zenodo)ModelUnknownPre-trained hand-drawn model weights
DECIMER PyPi PackageCodeMITInstallable Python package
RanDepict (GitHub)CodeMITSynthetic hand-drawn image generation toolkit

Data

The model was trained entirely on synthetic data generated using the RanDepict toolkit. No real hand-drawn images were used for training.

DatasetSourceMoleculesTotal ImagesNotes
1ChEMBL2,187,6694,375,3381 augmented + 1 clean per molecule
2ChEMBL2,187,66913,126,0142 augmented + 4 clean per molecule
3PubChem9,510,00038,040,0001 augmented + 3 clean per molecule
4PubChem38,040,000152,160,0001 augmented + 3 clean per molecule

A separate model selection experiment used a 1,024,000-molecule subset of ChEMBL to compare the three architectures (Table 1 in the paper). The DECIMER Hand-Drawn evaluation dataset consists of 5,088 real hand-drawn images from 23 volunteers.

Preprocessing:

  • SMILES strings length < 300 characters.
  • Images resized to $512 \times 512$.
  • Images generated with and without “hand-drawn style” augmentations.

Algorithms

  • Tokenization: SMILES split by heavy atoms, brackets, bond symbols, and special characters. Start <start> and end <end> tokens added; padded with <pad>.
  • Optimization: Adam optimizer with a custom learning rate schedule (as specified in the original Transformer paper). A dropout rate of 0.1 was used.
  • Loss Function: Trained using focal loss to address class imbalance for rare tokens. The focal loss formulation reduces the relative loss for well-classified examples: $$ \text{FL}(p_{\text{t}}) = -\alpha_{\text{t}} (1 - p_{\text{t}})^\gamma \log(p_{\text{t}}) $$
  • Augmentations: RanDepict applied synthetic distortions to mimic handwriting (wobbly lines, variable thickness, etc.).

Models

The final architecture (Model 3) is an Encoder-Decoder structure:

  • Encoder: EfficientNetV2-M (pretrained ImageNet backbone).
    • Input: $512 \times 512 \times 3$ image.
    • Output Features: $16 \times 16 \times 512$ (reshaped to sequence length 256, dimension 512).
    • Note: The final fully connected layer of the CNN is removed.
  • Decoder: Transformer (Decoder-only).
    • Layers: 6
    • Attention Heads: 8
    • Embedding Dimension: 512
  • Output: Predicted SMILES string token by token.

Evaluation

Metrics used for evaluation:

  1. Valid Predictions (%): Percentage of outputs that are syntactically valid SMILES.
  2. Exact Match Accuracy (%): Canonical SMILES string identity.
  3. Tanimoto Similarity: Fingerprint similarity (PubChem fingerprints) between ground truth and prediction.

Data Scaling Results (Hand-Drawn Dataset, Table 4 in the paper):

DatasetTraining ImagesValid PredictionsExact AccuracyTanimoto
1 (ChEMBL)4,375,33896.21%5.09%0.490
2 (ChEMBL)13,126,01497.41%26.08%0.690
3 (PubChem)38,040,00099.67%70.34%0.939
4 (PubChem)152,160,00099.72%73.25%0.942

Comparison with Other Tools (Hand-Drawn Dataset, Table 5 in the paper):

OCSR ToolMethodValid PredictionsExact AccuracyTanimoto
DECIMER (Ours)Deep Learning99.72%73.25%0.94
DECIMER.aiDeep Learning96.07%26.98%0.69
MolGrapherDeep Learning99.94%10.81%0.51
MolScribeDeep Learning95.66%7.65%0.59
Img2MolDeep Learning98.96%5.25%0.52
SwinOCSRDeep Learning97.37%5.11%0.64
ChemGrapherDeep Learning69.56%N/A0.09
ImagoRule-based43.14%2.99%0.22
MolVecRule-based71.86%1.30%0.23
OSRARule-based54.66%0.57%0.17

Hardware

  • Compute: Google Cloud TPU v4-128 pod slice.
  • Training Time:
    • EfficientNetV2-M model trained ~2x faster than EfficientNetV1-B7.
    • Average training time per epoch: 34 minutes (for Model 3 on 1M dataset subset).
  • Epochs: Models trained for 25 epochs.

Citation

@article{rajanAdvancementsHanddrawnChemical2024,
  title = {Advancements in Hand-Drawn Chemical Structure Recognition through an Enhanced {{DECIMER}} Architecture},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2024,
  month = jul,
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {1},
  pages = {78},
  issn = {1758-2946},
  doi = {10.1186/s13321-024-00872-7}
}