Enhanced DECIMER for Hand-Drawn Structure Recognition

Paper Information

Citation: Rajan, K., Brinkhaus, H.O., Zielesny, A. et al. (2024). Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. Journal of Cheminformatics, 16(78). https://doi.org/10.1186/s13321-024-00872-7

Publication: Journal of Cheminformatics 2024

Additional Resources:

Method Contribution: Architectural Optimization

This is a Method paper. It proposes an enhanced neural network architecture (EfficientNetV2 + Transformer) specifically designed to solve the problem of recognizing hand-drawn chemical structures. The primary contribution is architectural optimization and a data-driven training strategy, validated through ablation studies (comparing encoders) and benchmarked against existing rule-based and deep learning tools.

Motivation: Digitizing “Dark” Chemical Data

Chemical information in legacy laboratory notebooks and modern tablet-based inputs often exists as hand-drawn sketches.

Gap: Existing Optical Chemical Structure Recognition (OCSR) tools (particularly rule-based ones) lack robustness and fail when images have variability in style, line thickness, or noise.
Need: There is a critical need for automated tools to digitize this “dark data” effectively to preserve it and make it machine-readable and searchable.

Core Innovation: Decoder-Only Design and Synthetic Scaling

The core novelty is the architectural enhancement and synthetic training strategy:

Decoder-Only Transformer: Using only the decoder part of the Transformer (instead of a full encoder-decoder Transformer) improved average accuracy across OCSR benchmarks from 61.28% to 69.27% (Table 3 in the paper).
EfficientNetV2 Integration: Replacing standard CNNs or EfficientNetV1 with EfficientNetV2-M provided better feature extraction and 2x faster training speeds.
Scale of Synthetic Data: The authors demonstrate that scaling synthetic training data (up to 152 million images generated by RanDepict) directly correlates with improved generalization to real-world hand-drawn images, without ever training on real hand-drawn data.

Experimental Setup: Ablation and Real-World Baselines

Model Selection (Ablation): Tested three architectures (EfficientNetV2-M + Full Transformer, EfficientNetV1-B7 + Decoder-only, EfficientNetV2-M + Decoder-only) on standard benchmarks (JPO, CLEF, USPTO, UOB).
Data Scaling: Trained the best model on four progressively larger datasets (from 4M to 152M images) to measure performance gains.
Real-World Benchmarking: Validated the final model on the DECIMER Hand-drawn dataset (5088 real images drawn by volunteers) and compared against 9 other tools (OSRA, MolVec, Img2Mol, MolScribe, etc.).

Results and Conclusions: Strong Accuracy on Hand-Drawn Scans

Strong Performance: The final DECIMER model achieved 99.72% valid predictions and 73.25% exact accuracy on the hand-drawn benchmark. The next best non-DECIMER tool was MolGrapher at 10.81% accuracy, followed by MolScribe at 7.65%.
Robustness: Deep learning methods outperform rule-based methods (which scored 3% or less accuracy) on hand-drawn data.
Data Saturation: Quadrupling the dataset from 38M to 152M images yielded only marginal gains (about 3 percentage points in accuracy), suggesting current synthetic data strategies may be hitting a plateau.

Reproducibility

Artifacts

Artifact	Type	License	Notes
DECIMER Image Transformer (GitHub)	Code	MIT	Official TensorFlow implementation
Model Weights (Zenodo)	Model	Unknown	Pre-trained hand-drawn model weights
DECIMER PyPi Package	Code	MIT	Installable Python package
RanDepict (GitHub)	Code	MIT	Synthetic hand-drawn image generation toolkit

Data

The model was trained entirely on synthetic data generated using the RanDepict toolkit. No real hand-drawn images were used for training.

Dataset	Source	Molecules	Total Images	Notes
1	ChEMBL	2,187,669	4,375,338	1 augmented + 1 clean per molecule
2	ChEMBL	2,187,669	13,126,014	2 augmented + 4 clean per molecule
3	PubChem	9,510,000	38,040,000	1 augmented + 3 clean per molecule
4	PubChem	38,040,000	152,160,000	1 augmented + 3 clean per molecule

A separate model selection experiment used a 1,024,000-molecule subset of ChEMBL to compare the three architectures (Table 1 in the paper). The DECIMER Hand-Drawn evaluation dataset consists of 5,088 real hand-drawn images from 23 volunteers.

Preprocessing:

SMILES strings length < 300 characters.
Images resized to $512 \times 512$.
Images generated with and without “hand-drawn style” augmentations.

Algorithms

Tokenization: SMILES split by heavy atoms, brackets, bond symbols, and special characters. Start <start> and end <end> tokens added; padded with <pad>.
Optimization: Adam optimizer with a custom learning rate schedule (as specified in the original Transformer paper). A dropout rate of 0.1 was used.
Loss Function: Trained using focal loss to address class imbalance for rare tokens. The focal loss formulation reduces the relative loss for well-classified examples: $$ \text{FL}(p_{\text{t}}) = -\alpha_{\text{t}} (1 - p_{\text{t}})^\gamma \log(p_{\text{t}}) $$
Augmentations: RanDepict applied synthetic distortions to mimic handwriting (wobbly lines, variable thickness, etc.).

Models

The final architecture (Model 3) is an Encoder-Decoder structure:

Encoder: EfficientNetV2-M (pretrained ImageNet backbone).
- Input: $512 \times 512 \times 3$ image.
- Output Features: $16 \times 16 \times 512$ (reshaped to sequence length 256, dimension 512).
- Note: The final fully connected layer of the CNN is removed.
Decoder: Transformer (Decoder-only).
- Layers: 6
- Attention Heads: 8
- Embedding Dimension: 512
Output: Predicted SMILES string token by token.

Evaluation

Metrics used for evaluation:

Valid Predictions (%): Percentage of outputs that are syntactically valid SMILES.
Exact Match Accuracy (%): Canonical SMILES string identity.
Tanimoto Similarity: Fingerprint similarity (PubChem fingerprints) between ground truth and prediction.

Data Scaling Results (Hand-Drawn Dataset, Table 4 in the paper):

Dataset	Training Images	Valid Predictions	Exact Accuracy	Tanimoto
1 (ChEMBL)	4,375,338	96.21%	5.09%	0.490
2 (ChEMBL)	13,126,014	97.41%	26.08%	0.690
3 (PubChem)	38,040,000	99.67%	70.34%	0.939
4 (PubChem)	152,160,000	99.72%	73.25%	0.942

Comparison with Other Tools (Hand-Drawn Dataset, Table 5 in the paper):

OCSR Tool	Method	Valid Predictions	Exact Accuracy	Tanimoto
DECIMER (Ours)	Deep Learning	99.72%	73.25%	0.94
DECIMER.ai	Deep Learning	96.07%	26.98%	0.69
MolGrapher	Deep Learning	99.94%	10.81%	0.51
MolScribe	Deep Learning	95.66%	7.65%	0.59
Img2Mol	Deep Learning	98.96%	5.25%	0.52
SwinOCSR	Deep Learning	97.37%	5.11%	0.64
ChemGrapher	Deep Learning	69.56%	N/A	0.09
Imago	Rule-based	43.14%	2.99%	0.22
MolVec	Rule-based	71.86%	1.30%	0.23
OSRA	Rule-based	54.66%	0.57%	0.17

Hardware

Compute: Google Cloud TPU v4-128 pod slice.
Training Time:
- EfficientNetV2-M model trained ~2x faster than EfficientNetV1-B7.
- Average training time per epoch: 34 minutes (for Model 3 on 1M dataset subset).
Epochs: Models trained for 25 epochs.

Citation

@article{rajanAdvancementsHanddrawnChemical2024,
  title = {Advancements in Hand-Drawn Chemical Structure Recognition through an Enhanced {{DECIMER}} Architecture},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2024,
  month = jul,
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {1},
  pages = {78},
  issn = {1758-2946},
  doi = {10.1186/s13321-024-00872-7}
}

Paper Information#

Method Contribution: Architectural Optimization#

Motivation: Digitizing “Dark” Chemical Data#

Core Innovation: Decoder-Only Design and Synthetic Scaling#

Experimental Setup: Ablation and Real-World Baselines#

Results and Conclusions: Strong Accuracy on Hand-Drawn Scans#

Reproducibility#

Artifacts#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Citation#