Paper Information

Citation: Kim, J. H., & Choi, J. (2025). OCSAug: diffusion-based optical chemical structure data augmentation for improved hand-drawn chemical structure image recognition. The Journal of Supercomputing, 81, 926.

Publication: The Journal of Supercomputing 2025

Additional Resources:

What kind of paper is this?

This is a Method paper according to the taxonomy. It proposes a novel data augmentation pipeline (OCSAug) that integrates Denoising Diffusion Probabilistic Models (DDPM) and the RePaint algorithm to address the data scarcity problem in hand-drawn optical chemical structure recognition (OCSR). The contribution is validated through systematic benchmarking against existing augmentation techniques (RDKit, Randepict) and ablation studies on mask design.

What is the motivation?

A vast amount of molecular structure data exists in analog formats, such as hand-drawn diagrams in research notes or older literature. While OCSR models perform well on digitally rendered images, they struggle significantly with hand-drawn images due to noise, varying handwriting styles, and distortions. Current datasets for hand-drawn images (e.g., DECIMER) are too small to train robust models effectively, and existing augmentation tools (RDKit, Randepict) fail to generate sufficiently realistic hand-drawn variations.

What is the novelty here?

The core novelty is OCSAug, a three-phase pipeline that uses generative AI to synthesize training data:

  1. DDPM + RePaint: It utilizes a DDPM to learn the distribution of hand-drawn images and the RePaint algorithm for inpainting.
  2. Structural Masking: It introduces vertical and horizontal stripe pattern masks. These masks selectively obscure parts of atoms or bonds, forcing the diffusion model to reconstruct them with irregular “hand-drawn” styles while preserving the underlying chemical topology.
  3. Label Transfer: Because the chemical structure is preserved during inpainting, the SMILES label from the original image is directly transferred to the augmented image, bypassing the need for re-annotation.

What experiments were performed?

The authors evaluated OCSAug using the DECIMER dataset, specifically a “drug-likeness” subset filtered by Lipinski’s and Veber’s rules.

  • Baselines: The method was compared against RDKit (digital generation) and Randepict (rule-based augmentation).
  • Models: Four state-of-the-art OCSR models were fine-tuned: MolScribe, Image2SMILES (I2S), MolNexTR, and MPOCSR.
  • Metrics:
    • Tanimoto Similarity: To measure prediction accuracy against ground truth.
    • Fréchet Inception Distance (FID): To measure the distributional similarity between generated and real hand-drawn images.
    • RMSE: To quantify pixel-level structural preservation across different mask thicknesses.

What outcomes/conclusions?

  • Performance Boost: OCSAug improved recognition accuracy (Tanimoto similarity) by 1.918-3.820 times compared to non-augmented baselines, significantly outperforming RDKit and Randepict.
  • Data Quality: OCSAug achieved the lowest FID score (0.471) compared to Randepict (4.054) and RDKit (10.581), indicating its generated images are much closer to the real hand-drawn distribution.
  • Generalization: The method showed improved generalization on a newly collected real-world dataset of 463 images from 6 volunteers.
  • Limitations: The generation process is slow (3 weeks for 10k images on a single GPU) and the fixed stripe masks may struggle with highly complex, non-drug-like geometries.

Reproducibility Details

Data

  • Source: DECIMER dataset (hand-drawn images).
  • Filtering: A “drug-likeness” filter was applied (Lipinski’s rule of 5 + Veber’s rules) along with an atom filter (C, H, O, S, F, Cl, Br, N, P only).
  • Final Size: 3,194 samples, split into:
    • Training: 2,604 samples.
    • Validation: 290 samples.
    • Test: 300 samples.
  • Resolution: All images resized to $256 \x256$ pixels.

Algorithms

  • Framework: DDPM implemented using guided-diffusion.
  • RePaint Settings:
    • Total time steps: 250.
    • Jump length: 10.
    • Resampling counts: 10.
  • Masking Strategy:
    • Vertical Stripes: Obscure atom symbols to vary handwriting style.
    • Horizontal Stripes: Obscure bonds to vary length/thickness/alignment.
    • Optimal Thickness: A stripe thickness of 4 pixels was found to be optimal for balancing diversity and structural preservation.

Models

The OCSR models were pretrained on PubChem (digital images) and then fine-tuned on the OCSAug dataset.

  • MolScribe: Swin Transformer encoder, Transformer decoder. Fine-tuned (all layers) for 30 epochs, batch size 16-128, LR 2e-5.
  • I2S: Inception V3 encoder (frozen), FC/Decoder fine-tuned. 25 epochs, batch size 64, LR 1e-5.
  • MolNexTR: Dual-stream encoder (Swin + CNN). Fine-tuned (all layers) for 30 epochs, batch size 16-64, LR 2e-5.
  • MPOCSR: MPVIT backbone. Fine-tuned (all layers) for 25 epochs, batch size 16-32, LR 4e-5.

Evaluation

  • Metric: Improvement Ratio (IR) of Tanimoto Similarity (TS), calculated as $IR = TS_{\text{finetuned}} / TS_{\text{non-finetuned}}$.
  • Validation: Cross-validation on the split DECIMER dataset.

Hardware

  • GPU: NVIDIA GeForce RTX 4090.
  • Training Time: DDPM training took ~6 days.
  • Generation Time: Generating 2,600 augmented images took ~70 hours.

Citation

@article{kimOCSAugDiffusionbasedOptical2025,
  title = {OCSAug: Diffusion-Based Optical Chemical Structure Data Augmentation for Improved Hand-Drawn Chemical Structure Image Recognition},
  shorttitle = {OCSAug},
  author = {Kim, Jin Hyuk and Choi, Jonghwan},
  year = 2025,
  month = may,
  journal = {The Journal of Supercomputing},
  volume = {81},
  number = {8},
  pages = {926},
  doi = {10.1007/s11227-025-07406-4}
}