Hand-Drawn Structure Recognition on Hunter Heidenreich | ML Research Scientist

OCSAug: Diffusion-Based Augmentation for Hand-Drawn OCSR

Sat, 20 Dec 2025 00:00:00 +0000

Document Taxonomy: OCSAug as a Novel Method

This is a Method paper according to the taxonomy. It proposes a novel data augmentation pipeline (OCSAug) that integrates Denoising Diffusion Probabilistic Models (DDPM) and the RePaint algorithm to address the data scarcity problem in hand-drawn optical chemical structure recognition (OCSR). The contribution is validated through systematic benchmarking against existing augmentation techniques (RDKit, Randepict) and ablation studies on mask design.

Expanding Hand-Drawn Training Data for OCSR

A vast amount of molecular structure data exists in analog formats, such as hand-drawn diagrams in research notes or older literature. While OCSR models perform well on digitally rendered images, they struggle with hand-drawn images due to noise, varying handwriting styles, and distortions. Current datasets for hand-drawn images (e.g., DECIMER) are too small to train effective models, and existing augmentation tools (RDKit, Randepict) fail to generate sufficiently realistic hand-drawn variations.

OCSAug Pipeline: Masked RePaint via Generative AI

The core novelty is OCSAug, a three-phase pipeline that uses generative AI to synthesize training data:

DDPM + RePaint: It utilizes a DDPM to learn the distribution of hand-drawn images and the RePaint algorithm for inpainting.
Structural Masking: It introduces vertical and horizontal stripe pattern masks. These masks selectively obscure parts of atoms or bonds, forcing the diffusion model to reconstruct them with irregular “hand-drawn” styles while preserving the underlying chemical topology.
Label Transfer: Because the chemical structure is preserved during inpainting, the SMILES label from the original image is directly transferred to the augmented image, bypassing the need for re-annotation.

Benchmarking Diffusion Augmentations on DECIMER

The authors evaluated OCSAug using the DECIMER dataset, specifically a “drug-likeness” subset filtered by Lipinski’s and Veber’s rules.

Baselines: The method was compared against RDKit (digital generation) and Randepict (rule-based augmentation).
Models: Four recent OCSR models were fine-tuned: MolScribe, DECIMER 1.0 (I2S), MolNexTR, and MPOCSR.
Metrics:
- Tanimoto Similarity: To measure prediction accuracy against ground truth.
- Fréchet Inception Distance (FID): To measure the distributional similarity between generated and real hand-drawn images.
- RMSE: To quantify pixel-level structural preservation across different mask thicknesses.

Improved Generalization Capabilities and FID Scores

Performance Boost: OCSAug improved recognition accuracy (Tanimoto similarity) by 1.918 to 3.820 times compared to non-fine-tuned baselines (Improvement Ratio), outperforming traditional augmentation techniques such as RDKit and Randepict (1.570-3.523x).
Data Quality: OCSAug achieved the lowest FID score (0.471) compared to Randepict (4.054) and RDKit (10.581), indicating its generated images are much closer to the real hand-drawn distribution.
Generalization: The method showed improved generalization on a newly collected real-world dataset of 463 images from 6 volunteers.
Resolution Mixing: Training MolScribe and MolNexTR with a mix of $128 \times 128$, $256 \times 256$, and $512 \times 512$ resolution images improved Tanimoto similarity (e.g., MolScribe from 0.585 to 0.640), though this strategy did not help I2S or MPOCSR.
Real-World Evaluation: On a newly collected dataset of 463 hand-drawn images from 6 volunteers (88 drug compounds), the MPOCSR model fine-tuned with OCSAug achieved 0.367 exact-match accuracy (Tanimoto = 1.0), compared to 0.365 for non-augmented fine-tuning and 0.037 for no fine-tuning. The area under the accuracy curve showed a more notable improvement in reducing misrecognition.
Limitations: The generation process is slow (3 weeks for 10k images on a single GPU). The fixed stripe masks may struggle with highly complex, non-drug-like geometries: when evaluated on the full DECIMER dataset (without drug-likeness filtering), OCSAug did not yield uniform improvements across all models.

Reproducibility

Artifacts

Artifact	Type	License	Notes
OCSAug	Code	MIT	Official implementation using guided-diffusion and RePaint
DECIMER Hand-Drawn Dataset	Dataset	CC-BY 4.0	5,088 hand-drawn molecular structure images from 24 individuals

Data

Source: DECIMER dataset (hand-drawn images).
Filtering: A “drug-likeness” filter was applied (Lipinski’s rule of 5 + Veber’s rules) along with an atom filter (C, H, O, S, F, Cl, Br, N, P only).
Final Size: 3,194 samples, split into:
- Training: 2,604 samples.
- Validation: 290 samples.
- Test: 300 samples.
Resolution: All images resized to $256 \times 256$ pixels.

Algorithms

Framework: DDPM implemented using guided-diffusion.
RePaint Settings:
- Total time steps: 250.
- Jump length: 10.
- Resampling counts: 10.
Masking Strategy:
- Vertical Stripes: Obscure atom symbols to vary handwriting style.
- Horizontal Stripes: Obscure bonds to vary length/thickness/alignment.
- Optimal Thickness: A stripe thickness of 4 pixels was found to be optimal for balancing diversity and structural preservation.

Models

The OCSR models were pretrained on PubChem (digital images) and then fine-tuned on the OCSAug dataset.

MolScribe: Swin Transformer encoder, Transformer decoder. Fine-tuned (all layers) for 30 epochs, batch size 16-128, LR 2e-5.
I2S (DECIMER 1.0): Inception V3 encoder (frozen), FC/Decoder fine-tuned. 25 epochs, batch size 64, LR 1e-5.
MolNexTR: Dual-stream encoder (Swin + CNN). Fine-tuned (all layers) for 30 epochs, batch size 16-64, LR 2e-5.
MPOCSR: MPViT backbone. Fine-tuned (all layers) for 25 epochs, batch size 16-32, LR 4e-5.

Evaluation

Metric: Improvement Ratio (IR) of Tanimoto Similarity (TS), calculated iteratively or defined as:

$$ \text{IR} = \frac{\text{TS}_{\text{finetuned}}}{\text{TS}_{\text{non-finetuned}}} $$
Validation: Cross-validation on the split DECIMER dataset.

Hardware

GPU: NVIDIA GeForce RTX 4090.
Training Time: DDPM training took ~6 days.
Generation Time: Generating 2,600 augmented images took ~70 hours.

Paper Information

Citation: Kim, J. H., & Choi, J. (2025). OCSAug: diffusion-based optical chemical structure data augmentation for improved hand-drawn chemical structure image recognition. The Journal of Supercomputing, 81, 926.

Publication: The Journal of Supercomputing 2025

Additional Resources:

@article{kimOCSAugDiffusionbasedOptical2025,
  title = {OCSAug: Diffusion-Based Optical Chemical Structure Data Augmentation for Improved Hand-Drawn Chemical Structure Image Recognition},
  shorttitle = {OCSAug},
  author = {Kim, Jin Hyuk and Choi, Jonghwan},
  year = 2025,
  month = may,
  journal = {The Journal of Supercomputing},
  volume = {81},
  number = {8},
  pages = {926},
  doi = {10.1007/s11227-025-07406-4}
}

Enhanced DECIMER for Hand-Drawn Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Method Contribution: Architectural Optimization

This is a Method paper. It proposes an enhanced neural network architecture (EfficientNetV2 + Transformer) specifically designed to solve the problem of recognizing hand-drawn chemical structures. The primary contribution is architectural optimization and a data-driven training strategy, validated through ablation studies (comparing encoders) and benchmarked against existing rule-based and deep learning tools.

Motivation: Digitizing “Dark” Chemical Data

Chemical information in legacy laboratory notebooks and modern tablet-based inputs often exists as hand-drawn sketches.

Gap: Existing Optical Chemical Structure Recognition (OCSR) tools (particularly rule-based ones) lack robustness and fail when images have variability in style, line thickness, or noise.
Need: There is a critical need for automated tools to digitize this “dark data” effectively to preserve it and make it machine-readable and searchable.

Core Innovation: Decoder-Only Design and Synthetic Scaling

The core novelty is the architectural enhancement and synthetic training strategy:

Decoder-Only Transformer: Using only the decoder part of the Transformer (instead of a full encoder-decoder Transformer) improved average accuracy across OCSR benchmarks from 61.28% to 69.27% (Table 3 in the paper).
EfficientNetV2 Integration: Replacing standard CNNs or EfficientNetV1 with EfficientNetV2-M provided better feature extraction and 2x faster training speeds.
Scale of Synthetic Data: The authors demonstrate that scaling synthetic training data (up to 152 million images generated by RanDepict) directly correlates with improved generalization to real-world hand-drawn images, without ever training on real hand-drawn data.

Experimental Setup: Ablation and Real-World Baselines

Model Selection (Ablation): Tested three architectures (EfficientNetV2-M + Full Transformer, EfficientNetV1-B7 + Decoder-only, EfficientNetV2-M + Decoder-only) on standard benchmarks (JPO, CLEF, USPTO, UOB).
Data Scaling: Trained the best model on four progressively larger datasets (from 4M to 152M images) to measure performance gains.
Real-World Benchmarking: Validated the final model on the DECIMER Hand-drawn dataset (5088 real images drawn by volunteers) and compared against 9 other tools (OSRA, MolVec, Img2Mol, MolScribe, etc.).

Results and Conclusions: Strong Accuracy on Hand-Drawn Scans

Strong Performance: The final DECIMER model achieved 99.72% valid predictions and 73.25% exact accuracy on the hand-drawn benchmark. The next best non-DECIMER tool was MolGrapher at 10.81% accuracy, followed by MolScribe at 7.65%.
Robustness: Deep learning methods outperform rule-based methods (which scored 3% or less accuracy) on hand-drawn data.
Data Saturation: Quadrupling the dataset from 38M to 152M images yielded only marginal gains (about 3 percentage points in accuracy), suggesting current synthetic data strategies may be hitting a plateau.

Reproducibility

Artifacts

Artifact	Type	License	Notes
DECIMER Image Transformer (GitHub)	Code	MIT	Official TensorFlow implementation
Model Weights (Zenodo)	Model	Unknown	Pre-trained hand-drawn model weights
DECIMER PyPi Package	Code	MIT	Installable Python package
RanDepict (GitHub)	Code	MIT	Synthetic hand-drawn image generation toolkit

Data

The model was trained entirely on synthetic data generated using the RanDepict toolkit. No real hand-drawn images were used for training.

Dataset	Source	Molecules	Total Images	Notes
1	ChEMBL	2,187,669	4,375,338	1 augmented + 1 clean per molecule
2	ChEMBL	2,187,669	13,126,014	2 augmented + 4 clean per molecule
3	PubChem	9,510,000	38,040,000	1 augmented + 3 clean per molecule
4	PubChem	38,040,000	152,160,000	1 augmented + 3 clean per molecule

A separate model selection experiment used a 1,024,000-molecule subset of ChEMBL to compare the three architectures (Table 1 in the paper). The DECIMER Hand-Drawn evaluation dataset consists of 5,088 real hand-drawn images from 23 volunteers.

Preprocessing:

SMILES strings length < 300 characters.
Images resized to $512 \times 512$.
Images generated with and without “hand-drawn style” augmentations.

Algorithms

Tokenization: SMILES split by heavy atoms, brackets, bond symbols, and special characters. Start and end tokens added; padded with .
Optimization: Adam optimizer with a custom learning rate schedule (as specified in the original Transformer paper). A dropout rate of 0.1 was used.
Loss Function: Trained using focal loss to address class imbalance for rare tokens. The focal loss formulation reduces the relative loss for well-classified examples: $$ \text{FL}(p_{\text{t}}) = -\alpha_{\text{t}} (1 - p_{\text{t}})^\gamma \log(p_{\text{t}}) $$
Augmentations: RanDepict applied synthetic distortions to mimic handwriting (wobbly lines, variable thickness, etc.).

Models

The final architecture (Model 3) is an Encoder-Decoder structure:

Encoder: EfficientNetV2-M (pretrained ImageNet backbone).
- Input: $512 \times 512 \times 3$ image.
- Output Features: $16 \times 16 \times 512$ (reshaped to sequence length 256, dimension 512).
- Note: The final fully connected layer of the CNN is removed.
Decoder: Transformer (Decoder-only).
- Layers: 6
- Attention Heads: 8
- Embedding Dimension: 512
Output: Predicted SMILES string token by token.

Evaluation

Metrics used for evaluation:

Valid Predictions (%): Percentage of outputs that are syntactically valid SMILES.
Exact Match Accuracy (%): Canonical SMILES string identity.
Tanimoto Similarity: Fingerprint similarity (PubChem fingerprints) between ground truth and prediction.

Data Scaling Results (Hand-Drawn Dataset, Table 4 in the paper):

Dataset	Training Images	Valid Predictions	Exact Accuracy	Tanimoto
1 (ChEMBL)	4,375,338	96.21%	5.09%	0.490
2 (ChEMBL)	13,126,014	97.41%	26.08%	0.690
3 (PubChem)	38,040,000	99.67%	70.34%	0.939
4 (PubChem)	152,160,000	99.72%	73.25%	0.942

Comparison with Other Tools (Hand-Drawn Dataset, Table 5 in the paper):

OCSR Tool	Method	Valid Predictions	Exact Accuracy	Tanimoto
DECIMER (Ours)	Deep Learning	99.72%	73.25%	0.94
DECIMER.ai	Deep Learning	96.07%	26.98%	0.69
MolGrapher	Deep Learning	99.94%	10.81%	0.51
MolScribe	Deep Learning	95.66%	7.65%	0.59
Img2Mol	Deep Learning	98.96%	5.25%	0.52
SwinOCSR	Deep Learning	97.37%	5.11%	0.64
ChemGrapher	Deep Learning	69.56%	N/A	0.09
Imago	Rule-based	43.14%	2.99%	0.22
MolVec	Rule-based	71.86%	1.30%	0.23
OSRA	Rule-based	54.66%	0.57%	0.17

Hardware

Compute: Google Cloud TPU v4-128 pod slice.
Training Time:
- EfficientNetV2-M model trained ~2x faster than EfficientNetV1-B7.
- Average training time per epoch: 34 minutes (for Model 3 on 1M dataset subset).
Epochs: Models trained for 25 epochs.

Paper Information

Citation: Rajan, K., Brinkhaus, H.O., Zielesny, A. et al. (2024). Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. Journal of Cheminformatics, 16(78). https://doi.org/10.1186/s13321-024-00872-7

Publication: Journal of Cheminformatics 2024

Additional Resources:

@article{rajanAdvancementsHanddrawnChemical2024,
  title = {Advancements in Hand-Drawn Chemical Structure Recognition through an Enhanced {{DECIMER}} Architecture},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2024,
  month = jul,
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {1},
  pages = {78},
  issn = {1758-2946},
  doi = {10.1186/s13321-024-00872-7}
}

ChemReco: Hand-Drawn Chemical Structure Recognition

Fri, 19 Dec 2025 00:00:00 +0000

Research Contribution & Classification

This is a Methodological Paper ($\Psi_{\text{Method}}$) with a significant Resource ($\Psi_{\text{Resource}}$) component.

Method: The primary contribution is “ChemReco,” a specific deep learning pipeline (EfficientNet + Transformer) designed to solve the Optical Chemical Structure Recognition (OCSR) task for hand-drawn images. The authors conduct extensive ablation studies on architecture and data mixing ratios to validate performance.
Resource: The authors explicitly state that “the primary focus of this paper is constructing datasets” due to the scarcity of hand-drawn molecular data. They introduce a comprehensive synthetic data generation pipeline involving RDKit modifications and image degradation to create training data.

Motivation: Digitizing Hand-Drawn Chemical Sketches

Hand-drawing is the most intuitive method for chemists and students to record molecular structures. However, digitizing these drawings into machine-readable formats (like SMILES) usually requires time-consuming manual entry or specialized software.

Gap: Existing OCSR tools and rule-based methods often fail on hand-drawn sketches due to diverse writing styles, poor image quality, and the absence of labeled data.
Application: Automated recognition enables efficient chemical research and allows for automatic grading in educational settings.

Core Innovation: Synthetic Pipeline and Hybrid Architecture

The paper introduces ChemReco, an end-to-end system for recognizing C-H-O structures. Key novelties include:

Synthetic Data Pipeline: A multi-stage generation method that modifies RDKit source code to randomize bond/angle parameters, followed by OpenCV-based augmentation, degradation, and background addition to simulate realistic hand-drawn artifacts.
Architectural Choice: The specific application of EfficientNet (encoder) combined with a Transformer (decoder) for this domain, which the authors demonstrate outperforms the more common ResNet+LSTM baselines.
Hybrid Training Strategy: Finding that a mix of 90% synthetic and 10% real data yields optimal performance, superior to using either dataset alone.

Methodology & Ablation Studies

The authors performed a series of ablation studies and comparisons:

Synthesis Ablation: Evaluated the impact of each step in the generation pipeline (RDKit only $\rightarrow$ Augmentation $\rightarrow$ Degradation $\rightarrow$ Background) on validation loss and accuracy.
Dataset Size Ablation: Tested model performance when trained on synthetic datasets ranging from 100,000 to 1,000,000 images.
Real/Synthetic Ratio: Investigated the optimal mixing ratio of synthetic to real hand-drawn images (100:0, 90:10, 50:50, 10:90, 0:100), finding that the 90:10 ratio achieved 93.81% exact match, compared to 63.33% for synthetic-only and 65.83% for real-only.
Architecture Comparison: Benchmarked four encoder-decoder combinations: ResNet vs. EfficientNet encoders paired with LSTM vs. Transformer decoders.
Baseline Comparison: Compared results against a related study utilizing a CNN+LSTM framework.

Results & Interpretations

Best Performance: The EfficientNet + Transformer model trained on a 90:10 synthetic-to-real ratio achieved a 96.90% Exact Match rate on the test set.
Background Robustness: When training on synthetic data alone (no real images), the best accuracy on background-free test images was approximately 46% (using RDKit-aug-deg), while background test images reached approximately 53% (using RDKit-aug-bkg-deg). Adding random backgrounds during training helped prevent the model from overfitting to clean white backgrounds.
Data Volume: Increasing the synthetic dataset size from 100k to 1M consistently improved accuracy (average exact match: 49.40% at 100k, 54.29% at 200k, 61.31% at 500k, 63.33% at 1M, all without real images in training).
Encoder-Decoder Comparison (at 90:10 mix with 1M images):

Encoder	Decoder	Avg. Exact Match (%)
ResNet	LSTM	93.81
ResNet	Transformer	94.76
EfficientNet	LSTM	96.31
EfficientNet	Transformer	96.90

Superiority over Baselines: The model outperformed the cited CNN+LSTM baseline from ChemPix (93% vs 76% on the ChemPix test set).

Limitations

Restricted atom types: The system only handles molecules composed of carbon, hydrogen, and oxygen (C-H-O), excluding nitrogen, sulfur, halogens, and other heteroatoms commonly found in organic chemistry.
Structural complexity: Only structures with at most one ring are supported. Complex multi-ring systems and fused ring structures are not covered.
Dataset availability: The real hand-drawn dataset (2,598 images) is not publicly released and is only available upon request from the corresponding author.
Future directions: The authors suggest expanding to more heteroatoms, complex ring structures, and applications in automated grading of chemistry exams.

Reproducibility

Artifact	Type	License	Notes
hdr-DeepLearning	Code	Unknown	Official implementation in PyTorch
Paper	Publication	CC-BY-4.0	Open access via Nature

The real hand-drawn dataset (2,598 images) is available upon request from the corresponding author, not publicly downloadable. The synthetic data generation pipeline is described in detail but relies on modified RDKit source code, which is included in the repository.

Data

The study utilizes a combination of collected SMILES data, real hand-drawn images, and generated synthetic images.

Source Data: SMILES codes collected from PubChem, ZINC, GDB-11, and GDB-13. Filtered for C, H, O atoms and max 1 ring.
Real Dataset: 670 selected SMILES codes drawn by multiple volunteers, totaling 2,598 images.
Synthetic Dataset: Generated up to 1,000,000 images using the pipeline below.
Training Mix: The optimal training set used 1 million images with a 90:10 ratio of synthetic to real images.

Dataset Type	Source	Size	Notes
Real	Volunteer Drawings	2,598 images	Used for mixed training and testing
Synthetic	Generated	100k - 1M	Generated via modified RDKit + OpenCV augmentation/degradation; optionally enhanced with Stable Diffusion

Algorithms

The Synthetic Image Generation Pipeline is critical for reproduction:

RDKit Modification: Modify source code to introduce random keys, character width, length, and bond angles.
Augmentation (OpenCV): Apply sequence: Resize ($p=0.5$), Blur ($p=0.4$), Erode/Dilate ($p=0.2$), Distort ($p=0.8$), Flip ($p=0.5$), Affine ($p=0.7$).
Degradation: Apply sequence: Salt+pepper noise ($p=0.1$), Contrast ($p=0.7$), Sharpness ($p=0.5$), Invert ($p=0.3$).
Background Addition: Random backgrounds are augmented (Crop, Distort, Flip) and added to the molecular image to prevent background overfitting.
Diffusion Enhancement: Stable Diffusion (v1-4) is used for image-to-image enhancement to better simulate hand-drawn styles (prompt: “A pencil sketch of [Formula]… without charge distribution”).

Models

The system uses an encoder-decoder architecture:

Encoder: EfficientNet (pre-trained on ImageNet). The last layer is removed, and features are extracted into a Numpy array.
Decoder: Transformer. Utilizes self-attention to generate the SMILES sequence. Chosen over LSTM for better handling of long-range dependencies.
Output: Canonical SMILES string.

Evaluation

Primary Metric: Exact Match (EM). A strict binary evaluation checking whether the complete generated SMILES perfectly replicates the target string.
Other Metrics: Levenshtein Distance measures edit-level character proximity, while the Tanimoto coefficient evaluates structural similarity based on chemical fingerprints. Both were monitored during validation ablation runs.

Metric	Value	Baseline (CNN+LSTM)	Notes
Exact Match	96.90%	76%	Tested on the provided test set

Hardware

CPU: Intel(R) Xeon(R) Gold 6130 (40 GB RAM).
GPU: NVIDIA Tesla V100 (32 GB video memory).
Framework: PyTorch 1.9.1.
Training Configuration:
- Optimizer: Adam (learning rate 1e-4).
- Batch size: 32.
- Epochs: 100.

Paper Information

Citation: Ouyang, H., Liu, W., Tao, J., et al. (2024). ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning. Scientific Reports, 14, 17126. https://doi.org/10.1038/s41598-024-67496-7

Publication: Scientific Reports 2024

Additional Resources:

Official Code Repository

@article{ouyangChemRecoAutomatedRecognition2024,
  title = {ChemReco: Automated Recognition of Hand-Drawn Carbon--Hydrogen--Oxygen Structures Using Deep Learning},
  author = {Ouyang, Hengjie and Liu, Wei and Tao, Jiajun and Luo, Yanghong and Zhang, Wanjia and Zhou, Jiayu and Geng, Shuqi and Zhang, Chengpeng},
  journal = {Scientific Reports},
  volume = {14},
  number = {1},
  pages = {17126},
  year = {2024},
  publisher = {Nature Publishing Group},
  doi = {10.1038/s41598-024-67496-7}
}

AtomLenz: Atom-Level OCSR with Limited Supervision

Fri, 19 Dec 2025 00:00:00 +0000

Dual Contribution: Method and Data Resource

The paper proposes an architecture (AtomLenz) and training framework (ProbKT* + Edit-Correction) to solve the problem of Optical Chemical Structure Recognition (OCSR) in data-sparse domains. It also releases a curated, relabeled dataset of hand-drawn molecules with atom-level bounding box annotations.

Overcoming Annotation Bottlenecks in OCSR

Optical Chemical Structure Recognition (OCSR) is critical for digitizing chemical literature and lab notes. However, existing methods face three main limitations:

Generalization Limits: They struggle with sparse or stylistically unique domains, such as hand-drawn images, where massive datasets for pretraining are unavailable.
Annotation Cost: “Atom-level” methods (which detect individual atoms and bonds) require expensive bounding box annotations, which are rarely available for real-world sketch data.
Lack of Interpretability/Localization: Pure “Image-to-SMILES” models (like DECIMER) work well but fail to localize the atoms or bonds in the original image, limiting human-in-the-loop review and mechanistic interpretability.

AtomLenz, ProbKT*, and Graph Edit-Correction

The core contribution is AtomLenz, an OCSR framework that achieves atom-level entity detection using only SMILES supervision on target domains. The authors construct an explicit object detection pipeline using Faster R-CNN trained via a composite multi-task loss. The objective aims to optimize a multi-class log loss $L_{cls}$ for predicted class $\hat{c}$ and a regression loss $L_{reg}$ for predicted bounding box coordinates $\hat{b}$:

$$ \mathcal{L} = L_{cls}(c, \hat{c}) + L_{reg}(b, \hat{b}) $$

To bridge the gap between image inputs and the weakly supervised SMILES labels, the system leverages:

ProbKT (Probabilistic Knowledge Transfer):* Uses probabilistic logic and Hungarian matching to align predicted objects with the “ground truth” derived from the SMILES strings, enabling backpropagation without explicit bounding boxes.
Graph Edit-Correction: Generates pseudo-labels by solving an optimization problem that finds the smallest edit on the predicted graph such that the corrected graph and the ground-truth SMILES graph become isomorphic, which forces fine-tuning on less frequent atom types. The combination of ProbKT* and Edit-Correction is abbreviated as EditKT*.
ChemExpert: A chemically sound ensemble strategy that cascades predictions from multiple models (e.g., passing through DECIMER, then AtomLenz), halting at the first output that clears basic RDKit chemical validity checks.

Data Efficiency and Domain Adaptation Experiments

The authors evaluated the model specifically on domain adaptation and sample efficiency, treating hand-drawn molecules as the primary low-data target distribution:

Pretraining: Initially trained on ~214k synthetic images from ChEMBL explicitly labeled with bounding boxes (generated via RDKit).
Target Domain Adaptation: Fine-tuned on the Brinkhaus hand-drawn dataset (4,070 images) using purely SMILES supervision.
Evaluation Sets:
- Hand-drawn test set: 1,018 images.
- ChemPix: 614 out-of-domain hand-drawn images.
- Atom Localization set: 1,000 synthetic images to evaluate precise bounding box capabilities.
Baselines: Compared against leading OCSR methods, including DECIMER (v2.2.0), Img2Mol, MolScribe, ChemGrapher, and OSRA.

State-of-the-Art Ensembles vs. Standalone Limitations

SOTA Ensemble Performance: The ChemExpert module (combining AtomLenz and DECIMER) achieved state-of-the-art accuracy on both hand-drawn (63.5%) and ChemPix (51.8%) test sets.
Data Efficiency under Bottleneck Regimes: AtomLenz effectively bypassed the massive data constraints of competing models. When all methods were retrained from scratch on the same 4,070-sample hand-drawn training set (enriched with atom-level annotations from EditKT*), AtomLenz achieved 33.8% exact accuracy, outperforming baselines like Img2Mol (0.0%), MolScribe (1.3%), and DECIMER (0.1%), illustrating its sample efficiency.
Localization Success: The base framework achieved strong localization (mAP 0.801), a capability not provided by end-to-end transformers like DECIMER.
Methodological Tradeoffs: While AtomLenz is highly sample efficient, its standalone performance when fine-tuned on the target domain (33.8% accuracy) underperforms fine-tuned models trained on larger datasets like DECIMER (62.2% accuracy). AtomLenz achieves state-of-the-art results primarily when deployed as part of the ChemExpert ensemble alongside DECIMER, since errors from the two approaches tend to occur on different samples, allowing them to complement each other.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
Official Repository (AtomLenz)	Code	MIT	Complete pipeline for AtomLenz, ProbKT*, and Graph Edit-Correction.
Pre-trained Models	Model	MIT	Downloadable weights for Faster R-CNN detection backbones.
Hand-drawn Dataset (Brinkhaus)	Dataset	Unknown	Images and SMILES used for target domain fine-tuning and evaluation.
Relabeled Hand-drawn Dataset	Dataset	Unknown	1,417 images with bounding box annotations generated via EditKT*.
AtomLenz Web Demo	Other	Unknown	Interactive Hugging Face space for testing model inference.

Data

The study utilizes a mix of large synthetic datasets and smaller curated hand-drawn datasets.

Purpose	Dataset	Size	Notes
Pretraining	Synthetic ChEMBL	~214,000	Generated via RDKit/Indigo. Annotated with atoms, bonds, charges, stereocenters.
Fine-tuning	Hand-drawn (Brinkhaus)	4,070	Used for weakly supervised adaptation (SMILES only).
Evaluation	Hand-drawn Test	1,018
Evaluation	ChemPix	614	Out-of-distribution hand-drawn images.
Evaluation	Atom Localization	1,000	Synthetic images with ground truth bounding boxes.

Algorithms

Molecular Graph Constructor (Algorithm 1): A rule-based system to assemble the graph from detected objects:
1. Filtering: Removes overlapping atom boxes (IoU threshold).
2. Node Creation: Merges overlapping charge and stereocenter objects with their corresponding atom objects.
3. Edge Creation: Iterates over bond objects; if a bond overlaps with exactly two atoms, an edge is added. If >2, it selects the most probable pair.
4. Validation: Checks valency constraints; removes bonds iteratively if constraints are violated.
Weakly Supervised Training:
- ProbKT*: Uses Hungarian matching to align predicted objects with the “ground truth” implied by the SMILES string, allowing backpropagation without explicit boxes.
- Graph Edit-Correction: Finds the smallest edit on the predicted graph such that the corrected and true SMILES graphs become isomorphic, then uses the correction to generate pseudo-labels for retraining.

Models

Object Detection Backbone: Faster R-CNN.
- Four distinct models are trained for different entity types: Atoms ($O^a$), Bonds ($O^b$), Charges ($O^c$), and Stereocenters ($O^s$).
- Loss Function: Multi-task loss combining Multi-class Log Loss ($L_{cls}$) and Regression Loss ($L_{reg}$).
ChemExpert: An ensemble wrapper that prioritizes models based on user preference (e.g., DECIMER first, then AtomLenz). It accepts the first prediction that passes RDKit chemical validity checks.

Evaluation

Primary metrics focused on structural correctness and localization accuracy.

Metric	Value (Hand-drawn)	Baseline (DECIMER FT)	Notes
Accuracy (T=1)	33.8% (AtomLenz+EditKT*)	62.2%	Exact ECFP6 fingerprint match.
Tanimoto Sim.	0.484	0.727	Average similarity.
mAP	0.801	N/A	Localization accuracy (IoU 0.05-0.35).
Ensemble Acc.	63.5%	62.2%	ChemExpert (DECIMER + AtomLenz).

Hardware

Compute: Experiments utilized the Flemish Supercomputer Center (VSC) resources.
Note: Specific GPU models (e.g., A100/V100) are not explicitly detailed in the text, but Faster R-CNN training is standard on consumer or enterprise GPUs.

Paper Information

Citation: Oldenhof, M., De Brouwer, E., Arany, Á., & Moreau, Y. (2024). Atom-Level Optical Chemical Structure Recognition with Limited Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

Publication venue/year: CVPR 2024

Additional Resources:

BibTeX:

@inproceedings{oldenhofAtomLevelOpticalChemical2024,
  title = {Atom-Level Optical Chemical Structure Recognition with Limited Supervision},
  author = {Oldenhof, Martijn and De Brouwer, Edward and Arany, {\'A}d{\'a}m and Moreau, Yves},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024},
  eprint = {2404.01743},
  archiveprefix = {arXiv},
  primaryclass = {cs.CV}
}

Handwritten Chemical Structure Recognition with RCGD

Thu, 18 Dec 2025 00:00:00 +0000

Contribution and Methodological Framework

This is primarily a Method paper with a significant Resource component.

Method: It proposes a novel architectural framework (RCGD) and a new representation syntax (SSML) to solve the specific problem of handwritten chemical structure recognition.
Resource: It introduces a new benchmark dataset, EDU-CHEMC, containing 50,000 handwritten images to address the lack of public data in this domain.

The Ambiguity of Handwritten Chemical Structures

Recognizing handwritten chemical structures is significantly harder than printed ones due to:

Inherent Ambiguity: Handwritten atoms and bonds vary greatly in appearance.
Projection Complexity: Converting 2D projected layouts (like Natta or Fischer projections) into linear strings is difficult.
Limitations of Existing Formats: Standard formats like SMILES require domain knowledge (valence rules) and have a high semantic gap with the visual image. They often fail to represent “invalid” structures commonly found in educational/student work.

Bridging the Semantic Gap with SSML and RCGD

The paper introduces two core contributions to bridge the semantic gap between image and markup:

Structure-Specific Markup Language (SSML): An extension of Chemfig that provides an unambiguous, visual-based graph representation. Unlike SMILES, it describes how to draw the molecule step-by-step, making it easier for models to learn visual alignments. It supports “reconnection marks” to handle cyclic structures explicitly.
Random Conditional Guided Decoder (RCGD): A decoder that treats recognition as a graph traversal problem. It introduces three novel mechanisms:
- Conditional Attention Guidance: Uses branch angle directions to guide the attention mechanism, preventing the model from getting lost in complex structures.
- Memory Classification: A module that explicitly stores and classifies “unexplored” branch points to handle ring closures (reconnections).
- Path Selection: A training strategy that randomly samples traversal paths to prevent overfitting to a specific serialization order.

Experimental Setup and Baselines

Datasets:

Mini-CASIA-CSDB (Printed): A subset of 97,309 printed molecular structure images, upscaled to $500 \times 500$ resolution.
EDU-CHEMC (Handwritten): A new dataset of 52,987 images collected from educational settings (cameras, scanners, screens), including erroneous/non-existent structures.

Baselines:

Compared against standard String Decoders (SD) (based on DenseWAP), tested with both SMILES and SSML on Mini-CASIA-CSDB and exclusively with SSML on EDU-CHEMC.
Compared against BTTR and ABM (recent mathematical expression recognition models) adapted for the chemical structure task, both using SSML on EDU-CHEMC.
On Mini-CASIA-CSDB, also compared against WYGIWYS (a SMILES-based string decoder at 300x300 resolution).

Ablation Studies:

Evaluated the impact of removing Path Selection (PS) and Memory Classification (MC) mechanisms on EDU-CHEMC.
Tested robustness to image rotation ($180^{\circ}$) on Mini-CASIA-CSDB.

Recognition Performance and Robustness

Superiority of SSML: Models trained with SSML significantly outperformed those trained with SMILES (92.09% vs 81.89% EM on printed data) due to reduced semantic gap.
Best Performance: RCGD achieved the highest Exact Match (EM) scores on both datasets:
- Mini-CASIA-CSDB: 95.01% EM.
- EDU-CHEMC: 62.86% EM.
EDU-CHEMC Baselines: On the handwritten dataset, SD (DenseWAP) achieved 61.35% EM, outperforming both BTTR (58.21% EM) and ABM (58.78% EM). The authors note that BTTR and ABM’s reverse training mode, which helps in regular formula recognition, does not transfer well to graph-structured molecular data.
Ablation Results (Table 5, EDU-CHEMC): Removing Path Selection alone dropped EM from 62.86% to 62.15%. Removing both Path Selection and Memory Classification dropped EM further to 60.31%, showing that memory classification has a larger impact.
Robustness: RCGD showed minimal performance drop (0.85%) on rotated images compared to SMILES-based methods (10.36% drop). The SD with SSML dropped by 2.19%, confirming that SSML itself improves rotation invariance.
Educational Utility: The method can recognize and reconstruct chemically invalid structures (e.g., a Carbon atom with 5 bonds), making it applicable for correcting and revising handwritten answers in chemistry education.

Reproducibility Details

Data

1. EDU-CHEMC (Handwritten)

Total Size: 52,987 images.
Splits: Training (48,998), Validation (999), Test (2,992).
Characteristics: Real-world educational data, mixture of isolated molecules and reaction equations, includes invalid chemical structures.

2. Mini-CASIA-CSDB (Printed)

Total Size: 97,309 images.
Splits: Training (80,781), Validation (8,242), Test (8,286).
Preprocessing: Original $300 \times 300$ images were upscaled to $500 \times 500$ RGB to resolve blurring issues.

Algorithms

1. SSML Generation

To convert a molecular graph to SSML:

Traverse: Start from the left-most atom.
Bonds/Atoms: Output atom text and bond format [:].
Branches: At branch points, use phantom symbols ( and ) to enclose branches, ordered by ascending bond angle.
Reconnections: Use ?[tag] and ?[tag, bond] to mark start/end of ring closures.

2. RCGD Specifics

RCGD-SSML: Modified version of SSML for the decoder. Removes ( ) delimiters; adds \eob (end of branch). Maintains a dynamic Branch Angle Set ($M$).
Path Selection: During training, when multiple branches exist in $M$, the model randomly selects one to traverse next. During inference, it uses beam search to score candidate paths.
Loss Function: $$ \begin{aligned} L_{\text{total}} = L_{\text{ce}} + L_{\text{bc}} \end{aligned} $$
- $L_{\text{ce}}$: Cross-entropy loss for character sequence generation.
- $L_{\text{bc}}$: Multi-label classification loss for the memory module (predicting reconnection bond types for stored branch states).

Models

Encoder: DenseNet

Structure: 3 dense blocks.
Growth Rate: 24.
Depth: 32 per block.
Output: High-dimensional feature map $x \in \mathbb{R}^{d_x \times h \times w}$.

Decoder: GRU with Attention

Hidden State Dimension: 256.
Embedding Dimension: 256.
Attention Projection: 128.
Memory Classification Projection: 256.

Training Config:

Optimizer: Adam.
Learning Rate: 2e-4 with multi-step decay (gamma 0.5).
Dropout: 15%.
Strategy: Teacher-forcing used for validation selection.

Evaluation

Metrics:

Exact Match (EM): Percentage of samples where the predicted graph structure perfectly matches the label. For SMILES, string comparison; for SSML, converted to graph for isomorphism check.
Structure EM: Auxiliary metric for samples with mixed content (text + molecules), counting samples where all molecular structures are correct.

Artifacts:

Artifact	Type	License	Notes
EDU-CHEMC	Dataset	Unknown	Dataset annotations and download links (actual data hosted on Google Drive)

Missing Components:

No training or inference code is publicly released; only the dataset is available.
Pre-trained model weights are not provided.

Paper Information

Citation: Hu, J., Wu, H., Chen, M., Liu, C., Wu, J., Yin, S., Yin, B., Yin, B., Liu, C., Du, J., & Dai, L. (2023). Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder. Proceedings of the 31st ACM International Conference on Multimedia (pp. 8114-8124). https://doi.org/10.1145/3581783.3612573

Publication: ACM Multimedia 2023

Additional Resources:

GitHub Repository / EDU-CHEMC Dataset

@inproceedings{huHandwrittenChemicalStructure2023,
  title = {Handwritten Chemical Structure Image to Structure-Specific Markup Using Random Conditional Guided Decoder},
  booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
  author = {Hu, Jinshui and Wu, Hao and Chen, Mingjun and Liu, Chenyu and Wu, Jiajia and Yin, Shi and Yin, Baocai and Yin, Bing and Liu, Cong and Du, Jun and Dai, Lirong},
  year = {2023},
  month = oct,
  pages = {8114--8124},
  publisher = {ACM},
  address = {Ottawa ON Canada},
  doi = {10.1145/3581783.3612573},
  isbn = {979-8-4007-0108-5}
}

ChemPix: Hand-Drawn Hydrocarbon Structure Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Paper Classification and Core Contribution

This is primarily a Method paper, with a secondary contribution as a Resource paper.

The paper’s core contribution is the ChemPix architecture and training strategy using neural image captioning (CNN-LSTM) to convert hand-drawn chemical structures to SMILES. The extensive ablation studies on synthetic data generation (augmentation, degradation, backgrounds) and ensemble learning strategies confirm the methodological focus. The secondary resource contribution includes releasing a curated dataset of hand-drawn hydrocarbons and code for generating synthetic training data.

The Structural Input Bottleneck in Computational Chemistry

Inputting molecular structures into computational chemistry software for quantum calculations is often a bottleneck, requiring domain expertise and cumbersome manual entry in drawing software. While optical chemical structure recognition (OCSR) tools exist, they typically struggle with the noise and variability of hand-drawn sketches. There is a practical need for a tool that allows chemists to simply photograph a hand-drawn sketch and immediately convert it into a machine-readable format (SMILES), making computational workflows more accessible.

CNN-LSTM Image Captioning and Synthetic Generalization

Image Captioning Paradigm: The authors treat the problem as neural image captioning, using an encoder-decoder (CNN-LSTM) framework to “translate” an image directly to a SMILES string. This avoids the complexity of explicit atom/bond detection and graph assembly.
Synthetic Data Engineering: The paper introduces a rigorous synthetic data generation pipeline that transforms clean RDKit-generated images into “pseudo-hand-drawn” images via randomized backgrounds, degradation, and heavy augmentation. This allows the model to achieve >50% accuracy on real hand-drawn data without ever seeing it during training.
Ensemble Uncertainty Estimation: The method utilizes a “committee” (ensemble) of networks to improve accuracy and estimate confidence based on vote agreement, providing users with reliability indicators for predictions.

Extensive Ablation and Real-World Evaluation

Ablation Studies on Data Pipeline: The authors trained models on datasets generated at different stages of the pipeline (Clean RDKit $\rightarrow$ Augmented $\rightarrow$ Backgrounds $\rightarrow$ Degraded) to quantify the value of each transformation in bridging the synthetic-to-real domain gap.
Sample Size Scaling: They analyzed performance scaling by training on synthetic dataset sizes ranging from 10,000 to 500,000 images to understand data requirements.
Real-world Validation: The model was evaluated on a held-out test set of hand-drawn images collected via a custom web app, providing genuine out-of-distribution testing.
Fine-tuning Experiments: Comparisons of synthetic-only training versus fine-tuning with a small fraction of real hand-drawn data to assess the value of limited real-world supervision.

State-of-the-Art Hand-Drawn OCSR Performance

Pipeline Efficacy: Augmentation and image degradation were the most critical factors for generalization, achieving over 50% accuracy on hand-drawn data when training with 500,000 synthetic images. Adding backgrounds had a negligible effect on accuracy compared to degradation.
State-of-the-Art Performance: The final ensemble model (5 out of 17 trained NNs, selected for achieving >50% individual accuracy) achieved 76% accuracy (top-1) and 85.5% accuracy (top-3) on the hand-drawn test set, a significant improvement over the best single model’s 67.5%.
Synthetic Generalization: A model trained on 500,000 synthetic images achieved >50% accuracy on real hand-drawn data without any fine-tuning, validating the synthetic data generation strategy as a viable alternative to expensive manual labeling.
Ensemble Benefits: The voting committee approach improved accuracy and provided interpretable uncertainty estimates through vote distributions. When all five committee members agree ($V=5$), the confidence value reaches 98%.

Limitations

The authors acknowledge several limitations of the current system:

Hydrocarbons only: The model is restricted to hydrocarbon structures and does not handle heteroatoms or functional groups.
No conjoined rings: Molecules with multiple conjoined rings are excluded due to limitations of RDKit’s image generation, which depicts bridges differently from standard chemistry drawing conventions.
Resonance hybrid notation: The network struggles with benzene rings drawn in the resonance hybrid style (with a circle) compared to the Kekule structure, since the RDKit training images use exclusively Kekule representations.
Challenging backgrounds: Lined and squared paper increase recognition difficulty, and structures bleeding through from the opposite side of the page can confuse the network.

Reproducibility Details

Data

The study relies on two primary data sources: a massive synthetic dataset generated procedurally and a smaller collected dataset of real drawings.

Purpose	Dataset	Size	Notes
Training	Synthetic (RDKit)	500,000 images	Generated via RDKit with “heavy” augmentation: rotation ($0-360°$), blur, salt+pepper noise, and background texture addition.
Fine-tuning	Hand-Drawn (Real)	613 images	Crowdsourced via a web app from over 100 unique users; split into 200-image test set and 413 training/validation images.
Backgrounds	Texture Images	1,052 images	A pool of unlabeled texture photos (paper, desks, shadows) used to generate synthetic backgrounds.

Data Generation Parameters:

Augmentations: Rotation, Resize ($200-300px$), Blur, Dilate, Erode, Aspect Ratio, Affine transform ($\pm 20px$), Contrast, Quantize, Sharpness
Backgrounds: Randomly translated $\pm 100$ pixels and reflected

Algorithms

Ensemble Voting
A committee of networks casts votes for the predicted SMILES string. The final prediction is the one with the highest vote count. Validity of SMILES is checked using RDKit.

Beam Search
Used in the decoding layer with a beam width of $k=5$ to explore multiple potential SMILES strings. It approximates the sequence $\mathbf{\hat{y}}$ that maximizes the joint probability:

$$ \mathbf{\hat{y}} = \arg\max_{\mathbf{y}} \sum_{t=1}^{T} \log P(y_t \mid y_{

Optimization:

Optimizer: Adam
Learning Rate: $1 \times 10^{-4}$
Batch Size: 20
Loss Function: Cross-entropy loss across the sequence of $T$ tokens, computed as:

$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{
where $\mathbf{x}$ is the image representation and $y_t$ is the predicted SMILES character. This is calculated as perplexity for validation.

Models

The architecture is a standard image captioning model (Show, Attend and Tell style) adapted for chemical structures.

Encoder (CNN):

Input: 256x256 pixel PNG images
Structure: 4 blocks of Conv2D + MaxPool
- Block 1: 64 filters, (3,3) kernel
- Block 2: 128 filters, (3,3) kernel
- Block 3: 256 filters, (3,3) kernel
- Block 4: 512 filters, (3,3) kernel
Activation: ReLU throughout

Decoder (LSTM):

Hidden Units: 512
Embedding Dimension: 80
Attention: Mechanism with intermediary vector dimension of 512

Evaluation

Primary Metric: Exact SMILES match accuracy (character-by-character identity between predicted and ground truth SMILES)
Perplexity: Used for saving model checkpoints (minimizing uncertainty)
Top-k Accuracy: Reported for $k=1$ (76%) and $k=3$ (85.5%)

Artifacts

Artifact	Type	License	Notes
ChemPixCH	Code + Dataset	Apache-2.0	Official implementation with synthetic data generation pipeline and collected hand-drawn dataset

Paper Information

Citation: Weir, H., Thompson, K., Woodward, A., Choi, B., Braun, A., & Martínez, T. J. (2021). ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning. Chemical Science, 12(31), 10622-10633. https://doi.org/10.1039/D1SC02957F

Publication: Chemical Science 2021

Additional Resources:

GitHub Repository

@article{weir2021chempix,
  title={ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning},
  author={Weir, Hayley and Thompson, Keiran and Woodward, Amelia and Choi, Benjamin and Braun, Augustin and Mart{\'i}nez, Todd J.},
  journal={Chemical Science},
  volume={12},
  number={31},
  pages={10622--10633},
  year={2021},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D1SC02957F}
}

Handwritten Chemical Ring Recognition with Neural Networks

Wed, 17 Dec 2025 00:00:00 +0000

Contribution: Recognition Architecture for Heterocyclic Rings

This is a Method paper ($\Psi_{\text{Method}}$).

It proposes a specific algorithmic architecture (the “Classifier-Recognizer Approach”) to solve a pattern recognition problem. The rhetorical structure centers on defining three variations of a method, performing ablation-like comparisons between them (Whole Image vs. Lower Part), and demonstrating superior performance metrics (~94% accuracy) for the proposed technique.

Motivation: Enabling Sketch-Based Chemical Search

The authors identify a gap in existing OCR and handwriting recognition research, which typically focuses on alphanumeric characters or whole words.

Missing Capability: Recognition of specific heterocyclic chemical rings (23 types) had not been performed previously.
Practical Utility: Existing chemical search engines require text-based queries (names); this work enables “backward” search where a user can draw a ring to find its information.
Educational/Professional Aid: Useful for chemistry departments and mobile applications where chemists can sketch formulas on screens.

Innovation: The Classifier-Recognizer Pipeline

The core novelty is the two-phase “Classifier-Recognizer” architecture designed to handle the visual similarity of heterocyclic rings:

Phase 1 (Classifier): A neural network classifies the ring into one of four broad categories (S, N, O, Others) based solely on the upper part of the image (40x15 pixels).
Phase 2 (Recognizer): A class-specific neural network identifies the exact ring.
Optimization: The most successful variation (“Lower Part Image Recognizer with Half Size Grid”) uses only the lower part of the image and odd rows (half-grid) to reduce input dimensionality and computation time while improving accuracy. This effectively subsamples the input grid matrix $M \in \mathbb{R}^{H \times W}$ to a reduced matrix $M_{\text{sub}}$: $$ M_{\text{sub}} = { m_{i,j} \in M \mid i \text{ is odd} } $$

Failed Preliminary Approaches

Before arriving at the Classifier-Recognizer architecture, the authors tried three simpler methods that all failed:

Ordinary NN: A single neural network with 1600 inputs (40x40 grid), 1600 hidden units, and 23 outputs. This standard approach achieved only 7% accuracy.
Row/Column pixel counts: Using the number of black pixels per row and per column as features ($N_c + N_r$ inputs), which dramatically reduced dimensionality. This performed even worse, below 1% accuracy.
Midline crossing count: Drawing a horizontal midline and counting the number of line crossings. This failed because the crossing count varies between writers for the same ring.

These failures motivated the two-phase Classifier-Recognizer design.

Experimental Setup and Network Variations

The authors conducted a comparative study of three methodological variations:

Whole Image Recognizer: Uses the full image.
Whole Image (Half Size Grid): Uses only odd rows ($20 \times 40$ pixels).
Lower Part (Half Size Grid): Uses the lower part of the image with odd rows (the proposed method).

Setup:

Dataset: 23 types of heterocyclic rings.
Training: 1500 samples (distributed across S, N, O, and Others classes).
Testing: 1150 samples.
Metric: Recognition accuracy (Performance %) and Error %.

Results: High Accuracy via Dimension Reduction

Superior Method: The “Lower Part Image Recognizer with Half Size Grid” achieved the best performance (~94% overall).
High Classifier Accuracy: The first phase (classification into S/N/O/Other) achieves 100% accuracy for class S, 98.67% for O, 97.75% for N, and 97.67% for Others (Table 3).
Class ‘Others’ Difficulty: The ‘Others’ class showed lower performance (~90-93%) compared to S/N/O due to the higher complexity and similarity of rings in that category.
Efficiency: The half-grid approach reduced training time from ~53 hours (Whole Image) to ~35 hours (Lower Part Half Size Grid) while improving accuracy from 87% to 94%.

Training/Testing comparison across the three Classifier-Recognizer variations (Table 2):

Method	Hidden Nodes	Iterations	Training Time (hrs)	Error	Performance
Whole Image	50	1000	~53	13.0%	87.0%
Whole Image (Half Grid)	50	1000	~41	9.0%	91.0%
Lower Part (Half Grid)	50	1000	~35	6.0%	94.0%

Reproducibility Details

Data

The dataset consists of handwritten samples of 23 specific heterocyclic rings.

Purpose	Dataset	Size	Notes
Training	Heterocyclic Rings	1500 samples	Split: 300 (S), 400 (N), 400 (O), 400 (Others)
Testing	Heterocyclic Rings	1150 samples	Split: 150 (S), 300 (O), 400 (N), 300 (Others)

Preprocessing Steps:

Monochrome Conversion: Convert image to monochrome bitmap.
Grid Scaling: Convert drawing area (regardless of original size) to a fixed 40x40 grid.
Bounding: Scale the ring shape itself to fit the 40x40 grid.

Algorithms

The “Lower Part with Half Size” Pipeline:

Cut Point: A horizontal midline is defined; the algorithm separates the “Upper Part” and “Lower Part”.
Phase 1 Input: The Upper Part (rows 0-15 approx, scaled) is fed to the Classifier NN to determine the class (S, N, O, or Others).
Phase 2 Input:
- For classes S, N, O: The Lower Part of the image is used.
- For class Others: The Whole Ring is used.
Dimensionality Reduction: For the recognizer networks, only odd rows are used (effectively a 20x40 input grid) to reduce inputs from 1600 to 800.

Models

The system uses multiple distinct Feed-Forward Neural Networks (Backpropagation is implied by “training” and “epochs” context, though not explicitly named as the algorithm):

Structure: 1 Classifier NN + 4 Recognizer NNs (one for each class).
Hidden Layers: The preliminary “ordinary method” experiment used 1600 hidden units. The Classifier-Recognizer methods all used 50 hidden nodes per Table 2. The paper also notes that the ordinary approach tried various hidden layer sizes.
Input Nodes:
- Standard: 1600 (40x40).
- Optimized: ~800 (20x40 via half-grid).

Evaluation

Classifier Phase Testing Results (Table 3):

Class	Samples	Correct	Accuracy	Error
S	150	150	100.00%	0.00%
O	300	296	98.67%	1.33%
N	400	391	97.75%	2.25%
Others	300	293	97.67%	2.33%

Recognizer Phase Testing Results (Lower Part Image Recognizer with Half Size Grid, Table 4):

Class	Samples	Correct	Accuracy	Error
S	150	147	98.00%	2.00%
O	300	289	96.33%	3.67%
N	400	386	96.50%	3.50%
Others	300	279	93.00%	7.00%
Overall	1150	-	~94.0%	-

Reproducibility Assessment

No source code, trained models, or datasets were released with this paper. The handwritten ring samples were collected by the authors, and the software described (a desktop application) is not publicly available. The neural network architecture details (50 hidden nodes, 1000 iterations) and preprocessing pipeline are described in sufficient detail for reimplementation, but reproducing results would require collecting a new handwritten dataset of heterocyclic rings.

Status: Closed (no public code, data, or models).

Paper Information

Citation: Hewahi, N., Nounou, M. N., Nassar, M. S., Abu-Hamad, M. I., & Abu-Hamad, H. I. (2008). Chemical Ring Handwritten Recognition Based on Neural Networks. Ubiquitous Computing and Communication Journal, 3(3).

Publication: Ubiquitous Computing and Communication Journal 2008

@article{hewahiCHEMICALRINGHANDWRITTEN2008,
  title = {CHEMICAL RING HANDWRITTEN RECOGNITION BASED ON NEURAL NETWORKS},
  author = {Hewahi, Nabil and Nounou, Mohamed N and Nassar, Mohamed S and Abu-Hamad, Mohamed I and Abu-Hamad, Husam I},
  year = {2008},
  journal = {Ubiquitous Computing and Communication Journal},
  volume = {3},
  number = {3}
}

Structural Analysis of Handwritten Chemical Formulas

Mon, 15 Dec 2025 00:00:00 +0000

Contribution: Structural Approach to Document Analysis

Method. This paper proposes a system architecture for document analysis. It introduces a specific pipeline (Global Perception followed by Incremental Extraction) and validates this strategy with recognition rates on specific tasks. The core contribution is the shift from bitmap-based processing to a structural graph representation of graphical primitives.

Motivation: Overcoming Bitmap Limitations in Freehand Drawings

Complexity of Freehand: Freehand drawings contain fluctuating lines and noise that make standard vectorization techniques difficult to apply directly.
Limitation of Bitmap Analysis: Most existing systems at the time attempted to interpret the document by working directly on the static bitmap image throughout the process.
Need for Context: Interpretation requires a dynamic resource that can evolve as knowledge is extracted (e.g., recognizing a polygon changes the context for its neighbors).

Novelty: Dynamic Structural Graphs and Recursive Specialists

The authors propose a Structural Representation as the unique resource for interpretation.

Quadrilateral Primitives: The system builds Quadrilaterals (pairs of vectors) to represent thin shapes, which are robust to handwriting fluctuations.
Structural Graph: These primitives are organized into a graph where arcs represent geometric relationships (T-junctions, L-junctions, parallels).
Specialist Agents: Interpretation is driven by independent modules (specialists) that browse this graph recursively to identify high-level chemical entities like rings (polygons) or chains.

Experimental Setup and Outcomes

Validation Set: The system was tested on 20 handwritten off-line documents containing chemical formulas at 300 dpi resolution.
Text Database: A separate base of 328 models was used for the text recognition component.
High Graphical Accuracy: The system achieved a $\approx 97%$ recognition rate for graphical parts (chemical elements like rings and bonds).
Text Recognition: The text recognition module achieved a $\approx 93%$ success rate.
Robustness: The structural graph approach successfully handled multiple liaisons, polygons, chains and allowed for the progressive construction of a solution consistent with the context.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Evaluation	Handwritten Documents	20 docs	Off-line documents at 300 dpi
Training	Character Models	328 models	Used for the Pattern Matching text recognition base

Algorithms

The interpretation process is divided into two distinct phases:

1. Global Perception (Graph Construction)

Vectorization: Contour tracking produces a chain of vectors, which are simplified via iterative polygonal approximation until fusion stabilizes (2-5 iterations).
Quadrilateral Formation: Vectors are paired to form quadrilaterals based on Euclidean distance and “empirical” alignment criteria.
Graph Generation: Quadrilaterals become nodes. Arcs are created based on “zones of influence” and classified into 5 types: T-junction, Intersection (X), Parallel (//), L-junction, and Successive (S).
Redraw Heuristic: A pre-processing step transforms T, X, and S junctions into L or // relations, as chemical drawings primarily consist of L-junctions and parallels.

2. Specialists (Interpretation)

Liaison Specialist: Scans the graph for // arcs or quadrilaterals with free extremities to identify bonds.
Polygon/Chain Specialist: Uses recursive look-left and look-right procedures. If a search returns to the start node after $n$ steps, a polygon is detected.
Text Localization: Clusters “short” quadrilaterals by physical proximity into “focus zones”. Zones are classified as text/non-text based on connected components.

Models

Text Recognition Hybrid:

Normalization & Pattern Matching: A classic method using the database of 328 models.
Structural Rule Base: Uses “significant” quadrilaterals (length $\ge 1/3$ of zone dimension) to verify characters. A rule base defines the expected count of horizontal, vertical, right-diagonal, and left-diagonal lines for each character.

Evaluation

Metric	Value	Baseline	Notes
Graphical Element Recognition	~97%	N/A	Evaluated on 20 documents (Fig. 7 examples)
Text Recognition	~93%	N/A	Evaluated on 20 documents

Paper Information

Citation: Ramel, J.-Y., Boissier, G., & Emptoz, H. (1999). Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image. Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR ‘99), 83-86. https://doi.org/10.1109/ICDAR.1999.791730

Publication: ICDAR 1999

@inproceedings{ramelAutomaticReadingHandwritten1999,
  title = {Automatic Reading of Handwritten Chemical Formulas from a Structural Representation of the Image},
  booktitle = {Proceedings of the {{Fifth International Conference}} on {{Document Analysis}} and {{Recognition}}. {{ICDAR}} '99 ({{Cat}}. {{No}}.{{PR00318}})},
  author = {Ramel, J.-Y. and Boissier, G. and Emptoz, H.},
  year = 1999,
  pages = {83--86},
  publisher = {IEEE},
  address = {Bangalore, India},
  doi = {10.1109/ICDAR.1999.791730},
  isbn = {978-0-7695-0318-9}
}

Hand-Drawn Chemical Diagram Recognition (AAAI 2007)

Sun, 14 Dec 2025 00:00:00 +0000

Contribution and Methodological Approach

This is a Method paper. It proposes a multi-stage pipeline for interpreting hand-drawn diagrams that integrates a trainable symbol recognizer with a domain-specific verification step. The authors validate the method through an ablation study comparing the full system against a baseline lacking domain knowledge.

Motivation for Sketch-Based Interfaces

Current software for specifying chemical structures (e.g., ChemDraw, IsisDraw) relies on mouse and keyboard interfaces, which lack the speed, ease of use, and naturalness of drawing on paper. The goal is to bridge the gap between natural expression and computer interpretation by building a system that understands freehand chemical sketches.

Novel Integration of Chemical Domain Knowledge

The primary novelty is the integration of domain knowledge (specifically chemical valence rules) directly into the interpretation loop to resolve ambiguities and correct errors.

Specific technical contributions include:

Hybrid Recognizer: Combines feature-based SVMs, image-based template matching (modified Tanimoto), and off-the-shelf handwriting recognition to handle the mix of geometry and text.
Domain Verification Loop: A post-processing step that checks the chemical validity of the structure (e.g., nitrogen must have 3 bonds). If an inconsistency is found, the system searches the space of alternative hypotheses generated during the initial parsing phase to find a valid interpretation.
Contextual Parsing: Uses a sliding window (up to 7 strokes) and spatial context to parse interspersed symbols.
Implicit Structure Handling: Supports two common chemistry notations: (1) implicit elements, where carbon and hydrogen atoms are omitted and inferred from bond connectivity and valence rules, and (2) aromatic rings, detected as a circle drawn inside a hexagonal 6-carbon cycle.

Experimental Design and User Study

The authors conducted a user study to evaluate the system’s robustness on unconstrained sketches.

Participants: 6 users familiar with organic chemistry.
Task: Each user drew 12 pre-specified molecular compounds on a Tablet PC.
Conditions: The system was evaluated in two modes:
1. Domain: The full system with chemical valence checks.
2. Baseline: A simplified version with no knowledge of chemical valence/verification.
Data Split: Evaluated on collected sketches using a leave-one-out style approach (training on 11 examples from the same users).

Results and Error Reduction Analysis

Performance: The full system achieved an overall F-measure of 0.87 (Precision 0.86, Recall 0.89).
Impact of Domain Knowledge: Using domain knowledge reduced the overall error rate (measured by recall) by 27% compared to the baseline. The improvement was statistically significant ($p < .05$).
Error Recovery: The system successfully recovered from interpretations that were geometrically plausible but chemically impossible (e.g., misinterpreting “N” as bonds), as illustrated in their qualitative analysis.
Output Integration: Once interpreted, the resulting structure is expressed in a standard chemical specification format that can be passed to tools such as ChemDraw (for rendering) or SciFinder (for database queries).
Limitations: The system struggled with “messy” sketches where users drew single bonds with multiple strokes or over-traced lines, as the current bond recognizer assumes single-stroke straight bonds.

Reproducibility Details

Data

The study collected a custom dataset of hand-drawn diagrams.

Volume: 6 participants $\times$ 12 molecules = 72 total sketches (implied).
Preprocessing:
- Scale Normalization: The system estimates scale based on the average length of straight bonds (chosen because they are easy to identify). This normalizes geometric features for the classifier.
- Stroke Segmentation: Poly-line approximation using recursive splitting (minimizing least squared error) to break multi-segment strokes (e.g., connected bonds) into primitives.

Algorithms

1. Ink Parsing (Sliding Window)

Examines all combinations of up to $n=7$ sequential strokes.
Classifies each group as a valid symbol or invalid garbage.

2. Template Matching (Image-based)

Used for resolving ambiguities in text/symbols (e.g., ‘H’ vs ‘N’).
Metric: Modified Tanimoto coefficient. Unlike standard Tanimoto (point overlap), this version accounts for relative angle and curvature at each point.

3. Domain Verification

Trigger: An element with incorrect valence (e.g., Hydrogen with >1 bond).
Resolution: Searches stored alternative hypotheses for the affected strokes. It accepts a new hypothesis if it resolves the valence error without introducing new ones.
Constraint: It keeps an inconsistent structure if the original confidence score is significantly higher than alternatives (assuming user is still drawing or intentionally left it incomplete).

Models

Symbol Recognizer (Discriminative Classifier)

Type: Support Vector Machine (SVM).
Classes: Element letters, straight bonds, hash bonds, wedge bonds, invalid groups.
Input Features:
1. Number of strokes
2. Bounding-box dimensions (width, height, diagonal)
3. Ink density (ink length / diagonal length)
4. Inter-stroke distance (max distance between strokes in group)
5. Inter-stroke orientation (vector of relative orientations)

Text Recognition

Microsoft Tablet PC SDK: Used for recognizing alphanumeric characters (elements and subscripts).
Integrated with the SVM and Template Matcher via a combined scoring mechanism.

Evaluation

Metric	Value (Overall)	Baseline Comparison	Notes
Precision	0.86	0.81 (Baseline)	Full system vs. no domain knowledge
Recall	0.89	0.85 (Baseline)	27% error reduction
F-Measure	0.87	0.83 (Baseline)	Statistically significant ($p < .05$)

True Positive Definition: Match in both location (stroke grouping) and classification (label).

Hardware

Device: 1.5GHz Tablet PC.
Performance: Real-time feedback.

Reproducibility

No source code, trained models, or collected sketch data were publicly released. The paper is openly available through the AAAI digital library. The system depends on the Microsoft Tablet PC SDK (a proprietary, now-discontinued component), which would make exact replication difficult even with the algorithm descriptions provided.

Status: Closed

Paper Information

Citation: Ouyang, T. Y., & Davis, R. (2007). Recognition of Hand Drawn Chemical Diagrams. Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI-07), 846-851.

Publication: AAAI 2007

@inproceedings{ouyang2007recognition,
  title={Recognition of Hand Drawn Chemical Diagrams},
  author={Ouyang, Tom Y and Davis, Randall},
  booktitle={Proceedings of the 22nd National Conference on Artificial Intelligence},
  volume={1},
  pages={846--851},
  year={2007}
}