Online Recognition on Hunter Heidenreich | ML Research Scientist

Unified Framework for Handwritten Chemical Expressions

Wed, 17 Dec 2025 00:00:00 +0000

Addressing the Complexity of Handwritten Organic Chemistry

This is a Methodological Paper ($\Psi_{\text{Method}}$) from Microsoft Research Asia that addresses the challenge of recognizing complex 2D organic chemistry structures. By 2009, math expression recognition had seen significant commercial progress, but chemical expression recognition remained less developed.

The specific gap addressed is the geometric complexity of organic formulas. While inorganic formulas typically follow a linear, equation-like structure, organic formulas present complex 2D diagrammatic structures with various bond types and rings. Existing work often relied on strong assumptions (like single-stroke symbols) or failed to handle arbitrary compounds. There was a clear need for a unified solution capable of handling both inorganic and organic domains consistently.

The Chemical Expression Structure Graph (CESG)

The core innovation is a unified statistical framework that processes inorganic and organic expressions within the same pipeline. Key technical novelties include:

Unified Bond Modeling: Bonds are treated as special symbols. The framework detects “extended bond symbols” (multi-stroke bonds) and splits them into single, double, or triple bonds using corner detection for consistent processing.
Chemical Expression Structure Graph (CESG): A defined graph representation for generic chemical expressions where nodes represent symbols and edges represent bonds or spatial relations.
Non-Symbol Modeling: During the symbol grouping phase, the system explicitly models “invalid groups” to reduce over-grouping errors.
Global Graph Search: Structure analysis is formulated as finding the optimal CESG by searching over a Weighted Direction Graph ($G_{WD}$).

Graph Search and Statistical Validation

The authors validated the framework on a proprietary database of 35,932 handwritten chemical expressions collected from 300 writers.

Setup: The data was split into roughly 26,000 training and 6,400 testing samples.
Metric: Recognition accuracy was measured strictly by expression (all symbols and the complete structure must be correct).
Ablations: The team evaluated the performance contribution of symbol grouping, structure analysis, and full semantic verification.

Recognition Accuracy and Outcomes

The full framework achieved a Top-1 accuracy of 75.4% and a Top-5 accuracy of 83.1%.

Component Contribution: Structure analysis is the primary bottleneck. Adding it drops the theoretical “perfect grouping” performance from 85.9% to 74.1% (Top-1) due to structural errors.
Semantic Verification: Checking valence and grammar improved relative accuracy by 1.7%.

The unified framework effectively handles the variance in 2D space for chemical expressions, demonstrating that delayed decision-making (keeping top-N candidates) improves robustness.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
N/A	N/A	N/A	No public artifacts (code, data, models) were released by the authors.

Data

The study used a private Microsoft Research Asia dataset, making direct reproduction difficult.

Purpose	Dataset	Size	Notes
Total	Proprietary MSRA DB	35,932 expressions	Written by 300 people
Training	Subset	25,934 expressions
Testing	Subset	6,398 expressions

Content: 2,000 unique expressions from high school/college textbooks.
Composition: ~25% of samples are organic expressions.
Vocabulary: 163 symbol classes (elements, digits, +, ↑, %, bonds, etc.).

Algorithms

1. Symbol Grouping (Dynamic Programming)

Objective: Find the optimal symbol sequence $G_{max}$ maximizing the posterior probability given the ink strokes: $$ G_{max} = \arg\max_{G} P(G | \text{Ink}) $$
Non-symbol modeling: Iteratively trained models on “incorrect grouping results” to learn to reject invalid strokes.
Inter-group modeling: Uses Gaussian Mixture Models (GMM) to model spatial relations ($R_j$) between groups.

2. Bond Processing

Extended Bond Symbol: Recognizes connected strokes (e.g., a messy double bond written in one stroke) as a single “extended” symbol.
Splitting: Uses Curvature Scale Space (CSS) corner detection to split extended symbols into primitive lines.
Classification: A Neural Network verifies if the split lines form valid single, double, or triple bonds.

3. Structure Analysis (Graph Search)

Graph Construction: Builds a Weighted Direction Graph ($G_{WD}$) where nodes are symbol candidates and edges are potential relationships ($E_{c}, E_{nc}, E_{peer}, E_{sub}$).
Edge Weights: Calculated as the product of observation, spatial, and contextual probabilities: $$ W(S, O, R) = P(O|S) \times P(\text{Spatial}|R) \times P(\text{Context}|S, R) $$
- Spatial probability uses rectangular control regions and distance functions.
- Contextual probability uses statistical co-occurrence (e.g., ‘C’ often appears with ‘H’).
Search: Breadth-first search with pruning to find the top-N optimal CESGs.

Models

Symbol Recognition: Implementation details not specified, but likely HMM or NN based on the era. Bond verification explicitly uses a Neural Network.
Spatial Models: Gaussian Mixture Models (GMM) are used to model the 9 spatial relations (e.g., Left-super, Above, Subscript).
Semantic Model: A Context-Free Grammar (CFG) parser is used for final verification (e.g., ensuring digits aren’t reactants).

Evaluation

Evaluation is performed using “Expression-level accuracy”.

Metric	Value (Top-1)	Value (Top-5)	Notes
Full Framework	75.4%	83.1%
Without Semantics	74.1%	83.0%
Grouping Only	85.9%	95.6%	Theoretical max if structure analysis was perfect

Paper Information

Citation: Chang, M., Han, S., & Zhang, D. (2009). A Unified Framework for Recognizing Handwritten Chemical Expressions. 2009 10th International Conference on Document Analysis and Recognition, 1345–1349. https://doi.org/10.1109/ICDAR.2009.64

Publication: ICDAR 2009

@inproceedings{changUnifiedFrameworkRecognizing2009,
  title = {A {{Unified Framework}} for {{Recognizing Handwritten Chemical Expressions}}},
  booktitle = {2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}},
  author = {Chang, Ming and Han, Shi and Zhang, Dongmei},
  year = 2009,
  volume = {3},
  pages = {1345--1349},
  publisher = {IEEE},
  address = {Barcelona, Spain},
  doi = {10.1109/ICDAR.2009.64}
}

SVM-HMM Online Classifier for Chemical Symbols

Wed, 17 Dec 2025 00:00:00 +0000

Contribution: Double-Stage Classification Method

Method. This paper is a methodological contribution that proposes a novel “double-stage classifier” architecture. It fits the taxonomy by introducing a specific algorithmic pipeline (SVM rough classification followed by HMM fine classification) and a novel pre-processing algorithm (Point Sequence Reordering) to solve technical limitations in recognizing organic ring structures. The contribution is validated through ablation studies (comparing SVM kernels and HMM state/Gaussian counts) and performance benchmarks.

Motivation: Recognizing Complex Organic Ring Structures

The primary motivation is the complexity of recognizing handwritten chemical symbols, specifically the distinction between Organic Ring Structures (ORS) and Non-Ring Structures (NRS). Existing single-stage classifiers are unreliable for ORS because these symbols have arbitrary writing styles, variable stroke numbers, and inconsistent stroke orders due to their 2D hexagonal structure. A robust system is needed to handle this uncertainty and achieve high accuracy.

Core Innovation: Point Sequence Reordering (PSR)

The authors introduce two main novelties:

Double-Stage Architecture: A hybrid system where an SVM (using RBF kernel) first roughly classifies inputs as either ORS or NRS, followed by specialized HMMs for fine-grained recognition.
Point Sequence Reordering (PSR) Algorithm: A stroke-order independent algorithm designed specifically for ORS. It reorders the point sequence of a symbol based on a counter-clockwise scan from the centroid, effectively eliminating the uncertainty caused by variations in stroke number and writing order.

Methodology & Experimental Design

The authors collected a custom dataset and performed sequential optimizations:

SVM Optimization: Compared Polynomial, RBF, and Sigmoid kernels to find the best rough classifier.
HMM Optimization: Tested multiple combinations of states (4, 6, 8) and Gaussians (3, 4, 6, 8, 9, 12) to maximize fine classification accuracy.
PSR Validation: Conducted an ablation study comparing HMM accuracy on ORS symbols “Before PSR” vs “After PSR” to quantify the algorithm’s impact.

Results & Final Conclusions

Architecture Performance: The RBF-based SVM achieved 99.88% accuracy in differentiating ORS from NRS.
HMM Configuration: The optimal HMM topology was found to be 8-states and 12-Gaussians for both symbol types.
PSR Impact: The PSR algorithm improved ORS recognition. Top-1 accuracy shifted from 49.84% (Before PSR) to 98.36% (After PSR).
Overall Accuracy: The final integrated system achieved a Top-1 accuracy of 93.10% and Top-3 accuracy of 98.08% on the test set.

Reproducibility Details

Data

The study defined 101 chemical symbols split into two categories.

Category	Count	Content	Notes
NRS (Non-Ring)	63	Digits 0-9, 44 letters, 9 operators	Operators include +, -, =, $\rightarrow$, etc.
ORS (Organic Ring)	38	2D hexagonal structures	Benzene rings, cyclohexane, etc.

Collection: 12,322 total samples (122 per symbol) collected from 20 writers (teachers and students).
Split: 9,090 training samples and 3,232 test samples.
Constraints: Three specifications were used: normal, standard, and freestyle.

Algorithms

1. SVM Feature Extraction (Rough Classification) The input strokes are scaled, and a 58-dimensional feature vector is calculated:

Mesh ($4 \times 4$): Ratio of points in 16 grids (16 features).
Outline: Normalized scan distance from 4 edges with 5 scan lines each (20 features).
Projection: Point density in 5 bins per edge (20 features).
Aspect Ratio: Height/Width ratios (2 features).

2. Point Sequence Reordering (PSR) Used strictly for ORS preprocessing:

Calculate the centroid $(x_c, y_c)$ of the symbol.
Initialize a scan line at angle $\theta = 0$.
Traverse points; if a point $p_i = (x_i, y_i)$ satisfies the distance threshold to the scan line, add it to the reordered list. Distance $d_i$ is calculated as: $$ d_i = |(y_i - y_c)\cos(\theta) - (x_i - x_c)\sin(\theta)| $$
Increment $\theta$ by $\Delta\theta$ and repeat until a full circle ($2\pi$) is completed.

Models

SVM (Stage 1): RBF Kernel was selected as optimal with parameters $C=512$ and $\gamma=0.5$.
HMM (Stage 2): Left-right continuous HMM trained via Baum-Welch algorithm. The topology is one model per symbol using 8 states and 12 Gaussians.

Evaluation

Metrics reported are Top-1, Top-2, and Top-3 accuracy on the held-out test set.

Metric	NRS Accuracy	ORS Accuracy	Overall Test Accuracy
Top-1	91.91%	97.53%	93.10%
Top-3	99.12%	99.34%	98.08%

Hardware

Device: HP Pavilion tx1000 Tablet PC.
Processor: 2.00GHz CPU.

Paper Information

Citation: Zhang, Y., Shi, G., & Wang, K. (2010). A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols. 2010 International Conference on Pattern Recognition, 1888–1891. https://doi.org/10.1109/ICPR.2010.465

Publication: ICPR 2010

@inproceedings{zhang2010svm,
  title = {A SVM-HMM Based Online Classifier for Handwritten Chemical Symbols},
  booktitle = {2010 International Conference on Pattern Recognition},
  author = {Zhang, Yang and Shi, Guangshun and Wang, Kai},
  year = {2010},
  pages = {1888--1891},
  publisher = {IEEE},
  doi = {10.1109/ICPR.2010.465}
}

Recognition of On-line Handwritten Chemical Expressions

Wed, 17 Dec 2025 00:00:00 +0000

Contribution: On-line Chemical Expression Recognition Framework

This is a Method paper. It proposes a novel architectural pipeline (“Algorithm Model”) for recognizing on-line handwritten chemical expressions. The paper focuses on detailing the specific mechanisms of this pipeline (pre-processing, segmentation, two-level recognition, and HCI) and validates its effectiveness through quantitative comparison against a conventional baseline. The rhetorical structure aligns with the “Methodological Basis” of the taxonomy, prioritizing the “how well does this work?” question over theoretical derivation or dataset curation.

Motivation: The HCI Gap in Chemical Drawing

The authors identify a gap in existing human-computer interaction (HCI) for chemistry. While mathematical formula recognition had seen progress, chemical expression recognition was under-researched. Existing tools relied on keyboard/mouse input, which was time-consuming and inefficient for the complex, variable nature of chemical structures. Previous attempts were either too slow (vectorization-based) or failed to leverage specific chemical knowledge effectively. There was a practical need for a system that could handle the specific syntactic rules of chemistry in an on-line (real-time) handwriting setting.

Novelty: Two-Level Recognition Architecture

The core contribution is a two-level recognition algorithm that integrates chemical domain knowledge.

Level 1 (Substance Level): Treats connected strokes as a potential “substance unit” (e.g., “H2O”) and matches them against a dictionary using a modified edit distance algorithm.
Level 2 (Character Level): If the substance match fails, it falls back to segmenting the unit into isolated characters and reconstructing them using syntactic rules.
Hybrid Segmentation: Combines structural analysis (using bounding box geometry for super/subscript detection) with “partial recognition” (identifying special symbols like +, =, -> early to split the expression).

Methodology: Custom Dataset and Baseline Comparisons

The authors conducted a validation experiment in a laboratory environment with 20 participants (chemistry students and teachers).

Dataset: 1,197 total samples (983 from a standard set of 341 expressions, 214 arbitrary expressions written by users).
Baselines: They compared their “Two-Level” algorithm against a “Conventional” algorithm that skips the substance-level check and directly recognizes characters (“Recognize Character Directly”).
Conditions: They also tested the impact of their Human-Computer Interaction (HCI) module which allows user corrections.

Results: High Accuracy and HCI Corrections

Accuracy: The proposed two-level algorithm achieved significantly higher accuracy (96.4% for expression recognition) compared to the conventional baseline (91.5%).
Robustness: The method performed well even on “arbitrary” expressions not in the standard set (92.5% accuracy vs 88.2% baseline).
HCI Impact: Allowing users to modify results via the HCI module pushed final accuracy to high levels (98.8%).
Conclusion: The authors concluded the algorithm is reliable for real applications and flexible enough to be extended to other domains like physics or engineering.

Reproducibility Details

Data

The paper does not use a public benchmark but collected its own data for validation.

Purpose	Dataset	Size	Notes
Validation	Custom Lab Dataset	1,197 samples	Collected from 20 chemistry students/teachers using Tablet PCs. Includes 341 standard expressions + arbitrary user inputs.

Algorithms

The pipeline consists of four distinct phases with specific algorithmic choices:

1. Pre-processing

Smoothing: Uses a 5-tap Gaussian low-pass filter (Eq. 1) with specific coefficients to smooth stroke data.
Redundancy: Merges redundant points and removes “prickles” (isolated noise).
Re-ordering: Strokes are spatially re-sorted left-to-right, top-to-down to correct for arbitrary writing order.

2. Segmentation

Structural Analysis: Distinguishes relationships (Superscript vs. Subscript vs. Horizontal) using a geometric feature vector $(T, B)$ based on bounding box heights ($h$), vertical centers ($C$), and barycenters ($B_{bary}$): $$ \begin{aligned} d &= 0.7 \cdot y_{12} - y_{22} + 0.3 \cdot y_{11} \\ T &= 1000 \cdot \frac{d}{h_1} \\ B &= 1000 \cdot \frac{B_{bary1} - B_{bary2}}{h_1} \end{aligned} $$
Partial Recognition: Detects special symbols (+, =, ->) early to break expressions into “super-substance units” (e.g., separating reactants from products).

3. Recognition (Two-Level)

Level 1 (Dictionary Match):
- Uses a modified Edit Distance (Eq. 6) incorporating a specific distance matrix based on chemical syntax.
- Similarity $\lambda_{ij}$ is weighted by stroke credibility $\mu_i$ and normalized by string length.
Level 2 (Character Segmentation):
- Falls back to this if Level 1 fails.
- Segments characters by analyzing pixel density in horizontal/vertical/diagonal directions to find concave/convex points.
- Recombines characters using syntactic rules (e.g., valency checks) to verify validity.

Evaluation

Evaluation focused on recognition accuracy at both the character and expression level.

Metric	Value (Proposed)	Value (Baseline)	Notes
Expression Accuracy (EA)	96.4%	91.5%	“Standard” dataset subset.
Expression Accuracy (EA)	92.5%	88.2%	“Other” (arbitrary) dataset subset.
HCI-Assisted Accuracy	98.8%	N/A	Accuracy after user correction.

Hardware

Input Devices: Tablet PCs were used for data collection and testing.
Compute: Specific training hardware is not listed, but the algorithm is designed for real-time interaction on standard 2008-era computing devices.

Paper Information

Citation: Yang, J., Shi, G., Wang, Q., & Zhang, Y. (2008). Recognition of On-line Handwritten Chemical Expressions. 2008 IEEE International Joint Conference on Neural Networks, 2360–2365. https://doi.org/10.1109/IJCNN.2008.4634125

Publication: IJCNN 2008

@inproceedings{jufengyangRecognitionOnlineHandwritten2008,
  title = {Recognition of On-Line Handwritten Chemical Expressions},
  booktitle = {2008 {{IEEE International Joint Conference}} on {{Neural Networks}} ({{IEEE World Congress}} on {{Computational Intelligence}})},
  author = {{Jufeng Yang} and {Guangshun Shi} and {Qingren Wang} and {Yong Zhang}},
  year = 2008,
  month = jun,
  pages = {2360--2365},
  publisher = {IEEE},
  address = {Hong Kong, China},
  doi = {10.1109/IJCNN.2008.4634125},
  urldate = {2025-12-17},
  isbn = {978-1-4244-1820-6}
}

Online Handwritten Chemical Formula Structure Analysis

Wed, 17 Dec 2025 00:00:00 +0000

Hierarchical Grammatical Framework Contribution

This is a Method paper. It proposes a novel architectural framework for processing chemical formulas by decomposing them into three hierarchical levels (Formula, Molecule, Text). The contribution is defined by a specific set of formal grammatical rules and parsing algorithms used to construct a “grammar spanning tree” and “molecule spanning graph” from online handwritten strokes.

Motivation for Online Formula Recognition

The primary motivation is the application of mobile computing in chemistry education, where precise comprehension of casual, online handwritten formulas is a significant challenge.

2D Complexity: Unlike 1D text, chemical formulas utilize complex 2D spatial relationships that convey specific chemical meaning (e.g., bonds, rings).
Format Limitations: Existing storage formats like CML (Chemical Markup Language) or MDL MOLFILE do not natively record the layout or abbreviated information necessary for recognizing handwritten input.
Online Gap: Previous research focused heavily on offline (image-based) recognition, lacking solutions for online (stroke-based) handwritten chemical formulas (OHCF).

Core Novelty in Three-Level Grammatical Analysis

The core novelty is the Three-Level Grammatical Analysis approach:

Formula Level (1D): Treats the reaction equation as a linear sequence of components (Reactants, Products, Separators), parsed via a context-free grammar to build a spanning tree.
Molecule Level (2D): Treats molecules as graphs where “text groups” are vertices and “bonds” are edges. It introduces specific handling for “hidden Carbon dots” (intersections of bonds without text).
Text Level (1D): Analyzes the internal structure of text groups (atoms, subscripts).

Unique to this approach is the formal definition of the chemical grammar as a 5-tuple $G=(T,N,P,M,S)$ and the generation of an Adjacency Matrix directly from the handwritten sketch to represent chemical connectivity.

Experimental Validation on Handwritten Strokes

The authors validated their model using a custom dataset of online handwritten formulas.

Data Source: 25 formulas were randomly selected from a larger pool of 1,250 samples.
Scope: The test set included 484 total symbols, comprising generators, separators, text symbols, rings, and various bond types.
Granular Validation: The system was tested at multiple distinct stages:
- Key Symbol Extraction (Formula Level)
- Text Localization (Molecule Level)
- Bond End Grouping (Molecule Level)
- Text Recognition (Text Level)

Downstream Impact and Parsing Accuracy

The system achieved high accuracy across all sub-tasks, demonstrating that the hierarchical grammar approach is effective for both inorganic and organic formulas.

Formula Level: 98.3% accuracy for Key Symbols; 100% for State-assisted symbols.
Molecule Level: 98.8% accuracy for Bond End Grouping; 100% for Free End-Text connection detection.
Text Recognition: 98.7% accuracy (Top-3) using HMMs.
Impact: The method successfully preserves the writer’s “online information” (habits/intentions) while converting the handwritten input into standard formats suitable for visual editing or data retrieval.

Reproducibility Details

To replicate this work, one would need to implement the specific grammatical production rules and the geometric thresholds defined for bond analysis.

Data

Purpose	Dataset	Size	Notes
Training	Symbol HMMs	5,670 samples	Used to train the text recognition module
Testing	Text Recognition	2,016 samples	Test set for character HMMs
Testing	Formula Analysis	25 formulas	Random subset of 1,250 collected samples; contains 484 symbols

Algorithms

1. Formula Level Parsing

HBL Analysis: Identify the “Horizontal Baseline” (HBL) containing the most symbols to locate key operators (e.g., $+$, $\rightarrow$).
Grammar: Use the productions defined in Figure 4. Example rules include:
- $Reaction ::= ReactantList \ Generator \ ProductList$
- $Reactant ::= BalancingNum \ Molecule \ IonicCharacter$

2. Molecule Level Analysis (Bond Grouping)

Endpoint Classification: Points are classified as free ends, junctions (3+ bonds), or connections (2 bonds).
Grouping Equation: An endpoint $(x_k, y_k)$ belongs to Group A based on distance thresholding: $$ \begin{aligned} Include(x_0, y_0) = \begin{cases} 1, & d_0 < t \cdot \max d_x + \partial \\ 0, & \text{else} \end{cases} \end{aligned} $$ Where $d_k$ is the Euclidean distance to the group center $(x_a, y_a)$.

3. Connection Detection

Text-Bond Connection: A text group is connected to a bond if the free end falls within a bounding box expanded by thresholds $t_W$ and $t_H$: $$ \begin{aligned} Con(x,y) = \begin{cases} 1, & \min x - t_W < x < \max x + t_W \text{ AND } \min y - t_H < y < \max y + t_H \\ 0, & \text{else} \end{cases} \end{aligned} $$

Models

Text Recognition: Hidden Markov Models (HMM) are used for recognizing individual text symbols.
Grammar: Context-Free Grammar (CFG) designed with ambiguity elimination to ensure a single valid parse tree for any valid formula.

Evaluation

Performance is measured by recognition accuracy at specific processing stages:

Metric	Task	Value	Notes
Accuracy	F1 (Key Symbol Extraction)	98.3%	Formula Level
Accuracy	F2 (State-assisted Symbol)	100%	Formula Level
Accuracy	M2 (Bond End Grouping)	98.8%	Molecule Level
Accuracy	M3 (Free End-Text Conn)	100%	Molecule Level
Accuracy	T1 (Text Recognition)	98.7%	Top-3 Accuracy

Paper Information

Citation: Wang, X., Shi, G., & Yang, J. (2009). The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas. 2009 10th International Conference on Document Analysis and Recognition, 1056–1060. https://doi.org/10.1109/ICDAR.2009.70

Publication: ICDAR 2009

@inproceedings{wangUnderstandingStructureAnalyzing2009,
  title = {The {{Understanding}} and {{Structure Analyzing}} for {{Online Handwritten Chemical Formulas}}},
  booktitle = {2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}},
  author = {Wang, Xin and Shi, Guangshun and Yang, Jufeng},
  year = {2009},
  pages = {1056--1060},
  publisher = {IEEE},
  address = {Barcelona, Spain},
  doi = {10.1109/ICDAR.2009.70},
  isbn = {978-1-4244-4500-4},
  langid = {english}
}

On-line Handwritten Chemical Expression Recognition

Wed, 17 Dec 2025 00:00:00 +0000

A Methodological Approach to Chemical Recognition

This is a Method paper. It proposes a specific “novel two-level algorithm” and a “System model” for recognizing chemical expressions. The paper focuses on the architectural design of the recognition pipeline (segmentation, substance recognition, symbol recognition) and validates it against a “conventional algorithm” baseline, fitting the standard profile of a methodological contribution.

Bridging the Gap in Pen-Based Chemical Input

While pen-based computing has advanced for text and mathematical formulas, inputting chemical expressions remains “time-consuming”. Existing research often lacks “adequate chemical knowledge” or relies on algorithms that are too slow (global optimization) or structurally weak (local optimization). The authors aim to bridge this gap by integrating chemical domain knowledge into the recognition process to improve speed and accuracy.

Two-Level Recognition Strategy for Formulas

The core novelty is a two-level recognition strategy:

Level 1 (Substance Recognition): Uses global structural information to identify entire “substance units” (e.g., $H_2SO_4$) by matching against a dictionary.
Level 2 (Symbol Recognition): If Level 1 fails, the system falls back to segmenting the substance into isolated characters and recognizing them individually.

Additionally, the method integrates syntactic features (chemical knowledge) such as element conservation to validate and correct results and uses specific geometric features to distinguish superscript/subscript relationships.

Dataset Collection and Baseline Comparisons

Dataset Collection: The authors collected 1197 handwritten expression samples from 20 chemistry professionals and students. This included 983 “standard” expressions (from 341 templates) and 214 “arbitrary” expressions written freely.
Comparison: They compared their “Two-level recognition” approach against a “conventional algorithm” that shields the first level (directly segmenting into characters).
Metrics: They measured Material Accuracy (MA), Correct Expressions Number (AEN), and Expression Accuracy (EA).

High Accuracy in Formula Recognition

High Accuracy: The proposed algorithm achieved 96.4% Material Accuracy (MA) and 95.7% Expression Accuracy (EA) on the total test set.
Robustness: The method performed well on both standard (96.3% EA) and arbitrary (92.5% EA) expressions.
Validation: The authors conclude the algorithm is “reliable,” “flexible,” and suitable for real-time applications compared to prior work.

Reproducibility Details

Data

The authors constructed two distinct datasets for training and evaluation:

Purpose	Dataset	Size	Notes
Symbol Training	ISF Files	12,240 files	Used to train the ANN classifier. Covers 102 symbol classes (numerals, letters, operators, organic loops).
Expression Testing	Handwritten Expressions	1,197 samples	983 standard + 214 arbitrary expressions collected from 20 chemistry teachers/students.

Algorithms

1. Structural Segmentation (Superscript/Subscript)

To distinguish relationships (superscript, subscript, in-line), the authors define geometric parameters based on the bounding boxes of adjacent symbols ($x_{i1}, y_{i1}, x_{i2}, y_{i2}$):

$$d = 0.7 \times y_{12} - y_{22} + 0.3 \times y_{11}$$ $$T = 1000 \times d/h$$ $$B = 1000 \times (B_1 - B_2)/h_1$$

Where $B_1, B_2$ are barycenters and $h$ is height. $(T, B)$ serves as the feature vector for classification.

2. Segmentation Reliability

For segmenting strokes into units, the reliability of a segmentation path is calculated as:

$$Cof(K_{i},n)=\sum_{j=0}^{N}P(k_{j},k_{j+1})+P(S_{K_{i}})+\delta(N)$$

Where $P(k_j, k_{j+1})$ is the reliability of strokes being recognized as symbol $S_{k_j}$.

3. Substance Matching (Level 1)

A modified string edit distance is used to match handwritten input against a dictionary:

$$\lambda_{\overline{u}}=\mu_{i} \times f(Dis(i,j,r)/\sqrt{Max(Len_{i},Len_{j})})$$

Where $\mu_i$ is the recognizer credibility and $Dis(i,j,r)$ is the edit distance.

Models

Classifier: An ANN-based classifier is used for isolated symbol recognition.
Input Features: A set of ~30 features is extracted from strokes, including writing time, interval time, elastic mesh, and stroke outline.
Performance: The classifier achieved 92.1% accuracy on a test set of 2,702 isolated symbols.

Evaluation

The system was evaluated on the 1,197 expression samples.

Metric	Value (Total)	Value (Standard)	Value (Other)	Notes
Material Accuracy (MA)	96.4%	97.7%	94%	Accuracy of substance recognition.
Expression Accuracy (EA)	95.7%	96.3%	92.5%	Accuracy of full expression recognition.

Paper Information

Citation: Yang, J., Shi, G., Wang, K., Geng, Q., & Wang, Q. (2008). A Study of On-Line Handwritten Chemical Expressions Recognition. 2008 19th International Conference on Pattern Recognition, 1–4. https://doi.org/10.1109/ICPR.2008.4761824

Publication: ICPR 2008

@inproceedings{yangStudyOnlineHandwritten2008,
  title = {A Study of On-Line Handwritten Chemical Expressions Recognition},
  booktitle = {2008 19th {{International Conference}} on {{Pattern Recognition}}},
  author = {Yang, Jufeng and Shi, Guangshun and Wang, Kai and Geng, Qian and Wang, Qingren},
  year = 2008,
  month = dec,
  pages = {1--4},
  publisher = {IEEE},
  address = {Tampa, FL, USA},
  doi = {10.1109/ICPR.2008.4761824}
}

HMM-based Online Recognition of Chemical Symbols

Wed, 17 Dec 2025 00:00:00 +0000

What kind of paper is this?

This is a Method paper that proposes a specific algorithmic pipeline for the online recognition of handwritten chemical symbols. The core contribution is the engineering of an 11-dimensional feature vector combined with a Hidden Markov Model (HMM) architecture. The paper validates this method through quantitative experiments on a custom dataset, focusing on recognition accuracy as the primary metric.

What is the motivation?

Recognizing chemical symbols is uniquely challenging due to the complex structure of chemical expressions and the nature of pen-based input, which often results in broken or conglutinate strokes. Additionally, variations in writing style and random noise make the task difficult. While online recognition for Western characters and CJK scripts is well-developed, works specifically targeting online chemical symbol recognition are scarce, with most prior research focusing on offline recognition or global optimization.

What is the novelty here?

The primary novelty is the application of continuous HMMs specifically to the domain of online chemical symbol recognition, utilizing a specialized set of 11-dimensional local features. While HMMs have been used for other scripts, this paper tailors the feature extraction (including curliness, linearity, and writing direction) to capture the specific geometric properties of chemical symbols.

What experiments were performed?

The authors constructed a specific dataset for this task involving 20 participants (college teachers and students).

Dataset: 64 distinct symbols (digits, English letters, Greek letters, operators)
Volume: 7,808 total samples (122 per symbol), split into 5,670 training samples and 2,016 testing samples
Model Sweeps: They evaluated the HMM performance by varying the number of states (4, 6, 8) and the number of Gaussians per state (3, 4, 6, 9, 12)

What were the outcomes and conclusions drawn?

Performance: The best configuration (6 states, 9 Gaussians) achieved a top-1 accuracy of 89.5% and a top-3 accuracy of 98.7%
Scaling: Results showed that generally, increasing the number of states and Gaussians improved accuracy, though at the cost of computational efficiency
Error Analysis: The primary sources of error were shape similarities between specific characters (e.g., ‘0’ vs ‘O’ vs ‘o’, and ‘C’ vs ‘c’ vs ‘(’)

Reproducibility Details

Status: Closed / Very Low Reproducibility. This 2009 study relies on a private, custom-collected dataset and does not provide source code, model weights, or an open-access preprint.

Artifacts

Artifact	Type	License	Notes
None publicly available	N/A	N/A	No open source code, open datasets, or open-access preprints were released with this publication.

Data

The study utilized a custom dataset collected in a laboratory environment.

Purpose	Dataset	Size	Notes
Training	Custom Chemical Symbol Set	5,670 samples	90 samples per symbol
Testing	Custom Chemical Symbol Set	2,016 samples	32 samples per symbol

Dataset Composition: The set includes 64 symbols: Digits (0-9), Uppercase (A-Z, missing Q), Lowercase (a-z, selected), Greek letters ($\alpha$, $\beta$, $\gamma$, $\pi$), and operators ($+$, $=$, $\rightarrow$, $\uparrow$, $\downarrow$, $($ , $)$).

Algorithms

1. Preprocessing

The raw tablet data undergoes a 6-step pipeline:

Duplicate Point Elimination: Removing sequential points with identical coordinates
Broken Stroke Connection: Using Bezier curves to interpolate missing points/connect broken strokes
Hook Elimination: Removing artifacts at the start/end of strokes characterized by short length and sharp angle changes
Smoothing: Reducing noise from erratic pen movement
Re-sampling: Spacing points equidistantly to remove temporal variation
Size Normalization: Removing variation in writing scale

2. Feature Extraction (11 Dimensions)

Features are extracted from a 5-point window centered on $t$ ($t-2$ to $t+2$). The 11 dimensions are:

Normalized Vertical Position: $y(t)$ mapped to $[0,1]$
Normalized First Derivative ($x’$): Calculated via weighted sum of neighbors
Normalized First Derivative ($y’$): Calculated via weighted sum of neighbors
Normalized Second Derivative ($x’’$): Computed using $x’$ values
Normalized Second Derivative ($y’’$): Computed using $y’$ values
Curvature: $\frac{x’y’’ - x’‘y’}{(x’^2 + y’^2)^{3/2}}$
Writing Direction (Cos): $\cos \alpha(t)$ based on vector from $t-1$ to $t+1$
Writing Direction (Sin): $\sin \alpha(t)$
Aspect Ratio: Ratio of height to width in the 5-point window
Curliness: Deviation from the straight line connecting the first and last point of the window
Linearity: Average squared distance of points in the window to the straight line connecting start/end points

3. Feature Normalization

The final feature matrix $V$ is normalized to zero mean and unit standard deviation using the covariance matrix: $o_t = \Sigma^{-1/2}(v_t - \mu)$.

Models

Architecture: Continuous Hidden Markov Models (HMM)
Topology: Left-to-right (Bakis model)
Initialization: Initial distribution $\pi = {1, 0, …, 0}$; uniform transition matrix $A$; segmental k-means for observation matrix $B$
Training: Baum-Welch re-estimation
Decision: Maximum likelihood classification ($\hat{\lambda} = \arg \max P(O|\lambda)$)

Evaluation

Metric	Best Value	Configuration	Notes
Top-1 Accuracy	89.5%	6 States, 9 Gaussians	Highest reported accuracy
Top-3 Accuracy	98.7%	6 States, 9 Gaussians	Top-3 candidate accuracy

Paper Information

Citation: Zhang, Y., Shi, G., & Yang, J. (2009). HMM-Based Online Recognition of Handwritten Chemical Symbols. 2009 10th International Conference on Document Analysis and Recognition, 1255–1259. https://doi.org/10.1109/ICDAR.2009.99

Publication: ICDAR 2009

@inproceedings{zhang2009hmm,
  title = {HMM-Based Online Recognition of Handwritten Chemical Symbols},
  booktitle = {2009 10th International Conference on Document Analysis and Recognition},
  author = {Zhang, Yang and Shi, Guangshun and Yang, Jufeng},
  year = {2009},
  volume = {75},
  pages = {1255--1259},
  publisher = {IEEE},
  doi = {10.1109/ICDAR.2009.99}
}

Handwritten Chemical Symbol Recognition Using SVMs

Wed, 17 Dec 2025 00:00:00 +0000

Paper Contribution and Taxonomy

This is a Method paper according to the AI for Physical Sciences taxonomy.

Dominant Basis: The authors propose a novel hybrid architecture (SVM-EM) that combines two existing techniques to solve a specific recognition problem.
Rhetorical Indicators: The paper explicitly defines algorithms (Algorithm 1 & 2), presents a system architecture, and validates the method via ablation studies comparing the hybrid approach against its individual components.

Motivation for Pen-Based Input

Entering chemical expressions on digital devices is difficult due to their complex 2D spatial structure.

The Problem: While handwriting recognition for text and math is mature, chemical structures involve unique symbols and spatial arrangements that existing tools struggle to process efficiently.
Existing Solutions: Standard tools (like ChemDraw) rely on point-and-click interactions, which are described as complicated and non-intuitive compared to direct handwriting.
Goal: To enable fluid handwriting input on pen/touch-based devices (like iPads) by accurately recognizing individual chemical symbols in real-time.

Novelty: Hybrid SVM and Elastic Matching

The core contribution is the Hybrid SVM-EM approach, which splits recognition into a coarse classification stage and a fine-grained verification stage.

Two-Stage Pipeline:
1. SVM Recognition: Uses statistical features (stroke count, turning angles) to generate a short-list of candidate symbols.
2. Elastic Matching (EM): Uses a geometric point-to-point distance metric to re-rank these candidates against a library of stored symbol prototypes.
Online Stroke Partitioning: A heuristic-based method to group strokes into symbols in real-time based on time adjacency (grouping the last $n$ strokes) and spatial intersection checks, without waiting for the user to finish the entire drawing.

Experimental Design and Data Collection

The authors conducted a user study to collect data and evaluate the system:

Participants: 10 users were recruited to write chemical symbols on an iPad.
Task: Each user wrote 78 distinct chemical symbols (digits, alphabets, bonds) 3 times each.
Baselines: The hybrid method was compared against two baselines:
1. SVM only
2. Elastic Matching only.
Metrics: Evaluation focused on Precision@k (where $k=1, 3, 5$), measuring how often the correct symbol appeared in the top-$k$ suggestions.

Recognition Performance and Outcomes

The hybrid approach demonstrated improved performance compared to using either technique in isolation.

Key Results:
- Hybrid SVM-EM: 89.7% Precision@1 (Top-1 accuracy).
- SVM Only: 85.1% Precision@1.
- EM Only: 76.7% Precision@1.
Category Performance: The system performed best on Operators (91.9%) and Digits (91.3%), with slightly lower performance on Alphabetic characters (88.6%).
Impact: The system was successfully implemented as a real-time iOS application, allowing users to draw complex structures like $C\#CC(O)$ which are then converted to SMILES strings.

Reproducibility Details

Data

The study generated a custom dataset for training and evaluation.

Purpose	Dataset Stats	Details
Evaluation	2,340 samples	Collected from 10 users. Consists of 78 unique symbols: 10 digits (0-9), 52 letters (A-Z, a-z), and 16 bonds/operators (e.g., $=$, $+$, hash bonds).
Training	Unspecified size	A “Chemical Elastic Symbol Library” was created containing samples of all supported symbols to serve as prototypes for the Elastic Matching step.

Algorithms

The pipeline consists of four distinct algorithmic steps:

1. Stroke Partitioning

Logic: Groups the most recently written stroke with up to the last 4 previous strokes.
Filtering: Invalid groups are removed using “Spatial Distance Checking” (strokes too far apart) and “Stroke Intersection Checking” (strokes that don’t intersect where expected).

2. Preprocessing

Size Normalization: Scales symbol to a standard size based on its bounding box.
Smoothing: Uses average smoothing (replacing mid-points with the average of neighbors) to remove jitter.
Sampling: Resamples valid strokes to a fixed number of 50 points.

3. SVM Feature Extraction

Horizontal Angle: Calculated between two consecutive points ($P_1, P_2$). Values are binned into 12 groups ($30^{\circ}$ each).
Turning Angle: The difference between two consecutive horizontal angles. Values are binned into 18 groups ($10^{\circ}$ each).
Features: Input vector consists of stroke count, normalized coordinates, and the percentage of angles falling into the histograms described above.

4. Elastic Matching (Verification)

Distance Function: Euclidean distance summation between the points of the candidate symbol ($s$) and the partitioned input ($s_p$). $$ \begin{aligned} D(s, s_p) = \sum_{j=1}^{n} \sqrt{(x_{s,j} - x_{p,j})^2 + (y_{s,j} - y_{p,j})^2} \end{aligned} $$ Note: The paper formula sums the distances; $n$ is the number of points (50).
Ranking: Candidates are re-ranked in ascending order of this elastic distance.

Models

Classifier: Linear Support Vector Machine (SVM) implemented using LibSVM.
Symbol Library: A “Chemical Elastic Symbol Library” stores the raw stroke point sequences for all 78 supported symbols to enable the elastic matching comparison.

Evaluation

Performance was measured using precision at different ranks (Top-N accuracy).

Metric	Value	Baseline	Notes
Precision@1	89.7%	85.1% (SVM)	Hybrid model reduces error rate significantly over baselines.
Precision@3	94.1%	N/A	High recall in top 3 allows users to quickly correct errors via UI selection.
Precision@5	94.6%	N/A

Hardware

Device: Apple iPad (iOS platform).
Input: Touch/Pen-based input recording digital ink (x, y coordinates and pen-up/down events).

Paper Information

Citation: Tang, P., Hui, S. C., & Fu, C. W. (2013). Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition. 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS), 535–540. https://doi.org/10.1109/ICIS.2013.6607894

Publication: IEEE ICIS 2013

@inproceedings{tangOnlineChemicalSymbol2013,
  title = {Online Chemical Symbol Recognition for Handwritten Chemical Expression Recognition},
  booktitle = {2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS)},
  author = {Tang, Peng and Hui, Siu Cheung and Fu, Chi-Wing},
  year = 2013,
  volume = {22},
  pages = {535--540},
  publisher = {IEEE},
  doi = {10.1109/ICIS.2013.6607894}
}

ChemInk: Real-Time Recognition for Chemical Drawings

Mon, 15 Dec 2025 00:00:00 +0000

Contribution: Real-Time Sketch Recognition Method

This is a Method paper. It proposes a novel architectural framework for sketch recognition that integrates visual features at three distinct levels (inkpoints, segments, symbols) into a single probabilistic model. The rhetorical structure centers on the proposal of this new architecture, the introduction of a specific “trainable corner detector” algorithm, and the validation of these methods against existing benchmarks and alternative toolsets (ChemDraw).

Motivation: Bridging the Gap Between Sketching and CAD

The primary motivation is to bridge the gap between the natural, efficient process of drawing chemical diagrams by hand and the cumbersome “point-click-and-drag” interactions required by CAD tools like ChemDraw. While chemists prefer sketching for communication, existing digital tools do not offer the same speed or ease of use. The goal is to build an intelligent system that understands freehand sketches in real-time, converting them into structured data suitable for analysis or search.

Core Innovation: Hierarchical Joint CRF Model

The core novelty lies in the hierarchical joint model. Unlike previous approaches that might treat stroke segmentation and symbol recognition as separate, isolated steps, ChemInk uses a Conditional Random Field (CRF) to jointly model dependencies across three levels:

Inkpoints: Local visual appearance.
Segments: Stroke fragments separated by corners.
Candidates: Potential symbol groupings.

Additionally, the paper introduces a trainable corner detector that learns domain-specific corner definitions from data.

Experimental Design and Baselines

The authors conducted two primary evaluations:

Off-line Accuracy Evaluation:
- Dataset: 12 real-world organic compounds drawn by 10 participants.
- Metric: Recognition accuracy (Recall and Precision).
- Baseline: Comparison against their own previous work (O&D 2009) and ablations (with/without context).
On-line User Study:
- Task: 9 participants (chemistry students) drew 5 diagrams using both ChemInk (Tablet PC) and ChemDraw (Mouse/Keyboard).
- Metric: Time to completion and subjective user ratings (speed/ease of use).

Results: Accuracy and User Study Outcomes

Accuracy: The system achieved 97.4% symbol recognition accuracy, slightly outperforming the best prior result (97.1%). The trainable corner detector achieved 99.91% recall.
Speed: Users were twice as fast using ChemInk (avg. 36s) compared to ChemDraw (avg. 79s).
Usability: Participants rated ChemInk significantly higher for speed (6.3 vs 4.5) and ease of use (6.3 vs 4.7) on a 7-point scale.
Conclusion: Sketch recognition is a viable, superior alternative to standard CAD tools for authoring chemical diagrams.

Reproducibility Details

Data

Training/Test Data: 12 real-world organic compounds (e.g., Aspirin, Penicillin) drawn by 10 participants (organic chemistry familiar).
Evaluation Split: User-independent cross-validation (training on 9 users, testing on 1).
Input: Raw digital ink (strokes) collected on a Tablet PC.

Algorithms

1. Corner Detection (Trainable)

Method: Iterative vertex elimination.
Cost Function: $cost(p_{i}) = \sqrt{mse(s_{i}; p_{i-1}, p_{i+1})} \cdot dist(p_{i}; p_{i-1}, p_{i+1})$
Procedure: Repeatedly remove the vertex with the lowest cost until the classifier (trained on features like cost, diagonal length, ink density) predicts the remaining vertices are corners.

2. Feature Extraction

Inkpoints: Sampled at regular intervals. Features = $10 \times 10$ pixel orientation filters (0, 45, 90, 135 degrees) at two scales ($L/2$, $L$), smoothed and downsampled to $5 \times 5$. Total 400 features.
Segments: Similar image features centered at segment midpoint, plus geometric features (length, ink density).
Candidates: 5 feature images ($20 \times 20$) including an “endpoint” image, stretched to normalize aspect ratio.
Dimensionality Reduction: PCA used to compress feature images to 256 components.

3. Structure Generation

Clustering: Agglomerative clustering with a complete-link metric to connect symbols.
Threshold: Stop clustering at distance $0.4L$.

Models

Conditional Random Field (CRF)

Structure: 3-level hierarchy (Inkpoints $V_p$, Segments $V_s$, Candidates $V_c$).
Nodes:
- $V_p, V_s$ labels: “bond”, “hash”, “wedge”, “text”.
- $V_c$ labels: specific candidate interpretations.
Edges/Potentials:
- Entity-Feature: $\phi(y, x)$ (Linear classifier).
- Consistency: $\psi(y_i, y_j)$ (Hard constraint: child must match parent label).
- Spatial Context: $\psi_{ss}(y_i, y_j)$ (Pairwise geometric relationships between segments: angle, distance).
- Overlap: Prevents conflicting candidates from sharing segments.
Inference: Loopy Belief Propagation (up to 100 iterations).
Training: Maximum Likelihood via gradient ascent (L-BFGS).

Evaluation

Primary Metric: Accuracy (Recall/Precision) of symbol detection.
Comparison: Compared against Ouyang & Davis 2009 (previous SOTA).
Speed Metric: Wall-clock time for diagram creation (ChemInk vs. ChemDraw).

Hardware

Processor: 3.7 GHz processor (single thread) for base benchmarking (approx. 1 sec/sketch).
Deployment: Validated on 1.8 GHz Tablet PCs using multi-core parallelization for real-time feedback.

Paper Information

Citation: Ouyang, T. Y., & Davis, R. (2011). ChemInk: A Natural Real-Time Recognition System for Chemical Drawings. Proceedings of the 16th International Conference on Intelligent User Interfaces, 267–276. https://doi.org/10.1145/1943403.1943444

Publication: IUI ‘11

@inproceedings{ouyangChemInkNaturalRealtime2011,
  title = {ChemInk: A Natural Real-Time Recognition System for Chemical Drawings},
  shorttitle = {ChemInk},
  booktitle = {Proceedings of the 16th International Conference on Intelligent User Interfaces},
  author = {Ouyang, Tom Y. and Davis, Randall},
  year = {2011},
  month = feb,
  pages = {267--276},
  publisher = {ACM},
  address = {Palo Alto, CA, USA},
  doi = {10.1145/1943403.1943444},
  isbn = {978-1-4503-0419-1},
  url = {http://hdl.handle.net/1721.1/78898}
}