Online Handwritten Chemical Formula Structure Analysis

Paper Information

Citation: Wang, X., Shi, G., & Yang, J. (2009). The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas. 2009 10th International Conference on Document Analysis and Recognition, 1056-1060. https://doi.org/10.1109/ICDAR.2009.70

Publication: ICDAR 2009

Hierarchical Grammatical Framework Contribution

This is a Method paper. It proposes a novel architectural framework for processing chemical formulas by decomposing them into three hierarchical levels (Formula, Molecule, Text). The contribution is defined by a specific set of formal grammatical rules and parsing algorithms used to construct a “grammar spanning tree” and “molecule spanning graph” from online handwritten strokes.

Motivation for Online Formula Recognition

The primary motivation is the application of mobile computing in chemistry education, where precise comprehension of casual, online handwritten formulas is a significant challenge.

2D Complexity: Unlike 1D text, chemical formulas utilize complex 2D spatial relationships that convey specific chemical meaning (e.g., bonds, rings).
Format Limitations: Existing storage formats like CML (Chemical Markup Language) or MDL MOLFILE do not natively record the layout or abbreviated information necessary for recognizing handwritten input.
Online Gap: Previous research focused heavily on offline (image-based) recognition, lacking solutions for online (stroke-based) handwritten chemical formulas (OHCF).

Core Novelty in Three-Level Grammatical Analysis

The core novelty is the Three-Level Grammatical Analysis approach:

Formula Level (1D): Treats the reaction equation as a linear sequence of components (Reactants, Products, Separators), parsed via a context-free grammar to build a spanning tree.
Molecule Level (2D): Treats molecules as graphs where “text groups” are vertices and “bonds” are edges. It introduces specific handling for “hidden Carbon dots” (intersections of bonds without text).
Text Level (1D): Analyzes the internal structure of text groups (atoms, subscripts).

Unique to this approach is the formal definition of the chemical grammar as a 5-tuple $G=(T,N,P,M,S)$ and the generation of an Adjacency Matrix directly from the handwritten sketch to represent chemical connectivity.

Experimental Validation on Handwritten Strokes

The authors validated their model using a custom dataset of online handwritten formulas.

Data Source: 25 formulas were randomly selected from a larger pool of 1,250 samples.
Scope: The test set included 484 total symbols, comprising generators, separators, text symbols, rings, and various bond types.
Granular Validation: The system was tested at multiple distinct stages:
- Key Symbol Extraction (Formula Level)
- Text Localization (Molecule Level)
- Bond End Grouping (Molecule Level)
- Text Recognition (Text Level)

Downstream Impact and Parsing Accuracy

The system achieved high accuracy across all sub-tasks, demonstrating that the hierarchical grammar approach is effective for both inorganic and organic formulas.

Formula Level: 98.3% accuracy for Key Symbols; 100% for State-assisted symbols.
Molecule Level: 98.8% accuracy for Bond End Grouping; 100% for Free End-Text connection detection.
Text Recognition: 98.7% accuracy (Top-3) using HMMs.
Impact: The method successfully preserves the writer’s “online information” (habits/intentions) while converting the handwritten input into standard formats suitable for visual editing or data retrieval.

Reproducibility Details

To replicate this work, one would need to implement the specific grammatical production rules and the geometric thresholds defined for bond analysis.

Data

Purpose	Dataset	Size	Notes
Training	Symbol HMMs	5,670 samples	Used to train the text recognition module
Testing	Text Recognition	2,016 samples	Test set for character HMMs
Testing	Formula Analysis	25 formulas	Random subset of 1,250 collected samples; contains 484 symbols

Algorithms

1. Formula Level Parsing

HBL Analysis: Identify the “Horizontal Baseline” (HBL) containing the most symbols to locate key operators (e.g., $+$, $\rightarrow$).
Grammar: Use the productions defined in Figure 4. Example rules include:
- $Reaction ::= ReactantList \ Generator \ ProductList$
- $Reactant ::= BalancingNum \ Molecule \ IonicCharacter$

2. Molecule Level Analysis (Bond Grouping)

Endpoint Classification: Points are classified as free ends, junctions (3+ bonds), or connections (2 bonds).
Grouping Equation: An endpoint $(x_k, y_k)$ belongs to Group A based on distance thresholding: $$ \begin{aligned} Include(x_0, y_0) = \begin{cases} 1, & d_0 < t \cdot \max d_x + \partial \\ 0, & \text{else} \end{cases} \end{aligned} $$ Where $d_k$ is the Euclidean distance to the group center $(x_a, y_a)$.

3. Connection Detection

Text-Bond Connection: A text group is connected to a bond if the free end falls within a bounding box expanded by thresholds $t_W$ and $t_H$: $$ \begin{aligned} Con(x,y) = \begin{cases} 1, & \min x - t_W < x < \max x + t_W \text{ AND } \min y - t_H < y < \max y + t_H \\ 0, & \text{else} \end{cases} \end{aligned} $$

Models

Text Recognition: Hidden Markov Models (HMM) are used for recognizing individual text symbols.
Grammar: Context-Free Grammar (CFG) designed with ambiguity elimination to ensure a single valid parse tree for any valid formula.

Evaluation

Performance is measured by recognition accuracy at specific processing stages:

Metric	Task	Value	Notes
Accuracy	F1 (Key Symbol Extraction)	98.3%	Formula Level
Accuracy	F2 (State-assisted Symbol)	100%	Formula Level
Accuracy	M2 (Bond End Grouping)	98.8%	Molecule Level
Accuracy	M3 (Free End-Text Conn)	100%	Molecule Level
Accuracy	T1 (Text Recognition)	98.7%	Top-3 Accuracy

Citation

@inproceedings{wangUnderstandingStructureAnalyzing2009,
  title = {The {{Understanding}} and {{Structure Analyzing}} for {{Online Handwritten Chemical Formulas}}},
  booktitle = {2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}},
  author = {Wang, Xin and Shi, Guangshun and Yang, Jufeng},
  year = {2009},
  pages = {1056--1060},
  publisher = {IEEE},
  address = {Barcelona, Spain},
  doi = {10.1109/ICDAR.2009.70},
  isbn = {978-1-4244-4500-4},
  langid = {english}
}

Paper Information#

Hierarchical Grammatical Framework Contribution#

Motivation for Online Formula Recognition#

Core Novelty in Three-Level Grammatical Analysis#

Experimental Validation on Handwritten Strokes#

Downstream Impact and Parsing Accuracy#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Citation#