Paper Information

Citation: Wang, X., Shi, G., & Yang, J. (2009). The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas. 2009 10th International Conference on Document Analysis and Recognition, 1056-1060. https://doi.org/10.1109/ICDAR.2009.70

Publication: ICDAR 2009

What kind of paper is this?

This is a Method paper. It proposes a novel architectural framework for processing chemical formulas by decomposing them into three hierarchical levels (Formula, Molecule, Text). The contribution is defined by a specific set of formal grammatical rules and parsing algorithms used to construct a “grammar spanning tree” and “molecule spanning graph” from online handwritten strokes.

What is the motivation?

The primary motivation is the application of mobile computing in chemistry education, where precise comprehension of casual, online handwritten formulas is a significant challenge.

  • 2D Complexity: Unlike 1D text, chemical formulas utilize complex 2D spatial relationships that convey specific chemical meaning (e.g., bonds, rings).
  • Format Limitations: Existing storage formats like CML (Chemical Markup Language) or MDL MOLFILE do not natively record the layout or abbreviated information necessary for recognizing handwritten input.
  • Online Gap: Previous research focused heavily on offline (image-based) recognition, lacking solutions for online (stroke-based) handwritten chemical formulas (OHCF).

What is the novelty here?

The core novelty is the Three-Level Grammatical Analysis approach:

  1. Formula Level (1D): Treats the reaction equation as a linear sequence of components (Reactants, Products, Separators), parsed via a context-free grammar to build a spanning tree.
  2. Molecule Level (2D): Treats molecules as graphs where “text groups” are vertices and “bonds” are edges. It introduces specific handling for “hidden Carbon dots” (intersections of bonds without text).
  3. Text Level (1D): Analyzes the internal structure of text groups (atoms, subscripts).

Unique to this approach is the formal definition of the chemical grammar as a 5-tuple $G=(T,N,P,M,S)$ and the generation of an Adjacency Matrix directly from the handwritten sketch to represent chemical connectivity.

What experiments were performed?

The authors validated their model using a custom dataset of online handwritten formulas.

  • Data Source: 25 formulas were randomly selected from a larger pool of 1,250 samples.
  • Scope: The test set included 484 total symbols, comprising generators, separators, text symbols, rings, and various bond types.
  • Granular Validation: The system was tested at multiple distinct stages:
    • Key Symbol Extraction (Formula Level)
    • Text Localization (Molecule Level)
    • Bond End Grouping (Molecule Level)
    • Text Recognition (Text Level)

What were the outcomes and conclusions drawn?

The system achieved high accuracy across all sub-tasks, demonstrating that the hierarchical grammar approach is effective for both inorganic and organic formulas.

  • Formula Level: 98.3% accuracy for Key Symbols; 100% for State-assisted symbols.
  • Molecule Level: 98.8% accuracy for Bond End Grouping; 100% for Free End-Text connection detection.
  • Text Recognition: 98.7% accuracy (Top-3) using HMMs.
  • Impact: The method successfully preserves the writer’s “online information” (habits/intentions) while converting the handwritten input into standard formats suitable for visual editing or data retrieval.

Reproducibility Details

To replicate this work, one would need to implement the specific grammatical production rules and the geometric thresholds defined for bond analysis.

Data

PurposeDatasetSizeNotes
TrainingSymbol HMMs5,670 samplesUsed to train the text recognition module
TestingText Recognition2,016 samplesTest set for character HMMs
TestingFormula Analysis25 formulasRandom subset of 1,250 collected samples; contains 484 symbols

Algorithms

1. Formula Level Parsing

  • HBL Analysis: Identify the “Horizontal Baseline” (HBL) containing the most symbols to locate key operators (e.g., $+$, $\rightarrow$).
  • Grammar: Use the productions defined in Figure 4. Example rules include:
    • $Reaction ::= ReactantList \ Generator \ ProductList$
    • $Reactant ::= BalancingNum \ Molecule \ IonicCharacter$

2. Molecule Level Analysis (Bond Grouping)

  • Endpoint Classification: Points are classified as free ends, junctions (3+ bonds), or connections (2 bonds).
  • Grouping Equation: An endpoint $(x_k, y_k)$ belongs to Group A based on distance thresholding: $$Include(x_0, y_0) = \begin{cases} 1, & d_0 < t \cdot \max d_x + \partial \ 0, & \text{else} \end{cases}$$ Where $d_k$ is the Euclidean distance to the group center $(x_a, y_a)$.

3. Connection Detection

  • Text-Bond Connection: A text group is connected to a bond if the free end falls within a bounding box expanded by thresholds $t_W$ and $t_H$: $$Con(x,y) = \begin{cases} 1, & \min x - t_W < x < \max x + t_W \text{ AND } \min y - t_H < y < \max y + t_H \ 0, & \text{else} \end{cases}$$

Models

  • Text Recognition: Hidden Markov Models (HMM) are used for recognizing individual text symbols.
  • Grammar: Context-Free Grammar (CFG) designed with ambiguity elimination to ensure a single valid parse tree for any valid formula.

Evaluation

Performance is measured by recognition accuracy at specific processing stages:

MetricTaskValueNotes
AccuracyF1 (Key Symbol Extraction)98.3%Formula Level
AccuracyF2 (State-assisted Symbol)100%Formula Level
AccuracyM2 (Bond End Grouping)98.8%Molecule Level
AccuracyM3 (Free End-Text Conn)100%Molecule Level
AccuracyT1 (Text Recognition)98.7%Top-3 Accuracy

Citation

@inproceedings{wangUnderstandingStructureAnalyzing2009,
  title = {The {{Understanding}} and {{Structure Analyzing}} for {{Online Handwritten Chemical Formulas}}},
  booktitle = {2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}},
  author = {Wang, Xin and Shi, Guangshun and Yang, Jufeng},
  year = {2009},
  pages = {1056--1060},
  publisher = {IEEE},
  address = {Barcelona, Spain},
  doi = {10.1109/ICDAR.2009.70},
  isbn = {978-1-4244-4500-4},
  langid = {english}
}