Paper Information
Citation: Wang, X., Shi, G., & Yang, J. (2009). The Understanding and Structure Analyzing for Online Handwritten Chemical Formulas. 2009 10th International Conference on Document Analysis and Recognition, 1056-1060. https://doi.org/10.1109/ICDAR.2009.70
Publication: ICDAR 2009
What kind of paper is this?
This is a Method paper. It proposes a novel architectural framework for processing chemical formulas by decomposing them into three hierarchical levels (Formula, Molecule, Text). The contribution is defined by a specific set of formal grammatical rules and parsing algorithms used to construct a “grammar spanning tree” and “molecule spanning graph” from online handwritten strokes.
What is the motivation?
The primary motivation is the application of mobile computing in chemistry education, where precise comprehension of casual, online handwritten formulas is a significant challenge.
- 2D Complexity: Unlike 1D text, chemical formulas utilize complex 2D spatial relationships that convey specific chemical meaning (e.g., bonds, rings).
- Format Limitations: Existing storage formats like CML (Chemical Markup Language) or MDL MOLFILE do not natively record the layout or abbreviated information necessary for recognizing handwritten input.
- Online Gap: Previous research focused heavily on offline (image-based) recognition, lacking solutions for online (stroke-based) handwritten chemical formulas (OHCF).
What is the novelty here?
The core novelty is the Three-Level Grammatical Analysis approach:
- Formula Level (1D): Treats the reaction equation as a linear sequence of components (Reactants, Products, Separators), parsed via a context-free grammar to build a spanning tree.
- Molecule Level (2D): Treats molecules as graphs where “text groups” are vertices and “bonds” are edges. It introduces specific handling for “hidden Carbon dots” (intersections of bonds without text).
- Text Level (1D): Analyzes the internal structure of text groups (atoms, subscripts).
Unique to this approach is the formal definition of the chemical grammar as a 5-tuple $G=(T,N,P,M,S)$ and the generation of an Adjacency Matrix directly from the handwritten sketch to represent chemical connectivity.
What experiments were performed?
The authors validated their model using a custom dataset of online handwritten formulas.
- Data Source: 25 formulas were randomly selected from a larger pool of 1,250 samples.
- Scope: The test set included 484 total symbols, comprising generators, separators, text symbols, rings, and various bond types.
- Granular Validation: The system was tested at multiple distinct stages:
- Key Symbol Extraction (Formula Level)
- Text Localization (Molecule Level)
- Bond End Grouping (Molecule Level)
- Text Recognition (Text Level)
What were the outcomes and conclusions drawn?
The system achieved high accuracy across all sub-tasks, demonstrating that the hierarchical grammar approach is effective for both inorganic and organic formulas.
- Formula Level: 98.3% accuracy for Key Symbols; 100% for State-assisted symbols.
- Molecule Level: 98.8% accuracy for Bond End Grouping; 100% for Free End-Text connection detection.
- Text Recognition: 98.7% accuracy (Top-3) using HMMs.
- Impact: The method successfully preserves the writer’s “online information” (habits/intentions) while converting the handwritten input into standard formats suitable for visual editing or data retrieval.
Reproducibility Details
To replicate this work, one would need to implement the specific grammatical production rules and the geometric thresholds defined for bond analysis.
Data
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training | Symbol HMMs | 5,670 samples | Used to train the text recognition module |
| Testing | Text Recognition | 2,016 samples | Test set for character HMMs |
| Testing | Formula Analysis | 25 formulas | Random subset of 1,250 collected samples; contains 484 symbols |
Algorithms
1. Formula Level Parsing
- HBL Analysis: Identify the “Horizontal Baseline” (HBL) containing the most symbols to locate key operators (e.g., $+$, $\rightarrow$).
- Grammar: Use the productions defined in Figure 4. Example rules include:
- $Reaction ::= ReactantList \ Generator \ ProductList$
- $Reactant ::= BalancingNum \ Molecule \ IonicCharacter$
2. Molecule Level Analysis (Bond Grouping)
- Endpoint Classification: Points are classified as free ends, junctions (3+ bonds), or connections (2 bonds).
- Grouping Equation: An endpoint $(x_k, y_k)$ belongs to Group A based on distance thresholding: $$Include(x_0, y_0) = \begin{cases} 1, & d_0 < t \cdot \max d_x + \partial \ 0, & \text{else} \end{cases}$$ Where $d_k$ is the Euclidean distance to the group center $(x_a, y_a)$.
3. Connection Detection
- Text-Bond Connection: A text group is connected to a bond if the free end falls within a bounding box expanded by thresholds $t_W$ and $t_H$: $$Con(x,y) = \begin{cases} 1, & \min x - t_W < x < \max x + t_W \text{ AND } \min y - t_H < y < \max y + t_H \ 0, & \text{else} \end{cases}$$
Models
- Text Recognition: Hidden Markov Models (HMM) are used for recognizing individual text symbols.
- Grammar: Context-Free Grammar (CFG) designed with ambiguity elimination to ensure a single valid parse tree for any valid formula.
Evaluation
Performance is measured by recognition accuracy at specific processing stages:
| Metric | Task | Value | Notes |
|---|---|---|---|
| Accuracy | F1 (Key Symbol Extraction) | 98.3% | Formula Level |
| Accuracy | F2 (State-assisted Symbol) | 100% | Formula Level |
| Accuracy | M2 (Bond End Grouping) | 98.8% | Molecule Level |
| Accuracy | M3 (Free End-Text Conn) | 100% | Molecule Level |
| Accuracy | T1 (Text Recognition) | 98.7% | Top-3 Accuracy |
Citation
@inproceedings{wangUnderstandingStructureAnalyzing2009,
title = {The {{Understanding}} and {{Structure Analyzing}} for {{Online Handwritten Chemical Formulas}}},
booktitle = {2009 10th {{International Conference}} on {{Document Analysis}} and {{Recognition}}},
author = {Wang, Xin and Shi, Guangshun and Yang, Jufeng},
year = {2009},
pages = {1056--1060},
publisher = {IEEE},
address = {Barcelona, Spain},
doi = {10.1109/ICDAR.2009.70},
isbn = {978-1-4244-4500-4},
langid = {english}
}