On-line Handwritten Chemical Expression Recognition

Paper Information

Citation: Yang, J., Shi, G., Wang, K., Geng, Q., & Wang, Q. (2008). A Study of On-Line Handwritten Chemical Expressions Recognition. 2008 19th International Conference on Pattern Recognition, 1-4. https://doi.org/10.1109/ICPR.2008.4761824

Publication: ICPR 2008

A Methodological Approach to Chemical Recognition

This is a Method paper. It proposes a specific “novel two-level algorithm” and a “System model” for recognizing chemical expressions. The paper focuses on the architectural design of the recognition pipeline (segmentation, substance recognition, symbol recognition) and validates it against a “conventional algorithm” baseline, fitting the standard profile of a methodological contribution.

Bridging the Gap in Pen-Based Chemical Input

While pen-based computing has advanced for text and mathematical formulas, inputting chemical expressions remains “time-consuming”. Existing research often lacks “adequate chemical knowledge” or relies on algorithms that are too slow (global optimization) or structurally weak (local optimization). The authors aim to bridge this gap by integrating chemical domain knowledge into the recognition process to improve speed and accuracy.

Two-Level Recognition Strategy for Formulas

The core novelty is a two-level recognition strategy:

Level 1 (Substance Recognition): Uses global structural information to identify entire “substance units” (e.g., $H_2SO_4$) by matching against a dictionary.
Level 2 (Symbol Recognition): If Level 1 fails, the system falls back to segmenting the substance into isolated characters and recognizing them individually.

Additionally, the method integrates syntactic features (chemical knowledge) such as element conservation to validate and correct results and uses specific geometric features to distinguish superscript/subscript relationships.

Dataset Collection and Baseline Comparisons

Dataset Collection: The authors collected 1197 handwritten expression samples from 20 chemistry professionals and students. This included 983 “standard” expressions (from 341 templates) and 214 “arbitrary” expressions written freely.
Comparison: They compared their “Two-level recognition” approach against a “conventional algorithm” that shields the first level (directly segmenting into characters).
Metrics: They measured Material Accuracy (MA), Correct Expressions Number (AEN), and Expression Accuracy (EA).

High Accuracy in Formula Recognition

High Accuracy: The proposed algorithm achieved 96.4% Material Accuracy (MA) and 95.7% Expression Accuracy (EA) on the total test set.
Robustness: The method performed well on both standard (96.3% EA) and arbitrary (92.5% EA) expressions.
Validation: The authors conclude the algorithm is “reliable,” “flexible,” and suitable for real-time applications compared to prior work.

Reproducibility Details

Data

The authors constructed two distinct datasets for training and evaluation:

Purpose	Dataset	Size	Notes
Symbol Training	ISF Files	12,240 files	Used to train the ANN classifier. Covers 102 symbol classes (numerals, letters, operators, organic loops).
Expression Testing	Handwritten Expressions	1,197 samples	983 standard + 214 arbitrary expressions collected from 20 chemistry teachers/students.

Algorithms

1. Structural Segmentation (Superscript/Subscript)

To distinguish relationships (superscript, subscript, in-line), the authors define geometric parameters based on the bounding boxes of adjacent symbols ($x_{i1}, y_{i1}, x_{i2}, y_{i2}$):

$$d = 0.7 \times y_{12} - y_{22} + 0.3 \times y_{11}$$ $$T = 1000 \times d/h$$ $$B = 1000 \times (B_1 - B_2)/h_1$$

Where $B_1, B_2$ are barycenters and $h$ is height. $(T, B)$ serves as the feature vector for classification.

2. Segmentation Reliability

For segmenting strokes into units, the reliability of a segmentation path is calculated as:

$$Cof(K_{i},n)=\sum_{j=0}^{N}P(k_{j},k_{j+1})+P(S_{K_{i}})+\delta(N)$$

Where $P(k_j, k_{j+1})$ is the reliability of strokes being recognized as symbol $S_{k_j}$.

3. Substance Matching (Level 1)

A modified string edit distance is used to match handwritten input against a dictionary:

$$\lambda_{\overline{u}}=\mu_{i} \times f(Dis(i,j,r)/\sqrt{Max(Len_{i},Len_{j})})$$

Where $\mu_i$ is the recognizer credibility and $Dis(i,j,r)$ is the edit distance.

Models

Classifier: An ANN-based classifier is used for isolated symbol recognition.
Input Features: A set of ~30 features is extracted from strokes, including writing time, interval time, elastic mesh, and stroke outline.
Performance: The classifier achieved 92.1% accuracy on a test set of 2,702 isolated symbols.

Evaluation

The system was evaluated on the 1,197 expression samples.

Metric	Value (Total)	Value (Standard)	Value (Other)	Notes
Material Accuracy (MA)	96.4%	97.7%	94%	Accuracy of substance recognition.
Expression Accuracy (EA)	95.7%	96.3%	92.5%	Accuracy of full expression recognition.

Citation

@inproceedings{yangStudyOnlineHandwritten2008,
  title = {A Study of On-Line Handwritten Chemical Expressions Recognition},
  booktitle = {2008 19th {{International Conference}} on {{Pattern Recognition}}},
  author = {Yang, Jufeng and Shi, Guangshun and Wang, Kai and Geng, Qian and Wang, Qingren},
  year = 2008,
  month = dec,
  pages = {1--4},
  publisher = {IEEE},
  address = {Tampa, FL, USA},
  doi = {10.1109/ICPR.2008.4761824}
}

Paper Information#

A Methodological Approach to Chemical Recognition#

Bridging the Gap in Pen-Based Chemical Input#

Two-Level Recognition Strategy for Formulas#

Dataset Collection and Baseline Comparisons#

High Accuracy in Formula Recognition#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Citation#