Paper Information

Citation: Yang, J., Shi, G., Wang, K., Geng, Q., & Wang, Q. (2008). A Study of On-Line Handwritten Chemical Expressions Recognition. 2008 19th International Conference on Pattern Recognition, 1-4. https://doi.org/10.1109/ICPR.2008.4761824

Publication: ICPR 2008

What kind of paper is this?

This is a Method paper. It proposes a specific “novel two-level algorithm” and a “System model” for recognizing chemical expressions. The paper focuses on the architectural design of the recognition pipeline (segmentation, substance recognition, symbol recognition) and validates it against a “conventional algorithm” baseline, fitting the standard profile of a methodological contribution.

What is the motivation?

While pen-based computing has advanced for text and mathematical formulas, inputting chemical expressions remains “time-consuming”. Existing research often lacks “adequate chemical knowledge” or relies on algorithms that are too slow (global optimization) or structurally weak (local optimization). The authors aim to bridge this gap by integrating chemical domain knowledge into the recognition process to improve speed and accuracy.

What is the novelty here?

The core novelty is a two-level recognition strategy:

  1. Level 1 (Substance Recognition): Uses global structural information to identify entire “substance units” (e.g., $H_2SO_4$) by matching against a dictionary.
  2. Level 2 (Symbol Recognition): If Level 1 fails, the system falls back to segmenting the substance into isolated characters and recognizing them individually.

Additionally, the method integrates syntactic features (chemical knowledge) such as element conservation to validate and correct results and uses specific geometric features to distinguish superscript/subscript relationships.

What experiments were performed?

  • Dataset Collection: The authors collected 1197 handwritten expression samples from 20 chemistry professionals and students. This included 983 “standard” expressions (from 341 templates) and 214 “arbitrary” expressions written freely.
  • Comparison: They compared their “Two-level recognition” approach against a “conventional algorithm” that shields the first level (directly segmenting into characters).
  • Metrics: They measured Material Accuracy (MA), Correct Expressions Number (AEN), and Expression Accuracy (EA).

What were the outcomes and conclusions drawn?

  • High Accuracy: The proposed algorithm achieved 96.4% Material Accuracy (MA) and 95.7% Expression Accuracy (EA) on the total test set.
  • Robustness: The method performed well on both standard (96.3% EA) and arbitrary (92.5% EA) expressions.
  • Validation: The authors conclude the algorithm is “reliable,” “flexible,” and suitable for real-time applications compared to prior work.

Reproducibility Details

Data

The authors constructed two distinct datasets for training and evaluation:

PurposeDatasetSizeNotes
Symbol TrainingISF Files12,240 filesUsed to train the ANN classifier. Covers 102 symbol classes (numerals, letters, operators, organic loops).
Expression TestingHandwritten Expressions1,197 samples983 standard + 214 arbitrary expressions collected from 20 chemistry teachers/students.

Algorithms

1. Structural Segmentation (Superscript/Subscript)

To distinguish relationships (superscript, subscript, in-line), the authors define geometric parameters based on the bounding boxes of adjacent symbols ($x_{i1}, y_{i1}, x_{i2}, y_{i2}$):

$$d = 0.7 \times y_{12} - y_{22} + 0.3 \times y_{11}$$ $$T = 1000 \times d/h$$ $$B = 1000 \times (B_1 - B_2)/h_1$$

Where $B_1, B_2$ are barycenters and $h$ is height. $(T, B)$ serves as the feature vector for classification.

2. Segmentation Reliability

For segmenting strokes into units, the reliability of a segmentation path is calculated as:

$$Cof(K_{i},n)=\sum_{j=0}^{N}P(k_{j},k_{j+1})+P(S_{K_{i}})+\delta(N)$$

Where $P(k_j, k_{j+1})$ is the reliability of strokes being recognized as symbol $S_{k_j}$.

3. Substance Matching (Level 1)

A modified string edit distance is used to match handwritten input against a dictionary:

$$\lambda_{\overline{u}}=\mu_{i}\times f(Dis(i,j,r)/\sqrt{Max(Len_{i},Len_{j})})$$

Where $\mu_i$ is the recognizer credibility and $Dis(i,j,r)$ is the edit distance.

Models

  • Classifier: An ANN-based classifier is used for isolated symbol recognition.
  • Input Features: A set of ~30 features is extracted from strokes, including writing time, interval time, elastic mesh, and stroke outline.
  • Performance: The classifier achieved 92.1% accuracy on a test set of 2,702 isolated symbols.

Evaluation

The system was evaluated on the 1,197 expression samples.

MetricValue (Total)Value (Standard)Value (Other)Notes
Material Accuracy (MA)96.4%97.7%94%Accuracy of substance recognition.
Expression Accuracy (EA)95.7%96.3%92.5%Accuracy of full expression recognition.

Citation

@inproceedings{yangStudyOnlineHandwritten2008,
  title = {A Study of On-Line Handwritten Chemical Expressions Recognition},
  booktitle = {2008 19th {{International Conference}} on {{Pattern Recognition}}},
  author = {Yang, Jufeng and Shi, Guangshun and Wang, Kai and Geng, Qian and Wang, Qingren},
  year = 2008,
  month = dec,
  pages = {1--4},
  publisher = {IEEE},
  address = {Tampa, FL, USA},
  doi = {10.1109/ICPR.2008.4761824}
}