Hunter Heidenreich | ML Research Scientist — Page 15

Computational Chemistry
A cobalt sulfate and ethylenediamine mixture being prepared

Mixfile & MInChI: Machine-Readable Mixture Formats

A 2019 format specification introducing two complementary standards for chemical mixtures. Mixfile provides comprehensive mixture descriptions and MInChI provides compact canonical identifiers. This addresses the long-standing lack of standardized machine-readable formats for multi-component chemical systems.

Computational Chemistry
Colorized electron microscope image of nanostructured indium phosphide surface showing spatially oriented cubic crystallites

NInChI: Toward a Chemical Identifier for Nanomaterials

Can we create a SMILES-like notation for nanomaterials? A collaborative workshop tackles the challenge of representing complex, multi-component nanomaterials with a proposed extension to the established InChI system.

Computational Chemistry
Benzene in SELFIES notation

Recent Advances in the SELFIES Library: 2023 Update

A 2023 software update paper documenting improvements to the SELFIES Python library (v2.1.1), including a streamlined context-free grammar, expanded support for aromatic systems and stereochemistry, customizable semantic constraints, ML utility functions, and performance benchmarks on 300K+ molecules.

Computational Chemistry
Chemical diagram showing a generalized Grignard reaction

RInChI: The Reaction International Chemical Identifier

A 2018 infrastructure paper introducing RInChI (Reaction InChI), the first standardized format for uniquely identifying chemical reactions through algorithmic hashing and layering, enabling reaction database searching and duplicate detection analogous to how InChI works for individual molecules.

Computational Chemistry
SELFIES molecular representation overview

SELFIES: The Original Paper on Robust Molecular Strings

The 2020 paper that introduced SELFIES: Mario Krenn and colleagues created a molecular representation that solves SMILES validity problems. It guarantees every generated string corresponds to a valid chemical structure.

Computational Chemistry
Benzene molecular structure diagram

SMILES Notation: The Original Paper by Weininger (1988)

David Weininger introduced SMILES notation in 1988, establishing encoding rules for representing chemical structures as compact, human-readable strings.

Computational Chemistry
Optical chemical structure recognition example

MolRec: Chemical Structure Recognition at CLEF 2012

Performance evaluation of MolRec at the CLEF 2012 competition reveals a large performance gap between the automatic evaluation set (94-96% accuracy) and the manual evaluation set of complex patent structures (46-59% accuracy), with systematic analysis of failure modes including character grouping bugs, touching characters, and four-way junction vectorization.

Computational Chemistry
Optical chemical structure recognition example

MolRec: Rule-Based OCSR System at TREC 2011 Benchmark

Details the MolRec system for converting chemical diagram images into MOL files using vectorization, geometric rules, and graph construction. Achieved 95% accuracy on 1000 TREC 2011 benchmark images with comprehensive failure analysis of limitations.

Computational Chemistry
The transformation from a 2D chemical structure image to a SMILES representation

What is Optical Chemical Structure Recognition (OCSR)?

Discover how OCSR technology bridges the gap between molecular images and machine-readable data, evolving from rule-based systems to modern deep learning models for chemical knowledge extraction.

Computational Chemistry
αExtractor extracts structured chemical information from biomedical literature

αExtractor: Chemical Info from Biomedical Literature

A 2024 deep learning system for optical chemical structure recognition designed specifically for biomedical literature mining, using ResNet-Transformer architecture to handle challenging conditions including low-resolution images, noise, distortions, and even hand-drawn molecular diagrams from scientific documents.

Computational Chemistry
ChemInfty: Chemical Structure Recognition in Patent Images

ChemInfty: Chemical Structure Recognition in Patent Images

A 2011 rule-based OCSR system designed specifically for the challenging low-quality images in Japanese patent applications, using segment-based methods to handle pervasive problems like touching characters, merged atom labels with bonds, and broken lines.

Computational Chemistry
Diagram showing MolNexTR's dual-stream architecture: a molecular image feeds into parallel ConvNext and Vision Transformer encoders, producing a SMILES string.

MolNexTR: A Dual-Stream Molecular Image Recognition

MolNexTR proposes a dual-stream architecture combining ConvNext and Vision Transformers to improve molecular image recognition (OCSR). It achieves 81-97% accuracy across diverse benchmarks utilizing simultaneous local and global feature extraction alongside specialized image contamination augmentations.