Hunter Heidenreich | ML Research Scientist

Invalid SMILES Benefit Chemical Language Models: A Study

A 2024 Nature Machine Intelligence paper providing causal evidence that invalid SMILES generation improves chemical language model performance by filtering low-likelihood samples, while validity constraints (as in SELFIES) introduce structural biases that impair distribution learning.

Molecular Representations

SELFIES and the Future of Molecular String Representations

This 2022 perspective paper reviews 250 years of chemical notation evolution and proposes 16 concrete research projects to extend SELFIES beyond traditional organic chemistry into polymers, crystals, and reactions.

Scientific Computing

Grid of complex molecular structures rendered from SELFIES and SMILES strings

Molecular String Renderer: Robust Visualization Tool

A fault-tolerant RDKit wrapper treating molecular visualization as a software engineering problem, implementing strategy pattern for SVG generation with automatic raster fallback, native SELFIES support for generative AI workflows, and strict type safety for reliable batch processing of millions of molecules in training pipelines.

Generative Modeling

Diagram comparing standard stochastic sampling (gradient blocked) vs the reparameterization trick (gradient flows)

Auto-Encoding Variational Bayes: VAE Paper Summary

Kingma and Welling’s 2013 paper introducing Variational Autoencoders and the reparameterization trick, enabling end-to-end gradient-based training of generative models with continuous latent variables by moving the stochasticity outside the computational graph so that gradients can flow through a deterministic path.

Generative Modeling

Flowchart comparing VAE and IWAE computation showing the key difference in where averaging occurs relative to the log operation

Importance Weighted Autoencoders (IWAE) for Tighter Bounds

Burda et al.’s ICLR 2016 paper introducing Importance Weighted Autoencoders, which use importance sampling to derive a strictly tighter log-likelihood lower bound than standard VAEs, addressing posterior collapse and improving generative quality. The model architecture remains the same.

Generative Modeling

MNIST digit samples generated from a Variational Autoencoder latent space

Importance Weighted Autoencoders: Beyond the Standard VAE

Discover how Importance Weighted Autoencoders (IWAEs) use the same architecture as VAEs with a fundamentally more powerful objective to leverage multiple samples effectively.

Molecular Representations

D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism

InChI and Tautomerism: Toward Comprehensive Treatment

A comprehensive 2020 analysis of the tautomerism problem in chemical databases, compiling 86 tautomeric transformation rules (20 existing, 66 new) and validating them across 400M+ structures to inform algorithmic improvements for InChI V2.

Molecular Representations

2D molecular structure diagram of tricyclohexylphosphine showing a central phosphorus atom bonded to three cyclohexyl groups

InChI: The Worldwide Chemical Structure Identifier Standard

A comprehensive 2013 review explaining how InChI emerged as the global standard for chemical structure identifiers, covering its history as a response to the Internet’s need for non-proprietary molecular linking, its governance under IUPAC, and the technical layers that ensure uniqueness across diverse chemical databases.

Molecular Representations

Crystal structure of Na8Si46 clathrate displaying dodecahedral and tetrakaidecahedral coordination polyhedra

Making InChI FAIR and Sustainable for Inorganic Chemistry

A 2025 Faraday Discussions paper describing the major overhaul of InChI v1.07 that fixed more than 3000 bugs, added support for inorganic and organometallic compounds, and modernized the system to align with FAIR data principles for chemistry databases.

Molecular Representations

A cobalt sulfate and ethylenediamine mixture being prepared

Mixfile & MInChI: Machine-Readable Mixture Formats

A 2019 format specification introducing two complementary standards for chemical mixtures. Mixfile provides comprehensive mixture descriptions and MInChI provides compact canonical identifiers. This addresses the long-standing lack of standardized machine-readable formats for multi-component chemical systems.

Molecular Representations

Colorized electron microscope image of nanostructured indium phosphide surface showing spatially oriented cubic crystallites

NInChI: Toward a Chemical Identifier for Nanomaterials

Can we create a SMILES-like notation for nanomaterials? A collaborative workshop tackles the challenge of representing complex, multi-component nanomaterials with a proposed extension to the established InChI system.

Molecular Representations

Recent Advances in the SELFIES Library: 2023 Update

A 2023 software update paper documenting improvements to the SELFIES Python library (v2.1.1), including a streamlined context-free grammar, expanded support for aromatic systems and stereochemistry, customizable semantic constraints, ML utility functions, and performance benchmarks on 300K+ molecules.

Hunter Heidenreich | ML Research Scientist — Page 27