<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Molecular Notations on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/</link><description>Recent content in Molecular Notations on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Tue, 07 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/index.xml" rel="self" type="application/rss+xml"/><item><title>Materials Representations for ML Review</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/materials-representations-ml-review/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/materials-representations-ml-review/</guid><description>Review of representation strategies for encoding solid-state materials as ML inputs, covering structural descriptors, crystal graphs, and generative models.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-material-representations">A Systematization of Material Representations</h2>
<p>This paper is a <strong>Systematization</strong> that organizes and categorizes the strategies researchers use to convert solid-state materials into numerical representations suitable for machine learning models. Rather than proposing a new method, the review provides a structured taxonomy of existing approaches, connecting each to the practical constraints of data availability, computational cost, and prediction targets. It covers structural descriptors, graph-based learned representations, compositional features, transfer learning, and generative models for inverse design.</p>
<h2 id="why-material-representations-matter">Why Material Representations Matter</h2>
<p>Machine learning has enabled rapid property prediction for materials, but every ML pipeline depends on how the material is encoded as a numerical input. The authors identify three guiding principles for effective representations:</p>
<ol>
<li><strong>Similarity preservation</strong>: Similar materials should have similar representations, and dissimilar materials should diverge in representation space.</li>
<li><strong>Domain coverage</strong>: The representation should be constructable for every material in the target domain.</li>
<li><strong>Cost efficiency</strong>: Computing the representation should be cheaper than computing the target property directly (e.g., via <a href="https://en.wikipedia.org/wiki/Density_functional_theory">DFT</a>).</li>
</ol>
<p>In practice, materials scientists face several barriers. Atomistic structures span diverse space groups, supercell sizes, and disorder parameters. Real material performance depends on defects, microstructure, and interfaces. Structural information often requires expensive experimental or computational effort to obtain. Datasets in materials science tend to be small, sparse, and biased toward well-studied systems.</p>
<h2 id="structural-descriptors-local-global-and-topological">Structural Descriptors: Local, Global, and Topological</h2>
<p>The review covers three families of hand-crafted structural descriptors that encode atomic positions and types.</p>
<h3 id="local-descriptors">Local Descriptors</h3>
<p>Local descriptors characterize the environment around each atom. Atom-centered symmetry functions (ACSF), introduced by Behler and Parrinello, define radial and angular functions:</p>
<p>$$
G_{i}^{1} = \sum_{j \neq i}^{\text{neighbors}} e^{-\eta(R_{ij} - R_{s})^{2}} f_{c}(R_{ij})
$$</p>
<p>$$
G_{i}^{2} = 2^{1-\zeta} \sum_{j,k \neq i}^{\text{neighbors}} (1 + \lambda \cos \theta_{ijk})^{\zeta} e^{-\eta(R_{ij}^{2} + R_{ik}^{2} + R_{jk}^{2})} f_{c}(R_{ij}) f_{c}(R_{ik}) f_{c}(R_{jk})
$$</p>
<p>The Smooth Overlap of Atomic Positions (SOAP), proposed by Bartók et al., defines atomic neighborhood density as a sum of Gaussians and computes a rotationally invariant kernel through expansion in radial functions and <a href="https://en.wikipedia.org/wiki/Spherical_harmonics">spherical harmonics</a>:</p>
<p>$$
\rho_{i}(\mathbf{r}) = \sum_{j} \exp\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^{2}}{2\sigma^{2}}\right) = \sum_{nlm} c_{nlm} g_{n}(\mathbf{r}) Y_{lm}(\hat{\mathbf{r}})
$$</p>
<p>The power spectrum $\mathbf{p}(\mathbf{r}) \equiv \sum_{m} c_{nlm}(c_{n&rsquo;lm})^{*}$ serves as a vector descriptor of the local environment. SOAP has seen wide adoption both as a similarity metric and as input to ML models.</p>
<p><a href="https://en.wikipedia.org/wiki/Voronoi_diagram">Voronoi tessellation</a> provides another local approach, segmenting space into cells and extracting features like effective coordination numbers, cell volumes, and neighbor properties.</p>
<h3 id="global-descriptors">Global Descriptors</h3>
<p>Global descriptors encode the full structure. The Coulomb matrix models electrostatic interactions between atoms:</p>
<p>$$
M_{i,j} = \begin{cases} Z_{i}^{2.4} &amp; \text{for } i = j \\ \frac{Z_{i}Z_{j}}{|r_{i} - r_{j}|} &amp; \text{for } i \neq j \end{cases}
$$</p>
<p>Other global methods include partial radial distribution functions (PRDF), the many-body tensor representation (MBTR), and cluster expansions. The Atomic Cluster Expansion (ACE) framework generalizes cluster expansions to continuous environments and has become a foundation for modern deep learning potentials.</p>
<h3 id="topological-descriptors">Topological Descriptors</h3>
<p><a href="https://en.wikipedia.org/wiki/Persistent_homology">Persistent homology</a> from topological data analysis (TDA) identifies geometric features at multiple length scales. Topological descriptors capture pore geometries in porous materials and have outperformed traditional structural descriptors for predicting CO$_{2}$ adsorption in metal-organic frameworks and methane storage in <a href="https://en.wikipedia.org/wiki/Zeolite">zeolites</a>. A caveat is the $O(N^{3})$ worst-case computational cost per filtration.</p>
<h2 id="crystal-graph-neural-networks">Crystal Graph Neural Networks</h2>
<p>Graph neural networks bypass manual feature engineering by learning representations directly from structural data. Materials are converted to graphs $G(V, E)$ where nodes represent atoms and edges connect neighbors within a cutoff radius, with periodic boundary conditions.</p>
<p>Key architectures discussed include:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Key Innovation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN</td>
          <td>Crystal graph convolutions for broad property prediction</td>
      </tr>
      <tr>
          <td>MEGNet</td>
          <td>Materials graph networks with global state attributes</td>
      </tr>
      <tr>
          <td>ALIGNN</td>
          <td>Line graph neural networks incorporating three-body angular features</td>
      </tr>
      <tr>
          <td>Equivariant GNNs</td>
          <td>E(3)-equivariant message passing for tensorial properties</td>
      </tr>
  </tbody>
</table>
<p>The review identifies several limitations. Graph convolutions based on local neighborhoods can fail to capture long-range interactions or periodicity-dependent properties (e.g., lattice parameters, phonon spectra). Strategies to address this include concatenation with hand-tuned descriptors, plane-wave periodic basis modulation, and reciprocal-space features.</p>
<p>A major practical restriction is the requirement for relaxed atomic positions. Graphs built from unrelaxed crystal prototypes lose information about geometric distortions, degrading accuracy. Approaches to mitigate this include data augmentation with perturbed structures, Bayesian optimization of prototypes, and surrogate force-field relaxation.</p>
<p>Equivariant models that introduce higher-order tensors to node and edge features, constrained to transform correctly under E(3) operations, achieve state-of-the-art accuracy and can match structural descriptor performance even in low-data (~100 datapoints) regimes.</p>
<h2 id="compositional-descriptors-without-structure">Compositional Descriptors Without Structure</h2>
<p>When crystal structures are unavailable, representations can be built purely from stoichiometry and tabulated atomic properties (radii, electronegativity, valence electrons). Despite their simplicity, these methods have distinct advantages: zero computational overhead, accessibility to non-experts, and robustness for high-throughput screening.</p>
<p>Key methods include:</p>
<ul>
<li><strong>MagPie</strong>: 145 input features derived from elemental properties</li>
<li><strong>SISSO</strong>: Compressive sensing over algebraic combinations of atomic properties, capable of discovering interpretable descriptors (e.g., a new tolerance factor $\tau$ for perovskite stability)</li>
<li><strong>ElemNet</strong>: Deep neural network using only fractional stoichiometry as input, outperforming MagPie with &gt;3,000 training points</li>
<li><strong>ROOST</strong>: Fully-connected compositional graph with attention-based message passing, achieving strong performance with only hundreds of examples</li>
<li><strong>CrabNet</strong>: Self-attention on element embeddings with fractional encoding, handling dopant-level concentrations via log-scale inputs</li>
</ul>
<p>Compositional models cannot distinguish polymorphs and generally underperform structural approaches. They are most valuable when atomistic resolution is unavailable.</p>
<h2 id="defects-surfaces-and-grain-boundaries">Defects, Surfaces, and Grain Boundaries</h2>
<p>The review extends beyond idealized unit cells to practical materials challenges:</p>
<p><strong>Point defects</strong>: Representations of the pristine bulk can predict vacancy formation energies through linear relationships with band structure descriptors. Frey et al. proposed using relative differences between defect and parent structure properties, requiring no DFT on the defect itself.</p>
<p><strong>Surfaces and catalysis</strong>: Binding energy prediction for catalysis requires representations beyond the bulk unit cell. The d-band center for metals and oxygen 2p-band center for metal oxides serve as simple electronic descriptors, following the <a href="https://en.wikipedia.org/wiki/Sabatier_principle">Sabatier principle</a> that optimal catalytic activity requires intermediate binding strength. Graph neural networks trained on the Open Catalyst 2020 dataset (&gt;1 million DFT energies) have enabled broader screening, though errors remain high for certain adsorbates and non-metallic surfaces.</p>
<p><strong>Grain boundaries</strong>: SOAP descriptors computed for atoms near grain boundaries and clustered into local environment classes can predict grain boundary energy, mobility, and shear coupling. This approach provides interpretable structure-property relationships.</p>
<h2 id="transfer-learning-across-representations">Transfer Learning Across Representations</h2>
<p>When target datasets are small, transfer learning leverages representations learned from large, related datasets. The standard procedure involves: (1) pretraining on a large dataset (e.g., all Materials Project formation energies), (2) freezing parameters up to a chosen depth, and (3) either fine-tuning remaining layers or extracting features for a separate model.</p>
<p>Key findings from the review:</p>
<ul>
<li>Transfer learning is most effective when the source dataset is orders of magnitude larger than the target</li>
<li>Physically related tasks transfer better (e.g., Open Catalyst absorption energies transfer well to new adsorbates, less so to unrelated small molecules)</li>
<li>Earlier neural network layers learn more general representations and transfer better across properties</li>
<li>Multi-depth feature extraction, combining activations from multiple layers, can improve transfer</li>
<li>Predictions from surrogate models can serve as additional descriptors, expanding screening domains by orders of magnitude</li>
</ul>
<h2 id="generative-models-for-crystal-inverse-design">Generative Models for Crystal Inverse Design</h2>
<p>Generative models for solid-state materials face challenges beyond molecular generation: more diverse atomic species, the need to specify both positions and lattice parameters, non-unique definitions (rotations, translations, supercell scaling), and large unit cells (&gt;100 atoms for zeolites and MOFs).</p>
<p>The review traces the progression of approaches:</p>
<ol>
<li><strong>Voxel representations</strong>: Discretize unit cells into volume elements. Early work (iMatGen, Court et al.) demonstrated feasibility but was restricted to specific chemistries or cubic systems.</li>
<li><strong>Continuous coordinate models</strong>: Point cloud and invertible representations allowed broader chemical spaces but lacked symmetry invariances.</li>
<li><strong>Symmetry-aware models</strong>: Crystal Diffusion <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">VAE</a> (CDVAE) uses periodic graphs and SE(3)-equivariant message passing for translationally and rotationally invariant generation, establishing benchmark tasks for the field.</li>
<li><strong>Constrained models for porous materials</strong>: Approaches like SmVAE represent MOFs through their topological building blocks (RFcodes), ensuring all generated structures are physically valid.</li>
</ol>
<h2 id="open-problems-and-future-directions">Open Problems and Future Directions</h2>
<p>The review highlights four high-impact open questions:</p>
<ol>
<li><strong>Local vs. global descriptor trade-offs</strong>: Local descriptors (SOAP) excel for short-range interactions but struggle with long-range physics. Global descriptors model periodicity but lack generality across space groups. Combining local and long-range features could provide more universal models.</li>
<li><strong>Prediction from unrelaxed prototypes</strong>: ML force fields can relax structures at a fraction of DFT cost, potentially expanding screening domains. Key questions remain about required training data scale and generalizability.</li>
<li><strong>Applicability of compositional descriptors</strong>: The performance gap between compositional and structural models may be property-dependent, being smaller for properties like band gap that depend on global features rather than local site energies.</li>
<li><strong>Extensions of generative models</strong>: Diffusion-based architectures have improved on voxel approaches for small unit cells, but extending to microstructure, dimensionality, and surface generation remains open.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This paper is a review and does not present new experimental results or release any novel code, data, or models. The paper is open-access (hybrid OA at Annual Reviews) and the arXiv preprint is freely available. The following artifacts table covers key publicly available resources discussed in the review.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://arxiv.org/abs/2301.08813">arXiv preprint (2301.08813)</a></td>
          <td>Other</td>
          <td>arXiv (open access)</td>
          <td>Free preprint version</td>
      </tr>
      <tr>
          <td><a href="https://materialsproject.org">Materials Project</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>DFT energies, band gaps, structures for &gt;100,000 compounds</td>
      </tr>
      <tr>
          <td><a href="https://oqmd.org">OQMD</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>Open Quantum Materials Database, &gt;600,000 DFT entries</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Open-Catalyst-Project/ocp">Open Catalyst 2020 (OC20)</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>&gt;1,000,000 DFT surface adsorption energies</td>
      </tr>
      <tr>
          <td><a href="https://aflowlib.org">AFLOW</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>High-throughput ab initio library, &gt;3,000,000 entries</td>
      </tr>
      <tr>
          <td><a href="https://github.com/hackingmaterials/matminer">Matminer</a></td>
          <td>Code</td>
          <td>BSD</td>
          <td>Open-source toolkit for materials data mining and featurization</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The review covers: ACSF, SOAP, Voronoi tessellation, Coulomb matrices, PRDF, MBTR, cluster expansions, ACE, persistent homology, CGCNN, MEGNet, ALIGNN, E(3)-equivariant GNNs, MagPie, SISSO, ElemNet, ROOST, CrabNet, VAE, GAN, and diffusion-based crystal generators.</p>
<h3 id="hardware">Hardware</h3>
<p>No new experiments are conducted. Hardware requirements vary by the referenced methods (DFT calculations require HPC; GNN training typically requires 1-8 GPUs).</p>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p><strong>Partially Reproducible</strong>: The review paper itself is open-access. All major datasets discussed (Materials Project, OQMD, OC20, AFLOW) are publicly available under permissive licenses. Most referenced model implementations (CGCNN, MEGNet, ALIGNN, ROOST, CDVAE) have open-source code. No novel artifacts are released by the authors.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Damewood, J., Karaguesian, J., Lunger, J. R., Tan, A. R., Xie, M., Peng, J., &amp; Gómez-Bombarelli, R. (2023). Representations of Materials for Machine Learning. <em>Annual Review of Materials Research</em>, 53. <a href="https://doi.org/10.1146/annurev-matsci-080921-085947">https://doi.org/10.1146/annurev-matsci-080921-085947</a></p>
<p><strong>Publication</strong>: Annual Review of Materials Research, 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{damewood2023representations,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Representations of Materials for Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Damewood, James and Karaguesian, Jessica and Lunger, Jaclyn R. and Tan, Aik Rui and Xie, Mingrou and Peng, Jiayu and G{\&#39;o}mez-Bombarelli, Rafael}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Annual Review of Materials Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{53}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1146/annurev-matsci-080921-085947}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>InChI: The International Chemical Identifier</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi/</guid><description>InChI is IUPAC's open, layered chemical identifier that encodes molecular structure hierarchically for database interoperability and search.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p><strong>InChI (International Chemical Identifier)</strong> is an open, non-proprietary chemical structure identifier developed by <a href="https://iupac.org/">IUPAC</a> and <a href="https://www.nist.gov/">NIST</a>. Unlike <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, which linearizes a molecular graph through depth-first traversal, InChI decomposes a molecule into a hierarchy of <strong>layers</strong> (connectivity, hydrogen atoms, charge, stereochemistry) that build progressively from the molecular formula to full stereochemical detail. This layered design means that two representations of the same molecule always produce the same InChI, even if their input drawings differ in atom ordering or layout.</p>
<p>InChI was created to solve a specific problem: linking chemical information across databases on the open web. Before InChI, interoperability between chemical databases depended on proprietary identifiers (like CAS Registry Numbers) or format-dependent representations. The project began at a March 2000 IUPAC meeting and is maintained by the <a href="https://www.inchi-trust.org/">InChI Trust</a>, a UK charity supported by publishers and database providers. The algorithm&rsquo;s source code is <a href="https://github.com/IUPAC-InChI/InChI">open source</a>.</p>
<h3 id="key-characteristics">Key Characteristics</h3>
<ul>
<li><strong>Canonical by design</strong>: Every valid molecular structure maps to exactly one standard InChI string, regardless of how the structure was drawn or which atoms were numbered first. This uniqueness is built into the algorithm, not added as a post-processing step.</li>
<li><strong>Hierarchical layers</strong>: Information is organized from general (molecular formula) to specific (stereochemistry, isotopes). This allows matching at different levels of detail: a query with unknown stereochemistry can match against structures with known stereochemistry by comparing only the connectivity layers.</li>
<li><strong>Web-searchable via InChIKey</strong>: Because InChI strings contain characters (<code>/</code>, <code>+</code>, <code>=</code>) that break web search engines, the 27-character InChIKey hash provides a fixed-length, search-friendly identifier.</li>
<li><strong>Non-proprietary and open</strong>: Governed by IUPAC through the InChI Trust. The algorithm, source code, and specification are freely available.</li>
<li><strong>Machine-optimized</strong>: Designed for programmatic parsing and database operations rather than human readability. Compare with SMILES, which prioritizes human readability.</li>
</ul>
<h2 id="layered-structure">Layered Structure</h2>
<p>An InChI string begins with the prefix <code>InChI=</code> followed by a version number, then a series of layers separated by <code>/</code>. Each layer encodes a specific aspect of the molecular structure.</p>
<h3 id="layer-breakdown">Layer Breakdown</h3>
<p>For L-alanine (an amino acid with a chiral center):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
</span></span><span style="display:flex;"><span>       │  │      │            │                   │   │  │
</span></span><span style="display:flex;"><span>       │  │      │            │                   │   │  └─ /s: stereo type (1=absolute)
</span></span><span style="display:flex;"><span>       │  │      │            │                   │   └─ /m: parity inversion flag
</span></span><span style="display:flex;"><span>       │  │      │            │                   └─ /t: tetrahedral parity
</span></span><span style="display:flex;"><span>       │  │      │            └─ /h: hydrogen layer
</span></span><span style="display:flex;"><span>       │  │      └─ /c: connectivity layer
</span></span><span style="display:flex;"><span>       │  └─ molecular formula
</span></span><span style="display:flex;"><span>       └─ version (1S = standard InChI v1)
</span></span></code></pre></div><p>The full set of layers, in order:</p>
<ol>
<li><strong>Main layer</strong>: Molecular formula (e.g., <code>C3H7NO2</code>)</li>
<li><strong>Connectivity (<code>/c</code>)</strong>: Atom-to-atom connections, excluding bond orders. Atoms are numbered starting from 1, and connections are listed as pairs.</li>
<li><strong>Hydrogen (<code>/h</code>)</strong>: Hydrogen atom assignments, distinguishing mobile (tautomeric) from fixed hydrogens</li>
<li><strong>Charge (<code>/q</code>) and proton balance (<code>/p</code>)</strong>: Net charge and protonation state</li>
<li><strong>Double bond stereochemistry (<code>/b</code>)</strong>: E/Z configuration around double bonds</li>
<li><strong>Tetrahedral stereochemistry (<code>/t</code>)</strong>: R/S configuration at sp3 centers</li>
<li><strong>Parity inversion (<code>/m</code>)</strong>: Relates computed parity to actual configuration</li>
<li><strong>Stereo type (<code>/s</code>)</strong>: Whether stereochemistry is absolute, relative, or racemic</li>
<li><strong>Isotope layer (<code>/i</code>)</strong>: Isotopic labeling (e.g., deuterium, carbon-13)</li>
</ol>
<h3 id="standard-vs-non-standard-inchi">Standard vs. Non-Standard InChI</h3>
<p>The <code>S</code> in <code>InChI=1S/</code> indicates a <strong>Standard InChI</strong>, which uses a fixed set of normalization options to guarantee that any software producing Standard InChI will generate the same string for the same molecule. Non-standard InChI allows custom options (such as the Fixed-H layer <code>/f</code>, which distinguishes specific tautomeric forms) but sacrifices cross-implementation consistency.</p>
<h2 id="the-inchikey">The InChIKey</h2>
<p>InChI strings can be arbitrarily long for large molecules, and their <code>/</code>, <code>+</code>, and <code>=</code> characters cause problems for web search engines. The <strong>InChIKey</strong> addresses both issues by hashing the InChI into a fixed 27-character string:</p>
<p>$$
\text{InChIKey} = f_{\text{SHA-256}}(\text{InChI})
$$</p>
<h3 id="structure">Structure</h3>
<p>An InChIKey has the format <code>XXXXXXXXXXXXXX-XXXXXXXXXX-X</code>:</p>
<ul>
<li><strong>First block (14 characters)</strong>: SHA-256 hash of the connectivity layer (molecular skeleton)</li>
<li><strong>Second block (10 characters)</strong>: 8 characters encoding stereochemistry and isotopes, plus a standard/non-standard flag (<code>S</code> or <code>N</code>) and a version indicator (<code>A</code> for v1)</li>
<li><strong>Third block (1 character)</strong>: Protonation flag (<code>N</code> for neutral)</li>
</ul>
<p>For example, L-alanine:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>InChIKey: QNAYBMKLOCPYGJ-REOHCLBHSA-N
</span></span><span style="display:flex;"><span>          │                │          │
</span></span><span style="display:flex;"><span>          └─ connectivity  └─ stereo  └─ protonation
</span></span></code></pre></div><h3 id="collision-risk">Collision Risk</h3>
<p>Because the InChIKey is a hash, collisions are theoretically possible. The first block provides $2^{65}$ possible values for connectivity, making accidental collisions extremely unlikely for practical database sizes (estimated 1 in $10^{12}$ chance for $10^9$ compounds). It is important to distinguish InChIKey collisions (a mathematical inevitability of hashing, but rare in practice) from InChI collisions (bugs in the algorithm, which are very rare and targeted by the certification suite).</p>
<h2 id="working-with-inchi-in-python">Working with InChI in Python</h2>
<p>The RDKit library provides InChI support through its built-in functions:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem.inchi <span style="color:#f92672">import</span> MolFromInchi, MolToInchi, InchiToInchiKey
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># SMILES -&gt; InChI</span>
</span></span><span style="display:flex;"><span>mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;C[C@@H](N)C(=O)O&#34;</span>)  <span style="color:#75715e"># L-alanine</span>
</span></span><span style="display:flex;"><span>inchi <span style="color:#f92672">=</span> MolToInchi(mol)
</span></span><span style="display:flex;"><span>print(inchi)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># InChI -&gt; Molecule -&gt; SMILES</span>
</span></span><span style="display:flex;"><span>mol2 <span style="color:#f92672">=</span> MolFromInchi(inchi)
</span></span><span style="display:flex;"><span>print(Chem<span style="color:#f92672">.</span>MolToSmiles(mol2))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; C[C@@H](N)C(=O)O</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># InChI -&gt; InChIKey</span>
</span></span><span style="display:flex;"><span>key <span style="color:#f92672">=</span> InchiToInchiKey(inchi)
</span></span><span style="display:flex;"><span>print(key)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; QNAYBMKLOCPYGJ-REOHCLBHSA-N</span>
</span></span></code></pre></div><h3 id="layer-level-matching">Layer-Level Matching</h3>
<p>Because InChI is hierarchical, you can compare molecules at different levels of detail by truncating layers. Two molecules that differ only in stereochemistry will share the same connectivity layers:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem.inchi <span style="color:#f92672">import</span> MolToInchi, InchiToInchiKey
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># L-alanine and D-alanine differ only in chirality</span>
</span></span><span style="display:flex;"><span>l_ala <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;C[C@@H](N)C(=O)O&#34;</span>)
</span></span><span style="display:flex;"><span>d_ala <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;C[C@H](N)C(=O)O&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>l_inchi <span style="color:#f92672">=</span> MolToInchi(l_ala)
</span></span><span style="display:flex;"><span>d_inchi <span style="color:#f92672">=</span> MolToInchi(d_ala)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Full InChIs differ (different /t and /m layers)</span>
</span></span><span style="display:flex;"><span>print(l_inchi)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1</span>
</span></span><span style="display:flex;"><span>print(d_inchi)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m1/s1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># First block of InChIKey is identical (same connectivity)</span>
</span></span><span style="display:flex;"><span>l_key <span style="color:#f92672">=</span> InchiToInchiKey(l_inchi)
</span></span><span style="display:flex;"><span>d_key <span style="color:#f92672">=</span> InchiToInchiKey(d_inchi)
</span></span><span style="display:flex;"><span>print(l_key[:<span style="color:#ae81ff">14</span>] <span style="color:#f92672">==</span> d_key[:<span style="color:#ae81ff">14</span>])
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; True (same molecular skeleton)</span>
</span></span><span style="display:flex;"><span>print(l_key <span style="color:#f92672">==</span> d_key)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; False (different stereochemistry)</span>
</span></span></code></pre></div><h2 id="inchi-in-machine-learning">InChI in Machine Learning</h2>
<p>InChI was designed for database interoperability, not for machine learning. Its hierarchical, layer-based structure differs fundamentally from the sequential, atom-by-atom encoding used by <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>. This has practical implications for ML applications.</p>
<h3 id="optical-chemical-structure-recognition">Optical Chemical Structure Recognition</h3>
<p>InChI is widely used as an output format for <a href="/posts/what-is-ocsr/">optical chemical structure recognition (OCSR)</a> systems that extract molecular structures from images in scientific literature. Because InChI is canonical, it provides an unambiguous target for image-to-text models.</p>
<p><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/">Image2InChI</a> uses an improved SwinTransformer encoder with attention-based feature fusion to convert molecular images directly to InChI strings, achieving 99.8% accuracy on the BMS dataset. The <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/">ViT-InChI Transformer</a> takes a similar approach with a Vision Transformer backbone.</p>
<p>In a <a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/">systematic comparison of string representations for OCSR</a>, Rajan et al. (2022) evaluated SMILES, DeepSMILES, SELFIES, and InChI using the same transformer architecture. InChI strings are longer than SMILES (producing more tokens for the decoder), which increases sequence modeling difficulty. SMILES achieved the highest exact match accuracy (88.62%), while SELFIES achieved 100% structural validity.</p>
<h3 id="chemical-name-translation">Chemical Name Translation</h3>
<p>InChI&rsquo;s canonical structure makes it a natural intermediate representation for translating between chemical names and structures. <a href="/notes/chemistry/molecular-representations/name-translation/handsel-inchi-iupac-2021/">Handsel et al. (2021)</a> trained a sequence-to-sequence Transformer to translate InChI identifiers to IUPAC names character-by-character, achieving 91% accuracy on organic compounds from PubChem (10 million training pairs). <a href="/notes/chemistry/molecular-representations/name-translation/stout/">STOUT</a> converts through SELFIES as an intermediate but validates outputs against InChI for structural equivalence.</p>
<h3 id="representation-comparison-for-ml">Representation Comparison for ML</h3>
<p>InChI&rsquo;s design trade-offs position it differently from SMILES and SELFIES for machine learning:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>InChI</th>
          <th>SMILES</th>
          <th>SELFIES</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Uniqueness</td>
          <td>Canonical by design</td>
          <td>Requires canonicalization algorithm</td>
          <td>Via SMILES roundtrip</td>
      </tr>
      <tr>
          <td>Validity guarantee</td>
          <td>N/A (not generative)</td>
          <td>No</td>
          <td>Yes (every string is valid)</td>
      </tr>
      <tr>
          <td>Human readability</td>
          <td>Low (machine-optimized)</td>
          <td>High</td>
          <td>Moderate</td>
      </tr>
      <tr>
          <td>String length</td>
          <td>Longest</td>
          <td>Shortest</td>
          <td>Moderate</td>
      </tr>
      <tr>
          <td>Primary ML use</td>
          <td>OCSR output, database linking</td>
          <td>Generation, property prediction</td>
          <td>Generation with validity</td>
      </tr>
      <tr>
          <td>Tokenization</td>
          <td>Complex (layers, separators)</td>
          <td>Regex-based atom tokens</td>
          <td>Bracket-delimited tokens</td>
      </tr>
  </tbody>
</table>
<p>InChI&rsquo;s length and structural complexity (layer separators, parenthetical groupings, comma-delimited atom lists) make it less common as a direct input representation for generative models. Most molecular language models use SMILES or SELFIES for generation tasks, and convert to InChI only for canonicalized comparison or database lookup.</p>
<h2 id="limitations">Limitations</h2>
<h3 id="tautomerism">Tautomerism</h3>
<p>InChI v1 handles many tautomeric forms by normalizing mobile hydrogen atoms in the <code>/h</code> layer. However, certain tautomeric transformations (such as 1,4-oxime/nitroso conversions) can produce different InChIs for what chemists consider the same compound. This is a <a href="/notes/chemistry/molecular-representations/notations/inchi-and-tautomers/">known limitation targeted for InChI v2</a>, with 86 tautomeric transformation rules compiled and validated across 400M+ structures to inform the update.</p>
<h3 id="inorganic-and-organometallic-chemistry">Inorganic and Organometallic Chemistry</h3>
<p>The original InChI specification was designed primarily for organic molecules. Metal-ligand bonds, coordination compounds, and extended solid-state structures posed challenges. The <a href="/notes/chemistry/molecular-representations/notations/inchi-2025/">InChI v1.07 release</a> addresses this with dedicated handling for metal-ligand bonds, though complete coverage of all inorganic chemistry remains an ongoing effort.</p>
<h3 id="not-designed-for-generation">Not Designed for Generation</h3>
<p>Unlike SMILES (which can be generated token-by-token through depth-first graph traversal) or SELFIES (which guarantees validity by construction), InChI&rsquo;s layered format does not lend itself to autoregressive generation. A generative model would need to produce internally consistent layers: the connectivity layer must agree with the molecular formula, the hydrogen layer must be consistent with the connectivity, and the stereochemistry layers must reference valid atom indices. This cross-layer dependency makes InChI poorly suited as a target for token-by-token molecular generation, which is why most generative chemistry models use SMILES or SELFIES.</p>
<h3 id="irreversibility-of-inchikey">Irreversibility of InChIKey</h3>
<p>The InChIKey is a one-way hash. An InChIKey cannot be converted back to an InChI or a molecular structure. It is useful only for search and comparison, not for structure retrieval (without a lookup table).</p>
<h2 id="variants-and-extensions">Variants and Extensions</h2>
<h3 id="rinchi-reactions">RInChI: Reactions</h3>
<p><a href="/notes/chemistry/molecular-representations/notations/rinchi/">RInChI (Reaction InChI)</a> extends InChI to represent chemical reactions by combining the InChIs of reactants, products, and agents into a single identifier. It provides a canonical identifier for reactions, enabling reaction database searching and duplicate detection (Grethe et al., 2018).</p>
<h3 id="minchi-mixtures">MInChI: Mixtures</h3>
<p><a href="/notes/chemistry/molecular-representations/notations/mixfile-minchi/">MInChI (Mixture InChI)</a> represents mixtures of substances, combined with the Mixfile format for storing detailed mixture composition data. This extends the InChI framework to complex multi-component systems like formulations and alloys (Clark et al., 2019).</p>
<h3 id="ninchi-nanomaterials">NInChI: Nanomaterials</h3>
<p><a href="/notes/chemistry/molecular-representations/notations/ninchi-alpha/">NInChI</a> proposes a hierarchical adaptation of InChI for nanomaterial identification. Traditional chemical identifiers break down at the nanoscale, where a single &ldquo;entity&rdquo; may consist of millions of atoms arranged in layers, coatings, and surface functionalizations (Lynch et al., 2020).</p>
<h2 id="references">References</h2>
<ul>
<li>Heller, S., McNaught, A., Pletnev, I., Stein, S., &amp; Tchekhovskoi, D. (2015). InChI, the IUPAC International Chemical Identifier. <a href="https://doi.org/10.1186/s13321-015-0068-4"><em>Journal of Cheminformatics</em>, <em>7</em>(1), 23.</a></li>
<li>Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., &amp; Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. <a href="https://doi.org/10.1186/1758-2946-5-7"><em>Journal of Cheminformatics</em>, <em>5</em>(1), 7.</a></li>
<li>Grethe, G., Blanke, G., Kraut, H., &amp; Goodman, J. M. (2018). International Chemical Identifier for reactions (RInChI). <a href="https://doi.org/10.1186/s13321-018-0277-8"><em>Journal of Cheminformatics</em>, <em>10</em>(1), 22.</a></li>
<li>Clark, A. M., McEwen, L. R., Gedeck, P., &amp; Bunin, B. A. (2019). Capturing mixture composition: an open machine-readable format for representing mixed substances. <a href="https://doi.org/10.1186/s13321-019-0357-4"><em>Journal of Cheminformatics</em>, <em>11</em>(1), 33.</a></li>
<li>Lynch, I., et al. (2020). Can an InChI for nano address the need for a simplified representation of complex nanomaterials across experimental and nanoinformatics studies? <a href="https://doi.org/10.3390/nano10122493"><em>Nanomaterials</em>, <em>10</em>(12), 2493.</a></li>
<li><a href="https://www.inchi-trust.org/">InChI Trust</a></li>
<li><a href="https://github.com/IUPAC-InChI/InChI">InChI GitHub Repository</a></li>
</ul>
]]></content:encoded></item><item><title>t-SMILES: Tree-Based Fragment Molecular Encoding</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/t-smiles-fragment-molecular-representation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/t-smiles-fragment-molecular-representation/</guid><description>t-SMILES encodes fragmented molecules as SMILES-type strings via breadth-first traversal of full binary trees, reducing nesting depth and improving generation.</description><content:encoded><![CDATA[<h2 id="a-fragment-based-molecular-representation-method">A Fragment-Based Molecular Representation Method</h2>
<p>This is a <strong>Method</strong> paper that proposes t-SMILES (tree-based SMILES), a framework for representing molecules as SMILES-type strings derived from fragment-based decompositions. The primary contribution is an encoding algorithm that converts fragmented molecular graphs into full binary trees (FBTs) and then traverses them breadth-first to produce linear strings. Three coding variants are introduced: TSSA (shared atom), TSDY (dummy atom without ID), and TSID (dummy atom with ID). The framework achieves 100% theoretical validity, higher novelty scores, and improved distribution-learning metrics compared to classical <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> across ChEMBL, ZINC, and QM9 benchmarks.</p>
<h2 id="why-fragment-based-representations-matter-for-molecular-generation">Why Fragment-Based Representations Matter for Molecular Generation</h2>
<p>Classical SMILES encodes molecules via depth-first traversal of the molecular graph, requiring parentheses and ring identifiers to appear in matched pairs with deep nesting. When generative models (LSTM, Transformer) are trained on SMILES, they produce chemically invalid strings, particularly on small datasets, because they struggle to learn these long-range pairing constraints. DeepSMILES addresses some syntactical issues but still permits semantic violations (e.g., oxygen with three bonds). SELFIES guarantees 100% valid strings but at the cost of readability and, as the authors show, lower <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a> scores indicating generated molecules diverge from the training distribution.</p>
<p>Fragment-based approaches reduce the search space compared to atom-level methods and can provide insights into molecular recognition (e.g., protein-ligand interactions). However, existing fragment-based deep learning methods rely on fixed dictionaries of candidate fragments, creating in-vocabulary/out-of-vocabulary problems and high-dimensional sparse representations. The encoding of fragments as SMILES-type strings, rather than dictionary IDs, had not been systematically explored before this work.</p>
<p>The authors draw on the observation that fragments in organic molecules follow a <a href="https://en.wikipedia.org/wiki/Zipf's_law">Zipf-like</a> rank distribution similar to words in natural language, motivating the use of NLP techniques for fragment-based molecular modeling.</p>
<h2 id="core-innovation-binary-tree-encoding-of-fragmented-molecules">Core Innovation: Binary Tree Encoding of Fragmented Molecules</h2>
<p>The t-SMILES algorithm proceeds in three steps:</p>
<ol>
<li><strong>Fragmentation</strong>: A molecule is decomposed into valid chemical fragments using a chosen algorithm (JTVAE, BRICS, <a href="https://en.wikipedia.org/wiki/Matched_molecular_pair_analysis">MMPA</a>, or Scaffold), producing a fragmented molecular graph.</li>
<li><strong>Tree construction</strong>: The fragmented graph is converted into an Acyclic Molecular Tree (AMT), which is a reduced graph where nodes represent fragments and edges represent bonds between them. The AMT is then transformed into a Full Binary Tree (FBT), where every internal node has exactly two children.</li>
<li><strong>String generation</strong>: The FBT is traversed using breadth-first search (BFS) to produce the t-SMILES string.</li>
</ol>
<p>The framework introduces only two new symbols beyond standard SMILES: <code>&amp;</code> marks empty tree nodes (branch terminators providing global structural information), and <code>^</code> separates adjacent substructure segments (analogous to spaces between words in English).</p>
<h3 id="three-coding-variants">Three Coding Variants</h3>
<ul>
<li><strong>TSSA</strong> (shared atom): Two fragments share a real atom at their connection point. Produces the highest novelty scores and is recommended for goal-directed tasks.</li>
<li><strong>TSDY</strong> (dummy atom, no ID): Uses dummy atoms (marked with <code>*</code>) to indicate bonding points. Provides a balanced choice between novelty and distribution fidelity.</li>
<li><strong>TSID</strong> (dummy atom with ID): Uses numbered dummy atoms (<code>[n*]</code>) for unambiguous reconstruction. Produces the most faithful distribution reproduction and is recommended for distribution-learning tasks.</li>
</ul>
<h3 id="structural-advantages">Structural Advantages</h3>
<p>The key structural benefit is a dramatic reduction in nesting depth. For TSDY_M on ChEMBL, the proportion of tokens at nesting depth 0-1-2 increases from 68.0% (SMILES) to 99.3%, while depth 3-4-5 drops from 31.9% to 0.7%, and depth 6-11 drops from 0.1% to 0.0002%. The <code>&amp;</code> symbol, which encodes molecular topology, does not need to appear in pairs (unlike parentheses in SMILES), and its high frequency means it does not create a scarcity problem for learning.</p>
<p>The framework also supports a multi-code system where classical SMILES can be integrated as a special case called TS_Vanilla, and multiple fragmentation-based codes can be combined into hybrid models.</p>
<h3 id="reconstruction-and-data-augmentation">Reconstruction and Data Augmentation</h3>
<p>Molecules can be reconstructed from t-SMILES strings by reversing the process: rebuilding the FBT from the string, converting to AMT, and assembling fragments into a molecular graph. This reconstruction process can itself generate novel molecules without any model training by randomly assembling fragments. On ChEMBL, TSSA reconstruction achieves uniqueness above 0.98 and novelty above 0.68 for all four fragmentation algorithms, with 100% validity.</p>
<p>Data augmentation in t-SMILES operates at four levels: (1) different decomposition algorithms, (2) reconstruction, (3) enumeration of fragment strings, and (4) enumeration of FBTs. Unlike <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> (which only produces different strings for the same molecule), t-SMILES reconstruction generates genuinely different molecules from the same fragment set.</p>
<h2 id="systematic-evaluation-across-multiple-benchmarks">Systematic Evaluation Across Multiple Benchmarks</h2>
<p>All experiments use MolGPT (a Transformer-decoder model) as the primary generative model. Three types of metrics are employed: distribution-learning benchmarks, goal-directed benchmarks, and Wasserstein distance metrics for physicochemical properties.</p>
<h3 id="low-resource-datasets-jnk3-and-aid1706">Low-Resource Datasets (JNK3 and AID1706)</h3>
<p>On <a href="https://en.wikipedia.org/wiki/MAPK10">JNK3</a> (923 active molecules), the authors investigate overfitting behavior across training epochs:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Valid</th>
          <th>Novelty</th>
          <th>FCD</th>
          <th>Active Novel</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMILES [R200]</td>
          <td>0.795</td>
          <td>0.120</td>
          <td>0.584</td>
          <td>0.072</td>
      </tr>
      <tr>
          <td>SMILES [R2000]</td>
          <td>1.000</td>
          <td>0.001</td>
          <td>0.765</td>
          <td>0.004</td>
      </tr>
      <tr>
          <td>SELFIES [R200]</td>
          <td>1.000</td>
          <td>0.238</td>
          <td>0.544</td>
          <td>0.148</td>
      </tr>
      <tr>
          <td>SELFIES [R2000]</td>
          <td>1.000</td>
          <td>0.008</td>
          <td>0.767</td>
          <td>0.050</td>
      </tr>
      <tr>
          <td>TSSA_S [R300]</td>
          <td>1.000</td>
          <td>0.833</td>
          <td>0.564</td>
          <td>0.582</td>
      </tr>
      <tr>
          <td>TSSA_S [R5000]</td>
          <td>1.000</td>
          <td>0.817</td>
          <td>0.608</td>
          <td>0.564</td>
      </tr>
      <tr>
          <td>TF_TSSA_S [R5]</td>
          <td>1.000</td>
          <td>0.932</td>
          <td>0.483</td>
          <td>0.710</td>
      </tr>
      <tr>
          <td>TSSA_S_Rec50 [R10]</td>
          <td>1.000</td>
          <td>0.962</td>
          <td>0.389</td>
          <td>0.829</td>
      </tr>
  </tbody>
</table>
<p>Key findings: SMILES and DeepSMILES novelty scores collapse to near zero after 200 epochs, while t-SMILES novelty stabilizes around 0.8. The highest active-novel score of 0.829 comes from t-SMILES with reconstruction-based data augmentation. Transfer learning with t-SMILES maintains novelty of 0.710 at 5 epochs versus 0.526 for SMILES, and at 100 epochs the gap widens dramatically (0.569 vs. 0.023).</p>
<h3 id="distribution-learning-on-chembl">Distribution Learning on ChEMBL</h3>
<p>t-SMILES models outperform graph baselines (Graph MCTS, hG2G, MGM) and fragment-based methods (FASMIFRA). TSID_B and TSID_S achieve FCD scores of 0.909 while maintaining novelty of 0.941 and 0.933, surpassing SMILES (FCD 0.906, novelty 0.907) in both dimensions. TSDY and TSID models consistently outperform TSSA on distribution fidelity for larger molecules.</p>
<h3 id="goal-directed-tasks-on-chembl">Goal-Directed Tasks on ChEMBL</h3>
<p>On 20 <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> subtasks, different fragmentation algorithms excel at different tasks. The goal-directed reconstruction algorithm significantly outperforms random reconstruction. On the <a href="https://en.wikipedia.org/wiki/Sitagliptin">Sitagliptin</a> MPO task (T16.SMPO), the TSDY_M model with goal-directed reconstruction achieves a score of 0.930, compared to 0.598 for SMILES and 0.708 for CReM. On <a href="https://en.wikipedia.org/wiki/Valsartan">Valsartan</a> SMARTS (T18.VS), t-SMILES models reach 0.997 versus 0.985 for SMILES.</p>
<h3 id="distribution-learning-on-zinc-and-qm9">Distribution Learning on ZINC and QM9</h3>
<p>On ZINC, t-SMILES models significantly outperform existing fragment-based baselines (JTVAE, FragDgm). Seven t-SMILES models achieve both higher FCD and novelty scores than SELFIES. On QM9 (smaller molecules), all string-based models achieve high FCD scores (above 0.960), with t-SMILES performing better than existing string and graph approaches.</p>
<h3 id="physicochemical-properties">Physicochemical Properties</h3>
<p>Across ChEMBL and ZINC, TSDY and TSID models capture physicochemical property distributions (MolWt, LogP, SAScore, N_Atoms, N_Rings, etc.) more faithfully than TSSA models. Multiple t-SMILES models outperform SMILES in more than four out of nine property categories. Baseline models hG2G and JTVAE show the weakest pattern learning, producing molecules with fewer atoms and rings than the training data.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="main-results">Main Results</h3>
<ol>
<li>t-SMILES achieves 100% theoretical validity by fragmenting molecules into chemically valid pieces before encoding.</li>
<li>The framework avoids the overfitting problem on low-resource datasets, maintaining stable novelty scores where SMILES, DeepSMILES, and SELFIES collapse.</li>
<li>The multi-code system allows different coding algorithms to complement each other, with hybrid models accessing broader chemical space.</li>
<li>Goal-directed reconstruction significantly outperforms all baselines on targeted optimization tasks.</li>
<li>TSDY and TSID provide better distribution fidelity than TSSA on larger molecules, while TSSA excels at novelty generation for goal-directed tasks.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Whether the tree structure of t-SMILES can be effectively learned by Large Language Models remains unexplored.</li>
<li>Only published fragmentation algorithms were tested; custom fragmentation schemes were not investigated.</li>
<li>Experiments on more complex (larger) molecules were not performed.</li>
<li>The reconstruction algorithm uses simple rules for fragment assembly; more sophisticated assembly methods (Monte Carlo tree search, CReM) could improve quality.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest exploring advanced reconstruction and optimization algorithms, improved generative models, evolutionary techniques, and extending t-SMILES to property prediction, retrosynthesis, and reaction prediction tasks. The framework is also extensible to other string representations (t-DSMILES, t-SELFIES) by changing how fragments are encoded.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Low-resource evaluation</td>
          <td>JNK3</td>
          <td>923 active molecules</td>
          <td>Kinase inhibitors</td>
      </tr>
      <tr>
          <td>Low-resource evaluation</td>
          <td>AID1706</td>
          <td>329 active molecules</td>
          <td>SARS 3CLPro inhibitors</td>
      </tr>
      <tr>
          <td>Distribution learning</td>
          <td>ChEMBL</td>
          <td>Standard split</td>
          <td>Large drug-like molecules</td>
      </tr>
      <tr>
          <td>Distribution learning</td>
          <td>ZINC</td>
          <td>250K subset</td>
          <td>Medium drug-like molecules</td>
      </tr>
      <tr>
          <td>Distribution learning</td>
          <td>QM9</td>
          <td>~134K molecules</td>
          <td>Small organic molecules</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Fragmentation</strong>: JTVAE, BRICS, MMPA, Scaffold (all via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>)</li>
<li><strong>Tree construction</strong>: AMT from reduced graph, then FBT transformation</li>
<li><strong>Traversal</strong>: Breadth-first search on FBT</li>
<li><strong>Generative model</strong>: MolGPT (Transformer decoder)</li>
<li><strong>Discriminative model</strong>: AttentiveFP for activity prediction on JNK3/AID1706</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Fraction of generated strings that decode to valid molecules</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>Fraction of distinct molecules among valid generations</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>Fraction of generated molecules not in training set</td>
      </tr>
      <tr>
          <td>KLD</td>
          <td>Kullback-Leibler divergence for physicochemical property distributions</td>
      </tr>
      <tr>
          <td>FCD</td>
          <td>Frechet ChemNet Distance measuring chemical similarity to training set</td>
      </tr>
      <tr>
          <td>Active Novel</td>
          <td>Novel molecules predicted active by AttentiveFP</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/juanniwu/t-SMILES">t-SMILES GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with training/generation scripts</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/ZENODO.10991703">Zenodo deposit</a></td>
          <td>Code + Data</td>
          <td>CC-BY-4.0</td>
          <td>Archived code and data</td>
      </tr>
      <tr>
          <td><a href="https://codeocean.com/capsule/3034546/tree">Code Ocean capsule</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Certified reproducible compute capsule</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper mentions limited computational resources but does not specify exact GPU types or training times.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, J.-N., Wang, T., Chen, Y., Tang, L.-J., Wu, H.-L., &amp; Yu, R.-Q. (2024). t-SMILES: a fragment-based molecular representation framework for de novo ligand design. <em>Nature Communications</em>, 15, 4993.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2024tsmiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{t-SMILES: a fragment-based molecular representation framework for de novo ligand design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Juan-Ni and Wang, Tong and Chen, Yue and Tang, Li-Juan and Wu, Hai-Long and Yu, Ru-Qin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4993}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-49388-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SPE: Data-Driven SMILES Substructure Tokenization</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/</guid><description>SMILES Pair Encoding adapts byte pair encoding to learn chemically meaningful substructure tokens from SMILES, improving generation and QSAR prediction.</description><content:encoded><![CDATA[<h2 id="a-data-driven-tokenization-method-for-chemical-deep-learning">A Data-Driven Tokenization Method for Chemical Deep Learning</h2>
<p>This is a <strong>Method</strong> paper that introduces SMILES Pair Encoding (SPE), a tokenization algorithm adapted from <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">byte pair encoding (BPE)</a> in natural language processing. The primary contribution is a data-driven approach that learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset and then uses that vocabulary to tokenize SMILES for downstream deep learning tasks. The authors provide an open-source Python package (SmilesPE) and demonstrate improvements on both molecular generation and <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> prediction benchmarks.</p>
<h2 id="limitations-of-atom-level-smiles-tokenization">Limitations of Atom-Level SMILES Tokenization</h2>
<p>SMILES-based deep learning models require tokenization to convert molecular strings into sequences of discrete units. The standard approaches have well-known drawbacks:</p>
<ul>
<li><strong>Character-level tokenization</strong> breaks SMILES character by character, splitting chemically meaningful multi-character atoms. For example, <code>[C@@H]</code> becomes six separate tokens (<code>[</code>, <code>C</code>, <code>@</code>, <code>@</code>, <code>H</code>, <code>]</code>), losing the stereochemistry information of a single carbon.</li>
<li><strong>Atom-level tokenization</strong> addresses some of these issues by treating multi-character element symbols (Cl, Br) and bracketed atoms ([nH], [O-]) as single tokens. However, these tokens still encode only individual atoms, not substructures.</li>
<li><strong>k-mer tokenization</strong> (sequences of k consecutive overlapping characters) captures some connectivity information but suffers from the out-of-vocabulary problem: the model cannot represent k-mers not seen during training.</li>
</ul>
<p>All three approaches produce relatively long input sequences (mean ~40 tokens per molecule on ChEMBL at the atom level), which increases computational cost for sequential architectures like RNNs and exacerbates long-range dependency issues.</p>
<h2 id="core-innovation-adapting-byte-pair-encoding-for-smiles">Core Innovation: Adapting Byte Pair Encoding for SMILES</h2>
<p>SPE adapts the byte pair encoding algorithm, originally developed for data compression and later adopted for subword tokenization in NLP, to the domain of chemical strings. The algorithm has two phases:</p>
<p><strong>Vocabulary training:</strong></p>
<ol>
<li>Tokenize SMILES from a large dataset (ChEMBL) at the atom level</li>
<li>Initialize the vocabulary with all unique atom-level tokens</li>
<li>Iteratively count the frequency of all adjacent token pairs, merge the most frequent pair into a new token, and add it to the vocabulary</li>
<li>Stop when either the maximum vocabulary size (MVS) or a minimum frequency threshold (FT) is reached</li>
</ol>
<p><strong>Tokenization:</strong> Given a trained SPE vocabulary, a new SMILES string is first tokenized at the atom level, then token pairs are iteratively merged according to their frequency rank in the vocabulary until no further merges are possible.</p>
<p>The key hyperparameters are MVS and FT. In the reported experiments, MVS was set to 30,000 and FT was set to 2,000. The vocabulary was trained on ~3.4 million SMILES (both canonical and one non-canonical variant per molecule) from ChEMBL25. The resulting vocabulary contained 3,002 unique SMILES substrings with lengths ranging from 1 to 22 atom-level characters.</p>
<p>The trained SPE vocabulary produces tokens that are human-readable and correspond to chemically meaningful substructures and functional groups. SPE tokenization reduces the mean sequence length from approximately 40 tokens (atom-level) to approximately 6 tokens on ChEMBL, a roughly 6-7x compression. This shorter representation directly reduces computational cost for RNN-based and other sequential models.</p>
<p>The algorithm is also compatible with other text-based molecular representations such as <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, since these share atom-level character structures that can serve as the starting point for pair merging.</p>
<h2 id="molecular-generation-and-qsar-prediction-experiments">Molecular Generation and QSAR Prediction Experiments</h2>
<h3 id="molecular-generation">Molecular Generation</h3>
<p>The authors trained AWD-LSTM language models with SPE and atom-level tokenization on 9 million SMILES (1 canonical + 5 non-canonical per compound from ChEMBL25). Each model sampled 1 million SMILES for evaluation. The AWD-LSTM architecture used an embedding size of 400, three LSTM layers with 1,152 hidden units each, and various dropout settings (embedding: 0.1, input: 0.6, weight: 0.5, hidden: 0.2). Models were trained for 10 epochs with a base learning rate of 0.008 using one-cycle scheduling.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SPE</th>
          <th>Atom-level</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>0.941</td>
          <td>0.970</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>0.994</td>
          <td>0.992</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>0.983</td>
          <td>0.978</td>
      </tr>
      <tr>
          <td>Internal diversity</td>
          <td>0.897</td>
          <td>0.886</td>
      </tr>
      <tr>
          <td>Nearest neighbor similarity</td>
          <td>0.391</td>
          <td>0.386</td>
      </tr>
  </tbody>
</table>
<p>The SPE model generated a more diverse population of novel molecules at the cost of slightly lower validity (94.1% vs. 97.0%). Internal diversity is defined as:</p>
<p>$$
\text{Internal diversity} = 1 - \frac{1}{|G|} \sum_{(x_1, x_2) \in G \times G} T(x_1, x_2)
$$</p>
<p>where $T(x_1, x_2)$ is the Tanimoto similarity between molecules $x_1$ and $x_2$ using 1024-bit ECFP6 fingerprints. Nearest neighbor similarity (SNN) measures how well the generated set resembles the reference set:</p>
<p>$$
\text{SNN} = \frac{1}{|G|} \sum_{x_G \in G} \max_{x_R \in R} T(x_G, x_R)
$$</p>
<p>Substructure coverage analysis showed both models recovered the same top-1000 BRICS fragments (100% coverage), but SPE consistently outperformed atom-level tokenization on top-5000 coverage across all four substructure types: BRICS fragments (0.997 vs. 0.987), functional groups (0.688 vs. 0.659), scaffolds (0.872 vs. 0.825), and ring systems (0.781 vs. 0.761).</p>
<h3 id="qsar-prediction">QSAR Prediction</h3>
<p>QSAR models were built using the <a href="/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT transfer learning framework</a>, which pre-trains a language model on ChEMBL and then fine-tunes it for specific prediction tasks. The evaluation used 24 regression benchmarks (pIC50 values) from Cortes-Ciriano et al., covering targets ranging from 199 molecules (alpha-2a adrenergic receptor) to 5,010 molecules (<a href="https://en.wikipedia.org/wiki/KCNH2">hERG</a>). Models were evaluated on 10 random 80:10:10 splits using RMSE, R-squared, and MAE. Random forest models with 1024-bit ECFP6 were included as baseline comparisons.</p>
<p><a href="https://en.wikipedia.org/wiki/Effect_size">Cohen&rsquo;s d</a> effect sizes were computed to quantify performance differences between tokenization methods. SPE performed comparably or better than atom-level tokenization on 23 out of 24 datasets. Notable results with medium or large effect sizes favoring SPE included <a href="https://en.wikipedia.org/wiki/Cannabinoid_receptor_1">cannabinoid CB1 receptor</a> (large effect), A2a adrenergic receptor, LCK, estrogen receptor, and <a href="https://en.wikipedia.org/wiki/Aurora_kinase_A">Aurora-A kinase</a> (all medium effects). Against k-mer tokenization, SPE matched or outperformed on 22 out of 24 datasets.</p>
<p>Cohen&rsquo;s d is defined as:</p>
<p>$$
\text{Cohen&rsquo;s } d = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{(\text{SD}_1^2 + \text{SD}_2^2) / 2}}
$$</p>
<p>where $\bar{x}_1, \bar{x}_2$ are the group means and $\text{SD}_1, \text{SD}_2$ are the standard deviations. Thresholds of 0.2 (small), 0.5 (medium), and 0.8 (large) were used following standard recommendations.</p>
<p>SMILES-based deep learning models generally performed on par with or better than the RF baseline, with particularly strong advantages on the four largest datasets (<a href="https://en.wikipedia.org/wiki/Cyclooxygenase-2">COX-2</a>, <a href="https://en.wikipedia.org/wiki/Acetylcholinesterase">acetylcholinesterase</a>, erbB1, and hERG).</p>
<p>In addition to performance gains, SPE-based models trained on average 5 times faster than atom-level models due to the shorter input sequences.</p>
<h2 id="results-summary-and-future-directions">Results Summary and Future Directions</h2>
<p>The main findings of this study are:</p>
<ol>
<li>
<p><strong>SPE produces chemically meaningful tokens.</strong> The learned vocabulary contains human-readable SMILES substrings that correspond to common substructures and functional groups, making model interpretations more accessible.</p>
</li>
<li>
<p><strong>SPE compresses input sequences by ~6-7x.</strong> Mean token sequence length drops from ~40 (atom-level) to ~6 (SPE) on ChEMBL, yielding a ~5x training speedup.</p>
</li>
<li>
<p><strong>SPE improves molecular generation diversity.</strong> The SPE-based generative model produces molecules with higher novelty (98.3% vs. 97.8%), internal diversity (0.897 vs. 0.886), and substructure coverage, at the cost of slightly lower validity (94.1% vs. 97.0%).</p>
</li>
<li>
<p><strong>SPE matches or outperforms atom-level and k-mer tokenization on QSAR prediction.</strong> Across 24 benchmarks, SPE showed comparable or better performance in 23/24 comparisons against atom-level and 22/24 against k-mer tokenization.</p>
</li>
</ol>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>The SPE vocabulary is trained on a specific dataset (ChEMBL25) and may not optimally represent chemical spaces that differ significantly from drug-like compounds.</li>
<li>The validity rate for molecular generation is slightly lower than atom-level tokenization (94.1% vs. 97.0%), since longer substructure tokens can introduce invalid fragments.</li>
<li>The k-mer tokenization suffers from an out-of-vocabulary problem, which the authors address by replacing unseen 4-mers with <code>[UNK]</code> tokens, but this is a limitation of the comparison rather than of SPE itself.</li>
</ul>
<p><strong>Future directions:</strong> The authors suggest SPE could serve as a general tokenization method for SMILES-based deep learning, applicable to any task where SMILES strings are used as input (<a href="/notes/chemistry/molecular-design/generation/">generation</a>, <a href="/notes/chemistry/molecular-design/property-prediction/">property prediction</a>, <a href="/notes/chemistry/molecular-design/reaction-prediction/">reaction prediction</a>, retrosynthesis). The algorithm can also be applied to DeepSMILES and SELFIES representations without modification.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SPE vocabulary training</td>
          <td>ChEMBL25</td>
          <td>~3.4M SMILES</td>
          <td>1 canonical + 1 non-canonical per molecule</td>
      </tr>
      <tr>
          <td>Language model training</td>
          <td>ChEMBL25 augmented</td>
          <td>~9M SMILES</td>
          <td>1 canonical + 5 non-canonical per molecule</td>
      </tr>
      <tr>
          <td>Molecular generation evaluation</td>
          <td>Sampled from model</td>
          <td>1M SMILES per model</td>
          <td>Validated with RDKit</td>
      </tr>
      <tr>
          <td>QSAR benchmarks</td>
          <td>Cortes-Ciriano et al.</td>
          <td>24 datasets, 199-5010 molecules</td>
          <td>pIC50 regression tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SPE vocabulary training: iterative pair merging with MVS=30,000 and FT=2,000</li>
<li>Language model: AWD-LSTM with embedding size 400, 3 LSTM layers with 1,152 hidden units</li>
<li>Dropout: embedding=0.1, input=0.6, weight=0.5, hidden=0.2</li>
<li>Training: 10 epochs, base learning rate 0.008, one-cycle policy</li>
<li>QSAR: MolPMoFiT transfer learning with 25x training augmentation and 15x validation augmentation</li>
<li>Test time augmentation: average of canonical + 4 augmented SMILES predictions</li>
<li>RF baseline: 500 trees, 1024-bit ECFP6, default scikit-learn parameters</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>AWD-LSTM architecture from Merity et al. (2018)</li>
<li>MolPMoFiT framework from Li and Fourches (2020) for transfer learning QSAR</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity, Uniqueness, Novelty</td>
          <td>Generation</td>
          <td>Basic quality metrics</td>
      </tr>
      <tr>
          <td>Internal diversity</td>
          <td>Generation</td>
          <td>1 - mean pairwise Tanimoto (ECFP6)</td>
      </tr>
      <tr>
          <td>Nearest neighbor similarity</td>
          <td>Generation</td>
          <td>Mean max Tanimoto to reference set</td>
      </tr>
      <tr>
          <td>Substructure coverage</td>
          <td>Generation</td>
          <td>BRICS, functional groups, scaffolds, ring systems</td>
      </tr>
      <tr>
          <td>RMSE, R-squared, MAE</td>
          <td>QSAR regression</td>
          <td>10 random 80:10:10 splits</td>
      </tr>
      <tr>
          <td>Cohen&rsquo;s d</td>
          <td>QSAR comparison</td>
          <td>Effect size between tokenization methods</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not explicitly specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XinhaoLi74/SmilesPE">SmilesPE</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>SPE tokenization Python package</td>
      </tr>
      <tr>
          <td><a href="https://github.com/XinhaoLi74/MolPMoFiT">MolPMoFiT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Transfer learning QSAR framework</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, X., &amp; Fourches, D. (2021). SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning. <em>Journal of Chemical Information and Modeling</em>, 61(4), 1560-1569. <a href="https://doi.org/10.1021/acs.jcim.0c01127">https://doi.org/10.1021/acs.jcim.0c01127</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2021smiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Xinhao and Fourches, Denis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1560--1569}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.0c01127}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Smirk: Complete Tokenization for Molecular Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smirk-tokenization-molecular-models/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smirk-tokenization-molecular-models/</guid><description>Smirk tokenizer achieves full OpenSMILES coverage with 165 tokens by decomposing bracketed atoms into glyphs, validated via n-gram proxy models.</description><content:encoded><![CDATA[<h2 id="a-method-for-complete-chemical-tokenization">A Method for Complete Chemical Tokenization</h2>
<p>This is a <strong>Method</strong> paper that introduces two new tokenizers for molecular foundation models: Smirk and Smirk-GPE. The primary contribution is a tokenization scheme that achieves complete coverage of the OpenSMILES specification using only 165 tokens, addressing the vocabulary gaps present in existing atom-wise tokenizers. The paper also proposes n-gram language models as low-cost proxy evaluators for tokenizer quality and validates these proxies against 18 transformer-based models across multiple benchmarks.</p>
<h2 id="vocabulary-gaps-in-molecular-tokenization">Vocabulary Gaps in Molecular Tokenization</h2>
<p>Molecular foundation models overwhelmingly use &ldquo;atom-wise&rdquo; tokenization, where SMILES strings are split at atom boundaries using a regular expression first proposed by Schwaller et al. A key pattern in this regex treats all &ldquo;bracketed atoms&rdquo; (e.g., <code>[C@@H]</code>, <code>[18F]</code>, <code>[Au+]</code>) as single, irreducible tokens. Since bracketed atoms encode isotopes, chirality, charge, hydrogen count, and element identity, the number of possible permutations under the OpenSMILES specification exceeds 28 trillion. In practice, existing atom-wise tokenizers maintain vocabularies of fewer than 3,000 tokens, leaving large portions of chemical space unrepresentable.</p>
<p>This gap has real consequences. Many chemistry-specific tokenizers emit the unknown token <code>[UNK]</code> at non-negligible frequencies, particularly on datasets with diverse elements and stereochemistry. For example, <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SPE and APE</a> tokenizers produce <code>[UNK]</code> for roughly 19% of tokens on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> and approximately 50% on the tmQM transition metal complex dataset. Even models like <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a> and <a href="/notes/chemistry/molecular-design/reaction-prediction/reactiont5-pretrained-limited-reaction-data/">ReactionT5</a> lack tokens for elements such as copper, ruthenium, gold, and uranium.</p>
<p>The authors also note a subtler issue: some open-vocabulary tokenizers (e.g., <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa&rsquo;s</a> BPE) conflate chemically distinct entities. The same <code>Sc</code> token may represent both a sulfur-carbon bond (in organic SMILES) and the element scandium (in <code>[Sc]</code>), creating ambiguity in downstream analysis.</p>
<h2 id="smirk-glyph-level-decomposition-of-smiles">Smirk: Glyph-Level Decomposition of SMILES</h2>
<p>The core insight behind Smirk is to fully decompose bracketed atoms into their constituent &ldquo;glyphs,&rdquo; the primitive symbols defined by the OpenSMILES specification (element symbols, chirality markers, charges, isotope numbers, hydrogen counts, and brackets themselves). This transforms tokenization from a word-level scheme (one token per bracketed atom) to a character-level scheme over chemically meaningful glyphs.</p>
<p>Smirk uses a two-stage tokenization process:</p>
<ol>
<li><strong>Atom decomposition</strong>: Split a SMILES string into atom-level units using a regex (e.g., <code>OC[C@@H][OH]</code> becomes <code>O C [C@@H] [OH]</code>).</li>
<li><strong>Glyph decomposition</strong>: Further split each unit into its constituent glyphs (e.g., <code>[C@@H]</code> becomes <code>[ C @@ H ]</code>).</li>
</ol>
<p>The two-stage process is necessary to resolve ambiguities. For example, <code>Sc</code> in an unbracketed context represents a sulfur-carbon bond, while <code>[Sc]</code> denotes scandium. This ambiguity occurs over half a million times in PubChem&rsquo;s compound dataset.</p>
<p>The resulting vocabulary contains only 165 tokens, requires no training, and by construction can faithfully tokenize any molecule that conforms to the OpenSMILES specification. The implementation is written in Rust using HuggingFace&rsquo;s Tokenizers library and is available on PyPI.</p>
<p><strong>Smirk-GPE</strong> (Glyph Pair Encoding) extends Smirk with a <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">BPE</a>-like compression step. After Smirk tokenization, adjacent tokens are merged using learned rules, reducing sequence length. Unlike standard BPE, merges operate on token IDs rather than character strings, preserving the distinction between chemically different entities that happen to share the same characters. Smirk-GPE was trained on 262 million molecules from Enamine REAL Space with a target vocabulary of 50,000 tokens, though training terminated at 2,300 tokens after exhausting all possible merges.</p>
<h2 id="evaluation-framework-intrinsic-metrics-n-gram-proxies-and-transformer-benchmarks">Evaluation Framework: Intrinsic Metrics, N-Gram Proxies, and Transformer Benchmarks</h2>
<p>The evaluation covers 34 tokenizers across three datasets (Enamine REALSpace, MoleculeNet, and tmQM) using both intrinsic and extrinsic metrics.</p>
<h3 id="intrinsic-metrics">Intrinsic Metrics</h3>
<p>Four intrinsic metrics are computed for each tokenizer:</p>
<p><strong>Fertility</strong> measures the mean tokenized sequence length. Higher fertility increases computational cost due to the quadratic scaling of attention:</p>
<p>$$
\text{cost} \propto \text{fertility}^2
$$</p>
<p><strong>Normalized entropy</strong> quantifies how close a tokenizer comes to the information-theoretic ideal where all tokens are equally probable:</p>
<p>$$
\eta = \frac{-1}{\log |V|} \sum_{x \in V} p(x) \log p(x)
$$</p>
<p>where $V$ is the vocabulary and $p(x)$ is the observed token probability. Higher normalized entropy correlates with better downstream performance.</p>
<p><strong>Token imbalance</strong> measures the distance between observed token frequencies and a uniform distribution:</p>
<p>$$
D = \frac{1}{2} \sum_{x \in V} |p(x) - |V|^{-1}|
$$</p>
<p><strong>Unknown token frequency</strong> captures the fraction of emitted tokens that are <code>[UNK]</code>. This metric is particularly revealing: all existing chemistry-specific tokenizers (SPE/APE, atom-wise, BPE, and Unigram variants) emit <code>[UNK]</code> at non-negligible rates, while NLP tokenizers, Smirk, and Smirk-GPE do not.</p>
<h3 id="n-gram-proxy-language-models">N-Gram Proxy Language Models</h3>
<p>The paper proposes using n-gram models as low-cost proxies for transformer-based evaluation. An n-gram estimates token likelihood with <a href="https://en.wikipedia.org/wiki/Additive_smoothing">add-one smoothing</a>:</p>
<p>$$
P_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}) = \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|}
$$</p>
<p>where $C$ is the count function and $|V|$ is the vocabulary size. N-grams were &ldquo;pretrained&rdquo; on 1.6 billion SMILES from Enamine REAL Space and evaluated on validation splits. Cross-entropy loss and information loss from unknown tokens were computed.</p>
<p>To quantify information lost to <code>[UNK]</code> tokens, the authors compute the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL-divergence</a> between token distributions with and without unknown tokens, using a bidirectional character n-gram model:</p>
<p>$$
B_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}, x_{i+1}, \dots, x_{i+n-1}) \propto \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} \times \frac{C(x_{i}, \dots, x_{i+n-1}) + 1}{C(x_{i+1}, \dots, x_{i+n-1}) + |V|}
$$</p>
<h3 id="transformer-experiments">Transformer Experiments</h3>
<p>Eighteen encoder-only RoBERTa models (25M parameters each, excluding embeddings) were pretrained from scratch using masked language modeling on Enamine REAL Space (245M molecules, 30,000 steps). Each model used a different tokenizer, isolating the tokenizer&rsquo;s effect on performance. Finetuning was conducted on six regression and seven classification tasks from MoleculeNet and tmQM.</p>
<p>Linear fixed-effects models were used to estimate the standardized effect of each tokenization scheme relative to an atom-wise SMILES baseline.</p>
<h2 id="key-findings-and-practical-implications">Key Findings and Practical Implications</h2>
<h3 id="tokenizer-performance">Tokenizer Performance</h3>
<ul>
<li><strong>Smirk</strong> shows a positive effect on pretraining quality and downstream performance on tmQM (the dataset with the most bracketed atoms), but performs comparably to atom-wise tokenization on MoleculeNet tasks.</li>
<li><strong>SPE and APE</strong> tokenizers have a negative impact on both pretraining and downstream performance relative to the atom-wise baseline, likely due to their high <code>[UNK]</code> rates.</li>
<li><strong>Molecular encoding choice</strong> (<a href="/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/">SMILES vs. SELFIES</a>) has a negligible effect on performance.</li>
<li><strong>NLP tokenizers</strong> (GPT-4o, LLaMA, Gemma) score comparably to chemistry-specific tokenizers on intrinsic metrics and do not emit unknown tokens.</li>
</ul>
<h3 id="n-gram-proxy-validation">N-Gram Proxy Validation</h3>
<p>N-gram cross-entropy and information loss metrics show strong rank correlation (Spearman&rsquo;s $\rho$) with downstream transformer performance, validating their use as low-cost evaluation proxies. The effect sizes from n-gram and transformer experiments are directionally consistent.</p>
<h3 id="information-loss-from-unknown-tokens">Information Loss from Unknown Tokens</h3>
<p>Information loss is minimal for tokenizers with robust coverage but substantial for tokenizers with limited vocabularies on chemically diverse datasets. <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a> incurs only 0.1 nats/molecule on MoleculeNet but 40.3 nats/molecule on tmQM. Open-vocabulary tokenizers (Smirk, Smirk-GPE, NLP tokenizers) mitigate this degradation.</p>
<h3 id="practical-recommendations">Practical Recommendations</h3>
<p>The authors argue that molecular foundation models must encode the entire breadth of chemical space or risk obscuring critical features. Bracketed atoms encode information essential to clinically relevant pharmaceuticals (e.g., <a href="https://en.wikipedia.org/wiki/Amoxicillin">Amoxicillin</a>), industrial compounds (e.g., Tricalcium Silicate), and foundational chemistry (e.g., <a href="https://en.wikipedia.org/wiki/Cisplatin">Cisplatin</a>, where omitting the chiral marker erases medically relevant stereochemical information). The paper encourages the community to adopt open-vocabulary tokenizers and develop more chemically diverse benchmarks.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The analysis uses a single-point evaluation for transformer experiments, which may underestimate performance achievable with additional hyperparameter tuning.</li>
<li>Smirk-GPE&rsquo;s learned merges from REALSpace did not fully generalize to tmQM, as indicated by the token imbalance metric.</li>
<li>Current benchmarks (MoleculeNet) lack sufficient diversity to evaluate tokenizer robustness across the full periodic table, isotopes, charged species, and uncommon bond types.</li>
<li>The downstream impact of token ambiguities in BPE-based tokenizers (e.g., ChemBERTa&rsquo;s conflation of <code>Sc</code> as both sulfur-carbon and scandium) remains unclear.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>Enamine REAL Space</td>
          <td>1.6B SMILES (n-gram), 245M molecules (transformer)</td>
          <td>80/10/10 train/val/test split</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>MoleculeNet</td>
          <td>Multiple tasks</td>
          <td>6 regression + 7 classification tasks</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>tmQM</td>
          <td>108K transition metal complexes</td>
          <td>OpenSMILES molecular encodings</td>
      </tr>
      <tr>
          <td>Smirk-GPE training</td>
          <td>Enamine REAL Space (subset)</td>
          <td>262M molecules</td>
          <td>Training split only</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Smirk</strong>: Two-stage regex-based tokenization (atom decomposition, then glyph decomposition). No training required. Vocabulary: 165 tokens.</li>
<li><strong>Smirk-GPE</strong>: BPE-like compression on top of Smirk. Operates on token IDs (not strings) to preserve chemical disambiguation. Final vocabulary: 2,300 tokens.</li>
<li><strong>N-gram models</strong>: Add-one smoothing, bidirectional context ($2n - 2$ total context window). Implemented in Julia with exact integer arithmetic.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: RoBERTa-PreLayerNorm, 8 layers, 8 attention heads, hidden size 512, intermediate size 2048, max sequence length 2048. ~25M parameters (excluding embeddings).</li>
<li><strong>Pretraining</strong>: Masked language modeling, 30,000 steps, effective batch size 8192, FusedLamb optimizer, learning rate $1.6 \times 10^{-4}$.</li>
<li><strong>Finetuning</strong>: 100,000 steps, AdamW optimizer, effective batch size 128, learning rate $1.6 \times 10^{-4}$.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>MoleculeNet preferred metrics per task (AUROC for classification, MAE/RMSE for regression)</li>
<li>Fixed-effects models for standardized effect size estimation</li>
<li>Spearman&rsquo;s rank correlation between n-gram and transformer metrics</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pretraining: 2x NVIDIA A100 GPUs (Delta system at NCSA)</li>
<li>Finetuning: 1x NVIDIA A40 GPU</li>
<li>N-gram models: CPU-based (Julia implementation)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BattModels/Smirk">Smirk tokenizer</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Rust implementation with Python bindings, available on PyPI</td>
      </tr>
      <tr>
          <td>Model checkpoints</td>
          <td>Model</td>
          <td>Not specified</td>
          <td>Pretrained and finetuned checkpoints included in data release</td>
      </tr>
      <tr>
          <td>N-gram code</td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Julia implementation included in data release</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wadell, A., Bhutani, A., &amp; Viswanathan, V. (2026). Tokenization for Molecular Foundation Models. <em>Journal of Chemical Information and Modeling</em>, 66(3), 1384-1393. <a href="https://doi.org/10.1021/acs.jcim.5c01856">https://doi.org/10.1021/acs.jcim.5c01856</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wadell2026tokenization,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Tokenization for Molecular Foundation Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wadell, Alexius and Bhutani, Anoushka and Viswanathan, Venkatasubramanian}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{66}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1384--1393}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.5c01856}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES vs SELFIES Tokenization for Chemical LMs</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/</guid><description>Atom Pair Encoding (APE) tokenizer outperforms BPE on SMILES and SELFIES in RoBERTa-based chemical language models across MoleculeNet classification tasks.</description><content:encoded><![CDATA[<h2 id="atom-pair-encoding-for-chemical-language-modeling">Atom Pair Encoding for Chemical Language Modeling</h2>
<p>This is a <strong>Method</strong> paper that introduces Atom Pair Encoding (APE), a tokenization algorithm designed specifically for chemical string representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>). The primary contribution is demonstrating that a chemistry-aware tokenizer, which preserves atomic identity during subword merging, leads to improved molecular property classification accuracy in transformer-based models compared to the standard Byte Pair Encoding (BPE) approach.</p>
<h2 id="why-tokenization-matters-for-chemical-strings">Why Tokenization Matters for Chemical Strings</h2>
<p>Existing chemical language models based on BERT/RoBERTa architectures have typically relied on BPE for tokenizing SMILES and SELFIES strings. <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">Byte Pair Encoding (BPE)</a> was originally designed for natural language and data compression, where it excels at breaking words into meaningful subword units. When applied to chemical strings, BPE operates at the character level without understanding chemical semantics, leading to several problems:</p>
<ul>
<li><strong>Stray characters</strong>: BPE may create tokens like &ldquo;C)(&rdquo; that have no chemical meaning.</li>
<li><strong>Element splitting</strong>: Multi-character elements like chlorine (&ldquo;Cl&rdquo;) can be split into &ldquo;C&rdquo; and &ldquo;l&rdquo;, causing the model to misinterpret carbon and a dangling character.</li>
<li><strong>Lost structural context</strong>: BPE compresses sequences without considering how character position encodes molecular structure.</li>
</ul>
<p>Previous work on <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SMILES Pair Encoding (SPE)</a> attempted to address this by iteratively merging SMILES substrings into chemically meaningful tokens. However, SPE had practical limitations: its Python implementation did not support SELFIES, and it produced a smaller vocabulary (~3000 tokens) than what the data could support. These gaps motivated the development of APE.</p>
<h2 id="the-ape-tokenizer-chemistry-aware-subword-merging">The APE Tokenizer: Chemistry-Aware Subword Merging</h2>
<p>APE draws inspiration from both BPE and SPE but addresses their shortcomings. The key design decisions are:</p>
<ol>
<li>
<p><strong>Atom-level initialization</strong>: Instead of starting from individual characters (as BPE does), APE begins with chemically valid atomic units. For SMILES, this means recognizing multi-character elements (e.g., &ldquo;Cl&rdquo;, &ldquo;Br&rdquo;) as single tokens. For SELFIES, each bracketed string (e.g., [C], [Ring1], [=O]) serves as the fundamental unit.</p>
</li>
<li>
<p><strong>Iterative pair merging</strong>: Like BPE, APE iteratively merges the most frequent adjacent token pairs. The difference is that the initial tokenization preserves atomic boundaries, so merged tokens always represent valid chemical substructures.</p>
</li>
<li>
<p><strong>Larger vocabulary</strong>: Using the same minimum frequency threshold of 2000, APE generates approximately 5300 unique tokens from the PubChem dataset, compared to SPE&rsquo;s approximately 3000. This richer vocabulary provides more expressive power for representing chemical substructures.</p>
</li>
<li>
<p><strong>SELFIES compatibility</strong>: APE natively supports both SMILES and SELFIES, using the bracketed token structure of SELFIES as its starting point for that representation.</p>
</li>
</ol>
<p>The tokenizer was trained on a subset of 2 million molecules from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> (10 million SMILES total). This produced four tokenizer variants: SMILES-BPE, SMILES-APE, SELFIES-BPE, and SELFIES-APE.</p>
<h2 id="pre-training-and-evaluation-on-moleculenet-benchmarks">Pre-training and Evaluation on MoleculeNet Benchmarks</h2>
<h3 id="model-architecture">Model architecture</h3>
<p>All four models use the RoBERTa architecture with 6 hidden layers, a hidden size of 768, an intermediate size of 1536, and 12 attention heads. Pre-training used masked language modeling (MLM) with 15% token masking on 1 million molecules from PubChem, with a validation set of 100,000 molecules. Each model was pre-trained for 20 epochs using AdamW, with hyperparameter optimization via Optuna.</p>
<h3 id="downstream-tasks">Downstream tasks</h3>
<p>The models were fine-tuned on three <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Category</th>
          <th>Compounds</th>
          <th>Tasks</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>Physiology</td>
          <td>2,039</td>
          <td>1</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Biophysics</td>
          <td>41,127</td>
          <td>1</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>Physiology</td>
          <td>7,831</td>
          <td>12</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<p>Data was split 80/10/10 (train/validation/test) following MoleculeNet recommendations. Models were fine-tuned for 5 epochs with early stopping based on validation ROC-AUC.</p>
<h3 id="baselines">Baselines</h3>
<p>Results were compared against two text-based models (<a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a> MTR-77M and <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a>) and two graph-based models (D-MPNN from Chemprop and MoleculeNet Graph-Conv).</p>
<h3 id="main-results">Main results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP ROC</th>
          <th>HIV ROC</th>
          <th>Tox21 ROC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMILYAPE-1M</td>
          <td>0.754 +/- 0.006</td>
          <td>0.772 +/- 0.010</td>
          <td>0.838 +/- 0.002</td>
      </tr>
      <tr>
          <td>SMILYBPE-1M</td>
          <td>0.746 +/- 0.006</td>
          <td>0.754 +/- 0.015</td>
          <td>0.849 +/- 0.002</td>
      </tr>
      <tr>
          <td>SELFYAPE-1M</td>
          <td>0.735 +/- 0.015</td>
          <td>0.768 +/- 0.012</td>
          <td>0.842 +/- 0.002</td>
      </tr>
      <tr>
          <td>SELFYBPE-1M</td>
          <td>0.676 +/- 0.014</td>
          <td>0.709 +/- 0.012</td>
          <td>0.825 +/- 0.001</td>
      </tr>
      <tr>
          <td>ChemBERTa-2-MTR-77M</td>
          <td>0.698 +/- 0.014</td>
          <td>0.735 +/- 0.008</td>
          <td>0.790 +/- 0.003</td>
      </tr>
      <tr>
          <td>SELFormer</td>
          <td>0.716 +/- 0.021</td>
          <td>0.769 +/- 0.010</td>
          <td>0.838 +/- 0.005</td>
      </tr>
      <tr>
          <td>MoleculeNet-Graph-Conv</td>
          <td>0.690</td>
          <td>0.763</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.737</td>
          <td>0.776</td>
          <td>0.851</td>
      </tr>
  </tbody>
</table>
<p>APE consistently outperforms BPE for both SMILES and SELFIES. SMILYAPE achieves the best BBBP score (0.754), beating D-MPNN (0.737). On HIV, SMILYAPE (0.772) is competitive with D-MPNN (0.776). On Tox21, D-MPNN (0.851) leads, with SMILYBPE (0.849) and SELFYAPE (0.842) close behind.</p>
<h3 id="statistical-significance">Statistical significance</h3>
<p><a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann-Whitney U tests</a> confirmed statistically significant differences between SMILYAPE and SMILYBPE (p &lt; 0.05 on all datasets). Cliff&rsquo;s delta values indicate large effect sizes: 0.74 (BBBP), 0.70 (HIV), and -1.00 (Tox21, favoring BPE). For SELFIES models, SELFYAPE achieved Cliff&rsquo;s delta of 1.00 across all three datasets, indicating complete separation from SELFYBPE.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="ape-outperforms-bpe-by-preserving-atomic-identity">APE outperforms BPE by preserving atomic identity</h3>
<p>The consistent advantage of APE over BPE stems from APE&rsquo;s atom-level initialization. By starting with chemically valid units rather than individual characters, APE avoids creating nonsensical tokens that break chemical elements or mix structural delimiters with atoms.</p>
<h3 id="smiles-outperforms-selfies-with-ape-tokenization">SMILES outperforms SELFIES with APE tokenization</h3>
<p>SMILYAPE generally outperforms SELFYAPE across tasks. Attention weight analysis revealed that SMILYAPE assigns more weight to immediate neighboring tokens (0.108 vs. 0.096) and less to distant tokens (0.030 vs. 0.043). This pattern aligns with chemical intuition: bonding is primarily determined by directly connected atoms. SMILYAPE also produces more compact tokenizations (8.6 tokens per molecule vs. 11.9 for SELFYAPE), potentially allowing more efficient attention allocation.</p>
<h3 id="selfies-models-show-higher-inter-tokenizer-agreement">SELFIES models show higher inter-tokenizer agreement</h3>
<p>On the BBBP dataset, all true positives identified by SELFYBPE were also captured by SELFYAPE, with SELFYAPE achieving higher recall (61.68% vs. 55.14%). In contrast, SMILES-based models shared only 29.3% of true positives between APE and BPE variants, indicating that tokenization choice has a larger impact on SMILES models.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Pre-training used only 1 million molecules, compared to 77 million for ChemBERTa-2. Despite this, APE models were competitive or superior, but scaling effects remain unexplored.</li>
<li>Evaluation was limited to three binary classification tasks from MoleculeNet. Regression tasks, molecular generation, and reaction prediction were not tested.</li>
<li>The Tox21 result is notable: SMILYBPE outperforms SMILYAPE (0.849 vs. 0.838), suggesting APE&rsquo;s advantage may be task-dependent.</li>
<li>No comparison with recent atom-level tokenizers like <a href="/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/">Atom-in-SMILES</a> or newer approaches beyond SPE.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tokenizer training</td>
          <td>PubChem subset</td>
          <td>2M molecules</td>
          <td>SMILES strings converted to SELFIES via selfies library</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>PubChem subset</td>
          <td>1M molecules</td>
          <td>100K validation set</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BBBP</td>
          <td>2,039 compounds</td>
          <td>80/10/10 split</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,127 compounds</td>
          <td>80/10/10 split</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Tox21</td>
          <td>7,831 compounds</td>
          <td>80/10/10 split, 12 tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Tokenizers: BPE (via Hugging Face), APE (custom implementation, minimum frequency 2000)</li>
<li>Pre-training: Masked Language Modeling (15% masking) for 20 epochs</li>
<li>Optimizer: AdamW with Optuna hyperparameter search</li>
<li>Fine-tuning: 5 epochs with early stopping on validation ROC-AUC</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Architecture: RoBERTa with 6 layers, hidden size 768, intermediate size 1536, 12 attention heads</li>
<li>Four variants: SMILYAPE, SMILYBPE, SELFYAPE, SELFYBPE</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SMILYAPE</th>
          <th>SMILYBPE</th>
          <th>SELFYAPE</th>
          <th>SELFYBPE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP ROC-AUC</td>
          <td>0.754</td>
          <td>0.746</td>
          <td>0.735</td>
          <td>0.676</td>
      </tr>
      <tr>
          <td>HIV ROC-AUC</td>
          <td>0.772</td>
          <td>0.754</td>
          <td>0.768</td>
          <td>0.709</td>
      </tr>
      <tr>
          <td>Tox21 ROC-AUC</td>
          <td>0.838</td>
          <td>0.849</td>
          <td>0.842</td>
          <td>0.825</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA RTX 3060 GPU with 12 GiB VRAM</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mikemayuare/apetokenizer">APE Tokenizer</a></td>
          <td>Code</td>
          <td>Other (unspecified SPDX)</td>
          <td>Official APE tokenizer implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/mikemayuare/PubChem10M_SMILES_SELFIES">PubChem10M SMILES/SELFIES</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>10M SMILES with SELFIES conversions</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/mikemayuare">Pre-trained and fine-tuned models</a></td>
          <td>Model</td>
          <td>Not specified</td>
          <td>All four model variants on Hugging Face</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Leon, M., Perezhohin, Y., Peres, F., Popovič, A., &amp; Castelli, M. (2024). Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. <em>Scientific Reports</em>, 14(1), 25016. <a href="https://doi.org/10.1038/s41598-024-76440-8">https://doi.org/10.1038/s41598-024-76440-8</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{leon2024comparing,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Leon, Miguelangel and Perezhohin, Yuriy and Peres, Fernando and Popovi{\v{c}}, Ale{\v{s}} and Castelli, Mauro}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{25016}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-024-76440-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMI+AIS: Hybridizing SMILES with Environment Tokens</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smi-ais-hybrid-molecular-representation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smi-ais-hybrid-molecular-representation/</guid><description>SMI+AIS hybridizes SMILES with Atom-In-SMILES tokens encoding local chemical environments, improving molecular generation binding affinity and synthesizability.</description><content:encoded><![CDATA[<h2 id="a-hybrid-molecular-representation-combining-smiles-and-chemical-environment-tokens">A Hybrid Molecular Representation Combining SMILES and Chemical-Environment Tokens</h2>
<p>This is a <strong>Method</strong> paper that introduces SMI+AIS(N), a hybrid molecular string representation combining standard <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> tokens with <a href="/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/">Atom-In-SMILES (AIS)</a> tokens. AIS tokens encode local chemical environment information (central atom, ring membership, and neighboring atoms) into a single token. The key contribution is a systematic hybridization strategy that selectively replaces the most frequent SMILES tokens with AIS equivalents, preserving SMILES grammar compatibility while enriching token diversity. The method is validated on molecular structure generation via latent space optimization for drug design.</p>
<h2 id="limitations-of-standard-smiles-for-machine-learning">Limitations of Standard SMILES for Machine Learning</h2>
<p>SMILES is the most widely adopted string-based molecular representation, used in major databases like ZINC and PubChem. Despite this ubiquity, SMILES has several well-known limitations for machine learning applications:</p>
<ol>
<li><strong>Non-unique representations</strong>: The same molecule can be encoded as multiple distinct SMILES strings.</li>
<li><strong>Invalid string generation</strong>: Generative models can produce syntactically invalid SMILES that do not correspond to any molecule.</li>
<li><strong>Limited token diversity</strong>: SMILES tokens map one-to-one to atoms or bonds, so the token vocabulary is restricted to the available atom and bond types.</li>
<li><strong>Insufficient chemical context</strong>: Individual SMILES tokens carry no information about the local chemical environment of an atom.</li>
</ol>
<p>Alternative representations like <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (guaranteeing validity) and <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> (guaranteeing uniqueness) address some of these issues but share the same fundamental limitation of low token diversity. The Atom-In-SMILES (AIS) representation (Ucak et al., 2023) enriches tokens with neighboring atom and ring information, but using AIS exclusively produces a large vocabulary with many infrequent tokens that can cause data sparsity problems. The authors aim to find a middle ground: adding chemical context to the most common tokens while keeping the vocabulary manageable.</p>
<h2 id="core-innovation-selective-token-hybridization-with-ais">Core Innovation: Selective Token Hybridization with AIS</h2>
<p>The SMI+AIS(N) representation hybridizes standard SMILES with AIS tokens through a frequency-based selection process:</p>
<h3 id="ais-token-structure">AIS Token Structure</h3>
<p>Each AIS token encodes three pieces of information about an atom, delimited by semicolons:</p>
<p>$$
\lbrack \text{central atom} ; \text{ring info} ; \text{neighbor atoms} \rbrack
$$</p>
<p>For example, the oxygen in a carboxyl group of benzoic acid is represented as <code>[O;!R;C]</code>, meaning: oxygen atom, not in a ring, bonded to carbon. In standard SMILES, this would simply be <code>O</code>.</p>
<h3 id="hybridization-procedure">Hybridization Procedure</h3>
<ol>
<li>Convert all SMILES strings in the <a href="/notes/chemistry/datasets/zinc-22/">ZINC database</a> to their full AIS representations.</li>
<li>Count the frequency of each AIS token across the database.</li>
<li>Select the top-N most frequent AIS tokens to form the hybrid vocabulary.</li>
<li>In the hybrid representation, atoms matching these top-N AIS tokens are written in AIS notation; all other atoms use standard SMILES notation.</li>
</ol>
<p>For benzoic acid, the hybridization produces:</p>
<p>$$
\text{SMI}: \texttt{O=C(O)c1ccccc1}
$$</p>
<p>$$
\text{SMI+AIS}: \texttt{\lbrack O;!R;C\rbrack=\lbrack C;!R;COO\rbrack(\lbrack OH;!R;C\rbrack)c1ccccc1}
$$</p>
<p>The parameter N controls vocabulary size. The authors test N = 50, 100, 150, and 200, finding that N = 100-150 provides the best balance for the ZINC database.</p>
<h3 id="token-frequency-rebalancing">Token Frequency Rebalancing</h3>
<p>A key benefit of hybridization is mitigating the severe token frequency imbalance in standard SMILES. Carbon (C), the most frequent element with ~184 million occurrences in ZINC, is represented by only 16 token types in SMILES. With SMI+AIS(200), carbon is distinguished into 145 token types based on chemical environment, with 74% of carbon occurrences represented by AIS tokens. Less common elements like halogens see minimal change (only 2% AIS representation), which avoids introducing unnecessarily rare tokens.</p>
<table>
  <thead>
      <tr>
          <th>Element</th>
          <th>Frequency</th>
          <th>SMILES Types</th>
          <th>SMI+AIS(100) Types (AIS %)</th>
          <th>SMI+AIS(200) Types (AIS %)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>C</td>
          <td>183,860,954</td>
          <td>16</td>
          <td>78 (73%)</td>
          <td>145 (74%)</td>
      </tr>
      <tr>
          <td>O</td>
          <td>27,270,229</td>
          <td>8</td>
          <td>16 (11%)</td>
          <td>24 (11%)</td>
      </tr>
      <tr>
          <td>N</td>
          <td>26,022,928</td>
          <td>11</td>
          <td>32 (1%)</td>
          <td>46 (10%)</td>
      </tr>
      <tr>
          <td>X (halogens)</td>
          <td>6,137,030</td>
          <td>7</td>
          <td>10 (2%)</td>
          <td>11 (2%)</td>
      </tr>
      <tr>
          <td>S</td>
          <td>4,581,307</td>
          <td>12</td>
          <td>17 (2%)</td>
          <td>24 (2%)</td>
      </tr>
  </tbody>
</table>
<h2 id="latent-space-optimization-for-molecular-generation">Latent Space Optimization for Molecular Generation</h2>
<h3 id="model-architecture">Model Architecture</h3>
<p>The evaluation uses a <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">conditional variational autoencoder (CVAE)</a> with:</p>
<ul>
<li><strong>Encoder</strong>: BERT-style architecture with entity and positional embeddings, 4 multi-head attention layers (8 heads each), producing mean and standard deviation vectors in latent space.</li>
<li><strong>Decoder</strong>: 4 stacked gated recurrent unit (GRU) layers that transform sampled latent vectors (conditioned) back into token sequences.</li>
<li>Training: 20 epochs on 9 million compounds from the ZINC database (8:1:1 train/valid/test split) under identical conditions for all representations.</li>
</ul>
<h3 id="optimization-setup">Optimization Setup</h3>
<p><a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian Optimization</a> (BO) via BoTorch is applied to the CVAE <a href="/notes/chemistry/molecular-design/generation/latent-space/">latent space</a>, maximizing a multi-objective function:</p>
<p>$$
\text{Obj} = -\text{BA} - 0.5 \times \text{SA}^2
$$</p>
<p>where BA is binding affinity (docking score from QuickVina 2, lower is stronger) and SA is synthetic accessibility score (from RDKit, lower is more synthesizable). Each BO iteration generates 800 candidate latent vectors. Invalid strings receive a penalty objective value of -100.</p>
<h3 id="protein-targets">Protein Targets</h3>
<p>Four diverse targets were used to assess generalizability:</p>
<ul>
<li><strong>PDK4</strong> (<a href="https://en.wikipedia.org/wiki/Pyruvate_dehydrogenase_kinase">Pyruvate Dehydrogenase Kinase</a> 4): narrow, deep binding pocket</li>
<li><strong>5-HT1B</strong> (<a href="https://en.wikipedia.org/wiki/5-HT1B_receptor">Serotonin Receptor 1B</a>): shallow, open <a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">GPCR</a> conformation</li>
<li><strong>PARP1</strong> (<a href="https://en.wikipedia.org/wiki/PARP1">Poly ADP-ribose Polymerase 1</a>): small, flexible molecule binding site</li>
<li><strong>CK1d</strong> (<a href="https://en.wikipedia.org/wiki/Casein_kinase_1">Casein Kinase I</a> Delta): broad, accessible conformation</li>
</ul>
<p>Protein structures were obtained from the <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">Protein Data Bank</a> (PDB IDs: 4V26, 4IAQ, 6I8M, 4TN6). Each optimization was run 10 times independently from the same 5 initial compounds selected from BindingDB.</p>
<h3 id="key-results">Key Results</h3>
<p>SMI+AIS(100) consistently achieved the highest objective values across protein targets.</p>
<p><strong>PDK4 Optimization</strong> (Top-1 results over 10 independent runs):</p>
<ul>
<li>SMI+AIS(100) achieved approximately 12% improvement over standard SMILES and 28% improvement over SELFIES based on median Top-1 objective values.</li>
<li>Generated structures exhibited BA scores between -10 and -9 and SA scores between 2.0 and 2.3.</li>
<li>Molecular weights clustered around 400 amu, consistent with the CVAE conditioning.</li>
</ul>
<p><strong>Validity Ratios</strong>: Standard SMILES produced approximately 40% valid structures. SMI+AIS representations showed significant improvement as N increased, though SMI+AIS(200) showed slight saturation, likely from insufficiently trained infrequent tokens.</p>
<p><strong>SELFIES</strong>: Despite achieving the highest validity ratio, SELFIES failed to generate chemically meaningful structures with desirable BA and SA scores. The authors attribute this to SELFIES grammar where token meaning is highly context-dependent, causing minor latent space variations to produce large structural changes.</p>
<p><strong>Cross-target consistency</strong>: Improvements were observed across all four protein targets, with slight variation (5-HT1B showed smaller differences between SMI and SMI+AIS(100) for Top-1, while other targets showed significant improvements).</p>
<h2 id="improved-molecular-generation-through-chemical-context-enrichment">Improved Molecular Generation Through Chemical Context Enrichment</h2>
<p>The SMI+AIS(N) representation achieves consistent improvements in molecular generation quality compared to both standard SMILES and SELFIES. The core findings are:</p>
<ol>
<li><strong>Binding affinity improvement</strong>: Approximately 7% improvement over standard SMILES for the PDK4 target.</li>
<li><strong>Synthesizability improvement</strong>: Approximately 6% increase in synthetic accessibility scores.</li>
<li><strong>Target independence</strong>: Performance gains transfer across four structurally diverse protein targets.</li>
<li><strong>Preserved structural motifs</strong>: The generative model retains chemically meaningful fragments (e.g., acetamide and <a href="https://en.wikipedia.org/wiki/Piperidine">piperidine</a>) from initial compounds without explicit fragment constraints.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Stereochemistry</strong>: SMI+AIS inherits the limited stereochemistry handling of standard SMILES.</li>
<li><strong>Evaluation scope</strong>: Only molecular generation was tested; property prediction and other ML tasks remain unexplored.</li>
<li><strong>Compute constraints</strong>: The study was limited to molecular generation due to computing power and time.</li>
<li><strong>Single optimization strategy</strong>: Only latent space optimization with Bayesian optimization was evaluated; other generative approaches were not compared.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest extending SMI+AIS to diverse benchmarking tests including molecular property prediction, experimental validation, and broader applications of chemical language models.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Vocab</td>
          <td>ZINC Database</td>
          <td>9M compounds</td>
          <td>Canonicalized, deduplicated, split 8:1:1</td>
      </tr>
      <tr>
          <td>Binding targets</td>
          <td>BindingDB</td>
          <td>5 initial compounds per target</td>
          <td>Selected for each protein target</td>
      </tr>
      <tr>
          <td>Protein structures</td>
          <td>PDB</td>
          <td>4 structures</td>
          <td>IDs: 4V26, 4IAQ, 6I8M, 4TN6</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: AIS token frequency counting on full ZINC database, top-N selection</li>
<li><strong>Generative model</strong>: Conditional VAE with BERT encoder (4 layers, 8 heads) and GRU decoder (4 layers)</li>
<li><strong>Optimization</strong>: Bayesian Optimization via BoTorch (800 candidates per iteration)</li>
<li><strong>Docking</strong>: QuickVina 2 with 25 A pocket size, 10 docking simulations per ligand</li>
<li><strong>SA scoring</strong>: RDKit SA score</li>
<li>Training: 20 epochs for all representations under identical conditions</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>CVAE architecture details in supplementary (Fig. S9, Tables S2, S4)</li>
<li>No pre-trained weights released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SMI+AIS(100) vs SMILES</th>
          <th>SMI+AIS(100) vs SELFIES</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Median Top-1 Obj. Value</td>
          <td>+12%</td>
          <td>+28%</td>
          <td>PDK4 target</td>
      </tr>
      <tr>
          <td>Validity Ratio</td>
          <td>Higher than ~40% (SMILES)</td>
          <td>Lower than SELFIES</td>
          <td>SMI+AIS improves with N</td>
      </tr>
      <tr>
          <td>BA (binding affinity)</td>
          <td>~7% improvement</td>
          <td>Substantial</td>
          <td>Lower (more negative) is better</td>
      </tr>
      <tr>
          <td>SA (synthesizability)</td>
          <td>~6% improvement</td>
          <td>Substantial</td>
          <td>Lower is more synthesizable</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware details are not specified in the main text. Optimization wall times are reported in supplementary Table S5.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/herim-han/AIS-Drug-Opt">AIS-Drug-Opt</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Source code and datasets for reproduction</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. Code and processed data are publicly available on GitHub, but no pre-trained model weights are released, the license is unspecified, and hardware requirements are not documented in the main text.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Han, H., Yeom, M. S., &amp; Choi, S. (2025). Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation. <em>Scientific Reports</em>, 15, 16892. <a href="https://doi.org/10.1038/s41598-025-01890-7">https://doi.org/10.1038/s41598-025-01890-7</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{han2025hybridization,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Han, Herim and Yeom, Min Sun and Choi, Sunghwan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{16892}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-025-01890-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Randomized SMILES Improve Molecular Generative Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/</guid><description>Randomized SMILES improve RNN molecular generative models by increasing chemical space coverage, uniformity, and completeness versus canonical SMILES.</description><content:encoded><![CDATA[<h2 id="data-augmentation-through-smiles-randomization">Data Augmentation Through SMILES Randomization</h2>
<p>This is an <strong>Empirical</strong> paper that performs an extensive benchmark of RNN-based molecular generative models trained with different SMILES string variants. The primary contribution is demonstrating that randomized SMILES (non-unique molecular string representations obtained by randomizing atom orderings) substantially improve the quality of the generated chemical space compared to canonical SMILES, without requiring any changes to the model architecture.</p>
<p>The paper evaluates three properties of generated chemical spaces: uniformity (equal probability of sampling each molecule), completeness (coverage of the target space), and closedness (generating only molecules within the target space). These are measured using a new composite metric called UC-JSD.</p>
<h2 id="canonical-smiles-bias-in-generative-models">Canonical SMILES Bias in Generative Models</h2>
<p>Recurrent Neural Networks trained on SMILES strings have shown the capacity to create large chemical spaces of valid molecules. However, when trained with canonical SMILES (the unique string representation produced by a canonicalization algorithm), these models exhibit biases. Specifically, prior work by the same group showed that models trained on one million <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> molecules could only recover 68% of GDB-13 when sampled two billion times, compared to the theoretical maximum of 87% from an ideal uniform sampler.</p>
<p>The canonical SMILES representation introduces two problems. First, the canonicalization algorithm constrains how the molecular graph is traversed (e.g., prioritizing sidechains over ring atoms), forcing the model to learn both valid SMILES syntax and the specific canonical ordering rules. Second, structurally similar molecules can have substantially different canonical SMILES, making some molecules harder to sample than others. Molecules with more ring systems and complex topologies are particularly underrepresented.</p>
<p>The authors also note that DeepSMILES, a recently proposed alternative syntax, had not been benchmarked against randomized SMILES, and that the data augmentation capabilities of randomized SMILES at different training set sizes were unexplored.</p>
<h2 id="randomized-smiles-as-non-canonical-representations">Randomized SMILES as Non-Canonical Representations</h2>
<p>The core insight is that by randomizing the atom ordering before SMILES generation, each molecule can be represented by multiple different but equally valid SMILES strings. This effectively provides data augmentation: a molecule with $n$ heavy atoms can theoretically yield up to $n$ different SMILES strings (though the actual number is typically lower due to molecular symmetry).</p>
<p>Two randomized SMILES variants are explored:</p>
<ul>
<li><strong>Restricted randomized SMILES</strong>: Atom ordering is randomized, but RDKit&rsquo;s built-in fixes are applied. These fixes prevent overly complicated traversals, such as prioritizing sidechains before completing ring atoms.</li>
<li><strong>Unrestricted randomized SMILES</strong>: Atom ordering is randomized without any RDKit restrictions, producing a superset of the restricted variant that includes more convoluted SMILES strings.</li>
</ul>
<p>For each training epoch, a new set of randomized SMILES is generated for the same molecules, so a model trained for 300 epochs on one million molecules sees approximately 300 million different SMILES strings (with some overlap due to sampling).</p>
<p>The model architecture is a standard RNN with an embedding layer, $l$ layers of LSTM or GRU cells of size $w$, optional dropout, and a linear output layer with softmax. The training objective minimizes the average negative log-likelihood (NLL):</p>
<p>$$
J(T) = -\ln P(X_{0} = x_{0}) - \sum_{t=1}^{T} \ln P(X_{t} = x_{t} \mid X_{t-1} = x_{t-1} \dots X_{1} = x_{1})
$$</p>
<p>The key metric is the Uniformity-Completeness JSD (UC-JSD), which extends the Jensen-Shannon Divergence to measure how uniform, complete, and closed the generated chemical space is:</p>
<p>$$
JSD = H\left(\sum_{d \in D} \alpha_{i} \cdot d_{i}\right) - \sum_{d \in D} \alpha_{i} H(d_{i})
$$</p>
<p>where $H(d)$ is the Shannon entropy of a probability distribution. The UC-JSD is computed over the NLL vectors of the validation, training, and sampled sets. The composite UCC score is defined as:</p>
<p>$$
UCC = \text{completeness} \times \text{uniformity} \times \text{closedness}
$$</p>
<p>where completeness measures coverage of GDB-13, uniformity measures how equal the sampling probabilities are, and closedness measures how few invalid (out-of-target-space) molecules are generated.</p>
<h2 id="benchmark-design-across-smiles-variants-training-sizes-and-architectures">Benchmark Design Across SMILES Variants, Training Sizes, and Architectures</h2>
<p>The benchmark covers a systematic grid of experimental conditions:</p>
<p><strong>SMILES variants</strong>: Canonical, restricted randomized, unrestricted randomized, and three DeepSMILES variants (branch syntax, ring syntax, both).</p>
<p><strong>Training set sizes from GDB-13</strong>: 1,000,000, 10,000, and 1,000 molecules with corresponding validation sets.</p>
<p><strong>Architecture choices</strong>: LSTM vs. GRU cells, with hyperparameter grids over number of layers ($l$), hidden size ($w$), dropout rate ($d$), and batch size ($b$).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers ($l$)</th>
          <th>Hidden ($w$)</th>
          <th>Dropout ($d$)</th>
          <th>Batch ($b$)</th>
          <th>Cell</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GDB-13 1M</td>
          <td>3</td>
          <td>512</td>
          <td>0, 25, 50</td>
          <td>64, 128, 256, 512</td>
          <td>GRU, LSTM</td>
      </tr>
      <tr>
          <td>GDB-13 10K</td>
          <td>2, 3, 4</td>
          <td>256, 384, 512</td>
          <td>0, 25, 50</td>
          <td>8, 16, 32</td>
          <td>LSTM</td>
      </tr>
      <tr>
          <td>GDB-13 1K</td>
          <td>2, 3, 4</td>
          <td>128, 192, 256</td>
          <td>0, 25, 50</td>
          <td>4, 8, 16</td>
          <td>LSTM</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>3</td>
          <td>512</td>
          <td>0, 25, 50</td>
          <td>64, 128, 256, 512</td>
          <td>LSTM</td>
      </tr>
  </tbody>
</table>
<p>Each model&rsquo;s best epoch was selected using a smoothed UC-JSD curve, and the best epoch was then sampled with replacement $k = 2 \times 10^{9}$ times for GDB-13 benchmarks.</p>
<p>For ChEMBL experiments, models were trained on 1,483,943 molecules with a validation set of 78,102 molecules. Evaluation used validity, unique molecule count, and Frechet ChemNet Distance (FCD).</p>
<h2 id="randomized-smiles-produce-more-complete-and-uniform-chemical-spaces">Randomized SMILES Produce More Complete and Uniform Chemical Spaces</h2>
<h3 id="gdb-13-results-1m-training-set">GDB-13 results (1M training set)</h3>
<p>The restricted randomized SMILES model recovered 83.0% of GDB-13, compared to 72.8% for canonical SMILES and 68.4-72.1% for DeepSMILES variants. All three quality metrics improved substantially:</p>
<table>
  <thead>
      <tr>
          <th>SMILES Variant</th>
          <th>% GDB-13</th>
          <th>Uniformity</th>
          <th>Completeness</th>
          <th>Closedness</th>
          <th>UCC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Canonical</td>
          <td>72.8</td>
          <td>0.879</td>
          <td>0.836</td>
          <td>0.861</td>
          <td>0.633</td>
      </tr>
      <tr>
          <td>Rand. restricted</td>
          <td>83.0</td>
          <td>0.977</td>
          <td>0.953</td>
          <td>0.925</td>
          <td>0.860</td>
      </tr>
      <tr>
          <td>Rand. unrestricted</td>
          <td>80.9</td>
          <td>0.970</td>
          <td>0.929</td>
          <td>0.876</td>
          <td>0.790</td>
      </tr>
      <tr>
          <td>DeepSMILES (both)</td>
          <td>68.4</td>
          <td>0.851</td>
          <td>0.785</td>
          <td>0.796</td>
          <td>0.532</td>
      </tr>
  </tbody>
</table>
<p>The NLL distribution of GDB-13 molecules under the randomized SMILES model was centered near $NLL_{GDB13} = -\ln(1/|GDB13|) = 20.6$ with a narrow spread, indicating near-uniform sampling probability. The canonical model showed a much wider NLL distribution, meaning some molecules were orders of magnitude harder to sample.</p>
<p>Randomized SMILES without data augmentation (same SMILES each epoch) still outperformed canonical SMILES (UCC 0.712 vs. 0.633 for restricted), confirming that the non-canonical representation itself is beneficial beyond the augmentation effect.</p>
<h3 id="smaller-training-sets-amplify-the-advantage">Smaller training sets amplify the advantage</h3>
<p>With only 10,000 training molecules (0.001% of GDB-13), the randomized model generated 62.3% of GDB-13 vs. 38.8% for canonical. With 1,000 training molecules, the gap widened further: 34.1% vs. 14.5%. Validity also improved dramatically (81.2% vs. 50.4% for the 1K setting), suggesting randomized SMILES helps the model learn valid SMILES syntax more effectively from limited data.</p>
<h3 id="chembl-results">ChEMBL results</h3>
<p>On the drug-like ChEMBL dataset, the randomized SMILES model generated at least double the number of unique molecules compared to canonical (64.09% vs. 34.67% unique in a 2B sample), with comparable validity (98.33% vs. 98.26%). The canonical model showed a lower FCD (0.0712 vs. 0.1265), but the authors argue this reflects overfitting: the canonical model&rsquo;s NLL distributions for training and validation sets overlapped tightly, while the randomized model showed more uniform coverage. Physicochemical property distributions (molecular weight, logP, SA score, QED, NP score, internal diversity) were nearly identical across both models.</p>
<h3 id="architecture-findings">Architecture findings</h3>
<p>LSTM cells consistently outperformed GRU cells across all SMILES variants. Despite GRU&rsquo;s faster per-epoch training time, LSTM models converged in fewer epochs, making them faster overall. Dropout improved canonical SMILES models but was less beneficial (or detrimental) for randomized SMILES, suggesting that randomized SMILES themselves serve as a regularization mechanism. Larger batch sizes generally improved performance across all variants.</p>
<h3 id="uc-jsd-as-a-model-selection-metric">UC-JSD as a model selection metric</h3>
<p>The UC-JSD showed strong correlation with UCC ($R^{2} = 0.931$ for canonical, $R^{2} = 0.856$ for restricted randomized, $R^{2} = 0.885$ for unrestricted randomized), validating its use as a model selection criterion without requiring expensive sampling of every model.</p>
<p>The authors interpret randomized SMILES models as occupying a hybrid space between grammar-based and action-based generative models. The vocabulary serves as a fixed action space where atom tokens are &ldquo;add atom&rdquo; actions, bond tokens are &ldquo;add bond&rdquo; actions, and ring/branching tokens enable graph traversal. Canonical SMILES constrain this action space to a single deterministic path, while randomized SMILES allow the model to explore multiple valid traversals. This perspective also explains why DeepSMILES performed worse: its altered syntax creates a more complex action space without compensating benefits.</p>
<p>The authors encourage the use of randomized SMILES across different model architectures and tasks, including classification and property prediction, and suggest that finding optimal restricted variants of randomized SMILES is a promising research direction.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>GDB-13 subsets</td>
          <td>1M / 10K / 1K molecules</td>
          <td>Randomly sampled from 975M GDB-13</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>ChEMBL</td>
          <td>1,483,943 training / 78,102 validation</td>
          <td>Filtered subset of ChEMBL database</td>
      </tr>
  </tbody>
</table>
<p>GDB-13 is available from the <a href="http://gdb.unibe.ch/downloads">Reymond group website</a>. ChEMBL is publicly available.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Character-level tokenization with special handling for multi-character tokens (Cl, Br, bracketed atoms, %-prefixed ring numbers)</li>
<li>Teacher forcing during training with NLL loss</li>
<li>Gradient norm clipping to 1.0</li>
<li>Weight initialization from $\mathcal{U}(-\sqrt{1/w}, \sqrt{1/w})$</li>
<li>Adaptive learning rate decay based on UC-JSD</li>
<li>Best epoch selection via smoothed UC-JSD (window size 4)</li>
</ul>
<h3 id="models">Models</h3>
<p>Standard RNN architecture: embedding layer, stacked LSTM/GRU layers with optional dropout, linear output with softmax. Best models used 3 layers of 512-dimensional LSTM cells. Vocabulary sizes: 26 (GDB-13), 31 (ChEMBL).</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Randomized</th>
          <th>Best Canonical</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>% GDB-13 (1M)</td>
          <td>83.0%</td>
          <td>72.8%</td>
          <td>2B sample with replacement</td>
      </tr>
      <tr>
          <td>UCC (1M)</td>
          <td>0.860</td>
          <td>0.633</td>
          <td>Composite score</td>
      </tr>
      <tr>
          <td>% GDB-13 (10K)</td>
          <td>62.3%</td>
          <td>38.8%</td>
          <td>2B sample with replacement</td>
      </tr>
      <tr>
          <td>% GDB-13 (1K)</td>
          <td>34.1%</td>
          <td>14.5%</td>
          <td>2B sample with replacement</td>
      </tr>
      <tr>
          <td>% Unique ChEMBL</td>
          <td>64.09%</td>
          <td>34.67%</td>
          <td>2B sample with replacement</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Nvidia Tesla V100 (Volta) 16 GB VRAM with CUDA 9.1, driver 390.30. Training times ranged from 1 minute (1K canonical) to 131 hours (ChEMBL canonical). Randomized SMILES models required longer per-epoch training due to augmentation overhead but converged to better solutions.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/undeadpixel/reinvent-randomized">reinvent-randomized</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and benchmarking code</td>
      </tr>
      <tr>
          <td><a href="http://gdb.unibe.ch/downloads">GDB-13</a></td>
          <td>Dataset</td>
          <td>Academic use</td>
          <td>975 million fragment-like molecules</td>
      </tr>
      <tr>
          <td><a href="https://github.com/molecularsets/moses">MOSES benchmark</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Used for FCD and property calculations</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Arús-Pous, J., Johansson, S. V., Prykhodko, O., Bjerrum, E. J., Tyrchan, C., Reymond, J.-L., Chen, H., &amp; Engkvist, O. (2019). Randomized SMILES strings improve the quality of molecular generative models. <em>Journal of Cheminformatics</em>, 11(1), 71. <a href="https://doi.org/10.1186/s13321-019-0393-0">https://doi.org/10.1186/s13321-019-0393-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{aruspous2019randomized,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Randomized SMILES strings improve the quality of molecular generative models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ar{\&#39;u}s-Pous, Josep and Johansson, Simon Viet and Prykhodko, Oleksii and Bjerrum, Esben Jannik and Tyrchan, Christian and Reymond, Jean-Louis and Chen, Hongming and Engkvist, Ola}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{71}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-019-0393-0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Group SELFIES: Fragment-Based Molecular Strings</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/group-selfies-fragment-molecular-representation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/group-selfies-fragment-molecular-representation/</guid><description>Group SELFIES extends SELFIES with fragment-based group tokens for chemically robust molecular string representations that improve distribution learning.</description><content:encoded><![CDATA[<h2 id="a-fragment-aware-extension-of-selfies">A Fragment-Aware Extension of SELFIES</h2>
<p>This is a <strong>Method</strong> paper that introduces Group SELFIES, a molecular string representation extending SELFIES by incorporating group tokens that represent functional groups or entire substructures. The primary contribution is a representation that maintains the 100% chemical validity guarantee of SELFIES while enabling fragment-level molecular encoding. Group SELFIES is shorter, more human-readable, and produces better distribution learning compared to both SMILES and standard SELFIES.</p>
<h2 id="from-atoms-to-fragments-in-molecular-strings">From Atoms to Fragments in Molecular Strings</h2>
<p>Molecular string representations underpin nearly all string-based molecular generation, from chemical language models and VAEs to genetic algorithms. SMILES, the dominant representation, suffers from validity issues: generated strings frequently contain syntax errors or violate valency constraints. SELFIES solved this by guaranteeing that every string decodes to a valid molecule, but both SMILES and SELFIES operate at the atomic level. Human chemists, by contrast, think about molecules in terms of functional groups and substructures.</p>
<p>Fragment-based generative models exploit this inductive bias by constructing custom representations amenable to fragment-based molecular design. However, these approaches are typically graph-based, losing the desirable properties of string representations: easy manipulation and direct input into established language models. Historical string representations like Wiswesser Line Notation (WLN), Hayward Notation, and SYBYL Line Notation (SLN) did use non-atomic tokens, but none provided chemical robustness guarantees.</p>
<p>The gap is clear: no existing string representation combines the chemical robustness of SELFIES with the fragment-level abstraction that captures meaningful chemical motifs.</p>
<h2 id="group-tokens-with-chemical-robustness-guarantees">Group Tokens with Chemical Robustness Guarantees</h2>
<p>The core innovation is the introduction of <strong>group tokens</strong> into the SELFIES framework. Each group token represents a predefined molecular fragment (such as a benzene ring, carboxyl group, or any user-specified substructure) and is treated as a single unit during encoding and decoding.</p>
<h3 id="group-definition">Group Definition</h3>
<p>Each group is defined as a set of atoms and bonds with labeled <strong>attachment points</strong> that specify how the group participates in bonding. Each attachment point has a specified maximum valency, allowing the decoder to continue tracking available valency during string construction. Group tokens take the form <code>[:S&lt;group-name&gt;]</code>, where <code>S</code> is the starting attachment index.</p>
<h3 id="encoding">Encoding</h3>
<p>To encode a molecule, the encoder first recognizes and replaces substructure matches from the group set. By default, the encoder processes larger groups first, but users can override this with priority values. The encoder then traverses the molecular graph similarly to standard SELFIES encoding, inserting tokens that track attachment indices for entering and exiting groups.</p>
<h3 id="decoding">Decoding</h3>
<p>When the decoder encounters a group token, it looks up the corresponding group in the group set dictionary, places all atoms of the group, and connects the main chain to the starting attachment point. Navigation between attachment points is handled by reading subsequent tokens as relative indices. If an attachment point is occupied, the next available one is used. If all attachment points are exhausted, the group is immediately popped from the stack.</p>
<h3 id="chemical-robustness">Chemical Robustness</h3>
<p>The key property preserved from SELFIES is that <strong>any arbitrary Group SELFIES string decodes to a molecule with valid valency</strong>. This is achieved by maintaining the same two SELFIES decoder features within the group framework:</p>
<ol>
<li>Token overloading: every token can be interpreted as a number when needed (for branch lengths, ring targets, or attachment indices).</li>
<li>Valency tracking: if adding a bond would exceed available valency, the decoder adjusts the bond order or skips the bond.</li>
</ol>
<p>The authors verified robustness by encoding and decoding 25 million molecules from the eMolecules database.</p>
<h3 id="chirality-handling">Chirality Handling</h3>
<p>Group SELFIES handles chirality differently from SMILES and SELFIES. Rather than using <code>@</code>-notation for tetrahedral chirality, all chiral centers must be specified as groups. An &ldquo;essential set&rdquo; of 23 groups covers all relevant chiral centers in the eMolecules database. This approach also supports extended chirality (axial, helical, planar) by abstracting the entire chiral substructure into a group token.</p>
<h3 id="fragment-selection">Fragment Selection</h3>
<p>The group set is a user-defined dictionary that maps group names to molecular fragments. Users can specify groups manually using SMILES-like syntax, extract them from fragment libraries, or use fragmentation algorithms such as matched molecular pair analysis. The authors tested several approaches, including a naive method that cleaves side chains from rings and methods based on cheminformatics fragmentation tools. A useful group set typically contains fragments that appear in many molecules and replace many atoms, with similar fragments merged to reduce redundancy.</p>
<h2 id="experiments-on-compactness-generation-and-distribution-learning">Experiments on Compactness, Generation, and Distribution Learning</h2>
<h3 id="compactness-section-41">Compactness (Section 4.1)</h3>
<p>Using 53 groups (30 extracted from ZINC-250k plus 23 from the essential set), Group SELFIES strings are shorter than their SMILES and SELFIES equivalents. Despite Group SELFIES having a larger alphabet, the compressed file size of the ZINC-250k dataset is smallest for Group SELFIES, indicating lower information-theoretic complexity.</p>
<h3 id="random-molecular-generation-section-42">Random Molecular Generation (Section 4.2)</h3>
<p>To isolate the effect of the representation from the generative model, the authors use a primitive generative model: sample a random string length from the dataset, draw tokens uniformly from a bag of all tokens, and concatenate. From 100,000 ZINC-250k molecules:</p>
<ul>
<li>Randomly sampled Group SELFIES strings produce molecules whose SAScore and QED distributions more closely overlap with the original ZINC dataset than molecules from randomly sampled SELFIES strings.</li>
<li>The Wasserstein distances to the ZINC distribution are consistently lower for Group SELFIES.</li>
<li>On a nonfullerene acceptor (NFA) dataset, Group SELFIES preserves aromatic rings while SELFIES rarely does.</li>
</ul>
<h3 id="distribution-learning-with-vaes-section-43">Distribution Learning with VAEs (Section 4.3)</h3>
<p>Using the MOSES benchmarking framework, VAEs were trained for 125 epochs on both Group SELFIES and SELFIES representations. The Group SELFIES VAE used 300 groups extracted from the MOSES training set. Results from 100,000 generated molecules:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Group-VAE-125</th>
          <th>SELFIES-VAE-125</th>
          <th>Train (Reference)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid</td>
          <td>1.0 (0)</td>
          <td>1.0 (0)</td>
          <td>1.0</td>
      </tr>
      <tr>
          <td>Unique@1k</td>
          <td>1.0 (0)</td>
          <td>0.9996 (5)</td>
          <td>1.0</td>
      </tr>
      <tr>
          <td>Unique@10k</td>
          <td>0.9985 (4)</td>
          <td>0.9986 (4)</td>
          <td>1.0</td>
      </tr>
      <tr>
          <td>FCD (Test)</td>
          <td>0.1787 (29)</td>
          <td>0.6351 (43)</td>
          <td>0.008</td>
      </tr>
      <tr>
          <td>FCD (TestSF)</td>
          <td>0.734 (109)</td>
          <td>1.3136 (128)</td>
          <td>0.4755</td>
      </tr>
      <tr>
          <td>SNN (Test)</td>
          <td>0.6051 (4)</td>
          <td>0.6014 (3)</td>
          <td>0.6419</td>
      </tr>
      <tr>
          <td>Frag (Test)</td>
          <td>0.9995 (0)</td>
          <td>0.9989 (0)</td>
          <td>1.0</td>
      </tr>
      <tr>
          <td>Scaf (Test)</td>
          <td>0.9649 (21)</td>
          <td>0.9588 (15)</td>
          <td>0.9907</td>
      </tr>
      <tr>
          <td>IntDiv</td>
          <td>0.8587 (1)</td>
          <td>0.8579 (1)</td>
          <td>0.8567</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>0.9623 (7)</td>
          <td>0.96 (4)</td>
          <td>1.0</td>
      </tr>
  </tbody>
</table>
<p>The most notable improvement is in Frechet ChemNet Distance (FCD), where Group SELFIES achieves 0.1787 versus 0.6351 for SELFIES on the test set. FCD measures the difference between penultimate-layer activations of ChemNet, encoding a mixture of biological and chemical properties relevant to drug-likeness. Most other metrics are comparable, with Group SELFIES matching or slightly outperforming SELFIES across the board.</p>
<h2 id="advantages-limitations-and-future-directions">Advantages, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<p>Group SELFIES provides three main advantages over standard SELFIES:</p>
<ol>
<li><strong>Substructure control</strong>: Important scaffolds, chiral centers, and charged groups can be preserved during molecular optimization.</li>
<li><strong>Compactness</strong>: Group tokens represent multiple atoms, yielding shorter strings with lower information-theoretic complexity.</li>
<li><strong>Improved distribution learning</strong>: The FCD metric shows substantial improvement, indicating generated molecules better capture biological and chemical properties of the training set.</li>
</ol>
<p>Both SELFIES and Group SELFIES achieve 100% validity, eliminating the validity issues associated with SMILES-based generation.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Computational speed</strong>: Encoding and decoding is slower than SELFIES due to RDKit overhead, particularly for the encoder which performs substructure matching for every group in the set.</li>
<li><strong>No group overlap</strong>: Groups cannot overlap in the current formulation, which limits expressiveness for polycyclic compounds.</li>
<li><strong>Group set design</strong>: Choosing an effective group set remains an open design choice that may require domain expertise or fragmentation algorithm tuning.</li>
<li><strong>Limited generative model evaluation</strong>: The paper focuses on random sampling and VAEs; evaluation with more sophisticated models (GANs, reinforcement learning, genetic algorithms) is left to future work.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose several extensions: flexible scaffold tokens that preserve topology while allowing atom-type variation, representations based on cellular complexes or hypergraphs to handle overlapping groups, and integration with genetic algorithms like JANUS for molecular optimization.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compactness / Generation</td>
          <td>ZINC-250k</td>
          <td>250,000 molecules</td>
          <td>Random subset of 10,000 for fragment extraction; 100,000 for generation</td>
      </tr>
      <tr>
          <td>Distribution Learning</td>
          <td>MOSES benchmark</td>
          <td>~1.9M molecules</td>
          <td>Standard train/test split from MOSES framework</td>
      </tr>
      <tr>
          <td>Robustness Verification</td>
          <td>eMolecules</td>
          <td>25M molecules</td>
          <td>Full database encode-decode round trip</td>
      </tr>
      <tr>
          <td>NFA Generation</td>
          <td>NFA dataset</td>
          <td>Not specified</td>
          <td>Nonfullerene acceptors from Lopez et al. (2017)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Fragmentation</strong>: Naive ring-sidechain cleavage, matched molecular pair analysis, and diversity-based selection of 300 groups for VAE experiments.</li>
<li><strong>Essential set</strong>: 23 chiral groups covering all relevant chiral centers in eMolecules.</li>
<li><strong>Random generation</strong>: Bag-of-tokens sampling with length matched to dataset distribution.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>VAE</strong>: Trained for 125 epochs on MOSES dataset using both SELFIES and Group SELFIES tokenizations.</li>
<li>Architecture details follow the MOSES benchmark VAE configuration.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FCD</td>
          <td>Frechet ChemNet Distance (penultimate layer activations)</td>
      </tr>
      <tr>
          <td>SNN</td>
          <td>Average Tanimoto similarity to nearest neighbor in reference set</td>
      </tr>
      <tr>
          <td>Frag</td>
          <td>Cosine similarity of BRICS fragment distributions</td>
      </tr>
      <tr>
          <td>Scaf</td>
          <td>Cosine similarity of Bemis-Murcko scaffold distributions</td>
      </tr>
      <tr>
          <td>IntDiv</td>
          <td>Internal diversity via Tanimoto similarity</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>Percentage passing RDKit parsing</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>Percentage of non-duplicate generated molecules</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>Fraction of generated molecules not in training set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Robustness verification performed on the Niagara supercomputer (SciNet HPC Consortium).</li>
<li>VAE training hardware not specified.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/group-selfies">group-selfies</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Open-source Python implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cheng, A. H., Cai, A., Miret, S., Malkomes, G., Phielipp, M., &amp; Aspuru-Guzik, A. (2023). Group SELFIES: A robust fragment-based molecular string representation. <em>Digital Discovery</em>, 2(3), 748-758. <a href="https://doi.org/10.1039/D3DD00012E">https://doi.org/10.1039/D3DD00012E</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{cheng2023group,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Group SELFIES: A Robust Fragment-Based Molecular String Representation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Cheng, Austin H. and Cai, Andy and Miret, Santiago and Malkomes, Gustavo and Phielipp, Mariano and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{748--758}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D3DD00012E}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DeepSMILES: Adapting SMILES Syntax for Machine Learning</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/</guid><description>DeepSMILES modifies SMILES syntax to eliminate unbalanced parentheses and unpaired ring closures, reducing invalid outputs from generative molecular models.</description><content:encoded><![CDATA[<h2 id="a-new-molecular-string-notation-for-generative-models">A New Molecular String Notation for Generative Models</h2>
<p>This is a <strong>Method</strong> paper that introduces DeepSMILES, a modified SMILES syntax designed to reduce the rate of syntactically invalid strings produced by machine-learning generative models. The primary contribution is a pair of string-level transformations (for ring closures and for branches) that can be applied independently and interconverted with standard SMILES without loss of information, including stereochemistry.</p>
<h2 id="the-problem-of-invalid-smiles-in-molecular-generation">The Problem of Invalid SMILES in Molecular Generation</h2>
<p>Deep neural networks for de novo molecular design commonly operate on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">Variational autoencoders</a> (<a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al., 2018</a>), recurrent neural networks with LSTM (<a href="/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/">Segler et al., 2018</a>; Olivecrona et al., 2017), and grammar-based approaches (<a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Kusner et al., 2017</a>) all generate molecules by sampling character sequences. A persistent problem is that many generated strings are syntactically invalid SMILES, with reported validity rates ranging from 7% to 80%.</p>
<p>Two structural features of SMILES syntax are responsible for most invalid strings:</p>
<ol>
<li><strong>Balanced parentheses</strong>: Branches require matched open/close parenthesis pairs. A generative model must track nesting state across long sequences to produce valid brackets.</li>
<li><strong>Paired ring closure symbols</strong>: Rings require two identical digit tokens at corresponding positions. The model must remember which digits are &ldquo;open&rdquo; and close them appropriately.</li>
</ol>
<p>Grammar-based approaches (e.g., <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a>) can enforce balanced parentheses through a context-free grammar, but they cannot enforce the ring closure pairing constraint because that constraint is context-sensitive. Syntax-directed approaches (Dai et al., 2018) add explicit ring closure constraints but at the cost of significantly more complex decoder architectures.</p>
<h2 id="core-innovation-postfix-branch-notation-and-single-ring-closure-symbols">Core Innovation: Postfix Branch Notation and Single Ring Closure Symbols</h2>
<p>DeepSMILES addresses both syntax problems through two independent string transformations.</p>
<h3 id="ring-closure-transformation">Ring closure transformation</h3>
<p>Standard SMILES uses a pair of identical digits to mark ring openings and closings (e.g., <code>c1ccccc1</code> for benzene). DeepSMILES eliminates the ring-opening digit and replaces the ring-closing digit with the ring size, counting back along the tree path to the ring-opening atom. Benzene becomes <code>cccccc6</code>, where <code>6</code> means &ldquo;connect to the atom 6 positions back.&rdquo;</p>
<p>This transformation has three key properties:</p>
<ul>
<li>Every ring of a given size always uses the same digit, regardless of context. A phenyl ring is always <code>cccccc6</code> in DeepSMILES, whereas in SMILES it might be <code>c1ccccc1</code>, <code>c2ccccc2</code>, <code>c3ccccc3</code>, etc.</li>
<li>A single symbol cannot be &ldquo;unmatched&rdquo; since there is no corresponding opening symbol.</li>
<li>For double-digit ring sizes, the <code>%N</code> notation is used (and <code>%(N)</code> for sizes above 99).</li>
</ul>
<p>Bond stereochemistry is preserved by moving any explicit or stereo bond from the eliminated ring-opening symbol to the ring-closing symbol, with direction adjusted as needed.</p>
<h3 id="branch-parenthesis-transformation">Branch (parenthesis) transformation</h3>
<p>Standard SMILES uses matched open/close parenthesis pairs for branches (e.g., <code>C(OC)(SC)F</code>). DeepSMILES replaces this with a postfix notation inspired by Reverse Polish Notation (RPN). Only close parentheses are used, and the number of consecutive close parentheses indicates how far back on the current branch the next atom attaches.</p>
<p>For example, <code>C(OC)(SC)F</code> becomes <code>COC))SC))F</code>. The interpretation uses a stack: atoms are pushed onto the stack as they are read, each close parenthesis pops one atom from the stack, and the next atom connects to whatever is on top of the stack.</p>
<h3 id="stereochemistry-preservation">Stereochemistry preservation</h3>
<p>Tetrahedral stereochemistry is fully preserved through the transformations. When ring closure symbol reordering would change the stereo configuration, the <code>@</code>/<code>@@</code> annotation is inverted during encoding to compensate.</p>
<h3 id="independence-of-transformations">Independence of transformations</h3>
<p>The two transformations are independent and can be applied separately or together. Any application of DeepSMILES should specify which transformations were applied.</p>
<h2 id="roundtrip-validation-on-chembl-23">Roundtrip Validation on ChEMBL 23</h2>
<p>The authors validated DeepSMILES by roundtripping all entries in the ChEMBL 23 database through SMILES-to-DeepSMILES-to-SMILES conversion. Canonical SMILES (including stereochemistry) were generated by four independent cheminformatics toolkits: CDK, OEChem, Open Babel, and RDKit. Using multiple toolkits ensures coverage of different traversal orders and ring closure ordering conventions.</p>
<p>All SMILES strings roundtripped without error across all three configurations (branches only, rings only, both). The exact string representation may differ in ring closure digit assignment or digit ordering, sometimes with an associated stereo inversion at tetrahedral centers, but the canonical SMILES of the original and roundtripped molecules are identical.</p>
<h3 id="performance-characteristics">Performance characteristics</h3>
<p>The following table shows the effect of DeepSMILES conversion on string length and throughput, measured on canonical SMILES from Open Babel for ChEMBL 23:</p>
<table>
  <thead>
      <tr>
          <th>Transformation</th>
          <th>Mean % change in length</th>
          <th>Encoding (per sec)</th>
          <th>Decoding (per sec)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Branches only</td>
          <td>+8.2%</td>
          <td>32,000</td>
          <td>16,000</td>
      </tr>
      <tr>
          <td>Rings only</td>
          <td>-6.4%</td>
          <td>26,000</td>
          <td>24,000</td>
      </tr>
      <tr>
          <td>Both</td>
          <td>+1.9%</td>
          <td>26,000</td>
          <td>17,500</td>
      </tr>
  </tbody>
</table>
<p>The ring transformation slightly shortens strings (by removing one digit per ring), while the branch transformation slightly lengthens them (additional close parentheses). Combined, the net effect is a small increase of about 2%. Throughput is in the tens of thousands of conversions per second in pure Python.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>DeepSMILES does not eliminate all invalid strings. Invalid DeepSMILES can still be generated, for example when there are more close parentheses than atoms on the stack, or when a ring size exceeds the number of available atoms. The reference implementation raises a <code>DecodeError</code> in these cases, though the authors note that a more tolerant decoder (ignoring extra parentheses or defaulting to the first atom for oversized rings) could be used during generation.</p>
<p>The paper assumes that input SMILES are generated by a standard cheminformatics toolkit as a depth-first traversal of the molecular graph. Non-standard SMILES (e.g., <code>CC(C1)CCCC1</code>) cannot be directly encoded.</p>
<p>The authors suggest several directions for future work:</p>
<ul>
<li>Investigating whether a preferred traversal order (e.g., shorter branches first) would make DeepSMILES even easier for models to learn.</li>
<li>Exploring notations where atoms in the organic subset explicitly list their hydrogen count, which would allow a fully parenthesis-free representation.</li>
<li>Using SMILES augmentation with random traversal orders (as explored by Bjerrum and Threlfall, 2017) in combination with DeepSMILES.</li>
<li>Designing entirely new line notations optimized for ML, where every string maps to a valid molecule, there are few duplicate representations, small string changes produce small structural changes, and string length correlates with pharmaceutical relevance.</li>
</ul>
<p>The fused ring case presents additional complexity: a bicyclic system has three cycles, and depending on traversal order, the ring size digit may not directly correspond to the ring size of any individual ring. This is an inherent limitation of depth-first traversal-based notations.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validation</td>
          <td>ChEMBL 23</td>
          <td>~1.7M compounds</td>
          <td>Canonical SMILES from CDK, OEChem, Open Babel, RDKit</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The DeepSMILES encoder and decoder are pure string-processing algorithms with no machine-learning components. The transformations operate on SMILES syntax tokens (atoms, bonds, parentheses, ring closure digits) without chemical interpretation.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Roundtrip accuracy</td>
          <td>100%</td>
          <td>All ChEMBL 23 entries across 4 toolkits</td>
      </tr>
      <tr>
          <td>Encoding throughput</td>
          <td>26,000-32,000/s</td>
          <td>Pure Python, varies by transformation</td>
      </tr>
      <tr>
          <td>Decoding throughput</td>
          <td>16,000-24,000/s</td>
          <td>Pure Python, varies by transformation</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>No specific hardware requirements. The implementation is a pure Python module with no GPU dependencies.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/nextmovesoftware/deepsmiles">deepsmiles</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Pure Python encoder/decoder</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: O&rsquo;Boyle, N. M., &amp; Dalke, A. (2018). DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. <em>ChemRxiv</em>. <a href="https://doi.org/10.26434/chemrxiv.7097960.v1">https://doi.org/10.26434/chemrxiv.7097960.v1</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{oboyle2018deepsmiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{O&#39;Boyle, Noel M. and Dalke, Andrew}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ChemRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.26434/chemrxiv.7097960.v1}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Atom-in-SMILES: Better Tokens for Chemical Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/</guid><description>Atom-in-SMILES replaces generic SMILES tokens with environment-aware atomic tokens, reducing token degeneration and improving chemical translation accuracy.</description><content:encoded><![CDATA[<h2 id="a-new-tokenization-method-for-chemical-language-models">A New Tokenization Method for Chemical Language Models</h2>
<p>This is a <strong>Method</strong> paper that introduces Atom-in-SMILES (AIS), a tokenization scheme for SMILES strings that replaces generic atomic tokens with environment-aware tokens encoding each atom&rsquo;s local chemical neighborhood. The primary contribution is demonstrating that tokenization quality has a significant impact on chemical language model outcomes across multiple tasks: SMILES canonicalization, <a href="/notes/chemistry/molecular-design/reaction-prediction/">single-step retrosynthesis</a>, and <a href="/notes/chemistry/molecular-design/property-prediction/">molecular property prediction</a>.</p>
<h2 id="why-standard-smiles-tokenization-falls-short">Why Standard SMILES Tokenization Falls Short</h2>
<p>Standard atom-wise SMILES tokenization treats all atoms of the same element identically. Every carbon is tokenized as &ldquo;C&rdquo; regardless of whether it is part of an aromatic ring, a carbonyl group, or a methyl chain. This creates a highly degenerate token space where chemically distinct atoms share the same representation.</p>
<p>The authors draw an analogy between natural language and chemical language. A typical SMILES sequence is about three times longer than a natural language sentence, yet the token vocabulary is roughly 1000 times smaller. This mismatch leads to extreme token repetition: the same tokens (C, c, N, O) appear many times within a single sequence. In natural language processing, token degeneration (where models repeatedly predict the same token) is a known failure mode of autoregressive decoders. The repetitive nature of SMILES tokens exacerbates this problem in chemical language models.</p>
<p>SMILES also lacks a one-to-one correspondence between tokens and chemical meaning. Two molecules that differ in only one atom substitution (e.g., swapping a carbon for a nitrogen in a ring) produce identical token sets under atom-wise tokenization, making it harder for models to distinguish structurally similar molecules.</p>
<h2 id="core-innovation-encoding-atom-environments-into-tokens">Core Innovation: Encoding Atom Environments into Tokens</h2>
<p>The key insight is to replace each atomic token with a richer token that encodes the atom&rsquo;s local chemical environment, inspired by the <a href="https://en.wikipedia.org/wiki/Atoms_in_molecules">atoms-in-molecules (AIM)</a> concept from quantum chemistry. For a given SMILES string, the AIS mapping function $f$ operates on the token space:</p>
<p>$$
f(X) = \begin{cases} AE|_{X_{\text{central}}} &amp; \text{if } X \text{ is an atom} \\ X &amp; \text{otherwise} \end{cases}
$$</p>
<p>where $AE|_{X_{\text{central}}}$ denotes the atomic environment centered on atom $X$. Non-atomic tokens (brackets, bond symbols, ring closures) pass through unchanged.</p>
<p>Each AIS token is formatted as <code>[Sym;Ring;Neighbors]</code> where:</p>
<ul>
<li><strong>Sym</strong> is the atomic symbol with chirality, aromaticity (lowercase for aromatic), hydrogen count, and formal charge</li>
<li><strong>Ring</strong> indicates whether the atom is in a ring (<code>R</code>) or not (<code>!R</code>)</li>
<li><strong>Neighbors</strong> lists the neighboring atoms interacting with the central atom</li>
</ul>
<p>This mapping is bijective: SMILES strings can be fully recovered from AIS strings via an inverse projection. The algorithm iterates over atoms in a molecule, computes their local environments using RDKit, and produces environment-aware token variants.</p>
<p>As a concrete example, in glycine the two carbons and two oxygens are indistinguishable under atom-wise tokenization. Under AIS, each receives a unique token reflecting its bonding environment (e.g., the carboxyl carbon is distinguished from the alpha carbon).</p>
<p>The AIS tokenization also exhibits a fingerprint-like property. Because each token encodes local structural information, the set of AIS tokens for a molecule functions similarly to circular fingerprints like ECFP2. The authors show that pairwise <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarities</a> computed from AIS token sets have resolution comparable to ECFP2 and HashAP fingerprints, and better resolution than MACCS, Avalon, and RDKit fingerprints.</p>
<p>Token repetition can be quantified as:</p>
<p>$$
\text{rep-}l = \sum_{t=1}^{|s|} \mathbb{1}[s_t \in s_{t-w-1:t-1}]
$$</p>
<p>where $s$ is the predicted sequence, $|s|$ is the token count, and $w$ is the window size. AIS tokens exhibit consistently lower normalized repetition rates compared to SMILES, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, and <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> across diverse molecular datasets (drugs, natural products, steroids, lipids, metal complexes, octane isomers).</p>
<h2 id="experimental-evaluation-across-three-chemical-tasks">Experimental Evaluation Across Three Chemical Tasks</h2>
<h3 id="input-output-equivalent-mapping-smiles-canonicalization">Input-Output Equivalent Mapping (SMILES Canonicalization)</h3>
<p>The first task tests whether a model can translate non-canonical SMILES enumerations into canonical form. The authors constructed deliberately challenging datasets from <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> subsets with cumulative structural constraints (no cyclic heteroatom-heteroatom bonds, stable functional groups only, fragment-like, scaffold-like, etc.), generating training sets of 1M molecules augmented with 150K molecules from the most restrictive subset at 10x, 30x, and 50x augmentation levels.</p>
<table>
  <thead>
      <tr>
          <th>GDB-13 Subset</th>
          <th>Atom-wise (x10)</th>
          <th>Atom-wise (x50)</th>
          <th>AIS (x10)</th>
          <th>AIS (x50)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ab</td>
          <td>34.2%</td>
          <td>33.2%</td>
          <td>37.3%</td>
          <td>34.1%</td>
      </tr>
      <tr>
          <td>abc</td>
          <td>31.0%</td>
          <td>29.6%</td>
          <td>33.7%</td>
          <td>30.4%</td>
      </tr>
      <tr>
          <td>abcde</td>
          <td>48.7%</td>
          <td>45.5%</td>
          <td>53.6%</td>
          <td>47.0%</td>
      </tr>
      <tr>
          <td>abcdef</td>
          <td>41.8%</td>
          <td>39.1%</td>
          <td>52.5%</td>
          <td>46.9%</td>
      </tr>
      <tr>
          <td>abcdefg</td>
          <td>50.9%</td>
          <td>50.0%</td>
          <td>59.9%</td>
          <td>56.8%</td>
      </tr>
  </tbody>
</table>
<p>AIS outperformed atom-wise tokenization on all subsets and augmentation levels. The performance gap grew larger for more restrictive (more similar) subsets, reaching up to 10.7% on the abcdef subset. This demonstrates that AIS is particularly effective when molecules are structurally similar and harder to distinguish.</p>
<h3 id="single-step-retrosynthesis">Single-Step Retrosynthesis</h3>
<p>The second task uses the USPTO-50K benchmark for single-step <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthetic prediction</a> via a template-free transformer encoder-decoder model. The model was trained for 200,000 steps with Adam optimizer, negative log-likelihood loss, and cyclic learning rate scheduling.</p>
<table>
  <thead>
      <tr>
          <th>Tokenization</th>
          <th>rep-|P - rep-|GT &gt;= 2</th>
          <th>String Exact (%)</th>
          <th>Tc Exact (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Atom-wise baseline</td>
          <td>&ndash;</td>
          <td>42.00</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Atom-wise (reproduced)</td>
          <td>801</td>
          <td>42.05</td>
          <td>44.72</td>
      </tr>
      <tr>
          <td>SmilesPE</td>
          <td>821</td>
          <td>19.82</td>
          <td>22.74</td>
      </tr>
      <tr>
          <td>SELFIES</td>
          <td>886</td>
          <td>28.82</td>
          <td>30.76</td>
      </tr>
      <tr>
          <td>DeepSMILES</td>
          <td>902</td>
          <td>38.63</td>
          <td>41.20</td>
      </tr>
      <tr>
          <td><strong>Atom-in-SMILES</strong></td>
          <td><strong>727</strong></td>
          <td><strong>46.32</strong></td>
          <td><strong>47.62</strong></td>
      </tr>
  </tbody>
</table>
<p>AIS achieved 46.32% string exact accuracy (4.3% above the atom-wise baseline) and 47.62% Tanimoto exact accuracy (2.9% above baseline). AIS also had the fewest degenerate token repetitions (727 vs. 801 for atom-wise), representing approximately a 10% reduction. DeepSMILES had the highest repetition count (902) despite reasonable overall accuracy. SELFIES and <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SmilesPE</a> both performed substantially worse than the atom-wise baseline on this task.</p>
<p>The authors identified six common token repetition patterns in retrosynthetic predictions: long head repetitions, long tail repetitions, repetitive rings, repetitive chains, and halogen repetitions on both aliphatic and aromatic carbons.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The third task evaluates tokenization schemes on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks using Random Forest models with 5-fold cross-validation. AIS tokens were converted to fingerprint-like feature vectors.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>SMILES</th>
          <th>DeepSMILES</th>
          <th>SELFIES</th>
          <th>SmilesPE</th>
          <th>AIS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Regression (RMSE, lower is better)</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>0.628</td>
          <td>0.631</td>
          <td>0.675</td>
          <td>0.689</td>
          <td><strong>0.553</strong></td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>0.545</td>
          <td>0.544</td>
          <td>0.564</td>
          <td>0.761</td>
          <td><strong>0.441</strong></td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>0.924</td>
          <td>0.895</td>
          <td>0.938</td>
          <td>0.800</td>
          <td><strong>0.683</strong></td>
      </tr>
      <tr>
          <td><strong>Classification (ROC-AUC, higher is better)</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>0.758</td>
          <td>0.777</td>
          <td>0.799</td>
          <td>0.847</td>
          <td><strong>0.885</strong></td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>0.740</td>
          <td>0.774</td>
          <td>0.746</td>
          <td>0.837</td>
          <td><strong>0.835</strong></td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>0.649</td>
          <td>0.648</td>
          <td>0.653</td>
          <td>0.739</td>
          <td><strong>0.729</strong></td>
      </tr>
  </tbody>
</table>
<p>AIS achieved the best performance on all three regression datasets and two of three classification datasets. On ESOL, the RMSE improvement over standard SMILES was 12%. On lipophilicity, the improvement was 26%.</p>
<h2 id="key-findings-better-tokens-yield-better-chemical-models">Key Findings: Better Tokens Yield Better Chemical Models</h2>
<p>The main findings of this work are:</p>
<ol>
<li>
<p><strong>Tokenization significantly impacts chemical language model quality.</strong> The choice of tokenization scheme can change prediction accuracy by over 10 percentage points on equivalent mapping tasks.</p>
</li>
<li>
<p><strong>AIS reduces token degeneration by approximately 10%</strong> compared to atom-wise SMILES tokenization, with consistently lower normalized repetition rates across diverse molecular datasets.</p>
</li>
<li>
<p><strong>AIS outperforms all compared tokenization schemes</strong> (atom-wise SMILES, SmilesPE, SELFIES, DeepSMILES) on canonicalization, retrosynthesis, and property prediction.</p>
</li>
<li>
<p><strong>The fingerprint-like nature of AIS tokens</strong> enables direct use as molecular features for property prediction and provides resolution comparable to established circular fingerprints.</p>
</li>
<li>
<p><strong>The mapping is invertible</strong>, so AIS strings can always be converted back to valid SMILES. This is a practical advantage over approaches that may lose structural information.</p>
</li>
</ol>
<p><strong>Limitations</strong>: AIS cannot distinguish environmentally identical substructures or atoms related by a molecular symmetry plane, since it only considers nearest-neighbor environments. Performance on long-chain molecules (e.g., lipids) is similar across all tokenization schemes, suggesting that local environment encoding is less informative for repetitive linear structures.</p>
<p><strong>Future directions</strong>: The authors suggest AIS has potential for broader adoption in molecular generative models, chemical translation, and property prediction tasks across the cheminformatics community.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Canonicalization training</td>
          <td>GDB-13 subsets</td>
          <td>1M + 150K augmented</td>
          <td>Cumulative structural constraints a-h</td>
      </tr>
      <tr>
          <td>Canonicalization testing</td>
          <td>GDB-13 disjoint test sets</td>
          <td>20K per subset</td>
          <td>Various restriction levels</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>USPTO-50K</td>
          <td>~50K reactions</td>
          <td>Sequences &gt; 150 tokens removed</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet (ESOL, FreeSolv, Lipophilicity, BBBP, BACE, HIV)</td>
          <td>Varies</td>
          <td>Standard benchmark splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer encoder-decoder architecture for canonicalization and retrosynthesis tasks</li>
<li>200,000 training steps with Adam optimizer, negative log-likelihood loss, cyclic learning rate scheduler</li>
<li>Random Forest with 5-fold cross-validation for property prediction</li>
<li>AIS tokenization implemented via RDKit for atom environment extraction</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>String exact match (%)</td>
          <td>Canonicalization, Retrosynthesis</td>
          <td>Exact SMILES match</td>
      </tr>
      <tr>
          <td>Tanimoto exactness (Tc)</td>
          <td>Retrosynthesis</td>
          <td>Morgan FP radius 3, 2048 bits</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression property prediction</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification property prediction</td>
          <td>BBBP, BACE, HIV</td>
      </tr>
      <tr>
          <td>rep-l</td>
          <td>Token degeneration</td>
          <td>Single-token repetition count</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not explicitly specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/snu-lcbc/atom-in-SMILES">atom-in-SMILES</a></td>
          <td>Code</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>AIS tokenization implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ucak, U. V., Ashyrmamatov, I., &amp; Lee, J. (2023). Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. <em>Journal of Cheminformatics</em>, 15, 55. <a href="https://doi.org/10.1186/s13321-023-00725-9">https://doi.org/10.1186/s13321-023-00725-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ucak2023improving,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{55}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00725-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review of Molecular Representation Learning Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/molecular-representation-learning-foundation-models-review/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/molecular-representation-learning-foundation-models-review/</guid><description>A systematic review of molecular representation learning foundation models for drug discovery, covering five modalities and four pretraining strategies.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-molecular-representation-foundation-models">A Systematization of Molecular Representation Foundation Models</h2>
<p>This paper is a <strong>Systematization</strong> that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.</p>
<h2 id="why-a-systematic-review-of-mrl-foundation-models-is-needed">Why a Systematic Review of MRL Foundation Models Is Needed</h2>
<p>Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.</p>
<p>Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.</p>
<h2 id="taxonomy-of-molecular-descriptors-and-model-architectures">Taxonomy of Molecular Descriptors and Model Architectures</h2>
<p>The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.</p>
<h3 id="molecular-descriptors">Molecular Descriptors</h3>
<p>The review identifies five primary descriptor types:</p>
<ol>
<li><strong>Molecular fingerprints</strong>: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.</li>
<li><strong>1D sequences</strong>: <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.</li>
<li><strong>2D topological graphs</strong>: Atoms as nodes, bonds as edges. Can be derived from SMILES via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>, making graph datasets effectively interchangeable with SMILES datasets.</li>
<li><strong>3D geometry</strong>: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.</li>
<li><strong>Multimodal</strong>: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.</li>
</ol>
<p>The paper also discusses mathematically abstract molecular representations. For example, the <a href="https://en.wikipedia.org/wiki/Wiener_index">Wiener index</a> quantifies structural complexity:</p>
<p>$$
W = \frac{1}{2} \sum_{i &lt; j} d_{ij}
$$</p>
<p>where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.</p>
<p>Degree centrality captures local connectivity:</p>
<p>$$
C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij}
$$</p>
<p>where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.</p>
<h3 id="model-architectures">Model Architectures</h3>
<p>Models are classified into two primary categories:</p>
<p><strong>Unimodal-based models:</strong></p>
<ul>
<li><strong>Sequence-based</strong>: Transformer models operating on SMILES/SELFIES (e.g., <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>, MolGEN, <a href="/notes/chemistry/llm-applications/llamsmol-instruction-tuning-chemistry/">LlaSMol</a>). These capture syntactic patterns but miss spatial and topological features.</li>
<li><strong>Topological graph-based</strong>: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.</li>
<li><strong>3D geometry-based</strong>: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.</li>
<li><strong>Image-based</strong>: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.</li>
</ul>
<p><strong>Multimodal-based models:</strong></p>
<ul>
<li><strong>Sequence + Graph</strong>: <a href="/notes/chemistry/molecular-representations/multimodal/dual-view-molecule-pretraining/">DVMP</a>, PanGu Drug Model. Combines the strengths of string and topological representations.</li>
<li><strong>Graph + 3D Geometry</strong>: GraphMVP, Transformer-M. Enriches topological features with spatial information.</li>
<li><strong>Text + Molecular Structure</strong>: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.</li>
</ul>
<h2 id="four-pretraining-paradigms-for-mrl">Four Pretraining Paradigms for MRL</h2>
<p>The review systematically categorizes pretraining strategies into four paradigms:</p>
<h3 id="masked-language-modeling-mlm">Masked Language Modeling (MLM)</h3>
<p>The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.</p>
<h3 id="contrastive-learning-cl">Contrastive Learning (CL)</h3>
<p>The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.</p>
<h3 id="reconstruction-based-pretraining-rbp">Reconstruction-Based Pretraining (RBP)</h3>
<p>Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.</p>
<h3 id="multimodal-alignment-pretraining-map">Multimodal Alignment Pretraining (MAP)</h3>
<p>Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.</p>
<h2 id="downstream-applications-and-performance-benchmarks">Downstream Applications and Performance Benchmarks</h2>
<p>The review evaluates MRL foundation models across five application domains.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification datasets:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>BBBP</th>
          <th>BACE</th>
          <th>ClinTox</th>
          <th>Tox21</th>
          <th>SIDER</th>
          <th>HIV</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MGMAE</td>
          <td>Graph</td>
          <td>94.2</td>
          <td>92.7</td>
          <td>96.7</td>
          <td>86.0</td>
          <td>66.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MPG</td>
          <td>Graph</td>
          <td>92.2</td>
          <td>92.0</td>
          <td>96.3</td>
          <td>83.7</td>
          <td>66.1</td>
          <td>-</td>
      </tr>
      <tr>
          <td>GROVER</td>
          <td>Graph+Trans.</td>
          <td>94.0</td>
          <td>89.4</td>
          <td>94.4</td>
          <td>83.1</td>
          <td>65.8</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MoLFormer</td>
          <td>Sequence</td>
          <td>93.7</td>
          <td>88.2</td>
          <td>94.8</td>
          <td>84.7</td>
          <td>69.0</td>
          <td>82.2</td>
      </tr>
      <tr>
          <td>MM-Deacon</td>
          <td>Seq.+IUPAC</td>
          <td>78.5</td>
          <td>-</td>
          <td>99.5</td>
          <td>-</td>
          <td>69.3</td>
          <td>80.1</td>
      </tr>
      <tr>
          <td>Uni-Mol</td>
          <td>3D</td>
          <td>72.9</td>
          <td>85.7</td>
          <td>91.9</td>
          <td>79.6</td>
          <td>65.9</td>
          <td>80.8</td>
      </tr>
      <tr>
          <td>DVMP</td>
          <td>Seq.+Graph</td>
          <td>77.8</td>
          <td>89.4</td>
          <td>95.6</td>
          <td>79.1</td>
          <td>69.8</td>
          <td>81.4</td>
      </tr>
      <tr>
          <td>TxD-T-LLM</td>
          <td>Seq.+Text</td>
          <td>-</td>
          <td>-</td>
          <td>86.3</td>
          <td>88.2</td>
          <td>-</td>
          <td>73.2</td>
      </tr>
  </tbody>
</table>
<p>The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.</p>
<h3 id="molecular-generation">Molecular Generation</h3>
<p>MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on QM9.</p>
<h3 id="drug-drug-interaction-prediction"><a href="https://en.wikipedia.org/wiki/Drug_interaction">Drug-Drug Interaction</a> Prediction</h3>
<p>MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.</p>
<h3 id="retrosynthesis-prediction"><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a> Prediction</h3>
<p>DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).</p>
<h3 id="drug-synergy-prediction">Drug Synergy Prediction</h3>
<p>SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.</p>
<h2 id="guidelines-limitations-and-future-directions">Guidelines, Limitations, and Future Directions</h2>
<h3 id="model-selection-guidelines">Model Selection Guidelines</h3>
<p>The authors provide structured guidelines for choosing MRL foundation models based on:</p>
<ol>
<li><strong>Task objective</strong>: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.</li>
<li><strong>Data characteristics</strong>: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.</li>
<li><strong>Interpretability needs</strong>: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.</li>
<li><strong>Computational budget</strong>: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.</li>
</ol>
<h3 id="limitations-and-future-directions">Limitations and Future Directions</h3>
<p>The review identifies five key challenges:</p>
<ol>
<li><strong>Multimodal data integration</strong>: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating <a href="/notes/chemistry/molecular-simulation/">molecular dynamics</a> trajectories as a dynamic modality and using cross-modal data augmentation.</li>
<li><strong>Data scarcity</strong>: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.</li>
<li><strong>Interpretability</strong>: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.</li>
<li><strong>Training efficiency</strong>: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.</li>
<li><strong>Robustness and generalization</strong>: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.</p>
<h3 id="data">Data</h3>
<p>The review catalogs 28 representative molecular datasets used by the surveyed foundation models:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Descriptor</th>
          <th>Primary Use</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PubChem</td>
          <td>~118M</td>
          <td>SMILES, 3D, Image, IUPAC</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>ZINC15</td>
          <td>~980M</td>
          <td>SMILES</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>~2.4M</td>
          <td>SMILES</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>133,884</td>
          <td>SMILES</td>
          <td>Property prediction</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM</a></td>
          <td>450,000</td>
          <td>3D coordinates</td>
          <td>Property prediction</td>
      </tr>
      <tr>
          <td>USPTO-full</td>
          <td>950,000</td>
          <td>SMILES</td>
          <td>Reaction prediction</td>
      </tr>
      <tr>
          <td>Molecule3D</td>
          <td>4M</td>
          <td>3D coordinates</td>
          <td>Property prediction</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Z-dot-max/MRL_Foundation_Review">Review Materials (GitHub)</a></td>
          <td>Code/Data</td>
          <td>Not specified</td>
          <td>Code and data tables for figures</td>
      </tr>
      <tr>
          <td><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12784970/">Paper (PMC)</a></td>
          <td>Paper</td>
          <td>CC-BY</td>
          <td>Open access via PubMed Central</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model&rsquo;s original setup. The review covers:</p>
<ul>
<li>ROC-AUC for classification tasks (property prediction, DDI, synergy)</li>
<li>RMSE/MAE for regression tasks</li>
<li>Validity and novelty for molecular generation</li>
<li>Top-k accuracy for retrosynthesis</li>
<li>COV and MAT for conformation generation</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., &amp; Liu, Y. (2025). A systematic review of molecular representation learning foundation models. <em>Briefings in Bioinformatics</em>, 27(1), bbaf703. <a href="https://doi.org/10.1093/bib/bbaf703">https://doi.org/10.1093/bib/bbaf703</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{song2025systematic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A systematic review of molecular representation learning foundation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{27}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbaf703}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbaf703}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SELFIES and the Future of Molecular String Representations</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-2022/</link><pubDate>Tue, 02 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-2022/</guid><description>Perspective on SELFIES as a 100% robust SMILES alternative, with 16 future research directions for molecular AI.</description><content:encoded><![CDATA[<h2 id="position-a-roadmap-for-robust-chemical-languages">Position: A Roadmap for Robust Chemical Languages</h2>
<p>This is a <strong>Position</strong> paper (perspective) that proposes a research agenda for molecular representations in AI. It reviews the evolution of chemical notation over 250 years and argues for extending SELFIES-style robust representations beyond traditional organic chemistry into polymers, crystals, reactions, and other complex chemical systems.</p>
<h2 id="the-generative-bottleneck-in-traditional-representations">The Generative Bottleneck in Traditional Representations</h2>
<p>While SMILES has been the standard molecular representation since 1988, its fundamental weakness for machine learning is well-established: randomly generated SMILES strings are often invalid. The motivation is twofold:</p>
<ol>
<li><strong>Current problem</strong>: Traditional representations (SMILES, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>, DeepSMILES) lack 100% robustness; random mutations or generations can produce invalid strings, limiting their use in generative AI models.</li>
<li><strong>Future opportunity</strong>: SELFIES solved this for small organic molecules, but many important chemical domains (polymers, crystals, reactions) still lack robust representations, creating a bottleneck for AI-driven discovery in these areas.</li>
</ol>
<h2 id="16-concrete-research-directions-for-selfies">16 Concrete Research Directions for SELFIES</h2>
<p>The novelty is in the comprehensive research roadmap. The authors propose 16 concrete research projects organized around key themes:</p>
<ul>
<li><strong>Domain extension</strong>: Includes metaSELFIES for learning graph rules directly from data, BigSELFIES for stochastic polymers, and crystal structures via labeled quotient graphs.</li>
<li><strong>Chemical reactions</strong>: Robust reaction representations that enforce conservation laws.</li>
<li><strong>Programming perspective</strong>: Treating molecular representations as programming languages, potentially achieving Turing-completeness.</li>
<li><strong>Benchmarking</strong>: Systematic comparisons across representation formats.</li>
<li><strong>Interpretability</strong>: Understanding how humans and machines actually learn from different representations.</li>
</ul>
<h2 id="evidence-from-generative-case-studies">Evidence from Generative Case Studies</h2>
<p>This perspective paper includes case studies:</p>
<ol>
<li>
<p><strong>Pasithea (Deep Molecular Dreaming)</strong>: A generative model that first learns to predict a chemical property from a one-hot encoded SELFIES, then freezes the network weights and uses gradient descent on the one-hot input encoding to optimize molecular properties (logP). The target property increases or decreases nearly monotonically, demonstrating that the model has learned meaningful structure-property relationships from the SELFIES representation.</p>
</li>
<li>
<p><strong>DECIMER and STOUT</strong>: DECIMER (Deep lEarning for Chemical ImagE Recognition) is an image-to-structure tool, and STOUT (SMILES-TO-IUPAC-name Translator) translates between IUPAC names and molecular string representations. Both show improved performance when using SELFIES as an intermediate representation. STOUT internally converts SMILES to SELFIES before processing and decodes predicted SELFIES back to SMILES. These results suggest SELFIES provides a more learnable internal representation for sequence-to-sequence models.</p>
</li>
</ol>
<h2 id="strategic-outcomes-and-future-vision">Strategic Outcomes and Future Vision</h2>
<p>The paper establishes robust representations as a fundamental bottleneck in computational chemistry and proposes a clear path forward:</p>
<p><strong>Key outcomes</strong>:</p>
<ul>
<li>Identification of 16 concrete research projects spanning domain extension, benchmarking, and interpretability</li>
<li>Evidence that SELFIES enables capabilities (like smooth property optimization) impossible with traditional formats</li>
<li>Framework for thinking about molecular representations as programming languages</li>
</ul>
<p><strong>Strategic impact</strong>: The proposed extensions could enable new applications across drug discovery (efficient exploration beyond small molecules), materials design (systematic crystal structure discovery), synthesis planning (better reaction representations), and fundamental research (new ways to understand chemical behavior).</p>
<p><strong>Future vision</strong>: The authors emphasize that robust representations could become a bridge for bidirectional learning between humans and machines, enabling humans to learn new chemical concepts from AI systems.</p>
<h2 id="the-mechanism-of-robustness">The Mechanism of Robustness</h2>
<p>The key difference between SELFIES and other representations lies in how they handle syntax:</p>
<ul>
<li><strong>SMILES/DeepSMILES</strong>: Rely on non-local markers (opening/closing parentheses or ring numbers) that must be balanced. A mutation or random generation can easily break this balance, producing invalid strings.</li>
<li><strong>SELFIES</strong>: Uses a formal grammar (automaton) where derivation rules are entirely local. The critical innovation is <strong>overloading</strong>: a state-modifying symbol like <code>[Branch1]</code> starts a branch and changes the interpretation of the <em>next</em> symbol to represent a numerical parameter (the branch length).</li>
</ul>
<p>This overloading mechanism ensures that any arbitrary sequence of SELFIES tokens can be parsed into a valid molecular graph. The derivation can never fail because every symbol either adds an atom or modifies how subsequent symbols are interpreted.</p>
<h2 id="the-16-research-projects-technical-details">The 16 Research Projects: Technical Details</h2>
<p>This section provides technical details on the proposed research directions:</p>
<h3 id="extending-to-new-domains">Extending to New Domains</h3>
<p><strong>metaSELFIES (Project 1)</strong>: The authors propose learning graph construction rules automatically from data. This could enable robust representations for any graph-based system, from quantum optics to biological networks, without needing domain-specific expertise.</p>
<p><strong>Token Optimization (Project 2)</strong>: SELFIES uses &ldquo;overloading&rdquo; where a symbol&rsquo;s meaning changes based on context. This project would investigate how this affects machine learning performance and whether the approach can be optimized.</p>
<h3 id="handling-complex-molecular-systems">Handling Complex Molecular Systems</h3>
<p><strong>BigSELFIES (Project 3)</strong>: Current representations struggle with large, often random structures like polymers and biomolecules. BigSELFIES would combine hierarchical notation with stochastic building blocks to handle these complex systems where traditional small-molecule representations break down.</p>
<p><strong>Crystal Structures (Projects 4-5)</strong>: Crystals present unique challenges due to their infinite, periodic arrangements. An infinite net cannot be represented by a finite string directly. The proposed approach uses <strong>labeled quotient graphs (LQGs)</strong>, which are finite graphs that uniquely determine a periodic net. However, current SELFIES cannot represent LQGs because they lack symbols for edge directions and edge labels (vector shifts encoding periodicity). Extending SELFIES to handle these structures could enable AI-driven materials design without relying on predefined crystal structures, opening up systematic exploration of theoretical materials space.</p>
<p><strong>Beyond Organic Chemistry (Project 6)</strong>: Transition metals and main-group compounds feature complex bonding that breaks the simple two-center, two-electron model. The solution: use machine learning on large structural databases to automatically learn these complex bonding rules.</p>
<h3 id="chemical-reactions-and-programming-concepts">Chemical Reactions and Programming Concepts</h3>
<p><strong>Reaction Representations (Project 7)</strong>: Moving beyond static molecules to represent chemical transformations. A robust reaction format would enforce conservation laws and could learn reactivity patterns from large reaction datasets, improving synthesis planning.</p>
<h3 id="developing-a-100-robust-programming-language">Developing a 100% Robust Programming Language</h3>
<p><strong>Programming Language Perspective (Projects 8-9)</strong>: An intriguing reframing views molecular representations as programming languages executed by chemical parsers. This opens possibilities for adding loops, logic, and other programming concepts to efficiently describe complex structures. The ambitious goal is a Turing-complete programming language that is also 100% robust. While fascinating, it is worth critically noting that enforcing 100% syntactical robustness inherently restricts grammar flexibility. Can a purely robust string representation realistically describe highly fuzzy, delocalized electron bonds (like in Project 6) without becoming impractically long or collapsing into specialized sub-languages?</p>
<p><strong>Empirical Comparisons (Projects 10-11)</strong>: With multiple representation options (strings, matrices, images), we need systematic comparisons. The proposed benchmarks would go beyond simple validity metrics to focus on real-world design objectives in drug discovery, catalysis, and materials science.</p>
<p><strong>Human Readability (Project 12)</strong>: While SMILES is often called &ldquo;human-readable,&rdquo; this claim lacks scientific validation. The proposed study would test how well humans actually understand different molecular representations.</p>
<p><strong>Machine Learning Perspectives (Projects 13-16)</strong>: These projects explore how machines interpret molecular representations:</p>
<ul>
<li>Training networks to translate between formats to find universal representations</li>
<li>Comparing learning efficiency across different formats</li>
<li>Investigating latent space smoothness in generative models</li>
<li>Visualizing what models actually learn about molecular structure</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>Since this is a position paper outlining future research directions, standard empirical reproducibility metrics do not apply. However, the foundational tools required to pursue the proposed roadmap are open-source.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/selfies">aspuru-guzik-group/selfies</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Core SELFIES Python library, installable via <code>pip install selfies</code></td>
      </tr>
      <tr>
          <td><a href="https://arxiv.org/abs/2204.00056">arXiv:2204.00056</a></td>
          <td>Paper</td>
          <td>N/A</td>
          <td>Open-access preprint of the published Patterns article</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., Friederich, P., Gaudin, T., Gayle, A. A., Jablonka, K. M., Lameiro, R. F., Lemm, D., Lo, A., Moosavi, S. M., Nápoles-Duarte, J. M., Nigam, A., Pollice, R., Rajan, K., Schatzschneider, U., &hellip; Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. <em>Patterns</em>, <em>3</em>(10). <a href="https://doi.org/10.1016/j.patter.2022.100588">https://doi.org/10.1016/j.patter.2022.100588</a></p>
<p><strong>Publication</strong>: Patterns 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Krenn2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{SELFIES and the future of molecular string representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">ISSN</span> = <span style="color:#e6db74">{2666-3899}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{http://dx.doi.org/10.1016/j.patter.2022.100588}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">DOI</span> = <span style="color:#e6db74">{10.1016/j.patter.2022.100588}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Patterns}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Elsevier BV}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Krenn, Mario and Ai, Qianxiang and Barthel, Senja and Carson, Nessa and Frei, Angelo and Frey, Nathan C. and Friederich, Pascal and Gaudin, Théophile and Gayle, Alberto Alexander and Jablonka, Kevin Maik and Lameiro, Rafael F. and Lemm, Dominik and Lo, Alston and Moosavi, Seyed Mohamad and Nápoles-Duarte, José Manuel and Nigam, AkshatKumar and Pollice, Robert and Rajan, Kohulan and Schatzschneider, Ulrich and Schwaller, Philippe and Skreta, Marta and Smit, Berend and Strieth-Kalthoff, Felix and Sun, Chong and Tom, Gary and von Rudorff, Guido Falk and Wang, Andrew and White, Andrew and Young, Adamo and Yu, Rose and Aspuru-Guzik, Alán}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{100588}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">Original SELFIES Paper</a></li>
<li><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES Overview</a></li>
</ul>
]]></content:encoded></item><item><title>Invalid SMILES Benefit Chemical Language Models: A Study</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/invalid-smiles-help/</link><pubDate>Tue, 02 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/invalid-smiles-help/</guid><description>Skinnider (2024) shows that generating invalid SMILES actually improves chemical language model performance through quality filtering.</description><content:encoded><![CDATA[<h2 id="core-contribution-repurposing-invalid-smiles">Core Contribution: Repurposing Invalid SMILES</h2>
<p>This is an <strong>Empirical</strong> paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate &ldquo;invalid&rdquo; SMILES strings is beneficial for model performance.</p>
<h2 id="the-problem-with-absolute-validity-in-chemical-lms">The Problem with Absolute Validity in Chemical LMs</h2>
<p>Prior research attempted to eliminate invalid generations using constrained representations like SELFIES. This paper demonstrates that invalid outputs serve as low-likelihood samples whose removal acts as an implicit quality filter, improving distribution learning.</p>
<h2 id="invalid-generation-as-an-implicit-quality-filter">Invalid Generation as an Implicit Quality Filter</h2>
<p>The central insight is counterintuitive: <strong>invalid SMILES generation acts as a built-in quality control mechanism</strong>. The key contributions are:</p>
<ol>
<li>
<p><strong>Empirical Evidence</strong>: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.</p>
</li>
<li>
<p><strong>Mechanistic Explanation</strong>: Invalid SMILES are demonstrated to be low-likelihood samples from the model&rsquo;s probability distribution. When these are filtered out, it&rsquo;s equivalent to removing the model&rsquo;s least confident predictions, a form of automatic quality control.</p>
</li>
<li>
<p><strong>Causal Evidence</strong>: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.</p>
</li>
<li>
<p><strong>Bias Analysis</strong>: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.</p>
</li>
</ol>
<h2 id="experimental-design-and-causal-interventions">Experimental Design and Causal Interventions</h2>
<p>The paper uses a multi-pronged approach to establish both correlation and causation:</p>
<p><strong>Performance Comparisons</strong>: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.</p>
<p><strong>Loss Analysis</strong>: The relationship between SMILES validity and model confidence was examined by analyzing the sequence loss. For a given SMILES string $S$ composed of tokens $t_1, t_2, &hellip;, t_N$, the negative log-likelihood acts as a proxy for the model&rsquo;s uncertainty:</p>
<p>$$ \text{NLL}(S) = -\sum_{i=1}^N \log P(t_i | t_1, &hellip;, t_{i-1}) $$</p>
<p>Invalid SMILES strings consistently register higher $\text{NLL}$ scores, meaning they represent the model&rsquo;s least confident predictions. Filtering them effectively acts as automatic quality control, providing the mechanistic explanation for why invalid filtering improves performance.</p>
<p><strong>Causal Intervention</strong>: A key experiment involved modifying the SELFIES valency constraints at two levels: first allowing pentavalent carbons (&ldquo;Texas SELFIES&rdquo;), then removing all constraints entirely (&ldquo;unconstrained SELFIES&rdquo;). This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.</p>
<p><strong>Structural Bias Analysis</strong>: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model&rsquo;s exploration of chemical space.</p>
<p><strong>Generalization Testing</strong>: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.</p>
<p><strong>Practical Application</strong>: The approach was tested on structure elucidation, using models to identify unknown molecules from minimal experimental data like mass spectrometry.</p>
<h2 id="key-findings-on-validity-constraints-and-bias">Key Findings on Validity Constraints and Bias</h2>
<p><strong>Superior Performance Across the Board</strong>: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. This performance advantage was directly correlated with the proportion of invalid SMILES generated. Models that produced more invalid outputs performed better after filtering.</p>
<p><strong>Invalid SMILES Are Low-Confidence Predictions</strong>: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model&rsquo;s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.</p>
<p><strong>Causal Evidence Through Unconstrained SELFIES</strong>: Direct causal evidence came from modifying SELFIES to allow invalid generation. When &ldquo;unconstrained SELFIES&rdquo; models could generate and discard invalid molecules, their performance improved, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.</p>
<p><strong>Validity Constraints Introduce Systematic Bias</strong>: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model&rsquo;s ability to faithfully represent chemical space.</p>
<p><strong>Reduced Generalization</strong>: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.</p>
<p><strong>Real-World Application Benefits</strong>: In structure elucidation tasks, identifying unknown molecules from experimental data like mass spectrometry, SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.</p>
<p><strong>CASMI 2022 Benchmark</strong>: The language model trained on the LOTUS database was benchmarked against 19 submissions to the CASMI 2022 competition for structure elucidation of unknown compounds. Using only accurate mass as input (no MS/MS data), the model achieved competitive performance, highlighting the practical utility of the sampling-frequency-based approach for de novo structure elucidation.</p>
<p><strong>Computational Efficiency</strong>: Filtering invalid SMILES is computationally trivial. Parsing ten million SMILES strings with RDKit takes approximately 7.5 minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Primary Architecture (LSTM):</strong> The main results rely on a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) units.</p>
<ul>
<li><strong>Structure:</strong> Three-layer LSTM with a hidden layer size of 1,024 dimensions</li>
<li><strong>Embedding:</strong> An embedding layer of 128 dimensions</li>
<li><strong>Decoder:</strong> A linear decoder layer outputs token probabilities</li>
</ul>
<p><strong>Secondary Architecture (Transformer/GPT):</strong> To confirm robustness across architectures, the author also used a Generative Pretrained Transformer (GPT) architecture adapted from MolGPT.</p>
<ul>
<li><strong>Structure:</strong> Eight transformer blocks</li>
<li><strong>Internals:</strong> Each block contains eight masked self-attention heads and a feed-forward network (1,024 dimensions) using GELU activation</li>
<li><strong>Embedding:</strong> 256 dimensions, concatenated with learned positional encodings</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Optimizer:</strong> Adam optimizer for both architectures with $\beta_1=0.9$ and $\beta_2=0.999$.</p>
<p><strong>Learning Rate:</strong></p>
<ul>
<li>LSTM: 0.001</li>
<li>Transformer: 0.0005</li>
</ul>
<p><strong>Batch Size:</strong> 64</p>
<p><strong>Loss Function:</strong> Cross-entropy loss of next-token prediction.</p>
<p><strong>Stopping Criteria:</strong> Early stopping using a validation set (10% of training data) with patience of 50,000 minibatches.</p>
<h3 id="data">Data</h3>
<p><strong>Primary Source:</strong> ChEMBL database (version 28).</p>
<p><strong>Preprocessing Pipeline:</strong></p>
<ul>
<li><strong>Cleaning:</strong> Removal of duplicate SMILES, salts, and solvents (retaining heavy fragments with $\geq 3$ heavy atoms)</li>
<li><strong>Filtering:</strong> Molecules with atoms other than {Br, C, Cl, F, H, I, N, O, P, S} were removed</li>
<li><strong>Normalization:</strong> Charged molecules were neutralized and converted to canonical SMILES</li>
</ul>
<p><strong>Training Subsets:</strong> Models were trained on random samples of 30,000, 100,000, and 300,000 molecules to test scalability.</p>
<p><strong>Generalization Data:</strong> To test generalization, models were also trained on the <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> database (enumerating drug-like molecules up to 13 heavy atoms).</p>
<p><strong>Structure Elucidation Data:</strong> For practical application tasks, models were trained on natural products (LOTUS, COCONUT), food compounds (FooDB), and environmental contaminants (NORMAN).</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric:</strong> Fréchet ChemNet Distance (FCD), measuring chemical similarity between generated molecules and the training set (lower is better).</p>
<p><strong>Secondary Metrics:</strong></p>
<ul>
<li><strong>Validity:</strong> Percentage of outputs parseable by RDKit</li>
<li><strong>Scaffold Similarity:</strong> Jensen-Shannon distances between Murcko scaffold compositions</li>
<li><strong>Physical Properties:</strong> Comparisons of molecular weight, LogP, topological polar surface area (TPSA), and ring counts (aromatic vs. aliphatic)</li>
<li><strong>Structure Elucidation:</strong> &ldquo;Top-k accuracy,&rdquo; the proportion of held-out molecules where the correct structure appeared in the model&rsquo;s top $k$ ranked outputs</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Nodes:</strong> Dell EMC C4140 GPU compute nodes</li>
<li><strong>GPUs:</strong> NVIDIA Tesla V100</li>
<li><strong>Compute Time:</strong> Parsing 10 million SMILES took ~7.5 minutes on a single CPU; SELFIES models required an average of 0.6 hours longer to train than SMILES models</li>
</ul>
<h3 id="replicability">Replicability</h3>
<p><strong>Code Availability:</strong> Source code and intermediate data are available via <a href="https://doi.org/10.5281/zenodo.10680855">Zenodo</a>. Pre-trained model weights are not provided in the archive, requiring researchers to train models from scratch using the included scripts to fully replicate the study.</p>
<p><strong>Data Availability:</strong> Training datasets and generated molecule samples (10 million from ChEMBL/GDB-13 models, 100 million from LOTUS/COCONUT/FooDB/NORMAN cross-validation folds) are available via <a href="https://doi.org/10.5281/zenodo.8321735">Zenodo</a>.</p>
<p><strong>Software Libraries:</strong></p>
<ul>
<li><strong>PyTorch:</strong> LSTM and Transformer implementations</li>
<li><strong>RDKit:</strong> SMILES parsing, validity checking, and property calculation</li>
<li><strong>SELFIES:</strong> Version 2.1.1 for conversion</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10680855">Source code (Zenodo)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training scripts, analysis code, and intermediate data</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.8321735">Training and generated molecules (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Preprocessed training sets and sampled molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="implications-and-takeaways">Implications and Takeaways</h2>
<p>This work reframes how we think about &ldquo;errors&rdquo; in generative models. The key insight is that model outputs appearing incorrect often represent low-likelihood samples whose removal improves overall performance.</p>
<p>The findings suggest that the field&rsquo;s drive toward guaranteed validity leads to systematic biases. Letting models fail informatively and using those failures as quality signals can yield better distribution learning. This is relevant as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.</p>
<p>For practitioners, the takeaway is to consider the role of invalid outputs before eliminating them. Filtering low-confidence generations provides automatic quality control that improves final results.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437-448. <a href="https://doi.org/10.1038/s42256-024-00821-x">https://doi.org/10.1038/s42256-024-00821-x</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence (2024)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{skinnider2024invalid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Invalid SMILES are beneficial rather than detrimental to chemical language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Skinnider, Michael A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{437--448}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group UK London}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES Notation: The Original Paper by Weininger (1988)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-original-paper/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-original-paper/</guid><description>Weininger's 1988 paper introducing SMILES notation, a string-based molecular representation that became a standard in computational chemistry.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. <em>Journal of Chemical Information and Computer Sciences</em>, 28(1), 31-36. <a href="https://doi.org/10.1021/ci00057a005">https://doi.org/10.1021/ci00057a005</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1988</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES notation overview</a> - Modern usage summary</li>
<li><a href="/posts/visualizing-smiles-and-selfies-strings/">Converting SMILES to 2D images</a> - Practical visualization tutorial</li>
</ul>
<h2 id="core-contribution-a-string-based-molecular-notation">Core Contribution: A String-Based Molecular Notation</h2>
<p>This is a <strong>Method</strong> paper that introduces a novel notation system for representing chemical structures as text strings. It establishes the encoding rules and input conventions for SMILES (Simplified Molecular Input Line Entry System), while explicitly deferring the canonicalization algorithm to subsequent papers in the series.</p>
<h2 id="the-computational-complexity-of-chemical-information-in-the-1980s">The Computational Complexity of Chemical Information in the 1980s</h2>
<p>As computers became central to chemical information processing in the 1980s, the field faced a fundamental problem: existing line notations were either too complex for chemists to use practically or too limited for computational applications. Previous systems required extensive training to write correctly and were prone to errors.</p>
<p>The goal was ambitious: create a system that could represent any molecule as a simple text string, making it both human-readable and machine-efficient. This would enable compact database storage, fast processing, and easy exchange between software systems.</p>
<h2 id="separating-input-rules-from-canonicalization">Separating Input Rules from Canonicalization</h2>
<p>Weininger&rsquo;s key insight was to separate the problem into two parts: create simple, flexible rules that chemists could easily learn for input, while deferring to the computer the complex task of generating a unique, canonical representation. This division of labor made SMILES both practical and powerful.</p>
<p>The specific innovations include:</p>
<ol>
<li><strong>Simple input rules</strong> - Chemists could write molecules intuitively (e.g., <code>CCO</code> or <code>OCC</code> for ethanol)</li>
<li><strong>Ring closure notation</strong> - Breaking one bond and marking ends with matching digits</li>
<li><strong>Implicit hydrogens</strong> - Automatic calculation based on standard valences keeps strings compact</li>
<li><strong>Algorithmic aromaticity detection</strong> - Automatic recognition of aromatic systems from Kekulé structures</li>
<li><strong>Human-readable output</strong> - Unlike binary formats, SMILES strings are readable and debuggable</li>
</ol>
<p><strong>Important scope note</strong>: This first paper in the series establishes the input syntax and encoding rules. The canonicalization algorithm (how to generate unique SMILES) is explicitly stated as the subject of following papers: &ldquo;specification of isomerisms, substructures, and unique SMILES generation are the subjects of following papers.&rdquo;</p>
<h2 id="demonstrating-notation-rules-across-molecular-classes">Demonstrating Notation Rules Across Molecular Classes</h2>
<p>The paper is primarily a specification document establishing notation rules. The methodology is demonstrated through worked examples showing how to encode various molecular structures:</p>
<ul>
<li><strong>Basic molecules</strong>: Ethane (<code>CC</code>), ethylene (<code>C=C</code>), acetylene (<code>C#C</code>)</li>
<li><strong>Branches</strong>: Isobutyric acid (<code>CC(C)C(=O)O</code>)</li>
<li><strong>Rings</strong>: Cyclohexane (<code>C1CCCCC1</code>), benzene (<code>c1ccccc1</code>)</li>
<li><strong>Aromatic systems</strong>: Tropone (<code>O=c1cccccc1</code>), quinone (showing exocyclic bond effects)</li>
<li><strong>Complex structures</strong>: Morphine (40 characters vs 1000-2000 for connection tables)</li>
<li><strong>Edge cases</strong>: Salts, isotopes, charged species, tautomers</li>
</ul>
<p>Performance comparisons are mentioned qualitatively: SMILES processing was approximately 100 times faster than traditional connection table methods on the hardware of the era (1988), with dramatic reductions in storage space.</p>
<h2 id="performance-and-practical-viability">Performance and Practical Viability</h2>
<p>The paper successfully establishes SMILES as a practical notation system with several key outcomes:</p>
<p><strong>Practical benefits</strong>:</p>
<ul>
<li><strong>Compactness</strong>: 40 characters for morphine vs 1000-2000 for connection tables</li>
<li><strong>Speed</strong>: ~100x faster processing than traditional methods</li>
<li><strong>Accessibility</strong>: Simple enough for chemists to learn without extensive training</li>
<li><strong>Machine-friendly</strong>: Efficient parsing and string-based operations</li>
</ul>
<p><strong>Design principles validated</strong>:</p>
<ul>
<li>Separating user input from canonical representation makes the system both usable and rigorous</li>
<li>Implicit hydrogens reduce string length without loss of information</li>
<li>Ring closure notation with digit markers is more intuitive than complex graph syntax</li>
<li>Automatic aromaticity detection handles most cases correctly</li>
</ul>
<p><strong>Acknowledged limitations</strong>:</p>
<ul>
<li>Canonicalization algorithm not included in this paper</li>
<li>Stereochemistry handling deferred to subsequent papers</li>
<li>Some edge cases (like unusual valence states) require explicit specification</li>
</ul>
<p>The paper concludes by positioning SMILES as a foundation for database storage, substructure searching, and chemical informatics applications - a vision that proved accurate as SMILES became one of the most widely used molecular representations in computational chemistry.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>To implement the method described in this paper, the following look-up tables and algorithms are required. <strong>Note</strong>: These details are critical for replication but are often glossed over in high-level summaries.</p>
<h3 id="1-the-valence-look-up-table">1. The Valence Look-Up Table</h3>
<p>To calculate implicit hydrogens, the system assumes the &ldquo;lowest normal valence&rdquo; greater than or equal to the explicit bond count. The paper explicitly defines these valences:</p>
<table>
  <thead>
      <tr>
          <th>Element</th>
          <th>Allowed Valences</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>B</td>
          <td>3</td>
      </tr>
      <tr>
          <td>C</td>
          <td>4</td>
      </tr>
      <tr>
          <td>N</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>O</td>
          <td>2</td>
      </tr>
      <tr>
          <td>P</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>S (aliphatic)</td>
          <td>2, 4, 6</td>
      </tr>
      <tr>
          <td>S (aromatic)</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>F, Cl, Br, I</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<p><strong>Example</strong>: For sulfur in $\text{H}_2\text{SO}_4$ written as <code>OS(=O)(=O)O</code>, the explicit bond count is 6 (two single bonds + two double bonds to four oxygens), so the system uses valence 6 with zero implicit hydrogens. Without knowing S allows valence 6, the algorithm would fail.</p>
<h3 id="2-explicit-hydrogen-requirements">2. Explicit Hydrogen Requirements</h3>
<p>The paper lists exactly three cases where hydrogen atoms are retained (not suppressed):</p>
<ol>
<li><strong>Hydrogen connected to other hydrogen</strong> (molecular hydrogen, $\text{H}_2$, written as <code>[H][H]</code>)</li>
<li><strong>Hydrogen connected to zero or more than one other atom</strong> (bridging hydrogens, isolated protons)</li>
<li><strong>Isotopic hydrogen specifications</strong> in isomeric SMILES (deuterium <code>[2H]</code>, tritium <code>[3H]</code>)</li>
</ol>
<p>For all other cases, hydrogens are implicit and calculated from the valence table.</p>
<h3 id="3-ring-closure-notation">3. Ring Closure Notation</h3>
<p>Standard SMILES supports single digits <code>1-9</code> for ring closures. For rings numbered 10 and higher, the notation requires a <strong>percent sign prefix</strong>:</p>
<ul>
<li>Ring closures 1-9: <code>C1CCCCC1</code></li>
<li>Ring closures 10+: <code>C%10CCCCC%10</code>, <code>C2%13%24</code> (ring 2, ring 13, ring 24)</li>
</ul>
<p>Without this rule, a parser would fail on large polycyclic structures.</p>
<h3 id="4-aromaticity-detection-algorithm">4. Aromaticity Detection Algorithm</h3>
<p>The system uses an extended version of Hückel&rsquo;s Rule ($4N+2$ π-electrons). The &ldquo;excess electron&rdquo; count for the aromatic system is determined by these rules:</p>
<p><strong>Carbon contribution</strong>:</p>
<ul>
<li><strong>C in aromatic ring</strong>: Contributes 1 electron</li>
<li><strong>C double-bonded to exocyclic electronegative atom</strong> (e.g., $\text{C}=\text{O}$ in quinone): Contributes 0 electrons (the carbon &ldquo;loses&rdquo; its electron to the oxygen)</li>
</ul>
<p><strong>Heteroatom contribution</strong>:</p>
<ul>
<li><strong>O, S in ring</strong>: Contributes 2 electrons (lone pair)</li>
<li><strong>N in ring</strong>: Contributes 1 electron (pyridine-like) or 2 electrons (pyrrole-like, must have explicit hydrogen <code>[nH]</code>)</li>
</ul>
<p><strong>Charge effects</strong>:</p>
<ul>
<li><strong>Positive charge</strong>: Reduces electron count by 1</li>
<li><strong>Negative charge</strong>: Increases electron count by 1</li>
</ul>
<p><strong>Critical example - Quinone</strong>:</p>
<pre tabindex="0"><code>O=C1C=CC(=O)C=C1
</code></pre><p>Quinone has 6 carbons in the ring, but the two carbons bonded to exocyclic oxygens contribute 0 electrons each. The four remaining carbons contribute 4 electrons total (not 6), so quinone is <strong>not aromatic</strong> by this algorithm. This exocyclic bond rule is essential for correct aromaticity detection.</p>
<p><strong>Aromatic ring test</strong>:</p>
<ol>
<li>All atoms must be sp² hybridized</li>
<li>Count excess electrons using the rules above</li>
<li>Calculate whether the system complies with Hückel&rsquo;s parity rule constraint:
$$ \text{Excess Electrons} \equiv 2 \pmod 4 \iff \text{Excess Electrons} = 4N + 2 $$
If the electron count satisfies this property for some integer $N$, the ring is determined to be aromatic.</li>
</ol>
<h2 id="encoding-rules-reference">Encoding Rules Reference</h2>
<p>The following sections provide a detailed reference for the six fundamental SMILES encoding rules. These are the rules a user would apply when writing SMILES strings.</p>
<h3 id="1-atoms">1. Atoms</h3>
<p>Atoms use their standard chemical symbols. Elements in the &ldquo;organic subset&rdquo; (B, C, N, O, P, S, F, Cl, Br, I) can be written directly when they have their most common valence - so <code>C</code> automatically means a carbon with enough implicit hydrogens to satisfy its valence.</p>
<p>Everything else goes in square brackets: <code>[Au]</code> for gold, <code>[NH4+]</code> for ammonium ion, or <code>[13C]</code> for carbon-13. Aromatic atoms get lowercase letters: <code>c</code> for aromatic carbon in benzene.</p>
<h3 id="2-bonds">2. Bonds</h3>
<p>Bond notation is straightforward:</p>
<ul>
<li><code>-</code> for single bonds (usually omitted)</li>
<li><code>=</code> for double bonds</li>
<li><code>#</code> for triple bonds</li>
<li><code>:</code> for aromatic bonds (also usually omitted)</li>
</ul>
<p>So <code>CC</code> and <code>C-C</code> both represent ethane, while <code>C=C</code> is ethylene.</p>
<h3 id="3-branches">3. Branches</h3>
<p>Branches use parentheses, just like in mathematical expressions. Isobutyric acid becomes <code>CC(C)C(=O)O</code> - the main chain is <code>CC C(=O)O</code> with a methyl <code>(C)</code> branch.</p>
<h3 id="4-rings">4. Rings</h3>
<p>This is where SMILES gets clever. You break one bond and mark both ends with the same digit. Cyclohexane becomes <code>C1CCCCC1</code> - the <code>1</code> connects the first and last carbon, closing the ring.</p>
<p>You can reuse digits for different rings in the same molecule, making complex structures manageable.</p>
<h3 id="5-disconnected-parts">5. Disconnected Parts</h3>
<p>Salts and other disconnected structures use periods. Sodium phenoxide: <code>[Na+].[O-]c1ccccc1</code>. The order doesn&rsquo;t matter - you&rsquo;re just listing the separate components.</p>
<h3 id="6-aromaticity">6. Aromaticity</h3>
<p>Aromatic rings can be written directly with lowercase letters. Benzoic acid becomes <code>c1ccccc1C(=O)O</code>. The system can also detect aromaticity automatically from Kekulé structures, so <code>C1=CC=CC=C1C(=O)O</code> works just as well.</p>
<h3 id="simplified-subset-for-organic-chemistry">Simplified Subset for Organic Chemistry</h3>
<p>Weininger recognized that most chemists work primarily with organic compounds, so he defined a simplified subset that covers the vast majority of cases. For organic molecules, you only need four rules:</p>
<ol>
<li><strong>Atoms</strong>: Use standard symbols (C, N, O, etc.)</li>
<li><strong>Multiple bonds</strong>: Use <code>=</code> and <code>#</code> for double and triple bonds</li>
<li><strong>Branches</strong>: Use parentheses <code>()</code></li>
<li><strong>Rings</strong>: Use matching digits</li>
</ol>
<p>This &ldquo;basic SMILES&rdquo; covers the vast majority of organic compounds, making the system immediately accessible without having to learn all the edge cases.</p>
<h2 id="design-decisions-and-edge-cases">Design Decisions and Edge Cases</h2>
<p>Beyond the basic rules, the paper established several important conventions for handling ambiguous cases:</p>
<h3 id="hydrogen-handling">Hydrogen Handling</h3>
<p>Hydrogens are usually implicit - the system calculates how many each atom needs based on standard valences. So <code>C</code> represents CH₄, <code>N</code> represents NH₃, and so on. This keeps strings compact and readable.</p>
<p>Explicit hydrogens only appear in special cases: when hydrogen connects to multiple atoms, when you need to specify an exact count, or in isotopic specifications like <code>[2H]</code> for deuterium.</p>
<h3 id="bond-representation">Bond Representation</h3>
<p>The paper made an important choice about how to represent bonds in ambiguous cases. For example, nitromethane could be written as charge-separated <code>C[N+](=O)[O-]</code> or with covalent double bonds <code>CN(=O)=O</code>. Weininger chose to prefer the covalent form when possible, because it preserves the correct topological symmetry.</p>
<p>However, when covalent representation would require unusual valences, charge separation is preferred. Diazomethane becomes <code>C=[N+]=[N-]</code> to avoid forcing carbon into an unrealistic valence state.</p>
<h3 id="tautomers">Tautomers</h3>
<p>SMILES doesn&rsquo;t try to be too clever about tautomers - it represents exactly what you specify. So 2-pyridone can be written as either the enol form <code>Oc1ncccc1</code> or the keto form <code>O=c1[nH]cccc1</code>. The system won&rsquo;t automatically convert between them.</p>
<p>This explicit approach means you need to decide which tautomeric form to represent, but it also means the notation precisely captures what you intend.</p>
<h3 id="aromaticity-detection">Aromaticity Detection</h3>
<p>One of the most sophisticated parts of the original system was automatic aromaticity detection. The algorithm uses an extended Hückel rule: a ring is aromatic if all atoms are sp² hybridized and it contains 4N+2 π-electrons.</p>
<p>This means you can input benzene as the Kekulé structure <code>C1=CC=CC=C1</code> and the system will automatically recognize it as aromatic and convert it to <code>c1ccccc1</code>. The algorithm handles complex cases like tropone (<code>O=c1cccccc1</code>) and correctly identifies them as aromatic.</p>
<h3 id="aromatic-nitrogen">Aromatic Nitrogen</h3>
<p>The system makes an important distinction for nitrogen in aromatic rings. Pyridine-type nitrogen (like in pyridine itself) is written as <code>n</code> and has no attached hydrogens. Pyrrole-type nitrogen has an attached hydrogen that must be specified explicitly: <code>[nH]1cccc1</code> for pyrrole.</p>
<p>This distinction captures the fundamental difference in electron contribution between these two nitrogen types in aromatic systems.</p>
<h2 id="impact-and-legacy">Impact and Legacy</h2>
<p>Nearly four decades later, SMILES remains one of the most widely used molecular notations in computational chemistry. The notation became the foundation for:</p>
<ul>
<li><strong>Database storage</strong> - Compact, searchable molecular representations</li>
<li><strong>Substructure searching</strong> - Pattern matching in chemical databases</li>
<li><strong>Property prediction</strong> - Input format for QSAR models</li>
<li><strong>Chemical informatics</strong> - Standard exchange format between software</li>
<li><strong>Modern ML</strong> - Text-based representation for neural networks</li>
</ul>
<p>While newer approaches like <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> have addressed some limitations (like the possibility of invalid strings), SMILES&rsquo; combination of simplicity and power has made it enduringly useful.</p>
<p>The paper established both a notation system and a design philosophy: chemical informatics tools should be powerful enough for computers while remaining accessible to working chemists. That balance remains relevant today as we develop new molecular representations for machine learning and AI applications.</p>
]]></content:encoded></item><item><title>SELFIES: The Original Paper on Robust Molecular Strings</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-original-paper/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-original-paper/</guid><description>The 2020 paper introducing SELFIES, the 100% robust molecular representation that solves SMILES validity problems in ML applications.</description><content:encoded><![CDATA[<h2 id="contribution-a-100-robust-representation-for-ml">Contribution: A 100% Robust Representation for ML</h2>
<p>This is a <strong>Method</strong> paper that introduces a new molecular string representation designed specifically for machine learning applications.</p>
<h2 id="motivation-the-invalidity-bottleneck">Motivation: The Invalidity Bottleneck</h2>
<p>When neural networks generate molecules using <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES notation</a>, a huge fraction of output strings are invalid: either syntax errors or chemically impossible structures. This was a fundamental bottleneck: if your generative model produces a large fraction of invalid molecules, you are wasting computational effort and severely limiting chemical space exploration.</p>
<h2 id="novelty-a-formal-grammar-approach">Novelty: A Formal Grammar Approach</h2>
<p>The authors&rsquo; key insight was using a <strong>formal grammar approach</strong> (specifically, a Chomsky type-2, context-free grammar with self-referencing functions) where each symbol is interpreted based on chemical context. The &ldquo;state of the derivation&rdquo; tracks available valence bonds, preventing impossible structures like a carbon with five single bonds.</p>
<p>For example, generating 2-Fluoroethenimine (<code>FC=C=N</code>) follows a state derivation where each step restricts the available valency for the next element:</p>
<p>$$
\mathbf{X}_0 \xrightarrow{[F]} \text{F } \mathbf{X}_1 \xrightarrow{[=C]} \text{FC } \mathbf{X}_3 \xrightarrow{[=C]} \text{FC=C } \mathbf{X}_2 \xrightarrow{[\#N]} \text{FC=C=N}
$$</p>
<p>This approach guarantees 100% validity: every SELFIES string corresponds to a valid molecule, and every valid molecule can be represented.</p>
<h2 id="methodology--experiments-validating-robustness">Methodology &amp; Experiments: Validating Robustness</h2>
<p>The authors ran several experiments to demonstrate SELFIES&rsquo; robustness:</p>
<h3 id="random-mutation-test">Random Mutation Test</h3>
<p>They took the SELFIES and <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> representations of MDMA and introduced random changes:</p>
<ul>
<li><strong>SMILES</strong>: After just one random mutation, only 9.9% of strings remained valid (dropping to 1.1% after three mutations).</li>
<li><strong>SELFIES</strong>: 100% of mutated strings still represented valid molecules (though different from the original).</li>
</ul>
<p>This empirical difference demonstrates why SELFIES is well suited for evolutionary algorithms and genetic programming approaches to molecular design, where random mutations of strings are a core operation.</p>
<h3 id="generative-model-performance">Generative Model Performance</h3>
<p>The real test came with actual machine learning models. The authors trained Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) on both representations:</p>
<p><strong>VAE Results:</strong></p>
<ul>
<li>SMILES-based VAE: Large invalid regions scattered throughout the latent space</li>
<li>SELFIES-based VAE: Every point in the continuous latent space mapped to a valid molecule</li>
<li>The SELFIES model encoded <strong>over 100 times more diverse molecules</strong></li>
</ul>
<p><strong>GAN Results:</strong></p>
<ul>
<li>Best SMILES GAN: 18.6% diverse, valid molecules</li>
<li>Best SELFIES GAN: 78.9% diverse, valid molecules</li>
</ul>
<p><strong>Evaluation Metrics:</strong></p>
<ul>
<li><strong>Validity</strong>: Percentage of generated strings representing valid molecular structures</li>
<li><strong>Diversity</strong>: Number of unique valid molecules produced</li>
<li><strong>Reconstruction Accuracy</strong>: How well the autoencoder reproduced input molecules</li>
</ul>
<h3 id="scalability-test">Scalability Test</h3>
<p>The authors showed SELFIES works beyond toy molecules by successfully encoding and decoding all <strong>72 million molecules</strong> from the PubChem database (with fewer than 500 SMILES characters per molecule), demonstrating practical applicability to real chemical databases.</p>
<h2 id="results--conclusions-chemical-space-exploration">Results &amp; Conclusions: Chemical Space Exploration</h2>
<p><strong>Key Findings:</strong></p>
<ul>
<li>SELFIES achieves 100% validity guarantee: every string represents a valid molecule</li>
<li>SELFIES-based VAEs encode over 100x more diverse molecules than SMILES-based models</li>
<li>SELFIES-based GANs produce 78.9% diverse valid molecules vs. 18.6% for SMILES GANs</li>
<li>Successfully validated on all 72 million PubChem molecules</li>
</ul>
<p><strong>Limitations Acknowledged:</strong></p>
<ul>
<li>No standardization or canonicalization method at time of publication</li>
<li>The initial grammar covered only small biomolecules; extensions for stereochemistry, ions, polyvalency, and full periodic table coverage were planned</li>
<li>Requires community testing and adoption</li>
</ul>
<p><strong>Impact:</strong></p>
<p>This work demonstrated that designing ML-native molecular representations could enable new approaches in drug discovery and materials science. SELFIES was subsequently evaluated as an alternative input representation to SMILES in <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, a transformer pretrained on molecular strings for property prediction, where it performed comparably to SMILES on the Tox21 benchmark, though the comparison was limited to a single task.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The machine learning experiments used two distinct datasets:</p>
<ul>
<li><strong>QM9</strong> (134k molecules): Primary training dataset for VAE and GAN models</li>
<li><strong>PubChem</strong> (72M molecules): Used only to test representation coverage and scalability; not used for model training</li>
</ul>
<h3 id="models">Models</h3>
<p>The VAE implementation included:</p>
<ul>
<li><strong>Latent space</strong>: 241-dimensional with Gaussian distributions</li>
<li><strong>Input encoding</strong>: One-hot encoding of SELFIES/SMILES strings</li>
<li>Full architectural details (encoder/decoder structures, layer types) provided in Supplementary Information</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The authors found GAN performance was highly sensitive to hyperparameter selection:</p>
<ul>
<li>Searched <strong>200 different hyperparameter configurations</strong> to achieve the reported 78.9% diversity</li>
<li>Specific optimizers, learning rates, and training duration detailed in Supplementary Information</li>
<li>Full rule generation algorithm provided in Table 2</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>All models evaluated on:</p>
<ul>
<li><strong>Validity rate</strong>: Percentage of syntactically and chemically valid outputs</li>
<li><strong>Diversity</strong>: Count of unique valid molecules generated</li>
<li><strong>Reconstruction accuracy</strong>: Fidelity of autoencoder reconstruction (VAEs only)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training performed on the SciNet supercomputing infrastructure.</li>
<li>The paper does not specify GPU types or training times.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/selfies">SELFIES GitHub Repository</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation; has evolved significantly since the original paper</td>
      </tr>
  </tbody>
</table>
<h3 id="replication-resources">Replication Resources</h3>
<p>Complete technical replication is highly accessible due to the paper being published open-access in <em>Machine Learning: Science and Technology</em>. It primarily requires:</p>
<ul>
<li>The full rule generation algorithm (Table 2 in paper)</li>
<li>Code: <a href="https://github.com/aspuru-guzik-group/selfies">https://github.com/aspuru-guzik-group/selfies</a></li>
<li>Supplementary Information for complete architectural and hyperparameter specifications</li>
</ul>
<p><strong>Note</strong>: The <a href="/notes/chemistry/molecular-representations/notations/selfies/">modern SELFIES library</a> has evolved significantly since this foundational paper, addressing many of the implementation challenges identified by the authors.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krenn, M., Häse, F., Nigam, A., Friederich, P., &amp; Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. <em>Machine Learning: Science and Technology</em>, <em>1</em>(4), 045024. <a href="https://doi.org/10.1088/2632-2153/aba947">https://doi.org/10.1088/2632-2153/aba947</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology, 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Krenn_2020,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/2632-2153/aba947}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1088%2F2632-2153%2Faba947}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{aug}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{{IOP} Publishing}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{045024}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Mario Krenn and Florian H{\&#34;{a}}se and AkshatKumar Nigam and Pascal Friederich and Alan Aspuru-Guzik}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Self-referencing embedded strings ({SELFIES}): A 100{\%} robust molecular string representation}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Machine Learning: Science and Technology}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/aspuru-guzik-group/selfies">GitHub Repository</a></li>
<li><a href="/notes/chemistry/molecular-representations/notations/selfies/">Modern SELFIES Documentation</a></li>
</ul>
]]></content:encoded></item><item><title>RInChI: The Reaction International Chemical Identifier</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/rinchi/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/rinchi/</guid><description>RInChI extends InChI to create unique, machine-readable identifiers for chemical reactions and database searching.</description><content:encoded><![CDATA[<h2 id="paper-classification-and-scope">Paper Classification and Scope</h2>
<p>This is an <strong>infrastructure/resource paper</strong> combined with a <strong>methods paper</strong>. It establishes a standard format, releases an open-source software library, and enables large-scale database operations. The methods component details the specific algorithmic rules for constructing identifiers through hashing, sorting, and layering.</p>
<h2 id="the-need-for-standardized-reaction-identifiers">The Need for Standardized Reaction Identifiers</h2>
<p>While we have excellent standards for identifying individual molecules (like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>), there was no equivalent for chemical reactions. This creates real problems:</p>
<ul>
<li>Different researchers working on the same reaction might describe it completely differently</li>
<li>Searching large reaction databases becomes nearly impossible</li>
<li>No way to check if two apparently different reaction descriptions are actually the same process</li>
<li>Chemical databases can&rsquo;t easily link related reactions or identify duplicates</li>
</ul>
<p>If a reaction converts &ldquo;starting material A + reagent B to product C,&rdquo; it is difficult to determine if that is identical to another researcher&rsquo;s description of the same transformation using different names or graphical representations. A working group was established in 2008 to address this, producing prototype versions at the University of Cambridge starting in 2011. The first official release (RInChI V1.00) was funded by the InChI Trust.</p>
<h2 id="core-innovation-standardizing-reaction-strings">Core Innovation: Standardizing Reaction Strings</h2>
<p>RInChI solves this by creating a standardized, machine-readable label for any chemical reaction. The key insight is to focus on the essential chemistry while ignoring experimental details that can vary between labs.</p>
<h3 id="core-principles">Core Principles</h3>
<p>RInChI captures three fundamental pieces of information:</p>
<ol>
<li><strong>Starting materials</strong>: What molecules you begin with</li>
<li><strong>Products</strong>: What molecules you end up with</li>
<li><strong>Agents</strong>: Substances present at both the beginning and end (catalysts, solvents, etc.)</li>
</ol>
<p>Importantly, RInChI intentionally excludes experimental conditions like temperature, pressure, yield, or reaction time. These details can vary significantly even for identical chemical transformations, so including them would make it nearly impossible for different researchers to generate the same identifier.</p>
<h3 id="how-rinchi-works">How RInChI Works</h3>
<h4 id="the-rinchi-string-structure">The RInChI String Structure</h4>
<p>A RInChI string has six distinct layers. Crucially, <strong>Layers 2 and 3 are assigned alphabetically</strong>. This is essential for generating consistent identifiers.</p>
<p><strong>Layer 1: Version</strong></p>
<ul>
<li>Standard header defining the RInChI version (e.g., <code>RInChI=1.00.1S</code>)</li>
</ul>
<p><strong>Layers 2 &amp; 3: Component Molecules</strong></p>
<ul>
<li>These layers contain the InChI strings of reaction participants (reactants and products)</li>
<li><strong>Sorting Rule</strong>: The distinct groups (Reactant Group vs. Product Group) are sorted alphabetically as aggregate strings. The group that comes first alphabetically becomes <strong>Layer 2</strong>; the other becomes <strong>Layer 3</strong></li>
<li>This means if a product&rsquo;s InChI is alphabetically &ldquo;earlier&rdquo; than the reactant&rsquo;s, the product goes in Layer 2</li>
<li><strong>Formatting</strong>: Molecules within a layer are separated by <code>!</code>. The two layers are separated by <code>&lt;&gt;</code></li>
</ul>
<p><strong>Layer 4: Agents</strong></p>
<ul>
<li>Contains catalysts, solvents, and any molecule found in <em>both</em> the reactant and product input lists</li>
<li><strong>Algorithmic rule</strong>: Anything appearing in both the reactant list and product list must be removed from both and added to Layer 4</li>
</ul>
<p><strong>Layer 5: Direction (The Decoder)</strong></p>
<ul>
<li>This layer determines which component layer represents the starting material:
<ul>
<li><code>/d+</code>: Layer 2 is the Starting Material (forward direction)</li>
<li><code>/d-</code>: Layer 3 is the Starting Material (reverse direction)</li>
<li><code>/d=</code>: Equilibrium reaction</li>
</ul>
</li>
<li>Without this layer, you cannot determine reactants from products</li>
</ul>
<p><strong>Layer 6: No-Structure Data</strong></p>
<ul>
<li>Format: <code>/uA-B-C</code> where the numbers indicate the count of structureless materials in Layer 2, Layer 3, and Layer 4 respectively</li>
<li>Used when substances lack defined structures and cannot be represented by InChI</li>
</ul>
<h3 id="separator-syntax">Separator Syntax</h3>
<p>For parsing or generating RInChI strings, the separator characters are:</p>
<table>
  <thead>
      <tr>
          <th>Separator</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>/</code></td>
          <td>Separates layers</td>
      </tr>
      <tr>
          <td><code>!</code></td>
          <td>Separates molecules within a layer</td>
      </tr>
      <tr>
          <td><code>&lt;&gt;</code></td>
          <td>Separates reactant/product groups</td>
      </tr>
  </tbody>
</table>
<h3 id="example-structure">Example Structure</h3>
<pre><code>RInChI=1.00.1S/[Layer2 InChIs]&lt;&gt;[Layer3 InChIs]&lt;&gt;[Agent InChIs]/d+/u0-0-0
</code></pre>
<p>This systematic approach ensures that any researcher starting with the same reaction will generate an identical RInChI string.</p>
<h3 id="rinchikeys-shorter-identifiers-for-practical-use">RInChIKeys: Shorter Identifiers for Practical Use</h3>
<p>Since full RInChI strings can become extremely long, the standard includes three types of shorter, hashed keys for different applications:</p>
<h4 id="long-rinchikey">Long-RInChIKey</h4>
<ul>
<li>Contains complete InChIKeys for every molecule in the reaction</li>
<li>Variable length, but allows searching for reactions containing specific compounds</li>
<li>Useful for substructure searches: &ldquo;Show me all reactions involving compound X&rdquo;</li>
</ul>
<h4 id="short-rinchikey">Short-RInChIKey</h4>
<ul>
<li>Fixed length (63 characters): 55 letters plus eight hyphens</li>
<li>Generated by separately hashing the major InChI layers (molecular formula and connectivity) of layers two, three, and four into ten-character strings, then hashing the minor layers (stereochemistry) and protonation states into five-character groups</li>
<li>Suitable for exact matching, database indexing, and linking identical reactions across different databases</li>
</ul>
<h4 id="web-rinchikey">Web-RInChIKey</h4>
<ul>
<li>Shortest format (47 characters)</li>
<li>Generated by combining all InChIs from every layer, removing duplicates, sorting alphabetically, then hashing the major layers into a seventeen-character block and the minor layers into a twelve-character block, with a protonation indicator</li>
<li>Ignores molecular roles (reactant vs. product), making it useful for finding related reactions where a molecule&rsquo;s role might differ between studies</li>
<li>Good for discovering &ldquo;reverse&rdquo; reactions, comparing databases with different drawing models, or finding alternative synthetic routes</li>
</ul>
<h2 id="experimental-validation-and-software-implementation">Experimental Validation and Software Implementation</h2>
<p>This infrastructure paper focuses on developing and validating the RInChI standard. The validation approach includes:</p>
<ul>
<li><strong>Software implementation</strong>: Development of the official RInChI software library capable of parsing reaction files and generating identifiers</li>
<li><strong>Format testing</strong>: Validation that the system correctly handles standard reaction file formats (<code>.RXN</code>, <code>.RD</code>)</li>
<li><strong>Consistency verification</strong>: Ensuring identical reactions produce identical RInChI strings regardless of input variations</li>
<li><strong>Key generation</strong>: Testing all three RInChIKey variants (Long, Short, Web) for different use cases</li>
<li><strong>Database integration</strong>: Demonstrating practical application in reaction database management. A database of over one million RInChIs was assembled using data that NextMove Software extracted from the patent literature, available at www-rinchi.ch.cam.ac.uk</li>
</ul>
<h2 id="impact-on-chemical-database-analytics">Impact on Chemical Database Analytics</h2>
<h3 id="practical-applications">Practical Applications</h3>
<p>RInChI enables systematic organization and analysis of chemical reactions:</p>
<h4 id="database-management">Database Management</h4>
<p>RInChI enables systematic organization of reaction databases. You can:</p>
<ul>
<li>Automatically identify and merge duplicate reaction entries</li>
<li>Find all variations of a particular transformation</li>
<li>Link related reactions across different data sources</li>
</ul>
<h4 id="reaction-analysis">Reaction Analysis</h4>
<p>With standardized identifiers, you can perform large-scale analysis:</p>
<ul>
<li>Identify the most commonly used reagents or catalysts</li>
<li>Find cases where identical starting materials yield different products</li>
<li>Analyze reaction trends and patterns across entire databases</li>
</ul>
<h4 id="multi-step-synthesis-representation">Multi-Step Synthesis Representation</h4>
<p>RInChI can represent complex, multi-step syntheses as single combined identifiers, making it easier to analyze and compare different synthetic routes.</p>
<h4 id="research-integration">Research Integration</h4>
<p>The standard enables better collaboration by ensuring different research groups can generate identical identifiers for the same chemical processes, facilitating data sharing and literature analysis.</p>
<h3 id="limitations-and-considerations">Limitations and Considerations</h3>
<h4 id="what-gets-lost">What Gets Lost</h4>
<p>Since RInChI builds on the Standard InChI for individual molecules, it inherits certain limitations:</p>
<ul>
<li><strong>Tautomers</strong>: Different tautomeric forms are treated as identical</li>
<li><strong>Stereochemistry</strong>: Relative stereochemical relationships aren&rsquo;t captured</li>
<li><strong>Experimental conditions</strong>: Temperature, pressure, yield, and reaction time are intentionally excluded</li>
</ul>
<h4 id="the-trade-off">The Trade-off</h4>
<p>This is an intentional feature. By focusing on core chemical identity, RInChI achieves its primary goal: ensuring that different researchers working on the same fundamental transformation generate the same identifier.</p>
<h3 id="implementation-and-tools">Implementation and Tools</h3>
<h4 id="official-software">Official Software</h4>
<p>The RInChI software, available from the InChI Trust, handles the practical details:</p>
<ul>
<li>Accepts standard reaction file formats (<code>.RXN</code>, <code>.RD</code>)</li>
<li>Generates RInChI strings, all three RInChIKey variants, and auxiliary information</li>
<li>Automates the complex process of creating consistent identifiers</li>
</ul>
<h4 id="rauxinfo-preserving-visual-information">RAuxInfo: Preserving Visual Information</h4>
<p>While RInChI discards graphical information (atom coordinates, drawing layout), the software can generate supplementary &ldquo;RAuxInfo&rdquo; strings that preserve this data. This allows reconstruction of the original visual representation when needed.</p>
<h3 id="future-directions">Future Directions</h3>
<p>RInChI development continues to evolve:</p>
<ul>
<li><strong>Integration</strong>: Plans for compatibility with other emerging standards like <a href="/notes/chemistry/molecular-representations/notations/mixfile-minchi/">MInChI for chemical mixtures</a></li>
<li><strong>Extended applications</strong>: Work on representing complex, multi-component reaction systems</li>
<li><strong>Software development</strong>: Tools for generating graphical representations directly from RInChI without auxiliary information</li>
</ul>
<h3 id="key-takeaways">Key Takeaways</h3>
<ol>
<li>
<p><strong>Filling a critical gap</strong>: RInChI provides the first standardized way to uniquely identify chemical reactions, solving a fundamental problem in chemical informatics.</p>
</li>
<li>
<p><strong>Focus on essential chemistry</strong>: By excluding experimental variables, RInChI achieves consistent identification of core chemical transformations.</p>
</li>
<li>
<p><strong>Flexible searching</strong>: Multiple RInChIKey formats enable different types of database queries, from exact matching to similarity searching.</p>
</li>
<li>
<p><strong>Practical implementation</strong>: Official software tools make RInChI generation accessible to working chemists and database managers.</p>
</li>
<li>
<p><strong>Foundation for analysis</strong>: Standardized reaction identifiers enable large-scale analysis of chemical databases and systematic study of reaction patterns.</p>
</li>
</ol>
<p>RInChI brings to reaction data the same kind of standardization and machine-readability that SMILES and InChI provide for individual molecules.</p>
<h2 id="reproducibility">Reproducibility</h2>
<p>The RInChI software is available for download from the InChI Trust website (<a href="http://www.inchi-trust.org/downloads/)">http://www.inchi-trust.org/downloads/)</a>. It is also available as an Oracle cartridge and as a Pipeline Pilot component from StructurePendium. A database of over one million RInChIs is hosted at www-rinchi.ch.cam.ac.uk.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://www.inchi-trust.org/downloads/">RInChI Software (InChI Trust)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official RInChI V1.00 implementation</td>
      </tr>
      <tr>
          <td><a href="https://www-rinchi.ch.cam.ac.uk">RInChI Database</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Over 1M reactions from patent literature</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grethe, G., Blanke, G., Kraut, H., &amp; Goodman, J. M. (2018). International chemical identifier for reactions (RInChI). <em>Journal of Cheminformatics</em>, <em>10</em>(1), 22. <a href="https://doi.org/10.1186/s13321-018-0277-8">https://doi.org/10.1186/s13321-018-0277-8</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics (2018)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Grethe2018,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{International chemical identifier for reactions (RInChI)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grethe, Guenter and Blanke, Gerd and Kraut, Hans and Goodman, Jonathan M}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-018-0277-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Recent Advances in the SELFIES Library: 2023 Update</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-2023/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-2023/</guid><description>Major updates to the SELFIES library, improved performance, expanded chemistry support, and new customization features.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>This software update paper documents major improvements to the SELFIES Python library (version 2.1.1), covering its history, underlying algorithms, design, and performance.</p>
<h2 id="limitations-in-the-original-selfies-implementation">Limitations in the Original SELFIES Implementation</h2>
<p>While the <a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">original SELFIES concept</a> was promising, the initial 2019 implementation had critical limitations that prevented widespread adoption:</p>
<ol>
<li><strong>Performance</strong>: Too slow for production ML workflows</li>
<li><strong>Limited chemistry</strong>: Couldn&rsquo;t represent aromatic molecules, stereochemistry, or many other important chemical features</li>
<li><strong>Poor usability</strong>: Lacked user-friendly APIs for common tasks</li>
</ol>
<p>These barriers meant that despite SELFIES&rsquo; theoretical advantages (100% validity guarantee), researchers couldn&rsquo;t practically use it for real-world applications like drug discovery or materials science.</p>
<h2 id="architectural-refactoring-and-new-ml-integrations">Architectural Refactoring and New ML Integrations</h2>
<p>The 2023 update refactors the underlying SELFIES engine with improvements to design, efficiency, and supported features. The key updates include:</p>
<ol>
<li>
<p><strong>Streamlined Grammar</strong>: The underlying context-free grammar has been generalized and streamlined, improving execution speed and extensibility while maintaining the 100% validity guarantee.</p>
</li>
<li>
<p><strong>Expanded Chemical Support</strong>: Adds support for aromatic systems (via internal kekulization), stereochemistry (chirality, cis/trans), charged species, and isotopic data, covering nearly all features supported by SMILES while preserving the validity guarantee.</p>
</li>
<li>
<p><strong>Semantic Constraint API</strong>: Introduces the <code>set_semantic_constraints()</code> function, allowing specification of custom valence definitions useful for theoretical studies or hypervalent states.</p>
</li>
<li>
<p><strong>ML Utility Functions</strong>: Provides tokenization (<code>split_selfies</code>), length estimation (<code>len_selfies</code>), label/one-hot encoding (<code>selfies_to_encoding</code>), vocabulary extraction, and attribution tracking for integration with neural network pipelines.</p>
</li>
</ol>
<h2 id="performance-benchmarks--validity-testing">Performance Benchmarks &amp; Validity Testing</h2>
<p>The authors validated the library through several benchmarks:</p>
<p><strong>Performance testing</strong>: Roundtrip conversion (SMILES to SELFIES to SMILES) on the DTP open compound collection (slightly over 300K molecules) completed in 252 seconds total (136s encoding, 116s decoding), using pure Python with no external dependencies.</p>
<p><strong>Random SELFIES generation</strong>: Demonstrated that random SELFIES strings of varying lengths always decode to valid molecules, with the size distribution of generated molecules controllable by filtering the sampling alphabet (e.g., removing multi-bond and low-valence atom symbols shifts the distribution toward larger molecules).</p>
<p><strong>Validity guarantee</strong>: By construction, every SELFIES string decodes to a valid molecule. The grammar&rsquo;s bond demotion and deferred ring closure mechanisms make it impossible to generate chemically invalid structures.</p>
<p><strong>Attribution system</strong>: Showed both encoder and decoder can track which input symbols produce which output symbols, useful for property alignment.</p>
<h2 id="future-trajectories-for-general-chemical-representations">Future Trajectories for General Chemical Representations</h2>
<p>The 2023 update successfully addresses the main adoption barriers:</p>
<ol>
<li><strong>Fast enough</strong> for large-scale ML applications (300K molecules in ~4 minutes)</li>
<li><strong>Chemically comprehensive</strong> enough for drug discovery and materials science</li>
<li><strong>User-friendly</strong> enough for straightforward integration into existing workflows</li>
</ol>
<p>The validity guarantee, SELFIES&rsquo; core advantage, is now practically accessible for real-world research. The roadmap includes future extensions for polymers, crystals, chemical reactions, and non-covalent interactions, which would expand SELFIES&rsquo; applicability beyond small-molecule chemistry.</p>
<p><strong>Limitations acknowledged</strong>: The paper focuses on implementation improvements. Some advanced chemical systems (polymers, crystals) still need future work.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/selfies">selfies</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official Python library, installable via <code>pip install selfies</code></td>
      </tr>
  </tbody>
</table>
<h3 id="code">Code</h3>
<p>The <code>selfies</code> library is completely open-source and written in pure Python. It requires no extra dependencies and is available on GitHub, installable via <code>pip install selfies</code>. The repository includes testing suites (<code>tox</code>) and example benchmarking scripts to reproduce the translation speeds reported in the paper.</p>
<h3 id="hardware">Hardware</h3>
<p>Performance benchmarks (e.g., the 252-second roundtrip conversion on 300K molecules) were executed on Google Colaboratory using two 2.20GHz Intel Xeon CPUs.</p>
<h3 id="algorithms">Algorithms</h3>
<h4 id="technical-specification-the-grammar">Technical Specification: The Grammar</h4>
<p>The core innovation of SELFIES is a <strong>Context-Free Grammar (CFG) augmented with state-machine logic</strong> to ensure that every derived string represents a valid molecule. While the software features are important, understanding the underlying derivation rules is essential for replication or extension of the system.</p>
<p><strong>1. Derivation Rules: The Atom State Machine</strong></p>
<p>The fundamental mechanism that guarantees validity is a <strong>state machine</strong> that tracks the remaining valence of the most recently added atom:</p>
<ul>
<li><strong>State Tracking</strong>: The derivation maintains a non-terminal state $X_l$, where $l$ represents the current atom&rsquo;s remaining valence (number of bonds it can still form)</li>
<li><strong>Standard Derivation</strong>: An atom symbol $[\beta \alpha]$ (bond order + atom type) transitions the state from $S$ (start) to $X_l$, where $l$ is calculated from the atom&rsquo;s standard valence minus the incoming bond order</li>
<li><strong>Bond Demotion (The Key Rule)</strong>: When deriving atom symbol $[\beta \alpha]$ in state $X_i$, the actual bond order used is $d_0 = \min(\ell, i, d(\beta))$, where $\ell$ is the new atom&rsquo;s valence, $i$ is the previous atom&rsquo;s remaining capacity, and $d(\beta)$ is the requested bond order. This automatic downward adjustment is the mathematical core of the validity guarantee.</li>
</ul>
<p>This state machine ensures that no atom ever exceeds its allowed valence, making it impossible to generate chemically invalid structures.</p>
<p><strong>2. Control Symbols: Branches and Rings</strong></p>
<p>Branch length calculation: SELFIES uses a <strong>hexadecimal encoding</strong> to determine branch lengths. A branch symbol <code>[Branch l]</code> consumes the next $\ell$ symbols from the queue and converts them to integer indices $c_1, \dots, c_\ell$ via a fixed mapping (Table III in the paper). The number of symbols $N$ to include in the branch is then:</p>
<p>$$
N = 1 + \sum_{k=1}^{\ell} 16^{\ell - k} , c_k
$$</p>
<p>This formula interprets the indices as hexadecimal digits, allowing compact specification of branches up to hundreds of symbols long.</p>
<p>Ring closure queue system: Ring formation uses a <strong>deferred evaluation</strong> strategy to maintain validity. Ring symbols don&rsquo;t create bonds immediately; instead, they push closure candidates into a queue $R$. These candidates are resolved after the main derivation completes. A ring closure candidate is <strong>rejected</strong> if either ring atom has no remaining valence ($m_1 = 0$ or $m_2 = 0$), or if the left and right ring atoms are not distinct (to avoid self-loops). If a prior bond already exists between the two atoms, the bond order is incremented rather than duplicated. This deferred validation prevents invalid ring structures while keeping the grammar context-free during the main derivation.</p>
<p><strong>3. Symbol Structure and Standardization</strong></p>
<p>SELFIES enforces a strict, standardized format for atom symbols to eliminate ambiguity:</p>
<ul>
<li><strong>Canonical Format</strong>: Atom symbols follow the structure <code>[Bond, Isotope, Element, Chirality, H-count, Charge]</code></li>
<li><strong>No Variation</strong>: There is only one way to write each symbol (e.g., <code>[Fe++]</code> and <code>[Fe+2]</code> are standardized to a single form)</li>
<li><strong>Order Matters</strong>: The components must appear in the specified order</li>
</ul>
<p><strong>4. Default Semantic Constraints</strong></p>
<p>By default, the library enforces standard organic chemistry valence rules:</p>
<ul>
<li><strong>Charge-Dependent Valences</strong>: Default constraints specify maximum bonds per charge state (e.g., C: 4/5/3 for neutral/+1/-1; S: 6/7/5). Unlisted atom types default to 8 maximum bonds as a catch-all.</li>
<li><strong>Preset Options</strong>: Three preset constraint sets are available: <code>default</code>, <code>octet_rule</code>, and <code>hypervalent</code>.</li>
<li><strong>Customizable</strong>: Constraints can be modified via <code>set_semantic_constraints()</code> for specialized applications (hypervalent compounds, theoretical studies, etc.)</li>
</ul>
<p>The combination of these grammar rules with the state machine ensures that <strong>every valid SELFIES string decodes to a chemically valid molecule</strong>, regardless of how the string was generated (random, ML model output, manual construction, etc.).</p>
<h3 id="data">Data</h3>
<p><strong>Benchmark dataset</strong>: DTP (Developmental Therapeutics Program) open compound collection with slightly over 300K SMILES strings, a set of molecules tested experimentally for potential treatment against cancer and AIDS.</p>
<p><strong>Random generation testing</strong>: Random SELFIES strings of varying lengths (10, 100, 250 symbols) generated from both basic and filtered alphabets to test decoding validity and molecule size distributions.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Performance metric</strong>: Roundtrip conversion time (SMILES to SELFIES to SMILES) is 252 seconds for 300K+ molecules (136s encoding, 116s decoding). Times averaged over 3 replicate trials on Google Colaboratory.</p>
<p><strong>Validity testing</strong>: Random SELFIES strings of lengths 10, 100, and 250 all decode to valid molecules. Decoding 1000 random strings of length 250 from the basic alphabet takes 0.341s; from the filtered alphabet, 1.633s.</p>
<p><strong>Attribution system</strong>: Both <code>encoder()</code> and <code>decoder()</code> support an <code>attribute</code> flag that returns <code>AttributionMap</code> objects, tracing which input symbols produce which output symbols for property alignment.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., &amp; Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. <em>Digital Discovery</em>, <em>2</em>(4), 897-908. <a href="https://doi.org/10.1039/D3DD00044C">https://doi.org/10.1039/D3DD00044C</a></p>
<p><strong>Publication</strong>: Digital Discovery 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{lo2023recent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Recent advances in the self-referencing embedded strings (SELFIES) library}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lo, Alston and Pollice, Robert and Nigam, AkshatKumar and White, Andrew D and Krenn, Mario and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{897--908}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D3DD00044C}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/aspuru-guzik-group/selfies">SELFIES GitHub Repository</a></li>
<li><a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">Original SELFIES Paper (2020)</a></li>
<li><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES Format Overview</a></li>
</ul>
]]></content:encoded></item><item><title>NInChI: Toward a Chemical Identifier for Nanomaterials</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/ninchi-alpha/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/ninchi-alpha/</guid><description>NInChI (Nanomaterials InChI) extends chemical identifiers to represent complex, multi-component nanomaterials.</description><content:encoded><![CDATA[<h2 id="a-new-standard-for-nanoinformatics">A New Standard for Nanoinformatics</h2>
<p>This is a <strong>Systematization paper</strong> that proposes a new standard: the NInChI. It addresses a fundamental limitation in nanoinformatics. The result of a collaborative workshop organized by the H2020 research infrastructure NanoCommons and the nanoinformatics project NanoSolveIT, this work uses <strong>six detailed case studies</strong> to systematically develop a <strong>hierarchical, machine-readable notation</strong> for complex nanomaterials that could work across experimental research, regulatory frameworks, and computational modeling.</p>
<h2 id="the-breakdown-of-traditional-chemical-identifiers">The Breakdown of Traditional Chemical Identifiers</h2>
<p>Chemoinformatics has fantastic tools for representing small molecules: SMILES strings, InChI identifiers, and standardized databases that make molecular data searchable and shareable. But when you step into nanotechnology, everything breaks down.</p>
<p>Consider trying to describe a gold nanoparticle with a silica shell and organic surface ligands. How do you capture:</p>
<ul>
<li>The gold core composition and size</li>
<li>The silica shell thickness and interface</li>
<li>The surface chemistry and ligand density</li>
<li>The overall shape and morphology</li>
</ul>
<p>There&rsquo;s simply no standardized way to represent this complexity in a machine-readable format. This creates massive problems for:</p>
<ul>
<li><strong>Data sharing</strong> between research groups</li>
<li><strong>Regulatory assessment</strong> where precise identification matters</li>
<li><strong>Computational modeling</strong> that needs structured input</li>
<li><strong>Database development</strong> and search capabilities</li>
</ul>
<p>Without a standard notation, nanomaterials research suffers from the same data fragmentation that plagued small molecule chemistry before SMILES existed.</p>
<h2 id="the-five-tier-nanomaterial-description-hierarchy">The Five-Tier Nanomaterial Description Hierarchy</h2>
<p>The authors propose NInChI (Nanomaterials InChI), a layered extension to the existing InChI system. The core insight is organizing nanomaterial description from the inside out, following the OECD&rsquo;s framework for risk assessment, with a five-tier hierarchy:</p>
<ol>
<li><strong>Tier 1: Chemical Composition</strong>: What is the core made of? This differentiates uniform compositions (Tier 1.1), randomly mixed (Tier 1.2), ordered core-shell materials (Tier 1.3), and onion-like multi-shell morphologies (Tier 1.4).</li>
<li><strong>Tier 2: Morphology</strong>: What shape, size, and dimensionality? This encodes dimension (0D-3D), size and size distribution, and shape information.</li>
<li><strong>Tier 3: Surface Properties</strong>: Physical and chemical surface parameters such as charge, roughness, and hydrophobicity. Many of these depend on external conditions (pH, solvent, temperature).</li>
<li><strong>Tier 4: Surface Functionalization</strong>: How are coatings attached to the core? This includes functionalization density, orientation, and binding type (covalent vs. non-covalent).</li>
<li><strong>Tier 5: Surface Ligands</strong>: What molecules are on the surface, their density, orientation, and distribution?</li>
</ol>
<p>This hierarchy captures the essential information needed to distinguish between different nanomaterials while building on familiar chemical concepts.</p>
<h2 id="testing-the-standard-six-case-studies">Testing the Standard: Six Case Studies</h2>
<p>The authors tested their concept against six real-world case studies to identify what actually matters in practice.</p>
<p><strong>Case Study 1: Gold Nanoparticles</strong></p>
<p>Gold NPs provided a relatively simple test case: an inert metallic core with various surface functionalizations. Key insights: core composition and size are essential, surface chemistry (what molecules are attached) matters critically, shape affects properties, and dynamic properties like protein corona formation belong outside the intrinsic NInChI representation. This established the boundary: NInChI should capture intrinsic, stable properties.</p>
<p><strong>Case Study 2: Graphene-Family NMs</strong></p>
<p>Carbon nanotubes and graphene introduced additional complexity: dimensionality (1D tubes vs 2D sheets vs 0D fullerenes), chirality (the (n,m) vector that defines a nanotube&rsquo;s structure), defects and impurities that can alter properties, and number of layers (for nanotubes, single-wall vs multi-wall). This case showed that the notation needed to handle both topological complexity and chemical composition.</p>
<p><strong>Case Study 3: Complex Engineered (Doped and Multi-Metallic) NMs</strong></p>
<p>Doped materials, alloys, and core-shell structures revealed key requirements: need to distinguish true alloys (homogeneous mixing) and core-shell structures with the same overall composition, crystal structure information becomes crucial, and component ratios must be precisely specified. The case study assessed whether the MInChI extension could represent these solid solutions.</p>
<p><strong>Case Study 4: Database Applications</strong></p>
<p>The FAIR (Findable, Accessible, Interoperable, Reusable) principles guided this analysis. NInChI addresses real database problems: it provides greater specificity than CAS numbers (which lack nanoform distinction), offers a systematic alternative to ad-hoc naming schemes, and enables machine-searchability.</p>
<p><strong>Case Study 5: Computational Modeling</strong></p>
<p>This explored several applications: automated descriptor generation from NInChI structure, read-across predictions for untested materials, and model input preparation from standardized notation. The layered structure provides structured input that computational tools need for both physics-based and data-driven nanoinformatics approaches.</p>
<p><strong>Case Study 6: Regulatory Applications</strong></p>
<p>Under frameworks like REACH, regulators need to distinguish between different &ldquo;nanoforms&rdquo;, which are materials with the same chemical composition but different sizes, shapes, or surface treatments. NInChI directly addresses this by encoding the specific properties that define regulatory categories, providing precision sufficient for legal definitions and risk assessment frameworks.</p>
<h2 id="the-ninchi-alpha-specification-in-practice">The NInChI Alpha Specification in Practice</h2>
<p>Synthesizing insights from all six case studies, the authors propose the <strong>NInChI alpha specification</strong> (version 0.00.1A), a three-layer structure. Importantly, the paper distinguishes the five-tier NM description hierarchy (Section 1.2 above) from the three-layer NInChI notation hierarchy. NM properties from the five tiers are encoded into these three notation layers:</p>
<p><strong>Layer 1 (Version Number)</strong>: Standard header indicating the NInChI version, denoted as <code>0.00.1A</code> for the alpha version. This follows the convention of all <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>-based notations.</p>
<p><strong>Layer 2 (Composition)</strong>: Each component (core, shell, ligands, impurities, dopants, linkers) gets described using standard InChI (or PInChI/MInChI) for chemical composition, with additional sublayers for morphology (prefix <code>m</code>, e.g., <code>sp</code> for sphere, <code>sh</code> for shell, <code>tu</code> for tube), size (prefix <code>s</code>, in scientific notation in meters), crystal structure (prefix <code>k</code>), and chirality (prefix <code>w</code> for carbon nanotubes). Components are separated by <code>!</code>.</p>
<p><strong>Layer 3 (Arrangement)</strong>: Specified with prefix <code>y</code>, this layer describes how the components from Layer 2 are combined, proceeding from inside out. A core-shell material is written as <code>y2&amp;1</code> where the numbers reference components in Layer 2. Covalent bonding between components is indicated with parentheses, e.g., <code>(1&amp;2&amp;3)</code> for a nano core with a covalently bound ligand coating.</p>
<p>The paper provides concrete worked examples from the case studies:</p>
<ul>
<li><strong>Silica with gold coating</strong> (20 nm silica, 2 nm gold shell):
<code>NInChI=0.00.1A/Au/msh/s2t10r1-9;12r2-9!/O2Si/c1-3-2/msp/s20d-9/k000/y2&amp;1</code></li>
<li><strong>CTAB-capped gold nanoparticle</strong> (20 nm diameter):
<code>NinChI=0.00.1A/Au/msp/s20d-9!C19H42N.BrH/c1-5-6-7.../y1&amp;2</code></li>
<li><strong>Chiral single-wall nanotube</strong> of the (3,1) type with 0.4 nm diameter:
<code>NInChI=0.00.1A/C/mtu/s4d-10/w(3,1)/y1</code></li>
</ul>
<p><strong>Property Prioritization</strong>: The case studies produced a prioritization of NM properties into four categories (Table 3 in the paper):</p>
<table>
  <thead>
      <tr>
          <th>Category 1: Must Have</th>
          <th>Category 2a: Nice to Have</th>
          <th>Category 2b: Extrinsic</th>
          <th>Category 3: Out of Scope</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chemical composition</td>
          <td>Structural defects</td>
          <td>Surface charge</td>
          <td>Optical properties</td>
      </tr>
      <tr>
          <td>Size/size distribution</td>
          <td>Density</td>
          <td>Corona</td>
          <td>Magnetic properties</td>
      </tr>
      <tr>
          <td>Shape</td>
          <td>Surface composition</td>
          <td>Agglomeration state</td>
          <td>Chemical/oxidation state</td>
      </tr>
      <tr>
          <td>Crystal structure</td>
          <td></td>
          <td>Dispersion</td>
          <td></td>
      </tr>
      <tr>
          <td>Chirality</td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>Ligand and ligand binding</td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
  </tbody>
</table>
<p><strong>Implementation</strong>: The authors built a prototype NInChI generation tool using the ZK framework with a Java backend, available through the <a href="http://enaloscloud.novamechanics.com/nanocommons/NInChI/">Enalos Cloud Platform</a>. The tool lets users specify core composition, morphology, size, crystal structure, and chirality, then build outward by adding shells or clusters. InChIs for shell components are retrieved via the NCI/CADD chemical structure REST API.</p>
<p><strong>Limitations</strong>: The alpha version acknowledges areas for future development: nanocomposite and nanostructured materials, inverse NMs (nano holes in bulk material), and nanoporous materials are beyond current scope. Dynamic properties such as dissolution, agglomeration, and protein corona formation are excluded. The stochastic nature of NMs (e.g., broad size distributions) is not yet fully addressed. Covalent bonding between components needs further refinement.</p>
<p><strong>Impact</strong>: For researchers, NInChI enables precise structural queries for nanomaterials data sharing. For regulators, it provides systematic identification for risk assessment and nanoform classification under frameworks like REACH. For computational modelers, it enables automated descriptor generation and read-across predictions.</p>
<p><strong>Key Conclusions</strong>: The 8-month collaborative process demonstrates that creating systematic notation for nanomaterials is feasible. The hierarchical, inside-out organization provides an approach that satisfies experimentalists, modelers, database owners, and regulators. Testing against six case studies identified the essential features that must be captured. By extending InChI and reusing conventions from MInChI, RInChI, and PInChI, the work builds on existing infrastructure. The proposed NInChI alpha is intended to stimulate further analysis and refinement with the broader community and the InChI Trust.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Paper Accessibility</strong>: The paper is fully open-access under the CC BY 4.0 license, allowing for straightforward reading and analysis.</li>
<li><strong>Tools &amp; Code</strong>: The authors provided a prototype NInChI generation tool available through the <a href="http://enaloscloud.novamechanics.com/nanocommons/NInChI/">Enalos Cloud Platform</a>, built using the ZK framework with a Java backend. The underlying backend code was not released as an open-source library.</li>
<li><strong>Documentation</strong>: The paper serves as the first alpha specification for community discussion and refinement. No formal algorithmic pseudocode for automated string parsing or generation from structured nanomaterials files (like <code>.cif</code>) is provided.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://enaloscloud.novamechanics.com/nanocommons/NInChI/">NInChI Generator (Enalos Cloud)</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Prototype web tool for generating NInChI strings; backend not open-source</td>
      </tr>
      <tr>
          <td><a href="https://www.mdpi.com/2079-4991/10/12/2493">Paper (MDPI)</a></td>
          <td>Other</td>
          <td>CC BY 4.0</td>
          <td>Open-access alpha specification</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lynch, I., Afantitis, A., Exner, T., Himly, M., Lobaskin, V., Doganis, P., &hellip; &amp; Melagraki, G. (2020). Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies? <em>Nanomaterials</em>, <em>10</em>(12), 2493. <a href="https://doi.org/10.3390/nano10122493">https://doi.org/10.3390/nano10122493</a></p>
<p><strong>Publication</strong>: Nanomaterials (2020)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{lynch2020inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lynch, Iseult and Afantitis, Antreas and Exner, Thomas and others}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nanomaterials}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2493}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{MDPI}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.3390/nano10122493}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Mixfile &amp; MInChI: Machine-Readable Mixture Formats</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/mixfile-minchi/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/mixfile-minchi/</guid><description>Mixfile and MInChI provide the first standardized, machine-readable formats for representing chemical mixtures.</description><content:encoded><![CDATA[<h2 id="a-standardized-resource-for-chemical-mixtures">A Standardized Resource for Chemical Mixtures</h2>
<p>This is a <strong>Resource</strong> paper that introduces two complementary standards for representing chemical mixtures: the detailed <strong>Mixfile</strong> format for comprehensive mixture descriptions and the compact <strong>MInChI</strong> (Mixtures InChI) specification for canonical mixture identifiers.</p>
<h2 id="the-missing-format-for-complex-formulations">The Missing Format for Complex Formulations</h2>
<p>There is a fundamental gap in chemical informatics: current standards excel at representing pure individual molecules (SMILES, InChI, Molfile), but a corresponding standard for multi-component mixtures remains an open challenge. This is a major problem because real-world chemistry predominantly involves complex mixtures.</p>
<p>Everyday chemical work frequently involves:</p>
<ul>
<li>Reagents with specified purity (e.g., &ldquo;$\geq$ 97% pure&rdquo;)</li>
<li>Solutions and formulations</li>
<li>Complex mixtures like &ldquo;hexanes&rdquo; (which contains multiple isomers)</li>
<li>Drug formulations with active ingredients and excipients</li>
</ul>
<p>Without a machine-readable standard, chemists are forced to describe these mixtures in plain text that software cannot parse or analyze systematically. This creates barriers for automated safety analysis, inventory management, and data sharing.</p>
<h2 id="dual-design-comprehensive-mixfiles-and-canonical-minchis">Dual Design: Comprehensive Mixfiles and Canonical MInChIs</h2>
<p>The authors propose a two-part solution:</p>
<ol>
<li><strong>Mixfile</strong>: A detailed, hierarchical JSON format that captures the complete composition of a mixture</li>
<li><strong>MInChI</strong>: A compact, canonical string identifier derived from Mixfile data</li>
</ol>
<p>This dual approach provides both comprehensive description (Mixfile) and simple identification (MInChI), similar to having both a detailed recipe and a short name for a dish.</p>
<h3 id="what-makes-a-good-mixture-format">What Makes a Good Mixture Format?</h3>
<p>The authors identify three essential properties any mixture format must capture:</p>
<ol>
<li><strong>Compound</strong>: What molecules are present?</li>
<li><strong>Quantity</strong>: How much of each component?</li>
<li><strong>Hierarchy</strong>: How are components organized (e.g., mixtures-of-mixtures)?</li>
</ol>
<p>The hierarchical aspect is crucial. Consider &ldquo;hexanes&rdquo;: it is a named mixture containing specific proportions of n-hexane, 2-methylpentane, 3-methylpentane, etc. A mixture format needs to represent both the individual isomers and the fact that they are grouped under the umbrella term &ldquo;hexanes.&rdquo;</p>
<h3 id="mixfile-format-details">Mixfile Format Details</h3>
<p>Mixfile uses JSON as its foundation, making it both human-readable and easy to parse in modern programming languages. The core structure is a hierarchical tree where each component can contain:</p>
<ul>
<li><strong>name</strong>: Component identifier</li>
<li><strong>molfile/smiles/inchi/formula</strong>: Molecular structure (molfile is the primary source of truth)</li>
<li><strong>quantity/units/relation/ratio</strong>: Concentration data with optional relation operators</li>
<li><strong>contents</strong>: Array of sub-components for hierarchical mixtures</li>
<li><strong>identifiers</strong>: Database IDs or URLs for additional information</li>
</ul>
<h4 id="simple-example">Simple Example</h4>
<p>A basic Mixfile might look like:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;Acetone, ≥99%&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;acetone&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CC(=O)C&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">99</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;relation&#34;</span>: <span style="color:#e6db74">&#34;&gt;=&#34;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  ]
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Note that the paper specifies distinct fields for molecular structures: <code>molfile</code> (the primary source of truth), <code>smiles</code>, <code>inchi</code>, and <code>formula</code>. Concentration data uses separate <code>quantity</code>, <code>units</code>, and <code>relation</code> fields.</p>
<h4 id="complex-example-mixture-of-mixtures">Complex Example: Mixture-of-Mixtures</h4>
<p>For something like &ldquo;ethyl acetate dissolved in hexanes,&rdquo; the structure would be:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;Ethyl acetate in hexanes&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;ethyl acetate&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CCOC(=O)C&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">10</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;hexanes&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;n-hexane&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CCCCCC&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">60</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;2-methylpentane&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CC(C)CCC&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">25</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>      ]
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  ]
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>This hierarchical structure captures the &ldquo;recipe&rdquo; of complex mixtures while remaining machine-readable.</p>
<h3 id="minchi-canonical-mixture-identifiers">MInChI: Canonical Mixture Identifiers</h3>
<p>While Mixfiles provide comprehensive descriptions, simple identifiers are also needed for database storage and searching. This is where MInChI comes in.</p>
<p>A MInChI string is structured as:</p>
<pre><code>MInChI=0.00.1S/&lt;components&gt;/n&lt;indexing&gt;/g&lt;concentration&gt;
</code></pre>
<ul>
<li><strong>Header</strong>: Version information (<code>0.00.1S</code> in the paper&rsquo;s specification)</li>
<li><strong>Components</strong>: Standard InChI for each unique molecule, sorted alphabetically <em>by the InChI strings themselves</em>, then concatenated with <code>&amp;</code></li>
<li><strong>Indexing</strong> (prefixed with <code>/n</code>): Hierarchical structure using curly braces <code>{}</code> for branches and <code>&amp;</code> for adjacent nodes; uses 1-based integer indices referring to the sorted InChI list</li>
<li><strong>Concentration</strong> (prefixed with <code>/g</code>): Quantitative information for each component, with units converted to canonical codes</li>
</ul>
<h4 id="why-this-matters">Why This Matters</h4>
<p>MInChI strings enable simple database searches:</p>
<ul>
<li>Check if a specific component appears in any mixture</li>
<li>Compare different formulations of the same product</li>
<li>Identify similar mixtures based on string similarity</li>
</ul>
<h2 id="validating-the-standard-through-practical-tooling">Validating the Standard Through Practical Tooling</h2>
<p>The paper demonstrates the format&rsquo;s capabilities through several practical applications and a proof-of-concept implementation:</p>
<h3 id="text-extraction-algorithm">Text Extraction Algorithm</h3>
<p>The authors demonstrate a proof-of-concept algorithm that uses regular expressions and chemical name recognition to parse plain-text mixture descriptions into structured Mixfile data. The algorithm:</p>
<ol>
<li>Applies regex rules to remove filler words and extract concentrations</li>
<li>Looks up cleaned names against a custom chemical database</li>
<li>Falls back to OPSIN for SMILES generation from chemical names</li>
<li>Generates 2D coordinates for molecular structures</li>
</ol>
<h3 id="graphical-editor">Graphical Editor</h3>
<p>An open-source editor provides:</p>
<ul>
<li>Tree-based interface for building and editing hierarchical structures</li>
<li>Chemical structure sketching and editing</li>
<li>Database lookup (e.g., PubChem integration)</li>
<li>Automatic MInChI generation</li>
<li>Import/export capabilities</li>
</ul>
<h3 id="example-use-cases">Example Use Cases</h3>
<p>The paper validates the format through real-world applications:</p>
<ul>
<li><strong>Safety compliance</strong>: Automated hazard assessment based on concentration-dependent properties (e.g., solid osmium tetroxide vs. 1% aqueous solution)</li>
<li><strong>Inventory management</strong>: Precise, searchable laboratory records</li>
<li><strong>Data extraction</strong>: Parsing vendor catalogs and safety data sheets</li>
</ul>
<h2 id="outcomes-and-future-extensibility">Outcomes and Future Extensibility</h2>
<p>The work successfully establishes the first standardized, machine-readable formats for chemical mixtures. Key achievements:</p>
<ul>
<li><strong>Comprehensive representation</strong>: Mixfile captures component identity, quantity, and hierarchy</li>
<li><strong>Canonical identification</strong>: MInChI provides compact, searchable identifiers</li>
<li><strong>Practical tooling</strong>: Open-source editor and text extraction demonstrate feasibility</li>
<li><strong>Real-world validation</strong>: Format handles diverse use cases from safety to inventory</li>
</ul>
<h3 id="limitations-and-future-directions">Limitations and Future Directions</h3>
<p>The authors acknowledge areas for improvement:</p>
<ul>
<li><strong>Machine learning improvements</strong>: Better text extraction using modern NLP techniques</li>
<li><strong>Extended coverage</strong>: Support for polymers, complex formulations, analytical results</li>
<li><strong>Community adoption</strong>: Integration with existing chemical databases and software</li>
</ul>
<p>The hierarchical design makes Mixfile suitable for both &ldquo;recipe&rdquo; descriptions (how to make something) and analytical results (what was found). This flexibility should help drive adoption across different use cases in chemistry and materials science.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="open-source-tooling--data">Open Source Tooling &amp; Data</h3>
<p>While the central repository focusing on validating and establishing the MInChI standard is <a href="https://github.com/IUPAC/MInChI">github.com/IUPAC/MInChI</a>, the tools and datasets actually used to develop the paper&rsquo;s proofs-of-concept are hosted elsewhere:</p>
<ul>
<li><strong>Graphical Editor &amp; App codebase</strong>: The Electron application and Mixfile handling codebase (<code>console.js</code>) can be found at <a href="https://github.com/cdd/mixtures">github.com/cdd/mixtures</a>.</li>
<li><strong>Text Extraction Data</strong>: The several thousand extracted mixture records generated through the text extraction method can be accessed inside the <code>cdd/mixtures</code> repository under <a href="https://github.com/cdd/mixtures/tree/master/reference"><code>reference/gathering.zip</code></a>.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/IUPAC/MInChI">IUPAC/MInChI</a></td>
          <td style="text-align: left">Code / Data</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Validation test suite with ~150 mixture JSON files</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/cdd/mixtures">cdd/mixtures</a></td>
          <td style="text-align: left">Code / Data</td>
          <td style="text-align: left">GPL-3.0</td>
          <td style="text-align: left">Electron-based Mixfile editor, CLI tools, and reference mixture corpus</td>
      </tr>
  </tbody>
</table>
<p>The paper was funded by NIH Grant 1R43TR002528-01. No specific hardware requirements are needed, as this is a format specification with lightweight tooling.</p>
<h3 id="algorithms">Algorithms</h3>
<p>This section provides the specific algorithmic logic, schema definitions, and standardization rules needed to replicate the Mixfile parser or MInChI generator.</p>
<h4 id="the-strict-mixfile-json-schema">The Strict Mixfile JSON Schema</h4>
<p>To implement the format, a parser must recognize these specific fields:</p>
<p><strong>Root Structure</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;header&#34;</span>: {},
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: []
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Component Fields</strong>:</p>
<ul>
<li><code>name</code>: string (required if no structure is provided)</li>
<li><code>molfile</code>: string (the primary source of truth for molecular structure)</li>
<li><code>smiles</code>, <code>inchi</code>, <code>formula</code>: derived/transient fields for convenience</li>
<li><code>quantity</code>: number OR <code>[min, max]</code> array for ranges</li>
<li><code>units</code>: string (must map to supported ontology)</li>
<li><code>relation</code>: string (e.g., <code>&quot;&gt;&quot;</code>, <code>&quot;~&quot;</code>, <code>&quot;&gt;=&quot;</code>)</li>
<li><code>ratio</code>: array of two numbers <code>[numerator, denominator]</code></li>
<li><code>identifiers</code>: database assignments (e.g., CASRN, PubChem)</li>
<li><code>links</code>: URLs relevant to the component</li>
<li><code>contents</code>: recursive array for hierarchical mixtures</li>
</ul>
<h4 id="minchi-generation-algorithm">MInChI Generation Algorithm</h4>
<p>To generate <code>MInChI=0.00.1S/...</code>, the software must follow these steps:</p>
<ol>
<li>
<p><strong>Component Layer</strong>:</p>
<ul>
<li>Calculate standard <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> for all structures in the mixture</li>
<li>Sort distinct InChIs alphabetically by the InChI string itself</li>
<li>Join with <code>&amp;</code> to form the structure layer</li>
</ul>
</li>
<li>
<p><strong>Hierarchy &amp; Concentration Layers</strong>:</p>
<ul>
<li>Traverse the Mixfile tree recursively</li>
<li><strong>Indexing</strong>: Use integer indices (1-based) referring to the sorted InChI list</li>
<li><strong>Grouping</strong>: Use <code>{}</code> to denote hierarchy branches and <code>&amp;</code> to separate nodes at the same level</li>
<li><strong>Concentration</strong>: Convert all quantities to canonical unit codes and apply scaling factors</li>
</ul>
</li>
</ol>
<h4 id="unit-standardization-table">Unit Standardization Table</h4>
<p>Replication requires mapping input units to canonical MInChI codes. The full table from the paper (Table 1) includes:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Input Unit</th>
          <th style="text-align: left">MInChI Code</th>
          <th style="text-align: left">Scale Factor</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">%</td>
          <td style="text-align: left">pp</td>
          <td style="text-align: left">1</td>
      </tr>
      <tr>
          <td style="text-align: left">w/v%</td>
          <td style="text-align: left">wv</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">w/w%</td>
          <td style="text-align: left">wf</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">v/v%</td>
          <td style="text-align: left">vf</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">mol/mol%</td>
          <td style="text-align: left">mf</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">mol/L (M)</td>
          <td style="text-align: left">mr</td>
          <td style="text-align: left">1</td>
      </tr>
      <tr>
          <td style="text-align: left">mmol/L</td>
          <td style="text-align: left">mr</td>
          <td style="text-align: left">$10^{-3}$</td>
      </tr>
      <tr>
          <td style="text-align: left">g/L</td>
          <td style="text-align: left">wv</td>
          <td style="text-align: left">$10^{-3}$</td>
      </tr>
      <tr>
          <td style="text-align: left">mol/kg</td>
          <td style="text-align: left">mb</td>
          <td style="text-align: left">1</td>
      </tr>
      <tr>
          <td style="text-align: left">ratio</td>
          <td style="text-align: left">vp</td>
          <td style="text-align: left">1</td>
      </tr>
  </tbody>
</table>
<h4 id="text-extraction-logic">Text Extraction Logic</h4>
<p>The paper defines a recursive procedure for parsing plain-text mixture descriptions:</p>
<ol>
<li><strong>Input</strong>: Raw text string (e.g., &ldquo;2 M acetone in water&rdquo;)</li>
<li><strong>Rule Application</strong>: Apply RegEx rules in order:
<ul>
<li><em>Remove</em>: Delete common filler words (&ldquo;solution&rdquo;, &ldquo;in&rdquo;)</li>
<li><em>Replace</em>: Substitute known variations</li>
<li><em>Concentration</em>: Extract quantities like &ldquo;2 M&rdquo;, &ldquo;97%&rdquo;</li>
<li><em>Branch</em>: Split phrases like &ldquo;A in B&rdquo; into sub-nodes</li>
</ul>
</li>
<li><strong>Lookup</strong>: Check cleaned name against a custom table (handles cases like &ldquo;xylenes&rdquo; or specific structures)</li>
<li><strong>OPSIN</strong>: If no lookup match, send to the OPSIN tool to generate SMILES from the chemical name</li>
<li><strong>Embed</strong>: If structure found, generate 2D coordinates (Molfile) via RDKit</li>
</ol>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Clark, A. M., McEwen, L. R., Gedeck, P., &amp; Bunin, B. A. (2019). Capturing mixture composition: an open machine-readable format for representing mixed substances. <em>Journal of Cheminformatics</em>, <em>11</em>(1), 33. <a href="https://doi.org/10.1186/s13321-019-0357-4">https://doi.org/10.1186/s13321-019-0357-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics (2019)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{clark2019capturing,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Capturing mixture composition: an open machine-readable format for representing mixed substances}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Clark, Alex M and McEwen, Leah R and Gedeck, Peter and Bunin, Barry A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{BioMed Central}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IUPAC/MInChI">Official MInChI GitHub repository</a></li>
</ul>
]]></content:encoded></item><item><title>Making InChI FAIR and Sustainable for Inorganic Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi-2025/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi-2025/</guid><description>InChI v1.07 modernizes chemical identifiers for FAIR data principles and adds comprehensive support for inorganic compounds.</description><content:encoded><![CDATA[<h2 id="paper-contribution-modernizing-chemical-identifiers">Paper Contribution: Modernizing Chemical Identifiers</h2>
<p>This is a <strong>Resource</strong> paper that describes the development and maintenance of InChI (International Chemical Identifier), a fundamental infrastructure component for chemical databases. While it includes methodological improvements to the canonicalization algorithm for inorganic compounds, its primary contribution is ensuring the sustainability and accessibility of a critical chemical informatics resource.</p>
<h2 id="motivation-the-inorganic-chemistry-problem">Motivation: The Inorganic Chemistry Problem</h2>
<p>The International Chemical Identifier (InChI) is prevalent in chemistry databases, with over a billion structures using it. The original system was designed specifically for organic chemistry and systematically fails to parse organometallic structures accurately. The original implementation had significant limitations:</p>
<ul>
<li><strong>FAIR principles gap</strong>: Development was closed-source, documentation was inadequate, and the codebase was difficult to maintain</li>
<li><strong>Inorganic chemistry failure</strong>: Metal-ligand bonds were automatically disconnected, destroying stereochemical information for coordination complexes</li>
<li><strong>Technical debt</strong>: More than 3000 bugs and security vulnerabilities, nearly 60 Google OSS-Fuzz issues, and an unmaintainable codebase</li>
</ul>
<p>If you&rsquo;ve ever tried to search for a metal complex in a chemical database and gotten nonsense results, this is why. This paper describes the fix.</p>
<h2 id="core-innovation-smart-metal-ligand-handling">Core Innovation: Smart Metal-Ligand Handling</h2>
<p>The key innovations are:</p>
<ol>
<li>
<p><strong>Smart metal-ligand bond handling</strong>: A decision tree algorithm that uses coordination number and electronegativity to determine which bonds to keep and which to disconnect, preserving stereochemistry for coordination complexes</p>
</li>
<li>
<p><strong>Modernized development infrastructure</strong>: Migration to GitHub with open development, comprehensive testing, and maintainable documentation</p>
</li>
<li>
<p><strong>Backward compatibility</strong>: The core canonicalization algorithm remained unchanged, preserving over a billion existing InChIs for organic compounds</p>
</li>
</ol>
<p>The preprocessing step applies a two-pass iterative process for every metal in a structure:</p>
<ol>
<li><strong>Terminal metals</strong> (connected to only one other atom): check the electronegativity lookup table and disconnect if $\Delta EN \geq 1.7$</li>
<li><strong>Non-terminal metals</strong>: if coordination number exceeds the element&rsquo;s standard valence threshold, keep all bonds; otherwise, apply the same electronegativity check per bond (if at least one bond is kept, all are retained)</li>
<li>Hardcoded exceptions exist for Grignard reagents and organolithium compounds</li>
</ol>
<p>For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected as a coordination complex.</p>
<h2 id="validation-methods--experiments">Validation Methods &amp; Experiments</h2>
<p>The paper focuses on software engineering validation:</p>
<ul>
<li><strong>Bug fixing</strong>: Fixed more than 3000 bugs and security issues, plus nearly 60 Google OSS-Fuzz issues from the legacy codebase</li>
<li><strong>Backward compatibility testing</strong>: Verified that existing organic molecule InChIs remained unchanged</li>
<li><strong>Inorganic compound validation</strong>: Tested the new decision tree algorithm on coordination complexes, organometallic compounds, and ionic salts</li>
<li><strong>Documentation overhaul</strong>: Split technical documentation into Chemical Manual (for chemists) and Technical Manual (for developers)</li>
<li><strong>Web Demo</strong>: Created a browser-based <a href="https://iupac-inchi.github.io/InChI-Web-Demo/">InChI Web Demo</a> that calculates InChI, InChIKey, and AuxInfo from drawn structures or Molfiles, with all computation performed client-side</li>
</ul>
<p>The validation approach emphasizes maintaining the &ldquo;same molecule, same identifier&rdquo; principle while extending coverage to inorganic chemistry.</p>
<h2 id="key-outcomes-and-future-work">Key Outcomes and Future Work</h2>
<p>The v1.07 release successfully:</p>
<ul>
<li><strong>Modernizes infrastructure</strong>: Open development on GitHub with maintainable codebase</li>
<li><strong>Extends to inorganic chemistry</strong>: Proper handling of coordination complexes and organometallic compounds</li>
<li><strong>Maintains backward compatibility</strong>: No breaking changes for existing organic compound InChIs</li>
<li><strong>Improves database search</strong>: Metal complexes now searchable with correct stereochemistry preserved</li>
<li><strong>IUPAC approval</strong>: Version 1.07 has been approved by IUPAC&rsquo;s Committee on Publications and Cheminformatics Data Standards (CPCDS)</li>
</ul>
<p><strong>Acknowledged limitations</strong> for future work:</p>
<ul>
<li>Stereochemistry for inorganic and organometallic compounds still needs improvement, including atropisomers and MDL enhanced stereochemistry</li>
<li>Mixtures (MInChI) and nanomaterials (NInChI) remain unsolved problems</li>
<li>Chemical identifiers work best for discrete molecules and struggle with variable-composition materials</li>
</ul>
<p><strong>Impact</strong>: This update improves searchability of inorganic and organometallic compounds in major chemical databases by preserving coordination bond information that was previously discarded.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="software--data-availability">Software &amp; Data Availability</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IUPAC-InChI/InChI">IUPAC-InChI/InChI</a></td>
          <td>Code</td>
          <td>Open source (IUPAC/InChI Trust)</td>
          <td>Official C/C++ implementation of InChI v1.07</td>
      </tr>
      <tr>
          <td><a href="https://iupac-inchi.github.io/InChI-Web-Demo/">InChI Web Demo</a></td>
          <td>Other</td>
          <td>Open source</td>
          <td>Browser-based InChI/InChIKey generator for testing</td>
      </tr>
  </tbody>
</table>
<p>The InChI v1.07 codebase, primarily written in C/C++, is openly available on GitHub at <a href="https://github.com/IUPAC-InChI/InChI">IUPAC-InChI/InChI</a>. The repository includes the core canonicalization engine and the new inorganic preprocessing logic. Both the Technical Manual (for structural integration) and the Chemical Manual are maintained alongside the codebase. Compiled binaries are available for Windows, Linux, and macOS.</p>
<p><strong>Benchmarking Data</strong>: Validation of the new decision tree logic is managed through rigorous unit testing built directly into the repository&rsquo;s continuous integration pipelines. Standard tests with existing organic compounds confirm backward compatibility, while newly integrated suites of coordination complexes and organometallic compounds ensure the 1.07 processing triggers as expected.</p>
<h3 id="algorithms">Algorithms</h3>
<h4 id="the-metal-problem">The Metal Problem</h4>
<p>InChI&rsquo;s original algorithm assumed that bonds to metals were ionic and automatically disconnected them. This makes sense for something like sodium chloride (NaCl), where you have separate $\text{Na}^+$ and $\text{Cl}^-$ ions.</p>
<p>It fails for:</p>
<ul>
<li><strong>Coordination complexes</strong>: Where ligands are bonded to the metal center</li>
<li><strong>Organometallic compounds</strong>: Where carbon-metal bonds are covalent</li>
<li><strong>Sandwich compounds</strong>: Like ferrocene, where the bonding has both ionic and covalent character</li>
</ul>
<p>The result: loss of stereochemical information and identical InChIs for structurally different compounds.</p>
<h4 id="the-solution-smart-preprocessing">The Solution: Smart Preprocessing</h4>
<p>The new system uses a decision tree to figure out which metal-ligand bonds to keep and which to disconnect. The process is <strong>iterative</strong>: it runs for every metal in the structure, then checks every bond to that metal. In the C/C++ repository, this preprocessing logic acts as a filter applied <em>before</em> the traditional organic canonicalization engine (from v1.06) runs, dynamically determining whether coordination bonds are retained for downstream layer generation.</p>
<h5 id="decision-tree-logic">Decision Tree Logic</h5>
<p>The algorithm handles metals in two passes. First, <strong>terminal metals</strong> (bonded to only one atom) are checked against the electronegativity lookup table and disconnected if $\Delta EN \geq 1.7$. This preserves all metal-metal bonds.</p>
<p>Second, <strong>non-terminal metals</strong> are examined. For a metal $m$ bonded to ligand $l$:</p>
<p>$$
\begin{aligned}
B(m, l) &amp;=
\begin{cases}
\text{Connected (all bonds)} &amp; \text{if } CN(m) &gt; V(m) \\
\text{Connected} &amp; \text{if } |EN(m) - EN(l)| &lt; 1.7 \\
\text{Disconnected} &amp; \text{if } |EN(m) - EN(l)| \geq 1.7
\end{cases}
\end{aligned}
$$</p>
<p>A key rule: if at least one metal-ligand bond is kept for a given metal, all other bonds to that metal are also retained (no disconnection is carried out).</p>
<p><em>(Note: Explicit overrides exist for specific classes like Grignard reagents).</em></p>
<h5 id="hardcoded-chemical-exceptions">Hardcoded Chemical Exceptions</h5>
<p>The algorithm includes specific overrides based on well-established chemistry:</p>
<ul>
<li><strong>Grignard reagents (RMgX)</strong>: Explicitly configured to <strong>keep</strong> the Mg-C bond but <strong>disconnect</strong> the Mg-halide bond</li>
<li><strong>Organolithium compounds (RLi)</strong>: Explicitly configured to keep the structure intact</li>
</ul>
<p>These exceptions exist because the general electronegativity rules would give incorrect results for these compound classes.</p>
<h5 id="practical-example">Practical Example</h5>
<p>For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected because its coordination number exceeds the threshold.</p>
<h4 id="how-inchi-generation-works">How InChI Generation Works</h4>
<p>The process has six main steps:</p>
<ol>
<li><strong>Parse input</strong>: Read the structure from a file (Molfile, SDF, etc.)</li>
<li><strong>Convert to internal format</strong>: Transform into the software&rsquo;s data structures</li>
<li><strong>Normalize</strong>: Standardize tautomers, resolve ambiguities (where the new metal rules apply)</li>
<li><strong>Canonicalize</strong>: Create a unique representation independent of atom numbering</li>
<li><strong>Generate InChI string</strong>: Build the layered text identifier</li>
<li><strong>Create InChIKey</strong>: Hash the full string into a 27-character key for databases</li>
</ol>
<p>The InChI itself has separate layers for formula, connectivity, hydrogens, stereochemistry, isotopes, and charge. The InChIKey is what actually gets stored in databases for fast searching.</p>
<h5 id="inchikey-version-flag">InChIKey Version Flag</h5>
<p>Character 25 of the InChIKey indicates the version status:</p>
<ul>
<li><strong>&ldquo;S&rdquo;</strong>: Standard InChI</li>
<li><strong>&ldquo;N&rdquo;</strong>: Non-standard InChI</li>
<li><strong>&ldquo;B&rdquo;</strong>: Beta (experimental features)</li>
</ul>
<p>This flag is important for anyone parsing InChIKeys programmatically, as it tells you whether the identifier was generated using stable or experimental algorithms.</p>
<h2 id="additional-context">Additional Context</h2>
<h3 id="what-inchi-actually-does">What InChI Actually Does</h3>
<p>InChI creates a unique text string for any chemical structure. SMILES has multiple vendor implementations and can represent the same molecule in different ways. InChI provides a single, standardized format controlled by IUPAC. The goal is simple: same molecule, same identifier, every time.</p>
<p>This matters for FAIR data principles:</p>
<ul>
<li><strong>Findable</strong>: You can search for a specific compound across databases</li>
<li><strong>Accessible</strong>: The standard is open and free</li>
<li><strong>Interoperable</strong>: Different systems can connect chemical knowledge</li>
<li><strong>Reusable</strong>: The identifiers work consistently across platforms</li>
</ul>
<h3 id="better-documentation">Better Documentation</h3>
<p>The technical manual is being split into two documents:</p>
<ul>
<li><strong>Chemical Manual</strong>: For chemists who need to understand what InChIs mean</li>
<li><strong>Technical Manual</strong>: For developers who need to implement the algorithms</li>
</ul>
<p>This addresses the problem of current documentation serving both audiences poorly.</p>
<h3 id="the-bigger-picture">The Bigger Picture</h3>
<p>InChI&rsquo;s evolution reflects chemistry&rsquo;s expansion beyond its organic roots. The fact that it took this long to properly handle inorganic compounds shows how much computational chemistry has historically focused on carbon-based molecules.</p>
<p>As the field moves into catalysis, materials science, and coordination chemistry applications, having proper chemical identifiers becomes essential. You can&rsquo;t build FAIR chemical databases if half of chemistry is represented incorrectly.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Blanke, G., Brammer, J., Baljozovic, D., Khan, N. U., Lange, F., Bänsch, F., Tovee, C. A., Schatzschneider, U., Hartshorn, R. M., &amp; Herres-Pawlis, S. (2025). Making the InChI FAIR and sustainable while moving to inorganics. <em>Faraday Discussions</em>, 256, 503-519. <a href="https://doi.org/10.1039/D4FD00145A">https://doi.org/10.1039/D4FD00145A</a></p>
<p><strong>Publication</strong>: Faraday Discussions, 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blanke2025making,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Making the InChI FAIR and sustainable while moving to inorganics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blanke, G. and Brammer, J. and Baljozovic, D. and Khan, N. U. and Lange, F. and B{\&#34;a}nsch, F. and Tovee, C. A. and Schatzschneider, U. and Hartshorn, R. M. and Herres-Pawlis, S.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Faraday Discussions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{256}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{503--519}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>InChI: The Worldwide Chemical Structure Identifier Standard</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi-2013/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi-2013/</guid><description>Heller et al. (2013) explain how IUPAC's InChI became the global standard for representing chemical structures, its governance, and current limitations.</description><content:encoded><![CDATA[<h2 id="inchi-as-a-resource-and-systematization-standard">InChI as a Resource and Systematization Standard</h2>
<p>This is a <strong>Resource &amp; Systematization Paper</strong> that reviews the history, technical architecture, governance structure, and implementation status of the InChI standard. It documents both the institutional development of an open chemical identifier and the technical specification that enables it.</p>
<h2 id="the-motivation-interoperability-in-chemical-databases">The Motivation: Interoperability in Chemical Databases</h2>
<p>Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. These were expensive, restricted, and relied on &ldquo;in-house&rdquo; databases.</p>
<p>The authors argue the Internet and Open Source software acted as a <strong>&ldquo;black swan&rdquo; event</strong> that disrupted this status quo. The Internet created a need to link diverse, free and fee-based resources without a central gatekeeper. InChI was designed as the solution: a non-proprietary, open-source identifier enabling linking of distinct data compilations.</p>
<h2 id="technical-and-institutional-innovations-of-inchi">Technical and Institutional Innovations of InChI</h2>
<p>InChI&rsquo;s innovation is both technical and institutional:</p>
<p><strong>Technical novelty</strong>: A hierarchical &ldquo;layered&rdquo; canonicalization system where structure representations build from basic connectivity to full stereochemistry. This allows flexible matching: a molecule with unknown stereochemistry produces an InChI that&rsquo;s a subset of the same molecule with known stereochemistry.</p>
<p><strong>Institutional novelty</strong>: Creating an open standard governed by a charitable trust (the InChI Trust) that convinced commercial competitors (publishers, databases) to adopt it as a &ldquo;pre-competitive&rdquo; necessity. This solved the political problem of maintaining an open standard in a competitive industry.</p>
<h3 id="technical-architecture-layers-and-hashing">Technical Architecture: Layers and Hashing</h3>
<h4 id="the-inchi-string">The InChI String</h4>
<p>InChI is a <strong>canonicalized structure representation</strong> derived from IUPAC conventions. It uses a hierarchical &ldquo;layered&rdquo; format where specific layers add detail. The exact technical specification includes these string segments:</p>
<ol>
<li><strong>Main Layer</strong>: Chemical Formula</li>
<li><strong>Connectivity Layer (<code>/c</code>)</strong>: Atoms and bonds (excluding bond orders)</li>
<li><strong>Hydrogen Layer (<code>/h</code>)</strong>: Tautomeric and immobile H atoms</li>
<li><strong>Charge (<code>/q</code>) &amp; Proton Balance (<code>/p</code>)</strong>: Accounting for ionization</li>
<li><strong>Stereochemistry</strong>:
<ul>
<li>Double bond (<code>/b</code>) and Tetrahedral (<code>/t</code>) parity</li>
<li>Parity inversion (<code>/m</code>)</li>
<li>Stereo type (<code>/s</code>): absolute, relative, or racemic</li>
</ul>
</li>
<li><strong>Fixed-H Layer (<code>/f</code>)</strong>: Distinguishes specific tautomers if needed</li>
</ol>
<p>This layered approach means that a molecule with unknown stereochemistry will have an InChI that&rsquo;s a subset of the same molecule with known stereochemistry. This allows for flexible matching at the connectivity level even without complete stereochemical information.</p>
<h4 id="the-inchikey">The InChIKey</h4>
<p>Because InChI strings can be too long for search engines (which break at ~30 characters or at symbols like <code>/</code> and <code>+</code>), the InChIKey was created.</p>
<p><strong>Mechanism</strong>: A 27-character string generated via a <strong>SHA-256 hash</strong> of the InChI string. This can be represented as:</p>
<p>$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$</p>
<p><strong>Structure</strong>:</p>
<ul>
<li><strong>Block 1 (14 characters)</strong>: Encodes the molecular skeleton (connectivity)</li>
<li><strong>Block 2 (10 characters)</strong>: Eight letters encoding stereochemistry and isotopes, plus a flag indicating standard InChI (S) and an InChI version indicator (A for version 1)</li>
<li><strong>Block 3 (1 character)</strong>: Protonation flag (e.g., &lsquo;N&rsquo; for neutral)</li>
</ul>
<p>Because the InChIKey is a hash, it cannot be converted back to a structure (irreversible) and has a theoretical risk of collision. It is important to distinguish between <strong>InChI collisions</strong> (which are due to flaws/bugs and are very rare) and <strong>InChIKey collisions</strong> (which are mathematically inevitable due to hashing).</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>This is a systematization paper documenting an existing standard. However, the authors provide:</p>
<p><strong>Validation evidence</strong>:</p>
<ul>
<li><strong>Certification Suite</strong>: A test suite that software vendors must pass to display the &ldquo;InChI Certified&rdquo; logo, preventing fragmentation</li>
<li><strong>Round-trip conversion testing</strong>: Demonstrated &gt;99% success rate converting InChI back to structure (100% with AuxInfo layer)</li>
<li><strong>Real-world adoption metrics</strong>: Documented integration across major chemical databases and publishers</li>
</ul>
<p><strong>Known limitations identified</strong>:</p>
<ul>
<li>Tautomer representation issues in Version 1 (different drawings of same tautomer can generate different InChIs)</li>
<li>Edge cases in stereochemistry representation</li>
</ul>
<h3 id="institutional-history--governance">Institutional History &amp; Governance</h3>
<p><strong>Origin</strong>: The project was initiated at a March 2000 IUPAC meeting in Washington, DC. It was originally called the <strong>IUPAC Chemical Identifier Project (IChIP)</strong>.</p>
<p><strong>Development</strong>: Technical work was done by NIST (Stein, Heller, Tchekhovskoi), overseen by the IUPAC <strong>CCINS</strong> committee, which later became the <strong>InChI Subcommittee</strong> of Division VIII.</p>
<p><strong>The InChI Trust</strong>: To ensure the algorithm survived beyond a volunteer organization, the <strong>InChI Trust</strong> was formed in 2009. It is a UK charity supported by publishers and databases (e.g., Nature, RSC) to maintain the standard pre-competitively. This was a critical innovation: getting commercial publishers and software vendors to agree that a non-proprietary standard would benefit everyone.</p>
<h2 id="real-world-impact-and-future-directions">Real-World Impact and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Success through &ldquo;un-coerced adoption&rdquo;</strong>: InChI succeeded because commercial competitors viewed it as a &ldquo;pre-competitive&rdquo; necessity for the Internet age. The open governance model proved durable.</p>
<p><strong>Technical achievements</strong>:</p>
<ul>
<li>Reversible representation (&gt;99% without AuxInfo, 100% with it)</li>
<li>Hierarchical structure enables flexible matching at different levels of detail</li>
<li>InChIKey enables web search despite being a hash (with inherent collision risk)</li>
</ul>
<h3 id="limitations-acknowledged-as-of-2013">Limitations Acknowledged (as of 2013)</h3>
<ul>
<li><strong>Tautomerism Issues</strong>: Different drawings of the same tautomer (e.g., 1,4-oxime vs nitroso) can generate different InChIs in Version 1, which is targeted for Version 2</li>
<li><strong>Hash collision risk</strong>: InChIKey collisions are mathematically inevitable due to SHA-256 hashing, though InChI collisions (actual bugs) are very rare</li>
<li><strong>Certification required</strong>: To prevent fragmentation, software must pass the InChI Certification Suite</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors note that while this paper documents the state as of 2013, InChI continues to evolve. Tautomer handling and edge cases in stereochemistry representation were priorities for future versions. The governance model through the InChI Trust was designed to ensure long-term maintenance beyond the original volunteer contributors.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This systematization paper documents an existing standard. Key implementation resources are openly maintained by the InChI Trust.</p>
<h3 id="code--software">Code &amp; Software</h3>
<ul>
<li><strong>Official Open Source Implementation</strong>: The C source code and pre-compiled binaries for the InChI algorithm are freely available via the <a href="https://www.inchi-trust.org/downloads/">InChI Trust Downloads Page</a> and their <a href="https://github.com/IUPAC-InChI/InChI">official GitHub repository</a>.</li>
<li><strong>Canonicalization algorithm</strong>: Open-source implementation of IUPAC-based rules for generating unique representations from multiple possible drawings of the same molecule.</li>
</ul>
<h3 id="data--validation">Data &amp; Validation</h3>
<ul>
<li><strong>InChI Certification Suite</strong>: A test suite of chemical structures provided by the InChI Trust used to validate that third-party software implementations generate correct InChIs.</li>
<li><strong>Version 1 specification</strong>: Complete technical documentation of the layered format.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Round-trip conversion</strong>: &gt;99% success rate (100% with AuxInfo) as validated by NIST and IUPAC.</li>
<li><strong>Certification testing</strong>: Pass/fail validation for software claiming InChI compliance.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., &amp; Pletnev, I. (2013). InChI - the worldwide chemical structure identifier standard. <em>Journal of Cheminformatics</em>, <em>5</em>(1), 7. <a href="https://doi.org/10.1186/1758-2946-5-7">https://doi.org/10.1186/1758-2946-5-7</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics, 2013</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{heller2013inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{InChI} - the worldwide chemical structure identifier standard}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heller, Stephen and McNaught, Alan and Stein, Stephen and Tchekhovskoi, Dmitrii and Pletnev, Igor}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{7}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2013}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/1758-2946-5-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>InChI and Tautomerism: Toward Comprehensive Treatment</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi-and-tautomers/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/inchi-and-tautomers/</guid><description>Dhaked et al. compile 86 tautomeric rules and validate them across 400M+ structures, revealing that current InChI misses half of tautomeric relationships.</description><content:encoded><![CDATA[<h2 id="paper-contribution-a-systematized-tautomer-database-resource">Paper Contribution: A Systematized Tautomer Database Resource</h2>
<p>This is a <strong>Resource</strong> paper with strong <strong>Systematization</strong> elements. It provides a comprehensive catalog of 86 tautomeric transformation rules (20 pre-existing CACTVS defaults plus 66 new rules derived from experimental literature), designed to serve as a foundational resource for chemical database systems and the InChI V2 identifier standard. The systematic validation across 400+ million structures also makes it a benchmarking study for evaluating current chemoinformatics tools.</p>
<h2 id="the-tautomerism-problem-in-chemical-databases">The Tautomerism Problem in Chemical Databases</h2>
<p>Chemical databases face a fundamental problem: the same molecule can appear multiple times under different identifiers simply because it exists in different tautomeric forms. For example, glucose&rsquo;s ring-closed and open-chain forms are the same molecule; however, current chemical identifiers (including InChI) often treat them as distinct compounds.</p>















<figure class="post-figure center ">
    <img src="/img/notes/Glucose-tautomerism.webp"
         alt="D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism"
         title="D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Ring-chain tautomerism in glucose: the open-chain aldehyde form (left) and the cyclic pyranose form (right) are the same molecule in different tautomeric states.</figcaption>
    
</figure>

<p>This creates three critical problems:</p>
<ol>
<li><strong>Database redundancy</strong>: Millions of duplicate entries for the same chemical entities</li>
<li><strong>Search failures</strong>: Researchers miss relevant compounds during structure searches</li>
<li><strong>ML training issues</strong>: Machine learning models learn to treat tautomers as different molecules</li>
</ol>
<p>The motivation for this work is to provide a comprehensive, experimentally-grounded rule set that enables InChI V2 to properly recognize tautomeric relationships, eliminating these problems at the identifier level.</p>
<h2 id="86-comprehensive-tautomeric-transformation-rules">86 Comprehensive Tautomeric Transformation Rules</h2>
<p>The key contributions are:</p>
<ol>
<li>
<p><strong>Comprehensive Rule Set</strong>: Compilation of <strong>86 tautomeric transformation rules</strong> (20 pre-existing CACTVS defaults plus 66 new rules derived from experimental literature), categorized into:</p>
<ul>
<li>54 Prototropic rules (classic H-movement tautomerism)</li>
<li>21 Ring-Chain rules (cyclic/open-chain transformations)</li>
<li>11 Valence rules (structural rearrangements with valence changes)</li>
</ul>
</li>
<li>
<p><strong>Massive-Scale Validation</strong>: Testing these rules against <strong>nine major chemical databases</strong> totaling over 400 million structures to identify coverage gaps in current InChI implementations</p>
</li>
<li>
<p><strong>Quantitative Assessment</strong>: Systematic measurement showing that current InChI (even with Nonstandard 15T + KET settings) only achieves ~50% success in recognizing tautomeric relationships, with some new rules showing &lt;2% success rates</p>
</li>
<li>
<p><strong>Practical Tools</strong>: Creation of the <strong>Tautomerizer</strong> web tool for public use, demonstrating practical application of the rule set</p>
</li>
</ol>
<p>The novelty lies in the systematic compilation and validation of transformation rules at a scale that reveals critical gaps in current chemical identification systems.</p>
<h2 id="massive-scale-validation-across-400m-structures">Massive-Scale Validation Across 400M+ Structures</h2>
<h3 id="database-analysis">Database Analysis</h3>
<p>The researchers analyzed <strong>9 chemical databases</strong> totaling 400+ million structures:</p>
<ul>
<li><strong>Public databases</strong>: PubChem (largest), ChEMBL, DrugBank, PDB Ligands, SureChEMBL, AMS, ChemNavigator</li>
<li><strong>Private databases</strong>: CSD (Cambridge Structural Database), CSDB (NCI internal)</li>
</ul>
<h3 id="methodology">Methodology</h3>
<p><strong>Software</strong>: CACTVS Chemoinformatics Toolkit (versions 3.4.6.33 and 3.4.8.6)</p>
<p><strong>Tautomer Generation Protocol</strong>:</p>
<ul>
<li><strong>Algorithm</strong>: Single-step generation (apply transforms to input structure only, avoiding recursion)</li>
<li><strong>Constraints</strong>: Max 10 tautomers per structure, 30-second CPU timeout per transform</li>
<li><strong>Format</strong>: All rules expressed as SMIRKS strings</li>
<li><strong>Stereochemistry</strong>: Stereocenters involved in tautomerism were flattened during transformation</li>
</ul>
<p><strong>Success Metrics</strong> (tested against InChI V.1.05):</p>
<ul>
<li><strong>Complete InChI match</strong>: All tautomers share identical InChI</li>
<li><strong>Partial InChI match</strong>: At least two tautomers share an InChI</li>
<li>Tested against two InChI configurations: Standard InChI and Nonstandard InChI (with 15T and KET options enabled)</li>
</ul>
<h3 id="rule-coverage-analysis">Rule Coverage Analysis</h3>
<p>For each of the 86 rules, the researchers:</p>
<ol>
<li>Applied the transformation to all molecules in each database</li>
<li>Generated tautomers using the SMIRKS patterns</li>
<li>Computed InChI identifiers for each tautomer</li>
<li>Measured success rates (percentage of cases where InChI recognized the relationship)</li>
</ol>
<h3 id="key-findings-from-experiments">Key Findings from Experiments</h3>
<p><strong>Rule Frequency</strong>: The most common rule <code>PT_06_00</code> (1,3-heteroatom H-shift, covering keto-enol tautomerism) affects <strong>&gt;70% of molecules</strong> across databases.</p>
<p><strong>InChI Performance</strong>:</p>
<ul>
<li>Standard InChI: ~37% success rate</li>
<li>Nonstandard InChI (15T + KET): ~50% success rate</li>
<li>Many newly defined rules: &lt;2% success rate</li>
</ul>
<p><strong>Scale Impact</strong>: Implementing the full 86-rule set would approximately <strong>triple</strong> the number of compounds recognized as having tautomeric relationships relative to Standard InChI.</p>
<h2 id="outcomes-inchi-v2-requirements-and-coverage-gaps">Outcomes: InChI V2 Requirements and Coverage Gaps</h2>
<h3 id="main-findings">Main Findings</h3>
<ol>
<li>
<p><strong>Current Systems Are Inadequate</strong>: Even with the Nonstandard 15T + KET settings, InChI only achieves ~50% success in recognizing tautomeric relationships, with Standard InChI at ~37%</p>
</li>
<li>
<p><strong>Massive Coverage Gap</strong>: The new rule set reveals millions of tautomeric relationships that current InChI completely misses, particularly for ring-chain and valence tautomerism</p>
</li>
<li>
<p><strong>Implementation Requirement</strong>: InChI V2 will require a major redesign to handle the comprehensive rule set</p>
</li>
<li>
<p><strong>Rule Validation</strong>: The 86-rule set provides a validated foundation for next-generation chemical identifiers, with the new rules further confirmed against an independent ChEMBL 24.1 tautomer extraction</p>
</li>
</ol>
<h3 id="implications">Implications</h3>
<p><strong>For Chemical Databases</strong>:</p>
<ul>
<li>Reduced redundancy through proper tautomer recognition</li>
<li>Improved data quality and consistency</li>
<li>More comprehensive structure search results</li>
</ul>
<p><strong>For Machine Learning</strong>:</p>
<ul>
<li>More accurate training data (tautomers properly grouped)</li>
<li>Better molecular property prediction models</li>
<li>Reduced dataset bias from tautomeric duplicates</li>
</ul>
<p><strong>For Chemoinformatics Tools</strong>:</p>
<ul>
<li>Blueprint for InChI V2 development</li>
<li>Standardized rule set for tautomer generation</li>
<li>Public tool (Tautomerizer) for practical use</li>
</ul>
<h3 id="limitations-acknowledged">Limitations Acknowledged</h3>
<ul>
<li>Single-step generation only (omits recursive enumeration of all possible tautomers)</li>
<li>30-second timeout may miss complex transformations</li>
<li>Some tautomeric preferences are context-dependent (pH, solvent) and require more than static rules for capture</li>
</ul>
<h3 id="additional-validation">Additional Validation</h3>
<p>The authors validated their rule set against 4,158 tautomeric systems independently extracted from ChEMBL 24.1 via a SMILES-based tautomer hash (provided by Noel O&rsquo;Boyle and Roger Sayle). Their rules covered essentially all tautomeric systems in that set, with practically all cases handled by the standard CACTVS rules PT_02_00 through PT_21_00.</p>
<h3 id="companion-resource-tautomer-database">Companion Resource: Tautomer Database</h3>
<p>A companion paper describes the creation of a publicly available Tautomer Database (Tauto DB) containing over 2,800 tautomeric tuples extracted from experimental literature, available at <a href="https://cactus.nci.nih.gov/download/tautomer/">https://cactus.nci.nih.gov/download/tautomer/</a>. Data from this database informed the generation of new rules in this work.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The paper lays groundwork for InChI V2 development, emphasizing that the comprehensive rule set necessitates algorithmic redesign.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Datasets Analyzed</strong> (400M+ total structures):</p>
<p><strong>Public Databases</strong> (Enable partial reproduction):</p>
<ul>
<li><strong>PubChem</strong>: Largest public chemical database</li>
<li><strong>ChEMBL</strong>: Bioactive molecules with drug-like properties</li>
<li><strong>DrugBank</strong>: FDA-approved and experimental drugs</li>
<li><strong>PDB Ligands</strong>: Small molecules from protein structures</li>
<li><strong>SureChEMBL</strong>: Chemical structures from patents</li>
<li><strong>AMS</strong>: Screening samples</li>
<li><strong>ChemNavigator</strong>: Commercial chemical database</li>
</ul>
<p><strong>Private/Proprietary Databases</strong> (Prevent 100% full-scale reproduction):</p>
<ul>
<li><strong>CSD</strong>: Cambridge Structural Database (requires commercial/academic license)</li>
<li><strong>CSDB</strong>: NCI internal database (private)</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Tautomer Generation</strong>:</p>
<ul>
<li><strong>Method</strong>: Single-step SMIRKS-based transformations</li>
<li><strong>Constraints</strong>:
<ul>
<li>Maximum 10 tautomers per input structure</li>
<li>30-second CPU timeout per transformation</li>
<li>Stereochemistry flattening for affected centers</li>
</ul>
</li>
<li><strong>Toolkit Dependency</strong>: The authors used the CACTVS Chemoinformatics Toolkit. Researchers attempting to reproduce this with fully open-source tools (like RDKit) may encounter differing behavior due to proprietary chemical perception logic and licensing differences.</li>
</ul>
<p><strong>Rule Categories</strong>:</p>
<ul>
<li><strong>Prototropic (PT)</strong>: 54 rules for hydrogen movement
<ul>
<li>Most common: <code>PT_06_00</code> (1,3-heteroatom H-shift, &gt;70% coverage)</li>
</ul>
</li>
<li><strong>Ring-Chain (RC)</strong>: 21 rules for cyclic/open-chain transformations
<ul>
<li>Examples: <code>RC_03_00</code> (pentose sugars), <code>RC_04_01</code> (hexose sugars)</li>
</ul>
</li>
<li><strong>Valence (VT)</strong>: 11 rules for valence changes
<ul>
<li>Notable: <code>VT_02_00</code> (tetrazole/azide, ~2.8M hits)</li>
</ul>
</li>
</ul>
<p><strong>InChI Comparison</strong>:</p>
<ul>
<li>Standard InChI (default settings)</li>
<li>Nonstandard InChI with <code>15T</code> and <code>KET</code> options (mobile H and keto-enol)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Success Metrics</strong>:</p>
<p>Let $\mathcal{T}(m)$ be the set of generated tautomers for molecule $m$.</p>
<ul>
<li><strong>Complete Match</strong>: Occurs iff $\forall t_i, t_j \in \mathcal{T}(m), \text{InChI}(t_i) = \text{InChI}(t_j)$.</li>
<li><strong>Partial Match</strong>: At least 2 tautomers share the same InChI.</li>
<li><strong>Fail</strong>: All tautomers have different InChIs.</li>
</ul>
<p><strong>Benchmark Results</strong>:</p>
<ul>
<li>Standard InChI: ~37% success rate across all rules</li>
<li>Nonstandard (15T + KET): ~50% success rate</li>
<li>New rules: Many show &lt;2% recognition by current InChI</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Software Environment</strong>:</p>
<ul>
<li><strong>Toolkit</strong>: CACTVS Chemoinformatics Toolkit v3.4.6.33 and v3.4.8.6</li>
<li><strong>Hash Functions</strong>:
<ul>
<li><code>E_TAUTO_HASH</code> (tautomer-invariant identifier)</li>
<li><code>E_ISOTOPE_STEREO_HASH128</code> (tautomer-sensitive identifier)</li>
</ul>
</li>
</ul>
<p><strong>Note</strong>: The paper omits computational hardware specifications but acknowledges using the NIH HPC Biowulf cluster. Evaluating 400M+ structures necessitates high-throughput cluster computing, making it computationally expensive for an individual to replicate the full analysis from scratch.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://cactus.nci.nih.gov/tautomerizer/">Tautomerizer Web Tool</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Public web tool for applying tautomeric rules to user molecules</td>
      </tr>
      <tr>
          <td><a href="https://cactus.nci.nih.gov/download/tautomer/">Tautomer Database</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>2800+ experimental tautomeric tuples (companion resource)</td>
      </tr>
      <tr>
          <td><a href="https://pubs.acs.org/doi/10.1021/acs.jcim.9b01080">SMIRKS and Scripts (SI)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>CACTVS Tcl scripts and SMIRKS provided as Supporting Information</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Dhaked, D. K., Ihlenfeldt, W.-D., Patel, H., Delannée, V., &amp; Nicklaus, M. C. (2020). Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2. <em>Journal of Chemical Information and Modeling</em>, <em>60</em>(3), 1253-1275. <a href="https://doi.org/10.1021/acs.jcim.9b01080">https://doi.org/10.1021/acs.jcim.9b01080</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{dhaked2020toward,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Dhaked, Devendra K and Ihlenfeldt, Wolf-Dietrich and Patel, Hitesh and Delann{\&#39;e}e, Victorien and Nicklaus, Marc C}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1253--1275}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.9b01080}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://cactus.nci.nih.gov/tautomerizer/">Tautomerizer Tool</a> - Public web tool for testing tautomeric transformations</li>
</ul>
]]></content:encoded></item><item><title>SELFIES: A Robust Molecular String Representation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies/</link><pubDate>Fri, 12 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies/</guid><description>SELFIES is a robust molecular string representation for ML where every string decodes to a valid molecule, implemented in the selfies Python library.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p><strong>SELFIES (SELF-referencIng Embedded Strings)</strong> is a string-based molecular representation where every possible string, even one generated randomly, corresponds to a syntactically and semantically valid molecule. This property addresses a major limitation of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, where a large fraction of strings produced by machine learning models represent invalid chemical structures.</p>
<p>The format is implemented in an open-source Python library called <code>selfies</code>. Since the <a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">original publication</a>, the library has undergone significant architectural changes, most notably replacing the original string-manipulation engine with a graph-based internal representation that improved both performance and extensibility (see <a href="#recent-developments">Recent Developments</a>).</p>
<h3 id="key-characteristics">Key Characteristics</h3>
<ul>
<li><strong>Guaranteed Validity</strong>: Every possible SELFIES string can be decoded into a valid molecular graph that obeys chemical valence rules. This is its fundamental advantage over SMILES.</li>
<li><strong>Machine Learning Friendly</strong>: Can be used directly in any machine learning model (like VAEs or GANs) without adaptation, guaranteeing that all generated outputs are valid molecules.</li>
<li><strong>Customizable Constraints</strong>: The underlying chemical rules, such as maximum valence for different atoms, can be customized by the user. The library provides presets (e.g., for hypervalent species) and allows users to define their own rule sets.</li>
<li><strong>Human-readable</strong>: With some familiarity, SELFIES strings are human-readable, allowing interpretation of functional groups and connectivity.</li>
<li><strong>Local Operations</strong>: SELFIES encodes branch length and ring size as adjacent symbols in the string (rather than requiring matched delimiters or repeated digits at distant positions, as SMILES does), preventing common syntactical errors like unmatched parentheses or mismatched ring-closure digits.</li>
<li><strong>Broad Support</strong>: The current <code>selfies</code> library supports aromatic molecules (via kekulization), isotopes, charges, radicals, and stereochemistry. It also includes a dot symbol (<code>.</code>) for representing disconnected molecular fragments.</li>
</ul>
<h2 id="basic-syntax">Basic Syntax</h2>
<p>SELFIES uses symbols enclosed in square brackets (e.g., <code>[C]</code>, <code>[O]</code>, <code>[#N]</code>). The interpretation of each symbol depends on the current <strong>state of the derivation</strong> (described below), which ensures chemical valence rules are strictly obeyed. The syntax is formally defined by a Chomsky type-2 context-free grammar.</p>
<h3 id="derivation-rules">Derivation Rules</h3>
<p>SELFIES are constructed using a table of derivation rules. The process starts in an initial state (e.g., $X_0$) and reads the SELFIES string symbol by symbol. Each symbol, combined with the current state, determines the resulting atom/bond and the next state. The derivation state $X_n$ intuitively tracks that the previously added atom can form a maximum of $n$ additional bonds.</p>
<p>For example, the string <code>[F][=C][=C][#N]</code> is derived as follows, where $X_n$ indicates the atom can form up to $n$ additional bonds. Notice how bond demotion occurs: the first <code>[=C]</code> requests a double bond, but only a single bond is formed because state $X_1$ limits the connection to one bond.</p>
<p>$$
\begin{aligned}
\text{State } X_0 + \text{[F]} &amp;\rightarrow \text{F} + \text{State } X_1 \\
\text{State } X_1 + \text{[=C]} &amp;\rightarrow \text{F-C} + \text{State } X_3 \\
\text{State } X_3 + \text{[=C]} &amp;\rightarrow \text{F-C=C} + \text{State } X_2 \\
\text{State } X_2 + [\#\text{N}] &amp;\rightarrow \text{F-C=C=N} + \text{Final}
\end{aligned}
$$</p>
<h3 id="structural-features">Structural Features</h3>
<ul>
<li><strong>Branches</strong>: Represented by a <code>[Branch]</code> symbol. The symbols immediately following it are interpreted as an index that specifies the number of SELFIES symbols belonging to that branch. This structure prevents errors like unmatched parentheses in SMILES.</li>
<li><strong>Rings</strong>: Represented by a <code>[Ring]</code> symbol. Similar to branches, subsequent symbols specify an index that indicates which previous atom to connect to, forming a ring closure. To avoid violating valence constraints, ring bond creation is postponed to a final post-processing step, where it is only completed if the target atom has available bonds.</li>
</ul>
<h2 id="examples">Examples</h2>
<p>To see how these derivation rules work in practice, here are SELFIES representations for common molecules of increasing complexity:</p>















<figure class="post-figure center ">
    <img src="/img/selfies/ethanol.webp"
         alt="Ethanol molecule from SELFIES"
         title="Ethanol molecule from SELFIES"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Ethanol: <code>[C][C][O]</code></figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/selfies/benzene.webp"
         alt="Benzene molecule from SELFIES"
         title="Benzene molecule from SELFIES"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Benzene: <code>[C][=C][C][=C][C][=C][Ring1][=Branch1]</code></figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/selfies/aspirin.webp"
         alt="Aspirin molecule from SELFIES"
         title="Aspirin molecule from SELFIES"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Aspirin: <code>[C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]</code></figcaption>
    
</figure>

<h2 id="the-selfies-python-library">The <code>selfies</code> Python Library</h2>
<p>The <code>selfies</code> library provides a dependency-free Python implementation. Here are the core operations:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> selfies <span style="color:#66d9ef">as</span> sf
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># SMILES -&gt; SELFIES</span>
</span></span><span style="display:flex;"><span>smiles <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;c1ccc(C(=O)O)cc1&#34;</span>  <span style="color:#75715e"># benzoic acid</span>
</span></span><span style="display:flex;"><span>encoded <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>encoder(smiles)
</span></span><span style="display:flex;"><span>print(encoded)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; [C][=C][C][=C][C][Branch1][C][=O][O][=C][Ring1][=Branch1]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># SELFIES -&gt; SMILES</span>
</span></span><span style="display:flex;"><span>decoded <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(encoded)
</span></span><span style="display:flex;"><span>print(decoded)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; C1=CC=CC(=C1)C(=O)O</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Robustness: random strings always decode to valid molecules</span>
</span></span><span style="display:flex;"><span>random_selfies <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;[C][F][Ring1][O][=N][Branch1][C][S]&#34;</span>
</span></span><span style="display:flex;"><span>print(sf<span style="color:#f92672">.</span>decoder(random_selfies))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; always returns a valid molecule</span>
</span></span></code></pre></div><h3 id="tokenization-and-encoding">Tokenization and Encoding</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> selfies <span style="color:#66d9ef">as</span> sf
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>selfies_str <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;[C][=C][C][=C][C][Branch1][C][=O][O][=C][Ring1][=Branch1]&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Tokenize into individual symbols</span>
</span></span><span style="display:flex;"><span>tokens <span style="color:#f92672">=</span> list(sf<span style="color:#f92672">.</span>split_selfies(selfies_str))
</span></span><span style="display:flex;"><span>print(tokens)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; [&#39;[C]&#39;, &#39;[=C]&#39;, &#39;[C]&#39;, &#39;[=C]&#39;, &#39;[C]&#39;, &#39;[Branch1]&#39;, &#39;[C]&#39;,</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#     &#39;[=O]&#39;, &#39;[O]&#39;, &#39;[=C]&#39;, &#39;[Ring1]&#39;, &#39;[=Branch1]&#39;]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Get the alphabet (unique token set) from a dataset</span>
</span></span><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> [<span style="color:#e6db74">&#34;[C][C][O]&#34;</span>, <span style="color:#e6db74">&#34;[C][=C][C][=C][C][=C][Ring1][=Branch1]&#34;</span>]
</span></span><span style="display:flex;"><span>alphabet <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>get_alphabet_from_selfies(dataset)
</span></span><span style="display:flex;"><span>print(sorted(alphabet))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; [&#39;[=Branch1]&#39;, &#39;[=C]&#39;, &#39;[C]&#39;, &#39;[O]&#39;, &#39;[Ring1]&#39;]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Convert to integer encoding for ML pipelines</span>
</span></span><span style="display:flex;"><span>encoding, _ <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>selfies_to_encoding(
</span></span><span style="display:flex;"><span>    selfies<span style="color:#f92672">=</span>selfies_str,
</span></span><span style="display:flex;"><span>    vocab_stoi<span style="color:#f92672">=</span>{s: i <span style="color:#66d9ef">for</span> i, s <span style="color:#f92672">in</span> enumerate(sorted(alphabet))},
</span></span><span style="display:flex;"><span>    pad_to_len<span style="color:#f92672">=</span><span style="color:#ae81ff">20</span>,
</span></span><span style="display:flex;"><span>    enc_type<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;label&#34;</span>,
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h3 id="customizing-valence-constraints">Customizing Valence Constraints</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> selfies <span style="color:#66d9ef">as</span> sf
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># View current constraints</span>
</span></span><span style="display:flex;"><span>print(sf<span style="color:#f92672">.</span>get_semantic_constraints())
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Allow hypervalent sulfur (e.g., SF6)</span>
</span></span><span style="display:flex;"><span>sf<span style="color:#f92672">.</span>set_semantic_constraints(<span style="color:#e6db74">&#34;hypervalent&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Or define custom constraints</span>
</span></span><span style="display:flex;"><span>sf<span style="color:#f92672">.</span>set_semantic_constraints({
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;S&#34;</span>: <span style="color:#ae81ff">6</span>,  <span style="color:#75715e"># allow hexavalent sulfur</span>
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;P&#34;</span>: <span style="color:#ae81ff">5</span>,  <span style="color:#75715e"># allow pentavalent phosphorus</span>
</span></span><span style="display:flex;"><span>})
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Reset to defaults</span>
</span></span><span style="display:flex;"><span>sf<span style="color:#f92672">.</span>set_semantic_constraints(<span style="color:#e6db74">&#34;default&#34;</span>)
</span></span></code></pre></div><h2 id="selfies-in-machine-learning">SELFIES in Machine Learning</h2>
<h3 id="molecular-generation">Molecular Generation</h3>
<p>SELFIES is particularly advantageous for generative models in computational chemistry. When used in a VAE, the entire continuous latent space decodes to valid molecules, unlike SMILES where large regions of the latent space are invalid. The <a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">original SELFIES paper</a> demonstrated this concretely: a VAE trained with SELFIES stored two orders of magnitude more diverse molecules than a SMILES-based VAE, and a GAN produced 78.9% diverse valid molecules compared to 18.6% for SMILES (Krenn et al., 2020).</p>
<p>Several generation approaches build directly on SELFIES:</p>
<ul>
<li><strong>Latent space optimization</strong>: <a href="/notes/chemistry/molecular-design/generation/latent-space/limo-latent-inceptionism/">LIMO</a> uses a SELFIES-based VAE with gradient-based optimization to generate molecules with nanomolar binding affinities, achieving 6-8x speedup over RL baselines (Eckmann et al., 2022).</li>
<li><strong>Training-free generation</strong>: <a href="/notes/chemistry/molecular-design/generation/search-based/stoned-selfies-chemical-space-exploration/">STONED</a> demonstrates that simple character-level mutations in SELFIES (replacement, deletion, insertion) produce valid molecules by construction, eliminating the need for neural networks entirely. STONED achieved a GuacaMol score of 14.70, competitive with deep generative models (Nigam et al., 2021).</li>
<li><strong>Gradient-based dreaming</strong>: <a href="/notes/chemistry/molecular-design/generation/latent-space/deep-molecular-dreaming-pasithea/">PASITHEA</a> computes gradients with respect to one-hot encoded SELFIES inputs to steer molecules toward target property values. Because SELFIES&rsquo; surjective mapping guarantees every intermediate representation is a valid molecule, this continuous optimization over the input space is feasible. PASITHEA generated molecules with properties outside the training data range (logP up to 4.24 vs. a training max of 3.08), with 97.2% novelty (Shen et al., 2021).</li>
<li><strong>Large-scale pre-training</strong>: <a href="/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a> is a BART-based model pre-trained on 100M+ SELFIES molecules. It achieves 100% validity and an FCD of 0.0015 on MOSES (vs. 0.0061 for Chemformer), and introduces chemical feedback to align outputs with preference rankings (Fang et al., 2024).</li>
</ul>
<p>In benchmarks, SELFIES performs well for optimization-oriented tasks. In the <a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">PMO benchmark</a> of 25 methods, SELFIES-REINVENT ranked 3rd and STONED ranked 5th. SELFIES-based genetic algorithms outperformed SMILES-based GAs, likely because SELFIES provides more intuitive mutation operations (Gao et al., 2022). The <a href="/notes/chemistry/molecular-design/generation/evaluation/tartarus-inverse-molecular-design/">Tartarus benchmark</a> corroborates this across more diverse real-world objectives (organic emitters, protein ligands, reaction substrates): SELFIES-VAE consistently outperforms SMILES-VAE, and the representation matters most where validity is a bottleneck (Nigam et al., 2022).</p>
<p>SELFIES mutations provide a simple but effective way to explore chemical space:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> selfies <span style="color:#66d9ef">as</span> sf
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> random
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">mutate_selfies</span>(selfies_str, mutation_type<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;replace&#34;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Mutate a SELFIES string. Every output is a valid molecule.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    tokens <span style="color:#f92672">=</span> list(sf<span style="color:#f92672">.</span>split_selfies(selfies_str))
</span></span><span style="display:flex;"><span>    alphabet <span style="color:#f92672">=</span> list(sf<span style="color:#f92672">.</span>get_semantic_robust_alphabet())
</span></span><span style="display:flex;"><span>    idx <span style="color:#f92672">=</span> random<span style="color:#f92672">.</span>randint(<span style="color:#ae81ff">0</span>, len(tokens) <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> mutation_type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;replace&#34;</span>:
</span></span><span style="display:flex;"><span>        tokens[idx] <span style="color:#f92672">=</span> random<span style="color:#f92672">.</span>choice(alphabet)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">elif</span> mutation_type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;insert&#34;</span>:
</span></span><span style="display:flex;"><span>        tokens<span style="color:#f92672">.</span>insert(idx, random<span style="color:#f92672">.</span>choice(alphabet))
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">elif</span> mutation_type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;delete&#34;</span> <span style="color:#f92672">and</span> len(tokens) <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">1</span>:
</span></span><span style="display:flex;"><span>        tokens<span style="color:#f92672">.</span>pop(idx)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> <span style="color:#e6db74">&#34;&#34;</span><span style="color:#f92672">.</span>join(tokens)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Every mutation produces a valid molecule</span>
</span></span><span style="display:flex;"><span>original <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>encoder(<span style="color:#e6db74">&#34;c1ccccc1&#34;</span>)  <span style="color:#75715e"># benzene</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">5</span>):
</span></span><span style="display:flex;"><span>    mutant <span style="color:#f92672">=</span> mutate_selfies(original)
</span></span><span style="display:flex;"><span>    print(sf<span style="color:#f92672">.</span>decoder(mutant))  <span style="color:#75715e"># always valid</span>
</span></span></code></pre></div><h3 id="property-prediction-and-pretraining">Property Prediction and Pretraining</h3>
<p><a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a> is a RoBERTa-based chemical language model pretrained on 2M ChEMBL compounds using SELFIES as input. Because every masked token prediction corresponds to a valid molecular fragment, the model never wastes capacity learning invalid chemistry. SELFormer outperformed <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a> by approximately 12% on average across BACE, BBBP, and HIV classification benchmarks (Yüksel et al., 2023). <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> also evaluated SELFIES as an input representation, finding comparable performance to SMILES on the Tox21 task (Chithrananda et al., 2020).</p>
<p>The <a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a> demonstrated that SELFIES achieves ~100% validity vs. ~40% for SMILES in conditional molecular generation, while performing comparably for property prediction. This dual prediction-generation capability is enabled by interleaving numerical property tokens with SELFIES molecular tokens in a single sequence (Born &amp; Manica, 2023).</p>
<p>At larger scales, <a href="/notes/chemistry/molecular-representations/encoders/neural-scaling-of-deep-chemical-models/">ChemGPT</a> (up to 1B parameters) uses a GPT-Neo backbone with SELFIES tokenization for autoregressive molecular generation, demonstrating that SELFIES follows the same power-law neural scaling behavior observed in NLP (Frey et al., 2023).</p>
<h3 id="optical-chemical-structure-recognition">Optical Chemical Structure Recognition</h3>
<p>In image-to-text chemical structure recognition, <a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/">Rajan et al. (2022)</a> compared SMILES, DeepSMILES, SELFIES, and InChI as output formats using the same transformer architecture. SELFIES achieved 100% structural validity (every prediction could be decoded), while SMILES predictions occasionally contained syntax errors. The trade-off: SMILES achieved higher exact match accuracy (88.62%) partly because SELFIES strings are longer, producing more tokens for the decoder to predict.</p>
<h3 id="chemical-name-translation">Chemical Name Translation</h3>
<p><a href="/notes/chemistry/molecular-representations/name-translation/stout/">STOUT</a> uses SELFIES as its internal representation for translating between chemical line notations and IUPAC names. All SMILES are converted to SELFIES before processing, and the model achieves a BLEU score of 0.94 for IUPAC-to-SELFIES translation and 0.98 Tanimoto similarity on valid outputs. The authors found SELFIES&rsquo; syntactic robustness particularly valuable for this sequence-to-sequence task, where the decoder must produce a chemically valid output string (Rajan et al., 2021).</p>
<h3 id="tokenization">Tokenization</h3>
<p>Converting SELFIES strings into tokens for neural models is more straightforward than SMILES tokenization. Each bracket-enclosed symbol (<code>[C]</code>, <code>[=C]</code>, <code>[Branch1]</code>) is a natural token boundary. <a href="/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/">Atom Pair Encoding (APE)</a> extends byte pair encoding with chemistry-aware constraints for both SMILES and SELFIES. For SELFIES specifically, APE preserves atomic identity during subword merging, and SELFIES models showed strong inter-tokenizer agreement: all true positives from SELFIES-BPE were captured by SELFIES-APE (Leon et al., 2024).</p>
<h2 id="limitations-and-trade-offs">Limitations and Trade-offs</h2>
<h3 id="validity-constraints-can-introduce-bias">Validity Constraints Can Introduce Bias</h3>
<p>The guarantee that every string decodes to a valid molecule is SELFIES&rsquo; core advantage, but recent work has shown this comes with trade-offs. <a href="/notes/chemistry/molecular-representations/notations/invalid-smiles-help/">Skinnider (2024)</a> demonstrated that SMILES-based models consistently outperform SELFIES-based models on distribution-learning tasks. The mechanism: invalid SMILES represent a model&rsquo;s least confident predictions, and filtering them out acts as implicit quality control. SELFIES models, by construction, cannot discard low-confidence outputs this way. Furthermore, SELFIES validity constraints introduce systematic structural biases, generating fewer aromatic rings and more aliphatic structures compared to training data. When SELFIES constraints were relaxed to allow invalid generation (&ldquo;unconstrained SELFIES&rdquo;), performance improved, providing causal evidence that the ability to generate and discard invalid outputs benefits distribution learning.</p>
<p>This finding reframes the SMILES vs. SELFIES choice as context-dependent. As Grisoni (2023) summarizes in a <a href="/notes/chemistry/molecular-design/generation/evaluation/clms-de-novo-drug-design-review/">review of chemical language models</a>: &ldquo;SMILES offer a richer, more interpretable language with well-studied augmentation strategies, while SELFIES guarantee validity at the cost of chemical realism and edit interpretability.&rdquo;</p>
<p>The <a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">PMO benchmark</a> provides further nuance: SELFIES-based variants of language model methods (REINVENT, LSTM HC, VAE) generally do not outperform their SMILES counterparts, because modern language models learn SMILES grammar well enough that syntactic invalidity is no longer a practical bottleneck. The exception is genetic algorithms, where SELFIES mutations are naturally well-suited.</p>
<p>A study on <a href="/notes/chemistry/molecular-design/property-prediction/lm-complex-molecular-distributions/">complex molecular distributions</a> paints a consistent picture: SELFIES-trained RNNs achieve better standard metrics (validity, uniqueness, novelty), while SMILES-trained RNNs achieve better distributional fidelity as measured by Wasserstein distance (Flam-Shepherd et al., 2022). Taken together, these findings suggest that SELFIES and SMILES have genuinely complementary strengths, and the best choice depends on whether the task prioritizes validity/novelty or distributional faithfulness.</p>
<h3 id="degenerate-outputs">Degenerate Outputs</h3>
<p>Although every SELFIES string decodes to a valid molecule, the decoded molecule may not always be chemically meaningful in context. The <a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a> reported ~1.9% defective generations where the output molecule had fewer than 50% of the seed molecule&rsquo;s atoms (Born &amp; Manica, 2023). This highlights a distinction between syntactic validity (which SELFIES guarantees) and semantic appropriateness (which it does not).</p>
<h3 id="other-limitations">Other Limitations</h3>
<ul>
<li><strong>Indirect Canonicalization</strong>: A canonical SELFIES string is currently generated by first creating a canonical SMILES string and then converting it to SELFIES. Direct canonicalization is a goal for future development.</li>
<li><strong>String Length</strong>: SELFIES strings are generally longer than their corresponding SMILES strings, which can impact storage, processing times, and sequence modeling difficulty for very large datasets.</li>
<li><strong>Ongoing Standardization</strong>: While the library now supports most major features found in SMILES, work is ongoing to extend the format to more complex systems like polymers, crystals, and reactions.</li>
</ul>
<h2 id="variants-and-extensions">Variants and Extensions</h2>
<h3 id="group-selfies">Group SELFIES</h3>
<p><a href="/notes/chemistry/molecular-representations/notations/group-selfies-fragment-molecular-representation/">Group SELFIES</a> extends the representation with group tokens that represent functional groups or entire substructures (e.g., a benzene ring or carboxyl group) as single units. Each group token has labeled attachment points with specified valency, allowing the decoder to continue tracking available bonds. Group SELFIES maintains the validity guarantee while producing shorter, more human-readable strings. On MOSES VAE benchmarks, Group SELFIES achieved an FCD of 0.1787 versus 0.6351 for standard SELFIES, indicating substantially better distribution learning (Cheng et al., 2023).</p>
<h3 id="stoned-algorithms">STONED Algorithms</h3>
<p><a href="/notes/chemistry/molecular-design/generation/search-based/stoned-selfies-chemical-space-exploration/">STONED</a> (Superfast Traversal, Optimization, Novelty, Exploration and Discovery) is a suite of algorithms that exploit SELFIES&rsquo; validity guarantee for training-free molecular design through point mutations, interpolation, and optimization (Nigam et al., 2021). See <a href="#molecular-generation">Molecular Generation</a> above for benchmark results.</p>
<h2 id="recent-developments">Recent Developments</h2>
<p>The <a href="/notes/chemistry/molecular-representations/notations/selfies-2023/">2023 library update</a> replaced the original string-manipulation engine with a graph-based internal representation. This change resolved several long-standing limitations: the original approach could not handle aromatics (requiring kekulization), stereochemistry, or charged species. The graph-based engine now supports all of these, and processes 300K+ molecules in approximately 4 minutes in pure Python. The library has been validated on all 72 million molecules from PubChem.</p>
<p>Looking forward, researchers have outlined <a href="/notes/chemistry/molecular-representations/notations/selfies-2022/">16 future research directions</a> for extending robust representations to complex systems like polymers, crystals, and chemical reactions.</p>
<h2 id="further-reading">Further Reading</h2>
<ul>
<li><a href="/posts/visualizing-smiles-and-selfies-strings/"><strong>Converting SELFIES Strings to 2D Molecular Images</strong></a>: Hands-on tutorial demonstrating SELFIES robustness and building visualization tools</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li>Krenn, M., Häse, F., Nigam, A., Friederich, P., &amp; Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. <a href="https://doi.org/10.1088/2632-2153/aba947"><em>Machine Learning: Science and Technology</em>, <em>1</em>(4), 045024.</a></li>
<li>Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., &hellip; &amp; Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. <a href="https://doi.org/10.1016/j.patter.2022.100588"><em>Patterns</em>, <em>3</em>(10), 100588.</a></li>
<li>Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., &amp; Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. <a href="https://doi.org/10.1039/d3dd00044c"><em>Digital Discovery</em>, <em>2</em>, 897-908.</a></li>
<li>Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. <a href="https://doi.org/10.1038/s42256-024-00821-x"><em>Nature Machine Intelligence</em>, <em>6</em>, 437-448.</a></li>
<li>Shen, C., Krenn, M., Eppel, S., &amp; Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. <a href="https://doi.org/10.1088/2632-2153/ac09d6"><em>Machine Learning: Science and Technology</em>, <em>2</em>(3), 03LT02.</a></li>
<li>Fang, Y., et al. (2024). Domain-agnostic molecular generation with chemical feedback. <a href="https://openreview.net/forum?id=9rnerQyXlh"><em>ICLR 2024</em>.</a></li>
<li>Born, J., &amp; Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. <a href="https://doi.org/10.1038/s42256-023-00639-z"><em>Nature Machine Intelligence</em>, <em>5</em>, 432-444.</a></li>
<li>Frey, N. C., Soklaski, R., Axelrod, S., Samsi, S., Gómez-Bombarelli, R., Coley, C. W., &amp; Gadepally, V. (2023). Neural scaling of deep chemical models. <a href="https://doi.org/10.1038/s42256-023-00740-3"><em>Nature Machine Intelligence</em>, <em>5</em>, 1297-1305.</a></li>
<li>Rajan, K., Zielesny, A., &amp; Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. <a href="https://doi.org/10.1186/s13321-021-00512-4"><em>Journal of Cheminformatics</em>, <em>13</em>, 34.</a></li>
<li>Nigam, A., Pollice, R., &amp; Aspuru-Guzik, A. (2022). Tartarus: A benchmarking platform for realistic and practical inverse molecular design. <a href="https://openreview.net/forum?id=sLFDE2MHzHO"><em>NeurIPS 2022 Datasets and Benchmarks</em>.</a></li>
<li><a href="https://github.com/aspuru-guzik-group/selfies">SELFIES GitHub Repository</a></li>
</ul>
]]></content:encoded></item><item><title>The Number of Isomeric Hydrocarbons of the Methane Series</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/number-of-isomeric-hydrocarbons/</link><pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/number-of-isomeric-hydrocarbons/</guid><description>Henze and Blair's 1931 JACS paper deriving exact recursive formulas for counting constitutional alkane isomers.</description><content:encoded><![CDATA[<h2 id="a-theoretical-foundation-for-mathematical-chemistry">A Theoretical Foundation for Mathematical Chemistry</h2>
<p>This is a foundational <strong>theoretical paper</strong> in mathematical chemistry and chemical graph theory. It derives <strong>exact mathematical laws</strong> governing molecular topology. The paper also serves as a <strong>benchmark resource</strong>, establishing the first systematic isomer counts that corrected historical errors and whose recursive method remains the basis for modern molecular enumeration.</p>
<h2 id="historical-motivation-and-the-failure-of-centric-trees">Historical Motivation and the Failure of Centric Trees</h2>
<p>The primary motivation was the lack of a rigorous mathematical relationship between carbon content ($N$) and isomer count.</p>
<ul>
<li><strong>Previous failures</strong>: Earlier attempts by <a href="https://doi.org/10.1002/cber.187500801227">Cayley (1875)</a> (as cited by Henze and Blair, referring to the Berichte der deutschen chemischen Gesellschaft summary) and <a href="https://doi.org/10.1002/cber.187500802191">Schiff (1875)</a> used &ldquo;centric&rdquo; and &ldquo;bicentric&rdquo; symmetry tree methods that broke down as carbon content increased, producing incorrect counts as early as $N = 12$. Subsequent efforts by Tiemann (1893), Delannoy (1894), Losanitsch (1897), Goldberg (1898), and Trautz (1924), as cited in the paper, each improved on specific aspects but none achieved general accuracy beyond moderate carbon content.</li>
<li><strong>The theoretical gap</strong>: All prior formulas depended on exhaustively identifying centers of symmetry, meaning they required additional correction terms for each increase in $N$ and could not reliably predict counts for larger molecules like $C_{40}$.</li>
</ul>
<p>This work aimed to develop a theoretically sound, generalizable method that could be extended to any number of carbons.</p>
<h2 id="core-innovation-recursive-enumeration-of-graphs">Core Innovation: Recursive Enumeration of Graphs</h2>
<p>The core novelty is the proof that the count of hydrocarbons is a recursive function of the count of alkyl radicals (alcohols) of size $N/2$ or smaller. The authors rely on a preliminary calculation of the total number of isomeric alcohols (the methanol series) to make this hydrocarbon enumeration possible. By defining $T_k$ as the exact number of possible isomeric alkyl radicals strictly containing $k$ carbon atoms, graph enumeration transforms into a mathematical recurrence.</p>
<p>To rigorously prevent double-counting when functionally identical branches connect to a central carbon, Henze and Blair applied combinations with substitution. Because the chemical branches are unordered topologically, connecting $x$ branches of identical structural size $k$ results in combinations with repetition:</p>
<p>$$ \binom{T_k + x - 1}{x} $$</p>
<p>For example, if a Group B central carbon is bonded to three identical sub-branches of length $k$, the combinatoric volume for that precise topological partition resolves to:</p>
<p>$$ \frac{T_k (T_k + 1)(T_k + 2)}{6} $$</p>
<p>Summing these constrained combinatorial partitions across all valid branch sizes (governed by the Even/Odd bisection rules) yields the exact isomer count for $N$ without overestimating due to symmetric permutations.</p>
<p><strong>The Symmetry Constraints</strong>: The paper rigorously divides the problem space to prevent double-counting:</p>
<ul>
<li><strong>Group A (Centrosymmetric)</strong>: Hydrocarbons that can be bisected into two smaller alkyl radicals.
<ul>
<li><em>Even $N$</em>: Split into two radicals of size $N/2$.</li>
<li><em>Odd $N$</em>: Split into sizes $(N+1)/2$ and $(N-1)/2$.</li>
</ul>
</li>
<li><strong>Group B (Asymmetric)</strong>: Hydrocarbons whose graphic formula cannot be symmetrically bisected. They contain exactly one central carbon atom attached to 3 or 4 branches. To prevent double-counting, Henze and Blair established strict maximum branch sizes:
<ul>
<li><em>Even $N$</em>: No branch can be larger than $(N/2 - 1)$ carbons.</li>
<li><em>Odd $N$</em>: No branch can be larger than $(N-3)/2$ carbons.</li>
<li><em>The Combinatorial Partitioning</em>: They further subdivided these 3-branch and 4-branch molecules into distinct mathematical cases based on whether the branches were structurally identical or unique, applying distinct combinatorial formulas to each scenario.</li>
</ul>
</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/notes/hexane-and-its-six-isomers-by-even-and-odd-decomposition.webp"
         alt="The five structural isomers of hexane classified into Group A and Group B based on their decomposition"
         title="The five structural isomers of hexane classified into Group A and Group B based on their decomposition"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The five isomers of hexane ($C_6$) classified by Henze and Blair&rsquo;s symmetry scheme. Group A molecules (top row) can be bisected along a bond (highlighted in red) into two $C_3$ alkyl radicals. Group B molecules (bottom row) have a central carbon atom (red circle) with 3-4 branches, preventing symmetric bisection.</figcaption>
    
</figure>

<p>This classification is the key insight that enables the recursive formulas. By exhaustively partitioning hydrocarbons into these mutually exclusive groups, the authors could derive separate combinatorial expressions for each and sum them without double-counting.</p>
<p>For each structural class, combinatorial formulas are derived that depend on the number of isomeric alcohols ($T_k$) where $k &lt; N$. This transforms the problem of counting large molecular graphs into a recurrence relation based on the counts of smaller, simpler sub-graphs.</p>
<h2 id="validation-via-exhaustive-hand-enumeration">Validation via Exhaustive Hand-Enumeration</h2>
<p>The experiments were computational and enumerative:</p>
<ol>
<li><strong>Derivation of the recursion formulas</strong>: The main effort was the mathematical derivation of the set of equations for each structural class of hydrocarbon.</li>
<li><strong>Calculation</strong>: They applied their formulas to calculate the number of isomers for alkanes up to $N=40$, reaching over $6.2 \times 10^{13}$ isomers. This was far beyond what was previously possible.</li>
<li><strong>Validation by exhaustive enumeration</strong>: To prove the correctness of their theory, the authors manually drew and counted all possible structural formulas for the undecanes ($C_{11}$), dodecanes ($C_{12}$), tridecanes ($C_{13}$), and tetradecanes ($C_{14}$). This brute-force check confirmed their calculated numbers and corrected long-standing errors in the literature.
<ul>
<li><em>Key correction</em>: The manual enumeration proved that the count for tetradecane ($C_{14}$) is <strong>1,858</strong>, correcting erroneous values previously published by <a href="https://doi.org/10.1002/cber.189703002144" title="Die Isomerie-Arten bei den Homologen der Paraffin-Reihe">Losanitsch (1897)</a>, whose results for $C_{12}$ and $C_{14}$ the paper identifies as incorrect.</li>
</ul>
</li>
</ol>
<h2 id="benchmark-outcomes-and-scaling-limits">Benchmark Outcomes and Scaling Limits</h2>
<ul>
<li><strong>The Constitutional Limit</strong>: The paper establishes the mathematical ground truth for organic molecular graphs by strictly counting <em>constitutional</em> (structural) isomers. The derivation completely excludes 3D stereoisomerism (enantiomers and diastereomers). For modern geometric deep learning applications (e.g., generating 3D conformers), Henze and Blair&rsquo;s scaling sequence serves as a lower bound, representing a severe underestimation of the true number of spatial configurations feasible within chemical space.</li>
<li><strong>Theoretical outcome</strong>: The paper proves that the problem&rsquo;s inherent complexity requires a recursive approach.</li>
<li><strong>Benchmark resource</strong>: The authors published a table of isomer counts up to $C_{40}$ (Table II), correcting historical errors and establishing the first systematic enumeration across this range. Later computational verification revealed that the paper&rsquo;s hand-calculated values are exact through at least $C_{14}$ (confirmed by exhaustive enumeration) but accumulate minor arithmetic errors beyond that range (e.g., at $C_{40}$). The recursive method itself is exact and remains the basis for the accepted values in <a href="https://oeis.org/A000602">OEIS A000602</a>.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/notes/number-of-isomeric-hydrocarbons-of-the-methane-series.webp"
         alt="Log-scale plot showing exponential growth of alkane isomer counts from C1 to C40"
         title="Log-scale plot showing exponential growth of alkane isomer counts from C1 to C40"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The number of structural isomers grows super-exponentially with carbon content, reaching over 62 trillion for C₄₀. This plot, derived from Henze and Blair&rsquo;s Table II, illustrates the combinatorial explosion that makes direct enumeration intractable for larger molecules.</figcaption>
    
</figure>

<p>The plot above illustrates the staggering growth rate. Methane ($C_1$) through propane ($C_3$) each have exactly one isomer. Beyond this, the count accelerates rapidly: 75 isomers at $C_{10}$, nearly 37 million at $C_{25}$, and over 4 billion at $C_{30}$. By $C_{40}$, the count exceeds $6.2 \times 10^{13}$ (the paper&rsquo;s hand-calculated Table II reports 62,491,178,805,831, while the modern OEIS-verified value is 62,481,801,147,341). This super-exponential scaling demonstrates why brute-force enumeration becomes impossible and why the recursive approach was essential.</p>
<ul>
<li><strong>Foundational impact</strong>: This work established the mathematical framework that would later evolve into modern chemical graph theory and computational chemistry approaches for molecular enumeration. In the context of AI for molecular generation, this is an early form of <strong>expressivity analysis</strong>, defining the size of the chemical space that generative models must learn to cover.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li>
<p><strong>Algorithms</strong>: The exact mathematical recursive formulas and combinatorial partitioning logic are fully provided in the text, allowing for programmatic implementation.</p>
</li>
<li>
<p><strong>Evaluation</strong>: The authors scientifically validated their recursive formulas through exhaustive manual hand-enumeration (brute-force drawing of structural formulas) up to $C_{14}$ to establish absolute correctness.</p>
</li>
<li>
<p><strong>Data</strong>: The paper&rsquo;s Table II provides isomer counts up to $C_{40}$. These hand-calculated values are exact through at least $C_{14}$ (validated by exhaustive enumeration) but accumulate minor arithmetic errors beyond that range. The corrected integer sequence is maintained in the On-Line Encyclopedia of Integer Sequences (OEIS) as <a href="https://oeis.org/A000602">A000602</a>.</p>
</li>
<li>
<p><strong>Code</strong>: The OEIS page provides Mathematica and Maple implementations. The following pure Python implementation uses the OEIS generating functions (which formalize Henze and Blair&rsquo;s recursive method) to compute the corrected isomer counts up to any arbitrary $N$:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">compute_alkane_isomers</span>(max_n: int) <span style="color:#f92672">-&gt;</span> list[int]:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the number of alkane structural isomers C_nH_{2n+2}
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    up to max_n using the generating functions from OEIS A000602.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> max_n <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span>: <span style="color:#66d9ef">return</span> [<span style="color:#ae81ff">1</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Helper: multiply two polynomials (cap at degree max_n)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">poly_mul</span>(a: list[int], b: list[int]) <span style="color:#f92672">-&gt;</span> list[int]:
</span></span><span style="display:flex;"><span>        res <span style="color:#f92672">=</span> [<span style="color:#ae81ff">0</span>] <span style="color:#f92672">*</span> (max_n <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> i, v_a <span style="color:#f92672">in</span> enumerate(a):
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">for</span> j, v_b <span style="color:#f92672">in</span> enumerate(b):
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">if</span> i <span style="color:#f92672">+</span> j <span style="color:#f92672">&lt;=</span> max_n: res[i <span style="color:#f92672">+</span> j] <span style="color:#f92672">+=</span> v_a <span style="color:#f92672">*</span> v_b
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">else</span>: <span style="color:#66d9ef">break</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> res
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Helper: evaluate P(x^k) by spacing out terms</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">poly_pow</span>(a: list[int], k: int) <span style="color:#f92672">-&gt;</span> list[int]:
</span></span><span style="display:flex;"><span>        res <span style="color:#f92672">=</span> [<span style="color:#ae81ff">0</span>] <span style="color:#f92672">*</span> (max_n <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> i, v <span style="color:#f92672">in</span> enumerate(a):
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> i <span style="color:#f92672">*</span> k <span style="color:#f92672">&lt;=</span> max_n: res[i <span style="color:#f92672">*</span> k] <span style="color:#f92672">=</span> v
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">else</span>: <span style="color:#66d9ef">break</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> res
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># T represents the alkyl radicals (OEIS A000598), T[0] = 1</span>
</span></span><span style="display:flex;"><span>    T <span style="color:#f92672">=</span> [<span style="color:#ae81ff">0</span>] <span style="color:#f92672">*</span> (max_n <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>    T[<span style="color:#ae81ff">0</span>] <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Iteratively build coefficients of T</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># We only need to compute the (n-1)-th degree terms at step n</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, max_n <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>):
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Extract previously calculated slices</span>
</span></span><span style="display:flex;"><span>        t_prev <span style="color:#f92672">=</span> T[:n]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># T(x^2) and T(x^3) terms up to n-1</span>
</span></span><span style="display:flex;"><span>        t2_term <span style="color:#f92672">=</span> T[(n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>) <span style="color:#f92672">//</span> <span style="color:#ae81ff">2</span>] <span style="color:#66d9ef">if</span> (n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>) <span style="color:#f92672">%</span> <span style="color:#ae81ff">2</span> <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span> <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>        t3_term <span style="color:#f92672">=</span> T[(n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>) <span style="color:#f92672">//</span> <span style="color:#ae81ff">3</span>] <span style="color:#66d9ef">if</span> (n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>) <span style="color:#f92672">%</span> <span style="color:#ae81ff">3</span> <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span> <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># T(x)^2 and T(x)^3 terms up to n-1</span>
</span></span><span style="display:flex;"><span>        t_squared_n_1 <span style="color:#f92672">=</span> sum(t_prev[i] <span style="color:#f92672">*</span> t_prev[n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> i] <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(n))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        t_cubed_n_1 <span style="color:#f92672">=</span> sum(
</span></span><span style="display:flex;"><span>            T[i] <span style="color:#f92672">*</span> T[j] <span style="color:#f92672">*</span> T[n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> i <span style="color:#f92672">-</span> j]
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(n)
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">for</span> j <span style="color:#f92672">in</span> range(n <span style="color:#f92672">-</span> i)
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># T(x) * T(x^2) term up to n-1</span>
</span></span><span style="display:flex;"><span>        t_t2_n_1 <span style="color:#f92672">=</span> sum(
</span></span><span style="display:flex;"><span>            T[i] <span style="color:#f92672">*</span> T[j]
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(n)
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">for</span> j <span style="color:#f92672">in</span> range((n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> i) <span style="color:#f92672">//</span> <span style="color:#ae81ff">2</span> <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> i <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>j <span style="color:#f92672">==</span> n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        T[n] <span style="color:#f92672">=</span> (t_cubed_n_1 <span style="color:#f92672">+</span> <span style="color:#ae81ff">3</span> <span style="color:#f92672">*</span> t_t2_n_1 <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span> <span style="color:#f92672">*</span> t3_term) <span style="color:#f92672">//</span> <span style="color:#ae81ff">6</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Calculate Alkanes (OEIS A000602) from fully populated T</span>
</span></span><span style="display:flex;"><span>    T2 <span style="color:#f92672">=</span> poly_pow(T, <span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>    T3 <span style="color:#f92672">=</span> poly_pow(T, <span style="color:#ae81ff">3</span>)
</span></span><span style="display:flex;"><span>    T4 <span style="color:#f92672">=</span> poly_pow(T, <span style="color:#ae81ff">4</span>)
</span></span><span style="display:flex;"><span>    T_squared <span style="color:#f92672">=</span> poly_mul(T, T)
</span></span><span style="display:flex;"><span>    T_cubed <span style="color:#f92672">=</span> poly_mul(T_squared, T)
</span></span><span style="display:flex;"><span>    T_fourth <span style="color:#f92672">=</span> poly_mul(T_cubed, T)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    term2 <span style="color:#f92672">=</span> [(T_squared[i] <span style="color:#f92672">-</span> T2[i]) <span style="color:#f92672">//</span> <span style="color:#ae81ff">2</span> <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(max_n <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>)]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    term3_inner <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>        T_fourth[i]
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">+</span> <span style="color:#ae81ff">6</span> <span style="color:#f92672">*</span> poly_mul(T_squared, T2)[i]
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">+</span> <span style="color:#ae81ff">8</span> <span style="color:#f92672">*</span> poly_mul(T, T3)[i]
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">+</span> <span style="color:#ae81ff">3</span> <span style="color:#f92672">*</span> poly_mul(T2, T2)[i]
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">+</span> <span style="color:#ae81ff">6</span> <span style="color:#f92672">*</span> T4[i]
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(max_n <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    alkanes <span style="color:#f92672">=</span> [<span style="color:#ae81ff">1</span>] <span style="color:#f92672">+</span> [<span style="color:#ae81ff">0</span>] <span style="color:#f92672">*</span> max_n
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, max_n <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>):
</span></span><span style="display:flex;"><span>        alkanes[n] <span style="color:#f92672">=</span> T[n] <span style="color:#f92672">-</span> term2[n] <span style="color:#f92672">+</span> term3_inner[n <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span>] <span style="color:#f92672">//</span> <span style="color:#ae81ff">24</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> alkanes
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Calculate and verify</span>
</span></span><span style="display:flex;"><span>isomers <span style="color:#f92672">=</span> compute_alkane_isomers(<span style="color:#ae81ff">40</span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;C_14 isomers: </span><span style="color:#e6db74">{</span>isomers[<span style="color:#ae81ff">14</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)   <span style="color:#75715e"># Output: 1858</span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;C_40 isomers: </span><span style="color:#e6db74">{</span>isomers[<span style="color:#ae81ff">40</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)   <span style="color:#75715e"># Output: 62481801147341</span>
</span></span></code></pre></div></li>
<li>
<p><strong>Hardware</strong>: Derived analytically and enumerated manually by the authors in 1931 without computational hardware.</p>
</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Henze, H. R., &amp; Blair, C. M. (1931). The number of isomeric hydrocarbons of the methane series. <em>Journal of the American Chemical Society</em>, 53(8), 3077-3085. <a href="https://doi.org/10.1021/ja01359a034">https://doi.org/10.1021/ja01359a034</a></p>
<p><strong>Publication</strong>: Journal of the American Chemical Society (JACS) 1931</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{henze1931number,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{The number of isomeric hydrocarbons of the methane series}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Henze, Henry R and Blair, Charles M}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of the American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{53}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3077--3085}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{1931}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES: A Compact Notation for Chemical Structures</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles/</link><pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles/</guid><description>SMILES (Simplified Molecular Input Line Entry System) represents chemical structures using compact ASCII strings.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>SMILES (Simplified Molecular Input Line Entry System), originally developed by David Weininger in the late 1980s, is a one-dimensional string format for representing chemical molecular structures. It linearizes 3D molecular structures by performing a depth-first traversal of the molecular graph, recording the atoms and bonds along the way.</p>
<p>For example, the simple molecule ethanol ($\text{C}_2\text{H}_6\text{O}$) can be represented as <code>CCO</code>, while the more complex caffeine molecule becomes <code>CN1C=NC2=C1C(=O)N(C(=O)N2C)C</code>.</p>
<h3 id="key-characteristics">Key Characteristics</h3>
<ul>
<li><strong>Human-readable</strong>: Designed primarily for human readability. Compare with <a href="/notes/chemistry/molecular-representations/notations/inchi/">InChI</a>, a hierarchical representation optimized for machine parsing.</li>
<li><strong>Compact</strong>: More compact than other representations (3D coordinates, connectivity tables)</li>
<li><strong>Simple syntax</strong>: A language with simple syntax and structure, making it relatively easy to learn and use for chemists and researchers</li>
<li><strong>Flexible</strong>: Both linear and cyclic structures can be represented in many different valid ways</li>
</ul>
<p>For a hands-on tutorial on visualizing SMILES strings as 2D molecular images, see <a href="/posts/visualizing-smiles-and-selfies-strings/">Converting SMILES Strings to 2D Molecular Images</a>.</p>
<h2 id="basic-syntax">Basic Syntax</h2>
<h3 id="atomic-symbols">Atomic Symbols</h3>
<p>SMILES uses standard atomic symbols with implied hydrogen atoms:</p>
<ul>
<li><code>C</code> (methane, $\text{CH}_4$)</li>
<li><code>N</code> (ammonia, $\text{NH}_3$)</li>
<li><code>O</code> (water, $\text{H}_2\text{O}$)</li>
<li><code>P</code> (phosphine, $\text{PH}_3$)</li>
<li><code>S</code> (hydrogen sulfide, $\text{H}_2\text{S}$)</li>
<li><code>Cl</code> (hydrogen chloride, $\text{HCl}$)</li>
</ul>
<p><strong>Bracket notation</strong>: Elements outside the organic subset must be shown in brackets, e.g., <code>[Pt]</code> for elemental platinum. The organic subset (<code>B</code>, <code>C</code>, <code>N</code>, <code>O</code>, <code>P</code>, <code>S</code>, <code>F</code>, <code>Cl</code>, <code>Br</code>, and <code>I</code>) can omit brackets.</p>
<h3 id="bond-representation">Bond Representation</h3>
<p>Bonds are represented by symbols:</p>
<ul>
<li><strong>Single bond</strong>: <code>-</code> (usually omitted)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/ethane.webp"
         alt="Ethane"
         title="Ethane"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Ethane ($\text{C}_2\text{H}_6$), SMILES: <code>CC</code></figcaption>
    
</figure>

<ul>
<li><strong>Double bond</strong>: <code>=</code></li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/methyl_isocyanate.webp"
         alt="Methyl Isocyanate"
         title="Methyl Isocyanate"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Methyl Isocyanate ($\text{C}_2\text{H}_3\text{NO}$), SMILES: <code>CN=C=O</code></figcaption>
    
</figure>

<ul>
<li><strong>Triple bond</strong>: <code>#</code></li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/hydrogen_cyanide.webp"
         alt="Hydrogen Cyanide"
         title="Hydrogen Cyanide"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Hydrogen Cyanide (HCN), SMILES: <code>C#N</code></figcaption>
    
</figure>

<ul>
<li><strong>Aromatic bond</strong>: <code>:</code> (usually omitted when lowercase atom symbols indicate aromaticity)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/vanillin.webp"
         alt="Vanillin"
         title="Vanillin"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Vanillin ($\text{C}_8\text{H}_8\text{O}_3$), SMILES: <code>O=Cc1ccc(O)c(OC)c1</code></figcaption>
    
</figure>

<ul>
<li><strong>Disconnected structures</strong>: <code>.</code> (separates disconnected components such as salts and ionic compounds)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/copper_II_sulfate.webp"
         alt="Copper(II) Sulfate"
         title="Copper(II) Sulfate"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Copper(II) Sulfate ($\text{CuSO}_4$), SMILES: <code>[Cu+2].[O-]S(=O)(=O)[O-]</code></figcaption>
    
</figure>

<h3 id="structural-features">Structural Features</h3>
<ul>
<li><strong>Branches</strong>: Enclosed in parentheses and can be nested. For example, <code>CC(C)C(=O)O</code> represents isobutyric acid, where <code>(C)</code> and <code>(=O)</code> are branches off the main chain.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/3-propyl-4-isopropyl-1-heptene.webp"
         alt="3-Propyl-4-isopropyl-1-heptene"
         title="3-Propyl-4-isopropyl-1-heptene"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">3-Propyl-4-isopropyl-1-heptene ($\text{C}\<em>{12}\text{H}\</em>{22}$), SMILES: <code>C=CC(CCC)C(C(C)C)CCC</code></figcaption>
    
</figure>

<ul>
<li><strong>Cyclic structures</strong>: Written by breaking bonds and using numbers to indicate bond connections. For example, <code>C1CCCCC1</code> represents cyclohexane (the <code>1</code> connects the first and last carbon).</li>
<li><strong>Aromaticity</strong>: Lower case letters are used for atoms in aromatic rings. For example, benzene is written as <code>c1ccccc1</code>.</li>
<li><strong>Formal charges</strong>: Indicated by placing the charge in brackets after the atom symbol, e.g., <code>[C+]</code>, <code>[C-]</code>, or <code>[C-2]</code></li>
</ul>
<h2 id="stereochemistry-and-isomers">Stereochemistry and Isomers</h2>
<h3 id="isotope-notation">Isotope Notation</h3>
<p>Isotope notation specifies the exact isotope of an element and comes before the element within square brackets, e.g., <code>[13C]</code> for carbon-13.</p>
<h3 id="double-bond-stereochemistry">Double Bond Stereochemistry</h3>
<p>Directional bonds can be specified using <code>\</code> and <code>/</code> symbols to indicate the stereochemistry of double bonds:</p>
<ul>
<li><code>C/C=C\C</code> represents (E)-2-butene (trans configuration)</li>
<li><code>C/C=C/C</code> represents (Z)-2-butene (cis configuration)</li>
</ul>
<p>The direction of the slashes indicates which side of the double bond each substituent is on.</p>
<h3 id="tetrahedral-chirality">Tetrahedral Chirality</h3>
<p>Chirality around tetrahedral centers uses <code>@</code> and <code>@@</code> symbols:</p>
<ul>
<li><code>N[C@](C)(F)C(=O)O</code> vs <code>N[C@@](F)(C)C(=O)O</code></li>
<li>Anti-clockwise counting vs clockwise counting</li>
<li><code>@</code> and <code>@@</code> are shorthand for <code>@TH1</code> and <code>@TH2</code>, respectively</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/glucose.webp"
         alt="Glucose"
         title="Glucose"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Glucose ($\text{C}\<em>6\text{H}\</em>{12}\text{O}\_6$), SMILES: <code>OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)1</code></figcaption>
    
</figure>

<h3 id="advanced-stereochemistry">Advanced Stereochemistry</h3>
<p>More general notation for other stereocenters:</p>
<ul>
<li><code>@AL1</code>, <code>@AL2</code> for allene-type stereocenters</li>
<li><code>@SP1</code>, <code>@SP2</code>, <code>@SP3</code> for square-planar stereocenters</li>
<li><code>@TB1</code>&hellip;<code>@TB20</code> for trigonal bipyramidal stereocenters</li>
<li><code>@OH1</code>&hellip;<code>@OH30</code> for octahedral stereocenters</li>
</ul>
<p>SMILES allows partial specification since it relies on local chirality.</p>
<h2 id="smiles-in-machine-learning">SMILES in Machine Learning</h2>
<p>Beyond its original role as a compact notation, SMILES has become the dominant molecular input format for deep learning in chemistry. Its adoption has revealed both strengths and challenges specific to neural architectures.</p>
<h3 id="canonical-vs-randomized-smiles">Canonical vs. Randomized SMILES</h3>
<p>Canonical SMILES algorithms produce a single unique string per molecule, which is valuable for database deduplication. In generative modeling, however, canonical representations introduce training bias: the canonicalization algorithm constrains how the molecular graph is traversed (e.g., prioritizing sidechains over ring atoms), forcing models to learn both valid SMILES syntax and the specific ordering rules. Structurally similar molecules can have substantially different canonical strings, making complex topologies harder to sample.</p>
<p><a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">Randomized SMILES</a> address this by generating non-unique representations through random atom orderings. Training RNN-based generative models on randomized SMILES acts as data augmentation, improving chemical space coverage, sampling uniformity, and completeness compared to canonical SMILES (Arus-Pous et al., 2019). In one benchmark, randomized SMILES recovered significantly more of <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> chemical space than canonical SMILES across all training set sizes.</p>
<p>RDKit makes it straightforward to enumerate randomized SMILES for a given molecule:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;c1ccc(C(=O)O)cc1&#34;</span>)  <span style="color:#75715e"># benzoic acid</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Canonical form (deterministic)</span>
</span></span><span style="display:flex;"><span>print(Chem<span style="color:#f92672">.</span>MolToSmiles(mol))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; O=C(O)c1ccccc1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Randomized forms (different each call)</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">5</span>):
</span></span><span style="display:flex;"><span>    print(Chem<span style="color:#f92672">.</span>MolToSmiles(mol, doRandom<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; OC(=O)c1ccccc1</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; O=C(c1ccccc1)O</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; OC(c1ccccc1)=O</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; C(O)(c1ccccc1)=O</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; c1c(C(=O)O)cccc1</span>
</span></span></code></pre></div><p>Each of these strings encodes the same molecule but presents a different traversal of the molecular graph, giving a generative model more diverse training signal per molecule.</p>
<h3 id="validity-and-the-role-of-invalid-smiles">Validity and the Role of Invalid SMILES</h3>
<p>A large fraction of SMILES strings generated by neural models are syntactically or semantically invalid. Early efforts aimed to eliminate invalid outputs entirely, either through constrained representations like <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (which guarantee 100% validity) or modified syntax like <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> (which removes paired syntax; see <a href="#deepsmiles">Variants</a> below for syntax details).</p>
<p>More recent work has complicated this picture. <a href="/notes/chemistry/molecular-representations/notations/invalid-smiles-help/">Skinnider (2024)</a> demonstrated that invalid SMILES generation actually benefits chemical language models. Invalid strings tend to be low-likelihood samples from the model&rsquo;s probability distribution. Filtering them out is equivalent to removing the model&rsquo;s least confident predictions, acting as implicit quality control. Meanwhile, enforcing absolute validity (as SELFIES does) can introduce systematic structural biases that impair distribution learning. This reframes SMILES&rsquo; non-robustness as potentially advantageous in certain ML contexts.</p>
<h3 id="tokenization-challenges">Tokenization Challenges</h3>
<p>Converting SMILES strings into token sequences for neural models is non-trivial. The two baseline approaches illustrate the problem using chloramphenicol (<code>O=C(NC([C@@H](O)c1ccc([N+](=O)[O-])cc1)CO)C(Cl)Cl</code>):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> re
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>smiles <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;O=C(NC([C@@H](O)c1ccc([N+](=O)[O-])cc1)CO)C(Cl)Cl&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Character-level: splits every character individually</span>
</span></span><span style="display:flex;"><span>char_tokens <span style="color:#f92672">=</span> list(smiles)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># [&#39;O&#39;, &#39;=&#39;, &#39;C&#39;, &#39;(&#39;, &#39;N&#39;, &#39;C&#39;, &#39;(&#39;, &#39;[&#39;, &#39;C&#39;, &#39;@&#39;, &#39;@&#39;, &#39;H&#39;, &#39;]&#39;,</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#  &#39;(&#39;, &#39;O&#39;, &#39;)&#39;, &#39;c&#39;, &#39;1&#39;, &#39;c&#39;, &#39;c&#39;, &#39;c&#39;, &#39;(&#39;, &#39;[&#39;, &#39;N&#39;, &#39;+&#39;, &#39;]&#39;,</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#  &#39;(&#39;, &#39;=&#39;, &#39;O&#39;, &#39;)&#39;, &#39;[&#39;, &#39;O&#39;, &#39;-&#39;, &#39;]&#39;, &#39;)&#39;, &#39;c&#39;, &#39;c&#39;, &#39;1&#39;, &#39;)&#39;,</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#  &#39;C&#39;, &#39;O&#39;, &#39;)&#39;, &#39;C&#39;, &#39;(&#39;, &#39;C&#39;, &#39;l&#39;, &#39;)&#39;, &#39;C&#39;, &#39;l&#39;]</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; 49 tokens</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Atom-level: regex groups brackets, two-char elements, and bond symbols</span>
</span></span><span style="display:flex;"><span>atom_pattern <span style="color:#f92672">=</span> (
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;(\[[^\]]+\]|Br?|Cl?|N|O|S|P|F|I|&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;</span><span style="color:#ae81ff">\\</span><span style="color:#e6db74">|\/|:|~|@|\?|&gt;&gt;?|\*|%[0-9]</span><span style="color:#e6db74">{2}</span><span style="color:#e6db74">|[0-9])&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>atom_tokens <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>findall(atom_pattern, smiles)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># [&#39;O&#39;, &#39;=&#39;, &#39;C&#39;, &#39;(&#39;, &#39;N&#39;, &#39;C&#39;, &#39;(&#39;, &#39;[C@@H]&#39;, &#39;(&#39;, &#39;O&#39;, &#39;)&#39;, &#39;c&#39;,</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#  &#39;1&#39;, &#39;c&#39;, &#39;c&#39;, &#39;c&#39;, &#39;(&#39;, &#39;[N+]&#39;, &#39;(&#39;, &#39;=&#39;, &#39;O&#39;, &#39;)&#39;, &#39;[O-]&#39;, &#39;)&#39;,</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#  &#39;c&#39;, &#39;c&#39;, &#39;1&#39;, &#39;)&#39;, &#39;C&#39;, &#39;O&#39;, &#39;)&#39;, &#39;C&#39;, &#39;(&#39;, &#39;Cl&#39;, &#39;)&#39;, &#39;Cl&#39;]</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; 36 tokens</span>
</span></span></code></pre></div><p>Character-level tokenization splits <code>Cl</code> (chlorine) into <code>C</code> + <code>l</code>, making the chlorine indistinguishable from carbon. It also fragments <code>[C@@H]</code> (a chiral carbon) into six meaningless tokens: <code>[</code>, <code>C</code>, <code>@</code>, <code>@</code>, <code>H</code>, <code>]</code>. Atom-level tokenization preserves these as single tokens but still produces long sequences (~40 tokens per molecule on average in ChEMBL).</p>
<p>Several chemistry-aware tokenizers go further:</p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SMILES Pair Encoding (SPE)</a> adapts byte pair encoding to learn high-frequency SMILES substrings from large chemical datasets, compressing average sequence length from ~40 to ~6 tokens while preserving chemically meaningful substructures.</li>
<li><a href="/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/">Atom Pair Encoding (APE)</a> preserves atomic identity during subword merging, preventing chemically meaningless token splits.</li>
<li><a href="/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/">Atom-in-SMILES (AIS)</a> encodes each atom&rsquo;s local chemical environment into the token itself (e.g., distinguishing a carbonyl carbon from a methyl carbon), reducing token degeneration and improving translation accuracy.</li>
<li><a href="/notes/chemistry/molecular-representations/notations/smirk-tokenization-molecular-models/">Smirk</a> achieves full OpenSMILES coverage with only 165 tokens by decomposing bracketed atoms into glyphs.</li>
</ul>
<h3 id="smiles-based-foundation-models">SMILES-Based Foundation Models</h3>
<p>SMILES serves as the primary input format for molecular encoder models, including <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, <a href="/notes/chemistry/molecular-representations/encoders/smiles-transformer/">SMILES-Transformer</a>, <a href="/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/">BARTSmiles</a>, <a href="/notes/chemistry/molecular-representations/encoders/smi-ted-encoder-decoder-chemistry/">SMI-TED</a>, and <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>. These models learn molecular representations from large SMILES corpora through pre-training objectives like masked language modeling.</p>
<p>A key open challenge is robustness to SMILES variants. The <a href="/notes/chemistry/molecular-representations/encoders/amore-smiles-robustness-framework/">AMORE framework</a> revealed that current chemical language models struggle to recognize chemically equivalent SMILES representations (such as hydrogen-explicit vs. implicit forms, or different atom orderings) as encoding the same molecule.</p>
<h3 id="molecular-generation">Molecular Generation</h3>
<p>SMILES is the dominant representation for de novo molecular generation. The typical pipeline trains a language model on SMILES corpora, then steers sampling toward molecules with desired properties. Major architecture families include:</p>
<ul>
<li><strong>Variational autoencoders</strong>: The <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">Automatic Chemical Design VAE</a> (Gomez-Bombarelli et al., 2018) encodes SMILES into a continuous latent space, enabling gradient-based optimization toward target properties.</li>
<li><strong>RL-tuned generators</strong>: <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> and its successors fine-tune a pre-trained SMILES language model using reinforcement learning, rewarding molecules that satisfy multi-objective scoring functions. <a href="/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/">DrugEx</a> extends this with Pareto-based multi-objective optimization.</li>
<li><strong>Adversarial approaches</strong>: <a href="/notes/chemistry/molecular-design/generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN</a> and <a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a> apply GAN-based training to SMILES generation, using domain-specific rewards alongside the discriminator signal.</li>
</ul>
<p>The challenges of <a href="#canonical-vs-randomized-smiles">canonical vs. randomized SMILES</a> and <a href="#validity-and-the-role-of-invalid-smiles">invalid outputs</a> discussed above are particularly relevant in this generation context.</p>
<h3 id="property-prediction">Property Prediction</h3>
<p>SMILES strings serve as the primary input for quantitative structure-activity relationship (QSAR) models. <a href="/notes/chemistry/molecular-design/property-prediction/smiles2vec-interpretable-property-prediction/">SMILES2Vec</a> learns fixed-length molecular embeddings directly from SMILES for property regression and classification. <a href="/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/">MaxSMI</a> demonstrates that SMILES augmentation (training on multiple randomized SMILES per molecule) improves property prediction accuracy, connecting the <a href="#canonical-vs-randomized-smiles">data augmentation benefits</a> observed in generative settings to discriminative tasks.</p>
<h3 id="optical-chemical-structure-recognition">Optical Chemical Structure Recognition</h3>
<p>SMILES is also the standard output format for <a href="/posts/what-is-ocsr/">optical chemical structure recognition (OCSR)</a> systems, which extract molecular structures from images in scientific literature. Deep learning approaches like <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER</a> and <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/">Image2SMILES</a> frame this as an image-to-SMILES translation problem, using encoder-decoder architectures to generate SMILES strings directly from molecular diagrams. For a taxonomy of OCSR approaches, see the <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">OCSR methods overview</a>.</p>
<h2 id="limitations">Limitations</h2>
<h3 id="classical-limitations">Classical Limitations</h3>
<ul>
<li><strong>Non-uniqueness</strong>: Different SMILES strings can represent the same molecule (e.g., ethanol can be written as <code>CCO</code> or <code>OCC</code>). Canonical SMILES algorithms address this by producing a single unique representation.</li>
<li><strong>Non-robustness</strong>: SMILES strings can be written that do not correspond to any valid molecular structure.
<ul>
<li>Strings that cannot represent a molecular structure.</li>
<li>Strings that violate basic rules (more bonds than is physically possible).</li>
</ul>
</li>
<li><strong>Information loss</strong>: If 3D structural information exists, a SMILES string cannot encode it.</li>
</ul>
<h3 id="machine-learning-limitations">Machine Learning Limitations</h3>
<p>The challenges described above (canonical ordering bias motivating <a href="#canonical-vs-randomized-smiles">randomized SMILES</a>, validity constraints motivating <a href="#deepsmiles">DeepSMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, and tokenization ambiguity motivating <a href="#tokenization-challenges">chemistry-aware tokenizers</a>) remain active areas of research. See the linked sections for details on each.</p>
<h2 id="variants-and-standards">Variants and Standards</h2>
<h3 id="canonical-smiles">Canonical SMILES</h3>
<p>For how canonical vs. randomized SMILES affects generative modeling, see <a href="#canonical-vs-randomized-smiles">Canonical vs. Randomized SMILES</a> above.</p>
<p>Canonical SMILES algorithms produce a single unique string per molecule by assigning a deterministic rank to each atom and then traversing the molecular graph in that rank order. Most implementations build on the Morgan algorithm (extended connectivity): each atom starts with an initial invariant based on its properties (atomic number, degree, charge, hydrogen count), then iteratively updates its invariant by incorporating its neighbors&rsquo; invariants until the ranking stabilizes. The final atom ranks determine the traversal order, which determines the canonical string.</p>
<p>In practice, the Morgan algorithm alone does not fully resolve all ties. Implementations must also make choices about tie-breaking heuristics, aromaticity perception (Kekulé vs. aromatic form), and stereochemistry encoding. Because these choices differ across toolkits (RDKit, OpenBabel, Daylight, ChemAxon), the same molecule can produce different &ldquo;canonical&rdquo; SMILES depending on the software. A canonical SMILES is only guaranteed unique within a single implementation, not across implementations.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># RDKit&#39;s canonical SMILES for caffeine</span>
</span></span><span style="display:flex;"><span>mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;CN1C=NC2=C1C(=O)N(C(=O)N2C)C&#34;</span>)
</span></span><span style="display:flex;"><span>print(Chem<span style="color:#f92672">.</span>MolToSmiles(mol))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; Cn1c(=O)c2c(ncn2C)n(C)c1=O</span>
</span></span></code></pre></div><h3 id="isomeric-smiles">Isomeric SMILES</h3>
<p>Isomeric SMILES incorporates isotopes and stereochemistry information, providing more detailed molecular representations than generic SMILES. Non-isomeric SMILES strip this information, collapsing stereoisomers and isotopologues into the same string:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># L-alanine (chiral center)</span>
</span></span><span style="display:flex;"><span>mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;N[C@@H](C)C(=O)O&#34;</span>)
</span></span><span style="display:flex;"><span>print(Chem<span style="color:#f92672">.</span>MolToSmiles(mol, isomericSmiles<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; C[C@H](N)C(=O)O    (preserves chirality)</span>
</span></span><span style="display:flex;"><span>print(Chem<span style="color:#f92672">.</span>MolToSmiles(mol, isomericSmiles<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; CC(N)C(=O)O         (chirality lost)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Deuterated water (isotope labels)</span>
</span></span><span style="display:flex;"><span>mol2 <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;[2H]O[2H]&#34;</span>)
</span></span><span style="display:flex;"><span>print(Chem<span style="color:#f92672">.</span>MolToSmiles(mol2, isomericSmiles<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; [2H]O[2H]           (preserves isotopes)</span>
</span></span><span style="display:flex;"><span>print(Chem<span style="color:#f92672">.</span>MolToSmiles(mol2, isomericSmiles<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; [H]O[H]             (isotope info lost)</span>
</span></span></code></pre></div><h3 id="opensmiles-vs-proprietary">OpenSMILES vs. Proprietary</h3>
<ul>
<li><strong>Proprietary</strong>: The original SMILES specification was proprietary (Daylight Chemical Information Systems), which led to compatibility issues between different implementations.</li>
<li><strong>OpenSMILES</strong>: An open-source alternative standardization effort to address compatibility concerns and provide a freely available specification.</li>
</ul>
<h2 id="extensions-and-related-notations">Extensions and Related Notations</h2>
<h3 id="deepsmiles">DeepSMILES</h3>
<p><a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> modifies two aspects of SMILES syntax that cause most invalid strings in generative models, while remaining interconvertible with standard SMILES without information loss.</p>
<p><strong>Ring closures</strong>: Standard SMILES uses paired digits (<code>c1ccccc1</code> for benzene). A model must remember which digits are &ldquo;open&rdquo; and close them correctly. DeepSMILES replaces this with a single ring-size indicator at the closing position: <code>cccccc6</code> means &ldquo;connect to the atom 6 positions back.&rdquo;</p>
<p><strong>Branches</strong>: Standard SMILES uses matched parentheses (<code>C(OC)(SC)F</code>). DeepSMILES uses a postfix notation with only closing parentheses, where consecutive <code>)</code> symbols indicate how far to pop back on the atom stack: <code>COC))SC))F</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>SMILES:       c1ccccc1          C(OC)(SC)F
</span></span><span style="display:flex;"><span>DeepSMILES:   cccccc6           COC))SC))F
</span></span><span style="display:flex;"><span>              ↑                 ↑
</span></span><span style="display:flex;"><span>              single digit =    no opening parens,
</span></span><span style="display:flex;"><span>              ring size         )) pops back to C
</span></span></code></pre></div><p>A single unpaired symbol cannot be &ldquo;unmatched,&rdquo; eliminating the two main sources of syntactically invalid strings from generative models.</p>
<h3 id="reaction-smiles">Reaction SMILES</h3>
<p>Reaction SMILES extends the notation to represent chemical reactions by separating reactants, reagents, and products with <code>&gt;</code> symbols. The general format is <code>reactants&gt;reagents&gt;products</code>, where each group can contain multiple molecules separated by <code>.</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>CC(=O)O.CCO&gt;&gt;CC(=O)OCC.O
</span></span><span style="display:flex;"><span>│         │ │            │
</span></span><span style="display:flex;"><span>│         │ │            └─ water
</span></span><span style="display:flex;"><span>│         │ └─ ethyl acetate
</span></span><span style="display:flex;"><span>│         └─ ethanol
</span></span><span style="display:flex;"><span>└─ acetic acid
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>(Fischer esterification: acetic acid + ethanol → ethyl acetate + water)
</span></span></code></pre></div><p>The <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a> treats this as a machine translation problem, translating reactant SMILES to product SMILES with a Transformer encoder-decoder architecture.</p>
<h3 id="smarts-and-smirks">SMARTS and SMIRKS</h3>
<p><strong>SMARTS</strong> (SMILES Arbitrary Target Specification) is a pattern language built on SMILES syntax for substructure searching. It extends SMILES with query primitives like atom environments (<code>[CX3]</code> for a carbon with three connections) and logical operators, enabling precise structural pattern matching:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># SMARTS pattern for a carboxylic acid group: C(=O)OH</span>
</span></span><span style="display:flex;"><span>pattern <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmarts(<span style="color:#e6db74">&#34;[CX3](=O)[OX2H1]&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> name, smi <span style="color:#f92672">in</span> [(<span style="color:#e6db74">&#34;acetic acid&#34;</span>, <span style="color:#e6db74">&#34;CC(=O)O&#34;</span>),
</span></span><span style="display:flex;"><span>                  (<span style="color:#e6db74">&#34;benzoic acid&#34;</span>, <span style="color:#e6db74">&#34;c1ccc(C(=O)O)cc1&#34;</span>),
</span></span><span style="display:flex;"><span>                  (<span style="color:#e6db74">&#34;ethanol&#34;</span>, <span style="color:#e6db74">&#34;CCO&#34;</span>),
</span></span><span style="display:flex;"><span>                  (<span style="color:#e6db74">&#34;acetone&#34;</span>, <span style="color:#e6db74">&#34;CC(=O)C&#34;</span>)]:
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(smi)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  </span><span style="color:#e6db74">{</span>name<span style="color:#e6db74">:</span><span style="color:#e6db74">15s</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> -&gt; </span><span style="color:#e6db74">{</span><span style="color:#e6db74">&#39;match&#39;</span> <span style="color:#66d9ef">if</span> mol<span style="color:#f92672">.</span>HasSubstructMatch(pattern) <span style="color:#66d9ef">else</span> <span style="color:#e6db74">&#39;no match&#39;</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; acetic acid      -&gt; match</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; benzoic acid     -&gt; match</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; ethanol          -&gt; no match</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; acetone          -&gt; no match</span>
</span></span></code></pre></div><p><strong>SMIRKS</strong> extends SMARTS to describe reaction transforms, using atom maps (<code>:1</code>, <code>:2</code>, &hellip;) to track which atoms in the reactants correspond to which atoms in the products:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem <span style="color:#f92672">import</span> AllChem, MolFromSmiles, MolToSmiles
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># SMIRKS for ester hydrolysis: break the C-O ester bond</span>
</span></span><span style="display:flex;"><span>smirks <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;[C:1](=[O:2])[O:3][C:4]&gt;&gt;[C:1](=[O:2])[OH:3].[C:4][OH]&#34;</span>
</span></span><span style="display:flex;"><span>rxn <span style="color:#f92672">=</span> AllChem<span style="color:#f92672">.</span>ReactionFromSmarts(smirks)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>reactant <span style="color:#f92672">=</span> MolFromSmiles(<span style="color:#e6db74">&#34;CC(=O)OCC&#34;</span>)  <span style="color:#75715e"># ethyl acetate</span>
</span></span><span style="display:flex;"><span>products <span style="color:#f92672">=</span> rxn<span style="color:#f92672">.</span>RunReactants((reactant,))
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#34; + &#34;</span><span style="color:#f92672">.</span>join(MolToSmiles(p) <span style="color:#66d9ef">for</span> p <span style="color:#f92672">in</span> products[<span style="color:#ae81ff">0</span>]))
</span></span><span style="display:flex;"><span><span style="color:#75715e"># -&gt; CC(=O)O + CCO    (acetic acid + ethanol)</span>
</span></span></code></pre></div><p>See the <a href="/notes/chemistry/molecular-representations/notations/smirk-tokenization-molecular-models/">Smirk tokenizer</a> for a recent approach to tokenizing these extensions for molecular foundation models.</p>
<h3 id="t-smiles">t-SMILES</h3>
<p><a href="/notes/chemistry/molecular-representations/notations/t-smiles-fragment-molecular-representation/">t-SMILES</a> encodes molecules as fragment-based strings by decomposing a molecule into chemically meaningful substructures, arranging them into a full binary tree, and traversing it breadth-first. This dramatically reduces nesting depth compared to standard SMILES (99.3% of tokens at depth 0-2 vs. 68.0% for SMILES on ChEMBL).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>Standard SMILES (depth-first, atom-level):
</span></span><span style="display:flex;"><span>  CC(=O)Oc1ccccc1C(=O)O                     (aspirin)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>t-SMILES pipeline:
</span></span><span style="display:flex;"><span>  1. Fragment:     [CC(=O)O*]  [*c1ccccc1*]  [*C(=O)O]
</span></span><span style="display:flex;"><span>  2. Binary tree:
</span></span><span style="display:flex;"><span>                   [*c1ccccc1*]
</span></span><span style="display:flex;"><span>                  /             \
</span></span><span style="display:flex;"><span>         [CC(=O)O*]          [*C(=O)O]
</span></span><span style="display:flex;"><span>  3. BFS string:   [*c1ccccc1*] ^ [CC(=O)O*] ^ [*C(=O)O]
</span></span></code></pre></div><p>The framework introduces two symbols beyond standard SMILES: <code>^</code> separates adjacent fragments (analogous to spaces between words), and <code>&amp;</code> marks empty tree nodes. Only single closure symbols are needed per fragment, eliminating the deep nesting that makes standard SMILES difficult for generative models on small datasets.</p>
<h2 id="further-reading">Further Reading</h2>
<p>For a more robust alternative that guarantees 100% valid molecules, see <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES (Self-Referencing Embedded Strings)</a>. For the historical context and design philosophy behind SMILES, see <a href="/notes/chemistry/molecular-representations/notations/smiles-original-paper/">SMILES: The Original Paper (Weininger 1988)</a>.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://19january2021snapshot.epa.gov/sites/static/files/2015-05/documents/appendf.pdf">Sustainable Futures / P2 Framework Manual 2012 EPA-748-B12-001: Appendix F. SMILES Notation Tutorial</a></li>
<li><a href="https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html">Daylight Chemical Information Systems, Inc. SMILES</a></li>
<li><a href="http://opensmiles.org/opensmiles.html">OpenSMILES</a></li>
<li><a href="https://arxiv.org/abs/2402.01439">From Words to Molecules: A Survey of Large Language Models in Chemistry</a></li>
</ul>
]]></content:encoded></item></channel></rss>