<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Molecular-Representation on Hunter Heidenreich | AI Research Scientist &amp; Engineer</title><link>https://hunterheidenreich.com/tags/molecular-representation/</link><description>Recent content in Molecular-Representation on Hunter Heidenreich | AI Research Scientist &amp; Engineer</description><image><title>Hunter Heidenreich | AI Research Scientist &amp; Engineer</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Fri, 13 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/molecular-representation/index.xml" rel="self" type="application/rss+xml"/><item><title>ChemBERTa-3: Open Source Chemical Foundation Models</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemberta-3/</link><pubDate>Fri, 26 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemberta-3/</guid><description>An open-source framework integrating DeepChem and Ray for training and benchmarking chemical foundation models like MoLFormer and GROVER at scale.</description><content:encoded><![CDATA[<h2 id="core-contribution-an-open-source-framework">Core Contribution: An Open-Source Framework</h2>
<p>This is primarily a <strong>Resource ($\Psi_{\text{Resource}}$)</strong> paper, with secondary <strong>Method ($\Psi_{\text{Method}}$)</strong> contributions.</p>
<ul>
<li><strong>Resource Basis</strong>: The core contribution is &ldquo;ChemBERTa-3,&rdquo; an open-source framework integrated into DeepChem that standardizes the pretraining and benchmarking of chemical foundation models. The authors focus heavily on infrastructure (AWS/Ray integration) and correcting benchmarking inconsistencies in the field.</li>
<li><strong>Method Basis</strong>: It trains models like &ldquo;c3-MoLFormer&rdquo; to reproduce and validate the infrastructure.</li>
</ul>
<h2 id="the-pretraining-scalability-challenge">The Pretraining Scalability Challenge</h2>
<ul>
<li><strong>Scalability Challenges</strong>: Building robust molecular models is difficult due to the vast size of chemical space and the computational intensity of pretraining on large datasets.</li>
<li><strong>Proprietary Barriers</strong>: Many high-performing chemical foundation models (e.g., the full MoLFormer-XL) are partially closed-source or difficult to reproduce.</li>
<li><strong>Benchmarking Inconsistencies</strong>: There is a lack of systematic comparison between architectures (e.g., Graph vs. Transformer) using unified protocols. Specifically, previous comparisons relied on reported results that used differing scaffold splitting algorithms, making them inaccurate.</li>
</ul>
<h2 id="unified-infrastructure--standardized-benchmarking">Unified Infrastructure &amp; Standardized Benchmarking</h2>
<ul>
<li><strong>Unified Infrastructure</strong>: Integration of DeepChem with Ray for distributed, scalable pretraining and fine-tuning of both graph and transformer models.</li>
<li><strong>Standardized Benchmarking</strong>: Identification that MoLFormer&rsquo;s scaffold splitting algorithm differs from the standard DeepChem/MoleculeNet splitter, and the subsequent standardization of these benchmarks for fair comparison.</li>
<li><strong>New DeepChem Tools</strong>: Introduction of the <code>ModularTorchModel</code> class for flexible loss computation and <code>HuggingFaceModel</code> wrappers to bridge ecosystems.</li>
</ul>
<h2 id="benchmarking-transformers-vs-graph-models">Benchmarking Transformers vs. Graph Models</h2>
<ul>
<li><strong>Architecture Comparison</strong>: Benchmarked Transformers (<a href="/notes/computational-chemistry/chemical-language-models/chemberta/">ChemBERTa</a>, MoLFormer) against Graph models (GROVER, InfoGraph, InfoMax3D, DMPNN, GCN) and baselines (Random Forest).</li>
<li><strong>Pretraining Scale Disparity</strong>:
<ul>
<li>Transformers were pretrained on ZINC20 subsets ranging from 10M to 1.1B molecules (combining ZINC and PubChem).</li>
<li>Graph models were limited to 250K molecule subsets due to memory and computational overhead of message passing on large graphs. While this highlights the superior scalability of Transformer architectures, comparing a 1.1B-trained Transformer to a 250K-trained Graph model provides an unbalanced evaluation of architectural capacity.</li>
</ul>
</li>
<li><strong>Reproducibility Validation</strong>: Trained &ldquo;c3-MoLFormer&rdquo; (a reproduction of MoLFormer) on 1.1B molecules using two distinct hardware setups: AWS spot instances (Ray) and a local HPC cluster.</li>
<li><strong>Scaffold Split Analysis</strong>: Compared performance metrics using &ldquo;DeepChem scaffold splits&rdquo; vs. &ldquo;MoLFormer scaffold splits&rdquo; to quantify the impact of data leakage/overlap.</li>
</ul>
<h2 id="overcoming-scaffold-splitting-inconsistencies">Overcoming Scaffold Splitting Inconsistencies</h2>
<ul>
<li><strong>Scaling Transformers vs. Graphs</strong>: Transformer-based models are significantly easier to scale to large datasets than current graph-based approaches, though performance is comparable at small scales.</li>
<li><strong>Benchmarking sensitivity</strong>: MoLFormer&rsquo;s reported superiority over baselines was partly inflated by its specific scaffold splitting method, which had higher structural overlap between train and test sets (yielding a lower Tanimoto distance, generally quantified via $1 - \frac{|A \cap B|}{|A \cup B|}$) than DeepChem splits. When standardized, baselines like DMPNN perform more competitively.</li>
<li><strong>Infrastructure Viability</strong>: The framework successfully replicated large-scale training (MoLFormer-1.1B) on both cloud and on-premise HPC, confirming reproducibility.</li>
<li><strong>Open Source Release</strong>: All code, configurations, and the c3-MoLFormer-1.1B model weights are released to facilitate future research.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Pretraining</strong>:
<ul>
<li><strong>Source</strong>: ZINC20 (1.4B compounds) and PubChem.</li>
<li><strong>Scale</strong>: Subsets of 10M, 100M, and 1.1B (100% ZINC20 + 100% PubChem) were used for Transformers. Graph models used a 250K subset.</li>
</ul>
</li>
<li><strong>Fine-tuning</strong>:
<ul>
<li><strong>Suite</strong>: MoleculeNet.</li>
<li><strong>Tasks</strong>: Classification (BACE, BBBP, Tox21, HIV, SIDER, ClinTox) and Regression (ESOL, FreeSolv, Lipophilicity, QM9).</li>
<li><strong>Splits</strong>: Critical distinction made between &ldquo;DeepChem scaffold splits&rdquo; (80/10/10) and &ldquo;MoLFormer scaffold splits&rdquo; (which can be downloaded from <a href="https://ibm.ent.box.com/v/MoLFormer-data"><code>https://ibm.ent.box.com/v/MoLFormer-data</code></a>). The paper notes these algorithms differ.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: DeepChem integrated with Ray for distributed training. To recreate the environment, the repository relies on a nightly version of DeepChem (<code>pip install --pre deepchem</code>) and specific dependencies found within the <code>requirements.txt</code>. Pretraining scripts are available in the <code>chemberta3_benchmarking/pretraining</code> directory of the repository.</li>
<li><strong>Data Preparation</strong>: Featurization workflows (e.g., <code>CircularFingerprint</code>, <code>RDKitConformer</code>) are documented under <code>chemberta3_benchmarking/data/data_preprocessing/</code> in the codebase.</li>
<li><strong>Modular Training</strong>: Uses <code>ModularTorchModel</code> to allow loss computation from intermediate values and flexible component connection.</li>
<li><strong>Training Brittleness</strong>:
<ul>
<li><strong>Optimizer</strong>: Linear learning rate scheduler with warmup.</li>
<li><strong>Instability Handling</strong>: The authors observed significant loss spikes during warmup. Their primary mitigation strategy involved checkpointing frequently and restarting from the last stable state upon a spike, highlighting a persistent brittleness in optimizing these large chemical foundation models.</li>
<li><strong>Numerical Issues</strong>: Addressed NaN values by pretraining on a small dataset with low LR before scaling up.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/chemberta/">ChemBERTa</a></strong>: RoBERTa-based architecture trained with Masked Language Modeling (MLM) and Multitask Regression (MTR). Specific model identifiers (e.g., <a href="https://huggingface.co/DeepChem/ChemBERTa-100M-MLM"><code>DeepChem/ChemBERTa-100M-MLM</code></a>) are hosted on Hugging Face so researchers can pull them directly via the <code>transformers</code> library. The core pretraining objective minimized the standard MLM loss:
$$ \mathcal{L}<em>{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum</em>{i \in \mathcal{M}} \log \hat{y}<em>{i} $$
where $\mathcal{M}$ represents the set of masked SMILES token indices, and $\hat{y}</em>{i}$ is the model&rsquo;s predicted probability for the correct token given the corrupted sequence context.</li>
<li><strong>MoLFormer (c3-MoLFormer)</strong>: Re-implementation of the MoLFormer architecture (Rotary embeddings, linear attention). Specific model identifiers (e.g., <a href="https://huggingface.co/DeepChem/MoLFormer-c3-1.1B"><code>DeepChem/MoLFormer-c3-1.1B</code></a>) are similarly available on Hugging Face.
<ul>
<li>Tokenizer: <code>ibm/MoLFormer-XL-both-10pct</code> tokenizer.</li>
</ul>
</li>
<li><strong>Graph Models</strong>:
<ul>
<li><strong>GROVER</strong>: Graph Transformer with node/edge/graph level self-supervision.</li>
<li><strong>InfoGraph</strong>: Maximizes mutual information between graph-level and substructure representations.</li>
<li><strong>InfoMax3D</strong>: Incorporates 3D conformer data (via RDKit ETKDGv2) into contrastive pretraining.</li>
<li><strong>DMPNN</strong>: Directed Message Passing Neural Network (Chemprop variant).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: ROC-AUC for classification; RMSE for regression (MAE for QM9).</li>
<li><strong>Baselines</strong>: Random Forest, GCN, DMPNN trained on fine-tuning splits only.</li>
<li><strong>Protocol</strong>: Three independent runs per configuration to report mean and range (not a confidence interval), with the exception of the compute-heavy QM9 dataset, which only received a single run. Benchmarking execution scripts (e.g., GCN, RF, DMPNN, ChemBERTa) are stored in the repo under <code>chemberta3_benchmarking/models_benchmarking/</code> and contain the specific fine-tuning hyperparameters and optimizer configurations used for each downstream task.</li>
<li><strong>Key Results</strong>:
<ul>
<li><em>c3-MoLFormer-1.1B</em> achieved ~0.848 ROC-AUC on BACE and ~0.900 on BBBP (using MoLFormer splits). This closely matches the original IBM MoLFormer metrics, validating the reproducibility of the open-source framework.</li>
<li>When constrained to the equivalent 250K subset, Graph models (InfoGraph, GROVER) performed comparably to Transformers, indicating that Transformer superiority in chemistry is largely driven by data scalability rather than an inherent architectural advantage at small scales.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Cloud (AWS)</strong>:
<ul>
<li><strong>Compute</strong>: 40 NVIDIA T4 GPUs (<code>g4dn.12xlarge</code> spot instances for pretraining, <code>g4dn.2xlarge</code> for benchmarking).</li>
<li><strong>Cost</strong>: ~$4000 for MoLFormer 1.1B pretraining.</li>
<li><strong>Time</strong>: ~10 days (260 hours) for 1.1B model pretraining.</li>
<li><strong>Setup</strong>: Setup scripts for single-node and multi-node spot EC2 clusters are provided in the GitHub repository&rsquo;s <code>infra/</code> and <code>spot/</code> folders.</li>
</ul>
</li>
<li><strong>On-Premise HPC</strong>:
<ul>
<li><strong>Compute</strong>: 16 nodes (AMD EPYC), each with 4 AMD MI300A APUs.</li>
<li><strong>Environment</strong>: Ray multi-node multi-GPU framework.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/deepforestsci/chemberta3">ChemBERTa-3 GitHub Repository</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training, fine-tuning, and benchmarking framework</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-1.1B">DeepChem/MoLFormer-c3-1.1B</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer re-implementation pretrained on 1.1B molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/ChemBERTa-100M-MLM">DeepChem/ChemBERTa-100M-MLM</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>ChemBERTa pretrained on 100M ZINC molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-100M">DeepChem/MoLFormer-c3-100M</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer pretrained on 100M molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-550M">DeepChem/MoLFormer-c3-550M</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer pretrained on 550M molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Singh, R. et al. (2026). ChemBERTa-3: an open source training framework for chemical foundation models. <em>Digital Discovery</em>, 5, 662-685. <a href="https://doi.org/10.1039/D5DD00348B">https://doi.org/10.1039/D5DD00348B</a></p>
<p><strong>Publication</strong>: Digital Discovery 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/deepforestsci/chemberta3">ChemBERTa-3 GitHub Repository</a></li>
<li><a href="https://deepchem.io/">DeepChem Project</a></li>
<li><a href="https://huggingface.co/DeepChem">DeepChem Hugging Face Models</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{singhChemBERTa3OpenSource2026,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Singh, Riya and Barsainyan, Aryan Amit and Irfan, Rida and Amorin, Connor Joseph and He, Stewart and Davis, Tony and Thiagarajan, Arun and Sankaran, Shiva and Chithrananda, Seyone and Ahmad, Walid and Jones, Derek and McLoughlin, Kevin and Kim, Hyojin and Bhutani, Anoushka and Sathyanarayana, Shreyas Vinaya and Viswanathan, Venkat and Allen, Jonathan E. and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa-3}}: an open source training framework for chemical foundation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{662-685}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{The Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D5DD00348B}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1039/D5DD00348B}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GP-MoLFormer: Molecular Generation via Transformers</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/gp-molformer/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/gp-molformer/</guid><description>A 46.8M parameter transformer for molecular generation trained on 1.1B SMILES, introducing pair-tuning for efficient property optimization.</description><content:encoded><![CDATA[<h2 id="contribution-and-taxonomic-focus">Contribution and Taxonomic Focus</h2>
<p>This is primarily a <strong>Methodological</strong> paper, as it proposes a specific neural architecture (GP-MoLFormer) and a novel fine-tuning algorithm (Pair-tuning) for molecular generation. It validates these contributions against standard baselines (e.g., JT-VAE, MolGen-7b).</p>
<p>It also contains a secondary <strong>Theoretical</strong> contribution by establishing an empirical scaling law that relates inference compute (generation size) to the novelty of the generated molecules.</p>
<h2 id="motivation-data-scale-and-prompt-based-optimization">Motivation: Data Scale and Prompt-Based Optimization</h2>
<p>While large language models (LLMs) have transformed text generation, the impact of training data scale and memorization on <em>molecular</em> generative models remains under-explored. Specifically, there is a need to understand how training on billion-scale datasets affects the novelty of generated molecules and whether biases in public databases (like ZINC and PubChem) perpetuate memorization. Furthermore, existing optimization methods often require computationally expensive property predictors or reinforcement learning loops; there is a practical need for more efficient &ldquo;prompt-based&rdquo; optimization techniques.</p>
<h2 id="core-innovations-architecture-and-pair-tuning">Core Innovations: Architecture and Pair-Tuning</h2>
<ol>
<li><strong>Architecture</strong>: The application of a linear-attention transformer decoder with Rotary Positional Embeddings (RoPE) to generative chemistry, allowing for efficient training on 1.1 billion SMILES.</li>
<li><strong>Pair-Tuning</strong>: A novel, parameter-efficient fine-tuning method that uses property-ordered molecular pairs to learn &ldquo;soft prompts&rdquo; for optimization without updating the base model weights.</li>
<li><strong>Scaling Analysis</strong>: An extensive empirical investigation mapping the trade-off between inference compute (up to 10B generations) and chemical novelty, fitting an exponential decay curve that demonstrates how novelty saturates as generation volume grows.</li>
</ol>
<h2 id="experimental-methodology-and-downstream-tasks">Experimental Methodology and Downstream Tasks</h2>
<p>The authors evaluated GP-MoLFormer on three distinct tasks, though the comparisons highlight the difficulty of evaluating foundation models against classical baselines:</p>
<ol>
<li><strong>De Novo Generation</strong>: Comparing validity, uniqueness, and novelty against baselines (CharRNN, VAE, LIMO, MolGen-7b) on a held-out test set. Notably, this is an unequal comparison; most baselines were trained on the 1.6M molecule MOSES dataset, whereas GP-MoLFormer uses up to 1.1B molecules, meaning performance gains are heavily driven by data scale.</li>
<li><strong>Scaffold-Constrained Decoration</strong>: Generating molecules from DRD2 active binder scaffolds and measuring the hit rate of active compounds against specialized scaffold decorators.</li>
<li><strong>Property-Guided Optimization</strong>: Using Pair-tuning to optimize for Drug-likeness (QED), Penalized logP, and DRD2 binding activity, comparing the results to graph-based and reinforcement learning benchmarks.</li>
</ol>
<p>Additionally, they performed a <strong>Scaling Study</strong>:</p>
<ul>
<li>Comparing models trained on raw (1.1B) vs. de-duplicated (650M) data.</li>
<li>Generating up to 10 billion molecules to fit empirical scaling laws for novelty.</li>
</ul>
<h2 id="key-findings-and-scaling-laws">Key Findings and Scaling Laws</h2>
<ul>
<li><strong>Scale Driven Performance</strong>: GP-MoLFormer achieves high internal diversity and validity on generation metrics. However, its baseline novelty percentage (~32%) is considerably lower than classical models. The authors attribute this to the massive training scale forcing the model to heavily prioritize matching real-world molecule frequencies over pure exploration. GP-MoLFormer&rsquo;s advantage in generation metrics over LLM-baselines like MolGen-7b likely stems heavily from its 10x larger training dataset rather than fundamental architectural superiority.</li>
<li><strong>Pair-Tuning Efficacy</strong>: The proposed pair-tuning method effectively optimizes properties (e.g., improving DRD2 activity scores) without requiring full model fine-tuning or external reward loops. While successful, the text-based generation yields ~94.5% validity during optimization, which lags behind graph and SELFIES-based baselines that guarantee 100% structural validity.</li>
<li><strong>Memorization vs. Novelty</strong>: Training on de-duplicated data (GP-MoLFormer-UNIQ) yields higher novelty (approx. 5-8% higher) than training on raw data, confirming that duplication bias in public databases leads directly to memorization.</li>
<li><strong>Inference Scaling Law</strong>: Novelty decays exponentially with generation size ($y = ae^{-bx}$), yet the model maintains generative capability (~16.7% novelty) even after generating an unprecedented 10 billion molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Sources</strong>: A combination of <strong>PubChem</strong> (111M SMILES) and <strong>ZINC</strong> (1B SMILES) databases. Downloading and pre-training instructions are located in the repository&rsquo;s <code>data/README.md</code>.</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>All SMILES were canonicalized using RDKit (no isomeric information).</li>
<li><strong>GP-MoLFormer (Base)</strong>: Trained on the full 1.1B dataset (includes duplicates).</li>
<li><strong>GP-MoLFormer-UNIQ</strong>: Trained on a de-duplicated subset of 650M SMILES.</li>
</ul>
</li>
<li><strong>Tokenization</strong>: Uses the tokenizer from Schwaller et al. (2019) with a vocabulary size of <strong>2,362 tokens</strong>.</li>
<li><strong>Filtering</strong>: Sequences restricted to a maximum length of <strong>202 tokens</strong>.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pair-Tuning (Algorithm 1)</strong>:</p>
<ul>
<li><strong>Objective</strong>: Learn task-specific soft prompts $\phi_T$ to maximize the conditional probability of target molecule $b$ given a seed molecule $a$, where pair $(a, b)$ satisfies the property condition $b &gt; a$. The base model parameters $\theta$ remain frozen.</li>
<li><strong>Prompt Structure</strong>: Autoregressive training optimizes the continuous embeddings of $n$ enhancement tokens against the cross-entropy loss of the target sequence:
$$ \mathcal{L}(\phi_T) = - \sum_{i=1}^{|b|} \log P_{\theta}(b_i | \phi_T, a, b_{&lt;i}) $$</li>
<li><strong>Hyperparameters</strong>: Trained for 1,000 epochs with a batch size of 35 and a fixed learning rate of $3 \times 10^{-2}$.</li>
<li><strong>Inference</strong>: The learned prompt $\phi_T$ and seed molecule $a$ are prepended as context, and candidates are sampled autoregressively until a termination token is produced.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Availability</strong>: The model trained on deduplicated data (GP-MoLFormer-UNIQ) is publicly available on <a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">Hugging Face</a>. The full 1.1B base model is not explicitly hosted. The source code repository includes a disclosure that IBM will not maintain the code going forward.</li>
<li><strong>Architecture</strong>: Transformer decoder (~47M parameters: 12 layers, 12 heads, hidden size 768).</li>
<li><strong>Attention Mechanism</strong>: Combines Linear Attention (Generalized Random Feature map, $\phi$) with Rotary Positional Embeddings (RoPE). To avoid the quadratic complexity of standard attention while maintaining relative positional awareness, RoPE is applied to queries ($Q$) and keys ($K$) prior to the random feature mapping:
$$ \text{Attention}(Q, K, V) = \frac{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle v_n}{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle} $$</li>
<li><strong>Inference Speed</strong>: ~3ms per forward pass on a single A100 GPU.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Generation Quality Metrics</strong>: Validity, Uniqueness, Novelty (MOSES suite), Fréchet ChemNet Distance (FCD), Scaffold similarity (Scaf), and Similarity to Nearest Neighbor (SNN).</li>
<li><strong>MoLFormer-Based Metrics</strong>: The authors introduce Fréchet MoLFormer Distance (FMD) and MoLFormer-space IntDiv2 to measure distributional similarity using their own pre-trained continuous embeddings instead of standard fingerprints.</li>
<li><strong>Optimization Metrics</strong>: Penalized logP (calculated as $\text{logP} - \text{SA} - \text{max}(\text{maxrings}(size) - 6, 0)$), Drug-likeness (QED), and DRD2 activity scores.</li>
<li><strong>Scaling Metrics</strong>: Empirical fit for novelty decay: $y = ae^{-bx}$.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 16 x NVIDIA A100 (80 GB) GPUs across 2 nodes connected via EDR Infiniband.</li>
<li><strong>Training Time</strong>:
<ul>
<li>GP-MoLFormer (1.1B data): ~115 hours total (28.75 hours/epoch for 4 epochs).</li>
<li>GP-MoLFormer-UNIQ (650M data): ~80 hours total.</li>
</ul>
</li>
<li><strong>Hyperparameters</strong>: Used a batch size of 1,600 molecules per GPU with a fixed learning rate of $1.6 \times 10^{-4}$ (scaled up to $8\times$ factor as GPUs increased).</li>
<li><strong>Optimization</strong>: Used distributed data-parallel training and adaptive bucketing by sequence length to handle scale.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/gp-molformer/">GP-MoLFormer (GitHub)</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official implementation; IBM will not maintain going forward</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">GP-MoLFormer-Uniq (Hugging Face)</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Pre-trained on 650M de-duplicated SMILES</td>
      </tr>
  </tbody>
</table>
<p>The full 1.1B base model weights are not publicly hosted. The training data (PubChem and ZINC) is publicly available, and instructions for downloading and pre-processing are in the repository&rsquo;s <code>data/README.md</code>.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ross, J., Belgodere, B., Hoffman, S. C., Chenthamarakshan, V., Navratil, J., Mroueh, Y., &amp; Das, P. (2025). GP-MoLFormer: A Foundation Model For Molecular Generation. <em>Digital Discovery</em> (2025). <a href="https://doi.org/10.1039/D5DD00122F">https://doi.org/10.1039/D5DD00122F</a></p>
<p><strong>Publication</strong>: Digital Discovery (2025)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ross2025gpmolformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GP-MoLFormer: a foundation model for molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ross, Jerret and Belgodere, Brian and Hoffman, Samuel C and Chenthamarakshan, Vijil and Navratil, Jiri and Mroueh, Youssef and Das, Payel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D5DD00122F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa-2: Scaling Molecular Transformers to 77M</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemberta-2/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemberta-2/</guid><description>Optimizing transformer pretraining for molecules using MLM vs MTR objectives, scaling to 77M compounds from PubChem for improved property prediction.</description><content:encoded><![CDATA[<h2 id="classifying-chemberta-2s-methodological-contributions">Classifying ChemBERTa-2&rsquo;s Methodological Contributions</h2>
<p>This is primarily a <strong>Methodological</strong> paper with a secondary <strong>Resource</strong> contribution.</p>
<p>It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine &ldquo;how well&rdquo; these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.</p>
<p><strong>Key methodological indicators</strong>:</p>
<ul>
<li><strong>Baseline comparison</strong>: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (ChemBERTa-1) with prominent benchmark tables</li>
<li><strong>Ablation studies</strong>: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size</li>
<li><strong>Scaling analysis</strong>: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance</li>
</ul>
<h2 id="motivations-for-scaling-molecular-transformers">Motivations for Scaling Molecular Transformers</h2>
<p>The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a &ldquo;chemical foundation model&rdquo;.</p>
<p><strong>Key motivations</strong>:</p>
<ul>
<li><strong>Label scarcity</strong>: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant</li>
<li><strong>Scaling hypothesis</strong>: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP</li>
<li><strong>Efficiency</strong>: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and weakly supervised (MTR, using RDKit computed properties as labels) approaches</li>
</ul>
<h2 id="novelty-in-multi-task-regression-objectives">Novelty in Multi-Task Regression Objectives</h2>
<p><strong>Scale</strong>: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for SMILES-BERT).</p>
<p><strong>Pipeline optimization</strong>: A direct, controlled comparison of <strong>Masked Language Modeling (MLM)</strong> vs. <strong>Multi-Task Regression (MTR)</strong> pretraining objectives on identical datasets.</p>
<p><strong>Proxy selection</strong>: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.</p>
<h2 id="experimental-pretraining-setup-on-77m-compounds">Experimental Pretraining Setup on 77M Compounds</h2>
<h3 id="pretraining-setup">Pretraining Setup</h3>
<p><strong>Datasets</strong>: Subsets of PubChem containing 5M, 10M, and 77M unique SMILES.</p>
<p><strong>Tasks</strong>:</p>
<ul>
<li><strong>MLM</strong>: Masking 15% of tokens (following RoBERTa procedure). The model is optimized by minimizing the cross-entropy loss over the predicted masked tokens:
$$ \mathcal{L}_{MLM} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) $$
where $\mathcal{M}$ represents the set of masked token indices.</li>
<li><strong>MTR</strong>: Predicting 200 calculated molecular properties (via RDKit) simultaneously using a mean squared error objective:
$$ \mathcal{L}_{MTR} = \frac{1}{200} \sum_{j=1}^{200} \frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_{ij} - y_{ij} \right)^2 $$
Continuous target labels $y_{ij}$ are mean-normalized prior to training to equilibrate the disparate scales of different chemical properties.</li>
</ul>
<p><strong>Hyperparameter search</strong>: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.</p>
<h3 id="downstream-validation">Downstream Validation</h3>
<p><strong>Finetuning</strong>: Evaluated on 8 tasks from <strong>MoleculeNet</strong> (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).</p>
<p><strong>Analysis</strong>: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.</p>
<h2 id="key-performance-outcomes-and-scaling-realities">Key Performance Outcomes and Scaling Realities</h2>
<p><strong>Highly competitive performance</strong>: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks, though the margins demonstrate that task-specific baselines remain notably robust.</p>
<p><strong>MTR superiority</strong>: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM on every finetuning task evaluated. MTR is substantially slower than MLM due to the larger input size from the 200-element label vector, but MLM loss serves as a reliable proxy for MTR loss, enabling cheaper architecture search before committing to full MTR pretraining.</p>
<p><strong>Scaling laws versus downstream utility</strong>: Pretraining loss improved by 25-35% when increasing the dataset from 5M to 77M compounds. However, this improvement in pretraining loss does not uniformly transfer to downstream tasks. For MTR models, SR-p53 ROC-AUC decreases monotonically from 0.834 (5M) to 0.827 (10M) to 0.817 (77M), and Lipophilicity RMSE is worse at 77M (0.798) than at 5M (0.758), despite a dip at 10M (0.744). This variability in transfer challenges the assumption that pretraining improvements always yield downstream gains.</p>
<p><strong>Transfer learning</strong>: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The pretraining corpus is derived from <strong>PubChem</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pretraining</strong></td>
          <td>PubChem</td>
          <td>77M SMILES</td>
          <td>Canonicalized and globally shuffled. Subsets of 5M and 10M used. <strong>Note: Exact splits and datasets are not published.</strong></td>
      </tr>
      <tr>
          <td><strong>Validation</strong></td>
          <td>PubChem</td>
          <td>100k SMILES</td>
          <td>A fixed set held out from the 77M corpus. <strong>Note: Exact 100k subset is not published.</strong></td>
      </tr>
      <tr>
          <td><strong>MTR Labels</strong></td>
          <td>RDKit</td>
          <td>200 props</td>
          <td>200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized. <strong>Note: Calculated labels are not published and must be re-computed.</strong></td>
      </tr>
      <tr>
          <td><strong>Finetuning</strong></td>
          <td>MoleculeNet</td>
          <td>1.5k - 8k</td>
          <td>Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pretraining Objectives:</strong></p>
<ol>
<li><strong>Masked Language Modeling (MLM)</strong>: Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.</li>
<li><strong>Multi-Task Regression (MTR)</strong>: Predicting 200 RDKit properties. Labels are mean-normalized.</li>
</ol>
<p><strong>Tokenizer:</strong></p>
<ul>
<li>Dictionary of common SMILES characters</li>
<li>Maximum vocabulary size: <strong>591 tokens</strong></li>
</ul>
<p><strong>Optimization:</strong></p>
<ul>
<li><strong>Patience</strong>: Early stopping set to one pass through the dataset to ensure full coverage</li>
<li><strong>Hyperparameter search</strong>: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate. <strong>Note: The precise configuration of the winning models that were scaled to 77M is absent from the paper.</strong></li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Based on <strong>RoBERTa</strong> (HuggingFace implementation)</li>
<li><strong>Parameter scale</strong>: Models ranged between <strong>5M and 46M parameters</strong></li>
<li><strong>Selection</strong>: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset</li>
<li><strong>Checkpoints</strong>: Pre-trained weights are hosted by DeepChem on <a href="https://huggingface.co/DeepChem">Hugging Face</a>. Direct links include <a href="https://huggingface.co/DeepChem/ChemBERTa-77M-MTR">DeepChem/ChemBERTa-77M-MTR</a> and <a href="https://huggingface.co/DeepChem/ChemBERTa-77M-MLM">DeepChem/ChemBERTa-77M-MLM</a> (Note: Model cards are currently empty).</li>
<li><strong>Code Reference</strong>: While the <a href="https://github.com/deepchem/deepchem">DeepChem</a> repository is referenced for code, isolated training scripts tailored to recreate ChemBERTa-2&rsquo;s exact pipeline are not separated from the generalized deepchem library tooling.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Benchmarks were performed on <strong>MoleculeNet</strong> using DeepChem.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Tasks</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>RMSE</strong> ($\downarrow$)</td>
          <td>Delaney, Lipo, BACE (Reg), Clearance</td>
          <td>D-MPNN</td>
          <td>ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.8).</td>
      </tr>
      <tr>
          <td><strong>ROC-AUC</strong> ($\uparrow$)</td>
          <td>BBBP, ClinTox, HIV, Tox21, BACE (Cls)</td>
          <td>D-MPNN</td>
          <td>ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: AWS EC2 instances with <strong>Nvidia T4 GPUs</strong></li>
<li><strong>Strategy</strong>: AWS Spot instances were used to reduce cost; implemented frequent checkpointing to handle interruptions.</li>
<li><strong>Note</strong>: For MTR, they wrote a custom data loader wrapper around HuggingFace&rsquo;s text loader to handle CSV parsing efficiency, as the default CSV loader was a major bottleneck for the 200-element target vectors.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., &amp; Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. <em>arXiv preprint arXiv:2209.01712</em>. <a href="https://doi.org/10.48550/arXiv.2209.01712">https://doi.org/10.48550/arXiv.2209.01712</a></p>
<p><strong>Publication</strong>: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2010.09885">ChemBERTa-1 Paper</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{ahmadChemBERTa2ChemicalFoundation2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{ChemBERTa-2}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemformer: A Pre-trained Transformer for Comp Chem</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemformer/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemformer/</guid><description>BART-based Transformer pre-trained on 100M molecules using self-supervision to accelerate convergence on chemical sequence tasks.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-methodological-classification">Paper Contribution and Methodological Classification</h2>
<p>This is a <strong>Methodological ($\Psi_{\text{Method}}$)</strong> paper. It proposes an architecture adaptation (Chemformer based on BART) and a specific pre-training strategy (&ldquo;Combined&rdquo; masking and augmentation). The paper validates this method by benchmarking against established models on multiple tasks, including direct synthesis, retrosynthesis, and molecular optimization. It also includes a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution by making the pre-trained models and code available.</p>
<h2 id="motivation-computational-bottlenecks-in-cheminformatics">Motivation: Computational Bottlenecks in Cheminformatics</h2>
<p>Existing Transformer models for cheminformatics are often developed for single applications and are computationally expensive to train from scratch. For example, training a Molecular Transformer for reaction prediction can take days, limiting hyperparameter exploration. Self-supervised pre-training (like BERT or T5) has significantly advanced NLP by reducing fine-tuning time and improving performance. In chemistry, applications have traditionally focused on task-specific datasets or encoder-only architectures, which perform poorly on sequence generation tasks. The authors aim to use transfer learning on a large unlabelled dataset to create a model that converges quickly and performs well across diverse sequence-to-sequence and discriminative tasks.</p>
<h2 id="core-innovation-bart-architecture-and-combined-pre-training">Core Innovation: BART Architecture and Combined Pre-training</h2>
<p>The primary insight lies in the adaptation of the <strong>BART architecture</strong> for chemistry and the introduction of a <strong>&ldquo;Combined&rdquo; self-supervised pre-training task</strong>.</p>
<ul>
<li><strong>Architecture</strong>: Chemformer uses the BART encoder-decoder structure, allowing it to handle both discriminative (property prediction) and generative (reaction prediction) tasks efficiently. This provides an alternative to encoder-only (BERT) or decoder-only (GPT) models.</li>
<li><strong>Combined Pre-training</strong>: The authors introduce a task that applies both <strong>Span Masking</strong> (randomly replacing tokens with <code>&lt;mask&gt;</code>) and <strong>SMILES Augmentation</strong> (permuting atom order) simultaneously. Formally, given a canonical SMILES sequence $x$, a corrupted sequence $\tilde{x} = \text{Mask}(\text{Augment}(x))$ is generated. The model is trained using an autoregressive cross-entropy loss to reconstruct the canonical sequence from the corrupted input:
$$ \mathcal{L}_{\text{pre-train}} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{&lt;t}, \tilde{x}) $$</li>
<li><strong>Tunable Augmentation</strong>: A downstream augmentation strategy is proposed where the probability of augmenting the input/output SMILES ($p_{aug}$) is a tunable hyperparameter, performed on-the-fly.</li>
</ul>
<h2 id="experimental-setup-and-pre-training-tasks">Experimental Setup and Pre-training Tasks</h2>
<p>The authors pre-trained Chemformer on <strong>100 million molecules</strong> from ZINC-15 and fine-tuned it on three distinct task types:</p>
<ol>
<li><strong>Seq2Seq Reaction Prediction</strong>:
<ul>
<li><em>Direct Synthesis</em>: USPTO-MIT dataset (Mixed and Separated).</li>
<li><em>Retrosynthesis</em>: USPTO-50K dataset.</li>
</ul>
</li>
<li><strong>Molecular Optimization</strong>: Generating molecules with improved properties (LogD, solubility, clearance) starting from ChEMBL matched molecular pairs.</li>
<li><strong>Discriminative Tasks</strong>:
<ul>
<li><em>QSAR</em>: Predicting properties (ESOL, FreeSolv, Lipophilicity) from MoleculeNet.</li>
<li><em>Bioactivity</em>: Predicting pXC50 values for 133 genes using ExCAPE data.</li>
</ul>
</li>
</ol>
<p>Ablation studies compared three pre-training strategies (Masking, Augmentation, Combined) against a randomly initialized baseline.</p>
<h2 id="results-trade-offs-and-conclusions">Results, Trade-offs, and Conclusions</h2>
<ul>
<li><strong>Performance</strong>: Chemformer achieved <strong>competitive top-1 accuracy</strong> on USPTO-MIT (91.3% Mixed) and USPTO-50K (53.6-54.3%), outperforming the Augmented Transformer and graph-based models (GLN, GraphRetro).</li>
<li><strong>Convergence Speed</strong>: Pre-training significantly accelerated training; fine-tuning for just 20 epochs (30 mins) outperformed the previous baselines trained for significantly longer.</li>
<li><strong>Pre-training Tasks</strong>: The &ldquo;Combined&rdquo; task generally performed best for reaction prediction and bioactivity, while &ldquo;Masking&rdquo; was superior for molecular optimization.</li>
<li><strong>Augmentation Trade-off</strong>: The augmentation strategy improved top-1 accuracy but significantly degraded top-5/10 accuracy because beam search outputs became populated with augmented versions of the same molecule. This presents a considerable limitation for practical applications like retrosynthesis mapping, where retrieving a diverse set of candidate reactions is often critical.</li>
<li><strong>Discriminative Evaluation Caveats</strong>: Chemformer underperformed specialized baselines (like D-MPNN or MolBERT) on small discriminative datasets. The authors note that direct comparison is difficult: Chemformer was trained simultaneously on multiple subtasks (multi-task learning), while the literature baselines were trained and tuned on each subtask separately. Additionally, the Chemformer encoder uses fewer than 20M parameters compared to MolBERT&rsquo;s approximately 85M, and Chemformer&rsquo;s pre-training does not include molecular property objectives.</li>
<li><strong>Pre-training Data Scope</strong>: The 100M pre-training dataset from ZINC-15 was selected with constraints on molecular weight ($\le 500$ Da) and LogP ($\le 5$), focusing the learned representations on small, drug-like molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><em>Note: The primary GitHub repository for Chemformer was officially archived on February 11, 2026. Pre-trained weights and datasets used in the paper are still hosted externally on <a href="https://az.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Box</a>. Active development of Chemformer models has moved to the <a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels</a> repository.</em></p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/Chemformer">Chemformer (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Archived; original PyTorch implementation</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Active successor repository</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://az.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Pre-trained weights (Box)</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Base and Large model checkpoints</td>
      </tr>
  </tbody>
</table>
<p>The following datasets were used for pre-training and benchmarking.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Pre-training</strong></td>
          <td style="text-align: left">ZINC-15</td>
          <td style="text-align: left">100M</td>
          <td style="text-align: left">Selected subset (reactive, annotated purchasability, MW $\le 500$, LogP $\le 5$). Split: 99% Train / 0.5% Val / 0.5% Test.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Direct Synthesis</strong></td>
          <td style="text-align: left">USPTO-MIT</td>
          <td style="text-align: left">~470k</td>
          <td style="text-align: left">Evaluated on &ldquo;Mixed&rdquo; and &ldquo;Separated&rdquo; variants.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Retrosynthesis</strong></td>
          <td style="text-align: left">USPTO-50K</td>
          <td style="text-align: left">~50k</td>
          <td style="text-align: left">Standard benchmark for retrosynthesis.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Optimization</strong></td>
          <td style="text-align: left">ChEMBL MMPs</td>
          <td style="text-align: left">~160k Train</td>
          <td style="text-align: left">Matched Molecular Pairs for LogD, solubility, and clearance optimization.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Properties</strong></td>
          <td style="text-align: left">MoleculeNet</td>
          <td style="text-align: left">Small</td>
          <td style="text-align: left">ESOL (1128), FreeSolv (642), Lipophilicity (4200).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Bioactivity</strong></td>
          <td style="text-align: left">ExCAPE</td>
          <td style="text-align: left">~312k</td>
          <td style="text-align: left">133 gene targets; &gt;1200 compounds per gene.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Tokenization</strong>: Regex-based tokenization (523 tokens total) derived from ChEMBL 27 canonical SMILES.</li>
<li><strong>Augmentation</strong>: SMILES enumeration (permuting atom order) used for pre-training and on-the-fly during fine-tuning ($p_{aug}=0.5$ for Seq2Seq, $p_{aug}=1.0$ for discriminative).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pre-training Tasks</strong>:
<ol>
<li><em>Masking</em>: Span masking (BART style).</li>
<li><em>Augmentation</em>: Input is a randomized SMILES; target is canonical SMILES.</li>
<li><em>Combined</em>: Input is augmented <em>then</em> masked; target is canonical SMILES.</li>
</ol>
</li>
<li><strong>Optimization</strong>:
<ul>
<li>Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$).</li>
<li>Schedule: Linear warm-up (8000 steps) for pre-training; One-cycle schedule for fine-tuning.</li>
</ul>
</li>
<li><strong>Inference</strong>: Beam search with width 10 for Seq2Seq tasks. Used <code>molbart/inference_score.py</code> and <code>molbart/retrosynthesis/round_trip_inference.py</code> for standard and round-trip validation.</li>
</ul>
<h3 id="models">Models</h3>
<p>Two model sizes were trained. Both use the Pre-Norm Transformer layout with GELU activation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hyperparameter</th>
          <th style="text-align: left">Chemformer (Base)</th>
          <th style="text-align: left">Chemformer-Large</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Layers</strong></td>
          <td style="text-align: left">6</td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Model Dimension</strong></td>
          <td style="text-align: left">512</td>
          <td style="text-align: left">1024</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Feed-forward Dim</strong></td>
          <td style="text-align: left">2048</td>
          <td style="text-align: left">4096</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Attention Heads</strong></td>
          <td style="text-align: left">8</td>
          <td style="text-align: left">16</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Parameters</strong></td>
          <td style="text-align: left">~45M</td>
          <td style="text-align: left">~230M</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Pre-training Task</strong></td>
          <td style="text-align: left">All 3 variants</td>
          <td style="text-align: left">Combined only</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Comparisons relied on Top-N accuracy for reaction tasks and validity metrics for optimization.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Key Result</th>
          <th style="text-align: left">Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Direct Synthesis (Sep)</td>
          <td style="text-align: left"><strong>92.8%</strong> (Large)</td>
          <td style="text-align: left">91.1% (Aug Transformer)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Retrosynthesis</td>
          <td style="text-align: left"><strong>54.3%</strong> (Large)</td>
          <td style="text-align: left">53.7% (GraphRetro) / 52.5% (GLN)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Desirable %</strong></td>
          <td style="text-align: left">Mol Optimization</td>
          <td style="text-align: left"><strong>75.0%</strong> (Base-Mask)</td>
          <td style="text-align: left">70.2% (Transformer-R)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>RMSE</strong></td>
          <td style="text-align: left">Lipophilicity</td>
          <td style="text-align: left">0.598 (Combined)</td>
          <td style="text-align: left">0.555 (D-MPNN)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 NVIDIA V100 GPUs (batch size 128 per GPU).</li>
<li><strong>Training Time</strong>:
<ul>
<li>Pre-training: 2.5 days (Base) / 6 days (Large) for 1M steps.</li>
<li>Fine-tuning: ~20-40 epochs for reaction prediction (&lt;12 hours).</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Irwin, R., Dimitriadis, S., He, J., &amp; Bjerrum, E. J. (2022). Chemformer: a pre-trained transformer for computational chemistry. <em>Machine Learning: Science and Technology</em>, 3(1), 015022. <a href="https://doi.org/10.1088/2632-2153/ac3ffb">https://doi.org/10.1088/2632-2153/ac3ffb</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{irwinChemformerPretrainedTransformer2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemformer: A Pre-Trained Transformer for Computational Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Irwin, Ross and Dimitriadis, Spyridon and He, Jiazhen and Bjerrum, Esben Jannik}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{015022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2632-2153}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/2632-2153/ac3ffb}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Translating InChI to IUPAC Names with Transformers</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/handsel-inchi-iupac-2021/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/handsel-inchi-iupac-2021/</guid><description>Sequence-to-sequence Transformer translating InChI identifiers to IUPAC names with 91% accuracy on organic compounds.</description><content:encoded><![CDATA[<h2 id="primary-contribution-a-transformer-based-method">Primary Contribution: A Transformer-Based Method</h2>
<p>This is primarily a <strong>Method</strong> paper. It adapts a specific architecture (Transformer) to a specific task (InChI-to-IUPAC translation) and evaluates its performance against both machine learning and commercial baselines. It also has a secondary <strong>Resource</strong> contribution, as the trained model and scripts are released as open-source software.</p>
<h2 id="motivation-the-bottleneck-in-algorithmic-iupac-nomenclature">Motivation: The Bottleneck in Algorithmic IUPAC Nomenclature</h2>
<p>Generating correct IUPAC names is difficult due to the comprehensive but complex rules defined by the International Union of Pure and Applied Chemistry. Commercial software generates names from structures but remains closed-source with opaque methodologies and frequent inter-package disagreements. Open identifiers like InChI and SMILES lack direct human readability. This creates a need for an open, automated method to generate informative IUPAC names from standard identifiers like InChI, which are ubiquitous in online chemical databases.</p>
<h2 id="novelty-treating-chemical-translation-as-a-character-level-sequence">Novelty: Treating Chemical Translation as a Character-Level Sequence</h2>
<p>The key novelty is treating chemical nomenclature translation as a character-level sequence-to-sequence problem using a Transformer architecture, specifically using <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a> as the source language.</p>
<ul>
<li>Standard Neural Machine Translation (NMT) uses sub-word tokenization. This model processes InChI and predicts IUPAC names character-by-character.</li>
<li>It demonstrates that character-level tokenization outperforms byte-pair encoding or unigram models for this specific chemical task.</li>
<li>It uses InChI&rsquo;s standardization to avoid the canonicalization issues inherent in SMILES-based approaches.</li>
<li>The attention mechanism allows the decoder to align specific parts of the generated IUPAC name with corresponding structural features in the source InChI string, operating via the standard scaled dot-product attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$</li>
</ul>
<h2 id="methodology--experimental-validation">Methodology &amp; Experimental Validation</h2>
<ul>
<li><strong>Training:</strong> The model was trained on 10 million InChI/IUPAC pairs sampled from PubChem using a character-level objective. The model is supervised using categorical cross-entropy loss across the vocabulary of characters:
$$ \mathcal{L} = -\sum_{i=1}^{N} y_i \log(\hat{y}_i) $$</li>
<li><strong>Ablation Studies:</strong> The authors experimentally validated architecture choices, finding that LSTM models and sub-word tokenization (BPE) performed worse than the Transformer with character tokenization. They also optimized dropout rates.</li>
<li><strong>Performance Benchmarking:</strong> The model was evaluated on a held-out test set of 200,000 samples. Performance was quantified primarily by Whole-Name Accuracy and Normalized Edit Distance (based on the Damerau-Levenshtein distance, scaled by the maximum string length).</li>
<li><strong>Commercial Comparison:</strong> The authors compared their model against four major commercial packages (ACD/I-Labs, ChemAxon, Mestrelab, and PubChem&rsquo;s Lexichem). However, this evaluation used a highly limited test set of only 100 molecules, restricting the statistical confidence of the external baseline.</li>
<li><strong>Error Analysis:</strong> They analyzed performance across different chemical classes (organics, charged species, macrocycles, inorganics) and visualized attention coefficients to interpret model focus.</li>
</ul>
<h2 id="key-results-and-the-inorganic-challenge">Key Results and the Inorganic Challenge</h2>
<ul>
<li><strong>High Accuracy on Organics:</strong> The model achieved 91% whole-name accuracy on the test set, performing particularly well on organic compounds.</li>
<li><strong>Comparable to Commercial Tools:</strong> On the limited 100-molecule benchmark, the edit distance between the model&rsquo;s predictions and commercial packages (15-23%) was similar to the variation found <em>between</em> the commercial packages themselves (16-21%).</li>
<li><strong>Limitations on Inorganics:</strong> The model performed poorly on inorganic (14% accuracy) and organometallic compounds (20% accuracy). This is attributed to inherent data limitations in the standard InChI format (which deliberately disconnects metal atoms from their ligands) and low training data coverage for those classes.</li>
<li><strong>Character-Level Superiority:</strong> Character-level tokenization was found to be essential; byte-pair encoding reduced accuracy significantly.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset was derived from <a href="https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/">PubChem&rsquo;s public FTP server</a> (<code>CID-SMILES.gz</code> and <code>CID-IUPAC.gz</code>).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Raw</strong></td>
          <td>PubChem</td>
          <td>100M pairs</td>
          <td>Filtered for length (InChI &lt; 200 chars, IUPAC &lt; 150 chars). 132k unparseable SMILES dropped.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Subsampled</td>
          <td>10M pairs</td>
          <td>Random sample from the filtered set.</td>
      </tr>
      <tr>
          <td><strong>Validation</strong></td>
          <td>Held-out</td>
          <td>10,000 samples</td>
          <td>Limited to InChI length &gt; 50 chars.</td>
      </tr>
      <tr>
          <td><strong>Test</strong></td>
          <td>Held-out</td>
          <td>200,000 samples</td>
          <td>Limited to InChI length &gt; 50 chars.</td>
      </tr>
      <tr>
          <td><strong>Tokenization</strong></td>
          <td>Vocab</td>
          <td>InChI: 66 chars<br>IUPAC: 70 chars</td>
          <td>Character-level tokenization. Spaces treated as tokens.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: OpenNMT-py 2.0.0 (using PyTorch). Training scripts and vocabularies are available as supplementary files to the original publication. Pre-trained model weights are hosted on <a href="https://doi.org/10.5281/zenodo.5081159">Zenodo</a>.</li>
<li><strong>Architecture Type</strong>: Transformer Encoder-Decoder.</li>
<li><strong>Optimization</strong>: ADAM optimizer ($\beta_1=0.9, \beta_2=0.998$).</li>
<li><strong>Learning Rate</strong>: Linear warmup over 8000 steps to 0.0005, then decayed by inverse square root of iteration.</li>
<li><strong>Regularization</strong>:
<ul>
<li>Dropout: 0.1 (applied to dense and attentional layers).</li>
<li>Label Smoothing: Magnitude 0.1.</li>
</ul>
</li>
<li><strong>Training Strategy</strong>: Teacher forcing used for both training and validation.</li>
<li><strong>Gradient Accumulation</strong>: Gradients accumulated over 4 batches before updating parameters.</li>
<li><strong>Inference</strong>: Beam search with width 10 and length penalty 1.0.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Structure</strong>: 6 layers in encoder, 6 layers in decoder.</li>
<li><strong>Attention</strong>: 8 heads per attention sub-layer.</li>
<li><strong>Dimensions</strong>:
<ul>
<li>Feed-forward hidden state size: 2048.</li>
<li>Embedding vector length: 512.</li>
</ul>
</li>
<li><strong>Initialization</strong>: Glorot&rsquo;s method.</li>
<li><strong>Position</strong>: Positional encoding added to word vectors.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics reported include <strong>Whole-Name Accuracy</strong> (percentage of exact matches) and <strong>Normalized Edit Distance</strong> (Damerau-Levenshtein, scale 0-1).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy (All)</td>
          <td>91%</td>
          <td>N/A</td>
          <td>Test set of 200k samples.</td>
      </tr>
      <tr>
          <td>Accuracy (Inorganic)</td>
          <td>14%</td>
          <td>N/A</td>
          <td>Limited by InChI format and data.</td>
      </tr>
      <tr>
          <td>Accuracy (Organometallic)</td>
          <td>20%</td>
          <td>N/A</td>
          <td>Limited by InChI format and data.</td>
      </tr>
      <tr>
          <td>Accuracy (Charged)</td>
          <td>79%</td>
          <td>N/A</td>
          <td>Test set subset.</td>
      </tr>
      <tr>
          <td>Accuracy (Rajan)</td>
          <td>72%</td>
          <td>N/A</td>
          <td>Comparative ML model (STOUT).</td>
      </tr>
      <tr>
          <td>Edit Dist (Organic)</td>
          <td>$0.02 \pm 0.03$</td>
          <td>N/A</td>
          <td>Very high similarity for organics.</td>
      </tr>
      <tr>
          <td>Edit Dist (Inorganic)</td>
          <td>$0.32 \pm 0.20$</td>
          <td>N/A</td>
          <td>Poor performance on inorganics.</td>
      </tr>
      <tr>
          <td>Edit Dist (Organometallic)</td>
          <td>$0.37 \pm 0.24$</td>
          <td>N/A</td>
          <td>Poor performance on organometallics.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Tesla K80.</li>
<li><strong>Training Time</strong>: 7 days.</li>
<li><strong>Throughput</strong>: ~6000 tokens/sec (InChI) and ~3800 tokens/sec (IUPAC).</li>
<li><strong>Batch Size</strong>: 4096 tokens (approx. 30 compounds).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5081159">InChI to IUPAC model</a></td>
          <td>Model</td>
          <td>CC BY 4.0</td>
          <td>Pre-trained Transformer weights (551 MB), requires OpenNMT-py 2.0.0</td>
      </tr>
      <tr>
          <td><a href="https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/">PubChem FTP</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Source data: CID-SMILES.gz and CID-IUPAC.gz</td>
      </tr>
      <tr>
          <td>Training scripts &amp; vocabularies</td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Included as supplementary files with the publication</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Handsel, J., Matthews, B., Knight, N. J., &amp; Coles, S. J. (2021). Translating the InChI: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier. <em>Journal of Cheminformatics</em>, 13(1), 79. <a href="https://doi.org/10.1186/s13321-021-00535-x">https://doi.org/10.1186/s13321-021-00535-x</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{handselTranslatingInChIAdapting2021a,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Translating the {{InChI}}: Adapting Neural Machine Translation to Predict {{IUPAC}} Names from a Chemical Identifier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Translating the {{InChI}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Handsel, Jennifer and Matthews, Brian and Knight, Nicola J. and Coles, Simon J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{79}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00535-x}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine&#39;s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91\%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{Attention,GPU,InChI,IUPAC,seq2seq,Transformer}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Struct2IUPAC: Translating SMILES to IUPAC via Transformers</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/struct2iupac-2021/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/struct2iupac-2021/</guid><description>A Transformer-based model for translating between SMILES strings and IUPAC names, trained on 47M PubChem examples, achieving 98.9% accuracy with verification.</description><content:encoded><![CDATA[<h2 id="struct2iupac-as-a-methodological-shift">Struct2IUPAC as a Methodological Shift</h2>
<p>This is primarily a <strong>Method</strong> paper with significant elements of <strong>Position</strong>.</p>
<ul>
<li><strong>Method</strong>: The authors propose a specific neural architecture (Transformer with custom tokenization) and a verification pipeline (round-trip check) to solve the SMILES $\leftrightarrow$ IUPAC translation task. They rigorously benchmark this against rule-based baselines (OPSIN).</li>
<li><strong>Position</strong>: The authors explicitly argue for a paradigm shift, suggesting that &ldquo;heavy&rdquo; neural architectures should replace complex, costly rule-based legacy systems even for &ldquo;exact&rdquo; algorithmic tasks.</li>
</ul>
<h2 id="the-cost-of-rule-based-chemical-naming">The Cost of Rule-Based Chemical Naming</h2>
<ul>
<li><strong>Complexity of Naming</strong>: Generating IUPAC names manually is error-prone and requires deep algorithmic knowledge.</li>
<li><strong>Lack of Open Source Tools</strong>: While open-source tools exist for Name-to-Structure (e.g., OPSIN), there were no open-source tools for the inverse &ldquo;Structure-to-Name&rdquo; conversion at the time of writing.</li>
<li><strong>Cost of Development</strong>: Developing rule-based converters &ldquo;from scratch&rdquo; is prohibitively expensive and time-consuming compared to training a neural model on existing data.</li>
</ul>
<h2 id="struct2iupac-core-innovation">Struct2IUPAC Core Innovation</h2>
<ul>
<li><strong>Struct2IUPAC</strong>: The first effective open-source neural model for <a href="/notes/computational-chemistry/chemical-language-models/stout-v2/">converting SMILES to IUPAC names</a>, treating chemical translation as a Neural Machine Translation (NMT) problem.</li>
<li><strong>Verification Loop</strong>: A novel inference pipeline that generates multiple candidates via beam search and validates them using a reverse converter (OPSIN) to ensure the generated name maps back to the original structure.</li>
<li><strong>Custom Tokenization</strong>: A manually curated rule-based tokenizer for IUPAC names that handles specific chemical suffixes, prefixes, and stereochemical markers.</li>
</ul>
<h2 id="experimental-setup-and-stress-testing">Experimental Setup and Stress Testing</h2>
<ul>
<li><strong>Accuracy Benchmarking</strong>: The models were tested on a held-out subset of 100,000 molecules from PubChem. The authors measured accuracy across different beam sizes (1, 3, 5).</li>
<li><strong>Comparison to Rules</strong>: The neural IUPAC2Struct model was compared directly against the rule-based OPSIN tool.</li>
<li><strong>Stress Testing</strong>:
<ul>
<li><strong>Sequence Length</strong>: Evaluated performance across varying token lengths, identifying a &ldquo;sweet spot&rdquo; (10-60 tokens) and failure modes for very short (e.g., methane) or long molecules.</li>
<li><strong>Stereochemistry</strong>: Tested on &ldquo;stereo-dense&rdquo; compounds. The authors define a &ldquo;stereo-density&rdquo; index ($I$) as the ratio of stereocenters ($S$) to total tokens ($N$):
$$I = \frac{S}{N}$$
They observed a performance drop for these dense molecules, though the model still handled many stereocenters robustly.</li>
<li><strong>Tautomers</strong>: Verified the model&rsquo;s ability to handle different tautomeric forms (e.g., Guanine and Uracil variants).</li>
</ul>
</li>
<li><strong>Latency Analysis</strong>: Benchmarked inference speeds on CPU vs. GPU relative to output sequence length.</li>
</ul>
<h2 id="benchmarks-and-outcomes">Benchmarks and Outcomes</h2>
<ul>
<li><strong>High Accuracy</strong>: The Struct2IUPAC model achieved <strong>98.9% accuracy</strong> (Beam 5 with verification). The reverse model (IUPAC2Struct) achieved <strong>99.1%</strong>, comparable to OPSIN&rsquo;s 99.4%.</li>
<li><strong>Distribution Modeling vs. Intuition</strong>: The authors claim the model infers &ldquo;chemical logic,&rdquo; because it correctly generates multiple valid IUPAC names for single molecules where naming ambiguity exists (e.g., parent group selection). However, this more likely reflects the Transformer successfully modeling the high-frequency conditional probability distribution of synonymous names present in the PubChem training data, rather than learning intrinsic chemical rules.</li>
<li><strong>Production Readiness</strong>: Inference on GPU takes less than 0.5 seconds even for long names, making it viable for production use.</li>
<li><strong>Paradigm Shift</strong>: The authors conclude that neural networks are a viable, cost-effective alternative to developing rule-based algorithms for legacy notation conversion.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized the PubChem database.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Total</strong></td>
          <td>PubChem</td>
          <td>~95M</td>
          <td>Filtered for RDKit compatibility</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Split A</td>
          <td>47,312,235</td>
          <td>Random 50% split</td>
      </tr>
      <tr>
          <td><strong>Testing</strong></td>
          <td>Split B</td>
          <td>47,413,850</td>
          <td>Random 50% split</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Cleaning</strong>: Molecules that could not be processed by RDKit were removed. Molecules containing tokens not in the tokenizer (e.g., aromatic selenium) were excluded.</li>
<li><strong>Availability</strong>: A subset of 100,000 test molecules is available on GitHub (<code>data/test_100000.csv</code>) and Zenodo. The full train/test splits are not explicitly provided.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>SMILES</strong>: Character-based tokenization.</li>
<li><strong>IUPAC</strong>: Custom rule-based tokenizer splitting suffixes (<code>-one</code>, <code>-al</code>), prefixes (<code>-oxy</code>, <code>-di</code>), and special symbols (<code>(</code>, <code>)</code>, <code>R(S)</code>).</li>
</ul>
</li>
<li><strong>Verification Step</strong>:
<ol>
<li>Generate $N$ names using Beam Search ($N=5$).</li>
<li>Reverse translate the candidate name using OPSIN.</li>
<li>Check if the OPSIN structure matches the original input SMILES.</li>
<li>Display the first verified match; otherwise, report failure.</li>
</ol>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Standard Transformer with 6 encoder layers and 6 decoder layers.</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Attention Heads: 8</li>
<li>Attention Dimension ($d_{\text{model}}$): 512</li>
<li>Feed-Forward Dimension ($d_{\text{ff}}$): 2048</li>
</ul>
</li>
<li><strong>Training Objective</strong>: The models were trained using standard autoregressive cross-entropy loss over the target token sequence $y$ given the input string $x$:
$$\mathcal{L} = - \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, x)$$</li>
<li><strong>Training</strong>: Two separate models were trained: <code>Struct2IUPAC</code> (SMILES $\to$ IUPAC) and <code>IUPAC2Struct</code> (IUPAC $\to$ SMILES).</li>
<li><strong>Availability</strong>: Code for model architecture is provided in the GitHub repository. Pre-trained weights for the IUPAC2Struct model are available, but the Struct2IUPAC model weights are not publicly released, meaning researchers would need to retrain that model on their own PubChem data to reproduce those results.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed on a random subset of 100,000 molecules from the test set.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Beam Size</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Struct2IUPAC</td>
          <td>1</td>
          <td>96.1%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Struct2IUPAC</td>
          <td>5</td>
          <td>98.9%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>IUPAC2Struct</td>
          <td>1</td>
          <td>96.6%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>IUPAC2Struct</td>
          <td>5</td>
          <td>99.1%</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Robustness</strong>: Accuracy drops significantly for augmented (non-canonical) SMILES (37.16%) and stereo-enriched compounds (66.52%).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Infrastructure</strong>: 4 $\times$ Tesla V100 GPUs and 36 CPUs.</li>
<li><strong>Training Time</strong>: Approximately 10 days under full load.</li>
<li><strong>Inference Speed</strong>: &lt;0.5s per molecule on GPU; scale is linear with output token length.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/sergsb/IUPAC2Struct">IUPAC2Struct (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Transformer code and pre-trained IUPAC2Struct model</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4280814">Test data (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>100k test molecules, OPSIN failure cases, model failure cases</td>
      </tr>
      <tr>
          <td><a href="https://app.syntelly.com/smiles2iupac">Struct2IUPAC web demo</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online interface for SMILES to IUPAC conversion</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krasnov, L., Khokhlov, I., Fedorov, M. V., &amp; Sosnin, S. (2021). Transformer-based artificial neural networks for the conversion between chemical notations. <em>Scientific Reports</em>, 11(1), 14798. <a href="https://doi.org/10.1038/s41598-021-94082-y">https://doi.org/10.1038/s41598-021-94082-y</a></p>
<p><strong>Publication</strong>: Scientific Reports 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{krasnovTransformerbasedArtificialNeural2021a,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Transformer-Based Artificial Neural Networks for the Conversion between Chemical Notations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Krasnov, Lev and Khokhlov, Ivan and Fedorov, Maxim V. and Sosnin, Sergey}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{14798}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41598-021-94082-y}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/sergsb/IUPAC2Struct">GitHub Repository</a></li>
<li><a href="https://app.syntelly.com/smiles2iupac">Web Demo</a></li>
</ul>
]]></content:encoded></item><item><title>STOUT: SMILES to IUPAC Names via Neural Machine Translation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/stout/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/stout/</guid><description>A deep-learning neural machine translation approach to translate between SMILES strings and IUPAC names using the STOUT model.</description><content:encoded><![CDATA[<h2 id="contribution-translating-chemistry-as-a-language">Contribution: Translating Chemistry as a Language</h2>
<p>This is primarily a <strong>Method</strong> paper, with a strong secondary contribution as a <strong>Resource</strong> paper.</p>
<ul>
<li><strong>Method</strong>: It proposes a neural machine translation (NMT) architecture to approximate the complex, rule-based algorithm of IUPAC naming, treating it as a language translation task.</li>
<li><strong>Resource</strong>: It provides an open-source tool and trained models to the community, addressing a gap where such functionality was previously limited to proprietary software.</li>
</ul>
<h2 id="motivation-democratizing-iupac-nomenclature">Motivation: Democratizing IUPAC Nomenclature</h2>
<p>The International Union of Pure and Applied Chemistry (IUPAC) naming scheme is universally accepted but algorithmically complex. Generating these names correctly is challenging for humans, and automated generation is largely missing from major open-source toolkits like CDK, RDKit, or Open Babel. While reliable commercial tools exist (e.g., ChemAxon&rsquo;s <code>molconvert</code>), there was a lack of open-source alternatives for the scientific community. STOUT aims to fill this gap using a data-driven approach.</p>
<h2 id="core-innovation-sequence-to-sequence-naming">Core Innovation: Sequence-to-Sequence Naming</h2>
<ul>
<li><strong>Language Translation Approach</strong>: The authors treat chemical representations (<a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>/<a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>) and IUPAC names as two different languages, applying Neural Machine Translation (NMT) to translate between them.</li>
<li><strong>Use of SELFIES</strong>: The work establishes SELFIES (Self-Referencing Embedded Strings) as a robust choice over SMILES for deep learning tokenization in this specific task, capitalizing on its syntactic robustness.</li>
<li><strong>Hardware Acceleration</strong>: The paper benchmarks GPU versus TPU training and highlights the practical necessity of Tensor Processing Units (TPUs) for training large-scale chemical language models, reducing training time by an order of magnitude.</li>
</ul>
<h2 id="methodology--translation-validation">Methodology &amp; Translation Validation</h2>
<ul>
<li><strong>Data Scale</strong>: The model was trained on datasets of 30 million and 60 million molecules derived from PubChem.</li>
<li><strong>Hardware Benchmarking</strong>: Training efficiency was compared between an nVidia Tesla V100 GPU and Google TPU v3-8/v3-32 units.</li>
<li><strong>Bidirectional Translation</strong>: The system was tested on two distinct tasks:
<ol>
<li><strong>Forward</strong>: SELFIES → IUPAC names</li>
<li><strong>Reverse</strong>: IUPAC names → SELFIES</li>
</ol>
</li>
<li><strong>Validation</strong>: Performance was evaluated on a held-out test set of 2.2 million molecules.</li>
</ul>
<h2 id="translation-accuracy--hardware-scaling">Translation Accuracy &amp; Hardware Scaling</h2>
<ul>
<li><strong>High Accuracy</strong>: The model achieved an average BLEU score of ~90% and a Tanimoto similarity index &gt; 0.9 for both translation directions.</li>
<li><strong>Generalization</strong>: Even when predictions were textually mismatched (low BLEU score), the underlying chemical structures often remained highly similar (high Tanimoto similarity), suggesting the system captures fundamental chemical semantics rather than merely memorizing strings.</li>
<li><strong>Impact of Data Size</strong>: Expanding training from 30 million to 60 million molecules yielded consistent performance gains without saturating.</li>
<li><strong>Hardware Necessity</strong>: Training on TPUs proved up to 54 times faster than a standard GPU baseline (Tesla V100), making scaling highly computationally tractable.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">STOUT (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Current repo hosts STOUT V2.0 transformer models; V1 RNN code available in earlier commits</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Public Domain</td>
          <td style="text-align: left">Source of 111M molecules; 30M/60M training subsets not directly provided</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The dataset was curated from PubChem (111 million molecules). Note that the specific 30M and 60M subsets are not directly linked in the publication repository, which means a user would have to reconstruct the filtering process.</p>
<p><strong>Preprocessing &amp; Filtering</strong>:</p>
<ul>
<li>Explicit hydrogens removed; converted to canonical SMILES.</li>
<li><strong>Filtering Rules</strong>: MW &lt; 1500 Da, no counter ions, limited element set (C, H, O, N, P, S, F, Cl, Br, I, Se, B), no hydrogen isotopes, 3-40 bonds, no charged groups.</li>
<li><strong>Ground Truth Generation</strong>: ChemAxon&rsquo;s <code>molconvert</code> (Marvin Suite 20.15) was used to generate target IUPAC names for training.</li>
<li><strong>Representation</strong>: All SMILES were converted to SELFIES for training.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left">PubChem Filtered</td>
          <td style="text-align: left">30M &amp; 60M</td>
          <td style="text-align: left">Two distinct training sets created.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Testing</strong></td>
          <td style="text-align: left">PubChem Held-out</td>
          <td style="text-align: left">2.2M</td>
          <td style="text-align: left">Molecules not present in training sets; uniform token frequency.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>SELFIES</strong>: Split iteratively by brackets <code>[</code> and <code>]</code>.</li>
<li><strong>IUPAC</strong>: Split via punctuation (<code>(</code>, <code>)</code>, <code>{</code>, <code>}</code>, <code>[</code>, <code>]</code>, <code>-</code>, <code>.</code>, <code>,</code>) and a discrete set of sub-word chemical morphemes (e.g., <code>methyl</code>, <code>benzene</code>, <code>fluoro</code>).</li>
<li><strong>Padding</strong>: SELFIES padded to 48 tokens; IUPAC padded to 78 tokens. &ldquo;Start&rdquo; and &ldquo;End&rdquo; sequence markers append each chain.</li>
</ul>
</li>
<li><strong>Optimization</strong>: Adam optimizer instantiated with a learning rate of $0.0005$.</li>
<li><strong>Objective Function</strong>: Sparse categorical cross-entropy, assessing prediction probabilities for token $i$ over vocabulary $V$:
$$ \mathcal{L} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder sequence-to-sequence network with Bahdanau attention mechanism context weighting.</li>
<li><strong>Components</strong>:
<ul>
<li><strong>Encoder/Decoder</strong>: Recurrent Neural Networks (RNN) constructed using Gated Recurrent Units (GRU).</li>
<li><strong>Attention</strong>: Bahdanau (additive) soft attention, which calculates alignment scores to softly weight encoder hidden states natively:
$$ e_{tj} = v_a^\top \tanh(W_a s_{t-1} + U_a h_j) $$</li>
<li><strong>Embedding</strong>: Decoder output passes through a continuous embedding layer before concatenating with the attention context vector.</li>
</ul>
</li>
<li><strong>Implementation</strong>: Python 3 backend using TensorFlow 2.3.0. <em>Note: The linked GitHub repository currently defaults to the STOUT V2.0 transformer models, so researchers aiming to reproduce this specific V1 RNN paper should reference the older tag/commit history.</em></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics heavily emphasize both linguistic accuracy and cheminformatic structural correctness:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Details</th>
          <th style="text-align: left">Result (60M Model)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BLEU Score</strong></td>
          <td style="text-align: left">NLTK sentence BLEU (unigram to 4-gram)</td>
          <td style="text-align: left">0.94 (IUPAC $\to$ SELFIES)</td>
          <td style="text-align: left">Exact text overlap. Serves as a strictly syntactic proxy.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Tanimoto Similarity</strong></td>
          <td style="text-align: left">PubChem fingerprints via CDK</td>
          <td style="text-align: left">0.98 (Valid IUPAC names)</td>
          <td style="text-align: left">Evaluates substructure alignment over bit vectors, $T(A, B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert}$.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Comparison of hardware efficiency for training large chemical language models:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hardware</th>
          <th style="text-align: left">Batch Size</th>
          <th style="text-align: left">Time per Epoch (15M subset)</th>
          <th style="text-align: left">Speedup Factor</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>GPU (Tesla V100)</strong></td>
          <td style="text-align: left">256</td>
          <td style="text-align: left">~27 hours</td>
          <td style="text-align: left">1x</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>TPU v3-8</strong></td>
          <td style="text-align: left">1024 (Global)</td>
          <td style="text-align: left">~2 hours</td>
          <td style="text-align: left">13x</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>TPU v3-32</strong></td>
          <td style="text-align: left">1024 (Global)</td>
          <td style="text-align: left">~0.5 hours</td>
          <td style="text-align: left">54x</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A., &amp; Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. <em>Journal of Cheminformatics</em>, 13(1), 34. <a href="https://doi.org/10.1186/s13321-021-00512-4">https://doi.org/10.1186/s13321-021-00512-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanSTOUTSMILESIUPAC2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{STOUT: SMILES to IUPAC Names Using Neural Machine Translation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{STOUT}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{34}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00512-4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-09-22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{Attention mechanism,Chemical language,Deep neural network,DeepSMILES,IUPAC names,Neural machine translation,Recurrent neural network,SELFIES,SMILES}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">GitHub Repository</a></li>
<li><a href="/notes/computational-chemistry/chemical-language-models/stout-v2/">STOUT V2.0 Note</a></li>
<li><a href="/notes/computational-chemistry/chemical-language-models/struct2iupac-2021/">Struct2IUPAC Note</a></li>
<li><a href="/notes/computational-chemistry/chemical-language-models/handsel-inchi-iupac-2021/">HandSEL Note (InChI to IUPAC)</a></li>
</ul>
]]></content:encoded></item><item><title>STOUT V2.0: Transformer-Based SMILES to IUPAC Translation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/stout-v2/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/stout-v2/</guid><description>A Transformer-based model for translating SMILES to IUPAC names, trained on ~1 billion molecules, achieving ~0.99 BLEU score on benchmarks.</description><content:encoded><![CDATA[<h2 id="paper-contribution--methodological-scope">Paper Contribution &amp; Methodological Scope</h2>
<p><strong>Method (Primary) / Resource (Secondary)</strong></p>
<p>This paper presents a <strong>Methodological</strong> contribution by developing and validating a Transformer-based neural machine translation model (STOUT V2) for bidirectional chemical nomenclature (<a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> $\leftrightarrow$ IUPAC). It systematically compares this new architecture against previous RNN-based baselines (<a href="/notes/computational-chemistry/chemical-language-models/stout/">STOUT V1</a>) and performs ablation studies on tokenization strategies.</p>
<p>It also serves as a significant <strong>Resource</strong> contribution by generating a massive training dataset of nearly 1 billion SMILES-IUPAC pairs (curated via commercial Lexichem software) and releasing the resulting models and code as open-source tools for chemical naming.</p>
<h2 id="the-need-for-robust-open-source-iupac-nomenclature-rules">The Need for Robust Open-Source IUPAC Nomenclature Rules</h2>
<p>Assigning systematic IUPAC names to chemical structures requires adherence to complex rules, challenging human consistency. Deterministic, rule-based software options like OpenEye Lexichem and ChemAxon are reliable commercial solutions. Existing open-source tools like OPSIN focus on parsing names to structures.</p>
<p>The previous version of STOUT (V1), based on RNNs/GRUs, achieved ~90% BLEU accuracy, with known limitations in capturing long-distance dependencies required for stereochemistry handling. This work uses the sequence-learning capabilities of Transformers combined with large-scale datasets to create a competitive open-source IUPAC naming tool.</p>
<h2 id="architectural-shift-and-billion-scale-training">Architectural Shift and Billion-Scale Training</h2>
<p>The primary advancements over previous iterations address both architecture and dataset scale:</p>
<ol>
<li><strong>Architecture Shift</strong>: Moving from an RNN-based Seq2Seq model to a <strong>Transformer-based architecture</strong> (4 layers, 8 heads), which captures intricate chemical patterns better than GRUs.</li>
<li><strong>Billion-Scale Training</strong>: Training on a dataset of nearly <strong>1 billion molecules</strong> (combining PubChem and ZINC15), significantly larger than the 60 million used for STOUT V1.</li>
<li><strong>Tokenization Strategy</strong>: Determining that <strong>character-wise tokenization</strong> for IUPAC names is superior to word-wise tokenization in terms of both accuracy and training efficiency (15% faster).</li>
</ol>
<h2 id="experimental-validation-and-scaling-limits">Experimental Validation and Scaling Limits</h2>
<p>The authors conducted three primary experiments to validate bidirectional translation (SMILES $\rightarrow$ IUPAC and IUPAC $\rightarrow$ SMILES):</p>
<ul>
<li><strong>Experiment 1 (Optimization)</strong>: Assessed the impact of dataset size (1M vs 10M vs 50M) and tokenization strategy on SMILES-to-IUPAC performance.</li>
<li><strong>Experiment 2 (Scaling)</strong>: Trained models on 110 million PubChem molecules for <strong>both</strong> forward and reverse translation tasks to test performance on longer sequences.</li>
<li><strong>Experiment 3 (Generalization)</strong>: Trained on the full ~1 billion dataset (PubChem + ZINC15) for both translation directions.</li>
<li><strong>External Validation</strong>: Benchmarked against an external dataset from ChEBI (1,485 molecules) and ChEMBL34 to test generalization to unseen data.</li>
</ul>
<p><strong>Evaluation Metrics</strong>:</p>
<ul>
<li><strong>Textual Accuracy</strong>: BLEU scores (1-4) and Exact String Match.</li>
<li><strong>Chemical Validity</strong>: Retranslation of generated names back to SMILES using OPSIN, followed by Tanimoto similarity checks (PubChem fingerprints) against the original input.</li>
</ul>
<h2 id="translation-accuracy-and-structural-validity">Translation Accuracy and Structural Validity</h2>
<ul>
<li><strong>Superior Performance</strong>: STOUT V2 achieved an average BLEU score of <strong>0.99</strong> (vs 0.94 for V1). While exact string matches varied by experiment (83-89%), the model notably achieved a perfect BLEU score (1.0) on <strong>97.49%</strong> of a specific test set where STOUT V1 only reached 66.65%.</li>
<li><strong>Structural Validity (&ldquo;Near Misses&rdquo;)</strong>: When the generated name differed from the ground truth string, the re-generated structure often remained chemically valid. The model maintained an average Tanimoto similarity $T(A,B)$ of <strong>0.68</strong> for these divergent names between bit-vector fingerprints $A$ and $B$, roughly defined as:
$$ T(A,B) = \frac{\sum (A \cap B)}{\sum (A \cup B)} $$
<em>Critique</em>: Note that an average Tanimoto coefficient of 0.68 typically suggests moderate structural similarity/drift, not an almost-identical &ldquo;near miss&rdquo; (which would be $&gt;0.85$). This implies the model constructs chemically related but structurally distinct outputs when it fails exact string matching.</li>
<li><strong>Tokenization</strong>: Character-level splitting for IUPAC names outperformed word-level splitting and was more computationally efficient.</li>
<li><strong>Data Imbalance &amp; Generalization</strong>: The model&rsquo;s drop in performance for sequences &gt;600 characters highlights a systemic issue in open chemical databases: long, highly complex SMILES strings are significantly underrepresented. Even billion-scale training datasets are still bound by the chemical diversity of their source material.</li>
<li><strong>Limitations</strong>:
<ul>
<li><strong>Preferred Names (PINs)</strong>: The model mimics Lexichem&rsquo;s naming conventions, generating valid IUPAC names distinct from strict <em>Preferred IUPAC Names</em> (PINs).</li>
<li><strong>Sequence Length</strong>: Performance degrades for very long SMILES (&gt;600 characters) due to scarcity in the training data.</li>
<li><strong>Algorithmic Distillation Bottleneck</strong>: Because the 1 billion training pairs were generated entirely by OpenEye&rsquo;s Lexichem, STOUT V2 acts as a knowledge distillation of that specific commercial algorithm. The model learns Lexichem’s heuristic mapping, specific dialects, and potential systematic errors, rather than deriving true nomenclature rules from first principles.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data was derived from PubChem and ZINC15. Ground truth IUPAC names were generated using OpenEye Lexichem TK 2.8.1 to ensure consistency.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training (Exp 1)</strong></td>
          <td>PubChem Subset</td>
          <td>1M, 10M, 50M</td>
          <td>Selected via MaxMin algorithm for diversity</td>
      </tr>
      <tr>
          <td><strong>Training (Exp 2)</strong></td>
          <td>PubChem</td>
          <td>110M</td>
          <td>Filtered for SMILES length &lt; 600</td>
      </tr>
      <tr>
          <td><strong>Training (Exp 3)</strong></td>
          <td>PubChem + ZINC15</td>
          <td>~1 Billion</td>
          <td>999,637,326 molecules total</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>ChEBI</td>
          <td>1,485</td>
          <td>External validation set, non-overlapping with training</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>SMILES</strong>: Canonicalized, isomeric, and kekulized using RDKit (v2023.03.1).</li>
<li><strong>Formatting</strong>: Converted to TFRecord format in 100 MB chunks for TPU efficiency.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>SMILES Tokenization</strong>: Regex-based splitting. Atoms (e.g., &ldquo;Cl&rdquo;, &ldquo;Au&rdquo;), bonds, brackets, and digits are separate tokens.</li>
<li><strong>IUPAC Tokenization</strong>: <strong>Character-wise split</strong> was selected as the optimal strategy (treating every character as a token).</li>
<li><strong>Optimization</strong>: Adam optimizer with a custom learning rate scheduler based on model dimensions.</li>
<li><strong>Loss Function</strong>: Trained to minimize the Sparse Categorical Cross-Entropy $L$, masking padding tokens. For a correctly predicted target class $t$ alongside probabilities $p_i$, the masked loss is represented mathematically as:
$$ L = - \sum_{i=1}^{m} m_i y_{i} \log(p_{i}) $$
where $m_i$ masks padded positions.</li>
<li><strong>Code Availability</strong>: The <a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">main STOUT V2 repository</a> contains the inference package. The training pipeline/instructions (originally linked to a separate repo that is currently a 404) can still be found within the <a href="https://doi.org/10.5281/zenodo.6559438">Zenodo archive release</a>.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model follows the standard Transformer architecture from &ldquo;Attention is All You Need&rdquo; (Vaswani et al.).</p>
<ul>
<li><strong>Architecture</strong>: 4 Transformer layers (encoder/decoder stack).</li>
<li><strong>Attention</strong>: Multi-head attention with <strong>8 heads</strong>.</li>
<li><strong>Dimensions</strong>: Embedding size ($d_{model}$) = 512; Feed-forward dimension ($d_{ff}$) = 2048.</li>
<li><strong>Regularization</strong>: Dropout rate of 0.1.</li>
<li><strong>Context Window</strong>: Max input length (SMILES) = 600; Max output length (IUPAC) = 700-1000.</li>
<li><strong>Weights</strong>: Model weights for forward and reverse architectures are <a href="https://doi.org/10.5281/zenodo.13318286">available via Zenodo (v3)</a>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation focused on both string similarity and chemical structural integrity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Scope</th>
          <th>Method</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>BLEU Score</strong></td>
          <td>N-gram overlap</td>
          <td>Compared predicted IUPAC string to Ground Truth.</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Accuracy</td>
          <td>Binary 1/0 check for identical strings.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto</strong></td>
          <td>Structural Similarity</td>
          <td>Predicted Name $\rightarrow$ OPSIN $\rightarrow$ SMILES $\rightarrow$ Fingerprint comparison to input.</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">STOUT V2 GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Inference package (PyPI: STOUT-pypi)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.13318286">Model Weights (Zenodo v3)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Forward and reverse translation weights</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/6559438">Code Snapshot (Zenodo)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training pipeline archive</td>
      </tr>
      <tr>
          <td><a href="https://stout.decimer.ai">Web Application</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Demo with Ketcher, bulk submission, DECIMER integration</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was conducted entirely on Google Cloud Platform (GCP) TPUs.</p>
<ul>
<li><strong>STOUT V1</strong>: Trained on TPU v3-8.</li>
<li><strong>STOUT V2</strong>: Trained on <strong>TPU v4-128 pod slices</strong> (128 nodes).</li>
<li><strong>Large Scale (Exp 3)</strong>: Trained on <strong>TPU v4-256 pod slice</strong> (256 nodes).</li>
<li><strong>Training Time</strong>: Average of <strong>15 hours and 2 minutes per epoch</strong> for the 1 billion dataset.</li>
<li><strong>Framework</strong>: TensorFlow 2.15.0-pjrt with Keras.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A., &amp; Steinbeck, C. (2024). STOUT V2.0: SMILES to IUPAC name conversion using transformer models. <em>Journal of Cheminformatics</em>, 16(146). <a href="https://doi.org/10.1186/s13321-024-00941-x">https://doi.org/10.1186/s13321-024-00941-x</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2024</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanSTOUTV20SMILES2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{STOUT V2}}.0: {{SMILES}} to {{IUPAC}} Name Conversion Using Transformer Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{STOUT V2}}.0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{146}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-024-00941-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://stout.decimer.ai">Web Application</a> (Includes Ketcher drawing, bulk submission, and DECIMER integration)</li>
<li><a href="https://decimer.ai">DECIMER Project</a></li>
<li><a href="/notes/computational-chemistry/chemical-language-models/stout/">STOUT V1 Note</a></li>
<li><a href="https://zenodo.org/records/6559438">Zenodo Archive (Code Snapshot)</a></li>
</ul>
]]></content:encoded></item><item><title>Multimodal Search in Chemical Documents and Reactions</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/shah-multimodal-search-2025/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/shah-multimodal-search-2025/</guid><description>A multimodal search engine that integrates text passages, molecular diagrams, and reaction data to enable passage-level retrieval in chemical literature.</description><content:encoded><![CDATA[<h2 id="contribution-multimodal-synthesis-retrieval">Contribution: Multimodal Synthesis Retrieval</h2>
<p>This paper represents a $\Psi_{\text{Method}}$ projection that proposes a novel architectural pipeline for indexing and searching chemical literature. The framework unifies text, molecular diagrams, and structured reaction records. It also contains a secondary $\Psi_{\text{Resource}}$ projection, providing a functional demonstration tool and curating a specific benchmark dataset for Suzuki coupling reactions.</p>
<h2 id="the-gap-in-passage-level-chemical-retrieval">The Gap in Passage-Level Chemical Retrieval</h2>
<p>Scientific literature documents chemical reactions through a combination of text and visual diagrams. Textual descriptions detail parameters like yield and operational temperature, whereas diagrams graphically model these structural transformations. Existing tools such as SciFinder or Reaxys perform document-level or individual compound retrieval. They fail to explicitly link molecular figures to localized textual descriptions. This structure prevents researchers from directly extracting a corresponding reaction diagram alongside the exact textual protocol. Researchers require passage-level retrieval of synthesis protocols to efficiently access complete reaction conditions.</p>
<h2 id="core-innovation-unified-multimodal-indexing">Core Innovation: Unified Multimodal Indexing</h2>
<p>The core methodological innovation is a multimodal passage-level indexing and linking pipeline.</p>
<ul>
<li><strong>Unified Indexing:</strong> The framework processes text and diagrams in parallel and directly links them into a single index structure. This architecture supports search queries utilizing raw text, discrete SMILES strings, or multimodal combinations.</li>
<li><strong>Compound-Passage Linking:</strong> The mechanism applies conflict-resolution logic linking chemical diagrams to specific text citations using two parallel heuristics:
<ol>
<li><strong>Token-based Alignment:</strong> Matching parsed diagram labels against documented text strings (e.g., &ldquo;compound 5&rdquo;) using normalized Levenshtein distance.</li>
<li><strong>Fingerprint-based Alignment:</strong> Matching chemical structures against generated SMILES strings via structural Tanimoto Similarity.</li>
</ol>
</li>
<li><strong>ReactionMiner Integration:</strong> The pipeline parses and incorporates formatted reaction records (reactants, products, catalysts, quantitative yields) directly derived from segmented text passages.</li>
</ul>
<h2 id="methodology--expert-evaluation">Methodology &amp; Expert Evaluation</h2>
<p>The authors evaluated the system utilizing a chemical case study targeting specific synthesis domains alongside qualitative expert assessment.</p>
<ul>
<li><strong>Dataset:</strong> Evaluators processed a corpus of 7 research manuscripts and 6 supplementary data documents detailing Suzuki coupling reactions.</li>
<li><strong>Volume:</strong> The resulting index processed 1,282 extracted passages (indexing 538), extracted 383 unique SMILES, and logged 219 parsed reactions.</li>
<li><strong>Qualitative Evaluation:</strong> Practicing structural chemists developed real-world queries (such as cross-referencing the conceptual &ldquo;Burke group&rdquo; alongside an explicit structural SMARTS pattern) to gauge retrieval capability.</li>
</ul>
<h2 id="key-findings--system-limitations">Key Findings &amp; System Limitations</h2>
<ul>
<li><strong>Diagram-to-Text Linking:</strong> The pipeline accurately paired visual molecular diagrams with structurally derived text details, permitting testers to navigate directly from a molecule query card to the exact origin passage within the source PDF.</li>
<li><strong>Contextual Insight Extraction:</strong> Specialized chemists found the parsed reaction representations (yield metrics, isolated catalysts) functionally pragmatic as high-level extractive summaries.</li>
<li><strong>Extrapolative Retrieval:</strong> The architecture permitted the effective retrieval of targeted chemical derivatives (such as benzo[b]thiophen-2-ylboronic acid) via structurally related input queries (dibenzothiophene).</li>
</ul>
<p>The system evaluation highlights several architectural restrictions:</p>
<ul>
<li><strong>Domain-Restricted Validation:</strong> The initial validation is entirely qualitative and bounded to the specific subclass of Suzuki coupling reactions. The evaluation omits standardized quantitative retrieval baselines (e.g., MAP, NDCG) and lacks systematic ablation data for the fusion scoring mechanism.</li>
<li><strong>Algorithmic Transparency:</strong> The multimodal query routing mechanism does not clearly indicate the dominant retrieval feature. This hides whether keyword text or structural similarity actually drove the final result placement. This ambiguity limits operator control.</li>
<li><strong>Optical Processing Brittleness:</strong> The embedded vision inference and primitive parsing pipelines display inherent fragility, producing intermittent failures when associating text passages with correctly parsed molecular diagrams.</li>
<li><strong>Metadata Logging Incompleteness:</strong> Practicing chemists requested additional structured metadata targets (such as specific molar equivalents and parameterized mol% values) to successfully bridge the extracted data stream directly into digital electronic lab notebooks.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.cs.rit.edu/~dprl/reactionminer-demo-landing/">ReactionMiner Demo</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Online demo landing page; source code repository not publicly linked</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source:</strong> The corpus features 7 primary research papers and 6 auxiliary supplementary information documents focusing on Suzuki coupling reactions, sourced from practicing chemists at UIUC. This evaluation dataset is strictly internal and not publicly available.</li>
<li><strong>Preprocessing:</strong>
<ul>
<li>Engineers convert source PDFs to full-page raster images.</li>
<li>The system extracts localized graphical layout and raw text via <strong>PyTesseract</strong>.</li>
<li>The pipeline segments valid passage chunks emphasizing reaction-related sentences utilizing product-indicative lexicons and topic modeling.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Diagram Extraction:</strong> A <strong>YOLOv8</strong> model identifies and segments molecular regions within structured PDF pages.</li>
<li><strong>Diagram Parsing:</strong> The architecture relies on <strong>ChemScraper</strong> to infer structural semantics from raw diagrams:
<ul>
<li><em>Born-digital PDFs:</em> <strong>SymbolScraper</strong> extracts vector lines and polygons directly from bounding box definitions.</li>
<li><em>Raster images:</em> The system employs the <strong>Line Segment Detector (LSD)</strong> and watershed bounding algorithms to isolate native geometric primitives.</li>
</ul>
</li>
<li><strong>Text Entity Extraction:</strong> The framework deploys <strong>ChemDataExtractor 2.0</strong> to extract explicit molecular aliases. A translation layer maps these entities to string representations via <strong>OPSIN</strong>.</li>
<li><strong>Linking Logic (Fusion Score):</strong>
<ul>
<li><strong>Text Link:</strong> The algorithm calculates a normalized Levenshtein ratio connecting visual diagram labels against proximal text mentions based on calculated edit distance.</li>
<li><strong>Structure Link:</strong> The algorithm computes the discrete Tanimoto Similarity between generated 2048-bit Morgan fingerprints extracted from localized visual diagram features and baseline text SMILES queries:
$$ T(A, B) = \frac{A \cdot B}{|A|^{2} + |B|^{2} - A \cdot B} $$
where $A$ and $B$ represent the boolean bit vectors of the respective fingerprint pairs.</li>
<li><strong>Conflict Resolution Protocol:</strong> The system fuses structural geometry bounds and discrete textual tokenization metrics, prioritizing the ranking sequence that yields a higher terminal similarity score. During final retrieval, the candidate subset is systematically re-ranked leveraging the hybrid calculation of the BM25 explicit metric and the localized count of exact SMILES pattern hits.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Reaction Extraction Parameters:</strong> The engineers configure a <strong>LLaMA-3.1-8b</strong> model fine-tuned entirely via <strong>LoRA</strong> targeting custom tokens representing reaction entities (compounds, reagents, thermal inputs) directly pulled from text sub-chunks. Exact prompt constraints, the fine-tuning dataset, and specific LoRA hyperparameters are omitted from the source text.</li>
<li><strong>Diagram Processing Bounds:</strong> The codebase incorporates a segmentation-aware multi-task neural network topology built into ChemScraper to execute low-level raster image parsing tasks.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Search Engine Base:</strong> The authors implemented their indexing framework scaling atop <strong>PyTerrier</strong>.</li>
<li><strong>Text Feature Ranking:</strong> The metric utilizes standalone <strong>BM25</strong> bounds mapping keyword-similarity.</li>
<li><strong>Structure Feature Operations:</strong> The topology operates <strong>RDKit</strong> bindings powering substructure coordinate mapping logic and exact molecular similarity searches.</li>
<li><strong>Multimodal Fusion Processing:</strong>
<ul>
<li>The algorithm filters out terminal candidates mapping initial structural properties (SMILES queries) against the document-wide lexical properties (BM25 scores).</li>
<li>The final fusion routing assigns the strongest positive weight to retrieved passages that accumulate dense local clusters of structurally exact verified SMILES patterns.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Infrastructure:</strong> The hardware and parameter requirements to host the multi-stage vision extractors (YOLOv8, ChemScraper) alongside a local 8B LLM are entirely unspecified in the paper.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Shah, A. K., et al. (2025). Multimodal Search in Chemical Documents and Reactions. <em>arXiv preprint arXiv:2502.16865</em>. <a href="https://doi.org/10.48550/arXiv.2502.16865">https://doi.org/10.48550/arXiv.2502.16865</a></p>
<p><strong>Publication</strong>: SIGIR &lsquo;25 (Demo Track), 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{shahMultimodalSearchChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Multimodal {{Search}} in {{Chemical Documents}} and {{Reactions}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Shah, Ayush Kumar and Dey, Abhisek and Luo, Leo and Amador, Bryan and Philippy, Patrick and Zhong, Ming and Ouyang, Siru and Friday, David Mark and Bianchi, David and Jackson, Nick and Zanibbi, Richard and Han, Jiawei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2025</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = feb,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2502.16865}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2502.16865}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2502.16865}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.cs.rit.edu/~dprl/reactionminer-demo-landing/">Online Demo</a> (Note: While the landing page advertises the system as open-source, the exact repository URL and installation prerequisites are omitted from the official manuscript.)</li>
</ul>
]]></content:encoded></item><item><title>ChemDFM-X: Multimodal Foundation Model for Chemistry</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemdfm-x/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemdfm-x/</guid><description>Multimodal chemical model integrating 5 modalities (2D graphs, 3D conformations, images, MS2/IR spectra) trained on 7.6M instructions.</description><content:encoded><![CDATA[<h2 id="chemdfm-x-contribution-and-architecture">ChemDFM-X Contribution and Architecture</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> contribution.</p>
<p><strong>Method</strong>: The paper proposes a novel &ldquo;Cross-modal Dialogue Foundation Model&rdquo; architecture that aligns five distinct chemical modalities (2D graphs, 3D conformations, images, MS2 spectra, IR spectra) to a single LLM decoder using separate encoders and projection modules. It establishes strong baseline performance across multiple modalities compared against current generalist models.</p>
<p><strong>Resource</strong>: The paper addresses the scarcity of multimodal chemical data by constructing a <strong>7.6M instruction-tuning dataset</strong>. This dataset is largely synthesized from seed SMILES strings using approximate calculations (MMFF94, CFM-ID, Chemprop-IR) and specialist model predictions.</p>
<h2 id="bridging-experimental-data-and-llms">Bridging Experimental Data and LLMs</h2>
<p>Existing chemical AI models generally fall into two distinct categories. Task-specific specialist models achieve high accuracy on singular objectives, such as property prediction or molecular generation, but require strict formatting and lack conversational flexibility. Conversely, early chemical large language models provide natural language interaction but are restricted to text and SMILES strings. ChemDFM-X addresses this gap by enabling large multimodal models to process the experimental characterization data (MS2 and IR spectra) and visual data routinely used in practical chemistry workflows.</p>
<h2 id="synthetic-data-scaling-for-modality-alignment">Synthetic Data Scaling for Modality Alignment</h2>
<p>The core novelty lies in the <strong>&ldquo;Any-to-Text&rdquo; alignment strategy via synthetic data scaling</strong>:</p>
<ol>
<li>
<p><strong>Comprehensive Modality Support</strong>: ChemDFM-X incorporates experimental characterization data (MS2 and IR spectra) alongside 2D graphs, 3D conformations, and images. The data representations are formally defined mathematically rather than as raw pixels:</p>
<ul>
<li><strong>Molecular Graph</strong>: An undirected graph $G = (\textbf{V}, \textbf{E})$ with atom set $\textbf{V}$ and bond set $\textbf{E}$.</li>
<li><strong>Molecular Conformation</strong>: An undirected graph $G = (\textbf{V}&rsquo;, \textbf{E})$ storing spatial coordinates: $\textbf{v}_i = (x_i, y_i, z_i, a_i)$.</li>
<li><strong>MS2 Spectrum</strong>: Treated as a point sequence of discrete mass-to-charge ratios and intensities, tokenized via a discrete codebook: $\textbf{M} = ((r_1, I_1), (r_2, I_2), \dots, (r_n, I_n))$.</li>
<li><strong>IR Spectrum</strong>: Treated as a dense sequence of continuous wave lengths and absorption intensities, directly reshaped for feature extraction: $\textbf{R} = ((w_1, t_1), (w_2, t_2), \dots, (w_l, t_l))$.</li>
</ul>
<p>The authors trained new Sequence Transformer encoders from scratch for the MS2 and IR modalities since suitable pre-trained models did not exist.</p>
</li>
<li>
<p><strong>Synthetic Data Generation Pipeline</strong>: The authors generated a 7.6M sample dataset by starting with 1.3M seed SMILES and using &ldquo;approximate calculations&rdquo; to generate missing modalities:</p>
<ul>
<li>3D conformations via MMFF94 force field optimization</li>
<li>MS2 spectra via CFM-ID 4.0 (Competitive Fragmentation Modeling)</li>
<li>IR spectra via Chemprop-IR (Message Passing Neural Network)</li>
</ul>
</li>
<li>
<p><strong>Cross-Modal Synergy</strong>: The model demonstrates that training on reaction images improves recognition performance by leveraging semantic chemical knowledge (reaction rules) to correct visual recognition errors, an emergent capability from multimodal training.</p>
</li>
</ol>
<h2 id="multimodal-benchmarking-with-chemllmbench">Multimodal Benchmarking with ChemLLMBench</h2>
<p>The model was evaluated using a customized version of <strong>ChemLLMBench</strong> and <strong>MoleculeNet</strong> across three modality categories:</p>
<ol>
<li>
<p><strong>Structural Modalities</strong> (2D Graphs &amp; 3D Conformations):</p>
<ul>
<li>Molecule recognition and captioning</li>
<li>Property prediction (MoleculeNet: BACE, BBBP, ClinTox, HIV, Tox21)</li>
<li>Compared against specialist models (Mole-BERT, Uni-Mol, MolXPT, MolCA) and generalist models (3D-MoLM, ChemDFM, ChemLLM)</li>
</ul>
</li>
<li>
<p><strong>Visual Modalities</strong> (Images):</p>
<ul>
<li>Single molecule image recognition</li>
<li>Reaction image recognition</li>
<li>Compared against GPT-4O, Gemini 1.5 Pro, Qwen-VL, LLaVA, and specialist models MolNextr and MolScribe</li>
</ul>
</li>
<li>
<p><strong>Characterization Modalities</strong> (MS2 &amp; IR Spectra):</p>
<ul>
<li>Spectral analysis tasks (identifying molecules from spectra)</li>
<li>Contextualized spectral interpretation (combining spectra with reaction context)</li>
<li>Novel evaluation requiring integration of spectroscopic data with reaction knowledge</li>
</ul>
</li>
</ol>
<h2 id="cross-modal-synergy-and-generalist-performance">Cross-Modal Synergy and Generalist Performance</h2>
<p><strong>Key Findings</strong>:</p>
<ol>
<li>
<p><strong>Leading Generalist Performance</strong>: ChemDFM-X establishes a new benchmark among existing generalist models (such as 3D-MOLM and ChemLLM), achieving performance metrics that match dedicated specialist models across several multimodal tasks.</p>
</li>
<li>
<p><strong>Failure of General LMMs</strong>: General vision models (GPT-4O, Gemini 1.5 Pro, Qwen-VL, LLaVA, InternLM-XComposer2, DocOwl) failed significantly on chemical image recognition tasks (0% accuracy for most models on molecule and reaction recognition, Table 9), demonstrating that chemical domain knowledge cannot be assumed from general pre-training.</p>
</li>
<li>
<p><strong>Cross-Modal Error Correction</strong>: In reaction image recognition, ChemDFM-X achieved higher accuracy (53.0%) than on single molecules (46.0%) (Table 9). The authors conclude the model uses its internal knowledge of chemical reaction rules to correct recognition errors in the visual modality, an emergent capability from multimodal training.</p>
</li>
<li>
<p><strong>Reliance on Reaction Context for Spectra</strong>: In zero-shot scenarios, ChemDFM-X essentially fails at pure spectral recognition (achieving 0% and 1% top-1 accuracy on MS2 and IR spectra alone, Table 11). However, when SMILES-based reaction context is included, performance rises to 45% (MS2) and 64% (IR) on the reaction prediction task, and 29% (MS2) and 60% (IR) on retrosynthesis (Table 11). This indicates the model uses spectral data as a soft prior to constrain textual deductions. Furthermore, the paper compares ChemDFM-X’s spectral identification performance exclusively against text-only LLMs that cannot process spectra, omitting comparisons against established specialist tools.</p>
</li>
<li>
<p><strong>Surrogate Distillation Trade-offs</strong>: Because the spectral training data relies entirely on outputs from CFM-ID 4.0 and Chemprop-IR, ChemDFM-X effectively distills these surrogate models. Any inherent predictive biases or inaccuracies from these underlying tools are permanently embedded in the new ChemDFM-X encoders.</p>
</li>
</ol>
<p><strong>Main Conclusion</strong>: The &ldquo;separate encoders + unified decoder&rdquo; architecture with synthetic data generation enables effective multimodal chemical understanding, bridging the gap between specialist and generalist AI systems for chemistry.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors constructed a <strong>7.6M sample instruction-tuning dataset</strong> derived from <strong>1.3M seed SMILES</strong> (sourced from PubChem and USPTO). <strong>Note</strong>: The final 7.6M multimodal tuning dataset itself isn&rsquo;t publicly available.</p>
<p><strong>Generation Pipeline</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Generation Method</th>
          <th>Tool/Model</th>
          <th>Sample Count</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Graphs</strong></td>
          <td>Direct extraction from SMILES</td>
          <td>RDKit</td>
          <td>1.1M</td>
      </tr>
      <tr>
          <td><strong>3D Conformations</strong></td>
          <td>Force field optimization</td>
          <td>RDKit + MMFF94</td>
          <td>1.3M (pseudo-optimal)</td>
      </tr>
      <tr>
          <td><strong>Molecule Images</strong></td>
          <td>Rendering with augmentation</td>
          <td>RDKit, Indigo, ChemPix</td>
          <td>~1M (including handwritten style)</td>
      </tr>
      <tr>
          <td><strong>Reaction Images</strong></td>
          <td>Rendering from reaction SMILES</td>
          <td>RDKit</td>
          <td>300K</td>
      </tr>
      <tr>
          <td><strong>MS2 Spectra</strong></td>
          <td>Computational prediction</td>
          <td>CFM-ID 4.0</td>
          <td>~700K</td>
      </tr>
      <tr>
          <td><strong>IR Spectra</strong></td>
          <td>Computational prediction</td>
          <td>Chemprop-IR</td>
          <td>~1M</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Augmentation</strong>:</p>
<ul>
<li>Molecule images augmented with &ldquo;handwritten&rdquo; style using the ChemPix pipeline</li>
<li>Multiple rendering styles (RDKit default, Indigo clean)</li>
<li>Spectra generated at multiple energy levels (10eV, 20eV, 40eV for MS2)</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Architecture</strong>: &ldquo;Separate Encoders + Unified Decoder&rdquo;</p>
<p><strong>Code Availability</strong>: The authors have only released inference logic. The cross-modal projection training and synthetic data-generation scripts are closed.</p>
<p><strong>Modality Alignment</strong>:</p>
<ul>
<li>Each modality has a dedicated encoder (frozen pre-trained models where available)</li>
<li>For graph, conformation, MS2, and IR modalities: <strong>2-layer MLP projector</strong> (Linear, GELU, Linear) maps encoder features to LLM input space</li>
<li>For images: <strong>H-Reducer</strong> module compresses image tokens by factor of $n=8$ to handle high-resolution chemical images, then projects to LLM input space</li>
<li>All projected features are concatenated and fed to the unified LLM decoder</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Base LLM</strong>:</p>
<ul>
<li><strong>ChemDFM (13B)</strong>: LLaMA-based model pre-trained on chemical text and SMILES</li>
</ul>
<p><strong>Modality Encoders</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Encoder</th>
          <th>Pre-training Data</th>
          <th>Parameter Count</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Graph</strong></td>
          <td>Mole-BERT</td>
          <td>2M molecules</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>3D Conformation</strong></td>
          <td>Uni-Mol</td>
          <td>209M conformations</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>Image</strong></td>
          <td>CLIP (ViT)</td>
          <td>General domain</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>MS2 Spectrum</strong></td>
          <td>Transformer (SeqT)</td>
          <td>Trained from scratch</td>
          <td>-</td>
          <td><strong>Trainable</strong></td>
      </tr>
      <tr>
          <td><strong>IR Spectrum</strong></td>
          <td>Transformer (SeqT)</td>
          <td>Trained from scratch</td>
          <td>-</td>
          <td><strong>Trainable</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Design Rationale</strong>: MS2 and IR encoders trained from scratch as Sequence Transformers treating spectral peaks as token sequences, since no suitable pre-trained models exist for chemical spectra.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Accuracy (Acc)</strong> for recognition tasks</li>
<li><strong>BLEU-2/4</strong> and <strong>METEOR</strong> for captioning tasks</li>
<li><strong>AUC-ROC</strong> for property prediction (classification)</li>
</ul>
<p><strong>Code Availability</strong>: The adapted code for evaluating on ChemLLMBench and their custom spectral recognition tasks is closed-source.</p>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li><strong>ChemLLMBench</strong>: Adapted for multimodal inputs across molecule captioning, property prediction, and reaction understanding</li>
<li><strong>MoleculeNet</strong>: Standard molecular property prediction tasks (BACE, BBBP, ClinTox, HIV, Tox21)</li>
<li><strong>USPTO</strong>: Reaction prediction and retrosynthesis tasks</li>
<li><strong>Custom Spectral Tasks</strong>: Novel evaluations requiring spectral interpretation</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Note</strong>: The type and quantity of GPUs used, along with the total training wall-time, were not published.</p>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Total Batch Size</strong>: 256</li>
<li><strong>Epochs</strong>: 3</li>
<li><strong>Optimizer</strong>: AdamW</li>
</ul>
<p><strong>Modality-Specific Learning Rates (Peak)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Learning Rate</th>
          <th>Feature Dimension</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph</td>
          <td>1e-5</td>
          <td>300</td>
      </tr>
      <tr>
          <td>Conformation</td>
          <td>2e-4</td>
          <td>512</td>
      </tr>
      <tr>
          <td>Image</td>
          <td>2e-3</td>
          <td>1024</td>
      </tr>
      <tr>
          <td>MS2 / IR</td>
          <td>2e-4</td>
          <td>768</td>
      </tr>
  </tbody>
</table>
<p><strong>Note</strong>: Different learning rates reflect the varying degrees of domain adaptation required. Images (general CLIP) need more adaptation than graphs (chemical Mole-BERT).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/OpenDFM/ChemDFM-X">ChemDFM-X (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Inference code only; training and data generation scripts are closed</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/OpenDFM/ChemDFM-X-v1.0-13B">ChemDFM-X-v1.0-13B (HuggingFace)</a></td>
          <td>Model</td>
          <td>AGPL-3.0</td>
          <td>13B parameter multimodal model weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhao, Z., Chen, B., Li, J., Chen, L., Wen, L., Wang, P., Zhu, Z., Zhang, D., Wan, Z., Li, Y., Dai, Z., Chen, X., &amp; Yu, K. (2024). ChemDFM-X: Towards Large Multimodal Model for Chemistry. <em>Science China Information Sciences</em>, 67(12), 220109. <a href="https://doi.org/10.1007/s11432-024-4243-0">https://doi.org/10.1007/s11432-024-4243-0</a></p>
<p><strong>Publication</strong>: Science China Information Sciences, December 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2409.13194">arXiv Version</a></li>
<li><a href="https://github.com/OpenDFM/ChemDFM-X">Code Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhaoChemDFMXLargeMultimodal2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemDFM-X}}: {{Towards Large Multimodal Model}} for {{Chemistry}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhao, Zihan and Chen, Bo and Li, Jingpiao and Chen, Lu and Wen, Liyang and Wang, Pengyu and Zhu, Zichen and Zhang, Danyang and Wan, Ziping and Li, Yansi and Dai, Zhongyang and Chen, Xin and Yu, Kai}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Science China Information Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{67}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{220109}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/s11432-024-4243-0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2409.13194}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RFL: Simplifying Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/rfl/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/rfl/</guid><description>Novel Ring-Free Language representation and Molecular Skeleton Decoder architecture for improved optical chemical structure recognition from images.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chang, Q., Chen, M., Pi, C., et al. (2024). RFL: Simplifying Chemical Structure Recognition with Ring-Free Language. <em>arXiv preprint arXiv:2412.07594</em>. <a href="https://doi.org/10.48550/arXiv.2412.07594">https://doi.org/10.48550/arXiv.2412.07594</a></p>
<p><strong>Publication</strong>: arXiv preprint (December 2024)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/JingMog/RFL-MSD">Official Code Repository</a></li>
</ul>
<h2 id="methodological-contribution">Methodological Contribution</h2>
<p>This is a <strong>Methodological</strong> paper ($\Psi_{\text{Method}}$). It introduces a novel representation system (Ring-Free Language) and a specialized neural architecture (Molecular Skeleton Decoder) designed to solve specific limitations in converting 2D images to 1D chemical strings. The paper validates this method through direct comparison with state-of-the-art baselines and ablation studies.</p>
<h2 id="motivation-limitations-of-1d-serialization">Motivation: Limitations of 1D Serialization</h2>
<p>Current Optical Chemical Structure Recognition (OCSR) methods typically rely on &ldquo;unstructured modeling,&rdquo; where 2D molecular graphs are flattened into 1D strings like SMILES or SSML. While simple, these linear formats struggle to explicitly capture complex spatial relationships, particularly in molecules with multiple rings and branches. End-to-end models often fail to &ldquo;understand&rdquo; the graph structure when forced to predict these implicit 1D sequences, leading to error accumulation in complex scenarios.</p>
<h2 id="innovation-ring-free-language-rfl-and-molecular-skeleton-decoder-msd">Innovation: Ring-Free Language (RFL) and Molecular Skeleton Decoder (MSD)</h2>
<p>The authors propose two primary contributions to decouple spatial complexity:</p>
<ol>
<li><strong>Ring-Free Language (RFL)</strong>: A divide-and-conquer representation that splits a molecular graph $G$ into three explicit components: a molecular skeleton $\mathcal{S}$, individual ring structures $\mathcal{R}$, and branch information $\mathcal{F}$. This allows rings to be collapsed into &ldquo;SuperAtoms&rdquo; or &ldquo;SuperBonds&rdquo; during initial parsing.</li>
<li><strong>Molecular Skeleton Decoder (MSD)</strong>: A hierarchical architecture that progressively predicts the skeleton first, then the individual rings (using SuperAtom features as conditions), and finally classifies the branch connections.</li>
</ol>
<h2 id="methodology-and-experiments">Methodology and Experiments</h2>
<p>The method was evaluated on both handwritten and printed chemical structures against two baselines: DenseWAP (Zhang et al. 2018) and RCGD (Hu et al. 2023).</p>
<ul>
<li><strong>Datasets</strong>:
<ul>
<li><strong>EDU-CHEMC</strong>: ~49k handwritten samples (challenging, diverse styles)</li>
<li><strong>Mini-CASIA-CSDB</strong>: ~89k printed samples (from ChEMBL)</li>
<li><strong>Synthetic Complexity Dataset</strong>: A custom split of ChEMBL data grouped by structural complexity (atoms + bonds + rings) to test generalization</li>
</ul>
</li>
<li><strong>Ablation Studies</strong>: Tested the necessity of the MSD architecture vs. standard decoders and the impact of the <code>[conn]</code> token for filtering branch candidates</li>
</ul>
<h2 id="outcomes-and-conclusions">Outcomes and Conclusions</h2>
<ul>
<li><strong>SOTA Performance</strong>: The proposed method (MSD-RCGD) outperformed the previous state-of-the-art (RCGD) on both handwritten (95.38% Exact Match) and printed (95.58% Exact Match) datasets.</li>
<li><strong>Universal Improvement</strong>: Applying MSD/RFL to an older baseline (DenseWAP) improved its accuracy significantly (e.g., from 87.6% to 81.9% on complex sets), proving the method is model-agnostic.</li>
<li><strong>Complexity Handling</strong>: The method showed robust generalization on unseen high-complexity molecules, where standard 1D baselines failed completely (0% accuracy for standard DenseWAP vs. ~30% for MSD version on the hardest tier).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors utilized one handwritten and one printed dataset, plus a synthetic set for stress-testing complexity.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training/Test</strong></td>
          <td><strong>EDU-CHEMC</strong></td>
          <td>48,998 Train / 2,992 Test</td>
          <td>Handwritten images from educational scenarios</td>
      </tr>
      <tr>
          <td><strong>Training/Test</strong></td>
          <td><strong>Mini-CASIA-CSDB</strong></td>
          <td>89,023 Train / 8,287 Test</td>
          <td>Printed images rendered from ChEMBL using RDKit</td>
      </tr>
      <tr>
          <td><strong>Generalization</strong></td>
          <td><strong>ChEMBL Subset</strong></td>
          <td>5 levels of complexity</td>
          <td>Custom split based on Eq: $N_{atom} + N_{bond} + 12 \times N_{ring}$</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>RFL Splitting (Encoding)</strong>:</p>
<ol>
<li><strong>Detect Rings</strong>: Use DFS to find all non-nested rings $\mathcal{R}$.</li>
<li><strong>Determine Adjacency ($\gamma$)</strong>: Calculate shared edges between rings.</li>
<li><strong>Merge</strong>:
<ul>
<li>If $\gamma(r_i) = 0$ (isolated), merge ring into a <strong>SuperAtom</strong> node.</li>
<li>If $\gamma(r_i) &gt; 0$ (adjacent), merge ring into a <strong>SuperBond</strong> edge.</li>
</ul>
</li>
<li><strong>Update</strong>: Record connection info in $\mathcal{F}$ and remove ring details from the main graph to form Skeleton $\mathcal{S}$.</li>
</ol>
<p><strong>MSD Decoding</strong>:</p>
<ul>
<li><strong>Hierarchical Prediction</strong>: The model predicts the Skeleton $\mathcal{S}$ first.</li>
<li><strong>Contextual Ring Prediction</strong>: When a SuperAtom/Bond token is predicted, its hidden state $f^s$ is stored. After the skeleton is finished, $f^s$ is used as a condition to autoregressively decode the specific ring structure.</li>
<li><strong>Token <code>[conn]</code></strong>: A special token separates connected ring bonds from unconnected ones to sparsify the branch classification task.</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture follows a standard Image-to-Sequence pattern but with a forked decoder.</p>
<ul>
<li><strong>Encoder</strong>: DenseNet (Growth rate=24, Depth=32 per block)</li>
<li><strong>Decoder (MSD)</strong>:
<ul>
<li><strong>Core</strong>: GRU with Attention (Hidden dim=256, Embedding dim=256, Dropout=0.15)</li>
<li><strong>Skeleton Module</strong>: Autoregressively predicts sequence tokens. Uses Maxout activation.</li>
<li><strong>Branch Module</strong>: A binary classifier (MLP) taking concatenated features of skeleton bonds $f_{bs}$ and ring bonds $f_{br}$ to predict connectivity matrix $\mathcal{F}$.</li>
</ul>
</li>
<li><strong>Loss Function</strong>: $\mathcal{L} = \lambda_1 \mathcal{L}_{ce} + \lambda_2 \mathcal{L}_{cls}$ (where $\lambda=1$)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics focus on exact image reconstruction and structural validity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>EM (Exact Match)</strong></td>
          <td>% of images where predicted graph exactly matches ground truth.</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td><strong>Struct-EM</strong></td>
          <td>% of correctly identified chemical structures (ignoring non-chemical text).</td>
          <td>Auxiliary metric</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 x NVIDIA Tesla V100 (32GB RAM)</li>
<li><strong>Training Configuration</strong>:
<ul>
<li>Batch size: 8 (Handwritten), 32 (Printed)</li>
<li>Epochs: 50</li>
<li>Optimizer: Adam ($lr=2\times10^{-4}$, decayed by 0.5 via MultiStepLR)</li>
</ul>
</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{changRFLSimplifyingChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{RFL: Simplifying Chemical Structure Recognition with Ring-Free Language}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{RFL}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chang, Qikai and Chen, Mingjun and Pi, Changpeng and Hu, Pengfei and Zhang, Zhenrong and Ma, Jiefeng and Du, Jun and Yin, Baocai and Hu, Jinshui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2412.07594}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2412.07594}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2412.07594}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolSight: OCSR with RL and Multi-Granularity Learning</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/molsight/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/molsight/</guid><description>A three-stage OCSR framework using SMILES pretraining, auxiliary bond/coordinate tasks, and reinforcement learning to master stereochemistry recognition.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, W., Wang, X., Feng, B., &amp; Liu, W. (2025). MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning. <em>arXiv preprint arXiv:2511.17300</em>. <a href="https://doi.org/10.48550/arXiv.2511.17300">https://doi.org/10.48550/arXiv.2511.17300</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/hustvl/MolSight">Official Repository</a></li>
</ul>
<h2 id="contribution-a-framework-for-optical-chemical-structure-recognition">Contribution: A Framework for Optical Chemical Structure Recognition</h2>
<p>This is primarily a <strong>Method</strong> paper. It proposes a novel three-stage training framework (Pretraining → Fine-tuning → RL Post-training) to improve Optical Chemical Structure Recognition (OCSR). Specifically, it introduces the use of Group Relative Policy Optimization (GRPO) to solve non-differentiable chemical validity issues.</p>
<p>It also has a <strong>Resource</strong> component, as the authors construct and release <em>Stereo-200k</em>, a dataset specifically designed to train models on challenging stereoisomeric molecules.</p>
<h2 id="motivation-resolving-stereochemical-cues">Motivation: Resolving Stereochemical Cues</h2>
<p>Existing OCSR systems struggle to accurately recognize stereochemical information (e.g., chirality, geometric isomerism) because the visual cues distinguishing stereoisomers (such as wedge and dash bonds) are subtle. Current methods often fail to capture the geometric relationships required to distinguish molecules with identical connectivity but different spatial arrangements. Accurate recognition is critical for downstream tasks like drug discovery where stereochemistry determines pharmacological effects.</p>
<h2 id="core-innovations-grpo-and-multi-granularity-learning">Core Innovations: GRPO and Multi-Granularity Learning</h2>
<p>MolSight introduces three key technical innovations:</p>
<ol>
<li><strong>Reinforcement Learning for OCSR</strong>: It is the first OCSR system to incorporate RL (specifically GRPO) to directly optimize for chemical semantic correctness.</li>
<li><strong>Multi-Granularity Learning</strong>: It employs auxiliary heads for chemical bond classification and atom localization. Unlike previous approaches that optimize these jointly, MolSight decouples the coordinate head to prevent interference with SMILES generation.</li>
<li><strong>SMILES-M Notation</strong>: A lightweight extension to SMILES to handle Markush structures (common in patents) without significant sequence length increase.</li>
</ol>
<h2 id="experimental-methodology">Experimental Methodology</h2>
<p>The authors evaluated MolSight using a rigorous mix of real and synthetic benchmarks:</p>
<ul>
<li><strong>Baselines</strong>: Compared against rule-based (OSRA, MolVec) and deep learning methods (MolScribe, MolGrapher, DECIMER).</li>
<li><strong>Benchmarks</strong>: Evaluated on real-world datasets (USPTO, Maybridge UoB, CLEF-2012, JPO) and synthetic datasets (Staker, ChemDraw, Indigo, Stereo-2K).</li>
<li><strong>Ablation Studies</strong>: Tested the impact of the bond head, coordinate head, and RL stages separately.</li>
<li><strong>Transfer Learning</strong>: Assessed the quality of learned representations by using the frozen encoder for molecular property prediction on MoleculeNet.</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>SOTA Performance</strong>: MolSight achieved 85.1% stereochemical accuracy on the USPTO dataset, significantly outperforming the previous SOTA (MolScribe) which achieved 69.0%.</li>
<li><strong>RL Effectiveness</strong>: Reinforcement learning post-training specifically improved performance on stereoisomers, raising Tanimoto similarity and exact match rates on the Stereo-2k test set.</li>
<li><strong>Robustness</strong>: The model maintained high performance even on perturbed (rotated/sheared) and low-resolution images, outperforming rule-based methods significantly in these scenarios.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training pipeline uses three distinct data sources:</p>
<ol>
<li><strong>Pre-training</strong>: <em>MolParser-7M</em>. Contains diverse images but requires the <strong>SMILES-M</strong> extension to handle Markush structures.</li>
<li><strong>Fine-tuning</strong>: <em>PubChem-1M</em> and <em>USPTO-680K</em>. Used for multi-granularity learning with bond and coordinate labels.</li>
<li><strong>RL Post-training</strong>: <em>Stereo-200k</em>. A self-collected dataset from the first 2M compounds in PubChem, filtered for chirality (&rsquo;@&rsquo;) and cis-trans isomerism (&rsquo;/&rsquo;, &lsquo;\&rsquo;). It uses 5 different RDKit drawing styles to ensure robustness.</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Reinforcement Learning</strong>: Uses <strong>GRPO (Group Relative Policy Optimization)</strong>.
<ul>
<li><strong>Reward Function</strong>: A linear combination of Tanimoto similarity and Stereochemical exact match.
$$ R = w_t \cdot \text{Tanimoto} + w_s \cdot \text{ExactMatch} $$
where $w_t=0.4$ and $w_s=0.6$.</li>
<li><strong>Sampling</strong>: Samples 4 completions per image with temperature 1.0 during RL training.</li>
</ul>
</li>
<li><strong>Auxiliary Tasks</strong>:
<ul>
<li><strong>Bond Classification</strong>: Concatenates hidden states of two atom queries to predict bond type via MLP.</li>
<li><strong>Atom Localization</strong>: Treated as a classification task (SimCC) but optimized using <strong>Maximum Likelihood Estimation (MLE)</strong> to account for uncertainty.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder Transformer.
<ul>
<li><strong>Encoder</strong>: <strong>EfficientViT-L1</strong> (~53M params), chosen for linear attention efficiency.</li>
<li><strong>Decoder</strong>: 6-layer Transformer with <strong>RoPE</strong>, <strong>SwiGLU</strong>, and <strong>RMSNorm</strong>. Randomly initialized (no LLM weights) due to vocabulary mismatch.</li>
<li><strong>Coordinate Head</strong>: Separated from the main decoder. It adds 2 extra Transformer layers to process atom queries before prediction to improve accuracy.</li>
</ul>
</li>
<li><strong>Parameter Tuning</strong>:
<ul>
<li>Stage 3 (RL) uses <strong>LoRA</strong> (Rank=8, Alpha=16) to optimize the decoder.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Exact Match</strong>: String identity for standard SMILES.</li>
<li><strong>Tanimoto Coefficient</strong>: Fingerprint similarity for chemical semantics.</li>
<li><strong>OKS (Object Keypoint Similarity)</strong>: Used specifically for evaluating atom localization accuracy.</li>
</ul>
</li>
<li><strong>Perturbation</strong>: Robustness tested with random rotations [-5°, 5°] and xy-shearing [-0.1, 0.1].</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Training and inference performed on a single node.</li>
<li><strong>Processors</strong>: Intel Xeon Silver 4210R CPU.</li>
<li><strong>Accelerators</strong>: 4x <strong>NVIDIA GeForce RTX 3090/4090</strong> GPUs.</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Stage 1: Batch size 512, LR $4 \x10^{-4}$.</li>
<li>Stage 2: Batch size 256, Bond head LR $4 \x10^{-4}$, Coord head LR $4 \x10^{-5}$.</li>
<li>Stage 3 (RL): Batch size 64, Base LR $1 \x10^{-4}$.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{zhang2025molsight,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wenrui Zhang and Xinggang Wang and Bin Feng and Wenyu Liu}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2511.17300}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2511.17300}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MarkushGrapher: Multi-modal Markush Structure Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/markushgrapher/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/markushgrapher/</guid><description>Multi-modal transformer combining vision, text, and layout encoding to extract complex Markush structures from patent documents with OCSR.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Morin, L., Weber, V., Nassar, A., Meijer, G. I., Van Gool, L., Li, Y., &amp; Staar, P. (2025). MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures. <em>2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 14505-14515. <a href="https://doi.org/10.1109/CVPR52734.2025.01352">https://doi.org/10.1109/CVPR52734.2025.01352</a></p>
<p><strong>Publication</strong>: CVPR 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/DS4SD/MarkushGrapher">GitHub Repository</a></li>
</ul>
<h2 id="overcoming-unimodal-limitations-for-markush-structures">Overcoming Unimodal Limitations for Markush Structures</h2>
<p>The automated analysis of chemical literature, particularly patents, is critical for drug discovery and material science. A major bottleneck is the extraction of <strong>Markush structures</strong>, which are complex chemical templates that represent families of molecules using a core backbone image and textual variable definitions. Existing methods are limited because they either rely solely on images (OCSR) and miss the textual context, or focus solely on text and miss the structural backbone. This creates a practical need for a unified, multi-modal approach that jointly interprets visual and textual data to accurately extract these structures for prior-art search and database construction. This paper proposes a <strong>Method</strong> and introduces a new <strong>Resource</strong> (M2S dataset) to bridge this gap.</p>
<h2 id="markushgrapher-the-multi-modal-architecture">MarkushGrapher: The Multi-Modal Architecture</h2>
<p>The core innovation is <strong>MarkushGrapher</strong>, a multi-modal architecture that jointly encodes image, text, and layout information. Key contributions include:</p>
<ul>
<li><strong>Dual-Encoder Architecture</strong>: Combines a Vision-Text-Layout (VTL) encoder (based on UDOP) with a specialized, pre-trained Optical Chemical Structure Recognition (OCSR) encoder (MolScribe). Let $E_{\text{VTL}}$ represent the combined sequence embedding and $E_{\text{OCSR}}$ represent the domain-specific visual embeddings.</li>
<li><strong>Joint Recognition</strong>: The model autoregressively generates a sequential graph representation (Optimized CXSMILES) and a substituent table simultaneously. It leverages cross-modal dependencies, allowing text to clarify ambiguous visual details like bond types.</li>
<li><strong>Synthetic Data Pipeline</strong>: A comprehensive pipeline generates realistic synthetic Markush structures (images and text) from PubChem data, overcoming the lack of labeled training data.</li>
<li><strong>Optimized Representation</strong>: A compacted version of CXSMILES moves variable groups into the SMILES string and adds explicit atom indexing to handle complex &ldquo;frequency&rdquo; and &ldquo;position&rdquo; variation indicators.</li>
</ul>
<h2 id="experimental-validation-on-the-new-m2s-benchmark">Experimental Validation on the New M2S Benchmark</h2>
<p>The authors validated their approach using the following setup:</p>
<ul>
<li><strong>Baselines</strong>: Compared against image-only chemistry models (DECIMER, MolScribe) and general-purpose multi-modal models (Uni-SMART, GPT-4o, Pixtral, Llama-3.2).</li>
<li><strong>Datasets</strong>: Evaluated on three benchmarks:
<ol>
<li><strong>MarkushGrapher-Synthetic</strong>: 1,000 generated samples.</li>
<li><strong>M2S</strong>: A new benchmark of 103 manually annotated real-world patent images.</li>
<li><strong>USPTO-Markush</strong>: 74 Markush backbone images from USPTO patents.</li>
</ol>
</li>
<li><strong>Ablation Studies</strong>: Analyzed the impact of the OCSR encoder, late fusion strategies, and the optimized CXSMILES format.</li>
</ul>
<h2 id="reaching-state-of-the-art-exact-matches">Reaching State-of-the-Art Exact Matches</h2>
<ul>
<li><strong>Performance Limits</strong>: MarkushGrapher significantly outperformed all baselines. On the M2S benchmark, it achieved a 38% Exact Match on CXSMILES (compared to 21% for MolScribe) and 29% Exact Match on tables.</li>
<li><strong>Complex Feature Handling</strong>: The model demonstrates a unique capability to reliably recognize complex Markush features like frequency variation (&lsquo;Sg&rsquo;) and position variation (&rsquo;m&rsquo;) indicators. The alternative baselines scored near zero on these specific sub-tasks.</li>
<li><strong>Cross-Modal Reasoning</strong>: Qualitative analysis verified the model can correctly infer visual details (such as bond order) that appear ambiguous in the image but become apparent with the text description.</li>
<li><strong>Robustness</strong>: The model generalizes well to real-world data despite being trained purely on synthetic data iterations.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong></p>
<ul>
<li><strong>Source</strong>: Synthetic dataset generated from PubChem SMILES.</li>
<li><strong>Size</strong>: 210,000 synthetic images.</li>
<li><strong>Pipeline</strong>:
<ol>
<li><strong>Selection</strong>: Sampled SMILES from PubChem based on substructure diversity.</li>
<li><strong>Augmentation</strong>: SMILES augmented to artificial CXSMILES using RDKit (inserting variable groups, frequency indicators).</li>
<li><strong>Rendering</strong>: Images rendered using Chemistry Development Kit (CDK) with randomized drawing parameters (font, bond width, spacing).</li>
<li><strong>Text Generation</strong>: Textual definitions generated using manual templates extracted from patents; 10% were paraphrased using Mistral-7B-Instruct-v0.3 to increase diversity.</li>
<li><strong>OCR</strong>: Bounding boxes extracted via a custom SVG parser aligned with MOL files.</li>
</ol>
</li>
</ul>
<p><strong>Evaluation Data</strong></p>
<ul>
<li><strong>M2S Dataset</strong>: 103 images from USPTO, EPO, and WIPO patents (1999-2023), manually annotated with CXSMILES and substituent tables.</li>
<li><strong>USPTO-Markush</strong>: 74 images from USPTO patents (2010-2016).</li>
<li><strong>MarkushGrapher-Synthetic</strong>: 1,000 samples generated via the pipeline.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimized CXSMILES</strong>:
<ul>
<li><strong>Compression</strong>: Variable groups moved from the extension block to the main SMILES string as special atoms to reduce sequence length.</li>
<li><strong>Indexing</strong>: Atom indices appended to each atom (e.g., <code>C:1</code>) to explicitly link the graph to the extension block (crucial for <code>m</code> and <code>Sg</code> sections).</li>
<li><strong>Vocabulary</strong>: Specific tokens used for atoms and bonds.</li>
</ul>
</li>
<li><strong>Augmentation</strong>: Standard image augmentations (shift, scale, blur, pepper noise, random lines) and OCR text augmentations (character substitution/insertion/deletion).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder Transformer.
<ul>
<li><strong>VTL Encoder</strong>: T5-large encoder (initialized from UDOP) that processes image patches, text tokens, and layout (bounding boxes).</li>
<li><strong>OCSR Encoder</strong>: Vision encoder from MolScribe (Swin Transformer), frozen during training.</li>
<li><strong>Text Decoder</strong>: T5-large decoder.</li>
</ul>
</li>
<li><strong>Fusion Strategy</strong>: <strong>Late Fusion</strong>. The core multi-modal alignment combines the textual layout features with specialized chemical vision explicitly. The fused representation $H_{\text{fused}}$ relies on the VTL output $e_1$ concatenated with the projected OCSR output $e_2$ before decoding:
$$ H_{\text{fused}} = [e_1 \oplus \text{Proj}(e_2)] $$</li>
<li><strong>Parameters</strong>: 831M total (744M trainable).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>CXSMILES Exact Match (EM)</strong>: Requires perfect match of SMILES string, variable groups, <code>m</code> sections, and <code>Sg</code> sections (ignoring stereochemistry).</li>
<li><strong>Tanimoto Score</strong>: Similarity of RDKit DayLight fingerprints (Markush features removed).</li>
<li><strong>Table Exact Match</strong>: All variable groups and substituents must match.</li>
<li><strong>Table F1-Score</strong>: Aggregated recall and precision of substituents per variable group.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Trained on a single NVIDIA H100 GPU.</li>
<li><strong>Training Config</strong>: 10 epochs, batch size of 10, learning rate 5e-4, 100 warmup steps, weight decay 1e-3.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{morinMarkushGrapherJointVisual2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MarkushGrapher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Weber, Valéry and Nassar, Ahmed and Meijer, Gerhard Ingmar and Van Gool, Luc and Li, Yawei and Staar, Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jun,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{14505--14515}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/CVPR52734.2025.01352}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image2InChI: SwinTransformer for Molecular Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/image2inchi/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/image2inchi/</guid><description>Deep learning model using improved SwinTransformer encoder and attention-based feature fusion to convert molecular images to InChI strings.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, D., Xu, X., Pan, J., Gao, W., &amp; Zhang, S. (2024). Image2InChI: Automated Molecular Optical Image Recognition. <em>Journal of Chemical Information and Modeling</em>, 64(9), 3640-3649. <a href="https://doi.org/10.1021/acs.jcim.3c02082">https://doi.org/10.1021/acs.jcim.3c02082</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Dataset (Kaggle)</a></li>
</ul>
<p><strong>Note</strong>: These notes are based on the Abstract and Supporting Information files only.</p>
<h2 id="image2inchi-as-a-methodological-innovation">Image2InChI as a Methodological Innovation</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>. It proposes a specific new deep learning architecture (&ldquo;Image2InChI&rdquo;) to solve the task of Optical Chemical Structure Recognition (OCSR). The rhetorical focus is on engineering a system that outperforms baselines on specific metrics (InChI accuracy, MCS accuracy) and providing a valuable reference for future algorithmic work.</p>
<h2 id="bottlenecks-in-chemical-literature-digitization">Bottlenecks in Chemical Literature Digitization</h2>
<p>The accurate digitization of chemical literature is a bottleneck in AI-driven drug discovery. Chemical structures in patents and papers exist as optical images (pixels), but machine learning models require machine-readable string representations (like <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a> or <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>). Efficiently and automatically bridging this gap is a prerequisite for large-scale data mining in chemistry.</p>
<h2 id="hierarchical-swintransformer-and-attention-integration">Hierarchical SwinTransformer and Attention Integration</h2>
<p>The core novelty is the <strong>Image2InChI</strong> architecture, which integrates:</p>
<ol>
<li><strong>Improved SwinTransformer Encoder</strong>: Uses a hierarchical vision transformer to capture image features.</li>
<li><strong>Feature Fusion with Attention</strong>: A novel network designed to integrate image patch features with InChI prediction steps.</li>
<li><strong>End-to-End InChI Prediction</strong>: The architecture frames the problem as a direct image-to-sequence translation targeting InChI strings directly, diverging from techniques predicting independent graph components. The model is optimized using a standard Cross-Entropy Loss over the token vocabulary:
$$ \mathcal{L}<em>{\text{CE}} = - \sum</em>{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{X}) $$
where $\mathbf{X}$ represents the input image features, $y_t$ is the predicted token, and $T$ is the sequence length.</li>
</ol>
<h2 id="benchmarking-on-the-bms-dataset">Benchmarking on the BMS Dataset</h2>
<ul>
<li><strong>Benchmark Validation</strong>: The model was trained and tested on the <strong>BMS1000 (Bristol-Myers Squibb)</strong> dataset from a Kaggle competition.</li>
<li><strong>Ablation/Comparative Analysis</strong>: The authors compared their method against other models in the supplement.</li>
<li><strong>Preprocessing Validation</strong>: They justified their choice of denoising algorithms (8-neighborhood vs. Gaussian/Mean) to ensure preservation of bond lines while removing &ldquo;spiky point noise&rdquo;.</li>
</ul>
<h2 id="high-inchi-recognition-metrics">High InChI Recognition Metrics</h2>
<ul>
<li><strong>High Accuracy</strong>: The model achieved <strong>99.8% InChI accuracy</strong>, 94.8% Maximum Common Substructure (MCS) accuracy, and 96.2% Longest Common Subsequence (LCS) accuracy on the benchmarked dataset. It remains to be seen how well these models generalize to heavily degraded real-world patent images.</li>
<li><strong>Effective Denoising</strong>: The authors concluded that <strong>eight-neighborhood filtering</strong> is superior to mean or Gaussian filtering for this specific domain because it removes isolated noise points without blurring the fine edges of chemical bonds.</li>
<li><strong>Open Source</strong>: The authors committed to releasing the code to facilitate transparency and further research.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The primary dataset used is the <strong>BMS (Bristol-Myers Squibb) Dataset</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Source</strong></td>
          <td>Kaggle Competition (BMS-Molecular-Translation)</td>
      </tr>
      <tr>
          <td><strong>Total Size</strong></td>
          <td>2.4 million images</td>
      </tr>
      <tr>
          <td><strong>Training Set</strong></td>
          <td>1.8 million images</td>
      </tr>
      <tr>
          <td><strong>Test Set</strong></td>
          <td>0.6 million images</td>
      </tr>
      <tr>
          <td><strong>Content</strong></td>
          <td>Each image corresponds to a unique International Chemical Identifier (<a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a>)</td>
      </tr>
  </tbody>
</table>
<p><strong>Other Datasets</strong>: The authors also utilized JPO (Japanese Patent Office), CLEF (CLEF-IP 2012), UOB (MolrecUOB), and USPTO datasets for broader benchmarking.</p>
<p><strong>Preprocessing Pipeline</strong>:</p>
<ol>
<li><strong>Denoising</strong>: <strong>Eight-neighborhood filtering</strong> (threshold &lt; 4 non-white pixels) is used to remove salt-and-pepper noise while preserving bond lines. Mean and Gaussian filtering were rejected due to blurring.</li>
<li><strong>Sequence Padding</strong>:
<ul>
<li>Analysis showed max InChI length &lt; 270.</li>
<li>Fixed sequence length set to <strong>300</strong>.</li>
<li>Tokens: <code>&lt;sos&gt;</code> (190), <code>&lt;eos&gt;</code> (191), <code>&lt;pad&gt;</code> (192) used for padding/framing.</li>
</ul>
</li>
<li><strong>Numerization</strong>: Characters are mapped to integers based on a fixed vocabulary (e.g., &lsquo;C&rsquo; -&gt; 178, &lsquo;H&rsquo; -&gt; 182).</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Eight-Neighborhood Filtering (Denoising)</strong>:</p>
<p>Pseudocode logic:</p>
<ul>
<li>Iterate through every pixel.</li>
<li>Count non-white neighbors in the 3x3 grid (8 neighbors).</li>
<li>If count &lt; threshold (default 4), treat as noise and remove.</li>
</ul>
<p><strong>InChI Tokenization</strong>:</p>
<ul>
<li>InChI strings are split into character arrays.</li>
<li>Example: Vitamin C <code>InChI=1S/C6H8O6...</code> becomes <code>[&lt;sos&gt;, C, 6, H, 8, O, 6, ..., &lt;eos&gt;, &lt;pad&gt;...]</code>.</li>
<li>Mapped to integer tensor for model input.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Image2InChI</p>
<ul>
<li><strong>Encoder</strong>: Improved SwinTransformer (Hierarchical Vision Transformer).</li>
<li><strong>Decoder</strong>: Transformer Decoder with patch embedding.</li>
<li><strong>Fusion</strong>: A novel &ldquo;feature fusion network with attention&rdquo; integrates the visual tokens with the sequence generation process.</li>
<li><strong>Framework</strong>: PyTorch 1.8.1.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>InChI Acc</strong>: Exact match accuracy of the predicted InChI string (Reported: 99.8%).</li>
<li><strong>MCS Acc</strong>: Maximum Common Substructure accuracy (structural similarity) (Reported: 94.8%).</li>
<li><strong>LCS Acc</strong>: Longest Common Subsequence accuracy (string similarity) (Reported: 96.2%).</li>
<li><strong>Morgan FP</strong>: Morgan Fingerprint similarity (Reported: 94.1%).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Specification</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GPU</strong></td>
          <td>NVIDIA Tesla P100 (16GB VRAM)</td>
      </tr>
      <tr>
          <td><strong>Platform</strong></td>
          <td>MatPool cloud platform</td>
      </tr>
      <tr>
          <td><strong>CPU</strong></td>
          <td>Intel Xeon Gold 6271</td>
      </tr>
      <tr>
          <td><strong>RAM</strong></td>
          <td>32GB System Memory</td>
      </tr>
      <tr>
          <td><strong>Driver</strong></td>
          <td>NVIDIA-SMI 440.100</td>
      </tr>
      <tr>
          <td><strong>OS</strong></td>
          <td>Ubuntu 18.04</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2024image2inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Image2InChI: Automated Molecular Optical Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Da-zhou and Xu, Xin and Pan, Jia-heng and Gao, Wei and Zhang, Shi-rui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3640--3649}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.3c02082}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemVLM: A Multimodal Large Language Model for Chemistry</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemvlm/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/chemvlm/</guid><description>A 26B parameter multimodal LLM for chemistry, combining InternViT-6B and ChemLLM-20B for molecular structure recognition, property prediction, and reasoning.</description><content:encoded><![CDATA[<h2 id="paper-classification-method-and-resource">Paper Classification: Method and Resource</h2>
<p>This paper is a combination of <strong>Method</strong> (primary) and <strong>Resource</strong> (secondary).</p>
<p>It is primarily a <strong>Method</strong> paper because it proposes <strong>ChemVLM</strong>, a novel multimodal architecture specifically tailored for the chemical domain, utilizing a &ldquo;ViT-MLP-LLM&rdquo; framework. The authors introduce a specific two-stage training strategy to align visual features with chemical text representations.</p>
<p>Secondarily, it is a <strong>Resource</strong> paper as it introduces a comprehensive suite of three new datasets: <strong>ChemOCR</strong>, <strong>MMCR-Bench</strong>, and <strong>MMChemBench</strong>, developed to rigorously evaluate multimodal capabilities in chemistry, covering OCR, reasoning, and property prediction.</p>
<h2 id="bridging-the-visual-gap-in-chemical-llms">Bridging the Visual Gap in Chemical LLMs</h2>
<p>The primary motivation is the limitation of existing models in handling the multimodal nature of chemistry.</p>
<ul>
<li><strong>Visual Data Gap</strong>: Chemical tasks heavily rely on visual information (molecular structures, reactions) which purely text-based chemical LLMs cannot process.</li>
<li><strong>Limitations of Generalist Models</strong>: General multimodal models (like GPT-4V or LLaVA) lack specialized chemical domain knowledge, leading to hallucinations or misinterpretations.</li>
<li><strong>Inadequacy of OCR Tools</strong>: Traditional chemical OCR tools (like MolScribe) excel at modality conversion (Image-to-SMILES) but fail at complex reasoning tasks.</li>
</ul>
<h2 id="domain-specific-data-curation-and-benchmarking">Domain-Specific Data Curation and Benchmarking</h2>
<ul>
<li><strong>Data-Driven Alignment</strong>: The underlying &ldquo;ViT-MLP-LLM&rdquo; framework is standard in multimodal modeling, paralleling architectures like LLaVA. The core innovation here is the rigorous creation of a bilingual multimodal dataset spanning hand-drawn molecules, reactions, and exam questions augmented with style transfers. The training data pipeline heavily relies on generating synthetic variance using tools like RanDepict and RDKit to introduce distortions, rotations, and handwritten styles, alongside GPT-4 generated prompts to ensure linguistic diversity.</li>
<li><strong>Model Integration</strong>: ChemVLM merges <strong>InternViT-6B</strong> (a large-scale vision transformer) with <strong>ChemLLM-20B</strong> (a chemical language model). Visual features $X_v$ are mapped into the linguistic embedding space via an MLP projector, producing aligned token sequences alongside text instructions $X_q$. The joint multimodal sequence is trained using standard autoregressive next-token prediction:
$$ \mathcal{L} = -\sum_{i} \log P(y_i \mid X_v, X_q, y_{&lt;i}) $$</li>
<li><strong>Three Custom Benchmarks</strong>: The authors introduce tailored benchmarks to assess distinct competencies:
<ul>
<li><strong>ChemOCR</strong>: For image-to-SMILES conversion.</li>
<li><strong>MMCR-Bench</strong>: College entrance exam questions testing complex logical reasoning.</li>
<li><strong>MMChemBench</strong>: For molecule captioning and zero-shot property prediction.</li>
</ul>
</li>
</ul>
<h2 id="evaluating-chemical-ocr-and-reasoning">Evaluating Chemical OCR and Reasoning</h2>
<p>The authors benchmarked ChemVLM against both open-source (LLaVA, Qwen-VL, InternVL) and proprietary (GPT-4V) models across three primary domains:</p>
<ol>
<li><strong>Chemical OCR</strong>: Evaluated on 1,000 image-text pairs from ChemOCR. The primary metric is the Tanimoto similarity between the Morgan fingerprints of the generated structure ($A$) and the ground-truth SMILES ($B$):
$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$
They report both the average Tanimoto similarity and the strict exact-match rate (<code>Tanimoto@1.0</code>).</li>
<li><strong>Multimodal Chemical Reasoning (MMCR)</strong>: Tested on MMCR-Bench (1,000 exam questions), ScienceQA, and CMMU. Performance was scored based on accuracy for multiple-choice and fill-in-the-blank questions.</li>
<li><strong>Multimodal Molecule Understanding</strong>: Evaluated on MMChemBench for molecule captioning and property prediction.</li>
<li><strong>Text-Only Reasoning</strong>: Tested on SciBench, a text-only benchmark for university-level science, to ensure the model retains fundamental linguistic reasoning.</li>
<li><strong>Generalization</strong>: Tested on non-chemistry subjects within the CMMU framework (Biology, Physics, Math) to assess cross-domain competence.</li>
</ol>
<h2 id="performance-gains-and-existing-limitations">Performance Gains and Existing Limitations</h2>
<ul>
<li><strong>Multimodal Reasoning Leadership</strong>: ChemVLM achieved state-of-the-art results on MMCR-Bench (41.7%), surpassing generalist models like GPT-4V (40.1%). However, scoring for portions of these benchmarks relied heavily on an LLM-as-a-judge (the Qwen-max API), which can introduce bias as LLM evaluators often favor structural characteristics and verbosity produced by similar autoregressive models. Furthermore, the model was fine-tuned on 200,000 exam questions and tested on MMCR-Bench (also derived from Chinese college entrance exams). While the authors state the data was deduplicated, the potential for data leakage remains a significant unaddressed confounder.</li>
<li><strong>Superior Understanding</strong>: In molecule captioning and prediction, ChemVLM showed significant improvements over general baseline models, scoring 80.9% on prediction compared to GPT-4V&rsquo;s 38.6%. This is a natural consequence of testing a custom-trained model on domain-specific benchmarks.</li>
<li><strong>OCR Capabilities vs. Dedicated Tools</strong>: ChemVLM outperformed generalist MLLMs in chemical structure recognition, achieving an average Tanimoto similarity of 71.0% (vs. GPT-4V&rsquo;s 15.0%). However, it remains significantly inferior to pure structural OCR tools like MolScribe in strict modality conversion tasks, only achieving an exact structural match (<code>Tanimoto@1.0</code>) of 42.9% compared to MolScribe&rsquo;s 89.1%.</li>
<li><strong>Textual Retention and Generalization Claims</strong>: The authors claim the diverse training strategy imparts broad scientific reasoning, pointing to performance retention on non-chemistry subjects (Biology, Physics, Math) and strong results on the purely textual SciBench benchmark. However, this cross-domain generalization highly likely stems from the underlying base model (ChemLLM-20B/InternLM2) or the inclusion of 1.3 million &ldquo;General&rdquo; visual QA pairs in their training blend, rather than emergent general scientific skills originating purely from learning chemistry representations.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training and evaluation data relied on a mix of open-source repositories and custom curation. Many of the curated datasets have been formally released by the authors on Hugging Face (<a href="https://huggingface.co/datasets/di-zhang-fdu/chemvlm-sft-datasets"><code>di-zhang-fdu/chemvlm-sft-datasets</code></a>).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Source/Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong>DECIMER HDM</strong></td>
          <td>7,000+ hand-drawn molecular images.</td>
      </tr>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong>MolScribe Data</strong></td>
          <td>Scanned/photographed images from literature.</td>
      </tr>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong>Synthetic</strong></td>
          <td>Generated via ChemDraw, RDKit, and Indigo with style transfer (blurring, rotation, handwritten styles).</td>
      </tr>
      <tr>
          <td><strong>Training (Reaction)</strong></td>
          <td><strong>PEACE &amp; USPTO-50K</strong></td>
          <td>Inorganic and organic reaction schemes.</td>
      </tr>
      <tr>
          <td><strong>Training (Reasoning)</strong></td>
          <td><strong>Exam Questions</strong></td>
          <td>200,000 questions from OpenDataLab (Chinese education level). <a href="https://huggingface.co/collections/di-zhang-fdu/multi-corpus-datasets-for-chemllm-66e7f14fd683a3f4e51d737b">Available on Hugging Face</a>.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>ChemOCR</strong></td>
          <td>1,000 bilingual image-text pairs for SMILES recognition. Released via Google Drive link in repo.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>MMCR-Bench</strong></td>
          <td>1,000 multimodal chemistry exam questions. <strong>Requires emailing authors directly for access.</strong></td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>MMChemBench</strong></td>
          <td>Extension of ChemBench for captioning and property prediction. Released via Google Drive link in repo.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>: Images were augmented using <strong>RanDepict</strong> for style variation. Text data (SMILES) was validated and cleaned. Prompts were diversified using GPT-4 to generate different linguistic styles.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: &ldquo;ViT-MLP-LLM&rdquo; structure.
<ul>
<li><strong>Vision Encoder</strong>: InternViT-6B, processing images at $448 \times 448$ resolution. Images are segmented into tiles (max 12).</li>
<li><strong>Projector</strong>: Multi-Layer Perceptron (MLP) initialized randomly to map visual features to text embedding space.</li>
<li><strong>LLM</strong>: ChemLLM-20B, a domain-specific model.</li>
</ul>
</li>
<li><strong>Training Strategy</strong>: Two-stage supervised fine-tuning.
<ol>
<li><strong>Modal Alignment</strong>: Freeze LLM and base Vision Encoder weights. Train only the randomly initialized MLP projector and LoRA layers (rank 32) of the Vision Encoder. Uses diverse multimodal data.</li>
<li><strong>Supervised Fine-Tuning (SFT)</strong>: Keep LLM and Vision Encoder base weights frozen, but add LoRA (rank 16) to the LLM and retain LoRA (rank 32) on the Vision Encoder. The MLP projector is fully trained. Data includes specialized chemistry and general corpora.</li>
</ol>
</li>
<li><strong>Optimization</strong>:
<ul>
<li>Optimizer: AdamW</li>
<li>Context Length: 2048 tokens</li>
<li>Chat Template: InternLM2 dialogue schema</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>ChemVLM-26B</strong>: The primary model released. It combines the 6B parameter vision encoder and the 20B parameter language model. Weights are fully available at <a href="https://huggingface.co/AI4Chem/ChemVLM-26B-1-2"><code>AI4Chem/ChemVLM-26B-1-2</code></a>. An 8B version is also available.</li>
<li><strong>Baselines</strong>: Comparisons were made against <strong>GPT-4V</strong>, <strong>Qwen-VL-Chat</strong>, <strong>LLaVA-v1.5-13B</strong>, <strong>InternVL-v1.5</strong>, and <strong>Yi-VL-Plus</strong>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured across three distinct task types. Exact <a href="https://github.com/lijunxian111/ChemVlm/tree/master/evaluation">evaluation scripts</a> have been released in the official repository.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Method</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto Similarity</strong></td>
          <td>ChemOCR</td>
          <td>Comparison of generated SMILES vs. ground truth using RDKit. Reports Average Similarity and <code>Tanimoto@1.0</code> (exact match).</td>
      </tr>
      <tr>
          <td><strong>Accuracy</strong></td>
          <td>MMCR (Reasoning)</td>
          <td>+1 point for correct multiple-choice/fill-in-the-blank; 0 otherwise. Scored via Qwen-max API prompting.</td>
      </tr>
      <tr>
          <td><strong>Prediction Score</strong></td>
          <td>Property Prediction</td>
          <td>Evaluated on MMChemBench subsets.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Compute</strong>: Training utilized <strong>16 NVIDIA A100 (80GB)</strong> GPUs.</li>
<li><strong>Configuration</strong>:
<ul>
<li>Batch size: 4 (per GPU, resulting in an effective global batch size of 256)</li>
<li>Gradient Accumulation: 4 iterations</li>
<li>Precision: <strong>Deepspeed bfloat16 (bf16)</strong> with <strong>ZeRO-3</strong> offloading strategy</li>
<li>Framework: Training runs on the InternVL-v1.5 codebase rather than standalone scripts.</li>
</ul>
</li>
<li><strong>Inference Compute</strong>: Evaluating the 26B model requires at least one 80GB A100 GPU (with Flash Attention + bfloat16). The 8B variant requires a GPU with at least 48GB of VRAM.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemVLM-26B">ChemVLM-26B</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>Original 26B model weights</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemVLM-26B-1-2">ChemVLM-26B-1-2</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Updated 26B model weights</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/di-zhang-fdu/chemvlm-sft-datasets">chemvlm-sft-datasets</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>SFT training data (~51.7k rows)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/lijunxian111/ChemVlm">ChemVlm (GitHub)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training, evaluation, and inference code</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, J., et al. (2025). ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area. <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(1), 415-423. <a href="https://doi.org/10.1609/aaai.v39i1.32020">https://doi.org/10.1609/aaai.v39i1.32020</a></p>
<p><strong>Publication</strong>: AAAI 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{li2025chemvlm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Junxian and Zhang, Di and Wang, Xunzhi and Hao, Zeying and Lei, Jingdi and Tan, Qian and Zhou, Cai and Liu, Wei and Yang, Yaotian and Xiong, Xinrui and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Li, Wei and Su, Mao and Zhang, Shufei and Ouyang, Wanli and Li, Yuqiang and Zhou, Dongzhan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{415--423}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://doi.org/10.1609/aaai.v39i1.32020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1609/aaai.v39i1.32020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/lijunxian111/ChemVlm">Official Repository</a></li>
</ul>
]]></content:encoded></item><item><title>String Representations for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/rajan-string-representations-2022/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/rajan-string-representations-2022/</guid><description>Ablation study comparing SMILES, DeepSMILES, SELFIES, and InChI for OCSR. SMILES achieves highest accuracy; SELFIES guarantees validity.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Steinbeck, C., &amp; Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. <em>Digital Discovery</em>, 1(2), 84-90. <a href="https://doi.org/10.1039/D1DD00013F">https://doi.org/10.1039/D1DD00013F</a></p>
<p><strong>Publication</strong>: Digital Discovery 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://chemrxiv.org/engage/api-gateway/chemrxiv/assets/orp/resource/item/6144553b4853d253c6a7adde/original/performance-of-chemical-structure-string-representations-for-chemical-image-recognition-using-transformers.pdf">ChemRxiv Preprint (PDF)</a></li>
<li><a href="https://github.com/Kohulan/DECIMER_Short_Communication">Official Code Repository</a></li>
<li><a href="https://doi.org/10.5281/zenodo.5155037">Data on Zenodo</a></li>
<li>Related work: <a href="/notes/computational-chemistry/optical-structure-recognition/decimer/">DECIMER</a>, <a href="/notes/computational-chemistry/optical-structure-recognition/decimer-1.0/">DECIMER 1.0</a>, <a href="/notes/computational-chemistry/optical-structure-recognition/img2smi/">IMG2SMI</a></li>
</ul>
<h2 id="methodological-focus-and-resource-contributions">Methodological Focus and Resource Contributions</h2>
<p>This is a <strong>Methodological Paper</strong> ($\Psi_{\text{Method}}$) with a secondary contribution as a <strong>Resource Paper</strong> ($\Psi_{\text{Resource}}$).</p>
<p>It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (<a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>, DeepSMILES, <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>, <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a>) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.</p>
<h2 id="the-syntax-challenge-in-chemical-image-recognition">The Syntax Challenge in Chemical Image Recognition</h2>
<p>Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings (where ring closures and branches are marked by single characters potentially far apart in the sequence) creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.</p>
<h2 id="isolating-string-representation-variables">Isolating String Representation Variables</h2>
<p>The core novelty is the <strong>comparative isolation of the string representation variable</strong> in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between <em>validity</em> (guaranteed by SELFIES) and <em>accuracy</em> (highest with SMILES).</p>
<h2 id="large-scale-image-to-text-translation-experiments">Large-Scale Image-to-Text Translation Experiments</h2>
<p>The authors performed a large-scale image-to-text translation experiment:</p>
<ul>
<li><strong>Task</strong>: Converting 2D chemical structure images into text strings.</li>
<li><strong>Data</strong>:
<ul>
<li><strong>ChEMBL</strong>: ~1.6M molecules, split into two datasets (with and without stereochemistry).</li>
<li><strong>PubChem</strong>: ~3M molecules, split similarly, to test performance scaling with data size.</li>
</ul>
</li>
<li><strong>Representations</strong>: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.</li>
<li><strong>Metric</strong>: The models were evaluated on:
<ul>
<li><strong>Validity</strong>: Can the predicted string be decoded back to a molecule?</li>
<li><strong>Exact Match</strong>: Is the predicted string identical to the ground truth?</li>
<li><strong>Tanimoto Similarity</strong>: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)? The similarity $\mathcal{T}$ between two molecular fingerprints $A$ and $B$ is calculated as:
$$ \mathcal{T}(A, B) = \frac{A \cdot B}{||A||^2 + ||B||^2 - A \cdot B} $$</li>
</ul>
</li>
</ul>
<h2 id="comparative-performance-and-validity-trade-offs">Comparative Performance and Validity Trade-offs</h2>
<ul>
<li><strong>SMILES is the most accurate</strong>: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.</li>
<li><strong>SELFIES guarantees validity</strong>: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.</li>
<li><strong>InChI is unsuitable</strong>: InChI performed significantly worse (approx. 64% exact match) due to extreme string lengths (up to 273 tokens).</li>
<li><strong>Stereochemistry adds difficulty</strong>: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.</li>
<li><strong>Recommendation</strong>: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL (Dataset 1/2)</td>
          <td>~1.5M</td>
          <td>Filtered for MW &lt; 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>PubChem (Dataset 3/4)</td>
          <td>~3.0M</td>
          <td>Same filtering rules, used to test scaling.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Test Split</td>
          <td>~120k - 250k</td>
          <td>Created using RDKit MaxMin algorithm to ensure chemical diversity.</td>
      </tr>
  </tbody>
</table>
<p><strong>Image Generation</strong>:</p>
<ul>
<li><strong>Tool</strong>: CDK Structure Diagram Generator (SDG).</li>
<li><strong>Specs</strong>: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Tokenization Rules</strong> (Critical for replication):</p>
<ul>
<li><strong>SELFIES</strong>: Split at every <code>][</code> (e.g., <code>[C][N]</code> $\rightarrow$ <code>[C]</code>, <code>[N]</code>).</li>
<li><strong>SMILES / DeepSMILES</strong>: Regex-based splitting:
<ul>
<li>Every heavy atom (e.g., <code>C</code>, <code>N</code>).</li>
<li>Every bracket <code>(</code> and <code>)</code>.</li>
<li>Every bond symbol <code>=</code> and <code>#</code>.</li>
<li>Every single-digit number.</li>
<li>Everything inside square brackets <code>[]</code> is kept as a single token.</li>
</ul>
</li>
<li><strong>InChI</strong>: The prefix <code>InChI=1S/</code> was treated as a single token and removed during training, then re-added for evaluation.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model follows the <strong>DECIMER</strong> architecture.</p>
<ul>
<li><strong>Encoder</strong>: EfficientNet-B3 (pre-trained with &ldquo;Noisy Student&rdquo; weights).
<ul>
<li>Output: Image feature vectors of shape $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder</strong>: Transformer (similar to the &ldquo;Base&rdquo; model from <em>Attention Is All You Need</em>).
<ul>
<li>Layers: 4 encoder-decoder layers.</li>
<li>Attention Heads: 8.</li>
<li>Dimension ($d_{\text{model}}$): 512.</li>
<li>Feed-forward ($d_{\text{ff}}$): 2048.</li>
<li>Dropout: 10%.</li>
</ul>
</li>
<li><strong>Loss</strong>: Sparse categorical cross-entropy.</li>
<li><strong>Optimizer</strong>: Adam with custom learning rate scheduler.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics were calculated after converting all predictions back to standard SMILES.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Baseline (SMILES)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Identical Match</strong></td>
          <td>88.62% (PubChem)</td>
          <td>Strict character-for-character equality.</td>
      </tr>
      <tr>
          <td><strong>Valid Structure</strong></td>
          <td>99.78%</td>
          <td>SMILES had rare syntax errors; SELFIES achieved 100%.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto (Avg)</strong></td>
          <td>0.98</td>
          <td>Calculated using PubChem fingerprints via CDK.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: Google Cloud TPUs (v3-8).</li>
<li><strong>Format</strong>: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.</li>
<li><strong>Batch Size</strong>: 1024.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanPerformanceChemicalStructure2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{84--90}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D1DD00013F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MICER: Molecular Image Captioning with Transfer Learning</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/micer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/micer/</guid><description>Encoder-decoder model using pre-trained ResNet and attention-based LSTM to translate molecular images into SMILES, achieving SOTA OCSR performance.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yi, J., Wu, C., Zhang, X., Xiao, X., Qiu, Y., Zhao, W., Hou, T., &amp; Cao, D. (2022). MICER: a pre-trained encoder-decoder architecture for molecular image captioning. <em>Bioinformatics</em>, 38(19), 4562-4572. <a href="https://doi.org/10.1093/bioinformatics/btac545">https://doi.org/10.1093/bioinformatics/btac545</a></p>
<p><strong>Publication</strong>: Bioinformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Jiacai-Yi/MICER">GitHub Repository</a></li>
</ul>
<h2 id="micers-contribution-to-optical-structure-recognition">MICER&rsquo;s Contribution to Optical Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper according to the AI for Physical Sciences taxonomy. It proposes MICER, an encoder-decoder architecture that integrates transfer learning (fine-tuning pre-trained models) and attention mechanisms for Optical Chemical Structure Recognition (OCSR). The study includes rigorous benchmarking comparing MICER against three rule-based tools (OSRA, MolVec, Imago) and existing deep learning methods (DECIMER). The authors conduct extensive factor comparison experiments to isolate the effects of stereochemistry, molecular complexity, data volume, and encoder backbone choices.</p>
<h2 id="the-challenge-of-generalizing-in-ocsr">The Challenge of Generalizing in OCSR</h2>
<p>Chemical structures in scientific literature are valuable for drug discovery, but they are locked in image formats that are difficult to mine automatically. Traditional OCSR tools (like OSRA) rely on hand-crafted rules and expert knowledge. They are brittle, struggle with stylistic variations, and have low generalization ability. While deep learning has been applied (e.g., DECIMER), previous attempts often used frozen pre-trained feature extractors (without fine-tuning) or failed to fully exploit transfer learning, leading to suboptimal performance. The goal of this work is to build an end-to-end &ldquo;image captioning&rdquo; system that translates molecular images directly into <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings without intermediate segmentation steps.</p>
<h2 id="integrating-fine-tuning-and-attention-for-chemistry">Integrating Fine-Tuning and Attention for Chemistry</h2>
<p>The core novelty lies in the specific architectural integration of transfer learning with fine-tuning for the chemical domain. Unlike DECIMER, which used a frozen network, MICER fine-tunes a pre-trained ResNet on molecular images. This allows the encoder to adapt from general object recognition to specific chemical feature extraction.</p>
<p>The model incorporates an attention mechanism into the LSTM decoder, allowing the model to focus on specific image regions (atoms and bonds) when generating each character of the SMILES string. The paper explicitly analyzes &ldquo;intrinsic features&rdquo; of molecular data (stereochemistry, complexity) to guide the design of a robust training dataset, combining multiple chemical toolkits (Indigo, RDKit) to generate diverse styles.</p>
<h2 id="experimental-setup-and-ablation-studies">Experimental Setup and Ablation Studies</h2>
<p>The authors performed two types of experiments: Factor Comparison (ablations) and Benchmarking.</p>
<p><strong>Factor Comparisons</strong>: They evaluated how performance is affected by:</p>
<ul>
<li><strong>Stereochemistry (SI)</strong>: Comparing models trained on data with and without stereochemical information.</li>
<li><strong>Molecular Complexity (MC)</strong>: Analyzing performance across 5 molecular weight intervals.</li>
<li><strong>Data Volume (DV)</strong>: Training on datasets ranging from 0.64 million to 10 million images.</li>
<li><strong>Pre-trained Models (PTMs)</strong>: Comparing 8 different backbones (e.g., ResNet, VGG, Inception, MobileNet) versus a base CNN.</li>
</ul>
<p><strong>Benchmarking</strong>:</p>
<ul>
<li><strong>Baselines</strong>: OSRA, MolVec, Imago (rule-based); Base CNN, DECIMER (deep learning).</li>
<li><strong>Datasets</strong>: Four test sets (100k images each, except UOB): Uni-style, Multi-style, Noisy, and Real-world (UOB dataset).</li>
<li><strong>Metrics</strong>: Sequence Accuracy (Exact Match), Levenshtein Distance (ALD), and Tanimoto Similarity (Fingerprint match).</li>
</ul>
<h2 id="results-and-core-insights">Results and Core Insights</h2>
<p>MICER achieved 97.54% Sequence Accuracy on uni-style data and 82.33% on the real-world UOB dataset, significantly outperforming the next best method (OSRA scored approximately 23% on uni-style; DECIMER scored roughly 21% on UOB). ResNet101 was identified as the most effective encoder (87.58% accuracy in preliminary tests), outperforming deeper (DenseNet) or lighter (MobileNet) networks. Performance saturates around 6 million training samples. Stereochemical information drops accuracy by nearly 6%, indicating wedge and dash bonds are harder to recognize. Visualizing attention maps showed the model correctly attends to specific atoms (e.g., focusing on &lsquo;S&rsquo; or &lsquo;Cl&rsquo; pixels) when generating the corresponding character.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data was curated from the <strong>ZINC20</strong> database.</p>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Filtering</strong>: Removed organometallics, mixtures, and invalid molecules.</li>
<li><strong>Standardization</strong>: SMILES were canonicalized and de-duplicated.</li>
<li><strong>Generation</strong>: Images generated using <strong>Indigo</strong> and <strong>RDKit</strong> toolkits to vary styles.</li>
</ul>
<p><strong>Dataset Size</strong>:</p>
<ul>
<li><strong>Total</strong>: 10 million images selected for the final model.</li>
<li><strong>Composition</strong>: 6 million &ldquo;default style&rdquo; (Indigo) + 4 million &ldquo;multi-style&rdquo; (Indigo + RDKit).</li>
<li><strong>Splits</strong>: 8:1:1 ratio for Training/Validation/Test.</li>
</ul>
<p><strong>Vocabulary</strong>: A token dictionary of 39 characters + special tokens: <code>[pad]</code>, <code>[sos]</code>, <code>[eos]</code>, <code>[0]-[9]</code>, <code>[C]</code>, <code>[1]</code>, <code>[c]</code>, <code>[O]</code>, <code>[N]</code>, <code>[n]</code>, <code>[F]</code>, <code>[H]</code>, <code>[o]</code>, <code>[S]</code>, <code>[s]</code>, <code>[B]</code>, <code>[r]</code>, <code>[I]</code>, <code>[i]</code>, <code>[P]</code>, <code>[p]</code>, <code>(</code>, <code>)</code>, <code>[</code>, <code>]</code>, <code>@</code>, <code>=</code>, <code>#</code>, <code>/</code>, <code>-</code>, <code>+</code>, <code>\</code>, <code>%</code>. Note that two-letter atoms like &lsquo;Br&rsquo; are tokenized as distinct characters <code>[B]</code>, <code>[r]</code>.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: Character-level tokenization (not atom-level); the model learns to assemble &lsquo;C&rsquo; and &rsquo;l&rsquo; into &lsquo;Cl&rsquo;.</li>
<li><strong>Attention Mechanism</strong>: Uses a soft attention mechanism where the decoder calculates an attention score between the encoder&rsquo;s feature map ($8 \times 8 \times 512$) and the current hidden vector. Formula:
$$
\begin{aligned}
\text{att_score} &amp;= \text{softmax}(L_a(\tanh(L_f(F) + L_b(b_t))))
\end{aligned}
$$</li>
<li><strong>Training Configuration</strong>:
<ul>
<li><strong>Loss Function</strong>: Cross-entropy loss</li>
<li><strong>Optimizer</strong>: Adam optimizer</li>
<li><strong>Learning Rate</strong>: 2e-5</li>
<li><strong>Batch Size</strong>: 256</li>
<li><strong>Epochs</strong>: 15</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Encoder</strong>:</p>
<ul>
<li><strong>Backbone</strong>: Pre-trained <strong>ResNet101</strong> (trained on ImageNet).</li>
<li><strong>Modifications</strong>: The final layer is removed to output a Feature Map of size $8 \times 8 \times 512$.</li>
<li><strong>Flattening</strong>: Reshaped to a $64 \times 512$ feature matrix for the decoder.</li>
</ul>
<p><strong>Decoder</strong>:</p>
<ul>
<li><strong>Type</strong>: Long Short-Term Memory (LSTM) with Attention.</li>
<li><strong>Dropout</strong>: 0.3 applied to minimize overfitting.</li>
</ul>
<p>The encoder uses a pilot block, max-pool, and 4 layers of convolutional blocks (CB), feeding into the attention LSTM.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>SA (Sequence Accuracy)</strong>: Strict exact match of SMILES strings.</li>
<li><strong>ALD (Average Levenshtein Distance)</strong>: Edit distance for character-level error analysis.</li>
<li><strong>AMFTS / <a href="mailto:MFTS@1.0">MFTS@1.0</a></strong>: Tanimoto similarity of ECFP4 fingerprints to measure structural similarity.</li>
</ul>
<p><strong>Test Sets</strong>:</p>
<ul>
<li><strong>Uni-style</strong>: 100,000 images (Indigo default).</li>
<li><strong>Multi-style</strong>: 100,000 images (&gt;10 styles).</li>
<li><strong>Noisy</strong>: 100,000 images with noise added.</li>
<li><strong>UOB</strong>: 5,575 real-world images from literature.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 x NVIDIA Tesla V100 GPUs</li>
<li><strong>Training Time</strong>: Approximately 42 hours for the final model</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yiMICERPretrainedEncoder2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{MICER}}: A Pre-Trained Encoder--Decoder Architecture for Molecular Image Captioning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{MICER}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Yi, Jiacai and Wu, Chengkun and Zhang, Xiaochen and Xiao, Xinyi and Qiu, Yanlong and Zhao, Wentao and Hou, Tingjun and Cao, Dongsheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{19}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{4562--4572}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1367-4811}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1093/bioinformatics/btac545}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image2SMILES: Transformer OCSR with Synthetic Data Pipeline</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/image2smiles/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/image2smiles/</guid><description>Transformer-based OCSR using a novel synthetic data generation pipeline for robust molecular image interpretation across diverse drawing styles.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Khokhlov, I., Krasnov, L., Fedorov, M. V., &amp; Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. <em>Chemistry-Methods</em>, 2(1), e202100069. <a href="https://doi.org/10.1002/cmtd.202100069">https://doi.org/10.1002/cmtd.202100069</a></p>
<p><strong>Publication</strong>: Chemistry-Methods 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/syntelly/img2smiles_generator">Official Code (Data Generator)</a></li>
<li><a href="https://app.syntelly.com/pdf2smiles">Syntelly Demo</a></li>
</ul>
<h2 id="contribution-image2smiles-as-a-method-and-resource">Contribution: Image2SMILES as a Method and Resource</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a specific neural architecture (ResNet backbone and Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering &ldquo;How well does this work?&rdquo; with extensive benchmarks against rule-based systems like OSRA.</li>
<li><strong>Resource</strong>: A core contribution is the &ldquo;Generate and Train!&rdquo; paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.</li>
</ul>
<h2 id="motivation-bottlenecks-in-recognizing-trapped-chemical-structures">Motivation: Bottlenecks in Recognizing Trapped Chemical Structures</h2>
<p>Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.</p>
<ul>
<li><strong>Problem</strong>: Chemical structures are often &ldquo;trapped&rdquo; in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, &ldquo;Markush&rdquo; structures (templates), or visual contamination.</li>
<li><strong>Gap</strong>: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.</li>
<li><strong>Goal</strong>: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).</li>
</ul>
<h2 id="core-innovation-the-generate-and-train-pipeline-and-fg-smiles">Core Innovation: The &ldquo;Generate and Train!&rdquo; Pipeline and FG-SMILES</h2>
<ul>
<li><strong>&ldquo;Generate and Train!&rdquo; Paradigm</strong>: The authors assert that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like &ldquo;Markush&rdquo; variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual &ldquo;contamination&rdquo; (stray text, arrows).</li>
<li><strong>FG-SMILES</strong>: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.</li>
<li><strong>Encoder-Free Architecture</strong>: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.</li>
</ul>
<h2 id="methodology-and-benchmarking-against-osra">Methodology and Benchmarking Against OSRA</h2>
<ul>
<li><strong>Training</strong>: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.</li>
<li><strong>Validation (Synthetic)</strong>: Evaluated on a hold-out set of 1M synthetic images.</li>
<li><strong>Validation (Real World)</strong>:
<ul>
<li><strong>Dataset A</strong>: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.</li>
<li><strong>Dataset B</strong>: 296 structures systematically extracted from <em>Journal of Organic Chemistry</em> (one paper per issue from 2020) to reduce selection bias.</li>
</ul>
</li>
<li><strong>Comparison</strong>: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.</li>
</ul>
<h2 id="results-high-precision-extraction-and-key-limitations">Results: High-Precision Extraction and Key Limitations</h2>
<ul>
<li><strong>Performance</strong>:
<ul>
<li><strong>Synthetic</strong>: 90.7% exact match accuracy.</li>
<li><strong>Real Data (Dataset A)</strong>: Image2SMILES achieved <strong>79.2%</strong> accuracy compared to OSRA&rsquo;s <strong>62.1%</strong>.</li>
<li><strong>Real Data (Dataset B)</strong>: Image2SMILES achieved <strong>62.5%</strong> accuracy compared to OSRA&rsquo;s <strong>24.0%</strong>.</li>
</ul>
</li>
<li><strong>Confidence Correlation</strong>: There is a strong correlation between the model&rsquo;s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22% of data, enabling high-precision automated pipelines.</li>
<li><strong>Key Failures</strong>: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices ($R&rsquo;$ vs $R_1$), and explicit hydrogens rendered as groups.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: A subset of 10 million molecules sampled from PubChem.</li>
<li><strong>Selection Logic</strong>: Bias towards complex/rare structures using a &ldquo;Full Coefficient&rdquo; (FC) probability metric based on molecule size and ring/atom rarity.
<ul>
<li>Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.</li>
</ul>
</li>
<li><strong>Generation</strong>: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).</li>
<li><strong>Contamination</strong>: &ldquo;Visual noise&rdquo; is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.</li>
<li><strong>Target Format</strong>: <strong>FG-SMILES</strong> (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a <code>v</code> token.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Contamination Augmentation</strong>: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.</li>
<li><strong>Functional Group Resolution</strong>: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).</li>
<li><strong>Markush Support</strong>: Stochastic replacement of substituents with R-group labels ($R_1$, $R&rsquo;$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: &ldquo;Image-to-Sequence&rdquo; hybrid model.
<ul>
<li><strong>Backbone</strong>: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \times 48 \times 48$.</li>
<li><strong>Neck</strong>: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.</li>
<li><strong>Decoder</strong>: Standard Transformer Decoder (6 layers).</li>
</ul>
</li>
<li><strong>Input</strong>: Images resized to $384 \times 384 \times 3$.</li>
<li><strong>Output</strong>: Sequence of FG-SMILES tokens.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: Binary &ldquo;Exact Match&rdquo; (valid/invalid).
<ul>
<li>Strict criteria: Stereo and R-group indices must match exactly (e.g., $R&rsquo;$ vs $R_1$ is a failure).</li>
</ul>
</li>
<li><strong>Datasets</strong>:
<ul>
<li><strong>Internal</strong>: 5% random split of generated data (500k samples).</li>
<li><strong>External (Dataset A &amp; B)</strong>: Manually cropped real-world images from specified journals.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.</li>
<li><strong>Duration</strong>: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.</li>
<li><strong>Optimizer</strong>: RAdam with learning rate $3 \cdot 10^{-4}$.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{khokhlovImage2SMILESTransformerBasedMolecular2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Image2SMILES: Transformer-Based Molecular Optical Recognition Engine}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Image2SMILES}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Chemistry-Methods}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{e202100069}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2628-9725}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1002/cmtd.202100069}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image-to-Graph Transformers</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/image-to-graph-transformers/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/image-to-graph-transformers/</guid><description>A deep learning model that converts molecular images directly into graph structures, enabling recognition of abbreviated non-atomic symbols.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yoo, S., Kwon, O., &amp; Lee, H. (2022). Image-to-Graph Transformers for Chemical Structure Recognition. <em>arXiv preprint arXiv:2202.09580</em>. <a href="https://doi.org/10.48550/arXiv.2202.09580">https://doi.org/10.48550/arXiv.2202.09580</a></p>
<p><strong>Publication</strong>: arXiv 2022</p>
<h2 id="contribution-and-taxonomic-classification">Contribution and Taxonomic Classification</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel deep learning architecture designed to extract molecular structures from images by directly predicting the graph topology. The paper validates this approach through ablation studies (comparing ResNet-only baselines to the Transformer-augmented model) and extensive benchmarking against existing tools.</p>
<h2 id="the-challenge-with-smiles-and-non-atomic-symbols">The Challenge with SMILES and Non-Atomic Symbols</h2>
<ul>
<li><strong>Handling Abbreviations:</strong> Chemical structures in scientific literature often use non-atomic symbols (superatoms like &ldquo;R&rdquo; or &ldquo;Ph&rdquo;) to reduce complexity. Standard tools that generate SMILES strings fail here because SMILES syntax does not support arbitrary non-atomic symbols.</li>
<li><strong>Robustness to Style:</strong> Existing rule-based tools are brittle to the diverse drawing styles found in literature.</li>
<li><strong>Data Utilization:</strong> Pixel-wise graph recognition tools (like ChemGrapher) require expensive pixel-level labeling. An end-to-end approach can utilize massive amounts of image-molecule pairs (like USPTO data) without needing exact coordinate labels.</li>
</ul>
<h2 id="the-image-to-graph-i2g-architecture">The Image-to-Graph (I2G) Architecture</h2>
<p>The core novelty is the <strong>Image-to-Graph (I2G)</strong> architecture that bypasses string representations entirely:</p>
<ul>
<li><strong>Hybrid Encoder:</strong> Combines a ResNet backbone (for locality) with a Transformer encoder (for global context), allowing the model to capture relationships between atoms that are far apart in the image.</li>
<li><strong>Graph Decoder (GRAT):</strong> A modified Transformer decoder that generates the graph auto-regressively. It uses feature-wise transformations to modulate attention weights based on edge information (bond types).</li>
<li><strong>Coordinate-Aware Training:</strong> The model is forced to predict the exact 2D coordinates of atoms in the source image, which significantly boosts accuracy by grounding the decoder&rsquo;s attention.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<ul>
<li><strong>Baselines:</strong> The model was compared against OSRA (rule-based), MolVec (rule-based), and ChemGrapher (deep learning pixel-wise).</li>
<li><strong>Benchmarks:</strong> Evaluated on four standard datasets: UoB, USPTO, CLEF, and JPO. Images were converted to PDF and back to simulate degradation.</li>
<li><strong>Large Molecule Test:</strong> A custom dataset (<strong>OLED</strong>) was created from 12 journal papers (434 images) to test performance on larger, more complex structures (average 52.8 atoms).</li>
<li><strong>Ablations:</strong> The authors tested the impact of the Transformer encoder, auxiliary losses, and coordinate prediction.</li>
</ul>
<h2 id="empirical-results-and-robustness">Empirical Results and Robustness</h2>
<ul>
<li><strong>SOTA Performance:</strong> The proposed model outperformed existing models with a 17.1% relative improvement on benchmark datasets.</li>
<li><strong>Robustness:</strong> On large molecules (OLED dataset), it achieved a 12.8% relative improvement.</li>
<li><strong>Data Scaling:</strong> Adding real-world USPTO data to the synthetic training set improved performance by 20.5%, demonstrating the model&rsquo;s ability to learn from noisy, unlabeled coordinates.</li>
<li><strong>Handling Superatoms:</strong> The model successfully recognized pseudo-atoms (e.g., $R_1$, $R_2$) as distinct nodes, whereas SMILES-based tools collapsed them into generic &ldquo;Any&rdquo; atoms.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors used a combination of synthetic and real-world data for training.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td><strong>PubChem</strong></td>
          <td>4.6M</td>
          <td>Synthetic images generated using RDKit. Random superatoms (e.g., $CF_3$, $NO_2$) were substituted to simulate abbreviations.</td>
      </tr>
      <tr>
          <td>Training</td>
          <td><strong>USPTO</strong></td>
          <td>2.5M</td>
          <td>Real image-molecule pairs from patents. Used for robustness; lacks coordinate labels.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>Benchmarks</strong></td>
          <td>~5.7k</td>
          <td>UoB, USPTO, CLEF, JPO. Average ~15.8 atoms per molecule.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><strong>OLED</strong></td>
          <td>434</td>
          <td>Manually segmented from 12 journal papers. Large molecules (avg 52.8 atoms).</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing:</strong></p>
<ul>
<li>Input resolution is fixed at $800 x800$ pixels.</li>
<li>Images are virtually split into a $25 x25$ grid (625 patches total), where each patch is $32 x32$ pixels.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Encoder Logic:</strong></p>
<ul>
<li><strong>Grid Serialization:</strong> The $25 x25$ grid is flattened into a 1D sequence. 2D position information is concatenated to ResNet features before the Transformer.</li>
<li><strong>Auxiliary Losses:</strong> To aid convergence, classifiers on the encoder predict three things <em>per patch</em>: (1) number of atoms, (2) characters in atom labels, and (3) edge-sharing neighbors. These losses decrease to zero during training.</li>
</ul>
<p><strong>Decoder Logic:</strong></p>
<ul>
<li><strong>Auto-regressive Generation:</strong> At step $t$, the decoder generates a new node and connects it to existing nodes.</li>
<li><strong>Attention Modulation:</strong> Attention weights are transformed using bond information:
$$
\begin{aligned}
\text{Att}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d}} \odot \gamma(e_{ij}) + \beta(e_{ij}) \right) V
\end{aligned}
$$
where $e_{ij}$ is the edge type.</li>
<li><strong>Coordinate Prediction:</strong> The decoder outputs coordinates for each atom, which acts as a mechanism to track attention history.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Image Encoder:</strong> ResNet-34 backbone followed by a Transformer encoder.</li>
<li><strong>Graph Decoder:</strong> A &ldquo;Graph-Aware Transformer&rdquo; (GRAT) that outputs nodes (atom labels, coordinates) and edges (bond types).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics focus on structural identity, as standard string matching (SMILES) is insufficient for graphs with superatoms.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>SMI</strong></td>
          <td>Canonical SMILES Match</td>
          <td>Correct if predicted SMILES is identical to ground truth.</td>
      </tr>
      <tr>
          <td><strong>TS 1</strong></td>
          <td>Tanimoto Similarity = 1.0</td>
          <td>Ratio of predictions with perfect fingerprint overlap.</td>
      </tr>
      <tr>
          <td><strong>Sim.</strong></td>
          <td>Average Tanimoto Similarity</td>
          <td>Measures average structural overlap (cutoff usually &gt; 0.85).</td>
      </tr>
  </tbody>
</table>
]]></content:encoded></item><item><title>ICMDT: Automated Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/icmdt/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/icmdt/</guid><description>A Transformer-based model (ICMDT) for converting chemical structure images into InChI text strings using a novel Deep TNT block.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, Y., Chen, G., &amp; Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. <em>Applied Sciences</em>, 12(2), 680. <a href="https://doi.org/10.3390/app12020680">https://doi.org/10.3390/app12020680</a></p>
<p><strong>Publication</strong>: MDPI Applied Sciences 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.kaggle.com/c/bms-molecular-translation">Kaggle Competition: BMS Molecular Translation</a></li>
</ul>
<h2 id="contribution-image-to-text-translation-for-chemical-structures">Contribution: Image-to-Text Translation for Chemical Structures</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>It proposes a novel neural network architecture, the <strong>Image Captioning Model based on Deep TNT (ICMDT)</strong>, to solve the specific problem of &ldquo;molecular translation&rdquo; (image-to-text). The classification is supported by the following rhetorical indicators:</p>
<ul>
<li><strong>Novel Mechanism:</strong> It introduces the &ldquo;Deep TNT block&rdquo; to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).</li>
<li><strong>Baseline Comparison:</strong> The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).</li>
<li><strong>Ablation Study:</strong> Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.</li>
</ul>
<h2 id="motivation-digitizing-historical-chemical-literature">Motivation: Digitizing Historical Chemical Literature</h2>
<p>The primary motivation is to speed up chemical research by digitizing historical chemical literature.</p>
<ul>
<li><strong>Problem:</strong> Historical sources often contain corrupted or noisy images, making automated recognition difficult.</li>
<li><strong>Gap:</strong> Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.</li>
<li><strong>Goal:</strong> To build a dependable generative model that can accurately translate these noisy images into <strong><a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a></strong> (International Chemical Identifier) text strings.</li>
</ul>
<h2 id="novelty-multi-level-feature-fusion-with-deep-tnt">Novelty: Multi-Level Feature Fusion with Deep TNT</h2>
<p>The core contribution is the <strong>Deep TNT block</strong> and the resulting <strong>ICMDT</strong> architecture.</p>
<ul>
<li><strong>Deep TNT Block:</strong> The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
<ol>
<li><strong>Internal Transformer:</strong> Processes pixel embeddings.</li>
<li><strong>Middle Transformer:</strong> Processes small patch embeddings.</li>
<li><strong>Exterior Transformer:</strong> Processes large patch embeddings.</li>
</ol>
</li>
<li><strong>Multi-level Fusion:</strong> The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.</li>
<li><strong>Position Encoding:</strong> A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.</li>
</ul>
<h2 id="methodology-benchmarking-on-the-bms-dataset">Methodology: Benchmarking on the BMS Dataset</h2>
<p>The authors evaluated the model on the <strong>Bristol-Myers Squibb Molecular Translation</strong> dataset.</p>
<ul>
<li><strong>Baselines:</strong> They constructed four comparative models:
<ul>
<li>EfficientNetb0 + RNN (Bi-LSTM)</li>
<li>ResNet50d + RNN (Bi-LSTM)</li>
<li>EfficientNetb0 + Transformer</li>
<li>ResNet101d + Transformer</li>
</ul>
</li>
<li><strong>Ablation:</strong> They tested the impact of removing the large patch position encoding (ICMDT*) and reverting the encoder to a standard TNT-S (TNTD).</li>
<li><strong>Pre-processing Study:</strong> They experimented with denoising ratios and cropping strategies.</li>
</ul>
<h2 id="results--conclusions-state-of-the-art-inchi-translation">Results &amp; Conclusions: State-of-the-Art InChI Translation</h2>
<ul>
<li><strong>Performance:</strong> ICMDT achieved the lowest <strong>Levenshtein distance (0.69)</strong> compared to the best baseline (1.45 for ResNet101d+Transformer).</li>
<li><strong>Convergence:</strong> The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.</li>
<li><strong>Ablation Results:</strong> The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance).</li>
<li><strong>Limitations:</strong> The model struggles with <strong>stereochemical layers</strong> (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.</li>
<li><strong>Future Work:</strong> Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The experiments used the <strong>Bristol-Myers Squibb Molecular Translation</strong> dataset from Kaggle.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BMS Training Set</td>
          <td>2,424,186 images</td>
          <td>Supervised; contains noise and blur</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BMS Test Set</td>
          <td>1,616,107 images</td>
          <td>Higher noise variation than training set</td>
      </tr>
  </tbody>
</table>
<p><strong>Pre-processing Strategy</strong>:</p>
<ul>
<li><strong>Effective:</strong> Padding resizing (reshaping to square, padding with border pixels).</li>
<li><strong>Ineffective:</strong> Smart cropping (removing white borders degraded performance).</li>
<li><strong>Augmentation:</strong> GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).</li>
<li><strong>Denoising:</strong> Best results found by mixing denoised and original data (Ratio 2:13) during training.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer:</strong> Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).</li>
<li><strong>Loss Function:</strong> Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. While standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives, Anti-Focal Loss modifies this to reduce the disparity between training and inference distributions:
$$ \text{Anti-Focal Loss}(p_t) = - (1 + p_t)^\gamma \log(p_t) $$</li>
<li><strong>Training Schedule:</strong>
<ul>
<li>Initial resolution: $224 \times 224$</li>
<li>Fine-tuning: Resolution $384 \times 384$ for labels $&gt;150$ length.</li>
<li>Batch size: Dynamic, increasing from 16 to 1024.</li>
</ul>
</li>
<li><strong>Inference Strategy:</strong>
<ul>
<li>Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).</li>
<li>Test Time Augmentation (TTA): Rotations of $90^\circ$.</li>
<li>Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>ICMDT Architecture:</strong></p>
<ul>
<li><strong>Encoder (Deep TNT)</strong>:
<ul>
<li><strong>Internal Block:</strong> Hidden size 160, Heads 4, MLP act GELU.</li>
<li><strong>Middle Block:</strong> Hidden size 640, Heads 10.</li>
<li><strong>Exterior Block:</strong> Hidden size 2560, Heads 16.</li>
<li><strong>Patch Sizes:</strong> Small patch $16 \times 16$, Pixel patch $4 \times 4$.</li>
</ul>
</li>
<li><strong>Decoder (Vanilla Transformer)</strong>:
<ul>
<li>Dimensions: 5120 (Hidden), 1024 (FFN).</li>
<li>Depth: 3 layers, Heads: 8.</li>
<li>Vocab Size: 193 (InChI tokens).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric:</strong> Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Levenshtein Distance</th>
          <th>Params (M)</th>
          <th>Convergence (Epochs)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ICMDT (Ours)</strong></td>
          <td><strong>0.69</strong></td>
          <td>138.16</td>
          <td>~9.76</td>
      </tr>
      <tr>
          <td>TNTD (Ablation)</td>
          <td>1.29</td>
          <td>114.36</td>
          <td>-</td>
      </tr>
      <tr>
          <td>ResNet101d + Transformer</td>
          <td>1.45</td>
          <td>302.02</td>
          <td>14+</td>
      </tr>
      <tr>
          <td>ResNet50d + RNN</td>
          <td>3.86</td>
          <td>90.6</td>
          <td>14+</td>
      </tr>
      <tr>
          <td>EfficientNetb0 + RNN</td>
          <td>2.95</td>
          <td>46.3</td>
          <td>11+</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liAutomatedRecognitionChemical2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Li, Yanchi and Chen, Guanyu and Li, Xiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Applied Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{680}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Multidisciplinary Digital Publishing Institute}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2076-3417}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.3390/app12020680}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>End-to-End Transformer for Molecular Image Captioning</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/vit-inchi-transformer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/vit-inchi-transformer/</guid><description>Vision Transformer encoder with Transformer decoder for molecular image-to-InChI translation, achieving state-of-the-art performance on noisy datasets.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sundaramoorthy, C., Kelvin, L. Z., Sarin, M., &amp; Gupta, S. (2021). End-to-End Attention-based Image Captioning. <em>arXiv preprint arXiv:2104.14721</em>. <a href="https://doi.org/10.48550/arXiv.2104.14721">https://doi.org/10.48550/arXiv.2104.14721</a></p>
<p><strong>Publication</strong>: arXiv 2021 (preprint)</p>
<p><strong>Note</strong>: This is an arXiv preprint and has not undergone formal peer review.</p>
<h2 id="methodological-contribution">Methodological Contribution</h2>
<p>This is a <strong>Methodological Paper</strong>. It proposes a novel architectural approach to molecular image translation by replacing the standard CNN encoder with a Vision Transformer (ViT). The authors validate this method through comparative benchmarking against standard CNN+RNN baselines (e.g., ResNet+LSTM) and provide optimizations for inference speed.</p>
<h2 id="motivation-and-problem-statement">Motivation and Problem Statement</h2>
<p>The core problem addressed is existing molecular translation methods (extracting chemical structure from images into computer-readable InChI format) rely heavily on rule-based systems or CNN+RNN architectures. These current approaches often underperform when handling noisy images (common in scanned old journals) or images with few distinguishable features. There is a significant need in drug discovery to digitize and analyze legacy experimental data locked in image format within scientific publications.</p>
<h2 id="core-innovations-end-to-end-vit-encoder">Core Innovations: End-to-End ViT Encoder</h2>
<p>The primary contribution is the use of a completely convolution-free Vision Transformer (ViT) as the encoder, allowing the model to utilize long-range dependencies among image patches from the very beginning via self-attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
The architecture is a pure Transformer (Encoder-Decoder), treating the molecular image similarly to a sequence of tokens (patches). Furthermore, the authors implement a specific caching strategy for the decoder to avoid recomputing embeddings for previously decoded tokens, reducing the time complexity of the decoding step.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The model was compared against standard CNN + RNN and ResNet (18, 34, 50) + LSTM with attention. Ablation studies were conducted varying the number of transformer layers (3, 6, 12, 24) and image resolution (224x224 vs 384x384). The model trained on a large combined dataset, including Bristol Myers Squibb data, SMILES, GDB-13, and synthetically augmented images containing noise and artifacts. Performance was evaluated using the Levenshtein distance metric, which computes the minimum number of single-character edits to transform the predicted string into the ground truth.</p>
<h2 id="performance-outcomes-and-capabilities">Performance Outcomes and Capabilities</h2>
<p>The proposed 24-layer ViT model (input size 384) achieved the lowest Levenshtein distance of <strong>6.95</strong>, significantly outperforming the ResNet50+LSTM baseline (7.49) and the standard CNN+RNN (103.7). Increasing the number of layers had a strong positive impact, with the 24-layer model becoming competitive with current approaches. The model demonstrated high robustness on datasets with low distinguishable features and noise. The proposed caching optimization reduced the decoding time complexity per timestep from $O(MN^2 + N^3)$ to $O(MN + N^2)$.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a combined dataset randomly split into 70% training, 10% test, and 20% validation.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Bristol Myers Squibb</strong></td>
          <td>~2.4 million synthetic images with InChI labels.</td>
          <td>Provided by BMS global biopharmaceutical company.</td>
      </tr>
      <tr>
          <td><strong>SMILES</strong></td>
          <td>Kaggle contest data converted to InChI.</td>
          <td>Images generated using RDKit.</td>
      </tr>
      <tr>
          <td><strong>GDB-13</strong></td>
          <td>Subset of 977 million small organic molecules (up to 13 atoms).</td>
          <td>Converted from SMILES using RDKit.</td>
      </tr>
      <tr>
          <td><strong>Augmented Images</strong></td>
          <td>Synthetic images with salt/pepper noise, dropped atoms, and bond modifications.</td>
          <td>Used to improve robustness against noise.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Training Objective</strong>: Cross-entropy loss minimization.</li>
<li><strong>Inference Decoding</strong>: Autoregressive decoding predicting the next character of the InChI string.</li>
<li><strong>Positional Encoding</strong>: Standard sine and cosine functions of different frequencies.</li>
<li><strong>Optimization</strong>:
<ul>
<li><strong>Caching</strong>: Caches the output of each layer during decoding to avoid recomputing embeddings for already decoded tokens.</li>
<li><strong>JIT</strong>: PyTorch JIT compiler used for graph optimization.</li>
<li><strong>Self-Critical Training</strong>: Finetuning performed using self-critical sequence training (SCST).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Encoder (Vision Transformer)</strong>:
<ul>
<li>Input: Flattened 2D patches of the image. Patch size: $16 \x16$.</li>
<li>Projection: Trainable linear projection to latent vector size $D$.</li>
<li>Structure: Alternating layers of Multi-Head Self-Attention (MHSA) and MLP blocks.</li>
</ul>
</li>
<li><strong>Decoder (Vanilla Transformer)</strong>:
<ul>
<li>Input: Tokenized InChI string + sinusoidal positional embedding.</li>
<li>Vocabulary: 275 tokens (including <code>&lt;SOS&gt;</code>, <code>&lt;PAD&gt;</code>, <code>&lt;EOS&gt;</code>).</li>
</ul>
</li>
<li><strong>Hyperparameters (Best Model)</strong>:
<ul>
<li>Image Size: $384 \x384$.</li>
<li>Layers: 24.</li>
<li>Feature Dimension: 512.</li>
<li>Attention Heads: 12.</li>
<li>Optimizer: Adam.</li>
<li>Learning Rate: $3 \x10^{-5}$ (decayed by 0.5 in last 2 epochs).</li>
<li>Batch Size: Varied [64-512].</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Levenshtein Distance (lower is better).</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Image Size</th>
          <th>Layers</th>
          <th>Levenshtein Dist.</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Standard CNN+RNN</td>
          <td>224</td>
          <td>3</td>
          <td>103.7</td>
      </tr>
      <tr>
          <td>ResNet50 + LSTM</td>
          <td>224</td>
          <td>5</td>
          <td>7.49</td>
      </tr>
      <tr>
          <td>ViT Transformers (Best)</td>
          <td>384</td>
          <td>24</td>
          <td><strong>6.95</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>System</strong>: 70GB GPU system.</li>
<li><strong>Framework</strong>: PyTorch and PyTorch Lightning.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{sundaramoorthyEndtoEndAttentionbasedImage2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{End-to-{{End Attention-based Image Captioning}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Sundaramoorthy, Carola and Kelvin, Lin Ziwen and Sarin, Mahak and Gupta, Shubham}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER 1.0: Transformers for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/decimer-1.0/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/decimer-1.0/</guid><description>Transformer-based approach for Optical Chemical Structure Recognition converting chemical images to SELFIES strings with 96% accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A. &amp; Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. <em>Journal of Cheminformatics</em>, 13(1), 61. <a href="https://doi.org/10.1186/s13321-021-00538-8">https://doi.org/10.1186/s13321-021-00538-8</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Kohulan/DECIMER-TPU">GitHub Repository</a></li>
<li><a href="https://decimer.ai/">DECIMER Project Page</a></li>
</ul>
<h2 id="evaluating-the-contribution-a-methodological-shift">Evaluating the Contribution: A Methodological Shift</h2>
<p><strong>Method (Dominant)</strong> with strong <strong>Resource</strong> elements.</p>
<p>This is primarily a <strong>Method</strong> paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a <strong>Transformer-based network</strong> to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.</p>
<p>It also serves as a <strong>Resource</strong> contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (&gt;35 million molecules).</p>
<h2 id="motivation-unlocking-locked-chemical-knowledge">Motivation: Unlocking Locked Chemical Knowledge</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.</li>
<li><strong>Manual Bottlenecks</strong>: Manual curation and extraction of this data is tedious, slow, and error-prone.</li>
<li><strong>Limitations of Prior Tools</strong>: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.</li>
</ul>
<h2 id="key-innovation-transformer-based-molecular-translation">Key Innovation: Transformer-Based Molecular Translation</h2>
<ul>
<li><strong>Transformer Architecture</strong>: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a <strong>Transformer-based decoder</strong>, significantly improving accuracy.</li>
<li><strong>EfficientNet Backbone</strong>: Replaces the standard InceptionV3 feature extractor with <strong>EfficientNet-B3</strong>, which improved feature extraction quality for chemical images.</li>
<li><strong>SELFIES Representation</strong>: Utilizes <a href="/notes/computational-chemistry/molecular-representations/selfies/"><strong>SELFIES</strong></a> (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the &ldquo;invalid SMILES&rdquo; problem common in generative models.</li>
<li><strong>Massive Scaling</strong>: Trains on a synthetic dataset of <strong>35-39 million molecules</strong>, demonstrating that scaling data size directly correlates with improved model performance.</li>
</ul>
<h2 id="methodology-and-experimental-validation">Methodology and Experimental Validation</h2>
<ul>
<li><strong>Feature Extractor Ablation</strong>: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.</li>
<li><strong>Architecture Comparison</strong>: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints:
$$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$</li>
<li><strong>Data Scaling</strong>: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.</li>
<li><strong>Stereochemistry &amp; Ions</strong>: Tested the model&rsquo;s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.</li>
<li><strong>Augmentation Robustness</strong>: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.</li>
</ul>
<h2 id="results-superior-accuracy-and-scaling-limit-observations">Results: Superior Accuracy and Scaling Limit Observations</h2>
<ul>
<li><strong>Superior Architecture</strong>: The Transformer model with EfficientNet-B3 features significantly outperformed the Encoder-Decoder baseline. On the 1M dataset, the Transformer achieved <strong>74.57%</strong> exact matches (Tanimoto 1.0) compared to only <strong>7.03%</strong> for the Encoder-Decoder.</li>
<li><strong>High Accuracy at Scale</strong>: With the full 35-million molecule training set (Dataset 1), the model achieved a <strong>Tanimoto 1.0 score of 96.47%</strong> and an average Tanimoto similarity of 0.99.</li>
<li><strong>Isomorphism</strong>: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a>).</li>
<li><strong>Stereochemistry Costs</strong>: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).</li>
<li><strong>Hardware Efficiency</strong>: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs, reducing training time for the largest models from months to under 14 days.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors generated synthetic data from PubChem.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 1 (Clean)</td>
          <td>35M molecules</td>
          <td>No stereo/ions. Filtered for MW &lt; 1500, bond count 3-40, SMILES len &lt; 40.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 2 (Complex)</td>
          <td>33M molecules</td>
          <td>Includes stereochemistry and charged groups (ions).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 3 (Augmented)</td>
          <td>33M molecules</td>
          <td>Dataset 2 with image augmentations applied.</td>
      </tr>
      <tr>
          <td><strong>Preprocessing</strong></td>
          <td>N/A</td>
          <td>N/A</td>
          <td>Molecules converted to <strong>SELFIES</strong>. Images generated via CDK Structure Diagram Generator (SDG) as $299\times299$ 8-bit PNGs.</td>
      </tr>
      <tr>
          <td><strong>Format</strong></td>
          <td>TFRecords</td>
          <td>75 MB chunks</td>
          <td>128 Data points (image vector + tokenized string) per record.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Text Representation</strong>: <strong>SELFIES</strong> used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
<ul>
<li><em>Dataset 1 Tokens</em>: 27 unique tokens. Max length 47.</li>
<li><em>Dataset 2/3 Tokens</em>: 61 unique tokens (due to stereo/ion tokens).</li>
</ul>
</li>
<li><strong>Augmentation</strong>: Implemented using <code>imgaug</code> python package. Random application of:
<ul>
<li>Gaussian/Average Blur, Additive Gaussian Noise, Salt &amp; Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.</li>
</ul>
</li>
<li><strong>Optimization</strong>: Adam optimizer with a custom learning rate scheduler (following the &ldquo;Attention is all you need&rdquo; paper).</li>
</ul>
<h3 id="models">Models</h3>
<p>The final architecture is an <strong>Image-to-SELFIES Transformer</strong>.</p>
<ul>
<li><strong>Encoder (Feature Extractor)</strong>:
<ul>
<li><strong>EfficientNet-B3</strong> (pre-trained on Noisy-student).</li>
<li>Input: $299 \times 299 \times 3$ images (normalized -1 to 1).</li>
<li>Output Feature Vector: $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder (Transformer)</strong>:
<ul>
<li>4 Encoder-Decoder layers.</li>
<li>8 Parallel Attention Heads.</li>
<li>Dimension size: 512.</li>
<li>Feed-forward size: 2048.</li>
<li>Dropout: 0.1.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto 1.0</strong></td>
          <td><strong>96.47%</strong></td>
          <td>74.57% (1M subset)</td>
          <td>Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).</td>
      </tr>
      <tr>
          <td><strong>Avg Tanimoto</strong></td>
          <td><strong>0.9923</strong></td>
          <td>0.9371 (1M subset)</td>
          <td>Average similarity score (Dataset 1, 35M training).</td>
      </tr>
      <tr>
          <td><strong>Isomorphism</strong></td>
          <td><strong>99.75%</strong></td>
          <td>-</td>
          <td>Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Hardware</strong>: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.</li>
<li><strong>Comparison Hardware</strong>: Nvidia Tesla V100 (32GB GPU).</li>
<li><strong>Performance</strong>:
<ul>
<li>TPU v3-8 was ~4x faster than V100 GPU.</li>
<li>1 Million molecule model convergence: ~8.5 hours on TPU vs ~30 hours on GPU.</li>
<li>Largest model (35M) took &lt; 14 days on TPU.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMER10Deep2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{DECIMER 1.0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{dec}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00538-8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1186/s13321-021-00538-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemPix: Hand-Drawn Hydrocarbon Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/chempix/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/chempix/</guid><description>Deep learning framework using CNN-LSTM image captioning to convert hand-drawn hydrocarbon structures into SMILES strings with 76% accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Weir, H., Thompson, K., Woodward, A., Choi, B., Braun, A., &amp; Martínez, T. J. (2021). ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning. <em>Chemical Science</em>, 12(31), 10622-10633. <a href="https://doi.org/10.1039/D1SC02957F">https://doi.org/10.1039/D1SC02957F</a></p>
<p><strong>Publication</strong>: Chemical Science 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/mtzgroup/ChemPixCH">GitHub Repository</a></li>
</ul>
<h2 id="paper-classification-and-core-contribution">Paper Classification and Core Contribution</h2>
<p>This is primarily a <strong>Method</strong> paper, with a secondary contribution as a <strong>Resource</strong> paper.</p>
<p>The paper&rsquo;s core contribution is the <strong>ChemPix architecture and training strategy</strong> using neural image captioning (CNN-LSTM) to convert hand-drawn chemical structures to SMILES. The extensive ablation studies on synthetic data generation (augmentation, degradation, backgrounds) and ensemble learning strategies confirm the methodological focus. The secondary resource contribution includes releasing a curated dataset of hand-drawn hydrocarbons and code for generating synthetic training data.</p>
<h2 id="the-structural-input-bottleneck-in-computational-chemistry">The Structural Input Bottleneck in Computational Chemistry</h2>
<p>Inputting molecular structures into computational chemistry software for quantum calculations is often a bottleneck, requiring domain expertise and cumbersome manual entry in drawing software. While optical chemical structure recognition (OCSR) tools exist, they typically struggle with the noise and variability of hand-drawn sketches. There is a practical need for a tool that allows chemists to simply photograph a hand-drawn sketch and immediately convert it into a machine-readable format (<a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>), making computational workflows more accessible.</p>
<h2 id="cnn-lstm-image-captioning-and-robust-synthetic-generalization">CNN-LSTM Image Captioning and Robust Synthetic Generalization</h2>
<ol>
<li><strong>Image Captioning Paradigm</strong>: The authors treat the problem as <strong>neural image captioning</strong>, using an encoder-decoder (CNN-LSTM) framework to &ldquo;translate&rdquo; an image directly to a SMILES string. This avoids the complexity of explicit atom/bond detection and graph assembly.</li>
<li><strong>Synthetic Data Engineering</strong>: The paper introduces a rigorous synthetic data generation pipeline that transforms clean RDKit-generated images into &ldquo;pseudo-hand-drawn&rdquo; images via randomized backgrounds, degradation, and heavy augmentation. This allows the model to achieve &gt;50% accuracy on real hand-drawn data without ever seeing it during pre-training.</li>
<li><strong>Ensemble Uncertainty Estimation</strong>: The method utilizes a &ldquo;committee&rdquo; (ensemble) of networks to improve accuracy and estimate confidence based on vote agreement, providing users with reliability indicators for predictions.</li>
</ol>
<h2 id="extensive-ablation-and-real-world-evaluation">Extensive Ablation and Real-World Evaluation</h2>
<ol>
<li><strong>Ablation Studies on Data Pipeline</strong>: The authors trained models on datasets generated at different stages of the pipeline (Clean RDKit $\rightarrow$ Augmented $\rightarrow$ Backgrounds $\rightarrow$ Degraded) to quantify the value of each transformation in bridging the synthetic-to-real domain gap.</li>
<li><strong>Sample Size Scaling</strong>: They analyzed performance scaling by training on synthetic dataset sizes ranging from 50,000 to 500,000 images to understand data requirements.</li>
<li><strong>Real-world Validation</strong>: The model was evaluated on a held-out test set of hand-drawn images collected via a custom web app, providing genuine out-of-distribution testing.</li>
<li><strong>Fine-tuning Experiments</strong>: Comparisons of &ldquo;zero-shot&rdquo; performance (training only on synthetic data) versus fine-tuning with a small fraction of real hand-drawn data to assess the value of limited real-world supervision.</li>
</ol>
<h2 id="state-of-the-art-hand-drawn-ocsr-performance">State-of-the-Art Hand-Drawn OCSR Performance</h2>
<ol>
<li>
<p><strong>Pipeline Efficacy</strong>: Augmentation and image degradation were the most critical factors for generalization, improving accuracy from ~8% to nearly 50% on hand-drawn data. Adding backgrounds had a surprisingly negligible effect compared to degradation.</p>
</li>
<li>
<p><strong>State-of-the-Art Performance</strong>: The final ensemble model achieved <strong>76% accuracy</strong> (top-1) and <strong>86% accuracy</strong> (top-3) on the hand-drawn test set, demonstrating practical viability for real-world use.</p>
</li>
<li>
<p><strong>Synthetic Generalization</strong>: A model trained on 500,000 synthetic images achieved &gt;50% accuracy on real hand-drawn data without any fine-tuning, validating the synthetic data generation strategy as a viable alternative to expensive manual labeling.</p>
</li>
<li>
<p><strong>Ensemble Benefits</strong>: The voting committee approach improved accuracy and provided interpretable uncertainty estimates through vote distributions.</p>
</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study relies on two primary data sources: a massive synthetic dataset generated procedurally and a smaller collected dataset of real drawings.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Synthetic (RDKit)</td>
          <td>500,000 images</td>
          <td>Generated via RDKit with &ldquo;heavy&rdquo; augmentation: rotation ($0-360°$), blur, salt+pepper noise, and background texture addition.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Hand-Drawn (Real)</td>
          <td>~600 images</td>
          <td>Crowdsourced via a web app; used for fine-tuning and validation.</td>
      </tr>
      <tr>
          <td><strong>Backgrounds</strong></td>
          <td>Texture Images</td>
          <td>1,052 images</td>
          <td>A pool of unlabeled texture photos (paper, desks, shadows) used to generate synthetic backgrounds.</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Generation Parameters</strong>:</p>
<ul>
<li><strong>Augmentations</strong>: Rotation, Resize ($200-300px$), Blur, Dilate, Erode, Aspect Ratio, Affine transform ($\pm 20px$), Contrast, Quantize, Sharpness</li>
<li><strong>Backgrounds</strong>: Randomly translated $\pm 100$ pixels and reflected</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Ensemble Voting</strong><br>
A committee of networks casts votes for the predicted SMILES string. The final prediction is the one with the highest vote count. Validity of SMILES is checked using RDKit.</p>
<p><strong>Beam Search</strong><br>
Used in the decoding layer with a beam width of $k=5$ to explore multiple potential SMILES strings. It approximates the sequence $\mathbf{\hat{y}}$ that maximizes the joint probability:</p>
<p>$$ \mathbf{\hat{y}} = \arg\max_{\mathbf{y}} \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$</p>
<p><strong>Optimization</strong>:</p>
<ul>
<li>
<p><strong>Optimizer</strong>: Adam</p>
</li>
<li>
<p><strong>Learning Rate</strong>: $1 \x10^{-4}$</p>
</li>
<li>
<p><strong>Batch Size</strong>: 20</p>
</li>
<li>
<p><strong>Loss Function</strong>: Cross-entropy loss across the sequence of $T$ tokens, computed as:</p>
<p>$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$</p>
<p>where $\mathbf{x}$ is the image representation and $y_t$ is the predicted SMILES character. This is calculated as perplexity for validation.</p>
</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture is a standard image captioning model (Show, Attend and Tell style) adapted for chemical structures.</p>
<p><strong>Encoder (CNN)</strong>:</p>
<ul>
<li><strong>Input</strong>: 256x256 images (implicit from scaling operations)</li>
<li><strong>Structure</strong>: 4 blocks of Conv2D + MaxPool
<ul>
<li>Block 1: 64 filters, (3,3) kernel</li>
<li>Block 2: 128 filters, (3,3) kernel</li>
<li>Block 3: 256 filters, (3,3) kernel</li>
<li>Block 4: 512 filters, (3,3) kernel</li>
</ul>
</li>
<li><strong>Activation</strong>: ReLU throughout</li>
</ul>
<p><strong>Decoder (LSTM)</strong>:</p>
<ul>
<li><strong>Hidden Units</strong>: 512</li>
<li><strong>Embedding Dimension</strong>: 80</li>
<li><strong>Attention</strong>: Mechanism with intermediary vector dimension of 512</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Exact SMILES match accuracy (character-by-character identity between predicted and ground truth SMILES)</li>
<li><strong>Perplexity</strong>: Used for saving model checkpoints (minimizing uncertainty)</li>
<li><strong>Top-k Accuracy</strong>: Reported for $k=1$ (76%) and $k=3$ (86%)</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{weir2021chempix,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemPix: Automated Recognition of Hand-Drawn Hydrocarbon Structures Using Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Weir, Hayley and Thompson, Keiran and Woodward, Amelia and Choi, Benjamin and Braun, Augustin and Mart{\&#39;i}nez, Todd J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{31}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{10622--10633}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D1SC02957F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ABC-Net: Divide-and-Conquer SMILES Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/abc-net/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/abc-net/</guid><description>Deep learning OCSR model using keypoint estimation to detect atom and bond centers for graph-based molecular structure recognition.</description><content:encoded><![CDATA[<h2 id="contribution-and-paper-type">Contribution and Paper Type</h2>
<p><strong>Method</strong>. The paper proposes a novel architectural framework (ABC-Net) for Optical Chemical Structure Recognition (OCSR). It reformulates the problem from image captioning (sequence generation) to keypoint estimation (pixel-wise detection), backed by ablation studies on noise and comparative benchmarks against state-of-the-art tools.</p>
<h2 id="motivation-for-keypoint-based-ocsr">Motivation for Keypoint-Based OCSR</h2>
<ul>
<li><strong>Inefficiency of Rule-Based Methods</strong>: Traditional tools (OSRA, MolVec) rely on hand-coded rules that are brittle, require domain expertise, and fail to handle the wide variance in molecular drawing styles.</li>
<li><strong>Data Inefficiency of Captioning Models</strong>: Recent Deep Learning approaches (like DECIMER, Img2mol) treat OCSR as image captioning (Image-to-SMILES). This is data-inefficient because canonical SMILES require learning traversal orders, necessitating millions of training examples.</li>
<li><strong>Goal</strong>: To create a scalable, data-efficient model that predicts graph structures directly by detecting atomic/bond primitives.</li>
</ul>
<h2 id="abc-nets-divide-and-conquer-architecture">ABC-Net&rsquo;s Divide-and-Conquer Architecture</h2>
<ul>
<li><strong>Divide-and-Conquer Strategy</strong>: ABC-Net breaks the problem down into detecting <strong>atom centers</strong> and <strong>bond centers</strong> as independent keypoints.</li>
<li><strong>Keypoint Estimation</strong>: It leverages a Fully Convolutional Network (FCN) to generate heatmaps for object centers. This is inspired by computer vision techniques like CornerNet and CenterNet.</li>
<li><strong>Angle-Based Bond Detection</strong>: To handle overlapping bonds, the model classifies bond angles into 60 distinct bins ($0-360°$) at detected bond centers, allowing separation of intersecting bonds.</li>
<li><strong>Implicit Hydrogen Prediction</strong>: The model explicitly predicts the number of implicit hydrogens for aromatic heteroatoms to resolve ambiguity in dearomatization.</li>
</ul>
<h2 id="experimental-setup-and-synthetic-data">Experimental Setup and Synthetic Data</h2>
<ul>
<li><strong>Dataset Construction</strong>: Synthetic dataset of 100,000 molecules from ChEMBL, rendered using two different engines (RDKit and Indigo) to ensure style diversity.</li>
<li><strong>Baselines</strong>: Compared against two rule-based methods (MolVec, OSRA) and one deep learning method (Img2mol).</li>
<li><strong>Robustness Testing</strong>: Evaluated on the external UOB dataset (real-world images) and synthetic images with varying levels of salt-and-pepper noise (up to $p=0.6$).</li>
<li><strong>Data Efficiency</strong>: Analyzed performance scaling with training set size (10k to 160k images).</li>
</ul>
<h2 id="results-generalization-and-noise-robustness">Results, Generalization, and Noise Robustness</h2>
<ul>
<li><strong>Superior Accuracy</strong>: ABC-Net achieved <strong>94-98% accuracy</strong> on synthetic test sets, significantly outperforming MolVec (&lt;50%), OSRA (~61%), and Img2mol (~89-93%).</li>
<li><strong>Generalization</strong>: On the external UOB benchmark, ABC-Net achieved <strong>&gt;95% accuracy</strong>, whereas the deep learning baseline (Img2mol) dropped to 78.2%, indicating better generalization.</li>
<li><strong>Data Efficiency</strong>: The model reached ~95% performance with only 80,000 training images, proving it requires orders of magnitude less data than captioning-based models (which often use millions).</li>
<li><strong>Noise Robustness</strong>: Performance remained stable (&lt;2% drop) with noise levels up to $p=0.1$. Even at extreme noise ($p=0.6$), Tanimoto similarity remained high, suggesting the model recovers most substructures even when exact matches fail.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/zhang-xuan1314/ABC-Net">ABC-Net Repository</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Official implementation. Missing requirements.txt and pre-trained weights.</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Status: Partially Reproducible</strong>. The code is provided, but key components like the pre-trained weights, exact training environment dependencies, and the generated synthetic datasets are missing from the open-source release, making exact reproduction difficult.</p>
<h3 id="data">Data</h3>
<p>The authors constructed a synthetic dataset because labeled pixel-wise OCSR data is unavailable.</p>
<ul>
<li><strong>Source</strong>: ChEMBL database</li>
<li><strong>Filtering</strong>: Excluded molecules with &gt;50 non-H atoms or rare atom types/charges (&lt;1000 occurrences).</li>
<li><strong>Sampling</strong>: 100,000 unique SMILES selected such that every atom type/charge appears in at least 1,000 compounds.</li>
<li><strong>Generation</strong>: Images generated via <strong>RDKit</strong> and <strong>Indigo</strong> libraries.
<ul>
<li><em>Augmentation</em>: Varied bond thickness, label mode, orientation, and aromaticity markers.</li>
<li><em>Resolution</em>: $512 \times 512$ pixels.</li>
<li><em>Noise</em>: Salt-and-pepper noise added during training ($P$ = prob of background flip, $Q = 50P$).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL (RDKit/Indigo)</td>
          <td>80k</td>
          <td>8:1:1 split (Train/Val/Test)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>UOB Dataset</td>
          <td>~5.7k images</td>
          <td>External benchmark from Univ. of Birmingham</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>1. Keypoint Detection (Heatmaps)</strong></p>
<ul>
<li>
<p><strong>Down-sampling</strong>: Input $512 \times 512$ → Output $102 \times 102$ (stride 5).</p>
</li>
<li>
<p><strong>Label Softening</strong>: To handle discretization error, ground truth peaks are set to 1, first-order neighbors to 0.95, others to 0.</p>
</li>
<li>
<p><strong>Loss Function</strong>: Penalty-reduced pixel-wise binary focal loss (variants of CornerNet loss). The loss formulation is given as:</p>
<p>$$ L_{det} = - \frac{1}{N} \sum_{c,x,y} \begin{cases} (1 - \hat{Y}_{c,x,y})^{\alpha} \log(\hat{Y}_{c,x,y}) &amp; \text{if } Y_{c,x,y} = 1 \\ (1 - Y_{c,x,y})^{\beta} (\hat{Y}_{c,x,y})^{\alpha} \log(1 - \hat{Y}_{c,x,y}) &amp; \text{otherwise} \end{cases} $$</p>
<ul>
<li>$\alpha=2$ (focal parameter), $\beta=4$ (penalty reduction).</li>
<li>Weight balancing: Classes &lt;10% frequency weighted 10x.</li>
</ul>
</li>
</ul>
<p><strong>2. Bond Direction Classification</strong></p>
<ul>
<li><strong>Angle Binning</strong>: $360°$ divided into 60 intervals.</li>
<li><strong>Inference</strong>: A bond is detected if the angle probability is a local maximum and exceeds a threshold.</li>
<li><strong>Non-Maximum Suppression (NMS)</strong>: Required for opposite angles (e.g., $30°$ and $210°$) representing the same non-stereo bond.</li>
</ul>
<p><strong>3. Multi-Task Weighting</strong></p>
<ul>
<li>Uses Kendall&rsquo;s uncertainty weighting to balance 8 different loss terms (atom det, bond det, atom type, charge, H-count, bond angle, bond type, bond length).</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: ABC-Net (Custom U-Net / FCN)</p>
<ul>
<li><strong>Input</strong>: $512 \times 512 \times 1$ (Grayscale).</li>
<li><strong>Contracting Path</strong>: 5 steps. Each step has conv-blocks + $2 \times 2$ MaxPool.</li>
<li><strong>Expansive Path</strong>: 3 steps. Transpose-Conv upsampling + Concatenation (Skip Connections).</li>
<li><strong>Heads</strong>: Separate $1 \times 1$ convs for each task map (Atom Heatmap, Bond Heatmap, Property Maps).</li>
<li><strong>Output Dimensions</strong>:
<ul>
<li>Heatmaps: $(1, 102, 102)$</li>
<li>Bond Angles: $(60, 102, 102)$</li>
</ul>
</li>
<li><strong>Pre-trained Weights</strong>: Stated as available in paper, but missing from the public <a href="https://github.com/zhang-xuan1314/ABC-Net">GitHub repository</a>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Detection</strong>: Precision &amp; Recall (Object detection level).</li>
<li><strong>Regression</strong>: Mean Absolute Error (MAE) for bond lengths.</li>
<li><strong>Structure Recovery</strong>:
<ul>
<li><em>Accuracy</em>: Exact SMILES match rate.</li>
<li><em>Tanimoto</em>: ECFP similarity (fingerprint overlap).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>ABC-Net</th>
          <th>Img2mol (Baseline)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Accuracy (UOB)</strong></td>
          <td><strong>96.1%</strong></td>
          <td>78.2%</td>
          <td>Non-stereo subset</td>
      </tr>
      <tr>
          <td><strong>Accuracy (Indigo)</strong></td>
          <td><strong>96.4%</strong></td>
          <td>89.5%</td>
          <td>Non-stereo subset</td>
      </tr>
      <tr>
          <td><strong>Tanimoto (UOB)</strong></td>
          <td><strong>0.989</strong></td>
          <td>0.953</td>
          <td>Higher substructure recovery</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Configuration</strong>: 15 epochs, Batch size 64.</li>
<li><strong>Optimization</strong>: Adam Optimizer. LR $2.5 \times 10^{-4}$ (first 5 epochs) → $2.5 \times 10^{-5}$ (last 10).</li>
<li><strong>Compute</strong>: &ldquo;High-Performance Computing Center&rdquo; mentioned, specific GPU model not listed, but method described as &ldquo;efficient&rdquo; on GPU.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Yi, J.-C., Yang, G.-P., Wu, C.-K., Hou, T.-J., &amp; Cao, D.-S. (2022). ABC-Net: A divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. <em>Briefings in Bioinformatics</em>, 23(2), bbac033. <a href="https://doi.org/10.1093/bib/bbac033">https://doi.org/10.1093/bib/bbac033</a></p>
<p><strong>Publication</strong>: Briefings in Bioinformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/zhang-xuan1314/ABC-Net">GitHub Repository</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhangABCNetDivideandconquerBased2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{ABC-Net: A Divide-and-Conquer Based Deep Learning Architecture for {SMILES} Recognition from Molecular Images}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Xiao-Chen and Yi, Jia-Cai and Yang, Guo-Ping and Wu, Cheng-Kun and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{23}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{bbac033}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1093/bib/bbac033}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Img2Mol: Accurate SMILES from Molecular Depictions</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/img2mol/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/img2mol/</guid><description>Two-stage CNN approach for converting molecular images to SMILES using CDDD embeddings and extensive data augmentation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Clevert, D.-A., Le, T., Winter, R., &amp; Montanari, F. (2021). Img2Mol - accurate SMILES recognition from molecular graphical depictions. <em>Chemical Science</em>, 12(42), 14174-14181. <a href="https://doi.org/10.1039/D1SC01839F">https://doi.org/10.1039/D1SC01839F</a></p>
<p><strong>Publication</strong>: Chemical Science (2021)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/bayer-science-for-a-better-life/Img2Mol">GitHub Repository</a></li>
<li><a href="https://doi.org/10.1039/D1SC01839F">Paper on Royal Society of Chemistry</a></li>
</ul>
<h2 id="method-classification">Method Classification</h2>
<p>This is a <strong>method paper</strong> that introduces Img2Mol, a deep learning system for Optical Chemical Structure Recognition (OCSR). The work focuses on building a fast, accurate, and robust system for converting molecular structure depictions into machine-readable SMILES strings.</p>
<h2 id="systematization-and-motivation">Systematization and Motivation</h2>
<p>Vast amounts of chemical knowledge exist only as images in scientific literature and patents, making this data inaccessible for computational analysis, database searches, or machine learning pipelines. Manually extracting this information is slow and error-prone, creating a bottleneck for drug discovery and chemical research.</p>
<p>While rule-based OCSR systems like OSRA, MolVec, and Imago exist, they are brittle. Small variations in drawing style or image quality can cause them to fail. The authors argue that a deep learning approach, trained on diverse synthetic data, can generalize better across different depiction styles and handle the messiness of real-world images more reliably.</p>
<h2 id="two-stage-architecture-and-core-novelty">Two-Stage Architecture and Core Novelty</h2>
<p>The novelty lies in a two-stage architecture that separates perception from decoding, combined with aggressive data augmentation to ensure robustness. The key contributions are:</p>
<p><strong>1. Two-Stage Architecture with CDDD Embeddings</strong></p>
<p>Img2Mol uses an intermediate representation to predict SMILES from pixels. A <strong>custom CNN encoder</strong> maps the input image to a 512-dimensional <strong>Continuous and Data-Driven Molecular Descriptor (CDDD)</strong> embedding - a pre-trained, learned molecular representation that smoothly captures chemical similarity. A <strong>pre-trained decoder</strong> then converts this CDDD vector into the final canonical SMILES string.</p>
<p>This two-stage design has several advantages:</p>
<ul>
<li>The CDDD space is continuous and chemically meaningful, so nearby embeddings correspond to structurally similar molecules. This makes the regression task easier than learning discrete token sequences directly.</li>
<li>The decoder is pre-trained and fixed, so the CNN only needs to learn the image → CDDD mapping. This decouples the visual recognition problem from the sequence generation problem.</li>
<li>CDDD embeddings naturally enforce chemical validity constraints, reducing the risk of generating nonsensical structures.</li>
</ul>
<p><strong>2. Extensive Data Augmentation for Robustness</strong></p>
<p>The model was trained on 11.1 million unique molecules from ChEMBL and PubChem, but the critical insight is how the training images were generated. To expose the CNN to maximum variation in depiction styles, the authors:</p>
<ul>
<li>Used <strong>three different cheminformatics libraries</strong> (RDKit, OEChem, Indigo) to render images, each with its own drawing conventions</li>
<li>Applied <strong>wide-ranging augmentations</strong>: varying bond thickness, font size, rotation, resolution (190-2500 pixels), and other stylistic parameters</li>
<li><strong>Over-sampled larger molecules</strong> to improve performance on complex structures, which are underrepresented in chemical databases</li>
</ul>
<p>This ensures the network rarely sees the same depiction of a molecule twice, forcing it to learn invariant features.</p>
<p><strong>3. Fast Inference</strong></p>
<p>Because the architecture is a simple CNN followed by a fixed decoder, inference is very fast - especially compared to rule-based systems that rely on iterative graph construction algorithms. This makes Img2Mol practical for large-scale document mining.</p>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The evaluation focused on demonstrating that Img2Mol is more accurate, robust, and generalizable than existing rule-based systems:</p>
<ol>
<li>
<p><strong>Benchmark Comparisons</strong>: Img2Mol was tested on several standard OCSR benchmarks - USPTO (patent images), University of Birmingham (UoB), and CLEF datasets - against three open-source baselines: <strong>OSRA, MolVec, and Imago</strong>. Notably, no deep learning baselines were available at the time for comparison.</p>
</li>
<li>
<p><strong>Resolution and Molecular Size Analysis</strong>: The initial model, <code>Img2Mol(no aug.)</code>, was evaluated across different image resolutions and molecule sizes (measured by number of atoms) to understand failure modes. This revealed that:</p>
<ul>
<li>Performance degraded for molecules with &gt;35 atoms</li>
<li>Very high-resolution images lost detail when downscaled to the fixed input size</li>
<li>Low-resolution images (where rule-based methods failed completely) were handled well</li>
</ul>
</li>
<li>
<p><strong>Data Augmentation Ablation</strong>: A final model, <strong>Img2Mol</strong>, was trained with the full augmentation pipeline (wider resolution range, over-sampling of large molecules). Performance was compared to the initial version to quantify the effect of augmentation.</p>
</li>
<li>
<p><strong>Depiction Library Robustness</strong>: The model was tested on images generated by each of the three rendering libraries separately to confirm that training on diverse styles improved generalization.</p>
</li>
<li>
<p><strong>Generalization Tests</strong>: Img2Mol was evaluated on real-world patent images from the <strong>STAKER</strong> dataset, which were not synthetically generated. This tested whether the model could transfer from synthetic training data to real documents.</p>
</li>
<li>
<p><strong>Hand-Drawn Molecule Recognition</strong>: As an exploratory test, the authors evaluated performance on hand-drawn molecular structures - a task the model was never trained for - to see if the learned features could generalize to completely different visual styles.</p>
</li>
<li>
<p><strong>Speed Benchmarking</strong>: Inference time was measured and compared to rule-based baselines to demonstrate the practical efficiency of the approach.</p>
</li>
</ol>
<h2 id="results-conclusions-and-limitations">Results, Conclusions, and Limitations</h2>
<ul>
<li>
<p><strong>Substantial Performance Gains</strong>: Img2Mol outperformed all three rule-based baselines on nearly every benchmark. Accuracy was measured both as exact SMILES match and as <strong>Tanimoto similarity</strong> (a chemical fingerprint-based metric that measures structural similarity). Even when Img2Mol didn&rsquo;t predict the exact molecule, it often predicted a chemically similar one.</p>
</li>
<li>
<p><strong>Robustness Across Conditions</strong>: The full Img2Mol model (with aggressive augmentation) showed consistent performance across all image resolutions and molecule sizes. In contrast, rule-based systems were &ldquo;brittle&rdquo; - performance dropped sharply with minor perturbations to image quality or style.</p>
</li>
<li>
<p><strong>Depiction Library Invariance</strong>: Img2Mol&rsquo;s performance was stable across all three rendering libraries (RDKit, OEChem, Indigo), validating the multi-library training strategy. Rule-based methods struggled particularly with RDKit-generated images.</p>
</li>
<li>
<p><strong>Strong Generalization to Real-World Data</strong>: Despite being trained exclusively on synthetic images, Img2Mol performed well on real patent images from the STAKER dataset. This suggests the augmentation strategy successfully captured the diversity of real-world depictions.</p>
</li>
<li>
<p><strong>Overfitting in Baselines</strong>: Rule-based methods performed surprisingly well on older benchmarks (USPTO, UoB, CLEF) but failed on newer datasets (Img2Mol&rsquo;s test set, STAKER). This suggests they may be implicitly tuned to specific drawing conventions in legacy datasets.</p>
</li>
<li>
<p><strong>Limited Hand-Drawn Recognition</strong>: Img2Mol could recognize simple hand-drawn structures but struggled with complex or large molecules. This is unsurprising given the lack of hand-drawn data in training, but it highlights a potential avenue for future work.</p>
</li>
<li>
<p><strong>Speed Advantage</strong>: Img2Mol was substantially faster than rule-based competitors, especially on high-resolution images. This makes it practical for large-scale literature mining.</p>
</li>
</ul>
<p>The work establishes that deep learning can outperform traditional rule-based OCSR systems when combined with a principled two-stage architecture and comprehensive data augmentation. The CDDD embedding acts as a bridge between visual perception and chemical structure, providing a chemically meaningful intermediate representation that improves both accuracy and robustness. The focus on synthetic data diversity proves to be an effective strategy for generalizing to real-world documents.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Custom 8-layer Convolutional Neural Network (CNN) encoder</p>
<ul>
<li><strong>Input</strong>: $224 \times 224$ pixel grayscale images</li>
<li><strong>Backbone Structure</strong>: 8 convolutional layers organized into 3 stacks, followed by 3 fully connected layers
<ul>
<li><strong>Stack 1</strong>: 3 Conv layers ($7 \times 7$ filters, stride 3, padding 4) + Max Pooling</li>
<li><strong>Stack 2</strong>: 2 Conv layers + Max Pooling</li>
<li><strong>Stack 3</strong>: 3 Conv layers + Max Pooling</li>
<li><strong>Head</strong>: 3 fully connected layers</li>
</ul>
</li>
<li><strong>Output</strong>: 512-dimensional CDDD embedding vector</li>
</ul>
<p><strong>Decoder</strong>: Pre-trained CDDD decoder (from Winter et al.) - fixed during training, not updated</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Loss Function</strong>: Mean Squared Error (MSE) regression minimizing the distance between the predicted and true embeddings:</p>
<p>$$
l(d) = l(\text{cddd}_{\text{true}} - \text{cddd}_{\text{predicted}})
$$</p>
<p><strong>Optimizer</strong>: AdamW with initial learning rate $10^{-4}$</p>
<p><strong>Training Schedule</strong>:</p>
<ul>
<li>Batch size: 256</li>
<li>Training duration: 300 epochs</li>
<li>Plateau scheduler: Multiplies learning rate by 0.7 if validation loss plateaus for 10 epochs</li>
<li>Early stopping: Triggered if no improvement in validation loss for 50 epochs</li>
</ul>
<p><strong>Noise Tolerance</strong>: The decoder requires the CNN to predict embeddings with noise level $\sigma \le 0.15$ to achieve &gt;90% accuracy</p>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong>: 11.1 million unique molecules from ChEMBL and PubChem</p>
<p><strong>Splits</strong>: Approximately 50,000 examples each for validation and test sets</p>
<p><strong>Synthetic Image Generation</strong>:</p>
<ul>
<li>Three cheminformatics libraries: RDKit, OEChem, and Indigo</li>
<li>Augmentations: Resolution (190-2500 pixels), rotation, bond thickness, font size</li>
<li>Salt stripping: Keep only the largest fragment</li>
<li>Over-sampling: Larger molecules (&gt;35 atoms) over-sampled to improve performance</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li>Exact SMILES match accuracy</li>
<li>Tanimoto similarity (chemical fingerprint-based structural similarity)</li>
</ul>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li>USPTO (patent images)</li>
<li>University of Birmingham (UoB)</li>
<li>CLEF dataset</li>
<li>STAKER (real-world patent images)</li>
<li>Hand-drawn molecular structures (exploratory)</li>
</ul>
<p><strong>Baselines</strong>: OSRA, MolVec, Imago (rule-based systems)</p>
<h3 id="hardware">Hardware</h3>
<p>⚠️ <strong>Unspecified in paper or supplementary materials.</strong> Inference speed reported as ~4 minutes for 5000 images; training hardware (GPU model, count) is undocumented.</p>
<h3 id="known-limitations">Known Limitations</h3>
<p><strong>Molecular Size</strong>: Performance degrades for molecules with &gt;35 atoms. This is partly a property of the CDDD latent space itself - for larger molecules, the &ldquo;volume of decodable latent space&rdquo; shrinks, making the decoder more sensitive to small noise perturbations in the predicted embedding.</p>
]]></content:encoded></item><item><title>DECIMER: Deep Learning for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/decimer/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/decimer/</guid><description>Deep learning method for optical chemical structure recognition using image captioning networks trained on millions of synthetic molecular images.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A. &amp; Steinbeck, C. (2020). DECIMER: towards deep learning for chemical image recognition. <em>Journal of Cheminformatics</em>, 12(65). <a href="https://doi.org/10.1186/s13321-020-00469-w">https://doi.org/10.1186/s13321-020-00469-w</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2020</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Kohulan/DECIMER">Official GitHub Repository</a></li>
<li><a href="https://github.com/Kohulan/DECIMER-Image-to-SMILES">DECIMER Image-to-SMILES Repository</a></li>
</ul>
<h2 id="contribution-method-for-optical-chemical-entity-recognition">Contribution: Method for Optical Chemical Entity Recognition</h2>
<p>This is primarily a <strong>Method ($\Psi_{\text{Method}}$)</strong> paper with a strong <strong>Resource ($\Psi_{\text{Resource}}$)</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architecture (DECIMER) that repurposes &ldquo;show-and-tell&rdquo; image captioning networks for Optical Chemical Entity Recognition (OCER), providing an alternative to traditional rule-based segmentation pipelines.</li>
<li><strong>Resource</strong>: It establishes a framework for generating massively scalable synthetic training data using open-source cheminformatics tools (CDK) and databases (PubChem), circumventing the scarcity of manually annotated chemical images.</li>
</ul>
<h2 id="motivation-brittleness-of-heuristic-pipelines">Motivation: Brittleness of Heuristic Pipelines</h2>
<p>The extraction of chemical structures from scientific literature (OCER) is critical for populating open-access databases. Traditional OCER systems (like OSRA or CLiDE) rely on complex multi-step pipelines involving vectorization, character recognition, and graph compilation. These systems are brittle and incorporating new structural features requires laborious engineering. Inspired by deep reinforcement learning systems like AlphaGo Zero, the authors sought to formulate an end-to-end deep learning approach that learns directly from data with minimal prior assumptions.</p>
<h2 id="novelty-image-captioning-for-molecular-graphs">Novelty: Image Captioning for Molecular Graphs</h2>
<ul>
<li><strong>Image-to-Text Formulation</strong>: The paper frames chemical structure recognition as an image captioning problem, translating a bitmap image directly into a SMILES string using an encoder-decoder network. This bypasses explicit segmentation of atoms and bonds entirely.</li>
<li><strong>Synthetic Data Strategy</strong>: The authors generate synthetic images from PubChem using the CDK Structure Diagram Generator, scaling the dataset size to 15 million.</li>
<li><strong>Robust String Representations</strong>: The study performs key ablation experiments on string representations, comparing standard SMILES against DeepSMILES to evaluate how syntactic validity affects the network&rsquo;s learning capability.</li>
</ul>
<h2 id="experimental-setup-and-validation-strategies">Experimental Setup and Validation Strategies</h2>
<ul>
<li><strong>Data Scaling</strong>: Models were trained on dataset sizes ranging from 54,000 to 15 million synthetic images to observe empirical scaling laws regarding accuracy and compute time.</li>
<li><strong>Representation Comparison</strong>: The authors compared the validity of predicted strings and recognition accuracy when training on SMILES versus DeepSMILES. The cross-entropy loss formulation for sequence generation can be represented as:
$$ \mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{x}) $$
where $\mathbf{x}$ is the image representation and $y_t$ are the tokens of the SMILES/DeepSMILES string.</li>
<li><strong>Metric Evaluation</strong>: Performance was measured using Validity (syntactic correctness) and Tanimoto Similarity $T$, computed on molecular fingerprints to capture partial correctness even if the exact string prediction failed:
$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$</li>
</ul>
<h2 id="results-and-critical-conclusions">Results and Critical Conclusions</h2>
<ul>
<li><strong>Data Representation</strong>: DeepSMILES proved superior to standard SMILES for training stability and output validity. Preliminary tests suggested SELFIES performs even better (0.78 Tanimoto vs 0.53 for DeepSMILES at 6M images).</li>
<li><strong>Scaling Behavior</strong>: Accuracy improves linearly with dataset size. The authors extrapolate that near-perfect detection would require training on 50 to 100 million structures.</li>
<li><strong>Current Limitations</strong>: At the reported training scale (up to 15M), the model does not yet rival traditional heuristic approaches, but the learning curve suggests it is a viable trajectory given sufficient compute and data.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is synthetic, generated using the Chemistry Development Kit (CDK) Structure Diagram Generator (SDG) based on molecules from PubChem.</p>
<p><strong>Curation Rules</strong> (applied to PubChem data):</p>
<ul>
<li>Molecular weight &lt; 1500 Daltons.</li>
<li>Elements restricted to: C, H, O, N, P, S, F, Cl, Br, I, Se, B.</li>
<li>No counter ions or charged groups.</li>
<li>No isotopes (e.g., D, T).</li>
<li>Bond count between 5 and 40.</li>
<li>SMILES length &lt; 40 characters.</li>
<li>Implicit hydrogens only (except in functional groups).</li>
</ul>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Images</strong>: Generated as 299x299 bitmaps to match Inception V3 input requirements.</li>
<li><strong>Augmentation</strong>: One random rotation applied per molecule; no noise or blurring added in this iteration.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Synthetic (PubChem)</td>
          <td>54k - 15M</td>
          <td>Scaled across 12 experiments</td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>Independent Set</td>
          <td>6k - 1.6M</td>
          <td>10% of training size</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: <code>&quot;Show, Attend and Tell&quot;</code> (Attention-based Image Captioning).</li>
<li><strong>Optimization</strong>: Adam optimizer with learning rate 0.0005.</li>
<li><strong>Loss Function</strong>: Sparse Categorical Crossentropy.</li>
<li><strong>Training Loop</strong>: Trained for 25 epochs per model. Batch size of 640 images.</li>
</ul>
<h3 id="models">Models</h3>
<p>The network is implemented in TensorFlow 2.0.</p>
<ul>
<li><strong>Encoder</strong>: Inception V3 (Convolutional NN), used unaltered. Extracts feature vectors saved as NumPy arrays.</li>
<li><strong>Decoder</strong>: Gated Recurrent Unit (GRU) based Recurrent Neural Network (RNN) with soft attention mechanism.</li>
<li><strong>Embeddings</strong>: Image embedding dimension size of 600.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>The primary metric is Tanimoto similarity (Jaccard index) on PubChem fingerprints, which is robust for measuring structural similarity even when exact identity is not reached.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto 1.0</strong></td>
          <td>Percentage of predictions that are chemically identical to ground truth (isomorphic).</td>
      </tr>
      <tr>
          <td><strong>Average Tanimoto</strong></td>
          <td>Mean similarity score across the test set (captures partial correctness).</td>
      </tr>
      <tr>
          <td><strong>Validity</strong></td>
          <td>Percentage of predicted strings that are valid DeepSMILES/SMILES.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on a single node.</p>
<ul>
<li><strong>GPU</strong>: 1x NVIDIA Tesla V100.</li>
<li><strong>CPU</strong>: 2x Intel Xeon Gold 6230.</li>
<li><strong>RAM</strong>: 384 GB.</li>
<li><strong>Compute Time</strong>:
<ul>
<li>Linear scaling with data size.</li>
<li>15 million structures took ~27 days (64,909s per epoch).</li>
<li>Projected time for 100M structures: ~4 months on single GPU.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMERDeepLearning2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{DECIMER}}: Towards Deep Learning for Chemical Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{DECIMER}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{65}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-020-00469-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemGrapher: Deep Learning for Chemical OCR</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/chemgrapher-2020/</link><pubDate>Wed, 17 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/chemgrapher-2020/</guid><description>Deep learning OCSR method using semantic segmentation and classification CNNs to reconstruct chemical graphs with improved stereochemistry.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Oldenhof, M., Arany, A., Moreau, Y., &amp; Simm, J. (2020). ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning. <em>Journal of Chemical Information and Modeling</em>, 60(10), 4506-4517. <a href="https://doi.org/10.1021/acs.jcim.0c00459">https://doi.org/10.1021/acs.jcim.0c00459</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling 2020 (arXiv preprint Feb 2020)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2002.09914">arXiv Page</a></li>
</ul>
<h2 id="classifying-the-methodology">Classifying the Methodology</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel deep learning architecture and a specific graph-reconstruction algorithm to solve the problem of Optical Chemical Structure Recognition (OCSR). It validates this method by comparing it against the existing standard tool (OSRA), demonstrating superior performance on specific technical challenges like stereochemistry.</p>
<h2 id="the-ocr-stereochemistry-challenge">The OCR Stereochemistry Challenge</h2>
<p>Chemical knowledge is frequently locked in static images within scientific publications. Extracting this structure into machine-readable formats (graphs, SMILES) is essential for drug discovery and database querying. Existing tools, such as OSRA, rely on optical character recognition (OCR) and expert systems or hand-coded rules. These tools struggle with bond multiplicity and stereochemical information, often missing atoms or misinterpreting 3D cues (wedges and dashes). A machine learning approach allows for improvement via data scaling.</p>
<h2 id="decoupled-semantic-segmentation-and-classification-pipeline">Decoupled Semantic Segmentation and Classification Pipeline</h2>
<p>The core novelty is the <strong>segmentation-classification pipeline</strong> which decouples object detection from type assignment:</p>
<ol>
<li><strong>Semantic Segmentation</strong>: The model first predicts pixel-wise maps for atoms, bonds, and charges using a modified U-Net/dilated convolution architecture.</li>
<li><strong>Graph Building Algorithm</strong>: A specific algorithm iterates over the segmentation maps to generate candidate locations for atoms and bonds.</li>
<li><strong>Refinement via Classification</strong>: Dedicated classification networks take cutouts of the original image combined with the segmentation mask to verify and classify each candidate (e.g., distinguishing a single bond from a double bond, or a wedge from a dash).</li>
</ol>
<p>Additionally, the authors developed a novel method for <strong>synthetic data generation</strong> by modifying the source code of RDKit to output pixel-wise labels during the image drawing process. This solves the lack of labeled training data.</p>
<h2 id="evaluating-synthetics-and-benchmarks">Evaluating Synthetics and Benchmarks</h2>
<ul>
<li><strong>Synthetic Benchmarking</strong>: The authors generated test sets in 3 different stylistic variations. For each style, they tested on both stereo (complex 3D information) and non-stereo compounds.</li>
<li><strong>Baseline Comparison</strong>: They compared the error rates of ChemGrapher against <strong>OSRA</strong> (Optical Structure Recognition Application).</li>
<li><strong>Ablation Study</strong>: They analyzed the F1 scores of the segmentation networks versus the classification networks independently to understand where errors propagated.</li>
<li><strong>Real-world Case Study</strong>: They manually curated 61 images cut from journal articles to test performance on real, non-synthetic data.</li>
</ul>
<h2 id="advancements-over-osra">Advancements Over OSRA</h2>
<ul>
<li><strong>Superior Accuracy</strong>: ChemGrapher consistently achieved lower error rates than OSRA across all synthetic styles, particularly for stereochemical information (wedge and dash bonds).</li>
<li><strong>Component Performance</strong>: The classification networks showed significantly higher F1 scores than the segmentation networks. This suggests the two-stage approach allows the classifier to correct segmentation noise.</li>
<li><strong>Real-world Viability</strong>: In the manual case study, ChemGrapher correctly predicted 46 of 61 images, compared to 42 of 61 for OSRA.</li>
<li><strong>Limitations</strong>: The model struggles with thick bond lines and non-carbon atom labels in real-world images due to the bias in the synthetic training data.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors created a custom synthetic dataset using ChEMBL and RDKit, as no pixel-wise labeled dataset existed.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Source</strong></td>
          <td>ChEMBL</td>
          <td>1.9M</td>
          <td>Split into training pool (1.5M), val/train pool (300K), and test pools (35K each).</td>
      </tr>
      <tr>
          <td><strong>Segmentation Train</strong></td>
          <td>Synthetic</td>
          <td>~114K</td>
          <td>Sampled from ChEMBL pool such that every atom type appears in &gt;1000 compounds.</td>
      </tr>
      <tr>
          <td><strong>Labels</strong></td>
          <td>Pixel-wise</td>
          <td>N/A</td>
          <td>Generated by modifying <strong>RDKit</strong> source code to output label masks (atom type, bond type, charge) during drawing.</td>
      </tr>
      <tr>
          <td><strong>Candidates</strong></td>
          <td>Cutouts</td>
          <td>~27K (Atom)<br>~55K (Bond)</td>
          <td>Generated from the validation pool for training the classification networks.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Algorithm 1: Graph Building</strong></p>
<ol>
<li><strong>Segment</strong>: Apply segmentation network $s(x)$ to get maps $S^a$ (atoms), $S^b$ (bonds), $S^c$ (charges).</li>
<li><strong>Atom Candidates</strong>: Identify candidate blobs in $S^a$.</li>
<li><strong>Classify Atoms</strong>: For each candidate, crop the input image and segmentation map. Feed to $c_A$ and $c_C$ to predict Atom Type and Charge. Add to Vertex set $V$ if valid.</li>
<li><strong>Bond Candidates</strong>: Generate all pairs of nodes in $V$ within $2 \times$ bond length distance.</li>
<li><strong>Classify Bonds</strong>: For each pair, create a candidate mask (two rectangles meeting in the middle to encode directionality). Feed to $c_B$ to predict Bond Type (single, double, wedge, etc.). Add to Edge set $E$.</li>
</ol>
<h3 id="models">Models</h3>
<p>The pipeline uses four distinct Convolutional Neural Networks (CNNs).</p>
<p><strong>1. Semantic Segmentation Network ($s$)</strong></p>
<ul>
<li><strong>Architecture</strong>: 8-layer fully convolutional network (Dense Prediction Convolutional Network).</li>
<li><strong>Kernels</strong>: $3\times3$ for all layers.</li>
<li><strong>Dilation</strong>: Uses dilated convolutions to expand receptive field without losing resolution. Dilations: 1, 2, 4, 8, 8, 4, 2, 1.</li>
<li><strong>Input</strong>: Binary B/W image.</li>
<li><strong>Output</strong>: Multi-channel probability maps for Atom Types ($S^a$), Bond Types ($S^b$), and Charges ($S^c$).</li>
</ul>
<p><strong>2. Classification Networks ($c_A, c_B, c_C$)</strong></p>
<ul>
<li><strong>Purpose</strong>: Refines predictions on small image patches.</li>
<li><strong>Architecture</strong>: 5 convolutional layers followed by a MaxPool and 2 Fully Connected layers.
<ul>
<li>Layer 1: <strong>Depthwise separable convolution</strong>.</li>
<li>Layers 2-4: Dilated convolutions (factors 2, 4, 8).</li>
<li>Layer 5: Standard convolution + MaxPool ($124 \x124$).</li>
</ul>
</li>
<li><strong>Inputs</strong>:
<ul>
<li>Crop of the binary image ($x^{cut}$).</li>
<li>Crop of the segmentation map ($S^{cut}$).</li>
<li>&ldquo;Highlight&rdquo; mask ($h_L$) indicating the specific candidate location (e.g., a dot for atoms, two rectangles for bonds).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: <strong>F1 Score</strong> for individual network performance (segmentation pixels and classification accuracy).</li>
<li><strong>Metric</strong>: <strong>Error Rate</strong> (percentage of incorrect graphs) for overall system. A graph is &ldquo;incorrect&rdquo; if there is at least one mistake in atoms or bonds.</li>
<li><strong>Baselines</strong>: Compared against <strong>OSRA</strong> (v2.1.0 implied by context of standard tools).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Training and inference performed on a single <strong>NVIDIA Titan Xp</strong>.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{oldenhof2020chemgrapher,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Oldenhof, Martijn and Arany, Adam and Moreau, Yves and Simm, Jaak}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4506--4517}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.0c00459}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Research on Chemical Expression Images Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/hong-chemical-expression-2015/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/hong-chemical-expression-2015/</guid><description>A 2015 methodology for Optical Chemical Structure Recognition (OCSR) focusing on improved handling of adhesive symbols and wedge bonds.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hong, C., Du, X., &amp; Zhang, L. (2015). Research on Chemical Expression Images Recognition. <em>Proceedings of the 2015 Joint International Mechanical, Electronic and Information Technology Conference</em>, 267-271. <a href="https://doi.org/10.2991/jimet-15.2015.50">https://doi.org/10.2991/jimet-15.2015.50</a></p>
<p><strong>Publication</strong>: JIMET 2015 (Atlantis Press)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://jsme-editor.github.io/">JSME Editor (used for visualization)</a></li>
</ul>
<h2 id="contribution-new-ocsr-workflow-for-adhesion-and-wedge-bonds">Contribution: New OCSR Workflow for Adhesion and Wedge Bonds</h2>
<p><strong>Method</strong>. The paper proposes a novel algorithmic pipeline (OCSR) for recognizing 2D organic chemical structures from images. It validates this method by comparing it against an existing tool (OSRA) using a quantitative metric (Tanimoto Coefficient) on a test set of 200 images.</p>
<h2 id="motivation-challenges-with-connecting-symbols-and-stereochemistry">Motivation: Challenges with Connecting Symbols and Stereochemistry</h2>
<p>A vast amount of chemical structural information exists in scientific literature (PDFs/images) that is not machine-readable. Manually converting these images to formats like <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a> or CML is labor-intensive. Existing tools face challenges with:</p>
<ol>
<li><strong>Adhesion</strong>: Poor separation when chemical symbols touch or overlap with bonds.</li>
<li><strong>Stereochemistry</strong>: Incomplete identification of &ldquo;real&rdquo; (solid) and &ldquo;virtual&rdquo; (dashed/hashed) wedge bonds.</li>
</ol>
<h2 id="core-innovation-vector-based-separation-and-stereochemical-logic">Core Innovation: Vector-Based Separation and Stereochemical Logic</h2>
<p>The authors propose a specific <strong>OCSR (Optical Chemical Structure Recognition)</strong> workflow with two key technical improvements:</p>
<ol>
<li><strong>Vector-based Separation</strong>: The method vectorizes the image (using Potrace) to extract straight lines and curves, allowing better separation of &ldquo;adhesive&rdquo; chemical symbols (like H, N, O attached to bonds).</li>
<li><strong>Stereochemical Logic</strong>: Specific rules for identifying wedge bonds:
<ul>
<li><strong>Virtual (Dashed) Wedges</strong>: Identified by grouping connected domains and checking linear correlation of their center points.</li>
<li><strong>Real (Solid) Wedges</strong>: Identified after thinning by analyzing linear correlation and width variance of line segments.</li>
</ul>
</li>
</ol>
<h2 id="methodology--experimental-setup">Methodology &amp; Experimental Setup</h2>
<ul>
<li>
<p><strong>Dataset</strong>: 200 chemical structure images collected from the network.</p>
</li>
<li>
<p><strong>Baselines</strong>: Compared against <strong>OSRA</strong> (Optical Structure Recognition Application), a free online tool.</p>
</li>
<li>
<p><strong>Metric</strong>: <strong>Tanimoto Coefficient</strong>, measuring the similarity of the set of recognized bonds and symbols against the ground truth. The similarity $T(A, B)$ is defined as:</p>
<p>$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$</p>
</li>
</ul>
<h2 id="results--conclusions">Results &amp; Conclusions</h2>
<ul>
<li><strong>Performance</strong>: The proposed OCSR method achieved higher recognition rates than OSRA.
<ul>
<li><strong>Exact Match (100%)</strong>: OCSR achieved 90.0% vs. OSRA&rsquo;s 82.2%.</li>
<li><strong>High Similarity (&gt;85%)</strong>: OCSR recognized 157 structures vs. OSRA&rsquo;s 114.</li>
</ul>
</li>
<li><strong>Limitations</strong>: The paper notes that &ldquo;real wedge&rdquo; and &ldquo;virtual wedge&rdquo; identification was a primary focus, but general recognition effectiveness still &ldquo;has room for improvement&rdquo;.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used a custom collection of images, not a standard benchmark.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Web-crawled chemical images</td>
          <td>200 structures</td>
          <td>Images containing 2D organic structures; specific source URLs not provided.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The recognition pipeline follows these specific steps:</p>
<ol>
<li><strong>Preprocessing</strong>:
<ul>
<li><strong>Grayscale</strong>: via <code>cvCvtColor</code> (OpenCV).</li>
<li><strong>Binarization</strong>: via Otsu&rsquo;s method.</li>
</ul>
</li>
<li><strong>Isolated Symbol Removal</strong>:
<ul>
<li>Identifies connected domains with aspect ratios in <code>[0.8, 3.0]</code>.</li>
<li>Recognizes them using OCR (GOCR, OCRAD, Tesseract) and removes them from the image.</li>
</ul>
</li>
<li><strong>Virtual Wedge Recognition</strong>:
<ul>
<li>Groups small connected domains (points/clumps).</li>
<li>Calculates linear correlation of center points; if collinear, treats as a dashed bond.</li>
</ul>
</li>
<li><strong>Vectorization &amp; Thinning</strong>:
<ul>
<li><strong>Thinning</strong>: Rosenfeld algorithm (optimized) to reduce lines to single pixel width.</li>
<li><strong>Vectorization</strong>: Uses <strong>Potrace</strong> to convert pixels to vector segments.</li>
<li><strong>Merging</strong>: Combines split vector segments based on angle thresholds to form long straight lines.</li>
</ul>
</li>
<li><strong>Adhesive Symbol Separation</strong>:
<ul>
<li>Identifies curves (short segments after vectorization) attached to long lines.</li>
<li>Separates these domains and re-runs OCR.</li>
</ul>
</li>
<li><strong>&ldquo;Super Atom&rdquo; Merging</strong>:
<ul>
<li>Merges adjacent vertical/horizontal symbols (e.g., &ldquo;HO&rdquo;, &ldquo;CH3&rdquo;) based on distance thresholds between bounding boxes.</li>
</ul>
</li>
</ol>
<h3 id="models">Models</h3>
<p>The system relies on off-the-shelf OCR tools for character recognition; no custom ML models were trained.</p>
<ul>
<li><strong>OCR Engines</strong>: GOCR, OCRAD, TESSERACT.</li>
<li><strong>Visualization</strong>: JSME (JavaScript Molecule Editor) used to render output strings.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value (OCSR)</th>
          <th>Baseline (OSRA)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Exact Match (100%)</td>
          <td><strong>90.0%</strong></td>
          <td>82.2%</td>
          <td>Percentage of 200 images perfectly recognized.</td>
      </tr>
      <tr>
          <td>&gt;95% Similarity</td>
          <td><strong>95 images</strong></td>
          <td>71 images</td>
          <td>Count of images with Tanimoto &gt; 0.95.</td>
      </tr>
      <tr>
          <td>&gt;85% Similarity</td>
          <td><strong>157 images</strong></td>
          <td>114 images</td>
          <td>Count of images with Tanimoto &gt; 0.85.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Requirements</strong>: Unspecified; runs on standard CPU architecture (implied by use of standard libraries like OpenCV and Potrace).</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{hongResearchChemicalExpression2015,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Research on {{Chemical Expression Images Recognition}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 2015 {{Joint International Mechanical}}, {{Electronic}} and {{Information Technology Conference}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Hong, Chen and Du, Xiaoping and Zhang, Lu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Atlantis Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Chongqing, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.2991/jimet-15.2015.50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{978-94-6252-129-2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>IMG2SMI: Translating Molecular Structure Images to SMILES</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/img2smi/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/img2smi/</guid><description>Campos &amp; Ji's method for converting 2D molecular images to SMILES strings using Transformers and SELFIES representation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Campos, D., &amp; Ji, H. (2021). IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System (No. arXiv:2109.04202). arXiv. <a href="https://doi.org/10.48550/arXiv.2109.04202">https://doi.org/10.48550/arXiv.2109.04202</a></p>
<p><strong>Publication</strong>: arXiv preprint (2021)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.48550/arXiv.2109.04202">Paper on arXiv</a></li>
</ul>
<h2 id="contributions--taxonomy">Contributions &amp; Taxonomy</h2>
<p>This is both a <strong>Method</strong> and <strong>Resource</strong> paper:</p>
<ul>
<li><strong>Method</strong>: It adapts standard image captioning architectures (encoder-decoder) to the domain of Optical Chemical Structure Recognition (OCSR), treating molecule recognition as a translation task.</li>
<li><strong>Resource</strong>: It introduces <strong>MOLCAP</strong>, a large-scale dataset of 81 million molecules aggregated from public chemical databases, addressing the data scarcity that previously hindered deep learning approaches to OCSR.</li>
</ul>
<h2 id="the-bottleneck-in-chemical-literature-translation">The Bottleneck in Chemical Literature Translation</h2>
<p>Chemical literature is &ldquo;full of recipes written in a language computers cannot understand&rdquo; because molecules are depicted as 2D images. This creates a fundamental bottleneck:</p>
<ul>
<li><strong>The Problem</strong>: Chemists must manually redraw molecular structures to search for related compounds or reactions. This is slow, error-prone, and makes large-scale literature mining impossible.</li>
<li><strong>Existing Tools</strong>: Legacy systems like OSRA (Optical Structure Recognition Application) rely on handcrafted rules and often require human correction, making them unfit for unsupervised, high-throughput processing.</li>
<li><strong>The Goal</strong>: An automated system that can translate structure images directly to machine-readable strings (SMILES/SELFIES) without human supervision, enabling large-scale knowledge extraction from decades of chemistry literature and patents.</li>
</ul>
<h2 id="core-innovation-selfies-and-image-captioning">Core Innovation: SELFIES and Image Captioning</h2>
<p>The core novelty is demonstrating that <strong>how you represent the output text is as important as the model architecture itself</strong>. Key contributions:</p>
<ol>
<li>
<p><strong>Image Captioning Framework</strong>: Applies modern encoder-decoder architectures (ResNet-101 + Transformer) to OCSR, treating it as an image-to-text translation problem with a standard cross-entropy loss objective over the generation sequence:
$$ \mathcal{L} = -\sum\limits_{t=1}^{T} \log P(y_t \mid y_1, \ldots, y_{t-1}, x) $$</p>
</li>
<li>
<p><strong>SELFIES as Target Representation</strong>: The key mechanism relies on using <strong>SELFIES</strong> (Self-Referencing Embedded Strings) as the output format. SELFIES is based on a formal grammar where every possible string corresponds to a valid molecule, eliminating the syntactic invalidity problems (unmatched parentheses, invalid characters) that plague SMILES generation.</p>
</li>
<li>
<p><strong>MOLCAP Dataset</strong>: Created a comprehensive dataset of 81 million unique molecules from PubChem, ChEMBL, GDB13, and other sources. Generated 256x256 pixel images using RDKit for 1 million training samples and 5,000 validation samples.</p>
</li>
<li>
<p><strong>Task-Specific Evaluation</strong>: Demonstrated that traditional NLP metrics (BLEU) are poor indicators of scientific utility. Introduced evaluation based on <strong>molecular fingerprints</strong> (MACCS, RDK, Morgan) and <strong>Tanimoto similarity</strong>:
$$ T(a, b) = \frac{c}{a + b - c} $$
where $c$ is the number of common fingerprint bits, and $a$ and $b$ are the number of set bits in each respective molecule&rsquo;s fingerprint. This formulation reliably measures functional chemical similarity.</p>
</li>
</ol>
<h2 id="experimental-setup-and-ablation-studies">Experimental Setup and Ablation Studies</h2>
<p>The evaluation focused on comparing IMG2SMI to existing systems and identifying which design choices matter most:</p>
<ol>
<li>
<p><strong>Baseline Comparisons</strong>: Benchmarked against OSRA (rule-based system) and DECIMER (first deep learning approach) on the MOLCAP dataset to establish whether modern architectures could surpass traditional methods.</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Extensive ablations isolating key factors:</p>
<ul>
<li><strong>Decoder Architecture</strong>: Transformer vs. RNN/LSTM decoders</li>
<li><strong>Encoder Fine-tuning</strong>: Fine-tuned vs. frozen pre-trained ResNet weights</li>
<li><strong>Output Representation</strong>: SELFIES vs. character-level SMILES vs. BPE-tokenized SMILES (the most critical ablation)</li>
</ul>
</li>
</ol>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>MACCS FTS</th>
          <th>Valid Captions</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN Decoder</td>
          <td>~0.36</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Transformer Decoder</td>
          <td>0.94</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Fixed Encoder Weights</td>
          <td>0.76</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Fine-tuned Encoder</td>
          <td>0.94</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Character-level SMILES</td>
          <td>&lt;0.50</td>
          <td>~2%</td>
      </tr>
      <tr>
          <td>BPE SMILES (2000 vocab)</td>
          <td>0.85</td>
          <td>~40%</td>
      </tr>
      <tr>
          <td>SELFIES</td>
          <td>0.94</td>
          <td>99.4%</td>
      </tr>
  </tbody>
</table>
<ol start="3">
<li><strong>Metric Analysis</strong>: Systematic comparison of evaluation metrics including BLEU, ROUGE, Levenshtein distance, exact match accuracy, and molecular fingerprint-based similarity measures.</li>
</ol>
<h2 id="results-findings-and-limitations">Results, Findings, and Limitations</h2>
<p><strong>Performance Gains</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>IMG2SMI</th>
          <th>OSRA</th>
          <th>DECIMER</th>
          <th>Random Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MACCS FTS</td>
          <td>0.9475</td>
          <td>0.3600</td>
          <td>N/A</td>
          <td>~0.20</td>
      </tr>
      <tr>
          <td>RDK FTS</td>
          <td>0.9238</td>
          <td>N/A</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Morgan FTS</td>
          <td>0.8848</td>
          <td>N/A</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>0.6240</td>
          <td>0.0684</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Exact Match</td>
          <td>7.24%</td>
          <td>0.04%</td>
          <td>N/A</td>
          <td>0%</td>
      </tr>
      <tr>
          <td>Valid Captions</td>
          <td>99.4%</td>
          <td>65.2%</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<ul>
<li>163% improvement over OSRA on MACCS Tanimoto similarity.</li>
<li>Approximately 10x improvement on ROUGE scores.</li>
<li>Average Tanimoto similarity exceeds 0.85 (functionally similar molecules even when not exact matches).</li>
</ul>
<p><strong>Key Findings</strong>:</p>
<ul>
<li><strong>SELFIES is Critical</strong>: Using SELFIES yields <strong>99.4% valid molecules</strong>, compared to only ~2% validity for character-level SMILES. This robustness is essential for practical deployment.</li>
<li><strong>Architecture Matters</strong>: Transformer decoder significantly outperforms RNN/LSTM approaches. Fine-tuning the ResNet encoder (vs. frozen weights) yields substantial performance gains (e.g., MACCS FTS: 0.76 to 0.94).</li>
<li><strong>Metric Insights</strong>: BLEU is a poor metric for this task. Molecular fingerprint-based Tanimoto similarity is most informative because it measures functional chemical similarity.</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li><strong>Low Exact Match</strong>: Only <strong>7.24%</strong> exact matches. The model captures the overarching functional groups and structure but misses fine details like exact double bond placement.</li>
<li><strong>Complexity Bias</strong>: Trained on large molecules (average length &gt;40 tokens), so it performs poorly on very simple structures where OSRA still excels.</li>
</ul>
<p><strong>Conclusion</strong>: The work establishes that modern architectures combined with robust molecular representations (SELFIES) can significantly outperform traditional rule-based systems. The system is already useful for literature mining where functional similarity is more important than exact matches, though low exact match accuracy and poor performance on simple molecules indicate clear directions for future work.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Image captioning system based on DETR (Detection Transformer) framework.</p>
<p><strong>Visual Encoder</strong>:</p>
<ul>
<li><strong>Backbone</strong>: ResNet-101 pre-trained on ImageNet</li>
<li><strong>Feature Extraction</strong>: 4th layer extraction (convolutions only)</li>
<li><strong>Output</strong>: 2048-dimensional dense feature vector</li>
</ul>
<p><strong>Caption Decoder</strong>:</p>
<ul>
<li><strong>Type</strong>: Transformer encoder-decoder</li>
<li><strong>Layers</strong>: 3 stacked encoder layers, 3 stacked decoder layers</li>
<li><strong>Attention Heads</strong>: 8</li>
<li><strong>Hidden Dimensions</strong>: 2048 (feed-forward networks)</li>
<li><strong>Dropout</strong>: 0.1</li>
<li><strong>Layer Normalization</strong>: 1e-12</li>
</ul>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: AdamW</li>
<li><strong>Learning Rate</strong>: 5e-5 (selected after sweep from 1e-4 to 1e-6)</li>
<li><strong>Weight Decay</strong>: 1e-4</li>
<li><strong>Batch Size</strong>: 32</li>
<li><strong>Epochs</strong>: 5</li>
<li><strong>Codebase</strong>: Built on open-source DETR implementation</li>
</ul>
<h3 id="data">Data</h3>
<p><strong>MOLCAP Dataset</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total Size</td>
          <td>81,230,291 molecules</td>
          <td>Aggregated from PubChem, ChEMBL, GDB13</td>
      </tr>
      <tr>
          <td>Training Split</td>
          <td>1,000,000 molecules</td>
          <td>Randomly selected unique molecules</td>
      </tr>
      <tr>
          <td>Validation Split</td>
          <td>5,000 molecules</td>
          <td>Randomly selected for evaluation</td>
      </tr>
      <tr>
          <td>Image Resolution</td>
          <td>256x256 pixels</td>
          <td>Generated using RDKit</td>
      </tr>
      <tr>
          <td>Median SELFIES Length</td>
          <td>&gt;45 characters</td>
          <td>More complex than typical benchmarks</td>
      </tr>
      <tr>
          <td>Full Dataset Storage</td>
          <td>~16.24 TB</td>
          <td>Necessitated use of 1M subset</td>
      </tr>
      <tr>
          <td>Augmentation</td>
          <td>None</td>
          <td>No cropping, rotation, or other augmentation</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li>Images generated using RDKit at 256x256 resolution</li>
<li>Molecules converted to canonical representations</li>
<li>SELFIES tokenization for model output</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metrics</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>IMG2SMI Value</th>
          <th>OSRA Baseline</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MACCS FTS</td>
          <td>0.9475</td>
          <td>0.3600</td>
          <td>Fingerprint Tanimoto Similarity (functional groups)</td>
      </tr>
      <tr>
          <td>RDK FTS</td>
          <td>0.9238</td>
          <td>N/A</td>
          <td>RDKit fingerprint similarity</td>
      </tr>
      <tr>
          <td>Morgan FTS</td>
          <td>0.8848</td>
          <td>N/A</td>
          <td>Morgan fingerprint similarity (circular)</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>0.6240</td>
          <td>0.0684</td>
          <td>Text overlap metric</td>
      </tr>
      <tr>
          <td>Exact Match</td>
          <td>7.24%</td>
          <td>0.04%</td>
          <td>Structural identity (strict)</td>
      </tr>
      <tr>
          <td>Valid Captions</td>
          <td>99.4%</td>
          <td>65.2%</td>
          <td>Syntactic validity (with SELFIES)</td>
      </tr>
      <tr>
          <td>Levenshtein Distance</td>
          <td>Low</td>
          <td>High</td>
          <td>String edit distance</td>
      </tr>
  </tbody>
</table>
<p><strong>Secondary Metrics</strong> (shown to be less informative for chemical tasks):</p>
<ul>
<li>BLEU, ROUGE (better suited for natural language)</li>
<li>Levenshtein distance (doesn&rsquo;t capture chemical similarity)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Single NVIDIA GeForce RTX 2080 Ti</li>
<li><strong>Training Time</strong>: ~5 hours per epoch, approximately 24 hours total for 5 epochs</li>
<li><strong>Memory</strong>: Sufficient for batch size 32 with ResNet-101 + Transformer architecture</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{campos2021img2smi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Campos, Daniel and Ji, Heng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2109.04202}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2109.04202}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Molecular String Renderer: Robust Visualization Tool</title><link>https://hunterheidenreich.com/projects/molecular-string-renderer/</link><pubDate>Sun, 30 Nov 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/molecular-string-renderer/</guid><description>A robust, type-safe Python library for converting chemical strings (SMILES, SELFIES, InChI) into publication-quality molecular images.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>In computational chemistry and AI drug discovery, visualization pipelines are often brittle; breaking on edge cases or failing silently when processing millions of molecules for training data.</p>
<p>I built <code>molecular-string-renderer</code> to treat molecular visualization as a strict software engineering problem. It is a highly configurable, fault-tolerant wrapper around RDKit that standardizes the conversion of text-based chemical representations (SMILES, <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a>, SELFIES) into high-fidelity raster and vector graphics.</p>
<h2 id="features">Features</h2>
<p>This library differentiates itself from standard plotting scripts through strict architectural patterns designed for reliability:</p>
<h3 id="1-strategy-pattern-for-svg-generation">1. Strategy Pattern for SVG Generation</h3>
<p>RDKit&rsquo;s vector rendering can sometimes fail on complex molecular topologies. I implemented a <strong>Hybrid Strategy</strong> that ensures the pipeline never crashes during batch processing:</p>
<ul>
<li><strong>Vector Strategy</strong>: Attempts to generate a true, scalable vector graphic.</li>
<li><strong>Raster Fallback</strong>: If the vector engine fails, the system automatically renders a high-res PNG and embeds it transparently into the SVG container.</li>
</ul>
<h3 id="2-native-generative-ai-support">2. Native Generative AI Support</h3>
<p>With the rise of Large Language Models in chemistry, <strong>SELFIES</strong> (Self-Referencing Embedded Strings) has become a standard output format. This library handles SELFIES natively, managing the decoding and sanitization lifecycle internally so that ML training loops can simply &ldquo;pass strings and get images.&rdquo;</p>
<h3 id="3-strict-configuration-contracts">3. Strict Configuration Contracts</h3>
<p>The library uses <strong>Pydantic</strong> models (<code>RenderConfig</code>, <code>ParserConfig</code>, <code>OutputConfig</code>) to enforce strict data contracts. This ensures that visualization parameters are validated before any heavy computation begins, preventing runtime errors deep in a batch job.</p>
<h2 id="usage">Usage</h2>
<p>The library provides a simple Python API for rendering single molecules or batches of molecules from various string formats.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>Type Safety</strong>: The codebase runs with strict <code>mypy</code> settings, ensuring type safety across the entire pipeline.</li>
<li><strong>Grid Auto-Fitting</strong>: Implemented smart layout algorithms that automatically adjust grid dimensions based on the input batch size.</li>
<li><strong>Format Agnostic</strong>: Decouples the <em>parsing</em> logic (SMILES vs. MolBlock vs. SELFIES) from the <em>rendering</em> logic, making it trivial to add support for new proprietary formats.</li>
</ul>
<h2 id="why-this-matters">Why This Matters</h2>
<p>For AI research, infrastructure must be bulletproof. When generating 100,000 images for a diffusion model or visualizing the latent space of a chemical LLM, the pipeline must handle edge cases gracefully. This tool bridges the gap between academic scripts and production-grade data pipelines.</p>
<h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/posts/visualizing-smiles-and-selfies-strings/">Visualizing SMILES and SELFIES Strings</a>: walkthrough of the visualization pipeline this library implements</li>
<li><a href="/projects/isomer-dataset-generation/">Isomer Dataset Generation</a>: related project generating molecular datasets using SMILES/SELFIES representations</li>
</ul>
]]></content:encoded></item><item><title>SMILES: The Original Paper (Weininger 1988)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/smiles-original-paper/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/smiles-original-paper/</guid><description>Weininger's 1988 paper introducing SMILES notation, the string-based molecular representation that revolutionized chemistry.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. <em>Journal of Chemical Information and Computer Sciences</em>, 28(1), 31-36. <a href="https://doi.org/10.1021/ci00057a005">https://doi.org/10.1021/ci00057a005</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Computer Sciences, 1988</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES notation overview</a> - Modern usage summary</li>
<li><a href="/posts/visualizing-smiles-and-selfies-strings/">Converting SMILES to 2D images</a> - Practical visualization tutorial</li>
</ul>
<h2 id="core-contribution-a-string-based-molecular-notation">Core Contribution: A String-Based Molecular Notation</h2>
<p>This is a <strong>Method</strong> paper that introduces a novel notation system for representing chemical structures as text strings. It establishes the encoding rules and input conventions for SMILES (Simplified Molecular Input Line Entry System), while explicitly deferring the canonicalization algorithm to subsequent papers in the series.</p>
<h2 id="the-computational-complexity-of-chemical-information-in-the-1980s">The Computational Complexity of Chemical Information in the 1980s</h2>
<p>As computers became central to chemical information processing in the 1980s, the field faced a fundamental problem: existing line notations were either too complex for chemists to use practically or too limited for computational applications. Previous systems required extensive training to write correctly and were prone to errors.</p>
<p>The goal was ambitious: create a system that could represent any molecule as a simple text string, making it both human-readable and machine-efficient. This would enable compact database storage, fast processing, and easy exchange between software systems.</p>
<h2 id="separating-input-rules-from-canonicalization">Separating Input Rules from Canonicalization</h2>
<p>Weininger&rsquo;s key insight was to separate the problem into two parts: create simple, flexible rules that chemists could easily learn for input, while deferring to the computer the complex task of generating a unique, canonical representation. This division of labor made SMILES both practical and powerful.</p>
<p>The specific innovations include:</p>
<ol>
<li><strong>Simple input rules</strong> - Chemists could write molecules intuitively (e.g., <code>CCO</code> or <code>OCC</code> for ethanol)</li>
<li><strong>Ring closure notation</strong> - Breaking one bond and marking ends with matching digits</li>
<li><strong>Implicit hydrogens</strong> - Automatic calculation based on standard valences keeps strings compact</li>
<li><strong>Algorithmic aromaticity detection</strong> - Automatic recognition of aromatic systems from Kekulé structures</li>
<li><strong>Human-readable output</strong> - Unlike binary formats, SMILES strings are readable and debuggable</li>
</ol>
<p><strong>Important scope note</strong>: This first paper in the series establishes the input syntax and encoding rules. The canonicalization algorithm (how to generate unique SMILES) is explicitly stated as the subject of following papers: &ldquo;specification of isomerisms, substructures, and unique SMILES generation are the subjects of following papers.&rdquo;</p>
<h2 id="demonstrating-notation-rules-across-molecular-classes">Demonstrating Notation Rules Across Molecular Classes</h2>
<p>The paper is primarily a specification document establishing notation rules. The methodology is demonstrated through worked examples showing how to encode various molecular structures:</p>
<ul>
<li><strong>Basic molecules</strong>: Ethane (<code>CC</code>), ethylene (<code>C=C</code>), acetylene (<code>C#C</code>)</li>
<li><strong>Branches</strong>: Isobutyric acid (<code>CC(C)C(=O)O</code>)</li>
<li><strong>Rings</strong>: Cyclohexane (<code>C1CCCCC1</code>), benzene (<code>c1ccccc1</code>)</li>
<li><strong>Aromatic systems</strong>: Tropone (<code>O=c1ccccc1</code>), quinone (showing exocyclic bond effects)</li>
<li><strong>Complex structures</strong>: Morphine (40 characters vs 1000-2000 for connection tables)</li>
<li><strong>Edge cases</strong>: Salts, isotopes, charged species, tautomers</li>
</ul>
<p>Performance comparisons are mentioned qualitatively: SMILES processing was approximately 100 times faster than traditional connection table methods on the hardware of the era (1988), with dramatic reductions in storage space.</p>
<h2 id="performance-and-practical-viability">Performance and Practical Viability</h2>
<p>The paper successfully establishes SMILES as a practical notation system with several key outcomes:</p>
<p><strong>Practical benefits</strong>:</p>
<ul>
<li><strong>Compactness</strong>: 40 characters for morphine vs 1000-2000 for connection tables</li>
<li><strong>Speed</strong>: ~100x faster processing than traditional methods</li>
<li><strong>Accessibility</strong>: Simple enough for chemists to learn without extensive training</li>
<li><strong>Machine-friendly</strong>: Efficient parsing and string-based operations</li>
</ul>
<p><strong>Design principles validated</strong>:</p>
<ul>
<li>Separating user input from canonical representation makes the system both usable and rigorous</li>
<li>Implicit hydrogens reduce string length without loss of information</li>
<li>Ring closure notation with digit markers is more intuitive than complex graph syntax</li>
<li>Automatic aromaticity detection handles most cases correctly</li>
</ul>
<p><strong>Acknowledged limitations</strong>:</p>
<ul>
<li>Canonicalization algorithm not included in this paper</li>
<li>Stereochemistry handling deferred to subsequent papers</li>
<li>Some edge cases (like unusual valence states) require explicit specification</li>
</ul>
<p>The paper concludes by positioning SMILES as a foundation for database storage, substructure searching, and chemical informatics applications - a vision that proved accurate as SMILES became one of the most widely used molecular representations in computational chemistry.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>To implement the method described in this paper, the following look-up tables and algorithms are required. <strong>Note</strong>: These details are critical for replication but are often glossed over in high-level summaries.</p>
<h3 id="1-the-valence-look-up-table">1. The Valence Look-Up Table</h3>
<p>To calculate implicit hydrogens, the system assumes the &ldquo;lowest normal valence&rdquo; greater than or equal to the explicit bond count. The paper explicitly defines these valences:</p>
<table>
  <thead>
      <tr>
          <th>Element</th>
          <th>Allowed Valences</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>B</td>
          <td>3</td>
      </tr>
      <tr>
          <td>C</td>
          <td>4</td>
      </tr>
      <tr>
          <td>N</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>O</td>
          <td>2</td>
      </tr>
      <tr>
          <td>P</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>S</td>
          <td>2, 4, 6</td>
      </tr>
      <tr>
          <td>F, Cl, Br, I</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<p><strong>Example</strong>: For sulfur in $\text{H}_2\text{SO}_4$, the explicit bond count is 6 (four oxygens), so the system uses valence 6 with zero implicit hydrogens. Without knowing S allows valence 6, the algorithm would fail.</p>
<h3 id="2-explicit-hydrogen-requirements">2. Explicit Hydrogen Requirements</h3>
<p>The paper lists exactly three cases where hydrogen must be explicitly specified:</p>
<ol>
<li><strong>Hydrogen connected to zero atoms</strong> (rare edge case)</li>
<li><strong>Hydrogen connected to other hydrogen</strong> (molecular hydrogen, $\text{H}_2$, written as <code>[H][H]</code>)</li>
<li><strong>Hydrogen connected to more than one atom</strong> (bridging hydrogens)</li>
<li><strong>Isotopic hydrogen</strong> (deuterium <code>[2H]</code>, tritium <code>[3H]</code>)</li>
</ol>
<p>For all other cases, hydrogens are implicit and calculated from the valence table.</p>
<h3 id="3-ring-closure-notation">3. Ring Closure Notation</h3>
<p>Standard SMILES supports single digits <code>1-9</code> for ring closures. For rings numbered 10 and higher, the notation requires a <strong>percent sign prefix</strong>:</p>
<ul>
<li>Ring closures 1-9: <code>C1CCCCC1</code></li>
<li>Ring closures 10+: <code>C%10CCCCC%10</code>, <code>C2%13%24</code> (ring 2, ring 13, ring 24)</li>
</ul>
<p>Without this rule, a parser would fail on large polycyclic structures.</p>
<h3 id="4-aromaticity-detection-algorithm">4. Aromaticity Detection Algorithm</h3>
<p>The system uses an algorithmic version of Hückel&rsquo;s Rule ($4N+2$ π-electrons). The &ldquo;excess electron&rdquo; count for the aromatic system is determined by these rules:</p>
<p><strong>Carbon contribution</strong>:</p>
<ul>
<li><strong>C in aromatic ring</strong>: Contributes 1 electron</li>
<li><strong>C double-bonded to exocyclic electronegative atom</strong> (e.g., $\text{C}=\text{O}$ in quinone): Contributes 0 electrons (the carbon &ldquo;loses&rdquo; its electron to the oxygen)</li>
</ul>
<p><strong>Heteroatom contribution</strong>:</p>
<ul>
<li><strong>O, S in ring</strong>: Contributes 2 electrons (lone pair)</li>
<li><strong>N in ring</strong>: Contributes 1 electron (pyridine-like) or 2 electrons (pyrrole-like, must have explicit hydrogen <code>[nH]</code>)</li>
</ul>
<p><strong>Charge effects</strong>:</p>
<ul>
<li><strong>Positive charge</strong>: Reduces electron count by 1</li>
<li><strong>Negative charge</strong>: Increases electron count by 1</li>
</ul>
<p><strong>Critical example - Quinone</strong>:</p>
<pre tabindex="0"><code>O=C1C=CC(=O)C=C1
</code></pre><p>Quinone has 6 carbons in the ring, but the two carbons bonded to exocyclic oxygens contribute 0 electrons each. The four remaining carbons contribute 4 electrons total (not 6), so quinone is <strong>not aromatic</strong> by this algorithm. This exocyclic bond rule is essential for correct aromaticity detection.</p>
<p><strong>Aromatic ring test</strong>:</p>
<ol>
<li>All atoms must be sp² hybridized</li>
<li>Count excess electrons using the rules above</li>
<li>Calculate whether the system complies with Hückel&rsquo;s parity rule constraint:
$$ \text{Excess Electrons} \equiv 2 \pmod 4 \iff \text{Excess Electrons} = 4N + 2 $$
If the electron count satisfies this property for some integer $N$, the ring is determined to be aromatic.</li>
</ol>
<h2 id="encoding-rules-reference">Encoding Rules Reference</h2>
<p>The following sections provide a detailed reference for the six fundamental SMILES encoding rules. These are the rules a user would apply when writing SMILES strings.</p>
<h3 id="1-atoms">1. Atoms</h3>
<p>Atoms use their standard chemical symbols. Elements in the &ldquo;organic subset&rdquo; (B, C, N, O, P, S, F, Cl, Br, I) can be written directly when they have their most common valence - so <code>C</code> automatically means a carbon with enough implicit hydrogens to satisfy its valence.</p>
<p>Everything else goes in square brackets: <code>[Au]</code> for gold, <code>[NH4+]</code> for ammonium ion, or <code>[13C]</code> for carbon-13. Aromatic atoms get lowercase letters: <code>c</code> for aromatic carbon in benzene.</p>
<h3 id="2-bonds">2. Bonds</h3>
<p>Bond notation is straightforward:</p>
<ul>
<li><code>-</code> for single bonds (usually omitted)</li>
<li><code>=</code> for double bonds</li>
<li><code>#</code> for triple bonds</li>
<li><code>:</code> for aromatic bonds (also usually omitted)</li>
</ul>
<p>So <code>CC</code> and <code>C-C</code> both represent ethane, while <code>C=C</code> is ethylene.</p>
<h3 id="3-branches">3. Branches</h3>
<p>Branches use parentheses, just like in mathematical expressions. Isobutyric acid becomes <code>CC(C)C(=O)O</code> - the main chain is <code>CC C(=O)O</code> with a methyl <code>(C)</code> branch.</p>
<h3 id="4-rings">4. Rings</h3>
<p>This is where SMILES gets clever. You break one bond and mark both ends with the same digit. Cyclohexane becomes <code>C1CCCCC1</code> - the <code>1</code> connects the first and last carbon, closing the ring.</p>
<p>You can reuse digits for different rings in the same molecule, making complex structures manageable.</p>
<h3 id="5-disconnected-parts">5. Disconnected Parts</h3>
<p>Salts and other disconnected structures use periods. Sodium phenoxide: <code>[Na+].[O-]c1ccccc1</code>. The order doesn&rsquo;t matter - you&rsquo;re just listing the separate components.</p>
<h3 id="6-aromaticity">6. Aromaticity</h3>
<p>Aromatic rings can be written directly with lowercase letters. Benzoic acid becomes <code>c1ccccc1C(=O)O</code>. The system can also detect aromaticity automatically from Kekulé structures, so <code>C1=CC=CC=C1C(=O)O</code> works just as well.</p>
<h3 id="simplified-subset-for-organic-chemistry">Simplified Subset for Organic Chemistry</h3>
<p>Weininger recognized that most chemists work primarily with organic compounds, so he defined a simplified subset that covers the vast majority of cases. For organic molecules, you only need four rules:</p>
<ol>
<li><strong>Atoms</strong>: Use standard symbols (C, N, O, etc.)</li>
<li><strong>Multiple bonds</strong>: Use <code>=</code> and <code>#</code> for double and triple bonds</li>
<li><strong>Branches</strong>: Use parentheses <code>()</code></li>
<li><strong>Rings</strong>: Use matching digits</li>
</ol>
<p>This &ldquo;basic SMILES&rdquo; covers probably 90% of what most chemists encounter daily, making the system immediately accessible without having to learn all the edge cases.</p>
<h2 id="design-decisions-and-edge-cases">Design Decisions and Edge Cases</h2>
<p>Beyond the basic rules, the paper established several important conventions for handling ambiguous cases:</p>
<h3 id="hydrogen-handling">Hydrogen Handling</h3>
<p>Hydrogens are usually implicit - the system calculates how many each atom needs based on standard valences. So <code>C</code> represents CH₄, <code>N</code> represents NH₃, and so on. This keeps strings compact and readable.</p>
<p>Explicit hydrogens only appear in special cases: when hydrogen connects to multiple atoms, when you need to specify an exact count, or in isotopic specifications like <code>[2H]</code> for deuterium.</p>
<h3 id="bond-representation">Bond Representation</h3>
<p>The paper made an important choice about how to represent bonds in ambiguous cases. For example, nitro groups could be written as charge-separated <code>C[N+]([O-])[O-]</code> or with double bonds <code>CN(=O)=O</code>. Weininger chose to prefer covalent bonds when possible, keeping the topology symmetric and avoiding unusual charges.</p>
<p>However, when covalent representation would require unusual valences, charge separation is preferred. Diazomethane becomes <code>C=[N+]=[N-]</code> to avoid forcing carbon into an unrealistic valence state.</p>
<h3 id="tautomers">Tautomers</h3>
<p>SMILES doesn&rsquo;t try to be too clever about tautomers - it represents exactly what you specify. So 2-pyridone can be written as either the enol form <code>Oc1ncccc1</code> or the keto form <code>O=c1[nH]cccc1</code>. The system won&rsquo;t automatically convert between them.</p>
<p>This explicit approach means you need to decide which tautomeric form to represent, but it also means the notation precisely captures what you intend.</p>
<h3 id="aromaticity-detection">Aromaticity Detection</h3>
<p>One of the most sophisticated parts of the original system was automatic aromaticity detection. The algorithm uses an extended Hückel rule: a ring is aromatic if all atoms are sp² hybridized and it contains 4N+2 π-electrons.</p>
<p>This means you can input benzene as the Kekulé structure <code>C1=CC=CC=C1</code> and the system will automatically recognize it as aromatic and convert it to <code>c1ccccc1</code>. The algorithm handles complex cases like tropone (<code>O=c1ccccc1</code>) and correctly identifies them as aromatic.</p>
<h3 id="aromatic-nitrogen">Aromatic Nitrogen</h3>
<p>The system makes an important distinction for nitrogen in aromatic rings. Pyridine-type nitrogen (like in pyridine itself) is written as <code>n</code> and has no attached hydrogens. Pyrrole-type nitrogen has an attached hydrogen that must be specified explicitly: <code>[nH]1cccc1</code> for pyrrole.</p>
<p>This distinction captures the fundamental difference in electron contribution between these two nitrogen types in aromatic systems.</p>
<h2 id="impact-and-legacy">Impact and Legacy</h2>
<p>Nearly four decades later, SMILES remains one of the most important file formats in computational chemistry. The notation became the foundation for:</p>
<ul>
<li><strong>Database storage</strong> - Compact, searchable molecular representations</li>
<li><strong>Substructure searching</strong> - Pattern matching in chemical databases</li>
<li><strong>Property prediction</strong> - Input format for QSAR models</li>
<li><strong>Chemical informatics</strong> - Standard exchange format between software</li>
<li><strong>Modern ML</strong> - Text-based representation for neural networks</li>
</ul>
<p>While newer approaches like <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> have addressed some limitations (like the possibility of invalid strings), SMILES&rsquo; combination of simplicity and power has made it enduringly useful.</p>
<p>The paper established both a notation system and a design philosophy: chemical informatics tools should be powerful enough for computers while remaining accessible to working chemists. That balance remains relevant today as we develop new molecular representations for machine learning and AI applications.</p>
]]></content:encoded></item><item><title>SELFIES: The Original Paper (Krenn et al. 2020)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/selfies-original-paper/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/selfies-original-paper/</guid><description>The 2020 paper introducing SELFIES, the 100% robust molecular representation that solves SMILES validity problems in ML applications.</description><content:encoded><![CDATA[<h2 id="contribution-a-100-robust-representation-for-ml">Contribution: A 100% Robust Representation for ML</h2>
<p>This is a <strong>Method</strong> paper that introduces a new molecular string representation designed specifically for machine learning applications.</p>
<h2 id="motivation-the-invalidity-bottleneck">Motivation: The Invalidity Bottleneck</h2>
<p>When neural networks generate molecules using <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES notation</a>, a huge fraction of output strings are invalid: either syntax errors or chemically impossible structures. This was a fundamental bottleneck: if your generative model produces 70% invalid molecules, you&rsquo;re wasting computational effort and severely limiting chemical space exploration.</p>
<h2 id="novelty-a-formal-grammar-approach">Novelty: A Formal Grammar Approach</h2>
<p>The authors&rsquo; key insight was using a <strong>formal grammar approach</strong> (specifically, a Chomsky type-2, context-free grammar with self-referencing functions) where each symbol is interpreted based on chemical context. The &ldquo;state of the derivation&rdquo; tracks available valence bonds, preventing impossible structures like a carbon with five single bonds.</p>
<p>For example, generating 2-Fluoroethenimine (<code>FC=C=N</code>) follows a state derivation where each step restricts the available valency for the next element:</p>
<p>$$
\mathbf{X}_0 \xrightarrow{[F]} \text{F } \mathbf{X}_1 \xrightarrow{[=C]} \text{FC } \mathbf{X}_3 \xrightarrow{[=C]} \text{FC=C } \mathbf{X}_2 \xrightarrow{[\#N]} \text{FC=C=N}
$$</p>
<p>This approach guarantees 100% validity: every SELFIES string corresponds to a valid molecule, and every valid molecule can be represented.</p>
<h2 id="methodology--experiments-validating-robustness">Methodology &amp; Experiments: Validating Robustness</h2>
<p>The authors ran a convincing set of experiments to demonstrate SELFIES&rsquo; robustness:</p>
<h3 id="random-mutation-test">Random Mutation Test</h3>
<p>They took the SELFIES and <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> representations of MDMA and introduced random changes:</p>
<ul>
<li><strong>SMILES</strong>: After just one random mutation, only 9.9% of strings remained valid (dropping to 1.1% after three mutations).</li>
<li><strong>SELFIES</strong>: 100% of mutated strings still represented valid molecules (though different from the original).</li>
</ul>
<p>This empirical difference demonstrates why SELFIES is highly effective for evolutionary algorithms and genetic programming approaches to molecular design.</p>
<h3 id="generative-model-performance">Generative Model Performance</h3>
<p>The real test came with actual machine learning models. The authors trained Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) on both representations:</p>
<p><strong>VAE Results:</strong></p>
<ul>
<li>SMILES-based VAE: Large invalid regions scattered throughout the latent space</li>
<li>SELFIES-based VAE: Every point in the continuous latent space mapped to a valid molecule</li>
<li>The SELFIES model encoded <strong>over 100 times more diverse molecules</strong></li>
</ul>
<p><strong>GAN Results:</strong></p>
<ul>
<li>Best SMILES GAN: 18.5% diverse, valid molecules</li>
<li>Best SELFIES GAN: 78.9% diverse, valid molecules</li>
</ul>
<p><strong>Evaluation Metrics:</strong></p>
<ul>
<li><strong>Validity</strong>: Percentage of generated strings representing valid molecular structures</li>
<li><strong>Diversity</strong>: Number of unique valid molecules produced</li>
<li><strong>Reconstruction Accuracy</strong>: How well the autoencoder reproduced input molecules (detailed in Table 1)</li>
</ul>
<h3 id="scalability-test">Scalability Test</h3>
<p>The authors proved SELFIES works beyond toy molecules by successfully encoding and decoding all <strong>72 million molecules</strong> from the PubChem database, demonstrating practical applicability to real chemical databases.</p>
<h2 id="results--conclusions-chemical-space-exploration">Results &amp; Conclusions: Chemical Space Exploration</h2>
<p><strong>Key Findings:</strong></p>
<ul>
<li>SELFIES achieves 100% validity guarantee: every string represents a valid molecule</li>
<li>SELFIES-based VAEs encode over 100x more diverse molecules than SMILES-based models</li>
<li>SELFIES-based GANs produce 78.9% diverse valid molecules vs. 18.5% for SMILES GANs</li>
<li>Successfully validated on all 72 million PubChem molecules</li>
</ul>
<p><strong>Limitations Acknowledged:</strong></p>
<ul>
<li>No direct canonicalization method (initially required SMILES conversion)</li>
<li>Missing features: aromaticity, isotopes, complex stereochemistry</li>
<li>Requires community testing and adoption</li>
</ul>
<p><strong>Impact:</strong></p>
<p>This work demonstrated that designing ML-native molecular representations could unlock new capabilities in drug discovery and materials science. The paper sparked broader conversation about representation design and showed the effectiveness of addressing format limitations directly. SELFIES was subsequently evaluated as an alternative input representation to SMILES in <a href="/notes/computational-chemistry/chemical-language-models/chemberta/">ChemBERTa</a>, a transformer pretrained on molecular strings for property prediction, where it performed comparably to SMILES on the Tox21 benchmark, though the comparison was limited to a single task.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The machine learning experiments used two distinct datasets:</p>
<ul>
<li><strong>QM9</strong> (134k molecules): Primary training dataset for VAE and GAN models</li>
<li><strong>PubChem</strong> (72M molecules): Used only to test representation coverage and scalability; not used for model training</li>
</ul>
<h3 id="models">Models</h3>
<p>The VAE implementation included:</p>
<ul>
<li><strong>Latent space</strong>: 241-dimensional with Gaussian distributions</li>
<li><strong>Input encoding</strong>: One-hot encoding of SELFIES/SMILES strings</li>
<li>Full architectural details (encoder/decoder structures, layer types) provided in Supplementary Information</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The authors found GAN performance was highly sensitive to hyperparameter selection:</p>
<ul>
<li>Searched <strong>200 different hyperparameter configurations</strong> to achieve the reported 78.9% diversity</li>
<li>Specific optimizers, learning rates, and training duration detailed in Supplementary Information</li>
<li>Full rule generation algorithm provided in Table 2</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>All models evaluated on:</p>
<ul>
<li><strong>Validity rate</strong>: Percentage of syntactically and chemically valid outputs</li>
<li><strong>Diversity</strong>: Count of unique valid molecules generated</li>
<li><strong>Reconstruction accuracy</strong>: Fidelity of autoencoder reconstruction (VAEs only)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training performed on the SciNet supercomputing infrastructure.</li>
<li>Specific GPU/compute details (typically NVIDIA P100/V100 nodes for SciNet during this period) are provided in the Supplementary Information.</li>
</ul>
<h3 id="replication-resources">Replication Resources</h3>
<p>Complete technical replication is highly accessible due to the paper being published open-access in <em>Machine Learning: Science and Technology</em>. It primarily requires:</p>
<ul>
<li>The full rule generation algorithm (Table 2 in paper)</li>
<li>Code and trained models: <a href="https://github.com/aspuru-guzik-group/selfies">https://github.com/aspuru-guzik-group/selfies</a></li>
<li>Supplementary Information for complete architectural and hyperparameter specifications</li>
</ul>
<p><strong>Note</strong>: The <a href="/notes/computational-chemistry/molecular-representations/selfies/">modern SELFIES library</a> has evolved significantly since this foundational paper, addressing many of the implementation challenges identified by the authors.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krenn, M., Häse, F., Nigam, A., Friederich, P., &amp; Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. <em>Machine Learning: Science and Technology</em>, <em>1</em>(4), 045024. <a href="https://doi.org/10.1088/2632-2153/aba947">https://doi.org/10.1088/2632-2153/aba947</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology, 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Krenn_2020,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/2632-2153/aba947}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1088%2F2632-2153%2Faba947}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{aug}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{{IOP} Publishing}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{045024}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Mario Krenn and Florian H{\&#34;{a}}se and AkshatKumar Nigam and Pascal Friederich and Alan Aspuru-Guzik}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Self-referencing embedded strings ({SELFIES}): A 100{\%} robust molecular string representation}</span>,
</span></span><span style="display:flex;"><span>	<span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Machine Learning: Science and Technology}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/aspuru-guzik-group/selfies">GitHub Repository</a></li>
<li><a href="/notes/computational-chemistry/molecular-representations/selfies/">Modern SELFIES Documentation</a></li>
</ul>
]]></content:encoded></item><item><title>RInChI: Reaction International Chemical Identifier</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/rinchi/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/rinchi/</guid><description>RInChI extends InChI to create unique, machine-readable identifiers for chemical reactions and database searching.</description><content:encoded><![CDATA[<h2 id="paper-classification-and-scope">Paper Classification and Scope</h2>
<p>This is an <strong>infrastructure/resource paper</strong> combined with a <strong>methods paper</strong>. It establishes a standard format, releases an open-source software library, and enables large-scale database operations. The methods component details the specific algorithmic rules for constructing identifiers through hashing, sorting, and layering.</p>
<h2 id="the-need-for-standardized-reaction-identifiers">The Need for Standardized Reaction Identifiers</h2>
<p>While we have excellent standards for identifying individual molecules (like <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> and <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a>), there was no equivalent for chemical reactions. This creates real problems:</p>
<ul>
<li>Different researchers working on the same reaction might describe it completely differently</li>
<li>Searching large reaction databases becomes nearly impossible</li>
<li>No way to check if two apparently different reaction descriptions are actually the same process</li>
<li>Chemical databases can&rsquo;t easily link related reactions or identify duplicates</li>
</ul>
<p>If a reaction converts &ldquo;starting material A + reagent B → product C,&rdquo; it is difficult to determine if that is identical to another researcher&rsquo;s description of the same transformation using different names or graphical representations.</p>
<h2 id="core-innovation-standardizing-reaction-strings">Core Innovation: Standardizing Reaction Strings</h2>
<p>RInChI solves this by creating a standardized, machine-readable label for any chemical reaction. The key insight is to focus on the essential chemistry while ignoring experimental details that can vary between labs.</p>
<h3 id="core-principles">Core Principles</h3>
<p>RInChI captures three fundamental pieces of information:</p>
<ol>
<li><strong>Starting materials</strong>: What molecules you begin with</li>
<li><strong>Products</strong>: What molecules you end up with</li>
<li><strong>Agents</strong>: Substances present at both the beginning and end (catalysts, solvents, etc.)</li>
</ol>
<p>Importantly, RInChI intentionally excludes experimental conditions like temperature, pressure, yield, or reaction time. These details can vary significantly even for identical chemical transformations, so including them would make it nearly impossible for different researchers to generate the same identifier.</p>
<h3 id="how-rinchi-works">How RInChI Works</h3>
<h4 id="the-rinchi-string-structure">The RInChI String Structure</h4>
<p>A RInChI string has six distinct layers. Crucially, <strong>Layers 2 and 3 are assigned alphabetically</strong>. This is essential for generating consistent identifiers.</p>
<p><strong>Layer 1: Version</strong></p>
<ul>
<li>Standard header defining the RInChI version (e.g., <code>RInChI=1.00.1S</code>)</li>
</ul>
<p><strong>Layers 2 &amp; 3: Component Molecules</strong></p>
<ul>
<li>These layers contain the InChI strings of reaction participants (reactants and products)</li>
<li><strong>Sorting Rule</strong>: The distinct groups (Reactant Group vs. Product Group) are sorted alphabetically as aggregate strings. The group that comes first alphabetically becomes <strong>Layer 2</strong>; the other becomes <strong>Layer 3</strong></li>
<li>This means if a product&rsquo;s InChI is alphabetically &ldquo;earlier&rdquo; than the reactant&rsquo;s, the product goes in Layer 2</li>
<li><strong>Formatting</strong>: Molecules within a layer are separated by <code>!</code>. The two layers are separated by <code>&lt;&gt;</code></li>
</ul>
<p><strong>Layer 4: Agents</strong></p>
<ul>
<li>Contains catalysts, solvents, and any molecule found in <em>both</em> the reactant and product input lists</li>
<li><strong>Algorithmic rule</strong>: Anything appearing in both the reactant list and product list must be removed from both and added to Layer 4</li>
</ul>
<p><strong>Layer 5: Direction (The Decoder)</strong></p>
<ul>
<li>This layer determines which component layer represents the starting material:
<ul>
<li><code>/d+</code>: Layer 2 is the Starting Material (forward direction)</li>
<li><code>/d-</code>: Layer 3 is the Starting Material (reverse direction)</li>
<li><code>/d=</code>: Equilibrium reaction</li>
</ul>
</li>
<li>Without this layer, you cannot determine reactants from products</li>
</ul>
<p><strong>Layer 6: No-Structure Data</strong></p>
<ul>
<li>Format: <code>/uA-B-C</code> where the numbers indicate the count of structureless materials in Layer 2, Layer 3, and Layer 4 respectively</li>
<li>Used when substances lack defined structures and cannot be represented by InChI</li>
</ul>
<h3 id="separator-syntax">Separator Syntax</h3>
<p>For parsing or generating RInChI strings, the separator characters are:</p>
<table>
  <thead>
      <tr>
          <th>Separator</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>/</code></td>
          <td>Separates layers</td>
      </tr>
      <tr>
          <td><code>!</code></td>
          <td>Separates molecules within a layer</td>
      </tr>
      <tr>
          <td><code>&lt;&gt;</code></td>
          <td>Separates reactant/product groups</td>
      </tr>
  </tbody>
</table>
<h3 id="example-structure">Example Structure</h3>
<pre><code>RInChI=1.00.1S/[Layer2 InChIs]&lt;&gt;[Layer3 InChIs]&lt;&gt;[Agent InChIs]/d+/u0-0-0
</code></pre>
<p>This systematic approach ensures that any researcher starting with the same reaction will generate an identical RInChI string.</p>
<h3 id="rinchikeys-shorter-identifiers-for-practical-use">RInChIKeys: Shorter Identifiers for Practical Use</h3>
<p>Since full RInChI strings can become extremely long, the standard includes three types of shorter, hashed keys for different applications:</p>
<h4 id="long-rinchikey">Long-RInChIKey</h4>
<ul>
<li>Contains complete InChIKeys for every molecule in the reaction</li>
<li>Variable length, but allows searching for reactions containing specific compounds</li>
<li>Useful for substructure searches: &ldquo;Show me all reactions involving compound X&rdquo;</li>
</ul>
<h4 id="short-rinchikey">Short-RInChIKey</h4>
<ul>
<li>Fixed length (63 characters)</li>
<li>Generated by separately hashing different RInChI layers to maintain distinction between role components.</li>
<li><strong>Mathematical Formulation</strong>: The key represents an aggregate hash where the major layers ($L_{\text{major}}$, mapping to the molecular skeleton) are hashed completely separately from the minor layers ($L_{\text{minor}}$, such as stereochemistry and protonation states).
$$ \text{Key}_{\text{short}} = \text{Hash}( \text{Major Components} ) \oplus \text{Hash}( \text{Minor Components} ) \oplus \dots $$</li>
<li>Perfect for exact matching and database storage</li>
<li>The go-to choice for linking identical reactions across different databases</li>
</ul>
<h4 id="web-rinchikey">Web-RInChIKey</h4>
<ul>
<li>Shortest format (47 characters)</li>
<li><strong>Hashing algorithm</strong>: The Web-RInChIKey uses a simplified hashing structure designed to explicitly ignore molecular roles (e.g., whether something is a reactant or product). It defines a unified set of components $C = { c_1, c_2, \dots, c_n }$ consisting of all unique InChI representations across the entire reaction.
<ol>
<li>Combine and sort all InChIs alphabetically</li>
<li>Produce a two-part hash block focusing on major and minor structural layers:
$$ \text{Key}_{\text{web}} = \text{Hash}(C_{\text{major}}) + \text{Hash}(C_{\text{minor}}) $$</li>
</ol>
</li>
<li>Ignores molecular roles as reactants vs. products</li>
<li>Useful for finding related reactions where a molecule&rsquo;s role might be ambiguous</li>
<li>Good for discovering &ldquo;reverse&rdquo; reactions or alternative synthetic routes</li>
</ul>
<h2 id="experimental-validation-and-software-implementation">Experimental Validation and Software Implementation</h2>
<p>This infrastructure paper focuses on developing and validating the RInChI standard. The validation approach includes:</p>
<ul>
<li><strong>Software implementation</strong>: Development of the official RInChI software library capable of parsing reaction files and generating identifiers</li>
<li><strong>Format testing</strong>: Validation that the system correctly handles standard reaction file formats (<code>.RXN</code>, <code>.RD</code>)</li>
<li><strong>Consistency verification</strong>: Ensuring identical reactions produce identical RInChI strings regardless of input variations</li>
<li><strong>Key generation</strong>: Testing all three RInChIKey variants (Long, Short, Web) for different use cases</li>
<li><strong>Database integration</strong>: Demonstrating practical application in reaction database management</li>
</ul>
<h2 id="impact-on-chemical-database-analytics">Impact on Chemical Database Analytics</h2>
<h3 id="practical-applications">Practical Applications</h3>
<p>RInChI enables systematic organization and analysis of chemical reactions:</p>
<h4 id="database-management">Database Management</h4>
<p>RInChI enables systematic organization of reaction databases. You can:</p>
<ul>
<li>Automatically identify and merge duplicate reaction entries</li>
<li>Find all variations of a particular transformation</li>
<li>Link related reactions across different data sources</li>
</ul>
<h4 id="reaction-analysis">Reaction Analysis</h4>
<p>With standardized identifiers, you can perform large-scale analysis:</p>
<ul>
<li>Identify the most commonly used reagents or catalysts</li>
<li>Find cases where identical starting materials yield different products</li>
<li>Analyze reaction trends and patterns across entire databases</li>
</ul>
<h4 id="multi-step-synthesis-representation">Multi-Step Synthesis Representation</h4>
<p>RInChI can represent complex, multi-step syntheses as single combined identifiers, making it easier to analyze and compare different synthetic routes.</p>
<h4 id="research-integration">Research Integration</h4>
<p>The standard enables better collaboration by ensuring different research groups can generate identical identifiers for the same chemical processes, facilitating data sharing and literature analysis.</p>
<h3 id="limitations-and-considerations">Limitations and Considerations</h3>
<h4 id="what-gets-lost">What Gets Lost</h4>
<p>Since RInChI builds on the Standard InChI for individual molecules, it inherits certain limitations:</p>
<ul>
<li><strong>Tautomers</strong>: Different tautomeric forms are treated as identical</li>
<li><strong>Stereochemistry</strong>: Relative stereochemical relationships aren&rsquo;t captured</li>
<li><strong>Experimental conditions</strong>: Temperature, pressure, yield, and reaction time are intentionally excluded</li>
</ul>
<h4 id="the-trade-off">The Trade-off</h4>
<p>This is an intentional feature. By focusing on core chemical identity, RInChI achieves its primary goal: ensuring that different researchers working on the same fundamental transformation generate the same identifier.</p>
<h3 id="implementation-and-tools">Implementation and Tools</h3>
<h4 id="official-software">Official Software</h4>
<p>The RInChI software, available from the InChI Trust, handles the practical details:</p>
<ul>
<li>Accepts standard reaction file formats (<code>.RXN</code>, <code>.RD</code>)</li>
<li>Generates RInChI strings, all three RInChIKey variants, and auxiliary information</li>
<li>Automates the complex process of creating consistent identifiers</li>
</ul>
<h4 id="rauxinfo-preserving-visual-information">RAuxInfo: Preserving Visual Information</h4>
<p>While RInChI discards graphical information (atom coordinates, drawing layout), the software can generate supplementary &ldquo;RAuxInfo&rdquo; strings that preserve this data. This allows reconstruction of the original visual representation when needed.</p>
<h3 id="future-directions">Future Directions</h3>
<p>RInChI development continues to evolve:</p>
<ul>
<li><strong>Integration</strong>: Plans for compatibility with other emerging standards like <a href="/notes/computational-chemistry/molecular-representations/mixfile-minchi/">MInChI for chemical mixtures</a></li>
<li><strong>Extended applications</strong>: Work on representing complex, multi-component reaction systems</li>
<li><strong>Software development</strong>: Tools for generating graphical representations directly from RInChI without auxiliary information</li>
</ul>
<h3 id="key-takeaways">Key Takeaways</h3>
<ol>
<li>
<p><strong>Filling a critical gap</strong>: RInChI provides the first standardized way to uniquely identify chemical reactions, solving a fundamental problem in chemical informatics.</p>
</li>
<li>
<p><strong>Focus on essential chemistry</strong>: By excluding experimental variables, RInChI achieves consistent identification of core chemical transformations.</p>
</li>
<li>
<p><strong>Flexible searching</strong>: Multiple RInChIKey formats enable different types of database queries, from exact matching to similarity searching.</p>
</li>
<li>
<p><strong>Practical implementation</strong>: Official software tools make RInChI generation accessible to working chemists and database managers.</p>
</li>
<li>
<p><strong>Foundation for analysis</strong>: Standardized reaction identifiers enable large-scale analysis of chemical databases and systematic study of reaction patterns.</p>
</li>
</ol>
<p>RInChI represents a significant step toward making reaction data as standardized and machine-readable as molecular data has become with formats like SMILES and InChI.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grethe, G., Blanke, G., Kraut, H., &amp; Goodman, J. M. (2018). International chemical identifier for reactions (RInChI). <em>Journal of Cheminformatics</em>, <em>10</em>(1), 22. <a href="https://doi.org/10.1186/s13321-018-0277-8">https://doi.org/10.1186/s13321-018-0277-8</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics (2018)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Grethe2018,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{International chemical identifier for reactions (RInChI)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grethe, Guenter and Blanke, Gerd and Kraut, Hans and Goodman, Jonathan M}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-018-0277-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Recent Advances in the SELFIES Library (2023)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/selfies-2023/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/selfies-2023/</guid><description>Major updates to the SELFIES library, improved performance, expanded chemistry support, and new customization features.</description><content:encoded><![CDATA[<h2 id="selfies-as-a-production-ready-resource">SELFIES as a Production-Ready Resource</h2>
<p>This software update paper documents major improvements to the SELFIES Python library, establishing it as a production-ready computational tool.</p>
<h2 id="limitations-in-the-original-selfies-implementation">Limitations in the Original SELFIES Implementation</h2>
<p>While the <a href="/notes/computational-chemistry/molecular-representations/selfies-original-paper/">original SELFIES concept</a> was promising, the initial 2019 implementation had critical limitations that prevented widespread adoption:</p>
<ol>
<li><strong>Performance</strong>: Too slow for production ML workflows</li>
<li><strong>Limited chemistry</strong>: Couldn&rsquo;t represent aromatic molecules, stereochemistry, or many other important chemical features</li>
<li><strong>Poor usability</strong>: Lacked user-friendly APIs for common tasks</li>
</ol>
<p>These barriers meant that despite SELFIES&rsquo; theoretical advantages (100% validity guarantee), researchers couldn&rsquo;t practically use it for real-world applications like drug discovery or materials science.</p>
<h2 id="architectural-refactoring-and-new-ml-integrations">Architectural Refactoring and New ML Integrations</h2>
<p>The 2023 update refactors the underlying SELFIES engine, establishing it as a production-ready repository. The key updates include:</p>
<ol>
<li>
<p><strong>Graph-Based Processing</strong>: The system replaces string-based operations with directed molecular graphs internally, significantly increasing execution speed and extensibility.</p>
</li>
<li>
<p><strong>Expanded Canonical Forms</strong>: Standardizes support for aromatic systems, stereochemistry, charged species, and isotopic data, effectively overlapping with the chemical diversity coverage of SMILES while preserving robust validity guarantees.</p>
</li>
<li>
<p><strong>Semantic Constraint API</strong>: Introduces the <code>set_semantic_constraints()</code> function, allowing specification of custom valence definitions useful for theoretical studies or hypervalent states.</p>
</li>
<li>
<p><strong>Deep Learning Hooks</strong>: Integrates tokenization, length estimation, and attribution utilities specifically engineered to plug directly into neural network encoders and decoders.</p>
</li>
</ol>
<h2 id="performance-benchmarks--validity-testing">Performance Benchmarks &amp; Validity Testing</h2>
<p>The authors validated the library through several benchmarks:</p>
<p><strong>Performance testing</strong>: Roundtrip conversion (SMILES → SELFIES → SMILES) on 300,000+ molecules from the DTP dataset completed in 252 seconds, demonstrating production-ready speed compared to previous string-based iterations (translating to a massive orders-of-magnitude faster operation footprint).</p>
<p><strong>Chemical coverage</strong>: Tested on diverse molecular databases to verify the library handles most chemical diversity found in resources like PubChem.</p>
<p><strong>Validity guarantee</strong>: Demonstrated that random SELFIES strings always decode to valid molecules, with controllable size distributions through symbol filtering.</p>
<p><strong>Attribution system</strong>: Showed both encoder and decoder can track which input symbols produce which output symbols, useful for property alignment.</p>
<h2 id="future-trajectories-for-general-chemical-representations">Future Trajectories for General Chemical Representations</h2>
<p>The 2023 update successfully addresses the main adoption barriers:</p>
<ol>
<li><strong>Fast enough</strong> for large-scale ML applications (300K molecules in ~4 minutes)</li>
<li><strong>Chemically comprehensive</strong> enough for drug discovery and materials science</li>
<li><strong>User-friendly</strong> enough for seamless integration into existing workflows</li>
</ol>
<p>The validity guarantee, SELFIES&rsquo; core advantage, is now practically accessible for real-world research. The roadmap includes future extensions for polymers, crystals, chemical reactions, and non-covalent interactions, which would expand SELFIES&rsquo; applicability beyond small-molecule chemistry.</p>
<p><strong>Limitations acknowledged</strong>: The paper focuses on implementation improvements. Some advanced chemical systems (polymers, crystals) still need future work.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="code">Code</h3>
<p>The <code>selfies</code> library is completely open-source and written in pure Python. It requires no extra dependencies and is available on GitHub, installable via <code>pip install selfies</code>. The repository includes testing suites (<code>tox</code>) and example benchmarking scripts to reproduce the translation speeds reported in the paper.</p>
<h3 id="hardware">Hardware</h3>
<p>Performance benchmarks (e.g., the 252-second roundtrip conversion on 300K molecules) were executed on Google Colaboratory using two 2.20GHz Intel Xeon CPUs.</p>
<h3 id="algorithms">Algorithms</h3>
<h4 id="technical-specification-the-grammar">Technical Specification: The Grammar</h4>
<p>The core innovation of SELFIES is a <strong>Context-Free Grammar (CFG) augmented with state-machine logic</strong> to ensure that every derived string represents a valid molecule. While the software features are important, understanding the underlying derivation rules is essential for replication or extension of the system.</p>
<p><strong>1. Derivation Rules: The Atom State Machine</strong></p>
<p>The fundamental mechanism that guarantees validity is a <strong>state machine</strong> that tracks the remaining valence of the most recently added atom:</p>
<ul>
<li><strong>State Tracking</strong>: The derivation maintains a non-terminal state $X_l$, where $l$ represents the current atom&rsquo;s remaining valence (number of bonds it can still form)</li>
<li><strong>Standard Derivation</strong>: An atom symbol $[\beta \alpha]$ (bond order + atom type) transitions the state from $S$ (start) to $X_l$, where $l$ is calculated from the atom&rsquo;s standard valence minus the incoming bond order</li>
<li><strong>Bond Demotion (The Key Rule)</strong>: If a requested bond order $\beta$ exceeds the available valence $l$ of the previous atom, the bond is automatically demoted to $\min(l, d(\beta))$ to prevent invalid valency. This automatic adjustment is the mathematical core of the validity guarantee.</li>
</ul>
<p>This state machine ensures that no atom ever exceeds its allowed valence, making it impossible to generate chemically invalid structures.</p>
<p><strong>2. Control Symbols: Branches and Rings</strong></p>
<p>Branch length calculation: SELFIES uses a <strong>hexadecimal encoding</strong> to determine branch lengths. If a branch symbol is followed by index symbols $c_1 \dots c_k$, the number of symbols $N$ to include in the branch is calculated as:</p>
<p>$$
\begin{aligned}
N &amp;= 1 + 16 \cdot c_k
\end{aligned}
$$</p>
<p>This formula interprets subsequent symbols as base-16 integers, allowing compact representation of arbitrarily long branches.</p>
<p>Ring closure queue system: Ring formation uses a <strong>lazy evaluation</strong> strategy to maintain validity. Ring symbols don&rsquo;t create bonds immediately; instead, they push &ldquo;closure candidates&rdquo; into a queue $R$. These candidates are resolved <em>after</em> the main derivation completes. A ring closure candidate is <strong>rejected</strong> if either atom has no remaining valence ($m = 0$), or the closure would create a self-loop (same atom). This deferred validation prevents the creation of invalid ring structures while still allowing the grammar to remain context-free during the main derivation.</p>
<p><strong>3. Symbol Structure and Standardization</strong></p>
<p>SELFIES enforces a strict, standardized format for atom symbols to eliminate ambiguity:</p>
<ul>
<li><strong>Canonical Format</strong>: Atom symbols follow the structure <code>[Bond, Isotope, Element, Chirality, H-count, Charge]</code></li>
<li><strong>No Variation</strong>: There is only one way to write each symbol (e.g., <code>[Fe++]</code> and <code>[Fe+2]</code> are standardized to a single form)</li>
<li><strong>Order Matters</strong>: The components must appear in the specified order</li>
</ul>
<p><strong>4. Default Semantic Constraints</strong></p>
<p>By default, the library enforces standard organic chemistry valence rules:</p>
<ul>
<li><strong>Typical Valences</strong>: C=4, N=3, O=2, F=1</li>
<li><strong>Element Coverage</strong>: Supports most elements relevant to organic and medicinal chemistry</li>
<li><strong>Customizable</strong>: These constraints can be modified via <code>set_semantic_constraints()</code> for specialized applications (hypervalent compounds, theoretical studies, etc.)</li>
</ul>
<p>The combination of these grammar rules with the state machine ensures that <strong>every valid SELFIES string decodes to a chemically valid molecule</strong>, regardless of how the string was generated (random, ML model output, manual construction, etc.).</p>
<h3 id="data">Data</h3>
<p><strong>Benchmark dataset</strong>: DTP (Developmental Therapeutics Program) dataset with 300,000+ molecules used for performance testing.</p>
<p><strong>Chemical coverage testing</strong>: Diverse molecular databases including PubChem to verify the library handles broad chemical diversity (aromatic systems, stereochemistry, charged species, isotopes).</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Performance metric</strong>: Roundtrip conversion time (SMILES → SELFIES → SMILES) is 252 seconds for 300K molecules.</p>
<p><strong>Validity testing</strong>: Random SELFIES string generation with decoding verification shows 100% of generated strings decode to valid molecules.</p>
<p><strong>Attribution system</strong>: Encoder and decoder track which input symbols produce which output symbols, tested for property alignment between SMILES and SELFIES representations.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., &amp; Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. <em>Digital Discovery</em>, <em>2</em>(4), 897-908. <a href="https://doi.org/10.1039/D3DD00044C">https://doi.org/10.1039/D3DD00044C</a></p>
<p><strong>Publication</strong>: Digital Discovery 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{lo2023recent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Recent advances in the self-referencing embedded strings (SELFIES) library}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lo, Alston and Pollice, Robert and Nigam, AkshatKumar and White, Andrew D and Krenn, Mario and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{897--908}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D3DD00044C}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/aspuru-guzik-group/selfies">SELFIES GitHub Repository</a></li>
<li><a href="/notes/computational-chemistry/molecular-representations/selfies-original-paper/">Original SELFIES Paper (2020)</a></li>
<li><a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES Format Overview</a></li>
</ul>
]]></content:encoded></item><item><title>NInChI: Toward a Chemical Identifier for Nanomaterials</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/ninchi-alpha/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/ninchi-alpha/</guid><description>NInChI (Nanomaterials InChI) extends chemical identifiers to represent complex, multi-component nanomaterials.</description><content:encoded><![CDATA[<h2 id="a-new-standard-for-nanoinformatics">A New Standard for Nanoinformatics</h2>
<p>This is a <strong>Theory/Systematization paper</strong> that proposes a new <strong>Resource standard</strong>: the NInChI. It addresses a fundamental limitation in nanoinformatics. The result of a collaborative process (a &ldquo;Request for Comments&rdquo; or &ldquo;RFC&rdquo;), this work uses <strong>six detailed case studies</strong> to systematically develop a <strong>hierarchical, machine-readable notation</strong> for complex nanomaterials that could work across experimental research, regulatory frameworks, and computational modeling.</p>
<h2 id="the-breakdown-of-traditional-chemical-identifiers">The Breakdown of Traditional Chemical Identifiers</h2>
<p>Chemoinformatics has fantastic tools for representing small molecules: SMILES strings, InChI identifiers, and standardized databases that make molecular data searchable and shareable. But when you step into nanotechnology, everything breaks down.</p>
<p>Consider trying to describe a gold nanoparticle with a silica shell and organic surface ligands. How do you capture:</p>
<ul>
<li>The gold core composition and size</li>
<li>The silica shell thickness and interface</li>
<li>The surface chemistry and ligand density</li>
<li>The overall shape and morphology</li>
</ul>
<p>There&rsquo;s simply no standardized way to represent this complexity in a machine-readable format. This creates massive problems for:</p>
<ul>
<li><strong>Data sharing</strong> between research groups</li>
<li><strong>Regulatory assessment</strong> where precise identification matters</li>
<li><strong>Computational modeling</strong> that needs structured input</li>
<li><strong>Database development</strong> and search capabilities</li>
</ul>
<p>Without a standard notation, nanomaterials research suffers from the same data fragmentation that plagued small molecule chemistry before SMILES existed.</p>
<h2 id="the-five-tier-ninchi-hierarchy">The Five-Tier NInChI Hierarchy</h2>
<p>The authors propose NInChI (Nanomaterials InChI), a layered extension to the existing InChI system. The clever insight is organizing nanomaterial description from the inside out, like describing an onion with a five-tier hierarchy:</p>
<ol>
<li><strong>Tier 1: Chemical Core</strong>: What&rsquo;s the fundamental composition? Gold? Silver? Carbon nanotube?</li>
<li><strong>Tier 2: Morphology</strong>: What shape and size? Spherical nanoparticle? Rod? 2D sheet?</li>
<li><strong>Tier 3: Surface Properties</strong>: What&rsquo;s the surface like? Charge, roughness, hydrophobicity?</li>
<li><strong>Tier 4: Surface Chemistry</strong>: How are things attached? Covalent bonds? Physical adsorption?</li>
<li><strong>Tier 5: Surface Ligands</strong>: What molecules are on the surface and how many?</li>
</ol>
<p>This hierarchy captures the essential information needed to distinguish between different nanomaterials while building on familiar chemical concepts.</p>
<h2 id="testing-the-standard-six-case-studies">Testing the Standard: Six Case Studies</h2>
<p>The authors tested their concept against six real-world case studies to identify what actually matters in practice.</p>
<p><strong>Case Study 1: Gold Nanoparticles</strong></p>
<p>Gold NPs provided a relatively simple test case: an inert metallic core with various surface functionalizations. Key insights: core composition and size are essential, surface chemistry (what molecules are attached) matters critically, shape can dramatically affect properties, and dynamic properties like protein corona formation belong outside the intrinsic NInChI representation. This established the boundary: NInChI should capture intrinsic, stable properties.</p>
<p><strong>Case Study 2: Carbon Nanomaterials</strong></p>
<p>Carbon nanotubes and graphene introduced additional complexity: dimensionality (1D tubes vs 2D sheets vs 0D fullerenes), chirality (the (n,m) vector that defines a nanotube&rsquo;s structure), defects and impurities that can dramatically alter properties, and layer count for 2D materials. This case showed that the notation needed to handle both topological complexity and chemical composition.</p>
<p><strong>Case Study 3: Complex Engineered Materials</strong></p>
<p>Doped materials, alloys, and core-shell structures revealed key requirements: need to distinguish true alloys (homogeneous mixing) and core-shell structures with the same overall composition, crystal structure information becomes crucial, and component ratios must be precisely specified.</p>
<p><strong>Case Study 4: Database Applications</strong></p>
<p>The FAIR (Findable, Accessible, Interoperable, Reusable) principles guided this analysis. NInChI addresses real database problems: it provides greater specificity than CAS numbers (which lack nanoform distinction), offers a systematic alternative to ad-hoc naming schemes, and enables machine-searchability.</p>
<p><strong>Case Study 5: Computational Modeling</strong></p>
<p>This revealed exciting possibilities: automated descriptor generation from NInChI structure, read-across predictions for untested materials, and model input preparation from standardized notation. The layered structure provides exactly the kind of structured input that computational tools need.</p>
<p><strong>Case Study 6: Regulatory Applications</strong></p>
<p>Under frameworks like REACH, regulators need to distinguish between different &ldquo;nanoforms&rdquo;, which are materials with the same chemical composition but different sizes, shapes, or surface treatments. NInChI directly addresses this by encoding the specific properties that define regulatory categories, providing precision sufficient for legal definitions and risk assessment frameworks.</p>
<h2 id="the-ninchi-alpha-specification-in-practice">The NInChI Alpha Specification in Practice</h2>
<p>Synthesizing insights from all six case studies, the authors propose the <strong>NInChI alpha specification</strong>, a three-layer structure:</p>
<p><strong>Layer 1 (Version Information)</strong>: Standard header indicating the NInChI version and specification level, similar to how regular <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a> strings start with <code>InChI=1S/</code>.</p>
<p><strong>Layer 2 (Composition)</strong>: Each component (core, shell, ligands) gets described using standard InChI for chemical composition where possible, morphology sublayers for size and shape information, crystal structure data when relevant, and spatial arrangement details.</p>
<p><strong>Layer 3 (Arrangement)</strong>: Specifies how all components from Layer 2 fit together, essentially the &ldquo;assembly instructions&rdquo; for the nanomaterial, describing the structure from inside-out with core-shell arrangements, surface attachment modes, and hierarchical organization.</p>
<p>The paper provides concrete examples: a 20 nm silica sphere with 2 nm gold shell (simple core-shell structure), a CTAB-capped gold nanoparticle (surface ligand representation), and a chiral single-walled carbon nanotube (topological information encoding). These examples reveal both the power and complexity: the notation can represent sophisticated structures, and the strings get quite detailed. For instance, a basic alpha NInChI representation for a nanoparticle might follow a syntax pattern indicating the core composition and subsequent layers:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>InChI=1S/nano/1/Au/2/shp:sph/sz:20nm/3/lig:CTAB
</span></span></code></pre></div><p><em>(Note: This is a stylized representation of the alpha syntax to demonstrate the hierarchical layer-based structure.)</em></p>
<p><strong>Implementation</strong>: The authors built a prototype web interface for generating NInChI strings, providing a user-friendly interface, real-time generation, community feedback mechanism, and testing platform.</p>
<p><strong>Limitations</strong>: The alpha version acknowledges areas for future development: the scope currently focuses on essential properties, needs broader community testing and refinement, requires implementation in existing nanoinformatics tools, and must undergo formal standardization processes.</p>
<p><strong>Impact</strong>: For researchers, NInChI could transform how nanomaterials data gets shared through precise structural queries. For regulators, it provides systematic identification for risk assessment. For industry, standardized notation enables better inventory management, quality control, and automated safety assessment.</p>
<p><strong>Next Steps</strong>: Community adoption, software implementation in existing tools, formal standardization through IUPAC or similar bodies, and extension beyond the alpha scope.</p>
<p><strong>Key Conclusions</strong>: The work demonstrates that creating systematic notation for nanomaterials is a complex and feasible task. The hierarchical, inside-out organization provides an intuitive approach. Testing against real scenarios identified essential features. By extending InChI, the work leverages existing infrastructure. Success will depend on community adoption and continued refinement based on real-world usage.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Paper Accessibility</strong>: The paper is fully open-access under the CC BY 4.0 license, allowing for straightforward reading and analysis.</li>
<li><strong>Tools &amp; Code</strong>: The authors provided a prototype graphical web interface for generating basic NInChI strings available through the <a href="http://enaloscloud.novamechanics.com/nanocommons/NInChI/">Enalos Cloud Platform</a>. However, the underlying backend code for generating these representations was not released as an open-source library, limiting automated programmatic implementation.</li>
<li><strong>Documentation</strong>: The paper serves as an alpha specification Request for Comments (RFC). As a conceptual framework prototype, no formal algorithmic pseudocode for automated string parsing or generation from structured nanomaterials files (like <code>.cif</code>) is provided.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lynch, I., Afantitis, A., Exner, T., et al. (2020). Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies? <em>Nanomaterials</em>, <em>10</em>(12), 2493. <a href="https://doi.org/10.3390/nano10122493">https://doi.org/10.3390/nano10122493</a></p>
<p><strong>Publication</strong>: Nanomaterials (2020)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{lynch2020inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Can an InChI for Nano Address the Need for a Simplified Representation of Complex Nanomaterials across Experimental and Nanoinformatics Studies?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lynch, Iseult and Afantitis, Antreas and Exner, Thomas and others}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nanomaterials}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2493}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{MDPI}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.3390/nano10122493}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Mixfile &amp; MInChI: Machine-Readable Mixture Formats</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/mixfile-minchi/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/mixfile-minchi/</guid><description>Mixfile and MInChI provide the first standardized, machine-readable formats for representing chemical mixtures.</description><content:encoded><![CDATA[<h2 id="a-standardized-resource-for-chemical-mixtures">A Standardized Resource for Chemical Mixtures</h2>
<p>This is a <strong>Resource</strong> and <strong>Method</strong> paper that introduces two complementary standards for representing chemical mixtures: the detailed <strong>Mixfile</strong> format for comprehensive mixture descriptions and the compact <strong>MInChI</strong> (Mixtures InChI) specification for canonical mixture identifiers.</p>
<h2 id="the-missing-format-for-complex-formulations">The Missing Format for Complex Formulations</h2>
<p>Here&rsquo;s a fundamental gap in chemical informatics: Current standards excel at representing pure individual molecules (SMILES, InChI, Molfile). A corresponding standard for multi-component mixtures remains an open challenge. This is a major problem because real-world chemistry predominantly relies on complex mixtures.</p>
<p>Think about everyday chemical work: you frequently deal with:</p>
<ul>
<li>Reagents with specified purity (e.g., &ldquo;$\geq$ 97% pure&rdquo;)</li>
<li>Solutions and formulations</li>
<li>Complex mixtures like &ldquo;hexanes&rdquo; (which contains multiple isomers)</li>
<li>Drug formulations with active ingredients and excipients</li>
</ul>
<p>Without a machine-readable standard, chemists are forced to describe these mixtures in plain text that software can&rsquo;t parse or analyze systematically. This creates barriers for automated safety analysis, inventory management, and data sharing.</p>
<h2 id="dual-design-comprehensive-mixfiles-and-canonical-minchis">Dual Design: Comprehensive Mixfiles and Canonical MInChIs</h2>
<p>The authors propose a two-part solution:</p>
<ol>
<li><strong>Mixfile</strong>: A detailed, hierarchical JSON format that captures the complete composition of a mixture</li>
<li><strong>MInChI</strong>: A compact, canonical string identifier derived from Mixfile data</li>
</ol>
<p>This dual approach gives you both comprehensive description (Mixfile) and simple identification (MInChI), similar to how you might have both a detailed recipe and a short name for a dish.</p>
<h3 id="what-makes-a-good-mixture-format">What Makes a Good Mixture Format?</h3>
<p>The authors identify three essential properties any mixture format must capture:</p>
<ol>
<li><strong>Compound</strong>: What molecules are present?</li>
<li><strong>Quantity</strong>: How much of each component?</li>
<li><strong>Hierarchy</strong>: How are components organized (e.g., mixtures-of-mixtures)?</li>
</ol>
<p>The hierarchical aspect is crucial. Consider &ldquo;hexanes&rdquo;: it is a named mixture containing specific proportions of n-hexane, 2-methylpentane, 3-methylpentane, etc. Your mixture format needs to represent both the individual isomers and the fact that they&rsquo;re grouped under the umbrella term &ldquo;hexanes.&rdquo;</p>
<h3 id="mixfile-format-details">Mixfile Format Details</h3>
<p>Mixfile uses JSON as its foundation, making it both human-readable and easy to parse in modern programming languages. The core structure is a hierarchical tree where each component can contain:</p>
<ul>
<li><strong>name</strong>: Component identifier</li>
<li><strong>molfile/smiles/inchi</strong>: Molecular structure (molfile is the primary source of truth)</li>
<li><strong>quantity/units/relation</strong>: Concentration data with optional relation operators</li>
<li><strong>contents</strong>: Array of sub-components for hierarchical mixtures</li>
<li><strong>identifiers</strong>: Database IDs or URLs for additional information</li>
</ul>
<h4 id="simple-example">Simple Example</h4>
<p>A basic Mixfile might look like:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;Acetone, ≥99%&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;acetone&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CC(=O)C&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">99</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;relation&#34;</span>: <span style="color:#e6db74">&#34;&gt;=&#34;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  ]
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Note that the paper specifies distinct fields for molecular structures: <code>molfile</code> (the primary source of truth), <code>smiles</code>, <code>inchi</code>, and <code>formula</code>. Concentration data uses separate <code>quantity</code>, <code>units</code>, and <code>relation</code> fields.</p>
<h4 id="complex-example-mixture-of-mixtures">Complex Example: Mixture-of-Mixtures</h4>
<p>For something like &ldquo;ethyl acetate dissolved in hexanes,&rdquo; you&rsquo;d have:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;Ethyl acetate in hexanes&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;ethyl acetate&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CCOC(=O)C&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">10</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;hexanes&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;contents&#34;</span>: [
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;n-hexane&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CCCCCC&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">60</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;2-methylpentane&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;smiles&#34;</span>: <span style="color:#e6db74">&#34;CC(C)CCC&#34;</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;quantity&#34;</span>: <span style="color:#ae81ff">25</span>,
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">&#34;units&#34;</span>: <span style="color:#e6db74">&#34;%&#34;</span>
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>      ]
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  ]
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>This hierarchical structure elegantly captures the &ldquo;recipe&rdquo; of complex mixtures while remaining machine-readable.</p>
<h3 id="minchi-canonical-mixture-identifiers">MInChI: Canonical Mixture Identifiers</h3>
<p>While Mixfiles provide comprehensive descriptions, you also need simple identifiers for database storage and searching. This is where MInChI comes in.</p>
<p>A MInChI string is structured as:</p>
<pre><code>MInChI=0.00.1S/&lt;components&gt;/&lt;indexing&gt;/&lt;concentration&gt;
</code></pre>
<ul>
<li><strong>Header</strong>: Version information (<code>0.00.1S</code> in the paper&rsquo;s specification)</li>
<li><strong>Components</strong>: Standard InChI for each unique molecule, sorted alphabetically <em>by the InChI strings themselves</em>, then concatenated with <code>&amp;</code></li>
<li><strong>Indexing</strong>: Hierarchical structure using curly braces <code>{}</code> for branches and <code>&amp;</code> for adjacent nodes; uses 1-based integer indices referring to the sorted InChI list</li>
<li><strong>Concentration</strong>: Quantitative information for each component, with units converted to canonical codes</li>
</ul>
<h4 id="why-this-matters">Why This Matters</h4>
<p>MInChI strings enable simple database searches:</p>
<ul>
<li>Check if a specific component appears in any mixture</li>
<li>Compare different formulations of the same product</li>
<li>Identify similar mixtures based on string similarity</li>
</ul>
<h2 id="validating-the-standard-through-practical-tooling">Validating the Standard Through Practical Tooling</h2>
<p>The paper demonstrates the format&rsquo;s capabilities through several practical applications and a proof-of-concept implementation:</p>
<h3 id="text-extraction-algorithm">Text Extraction Algorithm</h3>
<p>The authors demonstrate a proof-of-concept algorithm that uses regular expressions and chemical name recognition to parse plain-text mixture descriptions into structured Mixfile data. The algorithm:</p>
<ol>
<li>Applies regex rules to remove filler words and extract concentrations</li>
<li>Looks up cleaned names against a custom chemical database</li>
<li>Falls back to OPSIN for SMILES generation from chemical names</li>
<li>Generates 2D coordinates for molecular structures</li>
</ol>
<h3 id="graphical-editor">Graphical Editor</h3>
<p>An open-source editor provides:</p>
<ul>
<li>Drag-and-drop interface for building hierarchical structures</li>
<li>Chemical structure sketching and editing</li>
<li>Database lookup (e.g., PubChem integration)</li>
<li>Automatic MInChI generation</li>
<li>Import/export capabilities</li>
</ul>
<h3 id="example-use-cases">Example Use Cases</h3>
<p>The paper validates the format through real-world applications:</p>
<ul>
<li><strong>Safety compliance</strong>: Automated hazard assessment based on concentration-dependent properties (e.g., solid osmium tetroxide vs. 1% aqueous solution)</li>
<li><strong>Inventory management</strong>: Precise, searchable laboratory records</li>
<li><strong>Data extraction</strong>: Parsing vendor catalogs and safety data sheets</li>
</ul>
<h2 id="outcomes-and-future-extensibility">Outcomes and Future Extensibility</h2>
<p>The work successfully establishes the first standardized, machine-readable formats for chemical mixtures. Key achievements:</p>
<ul>
<li><strong>Comprehensive representation</strong>: Mixfile captures component identity, quantity, and hierarchy</li>
<li><strong>Canonical identification</strong>: MInChI provides compact, searchable identifiers</li>
<li><strong>Practical tooling</strong>: Open-source editor and text extraction demonstrate feasibility</li>
<li><strong>Real-world validation</strong>: Format handles diverse use cases from safety to inventory</li>
</ul>
<h3 id="limitations-and-future-directions">Limitations and Future Directions</h3>
<p>The authors acknowledge areas for improvement:</p>
<ul>
<li><strong>Machine learning improvements</strong>: Better text extraction using modern NLP techniques</li>
<li><strong>Extended coverage</strong>: Support for polymers, complex formulations, analytical results</li>
<li><strong>Community adoption</strong>: Integration with existing chemical databases and software</li>
</ul>
<p>The hierarchical design makes Mixfile suitable for both &ldquo;recipe&rdquo; descriptions (how to make something) and analytical results (what was found). This flexibility should help drive adoption across different use cases in chemistry and materials science.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="open-source-tooling--data">Open Source Tooling &amp; Data</h3>
<p>While the central repository focusing on validating and establishing the MInChI standard is <a href="https://github.com/IUPAC/MInChI">github.com/IUPAC/MInChI</a>, the tools and datasets actually used to develop the paper&rsquo;s proofs-of-concept are hosted elsewhere:</p>
<ul>
<li><strong>Graphical Editor &amp; App codebase</strong>: The Electron application and Mixfile handling codebase (<code>console.js</code>) can be found at <a href="https://github.com/cdd/mixtures">github.com/cdd/mixtures</a>.</li>
<li><strong>Text Extraction Data</strong>: The 5,615 extracted mixture records mentioned in the text extraction validation can be accessed as training data inside the <code>cdd/mixtures</code> repository under <a href="https://github.com/cdd/mixtures/tree/master/reference"><code>reference/gathering.zip</code></a>.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>This section provides the specific algorithmic logic, schema definitions, and standardization rules needed to replicate the Mixfile parser or MInChI generator.</p>
<h4 id="the-strict-mixfile-json-schema">The Strict Mixfile JSON Schema</h4>
<p>To implement the format, a parser must recognize these specific fields:</p>
<p><strong>Root Structure</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;mixfileVersion&#34;</span>: <span style="color:#ae81ff">0.01</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;header&#34;</span>: {},
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;contents&#34;</span>: []
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Component Fields</strong>:</p>
<ul>
<li><code>name</code>: string (required if no structure is provided)</li>
<li><code>molfile</code>: string (the primary source of truth for molecular structure)</li>
<li><code>smiles</code>, <code>inchi</code>, <code>formula</code>: derived/transient fields for convenience</li>
<li><code>quantity</code>: number OR <code>[min, max]</code> array for ranges</li>
<li><code>units</code>: string (must map to supported ontology)</li>
<li><code>relation</code>: string (e.g., <code>&quot;&gt;&quot;</code>, <code>&quot;~&quot;</code>, <code>&quot;&gt;=&quot;</code>)</li>
<li><code>contents</code>: recursive array for hierarchical mixtures</li>
</ul>
<h4 id="minchi-generation-algorithm">MInChI Generation Algorithm</h4>
<p>To generate <code>MInChI=0.00.1S/...</code>, the software must follow these steps:</p>
<ol>
<li>
<p><strong>Component Layer</strong>:</p>
<ul>
<li>Calculate standard <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a> for all structures in the mixture</li>
<li>Sort distinct InChIs alphabetically by the InChI string itself</li>
<li>Join with <code>&amp;</code> to form the structure layer</li>
</ul>
</li>
<li>
<p><strong>Hierarchy &amp; Concentration Layers</strong>:</p>
<ul>
<li>Traverse the Mixfile tree recursively</li>
<li><strong>Indexing</strong>: Use integer indices (1-based) referring to the sorted InChI list</li>
<li><strong>Grouping</strong>: Use <code>{}</code> to denote hierarchy branches and <code>&amp;</code> to separate nodes at the same level</li>
<li><strong>Concentration</strong>: Convert all quantities to canonical unit codes and apply scaling factors</li>
</ul>
</li>
</ol>
<h4 id="unit-standardization-table">Unit Standardization Table</h4>
<p>Replication requires mapping input units to these canonical MInChI codes:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Input Unit</th>
          <th style="text-align: left">MInChI Code</th>
          <th style="text-align: left">Scale Factor</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">%</td>
          <td style="text-align: left">pp</td>
          <td style="text-align: left">1</td>
      </tr>
      <tr>
          <td style="text-align: left">w/w%</td>
          <td style="text-align: left">wf</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">v/v%</td>
          <td style="text-align: left">vf</td>
          <td style="text-align: left">0.01</td>
      </tr>
      <tr>
          <td style="text-align: left">mol/L (M)</td>
          <td style="text-align: left">mr</td>
          <td style="text-align: left">1</td>
      </tr>
      <tr>
          <td style="text-align: left">mmol/L</td>
          <td style="text-align: left">mr</td>
          <td style="text-align: left">$10^{-3}$</td>
      </tr>
      <tr>
          <td style="text-align: left">g/L</td>
          <td style="text-align: left">wv</td>
          <td style="text-align: left">$10^{-3}$</td>
      </tr>
  </tbody>
</table>
<h4 id="text-extraction-logic">Text Extraction Logic</h4>
<p>The paper defines a recursive procedure for parsing plain-text mixture descriptions:</p>
<ol>
<li><strong>Input</strong>: Raw text string (e.g., &ldquo;2 M acetone in water&rdquo;)</li>
<li><strong>Rule Application</strong>: Apply RegEx rules in order:
<ul>
<li><em>Remove</em>: Delete common filler words (&ldquo;solution&rdquo;, &ldquo;in&rdquo;)</li>
<li><em>Replace</em>: Substitute known variations</li>
<li><em>Concentration</em>: Extract quantities like &ldquo;2 M&rdquo;, &ldquo;97%&rdquo;</li>
<li><em>Branch</em>: Split phrases like &ldquo;A in B&rdquo; into sub-nodes</li>
</ul>
</li>
<li><strong>Lookup</strong>: Check cleaned name against a custom table (handles cases like &ldquo;xylenes&rdquo; or specific structures)</li>
<li><strong>OPSIN</strong>: If no lookup match, send to the OPSIN tool to generate SMILES from the chemical name</li>
<li><strong>Embed</strong>: If structure found, generate 2D coordinates (Molfile) via RDKit</li>
</ol>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Clark, A. M., McEwen, L. R., Gedeck, P., &amp; Bunin, B. A. (2019). Capturing mixture composition: An open machine-readable format for representing mixed substances. <em>Journal of Cheminformatics</em>, <em>11</em>(1), 33. <a href="https://doi.org/10.1186/s13321-019-0357-4">https://doi.org/10.1186/s13321-019-0357-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics (2019)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{clark2019capturing,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Capturing mixture composition: an open machine-readable format for representing mixed substances}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Clark, Alex M and McEwen, Leah R and Gedeck, Peter and Bunin, Barry A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1--14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IUPAC/MInChI">Official MInChI GitHub repository</a></li>
</ul>
]]></content:encoded></item><item><title>Making InChI FAIR and Sustainable for Inorganic Chemistry</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/inchi-2025/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/inchi-2025/</guid><description>InChI v1.07 modernizes chemical identifiers for FAIR principles and adds robust support for inorganic compounds.</description><content:encoded><![CDATA[<h2 id="paper-contribution-modernizing-chemical-identifiers">Paper Contribution: Modernizing Chemical Identifiers</h2>
<p>This is a <strong>Resource</strong> paper that describes the development and maintenance of InChI (International Chemical Identifier), a fundamental infrastructure component for chemical databases. While it includes methodological improvements to the canonicalization algorithm for inorganic compounds, its primary contribution is ensuring the sustainability and accessibility of a critical chemical informatics resource.</p>
<h2 id="motivation-the-inorganic-chemistry-problem">Motivation: The Inorganic Chemistry Problem</h2>
<p>The International Chemical Identifier (InChI) is prevalent in chemistry databases, with over a billion structures using it. The original system was designed specifically for organic chemistry and systematically fails to parse organometallic structures accurately. The original implementation had significant limitations:</p>
<ul>
<li><strong>FAIR principles gap</strong>: Development was closed-source, documentation was inadequate, and the codebase was difficult to maintain</li>
<li><strong>Inorganic chemistry failure</strong>: Metal-ligand bonds were automatically disconnected, destroying stereochemical information for coordination complexes</li>
<li><strong>Technical debt</strong>: Thousands of bugs, security vulnerabilities, and an unmaintainable codebase</li>
</ul>
<p>If you&rsquo;ve ever tried to search for a metal complex in a chemical database and gotten nonsense results, this is why. This paper describes the fix.</p>
<h2 id="core-innovation-smart-metal-ligand-handling">Core Innovation: Smart Metal-Ligand Handling</h2>
<p>The key innovations are:</p>
<ol>
<li>
<p><strong>Smart metal-ligand bond handling</strong>: A decision tree algorithm that uses coordination number and electronegativity to determine which bonds to keep and which to disconnect, preserving stereochemistry for coordination complexes</p>
</li>
<li>
<p><strong>Modernized development infrastructure</strong>: Migration to GitHub with open development, comprehensive testing, and maintainable documentation</p>
</li>
<li>
<p><strong>Backward compatibility</strong>: The core canonicalization algorithm remained unchanged, preserving over a billion existing InChIs for organic compounds</p>
</li>
</ol>
<p>The preprocessing step applies an iterative decision tree for every metal in a structure:</p>
<ul>
<li>If coordination number exceeds standard valence: keep all bonds (bypass electronegativity check)</li>
<li>If terminal metal: check electronegativity table (threshold: $\Delta EN = 1.7$)</li>
<li>Include hardcoded exceptions for Grignard reagents and organolithium compounds</li>
</ul>
<p>For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected as a coordination complex.</p>
<h2 id="validation-methods--experiments">Validation Methods &amp; Experiments</h2>
<p>The paper focuses on software engineering validation:</p>
<ul>
<li><strong>Bug fixing</strong>: Addressed thousands of bugs from the legacy codebase, including security vulnerabilities</li>
<li><strong>Backward compatibility testing</strong>: Verified that existing organic molecule InChIs remained unchanged</li>
<li><strong>Inorganic compound validation</strong>: Tested the new decision tree algorithm on coordination complexes, organometallic compounds, and ionic salts</li>
<li><strong>Documentation overhaul</strong>: Split technical documentation into Chemical Manual (for chemists) and Technical Manual (for developers)</li>
</ul>
<p>The validation approach emphasizes maintaining the &ldquo;same molecule, same identifier&rdquo; principle while extending coverage to inorganic chemistry.</p>
<h2 id="key-outcomes-and-future-work">Key Outcomes and Future Work</h2>
<p>The v1.07 release successfully:</p>
<ul>
<li><strong>Modernizes infrastructure</strong>: Open development on GitHub with maintainable codebase</li>
<li><strong>Extends to inorganic chemistry</strong>: Proper handling of coordination complexes and organometallic compounds</li>
<li><strong>Maintains backward compatibility</strong>: No breaking changes for existing organic compound InChIs</li>
<li><strong>Improves database search</strong>: Metal complexes now searchable with correct stereochemistry preserved</li>
</ul>
<p><strong>Acknowledged limitations</strong> for future work:</p>
<ul>
<li>Stereochemistry representation still needs improvement</li>
<li>Mixtures (MInChI) and nanomaterials (NInChI) remain unsolved problems</li>
<li>Chemical identifiers work best for discrete molecules and struggle with variable-composition materials</li>
</ul>
<p><strong>Impact</strong>: This update should dramatically improve searchability of inorganic and organometallic compounds in major chemical databases, addressing a critical gap in computational chemistry workflows.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="software--data-availability">Software &amp; Data Availability</h3>
<p>The InChI v1.07 codebase, primarily written in C/C++, is openly available on GitHub at <a href="https://github.com/IUPAC-InChI/InChI">IUPAC-InChI/InChI</a>. The repository includes the core canonicalization engine and the new inorganic preprocessing logic. Both the Technical Manual (for structural integration) and the Chemical Manual are maintained alongside the codebase.</p>
<p><strong>Benchmarking Data</strong>: Validation of the new decision tree logic is managed through rigorous unit testing built directly into the repository&rsquo;s continuous integration pipelines. Standard tests with existing organic compounds confirm backward compatibility, while newly integrated suites of coordination complexes and organometallic compounds ensure the 1.07 processing triggers as expected.</p>
<h3 id="algorithms">Algorithms</h3>
<h4 id="the-metal-problem">The Metal Problem</h4>
<p>InChI&rsquo;s original algorithm assumed that bonds to metals were ionic and automatically disconnected them. This makes sense for something like sodium chloride (NaCl), where you have separate $\text{Na}^+$ and $\text{Cl}^-$ ions.</p>
<p>It fails for:</p>
<ul>
<li><strong>Coordination complexes</strong>: Where ligands are bonded to the metal center</li>
<li><strong>Organometallic compounds</strong>: Where carbon-metal bonds are covalent</li>
<li><strong>Sandwich compounds</strong>: Like ferrocene, where the bonding has both ionic and covalent character</li>
</ul>
<p>The result: loss of stereochemical information and identical InChIs for structurally different compounds.</p>
<h4 id="the-solution-smart-preprocessing">The Solution: Smart Preprocessing</h4>
<p>The new system uses a decision tree to figure out which metal-ligand bonds to keep and which to disconnect. The process is <strong>iterative</strong>: it runs for every metal in the structure, then checks every bond to that metal. In the C/C++ repository, this preprocessing logic acts as a filter applied <em>before</em> the traditional organic canonicalization engine (from v1.06) runs, dynamically determining whether coordination bonds are retained for downstream layer generation.</p>
<h5 id="decision-tree-logic">Decision Tree Logic</h5>
<p>The algorithm determines the bond state $B(m, l)$ between a metal $m$ and ligand $l$ based on coordination number $CN(m)$, standard valence $V(m)$, and Pauling electronegativity $EN$:</p>
<p>$$
\begin{aligned}
B(m, l) &amp;=
\begin{cases}
\text{Connected} &amp; \text{if } CN(m) &gt; V(m) \\
\text{Connected} &amp; \text{if } |EN(m) - EN(l)| &lt; 1.7 \\
\text{Disconnected} &amp; \text{if } |EN(m) - EN(l)| \geq 1.7
\end{cases}
\end{aligned}
$$</p>
<p><em>(Note: Explicit overrides exist for specific classes like Grignard reagents).</em></p>
<h5 id="hardcoded-chemical-exceptions">Hardcoded Chemical Exceptions</h5>
<p>The algorithm includes specific overrides based on well-established chemistry:</p>
<ul>
<li><strong>Grignard reagents (RMgX)</strong>: Explicitly configured to <strong>keep</strong> the Mg-C bond but <strong>disconnect</strong> the Mg-halide bond</li>
<li><strong>Organolithium compounds (RLi)</strong>: Explicitly configured to keep the structure intact</li>
</ul>
<p>These exceptions exist because the general electronegativity rules would give incorrect results for these compound classes.</p>
<h5 id="practical-example">Practical Example</h5>
<p>For example, $\text{FeCl}_2$ is treated as ionic and disconnected into $\text{Fe}^{2+}$ and $2\ \text{Cl}^-$, while $[\text{FeCl}_4]^{2-}$ remains connected because its coordination number exceeds the threshold.</p>
<h4 id="how-inchi-generation-works">How InChI Generation Works</h4>
<p>The process has six main steps:</p>
<ol>
<li><strong>Parse input</strong>: Read the structure from a file (Molfile, SDF, etc.)</li>
<li><strong>Convert to internal format</strong>: Transform into the software&rsquo;s data structures</li>
<li><strong>Normalize</strong>: Standardize tautomers, resolve ambiguities (where the new metal rules apply)</li>
<li><strong>Canonicalize</strong>: Create a unique representation independent of atom numbering</li>
<li><strong>Generate InChI string</strong>: Build the layered text identifier</li>
<li><strong>Create InChIKey</strong>: Hash the full string into a 27-character key for databases</li>
</ol>
<p>The InChI itself has separate layers for formula, connectivity, hydrogens, stereochemistry, isotopes, and charge. The InChIKey is what actually gets stored in databases for fast searching.</p>
<h5 id="inchikey-version-flag">InChIKey Version Flag</h5>
<p>Character 25 of the InChIKey indicates the version status:</p>
<ul>
<li><strong>&ldquo;S&rdquo;</strong>: Standard InChI</li>
<li><strong>&ldquo;N&rdquo;</strong>: Non-standard InChI</li>
<li><strong>&ldquo;B&rdquo;</strong>: Beta (experimental features)</li>
</ul>
<p>This flag is important for anyone parsing InChIKeys programmatically, as it tells you whether the identifier was generated using stable or experimental algorithms.</p>
<h2 id="additional-context">Additional Context</h2>
<h3 id="what-inchi-actually-does">What InChI Actually Does</h3>
<p>InChI creates a unique text string for any chemical structure. SMILES has multiple vendor implementations and can represent the same molecule in different ways. InChI provides a single, standardized format controlled by IUPAC. The goal is simple: same molecule, same identifier, every time.</p>
<p>This matters for FAIR data principles:</p>
<ul>
<li><strong>Findable</strong>: You can search for a specific compound across databases</li>
<li><strong>Accessible</strong>: The standard is open and free</li>
<li><strong>Interoperable</strong>: Different systems can connect chemical knowledge</li>
<li><strong>Reusable</strong>: The identifiers work consistently across platforms</li>
</ul>
<h3 id="better-documentation">Better Documentation</h3>
<p>The technical manual is being split into two documents:</p>
<ul>
<li><strong>Chemical Manual</strong>: For chemists who need to understand what InChIs mean</li>
<li><strong>Technical Manual</strong>: For developers who need to implement the algorithms</li>
</ul>
<p>This addresses the problem of current documentation serving both audiences poorly.</p>
<h3 id="the-bigger-picture">The Bigger Picture</h3>
<p>InChI&rsquo;s evolution reflects chemistry&rsquo;s expansion beyond its organic roots. The fact that it took this long to properly handle inorganic compounds shows how much computational chemistry has historically focused on carbon-based molecules.</p>
<p>As the field moves into catalysis, materials science, and coordination chemistry applications, having proper chemical identifiers becomes essential. You can&rsquo;t build FAIR chemical databases if half of chemistry is represented incorrectly.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Blanke, G., Brammer, J., Baljozovic, D., Khan, N. U., Lange, F., Bänsch, F., Tovee, C. A., Schatzschneider, U., Hartshorn, R. M., &amp; Herres-Pawlis, S. (2025). Making the InChI FAIR and sustainable while moving to inorganics. <em>Faraday Discussions</em>, 256(0), 503-519. <a href="https://doi.org/10.1039/D4FD00145A">https://doi.org/10.1039/D4FD00145A</a></p>
<p><strong>Publication</strong>: Faraday Discussions, 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blanke2025making,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Making the InChI FAIR and sustainable while moving to inorganics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blanke, G. and Brammer, J. and Baljozovic, D. and Khan, N. U. and Lange, F. and B{\&#34;a}nsch, F. and Tovee, C. A. and Schatzschneider, U. and Hartshorn, R. M. and Herres-Pawlis, S.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Faraday Discussions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{256}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{503--519}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>InChI: The Worldwide Chemical Structure Identifier Standard</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/inchi-2013/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/inchi-2013/</guid><description>Heller et al. (2013) explain how IUPAC's InChI became the global standard for representing chemical structures, its governance, and current limitations.</description><content:encoded><![CDATA[<h2 id="inchi-as-a-resource-and-systematization-standard">InChI as a Resource and Systematization Standard</h2>
<p>This is a <strong>Resource &amp; Systematization Paper</strong> that reviews the history, technical architecture, governance structure, and implementation status of the InChI standard. It documents both the institutional development of an open chemical identifier and the technical specification that enables it.</p>
<h2 id="the-motivation-interoperability-in-chemical-databases">The Motivation: Interoperability in Chemical Databases</h2>
<p>Before InChI, the chemistry community faced a fundamental interoperability problem. Chemical databases used proprietary systems like CAS Registry Numbers, or format-dependent representations like <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings. These were expensive, restricted, and relied on &ldquo;in-house&rdquo; databases.</p>
<p>The authors argue the Internet and Open Source software acted as a <strong>&ldquo;black swan&rdquo; event</strong> that disrupted this status quo. The Internet created a need to link diverse, free and fee-based resources without a central gatekeeper. InChI was designed as the solution: a non-proprietary, open-source identifier enabling linking of distinct data compilations.</p>
<h2 id="technical-and-institutional-innovations-of-inchi">Technical and Institutional Innovations of InChI</h2>
<p>InChI&rsquo;s innovation is both technical and institutional:</p>
<p><strong>Technical novelty</strong>: A hierarchical &ldquo;layered&rdquo; canonicalization system where structure representations build from basic connectivity to full stereochemistry. This allows flexible matching: a molecule with unknown stereochemistry produces an InChI that&rsquo;s a subset of the same molecule with known stereochemistry.</p>
<p><strong>Institutional novelty</strong>: Creating an open standard governed by a charitable trust (the InChI Trust) that convinced commercial competitors (publishers, databases) to adopt it as a &ldquo;pre-competitive&rdquo; necessity. This solved the political problem of maintaining an open standard in a competitive industry.</p>
<h3 id="technical-architecture-layers-and-hashing">Technical Architecture: Layers and Hashing</h3>
<h4 id="the-inchi-string">The InChI String</h4>
<p>InChI is a <strong>canonicalized structure representation</strong> derived from IUPAC conventions. It uses a hierarchical &ldquo;layered&rdquo; format where specific layers add detail. The exact technical specification includes these string segments:</p>
<ol>
<li><strong>Main Layer</strong>: Chemical Formula</li>
<li><strong>Connectivity Layer (<code>/c</code>)</strong>: Atoms and bonds (excluding bond orders)</li>
<li><strong>Hydrogen Layer (<code>/h</code>)</strong>: Tautomeric and immobile H atoms</li>
<li><strong>Charge (<code>/q</code>) &amp; Proton Balance (<code>/p</code>)</strong>: Accounting for ionization</li>
<li><strong>Stereochemistry</strong>:
<ul>
<li>Double bond (<code>/b</code>) and Tetrahedral (<code>/t</code>) parity</li>
<li>Parity inversion (<code>/m</code>)</li>
<li>Stereo type (<code>/s</code>): absolute, relative, or racemic</li>
</ul>
</li>
<li><strong>Fixed-H Layer (<code>/f</code>)</strong>: Distinguishes specific tautomers if needed</li>
</ol>
<p>This layered approach is clever: a molecule with unknown stereochemistry will have an InChI that&rsquo;s a subset of the same molecule with known stereochemistry. This allows for flexible matching at the connectivity level even without complete stereochemical information.</p>
<h4 id="the-inchikey">The InChIKey</h4>
<p>Because InChI strings can be too long for search engines (which break at ~30 characters or at symbols like <code>/</code> and <code>+</code>), the InChIKey was created.</p>
<p><strong>Mechanism</strong>: A 27-character string generated via a <strong>SHA-256 hash</strong> of the InChI string. This can be represented as:</p>
<p>$$ \text{InChIKey} = f_{\text{SHA-256}}(\text{InChI}) $$</p>
<p><strong>Structure</strong>:</p>
<ul>
<li><strong>Block 1 (14 characters)</strong>: Encodes the molecular skeleton (connectivity)</li>
<li><strong>Block 2 (8 characters)</strong>: Encodes stereochemistry and isotopes</li>
<li><strong>Block 3</strong>: Version and protonation flags (e.g., &lsquo;N&rsquo; for neutral)</li>
</ul>
<p>Because the InChIKey is a hash, it cannot be converted back to a structure (irreversible) and has a theoretical risk of collision. It is important to distinguish between <strong>InChI collisions</strong> (which are due to flaws/bugs and are very rare) and <strong>InChIKey collisions</strong> (which are mathematically inevitable due to hashing).</p>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>This is a systematization paper documenting an existing standard. However, the authors provide:</p>
<p><strong>Validation evidence</strong>:</p>
<ul>
<li><strong>Certification Suite</strong>: A test suite that software vendors must pass to display the &ldquo;InChI Certified&rdquo; logo, preventing fragmentation</li>
<li><strong>Round-trip conversion testing</strong>: Demonstrated &gt;99% success rate converting InChI back to structure (100% with AuxInfo layer)</li>
<li><strong>Real-world adoption metrics</strong>: Documented integration across major chemical databases and publishers</li>
</ul>
<p><strong>Known limitations identified</strong>:</p>
<ul>
<li>Tautomer representation issues in Version 1 (different drawings of same tautomer can generate different InChIs)</li>
<li>Edge cases in stereochemistry representation</li>
</ul>
<h3 id="institutional-history--governance">Institutional History &amp; Governance</h3>
<p><strong>Origin</strong>: The project was initiated at a March 2000 IUPAC meeting in Washington, DC. It was originally called the <strong>IUPAC Chemical Identifier Project (IChIP)</strong>.</p>
<p><strong>Development</strong>: Technical work was done by NIST (Stein, Heller, Tchekhovskoi), overseen by the IUPAC <strong>CCINS</strong> committee, which later became the <strong>InChI Subcommittee</strong> of Division VIII.</p>
<p><strong>The InChI Trust</strong>: To ensure the algorithm survived beyond a volunteer organization, the <strong>InChI Trust</strong> was formed in 2009. It is a UK charity supported by publishers and databases (e.g., Nature, RSC) to maintain the standard pre-competitively. This was a critical innovation: getting commercial publishers and software vendors to agree that a non-proprietary standard would benefit everyone.</p>
<h2 id="real-world-impact-and-future-directions">Real-World Impact and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Success through &ldquo;un-coerced adoption&rdquo;</strong>: InChI succeeded because commercial competitors viewed it as a &ldquo;pre-competitive&rdquo; necessity for the Internet age. The open governance model proved durable.</p>
<p><strong>Technical achievements</strong>:</p>
<ul>
<li>Reversible representation (&gt;99% without AuxInfo, 100% with it)</li>
<li>Hierarchical structure enables flexible matching at different levels of detail</li>
<li>InChIKey enables web search despite being a hash (with inherent collision risk)</li>
</ul>
<h3 id="limitations-acknowledged-as-of-2013">Limitations Acknowledged (as of 2013)</h3>
<ul>
<li><strong>Tautomerism Issues</strong>: Different drawings of the same tautomer (e.g., 1,4-oxime vs nitroso) can generate different InChIs in Version 1, which is targeted for Version 2</li>
<li><strong>Hash collision risk</strong>: InChIKey collisions are mathematically inevitable due to SHA-256 hashing, though InChI collisions (actual bugs) are very rare</li>
<li><strong>Certification required</strong>: To prevent fragmentation, software must pass the InChI Certification Suite</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors note that while this paper documents the state as of 2013, InChI continues to evolve. Tautomer handling and edge cases in stereochemistry representation were priorities for future versions. The governance model through the InChI Trust was designed to ensure long-term maintenance beyond the original volunteer contributors.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This systematization paper documents an existing standard. Key implementation resources are openly maintained by the InChI Trust.</p>
<h3 id="code--software">Code &amp; Software</h3>
<ul>
<li><strong>Official Open Source Implementation</strong>: The C source code and pre-compiled binaries for the InChI algorithm are freely available via the <a href="https://www.inchi-trust.org/downloads/">InChI Trust Downloads Page</a> and their <a href="https://github.com/IUPAC-InChI/InChI">official GitHub repository</a>.</li>
<li><strong>Canonicalization algorithm</strong>: Open-source implementation of IUPAC-based rules for generating unique representations from multiple possible drawings of the same molecule.</li>
</ul>
<h3 id="data--validation">Data &amp; Validation</h3>
<ul>
<li><strong>InChI Certification Suite</strong>: A test suite of chemical structures provided by the InChI Trust used to validate that third-party software implementations generate correct InChIs.</li>
<li><strong>Version 1 specification</strong>: Complete technical documentation of the layered format.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Round-trip conversion</strong>: &gt;99% success rate (100% with AuxInfo) as validated by NIST and IUPAC.</li>
<li><strong>Certification testing</strong>: Pass/fail validation for software claiming InChI compliance.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D., &amp; Pletnev, I. (2013). InChI: the worldwide chemical structure identifier standard. <em>Journal of Cheminformatics</em>, <em>5</em>(1), 7. <a href="https://doi.org/10.1186/1758-2946-5-7">https://doi.org/10.1186/1758-2946-5-7</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics, 2013</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{heller2013inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{InChI, the worldwide chemical structure identifier standard}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heller, Stephen and McNaught, Alan and Stein, Stephen and Tchekhovskoi, Dmitrii and Pletnev, Igor}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1--9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2013}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>InChI and Tautomerism: Toward Comprehensive Treatment</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/inchi-and-tautomers/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/inchi-and-tautomers/</guid><description>Dhaked et al.'s comprehensive analysis of tautomerism in chemoinformatics, introducing 86 new tautomeric rules and their implications for InChI V2 development.</description><content:encoded><![CDATA[<h2 id="paper-contribution-a-systematized-tautomer-database-resource">Paper Contribution: A Systematized Tautomer Database Resource</h2>
<p>This is a <strong>Resource</strong> paper with strong <strong>Systematization</strong> elements. It provides a comprehensive catalog of 86 tautomeric transformation rules derived from experimental literature, designed to serve as a foundational resource for chemical database systems and the InChI V2 identifier standard. The systematic validation across 400+ million structures also makes it a benchmarking study for evaluating current chemoinformatics tools.</p>
<h2 id="the-tautomerism-problem-in-chemical-databases">The Tautomerism Problem in Chemical Databases</h2>
<p>Chemical databases face a fundamental problem: the same molecule can appear multiple times under different identifiers simply because it exists in different tautomeric forms. For example, glucose&rsquo;s ring-closed and open-chain forms are the same molecule; however, current chemical identifiers (including InChI) often treat them as distinct compounds.</p>















<figure class="post-figure center ">
    <img src="/img/notes/Glucose-tautomerism.webp"
         alt="D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism"
         title="D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Ring-chain tautomerism in glucose: the open-chain aldehyde form (left) and the cyclic pyranose form (right) are the same molecule in different tautomeric states.</figcaption>
    
</figure>

<p>This creates three critical problems:</p>
<ol>
<li><strong>Database redundancy</strong>: Millions of duplicate entries for the same chemical entities</li>
<li><strong>Search failures</strong>: Researchers miss relevant compounds during structure searches</li>
<li><strong>ML training issues</strong>: Machine learning models learn to treat tautomers as different molecules</li>
</ol>
<p>The motivation for this work is to provide a comprehensive, experimentally-grounded rule set that enables InChI V2 to properly recognize tautomeric relationships, eliminating these problems at the identifier level.</p>
<h2 id="86-comprehensive-tautomeric-transformation-rules">86 Comprehensive Tautomeric Transformation Rules</h2>
<p>The key contributions are:</p>
<ol>
<li>
<p><strong>Comprehensive Rule Set</strong>: Development of <strong>86 tautomeric transformation rules</strong> based on experimental literature, categorized into:</p>
<ul>
<li>54 Prototropic rules (classic H-movement tautomerism)</li>
<li>21 Ring-Chain rules (cyclic/open-chain transformations)</li>
<li>11 Valence rules (structural rearrangements with valence changes)</li>
</ul>
</li>
<li>
<p><strong>Massive-Scale Validation</strong>: Testing these rules against <strong>nine major chemical databases</strong> totaling over 400 million structures to identify coverage gaps in current InChI implementations</p>
</li>
<li>
<p><strong>Quantitative Assessment</strong>: Systematic measurement showing that current InChI (even with flexible settings) only achieves ~50% success in recognizing tautomeric relationships, with some new rules showing &lt;2% success rates</p>
</li>
<li>
<p><strong>Practical Tools</strong>: Creation of the <strong>Tautomerizer</strong> web tool for public use, demonstrating practical application of the rule set</p>
</li>
</ol>
<p>The novelty lies in the systematic compilation and validation of transformation rules at a scale that reveals critical gaps in current chemical identification systems.</p>
<h2 id="massive-scale-validation-across-400m-structures">Massive-Scale Validation Across 400M+ Structures</h2>
<h3 id="database-analysis">Database Analysis</h3>
<p>The researchers analyzed <strong>9 chemical databases</strong> totaling 400+ million structures:</p>
<ul>
<li><strong>Public databases</strong>: PubChem (largest), ChEMBL, DrugBank, PDB Ligands, SureChEMBL, AMS, ChemNavigator</li>
<li><strong>Private databases</strong>: CSD (Cambridge Structural Database), CSDB (NCI internal)</li>
</ul>
<h3 id="methodology">Methodology</h3>
<p><strong>Software</strong>: CACTVS Chemoinformatics Toolkit (versions 3.4.6.33 and 3.4.8.6)</p>
<p><strong>Tautomer Generation Protocol</strong>:</p>
<ul>
<li><strong>Algorithm</strong>: Single-step generation (apply transforms to input structure only, avoiding recursion)</li>
<li><strong>Constraints</strong>: Max 10 tautomers per structure, 30-second CPU timeout per transform</li>
<li><strong>Format</strong>: All rules expressed as SMIRKS strings</li>
<li><strong>Stereochemistry</strong>: Stereocenters involved in tautomerism were flattened during transformation</li>
</ul>
<p><strong>Success Metrics</strong>:</p>
<ul>
<li><strong>Complete InChI match</strong>: All tautomers share identical InChI</li>
<li><strong>Partial InChI match</strong>: At least two tautomers share an InChI</li>
<li>Tested against three InChI configurations: Standard, Nonstandard (15T), Nonstandard (15T + KET)</li>
</ul>
<h3 id="rule-coverage-analysis">Rule Coverage Analysis</h3>
<p>For each of the 86 rules, the researchers:</p>
<ol>
<li>Applied the transformation to all molecules in each database</li>
<li>Generated tautomers using the SMIRKS patterns</li>
<li>Computed InChI identifiers for each tautomer</li>
<li>Measured success rates (percentage of cases where InChI recognized the relationship)</li>
</ol>
<h3 id="key-findings-from-experiments">Key Findings from Experiments</h3>
<p><strong>Rule Frequency</strong>: The most common rule <code>PT_06_00</code> (1,3-heteroatom H-shift, covering keto-enol tautomerism) affects <strong>&gt;70% of molecules</strong> across databases.</p>
<p><strong>InChI Performance</strong>:</p>
<ul>
<li>Standard InChI: ~37% success rate</li>
<li>Nonstandard InChI (15T): ~50% success rate</li>
<li>Many newly defined rules: &lt;2% success rate</li>
</ul>
<p><strong>Scale Impact</strong>: Implementing the full 86-rule set would <strong>triple</strong> the number of compounds recognized as having tautomeric relationships compared to current InChI.</p>
<h2 id="outcomes-inchi-v2-requirements-and-coverage-gaps">Outcomes: InChI V2 Requirements and Coverage Gaps</h2>
<h3 id="main-findings">Main Findings</h3>
<ol>
<li>
<p><strong>Current Systems Are Inadequate</strong>: Even the most flexible InChI settings only achieve ~50% success in recognizing tautomeric relationships, with Standard InChI at ~37%</p>
</li>
<li>
<p><strong>Massive Coverage Gap</strong>: The new rule set reveals millions of tautomeric relationships that current InChI completely misses, particularly for ring-chain and valence tautomerism</p>
</li>
<li>
<p><strong>Implementation Requirement</strong>: InChI V2 will require a major redesign to handle the comprehensive rule set</p>
</li>
<li>
<p><strong>Rule Validation</strong>: The 86-rule set, derived from experimental literature, provides a validated foundation for next-generation chemical identifiers</p>
</li>
</ol>
<h3 id="implications">Implications</h3>
<p><strong>For Chemical Databases</strong>:</p>
<ul>
<li>Reduced redundancy through proper tautomer recognition</li>
<li>Improved data quality and consistency</li>
<li>More comprehensive structure search results</li>
</ul>
<p><strong>For Machine Learning</strong>:</p>
<ul>
<li>More accurate training data (tautomers properly grouped)</li>
<li>Better molecular property prediction models</li>
<li>Reduced dataset bias from tautomeric duplicates</li>
</ul>
<p><strong>For Chemoinformatics Tools</strong>:</p>
<ul>
<li>Blueprint for InChI V2 development</li>
<li>Standardized rule set for tautomer generation</li>
<li>Public tool (Tautomerizer) for practical use</li>
</ul>
<h3 id="limitations-acknowledged">Limitations Acknowledged</h3>
<ul>
<li>Single-step generation only (omits recursive enumeration of all possible tautomers)</li>
<li>30-second timeout may miss complex transformations</li>
<li>Some tautomeric preferences are context-dependent (pH, solvent) and require more than static rules for capture</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The paper lays groundwork for InChI V2 development, emphasizing that the comprehensive rule set necessitates algorithmic redesign.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Datasets Analyzed</strong> (400M+ total structures):</p>
<p><strong>Public Databases</strong> (Enable partial reproduction):</p>
<ul>
<li><strong>PubChem</strong>: Largest public chemical database</li>
<li><strong>ChEMBL</strong>: Bioactive molecules with drug-like properties</li>
<li><strong>DrugBank</strong>: FDA-approved and experimental drugs</li>
<li><strong>PDB Ligands</strong>: Small molecules from protein structures</li>
<li><strong>SureChEMBL</strong>: Chemical structures from patents</li>
<li><strong>AMS</strong>: Screening samples</li>
<li><strong>ChemNavigator</strong>: Commercial chemical database</li>
</ul>
<p><strong>Private/Proprietary Databases</strong> (Prevent 100% full-scale reproduction):</p>
<ul>
<li><strong>CSD</strong>: Cambridge Structural Database (requires commercial/academic license)</li>
<li><strong>CSDB</strong>: NCI internal database (private)</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Tautomer Generation</strong>:</p>
<ul>
<li><strong>Method</strong>: Single-step SMIRKS-based transformations</li>
<li><strong>Constraints</strong>:
<ul>
<li>Maximum 10 tautomers per input structure</li>
<li>30-second CPU timeout per transformation</li>
<li>Stereochemistry flattening for affected centers</li>
</ul>
</li>
<li><strong>Toolkit Dependency</strong>: The authors used the CACTVS Chemoinformatics Toolkit. Researchers attempting to reproduce this with fully open-source tools (like RDKit) may encounter differing behavior due to proprietary chemical perception logic and licensing differences.</li>
</ul>
<p><strong>Rule Categories</strong>:</p>
<ul>
<li><strong>Prototropic (PT)</strong>: 54 rules for hydrogen movement
<ul>
<li>Most common: <code>PT_06_00</code> (1,3-heteroatom H-shift, &gt;70% coverage)</li>
</ul>
</li>
<li><strong>Ring-Chain (RC)</strong>: 21 rules for cyclic/open-chain transformations
<ul>
<li>Examples: <code>RC_03_00</code> (pentose sugars), <code>RC_04_01</code> (hexose sugars)</li>
</ul>
</li>
<li><strong>Valence (VT)</strong>: 11 rules for valence changes
<ul>
<li>Notable: <code>VT_02_00</code> (tetrazole/azide, ~2.8M hits)</li>
</ul>
</li>
</ul>
<p><strong>InChI Comparison</strong>:</p>
<ul>
<li>Standard InChI (default settings)</li>
<li>Nonstandard InChI with <code>15T</code> option (mobile H)</li>
<li>Nonstandard InChI with <code>15T</code> and <code>KET</code> options (keto-enol)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Success Metrics</strong>:</p>
<p>Let $\mathcal{T}(m)$ be the set of generated tautomers for molecule $m$.</p>
<ul>
<li><strong>Complete Match</strong>: Occurs iff $\forall t_i, t_j \in \mathcal{T}(m), \text{InChI}(t_i) = \text{InChI}(t_j)$.</li>
<li><strong>Partial Match</strong>: At least 2 tautomers share the same InChI.</li>
<li><strong>Fail</strong>: All tautomers have different InChIs.</li>
</ul>
<p><strong>Benchmark Results</strong>:</p>
<ul>
<li>Standard InChI: ~37% success rate across all rules</li>
<li>Nonstandard (15T): ~50% success rate</li>
<li>New rules: Many show &lt;2% recognition by current InChI</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Software Environment</strong>:</p>
<ul>
<li><strong>Toolkit</strong>: CACTVS Chemoinformatics Toolkit v3.4.6.33 and v3.4.8.6</li>
<li><strong>Hash Functions</strong>:
<ul>
<li><code>E_TAUTO_HASH</code> (tautomer-invariant identifier)</li>
<li><code>E_ISOTOPE_STEREO_HASH128</code> (tautomer-sensitive identifier)</li>
</ul>
</li>
</ul>
<p><strong>Note</strong>: The paper omits computational hardware specifications because the analysis used existing chemical databases. However, evaluating 400M+ structures necessitates high-throughput cluster computing, making it computationally expensive for an individual to replicate the full analysis from scratch.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Dhaked, D. K., Ihlenfeldt, W.-D., Patel, H., Delannée, V., &amp; Nicklaus, M. C. (2020). Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2. <em>Journal of Chemical Information and Modeling</em>, <em>60</em>(3), 1253-1275. <a href="https://doi.org/10.1021/acs.jcim.9b01080">https://doi.org/10.1021/acs.jcim.9b01080</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2020</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{dhaked2020toward,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Dhaked, Devendra K and Ihlenfeldt, Wolf-Dietrich and Patel, Hitesh and Delann{\&#39;e}e, Vincent and Nicklaus, Marc C}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{60}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1253--1275}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://cactus.nci.nih.gov/tautomerizer/">Tautomerizer Tool</a> - Public web tool for testing tautomeric transformations</li>
</ul>
]]></content:encoded></item><item><title>αExtractor: Chemical Info from Biomedical Literature</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/alpha-extractor/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/alpha-extractor/</guid><description>αExtractor uses ResNet-Transformer to extract chemical structures from literature images, including noisy and hand-drawn molecules.</description><content:encoded><![CDATA[<h2 id="methodological-contribution-a-robust-optical-recognition-system">Methodological Contribution: A Robust Optical Recognition System</h2>
<p>This is primarily a <strong>Method</strong> ($\Psi_{\text{Method}}$) paper with a significant secondary <strong>Resource</strong> ($\Psi_{\text{Resource}}$) contribution (see the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> for more on these categories).</p>
<p>The dominant methodological contribution is the ResNet-Transformer recognition architecture that achieves state-of-the-art performance through robustness engineering. It specifically focuses on training on 20 million synthetic images with aggressive augmentation to handle degraded image conditions. The work answers the core methodological question &ldquo;How well does this work?&rdquo; through extensive benchmarking against existing OCSR tools and ablation studies validating architectural choices.</p>
<p>The secondary resource contribution comes from releasing αExtractor as a freely available web service, correcting labeling errors in standard benchmarks (CLEF, UOB, JPO), and providing an end-to-end document processing pipeline for biomedical literature mining.</p>
<h2 id="motivation-unlocking-visual-chemical-knowledge-in-biomedical-literature">Motivation: Unlocking Visual Chemical Knowledge in Biomedical Literature</h2>
<p>The motivation addresses a familiar pain point in chemical informatics within a biomedical context. Vast amounts of chemical knowledge in biomedical literature exist only as images, such as molecular structures embedded in figures, chemical synthesis schemes, and compound diagrams. This visual knowledge remains effectively invisible to computational methods, which creates a massive bottleneck for drug discovery research, systematic reviews, and large-scale chemical database construction.</p>
<p>Existing OCSR tools face two critical problems when applied to biomedical literature:</p>
<ol>
<li>
<p><strong>Real-world image quality</strong>: Biomedical papers often contain low-resolution figures, images with complex backgrounds, noise from scanning/digitization, and inconsistent drawing styles across different journals and decades of publications.</p>
</li>
<li>
<p><strong>End-to-end extraction</strong>: Most OCSR systems assume the presence of clean, cropped molecular images. In practice, you need to first find the molecular structures within multi-panel figures, reaction schemes, and dense document layouts before you can recognize them.</p>
</li>
</ol>
<p>The authors argue that a practical literature mining system needs to solve both problems simultaneously via robust recognition under noisy conditions and automated detection of molecular images within complex documents.</p>
<h2 id="core-innovation-robust-resnet-transformer-architecture">Core Innovation: Robust ResNet-Transformer Architecture</h2>
<p>The core innovation lies in combining a competition-winning recognition architecture with extensive robustness engineering and end-to-end document processing. The key contributions include:</p>
<ol>
<li>
<p><strong>ResNet-Transformer Recognition Model</strong>: The core recognition system uses a <strong>Residual Neural Network (ResNet)</strong> encoder paired with a <strong>Transformer decoder</strong> in an image-captioning framework. This architecture won first place in a Kaggle molecular translation competition, which provided a strong foundation for the recognition task. Let the input image be $I$. The model maximizes the joint likelihood of the SMILES tokens $T$ and coordinate sequences $X, Y$:
$$
\begin{aligned}
\mathcal{L}_{\text{total}} = - \sum_{i=1}^{L} \log P(T_i \mid I, T_{&lt;i}) - \lambda \sum_{i=1}^{L} \big(\log P(X_i \mid I, X_{&lt;i}) + \log P(Y_i \mid I, Y_{&lt;i})\big)
\end{aligned}
$$
Here, continuous $X$ and $Y$ atom coordinates are mapped strictly to 200 discrete bins to formulate the coordinate prediction as a standard classification task alongside SMILES generation.</p>
</li>
<li>
<p><strong>Enhanced Molecular Representation</strong>: The model produces an augmented representation that encompasses:</p>
<ul>
<li>Standard molecular connectivity information</li>
<li><strong>Bond type tokens</strong> (solid wedge bonds, dashed bonds, etc.) that preserve 3D stereochemical information</li>
<li><strong>Atom coordinate predictions</strong> that allow reconstruction of the exact molecular pose from the original image</li>
</ul>
<p>This dual prediction of discrete structure and continuous coordinates makes the output strictly faithful to the source material and enables better quality assessment.</p>
</li>
<li>
<p><strong>Massive Synthetic Training Dataset</strong>: The model was trained on approximately <strong>20 million synthetic molecular images</strong> generated from PubChem SMILES with aggressive data augmentation. The augmentation strategy randomized visual styles, image quality, and rendering parameters to create maximum diversity, ensuring the network rarely saw the same molecular depiction twice. This forces the model to learn robust, style-invariant features.</p>
</li>
<li>
<p><strong>End-to-End Document Processing Pipeline</strong>: αExtractor integrates <strong>object detection</strong> and <strong>structure recognition</strong> into a complete document mining system:</p>
<ul>
<li>An object detection model automatically locates molecular images within PDF documents</li>
<li>The recognition model converts detected images to structured representations</li>
<li>A web service interface makes the entire pipeline accessible to researchers without machine learning expertise</li>
</ul>
</li>
<li>
<p><strong>Robustness-First Design</strong>: The system was explicitly designed to handle degraded image conditions that break traditional OCSR tools, including low resolution, background interference, color variations, and scanning artifacts commonly found in legacy biomedical literature.</p>
</li>
</ol>
<h2 id="experimental-methodology-stress-testing-under-real-world-conditions">Experimental Methodology: Stress Testing under Real-World Conditions</h2>
<p>The evaluation focused on demonstrating robust performance across diverse image conditions, from pristine benchmarks to challenging real-world scenarios:</p>
<ol>
<li>
<p><strong>Benchmark Dataset Evaluation</strong>: αExtractor was tested on four standard OCSR benchmarks:</p>
<ul>
<li><strong>CLEF</strong>: Chemical structure recognition challenge dataset</li>
<li><strong>UOB</strong>: University of Birmingham patent images</li>
<li><strong>JPO</strong>: Japan Patent Office molecular diagrams</li>
<li><strong>USPTO</strong>: US Patent and Trademark Office structures</li>
</ul>
<p>Performance was measured using exact SMILES match accuracy.</p>
</li>
<li>
<p><strong>Error Analysis and Dataset Correction</strong>: During evaluation, the researchers discovered numerous labeling errors in the original benchmark datasets. They systematically identified and corrected these errors, then re-evaluated all methods on the cleaned datasets to get more accurate performance measurements.</p>
</li>
<li>
<p><strong>Robustness Stress Testing</strong>: The system was evaluated on two challenging datasets specifically designed to test robustness:</p>
<ul>
<li><strong>Color background images</strong> (200 samples): Molecular structures on complex, colorful backgrounds that simulate real figure conditions</li>
<li><strong>Low-quality images</strong> (200 samples): Degraded images with noise, blur, and artifacts typical of scanned documents</li>
</ul>
<p>These tests compared αExtractor against open-source alternatives under realistic degradation conditions.</p>
</li>
<li>
<p><strong>Generalization Testing</strong>: In the most challenging experiment, αExtractor was tested on <strong>hand-drawn molecular structures</strong>, representing a completely different visual domain not represented in the training data. This tested whether the learned features could generalize beyond digital rendering styles to human-drawn chemistry.</p>
</li>
<li>
<p><strong>End-to-End Document Extraction</strong>: The complete pipeline was evaluated on 50 PDF files containing 2,336 molecular images. This tested both the object detection component (finding molecules in complex documents) and the recognition component (converting them to SMILES) in a realistic literature mining scenario.</p>
</li>
<li>
<p><strong>Speed Benchmarking</strong>: Inference time was measured to demonstrate the practical efficiency needed for large-scale document processing.</p>
</li>
</ol>
<h2 id="results--conclusions-breakthrough-in-degraded-image-performance">Results &amp; Conclusions: Breakthrough in Degraded Image Performance</h2>
<ul>
<li>
<p><strong>Substantial Accuracy Gains</strong>: On the four benchmark datasets, αExtractor achieved accuracies of 91.83% (CLEF), 98.47% (UOB), 88.67% (JPO), and 93.64% (USPTO), significantly outperforming existing methods. After correcting dataset labeling errors, the true accuracies were even higher, reaching <strong>95.77% on CLEF, 99.86% on UOB, and 92.44% on JPO</strong>.</p>
</li>
<li>
<p><strong>Exceptional Robustness</strong>: Open-source competitors struggled on degraded images (achieving 5.5% accuracy at best). αExtractor maintained <strong>over 90% accuracy</strong> on both color background and low-quality image datasets, demonstrating the effectiveness of the massive synthetic training strategy.</p>
</li>
<li>
<p><strong>Remarkable Generalization</strong>: On hand-drawn molecules, a domain completely absent from training data, αExtractor achieved <strong>61.4% accuracy</strong> while other tools scored below 3%. This suggests the model learned genuinely chemical features.</p>
</li>
<li>
<p><strong>Practical End-to-End Performance</strong>: In the complete document processing evaluation, αExtractor detected <strong>95.1% of molecular images</strong> (2,221 out of 2,336) and correctly recognized <strong>94.5% of detected structures</strong> (2,098 correct predictions). This demonstrates the system&rsquo;s readiness for real-world literature mining applications.</p>
</li>
<li>
<p><strong>Dataset Quality Issues</strong>: The systematic discovery of labeling errors in standard benchmarks highlights a broader problem in OCSR evaluation. The corrected datasets provide more reliable baselines for future method development.</p>
</li>
<li>
<p><strong>Spatial Layout Limitation</strong>: αExtractor correctly identifies molecular connectivity; however, the re-rendered structures may have different spatial layouts than the originals. This could complicate visual verification for complex molecules, even if the chemical information remains accurate.</p>
</li>
</ul>
<p>The work establishes αExtractor as a significant advance in practical OCSR for biomedical applications. The combination of robust recognition, end-to-end document processing, and exceptional generalization makes it suitable for large-scale literature mining tasks where previous tools would fail. The focus on real-world robustness over benchmark optimization represents a mature approach to deploying machine learning in scientific workflows.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This paper is <strong>Partially Reproducible</strong>. While the authors detail the model architectures and training techniques, the source code, training dataset (20M synthetic images), and pre-trained weights remain closed-source and proprietary. The authors released a sample of their test data and host an online web server for running inference.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/CLEF_corrected">Corrected CLEF Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the CLEF benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/UOB_corrected">Corrected UOB Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the UOB benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/JPO_corrected">Corrected JPO Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the JPO benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/Colored_Background">Color Background Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">200 samples of molecular structures on complex, colorful backgrounds.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/Low_Quality">Low Quality Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">200 samples of degraded images with noise, blur, and artifacts.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/PDF">PDF Test Set</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Sample PDF files for end-to-end document extraction evaluation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://extractor.alphama.com.cn/csr">αExtractor Web Server</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Online service for running inference using the proprietary system.</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p><strong>Image Recognition Model:</strong></p>
<ul>
<li><strong>Backbone:</strong> ResNet50 producing output of shape $2048 \times 19 \times 19$, projected to 512 channels via a feed-forward layer</li>
<li><strong>Transformer Architecture:</strong> 3 encoder layers and 3 decoder layers with hidden dimension of 512</li>
<li><strong>Output Format:</strong> Generates SMILES tokens plus two auxiliary coordinate sequences (X-axis and Y-axis) that are length-aligned with the SMILES tokens via padding</li>
</ul>
<p><strong>Object Detection Model:</strong></p>
<ul>
<li><strong>Architecture:</strong> DETR (Detection Transformer) with ResNet101 backbone</li>
<li><strong>Transformer Architecture:</strong> 6 encoder layers and 6 decoder layers with hidden dimension of 256</li>
<li><strong>Purpose:</strong> Locates molecular images within PDF pages before recognition</li>
</ul>
<p><strong>Coordinate Prediction:</strong></p>
<ul>
<li>Continuous X/Y coordinates are discretized into <strong>200 discrete bins</strong></li>
<li>Padding tokens added to coordinate sequences to align perfectly with SMILES token sequence, enabling simultaneous structure and pose prediction</li>
</ul>
<h3 id="data">Data</h3>
<p><strong>Training Data:</strong></p>
<ul>
<li><strong>Synthetic Generation:</strong> Python script rendering PubChem SMILES into 2D images</li>
<li><strong>Dataset Size:</strong> Approximately 20.3 million synthetic molecular images from PubChem</li>
<li><strong>Superatom Handling:</strong> 50% of molecules had functional groups replaced with superatoms (e.g., &ldquo;COOH&rdquo;) or generic labels (R1, X1) to match literature drawing conventions</li>
<li><strong>Rendering Augmentation:</strong> Randomized bond thickness, bond spacing, font size, font color, and padding size</li>
</ul>
<p><strong>Geometric Augmentation:</strong></p>
<ul>
<li>Shear along x-axis: $\pm 15^\circ$</li>
<li>Rotation: $\pm 15^\circ$</li>
<li>Piecewise affine scaling</li>
</ul>
<p><strong>Noise Injection:</strong></p>
<ul>
<li>Pepper noise: 0-2%</li>
<li>Salt noise: 0-40%</li>
<li>Gaussian noise: scale 0-0.16</li>
</ul>
<p><strong>Destructive Augmentation:</strong></p>
<ul>
<li>JPEG compression: severity levels 2-5</li>
<li>Random masking</li>
</ul>
<p><strong>Evaluation Datasets:</strong></p>
<ul>
<li><strong>CLEF</strong>: Chemical structure recognition challenge dataset</li>
<li><strong>UOB</strong>: University of Birmingham patent images</li>
<li><strong>JPO</strong>: Japan Patent Office molecular diagrams</li>
<li><strong>USPTO</strong>: US Patent and Trademark Office structures</li>
<li><strong>Color background images</strong>: 200 samples</li>
<li><strong>Low-quality images</strong>: 200 samples</li>
<li><strong>Hand-drawn structures</strong>: Test set for generalization</li>
<li><strong>End-to-end document extraction</strong>: 50 PDFs (567 pages, 2,336 molecular images)</li>
</ul>
<h3 id="training">Training</h3>
<p><strong>Image Recognition Model:</strong></p>
<ul>
<li><strong>Optimizer:</strong> Adam with learning rate of 1e-4</li>
<li><strong>Batch Size:</strong> 100</li>
<li><strong>Epochs:</strong> 5</li>
<li><strong>Loss Function:</strong> Cross-entropy loss for both SMILES prediction and coordinate prediction</li>
</ul>
<p><strong>Object Detection Model:</strong></p>
<ul>
<li><strong>Optimizer:</strong> Adam with learning rate of 1e-4</li>
<li><strong>Batch Size:</strong> 24</li>
<li><strong>Training Strategy:</strong> Pre-trained on synthetic &ldquo;Lower Quality&rdquo; data for 5 epochs, then fine-tuned on annotated real &ldquo;High Quality&rdquo; data for 30 epochs</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics:</strong></p>
<ul>
<li><strong>Recognition</strong>: SMILES accuracy (exact match)</li>
<li><strong>End-to-End Pipeline</strong>:
<ul>
<li><strong>Recall</strong>: 95.1% for detection</li>
<li><strong>Accuracy</strong>: 94.5% for recognition</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Inference Hardware:</strong></p>
<ul>
<li>Cloud CPU server (8 CPUs, 64 GB RAM)</li>
<li><strong>Throughput:</strong> Processed 50 PDFs (567 pages) in 40 minutes</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xiong, J., Liu, X., Li, Z., Xiao, H., Wang, G., Niu, Z., Fei, C., Zhong, F., Wang, G., Zhang, W., Fu, Z., Liu, Z., Chen, K., Jiang, H., &amp; Zheng, M. (2024). αExtractor: A system for automatic extraction of chemical information from biomedical literature. <em>Science China Life Sciences</em>, 67(3), 618-621. <a href="https://doi.org/10.1007/s11427-023-2388-x">https://doi.org/10.1007/s11427-023-2388-x</a></p>
<p><strong>Publication</strong>: Science China Life Sciences (2024)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.1007/s11427-023-2388-x">Paper on Springer</a></li>
</ul>
]]></content:encoded></item><item><title>What is Optical Chemical Structure Recognition (OCSR)?</title><link>https://hunterheidenreich.com/posts/what-is-ocsr/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/what-is-ocsr/</guid><description>A micro-review of Optical Chemical Structure Recognition (OCSR), covering rule-based systems to modern deep learning models.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Decades of chemical research, breakthroughs in medicine, and novel materials are archived in journals, patents, and textbooks.
A huge portion of this knowledge is stored as images, a format inaccessible to standard computational tools.
This imposes challenges for both data retrieval and leveraging modern computational tools to analyze and predict chemical properties, inefficiencies that, when scaled across global research efforts, represent <a href="https://onlinelibrary.wiley.com/doi/full/10.1002/cam4.1454">substantial losses in scientific progress and potential human benefit</a>.</p>
<p>This is the central challenge that <strong>Optical Chemical Structure Recognition (OCSR)</strong> aims to solve. At its heart, OCSR is to chemistry what OCR (Optical Character Recognition) is to text: a technology that teaches computers to extract chemical information directly from 2D diagrams of molecules. It&rsquo;s the bridge between a picture of a molecule and a machine-readable format like <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> (Simplified Molecular Input Line Entry System) that can be stored, searched, and used to power new discoveries.</p>















<figure class="post-figure center ">
    <img src="/img/ocsr/img2smiles.webp"
         alt="The transformation from a 2D chemical structure image to a SMILES representation."
         title="The transformation from a 2D chemical structure image to a SMILES representation."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The transformation from a 2D chemical structure image to a SMILES representation.</figcaption>
    
</figure>

<p>Teaching a computer to read a chemical structure requires specialized techniques.</p>
<h2 id="the-complexity-of-chemical-graphs">The Complexity of Chemical Graphs</h2>
<p>Recognizing a molecule requires specialized techniques that extend standard Optical Character Recognition (OCR). A molecule is a <em>graph</em>: a collection of atoms (nodes) connected by bonds (edges).</p>
<blockquote>
<p>(While this simplified view excludes complex structures like coordination compounds and polymers, it provides a highly effective starting point for this discussion.)</p></blockquote>
<p>An OCSR system must overcome several hurdles:</p>
<ul>
<li><strong>Varying Styles:</strong> Chemical drawings vary widely across publications. Bond lengths, angles, and fonts can differ dramatically from one document to another.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/ocsr/acs.orglett.2c02187_1.webp"
         alt="An example from the Colored Background OSCR Benchmark, showing a complex and colorful chemical structure."
         title="An example from the Colored Background OSCR Benchmark, showing a complex and colorful chemical structure."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">An example from the <a href="https://huggingface.co/datasets/hheiden/Colored_Background_OCSR_benchmark">Colored Background OSCR Benchmark</a>, showing a complex and colorful chemical structure.</figcaption>
    
</figure>

<ul>
<li><strong>Image Quality:</strong> Older documents might be scanned at low resolutions, containing noise, blur, or other artifacts that make interpretation difficult.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/ocsr/2008239616_449_chem.webp"
         alt="A challenging chemical structure image from the JPO benchmark, difficult due to its low quality."
         title="A challenging chemical structure image from the JPO benchmark, difficult due to its low quality."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A challenging chemical structure image from the <a href="https://huggingface.co/datasets/hheiden/JPO_OCSR_benchmark">JPO benchmark</a>, difficult due to its low quality.</figcaption>
    
</figure>

<ul>
<li><strong>Structural Complexity:</strong> From simple rings to sprawling polymers and complex <strong>Markush structures</strong> (common in patents to represent a whole family of related compounds), the variety is immense.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/ocsr/markush.webp"
         alt="An example of a Markush structure, illustrating the complexity and variety of chemical compounds."
         title="An example of a Markush structure, illustrating the complexity and variety of chemical compounds."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">An example of a Markush structure, illustrating the complexity and variety of chemical compounds.</figcaption>
    
</figure>

<h2 id="the-evolution-of-ocsr">The Evolution of OCSR</h2>
<p>The quest to automate this process has evolved significantly, moving from brittle, hand-coded systems to sophisticated AI that can learn from data.</p>
<h3 id="act-1-the-rule-based-pioneers-ocr-10">Act 1: The Rule-Based Pioneers (OCR-1.0)</h3>
<p>The first OCSR systems, developed in the early 1990s, represent what we can now call the <strong>&ldquo;OCR-1.0&rdquo; era</strong>. Tools like <a href="https://pubs.acs.org/doi/10.1021/ci00008a018">Kekulé</a>, and later open-source solutions like <a href="/notes/computational-chemistry/optical-structure-recognition/osra/">OSRA</a> and <a href="https://github.com/ncats/molvec">MolVec</a>, operated like meticulous draftsmen. Their approach was methodical:</p>
<ol>
<li><strong>Vectorize the Image:</strong> Convert the pixel-based image into a collection of lines and shapes</li>
<li><strong>Identify Components:</strong> Use a set of hard-coded rules to classify these components. &ldquo;This thick line is a wedge bond.&rdquo; &ldquo;This group of pixels is the letter &lsquo;O&rsquo;.&rdquo;</li>
<li><strong>Reconstruct the Graph:</strong> Piece together the identified atoms and bonds into a coherent molecular graph</li>
</ol>
<p>This rule-based approach was groundbreaking but brittle. It struggled with the messiness of real-world documents and was expensive to maintain because each new style or error required new rules.</p>
<p>Additionally, they were designed as interactive tools to assist human experts in digitizing chemical structures.
There was always the assumption that a human would review and correct the output.</p>
<p>As a concrete case-study, consider the (reproduced) results from <a href="https://arxiv.org/abs/2411.11098">MolParser</a>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Method</th>
          <th style="text-align: center">USPTO</th>
          <th style="text-align: center">UoB</th>
          <th style="text-align: center">CLEF</th>
          <th style="text-align: center">JPO</th>
          <th style="text-align: center">ColoredBG</th>
          <th style="text-align: center">USPTO-10K</th>
          <th style="text-align: center">WildMol-10K</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Rule-based methods</strong></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
      </tr>
      <tr>
          <td style="text-align: left">OSRA 2.1 *</td>
          <td style="text-align: center">89.3</td>
          <td style="text-align: center">86.3</td>
          <td style="text-align: center"><strong>93.4</strong></td>
          <td style="text-align: center">56.3</td>
          <td style="text-align: center">5.5</td>
          <td style="text-align: center">89.7</td>
          <td style="text-align: center">26.3</td>
      </tr>
      <tr>
          <td style="text-align: left">MolVec 0.9.7 *</td>
          <td style="text-align: center">91.6</td>
          <td style="text-align: center">79.7</td>
          <td style="text-align: center">81.2</td>
          <td style="text-align: center">66.8</td>
          <td style="text-align: center">8.0</td>
          <td style="text-align: center">92.4</td>
          <td style="text-align: center">26.4</td>
      </tr>
      <tr>
          <td style="text-align: left">Imago 2.0 *</td>
          <td style="text-align: center">89.4</td>
          <td style="text-align: center">63.9</td>
          <td style="text-align: center">68.2</td>
          <td style="text-align: center">41.0</td>
          <td style="text-align: center">2.0</td>
          <td style="text-align: center">89.9</td>
          <td style="text-align: center">6.9</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Only synthetic training</strong></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
      </tr>
      <tr>
          <td style="text-align: left">Img2Mol *</td>
          <td style="text-align: center">30.0</td>
          <td style="text-align: center">68.1</td>
          <td style="text-align: center">17.9</td>
          <td style="text-align: center">16.1</td>
          <td style="text-align: center">3.5</td>
          <td style="text-align: center">33.7</td>
          <td style="text-align: center">24.4</td>
      </tr>
      <tr>
          <td style="text-align: left">MolGrapher †*</td>
          <td style="text-align: center">91.5</td>
          <td style="text-align: center"><strong>94.9</strong></td>
          <td style="text-align: center">90.5</td>
          <td style="text-align: center">67.5</td>
          <td style="text-align: center">7.5</td>
          <td style="text-align: center">93.3</td>
          <td style="text-align: center">45.5</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Real data finetuning</strong></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
      </tr>
      <tr>
          <td style="text-align: left">DECIMER 2.7 *</td>
          <td style="text-align: center">59.9</td>
          <td style="text-align: center">88.3</td>
          <td style="text-align: center">72.0</td>
          <td style="text-align: center">64.0</td>
          <td style="text-align: center">14.5</td>
          <td style="text-align: center">82.4</td>
          <td style="text-align: center">56.0</td>
      </tr>
      <tr>
          <td style="text-align: left">MolScribe *</td>
          <td style="text-align: center"><u>93.1</u></td>
          <td style="text-align: center">87.4</td>
          <td style="text-align: center">88.9</td>
          <td style="text-align: center">76.2</td>
          <td style="text-align: center">21.0</td>
          <td style="text-align: center"><strong>96.0</strong></td>
          <td style="text-align: center">66.4</td>
      </tr>
      <tr>
          <td style="text-align: left">MolParser-Tiny (Ours)</td>
          <td style="text-align: center">93.0</td>
          <td style="text-align: center">91.6</td>
          <td style="text-align: center"><u>91.0</u></td>
          <td style="text-align: center">75.6</td>
          <td style="text-align: center"><strong>58.5</strong></td>
          <td style="text-align: center">89.5</td>
          <td style="text-align: center">73.1</td>
      </tr>
      <tr>
          <td style="text-align: left">MolParser-Small (Ours)</td>
          <td style="text-align: center"><strong>93.1</strong></td>
          <td style="text-align: center">91.1</td>
          <td style="text-align: center">90.8</td>
          <td style="text-align: center">76.2</td>
          <td style="text-align: center">57.0</td>
          <td style="text-align: center"><u>94.8</u></td>
          <td style="text-align: center">76.3</td>
      </tr>
      <tr>
          <td style="text-align: left">MolParser-Base (Ours)</td>
          <td style="text-align: center">93.0</td>
          <td style="text-align: center"><u>91.8</u></td>
          <td style="text-align: center">90.7</td>
          <td style="text-align: center"><strong>78.9</strong></td>
          <td style="text-align: center">57.0</td>
          <td style="text-align: center">94.5</td>
          <td style="text-align: center"><strong>76.9</strong></td>
      </tr>
  </tbody>
</table>
<blockquote>
<p><strong>Table 2. Comparison of our method with existing OCSR models.</strong> We report the accuracy. We use <strong>bold</strong> to indicate the best performance and <u>underline</u> to denote the second-best performance. *: re-implemented results. †: results from original publications.</p></blockquote>
<p>In this table, we see that the rule-based methods (OSRA, MolVec, Imago) perform reasonably well on cleaner datasets like USPTO and UoB but falter on more challenging ones like JPO and ColoredBG. Modern AI-based methods (MolGrapher, DECIMER, MolScribe, MolParser) show significant improvements across the board, especially when fine-tuned on real data.</p>
<h3 id="act-2-the-ai-fork-in-the-road-2010s-2020s">Act 2: The AI Fork in the Road (2010s-2020s)</h3>
<p>The rise of deep learning in the 2010s brought new paradigms that could learn from data. Here, the field split into two distinct paths.</p>
<h4 id="path-a-the-rise-of-the-specialists-graph-based-ai">Path A: The Rise of the Specialists (Graph-Based AI)</h4>
<p>Some models replaced the hard-coded rules with AI components. Systems like <a href="https://github.com/DS4SD/MolGrapher">MolGrapher</a> and <a href="https://github.com/thomas0809/MolScribe">MolScribe</a> use a two-stage process:</p>
<ul>
<li><strong>Atom Detection:</strong> A neural network first identifies all the atoms in the image</li>
<li><strong>Bond Prediction:</strong> A second process then predicts the connections (bonds) between those atoms to form the final graph</li>
</ul>
<p>These are highly specialized tools, trained specifically for the task of building a molecular graph.</p>
<h4 id="path-b-the-rise-of-the-generalists-lvlms">Path B: The Rise of the Generalists (LVLMs)</h4>
<p>Another, more direct method treats OCSR as an image captioning task. This approach aligns with the broader trend of <strong>Large Vision-Language Models (LVLMs)</strong>: massive, general-purpose AIs like GPT-4V. Models like <a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER</a> and <a href="/notes/computational-chemistry/optical-structure-recognition/mol-parser/">MolParser</a> look at a molecular image and directly generate its textual representation, most commonly a <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES string</a>. This direct, end-to-end approach is powerful, though it requires enormous datasets to train effectively.</p>
<h2 id="the-next-frontier-the-ocr-20-vision-2024">The Next Frontier: The OCR-2.0 Vision (2024+)</h2>
<p>Recently, a proposal has emerged that charts a third path forward: <strong>OCR-2.0</strong>. This vision, proposed by <a href="https://arxiv.org/abs/2409.01704">Wei et al.</a> in 2024, argues for a new class of models that combine the best of both worlds. An OCR-2.0 model should be:</p>
<ol>
<li><strong>End-to-End:</strong> A single, unified model that simplifies maintenance</li>
<li><strong>Efficient &amp; Low-Cost:</strong> A specialized, highly efficient perception engine. The paper argues that using a giant LVLM for a pure recognition task is often inefficient</li>
<li><strong>Versatile:</strong> Capable of handling diverse artificial optical signals</li>
</ol>
<p>The flagship model for this theory is <a href="https://huggingface.co/stepfun-ai/GOT-OCR2_0">GOT (General OCR Theory)</a>. It&rsquo;s a single, unified model that can read an image and output structured text for a wide variety of inputs. It can translate a molecular diagram into a SMILES string, transcribe sheet music into musical notation, parse a bar chart into a data table, and describe a geometric shape using code.</p>
<p>This demonstrates that OCSR can be integrated into broader systems for processing human visual information. The same OCR-2.0 philosophy extends beyond chemistry: <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>, for instance, applies grounded vision-language modeling to general document OCR, producing both text transcriptions and bounding-box outputs from a single model.</p>
<h2 id="pushing-the-boundaries-of-recognition">Pushing the Boundaries of Recognition</h2>
<p>OCR-2.0 models like GOT push for <em>breadth</em>, and other state-of-the-art research deepens the <em>depth</em> of understanding for the uniquely complex task of chemical recognition.</p>
<h3 id="deepening-reasoning-with-a-visual-chain-of-thought">Deepening Reasoning with a &ldquo;Visual Chain of Thought&rdquo;</h3>
<p>The <a href="https://arxiv.org/abs/2506.07553">GTR-Mol-VLM</a> model makes recognition more intelligent by mimicking how a person might analyze a complex diagram. The model traverses the molecule step-by-step, predicting an atom, then its bond, then the next atom, and so on. This &ldquo;Visual Chain of Thought&rdquo; improves accuracy, especially for complex molecules. It also faithfully recognizes abbreviations like &ldquo;Ph&rdquo; as single units, better representing the source image.</p>
<h3 id="deepening-application-with-visual-fingerprinting">Deepening Application with &ldquo;Visual Fingerprinting&rdquo;</h3>
<p><a href="https://link.springer.com/article/10.1186/s13321-025-01091-4">Subgrapher</a> brilliantly rethinks the end goal. Many applications (like searching a patent database) require only the identification of specific molecular features. Subgrapher detects key functional groups and backbones directly from the image and creates a visual fingerprint. This approach mirrors identifying a person by key features (&ldquo;has glasses, a mustache&rdquo;), making it incredibly efficient for finding matches in a crowd.</p>
<h2 id="why-it-matters">Why It Matters</h2>
<p>The evolution of OCSR directly enables practical scientific advancements. This technology is a critical enabler for the future of science.</p>
<h3 id="unlocking-past-knowledge">Unlocking Past Knowledge</h3>
<p>OCSR digitizes decades of research from patents and journals, making it searchable and accessible for data mining. Imagine being able to search through every molecule ever published with a simple query. Or consider the practical impact: pharmaceutical companies can now automatically scan thousands of patent documents to ensure their new drug candidates don&rsquo;t infringe existing intellectual property, a process that previously required armies of human experts.</p>
<h3 id="accelerating-drug-discovery">Accelerating Drug Discovery</h3>
<p>By extracting vast datasets of molecules, scientists can train AI models to predict drug efficacy and toxicity, speeding up the discovery pipeline. The more molecular data we can digitize, the better our predictive models become.</p>
<h3 id="building-universal-document-intelligence">Building Universal Document Intelligence</h3>
<p>OCSR contributes to building AI systems capable of processing complex human documents. A scientific paper is a mix of text, equations, charts, tables, and molecular diagrams. Unified OCR-2.0 models are the key to unlocking all of this knowledge holistically.</p>
<h2 id="looking-forward">Looking Forward</h2>
<p>The ultimate goal is to create a seamless loop where scientific knowledge, no matter how it&rsquo;s stored, can be continuously fed into intelligent systems that help us understand our world and build a better one.</p>
<p>From the rigid rule-based systems of the 1990s to today&rsquo;s sophisticated AI that can read molecular diagrams as naturally as we read text, OCSR has come a long way. The field continues to accelerate. As these systems become more accurate, efficient, and versatile, they&rsquo;ll unlock new possibilities for scientific discovery that we can barely imagine today.</p>
<p>This entire process begins with teaching a computer how to read a picture.</p>
]]></content:encoded></item><item><title>SubGrapher: Visual Fingerprinting of Chemical Structures</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/subgrapher/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/subgrapher/</guid><description>Novel OCSR method creating molecular fingerprints from images through functional group segmentation for database retrieval.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Morin, L., Meijer, G. I., Weber, V., Van Gool, L., &amp; Staar, P. W. J. (2025). Subgrapher: Visual fingerprinting of chemical structures. Journal of Cheminformatics, 17(1), 149. <a href="https://doi.org/10.1186/s13321-025-01091-4">https://doi.org/10.1186/s13321-025-01091-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics (2025)</p>
<h2 id="paper-classification-and-taxonomy">Paper Classification and Taxonomy</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution. Using the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> framework:</p>
<p><strong>Primary Classification: Method</strong></p>
<p>The dominant basis vector is Methodological because SubGrapher introduces a novel architecture that fundamentally changes the OCSR workflow from two-step reconstruction (image → structure → fingerprint) to single-step fingerprinting (image → visual fingerprint). The paper validates this approach through systematic comparison against state-of-the-art methods (MolGrapher, OSRA, DECIMER, MolScribe), demonstrating superior performance on specific tasks like retrieval and robustness to image quality degradation.</p>
<p><strong>Secondary Classification: Resource</strong></p>
<p>The paper makes non-negligible resource contributions by releasing:</p>
<ul>
<li>Code and model weights on <a href="https://github.com/DS4SD/SubGrapher">GitHub</a> and <a href="https://huggingface.co/ds4sd/SubGrapher">HuggingFace</a></li>
<li>Five new visual fingerprinting benchmark datasets for molecule retrieval tasks</li>
<li>Comprehensive functional group knowledge base (1,534 substructures)</li>
</ul>
<h2 id="motivation-extracting-complex-structures-from-noisy-images">Motivation: Extracting Complex Structures from Noisy Images</h2>
<p>The motivation tackles a fundamental challenge in chemical informatics: extracting molecular information from the vast amounts of unstructured scientific literature, particularly patents. Millions of molecular structures exist only as images in these documents, making them inaccessible for computational analysis, database searches, or machine learning applications.</p>
<p>Traditional Optical Chemical Structure Recognition (OCSR) tools attempt to fully reconstruct molecular graphs from images, converting them into machine-readable formats like SMILES. However, these approaches face two critical limitations:</p>
<ol>
<li><strong>Brittleness to image quality</strong>: Poor resolution, noise, or unconventional drawing styles frequently cause complete failure</li>
<li><strong>Inability to handle complex structures</strong>: Markush structures, generic molecular templates with variable R-groups commonly used in patents, cannot be processed by conventional OCSR methods</li>
</ol>
<p>The key insight driving SubGrapher is that full molecular reconstruction may be unnecessary for many applications. For tasks like database searching, similarity analysis, or document retrieval, a molecular fingerprint - a vectorized representation capturing structural features - is often sufficient. This realization opens up a new approach: bypass the fragile reconstruction step and create fingerprints directly from visual information.</p>
<h2 id="key-innovation-direct-visual-fingerprinting">Key Innovation: Direct Visual Fingerprinting</h2>
<p>SubGrapher introduces a fundamentally different paradigm for extracting chemical information from images. It creates &ldquo;visual fingerprints&rdquo; through functional group recognition. The key innovations are:</p>
<ol>
<li>
<p><strong>Direct Image-to-Fingerprint Pipeline</strong>: SubGrapher eliminates the traditional two-step process (image → structure → fingerprint) by generating fingerprints directly from pixels. This single-stage approach avoids error accumulation from failed structure reconstructions and can handle images that would completely break conventional OCSR tools.</p>
</li>
<li>
<p><strong>Dual Instance Segmentation Architecture</strong>: The system employs two specialized Mask-RCNN networks working in parallel:</p>
<ul>
<li><strong>Functional group detector</strong>: Trained to identify 1,534 expert-defined functional groups using pixel-level segmentation masks</li>
<li><strong>Carbon backbone detector</strong>: Recognizes 27 common carbon chain patterns to capture the molecular scaffold</li>
</ul>
<p>Using instance segmentation provides detailed spatial information and higher accuracy through richer supervision during training.</p>
</li>
<li>
<p><strong>Extensive Functional Group Knowledge Base</strong>: The method leverages one of the most comprehensive open-source collections of functional groups, encompassing 1,534 substructures. These were systematically defined by:</p>
<ul>
<li>Starting with chemically logical atom combinations (C, O, S, N, B, P)</li>
<li>Expanding to include relevant subgroups and variations</li>
<li>Filtering based on frequency (appearing ~1,000+ times in PubChem)</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
</li>
<li>
<p><strong>Substructure-Graph Construction</strong>: After detecting functional groups and carbon backbones, SubGrapher builds a connectivity graph where:</p>
<ul>
<li>Each node represents an identified substructure</li>
<li>Edges connect substructures whose bounding boxes overlap (with 10% margin expansion)</li>
<li>This graph captures both the chemical components and their spatial relationships</li>
</ul>
</li>
<li>
<p><strong>Substructure-based Visual Molecular Fingerprint (SVMF)</strong>: The final output is a continuous, count-based fingerprint formally defined as a matrix $SVMF(m) \in \mathbb{R}^{n xn}$ where $n=1561$ (1,534 functional groups + 27 carbon backbones). The matrix is stored as a compressed upper triangular form:</p>
<p><strong>Diagonal elements</strong> ($i = j$): Weighted count of substructure $i$
$$SVMF_{ii}(m) = h_1 \cdot \text{count}(s_i)$$
where $h_1 = 10$ is the diagonal weight hyperparameter.</p>
<p><strong>Off-diagonal elements</strong> ($i \neq j$): Intersection coefficient based on shortest path distance $d$ in the substructure graph
$$SVMF_{ij}(m) = h_2(d) \cdot \text{intersection}(s_i, s_j)$$
where the distance decay function $h_2(d)$ is:</p>
<ul>
<li>$d \leq 1$: weight = 2</li>
<li>$d = 2$: weight = 2/4 = 0.5</li>
<li>$d = 3$: weight = 2/16 = 0.125</li>
<li>$d = 4$: weight = $2/256 \approx 0.0078$</li>
<li>$d &gt; 4$: weight = 0</li>
</ul>
<p><strong>Key properties</strong>:</p>
<ul>
<li>Functional groups receive higher weight than carbon chains in the graph construction</li>
<li>Similarity between fingerprints calculated using Euclidean distance</li>
<li>Resulting fingerprints are highly sparse (average 0.001% non-zero elements)</li>
<li>Compressed storage enables efficient database searches</li>
</ul>
</li>
<li>
<p><strong>Markush Structure Compatibility</strong>: Unlike traditional OCSR methods that fail on generic structures, SubGrapher can process Markush structures by recognizing their constituent functional groups and creating meaningful fingerprints for similarity searches.</p>
</li>
</ol>
<h2 id="experimental-validation-and-benchmarks">Experimental Validation and Benchmarks</h2>
<p>The evaluation focused on demonstrating SubGrapher&rsquo;s effectiveness across two critical tasks: accurate substructure detection and robust molecule retrieval from diverse image collections.</p>
<h4 id="substructure-detection-performance">Substructure Detection Performance</h4>
<p>SubGrapher&rsquo;s ability to identify functional groups was tested on three challenging benchmarks that expose different failure modes of OCSR systems:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Description</th>
          <th>Key Challenge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>JPO</strong></td>
          <td>341 images</td>
          <td>Japanese Patent Office images</td>
          <td>Low quality, noise, artifacts, non-standard drawing styles</td>
      </tr>
      <tr>
          <td><strong>USPTO-10K-L</strong></td>
          <td>1,000 images</td>
          <td>Large molecules (&gt;70 atoms)</td>
          <td>Scale variation, structural complexity, many functional groups</td>
      </tr>
      <tr>
          <td><strong>USPTO-Markush</strong></td>
          <td>74 images</td>
          <td>Generic Markush structures</td>
          <td>Variable R-groups, abstract patterns, template representation</td>
      </tr>
  </tbody>
</table>
<p><strong>Key findings:</strong></p>
<ol>
<li>
<p><strong>JPO Dataset (Low-Quality Patent Images)</strong>: SubGrapher achieved the highest Molecule Exact Match rate (83%), demonstrating robustness to image quality degradation that breaks traditional rule-based methods.</p>
</li>
<li>
<p><strong>USPTO-10K-L (Large Molecules)</strong>: The object detection approach handled scale variation better than conventional OCSR tools, achieving superior Substructure F1-scores on these challenging targets.</p>
</li>
<li>
<p><strong>USPTO-Markush (Generic Structures)</strong>: SubGrapher was the only method capable of processing these images, as traditional OCSR tools cannot handle generic molecular templates.</p>
</li>
</ol>
<p>Qualitative analysis revealed that SubGrapher correctly identified functional groups in scenarios where other methods failed completely: images with captions, unconventional drawing styles, or significant quality degradation.</p>
<h4 id="visual-fingerprinting-for-molecule-retrieval">Visual Fingerprinting for Molecule Retrieval</h4>
<p>The core application was evaluated using a retrieval task designed to simulate real-world database searching:</p>
<ol>
<li>
<p><strong>Benchmark Creation</strong>: Five benchmark datasets were constructed around structurally similar molecules (adenosine, camphor, cholesterol, limonene, and pyridine), each containing 500 similar molecular images.</p>
</li>
<li>
<p><strong>Retrieval Task</strong>: Given a SMILES string as a query, the goal was to find the corresponding molecular image within the dataset of 500 visually similar structures. This tests whether the visual fingerprint can distinguish between closely related molecules.</p>
</li>
<li>
<p><strong>Performance Comparison</strong>: SubGrapher significantly outperformed baseline methods, retrieving the correct molecule at an average rank of 95 out of 500. The key advantage was robustness&mdash;SubGrapher generates a unique fingerprint for every image, even with partial or uncertain predictions. In contrast, OCSR-based methods frequently fail to produce valid SMILES, making them unable to generate fingerprints for comparison.</p>
</li>
<li>
<p><strong>Real-World Case Study</strong>: A practical demonstration involved searching a 54-page patent document containing 356 chemical images for a specific Markush structure. SubGrapher successfully located the target structure, highlighting its utility for large-scale document mining.</p>
</li>
</ol>
<h4 id="training-data-generation">Training Data Generation</h4>
<p>Since no public datasets existed with the required pixel-level mask annotations for functional groups, the researchers developed a comprehensive synthetic data generation pipeline:</p>
<ol>
<li>
<p><strong>Extended MolDepictor</strong>: They enhanced existing molecular rendering tools to create images from SMILES strings and generate corresponding segmentation masks for all substructures present in each molecule.</p>
</li>
<li>
<p><strong>Markush Structure Rendering</strong>: The pipeline was extended to handle complex generic structures, creating training data for molecular templates that conventional tools cannot process.</p>
</li>
<li>
<p><strong>Diverse Molecular Sources</strong>: Training molecules were sourced from both ChEMBL and PubChem to ensure broad chemical diversity and coverage of different structural families.</p>
</li>
</ol>
<h2 id="results-impact-and-limitations">Results, Impact, and Limitations</h2>
<ul>
<li><strong>Superior Robustness to Image Quality</strong>: SubGrapher consistently outperformed traditional OCSR methods on degraded images, particularly the JPO patent dataset. While rule-based systems failed completely on poor-quality images, SubGrapher&rsquo;s learned representations proved resilient to noise, artifacts, and unconventional drawing styles.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SubGrapher</th>
          <th>MolScribe</th>
          <th>OSRA</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>S-F1</strong> (JPO)</td>
          <td>92</td>
          <td>94</td>
          <td>81</td>
          <td>SubGrapher comparable to SOTA on detection</td>
      </tr>
      <tr>
          <td><strong>M-EM</strong> (JPO)</td>
          <td><strong>83</strong></td>
          <td>82</td>
          <td>67</td>
          <td>SubGrapher wins on exact match for noisy images</td>
      </tr>
      <tr>
          <td><strong>Avg Retrieval Rank</strong></td>
          <td><strong>~95/500</strong></td>
          <td>&gt;100/500</td>
          <td>&gt;100/500</td>
          <td>SubGrapher significantly better at finding correct molecule</td>
      </tr>
  </tbody>
</table>
<ul>
<li>
<p><strong>Effective Handling of Scale and Complexity</strong>: The instance segmentation approach successfully managed large molecules and complex structures where traditional graph-reconstruction methods struggled. The Substructure F1-scores on USPTO-10K-L and USPTO-Markush benchmarks demonstrated clear advantages for challenging molecular targets.</p>
</li>
<li>
<p><strong>Breakthrough for Markush Structure Processing</strong>: SubGrapher represents the first practical solution for extracting information from Markush structures. These generic molecular templates appear frequently in patents but cannot be processed by conventional OCSR tools. This capability significantly expands the scope of automatically extractable chemical information from patent literature.</p>
</li>
<li>
<p><strong>Robust Molecule Retrieval Performance</strong>: The visual fingerprinting approach achieved reliable retrieval performance (average rank 95/500) across diverse molecular families. The key advantage was consistency&mdash;SubGrapher generates meaningful fingerprints even from partial or uncertain predictions, while OCSR-based methods often fail to produce any usable output.</p>
</li>
<li>
<p><strong>Practical Document Mining Capability</strong>: The successful identification of specific Markush structures within large patent documents (54 pages, 356 images) demonstrates real-world applicability for large-scale literature mining and intellectual property analysis.</p>
</li>
<li>
<p><strong>Single-Stage Architecture Benefits</strong>: By eliminating the traditional image → structure → fingerprint pipeline, SubGrapher avoids error accumulation from failed molecular reconstructions. Every input image produces a fingerprint, making the system more reliable for batch processing of diverse document collections.</p>
</li>
<li>
<p><strong>Limitations and Scope</strong>: The method remains focused on common organic functional groups and may struggle with inorganic chemistry, organometallic complexes, or highly specialized molecular classes not well-represented in the training data. The 1,534 functional groups, while extensive, represent a curated subset of chemical space.</p>
</li>
</ul>
<p>The work establishes a new paradigm for chemical information extraction from images, demonstrating that direct fingerprint generation can be more robust and practical than traditional structure reconstruction approaches. SubGrapher&rsquo;s ability to handle Markush structures and degraded images makes it particularly valuable for patent analysis and large-scale document mining, where traditional OCSR methods frequently fail. The approach suggests that task-specific learning (fingerprints for retrieval) can outperform general-purpose reconstruction methods in many practical applications.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data Generation</strong>: The paper developed a custom synthetic data pipeline since no public datasets existed with pixel-level mask annotations for functional groups:</p>
<ul>
<li><strong>Extended MolDepictor</strong>: Enhanced molecular rendering tool to generate both images and corresponding segmentation masks for all substructures</li>
<li><strong>Markush Structure Rendering</strong>: Pipeline extended to handle complex generic structures</li>
<li><strong>Source Molecules</strong>: ChEMBL and PubChem for broad chemical diversity</li>
</ul>
<p><strong>Evaluation Benchmarks</strong>:</p>
<ul>
<li><strong>JPO Dataset</strong>: Real patent images with poor resolution, noise, and artifacts</li>
<li><strong>USPTO-10K-L</strong>: Large complex molecular structures</li>
<li><strong>USPTO-Markush</strong>: Generic structures with variable R-groups</li>
<li><strong>Retrieval Benchmarks</strong>: Five datasets (adenosine, camphor, cholesterol, limonene, pyridine), each with 500 similar molecular images</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Dual instance segmentation system using Mask-RCNN</p>
<ul>
<li><strong>Functional Group Detector</strong>: Mask-RCNN trained to identify 1,534 expert-defined functional groups</li>
<li><strong>Carbon Backbone Detector</strong>: Mask-RCNN trained to recognize 27 common carbon chain patterns</li>
<li><strong>Backbone Network</strong>: Not specified in paper (typically ResNet-50/101 or Swin Transformer for Mask-RCNN)</li>
</ul>
<p><strong>Functional Group Knowledge Base</strong>: 1,534 substructures systematically defined by:</p>
<ul>
<li>Starting with chemically logical atom combinations (C, O, S, N, B, P)</li>
<li>Expanding to include relevant subgroups and variations</li>
<li>Filtering based on frequency (appearing ~1,000+ times in PubChem)</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Functional Group Definition</strong>:</p>
<ul>
<li><strong>1,534 Functional Groups</strong>: Defined by manually curated SMARTS patterns
<ul>
<li>Must contain heteroatoms (O, N, S, P, B)</li>
<li>Frequency threshold: ~1,000+ occurrences in PubChem</li>
<li>Systematically constructed from chemically logical atom combinations</li>
<li>Manual curation with SMILES, SMARTS, and descriptive names</li>
</ul>
</li>
<li><strong>27 Carbon Backbones</strong>: Patterns of 3-6 carbon atoms (rings and chains) to capture molecular scaffolds</li>
</ul>
<p><strong>Substructure-Graph Construction</strong>:</p>
<ol>
<li>Detect functional groups and carbon backbones using Mask-RCNN models</li>
<li>Build connectivity graph:
<ul>
<li>Each node represents an identified substructure instance</li>
<li>Edges connect substructures whose bounding boxes overlap</li>
<li>Bounding boxes expanded by 10% of smallest box&rsquo;s diagonal to ensure connectivity between adjacent groups</li>
<li>Functional groups weighted higher than carbon chains in graph construction</li>
</ul>
</li>
</ol>
<p><strong>SVMF Fingerprint Generation</strong>:</p>
<ul>
<li>Matrix form: $SVMF(m) \in \mathbb{R}^{n xn}$ where $n=1561$</li>
<li>Stored as compressed sparse upper triangular matrix</li>
<li><strong>Diagonal elements</strong>: $SVMF_{ii} = h_1 \cdot \text{count}(s_i)$ where $h_1 = 10$</li>
<li><strong>Off-diagonal elements</strong>: $SVMF_{ij} = h_2(d) \cdot \text{intersection}(s_i, s_j)$ where:
<ul>
<li>$h_2(d) = 2$ for $d \leq 1$</li>
<li>$h_2(d) = 2/4^{(d-1)}$ for $d = 2, 3, 4$</li>
<li>$h_2(d) = 0$ for $d &gt; 4$</li>
</ul>
</li>
<li>Average sparsity: 0.001% non-zero elements</li>
<li>Similarity metric: Euclidean distance between fingerprint vectors</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Substructure F1-score (S-F1)</strong>: Harmonic mean of precision and recall for individual substructure detection across all molecules in the dataset</li>
<li><strong>Molecule Exact Match (M-EM)</strong>: Percentage of molecules where S-F1 = 1.0 (all substructures correctly identified)</li>
<li><strong>Retrieval Rank</strong>: Average rank of ground truth molecule in candidate list of 500 similar structures when querying with SMILES fingerprint</li>
</ul>
<p><strong>Baselines</strong>: Compared against SOTA OCSR methods:</p>
<ul>
<li>Deep learning: MolScribe, MolGrapher, DECIMER</li>
<li>Rule-based: OSRA</li>
<li>Fingerprint methods: RDKit Daylight, MHFP (applied to OCSR outputs)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Training and inference hardware details are not provided in the main text or would be found in supplementary materials.</p>
<h3 id="implementation-gaps">Implementation Gaps</h3>
<p>The following details are not available in the paper and would require access to the code repository or supplementary information:</p>
<ul>
<li>Specific backbone architecture for Mask-RCNN (ResNet variant, Swin Transformer, etc.)</li>
<li>Optimizer type (AdamW, SGD, etc.)</li>
<li>Learning rate and scheduler</li>
<li>Batch size and number of training epochs</li>
<li>Loss function weights (box loss vs. mask loss)</li>
<li>GPU/TPU specifications used for training</li>
<li>Training time and computational requirements</li>
</ul>
]]></content:encoded></item><item><title>MolRec: Rule-Based OCSR System</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/molrec_at_trec/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/molrec_at_trec/</guid><description>Rule-based system for optical chemical structure recognition using vectorization and geometric analysis, achieving 95% accuracy on TREC 2011.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2011). Performance of MolRec at TREC 2011 Overview and Analysis of Results. <em>Proceedings of the 20th Text REtrieval Conference</em>. <a href="https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf">https://trec.nist.gov/pubs/trec20/papers/UoB.chem.update.pdf</a></p>
<p><strong>Publication</strong>: TREC 2011</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openbabel.org/">Open Babel</a> - Used for semantic MOL file comparison</li>
<li><a href="https://sourceforge.net/projects/osra/">OSRA Project</a> - Source of superatom dictionary data</li>
</ul>
<h2 id="contribution-rule-based-ocsr-system">Contribution: Rule-Based OCSR System</h2>
<p>This is a <strong>Method</strong> paper that presents and validates MolRec, a rule-based system for Optical Chemical Structure Recognition (OCSR). While the paper emphasizes performance analysis on the TREC 2011 benchmark, the core contribution is the system architecture itself: a multi-stage pipeline using vectorization, geometric rule-based analysis, and graph construction to convert chemical diagram images into machine-readable MOL files.</p>
<h2 id="motivation-robust-conversion-of-chemical-diagrams">Motivation: Robust Conversion of Chemical Diagrams</h2>
<p>Chemical molecular diagrams are ubiquitous in scientific documents across chemistry and life sciences. There is a critical need to convert these static raster images into machine-readable formats (like MOL files) that encode precise spatial and connectivity information for cheminformatics applications such as database indexing, similarity searching, and automated literature mining.</p>
<p>While pixel-based pattern matching approaches exist, they struggle with variations in drawing style, image quality, and diagram complexity. A more robust approach is needed that can handle the geometric and topological diversity of real-world chemical diagrams.</p>
<h2 id="novelty-vectorization-and-geometric-rules">Novelty: Vectorization and Geometric Rules</h2>
<p>MolRec distinguishes itself through a robust <strong>vectorization and geometric rule-based pipeline</strong>. Key technical innovations include:</p>
<p><strong>Disk-Growing Heuristic for Wedge Bonds</strong>: A novel dynamic algorithm to distinguish wedge bonds from bold lines. A disk with radius greater than the average line width is placed inside the connected component and grown in the direction that allows maximum expansion. This locates the triangle base (stereo-center) and identifies the wedge orientation.</p>
<p><strong>Joint Breaking Strategy</strong>: Explicitly breaking all connected joints in the vectorization stage to avoid combinatorial connection complexity. This allows uniform treatment of all line segment connections regardless of junction complexity.</p>
<p><strong>Superatom Dictionary Mining</strong>: The system mines MOL files from the OSRA dataset to build a comprehensive superatom dictionary (e.g., &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;), supplemented by the Marvin abbreviation collection.</p>
<p><strong>Comprehensive Failure Analysis</strong>: Unlike most OCSR papers that report only aggregate accuracy, this work provides a detailed categorization of all 55 failures, identifying 61 specific error reasons and their root causes.</p>
<h2 id="methodology-and-trec-2011-experiments">Methodology and TREC 2011 Experiments</h2>
<p><strong>Benchmark</strong>: The system was evaluated on the <strong>TREC 2011 Chemical Track</strong> test set consisting of 1,000 molecular diagram images. The authors performed two independent runs with slightly different internal parameter settings to assess reproducibility.</p>
<p><strong>Evaluation Metric</strong>: Correct recall of chemical structures. Output MOL files were compared semantically to ground truth using <strong>OpenBabel</strong>, which ignores syntactically different but chemically equivalent representations.</p>
<p><strong>Failure Analysis</strong>: The authors manually examined each of the 55 unique failures (from run 1) and categorized them, identifying 61 specific reasons for mis-recognition. This ablation-style analysis provides insight into systematic limitations of the rule-based approach.</p>
<h2 id="results-and-top-failure-modes">Results and Top Failure Modes</h2>
<p><strong>High Accuracy</strong>: MolRec achieved a <strong>95% correct recovery rate</strong> on the TREC 2011 benchmark:</p>
<ul>
<li>Run 1: 949/1000 structures correctly recognized</li>
<li>Run 2: 950/1000 structures correctly recognized</li>
</ul>
<p>The near-identical results across runs demonstrate the reproducibility of the rule-based approach.</p>
<p><strong>Top Failure Modes</strong> (from detailed analysis of 55 failures):</p>
<ul>
<li><strong>Dashed wedge bond misidentification (15 cases)</strong>: Most common failure - dashed wedge bonds incorrectly interpreted as two separate connected bonds</li>
<li><strong>Incorrect stereochemistry (10 cases)</strong>: Heuristics guessed wrong 3D orientations for ambiguous bonds</li>
<li><strong>Touching components (6 cases)</strong>: Ink bleed caused characters and bonds to merge, breaking segmentation assumptions</li>
<li><strong>Broken characters</strong>: System lacks recovery mechanisms for degraded or partial characters</li>
<li><strong>Solid circles</strong>: Diagrams with solid circles but no explicit 3D hydrogen bonds confuse the stereochemistry logic</li>
</ul>
<p><strong>System Strengths</strong>:</p>
<ul>
<li>Douglas-Peucker line simplification proves faster and more robust than Hough transforms across different drawing styles</li>
<li>Disk-growing wedge bond detection effectively distinguishes 3D orientations in most cases</li>
<li>Mining MOL files for superatom dictionary captures real-world chemical abbreviation usage patterns</li>
</ul>
<p><strong>Fundamental Limitations Revealed</strong>:</p>
<ul>
<li><strong>Brittleness</strong>: Small variations in drawing style or image quality can cause cascading failures</li>
<li><strong>Stereochemistry ambiguity</strong>: Even humans disagree on ambiguous cases; automated resolution based purely on syntax is inherently limited</li>
<li><strong>Segmentation dependence</strong>: Most failures trace back to incorrect separation of text, bonds, and graphical elements</li>
<li><strong>No error recovery</strong>: Early-stage mistakes propagate through the pipeline with no mechanism for correction</li>
</ul>
<p>The systematic error analysis provides an honest assessment of what 95% accuracy means in practice. While impressive for clean benchmark sets, the failure modes suggest fundamental scalability challenges for rule-based systems when applied to diverse real-world documents with noise, artifacts, and non-standard conventions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Dictionary Mining</td>
          <td>OSRA Dataset</td>
          <td>Unknown</td>
          <td>Mined to create superatom dictionary for abbreviations like &ldquo;Ph&rdquo;, &ldquo;COOH&rdquo;</td>
      </tr>
      <tr>
          <td>Dictionary</td>
          <td>Marvin Collection</td>
          <td>N/A</td>
          <td>Integrated Marvin abbreviation group collection for additional superatoms</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TREC 2011 Test Set</td>
          <td>1,000 images</td>
          <td>Standard benchmark for Text REtrieval Conference Chemical Track</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The MolRec pipeline consists of sequential image processing and graph construction stages:</p>
<p><strong>1. Preprocessing</strong></p>
<ul>
<li><strong>Binarization</strong>: Input image converted to binary</li>
<li><strong>Connected Component Labeling</strong>: Identifies distinct graphical elements</li>
<li><strong>OCR</strong>: Simple metric space-based engine identifies characters (letters $L$, digits $N$, symbols $S$)</li>
<li><strong>Character Grouping</strong>: Spatial proximity and type-based heuristics group characters:
<ul>
<li>Horizontal: Letter-Letter, Digit-Digit, Letter-Symbol</li>
<li>Vertical: Letter-Letter only</li>
<li>Diagonal: Letter-Digit, Letter-Charge</li>
</ul>
</li>
</ul>
<p><strong>2. Vectorization (Line Finding)</strong></p>
<ul>
<li><strong>Image Thinning</strong>: Reduce lines to unit width</li>
<li><strong>Douglas-Peucker Algorithm</strong>: Simplify polylines into straight line segments</li>
<li><strong>Joint Breaking</strong>: Explicitly split lines at junctions where $&gt;2$ segments meet, avoiding combinatorial connection complexity</li>
</ul>
<p><strong>3. Bond Recognition Rules</strong></p>
<p>After erasing text from the image, remaining line segments are analyzed:</p>
<ul>
<li><strong>Double/Triple Bonds</strong>: Cluster segments with same slope within threshold distance</li>
<li><strong>Dashed Bonds</strong>: Identify repeated short segments of similar length with collinear center points</li>
<li><strong>Wedge/Bold Bonds</strong>: Dynamic disk algorithm:
<ul>
<li>Place disk with radius $&gt;$ average line width inside component</li>
<li>Grow disk to maximum size to locate triangle base (stereo-center)</li>
<li>&ldquo;Walk&rdquo; disk to find narrow end, distinguishing wedge orientation</li>
</ul>
</li>
<li><strong>Wavy Bonds</strong>: Identify sawtooth pattern polylines after thinning</li>
<li><strong>Implicit Nodes</strong>: Split longer segments at points where parallel shorter segments terminate (carbon atoms in chains)</li>
</ul>
<p><strong>4. Graph Construction</strong></p>
<ul>
<li><strong>Node Formation</strong>: Group line segment endpoints by distance threshold</li>
<li><strong>Disambiguation</strong>: Logic separates lowercase &ldquo;l&rdquo;, uppercase &ldquo;I&rdquo;, digit &ldquo;1&rdquo;, and vertical bonds</li>
<li><strong>Superatom Expansion</strong>: Replace abbreviations with full structures using mined dictionary</li>
<li><strong>Stereochemistry Resolution</strong>: Heuristics based on neighbor counts determine direction for ambiguous bold/dashed bonds (known limitation)</li>
</ul>
<p><strong>5. MOL File Generation</strong></p>
<ul>
<li>Final graph structure converted to standard MOL file format</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Run 1</th>
          <th>Run 2</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Correct Recall</td>
          <td>949/1000</td>
          <td>950/1000</td>
          <td>Slightly different internal parameters between runs</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>94.9%</td>
          <td>95.0%</td>
          <td>Semantic comparison using OpenBabel</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison Method</strong>: OpenBabel converts graphs to MOL files and compares them semantically to ground truth, ignoring syntactic variations that don&rsquo;t affect chemical meaning.</p>
<p><strong>Failure Categorization</strong>: 55 unique failures analyzed, identifying 61 specific error reasons across categories like bond misidentification, stereochemistry errors, touching components, and broken characters.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Details</strong>: Not explicitly specified in the paper</li>
<li><strong>Performance Note</strong>: Vectorization approach noted as &ldquo;proven to be fast&rdquo; compared to Hough transform alternatives</li>
</ul>
<h3 id="references">References</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{sadawiPerformanceMolRecTREC2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of {{MolRec}} at {{TREC}} 2011 {{Overview}} and {{Analysis}} of {{Results}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 20th {{Text REtrieval Conference}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Sadawi, Noureddin M. and Sexton, Alan P. and Sorge, Volker}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2011}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolRec: Chemical Structure Recognition at CLEF 2012</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/molrec_at_clef/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/molrec_at_clef/</guid><description>MolRec achieves 95%+ accuracy on simple structures but struggles with complex diagrams, revealing rule-based OCSR limits and systematic failures.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sadawi, N. M., Sexton, A. P., &amp; Sorge, V. (2012). MolRec at CLEF 2012 - Overview and Analysis of Results. Working Notes of CLEF 2012 Evaluation Labs and Workshop. CLEF. <a href="http://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-SadawiEt2012.pdf">http://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-SadawiEt2012.pdf</a></p>
<p><strong>Publication</strong>: CLEF 2012 Workshop (ImageCLEF Track)</p>
<h2 id="systematization-of-rule-based-ocsr">Systematization of Rule-Based OCSR</h2>
<p>This is a <strong>Systematization</strong> paper that evaluates and analyzes MolRec&rsquo;s performance in the CLEF 2012 chemical structure recognition competition. The work provides systematic insights into how the improved MolRec system performed on different types of molecular diagrams and reveals structural challenges facing rule-based OCSR approaches through comprehensive failure analysis.</p>
<h2 id="investigating-the-limits-of-rule-based-recognition">Investigating the Limits of Rule-Based Recognition</h2>
<p>This work continues the story from the TREC 2011 evaluation, where MolRec achieved impressive 95% accuracy on 1000 molecular diagrams. The CLEF 2012 competition provided an opportunity to test an enhanced version of MolRec on different datasets and understand how performance varies across complexity levels.</p>
<p>The motivation is to understand exactly where rule-based chemical structure recognition breaks down. Examining the specific types of structures that cause failures provides necessary context for the reported 95% accuracy rate.</p>
<h2 id="the-two-stage-molrec-architecture">The Two-Stage MolRec Architecture</h2>
<p>The novelty lies in the systematic evaluation across two different difficulty levels and the comprehensive failure analysis. The authors tested an improved MolRec implementation that was more efficient than the TREC 2011 version, providing insights into both system evolution and the inherent challenges of chemical structure recognition.</p>
<p><strong>MolRec Architecture Overview</strong>: The system follows a two-stage pipeline approach:</p>
<ol>
<li>
<p><strong>Vectorization Stage</strong>: The system first preprocesses input images through several steps:</p>
<ul>
<li><strong>Binarization</strong> using Otsu&rsquo;s method to convert grayscale images to black and white</li>
<li><strong>OCR processing</strong> to identify and remove text components (atom labels, charges, etc.)</li>
<li><strong>Thinning</strong> to reduce the remaining diagram to single-pixel-width lines</li>
<li><strong>Geometric primitive extraction</strong> to identify lines, circles, arrows, and triangles</li>
<li><strong>Line simplification</strong> using the Douglas-Peucker algorithm to clean up vectorized bonds</li>
</ul>
</li>
<li>
<p><strong>Rule Engine Stage</strong>: A set of 18 chemical rules converts geometric primitives into molecular graphs:</p>
<ul>
<li><strong>Bridge bond recognition</strong> (applied first due to complexity)</li>
<li><strong>Standard bond and atom recognition</strong> (16 rules applied in any order)</li>
<li><strong>Context-aware disambiguation</strong> considering the entire graph structure</li>
<li><strong>Superatom expansion</strong> incorporating chemical abbreviations and groups</li>
</ul>
</li>
</ol>
<p>The system can output results in standard formats like MOL files or SMILES strings, making it compatible with existing chemical informatics workflows.</p>
<h2 id="clef-2012-experimental-design">CLEF 2012 Experimental Design</h2>
<p>The CLEF 2012 evaluation tested MolRec on two distinct datasets designed to assess different aspects of chemical structure recognition:</p>
<ol>
<li>
<p><strong>Large-Scale Automated Evaluation (865 images)</strong>: A substantial dataset evaluated automatically using OpenBabel for exact structural matches. The authors ran four different parameter configurations to understand system sensitivity and reproducibility.</p>
</li>
<li>
<p><strong>Complex Structure Manual Evaluation (95 images)</strong>: A smaller but more challenging dataset requiring manual evaluation. These structures included more complex features like stereochemistry, unusual bond types, and non-standard chemical notations.</p>
</li>
<li>
<p><strong>Parameter Sensitivity Analysis</strong>: Multiple runs with slightly different parameters tested the robustness of the recognition pipeline and identified optimal settings.</p>
</li>
<li>
<p><strong>Comprehensive Failure Analysis</strong>: Every incorrect recognition was manually examined to categorize error types and understand systematic limitations.</p>
</li>
</ol>
<h2 id="performance-divergence-and-critical-failure-modes">Performance Divergence and Critical Failure Modes</h2>
<p>The results reveal a stark performance gap between simple and complex molecular structures:</p>
<p><strong>Performance on Simple Structures</strong>: On the 865-image automated dataset, MolRec achieved <strong>94.91% to 96.18% accuracy</strong> across different parameter settings. This excellent performance demonstrates that rule-based approaches can handle standard molecular diagrams reliably when image quality is good and structures follow conventional drawing practices.</p>
<p><strong>Performance on Complex Structures</strong>: On the 95-image manual evaluation set, accuracy dropped dramatically to <strong>46.32% to 58.95%</strong>. This reveals the fundamental brittleness of rule-based systems when encountering real-world complexity.</p>
<p><strong>Key Failure Modes Identified</strong>:</p>
<ul>
<li>
<p><strong>Character Grouping Errors</strong>: Implementation bugs caused incorrect processing of subscripts and atom groups. For example, R₂₁ was misread as R₂₁₁, creating chemically nonsensical structures.</p>
</li>
<li>
<p><strong>Touching Character Problems</strong>: When characters physically touch due to image resolution or scanning artifacts, the system cannot separate them properly. This is a limitation that OCR systems still struggle with today.</p>
</li>
<li>
<p><strong>Four-Way Junction Failures</strong>: The vectorization process couldn&rsquo;t handle complex branching points where four bonds meet, leading to incorrect connectivity.</p>
</li>
<li>
<p><strong>OCR Misrecognition</strong>: Standard character recognition errors like confusing &ldquo;G&rdquo; with &ldquo;O&rdquo; or interpreting &ldquo;I&rdquo; as a vertical bond propagated through the entire recognition pipeline.</p>
</li>
<li>
<p><strong>Stereochemistry Recognition Issues</strong>: The system missed various 3D bond representations including solid wedges, dashed wedges, and wavy bonds that indicate stereochemical relationships.</p>
</li>
<li>
<p><strong>Charge Sign Detection</strong>: While positive charges (&quot;+&quot;) were recognized reliably, negative charges (&quot;−&quot;) were frequently missed, possibly due to typography variations.</p>
</li>
<li>
<p><strong>Proximity-Based Errors</strong>: Atoms positioned too close to bond endpoints were incorrectly connected, and the system struggled with crowded molecular regions.</p>
</li>
</ul>
<p><strong>Dataset Quality Issues</strong>: Interestingly, the authors discovered 11 cases where MolRec&rsquo;s output was actually correct, but the provided ground truth was wrong. This highlights the challenge of creating reliable evaluation datasets for chemical structure recognition.</p>
<p><strong>System Robustness</strong>: The parameter sensitivity analysis showed that MolRec&rsquo;s performance was relatively stable across different configurations, suggesting the core algorithms were robust within their intended operating range.</p>
<p><strong>Key Insights</strong>:</p>
<ul>
<li>
<p><strong>The 95% Accuracy Myth</strong>: While MolRec achieved excellent accuracy on clean, standard molecular diagrams, the dramatic performance drop on complex structures reveals that overall accuracy metrics can be misleading. Real-world chemical literature contains many of the &ldquo;difficult&rdquo; cases that drive accuracy down. Mathematical validation of structural connectivity often assumes perfect node extraction, modeled by a simplified graph matching accuracy metric: $$ \text{Accuracy} = \frac{\text{|Correctly Matched Structures|}}{\text{|Total Structures|}} $$.</p>
</li>
<li>
<p><strong>Rule-Based Brittleness</strong>: Every failure mode represents a case not covered by the 18 implemented rules. This highlights the fundamental limitation of rule-based approaches: they can only handle cases explicitly programmed by their creators.</p>
</li>
<li>
<p><strong>Cascading Failures</strong>: Many errors began in the vectorization stage (OCR failures, touching characters) and propagated through the entire pipeline. This suggests that robust early-stage processing is critical for overall system performance.</p>
</li>
<li>
<p><strong>Evaluation Challenges</strong>: The discovery of incorrect ground truth data emphasizes how difficult it is to create reliable benchmarks for chemical structure recognition, even with manual curation.</p>
</li>
</ul>
<p>The work provides an honest assessment of rule-based OCSR capabilities circa 2012. While MolRec could handle routine chemical diagrams well, its struggles with complex cases foreshadowed the limitations that would eventually drive the field toward deep learning approaches. The detailed failure analysis proved prescient. Many of the challenges identified here (handling noise, recognizing diverse drawing styles, robust stereochemistry detection) remain active research areas in modern chemical structure recognition systems.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="system-architecture">System Architecture</h3>
<p><strong>Model Type</strong>: Non-neural, Rule-Based System (Vectorization Pipeline + Logic Engine)</p>
<h3 id="data">Data</h3>
<p><strong>Evaluation Datasets (CLEF 2012)</strong>:</p>
<ul>
<li><strong>Set 1 (Automated)</strong>: 865 molecular structure images evaluated automatically using OpenBabel for exact structural matching</li>
<li><strong>Set 2 (Manual)</strong>: 95 complex molecular structure images requiring manual evaluation by human experts</li>
</ul>
<p><strong>Training Data</strong>: The paper does not describe the reference set used to build OCR character prototypes for the nearest neighbor classifier. The system relies on pre-existing geometric algorithms.</p>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Vectorization Pipeline</strong>:</p>
<ul>
<li><strong>Binarization</strong>: Otsu&rsquo;s method for converting grayscale to binary images</li>
<li><strong>Thinning</strong>: Reduces diagram to single-pixel-width lines</li>
<li><strong>Line Simplification</strong>: Douglas-Peucker algorithm with threshold set to 1-2x average line width</li>
<li><strong>OCR</strong>: Nearest neighbor classification with Euclidean distance metric</li>
</ul>
<p><strong>Rule Engine</strong>: 18 chemical structure rules converting geometric primitives to molecular graphs:</p>
<ul>
<li><strong>Rule 2.2 (Wavy Bonds)</strong>: Detailed in paper - identifies approximately collinear line segments with zig-zag patterns</li>
<li><strong>Bridge Bond Rules</strong>: Applied first due to complexity (details not fully specified)</li>
<li><strong>Standard Recognition Rules</strong>: 16 rules for bonds, atoms, and chemical features (most not detailed in paper)</li>
</ul>
<p><strong>Optimization</strong>: Performance tuned via manual adjustment of geometric thresholds, not gradient descent.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Automated</strong>: Exact structural match via OpenBabel MOL file comparison</li>
<li><strong>Manual</strong>: Visual inspection by human experts for structures where OpenBabel fails</li>
</ul>
<p><strong>Results</strong>:</p>
<ul>
<li><strong>Simple Structures (Set 1)</strong>: 94.91% to 96.18% accuracy across parameter configurations</li>
<li><strong>Complex Structures (Set 2)</strong>: 46.32% to 58.95% accuracy across parameter configurations</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Given the era (2012) and algorithmic nature (Otsu, thinning, geometric analysis), likely ran on standard CPUs.</p>
<h3 id="reproducibility-assessment">Reproducibility Assessment</h3>
<p><strong>Low reproducibility</strong> from this paper alone. The paper outlines high-level logic (Otsu binarization, Douglas-Peucker simplification, 18-rule system) but does not provide:</p>
<ul>
<li>The complete specification of all 18 rules (only Rule 2.2 for wavy bonds is detailed)</li>
<li>Exact numerical threshold values for fuzzy/strict parameters used in the CLEF runs</li>
<li>OCR training data or character prototype specifications</li>
</ul>
<p>The authors refer readers to a separate 2012 SPIE paper for the &ldquo;detailed overview&rdquo; of the MolRec system architecture.</p>
]]></content:encoded></item><item><title>GTR-CoT: Graph Traversal Chain-of-Thought for Molecules</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/gtr-mol-vlm/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/gtr-mol-vlm/</guid><description>GTR-CoT uses graph traversal chain-of-thought reasoning to improve optical chemical structure recognition accuracy.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, J., Yang, H., Wu, J., He, Y., Wei, X., Wang, Y., Liu, C., Ge, L., Wu, L., Wang, B., Lin, D., &amp; He, C. (2025). GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition (No. arXiv:2506.07553; Version 2). arXiv. <a href="https://doi.org/10.48550/arXiv.2506.07553">https://doi.org/10.48550/arXiv.2506.07553</a></p>
<p><strong>Publication</strong>: arXiv preprint (2025)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.48550/arXiv.2506.07553">Paper on arXiv</a></li>
</ul>
<h2 id="contribution-vision-language-modeling-for-ocsr">Contribution: Vision-Language Modeling for OCSR</h2>
<p>This is a <strong>method paper</strong> that introduces GTR-Mol-VLM, a Vision-Language Model for Optical Chemical Structure Recognition (OCSR). The work addresses the persistent challenge of converting molecular structure images into machine-readable formats, with a particular focus on handling chemical abbreviations that have plagued existing systems.</p>
<h2 id="motivation-the-abbreviation-bottleneck">Motivation: The Abbreviation Bottleneck</h2>
<p>The motivation tackles a long-standing bottleneck in chemical informatics: most existing OCSR systems fail catastrophically when they encounter abbreviated functional groups. When a chemist draws &ldquo;Ph&rdquo; for phenyl or &ldquo;Et&rdquo; for ethyl, current models either crash or produce wildly incorrect structures because they&rsquo;ve been trained on data where images contain abbreviations but the ground-truth labels contain fully expanded molecular graphs.</p>
<p>This creates a fundamental mismatch. The model sees &ldquo;Ph&rdquo; in the image but is told the &ldquo;correct&rdquo; answer is a full benzene ring. It&rsquo;s like teaching someone to read by showing them &ldquo;Dr.&rdquo; but insisting the right answer is &ldquo;Doctor&rdquo; - the supervision signal is inconsistent with what&rsquo;s actually visible.</p>
<p>Beyond this data problem, existing graph-parsing methods use a two-stage approach: predict all atoms first, then predict all bonds. This is inefficient and ignores the structural constraints that could help during prediction. The authors argue that mimicking how humans analyze molecular structures - following bonds from atom to atom in a connected traversal - would be more effective.</p>
<h2 id="novelty-graph-traversal-as-visual-chain-of-thought">Novelty: Graph Traversal as Visual Chain-of-Thought</h2>
<p>The novelty lies in combining two key insights about how to properly train and architect OCSR systems. The main contributions are:</p>
<ol>
<li>
<p><strong>Graph Traversal as Visual Chain of Thought</strong>: GTR-Mol-VLM generates molecular graphs by traversing them sequentially, predicting an atom, then its connected bond, then the next atom, and so on. This mimics how a human chemist would trace through a structure and allows the model to use previously predicted atoms and bonds as context for subsequent predictions.</p>
<p>Formally, the model output sequence for image $I_m$ is generated as:</p>
<p>$$ R_m = \text{concat}(CoT_m, S_m) $$</p>
<p>where $CoT_m$ represents the deterministic graph traversal steps (atoms and bonds) and $S_m$ is the final SMILES representation. This intermediate reasoning step makes the model more interpretable and helps it learn the structural logic of molecules.</p>
</li>
<li>
<p><strong>&ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; Principle</strong>: This addresses the abbreviation problem head-on. The authors correct the ground-truth annotations to match what&rsquo;s actually visible in the image.</p>
<p>They treat abbreviations like &ldquo;Ph&rdquo; as single &ldquo;superatoms&rdquo; and build a pipeline to automatically detect and correct training data. Using OCR to extract visible text from molecular images, they replace the corresponding expanded substructures in the ground-truth with the appropriate abbreviation tokens. This ensures the supervision signal is consistent with the visual input.</p>
</li>
<li>
<p>Methodology &amp; Experiments: Baselines and Datasetspport this approach, the authors created a large-scale dataset combining 1M synthetic molecules from PubChem with 351K corrected real-world patent images from USPTO. The key innovation is the correction pipeline that identifies abbreviations in patent images and fixes the inconsistent ground-truth labels.</p>
</li>
<li>
<p><strong>MolRec-Bench Evaluation</strong>: Traditional SMILES-based evaluation fails for molecules with abbreviations because canonicalization breaks down. The authors created a new benchmark that evaluates graph structure directly, providing three metrics: direct SMILES generation, graph-derived SMILES, and exact graph matching.</p>
</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The evaluation focused on demonstrating that GTR-Mol-VLM&rsquo;s design principles solve real problems that plague existing OCSR systems:</p>
<ol>
<li>
<p><strong>Comprehensive Baseline Comparison</strong>: GTR-Mol-VLM was tested against three categories of models:</p>
<ul>
<li><strong>Specialist OCSR systems</strong>: MolScribe and MolNexTR (the current state-of-the-art)</li>
<li><strong>Chemistry-focused VLMs</strong>: ChemVLM</li>
<li><strong>General-purpose VLMs</strong>: GPT-4o, Claude-3.5-Sonnet, Qwen-VL-Max</li>
</ul>
</li>
<li>
<p><strong>MolRec-Bench Evaluation</strong>: The new benchmark includes two subsets of patent images:</p>
<ul>
<li><strong>MolRec-Std</strong>: Standard patent images similar to existing benchmarks</li>
<li><strong>MolRec-Abb</strong>: Images specifically selected to contain many abbreviated functional groups</li>
</ul>
<p>This design directly tests whether models can handle the abbreviation problem that breaks existing systems.</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Systematic experiments isolated the contribution of key design choices:</p>
<ul>
<li><strong>Chain-of-Thought vs. Direct</strong>: Comparing graph traversal CoT against direct SMILES prediction</li>
<li><strong>Traversal Strategy</strong>: Graph traversal vs. the traditional &ldquo;atoms-then-bonds&rdquo; approach</li>
<li><strong>Dataset Quality</strong>: Training on corrected vs. uncorrected data</li>
</ul>
</li>
<li>
<p><strong>Retraining Experiments</strong>: Existing specialist models (MolScribe, MolNexTR) were retrained from scratch on the corrected GTR-CoT-1.3M dataset to isolate the effect of data quality from architectural improvements.</p>
</li>
<li>
<p><strong>Qualitative Analysis</strong>: Visual inspection of predictions on challenging cases with heavy abbreviation usage, complex structures, and edge cases to understand failure modes.</p>
</li>
</ol>
<h2 id="results--conclusions-resolving-the-abbreviation-bottleneck">Results &amp; Conclusions: Resolving the Abbreviation Bottleneck</h2>
<ul>
<li>
<p><strong>Dramatic Performance Gains on Abbreviations</strong>: GTR-Mol-VLM achieves state-of-the-art performance across all metrics on both benchmark subsets, but the improvement is particularly striking on MolRec-Abb. Existing specialist models that perform well on standard images see their accuracy drop below 20% when abbreviations are present. GTR-Mol-VLM maintains high performance across both conditions.</p>
</li>
<li>
<p><strong>Data Correction is Critical</strong>: When existing models were retrained on the corrected GTR-CoT-1.3M dataset, their performance improved substantially, validating that the &ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; principle addresses a real problem in the training data. However, GTR-Mol-VLM still outperformed these retrained baselines, confirming that both the data correction and the architectural innovations contribute.</p>
</li>
<li>
<p><strong>Chain-of-Thought Helps</strong>: Ablation studies confirmed that generating the graph traversal sequence before the final SMILES string improves performance compared to direct prediction. The intermediate reasoning step provides valuable structure that helps the model learn chemical logic.</p>
</li>
<li>
<p><strong>Graph Traversal Beats Traditional Parsing</strong>: The sequential atom-bond traversal approach outperformed the traditional &ldquo;atoms-then-bonds&rdquo; method, supporting the hypothesis that mimicking human reasoning patterns is more effective.</p>
</li>
<li>
<p><strong>General VLMs Still Struggle</strong>: Despite their impressive capabilities in other domains, general-purpose VLMs like GPT-4o performed poorly on this task, highlighting the importance of domain-specific training and architectural considerations.</p>
</li>
<li>
<p><strong>Evaluation Methodology Matters</strong>: The new graph-based evaluation metrics revealed problems with traditional SMILES-based evaluation that previous work had missed. Many &ldquo;failures&rdquo; in existing benchmarks were actually correct graph predictions that got marked wrong due to canonicalization issues with abbreviations.</p>
</li>
</ul>
<p>The work establishes that addressing the abbreviation problem requires both correcting the training data and rethinking the model architecture. The combination of faithful data annotation and sequential graph generation represents a significant advance for making OCSR systems robust enough for real-world deployment on diverse chemical literature.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Base Model</strong>: GTR-Mol-VLM fine-tunes <strong>Qwen-VL 2.5 3B</strong>.</p>
<p><strong>Input/Output Mechanism</strong>:</p>
<ul>
<li><strong>Input</strong>: The model takes an image $I_m$ and a text prompt</li>
<li><strong>Output</strong>: The model generates $R_m = \text{concat}(CoT_m, S_m)$, where it first produces the Chain-of-Thought (the graph traversal steps) followed immediately by the final SMILES string</li>
<li><strong>Traversal Strategy</strong>: Uses <strong>depth-first traversal</strong> to alternately predict atoms and bonds</li>
</ul>
<p><strong>Prompt Structure</strong>: The model is prompted to &ldquo;list the types of atomic elements&hellip; the coordinates&hellip; and the chemical bonds&hellip; then&hellip; output a canonical SMILES&rdquo;. The CoT output is formatted as a JSON list of atoms (with coordinates) and bonds (with indices referring to previous atoms), interleaved.</p>
<h3 id="data">Data</h3>
<p><strong>Training Dataset (GTR-CoT-1.3M)</strong>:</p>
<ul>
<li><strong>Synthetic Component</strong>: 1 million molecular SMILES from PubChem, converted to images using Indigo</li>
<li><strong>Real Component</strong>: 351,000 samples from USPTO patents (filtered from an original 680,000)
<ul>
<li>Processed using an OCR pipeline to detect abbreviations (e.g., &ldquo;Ph&rdquo;, &ldquo;Et&rdquo;)</li>
<li>Ground truth expanded structures replaced with superatoms to match visible abbreviations in images</li>
<li>This &ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; correction ensures training supervision matches visual input</li>
</ul>
</li>
</ul>
<p><strong>Evaluation Dataset (MolRec-Bench)</strong>:</p>
<ul>
<li><strong>MolRec-USPTO</strong>: 5,423 standard molecular images from patents</li>
<li><strong>MolRec-Abb</strong>: 9,311 molecular images specifically containing abbreviated superatoms, derived from MolGrapher data</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Graph Traversal Algorithm</strong>:</p>
<ul>
<li>Depth-first traversal strategy</li>
<li>Alternating atom-bond prediction sequence</li>
<li>Each step uses previously predicted atoms and bonds as context</li>
</ul>
<p><strong>Training Procedure</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: AdamW</li>
<li><strong>Learning Rate</strong>: Peak learning rate of $1.6 \times 10^{-4}$ with cosine decay</li>
<li><strong>Warm-up</strong>: Linear warm-up for the first 10% of iterations</li>
<li><strong>Batch Size</strong>: 2 per GPU with gradient accumulation over 16 steps, yielding <strong>effective batch size of 1024</strong></li>
<li><strong>Efficiency</strong>: DeepSpeed ZeRO Stage 2 for memory reduction</li>
<li><strong>Duration</strong>: 3 epochs</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong> (three complementary measures to handle abbreviation issues):</p>
<ul>
<li><strong><code>Gen_SMILES</code></strong>: Exact match ratio of SMILES strings directly generated by the VLM (image-captioning style)</li>
<li><strong><code>Gra_SMILES</code></strong>: Exact match ratio of SMILES strings derived from the predicted graph structure (graph-parsing style)</li>
<li><strong><code>Graph</code></strong>: Exact match ratio between ground truth and predicted graphs (node/edge comparison, bypassing SMILES canonicalization issues)</li>
</ul>
<p><strong>Baselines Compared</strong>:</p>
<ul>
<li>Specialist OCSR systems: MolScribe, MolNexTR</li>
<li>Chemistry-focused VLMs: ChemVLM</li>
<li>General-purpose VLMs: GPT-4o, Claude-3.5-Sonnet, Qwen-VL-Max</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Compute</strong>: Training performed on <strong>32 NVIDIA A100 GPUs</strong></p>
]]></content:encoded></item><item><title>ChemInfty: Chemical Structure Recognition in Patent Images</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/cheminfty/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/cheminfty/</guid><description>Fujiyoshi et al.'s segment-based approach for recognizing chemical structures in challenging Japanese patent images with touching characters and broken lines.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fujiyoshi, A., Nakagawa, K., &amp; Suzuki, M. (2011). Robust Method of Segmentation and Recognition of Chemical Structure Images in ChemInfty. <em>Pre-Proceedings of the 9th IAPR International Workshop on Graphics Recognition, GREC.</em></p>
<p><strong>Publication</strong>: GREC 2011 (Graphics Recognition Workshop)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="http://www.sciaccess.net/en/InftyReader/">InftyReader Project</a></li>
</ul>
<h2 id="contribution-segment-based-ocsr-method">Contribution: Segment-Based OCSR Method</h2>
<p>This is a <strong>method paper</strong> that introduces ChemInfty, a rule-based system for Optical Chemical Structure Recognition (OCSR) specifically designed to handle the challenging, low-quality images found in Japanese patent applications.</p>
<h2 id="motivation-the-challenge-of-degraded-patent-images">Motivation: The Challenge of Degraded Patent Images</h2>
<p>The motivation is straightforward: Japanese patent applications contain a massive amount of chemical knowledge, but the images are remarkably poor quality. Unlike the relatively clean molecular diagrams in scientific papers, patent images suffer from multiple problems that break conventional OCSR systems.</p>
<p>The authors quantified these issues in a sample of 200 patent images and found that 22% contained touching characters (where atom labels merge together), 19.5% had characters touching bond lines, and 8.5% had broken lines. These are not edge cases; they are pervasive enough to cripple existing recognition tools. Established systems like CLIDE, ChemReader, and OSRA struggle significantly with line-touching characters and broken lines, leading to recognition failures.</p>
<p>The challenge is compounded by the sheer diversity of creation methods. Some structures are drawn with sophisticated molecular editors, others with basic paint programs, and some are even handwritten. This means there&rsquo;s no standardization in fonts, character sizes, or line thickness. Add in the effects of scanning and faxing, and you have images with significant noise, distortion, and degradation.</p>
<p>The goal of ChemInfty is to build a system robust enough to handle these messy real-world conditions and make Japanese patent chemistry computer-searchable.</p>
<h2 id="core-innovation-segment-decomposition-and-dynamic-programming">Core Innovation: Segment Decomposition and Dynamic Programming</h2>
<p>The novelty lies in a segment-based decomposition approach that separates the recognition problem into manageable pieces before attempting to classify them. The key insight is that traditional OCR fails on these images because characters and lines are physically merged. You cannot recognize a character if you cannot cleanly separate it from the surrounding bonds first.</p>
<p>ChemInfty&rsquo;s approach has several distinctive elements:</p>
<ol>
<li>
<p><strong>Line and Curve Segmentation</strong>: The system first decomposes the image into smaller line and curve segments. The decomposition happens at natural breakpoints&mdash;crossings, sharp bends, and other locations where touching is likely to occur. This creates a set of primitive elements that can be recombined in different ways.</p>
</li>
<li>
<p><strong>Linear Order Assumption for Scalability</strong>: To make the dynamic programming approach computationally tractable and avoid combinatorial explosion, the system assumes that segments to be combined are adjacent when sorted in one of four directional orderings ($\perp, \setminus, \triangle, \rightarrow$). This constraint dramatically reduces the search space while still capturing the natural spatial relationships in chemical diagrams.</p>
</li>
<li>
<p><strong>Dynamic Programming for Segment Combination</strong>: Once the image is decomposed, the system faces a combinatorial problem: which segments should be grouped together to form characters, and which should be classified as bonds? The authors use dynamic programming to efficiently search for the &ldquo;most suitable combination&rdquo; of segments. This optimization finds the configuration that maximizes the likelihood of valid chemical structure elements.</p>
</li>
<li>
<p><strong>Two-Pass OCR Strategy</strong>: ChemInfty integrates with InftyReader, a powerful OCR engine. The system uses OCR twice in the pipeline:</p>
<ul>
<li><strong>First pass</strong>: High-confidence character recognition removes obvious atom labels early, simplifying the remaining image</li>
<li><strong>Second pass</strong>: After the segment-based method identifies and reconstructs difficult character regions, OCR is applied again to the cleaned-up character image</li>
</ul>
<p>This two-stage approach handles both easy and hard cases effectively: simple characters are recognized immediately, while complex cases get special treatment.</p>
</li>
<li>
<p><strong>Image Thinning for Structure Analysis</strong>: Before segmentation, the system thins the remaining graphical elements (after removing high-confidence characters) to skeleton lines. This thinning operation reveals the underlying topological structure&mdash;crossings, bends, and endpoints&mdash;making it easier to detect where segments should be divided.</p>
</li>
<li>
<p><strong>Proximity-Based Grouping</strong>: After identifying potential character segments, the system groups nearby segments together. This spatial clustering ensures that parts of the same character that were separated by bonds get recombined correctly.</p>
</li>
</ol>
<h2 id="methodology-real-world-patent-evaluation">Methodology: Real-World Patent Evaluation</h2>
<p>The evaluation focused on demonstrating that ChemInfty could handle real-world patent images at scale:</p>
<ol>
<li>
<p><strong>Large-Scale Patent Dataset</strong>: The system was tested on chemical structure images from Japanese patent applications published in 2008. This represents a realistic deployment scenario with all the messiness of actual documents.</p>
</li>
<li>
<p><strong>Touching Character Separation</strong>: The authors specifically measured the system&rsquo;s ability to separate characters from bonds when they were touching. Success was defined as cleanly extracting the character region so that OCR could recognize it.</p>
</li>
<li>
<p><strong>Recognition Accuracy by Object Type</strong>: Performance was broken down by element type (characters, line segments, solid wedges, and hashed wedges). This granular analysis revealed which components were easier or harder for the system to handle.</p>
</li>
<li>
<p><strong>End-to-End Performance</strong>: The overall recognition ratio was calculated across all object types to establish the system&rsquo;s practical utility for automated patent processing.</p>
</li>
</ol>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li>
<p><strong>Effective Separation for Line-Touching Characters</strong>: The segment-based method successfully separated 63.5% of characters that were touching bond lines. This is a substantial improvement over standard OCR, which typically fails completely on such cases. The authors note that when image quality is reasonable, the separation method works well.</p>
</li>
<li>
<p><strong>Strong Overall Character Recognition</strong>: Character recognition achieved 85.86% accuracy, which is respectable given the poor quality of the input images. Combined with the 90.73% accuracy for line segments, this demonstrates the system can reliably reconstruct the core molecular structure.</p>
</li>
<li>
<p><strong>Weak Performance on Wedges</strong>: The system struggled significantly with stereochemistry notation. Solid wedges were correctly recognized only 52.54% of the time, and hashed wedges fared even worse at 23.63%. This is a critical limitation since stereochemistry is often essential for understanding molecular properties.</p>
</li>
<li>
<p><strong>Image Quality Dependency</strong>: The authors acknowledge that the method&rsquo;s effectiveness is ultimately limited by image quality. When images are severely degraded (blurred to the point where even humans struggle to distinguish characters from noise), the segmentation approach cannot reliably separate touching elements.</p>
</li>
<li>
<p><strong>Overall System Performance</strong>: The combined recognition ratio of 86.58% for all objects indicates that ChemInfty is a working system but not yet production-ready. The authors conclude that further refinement is necessary, particularly for wedge recognition and handling extremely low-quality images.</p>
</li>
</ul>
<p>The work establishes that segment-based decomposition with dynamic programming is a viable approach for handling the specific challenges of patent image OCSR. The two-pass OCR strategy and the use of image thinning to reveal structure are practical engineering solutions that improve robustness. However, the results also highlight that rule-based methods are fundamentally limited by image quality. There is only so much you can do with algorithmic cleverness when the input is severely degraded. This limitation would motivate later work on deep learning approaches that can learn robust feature representations from large datasets.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="technical-paradigm">Technical Paradigm</h3>
<p><strong>This is a pre-deep learning (2011) classical computer vision paper.</strong> The system uses rule-based methods and traditional OCR engines, not neural networks.</p>
<h3 id="models">Models</h3>
<ul>
<li><strong>InftyReader</strong>: A mathematical OCR engine used for the initial high-confidence character recognition pass. This is a pre-existing external tool.</li>
<li><strong>DEF-based OCR</strong>: A standard OCR engine based on Directional Element Features (DEF). These are manually engineered statistical features (histograms of edge directions), not learned neural network features.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The paper details a multi-step recognition pipeline:</p>
<ol>
<li><strong>Preprocessing</strong>: Binarization and smoothing</li>
<li><strong>Initial Character Removal</strong>: High-confidence characters are recognized by the InftyReader OCR engine and removed from the image to simplify segmentation</li>
<li><strong>Skeletonization</strong>: Thinning using <strong>Hilditch&rsquo;s algorithm</strong> to skeletonize graphical elements, revealing topological structure (crossings, bends, endpoints)</li>
<li><strong>Feature Point Detection</strong>:
<ul>
<li><strong>Crossing points</strong>: Direct detection on skeleton</li>
<li><strong>Bending points</strong>: Detected using the <strong>Hough transformation</strong></li>
</ul>
</li>
<li><strong>Dynamic Programming Search</strong>:
<ul>
<li><strong>Input</strong>: Set of line/curve segments $S$</li>
<li><strong>Procedure</strong>: Sort segments in 4 directions ($\perp, \setminus, \triangle, \rightarrow$). For each direction, use DP to find the grouping that minimizes a heuristic score</li>
<li><strong>Complexity</strong>: $O(n^2)$ where $n$ is the number of segments</li>
<li><strong>Scoring</strong>: Uses a function <code>Measure(S')</code> that returns a score (0-100) indicating if a subset of segments forms a valid character or bond</li>
</ul>
</li>
</ol>
<p>The scoring function <code>Measure(S')</code> used in the dynamic programming algorithm is never mathematically defined in the paper, limiting replicability.</p>
<h3 id="data">Data</h3>
<p><strong>Evaluation Dataset</strong>: Chemical structure images from Japanese patent applications published in 2008. The complete 2008 dataset contains 229,969 total images.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>Japanese Published Patent Applications (2008)</td>
          <td>1,599 images</td>
          <td>Contains 229,969 total images for the year. Format: TIFF, 200-400 dpi.</td>
      </tr>
      <tr>
          <td>Analysis</td>
          <td>Random subset for frequency analysis</td>
          <td>200 images</td>
          <td>Used to estimate frequency of touching/broken characters (found in ~20% of images).</td>
      </tr>
  </tbody>
</table>
<p><strong>No Training Set</strong>: The system is rule-based and uses pre-built OCR engines, so no model training was performed.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric</strong>: Recognition ratio (percentage of correctly recognized objects)</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Line-touching Separation</td>
          <td>63.5%</td>
          <td>Success rate for separating text glued to lines</td>
      </tr>
      <tr>
          <td>Character Recognition</td>
          <td>85.86%</td>
          <td>For all character sizes</td>
      </tr>
      <tr>
          <td>Line segments</td>
          <td>90.73%</td>
          <td>Standard bond recognition</td>
      </tr>
      <tr>
          <td>Solid Wedge Recognition</td>
          <td>52.54%</td>
          <td>Low performance noted as area for improvement</td>
      </tr>
      <tr>
          <td>Hashed Wedges</td>
          <td>23.63%</td>
          <td>Poorest performing element type</td>
      </tr>
      <tr>
          <td>Overall</td>
          <td>86.58%</td>
          <td>Combined across all object types</td>
      </tr>
  </tbody>
</table>
<p><strong>Total Objects Evaluated</strong>: 742,287 objects (characters, line segments, solid wedges, hashed wedges) extracted from the patent images.</p>
<h3 id="hardware">Hardware</h3>
<p>Not reported. Computational cost was not a primary concern for this classical CV system.</p>
<h3 id="replicability">Replicability</h3>
<p><strong>Low.</strong> The paper does not provide sufficient detail for full replication:</p>
<ul>
<li>The scoring function <code>Measure(S')</code> used in the dynamic programming algorithm is never mathematically defined</li>
<li>Dependency on the proprietary/specialized InftyReader engine</li>
<li>No pseudocode provided for the segment decomposition heuristics</li>
</ul>
<h3 id="notes-on-wedge-recognition">Notes on Wedge Recognition</h3>
<p>The system&rsquo;s poor performance on solid wedges (52.54%) and hashed wedges (23.63%) reflects a fundamental challenge for classical thinning algorithms. Wedge bonds are dense triangular regions that indicate 3D stereochemistry. When skeletonized using algorithms like Hilditch&rsquo;s method, these &ldquo;blob&rdquo; shapes often distort into unrecognizable patterns, unlike the clean thin lines that represent regular bonds.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fujiyoshiRobustMethodSegmentation2011,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Robust {{Method}} of {{Segmentation}} and {{Recognition}} of {{Chemical Structure Images}} in {{ChemInfty}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Fujiyoshi, Akio and Nakagawa, Koji and Suzuki, Masakazu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2011</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Pre-proceedings of the 9th IAPR international workshop on graphics recognition, GREC}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolParser: End-to-End Molecular Structure Recognition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/mol-parser/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/mol-parser/</guid><description>MolParser converts molecular images from scientific documents to machine-readable formats using end-to-end learning with Extended SMILES representation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, X., Wang, J., Cai, X., Chen, S., Yang, S., Tao, H., Wang, N., Yao, L., Zhang, L., &amp; Ke, G. (2025). MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild (No. arXiv:2411.11098). arXiv. <a href="https://doi.org/10.48550/arXiv.2411.11098">https://doi.org/10.48550/arXiv.2411.11098</a></p>
<p><strong>Publication</strong>: arXiv preprint (2025)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/computational-chemistry/optical-structure-recognition/molparser_7m-wildmol/">MolParser-7M Dataset</a> - 7M+ image-text pairs for OCSR</li>
<li><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M on HuggingFace</a> - Dataset repository</li>
<li><a href="https://huggingface.co/UniParser/MolDet">MolDet YOLO Detector</a> - Object detection model for extracting molecular images from documents</li>
</ul>
<h2 id="contribution-end-to-end-ocsr-and-real-world-resources">Contribution: End-to-End OCSR and Real-World Resources</h2>
<p>This is primarily a <strong>Method</strong> paper (see <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a>), with a significant secondary contribution as a <strong>Resource</strong> paper.</p>
<p><strong>Method contribution ($\Psi_{\text{Method}}$)</strong>: The paper proposes a novel end-to-end architecture combining a Swin Transformer encoder with a BART decoder, and crucially introduces <strong>Extended SMILES (E-SMILES)</strong>, a new syntactic extension to standard SMILES notation that enables representation of Markush structures, abstract rings, and variable attachment points found in patents. The work validates this method through extensive ablation studies and achieves state-of-the-art performance against prior OCSR systems.</p>
<p><strong>Resource contribution ($\Psi_{\text{Resource}}$)</strong>: The paper introduces <strong>MolParser-7M</strong>, the largest OCSR dataset to date (7.7M image-text pairs), and <strong>WildMol</strong>, a challenging benchmark of 20,000 manually annotated real-world molecular images. The construction of these datasets through an active learning data engine with human-in-the-loop validation represents significant infrastructure that enables future OCSR research.</p>
<h2 id="motivation-unlocking-chemistry-from-real-world-documents">Motivation: Unlocking Chemistry from Real-World Documents</h2>
<p>The motivation stems from a practical problem in chemical informatics: vast amounts of chemical knowledge are locked in unstructured formats. Patents, research papers, and legacy documents depict molecular structures as images. This creates a barrier for large-scale data analysis and prevents Large Language Models from effectively understanding scientific literature in chemistry and drug discovery.</p>
<p>Existing OCSR methods struggle with real-world documents for two fundamental reasons:</p>
<ol>
<li><strong>Representational limitations</strong>: Standard SMILES notation cannot capture complex structural templates like <strong>Markush structures</strong>, which are ubiquitous in patents. These structures define entire families of compounds using variable R-groups and abstract patterns, making them essential for intellectual property but impossible to represent with conventional methods.</li>
<li><strong>Data distribution mismatch</strong>: Real-world molecular images suffer from noise, inconsistent drawing styles, variable resolution, and interference from surrounding text. Models trained exclusively on clean, synthetically rendered molecules fail to generalize when applied to actual documents.</li>
</ol>
<h2 id="novelty-e-smiles-and-human-in-the-loop-data-engine">Novelty: E-SMILES and Human-in-the-Loop Data Engine</h2>
<p>The novelty lies in a comprehensive system that addresses both representation and data quality challenges through four integrated contributions:</p>
<ol>
<li>
<p><strong>Extended SMILES (E-SMILES)</strong>: A backward-compatible extension to the SMILES format that can represent complex structures previously inexpressible in standard chemical notations. E-SMILES uses a separator token <code>&lt;sep&gt;</code> to delineate the core molecular structure from supplementary annotations. These annotations employ XML-like tags to encode Markush structures, polymers, abstract rings, and other complex patterns. Critically, the core structure remains parseable by standard cheminformatics tools like RDKit, while the supplementary tags provide a structured, LLM-friendly format for capturing edge cases.</p>
</li>
<li>
<p><strong>MolParser-7M Dataset</strong>: The largest publicly available OCSR dataset, containing over 7 million image-text pairs. What distinguishes this dataset is both its scale and its composition. It includes 400,000 &ldquo;in-the-wild&rdquo; samples (molecular images extracted from actual patents and scientific papers) and subsequently curated by human annotators. This real-world data addresses the distribution mismatch problem directly by exposing the model to the same noise, artifacts, and stylistic variations it encounters in production.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Data Engine</strong>: A systematic approach to collecting and annotating real-world training data. The pipeline begins with an object detection model that extracts molecular images from over a million PDF documents. An active learning algorithm then identifies the most informative samples (those where the current model struggles) for human annotation. The model pre-annotates these images, and human experts review and correct them. This creates an iterative improvement cycle: annotate, train, identify new challenging cases, repeat.</p>
</li>
<li>
<p><strong>Efficient End-to-End Architecture</strong>: The model treats OCSR as an image captioning problem. A Swin-Transformer vision encoder extracts visual features, a simple MLP compresses them, and a BART decoder generates the E-SMILES string autoregressively. The model minimizes the standard negative log-likelihood of the target E-SMILES token sequence $y$ given the sequence history and input image $x$:</p>
</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, x; \theta)
\end{aligned}
$$</p>
<p>The training strategy employs curriculum learning, starting with simple molecules and gradually introducing complexity and heavier data augmentation.</p>
<h2 id="experimental-setup-two-stage-training-and-benchmarking">Experimental Setup: Two-Stage Training and Benchmarking</h2>
<p>The evaluation focused on demonstrating that MolParser generalizes to real-world documents:</p>
<ol>
<li>
<p><strong>Two-Stage Training Protocol</strong>: The model underwent a systematic training process:</p>
<ul>
<li><strong>Pre-training</strong>: Initial training on millions of synthetic molecular images using curriculum learning. The curriculum progresses from simple molecules to complex structures while gradually increasing data augmentation intensity (blur, noise, perspective transforms).</li>
<li><strong>Fine-tuning</strong>: Subsequent training on 400,000 curated real-world samples extracted from patents and papers. This fine-tuning phase is critical for adapting to the noise and stylistic variations of actual documents.</li>
</ul>
</li>
<li>
<p><strong>Benchmark Evaluation</strong>: The model was evaluated on multiple standard OCSR benchmarks to establish baseline performance on clean data. These benchmarks test recognition accuracy on well-formatted molecular diagrams.</p>
</li>
<li>
<p><strong>Real-World Document Analysis</strong>: The critical test involved applying MolParser to molecular structures extracted directly from scientific documents. This evaluation measures the gap between synthetic benchmark performance and real-world applicability (the core problem the paper addresses).</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Experiments isolating the contribution of each component:</p>
<ul>
<li>The impact of real-world training data versus synthetic-only training</li>
<li>The effectiveness of curriculum learning versus standard training</li>
<li>The value of the human-in-the-loop annotation pipeline versus random sampling</li>
<li>The necessity of E-SMILES extensions for capturing complex structures</li>
</ul>
</li>
</ol>
<h2 id="outcomes-and-empirical-findings">Outcomes and Empirical Findings</h2>
<ul>
<li>
<p><strong>Performance on Benchmarks</strong>: MolParser improves upon previous OCSR methods on both standard benchmarks and real-world documents. The performance gap is particularly pronounced on real-world data, validating the core hypothesis that training on actual document images is essential for practical deployment. On WildMol-10k, MolParser-Base achieved 76.9% accuracy, significantly outperforming MolScribe (66.4%) and MolGrapher (45.5%).</p>
</li>
<li>
<p><strong>Real-World Data is Critical</strong>: Models trained exclusively on synthetic data show substantial performance degradation when applied to real documents. The 400,000 in-the-wild training samples bridge this gap, demonstrating that data quality and distribution matching matter as much as model architecture. Ablation experiments showed synthetic-only training achieved only 22.4% accuracy on WildMol versus 76.9% with real-world fine-tuning.</p>
</li>
<li>
<p><strong>E-SMILES Enables Broader Coverage</strong>: The extended representation successfully captures molecular structures that were previously inexpressible, particularly Markush structures from patents. This dramatically expands the scope of what can be automatically extracted from chemical literature.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Scales Efficiently</strong>: The active learning pipeline reduces annotation time by up to 90% while maintaining high quality. This approach makes it feasible to curate large-scale, high-quality datasets for specialized domains where expert knowledge is expensive.</p>
</li>
<li>
<p><strong>Speed and Accuracy</strong>: The end-to-end architecture achieves both high accuracy and fast inference, making it practical for large-scale document processing. MolParser-Base processes 40 images per second on RTX 4090D, while the Tiny variant achieves 131 FPS. The direct image-to-text approach avoids the error accumulation of multi-stage pipelines.</p>
</li>
</ul>
<p>The work establishes that practical OCSR requires more than architectural innovations. It demands careful attention to data quality, representation design, and the distribution mismatch between synthetic training data and real-world applications. The combination of E-SMILES, the MolParser-7M dataset, and the human-in-the-loop data engine provides a template for building robust vision systems in scientific domains where clean training data is scarce but expert knowledge is available.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is split into a massive synthetic pre-training set and a curated fine-tuning set.</p>
<p><strong>Training Data Composition (MolParser-7M)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset Name</th>
          <th>Size</th>
          <th>Composition / Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pre-training</strong></td>
          <td>MolParser-7M (Synthetic)</td>
          <td>~7.3M</td>
          <td><strong>Markush-3M</strong> (40%), <strong>ChEMBL-2M</strong> (27%), <strong>Polymer-1M</strong> (14%), PAH-600k (8%), BMS-360k (5%). Generated via RDKit/Indigo with randomized styles.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-SFT-400k</td>
          <td>400k</td>
          <td>Real images from patents/papers selected via active learning (confidence filtering 0.6-0.9) and manually annotated. 66% real molecules from PDFs, 34% synthetic.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-Gen-200k</td>
          <td>200k</td>
          <td>Subset of synthetic data kept to prevent catastrophic forgetting.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Handwrite-5k</td>
          <td>5k</td>
          <td>Handwritten molecules from Img2Mol to support hand-drawn queries.</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Sources</strong>: 1.2M patents and scientific papers (PDF documents)</li>
<li><strong>Extraction</strong>: MolDet (YOLO11-based detector) identified ~20M molecular images, deduplicated to ~4M candidates</li>
<li><strong>Selection</strong>: Active learning ensemble (5-fold models) identified high-uncertainty samples for annotation</li>
<li><strong>Annotation</strong>: Human experts corrected model pre-annotations (90% time savings vs. from-scratch annotation)</li>
</ul>
<p><strong>Test Benchmarks</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO-10k</td>
          <td>10,000</td>
          <td>Standard synthetic benchmark</td>
      </tr>
      <tr>
          <td>Maybridge UoB</td>
          <td>-</td>
          <td>Synthetic molecules</td>
      </tr>
      <tr>
          <td>CLEF-2012</td>
          <td>-</td>
          <td>Patent images</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>-</td>
          <td>Japanese patent office</td>
      </tr>
      <tr>
          <td>ColoredBG</td>
          <td>-</td>
          <td>Colored background molecules</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td>10,000</td>
          <td>Ordinary molecules cropped from real PDFs (new)</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k-M</strong></td>
          <td>10,000</td>
          <td>Markush structures (significantly harder, new)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Extended SMILES (E-SMILES) Encoding</strong>:</p>
<ul>
<li><strong>Format</strong>: <code>SMILES&lt;sep&gt;EXTENSION</code> where <code>&lt;sep&gt;</code> separates core structure from supplementary annotations</li>
<li><strong>Extensions use XML-like tags</strong>:
<ul>
<li><code>&lt;a&gt;index:group&lt;/a&gt;</code> for substituents/variable groups (Markush structures)</li>
<li><code>&lt;r&gt;</code> for variable attachment points</li>
<li><code>&lt;c&gt;</code> for abstract rings</li>
<li><code>&lt;dum&gt;</code> for connection points</li>
</ul>
</li>
<li><strong>Backward compatible</strong>: Core SMILES parseable by RDKit; extensions provide structured format for edge cases</li>
</ul>
<p><strong>Curriculum Learning Strategy</strong>:</p>
<ul>
<li><strong>Phase 1</strong>: No augmentation, simple molecules (&lt;60 tokens)</li>
<li><strong>Phase 2</strong>: Gradually increase augmentation intensity and sequence length</li>
<li>Progressive complexity allows stable training on diverse molecular structures</li>
</ul>
<p><strong>Active Learning Data Selection</strong>:</p>
<ol>
<li>Train 5 model folds on current dataset</li>
<li>Compute pairwise Tanimoto similarity of predictions on candidate images</li>
<li>Select samples with confidence scores <strong>0.6-0.9</strong> for human review (highest learning value)</li>
<li>Human experts correct model pre-annotations</li>
<li>Iteratively expand training set with hard samples</li>
</ol>
<p><strong>Data Augmentations</strong>:</p>
<ul>
<li>RandomAffine (rotation, scale, translation)</li>
<li>JPEGCompress (compression artifacts)</li>
<li>InverseColor (color inversion)</li>
<li>SurroundingCharacters (text interference)</li>
<li>RandomCircle (circular artifacts)</li>
<li>ColorJitter (brightness, contrast variations)</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture follows a standard <strong>Image Captioning</strong> (Encoder-Decoder) paradigm.</p>
<p><strong>Architecture Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Vision Encoder</strong></td>
          <td>Swin Transformer (ImageNet pretrained)</td>
      </tr>
      <tr>
          <td>- Tiny variant</td>
          <td>66M parameters, $224 \x224$ input</td>
      </tr>
      <tr>
          <td>- Small variant</td>
          <td>108M parameters</td>
      </tr>
      <tr>
          <td>- Base variant</td>
          <td>216M parameters, $384 \x384$ input</td>
      </tr>
      <tr>
          <td><strong>Connector</strong></td>
          <td>2-layer MLP reducing channel dimension by half</td>
      </tr>
      <tr>
          <td><strong>Text Decoder</strong></td>
          <td>BART-Decoder (12 layers, 16 attention heads)</td>
      </tr>
  </tbody>
</table>
<p><strong>Training Configuration</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Setting</th>
          <th>Pre-training</th>
          <th>Fine-tuning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Hardware</strong></td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
      </tr>
      <tr>
          <td><strong>Optimizer</strong></td>
          <td>AdamW</td>
          <td>AdamW</td>
      </tr>
      <tr>
          <td><strong>Learning Rate</strong></td>
          <td>$1 \x10^{-4}$</td>
          <td>$5 \x10^{-5}$</td>
      </tr>
      <tr>
          <td><strong>Weight Decay</strong></td>
          <td>$1 \x10^{-2}$</td>
          <td>$1 \x10^{-2}$</td>
      </tr>
      <tr>
          <td><strong>Scheduler</strong></td>
          <td>Cosine with warmup</td>
          <td>Cosine with warmup</td>
      </tr>
      <tr>
          <td><strong>Epochs</strong></td>
          <td>20</td>
          <td>4</td>
      </tr>
      <tr>
          <td><strong>Label Smoothing</strong></td>
          <td>0.01</td>
          <td>0.005</td>
      </tr>
  </tbody>
</table>
<p><strong>Curriculum Learning Schedule</strong> (Pre-training):</p>
<ul>
<li>Starts with simple molecules (&lt;60 tokens, no augmentation)</li>
<li>Gradually adds complexity and augmentation (blur, noise, perspective transforms)</li>
<li>Enables stable learning across diverse molecular structures</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>: Exact match accuracy on predicted E-SMILES strings (character-level exact match)</p>
<p><strong>Key Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>MolParser-Base</th>
          <th>MolScribe</th>
          <th>MolGrapher</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td><strong>76.9%</strong></td>
          <td>66.4%</td>
          <td>45.5%</td>
          <td>Real-world patent/paper crops</td>
      </tr>
      <tr>
          <td><strong>USPTO-10k</strong></td>
          <td><strong>94.5%</strong></td>
          <td>96.0%</td>
          <td>93.3%</td>
          <td>Synthetic benchmark</td>
      </tr>
      <tr>
          <td><strong>Throughput (FPS)</strong></td>
          <td><strong>39.8</strong></td>
          <td>16.5</td>
          <td>2.2</td>
          <td>Measured on RTX 4090D</td>
      </tr>
  </tbody>
</table>
<p><strong>Additional Performance</strong>:</p>
<ul>
<li>MolParser-Tiny: 131 FPS on RTX 4090D (66M params)</li>
<li>Real-world vs. synthetic gap: Fine-tuning on MolParser-SFT-400k closed the performance gap between clean benchmarks and in-the-wild documents</li>
</ul>
<p><strong>Ablation Findings</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Factor</th>
          <th>Impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Real-world training data</td>
          <td>Substantial accuracy gains on documents vs. synthetic-only training (22.4% → 76.9% on WildMol)</td>
      </tr>
      <tr>
          <td>Curriculum learning</td>
          <td>Outperforms standard training, especially for complex molecules</td>
      </tr>
      <tr>
          <td>Active learning selection</td>
          <td>More effective than random sampling for annotation budget</td>
      </tr>
      <tr>
          <td>E-SMILES extensions</td>
          <td>Essential for Markush structure recognition (impossible with standard SMILES)</td>
      </tr>
      <tr>
          <td>Dataset scale</td>
          <td>Larger pre-training dataset (7M vs 300k) significantly improves generalization</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 8x NVIDIA RTX 4090D GPUs</li>
<li><strong>Inference</strong>: Single RTX 4090D sufficient for real-time processing</li>
<li><strong>Training time</strong>: 20 epochs pre-training + 4 epochs fine-tuning (specific duration not reported)</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{fang2025molparser,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Xi and Wang, Jiankun and Cai, Xiaochen and Chen, Shangqian and Yang, Shuwen and Tao, Haoyi and Wang, Nan and Yao, Lin and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2411.11098}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2411.11098}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolParser-7M &amp; WildMol: Large-Scale OCSR Datasets</title><link>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/molparser_7m-wildmol/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/optical-structure-recognition/molparser_7m-wildmol/</guid><description>MolParser-7M is the largest OCSR dataset with 7.7M image-text pairs of molecules and E-SMILES, including 400k real-world samples.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/molparser-markush-example.webp"
         alt="Example of a complex Markush structure"
         title="Example of a complex Markush structure"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">An example of a complex Markush structure that can be represented by the E-SMILES format but not by standard SMILES or FG-SMILES.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/molparser-low-quality-example.webp"
         alt="Sample from the WildMol benchmark"
         title="Sample from the WildMol benchmark"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A sample from the WildMol benchmark, showing a low-quality, noisy molecular image cropped from real-world literature that challenges OCSR systems.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/molparser-colored-example.webp"
         alt="Colored molecule with annotations"
         title="Colored molecule with annotations"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolParser-7M (Training Set)</strong></td>
          <td>7,740,871</td>
          <td>A large-scale dataset for training OCSR models, split into pre-training and fine-tuning stages.</td>
      </tr>
      <tr>
          <td><strong>WildMol (Test Set)</strong></td>
          <td>20,000</td>
          <td>A benchmark of 20,000 human-annotated samples cropped from real PDF files to evaluate OCSR models in &lsquo;in-the-wild&rsquo; scenarios.</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="wildmol-10k-accuracy">WildMol-10K Accuracy<a hidden class="anchor" aria-hidden="true" href="#wildmol-10k-accuracy">#</a></h3>
    <p class="benchmark-description">Evaluation of OCSR models on 10,000 real-world molecular images cropped from scientific literature and patents</p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Accuracy (%)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>MolParser-Base</strong><br><small>End-to-end visual recognition trained on MolParser-7M</small>
          </td>
          <td>76.9</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>MolScribe</strong><br><small>Transformer-based OCSR system</small>
          </td>
          <td>66.4</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>DECIMER 2.7</strong><br><small>Deep learning for chemical image recognition</small>
          </td>
          <td>56</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>MolGrapher</strong><br><small>Graph-based molecular structure recognition</small>
          </td>
          <td>45.5</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>MolVec 0.9.7</strong><br><small>Vector-based structure recognition</small>
          </td>
          <td>26.4</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>OSRA 2.1</strong><br><small>Optical Structure Recognition Application</small>
          </td>
          <td>26.3</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Img2Mol</strong><br><small>Image-to-molecule translation</small>
          </td>
          <td>24.4</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Imago 2.0</strong><br><small>Chemical structure recognition toolkit</small>
          </td>
          <td>6.9</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="key-contribution">Key Contribution</h2>
<p>Introduces MolParser-7M, the largest Optical Chemical Structure Recognition (OCSR) dataset, uniquely combining diverse synthetic data with a large volume of manually-annotated, &ldquo;in-the-wild&rdquo; images from real scientific documents to improve model robustness. Also introduces WildMol, a new challenging benchmark for evaluating OCSR performance on real-world data, including Markush structures.</p>
<h2 id="overview">Overview</h2>
<p>The MolParser project addresses the challenge of recognizing molecular structures from images found in real-world scientific documents. Unlike existing OCSR datasets that rely primarily on synthetically generated images, MolParser-7M incorporates 400,000 manually annotated images cropped from actual patents and scientific papers, making it the first large-scale dataset to bridge the gap between synthetic training data and real-world deployment scenarios.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li>Largest open-source OCSR dataset with over 7.7 million pairs</li>
<li>The only large-scale OCSR training set that includes a significant amount (400k) of &ldquo;in-the-wild&rdquo; data cropped from real patents and literature</li>
<li>High diversity of molecular structures from numerous sources (PubChem, ChEMBL, polymers, etc.)</li>
<li>Introduces the WildMol benchmark for evaluating performance on challenging, real-world data, including Markush structures</li>
<li>The &ldquo;in-the-wild&rdquo; fine-tuning data (MolParser-SFT-400k) was curated via an efficient active learning data engine with human-in-the-loop validation</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li>The E-SMILES format cannot represent certain complex cases, such as coordination bonds, dashed abstract rings, and Markush structures depicted with special patterns</li>
<li>The model and data do not yet fully exploit molecular chirality, which is critical for chemical properties</li>
<li>Performance could be further improved by scaling up the amount of real annotated training data</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="synthetic-data-generation">Synthetic Data Generation</h3>
<p>To ensure diversity, molecular structures were collected from databases like ChEMBL, PubChem, and Kaggle BMS. A significant number of Markush, polymer, and fused-ring structures were also randomly generated. Images were rendered using RDKit and epam.indigo with randomized parameters (e.g., bond width, font size, rotation) to increase visual diversity.</p>
<h3 id="in-the-wild-data-engine-molparser-sft-400k">In-the-Wild Data Engine (MolParser-SFT-400k)</h3>
<p>A YOLOv11 object detection model (MolDet) located and cropped over 20 million molecule images from 1.22 million real PDFs (patents and papers). After de-duplication via p-hash similarity, 4 million unique images remained.</p>
<p>An active learning algorithm was used to select the most informative samples for annotation, targeting images where an ensemble of 5-fold models showed moderate confidence (0.6-0.9 Tanimoto similarity), indicating they were challenging but learnable.</p>
<p>This active learning approach with model pre-annotations reduced manual annotation time per molecule to 30 seconds, a 90% savings compared to annotating from scratch. In the final fine-tuning dataset, 56.04% of annotations directly utilized raw model pre-annotations, 20.97% passed review after a single manual correction, and just 9.13% required three or more rounds of annotation.</p>
<h3 id="e-smiles-specification">E-SMILES Specification</h3>
<p>To accommodate complex patent structures that standard SMILES cannot support, the authors introduced an Extended SMILES format (<code>SMILES&lt;sep&gt;EXTENSION</code>). The <code>EXTENSION</code> component uses XML-like tokens to manage complexities:</p>
<ul>
<li><code>&lt;a&gt;...&lt;/a&gt;</code> encapsulates Markush R-groups and abbreviation groups.</li>
<li><code>&lt;r&gt;...&lt;/r&gt;</code> denotes ring attachments with uncertainty positions.</li>
<li><code>&lt;c&gt;...&lt;/c&gt;</code> defines abstract rings.</li>
<li><code>&lt;dum&gt;</code> identifies a connection point.</li>
</ul>
<p>This format provides a crucial advancement for Markush-molecule matching and LLM integration, while retaining RDKit compatibility for the standard SMILES portion.</p>
]]></content:encoded></item><item><title>SELFIES (Self-Referencing Embedded Strings)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/selfies/</link><pubDate>Fri, 12 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/selfies/</guid><description>SELFIES is a 100% robust molecular string representation for ML, implemented in the open-source selfies Python library.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p><strong>SELFIES (SELF-referencIng Embedded Strings)</strong> is a string-based molecular representation designed to be <strong>100% robust</strong>. This means every SELFIES string, even one generated randomly, corresponds to a syntactically and semantically valid molecule. SELFIES was developed to overcome a major limitation of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>, where a large fraction of strings generated in machine learning tasks represent invalid chemical structures.</p>
<p>The format is implemented in an open-source Python library called <code>selfies</code>, which has significantly expanded its capabilities since the original publication. The modern library uses an internal graph-based representation for greater efficiency and flexibility.</p>
<h3 id="key-characteristics">Key Characteristics</h3>
<ul>
<li><strong>100% Robustness</strong>: Every possible SELFIES string can be decoded into a valid molecular graph that obeys chemical valence rules. This is its fundamental advantage over SMILES.</li>
<li><strong>Machine Learning Friendly</strong>: Can be used directly in any machine learning model (like VAEs or GANs) without adaptation, guaranteeing that all generated outputs are valid molecules.</li>
<li><strong>Customizable Constraints</strong>: The underlying chemical rules, such as maximum valence for different atoms, can be customized by the user. The library provides presets (e.g., for hypervalent species) and allows users to define their own rule sets.</li>
<li><strong>Human-readable</strong>: With some familiarity, SELFIES strings are human-readable, allowing interpretation of functional groups and connectivity.</li>
<li><strong>Local Operations</strong>: SELFIES encodes branch length and ring size locally within the string, preventing common syntactical errors. SMILES uses non-local numbers for rings and parentheses for branches.</li>
<li><strong>Broad Support</strong>: The current <code>selfies</code> library supports aromatic molecules (via kekulization), isotopes, charges, radicals, and stereochemistry. It also includes a dot symbol (<code>.</code>) for representing disconnected molecular fragments.</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Indirect Canonicalization</strong>: A canonical SELFIES string is currently generated by first creating a canonical SMILES string and then converting it to SELFIES. Direct canonicalization is a goal for future development.</li>
<li><strong>Ongoing Standardization</strong>: While the library now supports most major features found in SMILES, work is ongoing to extend the format to more complex systems like polymers, crystals, and reactions.</li>
<li><strong>String Length</strong>: SELFIES strings are generally longer than their corresponding SMILES strings, which can impact storage and processing times for very large datasets.</li>
</ul>
<h3 id="recent-developments">Recent Developments</h3>
<p>The SELFIES library has undergone significant improvements since its <a href="/notes/computational-chemistry/molecular-representations/selfies-original-paper/">original introduction</a>. The <a href="/notes/computational-chemistry/molecular-representations/selfies-2023/">2023 library update</a> brought major architectural changes, including a graph-based internal representation, expanded chemical support (aromatics, stereochemistry, charges), and enhanced machine learning integration. Looking forward, researchers have outlined <a href="/notes/computational-chemistry/molecular-representations/selfies-2022/">16 future research directions</a> for extending robust representations to complex systems like polymers, crystals, and chemical reactions.</p>
<h3 id="recent-developments-1">Recent Developments</h3>
<p>The SELFIES library has undergone significant improvements since its <a href="/notes/computational-chemistry/molecular-representations/selfies-original-paper/">original introduction</a>. The <a href="/notes/computational-chemistry/molecular-representations/selfies-2023/">2023 library update</a> brought major architectural changes, including a graph-based internal representation, expanded chemical support (aromatics, stereochemistry, charges), and enhanced machine learning integration. Looking forward, researchers have outlined <a href="/notes/computational-chemistry/molecular-representations/selfies-2022/">16 future research directions</a> for extending robust representations to complex systems like polymers, crystals, and chemical reactions.</p>
<h2 id="basic-syntax">Basic Syntax</h2>
<p>SELFIES uses symbols enclosed in square brackets (e.g., <code>[C]</code>, <code>[O]</code>, <code>[#N]</code>). The interpretation of each symbol depends on the current <strong>state of the derivation</strong>, which ensures chemical valence rules are strictly obeyed. The syntax is formally defined by a Chomsky type-2 context-free grammar.</p>
<h3 id="derivation-rules">Derivation Rules</h3>
<p>SELFIES are constructed using a table of derivation rules. The process starts in an initial state (e.g., $X_0$) and reads the SELFIES string symbol by symbol. Each symbol, combined with the current state, determines the resulting atom/bond and the next state. The derivation state $X_n$ intuitively tracks that the previously added atom can form a maximum of $n$ additional bonds.</p>
<p>For example, the string <code>[F][C][C][#N]</code> is derived as follows, where $X_n$ indicates the atom can form up to $n$ additional bonds:</p>
<p>$$
\begin{aligned}
\text{State } X_0 + \text{[F]} &amp;\rightarrow \text{F (molecule)} + \text{State } X_1 \\
\text{State } X_1 + \text{[C]} &amp;\rightarrow \text{F-C} + \text{State } X_3 \\
\text{State } X_3 + \text{[C]} &amp;\rightarrow \text{F-C=C} + \text{State } X_2 \\
\text{State } X_2 + \text{[\#N]} &amp;\rightarrow \text{F-C=C=N} + \text{Final}
\end{aligned}
$$</p>
<h3 id="structural-features">Structural Features</h3>
<ul>
<li><strong>Branches</strong>: Represented by a <code>[Branch]</code> symbol. The symbols immediately following it are interpreted as an index that specifies the number of SELFIES symbols belonging to that branch. This structure prevents errors like unmatched parentheses in SMILES.</li>
<li><strong>Rings</strong>: Represented by a <code>[Ring]</code> symbol. Similar to branches, subsequent symbols specify an index that indicates which previous atom to connect to, forming a ring closure. To guarantee validity, ring bond creation is postponed to a final post-processing step, where it is only completed if the target atom has available valence bonds.</li>
</ul>
<h2 id="examples">Examples</h2>
<p>Here are some examples of SELFIES representations for common molecules:</p>















<figure class="post-figure center ">
    <img src="/img/selfies/ethanol.webp"
         alt="Ethanol molecule from SELFIES"
         title="Ethanol molecule from SELFIES"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Ethanol: <code>[C][C][O]</code></figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/selfies/benzene.webp"
         alt="Benzene molecule from SELFIES"
         title="Benzene molecule from SELFIES"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Benzene: <code>[C][=C][C][=C][C][=C][Ring1][=Branch1]</code></figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/selfies/aspirin.webp"
         alt="Aspirin molecule from SELFIES"
         title="Aspirin molecule from SELFIES"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Aspirin: <code>[C][C][Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][C][=O][O]</code></figcaption>
    
</figure>

<h2 id="practical-applications">Practical Applications</h2>
<p>SELFIES is particularly advantageous for generative models in computational chemistry:</p>
<ul>
<li><strong>Inverse Molecular Design</strong>: Ideal for tasks where novel molecules with desired properties are generated using models like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).</li>
<li><strong>Valid Latent Space</strong>: When used in a VAE, the entire continuous latent space decodes to valid molecules. SMILES has large regions of the latent space that are invalid. This is crucial for both optimization tasks and model interpretability.</li>
<li><strong>Higher Molecular Diversity</strong>: Experiments show that generative models trained with SELFIES produce a significantly higher density and diversity of valid molecules. A VAE trained with SELFIES stored two orders of magnitude more diverse molecules, and a GAN produced 78.9% diverse valid molecules compared to 18.5% for SMILES.</li>
</ul>
<h3 id="the-selfies-python-library">The <code>selfies</code> Python Library</h3>
<p>The <code>selfies</code> library provides a user-friendly and efficient implementation without external dependencies. Key functionalities include:</p>
<ul>
<li><strong>Core Functions</strong>: <code>encoder()</code> to convert SMILES to SELFIES and <code>decoder()</code> to convert SELFIES to SMILES.</li>
<li><strong>Customization</strong>: <code>set_semantic_constraints()</code> allows users to define their own valence rules.</li>
<li><strong>Attribution</strong>: A flag in the core functions enables tracing which input symbols are responsible for each output symbol, explaining the translation process.</li>
<li><strong>Utilities</strong>: Functions are provided for common tasks like tokenizing a SELFIES string (<code>split_selfies</code>), getting its length (<code>len_selfies</code>), and converting SELFIES strings into numerical representations (label or one-hot encodings) for machine learning pipelines (<code>selfies_to_encoding</code>).</li>
</ul>
<h2 id="variants-and-extensibility">Variants and Extensibility</h2>
<p>The core SELFIES grammar is designed to be extensible. The <code>selfies</code> library has already extended the original concept to cover a much richer chemical space:</p>
<ul>
<li><strong>Current Support</strong>: The library can represent ions, radicals, stereochemistry, aromatic systems, and most elements relevant to organic and medicinal chemistry. It has been used to encode and decode all 72 million molecules from PubChem.</li>
<li><strong>Future Directions</strong>: Researchers have outlined future extensions to represent more complex structures like polymers, crystals, molecules with non-covalent bonds, and chemical reactions.</li>
</ul>
<h2 id="further-reading">Further Reading</h2>
<h3 id="research-papers">Research Papers</h3>
<ul>
<li><a href="/notes/computational-chemistry/molecular-representations/selfies-original-paper/"><strong>SELFIES: The Original Paper (2020)</strong></a>: The foundational work that introduced SELFIES and demonstrated its advantages over SMILES</li>
<li><a href="/notes/computational-chemistry/molecular-representations/selfies-2023/"><strong>Recent Advances in the SELFIES Library (2023)</strong></a>: Major library improvements including performance gains and expanded chemical support</li>
<li><a href="/notes/computational-chemistry/molecular-representations/selfies-2022/"><strong>SELFIES and the Future of Molecular String Representations (2022)</strong></a>: Perspective paper outlining 16 future research directions</li>
</ul>
<h3 id="practical-applications-1">Practical Applications</h3>
<ul>
<li><a href="/posts/visualizing-smiles-and-selfies-strings/"><strong>Converting SELFIES Strings to 2D Molecular Images</strong></a>: Hands-on tutorial demonstrating SELFIES robustness and building visualization tools</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li>Krenn, M., Häse, F., Nigam, A., Friederich, P., &amp; Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. <a href="https://doi.org/10.1088/2632-2153/aba947"><em>Machine Learning: Science and Technology</em>, <em>1</em>(4), 045024.</a></li>
<li>Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., &amp; Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. <a href="https://doi.org/10.1039/d3dd00044c"><em>Digital Discovery</em>, <em>2</em>, 897-908.</a></li>
<li><a href="https://github.com/aspuru-guzik-group/selfies">SELFIES GitHub Repository</a></li>
</ul>
]]></content:encoded></item><item><title>Converting SMILES and SELFIES to 2D Molecular Images</title><link>https://hunterheidenreich.com/posts/visualizing-smiles-and-selfies-strings/</link><pubDate>Fri, 12 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/visualizing-smiles-and-selfies-strings/</guid><description>A complete guide to generating high-quality 2D molecular structure images from SMILES and SELFIES strings using Python, RDKit, and PIL.</description><content:encoded><![CDATA[<p>Lately, I&rsquo;ve spent a lot of time staring at datasets full of 1D molecular strings.
With time, I find I get better at recognizing functional groups and substructures like <code>C(=O)O</code> (carboxylic acid) or <code>c1ccccc1</code> (benzene ring) in SMILES.
However, anything really complex is beyond my personal visualization capabilities.</p>
<p>I ran into this recently while debugging a generative model.
Sometimes the grammar of the string provides the clue as to what is going wrong.
Other times, actually <em>seeing</em> the molecule is what helps.
I had a terminal full of generated strings and needed to verify their structures visually.
I needed a streamlined way to generate these images locally.
A lightweight script turns that text into a properly formatted image directly from the terminal.</p>
<h2 id="smiles-vs-selfies">SMILES vs. SELFIES</h2>
<p>There are two primary string representations you will encounter in modern cheminformatics:</p>
<ol>
<li><strong><a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a></strong>: The industry standard. It uses simple rules (<code>C</code> for carbon, <code>=</code> for double bonds, parentheses for branches). It is compact and machine-parseable. However, random SMILES strings are often invalid (e.g., unclosed rings or invalid valences).</li>
<li><strong><a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a></strong>: Designed specifically for machine learning. It is a robust representation where <em>every</em> string corresponds to a valid molecular graph. This makes it ideal for generative models. Note that it is more verbose than SMILES.</li>
</ol>
<p>I often need to visualize both formats. Let&rsquo;s build a single, robust Python tool to handle them.</p>
<h2 id="the-quick-win-native-rdkit">The Quick Win: Native RDKit</h2>
<p>If you just need a quick image from a SMILES string and don&rsquo;t care about the image dimensions or adding a legend, RDKit can do this in three lines:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem <span style="color:#f92672">import</span> Draw
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;CCO&#34;</span>)
</span></span><span style="display:flex;"><span>Draw<span style="color:#f92672">.</span>MolToFile(mol, <span style="color:#e6db74">&#34;ethanol.png&#34;</span>)
</span></span></code></pre></div><p>The native RDKit method is fast for quick checks. However, custom rendering provides necessary control over image dimensions, formula subscripts, and handling multiple input formats like SELFIES.</p>
<h2 id="building-a-custom-renderer-for-precise-control">Building a Custom Renderer for Precise Control</h2>
<p>Let&rsquo;s build a more robust tool using RDKit for chemical processing, the <code>selfies</code> library for decoding, and PIL for high-quality image manipulation.</p>
<h3 id="core-dependencies">Core Dependencies</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> selfies <span style="color:#66d9ef">as</span> sf
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem <span style="color:#f92672">import</span> Draw, rdDepictor, rdMolDescriptors
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> PIL <span style="color:#f92672">import</span> Image, ImageDraw, ImageFont
</span></span></code></pre></div><p>RDKit handles the chemical logic, <code>selfies</code> translates SELFIES to SMILES, and PIL gives us fine control over the final image appearance.</p>
<h3 id="the-main-conversion-function">The Main Conversion Function</h3>
<p>Here is the core conversion logic. Notice the addition of Python type hints to ensure our code is robust and maintainable.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">string_to_png</span>(mol_string: str, output_file: str, size: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>, is_selfies: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Generates a 2D molecule image with a chemical formula legend from SMILES or SELFIES.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Decode SELFIES to SMILES if necessary</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> is_selfies:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            smiles <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(mol_string)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid SELFIES string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>) <span style="color:#f92672">from</span> e
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        smiles <span style="color:#f92672">=</span> mol_string
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(smiles)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> mol:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Could not generate molecule from SMILES: </span><span style="color:#e6db74">{</span>smiles<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Generate 2D coordinates and formula</span>
</span></span><span style="display:flex;"><span>    rdDepictor<span style="color:#f92672">.</span>Compute2DCoords(mol)
</span></span><span style="display:flex;"><span>    formula <span style="color:#f92672">=</span> rdMolDescriptors<span style="color:#f92672">.</span>CalcMolFormula(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Render the molecule</span>
</span></span><span style="display:flex;"><span>    img <span style="color:#f92672">=</span> Draw<span style="color:#f92672">.</span>MolToImage(mol, size<span style="color:#f92672">=</span>(size, size))<span style="color:#f92672">.</span>convert(<span style="color:#e6db74">&#34;RGBA&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Create a canvas with extra space at the bottom for the legend</span>
</span></span><span style="display:flex;"><span>    legend_height <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.1</span>)
</span></span><span style="display:flex;"><span>    canvas <span style="color:#f92672">=</span> Image<span style="color:#f92672">.</span>new(<span style="color:#e6db74">&#34;RGBA&#34;</span>, (size, size <span style="color:#f92672">+</span> legend_height), <span style="color:#e6db74">&#34;white&#34;</span>)
</span></span><span style="display:flex;"><span>    canvas<span style="color:#f92672">.</span>paste(img, (<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    draw <span style="color:#f92672">=</span> ImageDraw<span style="color:#f92672">.</span>Draw(canvas)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Define dynamic font sizes</span>
</span></span><span style="display:flex;"><span>    font_reg <span style="color:#f92672">=</span> get_font(int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.03</span>))
</span></span><span style="display:flex;"><span>    font_sub <span style="color:#f92672">=</span> get_font(int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw the legend</span>
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>)
</span></span><span style="display:flex;"><span>    y <span style="color:#f92672">=</span> size <span style="color:#f92672">+</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw &#34;Formula: &#34; label</span>
</span></span><span style="display:flex;"><span>    draw<span style="color:#f92672">.</span>text((x, y), <span style="color:#e6db74">&#34;Formula: &#34;</span>, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">+=</span> draw<span style="color:#f92672">.</span>textlength(<span style="color:#e6db74">&#34;Formula: &#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw formula with subscript handling for numbers</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> char <span style="color:#f92672">in</span> formula:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Use smaller font and lower y-offset for numbers (subscripts)</span>
</span></span><span style="display:flex;"><span>        font <span style="color:#f92672">=</span> font_sub <span style="color:#66d9ef">if</span> char<span style="color:#f92672">.</span>isdigit() <span style="color:#66d9ef">else</span> font_reg
</span></span><span style="display:flex;"><span>        y_offset <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.005</span>) <span style="color:#66d9ef">if</span> char<span style="color:#f92672">.</span>isdigit() <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        draw<span style="color:#f92672">.</span>text((x, y <span style="color:#f92672">+</span> y_offset), char, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font)
</span></span><span style="display:flex;"><span>        x <span style="color:#f92672">+=</span> draw<span style="color:#f92672">.</span>textlength(char, font<span style="color:#f92672">=</span>font)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw original string</span>
</span></span><span style="display:flex;"><span>    label <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;SELFIES&#34;</span> <span style="color:#66d9ef">if</span> is_selfies <span style="color:#66d9ef">else</span> <span style="color:#e6db74">&#34;SMILES&#34;</span>
</span></span><span style="display:flex;"><span>    draw<span style="color:#f92672">.</span>text((x, y), <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34; | </span><span style="color:#e6db74">{</span>label<span style="color:#e6db74">}</span><span style="color:#e6db74">: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    canvas<span style="color:#f92672">.</span>save(output_file)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Saved: </span><span style="color:#e6db74">{</span>output_file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>This function handles everything: SELFIES decoding, validation, coordinate generation, image creation, and legend drawing.</p>
<h3 id="font-handling">Font Handling</h3>
<p>We need a helper to handle fonts robustly across systems:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_font</span>(size: int, font_name: str <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;arial.ttf&#34;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Attempts to load a TTF font, falls back to default if unavailable.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> ImageFont<span style="color:#f92672">.</span>truetype(font_name, size)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">IOError</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> ImageFont<span style="color:#f92672">.</span>load_default()
</span></span></code></pre></div><h2 id="examples-in-action">Examples in Action</h2>
<p>Let&rsquo;s see the tool in action with some common molecules, comparing the SMILES and SELFIES inputs.</p>
<h3 id="simple-molecules">Simple Molecules</h3>















<figure class="post-figure center ">
    <img src="/img/smiles2img/ethanol_demo.webp"
         alt="Ethanol molecular structure with formula C2H6O"
         title="Ethanol molecular structure with formula C2H6O"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Ethanol</strong>: A simple alcohol. The SMILES is <code>CCO</code>, while the SELFIES is <code>[C][C][O]</code>.</figcaption>
    
</figure>

<h3 id="aromatic-compounds">Aromatic Compounds</h3>















<figure class="post-figure center ">
    <img src="/img/smiles2img/benzene_demo.webp"
         alt="Benzene molecular structure with formula C6H6"
         title="Benzene molecular structure with formula C6H6"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Benzene</strong>: The classic aromatic ring. SMILES uses numbers for ring closures (<code>C1=CC=CC=C1</code>), while SELFIES uses explicit tokens (<code>[C][=C][C][=C][C][=C][Ring1][=Branch1]</code>).</figcaption>
    
</figure>

<h3 id="complex-pharmaceuticals">Complex Pharmaceuticals</h3>















<figure class="post-figure center ">
    <img src="/img/smiles2img/aspirin_demo.webp"
         alt="Aspirin molecular structure with formula C9H8O4"
         title="Aspirin molecular structure with formula C9H8O4"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Aspirin</strong>: A more complex molecule showing how the tool handles branched structures and multiple functional groups.</figcaption>
    
</figure>

<h2 id="going-further-vector-graphics-svg">Going Further: Vector Graphics (SVG)</h2>
<p>Use vector graphics (SVG/PDF) for true publication-quality figures. Vector graphics scale infinitely without pixelation.</p>
<p>RDKit handles this natively with <code>rdMolDraw2D</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem.Draw <span style="color:#f92672">import</span> rdMolDraw2D
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">string_to_svg</span>(mol_string: str, output_file: str, size: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>, is_selfies: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Generates a 2D molecule SVG image.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> is_selfies:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            mol_string <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(mol_string)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid SELFIES string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>) <span style="color:#f92672">from</span> e
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(mol_string)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> mol:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    rdDepictor<span style="color:#f92672">.</span>Compute2DCoords(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> rdMolDraw2D<span style="color:#f92672">.</span>MolDraw2DSVG(size, size)
</span></span><span style="display:flex;"><span>    d<span style="color:#f92672">.</span>DrawMolecule(mol)
</span></span><span style="display:flex;"><span>    d<span style="color:#f92672">.</span>FinishDrawing()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(output_file, <span style="color:#e6db74">&#34;w&#34;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        f<span style="color:#f92672">.</span>write(d<span style="color:#f92672">.</span>GetDrawingText())
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Saved: </span><span style="color:#e6db74">{</span>output_file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>This provides a perfect vector image. Note that this method omits the custom PIL-based legend. Choose the right tool for the job: PNG for quick checks and slides, SVG for journal submissions.</p>
<h2 id="command-line-interface">Command-Line Interface</h2>
<p>The tool uses Python&rsquo;s standard <code>argparse</code> library for a robust command-line interface. It automatically detects if you want an SVG based on the file extension and includes a <code>--selfies</code> flag.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Basic SMILES usage</span>
</span></span><span style="display:flex;"><span>python mol2img.py <span style="color:#e6db74">&#34;CCO&#34;</span> -o ethanol.png
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># SELFIES usage</span>
</span></span><span style="display:flex;"><span>python mol2img.py <span style="color:#e6db74">&#34;[C][C][O]&#34;</span> -o ethanol.png --selfies
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Generate SVG for publication</span>
</span></span><span style="display:flex;"><span>python mol2img.py <span style="color:#e6db74">&#34;CCO&#34;</span> -o ethanol.svg
</span></span></code></pre></div><h2 id="download-the-complete-script">Download the Complete Script</h2>
<p>You can copy the complete <code>mol2img.py</code> script directly from the code block below. For a more robust, production-grade version with fault-tolerant SVG fallback, strict type safety, and batch processing support, see the <a href="/projects/molecular-string-renderer/">Molecular String Renderer project</a>.</p>
<h3 id="installation-and-setup">Installation and Setup</h3>
<p>Before using the script, install the required dependencies:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install rdkit pillow selfies
</span></span></code></pre></div><h3 id="complete-script">Complete Script</h3>
<details>
<summary>Click to expand the complete mol2img.py script</summary>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> argparse
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> sys
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> selfies <span style="color:#66d9ef">as</span> sf
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem <span style="color:#f92672">import</span> Draw, rdDepictor, rdMolDescriptors
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem.Draw <span style="color:#f92672">import</span> rdMolDraw2D
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> PIL <span style="color:#f92672">import</span> Image, ImageDraw, ImageFont
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_font</span>(size: int, font_name: str <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;arial.ttf&#34;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Attempts to load a TTF font, falls back to default if unavailable.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> ImageFont<span style="color:#f92672">.</span>truetype(font_name, size)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">IOError</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> ImageFont<span style="color:#f92672">.</span>load_default()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">string_to_svg</span>(mol_string: str, output_file: str, size: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>, is_selfies: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Generates a 2D molecule SVG image.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> is_selfies:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            mol_string <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(mol_string)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid SELFIES string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>) <span style="color:#f92672">from</span> e
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(mol_string)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> mol:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    rdDepictor<span style="color:#f92672">.</span>Compute2DCoords(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> rdMolDraw2D<span style="color:#f92672">.</span>MolDraw2DSVG(size, size)
</span></span><span style="display:flex;"><span>    d<span style="color:#f92672">.</span>DrawMolecule(mol)
</span></span><span style="display:flex;"><span>    d<span style="color:#f92672">.</span>FinishDrawing()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(output_file, <span style="color:#e6db74">&#34;w&#34;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        f<span style="color:#f92672">.</span>write(d<span style="color:#f92672">.</span>GetDrawingText())
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Saved: </span><span style="color:#e6db74">{</span>output_file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">string_to_png</span>(mol_string: str, output_file: str, size: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>, is_selfies: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Generates a 2D molecule image with a chemical formula legend.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> is_selfies:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            smiles <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(mol_string)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid SELFIES string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>) <span style="color:#f92672">from</span> e
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        smiles <span style="color:#f92672">=</span> mol_string
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(smiles)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> mol:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Could not generate molecule from string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Generate 2D coordinates and formula</span>
</span></span><span style="display:flex;"><span>    rdDepictor<span style="color:#f92672">.</span>Compute2DCoords(mol)
</span></span><span style="display:flex;"><span>    formula <span style="color:#f92672">=</span> rdMolDescriptors<span style="color:#f92672">.</span>CalcMolFormula(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Render the molecule</span>
</span></span><span style="display:flex;"><span>    img <span style="color:#f92672">=</span> Draw<span style="color:#f92672">.</span>MolToImage(mol, size<span style="color:#f92672">=</span>(size, size))<span style="color:#f92672">.</span>convert(<span style="color:#e6db74">&#34;RGBA&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Create a canvas with extra space at the bottom for the legend</span>
</span></span><span style="display:flex;"><span>    legend_height <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.1</span>)
</span></span><span style="display:flex;"><span>    canvas <span style="color:#f92672">=</span> Image<span style="color:#f92672">.</span>new(<span style="color:#e6db74">&#34;RGBA&#34;</span>, (size, size <span style="color:#f92672">+</span> legend_height), <span style="color:#e6db74">&#34;white&#34;</span>)
</span></span><span style="display:flex;"><span>    canvas<span style="color:#f92672">.</span>paste(img, (<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    draw <span style="color:#f92672">=</span> ImageDraw<span style="color:#f92672">.</span>Draw(canvas)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Define dynamic font sizes</span>
</span></span><span style="display:flex;"><span>    font_reg <span style="color:#f92672">=</span> get_font(int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.03</span>))
</span></span><span style="display:flex;"><span>    font_sub <span style="color:#f92672">=</span> get_font(int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw the legend</span>
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>)
</span></span><span style="display:flex;"><span>    y <span style="color:#f92672">=</span> size <span style="color:#f92672">+</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw &#34;Formula: &#34; label</span>
</span></span><span style="display:flex;"><span>    draw<span style="color:#f92672">.</span>text((x, y), <span style="color:#e6db74">&#34;Formula: &#34;</span>, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">+=</span> draw<span style="color:#f92672">.</span>textlength(<span style="color:#e6db74">&#34;Formula: &#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw formula with subscript handling for numbers</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> char <span style="color:#f92672">in</span> formula:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Use smaller font and lower y-offset for numbers (subscripts)</span>
</span></span><span style="display:flex;"><span>        font <span style="color:#f92672">=</span> font_sub <span style="color:#66d9ef">if</span> char<span style="color:#f92672">.</span>isdigit() <span style="color:#66d9ef">else</span> font_reg
</span></span><span style="display:flex;"><span>        y_offset <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.005</span>) <span style="color:#66d9ef">if</span> char<span style="color:#f92672">.</span>isdigit() <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        draw<span style="color:#f92672">.</span>text((x, y <span style="color:#f92672">+</span> y_offset), char, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font)
</span></span><span style="display:flex;"><span>        x <span style="color:#f92672">+=</span> draw<span style="color:#f92672">.</span>textlength(char, font<span style="color:#f92672">=</span>font)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw original string</span>
</span></span><span style="display:flex;"><span>    label <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;SELFIES&#34;</span> <span style="color:#66d9ef">if</span> is_selfies <span style="color:#66d9ef">else</span> <span style="color:#e6db74">&#34;SMILES&#34;</span>
</span></span><span style="display:flex;"><span>    draw<span style="color:#f92672">.</span>text((x, y), <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34; | </span><span style="color:#e6db74">{</span>label<span style="color:#e6db74">}</span><span style="color:#e6db74">: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    canvas<span style="color:#f92672">.</span>save(output_file)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Saved: </span><span style="color:#e6db74">{</span>output_file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;__main__&#34;</span>:
</span></span><span style="display:flex;"><span>    parser <span style="color:#f92672">=</span> argparse<span style="color:#f92672">.</span>ArgumentParser(description<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Convert a SMILES or SELFIES string to a 2D molecular image.&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;string&#34;</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;The molecular string to convert&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;-o&#34;</span>, <span style="color:#e6db74">&#34;--output&#34;</span>, default<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;molecule.png&#34;</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Output filename (default: molecule.png)&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;--size&#34;</span>, type<span style="color:#f92672">=</span>int, default<span style="color:#f92672">=</span><span style="color:#ae81ff">500</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Image width/height in pixels (default: 500)&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;--svg&#34;</span>, action<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;store_true&#34;</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Force SVG output (overrides filename extension)&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;--selfies&#34;</span>, action<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;store_true&#34;</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Treat the input string as SELFIES.&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    args <span style="color:#f92672">=</span> parser<span style="color:#f92672">.</span>parse_args()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Determine format based on flag or file extension</span>
</span></span><span style="display:flex;"><span>        is_svg <span style="color:#f92672">=</span> args<span style="color:#f92672">.</span>svg <span style="color:#f92672">or</span> args<span style="color:#f92672">.</span>output<span style="color:#f92672">.</span>lower()<span style="color:#f92672">.</span>endswith(<span style="color:#e6db74">&#34;.svg&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> is_svg:
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Ensure extension is correct if not present</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> args<span style="color:#f92672">.</span>output<span style="color:#f92672">.</span>lower()<span style="color:#f92672">.</span>endswith(<span style="color:#e6db74">&#34;.svg&#34;</span>):
</span></span><span style="display:flex;"><span>                args<span style="color:#f92672">.</span>output <span style="color:#f92672">=</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>splitext(args<span style="color:#f92672">.</span>output)[<span style="color:#ae81ff">0</span>] <span style="color:#f92672">+</span> <span style="color:#e6db74">&#34;.svg&#34;</span>
</span></span><span style="display:flex;"><span>            string_to_svg(args<span style="color:#f92672">.</span>string, args<span style="color:#f92672">.</span>output, args<span style="color:#f92672">.</span>size, args<span style="color:#f92672">.</span>selfies)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>            string_to_png(args<span style="color:#f92672">.</span>string, args<span style="color:#f92672">.</span>output, args<span style="color:#f92672">.</span>size, args<span style="color:#f92672">.</span>selfies)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Error: </span><span style="color:#e6db74">{</span>e<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>        sys<span style="color:#f92672">.</span>exit(<span style="color:#ae81ff">1</span>)
</span></span></code></pre></div></details>
]]></content:encoded></item><item><title>SMILES: Compact Notation for Chemical Structures</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/smiles/</link><pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/smiles/</guid><description>SMILES (Simplified Molecular Input Line Entry System) represents chemical structures using compact ASCII strings.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>SMILES (Simplified Molecular Input Line Entry System), originally developed by David Weininger in the late 1980s, is a one-dimensional string format for representing chemical molecular structures. It provides a linearized, serialized representation of 3D molecular structures by performing a depth-first traversal of the molecular graph $G = (V, E)$. Similar to a connection table, SMILES identifies the nodes (atoms) and edges (bonds) of the graph.</p>
<p>For example, the simple molecule ethanol ($\text{C}_2\text{H}_6\text{O}$) can be represented as <code>CCO</code>, while the more complex caffeine molecule becomes <code>CN1C=NC2=C1C(=O)N(C(=O)N2C)C</code>.</p>
<h3 id="key-characteristics">Key Characteristics</h3>
<ul>
<li><strong>Human-readable</strong>: Designed primarily for human readability, whereas <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a> provides a hierarchical representation optimized for machine parsing.</li>
<li><strong>Compact</strong>: More compact than other representations (3D coordinates, connectivity tables)</li>
<li><strong>Simple syntax</strong>: A language with simple syntax and structure, making it relatively easy to learn and use for chemists and researchers</li>
<li><strong>Flexible</strong>: Both linear and cyclic structures can be represented in many different valid ways</li>
</ul>
<h3 id="limitations">Limitations</h3>
<ul>
<li><strong>Non-uniqueness</strong>: Different SMILES strings can represent the same molecule (e.g., different resonance forms).</li>
<li><strong>Non-robustness</strong>: SMILES strings can be written that do not correspond to any valid molecular structure.
<ul>
<li>Strings that cannot represent a molecular structure.</li>
<li>Strings that violate basic rules (more bonds than is physically possible).</li>
</ul>
</li>
<li><strong>Information loss</strong>: If 3D structural information exists, a SMILES string cannot encode it.</li>
</ul>
<p>For a more robust alternative that guarantees 100% valid molecules, see <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES (Self-Referencing Embedded Strings)</a>.</p>
<p>For the historical context and design philosophy behind SMILES, see <a href="/notes/computational-chemistry/molecular-representations/smiles-original-paper/">SMILES: The Original Paper (Weininger 1988)</a>.</p>
<h2 id="basic-syntax">Basic Syntax</h2>
<h3 id="atomic-symbols">Atomic Symbols</h3>
<p>SMILES uses standard atomic symbols with implied hydrogen atoms:</p>
<ul>
<li><code>C</code> (methane, $\text{CH}_4$)</li>
<li><code>N</code> (ammonia, $\text{NH}_3$)</li>
<li><code>O</code> (water, $\text{H}_2\text{O}$)</li>
<li><code>P</code> (phosphine, $\text{PH}_3$)</li>
<li><code>S</code> (hydrogen sulfide, $\text{H}_2\text{S}$)</li>
<li><code>Cl</code> (hydrogen chloride, HCl)</li>
</ul>
<p><strong>Bracket notation</strong>: Elements outside the organic subset must be shown in brackets, e.g., <code>[Pt]</code> for elemental platinum. The organic subset (<code>B</code>, <code>C</code>, <code>N</code>, <code>O</code>, <code>P</code>, <code>S</code>, <code>F</code>, <code>Cl</code>, <code>Br</code>, and <code>I</code>) can omit brackets.</p>
<h3 id="bond-representation">Bond Representation</h3>
<p>Bonds are represented by symbols:</p>
<ul>
<li><strong>Single bond</strong>: <code>-</code> (usually omitted)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/ethane.webp"
         alt="Ethane"
         title="Ethane"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Ethane ($\text{C}_2\text{H}_6$), SMILES: <code>CC</code></figcaption>
    
</figure>

<ul>
<li><strong>Double bond</strong>: <code>=</code></li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/methyl_isocyanate.webp"
         alt="Methyl Isocyanate"
         title="Methyl Isocyanate"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Methyl Isocyanate ($\text{C}_2\text{H}_6\text{N}$), SMILES: <code>CN=C=O</code></figcaption>
    
</figure>

<ul>
<li><strong>Triple bond</strong>: <code>#</code></li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/hydrogen_cyanide.webp"
         alt="Hydrogen Cyanide"
         title="Hydrogen Cyanide"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Hydrogen Cyanide (HCN), SMILES: <code>C#N</code></figcaption>
    
</figure>

<ul>
<li><strong>Aromatic bond</strong>: <code>:</code> (usually omitted, or <code>*</code> for aromatic rings)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/vanillin.webp"
         alt="Vanillin"
         title="Vanillin"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Vanillin ($\text{C}_8\text{H}_8\text{O}_3$), SMILES: <code>O=Cc1ccc(O)c(OC)c1</code></figcaption>
    
</figure>

<ul>
<li><strong>Delocalized bond</strong>: <code>.</code></li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/copper_II_sulfate.webp"
         alt="Copper(II) Sulfate"
         title="Copper(II) Sulfate"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Copper(II) Sulfate ($\text{CuSO}_4$), SMILES: <code>[Cu+2].[O-]S(=O)(=O)[O-]</code></figcaption>
    
</figure>

<h3 id="structural-features">Structural Features</h3>
<ul>
<li><strong>Branches</strong>: Enclosed in parentheses and can be nested. For example, <code>C(C)C</code> represents propane with a methyl branch.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/3-propyl-4-isopropyl-1-heptene.webp"
         alt="3-Propyl-4-isopropyl-1-heptene"
         title="3-Propyl-4-isopropyl-1-heptene"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">3-Propyl-4-isopropyl-1-heptene ($\text{C}\<em>{12}\text{H}\</em>{22}$), SMILES: <code>C=CC(CCC)C(C(C)C)CCC</code></figcaption>
    
</figure>

<ul>
<li><strong>Cyclic structures</strong>: Written by breaking bonds and using numbers to indicate bond connections. For example, <code>C1CCCCC1</code> represents cyclohexane (the <code>1</code> connects the first and last carbon).</li>
<li><strong>Aromaticity</strong>: Lower case letters are used for atoms in aromatic rings. For example, benzene is written as <code>c1ccccc1</code>.</li>
<li><strong>Formal charges</strong>: Indicated by placing the charge in brackets after the atom symbol, e.g., <code>[C+]</code>, <code>[C-]</code>, or <code>[C-2]</code></li>
</ul>
<h2 id="stereochemistry-and-isomers">Stereochemistry and Isomers</h2>
<h3 id="isotope-notation">Isotope Notation</h3>
<p>Isotope notation specifies the exact isotope of an element and comes before the element within square brackets, e.g., <code>[13C]</code> for carbon-13.</p>
<h3 id="double-bond-stereochemistry">Double Bond Stereochemistry</h3>
<p>Directional bonds can be specified using <code>\</code> and <code>/</code> symbols to indicate the stereochemistry of double bonds:</p>
<ul>
<li><code>C/C=C\C</code> represents (E)-2-butene (trans configuration)</li>
<li><code>C/C=C/C</code> represents (Z)-2-butene (cis configuration)</li>
</ul>
<p>The direction of the slashes indicates which side of the double bond each substituent is on.</p>
<h3 id="tetrahedral-chirality">Tetrahedral Chirality</h3>
<p>Chirality around tetrahedral centers uses <code>@</code> and <code>@@</code> symbols:</p>
<ul>
<li><code>N[C@](C)(F)C(=O)O</code> vs <code>N[C@@](F)(C)C(=O)O</code></li>
<li>Anti-clockwise counting vs clockwise counting</li>
<li><code>@</code> and <code>@@</code> are shorthand for <code>@TH1</code> and <code>@TH2</code>, respectively</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/smiles2img/glucose.webp"
         alt="Glucose"
         title="Glucose"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Glucose ($\text{C}\<em>6\text{H}\</em>{12}\text{O}\_6$), SMILES: <code>OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@H](O)1</code></figcaption>
    
</figure>

<h3 id="advanced-stereochemistry">Advanced Stereochemistry</h3>
<p>More general notation for other stereocenters:</p>
<ul>
<li><code>@AL1</code>, <code>@AL2</code> for allene-type stereocenters</li>
<li><code>@SP1</code>, <code>@SP2</code>, <code>@SP3</code> for square-planar stereocenters</li>
<li><code>@TB1</code>&hellip;<code>@TB20</code> for trigonal bipyramidal stereocenters</li>
<li><code>@OH1</code>&hellip;<code>@OH30</code> for octahedral stereocenters</li>
</ul>
<p>SMILES allows partial specification since it relies on local chirality.</p>
<h2 id="practical-applications">Practical Applications</h2>
<p>SMILES notation is widely used in:</p>
<ul>
<li><strong>Chemical databases</strong>: Storage and retrieval of molecular structures</li>
<li><strong>Machine learning</strong>: Input representation for molecular property prediction</li>
<li><strong>Chemical informatics</strong>: Substructure searching and similarity analysis</li>
<li><strong>Drug discovery</strong>: High-throughput virtual screening</li>
<li><strong>Chemical reaction databases</strong>: Representing reactants and products</li>
</ul>
<p>For a hands-on tutorial on visualizing SMILES strings as 2D molecular images, see <a href="/posts/visualizing-smiles-and-selfies-strings/">Converting SMILES Strings to 2D Molecular Images</a>.</p>
<h2 id="variants-and-standards">Variants and Standards</h2>
<h3 id="canonical-smiles">Canonical SMILES</h3>
<p>Canonical SMILES seeks unique representations of molecules to ensure consistency across different software implementations.</p>
<h3 id="opensmiles-vs-proprietary">OpenSMILES vs. Proprietary</h3>
<ul>
<li><strong>Proprietary</strong>: SMILES is technically closed source, which can cause compatibility issues between different groups/labs</li>
<li><strong>OpenSMILES</strong>: Open-source alternative standardization to address compatibility concerns</li>
</ul>
<h3 id="isomeric-smiles">Isomeric SMILES</h3>
<p>Isomeric SMILES incorporates isotopes and stereochemistry information, providing more detailed molecular representations.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://19january2021snapshot.epa.gov/sites/static/files/2015-05/documents/appendf.pdf">Sustainable Futures / P2 Framework Manual 2012 EPA-748-B12-001: Appendix F. SMILES Notation Tutorial</a></li>
<li><a href="https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html">Daylight Chemical Information Systems, Inc. SMILES</a></li>
<li><a href="http://opensmiles.org/opensmiles.html">OpenSMILES</a></li>
<li><a href="https://arxiv.org/abs/2402.01439">From Words to Molecules: A Survey of Large Language Models in Chemistry</a></li>
</ul>
]]></content:encoded></item><item><title>MARCEL: Molecular Representation &amp; Conformers</title><link>https://hunterheidenreich.com/notes/computational-chemistry/datasets/marcel/</link><pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/datasets/marcel/</guid><description>MARCEL dataset provides 722K+ conformers across 76K+ molecules for drug discovery, catalysis, and molecular representation learning research.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>MARCEL provides a benchmark for conformer ensemble learning. It demonstrates that explicitly modeling full conformer distributions improves property prediction across drug-like molecules and organometallic catalysts.</p>
<h2 id="overview">Overview</h2>
<p>The Molecular Representation and Conformer Ensemble Learning (MARCEL) dataset provides 722K+ conformations across 76K+ molecules spanning four diverse chemical domains: drug-like molecules (Drugs-75K), organophosphorus ligands (Kraken), chiral catalysts (EE), and organometallic complexes (BDE). MARCEL evaluates conformer ensemble methods across both pharmaceutical and catalysis applications.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/marcel-drugs-75k-example-conformer.webp"
         alt="Example conformer from Drugs-75K"
         title="Example conformer from Drugs-75K"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Drugs-75K (SMILES: <code>COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1</code>; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-drugs-75k-example-conformer-2d.webp"
         alt="2D structure of Drugs-75K conformer"
         title="2D structure of Drugs-75K conformer"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">2D structure of Drugs-75K conformer above</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-kraken-ligand10-conf0-2d.webp"
         alt="Example conformer from Kraken in 2D"
         title="Example conformer from Kraken in 2D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Kraken (ligand 10, conformer 0) in 2D</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-kraken-ligand10-conf0-3d.webp"
         alt="Example conformer from Kraken in 3D"
         title="Example conformer from Kraken in 3D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Kraken (ligand 10, conformer 0) in 3D</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-bde-Pt_9.63-3d.webp"
         alt="Example substrate from BDE in 3D"
         title="Example substrate from BDE in 3D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example substrate from BDE in 3D (Pt_9.63)</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-bde-Pt_9.63-2d.webp"
         alt="2D structure of BDE substrate"
         title="2D structure of BDE substrate"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">2D structure of BDE substrate above</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Drugs-75K</strong></td>
          <td>75,099 molecules</td>
          <td>Drug-like molecules with at least 5 rotatable bonds</td>
      </tr>
      <tr>
          <td><strong>Kraken</strong></td>
          <td>1,552 molecules</td>
          <td>Monodentate organophosphorus (III) ligands</td>
      </tr>
      <tr>
          <td><strong>EE</strong></td>
          <td>872 molecules</td>
          <td>Rhodium (Rh)-bound atropisomeric catalysts derived from chiral bisphosphine</td>
      </tr>
      <tr>
          <td><strong>BDE</strong></td>
          <td>5,195 molecules</td>
          <td>Organometallic catalysts ML$_1$L$_2$</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="drugs-75k-ionization-potential">Ionization Potential (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-ionization-potential">#</a></h3>
    <p class="benchmark-description">Predict ionization potential from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.4066</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.4069</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.4126</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.4149</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.4174</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.428</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4351</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4354</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4361</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4393</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.4394</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4441</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4452</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.4466</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4505</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.4595</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4788</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4987</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.6617</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="drugs-75k-electron-affinity">Electron Affinity (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-electron-affinity">#</a></h3>
    <p class="benchmark-description">Predict electron affinity from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.391</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.3922</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.3944</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.3953</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.3964</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.4033</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4085</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4169</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4169</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.4207</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4233</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4232</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4251</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.4269</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.4417</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4495</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4648</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4747</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.585</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="drugs-75k-electronegativity">Electronegativity (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-electronegativity">#</a></h3>
    <p class="benchmark-description">Predict electronegativity (χ) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.197</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.2027</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2069</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.2083</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.2199</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.2212</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.2243</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2243</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.226</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.2267</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.2267</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.2294</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.2324</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2378</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.2436</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.2441</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.2505</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.2732</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.4073</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-b5">B₅ Sterimol Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-b5">#</a></h3>
    <p class="benchmark-description">Predict B₅ sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.2225</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.2313</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.263</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2644</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2704</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.2789</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.3072</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.3128</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.3228</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.3293</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.3443</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.345</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.351</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.3567</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.476</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.485</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4873</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4879</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.9611</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-l">L Sterimol Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-l">#</a></h3>
    <p class="benchmark-description">Predict L sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.3386</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.3468</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.3619</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.3643</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.3754</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4003</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4174</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4303</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4322</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4344</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4363</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4471</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.4485</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.4493</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.5142</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.5452</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.5458</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.6417</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.8389</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-burb5">Buried B₅ Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-burb5">#</a></h3>
    <p class="benchmark-description">Predict buried B₅ sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.1589</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.1693</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.1719</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.1782</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.1783</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2024</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2017</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.2066</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.2097</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.2178</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.2176</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.2295</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.2395</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.2422</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.2758</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.2813</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2884</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.3002</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.4929</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-burl">Buried L Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-burl">#</a></h3>
    <p class="benchmark-description">Predict buried L sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.0947</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.1185</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.12</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.1324</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.1386</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.1443</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.1486</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.15</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.1521</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.1526</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.1548</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.1635</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.1673</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.1741</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.1861</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.1924</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.1948</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2529</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.2781</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="ee-enantioselectivity">Enantioselectivity (EE)<a hidden class="anchor" aria-hidden="true" href="#ee-enantioselectivity">#</a></h3>
    <p class="benchmark-description">Predict enantiomeric excess for Rh-catalyzed asymmetric reactions</p>
    <p class="benchmark-meta"><strong>Subset:</strong> EE
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (%)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>11.61</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>12.03</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>13.56</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>13.96</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>14.22</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>14.64</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>17.74</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>18.03</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>18.42</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>19.8</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>20.24</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>33.95</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>61.03</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>61.3</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>61.63</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>62.08</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>62.31</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>62.38</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>64.01</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="bde-bond-dissociation">Bond Dissociation Energy (BDE)<a hidden class="anchor" aria-hidden="true" href="#bde-bond-dissociation">#</a></h3>
    <p class="benchmark-description">Predict metal-ligand bond dissociation energy for organometallic catalysts</p>
    <p class="benchmark-meta"><strong>Subset:</strong> BDE
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>1.45</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>1.47</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>1.53</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>1.53</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>1.61</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>1.65</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>1.87</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>1.97</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>2.01</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>2.13</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>2.48</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>2.55</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>2.61</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>2.64</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>2.66</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>2.74</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>2.83</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>3.03</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>10.08</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GEOM</strong></td>
          <td>Source</td>
          <td><a href="/notes/computational-chemistry/datasets/geom/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Domain diversity</strong>: Beyond drug-like molecules, includes organometallics and catalysts rarely covered in existing benchmarks</li>
<li><strong>Ensemble-based</strong>: Provides full conformer ensembles with statistical weights</li>
<li><strong>DFT-quality energies</strong>: Drugs-75K features DFT-level conformers and energies (higher accuracy than GEOM-Drugs)</li>
<li><strong>Realistic scenarios</strong>: BDE subset models the practical constraint of lacking DFT-computed conformers for large catalyst systems</li>
<li><strong>Comprehensive baselines</strong>: Benchmarks 18 models across 1D (SMILES), 2D (graph), 3D (single conformer), and ensemble methods</li>
<li><strong>Property diversity</strong>: Covers ionization potential, electron affinity, electronegativity, ligand descriptors, and catalytic properties</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Regression only</strong>: All tasks evaluate regression metrics exclusively</li>
<li><strong>Chemical space coverage</strong>: The 76K molecules encapsulate a fraction of the expansive drug-like and catalyst chemical spaces</li>
<li><strong>Compute requirements</strong>: Working with large conformer ensembles demands significant computational resources</li>
<li><strong>Proprietary data</strong>: EE subset is proprietary (as of December 2025)</li>
<li><strong>DFT bottleneck</strong>: BDE demonstrates a practical limitation: single DFT optimization can take 2-3 days, making conformer-level DFT infeasible for large organometallics</li>
<li><strong>Uniform sampling baseline</strong>: The initial data augmentation strategy tested for handling ensembles samples conformers uniformly rather than by Boltzmann weight. This unprincipled physical assumption likely explains why the strategy occasionally introduces noise and fails to aid complex 3D architectures.</li>
<li><strong>Drugs-75K properties</strong>: The large-scale benchmark (Drugs-75K) specifically targets electronic properties (Ionization Potential, Electron Affinity, Electronegativity). As the authors explicitly highlight in Section 5.2, these properties are generally less sensitive to conformational rotations compared to steric or spatial interactions. This significantly confounds evaluating whether explicit conformer ensembles actually benefit large-scale regression tasks.</li>
<li><strong>Unrealistic single-conformer baselines</strong>: The 3D single-conformer models are exclusively evaluated on the lowest-energy conformer. This setup is inherently flawed for real-world application, as knowing the global minimum <em>a priori</em> requires exhaustively searching and computing energies for the entire conformer space.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="data-generation-pipeline">Data Generation Pipeline</h3>
<h4 id="drugs-75k">Drugs-75K</h4>
<p><strong>Source</strong>: GEOM-Drugs subset</p>
<p><strong>Filtering</strong>:</p>
<ul>
<li>Minimum 5 rotatable bonds (focus on flexible molecules)</li>
<li>Allowed elements: H, C, N, O, F, Si, P, S, Cl</li>
</ul>
<p><strong>Conformer generation</strong>:</p>
<ul>
<li>DFT-level calculations for both conformers and energies</li>
<li>Higher accuracy than original GEOM-Drugs (semi-empirical GFN2-xTB)</li>
</ul>
<p><strong>Properties</strong>: Ionization Potential (IP), Electron Affinity (EA), Electronegativity (χ)</p>
<h4 id="kraken">Kraken</h4>
<p><strong>Source</strong>: Original Kraken dataset (1,552 monodentate organophosphorus(III) ligands)</p>
<p><strong>Properties</strong>: 5 of 78 available properties</p>
<ul>
<li>$B_5$: Ligand sterics (buried volume)</li>
<li>$L$: Ligand electronics</li>
<li>$\text{Bur}B_5$: Buried volume variant</li>
<li>$\text{Bur}L$: Electronic parameter variant</li>
</ul>
<h4 id="ee-enantiomeric-excess">EE (Enantiomeric Excess)</h4>
<p><strong>Generation method</strong>: Q2MM (Quantum-guided Molecular Mechanics)</p>
<p><strong>Molecules</strong>: 872 Rhodium (Rh)-bound atropisomeric catalysts from chiral bisphosphine</p>
<p><strong>Property</strong>: Enantiomeric excess (EE) for asymmetric catalysis</p>
<p><strong>Availability</strong>: Proprietary-only (closed-source as of December 2025)</p>
<h4 id="bde-bond-dissociation-energy">BDE (Bond Dissociation Energy)</h4>
<p><strong>Molecules</strong>: 5,195 organometallic catalysts (ML₁L₂ structure)</p>
<p><strong>Initial conformers</strong>: OpenBabel with geometric optimization</p>
<p><strong>Energies</strong>: DFT calculations</p>
<p><strong>Property</strong>: Electronic dissociation energy (difference between bound and unbound states)</p>
<p><strong>Key constraint</strong>: DFT optimization for full conformer ensembles computationally infeasible (2-3 days per molecule)</p>
<h3 id="benchmark-setup">Benchmark Setup</h3>
<p><strong>Task</strong>: Predict molecular properties from structure using different representation strategies (1D/2D/3D/Ensemble).
The ground-truth regression targets are calculated as the Boltzmann-averaged value of the property across the conformer ensemble:</p>
<p>$$
\langle y \rangle_{k_B} = \sum_{\mathbf{C}_i \in \mathcal{C}} p_i y_i
$$</p>
<p>Where $p_i$ is the conformal probability under experimental conditions derived from the conformer energy $e_i$:</p>
<p>$$
p_i = \frac{\exp(-e_i / k_B T)}{\sum_j \exp(-e_j / k_B T)}
$$</p>
<p><strong>Data splits</strong>: Datasets are partitioned 70% train, 10% validation, and 20% test.</p>
<p><strong>Model categories</strong>:</p>
<ol>
<li><strong>1D Models</strong>: SMILES-based (Random Forest on Morgan fingerprints, LSTM, Transformer).</li>
<li><strong>2D Models</strong>: Graph-based (GIN, GIN+VN, ChemProp, GraphGPS).</li>
<li><strong>3D Models</strong>: Single conformer (SchNet, DimeNet++, GemNet, PaiNN, ClofNet, LEFTNet). For evaluation, single 3D models exclusively ingest the lowest-energy conformer. This baseline setting often yields strong performance but is unrealistic in practice, as identifying the global minimum requires exhaustively searching the entire conformer space.</li>
<li><strong>Ensemble Models</strong>: Full conformer ensemble processing via explicit set encoders. For each conformer embedding $\mathbf{z}_i$, three aggregation strategies are evaluated:</li>
</ol>
<p><strong>Mean Pooling:</strong>
$$
\mathbf{s}_{\text{MEAN}} = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} \mathbf{z}_i
$$</p>
<p><strong>DeepSets:</strong>
$$
\mathbf{s}_{\text{DS}} = g\left(\sum_{i=1}^{|\mathcal{C}|} h(\mathbf{z}_i)\right)
$$</p>
<p><strong>Self-Attention:</strong>
$$
\begin{aligned}
\mathbf{s}_{\text{ATT}} &amp;= \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} \mathbf{c}_i, \quad \text{where} \quad \mathbf{c}_i = g\left( \sum_{j=1}^{|\mathcal{C}|} \alpha_{ij} h(\mathbf{z}_j) \right) \\
\alpha_{ij} &amp;= \frac{\exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_j))\right)}{\sum_{k=1}^{|\mathcal{C}|} \exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_k))\right)}
\end{aligned}
$$</p>
<p><strong>Evaluation metric</strong>: Mean Absolute Error (MAE) for all tasks.</p>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Ensemble superiority (task-dependent)</strong>: Across benchmarks, explicitly modeling the full conformer set using DeepSets often achieved top performance. However, these improvements are not uniform:</p>
<ul>
<li><strong>Small-Scale Success</strong>: Ensemble methods show large improvements on tasks like Kraken (Ensemble PaiNN achieves 0.2225 on $B_5$ vs 0.3443 single) and EE (Ensemble GemNet achieves 11.61% vs 18.03% single).</li>
<li><strong>Large-Scale Plateau</strong>: The performance improvements did not strongly transfer to massive subsets like Drugs-75K (Ensemble GemNet achieves 0.4066 eV on IP vs 0.4069 eV single). The authors posit the computational burden of processing all ensemble embeddings changes learning dynamics and increases training difficulty.</li>
</ul>
<p><strong>Conformer Sampling for Noise</strong>: Data augmentation (randomly sampling one conformer from an ensemble during training) improves performance and robustness when underlying conformers are imprecise (e.g., the forcefield-generated conformers in the BDE subset).</p>
<p><strong>3D vs 2D</strong>: 3D models generally outperform 2D graph models, especially for conformationally-sensitive properties, though 1D and 2D methods remain highly competitive on low-resource datasets or less rotation-sensitive properties.</p>
<p><strong>Model architecture</strong>: GemNet and PaiNN architectures consistently top-ranked across tasks.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Data</strong>: The Drugs-75K, Kraken, and BDE subsets are openly available via Google Drive links on the project&rsquo;s GitHub repository. However, the EE dataset remains closed-source/proprietary (as of 2026), making the EE suite of the benchmark currently irreproducible.</li>
<li><strong>Code</strong>: The benchmark suite and PyTorch-Geometric dataset loaders are open-sourced at <a href="https://github.com/SXKDZ/MARCEL">GitHub (SXKDZ/MARCEL)</a> under the Apache-2.0 license.</li>
<li><strong>Hardware</strong>: The authors trained these models using Nvidia A100 (40GB) GPUs. Memory-intensive models (e.g., GemNet, LEFTNet) required Nvidia H100 (80GB) GPUs. Total computation across all benchmark experiments was approximately 6,000 GPU hours.</li>
<li><strong>Algorithms/Models</strong>: Hyperparameters for all 18 evaluated models are provided directly within the repository configuration files (<code>benchmarks/params</code>), ensuring clear algorithmic reproducibility. All baseline models use popular, publicly available frameworks (e.g., PyTorch Geometric, OGB, RDKit).</li>
<li><strong>Evaluation</strong>: Detailed evaluation scripts are provided in the repository with consistent tracking of Mean Absolute Error (MAE) and proper configuration of benchmark splits.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhu2024learning,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yanqiao Zhu and Jeehyun Hwang and Keir Adams and Zhen Liu and Bozhao Nan and Brock Stenfors and Yuanqi Du and Jatin Chauhan and Olaf Wiest and Olexandr Isayev and Connor W. Coley and Yizhou Sun and Wei Wang}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Twelfth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=NSDszJ2uIV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GEOM: Energy-Annotated Molecular Conformations</title><link>https://hunterheidenreich.com/notes/computational-chemistry/datasets/geom/</link><pubDate>Thu, 04 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/datasets/geom/</guid><description>Dataset card for GEOM, providing energy-annotated molecular conformations generated via CREST/xTB and refined with DFT for property prediction benchmarks.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>GEOM addresses the gap between 2D molecular graphs and flexible 3D properties by providing 450k+ molecules with 37M+ conformations. This extensive sampling connects conformer ensembles to experimental properties, providing the necessary infrastructure to benchmark conformer generation methods and train 3D-aware property predictors.</p>
<h2 id="overview">Overview</h2>
<p>The Geometric Ensemble Of Molecules (GEOM) dataset provides energy-annotated molecular conformations generated through systematic computational methods. The dataset includes molecules from drug discovery campaigns (AICures), quantum chemistry benchmarks (QM9), and molecular property prediction benchmarks (MoleculeNet), with conformations sampled using CREST/GFN2-xTB and a subset refined with high-quality DFT calculations.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/GEOM-sample-_4-pyrimidin-2-yloxyphenyl_acetamide.webp"
         alt="Example SARS-CoV-2 3CL protease active molecule"
         title="Example SARS-CoV-2 3CL protease active molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Drug-like (AICures)</strong></td>
          <td>304,466 molecules</td>
          <td>Drug-like molecules from AICures COVID-19 challenge (avg 44 atoms)</td>
      </tr>
      <tr>
          <td><strong>QM9</strong></td>
          <td>133,258 molecules</td>
          <td>Small molecules from QM9 (up to 9 heavy atoms)</td>
      </tr>
      <tr>
          <td><strong>MoleculeNet</strong></td>
          <td>16,865 molecules</td>
          <td>Molecules from MoleculeNet benchmarks for physical chemistry, biophysics, and physiology</td>
      </tr>
      <tr>
          <td><strong>BACE (High-quality DFT)</strong></td>
          <td>1,511 molecules</td>
          <td>BACE subset with high-quality DFT energies (r2scan-3c) and experimental inhibition data</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="gibbs-free-energy-prediction">Gibbs Free Energy Prediction<a hidden class="anchor" aria-hidden="true" href="#gibbs-free-energy-prediction">#</a></h3>
    <p class="benchmark-description">Predict ensemble Gibbs free energy (G) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.203</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.225</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.274</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.289</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.406</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="average-energy-prediction">Average Energy Prediction<a hidden class="anchor" aria-hidden="true" href="#average-energy-prediction">#</a></h3>
    <p class="benchmark-description">Predict ensemble average energy (E) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.11</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.113</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.119</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.131</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.166</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="conformer-count-prediction">Conformer Count Prediction<a hidden class="anchor" aria-hidden="true" href="#conformer-count-prediction">#</a></h3>
    <p class="benchmark-description">Predict ln(number of unique conformers) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.363</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.38</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.455</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.484</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.763</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>QM9</strong></td>
          <td>134k small molecules with up to 9 heavy atoms and DFT properties</td>
      </tr>
      <tr>
          <td><strong>PCQM4Mv2</strong></td>
          <td>Millions of computationally generated molecules for HOMO-LUMO gap prediction</td>
      </tr>
      <tr>
          <td><strong>PubChemQC</strong></td>
          <td>DFT structures and energy properties for millions of PubChem molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Scale</strong>: 37M+ conformations across 450k+ molecules, providing massive coverage of drug-like and small molecule chemical space.</li>
<li><strong>Energy Annotations</strong>: All conformations include semi-empirical energies (GFN2-xTB); the BACE subset includes high-quality DFT energies.</li>
<li><strong>Quality Tiers</strong>: Three levels of computational rigor allow researchers to trade off dataset size for simulation accuracy.</li>
<li><strong>Benchmark Ready</strong>: Includes validated splits and architectural baselines (e.g., ChemProp, SchNet) for property prediction tasks.</li>
<li><strong>Task Diversity</strong>: Combines molecules sourced from drug discovery (AICures), quantum chemistry (QM9), and biophysiology domains (MoleculeNet).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Computational Constraints</strong>: The highest-accuracy DFT subset (BACE) is limited to 1,511 molecules due to the extreme computational cost of exact free energy sampling and Hessian estimation.</li>
<li><strong>Semi-Empirical Accuracy Gap</strong>: The $p^{\text{CREST}}$ statistical weights rely on GFN2-xTB energies, which exhibit a $\sim$2 kcal/mol MAE against true DFT. At room temperature ($k_BT \approx 0.59$ kcal/mol), this error heavily skews the Boltzmann distribution, meaning standard subset weights are imprecise.</li>
<li><strong>Solvation Assumptions</strong>: Most subsets rely on vacuum calculations. Only the BACE subset uses an implicit solvent (ALPB/C-PCM for water).</li>
<li><strong>Coverage Lapses</strong>: Extremely flexible molecules (e.g., within the SIDER dataset) frequently failed the conformer generation pipeline due to runaway topologies.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="data-generation-pipeline">Data Generation Pipeline</h3>
<p><strong>Initial conformer sampling</strong> (RDKit):</p>
<ul>
<li><code>EmbedMultipleConfs</code> with <code>numConfs=50</code>, <code>pruneRmsThresh=0.01</code> Å</li>
<li>MMFF force field optimization</li>
<li>GFN2-xTB optimization of seed conformer</li>
</ul>
<p><strong>Conformational exploration</strong> (CREST):</p>
<ul>
<li>Metadynamics in NVT ensemble driven by a pushing bias potential:
$$ V_{\text{bias}} = \sum_i k_i \exp(-\alpha_i \Delta_i^2) $$
where $\Delta_i$ is the root-mean-square displacement (RMSD) against the $i$-th reference structure.</li>
<li>12 independent runs per molecule to vary pushing strength $k_i$.</li>
<li>6.0 kcal/mol safety window for conformer retention.</li>
<li>Solvent: ALPB for water (BACE); vacuum for others.</li>
</ul>
<p><strong>Energy calculation &amp; Weighting</strong>:</p>
<ul>
<li>
<p><strong>Standard (GFN2-xTB)</strong>: Semi-empirical tight-binding DFT ($\approx$ 2 kcal/mol MAE vs DFT). Conformers are assigned a statistical probability based on energy $E_i$ and rotamer degeneracy $d_i$:
$$ p^{\text{CREST}}_i = \frac{d_i \exp(-E_i / k_B T)}{\sum_j d_j \exp(-E_j / k_B T)} $$</p>
</li>
<li>
<p><strong>High-Quality DFT (CENSO)</strong>: Refines structures using the <code>r2scan-3c</code> functional, computing exact conformation-dependent free energies ($G_i$) that remove the need for explicit rotamer degeneracy approximations:</p>
<p>$$
\begin{aligned}
p^{\text{CENSO}}_i &amp;= \frac{\exp(-G_i / k_B T)}{\sum_j \exp(-G_j / k_B T)} \\
G_i &amp;= E_{\text{gas}}^{(i)} + \delta G_{\text{solv}}^{(i)}(T) + G_{\text{trv}}^{(i)}(T)
\end{aligned}
$$</p>
</li>
</ul>
<h3 id="quality-levels">Quality Levels</h3>
<table>
  <thead>
      <tr>
          <th>Level</th>
          <th>Method</th>
          <th>Subset</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Standard</strong></td>
          <td>CREST/GFN2-xTB</td>
          <td>All subsets</td>
          <td>~2 kcal/mol MAE vs DFT</td>
      </tr>
      <tr>
          <td><strong>DFT Single-Point</strong></td>
          <td>r2scan-3c/mTZVPP on CREST geometries</td>
          <td>BACE (1,511 molecules)</td>
          <td>Sub-kcal/mol</td>
      </tr>
      <tr>
          <td><strong>DFT Optimized</strong></td>
          <td>CENSO full optimization + free energies</td>
          <td>BACE (534 molecules)</td>
          <td>~0.3 kcal/mol vs CCSD(T)</td>
      </tr>
  </tbody>
</table>
<h3 id="benchmark-setup">Benchmark Setup</h3>
<p><strong>Task</strong>: Predict ensemble summary statistics directly from the 2D molecular structure. The target properties include:</p>
<ul>
<li><strong>Conformational Free Energy ($G$)</strong>: $G = -TS$, where $S = -R \sum_i p_i \log p_i$.</li>
<li><strong>Average Energy ($\langle E \rangle$)</strong>: $\langle E \rangle = \sum_i p_i E_i$.</li>
<li><strong>Unique Conformers</strong>: Naturally logged count of conformers retained within the energy window.</li>
</ul>
<p><strong>Data</strong>: 100,000 species randomly sampled from AICures subset, split 60/20/20 (train/validation/test).</p>
<p><strong>Hyperparameters</strong>: Optimized using Hyperopt package for each model/task combination.</p>
<p><strong>Models</strong>:</p>
<ul>
<li><strong>SchNetFeatures</strong>: 3D SchNet architecture + graph features, trained on highest-probability conformer</li>
<li><strong>ChemProp</strong>: Message Passing Neural Network (state-of-the-art graph model)</li>
<li><strong>FFNN</strong>: Feed-forward network on Morgan fingerprints</li>
<li><strong>KRR</strong>: Kernel Ridge Regression on Morgan fingerprints</li>
<li><strong>Random Forest</strong>: Random Forest on Morgan fingerprints</li>
</ul>
<h3 id="hardware--computational-cost">Hardware &amp; Computational Cost</h3>
<h4 id="crestgfn2-xtb-generation">CREST/GFN2-xTB Generation</h4>
<p><strong>Total compute</strong>: ~15.7 million core hours</p>
<p><strong>AICures subset</strong>:</p>
<ul>
<li>13M core hours on Knights Landing (32-core nodes)</li>
<li>1.2M core hours on Cascade Lake/Sky Lake (13-core nodes)</li>
<li>Average wall time: 2.8 hours/molecule (KNL) or 0.63 hours/molecule (Sky Lake)</li>
</ul>
<p><strong>MoleculeNet subset</strong>: 1.5M core hours</p>
<h4 id="dft-calculations-bace-only">DFT Calculations (BACE only)</h4>
<p><strong>Software</strong>: CENSO 1.1.2 + ORCA 5.0.1 (r2scan-3c/mTZVPP functional)</p>
<p><strong>Solvent</strong>: C-PCM implicit solvation (water)</p>
<p><strong>Hardware</strong>: ~54 cores per job</p>
<p><strong>Compute cost</strong>:</p>
<ul>
<li>781,000 CPU hours for CENSO optimizations</li>
<li>1.1M CPU hours for single-point energy calculations</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Data Availability</strong>: All generated conformations, energies, and thermodynamic properties are publicly hosted on <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JNGTDF">Harvard Dataverse</a>. The data is provided in language-agnostic MessagePack format and Python-specific RDKit <code>.pkl</code> formats.</li>
<li><strong>Code &amp; Analysis</strong>: The primary GitHub repository (<a href="https://github.com/learningmatter-mit/geom">learningmatter-mit/geom</a>) provides tutorials for data extraction, RDKit processing, and conformational visualization.</li>
<li><strong>Model Training &amp; Baselines</strong>: The machine learning benchmarks (SchNet, ChemProp) and corresponding training scripts used to evaluate the dataset can be reproduced using the authors&rsquo; <a href="https://github.com/learningmatter-mit/NeuralForceField">NeuralForceField repository</a>.</li>
<li><strong>Hardware &amp; Compute</strong>: Extreme compute was required (15.7M core hours for CREST sampling alone), heavily utilizing Knights Landing (KNL) and Cascade Lake architectures. See <em>Hardware &amp; Computational Cost</em> section above for full details.</li>
<li><strong>Software Versions</strong>: Precise reproduction of conformational properties requires specific versions to mitigate numerical variances: CREST v2.9, xTB v6.2.3/v6.4.1, CENSO v1.1.2, ORCA v5.0.1/v5.0.2, and RDKit v2020.09.1.</li>
<li><strong>Open-Access Paper</strong>: The full methodology is accessible via the <a href="https://arxiv.org/abs/2006.05531">arXiv preprint</a>.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Axelrod_2022,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GEOM, energy-annotated molecular conformations for property prediction and molecular generation}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{2052-4463}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1038/s41597-022-01288-4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1038/s41597-022-01288-4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Data}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Science and Business Media LLC}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Axelrod, Simon and Gómez-Bombarelli, Rafael}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span>=<span style="color:#e6db74">{apr}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DenoiseVAE: Adaptive Noise for Molecular Pre-training</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-modeling/denoise-vae/</link><pubDate>Sun, 24 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-modeling/denoise-vae/</guid><description>Liu et al.'s ICLR 2025 paper introducing DenoiseVAE, which learns adaptive, atom-specific noise distributions for better molecular force fields.</description><content:encoded><![CDATA[<h2 id="paper-contribution-type">Paper Contribution Type</h2>
<p>This is a <strong>method paper</strong> with a supporting theoretical component. It introduces a new pre-training framework, DenoiseVAE, that challenges the standard practice of using fixed, hand-crafted noise distributions in denoising-based molecular representation learning.</p>
<h2 id="motivation-the-inter--and-intra-molecular-variations-problem">Motivation: The Inter- and Intra-molecular Variations Problem</h2>
<p>The motivation is to create a more physically principled denoising pre-training task for 3D molecules. The core idea of denoising is to learn molecular force fields by corrupting an equilibrium conformation with noise and then learning to recover it. However, existing methods use a single, hand-crafted noise strategy (e.g., Gaussian noise of a fixed scale) for all atoms across all molecules. This is physically unrealistic for two main reasons:</p>
<ol>
<li><strong>Inter-molecular differences</strong>: Different molecules have unique Potential Energy Surfaces (PES), meaning the space of low-energy (i.e., physically plausible) conformations is highly molecule-specific.</li>
<li><strong>Intra-molecular differences (Anisotropy)</strong>: Within a single molecule, different atoms have different degrees of freedom. For instance, an atom in a rigid functional group can move much less than one connected by a single, rotatable bond.</li>
</ol>
<p>The authors argue that this &ldquo;one-size-fits-all&rdquo; noise approach leads to inaccurate force field learning because it samples many physically improbable conformations.</p>
<h2 id="novelty-a-learnable-atom-specific-noise-generator">Novelty: A Learnable, Atom-Specific Noise Generator</h2>
<p>The core novelty is a framework that learns to generate noise tailored to each specific molecule and atom. This is achieved through three key innovations:</p>
<ol>
<li><strong>Learnable Noise Generator</strong>: The authors introduce a Noise Generator module (a 4-layer Equivariant Graph Neural Network) that takes a molecule&rsquo;s equilibrium conformation $X$ as input and outputs a unique, atom-specific Gaussian noise distribution (i.e., a different variance $\sigma_i^2$ for each atom $i$). This directly addresses the issues of PES specificity and force field anisotropy.</li>
<li><strong>Variational Autoencoder (VAE) Framework</strong>: The Noise Generator (encoder) and a Denoising Module (a 7-layer EGNN decoder) are trained jointly within a VAE paradigm. The noisy conformation is sampled using the reparameterization trick:
$$
\begin{aligned}
\tilde{x}_i &amp;= x_i + \epsilon \sigma_i
\end{aligned}
$$</li>
<li><strong>Principled Optimization Objective</strong>: The training loss balances two competing goals:
$$
\begin{aligned}
\mathcal{L}_{DenoiseVAE} &amp;= \mathcal{L}_{Denoise} + \lambda \mathcal{L}_{KL}
\end{aligned}
$$
<ul>
<li>A denoising reconstruction loss ($\mathcal{L}_{Denoise}$) encourages the Noise Generator to produce physically plausible perturbations from which the original conformation can be recovered. This implicitly constrains the noise to respect the molecule&rsquo;s underlying force fields.</li>
<li>A KL divergence regularization term ($\mathcal{L}_{KL}$) pushes the generated noise distributions towards a predefined prior. This prevents the trivial solution of generating zero noise and encourages the model to explore a diverse set of low-energy conformations.</li>
</ul>
</li>
</ol>
<p>The authors also provide a theoretical analysis showing that optimizing their objective is equivalent to maximizing the Evidence Lower Bound (ELBO) on the log-likelihood of observing physically realistic conformations.</p>
<h2 id="methodology--experimental-baselines">Methodology &amp; Experimental Baselines</h2>
<p>The model was pretrained on the PCQM4Mv2 dataset (approximately 3.4 million organic molecules) and then evaluated on a comprehensive suite of downstream tasks to test the quality of the learned representations:</p>
<ol>
<li><strong>Molecular Property Prediction (QM9)</strong>: The model was evaluated on 12 quantum chemical property prediction tasks for small molecules (134k molecules; 100k train, 18k val, 13k test split). DenoiseVAE achieved state-of-the-art or second-best performance on 11 of the 12 tasks, with particularly significant gains on $C_v$ (heat capacity), indicating better capture of vibrational modes.</li>
<li><strong>Force Prediction (MD17)</strong>: The task was to predict atomic forces from molecular dynamics trajectories for 8 different small molecules (9,500 train, 500 val split). DenoiseVAE was the top performer on 5 of the 8 molecules (Aspirin, Benzene, Ethanol, Naphthalene, Toluene), though it underperformed Frad on Malonaldehyde, Salicylic Acid, and Uracil by significant margins.</li>
<li><strong>Ligand Binding Affinity (PDBBind v2019)</strong>: On the PDBBind dataset with 30% and 60% protein sequence identity splits, the model showed strong generalization, outperforming baselines like Uni-Mol particularly on the more stringent 30% split across RMSE, Pearson correlation, and Spearman correlation.</li>
<li><strong>PCQM4Mv2 Validation</strong>: DenoiseVAE achieved a validation MAE of 0.0777 on the PCQM4Mv2 HOMO-LUMO gap prediction task with only 1.44M parameters, competitive with models 10-40x larger (e.g., GPS++ at 44.3M params achieves 0.0778).</li>
<li><strong>Ablation Studies</strong>: The authors analyzed the sensitivity to key hyperparameters, namely the prior&rsquo;s standard deviation ($\sigma$) and the KL-divergence weight ($\lambda$), confirming that $\lambda=1$ and $\sigma=0.1$ are optimal. Removing the KL term leads to trivial solutions (near-zero noise). An additional ablation on the Noise Generator depth found 4 EGNN layers optimal over 2 layers. A comparison of independent (diagonal) versus non-independent (full covariance) noise sampling showed comparable results, suggesting the EGNN already captures inter-atomic dependencies implicitly.</li>
<li><strong>Case Studies</strong>: Visualizations of the learned noise variances for different molecules confirmed that the model learns chemically intuitive noise patterns. For example, it applies smaller perturbations to atoms in a rigid bicyclic norcamphor derivative and larger ones to atoms in flexible functional groups of a cyclopropane derivative. Even identical functional groups (e.g., hydroxyl) receive different noise scales in different molecular contexts.</li>
</ol>
<h2 id="key-findings-on-force-field-learning">Key Findings on Force Field Learning</h2>
<ul>
<li><strong>Primary Conclusion</strong>: Learning a <strong>molecule-adaptive and atom-specific</strong> noise distribution is a superior strategy for denoising-based pre-training compared to using fixed, hand-crafted heuristics. This more physically-grounded approach leads to representations that better capture molecular force fields.</li>
<li><strong>Strong Benchmark Performance</strong>: DenoiseVAE achieves best or second-best results on 11 of 12 QM9 tasks, 5 of 8 MD17 molecules, and leads on the stringent 30% LBA split. Performance is mixed on some MD17 molecules (Malonaldehyde, Salicylic Acid, Uracil), where it trails Frad.</li>
<li><strong>Effective Framework</strong>: The proposed VAE-based framework, which jointly trains a Noise Generator and a Denoising Module, is an effective and theoretically sound method for implementing this adaptive noise strategy. The interplay between the reconstruction loss and the KL-divergence regularization is key to its success.</li>
<li><strong>Limitation and Future Direction</strong>: The method is based on classical force field assumptions. The authors note that integrating more accurate force fields represents a promising direction for future work.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<ul>
<li><strong>Source Code &amp; Weights</strong>: The authors have released their code at <a href="https://github.com/Serendipity-r/DenoiseVAE">Serendipity-r/DenoiseVAE</a> on GitHub.</li>
<li><strong>Implementation</strong>: Hyperparameters and architectures are detailed in the paper&rsquo;s appendix (A.14), and the repository provides reference implementations.</li>
</ul>
<h3 id="data">Data</h3>
<ul>
<li><strong>Pre-training Dataset</strong>: <a href="https://ogb.stanford.edu/docs/lsc/pcqm4mv2/">PCQM4Mv2</a> (approximately 3.4 million organic molecules)</li>
<li><strong>Property Prediction</strong>: <a href="https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.QM9.html">QM9 dataset</a> (134k molecules; 100k train, 18k val, 13k test split) for 12 quantum chemical properties</li>
<li><strong>Force Prediction</strong>: <a href="http://www.sgdml.org/#datasets">MD17 dataset</a> (9,500 train, 500 val split) for 8 different small molecules</li>
<li><strong>Ligand Binding Affinity</strong>: PDBBind v2019 (4,463 protein-ligand complexes) with 30% and 60% sequence identity splits</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Noise Generator</strong>: 4-layer Equivariant Graph Neural Network (EGNN) that outputs atom-specific Gaussian noise distributions</li>
<li><strong>Denoising Module</strong>: 7-layer EGNN decoder</li>
<li><strong>Training Objective</strong>: $\mathcal{L}_{DenoiseVAE} = \mathcal{L}_{Denoise} + \lambda \mathcal{L}_{KL}$ with $\lambda=1$</li>
<li><strong>Noise Sampling</strong>: Reparameterization trick with $\tilde{x}_i = x_i + \epsilon \sigma_i$</li>
<li><strong>Prior Distribution</strong>: Standard deviation $\sigma=0.1$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Model Size</strong>: 1.44M parameters total</li>
<li><strong>Fine-tuning Protocol</strong>: Noise Generator discarded after pre-training; only the pre-trained Denoising Module (7-layer EGNN) is retained for downstream fine-tuning</li>
<li><strong>Optimizer</strong>: AdamW with cosine learning rate decay (max LR of 0.0005)</li>
<li><strong>Batch Size</strong>: 128</li>
<li><strong>System Training</strong>: Fine-tuned end-to-end for specific tasks; force prediction involves computing the gradient of the predicted energy</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Ablation Studies</strong>: Sensitivity analysis confirmed $\lambda=1$ and $\sigma=0.1$ as optimal hyperparameters; removing the KL term leads to trivial solutions (near-zero noise)</li>
<li><strong>Noise Generator Depth</strong>: 4 EGNN layers outperformed 2 layers across both QM9 and MD17 benchmarks</li>
<li><strong>Covariance Structure</strong>: Full covariance matrix (non-independent noise sampling) yielded comparable results to diagonal variance (independent sampling), likely because the EGNN already integrates neighboring atom information</li>
<li><strong>O(3) Invariance</strong>: The method satisfies O(3) probabilistic invariance, meaning the noise distribution is unchanged under rotations and reflections</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU Configuration</strong>: Experiments conducted on a single RTX A3090 GPU; 6 GPUs with 144GB total memory sufficient for full reproduction</li>
<li><strong>CPU</strong>: Intel Xeon Gold 5318Y @ 2.10GHz</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, Y., Chen, J., Jiao, R., Li, J., Huang, W., &amp; Su, B. (2025). DenoiseVAE: Learning Molecule-Adaptive Noise Distributions for Denoising-based 3D Molecular Pre-training. <em>The Thirteenth International Conference on Learning Representations (ICLR)</em>.</p>
<p><strong>Publication</strong>: ICLR 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{liu2025denoisevae,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DenoiseVAE: Learning Molecule-Adaptive Noise Distributions for Denoising-based 3D Molecular Pre-training}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yurou Liu and Jiahao Chen and Rui Jiao and Jiangmeng Li and Wenbing Huang and Bing Su}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Thirteenth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=ym7pr83XQr}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://iclr.cc/virtual/2025/poster/27701">ICLR 2025 poster page</a></li>
<li><a href="https://openreview.net/forum?id=ym7pr83XQr">OpenReview forum</a></li>
<li><a href="https://openreview.net/pdf?id=ym7pr83XQr">PDF on OpenReview</a></li>
</ul>
]]></content:encoded></item><item><title>Beyond Atoms: 3D Space Modeling for Molecular Pretraining</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-modeling/beyond-atoms/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-modeling/beyond-atoms/</guid><description>Lu et al. introduce SpaceFormer, a Transformer that models entire 3D molecular space including atoms for superior representations.</description><content:encoded><![CDATA[<h2 id="paper-typology-and-contribution">Paper Typology and Contribution</h2>
<p>This is a <strong>Method</strong> paper. It challenges the atom-centric paradigm of molecular representation learning by proposing a novel framework that models the continuous 3D space surrounding atoms. The core contribution is <strong>SpaceFormer</strong>, a Transformer-based architecture that discretizes molecular space into grids to capture physical phenomena (electron density, electromagnetic fields) often missed by traditional point-cloud models.</p>
<h2 id="the-physical-intuition-modeling-empty-space">The Physical Intuition: Modeling &ldquo;Empty&rdquo; Space</h2>
<p><strong>The Gap</strong>: Prior 3D molecular representation models, such as Uni-Mol, treat molecules as discrete sets of atoms, essentially point clouds in 3D space. However, from a quantum physics perspective, the &ldquo;empty&rdquo; space between atoms is far from empty. It is permeated by electron density distributions and electromagnetic fields that determine molecular properties.</p>
<p><strong>The Hypothesis</strong>: Explicitly modeling this continuous 3D space alongside discrete atom positions yields superior representations for downstream tasks, particularly for computational properties that depend on electronic structure, such as HOMO/LUMO energies and energy gaps.</p>
<h2 id="a-surprising-observation-virtual-points-improve-representations">A Surprising Observation: Virtual Points Improve Representations</h2>
<p>Before proposing SpaceFormer, the authors present a simple yet revealing experiment. They augment Uni-Mol by adding randomly sampled virtual points (VPs) from the 3D space within the circumscribed cuboid of each molecule. These VPs carry no chemical information whatsoever: they are purely random noise points.</p>
<p>The result is surprising: adding just 10 random VPs already yields a noticeable improvement in validation loss. The improvement remains consistent and gradually increases as the number of VPs grows, eventually reaching a plateau. This observation holds across downstream tasks as well, with Uni-Mol + VPs improving on several quantum property predictions (LUMO, E1-CC2, E2-CC2) compared to vanilla Uni-Mol.</p>
<p>The implication is that even uninformative spatial context helps the model learn better representations, motivating a principled framework for modeling the full 3D molecular space.</p>
<h2 id="spaceformer-voxelization-and-3d-positional-encodings">SpaceFormer: Voxelization and 3D Positional Encodings</h2>
<p>The key innovation is treating the molecular representation problem as <strong>3D space modeling</strong>. SpaceFormer follows these core steps:</p>
<ol>
<li><strong>Voxelizes the entire 3D space</strong> into a grid with cells of $0.49\text{\AA}$ (based on O-H bond length to ensure at most one atom per cell).</li>
<li><strong>Uses adaptive multi-resolution grids</strong> to efficiently handle empty space, keeping it fine-grained near atoms and coarse-grained far away.</li>
<li><strong>Applies Transformers to 3D spatial tokens</strong> with custom positional encodings that achieve linear complexity.</li>
</ol>
<p>Specifically, the model utilizes two forms of 3D Positional Encoding:</p>
<p><strong>3D Directional PE (RoPE Extension)</strong>
They extend Rotary Positional Encoding (RoPE) to 3D continuous space by splitting the Query and Key vectors into three blocks (one for each spatial axis). The directional attention mechanism takes the form:</p>
<p>$$
\begin{aligned}
\mathbf{q}_{i}^{\top} \mathbf{k}_{j} = \sum_{s=1}^{3} \mathbf{q}_{i,s}^{\top} \mathbf{R}(c_{j,s} - c_{i,s}) \mathbf{k}_{j,s}
\end{aligned}
$$</p>
<p><strong>3D Distance PE (RFF Approximation)</strong>
To compute invariant geometric distance without incurring quadratic memory overhead, they use Random Fourier Features (RFF) to approximate a Gaussian kernel of pairwise distances:</p>
<p>$$
\begin{aligned}
\exp \left( - \frac{| \mathbf{c}_i - \mathbf{c}_j |_2^2}{2\sigma^2} \right) &amp;\approx z(\mathbf{c}_i)^\top z(\mathbf{c}_j) \\
z(\mathbf{c}_i) &amp;= \sqrt{\frac{2}{d}} \cos(\sigma^{-1} \mathbf{c}_i^\top \boldsymbol{\omega} + \mathbf{b})
\end{aligned}
$$</p>
<p>This approach enables the model to natively encode complex field-like phenomena without computing exhaustive $O(N^2)$ distance matrices.</p>
<h2 id="experimental-setup-and-downstream-tasks">Experimental Setup and Downstream Tasks</h2>
<p><strong>Pretraining Data</strong>: 19 million unlabeled molecules from the same dataset used by Uni-Mol.</p>
<p><strong>Downstream Benchmarks</strong>: The authors propose a new benchmark of 15 tasks, motivated by known limitations of MoleculeNet: invalid structures, inconsistent chemical representations, data curation errors, and an inability to adequately distinguish model performance. The tasks split into two categories:</p>
<ol>
<li>
<p><strong>Computational Properties (Quantum Mechanics)</strong></p>
<ul>
<li>Subsets of <a href="/notes/computational-chemistry/datasets/gdb-17/">GDB-17</a> (HOMO, LUMO, GAP energy prediction, 20K samples; E1-CC2, E2-CC2, f1-CC2, f2-CC2, 21.7K samples)</li>
<li>Polybenzenoid hydrocarbons (Dipole moment, adiabatic ionization potential, D3 dispersion correction)</li>
<li>Metric: Mean Absolute Error (MAE)</li>
</ul>
</li>
<li>
<p><strong>Experimental Properties (Pharma/Bio)</strong></p>
<ul>
<li>MoleculeNet tasks (BBBP, BACE for drug discovery)</li>
<li>Biogen ADME tasks (HLM, MME, Solubility)</li>
<li>Metrics: AUC for classification, MAE for regression</li>
</ul>
</li>
</ol>
<p><strong>Splitting Strategy</strong>: All datasets use 8:1:1 train/validation/test ratio with <strong>scaffold splitting</strong> to test out-of-distribution generalization.</p>
<p><strong>Training Setup</strong>:</p>
<ul>
<li><strong>Objective</strong>: Masked Auto-Encoder (MAE) with 30% random masking. Model predicts whether a cell contains an atom, and if so, regresses both atom type and precise offset position.</li>
<li><strong>Hardware</strong>: ~50 hours on 8 NVIDIA A100 GPUs</li>
<li><strong>Optimizer</strong>: Adam ($\beta_1=0.9, \beta_2=0.99$)</li>
<li><strong>Learning Rate</strong>: Peak 1e-4 with linear decay and 0.01 warmup ratio</li>
<li><strong>Batch Size</strong>: 128</li>
<li><strong>Total Updates</strong>: 1 million</li>
</ul>
<p><strong>Baseline Comparisons</strong>: GROVER (2D graph-based MPR), GEM (2D graph enhanced with 3D information), 3D Infomax (GNN with 3D information), Uni-Mol (3D MPR, primary baseline using the same pretraining dataset), and Mol-AE (extends Uni-Mol with atom-based MAE pretraining).</p>
<h2 id="results-and-analysis">Results and Analysis</h2>
<p><strong>Strong Contextual Performance</strong>: SpaceFormer ranked 1st in 10 of 15 tasks and in the top 2 for 14 of 15 tasks. It surpassed the runner-up models by approximately 20% on quantum property tasks (HOMO, LUMO, GAP, E1-CC2, Dipmom), validating that modeling non-atom space captures electronic structure better than atom-only regimes.</p>
<h3 id="key-results-on-quantum-properties">Key Results on Quantum Properties</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>GROVER</th>
          <th>GEM</th>
          <th>3D Infomax</th>
          <th>Uni-Mol</th>
          <th>Mol-AE</th>
          <th><strong>SpaceFormer</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HOMO (Ha)</td>
          <td>0.0075</td>
          <td>0.0068</td>
          <td>0.0065</td>
          <td>0.0052</td>
          <td>0.0050</td>
          <td><strong>0.0042</strong></td>
      </tr>
      <tr>
          <td>LUMO (Ha)</td>
          <td>0.0086</td>
          <td>0.0080</td>
          <td>0.0070</td>
          <td>0.0060</td>
          <td>0.0057</td>
          <td><strong>0.0040</strong></td>
      </tr>
      <tr>
          <td>GAP (Ha)</td>
          <td>0.0109</td>
          <td>0.0107</td>
          <td>0.0095</td>
          <td>0.0081</td>
          <td>0.0080</td>
          <td><strong>0.0064</strong></td>
      </tr>
      <tr>
          <td>E1-CC2 (eV)</td>
          <td>0.0101</td>
          <td>0.0090</td>
          <td>0.0089</td>
          <td>0.0067</td>
          <td>0.0070</td>
          <td><strong>0.0058</strong></td>
      </tr>
      <tr>
          <td>Dipmom (Debye)</td>
          <td>0.0752</td>
          <td>0.0289</td>
          <td>0.0291</td>
          <td>0.0106</td>
          <td>0.0113</td>
          <td><strong>0.0083</strong></td>
      </tr>
  </tbody>
</table>
<p>SpaceFormer&rsquo;s advantage is most pronounced on computational properties that depend on electronic structure. On experimental biological tasks (e.g., BBBP), where measurements are noisy, the advantage narrows or reverses: Uni-Mol achieves 0.9066 AUC on BBBP compared to SpaceFormer&rsquo;s 0.8605.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>The authors present several ablations that isolate the source of SpaceFormer&rsquo;s improvements:</p>
<p><strong>MAE vs. Denoising</strong>: SpaceFormer with MAE pretraining outperforms SpaceFormer with denoising on all four ablation tasks. The MAE objective requires predicting <em>whether</em> an atom exists in a masked voxel, which forces the model to learn global structural dependencies. In the denoising variant, only atom cells are masked so the model never needs to predict atom existence, reducing the task to coordinate regression.</p>
<p><strong>FLOPs Control</strong>: A SpaceFormer-Large model (4x width, atom-only) trained with comparable FLOPs still falls short of SpaceFormer with 1000 non-atom cells on most downstream tasks. This confirms the improvement comes from modeling 3D space, not from additional compute.</p>
<p><strong>Virtual Points vs. SpaceFormer</strong>: Adding up to 200 random virtual points to Uni-Mol improves some tasks but leaves a significant gap compared to SpaceFormer, demonstrating that principled space discretization outperforms naive point augmentation.</p>
<p><strong>Efficiency Validation</strong>: The Adaptive Grid Merging method reduces the number of cells by roughly 10x with virtually no performance degradation. The 3D positional encodings scale linearly with the number of cells, while Uni-Mol&rsquo;s pretraining cost scales quadratically.</p>
<h3 id="scope-and-future-directions">Scope and Future Directions</h3>
<p>SpaceFormer does not incorporate built-in SE(3) equivariance, relying instead on data augmentation (random rotations and random boundary padding) during training. The authors identify extending SpaceFormer to force field tasks and larger systems such as proteins and complexes as promising future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="code-and-data-availability">Code and Data Availability</h3>
<ul>
<li><strong>Source Code</strong>: As of the current date, the authors have not released the official source code or pre-trained weights.</li>
<li><strong>Datasets</strong>: Pretraining utilized the same 19M unlabeled molecule dataset as Uni-Mol. Downstream tasks use a newly curated internal benchmark built from subsets of GDB-17, MoleculeNet, and Biogen ADME. The exact customized scaffold splits for these evaluations are pending the official code release.</li>
<li><strong>Compute</strong>: Pretraining the base SpaceFormer encoder (~67.8M parameters, configured to merge level 3) required approximately 50 hours on 8 NVIDIA A100 GPUs.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model treats a molecule as a 3D &ldquo;image&rdquo; via voxelization, processed by a Transformer.</p>
<p><strong>Input Representation</strong>:</p>
<ul>
<li><strong>Discretization</strong>: 3D space divided into grid cells with length <strong>$0.49\text{\AA}$</strong> (based on O-H bond length to ensure at most one atom per cell)</li>
<li><strong>Tokenization</strong>: Tokens are pairs $(t_i, c_i)$ where $t_i$ is atom type (or NULL) and $c_i$ is the coordinate</li>
<li><strong>Embeddings</strong>: Continuous embeddings with dimension 512. Inner-cell positions discretized with $0.01\text{\AA}$ precision</li>
</ul>
<p><strong>Transformer Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Layers</th>
          <th>Attention Heads</th>
          <th>Embedding Dim</th>
          <th>FFN Dim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Encoder</strong></td>
          <td>16</td>
          <td>8</td>
          <td>512</td>
          <td>2048</td>
      </tr>
      <tr>
          <td><strong>Decoder</strong> (MAE)</td>
          <td>4</td>
          <td>4</td>
          <td>256</td>
          <td>1024</td>
      </tr>
  </tbody>
</table>
<p><strong>Attention Mechanism</strong>: FlashAttention for efficient handling of large sequence lengths.</p>
<p><strong>Positional Encodings</strong>:</p>
<ol>
<li><strong>3D Directional PE</strong>: Extension of Rotary Positional Embedding (RoPE) to 3D continuous space, capturing relative directionality</li>
<li><strong>3D Distance PE</strong>: Random Fourier Features (RFF) to approximate Gaussian kernel of pairwise distances with linear complexity</li>
</ol>
<h4 id="visualizing-rff-and-rope">Visualizing RFF and RoPE</h4>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-rff-rope-visualization.webp"
         alt="Four-panel visualization showing RFF distance encoding and RoPE directional encoding mechanisms"
         title="Four-panel visualization showing RFF distance encoding and RoPE directional encoding mechanisms"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Visual intuition for SpaceFormer&rsquo;s positional encodings: Top row shows RFF distance encoding (Gaussian-like attention decay and high-frequency feature fingerprints). Bottom row shows RoPE directional encoding (vector rotation fields and resulting attention patterns).</figcaption>
    
</figure>

<p><strong>Top Row (Distance / RFF):</strong> Shows how the model learns &ldquo;closeness.&rdquo; Distance is represented by a complex &ldquo;fingerprint&rdquo; of waves that creates a Gaussian-like force field.</p>
<ul>
<li><strong>Top Left (The Force Field):</strong> The attention score (dot product) naturally forms a Gaussian curve. It is high when atoms are close and decays to zero as they move apart. This mimics physical forces without the model needing to learn that math from scratch.</li>
<li><strong>Top Right (The Fingerprint):</strong> Each dimension oscillates at a different frequency. A specific distance (e.g., $d=2$) has a unique combination of high and low values across these dimensions, creating a unique &ldquo;fingerprint&rdquo; for that exact distance.</li>
</ul>
<p><strong>Bottom Row (Direction / RoPE):</strong> Shows how the model learns &ldquo;relative position.&rdquo; It visualizes the vector rotation and how that creates a grid-like attention pattern.</p>
<ul>
<li><strong>Bottom Left (The Rotation):</strong> This visualizes the &ldquo;X-axis chunk&rdquo; of the vector. As you move from left ($x=-3$) to right ($x=3$), the arrows rotate. The model compares angles between atoms to determine relative positions.</li>
<li><strong>Bottom Right (The Grid):</strong> The resulting attention pattern when combining X-rotations and Y-rotations. The red/blue regions show where the model pays attention relative to the center, forming a grid-like interference pattern that distinguishes relative positions (e.g., &ldquo;top-right&rdquo; vs &ldquo;bottom-left&rdquo;).</li>
</ul>
<h4 id="adaptive-grid-merging">Adaptive Grid Merging</h4>
<p>To make the 3D grid approach computationally tractable, two key strategies are employed:</p>
<ol>
<li><strong>Grid Sampling</strong>: Randomly selecting 10-20% of empty cells during training</li>
<li><strong>Adaptive Grid Merging</strong>: Recursively merging $2 \times 2 \times 2$ blocks of empty cells into larger &ldquo;coarse&rdquo; cells, creating a multi-resolution view that is fine-grained near atoms and coarse-grained in empty space (merging set to Level 3)</li>
</ol>
<p><strong>Visualizing Adaptive Grid Merging</strong>:</p>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-adaptive-grid-merging.webp"
         alt="2D simulation of adaptive grid merging for an H2O molecule showing multi-resolution cells"
         title="2D simulation of adaptive grid merging for an H2O molecule showing multi-resolution cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adaptive grid merging demonstrated on H₂O. Red cells (Level 0) contain atoms and remain at full resolution. Progressively darker blue cells represent merged empty regions at higher levels, covering the same volume with fewer tokens.</figcaption>
    
</figure>

<p>The adaptive grid process compresses empty space around molecules while maintaining high resolution near atoms:</p>
<ul>
<li><strong>Red Cells (Level 0):</strong> The smallest squares ($0.49$Å) containing atoms. These are kept at highest resolution because electron density changes rapidly here.</li>
<li><strong>Light Blue Cells (Level 0/1):</strong> Small empty regions close to atoms.</li>
<li><strong>Darker Blue Cells (Level 2/3):</strong> Large blocks of empty space further away.</li>
</ul>
<p>If we used a naive uniform grid, we would have to process thousands of empty &ldquo;Level 0&rdquo; cells containing almost zero information. By merging them into larger blocks (the dark blue squares), the model covers the same volume with significantly fewer input tokens, reducing the number of tokens by roughly <strong>10x</strong> compared to a dense grid.</p>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-adaptive-grid-benzene.webp"
         alt="Adaptive grid merging visualization for benzene molecule showing hexagonal ring with multi-resolution grid cells"
         title="Adaptive grid merging visualization for benzene molecule showing hexagonal ring with multi-resolution grid cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adaptive grid merging for benzene (C₆H₆). The model maintains maximum resolution (red Level 0 cells) only where atoms exist, while merging vast empty regions into large blocks (dark blue L3/L4 cells). This allows the model to focus computational power on chemically active zones.</figcaption>
    
</figure>

<p>The benzene example above demonstrates how this scales to larger molecules. The characteristic hexagonal ring of 6 carbon atoms (black) and 6 hydrogen atoms (white) occupies a small fraction of the total grid. The dark blue corners (L3, L4) represent massive merged blocks of empty space, allowing the model to focus 90% of its computational power on the red &ldquo;active&rdquo; zones where chemistry actually happens.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lu, S., Ji, X., Zhang, B., Yao, L., Liu, S., Gao, Z., Zhang, L., &amp; Ke, G. (2025). Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling. <em>Proceedings of the 42nd International Conference on Machine Learning (ICML)</em>. <a href="https://proceedings.mlr.press/v267/lu25e.html">https://proceedings.mlr.press/v267/lu25e.html</a></p>
<p><strong>Publication</strong>: ICML 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lu2025beyond,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lu, Shuqi and Ji, Xiaohong and Zhang, Bohang and Yao, Lin and Liu, Siyuan and Gao, Zhifeng and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 42nd International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=Wd9KPQCKwq">OpenReview forum</a></li>
<li><a href="https://openreview.net/pdf?id=Wd9KPQCKwq">PDF on OpenReview</a></li>
<li><a href="https://icml.cc/virtual/2025/poster/45004">ICML 2025 poster page</a></li>
</ul>
]]></content:encoded></item><item><title>GEOM Dataset: 3D Molecular Conformer Generation</title><link>https://hunterheidenreich.com/posts/geom-conformer-generation-dataset/</link><pubDate>Fri, 15 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/geom-conformer-generation-dataset/</guid><description>Learn how GEOM transforms 2D molecular graphs into dynamic 3D conformer ensembles for molecular machine learning applications.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>In molecular machine learning, we often start with a 2D graph, a blueprint of atoms and bonds. A molecule&rsquo;s function is deeply tied to its dynamic 3D shape. Molecules are flexible entities that exist as an <strong>ensemble of low-energy conformations</strong>. Capturing 3D molecular shapes is crucial for predicting molecular behavior.</p>
<p>The <a href="/notes/computational-chemistry/datasets/geom/">GEOM</a> (Geometric Ensemble Of Molecules) dataset was created to bridge this gap. It provides a massive collection of high-quality 3D conformer ensembles, transforming static 2D graphs into something much closer to physical reality. This makes it an invaluable resource for anyone working in geometric deep learning for chemistry and drug discovery.</p>















<figure class="post-figure center ">
    <img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-022-01288-4/MediaObjects/41597_2022_1288_Fig1_HTML.png?as=webp"
         alt="Overlay of conformers for a complex molecule"
         title="Overlay of conformers for a complex molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">3D conformer ensembles expand upon 2D blueprints by revealing the diverse shapes the latanoprost molecule adopts.</figcaption>
    
</figure>

<h2 id="the-challenge-of-conformer-generation">The Challenge of Conformer Generation</h2>
<p>Generating 3D structures for every molecule is computationally hard for two main reasons:</p>
<ol>
<li><strong>Combinatorial Explosion</strong>: Think of a molecule with several rotatable bonds. Each bond is like a joint that can be twisted. The number of possible 3D shapes grows exponentially with each new joint. Trying every combination is impractical for most molecules.</li>
<li><strong>Speed vs. Accuracy</strong>: We need to calculate the energy of each shape to know if it&rsquo;s realistic (low energy). Classical <strong>force fields</strong> are fast. <strong>Density Functional Theory (DFT)</strong> provides quantum mechanical accuracy.</li>
</ol>
<p>GEOM uses a semi-empirical method to capture the underlying quantum mechanics efficiently, enabling the generation of millions of conformations for a large dataset.</p>
<h2 id="a-deeper-look-inside-the-geom-dataset">A Deeper Look Inside the GEOM Dataset</h2>
<p>The scale of GEOM is impressive: over <strong>37 million conformations</strong> for more than <strong>450,000 unique molecules</strong>. But the numbers in the paper&rsquo;s tables tell a more interesting story about the dataset&rsquo;s composition.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">AICures drug dataset (N=304,466)</th>
          <th style="text-align: left">Mean</th>
          <th style="text-align: left">Max</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Number of heavy atoms</td>
          <td style="text-align: left">24.9</td>
          <td style="text-align: left">91</td>
      </tr>
      <tr>
          <td style="text-align: left">Number of rotatable bonds</td>
          <td style="text-align: left">6.5</td>
          <td style="text-align: left">53</td>
      </tr>
      <tr>
          <td style="text-align: left">Conformers</td>
          <td style="text-align: left">102.6</td>
          <td style="text-align: left">7,451</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>QM9 dataset (N=133,258)</strong></td>
          <td style="text-align: left"><strong>Mean</strong></td>
          <td style="text-align: left"><strong>Max</strong></td>
      </tr>
      <tr>
          <td style="text-align: left">Number of heavy atoms</td>
          <td style="text-align: left">8.8</td>
          <td style="text-align: left">9</td>
      </tr>
      <tr>
          <td style="text-align: left">Number of rotatable bonds</td>
          <td style="text-align: left">2.2</td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left">Conformers</td>
          <td style="text-align: left">13.5</td>
          <td style="text-align: left">1,101</td>
      </tr>
  </tbody>
</table>
<p><em>A simplified view of Tables 1 &amp; 4 from the paper, highlighting the key differences.</em></p>
<p>What does this tell us?</p>
<ul>
<li><strong>Two Worlds of Molecules</strong>: The dataset is clearly split. The <strong>QM9</strong> subset contains small, relatively rigid molecules (mean of 2.2 rotatable bonds). In contrast, the <strong>AICures</strong> subset contains larger, more flexible drug-like molecules (mean of 6.5 rotatable bonds, with one molecule having 53!). This diversity is ideal for training machine learning models that need to generalize from simple cases to complex, real-world examples.</li>
<li><strong>Conformational Complexity</strong>: The number of conformers found per molecule reflects this flexibility. A typical QM9 molecule has about 13 conformers, while a drug-like molecule has over 100 on average. This highlights the necessity of 3D ensembles for flexible molecules.</li>
</ul>
<p>Beyond the structures themselves, GEOM is rich with experimental data, connecting the 3D shapes to real-world properties. The molecules are labeled with data for everything from <strong>water solubility</strong> and <strong>blood-brain barrier penetration</strong> to <strong>toxicity</strong> and inhibition of key viral targets like the <strong>SARS-CoV-2 3CL protease</strong>. This makes it a powerful tool for developing property prediction models.</p>
<p>In fact, this creates a benchmark for:</p>
<ul>
<li>Property prediction models that can leverage conformer ensembles (or members of the ensemble) as input.</li>
<li>Conformer generation models that must transform 2D graphs into realistic, 3D distributions.</li>
<li>End-to-end property-based evaluation of the conformer ensembles generated by a model.</li>
</ul>
<h2 id="the-toolbox-behind-geom-key-techniques-explained">The Toolbox Behind GEOM: Key Techniques Explained</h2>
<p>The GEOM paper mentions several advanced computational chemistry methods. Let&rsquo;s briefly break down the most important ones:</p>
<ul>
<li><strong>GFN2-xTB</strong>: This is the semi-empirical quantum mechanical method used to calculate energies and forces in GEOM. Think of it as a &ldquo;middle ground&rdquo; method. It provides greater speed than full DFT while capturing electronic effects absent in classical force fields, making it a pragmatic choice for generating a large dataset.</li>
<li><strong>CREST</strong>: This is the program that actually performs the conformer search. It uses a clever technique based on <strong>metadynamics</strong>, where it simulates the molecule&rsquo;s movement and adds a &ldquo;penalty&rdquo; potential to discourage it from revisiting shapes it has already seen. This pushes the molecule to explore its conformational space efficiently, finding many diverse, low-energy structures.</li>
<li><strong>CENSO</strong>: For a small subset of molecules, the authors went a step further with CENSO. This program takes the conformers found by CREST and refines them with more accurate (and expensive) DFT calculations. It&rsquo;s a way of getting very high-quality &ldquo;gold standard&rdquo; data for benchmarking.</li>
<li><strong>Implicit Solvent Models</strong>: Molecules in the body exist in aqueous environments. Methods like <strong>C-PCM</strong> and <strong>ALPB</strong> model water as a continuous medium, which affects the molecule&rsquo;s preferred shape and energy. This is crucial for biological applications.</li>
</ul>
<h2 id="the-math-behind-the-molecules-explained-simply">The Math Behind the Molecules (Explained Simply)</h2>
<p>The paper includes a couple of equations based on the Boltzmann distribution, which is a fundamental concept from statistical mechanics that tells us the probability of finding a system in a certain state.</p>
<p>The key equation used by CREST to assign a probability (or &ldquo;statistical weight&rdquo;) to the <em>i</em>-th conformer is:</p>
<p>$$ P_{i}^{\text{CREST}} = \frac{d_{i}\exp(-E_{i}/k_{B}T)}{\sum_{j}d_{j}\exp(-E_{j}/k_{B}T)} $$</p>
<p>Let&rsquo;s demystify this:</p>
<ul>
<li>$E_i$ is the energy of the conformer. The negative sign and the exponential mean that <strong>lower energy leads to a much higher probability</strong>.</li>
<li>$k_B T$ is the thermal energy at a given temperature $T$. It sets the energy scale. If the energy difference between two conformers is much larger than $k_B T$, the higher-energy one will be virtually nonexistent.</li>
<li>$d_i$ represents the degeneracy of the conformer, which accounts for the number of equivalent states or configurations that share the same energy $E_i$.
<ul>
<li>Degeneracy refers to the number of equivalent, indistinguishable atomic arrangements (rotamers) that correspond to a single overall molecular shape (conformer). For example, the rotation of a methyl group ($-\text{CH}_3$) produces multiple identical-looking orientations of its hydrogen atoms.</li>
</ul>
</li>
<li>The denominator, $\sum_{j}d_{j}\exp(-E_{j}/k_{B}T)$, is the <strong>partition function</strong>. Its job is to sum up the terms from all possible conformers to ensure that all the probabilities add up to 100%.</li>
</ul>
<p>For the high-quality CENSO calculations, the equation uses the <strong>Gibbs Free Energy ($G_i$)</strong>. Free energy provides a complete measure by including the molecule&rsquo;s internal energy, its interaction with a solvent, and entropic effects (like how much it can &ldquo;wiggle&rdquo;). This gives a more accurate ranking of the conformer probabilities.</p>
<h2 id="a-closer-look-at-the-figures-what-the-data-really-shows">A Closer Look at the Figures: What the Data Really Shows</h2>
<p>The paper&rsquo;s figures offer some honest insights into the dataset&rsquo;s quality and the trade-offs involved.</p>















<figure class="post-figure center ">
    <img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-022-01288-4/MediaObjects/41597_2022_1288_Fig3_HTML.png?as=webp"
         alt="Scatter plot comparing energy calculation methods."
         title="Scatter plot comparing energy calculation methods."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Comparing the &lsquo;fast&rsquo; GFN2-xTB energies with &lsquo;accurate&rsquo; DFT energies. (a) There&rsquo;s a clear correlation, but also a lot of spread. (b) The ranking accuracy (Spearman ρ) is decent on average (0.39) but highly variable.</figcaption>
    
</figure>

<p>Figure 3 is particularly important. It compares the fast GFN2-xTB energies with much more accurate single-point DFT energies.</p>
<ul>
<li>The <strong>Mean Absolute Error (MAE) of 1.96 kcal/mol</strong> shows that, on average, the fast method gets the energy wrong by about 2 kcal/mol. At room temperature, the thermal energy ($k_B T$) is only about 0.6 kcal/mol. Because the Boltzmann probability depends on the energy _exponentially_, a 2 kcal/mol error can dramatically change the predicted importance of a conformer.</li>
<li>The <strong>Spearman correlation plot</strong> (right side) shows how well GFN2-xTB <em>ranks</em> the conformers from lowest to highest energy compared to DFT. An average correlation of 0.39 provides a strong baseline, though the wide distribution indicates variable performance across different molecules. The ranking accuracy fluctuates, achieving near perfection for certain molecules and showing significant deviation for others.</li>
</ul>
<p>This is a key takeaway: the GFN2-xTB/CREST method excels at <strong>discovering</strong> low-energy shapes. For accurate probability <strong>ranking</strong>, the higher-level DFT energies provided in GEOM are required.</p>
<h2 id="conclusion-what-this-means-for-machine-learning">Conclusion: What This Means for Machine Learning</h2>
<p>For researchers at the intersection of machine learning and chemistry, GEOM provides a realistic foundation to build upon. By shifting the focus from static 2D graphs to dynamic 3D ensembles, GEOM enables a new generation of models.</p>
<p>This dataset is an ideal training ground for models designed to understand 3D geometry, such as <strong>SE(3)-equivariant neural networks</strong>, <strong>diffusion models</strong>, <strong>transformers</strong>, and <strong>VAEs</strong>, which can learn to generate conformer ensembles directly from a 2D graph. By training on GEOM, these models can learn the complex relationship between a molecule&rsquo;s chemical blueprint and its real-world, flexible nature.</p>
<p>For a comprehensive technical reference including detailed specifications, quality metrics, and performance leaderboards, see my <a href="/notes/computational-chemistry/datasets/geom/">GEOM Dataset Card</a>.</p>
<p>Explore the GEOM dataset further by visiting its <a href="https://github.com/learningmatter-mit/geom">GitHub repository</a>.</p>
<h2 id="references">References</h2>
<ul>
<li>Axelrod, S. &amp; Gómez-Bombarelli, R. &ldquo;GEOM, energy-annotated molecular conformations for property prediction and molecular generation.&rdquo; <em>Scientific Data</em> 9, 185 (2022). <a href="https://doi.org/10.1038/s41597-022-01288-4">https://doi.org/10.1038/s41597-022-01288-4</a></li>
<li>GitHub repositories:
<ul>
<li><a href="https://github.com/learningmatter-mit/geom">learningmatter-mit/geom</a></li>
<li><a href="https://github.com/learningmatter-mit/NeuralForceField">learningmatter-mit/NeuralForceField</a></li>
</ul>
</li>
</ul>
]]></content:encoded></item><item><title>Invalid SMILES Benefit Chemical Language Models: A Study</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/invalid-smiles-help/</link><pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/invalid-smiles-help/</guid><description>Skinnider (2024) shows that generating invalid SMILES actually improves chemical language model performance through quality filtering.</description><content:encoded><![CDATA[<h2 id="core-contribution-repurposing-invalid-smiles">Core Contribution: Repurposing Invalid SMILES</h2>
<p>This is a <strong>Method</strong> and <strong>Theory</strong> paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate &ldquo;invalid&rdquo; SMILES strings is actually beneficial for model performance.</p>
<h2 id="the-problem-with-absolute-validity-in-chemical-lms">The Problem with Absolute Validity in Chemical LMs</h2>
<p>Prior research attempts to eliminate invalid generations using constrained representations like SELFIES. This paper shifts the paradigm by demonstrating that invalid outputs provide a necessary theoretical signal for distribution learning, enabling implicit uncertainty estimation.</p>
<h2 id="invalid-generation-as-an-implicit-quality-filter">Invalid Generation as an Implicit Quality Filter</h2>
<p>The central insight is counterintuitive: <strong>invalid SMILES generation acts as a built-in quality control mechanism</strong>. The key contributions are:</p>
<ol>
<li>
<p><strong>Empirical Evidence</strong>: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.</p>
</li>
<li>
<p><strong>Mechanistic Explanation</strong>: Invalid SMILES are demonstrated to be low-likelihood samples from the model&rsquo;s probability distribution. When these are filtered out, it&rsquo;s equivalent to removing the model&rsquo;s least confident predictions, a form of automatic quality control.</p>
</li>
<li>
<p><strong>Causal Evidence</strong>: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.</p>
</li>
<li>
<p><strong>Bias Analysis</strong>: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.</p>
</li>
</ol>
<h2 id="experimental-design-and-causal-interventions">Experimental Design and Causal Interventions</h2>
<p>The paper uses a multi-pronged approach to establish both correlation and causation:</p>
<p><strong>Performance Comparisons</strong>: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.</p>
<p><strong>Loss Analysis</strong>: The relationship between SMILES validity and model confidence was examined by analyzing the sequence loss. For a given SMILES string $S$ composed of tokens $t_1, t_2, &hellip;, t_N$, the negative log-likelihood acts as a proxy for the model&rsquo;s uncertainty:</p>
<p>$$ \text{NLL}(S) = -\sum_{i=1}^N \log P(t_i | t_1, &hellip;, t_{i-1}) $$</p>
<p>Invalid SMILES strings consistently register higher $\text{NLL}$ scores, meaning they represent the model&rsquo;s least confident predictions. Filtering them effectively acts as automatic quality control, providing the mechanistic explanation for why invalid filtering improves performance.</p>
<p><strong>Causal Intervention</strong>: The most clever experiment involved creating &ldquo;unconstrained SELFIES&rdquo; models by relaxing or removing the valency constraints that normally ensure validity. This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.</p>
<p><strong>Structural Bias Analysis</strong>: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model&rsquo;s exploration of chemical space.</p>
<p><strong>Generalization Testing</strong>: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.</p>
<p><strong>Practical Application</strong>: The approach was tested on structure elucidation, using models to identify unknown molecules from minimal experimental data like mass spectrometry.</p>
<h2 id="key-findings-on-validity-constraints-and-bias">Key Findings on Validity Constraints and Bias</h2>
<p><strong>Superior Performance Across the Board</strong>: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. Remarkably, this performance advantage was directly correlated with the proportion of invalid SMILES generated. Models that produced more invalid outputs performed better after filtering.</p>
<p><strong>Invalid SMILES Are Low-Confidence Predictions</strong>: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model&rsquo;s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.</p>
<p><strong>Causal Evidence Through Unconstrained SELFIES</strong>: The most compelling evidence came from modifying SELFIES to allow invalid generation. When &ldquo;unconstrained SELFIES&rdquo; models could generate and discard invalid molecules, their performance improved dramatically, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.</p>
<p><strong>Validity Constraints Introduce Systematic Bias</strong>: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model&rsquo;s ability to faithfully represent chemical space.</p>
<p><strong>Reduced Generalization</strong>: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.</p>
<p><strong>Real-World Application Benefits</strong>: In structure elucidation tasks, identifying unknown molecules from experimental data like mass spectrometry, SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.</p>
<p><strong>Computational Efficiency</strong>: Filtering invalid SMILES is computationally trivial. Parsing one million SMILES strings with RDKit takes only minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Primary Architecture (LSTM):</strong> The main results rely on a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) units.</p>
<ul>
<li><strong>Structure:</strong> Three-layer LSTM with a hidden layer size of 1,024 dimensions</li>
<li><strong>Embedding:</strong> An embedding layer of 128 dimensions</li>
<li><strong>Decoder:</strong> A linear decoder layer outputs token probabilities</li>
</ul>
<p><strong>Secondary Architecture (Transformer/GPT):</strong> To prove robustness, the author also used a Generative Pretrained Transformer (GPT) architecture adapted from MolGPT.</p>
<ul>
<li><strong>Structure:</strong> Eight transformer blocks</li>
<li><strong>Internals:</strong> Each block contains eight masked self-attention heads and a feed-forward network (1,024 dimensions) using GELU activation</li>
<li><strong>Embedding:</strong> 256 dimensions, concatenated with learned positional encodings</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Optimizer:</strong> Adam optimizer for both architectures with $\beta_1=0.9$ and $\beta_2=0.999$.</p>
<p><strong>Learning Rate:</strong></p>
<ul>
<li>LSTM: 0.001</li>
<li>Transformer: 0.0005</li>
</ul>
<p><strong>Batch Size:</strong> 64</p>
<p><strong>Loss Function:</strong> Cross-entropy loss of next-token prediction.</p>
<p><strong>Stopping Criteria:</strong> Early stopping using a validation set (10% of training data) with patience of 50,000 minibatches.</p>
<h3 id="data">Data</h3>
<p><strong>Primary Source:</strong> ChEMBL database (version 28).</p>
<p><strong>Preprocessing Pipeline:</strong></p>
<ul>
<li><strong>Cleaning:</strong> Removal of duplicate SMILES, salts, and solvents (retaining heavy fragments with $\geq 3$ heavy atoms)</li>
<li><strong>Filtering:</strong> Molecules with atoms other than {Br, C, Cl, F, H, I, N, O, P, S} were removed</li>
<li><strong>Normalization:</strong> Charged molecules were neutralized and converted to canonical SMILES</li>
</ul>
<p><strong>Training Subsets:</strong> Models were trained on random samples of 30,000, 100,000, and 300,000 molecules to test scalability.</p>
<p><strong>Generalization Data:</strong> To test generalization, models were also trained on the <a href="/notes/computational-chemistry/datasets/gdb-13/">GDB-13</a> database (enumerating drug-like molecules up to 13 heavy atoms).</p>
<p><strong>Structure Elucidation Data:</strong> For practical application tasks, models were trained on natural products (LOTUS, COCONUT), food compounds (FooDB), and environmental contaminants (NORMAN).</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric:</strong> Fréchet ChemNet Distance (FCD), measuring chemical similarity between generated molecules and the training set (lower is better).</p>
<p><strong>Secondary Metrics:</strong></p>
<ul>
<li><strong>Validity:</strong> Percentage of outputs parseable by RDKit</li>
<li><strong>Scaffold Similarity:</strong> Jensen-Shannon distances between Murcko scaffold compositions</li>
<li><strong>Physical Properties:</strong> Comparisons of molecular weight, LogP, topological polar surface area (TPSA), and ring counts (aromatic vs. aliphatic)</li>
<li><strong>Structure Elucidation:</strong> &ldquo;Top-k accuracy,&rdquo; the proportion of held-out molecules where the correct structure appeared in the model&rsquo;s top $k$ ranked outputs</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Nodes:</strong> Dell EMC C4140 GPU compute nodes</li>
<li><strong>GPUs:</strong> NVIDIA Tesla V100</li>
<li><strong>Compute Time:</strong> Parsing 1 million SMILES took ~7.5 minutes on a single CPU; SELFIES models required an average of 0.6 hours longer to train than SMILES models</li>
</ul>
<h3 id="replicability">Replicability</h3>
<p><strong>Code Availability:</strong> Source code and intermediate data are available via <a href="https://doi.org/10.5281/zenodo.10680855">Zenodo</a>. Note that pre-trained model weights are not explicitly provided in the archive, requiring researchers to train models from scratch using the included scripts to fully replicate the study.</p>
<p><strong>Software Libraries:</strong></p>
<ul>
<li><strong>PyTorch:</strong> LSTM and Transformer implementations</li>
<li><strong>RDKit:</strong> SMILES parsing, validity checking, and property calculation</li>
<li><strong>SELFIES:</strong> Version 2.1.1 for conversion</li>
</ul>
<h2 id="implications-and-takeaways">Implications and Takeaways</h2>
<p>This work fundamentally challenges how we think about &ldquo;errors&rdquo; in generative models. The key insight finds that model outputs that appear incorrect often represent useful uncertainty signals that improve overall performance when properly handled.</p>
<p>The findings suggest that the field&rsquo;s drive toward guaranteed validity leads to systematic biases. It proves advantageous to let models fail informatively and use those failures as quality signals. This applies particularly as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.</p>
<p>For practitioners, the message is clear: pause before &ldquo;fixing&rdquo; models that generate invalid outputs. That apparent flaw acts as a feature in disguise, providing automatic quality control that improves final results. Success often stems from learning to recognize and filter mistakes effectively.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437-448. <a href="https://doi.org/10.1038/s42256-024-00821-x">https://doi.org/10.1038/s42256-024-00821-x</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence (2024)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{skinnider2024invalid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Invalid SMILES are beneficial rather than detrimental to chemical language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Skinnider, Michael A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{437--448}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group UK London}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Synthetic Isomer Data Generation Pipeline</title><link>https://hunterheidenreich.com/projects/isomer-dataset-generation/</link><pubDate>Sat, 09 Mar 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/isomer-dataset-generation/</guid><description>An end-to-end cheminformatics pipeline transforming 1D chemical formulas into 3D conformer datasets using graph enumeration and physics-based featurization.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>In computational drug discovery, data scarcity is often the bottleneck. This project builds a synthetic data generator that creates labeled 3D molecular datasets starting from nothing but a raw chemical formula (e.g., $C_6H_{14}$).</p>
<p>The pipeline bridges the gap between <strong>1D Chemical Information</strong> (stoichiometry) and <strong>3D Geometric Data</strong> (conformers), effectively serving as a &ldquo;data factory&rdquo; for training molecular machine learning models.</p>
<h2 id="features">Features</h2>
<h3 id="1-graph-enumeration--3d-embedding">1. Graph Enumeration &amp; 3D Embedding</h3>
<p>The core of the project is <code>pysomer/data/gen.py</code>, which orchestrates a multi-step generation process:</p>
<ul>
<li><strong>Structural Isomerism:</strong> Uses <strong>MAYGEN</strong> (via a Java bridge) to mathematically enumerate all valid graph connectivities for a given formula</li>
<li><strong>Conformer Sampling:</strong> Uses <strong>RDKit</strong> to embed these graphs into 3D space, generating multiple conformers (rotamers) per isomer to capture flexibility</li>
<li><strong>IUPAC Labeling:</strong> Automatically queries PubChem APIs to assign human-readable labels (e.g., &ldquo;2-methylpentane&rdquo;) to the generated structures</li>
</ul>
<h3 id="2-physics-aware-featurization">2. Physics-Aware Featurization</h3>
<p>The pipeline computes <strong>Coulomb Matrices</strong>, ensuring the input respects physical invariants:</p>
<p>$$C_{ij} = \begin{cases} 0.5 Z_i^{2.4} &amp; i = j \ \frac{Z_i Z_j}{|R_i - R_j|} &amp; i \neq j \end{cases}$$</p>
<p>This representation encodes the electrostatic potential of the molecule, providing a more robust signal for the neural network than raw Cartesian coordinates.</p>
<h3 id="3-hdf5-data-storage">3. HDF5 Data Storage</h3>
<p>To handle the large volume of generated conformers, the system writes to hierarchical <strong>HDF5</strong> files. This allows for efficient, chunked I/O during training, a critical pattern for scaling to larger chemical spaces.</p>
<h2 id="usage">Usage</h2>
<p>The pipeline is executed via a CLI, taking a chemical formula as input and outputting an HDF5 dataset of 3D conformers.</p>
<h2 id="results">Results</h2>
<p>This project serves as a &ldquo;vertical slice&rdquo; of a cheminformatics workflow.</p>
<ul>
<li><strong>The Good:</strong> The separation of concerns is clean. Using <code>dataclasses</code> for configuration and HDF5 for storage demonstrates a &ldquo;production-first&rdquo; mindset regarding data engineering.</li>
<li><strong>The &ldquo;Old School&rdquo;:</strong> The model used is a simple Multi-Layer Perceptron (MLP) on flattened Coulomb Matrices. In a modern production setting (post-2020), I would replace this with an <strong>E(3)-Equivariant GNN</strong> (like SchNet or E3NN) to handle rotational symmetry natively, eliminating manual feature engineering.</li>
<li><strong>Dependency Management:</strong> The reliance on an external Java JAR (<code>MAYGEN</code>) for graph enumeration makes the environment brittle. Today, I would likely swap this for a pure Python enumerator or a containerized microservice to improve portability.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<p>This data pipeline powers the analysis in my comprehensive guide on molecular representation:</p>
<ul>
<li><a href="/posts/alkane-constitutional-isomer-classification/">Coulomb Matrix Eigenvalues: Can You Hear the Shape of a Molecule?</a>: A deep dive into data generation, unsupervised clustering, and supervised classification of alkane isomers.</li>
</ul>
<p>See also:</p>
<ul>
<li><a href="/posts/molecular-descriptor-coulomb-matrix/">The Coulomb Matrix</a>: Deep dive into the physics-based featurization used here</li>
<li><a href="/notes/computational-chemistry/classic-papers/number-of-isomeric-hydrocarbons/">The Number of Isomeric Hydrocarbons</a>: The foundational 1931 paper on alkane enumeration</li>
</ul>
]]></content:encoded></item><item><title>Hearing Molecular Shape via Coulomb Matrix Eigenvalues</title><link>https://hunterheidenreich.com/posts/alkane-constitutional-isomer-classification/</link><pubDate>Sat, 24 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/alkane-constitutional-isomer-classification/</guid><description>Explore molecular shape recognition using Coulomb matrix eigenvalues. An analysis of alkane isomers, clustering limits, and supervised classification.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Can you determine a molecule&rsquo;s shape from mathematical fingerprints alone? This question drives some of the most fundamental challenges in computational chemistry and machine learning. In the broader ML context, this is the classic search for the right <em>inductive bias</em> or <em>invariant representation</em>. Whether we are processing messy documents, natural language, or molecular dynamics, finding a representation that captures essential structure while ignoring irrelevant variations is critical.</p>
<p>I recently encountered a paper with an intriguing title: <a href="https://doi.org/10.1021/acs.jcim.0c00631">&ldquo;Can One Hear the Shape of a Molecule (from its Coulomb Matrix Eigenvalues)?&rdquo;</a> The title references Mark Kac&rsquo;s famous mathematical question <a href="https://www.math.ucdavis.edu/~hunter/m207b/kac.pdf">&ldquo;Can One Hear the Shape of a Drum?&rdquo;</a> exploring whether a drum&rsquo;s shape dictates its sound frequencies.</p>
<p>The molecular version asks: can we determine a molecule&rsquo;s structure from the eigenvalues of its <a href="/posts/molecular-descriptor-coulomb-matrix/">Coulomb matrix</a>?</p>
<p>Molecular representations are the foundation of machine learning in chemistry. If eigenvalues can capture structural information, they become powerful features for property prediction. Successfully separating simple structural differences is a prerequisite for handling more complex molecules.</p>
<p>The original authors tested this hypothesis using alkane constitutional isomers (molecules with identical formulas but different structural arrangements). I decided to replicate and extend their work to better understand both the methods and their limitations.</p>
<p>In this post, we will explore molecular representation through eigenvalue analysis, covering data generation, unsupervised clustering approaches, and supervised classification methods. I&rsquo;ll also explore log-transformed Coulomb matrices, which can reveal structural details that standard matrices miss.</p>
<h2 id="why-alkanes-make-ideal-test-cases">Why Alkanes Make Ideal Test Cases</h2>
<p><a href="https://en.wikipedia.org/wiki/Alkane">Alkanes</a> are the simplest organic molecules: carbon and hydrogen connected by single bonds with the general formula $C_{n}H_{2n+2}$.</p>
<p>What makes them perfect for testing molecular representations is their constitutional isomers: molecules with identical formulas but different structural arrangements. For small alkanes ($n \leq 3$), atoms can connect in only one way. Starting with butane ($n = 4$), multiple arrangements become possible:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/4-Butane-3D-balls.webp"
         alt="Butane as a ball-and-stick model."
         title="Butane as a ball-and-stick model."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Butane: a linear chain</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/4-Isobutane-3D-balls.webp"
         alt="Isobutane as a ball-and-stick model."
         title="Isobutane as a ball-and-stick model."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Isobutane: a branched structure</figcaption>
    
</figure>

<p>The number of isomers grows rapidly with molecular size. By undecane ($n = 11$), there are 159 different structural arrangements:</p>
<table>
  <thead>
      <tr>
          <th>Alkane</th>
          <th>n</th>
          <th>Isomers</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Butane</td>
          <td>4</td>
          <td>2</td>
      </tr>
      <tr>
          <td>Pentane</td>
          <td>5</td>
          <td>3</td>
      </tr>
      <tr>
          <td>Hexane</td>
          <td>6</td>
          <td>5</td>
      </tr>
      <tr>
          <td>Heptane</td>
          <td>7</td>
          <td>9</td>
      </tr>
      <tr>
          <td>Octane</td>
          <td>8</td>
          <td>18</td>
      </tr>
      <tr>
          <td>Nonane</td>
          <td>9</td>
          <td>35</td>
      </tr>
      <tr>
          <td>Decane</td>
          <td>10</td>
          <td>75</td>
      </tr>
      <tr>
          <td>Undecane</td>
          <td>11</td>
          <td>159</td>
      </tr>
  </tbody>
</table>
<p>This creates a natural classification challenge: can Coulomb matrix eigenvalues distinguish these structural differences? Successfully separating simple alkane isomers is a prerequisite for handling more complex molecules.</p>
<h2 id="computational-pipeline">Computational Pipeline</h2>
<p>The analysis requires three computational steps:</p>
<ol>
<li><strong>Generate constitutional isomers</strong> for each alkane formula</li>
<li><strong>Create multiple 3D conformations</strong> for each isomer</li>
<li><strong>Calculate Coulomb matrix eigenvalues</strong> for each conformation</li>
</ol>
<h3 id="generating-constitutional-isomers">Generating Constitutional Isomers</h3>
<p>Enumerating all possible carbon skeletons is a combinatorial problem. I used <a href="https://github.com/MehmetAzizYirik/MAYGEN">MAYGEN</a>, an open-source Java tool for generating molecular structures from chemical formulas.</p>
<p>For butane ($C_{4}H_{10}$):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>java -jar MAYGEN-1.8.jar -v -m -f C4H10 -smi -o butane_conformers.smi
</span></span></code></pre></div><p>This generates:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>CCCC
</span></span><span style="display:flex;"><span>CC(C)C
</span></span></code></pre></div><p>The first is n-butane (linear), the second is isobutane (branched). We can automate this across all alkanes:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>os<span style="color:#f92672">.</span>makedirs(<span style="color:#e6db74">&#39;isomers&#39;</span>, exist_ok<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    cmd <span style="color:#f92672">=</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;java -jar MAYGEN-1.8.jar -f C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> -smi -o isomers/C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.smi&#34;</span>
</span></span><span style="display:flex;"><span>    os<span style="color:#f92672">.</span>system(cmd)
</span></span></code></pre></div><h3 id="generating-3d-conformations">Generating 3D Conformations</h3>
<p>For machine learning applications, we need multiple 3D structures of each isomer to capture conformational flexibility. I used <a href="https://github.com/rdkit/rdkit">RDKit</a>&rsquo;s ETKDG method, which <a href="https://doi.org/10.1021/acs.jcim.7b00505">remains competitive</a> with commercial alternatives:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem <span style="color:#f92672">import</span> AllChem
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">smiles_str_to_rdkit_mol</span>(smiles_str: str) <span style="color:#f92672">-&gt;</span> rdkit<span style="color:#f92672">.</span>Chem<span style="color:#f92672">.</span>Mol:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Convert a SMILES string to an RDKit mol object.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Args:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    - smiles_str (str): A SMILES string representing a molecule.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Returns:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    - mol (rdkit.Chem.Mol): An RDKit mol object representing the molecule.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Convert SMILES string to RDKit mol object</span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> rdkit<span style="color:#f92672">.</span>Chem<span style="color:#f92672">.</span>MolFromSmiles(smiles_str)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Add hydrogens to the molecule</span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> rdkit<span style="color:#f92672">.</span>Chem<span style="color:#f92672">.</span>AddHs(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Assign 3D coordinates to the molecule</span>
</span></span><span style="display:flex;"><span>    AllChem<span style="color:#f92672">.</span>EmbedMolecule(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> mol
</span></span></code></pre></div><h3 id="computing-coulomb-matrix-eigenvalues">Computing Coulomb Matrix Eigenvalues</h3>
<p>The Coulomb matrix encodes 3D structure in a rotation and translation-invariant way. Its eigenvalues should capture structural information while remaining invariant to molecular orientation.</p>
<p>First, I wrote a helper function to convert RDKit molecules into <a href="https://ase-lib.org/">ASE</a> <code>Atoms</code> objects that <a href="https://singroup.github.io/dscribe/">DScribe</a> can process:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">rdkit_mol_to_ase_atoms</span>(rdkit_mol: rdkit<span style="color:#f92672">.</span>Chem<span style="color:#f92672">.</span>Mol) <span style="color:#f92672">-&gt;</span> ase<span style="color:#f92672">.</span>Atoms:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Convert an RDKit molecule to an ASE Atoms object.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Args:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        rdkit_mol: RDKit molecule object.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Returns:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        ASE Atoms object.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    ase_atoms <span style="color:#f92672">=</span> ase<span style="color:#f92672">.</span>Atoms(
</span></span><span style="display:flex;"><span>        numbers<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>            atom<span style="color:#f92672">.</span>GetAtomicNum() <span style="color:#66d9ef">for</span> atom <span style="color:#f92672">in</span> rdkit_mol<span style="color:#f92672">.</span>GetAtoms()
</span></span><span style="display:flex;"><span>        ],
</span></span><span style="display:flex;"><span>        positions<span style="color:#f92672">=</span>rdkit_mol<span style="color:#f92672">.</span>GetConformer()<span style="color:#f92672">.</span>GetPositions()
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> ase_atoms
</span></span></code></pre></div><p>Then I computed Coulomb matrix eigenvalues using DScribe, with optional log transformation:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">ase_atoms_to_coloumb_matrix_eigenvalues</span>(
</span></span><span style="display:flex;"><span>    ase_atoms: ase<span style="color:#f92672">.</span>Atoms,
</span></span><span style="display:flex;"><span>    log: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>) <span style="color:#f92672">-&gt;</span> np<span style="color:#f92672">.</span>ndarray:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Convert an ASE Atoms object to a Coulomb matrix and calculate its eigenvalues.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Args:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        ase_atoms: ASE Atoms object.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        log: Whether to log transform the Coulomb matrix prior to calculating the eigenvalues.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Returns:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        Eigenvalues of the Coulomb matrix.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Create a Coulomb matrix</span>
</span></span><span style="display:flex;"><span>    coulomb_matrix <span style="color:#f92672">=</span> dscribe<span style="color:#f92672">.</span>descriptors<span style="color:#f92672">.</span>CoulombMatrix(
</span></span><span style="display:flex;"><span>        n_atoms_max<span style="color:#f92672">=</span>ase_atoms<span style="color:#f92672">.</span>get_global_number_of_atoms(),
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Calculate the Coulomb matrix</span>
</span></span><span style="display:flex;"><span>    coulomb_matrix <span style="color:#f92672">=</span> coulomb_matrix<span style="color:#f92672">.</span>create(ase_atoms)
</span></span><span style="display:flex;"><span>    coulomb_matrix <span style="color:#f92672">=</span> coulomb_matrix<span style="color:#f92672">.</span>reshape(
</span></span><span style="display:flex;"><span>        ase_atoms<span style="color:#f92672">.</span>get_global_number_of_atoms(),
</span></span><span style="display:flex;"><span>        ase_atoms<span style="color:#f92672">.</span>get_global_number_of_atoms())
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> log:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Log transform the Coulomb matrix</span>
</span></span><span style="display:flex;"><span>        coulomb_matrix <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>log(coulomb_matrix)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Calculate the eigenvalues of the Coulomb matrix</span>
</span></span><span style="display:flex;"><span>    eigenvalues <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>eigvals(coulomb_matrix)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> eigenvalues
</span></span></code></pre></div><p>Combining these functions enables efficient data generation:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Generate 1000 conformations per isomer for each alkane</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># gen_n_spectra() combines the above functions to generate multiple conformations</span>
</span></span><span style="display:flex;"><span>os<span style="color:#f92672">.</span>makedirs(<span style="color:#e6db74">&#39;spectra&#39;</span>, exist_ok<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>n_confs <span style="color:#f92672">=</span> <span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;Generating spectra for C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;isomers/C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.smi&#39;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        smiles_list <span style="color:#f92672">=</span> [line<span style="color:#f92672">.</span>strip() <span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> f]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i, smiles <span style="color:#f92672">in</span> enumerate(smiles_list):
</span></span><span style="display:flex;"><span>        spectra <span style="color:#f92672">=</span> gen_n_spectra(smiles, n_confs, log<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>        np<span style="color:#f92672">.</span>save(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;spectra/C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">_</span><span style="color:#e6db74">{</span>i<span style="color:#e6db74">}</span><span style="color:#e6db74">.npy&#39;</span>, spectra)
</span></span></code></pre></div><h2 id="reproducing-the-original-results">Reproducing the Original Results</h2>
<p>To validate our computational pipeline, I replicated key figures from the original paper. This ensures our implementation correctly captures the phenomena they observed.</p>
<p>We generate data using 1000 conformations per isomer:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>os<span style="color:#f92672">.</span>makedirs(<span style="color:#e6db74">&#39;spectra&#39;</span>, exist_ok<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>n_confs <span style="color:#f92672">=</span> <span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;Generating spectra for C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;isomers/C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.smi&#39;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        lines <span style="color:#f92672">=</span> f<span style="color:#f92672">.</span>readlines()
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> i, line <span style="color:#f92672">in</span> enumerate(lines):
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> line<span style="color:#f92672">.</span>strip():
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">continue</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            smiles <span style="color:#f92672">=</span> line<span style="color:#f92672">.</span>strip()
</span></span><span style="display:flex;"><span>            spectra <span style="color:#f92672">=</span> gen_n_spectra(n_confs, smiles, log<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>)
</span></span><span style="display:flex;"><span>            np<span style="color:#f92672">.</span>save(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;spectra/C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">_</span><span style="color:#e6db74">{</span>i<span style="color:#e6db74">:</span><span style="color:#e6db74">03d</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.npy&#39;</span>, spectra)
</span></span></code></pre></div><p>After generation, we can load the data into a structured format:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> re
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> glob <span style="color:#f92672">import</span> glob
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>spectra <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    spectra[n] <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> f <span style="color:#f92672">in</span> glob(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;spectra/C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">_*.npy&#39;</span>):
</span></span><span style="display:flex;"><span>        j <span style="color:#f92672">=</span> int(re<span style="color:#f92672">.</span>search(<span style="color:#e6db74">rf</span><span style="color:#e6db74">&#39;C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">_(\d+).npy&#39;</span>, f)<span style="color:#f92672">.</span>group(<span style="color:#ae81ff">1</span>))
</span></span><span style="display:flex;"><span>        spectra[n][j] <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>load(f)
</span></span></code></pre></div><h3 id="largest-eigenvalues-across-alkane-series">Largest Eigenvalues Across Alkane Series</h3>
<p>The first analysis examines how the largest Coulomb matrix eigenvalues vary across constitutional isomers for each alkane formula. This plot reveals whether single eigenvalues can distinguish between different molecular formulas.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>fig, ax <span style="color:#f92672">=</span> plt<span style="color:#f92672">.</span>subplots(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>, figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">10</span>, <span style="color:#ae81ff">5</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    eigenvalues <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>array([spectra[n][i][:, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>mean() <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> spectra[n]])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># add black dots to boxplot, but not dirextly to the center line</span>
</span></span><span style="display:flex;"><span>    jitter <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>normal(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0.1</span>, size<span style="color:#f92672">=</span>len(eigenvalues))
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>scatter(np<span style="color:#f92672">.</span>full(len(eigenvalues), n) <span style="color:#f92672">+</span> jitter, eigenvalues, color<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;black&#39;</span>, s<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.3</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Plot median</span>
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>scatter([n], [np<span style="color:#f92672">.</span>median(eigenvalues)], color<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;red&#39;</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.5</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Plot range</span>
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>plot([n <span style="color:#f92672">-</span> <span style="color:#ae81ff">0.5</span>, n <span style="color:#f92672">+</span> <span style="color:#ae81ff">0.5</span>], [np<span style="color:#f92672">.</span>min(eigenvalues), np<span style="color:#f92672">.</span>min(eigenvalues)], <span style="color:#e6db74">&#39;k-&#39;</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.5</span>)
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>plot([n <span style="color:#f92672">-</span> <span style="color:#ae81ff">0.5</span>, n <span style="color:#f92672">+</span> <span style="color:#ae81ff">0.5</span>], [np<span style="color:#f92672">.</span>max(eigenvalues), np<span style="color:#f92672">.</span>max(eigenvalues)], <span style="color:#e6db74">&#39;k-&#39;</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.5</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;Molecular formula&#39;</span>)
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_ylabel(<span style="color:#e6db74">&#39;Largest eigenvalue&#39;</span>)
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_xticks(range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>))
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_xticklabels([<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span> <span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>)])
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;Largest eigenvalues of the Coulomb matrix for alkane constitutional isomers&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>savefig(<span style="color:#e6db74">&#39;alkane_coulomb_matrix_largest_eigenvalues.webp&#39;</span>, bbox_inches<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;tight&#39;</span>)
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/alkane_coulomb_matrix_largest_eigenvalues.webp"
         alt="Largest eigenvalues of the Coulomb matrix for alkane constitutional isomers."
         title="Largest eigenvalues of the Coulomb matrix for alkane constitutional isomers."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Largest eigenvalues show sub-linear growth with molecular size and increasing overlap between isomers.</figcaption>
    
</figure>

<p>Our results match the original paper. We observe sub-linear growth in the largest eigenvalue with carbon number, and critically, increasing overlap between isomers as molecules grow larger. The largest eigenvalue alone cannot reliably distinguish constitutional isomers for larger alkanes.</p>
<h3 id="eigenvalue-distributions-for-heptane-isomers">Eigenvalue Distributions for Heptane Isomers</h3>
<p>Looking deeper at a specific case, I analyzed the probability density functions for heptane ($C_7H_{16}$) isomers. This molecule has nine constitutional isomers, providing a good test of discrimination power.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n_sel <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">8</span>):
</span></span><span style="display:flex;"><span>    fig, ax <span style="color:#f92672">=</span> plt<span style="color:#f92672">.</span>subplots(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>, figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">10</span>, <span style="color:#ae81ff">5</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    smiles <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;isomers/C</span><span style="color:#e6db74">{</span>n_sel<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n_sel <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.smi&#39;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> f:
</span></span><span style="display:flex;"><span>            smiles<span style="color:#f92672">.</span>append(line<span style="color:#f92672">.</span>strip())
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(len(spectra[n_sel])):
</span></span><span style="display:flex;"><span>        eigenvalues <span style="color:#f92672">=</span> spectra[n_sel][i][:, <span style="color:#ae81ff">0</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># kde plot with the following params</span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># - Gaussian kernel</span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># - bandwidth with Silverman&#39;s rule of thumb</span>
</span></span><span style="display:flex;"><span>        ax <span style="color:#f92672">=</span> sns<span style="color:#f92672">.</span>kdeplot(eigenvalues, bw_method<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;silverman&#39;</span>, label<span style="color:#f92672">=</span>get_iupac_name(smiles[i]))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;Largest eigenvalue&#39;</span>)
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>set_ylabel(<span style="color:#e6db74">&#39;Density&#39;</span>)
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;PDF of the largest eigenvalue for $C_&#39;</span> <span style="color:#f92672">+</span> str(n_sel) <span style="color:#f92672">+</span> <span style="color:#e6db74">&#39;H_{&#39;</span> <span style="color:#f92672">+</span> str(<span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n_sel <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span>) <span style="color:#f92672">+</span> <span style="color:#e6db74">&#39;}$&#39;</span>)
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>legend()
</span></span><span style="display:flex;"><span>    plt<span style="color:#f92672">.</span>savefig(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;pdf_largest_eigenvalue_C</span><span style="color:#e6db74">{</span>n_sel<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n_sel <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.webp&#39;</span>, bbox_inches<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;tight&#39;</span>)
</span></span><span style="display:flex;"><span>    plt<span style="color:#f92672">.</span>close()
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/pdf_largest_eigenvalue_C7H16.webp"
         alt="PDFs of the largest eigenvalue for heptane isomers."
         title="PDFs of the largest eigenvalue for heptane isomers."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Heptane isomers show distinct eigenvalue ranges: n-heptane (smallest), 2,2,3-trimethylbutane (largest), with others overlapping.</figcaption>
    
</figure>

<p>The pattern is clear:</p>
<ul>
<li><strong>n-heptane</strong> (linear chain) has the smallest eigenvalues</li>
<li><strong>2,2,3-trimethylbutane</strong> (highly branched) has the largest</li>
<li><strong>Seven other isomers</strong> fall in between with substantial overlap</li>
</ul>
<p>This demonstrates the fundamental limitation: while extreme structural differences (linear vs. highly branched) create separable eigenvalue distributions, intermediate structures become indistinguishable.</p>
<p>For smaller alkanes, the separation is more promising:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/pdf_largest_eigenvalue_C4H10.webp"
         alt="PDFs for butane isomers."
         title="PDFs for butane isomers."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Butane (n=4): Clean separation between linear and branched structures.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/pdf_largest_eigenvalue_C5H12.webp"
         alt="PDFs for pentane isomers."
         title="PDFs for pentane isomers."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Pentane (n=5): Good separation between most isomers.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/pdf_largest_eigenvalue_C6H14.webp"
         alt="PDFs for hexane isomers."
         title="PDFs for hexane isomers."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Hexane (n=6): Some isomers (2-methylpentane, 3-methylpentane) become difficult to distinguish.</figcaption>
    
</figure>

<p>The progression is clear: eigenvalue-based discrimination works well for small alkanes and degrades as molecular complexity increases.</p>
<h3 id="two-dimensional-eigenvalue-space">Two-Dimensional Eigenvalue Space</h3>
<p>Can we improve discrimination by using multiple eigenvalues? For butane, plotting the first two eigenvalues reveals interesting structure:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>fig, ax <span style="color:#f92672">=</span> plt<span style="color:#f92672">.</span>subplots(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>, figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">10</span>, <span style="color:#ae81ff">5</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>n_sel <span style="color:#f92672">=</span> <span style="color:#ae81ff">4</span>
</span></span><span style="display:flex;"><span>smiles <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;isomers/C</span><span style="color:#e6db74">{</span>n_sel<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n_sel <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.smi&#39;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> f:
</span></span><span style="display:flex;"><span>        smiles<span style="color:#f92672">.</span>append(line<span style="color:#f92672">.</span>strip())
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(len(spectra[n_sel])):
</span></span><span style="display:flex;"><span>    eigenvalues <span style="color:#f92672">=</span> spectra[n_sel][i][:, :<span style="color:#ae81ff">2</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>scatter(eigenvalues[:, <span style="color:#ae81ff">0</span>], eigenvalues[:, <span style="color:#ae81ff">1</span>], label<span style="color:#f92672">=</span>get_iupac_name(smiles[i]))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;Largest eigenvalue&#39;</span>)
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_ylabel(<span style="color:#e6db74">&#39;Second largest eigenvalue&#39;</span>)
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;2D plot of the first two eigenvalues for $C_4H_</span><span style="color:#e6db74">{10}</span><span style="color:#e6db74">$ conformers&#39;</span>)
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>legend()
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>savefig(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;2d_largest_eigenvalue_C</span><span style="color:#e6db74">{</span>n_sel<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n_sel <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.webp&#39;</span>, bbox_inches<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;tight&#39;</span>)
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/2d_largest_eigenvalue_C4H10.webp"
         alt="2D eigenvalue space for butane isomers."
         title="2D eigenvalue space for butane isomers."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Perfect linear separation of butane isomers using the first two eigenvalues.</figcaption>
    
</figure>

<p>The two isomers cluster distinctly, demonstrating that multi-dimensional eigenvalue features can achieve perfect separation for simple cases. The outlier point in the lower right likely results from conformational sampling noise.</p>
<h3 id="dimensionality-and-information-content">Dimensionality and Information Content</h3>
<p>How many eigenvalues do we actually need? Principal component analysis reveals the effective dimensionality of the eigenvalue representations:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.decomposition <span style="color:#f92672">import</span> PCA
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>fig, ax <span style="color:#f92672">=</span> plt<span style="color:#f92672">.</span>subplots(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>, figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">10</span>, <span style="color:#ae81ff">5</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>n_components <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    n_components[n] <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(len(spectra[n])):
</span></span><span style="display:flex;"><span>        eigenvalues <span style="color:#f92672">=</span> spectra[n][i]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># PCA</span>
</span></span><span style="display:flex;"><span>        pca <span style="color:#f92672">=</span> PCA(n_components<span style="color:#f92672">=</span><span style="color:#ae81ff">0.99</span>, svd_solver<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;full&#39;</span>, whiten<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, random_state<span style="color:#f92672">=</span><span style="color:#ae81ff">42</span>)
</span></span><span style="display:flex;"><span>        pca<span style="color:#f92672">.</span>fit(eigenvalues)
</span></span><span style="display:flex;"><span>        n_components[n]<span style="color:#f92672">.</span>append(pca<span style="color:#f92672">.</span>n_components_)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>scatter([n] <span style="color:#f92672">*</span> len(n_components[n]), n_components[n], alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.3</span>)
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>plot([n <span style="color:#f92672">-</span> <span style="color:#ae81ff">0.25</span>, n <span style="color:#f92672">+</span> <span style="color:#ae81ff">0.25</span>], [np<span style="color:#f92672">.</span>mean(n_components[n]), np<span style="color:#f92672">.</span>mean(n_components[n])], <span style="color:#e6db74">&#39;k-&#39;</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.5</span>)  <span style="color:#75715e"># Draw a line for the mean</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>plot([<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">11</span>], [<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">11</span>], <span style="color:#e6db74">&#39;k--&#39;</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.5</span>, label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;y = num carbon&#39;</span>)
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>plot([<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">11</span>], [<span style="color:#ae81ff">5</span>, <span style="color:#ae81ff">35</span>], <span style="color:#e6db74">&#39;r--&#39;</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.5</span>, label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;y = num atoms&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;Number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_ylabel(<span style="color:#e6db74">&#39;Number of principal components&#39;</span>)
</span></span><span style="display:flex;"><span>ax<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;99% variance explained by number of principal components&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>legend()
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>savefig(<span style="color:#e6db74">&#39;99_variance_explained.webp&#39;</span>, bbox_inches<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;tight&#39;</span>)
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/99_variance_explained.webp"
         alt="Principal components needed for 99% variance."
         title="Principal components needed for 99% variance."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The eigenvalue space compresses efficiently. Far fewer components than molecular degrees of freedom are needed.</figcaption>
    
</figure>

<p>Key observations:</p>
<ul>
<li><strong>High compressibility</strong>: Much fewer than the $3n+2$ degrees of freedom are needed</li>
<li><strong>Linear scaling</strong>: Principal components grow roughly linearly with carbon number</li>
<li><strong>Efficient representation</strong>: The eigenvalue space has lower effective dimensionality than expected</li>
</ul>
<p>This suggests the representations are highly correlated, enabling significant dimensionality reduction without information loss.</p>
<h2 id="log-transformed-coulomb-matrices">Log-Transformed Coulomb Matrices</h2>
<p>As explored in our <a href="/posts/molecular-descriptor-coulomb-matrix/">previous post on Coulomb matrices</a>, log transformation can reveal different structural information. Standard Coulomb matrices emphasize heavy atom interactions. Log transformation expands the influence of hydrogen atoms by mapping magnitudes in $[0,1]$ to $[-\infty,0]$.</p>
<p>I generated equivalent datasets using log-transformed matrices to test how this affects discriminative power.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>os<span style="color:#f92672">.</span>makedirs(<span style="color:#e6db74">&#39;spectra&#39;</span>, exist_ok<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>n_confs <span style="color:#f92672">=</span> <span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;Generating spectra for C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;isomers/C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.smi&#39;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        lines <span style="color:#f92672">=</span> f<span style="color:#f92672">.</span>readlines()
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;</span><span style="color:#ae81ff">\t</span><span style="color:#e6db74">Number of SMILES strings: </span><span style="color:#e6db74">{</span>len(lines)<span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">for</span> i, line <span style="color:#f92672">in</span> enumerate(tqdm(lines)):
</span></span><span style="display:flex;"><span>            print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;</span><span style="color:#ae81ff">\t\t</span><span style="color:#e6db74">{</span>i <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span><span style="color:#e6db74">}</span><span style="color:#e6db74">/</span><span style="color:#e6db74">{</span>len(lines)<span style="color:#e6db74">}</span><span style="color:#e6db74"> - </span><span style="color:#e6db74">{</span>line<span style="color:#f92672">.</span>strip()<span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> line<span style="color:#f92672">.</span>strip():
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">continue</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>exists(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;spectra/log-C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">_</span><span style="color:#e6db74">{</span>i<span style="color:#e6db74">}</span><span style="color:#e6db74">.npy&#39;</span>):
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">continue</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            smiles <span style="color:#f92672">=</span> line<span style="color:#f92672">.</span>strip()
</span></span><span style="display:flex;"><span>            spectra <span style="color:#f92672">=</span> gen_n_spectra(n<span style="color:#f92672">=</span>n_confs, smiles_str<span style="color:#f92672">=</span>smiles, log<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>            np<span style="color:#f92672">.</span>save(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;spectra/log-C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">_</span><span style="color:#e6db74">{</span>i<span style="color:#e6db74">:</span><span style="color:#e6db74">03d</span><span style="color:#e6db74">}</span><span style="color:#e6db74">.npy&#39;</span>, spectra)
</span></span></code></pre></div><h3 id="log-transformed-eigenvalue-distributions">Log-Transformed Eigenvalue Distributions</h3>
<p>The log transformation dramatically changes the eigenvalue landscape:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/alkane_log_coulomb_matrix_largest_eigenvalues.webp"
         alt="Log-transformed eigenvalues across alkane series."
         title="Log-transformed eigenvalues across alkane series."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Log transformation emphasizes hydrogen interactions, creating larger eigenvalue ranges and more negative values.</figcaption>
    
</figure>

<p>Log-transformed versions exhibit distinct characteristics:</p>
<ul>
<li><strong>Span large negative values</strong> due to hydrogen atom emphasis</li>
<li><strong>Show increasing variance</strong> between isomers as molecular size grows</li>
<li><strong>Demonstrate greater discrimination potential</strong> for some isomers</li>
</ul>
<p>This comes with trade-offs. The distributions can become significantly broader, as seen in the heptane analysis:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/pdf_log_largest_eigenvalue_C7H16.webp"
         alt="Log-transformed eigenvalue PDFs for heptane."
         title="Log-transformed eigenvalue PDFs for heptane."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Log transformation creates wider, more overlapping distributions that may reduce discrimination power.</figcaption>
    
</figure>

<p>The log scale on the y-axis is necessary because unbranched isomers become nearly invisible due to the highly concentrated distributions of branched isomers.</p>
<h3 id="two-dimensional-log-transformed-space">Two-Dimensional Log-Transformed Space</h3>
<p>The 2D eigenvalue plot for log-transformed butane shows similar clustering behavior:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/2d_log_largest_eigenvalue_C4H10.webp"
         alt="2D log-transformed eigenvalue space for butane."
         title="2D log-transformed eigenvalue space for butane."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Log transformation brings isomers closer together while maintaining separability for simple cases.</figcaption>
    
</figure>

<p>The transformation reduces the separation distance, yet linear discriminability remains intact for this simple case.</p>
<h3 id="dimensionality-of-log-transformed-features">Dimensionality of Log-Transformed Features</h3>
<p>Principal component analysis of log-transformed eigenvalues reveals similar compression properties:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/99_variance_explained_log.webp"
         alt="Principal components for log-transformed eigenvalues."
         title="Principal components for log-transformed eigenvalues."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Log transformation requires slightly more principal components while maintaining efficient compression.</figcaption>
    
</figure>

<p>The log-transformed features show comparable dimensionality reduction with marginally higher component requirements.</p>
<h2 id="testing-eigenvalue-separability">Testing Eigenvalue Separability</h2>
<p>Our exploratory analysis revealed concerning patterns that hint at fundamental limitations: high correlation between eigenvalue dimensions, rapid dimensionality compression via PCA, and overlapping distributions for larger molecules ($n \geq 6$).</p>
<p>These findings leave the question open. We now test eigenvalues directly to see if they can actually separate constitutional isomers without supervision.</p>
<p>We&rsquo;ll use two complementary clustering metrics to measure how well eigenvalues separate constitutional isomers. This is a fair test. We only compare isomers with identical molecular formulas, keeping eigenvalue dimensions constant.</p>
<h3 id="dunn-index-global-cluster-quality">Dunn Index: Global Cluster Quality</h3>
<p>The <a href="https://en.wikipedia.org/wiki/Dunn_index">Dunn Index</a> provides a single metric capturing cluster quality. It asks: &ldquo;Are the closest different clusters still farther apart than the most spread-out individual cluster?&rdquo;</p>
<p>$$
\text{Dunn Index} = \frac{\text{smallest distance between different clusters}}{\text{largest diameter within any cluster}}
$$</p>
<p>Higher values indicate better separation. When it approaches zero, clusters become indistinguishable, exactly what we suspected from the overlapping eigenvalue distributions observed earlier.</p>
<p>Computing the Dunn Index for each alkane series:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> time
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>dunn_scores <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    tik <span style="color:#f92672">=</span> time<span style="color:#f92672">.</span>time()
</span></span><span style="display:flex;"><span>    dunn_scores[n] <span style="color:#f92672">=</span> dunn_index([spectra[n][i] <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> spectra[n]])
</span></span><span style="display:flex;"><span>    tok <span style="color:#f92672">=</span> time<span style="color:#f92672">.</span>time()
</span></span><span style="display:flex;"><span>    dunn_scores[n][<span style="color:#e6db74">&#39;time&#39;</span>] <span style="color:#f92672">=</span> tok <span style="color:#f92672">-</span> tik
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">:&#39;</span>, dunn_scores[n])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>C4H10: {&#39;diameter&#39;: 21.43072917950398, &#39;distance&#39;: 8.316362440688767, &#39;dunn_index&#39;: 0.3880578383978837, &#39;time&#39;: 0.06010293960571289}
</span></span><span style="display:flex;"><span>C5H12: {&#39;diameter&#39;: 23.449286379564892, &#39;distance&#39;: 2.4693042873545856, &#39;dunn_index&#39;: 0.10530402705587172, &#39;time&#39;: 0.10832405090332031}
</span></span><span style="display:flex;"><span>C6H14: {&#39;diameter&#39;: 19.602363375467938, &#39;distance&#39;: 1.4477574259511048, &#39;dunn_index&#39;: 0.07385626917634591, &#39;time&#39;: 0.28030991554260254}
</span></span><span style="display:flex;"><span>C7H16: {&#39;diameter&#39;: 20.065014927470955, &#39;distance&#39;: 0.4050094394280803, &#39;dunn_index&#39;: 0.02018485612355977, &#39;time&#39;: 1.0307331085205078}
</span></span><span style="display:flex;"><span>C8H18: {&#39;diameter&#39;: 24.794154667613665, &#39;distance&#39;: 0.5013450168168625, &#39;dunn_index&#39;: 0.020220290771668196, &#39;time&#39;: 4.199508905410767}
</span></span><span style="display:flex;"><span>C9H20: {&#39;diameter&#39;: 21.811025941686033, &#39;distance&#39;: 0.34381162248560415, &#39;dunn_index&#39;: 0.01576320267578513, &#39;time&#39;: 17.400264978408813}
</span></span><span style="display:flex;"><span>C10H22: {&#39;diameter&#39;: 27.180773716656066, &#39;distance&#39;: 0.4986608768730121, &#39;dunn_index&#39;: 0.0183460883811206, &#39;time&#39;: 86.00787401199341}
</span></span><span style="display:flex;"><span>C11H24: {&#39;diameter&#39;: 25.58731511020692, &#39;distance&#39;: 0.5490373275460223, &#39;dunn_index&#39;: 0.021457402825629343, &#39;time&#39;: 424.4431610107422}
</span></span></code></pre></div><p>The computation time grows dramatically, over 7 minutes for $C_{11}H_{24}$, due to quadratic scaling with the number of isomers (159 isomers requiring ~12,000 pairwise comparisons).</p>
<p>The results reveal a clear trend:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>fig, axs <span style="color:#f92672">=</span> plt<span style="color:#f92672">.</span>subplots(<span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">2</span>, figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">15</span>, <span style="color:#ae81ff">10</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># in axs[0, 0] - diameter vs number of carbon atoms</span>
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>plot(
</span></span><span style="display:flex;"><span>    list(range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)),
</span></span><span style="display:flex;"><span>    [dunn_scores[n][<span style="color:#e6db74">&#39;diameter&#39;</span>] <span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)],
</span></span><span style="display:flex;"><span>    marker<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;o&#39;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;Number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>set_ylabel(<span style="color:#e6db74">&#39;Diameter&#39;</span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;Diameter vs number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># in axs[0, 1] - distance vs number of carbon atoms</span>
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>plot(
</span></span><span style="display:flex;"><span>    list(range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)),
</span></span><span style="display:flex;"><span>    [dunn_scores[n][<span style="color:#e6db74">&#39;distance&#39;</span>] <span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)],
</span></span><span style="display:flex;"><span>    marker<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;o&#39;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;Number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>set_ylabel(<span style="color:#e6db74">&#39;Distance&#39;</span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;Distance vs number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># in axs[1, 0] - dunn index vs number of carbon atoms</span>
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>plot(
</span></span><span style="display:flex;"><span>    list(range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)),
</span></span><span style="display:flex;"><span>    [dunn_scores[n][<span style="color:#e6db74">&#39;dunn_index&#39;</span>] <span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)],
</span></span><span style="display:flex;"><span>    marker<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;o&#39;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;Number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>set_ylabel(<span style="color:#e6db74">&#39;Dunn index&#39;</span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;Dunn index vs number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># in axs[1, 1] - time vs number of carbon atoms</span>
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>plot(
</span></span><span style="display:flex;"><span>    list(range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)),
</span></span><span style="display:flex;"><span>    [dunn_scores[n][<span style="color:#e6db74">&#39;time&#39;</span>] <span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)],
</span></span><span style="display:flex;"><span>    marker<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;o&#39;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#39;Number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>set_ylabel(<span style="color:#e6db74">&#39;Time (s)&#39;</span>)
</span></span><span style="display:flex;"><span>axs[<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>]<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;Time vs number of carbon atoms&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>tight_layout()
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>savefig(<span style="color:#e6db74">&#39;dunn_index_vs_num_carbon_atoms.webp&#39;</span>, bbox_inches<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;tight&#39;</span>)
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/dunn_index_vs_num_carbon_atoms.webp"
         alt="Dunn Index analysis showing separability metrics, distances, and computation time versus molecular size"
         title="Dunn Index analysis showing separability metrics, distances, and computation time versus molecular size"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Dunn Index analysis reveals deteriorating separability as molecular complexity increases.</figcaption>
    
</figure>

<p>The trend confirms our earlier concerns:</p>
<ul>
<li><strong>$C_{4}H_{10}$</strong>: Excellent separation (Dunn Index = 0.39) between butane and isobutane</li>
<li><strong>$C_{5}H_{12}$ to $C_{6}H_{14}$</strong>: Rapid decline in separability</li>
<li><strong>$C_{7}H_{16}$ and beyond</strong>: Poor separation (Dunn Index $\approx$ 0.02)</li>
</ul>
<p>This validates our computational pipeline and matches the original paper&rsquo;s findings. For larger molecules, eigenvalue clusters become nearly indistinguishable, confirming the overlapping distributions we observed earlier.</p>
<h3 id="silhouette-analysis-individual-conformation-assessment">Silhouette Analysis: Individual Conformation Assessment</h3>
<p>The Dunn Index provides the global view. We must also consider individual molecules. The <a href="https://en.wikipedia.org/wiki/Silhouette_(clustering)">silhouette score</a> evaluates each conformation separately, asking: &ldquo;Is this molecule closer to its own isomer family or to a different one?&rdquo;</p>
<p>For each molecular conformation $i$:</p>
<p>$$
s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
$$</p>
<p>where:</p>
<ul>
<li>$a(i)$ = average distance to other conformations of the <strong>same</strong> isomer</li>
<li>$b(i)$ = average distance to conformations of the <strong>nearest different</strong> isomer</li>
</ul>
<p><strong>Interpretation:</strong></p>
<ul>
<li><strong>Score near +1</strong>: Conformation clusters correctly (good clustering)</li>
<li><strong>Score near -1</strong>: Conformation closer to different isomer (misclassification)</li>
</ul>
<p>This enables two critical measurements:</p>
<ol>
<li>How many isomers have <strong>any</strong> misclassified conformations?</li>
<li>What fraction of <strong>individual conformations</strong> get misclassified?</li>
</ol>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.metrics <span style="color:#f92672">import</span> silhouette_samples
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> tqdm <span style="color:#f92672">import</span> tqdm
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>s_scores <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> tqdm(range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>)):
</span></span><span style="display:flex;"><span>    X <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>    y <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> spectra[n]:
</span></span><span style="display:flex;"><span>        X<span style="color:#f92672">.</span>append(spectra[n][i])
</span></span><span style="display:flex;"><span>        y<span style="color:#f92672">.</span>extend(np<span style="color:#f92672">.</span>full(spectra[n][i]<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>], i))
</span></span><span style="display:flex;"><span>    X <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>concatenate(X)
</span></span><span style="display:flex;"><span>    y <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>array(y)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    s_scores[n] <span style="color:#f92672">=</span> silhouette_samples(X, y)
</span></span></code></pre></div><p>Computing both clustering quality metrics:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Metric 1: Fraction of isomers with ANY negative scores</span>
</span></span><span style="display:flex;"><span>neg_iso <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    n_iso <span style="color:#f92672">=</span> s_scores[n]<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>] <span style="color:#f92672">//</span> <span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span>    n_has_neg <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(n_iso):
</span></span><span style="display:flex;"><span>        chunk <span style="color:#f92672">=</span> s_scores[n][i <span style="color:#f92672">*</span> <span style="color:#ae81ff">1000</span>:(i <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>) <span style="color:#f92672">*</span> <span style="color:#ae81ff">1000</span>]
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> np<span style="color:#f92672">.</span>any(chunk <span style="color:#f92672">&lt;</span> <span style="color:#ae81ff">0</span>):
</span></span><span style="display:flex;"><span>            n_has_neg <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>    neg_iso[n] <span style="color:#f92672">=</span> n_has_neg <span style="color:#f92672">/</span> n_iso
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Metric 2: Individual conformation misclassification rates</span>
</span></span><span style="display:flex;"><span>neg_confs <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    n_iso <span style="color:#f92672">=</span> s_scores[n]<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>] <span style="color:#f92672">//</span> <span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span>    neg_confs[n] <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>zeros(n_iso)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(n_iso):
</span></span><span style="display:flex;"><span>        isomer_scores <span style="color:#f92672">=</span> s_scores[n][i <span style="color:#f92672">*</span> <span style="color:#ae81ff">1000</span>:(i <span style="color:#f92672">+</span> <span style="color:#ae81ff">1</span>) <span style="color:#f92672">*</span> <span style="color:#ae81ff">1000</span>]
</span></span><span style="display:flex;"><span>        neg_confs[n][i] <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>sum(isomer_scores <span style="color:#f92672">&lt;</span> <span style="color:#ae81ff">0</span>) <span style="color:#f92672">/</span> isomer_scores<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>]
</span></span></code></pre></div><h4 id="isomer-level-analysis">Isomer-Level Analysis</h4>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/fraction_of_negative_silhouette_scores_vs_num_carbon_atoms.webp"
         alt="Chart showing fraction of isomers with at least one misclassified conformation"
         title="Chart showing fraction of isomers with at least one misclassified conformation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Fraction of isomers with at least one misclassified conformation (a stringent test of cluster purity).</figcaption>
    
</figure>

<p>The trend is concerning: by $C_{11}H_{24}$, 97% of isomers have at least one conformation that would be misclassified. This metric is deliberately strict. Even a single misplaced conformation marks the entire isomer as problematic.</p>
<h4 id="conformation-level-analysis">Conformation-Level Analysis</h4>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/fraction_of_negative_silhouette_scores_vs_num_carbon_atoms_individual.webp"
         alt="Chart showing individual misclassification rates per isomer with horizontal lines showing range for each molecular size"
         title="Chart showing individual misclassification rates per isomer with horizontal lines showing range for each molecular size"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Individual misclassification rates per isomer. Each point represents one isomer; horizontal lines show the range for each molecular size.</figcaption>
    
</figure>

<p>The individual analysis reveals dramatic variation:</p>
<ul>
<li><strong>$C_{4}H_{10}$</strong>: Perfect clustering (0% misclassification), confirming our earlier 2D separation plots</li>
<li><strong>$C_{5}H_{12}$ to $C_{6}H_{14}$</strong>: Modest problems (1-8% misclassification rates)</li>
<li><strong>$C_{11}H_{24}$</strong>: Average 35% conformations misclassified per isomer</li>
</ul>
<p>Some isomers experience up to 99.5% conformation misclassification (they become essentially unrecognizable in eigenvalue space). This directly connects to our earlier observation: mathematical representations that appear elegant may lack the structural nuances needed for practical discrimination.</p>
<h2 id="supervised-learning-finding-hidden-structure">Supervised Learning: Finding Hidden Structure</h2>
<p>Both clustering metrics deliver the same conclusion: Coulomb matrix eigenvalues alone struggle to reliably distinguish constitutional isomers for larger alkanes. The mathematical elegance of eigenvalues encounters practical limitations as molecular complexity increases.</p>
<p>Supervised learning offers an alternative approach. Providing labels allows models to extract hidden patterns that elude clustering algorithms. The mathematical structure often requires explicit guidance for discovery.</p>
<p>I&rsquo;ll focus on two baseline approaches: k-nearest neighbors and logistic regression. These represent fundamentally different learning paradigms (one memorizes patterns, the other learns linear boundaries) giving us insight into what types of structure might exist in eigenvalue space.</p>
<h2 id="k-nearest-neighbors-pattern-recognition-through-memory">k-Nearest Neighbors: Pattern Recognition Through Memory</h2>
<p>k-NN represents the simplest supervised learning approach: it stores all training examples and classifies new samples based on their closest neighbors. If eigenvalue patterns truly distinguish isomers, nearby points in eigenvalue space should belong to the same class.</p>
<p>This directly tests the local structure. Local neighborhoods often preserve meaningful distinctions even when global structure appears diffuse.</p>
<h3 id="testing-different-feature-representations">Testing Different Feature Representations</h3>
<p>We compare three approaches: full eigenvalue vectors, top 10 eigenvalues only, and PCA-reduced representations.</p>
<p>Testing 1-nearest neighbor with full dimensionality:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.model_selection <span style="color:#f92672">import</span> cross_val_score, StratifiedKFold
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.neighbors <span style="color:#f92672">import</span> KNeighborsClassifier
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>df_1nn <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Prepare the data for CnH2n+2</span>
</span></span><span style="display:flex;"><span>    X, y <span style="color:#f92672">=</span> prep_data(n<span style="color:#f92672">=</span>n)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Create knn classifier</span>
</span></span><span style="display:flex;"><span>    knn <span style="color:#f92672">=</span> KNeighborsClassifier(n_neighbors<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Set up stratified 5-fold cross-validation</span>
</span></span><span style="display:flex;"><span>    cv <span style="color:#f92672">=</span> StratifiedKFold(n_splits<span style="color:#f92672">=</span><span style="color:#ae81ff">5</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Perform cross-validation. Since &#39;cross_val_score&#39; computes accuracy, we compute misclassification rate by subtracting accuracy from 1.</span>
</span></span><span style="display:flex;"><span>    acc_scores <span style="color:#f92672">=</span> cross_val_score(knn, X, y, cv<span style="color:#f92672">=</span>cv, scoring<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;accuracy&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Convert accuracy scores to misclassification error rates</span>
</span></span><span style="display:flex;"><span>    misclassification_error_rates <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> acc_scores
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Calculate the average and standard deviation of the misclassification error rates</span>
</span></span><span style="display:flex;"><span>    avg_misclassification_error <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>mean(misclassification_error_rates)
</span></span><span style="display:flex;"><span>    std_misclassification_error <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>std(misclassification_error_rates)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">: </span><span style="color:#e6db74">{</span>avg_misclassification_error<span style="color:#e6db74">:</span><span style="color:#e6db74">.2%</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> ± </span><span style="color:#e6db74">{</span>std_misclassification_error<span style="color:#e6db74">:</span><span style="color:#e6db74">.2%</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    df_1nn<span style="color:#f92672">.</span>append({
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;molecule&#39;</span>: <span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;avg_misclassification_error&#39;</span>: avg_misclassification_error,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;std_misclassification_error&#39;</span>: std_misclassification_error,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;n&#39;</span>: n,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;representation&#39;</span>: <span style="color:#e6db74">&#39;full&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;model&#39;</span>: <span style="color:#e6db74">&#39;1nn&#39;</span>,
</span></span><span style="display:flex;"><span>    })
</span></span></code></pre></div><p>The results are remarkable compared to unsupervised clustering:</p>
<pre><code>C4H10: 0.00% ± 0.00%
C5H12: 0.00% ± 0.00%
C6H14: 0.00% ± 0.00%
C7H16: 0.07% ± 0.05%
C8H18: 0.11% ± 0.05%
C9H20: 0.51% ± 0.09%
C10H22: 1.31% ± 0.09%
C11H24: 3.24% ± 0.09%
</code></pre>
<p><strong>Perfect classification</strong> for molecules up to $C_{6}H_{14}$, with remarkably low error rates even for complex molecules like $C_{11}H_{24}$. This dramatically improves over clustering, where 97% of $C_{11}H_{24}$ isomers had misclassified conformations.</p>
<p><strong>Note on feature scaling:</strong> Standardizing features significantly degraded performance; eigenvalue magnitudes carry crucial structural information.</p>
<p>Comparing performance across different feature representations:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/alkane-classification-1nn.webp"
         alt="1-NN performance across different representations"
         title="1-NN performance across different representations"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">1-NN classification performance across different eigenvalue representations shows similar results, with slight advantages for full representations on larger molecules.</figcaption>
    
</figure>

<p><strong>Key insights:</strong></p>
<ul>
<li><strong>Representation choice matters little</strong> for 1-NN. Full, top-10, and PCA representations perform nearly identically</li>
<li><strong>PCA slightly outperforms</strong> top-10 eigenvalues for larger molecules, capturing more structural variance</li>
<li><strong>Perfect classification</strong> persists through $C_{6}H_{14}$ regardless of representation</li>
</ul>
<p>This confirms that discriminative information concentrates in the largest eigenvalues, validating our earlier PCA findings.</p>
<h3 id="the-neighbor-count-effect">The Neighbor Count Effect</h3>
<p>Testing k-NN with different neighbor counts (k=1, 3, 5) reveals a counterintuitive pattern:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/alkane-classification-knn.webp"
         alt="k-NN performance for different k values"
         title="k-NN performance for different k values"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">k-NN classification performance decreases as k increases. More neighbors actually hurt accuracy.</figcaption>
    
</figure>

<p><strong>Why does performance degrade with more neighbors?</strong> This connects directly to our earlier clustering analysis. The eigenvalue space lacks meaningful local structure. When k-NN examines beyond the immediate nearest neighbor, it increasingly finds examples from different classes.</p>
<p>This validates our unsupervised findings: in the absence of clear cluster boundaries, examining more neighbors introduces noise.</p>
<h2 id="logistic-regression-learning-linear-decision-boundaries">Logistic Regression: Learning Linear Decision Boundaries</h2>
<p>Logistic regression represents a fundamentally different approach. Logistic regression learns linear decision boundaries in eigenvalue space. If eigenvalues encode structural information linearly, this should work well.</p>
<p>We&rsquo;ll focus on PCA-reduced representations to keep computation manageable, using insights from the k-NN analysis.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.linear_model <span style="color:#f92672">import</span> LogisticRegression
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.pipeline <span style="color:#f92672">import</span> Pipeline
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.decomposition <span style="color:#f92672">import</span> PCA
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>df_lr <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> n <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">12</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Prepare the data for CnH2n+2</span>
</span></span><span style="display:flex;"><span>    X, y <span style="color:#f92672">=</span> prep_data(n<span style="color:#f92672">=</span>n)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Create logistic regression classifier with PCA</span>
</span></span><span style="display:flex;"><span>    lr <span style="color:#f92672">=</span> Pipeline([
</span></span><span style="display:flex;"><span>        (<span style="color:#e6db74">&#39;pca&#39;</span>, PCA(n_components<span style="color:#f92672">=</span><span style="color:#ae81ff">10</span>)),
</span></span><span style="display:flex;"><span>        (<span style="color:#e6db74">&#39;lr&#39;</span>, LogisticRegression(
</span></span><span style="display:flex;"><span>            max_iter<span style="color:#f92672">=</span><span style="color:#ae81ff">10_000</span>,
</span></span><span style="display:flex;"><span>            penalty<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;l2&#39;</span>,
</span></span><span style="display:flex;"><span>            solver<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;lbfgs&#39;</span>,
</span></span><span style="display:flex;"><span>            C<span style="color:#f92672">=</span><span style="color:#ae81ff">10.0</span>,  <span style="color:#75715e"># Reduced regularization</span>
</span></span><span style="display:flex;"><span>            random_state<span style="color:#f92672">=</span><span style="color:#ae81ff">42</span>,
</span></span><span style="display:flex;"><span>            n_jobs<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span>        ))
</span></span><span style="display:flex;"><span>    ])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 5-fold stratified cross-validation</span>
</span></span><span style="display:flex;"><span>    cv <span style="color:#f92672">=</span> StratifiedKFold(n_splits<span style="color:#f92672">=</span><span style="color:#ae81ff">5</span>)
</span></span><span style="display:flex;"><span>    acc_scores <span style="color:#f92672">=</span> cross_val_score(lr, X, y, cv<span style="color:#f92672">=</span>cv, scoring<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;accuracy&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Convert to misclassification rates</span>
</span></span><span style="display:flex;"><span>    avg_error <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>mean(<span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> acc_scores)
</span></span><span style="display:flex;"><span>    std_error <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>std(<span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> acc_scores)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;C</span><span style="color:#e6db74">{</span>n<span style="color:#e6db74">}</span><span style="color:#e6db74">H</span><span style="color:#e6db74">{</span><span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>n <span style="color:#f92672">+</span> <span style="color:#ae81ff">2</span><span style="color:#e6db74">}</span><span style="color:#e6db74">: </span><span style="color:#e6db74">{</span>avg_error<span style="color:#e6db74">:</span><span style="color:#e6db74">.2%</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> ± </span><span style="color:#e6db74">{</span>std_error<span style="color:#e6db74">:</span><span style="color:#e6db74">.2%</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span></code></pre></div><p>Comparing k-NN versus logistic regression performance:</p>















<figure class="post-figure center ">
    <img src="/img/alkane-constitutional-isomers/alkane-classification-1nn-lr.webp"
         alt="Comparison of 1-NN and Logistic Regression performance"
         title="Comparison of 1-NN and Logistic Regression performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">k-NN significantly outperforms logistic regression, especially for larger molecules. The performance gap widens as molecular complexity increases.</figcaption>
    
</figure>

<p><strong>Key observations:</strong></p>
<ul>
<li><strong>k-NN dominates</strong> across all molecular sizes</li>
<li><strong>Linear boundaries fail</strong> for larger molecules. This suggests nonlinear eigenvalue relationships.</li>
<li><strong>Performance gap grows</strong> with molecular complexity, indicating increasingly nonlinear structural patterns</li>
</ul>
<p>Logistic regression&rsquo;s performance indicates that discriminative patterns in eigenvalue space are fundamentally nonlinear. Capturing these complex relationships requires memory-based or non-linear approaches.</p>
<h2 id="implications-for-molecular-representation">Implications for Molecular Representation</h2>
<p>Our supervised learning experiments reveal a nuanced picture of Coulomb matrix eigenvalues as molecular descriptors. Eigenvalues preserve sufficient local structure for nearest-neighbor classification to work remarkably well, despite lacking clean global clusters.</p>
<p>This analysis reveals important lessons about molecular representations:</p>
<ol>
<li><strong>Practical utility requires robust empirical performance alongside mathematical elegance</strong>: Mathematical elegance often accompanies fundamental limitations in complex spaces.</li>
<li><strong>Context matters</strong>: Representations exhibit distinct performance characteristics under supervised versus unsupervised conditions.</li>
<li><strong>Molecular complexity is challenging</strong>: Even simple alkanes test our best descriptors.</li>
<li><strong>Local vs. global structure</strong>: Local neighborhood structures often contain highly discriminative information.</li>
</ol>
<p>For practitioners working with molecular representations, it is crucial to test multiple learning paradigms. Supervised and unsupervised approaches often yield different insights. Furthermore, logistic regression&rsquo;s poor performance indicates that discriminative patterns in eigenvalue space are fundamentally nonlinear. Capturing these complex relationships requires memory-based or non-linear approaches.</p>
<p><strong>Real-World Impact:</strong> Why does this matter beyond an academic exercise? Robust molecular representations are the engine driving modern computational chemistry. Whether we are accelerating drug discovery pipelines, designing novel materials, or running large-scale molecular dynamics simulations, the quality of our underlying representations dictates the success of our machine learning models. Understanding exactly where and why simpler descriptors like Coulomb matrix eigenvalues fail helps us design the next generation of graph neural networks and transformer-based architectures that power today&rsquo;s scientific breakthroughs.</p>
<p>The data pipeline that generated the datasets used in this analysis is available at the <a href="/projects/isomer-dataset-generation/">Synthetic Isomer Data Generation Pipeline project page</a>.</p>
]]></content:encoded></item><item><title>Coulomb Matrices for Molecular Machine Learning</title><link>https://hunterheidenreich.com/posts/molecular-descriptor-coulomb-matrix/</link><pubDate>Sat, 10 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/molecular-descriptor-coulomb-matrix/</guid><description>Learn how Coulomb matrices encode 3D molecular structure for machine learning from basic theory to Python implementation and practical limitations.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>When working with machine learning in chemistry, one of the first challenges you encounter is how to represent molecules in a way that algorithms can understand. You can&rsquo;t just feed raw atomic coordinates into a model. The representation needs to be invariant to rotation, translation, and atom ordering, since these operations don&rsquo;t change the molecule&rsquo;s fundamental properties.</p>
<p>The Coulomb matrix, introduced by Rupp et al. in 2012 <a href="#ref-1">[1]</a>, provides a straightforward solution to this problem. While newer methods have largely superseded it for practical applications, the Coulomb matrix remains an excellent starting point for understanding how molecular descriptors work.</p>
<p>The key insight is simple: we encode pairwise relationships between atoms in a way that captures the essential physics while maintaining the required invariances.</p>
<h2 id="the-coulomb-matrix-theory-and-intuition">The Coulomb Matrix: Theory and Intuition</h2>
<p>The Coulomb matrix encodes molecular structure in a symmetric $N \xN$ matrix, where $N$ is the number of atoms. Each element $C_{ij}$ is defined as:</p>
<p>$$
C_{ij} = \begin{cases} 0.5 Z_i^{2.4} &amp; \text{if } i = j, \\ \frac{Z_i Z_j}{|\mathbf{R}_i - \mathbf{R}_j|} &amp; \text{if } i \neq j, \end{cases}
$$</p>
<p>Here, $Z_i$ is the atomic number of atom $i$, and $\mathbf{R}_i$ is its position in 3D space. The diagonal elements ($0.5 Z_i^{2.4}$) represent atomic self-energies, derived from fitting atomic numbers to experimental data. The off-diagonal elements mimic Coulombic interactions between atoms. They&rsquo;re inversely proportional to distance, just like electrostatic potential energy <a href="#ref-3">[3]</a>.</p>
<p>This construction gives us several useful properties:</p>
<ul>
<li><strong>Rotation and translation invariant</strong>: Only relative distances matter</li>
<li><strong>Symmetric</strong>: $C_{ij} = C_{ji}$, which is physically sensible</li>
<li><strong>Size-extensive</strong>: Larger molecules have larger matrix elements</li>
<li><strong>Captures 3D structure</strong>: Nearby atoms have larger interaction terms</li>
</ul>
<p>While more sophisticated methods exist today <a href="#ref-2">[2]</a>, the Coulomb matrix&rsquo;s simplicity makes it ideal for understanding the fundamentals of molecular representation.</p>
<h3 id="hands-on-example-bicyclobutane">Hands-on Example: Bicyclobutane</h3>
<p>Let&rsquo;s calculate the Coulomb matrix for <a href="https://en.wikipedia.org/wiki/Bicyclobutane">bicyclobutane</a>, a strained but stable four-membered ring system. This example will show you exactly how the theory translates to practice.</p>















<figure class="post-figure center ">
    <img src="https://upload.wikimedia.org/wikipedia/commons/b/b4/Bicyclobutane-2.svg"
         alt="Bicyclobutane"
         title="Bicyclobutane"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Bicyclobutane structure (Smokefoot, Public domain, via Wikimedia Commons)</figcaption>
    
</figure>

<p>I&rsquo;ll use Python with the Atomic Simulation Environment (<code>ase</code>) for molecular structure <a href="#ref-4">[4]</a> and <code>dscribe</code> for the Coulomb matrix calculation <a href="#ref-2">[2]</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> ase.build <span style="color:#f92672">import</span> molecule
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> ase.visualize <span style="color:#f92672">import</span> view
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Load the bicyclobutane structure</span>
</span></span><span style="display:flex;"><span>bicyclobutane <span style="color:#f92672">=</span> molecule(<span style="color:#e6db74">&#39;bicyclobutane&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Optional: visualize the structure</span>
</span></span><span style="display:flex;"><span>view(bicyclobutane, viewer<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;x3d&#39;</span>)
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/bicyclobutane_ase_1.webp"
         alt="Bicyclobutane 3D structure"
         title="Bicyclobutane 3D structure"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">3D structure of bicyclobutane</figcaption>
    
</figure>

<p>Now we calculate the Coulomb matrix using DScribe:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> dscribe.descriptors <span style="color:#f92672">import</span> CoulombMatrix
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Set up the descriptor</span>
</span></span><span style="display:flex;"><span>cm <span style="color:#f92672">=</span> CoulombMatrix(n_atoms_max<span style="color:#f92672">=</span>len(bicyclobutane))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Calculate and reshape into matrix form</span>
</span></span><span style="display:flex;"><span>cm_bicyclobutane <span style="color:#f92672">=</span> cm<span style="color:#f92672">.</span>create(bicyclobutane)
</span></span><span style="display:flex;"><span>cm_bicyclobutane <span style="color:#f92672">=</span> cm_bicyclobutane<span style="color:#f92672">.</span>reshape(len(bicyclobutane), len(bicyclobutane))
</span></span></code></pre></div><h3 id="visualizing-the-results">Visualizing the Results</h3>
<p>The Coulomb matrix can be visualized as a heatmap. Let&rsquo;s look at both the raw matrix and its logarithm:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> matplotlib.pyplot <span style="color:#66d9ef">as</span> plt
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Raw Coulomb matrix</span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>figure(figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">8</span>, <span style="color:#ae81ff">8</span>), dpi<span style="color:#f92672">=</span><span style="color:#ae81ff">150</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>imshow(cm_bicyclobutane, cmap<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;coolwarm&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>colorbar(label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Magnitude&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>title(<span style="color:#e6db74">&#39;Coulomb Matrix for Bicyclobutane&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>show()
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/cm_bicyclobutane.webp"
         alt="Coulomb matrix of bicyclobutane"
         title="Coulomb matrix of bicyclobutane"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Coulomb matrix for bicyclobutane</figcaption>
    
</figure>

<p>The raw matrix shows clear patterns:</p>
<ul>
<li><strong>Large diagonal elements</strong>: Carbon atoms (Z=6) dominate due to their higher atomic numbers</li>
<li><strong>Smaller off-diagonal elements</strong>: Represent pairwise interactions</li>
<li><strong>Minimal hydrogen contribution</strong>: Hydrogen atoms (Z=1) have much smaller values</li>
</ul>
<p>For better visualization of the structure, the logarithm reveals more detail:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>figure(figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">8</span>, <span style="color:#ae81ff">8</span>), dpi<span style="color:#f92672">=</span><span style="color:#ae81ff">150</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>imshow(np<span style="color:#f92672">.</span>log(cm_bicyclobutane), cmap<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;coolwarm&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>colorbar(label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;log(Magnitude)&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>title(<span style="color:#e6db74">&#39;Log Coulomb Matrix for Bicyclobutane&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>show()
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/cm_bicyclobutane_log.webp"
         alt="Log Coulomb matrix of bicyclobutane"
         title="Log Coulomb matrix of bicyclobutane"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Log-scale reveals more structural detail</figcaption>
    
</figure>

<h3 id="eigenvalue-analysis">Eigenvalue Analysis</h3>
<p>The eigenvalues of the Coulomb matrix provide another perspective on molecular structure:</p>















<figure class="post-figure center ">
    <img src="/img/cm_bicyclobutane_eigenvalues.webp"
         alt="Eigenvalues of Coulomb matrix"
         title="Eigenvalues of Coulomb matrix"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Eigenvalues of the Coulomb matrix</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/cm_bicyclobutane_log_eigenvalues.webp"
         alt="Eigenvalues of log Coulomb matrix"
         title="Eigenvalues of log Coulomb matrix"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Eigenvalues on logarithmic scale</figcaption>
    
</figure>

<p>These eigenvalues are often used as features themselves, providing a more compact representation than the full matrix.</p>
<h2 id="practical-limitations">Practical Limitations</h2>
<p>The Coulomb matrix has significant limitations that explain why it&rsquo;s been largely superseded by modern methods. Understanding these constraints is crucial for knowing when and how to use this descriptor.</p>
<h3 id="the-size-problem">The Size Problem</h3>
<p>Every molecule must be represented by the same size matrix, which creates several issues:</p>
<ul>
<li><strong>Padding overhead</strong>: Small molecules get padded with zeros up to the maximum size</li>
<li><strong>Quadratic scaling</strong>: An $N$-atom molecule requires $N^2$ features</li>
<li><strong>Fixed maximum size</strong>: You can&rsquo;t represent molecules larger than your preset limit</li>
<li><strong>Inefficient storage</strong>: Most elements are zero for small molecules in large matrices</li>
</ul>
<p>For a dataset ranging from 5-atom to 50-atom molecules, every molecule needs a 50x50 matrix. That&rsquo;s 2,500 features, most of which are zero for smaller molecules.</p>
<h3 id="permutation-sensitivity">Permutation Sensitivity</h3>
<p>Despite being called &ldquo;invariant,&rdquo; the Coulomb matrix can actually change if you reorder the atoms in your input file. The standard solution is to sort atoms by the L2 norm of their matrix rows, but this introduces its own problems:</p>
<ul>
<li><strong>Symmetry breaking</strong>: Equivalent atoms might be ordered differently</li>
<li><strong>Numerical instability</strong>: Small coordinate changes can flip the ordering</li>
<li><strong>Loss of chemical intuition</strong>: The sorted order doesn&rsquo;t reflect meaningful chemistry</li>
</ul>
<p>Interestingly, some studies suggest that adding controlled noise to create multiple permutations can actually improve machine learning performance <a href="#ref-5">[5]</a>.</p>
<h3 id="limited-scope">Limited Scope</h3>
<p>The Coulomb matrix works well only for specific types of systems:</p>
<ul>
<li><strong>Small molecules</strong>: Performance degrades for large systems due to size scaling</li>
<li><strong>Gas-phase</strong>: Not suitable for periodic systems like crystals or surfaces</li>
<li><strong>Single conformations</strong>: Each 3D structure gets its own matrix</li>
<li><strong>Non-reactive</strong>: Doesn&rsquo;t capture bond-breaking or formation</li>
</ul>
<p>For periodic systems, you&rsquo;d need specialized variants like the Ewald sum matrix <a href="#ref-6">[6]</a>.</p>
<h2 id="why-learn-it-anyway">Why Learn It Anyway?</h2>
<p>Given these limitations, why spend time understanding the Coulomb matrix? Several reasons:</p>
<p><strong>Educational value</strong>: It&rsquo;s conceptually straightforward and provides excellent intuition for how molecular descriptors work. The mathematical formulation is simple enough to implement from scratch.</p>
<p><strong>Historical importance</strong>: Many subsequent methods build on ideas first explored with Coulomb matrices. Understanding this foundation helps you appreciate why newer methods were developed.</p>
<p><strong>Benchmarking</strong>: It remains useful as a baseline method for comparing new descriptors on small molecular datasets.</p>
<p><strong>Proof of concept</strong>: For exploratory work on small, well-defined datasets, the Coulomb matrix can still provide quick insights.</p>
<p>If you&rsquo;re working on practical problems with larger datasets or diverse molecular sizes, consider modern alternatives like graph neural networks, descriptors from DScribe&rsquo;s extended library, or learned representations from transformer models.</p>
<h2 id="putting-it-in-context">Putting It in Context</h2>
<p>To see the Coulomb matrix applied to real problems, I&rsquo;ve written a detailed guide using it for molecular classification:</p>
<ul>
<li><a href="/posts/alkane-constitutional-isomer-classification/">Coulomb Matrix Eigenvalues: Can You Hear the Shape of a Molecule?</a>: A comprehensive analysis of alkane isomers, from unsupervised clustering limits to supervised classification successes.</li>
</ul>
<p>For comparison with modern approaches, check out my post on <a href="/posts/geom-conformer-generation-dataset/">3D conformer generation with the GEOM dataset</a>, which showcases more sophisticated molecular representations. For technical specifications and benchmarks, see the <a href="/notes/computational-chemistry/datasets/geom/">GEOM dataset card</a>.</p>
<p>The Coulomb matrix may be dated, but it remains an excellent entry point into the world of molecular machine learning. Once you understand its strengths and limitations, you&rsquo;ll be better equipped to appreciate why the field has moved toward more sophisticated approaches.</p>
<hr>
<p><em>Have questions about molecular descriptors or want to discuss other approaches to molecular machine learning? I&rsquo;d be happy to explore these topics further.</em></p>
<h2 id="references">References</h2>
<ul>
<li><a id="ref-1"></a>[1]: M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld, &ldquo;Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning,&rdquo; Physical Review Letters, 108(5), 058301 (2012). <a href="https://doi.org/10.1103/PhysRevLett.108.058301">https://doi.org/10.1103/PhysRevLett.108.058301</a> <a href="https://arxiv.org/abs/1109.2618">arXiv:1109.2618</a></li>
<li><a id="ref-2"></a>[2] L. Himanen, M. O. J. Jäger, E. V. Morooka, F. F. Canova, Y. S. Ranawat, D. Z. Gao, P. Rinke, and A. S. Foster, &ldquo;DScribe: Library of descriptors for machine learning in materials science,&rdquo; Computer Physics Communications, 247, 106949 (2020). <a href="https://doi.org/10.1016/j.cpc.2019.106949">https://doi.org/10.1016/j.cpc.2019.106949</a> <a href="https://arxiv.org/abs/1904.08875">arXiv:1904.08875</a></li>
<li><a id="ref-3"></a>[3] J. Schrier, &ldquo;Can one hear the shape of a molecule (from its Coulomb matrix eigenvalues)?,&rdquo; Journal of Chemical Information and Modeling, 60(8), 3804-3811 (2020). <a href="https://doi.org/10.1021/acs.jcim.0c00631">https://doi.org/10.1021/acs.jcim.0c00631</a></li>
<li><a id="ref-4"></a>[4] A. H. Larsen, J. J. Mortensen, J. Blomqvist, I. E. Castelli, R. Christensen, M. Dułak, J. Friis, M. N. Groves, B. Hammer, C. Hargus, E. D. Hermes, P. C. Jennings, P. B. Jensen, J. Kermode, J. R. Kitchin, E. L. Kolsbjerg, J. Kubal, K. Kaasbjerg, S. Lysgaard, J. B. Maronsson, T. Maxson, T. Olsen, L. Pastewka, A. Peterson, C. Rostgaard, J. Schiøtz, O. Schütt, M. Strange, K. S. Thygesen, T. Vegge, L. Vilhelmsen, M. Walter, Z. Zeng, and K. W. Jacobsen, &ldquo;The Atomic Simulation Environment - A Python library for working with atoms,&rdquo; J. Phys.: Condens. Matter, 29, 273002 (2017). <a href="https://doi.org/10.1088/1361-648X/aa680e">https://doi.org/10.1088/1361-648X/aa680e</a> <a href="https://ase-lib.org/index.html">documentation</a></li>
<li><a id="ref-5"></a>[5] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, A. Lilienfeld, and K.-R. Müller, &ldquo;Learning invariant representations of molecules for atomization energy prediction,&rdquo; Advances in Neural Information Processing Systems, 25 (2012). Available online: <a href="https://proceedings.neurips.cc/paper_files/paper/2012/file/115f89503138416a242f40fb7d7f338e-Paper.pdf">https://proceedings.neurips.cc/paper_files/paper/2012/file/115f89503138416a242f40fb7d7f338e-Paper.pdf</a></li>
<li><a id="ref-6"></a>[6] F. Faber, A. Lindmaa, O. A. von Lilienfeld, and R. Armiento, &ldquo;Crystal structure representations for machine learning models of formation energies,&rdquo; International Journal of Quantum Chemistry, 115(16), 1094-1101 (2015). <a href="https://doi.org/10.1002/qua.24917">https://doi.org/10.1002/qua.24917</a></li>
</ul>
]]></content:encoded></item><item><title>SELFIES and the Future of Molecular String Representations</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/selfies-2022/</link><pubDate>Fri, 14 Oct 2022 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-representations/selfies-2022/</guid><description>Perspective on SELFIES as a 100% robust SMILES alternative, with 16 future research directions for molecular AI.</description><content:encoded><![CDATA[<h2 id="position-a-roadmap-for-robust-chemical-languages">Position: A Roadmap for Robust Chemical Languages</h2>
<p>This is a <strong>Position</strong> paper (perspective) that proposes a research agenda for molecular representations in AI. It reviews the evolution of chemical notation over 250 years and argues for extending SELFIES-style robust representations beyond traditional organic chemistry into polymers, crystals, reactions, and other complex chemical systems.</p>
<h2 id="the-generative-bottleneck-in-traditional-representations">The Generative Bottleneck in Traditional Representations</h2>
<p>While SMILES has been the standard molecular representation since 1988, its fundamental weakness for machine learning is well-established: randomly generated SMILES strings are often invalid. The motivation is twofold:</p>
<ol>
<li><strong>Current problem</strong>: Traditional representations (SMILES, <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a>, DeepSMILES) lack 100% robustness; random mutations or generations can produce invalid strings, limiting their use in generative AI models.</li>
<li><strong>Future opportunity</strong>: SELFIES solved this for small organic molecules, but many important chemical domains (polymers, crystals, reactions) still lack robust representations, creating a bottleneck for AI-driven discovery in these areas.</li>
</ol>
<h2 id="16-concrete-research-directions-for-selfies">16 Concrete Research Directions for SELFIES</h2>
<p>The novelty is in the comprehensive research roadmap. The authors propose 16 concrete research projects organized around key themes:</p>
<ul>
<li><strong>Domain extension</strong>: Includes metaSELFIES for learning graph rules directly from data, BigSELFIES for stochastic polymers, and crystal structures via labeled quotient graphs.</li>
<li><strong>Chemical reactions</strong>: Robust reaction representations that enforce conservation laws.</li>
<li><strong>Programming perspective</strong>: Treating molecular representations as programming languages, potentially achieving Turing-completeness.</li>
<li><strong>Benchmarking</strong>: Systematic comparisons across representation formats.</li>
<li><strong>Interpretability</strong>: Understanding how humans and machines actually learn from different representations.</li>
</ul>
<h2 id="evidence-from-generative-case-studies">Evidence from Generative Case Studies</h2>
<p>This perspective paper includes case studies:</p>
<ol>
<li>
<p><strong>Pasithea (Deep Molecular Dreaming)</strong>: A generative model using gradient descent in latent space to optimize molecular properties (logP). When using one-hot encoded SELFIES, the model successfully optimizes logP monotonically, demonstrating smooth traversal of chemical space impossible with SMILES.</p>
</li>
<li>
<p><strong>STOUT and DECIMER</strong>: Tools for IUPAC-name-to-structure and image-to-structure conversion. Both show improved accuracy when using SELFIES as an intermediate representation, even when the final output is SMILES, suggesting SELFIES provides a more learnable internal representation for sequence-to-sequence models.</p>
</li>
</ol>
<h2 id="strategic-outcomes-and-future-vision">Strategic Outcomes and Future Vision</h2>
<p>The paper establishes robust representations as a fundamental bottleneck in computational chemistry and proposes a clear path forward:</p>
<p><strong>Key outcomes</strong>:</p>
<ul>
<li>Identification of 16 concrete research projects spanning domain extension, benchmarking, and interpretability</li>
<li>Evidence that SELFIES enables capabilities (like smooth property optimization) impossible with traditional formats</li>
<li>Framework for thinking about molecular representations as programming languages</li>
</ul>
<p><strong>Strategic impact</strong>: The proposed extensions could unlock new capabilities across drug discovery (efficient exploration beyond small molecules), materials design (systematic crystal structure discovery), synthesis planning (better reaction representations), and fundamental research (new ways to understand chemical behavior).</p>
<p><strong>Future vision</strong>: The authors emphasize that robust representations could become a bridge for bidirectional learning between humans and machines, enabling humans to learn new chemical concepts from AI systems.</p>
<h2 id="the-mechanism-of-robustness">The Mechanism of Robustness</h2>
<p>The key difference between SELFIES and other representations lies in how they handle syntax:</p>
<ul>
<li><strong>SMILES/DeepSMILES</strong>: Rely on non-local markers (opening/closing parentheses or ring numbers) that must be balanced. A mutation or random generation can easily break this balance, producing invalid strings.</li>
<li><strong>SELFIES</strong>: Uses a formal grammar (automaton) where derivation rules are entirely local. The critical innovation is <strong>overloading</strong>: a state-modifying symbol like <code>[Branch1]</code> starts a branch and changes the interpretation of the <em>next</em> symbol to represent a numerical parameter (the branch length).</li>
</ul>
<p>This overloading mechanism ensures that any arbitrary sequence of SELFIES tokens can be parsed into a valid molecular graph. The derivation can never fail because every symbol either adds an atom or modifies how subsequent symbols are interpreted.</p>
<h2 id="the-16-research-projects-technical-details">The 16 Research Projects: Technical Details</h2>
<p>This section provides technical details on the proposed research directions:</p>
<h3 id="extending-to-new-domains">Extending to New Domains</h3>
<p><strong>metaSELFIES (Project 1)</strong>: The authors propose learning graph construction rules automatically from data. This could enable robust representations for any graph-based system, from quantum optics to biological networks, without needing domain-specific expertise.</p>
<p><strong>Token Optimization (Project 2)</strong>: SELFIES uses &ldquo;overloading&rdquo; where a symbol&rsquo;s meaning changes based on context. This project would investigate how this affects machine learning performance and whether the approach can be optimized.</p>
<h3 id="handling-complex-molecular-systems">Handling Complex Molecular Systems</h3>
<p><strong>BigSELFIES (Project 3)</strong>: Current representations struggle with large, often random structures like polymers and biomolecules. BigSELFIES would combine hierarchical notation with stochastic building blocks to handle these complex systems where traditional small-molecule representations break down.</p>
<p><strong>Crystal Structures (Projects 4-5)</strong>: Crystals present unique challenges due to their infinite, periodic arrangements. An infinite net cannot be represented by a finite string directly. The proposed approach uses <strong>labeled quotient graphs (LQGs)</strong>, which are finite graphs that uniquely determine a periodic net. However, current SELFIES cannot represent LQGs because they lack symbols for edge directions and edge labels (vector shifts encoding periodicity). Extending SELFIES to handle these structures could enable AI-driven materials design without relying on predefined crystal structures, unlocking systematic exploration of theoretical materials space.</p>
<p><strong>Beyond Organic Chemistry (Project 6)</strong>: Transition metals and main-group compounds feature complex bonding that breaks the simple two-center, two-electron model. The solution: use machine learning on large structural databases to automatically learn these complex bonding rules.</p>
<h3 id="chemical-reactions-and-programming-concepts">Chemical Reactions and Programming Concepts</h3>
<p><strong>Reaction Representations (Project 7)</strong>: Moving beyond static molecules to represent chemical transformations. A robust reaction format would enforce conservation laws and could learn reactivity patterns from large reaction datasets, potentially revolutionizing synthesis planning.</p>
<h3 id="developing-a-100-robust-programming-language">Developing a 100% Robust Programming Language</h3>
<p><strong>Programming Language Perspective (Projects 8-9)</strong>: An intriguing reframing views molecular representations as programming languages executed by chemical parsers. This opens possibilities for adding loops, logic, and other programming concepts to efficiently describe complex structures. The ambitious goal is a Turing-complete programming language that is also 100% robust. While fascinating, it is worth critically noting that enforcing 100% syntactical robustness inherently restricts grammar flexibility. Can a purely robust string representation realistically describe highly fuzzy, delocalized electron bonds (like in Project 6) without becoming impractically long or collapsing into specialized sub-languages?</p>
<p><strong>Empirical Comparisons (Projects 10-11)</strong>: With multiple representation options (strings, matrices, images), we need systematic comparisons. The proposed benchmarks would go beyond simple validity metrics to focus on real-world design objectives in drug discovery, catalysis, and materials science.</p>
<p><strong>Human Readability (Project 12)</strong>: While SMILES is often called &ldquo;human-readable,&rdquo; this claim lacks scientific validation. The proposed study would test how well humans actually understand different molecular representations.</p>
<p><strong>Machine Learning Perspectives (Projects 13-16)</strong>: These projects explore how machines interpret molecular representations:</p>
<ul>
<li>Training networks to translate between formats to find universal representations</li>
<li>Comparing learning efficiency across different formats</li>
<li>Investigating latent space smoothness in generative models</li>
<li>Visualizing what models actually learn about molecular structure</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>Since this is a position paper outlining future research directions, standard empirical reproducibility metrics do not apply. However, the foundational tools required to pursue the proposed roadmap are open-source:</p>
<ul>
<li><strong>Code Availability</strong>: The core SELFIES framework is fully open-source and actively maintained by the Aspuru-Guzik group. The repository can be found at <a href="https://github.com/aspuru-guzik-group/selfies">aspuru-guzik-group/selfies</a>.</li>
<li><strong>Installation</strong>: The library is readily available via PyPI: <code>pip install selfies</code>.</li>
<li><strong>Paper Accessibility</strong>: While published in <em>Patterns</em>, a free preprint is actively maintained and can be accessed via <a href="https://arxiv.org/abs/2204.00056">arXiv:2204.00056</a>.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., Friederich, P., Gaudin, T., Gayle, A. A., Jablonka, K. M., Lameiro, R. F., Lemm, D., Lo, A., Moosavi, S. M., Nápoles-Duarte, J. M., Nigam, A., Pollice, R., Rajan, K., Schatzschneider, U., &hellip; Aspuru-Guzik, A. (2022). SELFIES and the future of molecular string representations. <em>Patterns</em>, <em>3</em>(10). <a href="https://doi.org/10.1016/j.patter.2022.100588">https://doi.org/10.1016/j.patter.2022.100588</a></p>
<p><strong>Publication</strong>: Patterns 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Krenn2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{SELFIES and the future of molecular string representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">ISSN</span> = <span style="color:#e6db74">{2666-3899}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{http://dx.doi.org/10.1016/j.patter.2022.100588}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">DOI</span> = <span style="color:#e6db74">{10.1016/j.patter.2022.100588}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Patterns}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Elsevier BV}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Krenn, Mario and Ai, Qianxiang and Barthel, Senja and Carson, Nessa and Frei, Angelo and Frey, Nathan C. and Friederich, Pascal and Gaudin, Théophile and Gayle, Alberto Alexander and Jablonka, Kevin Maik and Lameiro, Rafael F. and Lemm, Dominik and Lo, Alston and Moosavi, Seyed Mohamad and Nápoles-Duarte, José Manuel and Nigam, AkshatKumar and Pollice, Robert and Rajan, Kohulan and Schatzschneider, Ulrich and Schwaller, Philippe and Skreta, Marta and Smit, Berend and Strieth-Kalthoff, Felix and Sun, Chong and Tom, Gary and Young, Adamo and Yu, Rose and Aspuru-Guzik, Alán}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{100588}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/computational-chemistry/molecular-representations/selfies-original-paper/">Original SELFIES Paper</a></li>
<li><a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES Overview</a></li>
</ul>
]]></content:encoded></item></channel></rss>