<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Application Papers: Transferring Methods to New Domains on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/paper-types/application/</link><description>Recent content in Application Papers: Transferring Methods to New Domains on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sat, 11 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/paper-types/application/index.xml" rel="self" type="application/rss+xml"/><item><title>Fine-Tuning GPT-3 for Predictive Chemistry Tasks</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/leveraging-llms-predictive-chemistry/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/leveraging-llms-predictive-chemistry/</guid><description>Fine-tuned GPT-3 matches or outperforms specialized ML models on molecular, materials, and reaction property prediction, especially in low-data regimes.</description><content:encoded><![CDATA[<h2 id="gpt-3-as-a-general-purpose-chemistry-predictor">GPT-3 as a General-Purpose Chemistry Predictor</h2>
<p>This is an <strong>Empirical</strong> paper that systematically benchmarks fine-tuned GPT-3 against dedicated machine learning models across 15 chemistry and materials science prediction tasks. The primary contribution is demonstrating that a general-purpose large language model, with no chemistry-specific architecture or featurization, can match or outperform specialized ML approaches, particularly when training data is limited. The paper also demonstrates inverse molecular design through simple prompt inversion.</p>
<h2 id="why-general-purpose-llms-for-chemistry">Why General-Purpose LLMs for Chemistry</h2>
<p>Machine learning in chemistry typically requires domain-specific feature engineering: molecular fingerprints, graph neural network architectures, or hand-crafted descriptors tailored to each application. Developing these approaches demands specialized expertise and significant effort for each new problem. The small datasets common in experimental chemistry further complicate matters, as many sophisticated ML approaches require large training sets to learn meaningful representations.</p>
<p>Large language models like GPT-3, trained on vast internet text corpora, had shown surprising capability at tasks they were not explicitly trained for. The key question motivating this work was whether these general-purpose models could also answer scientific questions for which we lack answers, given that most chemistry problems can be represented in text form. For example: &ldquo;If I change the metal in my <a href="https://en.wikipedia.org/wiki/Metal%E2%80%93organic_framework">metal-organic framework</a>, will it be stable in water?&rdquo;</p>
<p>Prior chemical language models (e.g., <a href="/notes/chemistry/molecular-design/property-prediction/transformer-cnn-qsar-modeling/">Transformer-CNN</a>, <a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a>, <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a>) were pre-trained on chemistry-specific corpora. In contrast, this work investigates models trained primarily on general internet text, examining whether the implicit chemical knowledge encoded during pre-training, combined with task-specific fine-tuning, can substitute for explicit chemical featurization.</p>
<h2 id="language-interfaced-fine-tuning-for-chemistry">Language-Interfaced Fine-Tuning for Chemistry</h2>
<p>The core innovation is &ldquo;language-interfaced fine-tuning&rdquo; (LIFT): reformulating chemistry prediction tasks as natural language question-answering. Training examples take the form of question-completion pairs, where questions describe the chemical system in text and completions provide the target property. For example:</p>
<ul>
<li><strong>Classification</strong>: &ldquo;What is the phase of Co1Cu1Fe1Ni1V1?&rdquo; with completion &ldquo;0&rdquo; (multi-phase)</li>
<li><strong>Regression</strong>: Property values are rounded to a fixed precision, converting continuous prediction into a text generation problem</li>
<li><strong>Inverse design</strong>: Questions and completions are simply swapped, asking &ldquo;What is a molecule with property X?&rdquo; and expecting a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> string as completion</li>
</ul>
<p>The fine-tuning uses OpenAI&rsquo;s API with the smallest <code>ada</code> variant of GPT-3, with uniform hyperparameters across all tasks (8 epochs, learning rate multiplier of 0.02). No optimization of prompt structure, tokenization, or training schedule was performed, making the approach deliberately simple.</p>
<p>For regression, since language models generate discrete tokens rather than continuous values, the authors round target values to a fixed precision (e.g., 1% for Henry coefficients). This converts regression into a form of classification over numeric strings, with the assumption that GPT-3 can interpolate between these discretized values.</p>
<p>The approach also extends to open-source models. The authors demonstrate that GPT-J-6B can be fine-tuned using parameter-efficient techniques (LoRA, 8-bit quantization) on consumer hardware, and provide the <code>chemlift</code> Python package for this purpose.</p>
<h2 id="benchmarks-across-molecules-materials-and-reactions">Benchmarks Across Molecules, Materials, and Reactions</h2>
<h3 id="datasets-and-tasks">Datasets and Tasks</h3>
<p>The evaluation spans three chemical domains with 15 total benchmarks:</p>
<p><strong>Molecules:</strong></p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Photoswitch">Photoswitch</a> transition wavelength prediction (2022)</li>
<li>Free energy of solvation (FreeSolv, 2014)</li>
<li>Aqueous solubility (ESOL, 2004)</li>
<li>Lipophilicity (ChEMBL, 2012)</li>
<li><a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO-LUMO gap</a> (QMugs, 2022)</li>
<li><a href="https://en.wikipedia.org/wiki/Organic_solar_cell">Organic photovoltaic</a> power conversion efficiency (2018)</li>
</ul>
<p><strong>Materials:</strong></p>
<ul>
<li>Coarse-grained surfactant adsorption free energy (2021)</li>
<li>CO2 and CH4 <a href="https://en.wikipedia.org/wiki/Henry%27s_law">Henry coefficients</a> in MOFs (2020)</li>
<li>MOF heat capacity (2022)</li>
<li><a href="https://en.wikipedia.org/wiki/High-entropy_alloy">High-entropy alloy</a> phase prediction (2020)</li>
<li><a href="https://en.wikipedia.org/wiki/Amorphous_metal">Bulk metallic glass</a> formation ability (2006)</li>
<li>Metallic behavior prediction (2018)</li>
</ul>
<p><strong>Reactions:</strong></p>
<ul>
<li>C-N cross-coupling yield (<a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a>, 2018)</li>
<li>C-C cross-coupling yield (<a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki</a>, 2022)</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>The baselines include both traditional ML and deep learning approaches:</p>
<ul>
<li><strong>Non-DL</strong>: XGBoost with molecular descriptors/fragprints, Gaussian Process Regression (GPR), random forests, n-Gram models, Automatminer, differential reaction fingerprints (DRFP)</li>
<li><strong>Deep learning</strong>: MolCLR, ModNet, CrabNet, TabPFN</li>
</ul>
<h3 id="data-efficiency-analysis">Data Efficiency Analysis</h3>
<p>To compare data efficiency, the authors fit power law curves to learning curves for all models and measure the &ldquo;data efficiency factor&rdquo;: how much more (or fewer) data the best baseline needs to match GPT-3&rsquo;s performance in the low-data regime.</p>
<table>
  <thead>
      <tr>
          <th>Domain</th>
          <th>Benchmark</th>
          <th>Data Efficiency vs. Non-DL</th>
          <th>vs. DL Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecules</td>
          <td>Photoswitch wavelength</td>
          <td>1.1x (n-Gram)</td>
          <td>1.2x (TabPFN)</td>
      </tr>
      <tr>
          <td>Molecules</td>
          <td>Solvation free energy</td>
          <td>3.1x (GPR)</td>
          <td>1.3x (TabPFN)</td>
      </tr>
      <tr>
          <td>Molecules</td>
          <td>Solubility</td>
          <td>1.0x (XGBoost)</td>
          <td>0.002x (MolCLR)</td>
      </tr>
      <tr>
          <td>Molecules</td>
          <td>Lipophilicity</td>
          <td>3.43x (GPR)</td>
          <td>0.97x (TabPFN)</td>
      </tr>
      <tr>
          <td>Molecules</td>
          <td>HOMO-LUMO gap</td>
          <td>4.3x (XGBoost)</td>
          <td>0.62x (TabPFN)</td>
      </tr>
      <tr>
          <td>Materials</td>
          <td>HEA phase</td>
          <td>24x (RF)</td>
          <td>9.0x (CrabNet)</td>
      </tr>
      <tr>
          <td>Materials</td>
          <td>CO2 Henry coeff.</td>
          <td>0.40x (XGBoost)</td>
          <td>12x (TabPFN)</td>
      </tr>
      <tr>
          <td>Reactions</td>
          <td>C-N cross-coupling</td>
          <td>2.9x (DRFP)</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Values &gt;1 indicate GPT-3 is more data-efficient. For the HEA phase prediction task, GPT-3 achieved comparable accuracy to a random forest model trained on 1,126 data points using only about 50 training examples.</p>
<h3 id="representation-sensitivity">Representation Sensitivity</h3>
<p>An important finding is that GPT-3 performs well regardless of molecular representation format. The authors tested IUPAC names, SMILES, and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, finding good results across all representations. IUPAC names often produced the best performance, which is notable because it makes the approach accessible to non-specialists who can simply use chemical names rather than learning specialized encodings.</p>
<h3 id="inverse-design">Inverse Design</h3>
<p>For inverse design, the authors fine-tuned GPT-3 with reversed question-completion pairs. On photoswitches:</p>
<ul>
<li>Generated molecules include both training set members and novel structures (some not in PubChem)</li>
<li>Transition wavelengths matched target values within about 10% mean absolute percentage error (validated using the GPR model from Griffiths et al.)</li>
<li>A temperature parameter controls the diversity-validity tradeoff: low temperatures produce training set copies, high temperatures produce diverse but potentially invalid structures</li>
<li>Across all temperatures, generated molecules showed low synthetic accessibility (SA) scores, suggesting synthesizability</li>
</ul>
<p>The authors also demonstrated iterative inverse design for HOMO-LUMO gap optimization: starting from QMugs data, they iteratively fine-tuned GPT-3 to generate molecules with progressively larger bandgaps (&gt;5 eV), successfully shifting the distribution over four generations. This worked even when extrapolating beyond the training distribution (e.g., training only on molecules with gaps &lt;3.5 eV, then generating molecules with gaps &gt;4.0 eV).</p>
<h3 id="coarse-grained-polymer-design">Coarse-Grained Polymer Design</h3>
<p>A striking test involved coarse-grained dispersant polymers with four monomer types and chain lengths of 16-48 units. GPT-3 had no prior knowledge of these abstract representations, yet it outperformed dedicated models for adsorption free energy prediction and successfully performed inverse design, generating monomer sequences with a mean percentage error of about 22% for the desired property.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>
<p><strong>Low-data advantage</strong>: Fine-tuned GPT-3 consistently shows the largest advantages over conventional ML in low-data regimes (tens to hundreds of data points), which is precisely where experimental chemistry datasets typically fall.</p>
</li>
<li>
<p><strong>Representation agnostic</strong>: The model works with IUPAC names, SMILES, SELFIES, and even invented abstract representations, removing the need for chemistry-specific tokenization.</p>
</li>
<li>
<p><strong>No feature engineering</strong>: The approach requires no domain-specific descriptors, fingerprints, or architectural modifications, making it accessible to researchers without ML expertise.</p>
</li>
<li>
<p><strong>Bidirectional design</strong>: Inverse design is achieved by simply reversing the question format, with no architectural changes or separate generative model needed.</p>
</li>
<li>
<p><strong>Extrapolation capability</strong>: The model can generate molecules with properties outside the training distribution, as demonstrated by the HOMO-LUMO gap extrapolation experiments.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>In the <strong>high-data regime</strong>, conventional ML models with chemistry-specific features often catch up to or surpass GPT-3, as the inductive biases encoded in GPT-3 become less necessary with sufficient data.</li>
<li><strong>Regression</strong> is inherently limited by the discretization of continuous values into tokens. This requires more data than classification and introduces quantization error.</li>
<li>The approach relies on the <strong>OpenAI API</strong>, introducing cost and reproducibility concerns (model versions may change). The authors partially address this by providing open-source alternatives via <code>chemlift</code>.</li>
<li>The authors acknowledge that <strong>identified correlations may not represent causal relationships</strong>. GPT-3 finding predictive patterns does not guarantee that the patterns are chemically meaningful.</li>
<li>No optimization of prompts, tokenization, or hyperparameters was performed, suggesting room for improvement but also making it difficult to assess the ceiling of this approach.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>All datasets are publicly available and were obtained from published benchmarks.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification</td>
          <td>HEA phase (Pei et al.)</td>
          <td>1,252 alloys</td>
          <td>Single-phase vs. multi-phase</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>643 molecules</td>
          <td>Hydration free energies</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>1,128 molecules</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>QMugs</td>
          <td>665,000 molecules</td>
          <td>HOMO-LUMO gaps via GFN2-xTB</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Lipophilicity (ChEMBL)</td>
          <td>Varies</td>
          <td>LogP classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>OPV PCE</td>
          <td>Varies</td>
          <td>Organic photovoltaic efficiency</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>MOF Henry coefficients</td>
          <td>Varies</td>
          <td>CO2/CH4 adsorption</td>
      </tr>
      <tr>
          <td>Inverse design</td>
          <td>Photoswitches (Griffiths et al.)</td>
          <td>392 molecules</td>
          <td>Transition wavelengths</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Fine-tuning via OpenAI API: 8 epochs, learning rate multiplier 0.02</li>
<li>GPT-3 <code>ada</code> variant (smallest model) used for all main results</li>
<li>In-context learning also tested with larger GPT-3 models and GPT-4</li>
<li>Open-source alternative: GPT-J-6B with LoRA + 8-bit quantization</li>
<li>Learning curves fit to power laws $-a \exp(-bx + c)$ for data efficiency comparison</li>
<li>Validity checked using RDKit via <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>&rsquo;s <code>is\_valid</code> method</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-3 ada (OpenAI API, proprietary)</li>
<li>GPT-J-6B (open-source, fine-tunable on consumer hardware)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>HEA phase</td>
          <td>Classification</td>
      </tr>
      <tr>
          <td>$F_1$ macro</td>
          <td>All classification tasks</td>
          <td>Class-balanced</td>
      </tr>
      <tr>
          <td>Cohen&rsquo;s $\kappa$</td>
          <td>Classification</td>
          <td>Used for learning curve thresholds</td>
      </tr>
      <tr>
          <td>MAE / MAPE</td>
          <td>Regression, inverse design</td>
          <td>Property prediction accuracy</td>
      </tr>
      <tr>
          <td>Validity rate</td>
          <td>Inverse design</td>
          <td>Fraction of parseable SMILES</td>
      </tr>
      <tr>
          <td>Frechet ChemNet distance</td>
          <td>Inverse design</td>
          <td>Distribution similarity</td>
      </tr>
      <tr>
          <td>SA score</td>
          <td>Inverse design</td>
          <td>Synthetic accessibility</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Fine-tuning via OpenAI API (cloud compute, not user-specified)</li>
<li>Open-source experiments: consumer GPU hardware with 8-bit quantization</li>
<li>Quantum chemistry validation: GFN2-xTB for HOMO-LUMO calculations</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/kjappelbaum/gptchem">gptchem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>All experiments with OpenAI API</td>
      </tr>
      <tr>
          <td><a href="https://github.com/lamalab-org/chemlift">chemlift</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Open-source LLM fine-tuning support</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.7806672">Zenodo (gptchem)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archived release</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10233422">Zenodo (chemlift)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archived release</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A., &amp; Smit, B. (2024). Leveraging large language models for predictive chemistry. <em>Nature Machine Intelligence</em>, 6(2), 161-169. <a href="https://doi.org/10.1038/s42256-023-00788-1">https://doi.org/10.1038/s42256-023-00788-1</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{jablonka2024leveraging,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Leveraging large language models for predictive chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jablonka, Kevin Maik and Schwaller, Philippe and Ortega-Guerrero, Andres and Smit, Berend}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{161--169}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-023-00788-1}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Data Transfer Approaches for Seq-to-Seq Retrosynthesis</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/</guid><description>Systematic comparison of joint training, self-training, and pre-training plus fine-tuning for Transformer-based retrosynthesis on USPTO-50K.</description><content:encoded><![CDATA[<h2 id="systematic-study-of-data-transfer-for-retrosynthesis">Systematic Study of Data Transfer for Retrosynthesis</h2>
<p>This is an <strong>Empirical</strong> paper that systematically compares three standard data transfer methods (joint training, self-training, and pre-training plus fine-tuning) applied to a Transformer-based sequence-to-sequence model for single-step retrosynthesis. The primary contribution is demonstrating that pre-training on a large augmented dataset (USPTO-Full, 877K reactions) followed by fine-tuning on the smaller target dataset (USPTO-50K) produces substantial accuracy improvements over the baseline Transformer, achieving competitive or superior results to contemporaneous state-of-the-art graph-based models at higher values of n-best accuracy.</p>
<h2 id="bridging-the-data-gap-in-retrosynthesis-prediction">Bridging the Data Gap in Retrosynthesis Prediction</h2>
<p><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a>, the problem of predicting reactant compounds needed to synthesize a target product, has seen rapid progress through increasingly sophisticated model architectures: <a href="/notes/chemistry/molecular-design/reaction-prediction/nmt-organic-reaction-prediction/">LSTM seq-to-seq models</a>, <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Transformer models</a>, and graph-to-graph approaches. However, the authors identify a gap in this research trajectory. While model architecture has received extensive attention, the role of training data strategies has been largely neglected in the retrosynthesis literature.</p>
<p>The core practical problem is that high-quality supervised datasets for retrosynthesis (like USPTO-50K) tend to be small and distribution-skewed, with all samples pre-classified into ten major reaction classes. Meanwhile, larger datasets (USPTO-Full with 877K samples, USPTO-MIT with 479K samples) exist but have different distributional properties. Data transfer techniques are standard practice in computer vision, NLP, and machine translation for exactly this scenario, yet they had not been systematically evaluated for retrosynthesis at the time of this work.</p>
<p>The authors also note a contrast with Zoph et al. (2020), who found that self-training outperforms pre-training in image recognition. They hypothesize that chemical compound strings may have more universal representations than images, making pre-training more effective in the chemistry domain.</p>
<h2 id="three-data-transfer-methods-for-retrosynthesis">Three Data Transfer Methods for Retrosynthesis</h2>
<p>The paper formalizes retrosynthesis as a seq-to-seq problem where both the product $x$ and reactant set $y$ are represented as <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. A retrosynthesis model defines a likelihood $p_{\mathcal{M}}(y \mid x; \theta)$ optimized via maximum log-likelihood:</p>
<p>$$
\theta^{*} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}^{T}_{\text{Train}}} \log p(y_{i} \mid x_{i})
$$</p>
<p>Given a target dataset $\mathcal{D}^{T}$ and an augment dataset $\mathcal{D}^{A}$, three transfer methods are examined:</p>
<p><strong>Joint Training</strong> concatenates the training sets and optimizes over the union:</p>
<p>$$
\theta^{*}_{\text{joint}} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}_{\text{joint}}} \log p(y_{i} \mid x_{i}), \quad \mathcal{D}_{\text{joint}} = \mathcal{D}^{T}_{\text{Train}} \cup \mathcal{D}^{A}_{\text{Train}}
$$</p>
<p>This requires that both datasets share the same input/output domain (same SMILES canonicalization rules).</p>
<p><strong>Self-Training</strong> (pseudo labeling) first trains a base model on $\mathcal{D}^{T}$ alone, then uses this model to relabel the augment dataset products:</p>
<p>$$
\hat{y}_{i} = \arg\max_{y} \log p(y \mid x_{i}; \theta^{*}_{\text{single}}) \quad \text{for } x_{i} \in \mathcal{D}^{A}_{\text{Train}}
$$</p>
<p>The pseudo-labeled augment set is then combined with $\mathcal{D}^{T}_{\text{Train}}$ for joint training. This approach does not require consistent label domains between datasets.</p>
<p><strong>Pre-training plus Fine-tuning</strong> trains first on the augment dataset to obtain $\theta^{*}_{\text{pretrain}}$, then initializes fine-tuning from this checkpoint:</p>
<p>$$
\theta^{0}_{\text{finetune}} \leftarrow \theta^{*}_{\text{pretrain}}, \quad \theta^{\ell+1}_{\text{finetune}} \leftarrow \theta^{\ell}_{\text{finetune}} - \gamma^{\ell} \nabla \mathcal{L}(\mathcal{D}^{T}_{\text{Train}}) \big|_{{\theta^{\ell}_{\text{finetune}}}}
$$</p>
<h2 id="experimental-setup-on-uspto-benchmarks">Experimental Setup on USPTO Benchmarks</h2>
<p>The experiments use a fixed Transformer architecture (3 self-attention layers, 500-dimensional latent vectors) implemented in OpenNMT-py, evaluated across all three transfer methods.</p>
<p><strong>Datasets:</strong></p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target</td>
          <td>USPTO-50K</td>
          <td>40K/5K/5K (train/val/test)</td>
          <td>10 reaction classes, curated by Lowe (2012)</td>
      </tr>
      <tr>
          <td>Augment (main)</td>
          <td>USPTO-Full</td>
          <td>844K train (after cleansing)</td>
          <td>Curated by Lowe (2017)</td>
      </tr>
      <tr>
          <td>Augment (smaller)</td>
          <td>USPTO-MIT</td>
          <td>384K train (after cleansing)</td>
          <td>Curated by Jin et al. (2017)</td>
      </tr>
  </tbody>
</table>
<p>Data cleansing removed all augment dataset samples whose product SMILES appeared in any USPTO-50K subset, preventing data leakage. All datasets were re-canonicalized with a unified <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> version.</p>
<p><strong>Evaluation</strong> uses n-best accuracy with k=50 beam search, computing accuracy at n=1, 3, 5, 10, 20, 50. Models are selected by best validation perplexity. All experiments report averages and standard deviations over 5 runs.</p>
<p><strong>Optimization</strong> uses Adam with cyclic learning rate scheduling (warm-up) for all methods except fine-tuning, which uses a standard non-cyclic scheduler.</p>
<p><strong>Results comparing data transfer methods (USPTO-Full augment):</strong></p>
<table>
  <thead>
      <tr>
          <th>Training Method</th>
          <th>n=1</th>
          <th>n=3</th>
          <th>n=5</th>
          <th>n=10</th>
          <th>n=20</th>
          <th>n=50</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single model (No Transfer)</td>
          <td>35.3 +/- 1.4</td>
          <td>52.8 +/- 1.4</td>
          <td>58.9 +/- 1.3</td>
          <td>64.5 +/- 1.2</td>
          <td>68.8 +/- 1.2</td>
          <td>72.1 +/- 1.3</td>
      </tr>
      <tr>
          <td>Joint Training</td>
          <td>39.1 +/- 1.3</td>
          <td>63.4 +/- 0.9</td>
          <td>71.9 +/- 0.5</td>
          <td>80.1 +/- 0.2</td>
          <td>85.4 +/- 0.3</td>
          <td>89.4 +/- 0.2</td>
      </tr>
      <tr>
          <td>Self-Training</td>
          <td>41.5 +/- 1.0</td>
          <td>60.4 +/- 0.7</td>
          <td>66.1 +/- 0.7</td>
          <td>71.8 +/- 0.6</td>
          <td>75.3 +/- 0.5</td>
          <td>78.0 +/- 0.3</td>
      </tr>
      <tr>
          <td>Pre-training + Fine-Tune</td>
          <td>57.4 +/- 0.4</td>
          <td>77.6 +/- 0.4</td>
          <td>83.1 +/- 0.2</td>
          <td>87.4 +/- 0.4</td>
          <td>89.6 +/- 0.3</td>
          <td>90.9 +/- 0.2</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison with state-of-the-art models:</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>n=1</th>
          <th>n=3</th>
          <th>n=5</th>
          <th>n=10</th>
          <th>n=20</th>
          <th>n=50</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GLN (Dai et al., 2019)</td>
          <td>Logic Network</td>
          <td>52.5</td>
          <td>69.0</td>
          <td>75.6</td>
          <td>83.7</td>
          <td>88.5</td>
          <td>92.4</td>
      </tr>
      <tr>
          <td>G2Gs (Shi et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>48.9</td>
          <td>67.6</td>
          <td>72.5</td>
          <td>75.5</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>RetroXpert (Yan et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>65.6</td>
          <td>78.7</td>
          <td>80.8</td>
          <td>83.3</td>
          <td>84.6</td>
          <td>86.0</td>
      </tr>
      <tr>
          <td>GraphRetro (Somnath et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>63.8</td>
          <td>80.5</td>
          <td>84.1</td>
          <td>85.9</td>
          <td>N/A</td>
          <td>87.2</td>
      </tr>
      <tr>
          <td>Pre-training + Fine-Tune (ours)</td>
          <td>Seq-to-Seq</td>
          <td>57.4</td>
          <td>77.6</td>
          <td>83.1</td>
          <td>87.4</td>
          <td>89.6</td>
          <td>90.9</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Primary findings:</strong></p>
<ol>
<li>All three data transfer methods improve over the no-transfer baseline across all n-best accuracy levels.</li>
<li>Pre-training plus fine-tuning provides the largest gains, improving top-1 accuracy by 22.1 absolute percentage points (from 35.3% to 57.4%) and achieving the best n=10 and n=20 accuracy among all compared models, including graph-based approaches.</li>
<li>Augment dataset size matters: using USPTO-Full (844K) yields substantially better results than USPTO-MIT (384K) for joint training and pre-training plus fine-tuning, though self-training gains are surprisingly robust to augment dataset size.</li>
<li>Manual inspection of erroneous predictions shows that over 99% of top-1 predictions from the pre-trained/fine-tuned model are chemically appropriate or sensible, even when they do not exactly match the gold-standard reactants.</li>
<li>Pre-training plus fine-tuning shows a distinct advantage in training dynamics: the 1-best and n-best accuracy curves evolve similarly during fine-tuning, unlike the single model where these curves can diverge significantly. This makes early stopping more reliable.</li>
</ol>
<p><strong>Class-wise improvements</strong> are observed across all 10 reaction classes, with the largest gains in heterocycle formation (0.40 to 0.86 at 50-best) and functional group interconversion (0.57 to 0.90).</p>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>The model struggles with compounds containing multiple similar substituents (e.g., long-chain hydrocarbons), occasionally selecting the wrong one.</li>
<li>Some reactions involving rare chemical groups (<a href="https://en.wikipedia.org/wiki/Polycyclic_aromatic_hydrocarbon">polycyclic aromatic hydrocarbons</a>) still produce invalid SMILES, suggesting the augment dataset lacks sufficient examples of these structures.</li>
<li>Top-1 accuracy (57.4%) lags behind the best graph-based models (RetroXpert at 65.6%), though the gap narrows at higher n values.</li>
<li>The study uses a fixed Transformer architecture without architecture-specific optimization for each transfer method.</li>
</ul>
<p><strong>Future directions</strong> proposed include freezing parts of the network during fine-tuning, applying data transfer to graph-to-graph models, and testing transferability to other retrosynthesis datasets beyond USPTO-50K.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target</td>
          <td>USPTO-50K</td>
          <td>50K reactions</td>
          <td>Curated by Lowe (2012), 10 reaction classes</td>
      </tr>
      <tr>
          <td>Augment</td>
          <td>USPTO-Full</td>
          <td>877K reactions (844K after cleansing)</td>
          <td>Curated by Lowe (2017), available via Figshare</td>
      </tr>
      <tr>
          <td>Augment (alt)</td>
          <td>USPTO-MIT</td>
          <td>479K reactions (384K after cleansing)</td>
          <td>Curated by Jin et al. (2017)</td>
      </tr>
  </tbody>
</table>
<p>Data cleansing removes augment samples whose products appear in any USPTO-50K subset. Unified RDKit canonicalization applied to all datasets.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer seq-to-seq model (3 self-attention layers, 500-dim latent vectors)</li>
<li>Positional encoding enabled</li>
<li>Maximum sequence length: 200 tokens</li>
<li>Adam optimizer</li>
<li>Cyclic learning rate scheduler with warm-up (all methods except fine-tuning)</li>
<li>Non-cyclic scheduler for fine-tuning phase (Klein et al., 2017)</li>
<li>Beam search with k=50 for inference</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Implementation: OpenNMT-py</li>
<li>No pre-trained weights or model checkpoints released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Top-1 accuracy</td>
          <td>57.4%</td>
          <td>35.3% (no transfer)</td>
          <td>Pre-train + fine-tune, USPTO-Full augment</td>
      </tr>
      <tr>
          <td>Top-10 accuracy</td>
          <td>87.4%</td>
          <td>64.5% (no transfer)</td>
          <td>Best among all compared models</td>
      </tr>
      <tr>
          <td>Top-20 accuracy</td>
          <td>89.6%</td>
          <td>68.8% (no transfer)</td>
          <td>Best among all compared models</td>
      </tr>
      <tr>
          <td>Top-50 accuracy</td>
          <td>90.9%</td>
          <td>72.1% (no transfer)</td>
          <td>Competitive with GLN (92.4%)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware details are not specified in the paper. The authors mention GPU memory constraints motivating the 200-token sequence length limit.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ishiguro, K., Ujihara, K., Sawada, R., Akita, H., &amp; Kotera, M. (2020). Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis. <em>arXiv preprint arXiv:2010.00792</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ishiguro2020data,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ishiguro, Katsuhiko and Ujihara, Kazuya and Sawada, Ryohto and Akita, Hirotaka and Kotera, Masaaki}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2010.00792}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Maxsmi: SMILES Augmentation for Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/</guid><description>Maxsmi systematically evaluates five SMILES augmentation strategies with CNN and RNN models across solubility, lipophilicity, and bioactivity tasks.</description><content:encoded><![CDATA[<h2 id="systematic-benchmarking-of-smiles-data-augmentation">Systematic Benchmarking of SMILES Data Augmentation</h2>
<p>This is an <strong>Empirical</strong> paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the &ldquo;Maxsmi&rdquo; models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.</p>
<h2 id="the-data-scarcity-problem-in-qsar-modeling">The Data Scarcity Problem in QSAR Modeling</h2>
<p>Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES representation</a> of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.</p>
<h2 id="five-augmentation-strategies-and-test-time-ensemble-learning">Five Augmentation Strategies and Test-Time Ensemble Learning</h2>
<p>The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:</p>
<ol>
<li><strong>No augmentation</strong>: use only the canonical SMILES (baseline)</li>
<li><strong>Augmentation with duplication</strong>: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$</li>
<li><strong>Augmentation without duplication</strong>: generate $m$ random SMILES and discard exact duplicates</li>
<li><strong>Augmentation with reduced duplication</strong>: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above</li>
<li><strong>Augmentation with estimated maximum</strong>: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space</li>
</ol>
<p>Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:</p>
<p>$$
\hat{y}_i(C) = M_{\Theta}(S_i(C))
$$</p>
<p>The compound-level prediction is an aggregation (mean) over these:</p>
<p>$$
\hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big)
$$</p>
<p>The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.</p>
<h2 id="experimental-design-three-architectures-four-datasets">Experimental Design: Three Architectures, Four Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size (after preprocessing)</th>
          <th>Train / Test</th>
          <th>Task</th>
          <th>Provenance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>1,128</td>
          <td>902 / 226</td>
          <td>Water solubility</td>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></td>
      </tr>
      <tr>
          <td>ESOL_small</td>
          <td>1,068</td>
          <td>854 / 214</td>
          <td>Solubility (max 25 heavy atoms)</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>513 / 129</td>
          <td>Hydration free energy</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>4,199</td>
          <td>3,359 / 840</td>
          <td>Octanol/water distribution</td>
          <td><a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a></td>
      </tr>
      <tr>
          <td>Affinity (EGFR)</td>
          <td>5,849</td>
          <td>4,679 / 1,170</td>
          <td><a href="https://en.wikipedia.org/wiki/IC50">pIC50</a> against <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">EGFR</a> kinase</td>
          <td>Kinodata</td>
      </tr>
  </tbody>
</table>
<h3 id="architectures">Architectures</h3>
<p>Three shallow neural networks are compared:</p>
<ul>
<li><strong>CONV1D</strong>: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers</li>
<li><strong>CONV2D</strong>: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers</li>
<li><strong>RNN</strong>: LSTM layer followed by two fully connected layers (128 and 64 units)</li>
</ul>
<p>All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.</p>
<h3 id="augmentation-sweep">Augmentation sweep</h3>
<p>The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.</p>
<h2 id="key-findings-augmentation-consistently-improves-rmse">Key Findings: Augmentation Consistently Improves RMSE</h2>
<h3 id="augmentation-always-helps">Augmentation always helps</h3>
<p>Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.</p>
<h3 id="best-models-maxsmi">Best models (Maxsmi)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Model</th>
          <th>Augmentation Number</th>
          <th>Strategy</th>
          <th>Test RMSE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>CONV1D</td>
          <td>70</td>
          <td>Reduced duplication</td>
          <td>0.569</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>CONV1D</td>
          <td>70</td>
          <td>With duplication</td>
          <td>1.032</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>CONV1D</td>
          <td>80</td>
          <td>Without duplication</td>
          <td>0.593</td>
      </tr>
  </tbody>
</table>
<p>The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.</p>
<h3 id="no-single-best-augmentation-strategy">No single best augmentation strategy</h3>
<p>The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.</p>
<h3 id="canonical-smiles-outperform-single-random-smiles">Canonical SMILES outperform single random SMILES</h3>
<p>When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).</p>
<h3 id="comparison-to-prior-work">Comparison to prior work</h3>
<table>
  <thead>
      <tr>
          <th>Study</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
          <th>Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Maxsmi</td>
          <td>0.569</td>
          <td>1.032</td>
          <td>0.593</td>
          <td>CNN</td>
      </tr>
      <tr>
          <td>MoleculeNet</td>
          <td>0.58 +/- 0.03</td>
          <td>1.15 +/- 0.12</td>
          <td>0.655 +/- 0.036</td>
          <td>GNN</td>
      </tr>
      <tr>
          <td>CNF</td>
          <td>0.62</td>
          <td>1.11</td>
          <td>0.67</td>
          <td>CNN</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT</a></td>
          <td>N/A</td>
          <td>1.197 +/- 0.127</td>
          <td>0.565 +/- 0.037</td>
          <td>RNN</td>
      </tr>
  </tbody>
</table>
<p>Maxsmi outperforms or matches MoleculeNet&rsquo;s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.</p>
<h3 id="confidence-estimation">Confidence estimation</h3>
<p>The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.</p>
<h3 id="egfr-affinity-test-case">EGFR affinity test case</h3>
<p>Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.</li>
<li>The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.</li>
<li>The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.</li>
<li>The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>MoleculeNet, water solubility</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>MoleculeNet, hydration free energy</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Lipophilicity</td>
          <td>4,199</td>
          <td>ChEMBL, logD</td>
      </tr>
      <tr>
          <td>Test case</td>
          <td>EGFR Affinity</td>
          <td>5,849</td>
          <td>Kinodata (ChEMBL v28), pIC50</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SMILES generation via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>&rsquo;s random SMILES enumeration</li>
<li>One-hot encoding of SMILES characters with padding to max length</li>
<li>Five augmentation strategies applied to both training and test sets</li>
<li>Mean aggregation for compound-level predictions</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CONV1D</td>
          <td>1D conv (kernel 10, stride 1) + 2 FC layers</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>CONV2D</td>
          <td>2D conv (single channel) + 2 FC layers</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>RNN</td>
          <td>LSTM + FC(128) + FC(64)</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>RF Baseline</td>
          <td>Random Forest (default sklearn)</td>
          <td>Morgan FP, radius 2, length 1024</td>
      </tr>
  </tbody>
</table>
<p>Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSE (ESOL)</td>
          <td>0.569</td>
          <td>1.102 (RF)</td>
          <td>CONV1D, 70x reduced dup</td>
      </tr>
      <tr>
          <td>RMSE (FreeSolv)</td>
          <td>1.032</td>
          <td>2.563 (RF)</td>
          <td>CONV1D, 70x with dup</td>
      </tr>
      <tr>
          <td>RMSE (Lipophilicity)</td>
          <td>0.593</td>
          <td>0.860 (RF)</td>
          <td>CONV1D, 80x without dup</td>
      </tr>
      <tr>
          <td>RMSE (EGFR)</td>
          <td>0.777</td>
          <td>0.758 (RF)</td>
          <td>CONV1D, 70x reduced dup</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/volkamerlab/maxsmi">volkamerlab/maxsmi</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Full source code, trained models, CLI for prediction</td>
      </tr>
      <tr>
          <td><a href="https://maxsmi.readthedocs.io/en/latest/">Documentation</a></td>
          <td>Docs</td>
          <td>N/A</td>
          <td>Read the Docs documentation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/openkinome/kinodata">Kinodata</a></td>
          <td>Dataset</td>
          <td>N/A</td>
          <td>Curated kinase bioactivity data from ChEMBL v28</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kimber, T. B., Gagnebin, M., &amp; Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. <em>Artificial Intelligence in the Life Sciences</em>, 1, 100014. <a href="https://doi.org/10.1016/j.ailsci.2021.100014">https://doi.org/10.1016/j.ailsci.2021.100014</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kimber2021maxsmi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Artificial Intelligence in the Life Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{100014}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.ailsci.2021.100014}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RNNs vs Transformers for Molecular Generation Tasks</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-language-models-rnns-or-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-language-models-rnns-or-transformer/</guid><description>Empirical comparison of RNN and Transformer architectures for molecular generation using SMILES and SELFIES across three generative tasks.</description><content:encoded><![CDATA[<h2 id="an-empirical-comparison-of-sequence-architectures-for-molecular-generation">An Empirical Comparison of Sequence Architectures for Molecular Generation</h2>
<p>This is an <strong>Empirical</strong> paper that systematically compares two dominant sequence modeling architectures, recurrent neural networks (RNNs) and the Transformer, for chemical language modeling. The primary contribution is a controlled experimental comparison across three generative tasks of increasing complexity, combined with an evaluation of two molecular string representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>). The paper does not propose a new method; instead, it provides practical guidance on when each architecture is more appropriate for molecular generation.</p>
<h2 id="why-compare-rnns-and-transformers-for-molecular-design">Why Compare RNNs and Transformers for Molecular Design?</h2>
<p>Exploring unknown molecular space and designing molecules with target properties is a central goal in computational drug design. Language models trained on molecular string representations (SMILES, SELFIES) have shown the capacity to learn complex molecular distributions. RNN-based models, including LSTM and GRU variants, were the first widely adopted architectures for this task. Models like <a href="/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/">CharRNN</a>, ReLeaSE, and conditional RNNs demonstrated success in generating focused molecular libraries. More recently, self-attention-based Transformer models (Mol-GPT, LigGPT) have gained popularity due to their parallelizability and ability to capture long-range dependencies.</p>
<p>Despite the widespread adoption of Transformers across NLP, it was not clear whether they uniformly outperform RNNs for molecular generation. Prior work by Dollar et al. showed that RNN-based models achieved higher validity than Transformer-based models in some settings. Flam-Shepherd et al. demonstrated that RNN language models could learn complex molecular distributions across challenging generative tasks. This paper extends that comparison by adding the Transformer architecture to the same set of challenging tasks and evaluating both SMILES and SELFIES representations.</p>
<h2 id="experimental-design-three-tasks-two-architectures-two-representations">Experimental Design: Three Tasks, Two Architectures, Two Representations</h2>
<p>The core experimental design uses a 2x2 setup: two architectures (RNN and Transformer) crossed with two molecular representations (SMILES and SELFIES), yielding four model variants: SM-RNN, SF-RNN, SM-Transformer, and SF-Transformer.</p>
<h3 id="three-generative-tasks">Three generative tasks</h3>
<p>The three tasks, drawn from <a href="/notes/chemistry/molecular-design/property-prediction/lm-complex-molecular-distributions/">Flam-Shepherd et al.</a>, are designed with increasing complexity:</p>
<ol>
<li>
<p><strong>Penalized LogP task</strong>: Generate molecules with high penalized LogP scores (LogP minus synthetic accessibility and long-cycle penalties). The dataset is built from ZINC15 molecules with penalized LogP &gt; 4.0. Molecule sequences are relatively short (50-75 tokens).</p>
</li>
<li>
<p><strong>Multidistribution task</strong>: Learn a multimodal molecular weight distribution constructed from four distinct subsets: GDB13 (MW &lt;= 185), ZINC (185 &lt;= MW &lt;= 425), Harvard Clean Energy Project (460 &lt;= MW &lt;= 600), and POLYMERS (MW &gt; 600). This tests the ability to capture multiple modes simultaneously.</p>
</li>
<li>
<p><strong>Large-scale task</strong>: Generate large molecules from PubChem with more than 100 heavy atoms and MW ranging from 1250 to 5000. This tests long-sequence generation capability.</p>
</li>
</ol>
<h3 id="model-configuration">Model configuration</h3>
<p>Models are compared with matched parameter counts (5.2-5.3M to 36.4M parameters). Hyperparameter optimization uses random search over learning rate [0.0001, 0.001], hidden units (500-1000 for RNNs, 376-776 for Transformers), layer number [3, 5], and dropout [0.0, 0.5]. A regex-based tokenizer replaces character-by-character tokenization, reducing token lengths from 10,000 to under 3,000 for large molecules.</p>
<h3 id="evaluation-metrics">Evaluation metrics</h3>
<p>The evaluation covers multiple dimensions:</p>
<ul>
<li><strong>Standard metrics</strong>: validity, uniqueness, novelty</li>
<li><strong>Molecular properties</strong>: <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a>, LogP, SA, QED, Bertz complexity (BCT), natural product likeness (NP), molecular weight (MW)</li>
<li><strong>Wasserstein distance</strong>: measures distributional similarity between generated and training molecules for each property</li>
<li><strong>Tanimoto similarity</strong>: structural and scaffold similarity between generated and training molecules</li>
<li><strong>Token length (TL)</strong>: comparison of generated vs. training sequence lengths</li>
</ul>
<p>For each task, 10,000 molecules are generated and evaluated.</p>
<h2 id="key-results-across-tasks">Key Results Across Tasks</h2>
<h3 id="penalized-logp-task">Penalized LogP task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.56</td>
          <td>0.12</td>
          <td>0.02</td>
          <td>0.01</td>
          <td>16.61</td>
          <td>0.09</td>
          <td>5.90</td>
          <td>0.43</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.63</td>
          <td>0.25</td>
          <td>0.42</td>
          <td>0.02</td>
          <td>36.43</td>
          <td>0.23</td>
          <td>2.35</td>
          <td>0.40</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.83</td>
          <td>0.18</td>
          <td>0.02</td>
          <td>0.01</td>
          <td>23.77</td>
          <td>0.09</td>
          <td>7.99</td>
          <td>0.84</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.97</td>
          <td>0.22</td>
          <td>0.47</td>
          <td>0.02</td>
          <td>44.43</td>
          <td>0.28</td>
          <td>5.04</td>
          <td>0.53</td>
      </tr>
  </tbody>
</table>
<p>RNN-based models achieve smaller Wasserstein distances across most properties. The authors attribute this to LogP being computed as a sum of atomic contributions (a local property), which aligns with RNNs&rsquo; strength in capturing local structural features. RNNs also generated ring counts closer to the training distribution (4.10 for SM-RNN vs. 4.04 for SM-Transformer, with training data at 4.21). The Transformer performed better on global structural similarity (higher Tanimoto similarity to training data).</p>
<h3 id="multidistribution-task">Multidistribution task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.16</td>
          <td>0.07</td>
          <td>0.03</td>
          <td>0.01</td>
          <td>18.34</td>
          <td>0.02</td>
          <td>7.07</td>
          <td>0.81</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.46</td>
          <td>0.38</td>
          <td>0.55</td>
          <td>0.03</td>
          <td>110.72</td>
          <td>0.24</td>
          <td>10.00</td>
          <td>1.58</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.16</td>
          <td>0.16</td>
          <td>0.03</td>
          <td>0.01</td>
          <td>39.94</td>
          <td>0.02</td>
          <td>10.03</td>
          <td>1.28</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.73</td>
          <td>0.37</td>
          <td>0.63</td>
          <td>0.04</td>
          <td>107.46</td>
          <td>0.30</td>
          <td>17.57</td>
          <td>2.40</td>
      </tr>
  </tbody>
</table>
<p>Both SMILES-based models captured all four modes of the MW distribution well. While RNNs had smaller overall Wasserstein distances, the Transformer fitted the higher-MW modes better. This aligns with the observation that longer molecular sequences (which correlate with higher MW) favor the Transformer&rsquo;s global attention mechanism over the RNN&rsquo;s sequential processing.</p>
<h3 id="large-scale-task">Large-scale task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.46</td>
          <td>1.89</td>
          <td>0.20</td>
          <td>0.01</td>
          <td>307.09</td>
          <td>0.03</td>
          <td>105.29</td>
          <td>12.05</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.65</td>
          <td>1.78</td>
          <td>0.43</td>
          <td>0.01</td>
          <td>456.98</td>
          <td>0.14</td>
          <td>100.79</td>
          <td>15.26</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.36</td>
          <td>1.64</td>
          <td>0.07</td>
          <td>0.01</td>
          <td>172.93</td>
          <td>0.02</td>
          <td>59.04</td>
          <td>7.41</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.91</td>
          <td>2.82</td>
          <td>0.47</td>
          <td>0.01</td>
          <td>464.75</td>
          <td>0.18</td>
          <td>92.91</td>
          <td>11.57</td>
      </tr>
  </tbody>
</table>
<p>The Transformer demonstrates a clear advantage on large molecules. SM-Transformer achieves substantially lower Wasserstein distances than SM-RNN across nearly all properties, with particularly large improvements in BCT (172.93 vs. 307.09) and MW (59.04 vs. 105.29). The Transformer also produces better Tanimoto similarity scores and more accurate token length distributions.</p>
<h3 id="standard-metrics-across-all-tasks">Standard metrics across all tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>SM-RNN</th>
          <th>SF-RNN</th>
          <th>SM-Transformer</th>
          <th>SF-Transformer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP</td>
          <td>Valid</td>
          <td>0.90</td>
          <td>1.00</td>
          <td>0.89</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>Uniqueness</td>
          <td>0.98</td>
          <td>0.99</td>
          <td>0.98</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>Novelty</td>
          <td>0.75</td>
          <td>0.71</td>
          <td>0.71</td>
          <td>0.71</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Valid</td>
          <td>0.95</td>
          <td>1.00</td>
          <td>0.97</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Uniqueness</td>
          <td>0.96</td>
          <td>1.00</td>
          <td>1.00</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Novelty</td>
          <td>0.91</td>
          <td>0.98</td>
          <td>0.91</td>
          <td>0.98</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Valid</td>
          <td>0.84</td>
          <td>1.00</td>
          <td>0.88</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Uniqueness</td>
          <td>0.99</td>
          <td>0.99</td>
          <td>0.98</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Novelty</td>
          <td>0.85</td>
          <td>0.92</td>
          <td>0.86</td>
          <td>0.94</td>
      </tr>
  </tbody>
</table>
<p>SELFIES achieves 100% validity across all tasks by construction, while SMILES validity drops for large molecules. The Transformer achieves slightly higher validity than the RNN for SMILES-based models, particularly on the large-scale task (0.88 vs. 0.84).</p>
<h2 id="conclusions-and-practical-guidelines">Conclusions and Practical Guidelines</h2>
<p>The central finding is that neither architecture universally dominates. The choice between RNNs and Transformers should depend on the characteristics of the molecular data:</p>
<ul>
<li>
<p><strong>RNNs are preferred</strong> when molecular properties depend on local structural features (e.g., LogP, ring counts) and when sequences are relatively short. They better capture local fragment distributions.</p>
</li>
<li>
<p><strong>Transformers are preferred</strong> when dealing with large molecules (high MW, long sequences) where global attention can capture the overall distribution more effectively. RNNs suffer from information obliteration on long sequences.</p>
</li>
<li>
<p><strong>SMILES outperforms SELFIES</strong> on property distribution metrics across nearly all tasks and models. While SELFIES guarantees 100% syntactic validity, its generated molecules show worse distributional fidelity to training data. The authors argue that validity is a less important concern than property fidelity, since invalid SMILES can be filtered easily.</p>
</li>
</ul>
<p>The authors acknowledge that longer sequences remain challenging for both architectures. For Transformers, the quadratic growth of the attention matrix limits scalability. For RNNs, the vanishing gradient problem limits effective context length.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Task 1</td>
          <td>ZINC15 (penalized LogP &gt; 4.0)</td>
          <td>Not specified</td>
          <td>High penalized LogP molecules</td>
      </tr>
      <tr>
          <td>Task 2</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> + ZINC + CEP + POLYMERS</td>
          <td>~200K</td>
          <td>Multimodal MW distribution</td>
      </tr>
      <tr>
          <td>Task 3</td>
          <td>PubChem (&gt;100 heavy atoms)</td>
          <td>Not specified</td>
          <td>MW range 1250-5000</td>
      </tr>
  </tbody>
</table>
<p>Data processing code available at <a href="https://github.com/danielflamshep/genmoltasks">https://github.com/danielflamshep/genmoltasks</a> (from the original Flam-Shepherd et al. study).</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: Regex-based tokenizer (not character-by-character)</li>
<li><strong>Hyperparameter search</strong>: Random search over learning rate [0.0001, 0.001], hidden units, layers [3, 5], dropout [0.0, 0.5]</li>
<li><strong>Selection</strong>: Top 20% by sum of valid + unique + novelty, then final selection on all indicators</li>
<li><strong>Generation</strong>: 10K molecules per model per task</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN variants</td>
          <td>5.2M - 36.4M</td>
          <td>RNN (LSTM/GRU)</td>
      </tr>
      <tr>
          <td>Transformer variants</td>
          <td>5.3M - 36.4M</td>
          <td>Transformer decoder</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Wasserstein distance for property distributions (FCD, LogP, SA, QED, BCT, NP, MW, TL), Tanimoto similarity (molecular and scaffold), validity, uniqueness, novelty.</p>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/viko-3/language_model">trans_language</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Transformer implementation by the authors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/danielflamshep/genmoltasks">genmoltasks</a></td>
          <td>Code/Data</td>
          <td>Apache-2.0</td>
          <td>Dataset construction from Flam-Shepherd et al.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chen, Y., Wang, Z., Zeng, X., Li, Y., Li, P., Ye, X., &amp; Sakurai, T. (2023). Molecular language models: RNNs or transformer? <em>Briefings in Functional Genomics</em>, 22(4), 392-400. <a href="https://doi.org/10.1093/bfgp/elad012">https://doi.org/10.1093/bfgp/elad012</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chen2023molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular language models: RNNs or transformer?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chen, Yangyang and Wang, Zixu and Zeng, Xiangxiang and Li, Yayang and Li, Pengyong and Ye, Xiucai and Sakurai, Tetsuya}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Functional Genomics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{392--400}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bfgp/elad012}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Re-evaluating Sample Efficiency in Molecule Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/sample-efficiency-de-novo-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/sample-efficiency-de-novo-generation/</guid><description>Thomas et al. re-evaluate generative model benchmarks for de novo drug design, adding property filters and diversity metrics that re-rank model performance.</description><content:encoded><![CDATA[<h2 id="an-empirical-re-evaluation-of-generative-model-benchmarks">An Empirical Re-evaluation of Generative Model Benchmarks</h2>
<p>This is an <strong>Empirical</strong> paper. The primary contribution is a critical reassessment of the <a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">Practical Molecular Optimization (PMO)</a> benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed <a href="/notes/chemistry/molecular-design/generation/rl-tuned/augmented-hill-climb-rl-molecule-generation/">Augmented Hill-Climb (AHC)</a> method.</p>
<h2 id="sample-efficiency-and-chemical-quality-in-drug-design">Sample Efficiency and Chemical Quality in Drug Design</h2>
<p>Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The <a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">PMO benchmark</a> by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> to be the most sample-efficient model across 23 tasks.</p>
<p>However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a> task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.</p>
<h2 id="modified-metrics-property-filters-and-diversity-requirements">Modified Metrics: Property Filters and Diversity Requirements</h2>
<p>The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:</p>
<p><strong>AUC Top-10 (Filtered)</strong>: Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.</p>
<p><strong>AUC Top-10 (Diverse)</strong>: The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.</p>
<p><strong>AUC Top-10 (Combined)</strong>: Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.</p>
<h2 id="benchmark-setup-and-generative-models-evaluated">Benchmark Setup and Generative Models Evaluated</h2>
<h3 id="implementation-details">Implementation Details</h3>
<p>The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.</p>
<p>Two AHC variants are benchmarked:</p>
<ul>
<li><strong>SMILES-AHC</strong>: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$</li>
<li><strong>SMILES-AHC</strong>*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality</li>
</ul>
<p>Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.</p>
<h3 id="models-compared">Models Compared</h3>
<p>The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark.</p>
<h2 id="re-ranked-results-and-augmented-hill-climb-performance">Re-ranked Results and Augmented Hill-Climb Performance</h2>
<p>The modified metrics substantially re-order the ranking of generative models:</p>
<ol>
<li>
<p><em><em>SMILES-AHC</em> achieves top performance on AUC Top-10 (Combined)</em>*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.</p>
</li>
<li>
<p><strong>SMILES-AHC (data-driven hyperparameters) ranks first</strong> when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.</p>
</li>
<li>
<p><strong>REINVENT retains its first-place rank under property filters alone</strong>, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.</p>
</li>
<li>
<p><strong>Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly</strong> under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.</p>
</li>
<li>
<p><strong>Both AHC variants excel on empirically difficult tasks</strong>, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics</li>
<li>Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection</li>
<li>Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches</li>
<li>Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives</li>
<li>Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC250k</td>
          <td>~250K molecules</td>
          <td>Subset of ZINC15, provided by PMO benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/pmo-sample-efficient-molecular-optimization/">PMO</a> benchmark tasks</td>
          <td>23 objectives</td>
          <td>Derived primarily from <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Augmented Hill-Climb</strong>: RL strategy from Thomas et al. (2022), patience of 5</li>
<li><strong>Hyperparameters (SMILES-AHC)</strong>: batch size 256, $\sigma = 120$, $K = 0.25$</li>
<li><em><em>Hyperparameters (SMILES-AHC</em>)</em>*: $\sigma = 60$ (domain-informed selection)</li>
<li><strong>Prior training</strong>: 5 epochs, batch size 128, SMILES notation</li>
<li><strong>Oracle budget</strong>: 10,000 evaluations per task</li>
<li><strong>Replicates</strong>: 5 per model per task</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Embedding (128) + 3x GRU (512), following REINVENT</li>
<li><strong>All 25 PMO benchmark models</strong> re-evaluated using original implementations</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC Top-10 (Original)</td>
          <td>Area under curve of average top 10 molecules</td>
          <td>Standard PMO metric</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Filtered)</td>
          <td>Original with MW/LogP and ECFP4 novelty filters</td>
          <td>$\mu \pm 4\sigma$ from ZINC250k</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Diverse)</td>
          <td>Top 10 selected with Tanimoto &lt; 0.35 diversity</td>
          <td>ECFP4 fingerprints</td>
      </tr>
      <tr>
          <td>AUC Top-10 (Combined)</td>
          <td>Both filters and diversity applied</td>
          <td>Most stringent metric</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MorganCThomas/MolScore">MolScore</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Scoring and benchmarking framework by the first author</td>
      </tr>
      <tr>
          <td><a href="https://github.com/wenhao-gao/mol_opt">PMO Benchmark</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Original benchmark code and data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Thomas, M., O&rsquo;Boyle, N. M., Bender, A., &amp; de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. <em>arXiv preprint arXiv:2212.01385</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{thomas2022reevaluating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Re-evaluating sample efficiency in de novo molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Thomas, Morgan and O&#39;Boyle, Noel M. and Bender, Andreas and de Graaf, Chris}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2212.01385}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.LG}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2212.01385}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Fine-Tuning GPT-3 for Molecular Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/fine-tuning-gpt3-molecular-properties/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/fine-tuning-gpt3-molecular-properties/</guid><description>Evaluating fine-tuned GPT-3 ada models for HOMO/LUMO classification of organic semiconductors from SMILES, with ablation and robustness analysis.</description><content:encoded><![CDATA[<h2 id="gpt-3-as-a-molecular-property-classifier">GPT-3 as a Molecular Property Classifier</h2>
<p>This is an <strong>Empirical</strong> paper that evaluates the effectiveness of fine-tuning OpenAI&rsquo;s GPT-3 language model (specifically the &ldquo;ada&rdquo; base model) for predicting electronic and functional properties of organic molecules. Rather than proposing a new architecture, the work systematically tests whether a general-purpose LLM can learn chemically meaningful patterns from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings when fine-tuned on classification tasks. The primary contribution is the empirical characterization of GPT-3&rsquo;s performance, robustness, and limitations for molecular property prediction, including extensive ablation studies.</p>
<h2 id="why-fine-tune-a-general-purpose-llm-for-chemistry">Why Fine-Tune a General-Purpose LLM for Chemistry?</h2>
<p>Machine learning for molecular property prediction typically relies on specialized representations: molecular graphs processed by graph neural networks (GNNs), engineered molecular descriptors, or domain-specific chemical language models trained from scratch on SMILES or <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>. These approaches require varying levels of domain expertise to design the inputs and architecture.</p>
<p>GPT-3, pre-trained on vast amounts of general text, already has an internal representation of language structure. SMILES notation, as a text-based molecular representation, can be treated as a &ldquo;language&rdquo; with its own syntax. The authors hypothesize that GPT-3&rsquo;s language understanding capabilities, combined with the human-readable nature of SMILES, may enable the model to recognize significant patterns within chemical structures and capture structure-property dependencies. The key question is whether fine-tuning alone is sufficient, or whether specialized architectures provide fundamental advantages.</p>
<p>Prior work by <a href="/notes/chemistry/llm-applications/leveraging-llms-predictive-chemistry/">Jablonka et al.</a> showed that fine-tuned GPT-3 could perform surprisingly well on low-data chemistry tasks, sometimes surpassing dedicated models. This paper extends that investigation with a focus on electronic properties (<a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO and LUMO</a> energies) of <a href="https://en.wikipedia.org/wiki/Organic_semiconductor">organic semiconductors</a>, with deeper analysis of robustness and failure modes.</p>
<h2 id="smiles-to-classification-via-prompt-completion-fine-tuning">SMILES-to-Classification via Prompt-Completion Fine-Tuning</h2>
<p>The core approach is straightforward. Each training example is a prompt-completion pair in JSONL format:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{<span style="color:#f92672">&#34;prompt&#34;</span>: <span style="color:#e6db74">&#34;SMILES_string&#34;</span>, <span style="color:#f92672">&#34;completion&#34;</span>: <span style="color:#e6db74">&#34;class_label&#34;</span>}
</span></span></code></pre></div><p>The SMILES string serves as the prompt, and the fine-tuned model learns to complete it with a class label (0/1 for binary, 0/1/2 for ternary, 0/1/2/3 for quaternary classification). Class thresholds are determined by equally segmenting the property value range. The authors use GPT-3&rsquo;s default tokenizer, which breaks SMILES strings into subword tokens that do not correspond to chemically meaningful units (e.g., &ldquo;c1ccccc1&rdquo; for benzene gets tokenized into arbitrary fragments).</p>
<p>This design choice has important implications. The model must learn chemical semantics from token patterns that are not aligned with atoms or bonds. The authors note this as a limitation and hypothesize that a chemistry-aware tokenizer could improve performance.</p>
<h2 id="experimental-setup-and-baseline-comparisons">Experimental Setup and Baseline Comparisons</h2>
<h3 id="datasets">Datasets</h3>
<p>The primary dataset is a collection of 48,182 organic semiconductor (OSC) molecules extracted from the <a href="https://en.wikipedia.org/wiki/Cambridge_Structural_Database">Cambridge Structural Database</a> (CSD). Each molecule has a SMILES representation and quantum-chemically computed electronic properties (HOMO and LUMO energies). A secondary dataset of 572 aromatic molecular photocatalysts (AMPs) with experimentally measured <a href="https://en.wikipedia.org/wiki/Hydrogen_evolution_reaction">hydrogen evolution rates</a> (HER) provides an additional test case.</p>
<h3 id="baselines">Baselines</h3>
<p>Three baselines are compared:</p>
<ol>
<li><strong>Directed message-passing neural network (D-MPNN)</strong> via Chemprop, using default molecular graph representations</li>
<li><strong>RDKit molecular descriptors + SVM</strong>, using the top 20 descriptors selected by SelectKBest</li>
<li><strong>Prior ML results</strong> from the original AMP dataset paper (using engineered domain-specific features)</li>
</ol>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Task</th>
          <th>Classes</th>
          <th>GPT-3 Accuracy</th>
          <th>GNN Accuracy</th>
          <th>Descriptors Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OSCs (48,182)</td>
          <td>HOMO</td>
          <td>3</td>
          <td>0.92</td>
          <td>0.94</td>
          <td>0.87</td>
      </tr>
      <tr>
          <td>OSCs (48,182)</td>
          <td>HOMO</td>
          <td>4</td>
          <td>0.68</td>
          <td>0.75</td>
          <td>0.47</td>
      </tr>
      <tr>
          <td>OSCs (48,182)</td>
          <td>HOMO</td>
          <td>5</td>
          <td>0.60</td>
          <td>0.68</td>
          <td>0.40</td>
      </tr>
      <tr>
          <td>OSCs (48,182)</td>
          <td>LUMO</td>
          <td>3</td>
          <td>0.94</td>
          <td>0.94</td>
          <td>0.91</td>
      </tr>
      <tr>
          <td>AMPs (572)</td>
          <td>HER</td>
          <td>2</td>
          <td>0.88</td>
          <td>0.86</td>
          <td>0.87</td>
      </tr>
  </tbody>
</table>
<p>For ternary classification, GPT-3 performs on par with GNNs (0.92 vs. 0.94 for HOMO; 0.94 vs. 0.94 for LUMO). Performance degrades more steeply than GNNs as the number of classes increases: at 5-class HOMO, GPT-3 achieves only 0.60 vs. GNN&rsquo;s 0.68. On the small AMP dataset (572 molecules), GPT-3 slightly outperforms the GNN (0.88 vs. 0.86).</p>
<h3 id="learning-curves">Learning Curves</h3>
<p>The data efficiency analysis reveals that GPT-3 needs at least 20% of the OSC dataset (approximately 9,600 molecules) to reach accuracy above 0.9. Below 1,000 training points, accuracy drops below 0.6. GNNs outperform GPT-3 in this low-data regime, which the authors attribute to (1) the molecular graph being chemically more expressive than SMILES for these tasks, and (2) fine-tuning requiring sufficient data to capture relevant SMILES patterns.</p>
<h3 id="ablation-study-1-single-atom-removal">Ablation Study 1: Single-Atom Removal</h3>
<p>The authors tested robustness by removing individual non-hydrogen, non-carbon atoms from SMILES strings and replacing them with a <code>&lt;missing&gt;</code> token. Out of 45,763 ablation tests on 7,714 correctly predicted molecules, 95.2% retained the same classification. This suggests the model captures redundant structural information rather than relying on any single atom.</p>
<h3 id="ablation-study-2-single-group-removal">Ablation Study 2: Single-Group Removal</h3>
<p>Fifteen chemical groups (nitrile, nitro, enamine, ketone, etc.) were individually ablated. The fine-tuned model attributed the most importance to acetylene (81% agreement for HOMO), enamine (85%), nitro (86%), and ketone (87%) groups, as these altered HOMO predictions in more than 10% of tests. Interestingly, groups that participate in electronic pi-conjugation tended to be more &ldquo;important&rdquo; to the model&rsquo;s HOMO predictions.</p>
<p>When ablated atoms were replaced with random elements instead of the <code>&lt;missing&gt;</code> token, the model failed in 80% of cases for a representative molecule. This suggests the model may &ldquo;fill in&rdquo; the missing information when seeing the <code>&lt;missing&gt;</code> token but gets confused by incorrect atomic identities.</p>
<h3 id="predicting-unknown-molecular-families">Predicting Unknown Molecular Families</h3>
<p>The authors held out entire families of <a href="https://en.wikipedia.org/wiki/Polycyclic_aromatic_hydrocarbon">polycyclic aromatic hydrocarbons</a> (naphthalene, anthracene, tetracene, pyrene, perylene), quinones, and imides during training, then tested predictions on these unseen families. Results for the first five PAH families:</p>
<table>
  <thead>
      <tr>
          <th>Fragment Family</th>
          <th>Molecules</th>
          <th>GPT-3 HOMO</th>
          <th>GNN HOMO</th>
          <th>GPT-3 LUMO</th>
          <th>GNN LUMO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Naphthalene</td>
          <td>475</td>
          <td>0.94</td>
          <td>0.95</td>
          <td>0.88</td>
          <td>0.91</td>
      </tr>
      <tr>
          <td>Anthracene</td>
          <td>577</td>
          <td>0.99</td>
          <td>1.00</td>
          <td>0.93</td>
          <td>0.97</td>
      </tr>
      <tr>
          <td>Tetracene</td>
          <td>72</td>
          <td>0.96</td>
          <td>1.00</td>
          <td>0.90</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Pyrene</td>
          <td>237</td>
          <td>0.98</td>
          <td>1.00</td>
          <td>0.97</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Perylene</td>
          <td>41</td>
          <td>0.98</td>
          <td>1.00</td>
          <td>0.98</td>
          <td>0.95</td>
      </tr>
  </tbody>
</table>
<p>GPT-3 generalizes well to unknown PAH families, though GNNs have a slight edge on HOMO prediction. Performance degrades somewhat for quinones and imides.</p>
<h3 id="canonical-vs-non-canonical-smiles">Canonical vs. Non-Canonical SMILES</h3>
<p>A model fine-tuned only on canonical SMILES performed poorly on non-canonical variants: only 1,622 of 8,578 molecules achieved consistent predictions across all 11 SMILES variants (1 canonical + 10 non-canonical). Augmenting the training data with 5 non-canonical SMILES per molecule dramatically improved consistency to 7,243 of 8,578 molecules and nearly eliminated erroneous (non-class-label) responses. This finding highlights that GPT-3&rsquo;s pattern matching is highly sensitive to surface-level string representation and benefits substantially from <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> <a href="/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/">data augmentation</a>.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>The main findings are:</p>
<ol>
<li>Fine-tuned GPT-3 (ada) achieves competitive accuracy with GNNs for coarse-grained (ternary) HOMO/LUMO classification, but performance drops more steeply with finer granularity.</li>
<li>The model shows robustness to single-atom and single-group ablation, suggesting it captures chemically redundant patterns.</li>
<li>Generalization to held-out molecular families is strong, though GNNs maintain a slight advantage.</li>
<li>SMILES augmentation with non-canonical variants is essential for consistent predictions.</li>
</ol>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Black-box nature</strong>: GPT-3 provides no physical insight or interpretability, unlike GNN models where molecular graph features can be augmented with domain knowledge.</li>
<li><strong>Tokenization</strong>: The generic tokenizer does not respect chemical structure. A chemistry-aware tokenizer could improve data efficiency and accuracy.</li>
<li><strong>SELFIES underperformance</strong>: Initial tests with SELFIES did not improve over SMILES, likely because generic tokenization stripped away the extra chemical information SELFIES encodes.</li>
<li><strong>Cost</strong>: Fine-tuning via OpenAI&rsquo;s API cost approximately $500 for the experiments, and the model is closed-source, preventing systematic interpretation of learned representations.</li>
<li><strong>Classification only</strong>: The approach performs coarse-grained classification rather than regression, limiting utility for applications requiring precise numerical predictions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>OSC molecules from CSD</td>
          <td>48,182</td>
          <td>SMILES + DFT-computed HOMO/LUMO energies</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Aromatic molecular photocatalysts (AMPs)</td>
          <td>572</td>
          <td>Experimental hydrogen evolution rates</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Fine-tuning uses OpenAI&rsquo;s GPT-3 &ldquo;ada&rdquo; base model via the API</li>
<li>Prompt-completion pairs in JSONL format</li>
<li>Default GPT-3 tokenizer</li>
<li>80/20 train/test split for OSC; stratified 10-fold CV for AMPs</li>
<li>Non-canonical SMILES generated using RDKit (10 per molecule for testing, 5 per molecule for augmented training)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-3 &ldquo;ada&rdquo; (fine-tuned, closed-source, accessed via OpenAI API)</li>
<li>Chemprop D-MPNN baseline (open-source)</li>
<li>RDKit descriptors + scikit-learn SVM baseline</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best GPT-3 Value</th>
          <th>Best GNN Value</th>
          <th>Task</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>0.92</td>
          <td>0.94</td>
          <td>3-class HOMO (OSCs)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>0.94</td>
          <td>0.94</td>
          <td>3-class LUMO (OSCs)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>0.88</td>
          <td>0.86</td>
          <td>2-class HER (AMPs)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify local hardware requirements. All GPT-3 fine-tuning was conducted via OpenAI&rsquo;s cloud API at a total cost of approximately $500.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XieZikai/Chem-GPT-Finetune">Chem-GPT-Finetune</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Python code and datasets for fine-tuning and evaluation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xie, Z., Evangelopoulos, X., Omar, O. H., Troisi, A., Cooper, A. I., &amp; Chen, L. (2024). Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules. <em>Chemical Science</em>, 15(2), 500-510.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xie2024finetuning,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Fine-tuning {GPT-3} for machine learning electronic and functional properties of organic molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xie, Zikai and Evangelopoulos, Xenophon and Omar, {\&#34;O}mer H. and Troisi, Alessandro and Cooper, Andrew I. and Chen, Linjiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{500--510}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D3SC04610A}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Benchmarking LLMs for Molecular Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/benchmarking-llms-molecule-prediction/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/benchmarking-llms-molecule-prediction/</guid><description>Empirical evaluation of GPT-3.5, GPT-4, and Llama-2 on six OGB molecular property prediction tasks, comparing LLMs against GNNs and language models.</description><content:encoded><![CDATA[<h2 id="empirical-benchmarking-of-llms-on-molecular-tasks">Empirical Benchmarking of LLMs on Molecular Tasks</h2>
<p>This is an <strong>Empirical</strong> paper that systematically evaluates whether large language models (LLMs) can handle molecular property prediction tasks. The primary contribution is a structured benchmarking framework that compares LLMs (GPT-3.5, GPT-4, Llama-2-7b, Llama-2-13b) against conventional ML models (DeBERTa, GCN, GIN) across six standard molecular benchmark datasets from OGB. The study also introduces a collaborative framework where LLM-generated responses augment ML model features.</p>
<h2 id="why-benchmark-llms-on-molecular-property-prediction">Why Benchmark LLMs on Molecular Property Prediction</h2>
<p>LLMs have demonstrated strong capabilities across many NLP tasks, but their effectiveness on structured scientific data, particularly molecular graphs, remains unclear. Prior work has explored LLMs for chemistry tasks such as <a href="/notes/chemistry/molecular-design/reaction-prediction/">reaction prediction</a>, <a href="/notes/chemistry/molecular-representations/name-translation/transformer-chemical-name-to-smiles/">name-to-SMILES translation</a>, and molecule description. However, a systematic evaluation of LLMs on standard molecular property prediction benchmarks (classification and regression) with controlled prompt engineering has been lacking.</p>
<p>The key questions motivating this work:</p>
<ol>
<li>Can LLMs effectively predict molecular properties when given <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings and textual descriptions of molecular structure?</li>
<li>Does encoding geometric structure information as text help LLMs understand molecules?</li>
<li>Can LLM responses serve as useful augmentations for traditional ML models?</li>
</ol>
<h2 id="prompt-engineering-for-molecular-prediction">Prompt Engineering for Molecular Prediction</h2>
<p>The core methodological contribution is a systematic prompt engineering framework for querying LLMs on molecule tasks. Given a molecule $\mathcal{G} = (S, G, D)$ where $S$ is the SMILES string, $G$ is the geometric structure, and $D$ is a generated text description of atom features and graph structure, the authors design several prompt templates:</p>
<p><strong>Zero-shot prompts</strong> (three variants):</p>
<ul>
<li><strong>Input-Feature (IF)</strong>: Asks for general insights about a molecule given its SMILES and description</li>
<li><strong>Input-Prediction (IP)</strong>: Asks for a direct prediction in a specified format</li>
<li><strong>Input-Explanation (IE)</strong>: Asks for both a prediction and an explanation</li>
</ul>
<p>Each zero-shot prompt has a variant with descriptions (IFD, IPD, IED) that encodes atom features and graph structure as additional text following the approach of Fatemi et al. (2023).</p>
<p><strong>Few-shot prompts (FS-k)</strong>: Provide $k$ labeled examples as in-context learning demonstrations before the query. The study uses $k \in {1, 2, 3}$.</p>
<p>The authors also explore three predictive model pipelines:</p>
<ul>
<li><strong>Solo</strong>: A single model (LLM, LM, or GNN) makes predictions independently</li>
<li><strong>Duo</strong>: An ML model receives both the original features and LLM-generated responses as input</li>
<li><strong>Trio</strong>: A GNN receives SMILES embeddings from an LM plus LLM response embeddings alongside geometric features</li>
</ul>
<p>The LLM prediction can be formalized as $A = f_{LLM}(Q)$ where $Q$ is the prompt and $A$ is the response. For the ML augmentation pipelines, the LM-based Duo model predicts as:</p>
<p>$$\hat{y} = f_{LM}(S, R)$$</p>
<p>where $R$ is the LLM response, and the GNN-based Trio model predicts as:</p>
<p>$$\hat{y} = f_{GNN}(G, X)$$</p>
<p>where $X$ includes features derived from both SMILES embeddings and LLM response embeddings.</p>
<h2 id="experimental-setup-across-six-ogb-benchmarks">Experimental Setup Across Six OGB Benchmarks</h2>
<h3 id="datasets">Datasets</h3>
<p>The study uses six molecular property prediction datasets from OGB and <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Molecules</th>
          <th>Avg. Nodes</th>
          <th>Avg. Edges</th>
          <th>Task Type</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ogbg-molbace</td>
          <td>1,513</td>
          <td>34.1</td>
          <td>73.7</td>
          <td>Binary classification (<a href="https://en.wikipedia.org/wiki/Beta-secretase_1">BACE-1</a> inhibition)</td>
      </tr>
      <tr>
          <td>ogbg-molbbbp</td>
          <td>2,039</td>
          <td>24.1</td>
          <td>51.9</td>
          <td>Binary classification (<a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB</a> penetration)</td>
      </tr>
      <tr>
          <td>ogbg-molhiv</td>
          <td>41,127</td>
          <td>25.5</td>
          <td>27.5</td>
          <td>Binary classification (HIV inhibition)</td>
      </tr>
      <tr>
          <td>ogbg-molesol</td>
          <td>1,128</td>
          <td>13.3</td>
          <td>27.4</td>
          <td>Regression (water solubility)</td>
      </tr>
      <tr>
          <td>ogbg-molfreesolv</td>
          <td>642</td>
          <td>8.7</td>
          <td>16.8</td>
          <td>Regression (<a href="https://en.wikipedia.org/wiki/Hydration_energy">hydration free energy</a>)</td>
      </tr>
      <tr>
          <td>ogbg-mollipo</td>
          <td>4,200</td>
          <td>27.0</td>
          <td>59.0</td>
          <td>Regression (<a href="https://en.wikipedia.org/wiki/Lipophilicity">lipophilicity</a>)</td>
      </tr>
  </tbody>
</table>
<p>Classification tasks are evaluated by <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC-AUC</a> (higher is better) and regression tasks by RMSE (lower is better).</p>
<h3 id="models-compared">Models Compared</h3>
<ul>
<li><strong>LLMs</strong>: GPT-3.5 (primary), GPT-4, Llama-2-7b, Llama-2-13b, all used as black-box APIs with fixed parameters</li>
<li><strong>Language Model</strong>: DeBERTa, fine-tuned on SMILES strings</li>
<li><strong>GNNs</strong>: GCN and GIN, trained on geometric molecular structure</li>
</ul>
<h3 id="key-results-llms-alone-vs-ml-models">Key Results: LLMs Alone vs. ML Models</h3>
<p>The paper presents five main observations:</p>
<p><strong>Observation 1: GPT models outperform Llama models on molecule tasks.</strong> On the ogbg-molhiv dataset, GPT-3.5 and GPT-4 consistently outperform Llama-2-7b and Llama-2-13b across all prompt variants. GPT-4 offers marginal improvement over GPT-3.5 at 20x the cost and 10x the latency, so GPT-3.5 is used as the default LLM.</p>
<p><strong>Observation 2: LLMs lag behind ML models across all datasets.</strong> Across all six datasets, LLM-based approaches underperform compared to DeBERTa, GCN, and GIN. For example, on ogbg-molhiv, the best LLM achieves 0.5892 ROC-AUC (IP prompt) compared to GIN&rsquo;s 0.7601. On regression tasks, the gap is even larger: GIN achieves 0.9555 RMSE on ogbg-molesol versus the best LLM&rsquo;s 1.9963.</p>
<p><strong>Observation 3: Text descriptions of molecular geometry do not help LLMs.</strong> Adding structural descriptions (the &ldquo;D&rdquo; variants of prompts) generally degrades LLM performance and reduces response consistency. The additional tokens from structure descriptions appear to introduce noise rather than useful geometric information.</p>
<p><strong>Observation 4: Geometric structure is critical for molecular prediction.</strong> GNN models that directly process molecular graphs substantially outperform both LLMs and text-based language models, confirming that geometric information is essential for accurate property prediction.</p>
<p><strong>Observation 5: LLMs can augment ML models effectively.</strong> When LLM responses are used as additional features for GNN models (Duo and Trio pipelines), several configurations show improvements. For example, on ogbg-molbace, GCN with FS-2 augmentation achieves 0.7903 test ROC-AUC versus baseline GCN&rsquo;s 0.7147. GIN with SMILES features (Duo pipeline) achieves 0.7837 on ogbg-molhiv versus the baseline GIN&rsquo;s 0.7601.</p>
<h3 id="response-consistency">Response Consistency</h3>
<p>The study also measures response consistency, defined as the fraction of LLM responses conforming to the required output format. Adding descriptions to prompts reduces consistency, and few-shot prompts generally improve consistency over zero-shot variants.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>LLMs are not competitive with specialized ML models for molecular property prediction when used directly, with GNNs maintaining clear advantages across all six benchmark datasets.</li>
<li>Converting molecular geometric structure to text descriptions is insufficient for conveying structural information to LLMs, as evidenced by degraded performance and reduced response consistency with description-augmented prompts.</li>
<li>LLMs show the most promise as augmenters of existing ML models rather than as standalone predictors, with the Duo and Trio pipelines yielding improvements over Solo baselines in many configurations.</li>
<li>Among LLMs, GPT-3.5 offers the best cost-performance tradeoff for molecule tasks.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The study is limited to black-box API access with fixed LLM parameters. Fine-tuning or parameter-efficient adaptation (e.g., LoRA) was not explored due to computational constraints and API limitations.</li>
<li>Advanced prompting techniques (Chain-of-Thought, Tree-of-Thought, Graph-of-Thought, RAG) were tested in preliminary experiments but performed worse, which the authors attribute to the difficulty of designing proper reasoning chains for molecular property prediction.</li>
<li>Only six datasets from OGB/MoleculeNet are evaluated. Other molecular tasks (e.g., reaction prediction, retrosynthesis) are not covered.</li>
<li>The evaluation uses a single random seed for LLM queries, and the stochastic nature of LLM outputs means results may vary across runs.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors identify three promising avenues: (1) developing methods to better incorporate molecular geometric structure into LLM inputs, (2) designing more sophisticated frameworks for integrating LLMs with traditional ML models, and (3) training domain-specialized chemistry LLMs that can reduce hallucinations in chemical reasoning.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>ogbg-molbace</td>
          <td>1,513 molecules</td>
          <td>Binary classification, BACE-1 inhibition</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ogbg-molbbbp</td>
          <td>2,039 molecules</td>
          <td>Binary classification, BBB penetration</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ogbg-molhiv</td>
          <td>41,127 molecules</td>
          <td>Binary classification, HIV inhibition</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ogbg-molesol</td>
          <td>1,128 molecules</td>
          <td>Regression, water solubility</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ogbg-molfreesolv</td>
          <td>642 molecules</td>
          <td>Regression, hydration free energy</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ogbg-mollipo</td>
          <td>4,200 molecules</td>
          <td>Regression, lipophilicity</td>
      </tr>
  </tbody>
</table>
<p>All datasets use standard OGB scaffold splits.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Zero-shot prompts: IF, IP, IE (and description-augmented variants IFD, IPD, IED)</li>
<li>Few-shot prompts: FS-1, FS-2, FS-3</li>
<li>Solo/Duo/Trio integration pipelines for combining LLM outputs with ML models</li>
<li>DeBERTa fine-tuned on SMILES strings</li>
<li>GCN and GIN with OGB benchmark implementations</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-3.5 and GPT-4 via OpenAI API with default hyperparameters</li>
<li>Llama-2-7b and Llama-2-13b via HuggingFace</li>
<li>DeBERTa (DeBERTaV3)</li>
<li>GCN and GIN following OGB leaderboard implementations</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification (molbace, molbbbp, molhiv)</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression (molesol, molfreesolv, mollipo)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Response consistency</td>
          <td>All tasks</td>
          <td>Fraction of format-conforming LLM outputs</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware details are not specified in the paper. LLM experiments use API calls (OpenAI) and HuggingFace inference. GNN and DeBERTa training uses standard implementations from OGB benchmark leaderboards.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhiqiangzhongddu/LLMaMol">LLMaMol</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation with prompt templates and evaluation pipeline</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhong, Z., Zhou, K., &amp; Mottin, D. (2024). Benchmarking Large Language Models for Molecule Prediction Tasks. arXiv preprint arXiv:2403.05075.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhong2024benchmarking,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Benchmarking Large Language Models for Molecule Prediction Tasks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhong, Zhiqiang and Zhou, Kuangyu and Mottin, Davide}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2403.05075}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2403.05075}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Benchmarking Chemistry Knowledge in Code-Gen LLMs</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/llm-chemistry-code-assessment/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/llm-chemistry-code-assessment/</guid><description>Benchmarking code-generating LLMs on 84 chemistry tasks spanning general chemistry, biochemistry, and computational chemistry with prompt engineering analysis.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: White, A. D., Hocky, G. M., Gandhi, H. A., Ansari, M., Cox, S., Wellawatte, G. P., Sasmal, S., Yang, Z., Liu, K., Singh, Y., &amp; Peña Ccoa, W. J. (2023). Assessment of chemistry knowledge in large language models that generate code. <em>Digital Discovery</em>, 2(2), 368-376. <a href="https://doi.org/10.1039/d2dd00087c">https://doi.org/10.1039/d2dd00087c</a></p>
<p><strong>Publication</strong>: Digital Discovery 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ur-whitelab/nlcc-data">nlcc-data benchmark repository</a></li>
<li><a href="https://ur-whitelab.github.io/nlcc-data/">Evaluation completions website</a></li>
<li><a href="https://doi.org/10.5281/zenodo.6800475">Zenodo evaluation data (DOI: 10.5281/zenodo.6800475)</a></li>
</ul>
<h2 id="benchmarking-chemistry-knowledge-in-code-generating-llms">Benchmarking Chemistry Knowledge in Code-Generating LLMs</h2>
<p>This is an <strong>Empirical</strong> paper that evaluates code-generating large language models on chemistry tasks. The primary contribution is a categorized benchmark of 84 chemistry problems across 10 topics, along with a systematic evaluation of several LLMs (Codex cushman, Codex davinci, text-davinci-003, InCoder, CodeGen) on these tasks. The paper also provides practical guidance on prompt engineering strategies that improve accuracy.</p>
<h2 id="why-evaluate-llms-on-chemistry-coding-tasks">Why Evaluate LLMs on Chemistry Coding Tasks</h2>
<p>As of late 2022, LLMs trained on code (such as Codex and InCoder) had become widely available through tools like GitHub Copilot and Tabnine. An open question was whether these general-purpose code models contained sufficient domain knowledge to solve chemistry problems expressed as coding tasks. Chemistry has specialized language, equations, and conventions (e.g., <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> notation, thermodynamic relationships, molecular simulation methods) that may not be well-represented in general code training data. Prior work had shown that knowledge of the periodic table requires very high parameter counts, but the broader extent of chemistry knowledge in code LLMs was unexplored.</p>
<p>The authors sought to answer a specific question: do code-generating LLMs &ldquo;know&rdquo; chemistry? This means evaluating whether LLMs can correlate natural language descriptions of chemistry problems with correct code implementations, including proper equations, units, and use of domain-specific libraries.</p>
<h2 id="benchmark-design-and-prompt-engineering-strategies">Benchmark Design and Prompt Engineering Strategies</h2>
<p>The benchmark covers 10 topic categories:</p>
<table>
  <thead>
      <tr>
          <th>Topic</th>
          <th>Abbreviation</th>
          <th>N</th>
          <th>Expert-only</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Biochemistry</td>
          <td>bio</td>
          <td>13</td>
          <td>2</td>
      </tr>
      <tr>
          <td>Cheminformatics</td>
          <td>cheminf</td>
          <td>10</td>
          <td>0</td>
      </tr>
      <tr>
          <td>General chemistry</td>
          <td>genchem</td>
          <td>11</td>
          <td>0</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-simulation/">Molecular dynamics</a></td>
          <td>md</td>
          <td>11</td>
          <td>3</td>
      </tr>
      <tr>
          <td>Plotting</td>
          <td>plot</td>
          <td>10</td>
          <td>10</td>
      </tr>
      <tr>
          <td>Quantum mechanics</td>
          <td>qm</td>
          <td>8</td>
          <td>3</td>
      </tr>
      <tr>
          <td>Simulation methods</td>
          <td>sim</td>
          <td>8</td>
          <td>5</td>
      </tr>
      <tr>
          <td>Spectroscopy</td>
          <td>spect</td>
          <td>11</td>
          <td>1</td>
      </tr>
      <tr>
          <td>Statistics</td>
          <td>stats</td>
          <td>11</td>
          <td>1</td>
      </tr>
      <tr>
          <td>Thermodynamics</td>
          <td>thermo</td>
          <td>10</td>
          <td>0</td>
      </tr>
  </tbody>
</table>
<p>Each task is formatted as a Python function with a docstring describing the expected behavior. The LLM must generate a completion that passes automated unit tests. Of the 84 total prompts, 25 require expert evaluation (e.g., plotting tasks) where automated testing is insufficient.</p>
<p>The key prompt engineering insight is the use of &ldquo;contexts,&rdquo; which are code prepended before prompts. The authors tested several context strategies:</p>
<ul>
<li><strong>Custom context</strong>: Topic-specific imports (e.g., <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> for cheminformatics) plus a one-line completion example to teach the model how to signal the end of output.</li>
<li><strong>Insert context</strong>: Uses model infilling capabilities instead of completion-based generation. Available for davinci and InCoder.</li>
<li><strong>Copyright context</strong>: Adding a copyright notice at the top of the file, which conditions the model toward higher-quality code patterns.</li>
<li><strong>Authority context</strong>: Adding &ldquo;This is written by an expert Python programmer.&rdquo;</li>
</ul>
<p>The copyright notice improved accuracy at higher temperatures. The intuition is that copyrighted code in training data tends to be higher-quality, so the notice acts similarly to lowering temperature. The best model/temperature combination (davinci at T=0.05) was already operating at effectively low temperature, so the copyright trick did not further improve it.</p>
<h2 id="experimental-setup-models-sampling-and-expert-evaluation">Experimental Setup: Models, Sampling, and Expert Evaluation</h2>
<h3 id="models-evaluated">Models evaluated</h3>
<p>The study compared five models, all decoder-only architectures:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Abbreviation</th>
          <th>Parameters</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>code-cushman-001</td>
          <td>cushman</td>
          <td>12B</td>
          <td>OpenAI (GPT-3 fine-tuned on code)</td>
      </tr>
      <tr>
          <td>code-davinci-002</td>
          <td>davinci</td>
          <td>~175B (estimated)</td>
          <td>OpenAI (GPT-3.5 class)</td>
      </tr>
      <tr>
          <td>text-davinci-003</td>
          <td>davinci3</td>
          <td>~175B (estimated)</td>
          <td>OpenAI (RLHF-adapted from davinci)</td>
      </tr>
      <tr>
          <td>InCoder</td>
          <td>incoder</td>
          <td>6B</td>
          <td>Fried et al. 2022</td>
      </tr>
      <tr>
          <td>CodeGen</td>
          <td>codegen</td>
          <td>16B</td>
          <td>Nijkamp et al. 2022</td>
      </tr>
  </tbody>
</table>
<h3 id="sampling-and-evaluation">Sampling and evaluation</h3>
<p>Completions were generated using top-k sampling (k=5) at three temperatures: T=0.05, 0.2, and 0.5. For InCoder-6B, GPU memory limited sampling to k=1. Error bars in all reported results are 95% confidence intervals from <a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)">bootstrap resampling</a> across top-k samples.</p>
<p>Accuracy was defined following the HumanEval approach: a completion is correct if the code runs and passes unit tests, regardless of whether it matches a reference implementation.</p>
<h3 id="expert-evaluation">Expert evaluation</h3>
<p>Nine co-authors (postdoctoral scholars and Ph.D. students) performed 650 evaluations of davinci completions through a web interface. Each completion was scored on a 5-point scale: Perfect (5), Correct but not perfect (4), Runs and is almost correct (3), Does not run but is almost correct (2), Far from correct (1). Expert-evaluated accuracy counted only &ldquo;Perfect&rdquo; and &ldquo;Correct but not perfect&rdquo; as correct.</p>
<h3 id="key-results-by-topic-and-model">Key results by topic and model</h3>
<table>
  <thead>
      <tr>
          <th>Topic</th>
          <th>incoder</th>
          <th>codegen</th>
          <th>davinci</th>
          <th>davinci3</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>bio</td>
          <td>0%</td>
          <td>29%</td>
          <td>43%</td>
          <td>86%</td>
      </tr>
      <tr>
          <td>cheminf</td>
          <td>20%</td>
          <td>20%</td>
          <td>50%</td>
          <td>50%</td>
      </tr>
      <tr>
          <td>genchem</td>
          <td>29%</td>
          <td>86%</td>
          <td>86%</td>
          <td>86%</td>
      </tr>
      <tr>
          <td>md</td>
          <td>0%</td>
          <td>13%</td>
          <td>63%</td>
          <td>88%</td>
      </tr>
      <tr>
          <td>qm</td>
          <td>20%</td>
          <td>60%</td>
          <td>100%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>sim</td>
          <td>0%</td>
          <td>0%</td>
          <td>100%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>spect</td>
          <td>30%</td>
          <td>20%</td>
          <td>50%</td>
          <td>40%</td>
      </tr>
      <tr>
          <td>stats</td>
          <td>40%</td>
          <td>80%</td>
          <td>70%</td>
          <td>60%</td>
      </tr>
      <tr>
          <td>thermo</td>
          <td>10%</td>
          <td>10%</td>
          <td>80%</td>
          <td>70%</td>
      </tr>
      <tr>
          <td><strong>total</strong></td>
          <td><strong>17%</strong></td>
          <td><strong>35%</strong></td>
          <td><strong>72%</strong></td>
          <td><strong>75%</strong></td>
      </tr>
  </tbody>
</table>
<p>All accuracies reported use the best context for each model (copyright for incoder-6B, authority for codegen-16B, insert for davinci) at T=0.2.</p>
<h2 id="findings-llms-know-chemistry-with-caveats">Findings: LLMs Know Chemistry, With Caveats</h2>
<p>The central finding is that code-generating LLMs do contain substantial chemistry knowledge. The best model (davinci) achieved 72% overall accuracy, with prompt engineering contributing approximately 30 percentage points to this figure. The text-davinci-003 model, which was fine-tuned with RLHF, achieved 75% and showed reduced sensitivity to prompt engineering, suggesting that human feedback alignment partially subsumes the benefits of manual prompt design.</p>
<h3 id="strengths-and-successful-domains">Strengths and successful domains</h3>
<ul>
<li><strong>Quantum mechanics and simulation</strong>: davinci achieved 100% on both categories, indicating strong knowledge of computational chemistry equations and simulation patterns.</li>
<li><strong>General chemistry</strong>: All models except InCoder performed well (86%), suggesting that general chemistry concepts are well-represented in code training data.</li>
<li><strong>Molecular structure generation</strong>: InstructGPT showed some ability to connect natural language descriptions with SMILES strings, generating valid (though not exact) molecular structures from prompts like &ldquo;a phenol derivative.&rdquo;</li>
</ul>
<h3 id="limitations-and-failure-modes">Limitations and failure modes</h3>
<ul>
<li><strong>Lack of reasoning</strong>: The authors emphasize that LLMs demonstrate knowledge correlation, not reasoning. Davinci frequently uses &ldquo;relativistic <a href="https://en.wikipedia.org/wiki/Hartree%E2%80%93Fock_method">Hartree-Fock</a>&rdquo; for any prompt requesting a &ldquo;highly accurate&rdquo; quantum calculation, because it has memorized the association between &ldquo;relativistic&rdquo; and &ldquo;accurate&rdquo; rather than understanding the underlying chemistry.</li>
<li><strong>Hallucinated functions</strong>: When given difficult prompts (e.g., &ldquo;return the <a href="https://en.wikipedia.org/wiki/Residual_dipolar_coupling">residual dipolar couplings</a> given a SMILES string&rdquo;), the model invents non-existent functions like <code>MolToRDC</code>.</li>
<li><strong>API version mismatches</strong>: Many errors in the molecular dynamics category stem from the model using outdated function signatures for packages like MDTraj, likely reflecting the training data cutoff.</li>
<li><strong>Expert-evaluated accuracy is lower</strong>: On topics requiring expert evaluation (generally harder tasks), accuracy drops, and it correlates negatively with perceived difficulty.</li>
</ul>
<h3 id="practical-recommendations">Practical recommendations</h3>
<p>The paper offers several practical tips for using code LLMs in chemistry:</p>
<ol>
<li>Use correctly spelled, precise prompts. If a function should &ldquo;return&rdquo; a value, use the word &ldquo;return&rdquo; rather than &ldquo;compute.&rdquo;</li>
<li>Be explicit about what variables represent (e.g., specify that k is a spring constant, not Boltzmann&rsquo;s constant).</li>
<li>Import only the packages you intend to use, as the model will attempt to use all imported libraries.</li>
<li>Adding a copyright notice or &ldquo;expert programmer&rdquo; statement can improve accuracy, though RLHF-trained models are less sensitive to this.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Evaluation</td>
          <td>nlcc-data benchmark</td>
          <td>84 prompts across 10 chemistry topics</td>
          <td>Open source, community-extensible</td>
      </tr>
      <tr>
          <td>Expert evaluation</td>
          <td>Human evaluations CSV</td>
          <td>650 evaluations</td>
          <td>Available in Supporting Information</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Evaluation uses automated unit testing for 59 of 84 prompts. Expert evaluation covers the remaining 25 prompts through a web-based scoring interface. Five completions per prompt were generated via top-k sampling at three temperatures.</p>
<h3 id="models">Models</h3>
<p>All models evaluated are external (OpenAI API for Codex/davinci, HuggingFace for InCoder/CodeGen). No new models were trained. Python version and packages were pinned to June 2021 to avoid library changes influencing results.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Accuracy is binary: a completion passes all unit tests (1.0) or fails (0.0), averaged across top-k samples and temperatures. Expert evaluation uses a 5-point scale collapsed to binary (Perfect or Correct = 1.0).</p>
<h3 id="hardware">Hardware</h3>
<p>GPU memory limitations are mentioned for InCoder-6B (limiting k=1 instead of k=5). No other hardware details are specified.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ur-whitelab/nlcc-data">nlcc-data benchmark</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Open-source benchmark prompts and solutions</td>
      </tr>
      <tr>
          <td><a href="https://ur-whitelab.github.io/nlcc-data/">Evaluation website</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Web interface showing completions</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.6800475">Zenodo evaluation data</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Expert evaluation completions in HTML</td>
      </tr>
      <tr>
          <td><a href="https://pubs.rsc.org/en/content/articlepdf/2023/dd/d2dd00087c">Paper (open access)</a></td>
          <td>Other</td>
          <td>CC-BY-NC</td>
          <td>Published article</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{white2023assessment,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Assessment of chemistry knowledge in large language models that generate code}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{White, Andrew D. and Hocky, Glen M. and Gandhi, Heta A. and Ansari, Mehrad and Cox, Sam and Wellawatte, Geemi P. and Sasmal, Subarna and Yang, Ziyue and Liu, Kangxin and Singh, Yuvraj and Peña Ccoa, Willmor J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{368--376}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/d2dd00087c}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AMORE: Testing ChemLLM Robustness to SMILES Variants</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/amore-smiles-robustness-framework/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/amore-smiles-robustness-framework/</guid><description>AMORE is a zero-shot framework testing whether chemical language models recognize equivalent SMILES of the same molecule via embedding retrieval.</description><content:encoded><![CDATA[<h2 id="an-empirical-framework-for-probing-chemical-understanding">An Empirical Framework for Probing Chemical Understanding</h2>
<p>This is an <strong>Empirical</strong> paper that introduces Augmented Molecular Retrieval (AMORE), a zero-shot evaluation framework for chemical language models (ChemLMs). The primary contribution is a method to assess whether ChemLMs have learned genuine molecular semantics or simply memorize textual patterns. Rather than relying on traditional NLP metrics like BLEU and ROUGE, AMORE tests whether a model&rsquo;s embedding space treats chemically equivalent <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> representations as similar. The authors evaluate 12 models across multiple architectures (encoder-only, encoder-decoder, decoder-only) on two datasets and five augmentation types, and extend the analysis to downstream MoleculeNet tasks.</p>
<h2 id="why-standard-nlp-metrics-fail-for-chemical-evaluation">Why Standard NLP Metrics Fail for Chemical Evaluation</h2>
<p>Chemical language models are typically evaluated using text-based metrics from NLP (BLEU, ROUGE, METEOR) on tasks like molecule captioning. These metrics compare word overlap and sentence fluency but cannot detect whether a model truly understands molecular structure. A SMILES string like <code>C(=O)O</code> and its canonicalized or kekulized form represent the same molecule, yet text-based metrics would penalize valid reformulations. Embedding-based metrics like BERTScore are also insufficient because they were trained on general text, not chemical notation.</p>
<p>The core research question is direct: do evaluation metrics used on ChemLMs reflect actual chemical knowledge, or do the models simply imitate understanding by learning textual features? This question has practical consequences in pharmaceuticals and healthcare, where missteps in chemical reasoning carry serious risks.</p>
<h2 id="embedding-based-retrieval-as-a-chemical-litmus-test">Embedding-Based Retrieval as a Chemical Litmus Test</h2>
<p>AMORE exploits a fundamental property of molecular representations: a single molecule can be written as multiple valid SMILES strings that are chemically identical. These serve as &ldquo;total synonyms,&rdquo; a concept without a true analogue in natural language.</p>
<p>The framework works in four steps:</p>
<ol>
<li>Take a set $X = (x_1, x_2, \ldots, x_n)$ of $n$ molecular representations.</li>
<li>Apply a transformation $f$ to obtain augmented representations $X&rsquo; = (x&rsquo;_1, x&rsquo;_2, \ldots, x&rsquo;_n)$, where $x&rsquo;_i = f(x_i)$. The constraint is that $f$ must not change the underlying molecule.</li>
<li>Obtain vectorized embeddings $e(x_i)$ and $e(x&rsquo;_j)$ from the model for each original and augmented SMILES.</li>
<li>Evaluate in a retrieval task: given $e(x_i)$, retrieve $e(x&rsquo;_i)$ from the augmented set.</li>
</ol>
<p>The evaluation metrics are top-$k$ accuracy (whether the correct augmented SMILES ranks at position $\leq k$) and <a href="https://en.wikipedia.org/wiki/Mean_reciprocal_rank">Mean Reciprocal Rank</a> (MRR). Retrieval uses <a href="https://en.wikipedia.org/wiki/FAISS">FAISS</a> for efficient nearest-neighbor search. The key insight is that if a model truly understands molecular structure, it should embed different SMILES representations of the same molecule close together.</p>
<h3 id="five-smiles-augmentation-types">Five SMILES Augmentation Types</h3>
<p>The framework uses five identity-preserving augmentations, all executed through <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>:</p>
<ol>
<li><strong>Canonicalization</strong>: Transform SMILES to the standardized RDKit canonical form.</li>
<li><strong>Hydrogen addition</strong>: Explicitly add hydrogen atoms that are normally implied (e.g., <code>C</code> becomes <code>[CH4]</code>). This dramatically increases string length.</li>
<li><strong>Kekulization</strong>: Convert aromatic ring notation to explicit alternating double bonds.</li>
<li><strong>Cycle renumbering</strong>: Replace ring-closure digit identifiers with random valid alternatives.</li>
<li><strong>Random atom order</strong>: Randomize the atom traversal order used to generate the SMILES string.</li>
</ol>
<h2 id="twelve-models-two-datasets-five-augmentations">Twelve Models, Two Datasets, Five Augmentations</h2>
<h3 id="models-evaluated">Models Evaluated</h3>
<p>The authors test 12 publicly available Transformer-based models spanning three architecture families:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Domain</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Text+Chem T5-standard</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>Text+Chem T5-augm</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>MolT5-large</td>
          <td>Cross-modal</td>
          <td>770M</td>
      </tr>
      <tr>
          <td>SciFive</td>
          <td>Text-only</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>PubChemDeBERTa</td>
          <td>Chemical</td>
          <td>86M</td>
      </tr>
      <tr>
          <td>ChemBERT-ChEMBL</td>
          <td>Chemical</td>
          <td>6M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></td>
          <td>Chemical</td>
          <td>125M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/">BARTSmiles</a></td>
          <td>Chemical</td>
          <td>400M</td>
      </tr>
      <tr>
          <td>ZINC-RoBERTa</td>
          <td>Chemical</td>
          <td>102M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/multimodal/nach0-multimodal-chemical-language-model/">nach0</a></td>
          <td>Chemical</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>ZINC-GPT</td>
          <td>Chemical</td>
          <td>87M</td>
      </tr>
  </tbody>
</table>
<h3 id="datasets">Datasets</h3>
<ul>
<li><strong>ChEBI-20 test set</strong>: ~3,300 molecule-description pairs, used for both AMORE retrieval and molecule captioning comparisons.</li>
<li><strong>Isomers</strong> (<a href="/notes/chemistry/datasets/qm9/">QM9</a> subset): 918 molecules that are all isomers of C9H12N2O, making retrieval harder because all molecules share the same molecular formula.</li>
</ul>
<h3 id="key-results-on-chebi-20">Key Results on ChEBI-20</h3>
<p>On the ChEBI-20 dataset (Table 2 from the paper), top-1 accuracy varies enormously by augmentation type. Cycle renumbering is easiest (up to 98.48% Acc@1 for SciFive), while hydrogen addition is hardest (no model exceeds 5.97% Acc@1).</p>
<p>For the cross-modal Text+Chem T5-standard model:</p>
<table>
  <thead>
      <tr>
          <th>Augmentation</th>
          <th>Acc@1</th>
          <th>Acc@5</th>
          <th>MRR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Canonical</td>
          <td>63.03</td>
          <td>82.76</td>
          <td>72.4</td>
      </tr>
      <tr>
          <td>Hydrogen</td>
          <td>5.46</td>
          <td>10.85</td>
          <td>8.6</td>
      </tr>
      <tr>
          <td>Kekulization</td>
          <td>76.76</td>
          <td>92.03</td>
          <td>83.8</td>
      </tr>
      <tr>
          <td>Cycle</td>
          <td>96.70</td>
          <td>99.82</td>
          <td>98.2</td>
      </tr>
      <tr>
          <td>Random</td>
          <td>46.94</td>
          <td>74.18</td>
          <td>59.33</td>
      </tr>
  </tbody>
</table>
<h3 id="key-results-on-isomers">Key Results on Isomers</h3>
<p>Performance drops substantially on the Isomers dataset, where all molecules share the same formula. The best Acc@1 for hydrogen augmentation is just 1.53% (MolT5-large). Even for the relatively easy cycle augmentation, top scores drop from the high 90s to the low 90s for most models, and some models (BARTSmiles: 41.83%) struggle considerably.</p>
<h3 id="downstream-moleculenet-impact">Downstream MoleculeNet Impact</h3>
<p>The authors also fine-tuned models on original <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> training data and tested on augmented test sets across 9 tasks (regression, binary classification, multilabel classification). Results confirm that augmentations degrade downstream performance. For example, on ESOL regression, RMSE increased from 0.87 to 7.93 with hydrogen addition. Rankings computed using the Vote&rsquo;n&rsquo;Rank framework (using the <a href="https://en.wikipedia.org/wiki/Copeland%27s_method">Copeland rule</a>) show that hydrogen augmentation is the only one that substantially reshuffles model rankings; other augmentations preserve the original ordering.</p>
<h3 id="correlation-between-amore-and-captioning-metrics">Correlation Between AMORE and Captioning Metrics</h3>
<p>The differences in ROUGE/METEOR between original and augmented SMILES correlate with AMORE retrieval accuracy (Spearman correlation &gt; 0.7 with p-value = 0.003 for Acc@1). This validates AMORE as a proxy for predicting how augmentations will affect generation quality, without requiring labeled captioning data.</p>
<h2 id="current-chemlms-learn-syntax-not-chemistry">Current ChemLMs Learn Syntax, Not Chemistry</h2>
<p>The central finding is that existing ChemLMs are not robust to identity-preserving SMILES augmentations. Several specific conclusions emerge:</p>
<ol>
<li>
<p><strong>Hydrogen augmentation is catastrophic</strong>: All models fail (&lt; 6% Acc@1 on ChEBI-20, &lt; 2% on Isomers). The authors attribute this to the near-complete absence of explicit hydrogen in pretraining data, creating a distribution shift.</p>
</li>
<li>
<p><strong>Cross-modal models outperform unimodal ones</strong>: Models trained on both text and SMILES (Text+Chem T5, MolT5) consistently achieve higher retrieval accuracy on four of five augmentations.</p>
</li>
<li>
<p><strong>Augmentation difficulty follows a consistent order</strong>: For all models, hydrogen is hardest, followed by canonicalization, random ordering, kekulization, and cycle renumbering (easiest).</p>
</li>
<li>
<p><strong>Layer-wise analysis reveals instability</strong>: Retrieval accuracy across Transformer layers is correlated across augmentation types, suggesting that representations degrade at the same layers regardless of augmentation.</p>
</li>
<li>
<p><strong><a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> partially explains difficulty</strong>: Hydrogen augmentation produces strings ~2x longer than originals (Levenshtein ratio of 1.49), but the low correlation between Levenshtein ratio and downstream metrics (ROUGE1 correlation of -0.05 for hydrogen) suggests string length alone does not explain the failure.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations. Only publicly available HuggingFace models were evaluated, excluding models like <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> and <a href="/notes/chemistry/molecular-representations/encoders/molformer/">Molformer</a> that lack HF checkpoints. The study focuses exclusively on SMILES sequences, not 3D molecular structures or other formats like <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>. The augmentation types, while representative, do not cover all possible identity transformations.</p>
<p>The authors suggest that AMORE could serve as a regularization tool during training, for example by using metric learning to encourage models to embed SMILES variants of the same molecule close together.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Retrieval evaluation</td>
          <td>ChEBI-20 test set</td>
          <td>3,300 molecules</td>
          <td>Standard benchmark for molecule captioning</td>
      </tr>
      <tr>
          <td>Retrieval evaluation</td>
          <td>Isomers (QM9 subset)</td>
          <td>918 molecules</td>
          <td>All isomers of C9H12N2O</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>MoleculeNet (9 tasks)</td>
          <td>Varies</td>
          <td>ESOL, Lipophilicity, FreeSolv, HIV, BBBP, BACE, Tox21, ToxCast, SIDER</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SMILES augmentations via RDKit (canonicalization, hydrogen addition, kekulization, cycle renumbering, random atom ordering)</li>
<li>Nearest-neighbor retrieval using FAISS with L2, cosine, inner product, and HNSW metrics</li>
<li>Model ranking via Vote&rsquo;n&rsquo;Rank (Copeland rule) on MoleculeNet tasks</li>
</ul>
<h3 id="models">Models</h3>
<p>All 12 evaluated models are publicly available on HuggingFace. No custom model training was performed for the AMORE retrieval experiments. MoleculeNet experiments used standard fine-tuning on original training splits.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Acc@1</td>
          <td>Top-1 retrieval accuracy</td>
          <td>Primary AMORE metric</td>
      </tr>
      <tr>
          <td>Acc@5</td>
          <td>Top-5 retrieval accuracy</td>
          <td>Secondary AMORE metric</td>
      </tr>
      <tr>
          <td>MRR</td>
          <td>Mean Reciprocal Rank</td>
          <td>Average rank of correct match</td>
      </tr>
      <tr>
          <td>ROUGE-2</td>
          <td>Bigram overlap for captioning</td>
          <td>Compared against AMORE</td>
      </tr>
      <tr>
          <td>METEOR</td>
          <td>MT evaluation metric for captioning</td>
          <td>Compared against AMORE</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Computational resources from HPC facilities at HSE University. Specific GPU types and training times are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ChemistryLLMs/AMORE">AMORE GitHub</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Framework code and evaluation data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ganeeva, V., Khrabrov, K., Kadurin, A., &amp; Tutubalina, E. (2025). Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework. <em>Journal of Cheminformatics</em>, 17(1). <a href="https://doi.org/10.1186/s13321-025-01079-0">https://doi.org/10.1186/s13321-025-01079-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ganeeva2025measuring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ganeeva, Veronika and Khrabrov, Kuzma and Kadurin, Artur and Tutubalina, Elena}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-025-01079-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>String Representations for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/</guid><description>Ablation study comparing SMILES, DeepSMILES, SELFIES, and InChI for OCSR. SMILES achieves highest accuracy; SELFIES guarantees validity.</description><content:encoded><![CDATA[<h2 id="empirical-focus-and-resource-contributions">Empirical Focus and Resource Contributions</h2>
<p>This is an <strong>Empirical Paper</strong> ($\Psi_{\text{Empirical}}$) with a secondary contribution as a <strong>Resource Paper</strong> ($\Psi_{\text{Resource}}$).</p>
<p>It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, DeepSMILES, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.</p>
<h2 id="the-syntax-challenge-in-chemical-image-recognition">The Syntax Challenge in Chemical Image Recognition</h2>
<p>Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings (where ring closures and branches are marked by single characters potentially far apart in the sequence) creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.</p>
<h2 id="isolating-string-representation-variables">Isolating String Representation Variables</h2>
<p>The core novelty is the <strong>comparative isolation of the string representation variable</strong> in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between <em>validity</em> (guaranteed by SELFIES) and <em>accuracy</em> (highest with SMILES).</p>
<h2 id="large-scale-image-to-text-translation-experiments">Large-Scale Image-to-Text Translation Experiments</h2>
<p>The authors performed a large-scale image-to-text translation experiment:</p>
<ul>
<li><strong>Task</strong>: Converting 2D chemical structure images into text strings.</li>
<li><strong>Data</strong>:
<ul>
<li><strong>ChEMBL</strong>: ~1.6M molecules, split into two datasets (with and without stereochemistry).</li>
<li><strong>PubChem</strong>: ~3M molecules, split similarly, to test performance scaling with data size.</li>
</ul>
</li>
<li><strong>Representations</strong>: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.</li>
<li><strong>Metric</strong>: The models were evaluated on:
<ul>
<li><strong>Validity</strong>: Can the predicted string be decoded back to a molecule?</li>
<li><strong>Exact Match</strong>: Is the predicted string identical to the ground truth?</li>
<li><strong>Tanimoto Similarity</strong>: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)? The similarity $\mathcal{T}$ between two molecular fingerprints $A$ and $B$ is calculated as:
$$ \mathcal{T}(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$</li>
</ul>
</li>
</ul>
<h2 id="comparative-performance-and-validity-trade-offs">Comparative Performance and Validity Trade-offs</h2>
<ul>
<li><strong>SMILES is the most accurate</strong>: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.</li>
<li><strong>SELFIES guarantees validity</strong>: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.</li>
<li><strong>InChI is unsuitable</strong>: InChI performed significantly worse (approx. 64% exact match) due to extreme maximum string lengths (up to 273 characters).</li>
<li><strong>Stereochemistry adds difficulty</strong>: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.</li>
<li><strong>Recommendation</strong>: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL (Dataset 1/2)</td>
          <td>~1.5M</td>
          <td>Filtered for MW &lt; 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>PubChem (Dataset 3/4)</td>
          <td>~3.0M</td>
          <td>Same filtering rules, used to test scaling.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Test Split</td>
          <td>~120k - 250k</td>
          <td>Created using RDKit MaxMin algorithm to ensure chemical diversity.</td>
      </tr>
  </tbody>
</table>
<p><strong>Image Generation</strong>:</p>
<ul>
<li><strong>Tool</strong>: CDK Structure Diagram Generator (SDG).</li>
<li><strong>Specs</strong>: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Tokenization Rules</strong> (Critical for replication):</p>
<ul>
<li><strong>SELFIES</strong>: Split at every <code>][</code> (e.g., <code>[C][N]</code> $\rightarrow$ <code>[C]</code>, <code>[N]</code>).</li>
<li><strong>SMILES / DeepSMILES</strong>: Regex-based splitting:
<ul>
<li>Every heavy atom (e.g., <code>C</code>, <code>N</code>).</li>
<li>Every bracket <code>(</code> and <code>)</code>.</li>
<li>Every bond symbol <code>=</code> and <code>#</code>.</li>
<li>Every single-digit number.</li>
<li>Everything inside square brackets <code>[]</code> is kept as a single token.</li>
</ul>
</li>
<li><strong>InChI</strong>: The prefix <code>InChI=1S/</code> was treated as a single token and removed during training, then re-added for evaluation.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model follows the <strong>DECIMER</strong> architecture.</p>
<ul>
<li><strong>Encoder</strong>: EfficientNet-B3 (pre-trained with &ldquo;Noisy Student&rdquo; weights).
<ul>
<li>Output: Image feature vectors of shape $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder</strong>: Transformer (similar to the &ldquo;Base&rdquo; model from <em>Attention Is All You Need</em>).
<ul>
<li>Layers: 4 encoder-decoder layers.</li>
<li>Attention Heads: 8.</li>
<li>Dimension ($d_{\text{model}}$): 512.</li>
<li>Feed-forward ($d_{\text{ff}}$): 2048.</li>
<li>Dropout: 10%.</li>
</ul>
</li>
<li><strong>Loss</strong>: Sparse categorical cross-entropy.</li>
<li><strong>Optimizer</strong>: Adam with custom learning rate scheduler.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics were calculated after converting all predictions back to standard SMILES.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Baseline (SMILES)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Identical Match</strong></td>
          <td>88.62% (PubChem)</td>
          <td>Strict character-for-character equality.</td>
      </tr>
      <tr>
          <td><strong>Valid Structure</strong></td>
          <td>99.78%</td>
          <td>SMILES had rare syntax errors; SELFIES achieved 100%.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto (Avg)</strong></td>
          <td>0.98</td>
          <td>Calculated using PubChem fingerprints via CDK.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: Google Cloud TPUs (v3-8).</li>
<li><strong>Format</strong>: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.</li>
<li><strong>Batch Size</strong>: 1024.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER_Short_Communication">DECIMER Short Communication</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and evaluation scripts (Python, Java)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5155037">Datasets on Zenodo</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>SMILES data and processing scripts</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Steinbeck, C., &amp; Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. <em>Digital Discovery</em>, 1(2), 84-90. <a href="https://doi.org/10.1039/D1DD00013F">https://doi.org/10.1039/D1DD00013F</a></p>
<p><strong>Publication</strong>: Digital Discovery 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://chemrxiv.org/doi/pdf/10.26434/chemrxiv-2021-7c9wf">ChemRxiv Preprint (PDF)</a></li>
<li><a href="https://github.com/Kohulan/DECIMER_Short_Communication">Official Code Repository</a></li>
<li><a href="https://doi.org/10.5281/zenodo.5155037">Data on Zenodo</a></li>
<li>Related work: <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/">IMG2SMI</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanPerformanceChemicalStructure2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{84--90}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D1DD00013F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>