<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Transformers on Hunter Heidenreich | Senior AI Research Scientist</title><link>https://hunterheidenreich.com/tags/transformers/</link><description>Recent content in Transformers on Hunter Heidenreich | Senior AI Research Scientist</description><image><title>Hunter Heidenreich | Senior AI Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 31 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/transformers/index.xml" rel="self" type="application/rss+xml"/><item><title>SpeechT5: Unified Speech-Text Pre-Training Framework</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/speecht5-unified-speech-text-pretraining/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/speecht5-unified-speech-text-pretraining/</guid><description>SpeechT5 introduces a shared encoder-decoder framework with cross-modal vector quantization for joint speech and text pre-training across six tasks.</description><content:encoded><![CDATA[<h2 id="a-unified-encoder-decoder-for-spoken-language-processing">A Unified Encoder-Decoder for Spoken Language Processing</h2>
<p>SpeechT5 is a <strong>Method</strong> paper that introduces a shared encoder-decoder pre-training framework for spoken language processing. Inspired by <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5&rsquo;s</a> text-to-text paradigm, SpeechT5 reformulates all spoken language tasks as &ldquo;speech/text to speech/text&rdquo; problems. The framework uses modal-specific pre-nets and post-nets to interface between raw speech or text and a shared Transformer encoder-decoder, enabling a single pre-trained model to handle six downstream tasks: automatic speech recognition (ASR), text-to-speech synthesis (TTS), speech translation (ST), voice conversion (VC), speech enhancement (SE), and speaker identification (SID).</p>
<h2 id="bridging-the-gap-between-speech-and-text-pre-training">Bridging the Gap Between Speech and Text Pre-Training</h2>
<p>Prior speech pre-training work (wav2vec 2.0, HuBERT) suffered from two key limitations. First, these models learned speech representations from unlabeled audio alone, ignoring the complementary information in text data that is critical for cross-modal tasks like ASR and TTS. Second, they relied on encoder-only architectures with task-specific prediction heads, leaving the decoder un-pretrained for sequence-to-sequence generation tasks.</p>
<p>SpeechT5 addresses both gaps by (1) jointly pre-training on unlabeled speech and text data, and (2) using a full encoder-decoder architecture that benefits generation tasks directly. The approach builds on the observation that speech and text, despite their surface differences, share underlying semantic structure that a unified representation can capture.</p>
<h2 id="cross-modal-vector-quantization-for-alignment">Cross-Modal Vector Quantization for Alignment</h2>
<p>The core innovation in SpeechT5 is a cross-modal <a href="https://en.wikipedia.org/wiki/Vector_quantization">vector quantization</a> (VQ) mechanism that aligns speech and text representations into a shared semantic space. The architecture consists of three components:</p>
<p><strong>Shared encoder-decoder backbone.</strong> A Transformer with 12 encoder blocks and 6 decoder blocks (768-dim, 12 heads), using relative position embeddings.</p>
<p><strong>Modal-specific pre/post-nets.</strong> Six specialized networks handle the conversion between raw modalities and the shared representation space:</p>
<ul>
<li>Speech-encoder pre-net: a convolutional feature extractor (from wav2vec 2.0) downsampling raw waveforms</li>
<li>Speech-decoder pre-net: three FC layers with ReLU, processing 80-dimensional log Mel-filterbank features</li>
<li>Speech-decoder post-net: a linear layer predicting Mel features plus five 1D conv layers (256 channels) for residual refinement, with an x-vector speaker embedding concatenated for multi-speaker support</li>
<li>Text pre/post-nets: shared embedding layers mapping between character-level token indices and hidden states (768-dim)</li>
</ul>
<p><strong>Cross-modal vector quantization.</strong> A shared codebook $\mathbf{C}^{K}$ with $K$ learnable embeddings bridges the two modalities. Encoder outputs $\mathbf{u}_i$ are quantized via nearest-neighbor lookup:</p>
<p>$$
\mathbf{c}_i = \arg\min_{j \in [K]} | \mathbf{u}_i - \mathbf{c}_j |_2
$$</p>
<p>A proportion (10%) of contextual representations are randomly replaced with these quantized latent units before being fed to the decoder&rsquo;s cross-attention. This mixing forces the quantizer to capture cross-modal features. A diversity loss encourages full codebook utilization:</p>
<p>$$
\mathcal{L}_d = \frac{1}{K} \sum_{k=1}^{K} p_k \log p_k
$$</p>
<h3 id="pre-training-objectives">Pre-Training Objectives</h3>
<p>SpeechT5 combines three pre-training objectives:</p>
<p><strong>Speech pre-training</strong> uses two tasks. A bidirectional masked prediction loss $\mathcal{L}_{mlm}^{s}$ follows HuBERT&rsquo;s approach, masking 8% of timesteps in 10-step spans and predicting frame-level targets from an acoustic unit discovery model:</p>
<p>$$
\mathcal{L}_{mlm}^{s} = \sum_{n \in \mathcal{M}} \log p(\mathbf{z}_n \mid \hat{\mathbf{H}}, n)
$$</p>
<p>A reconstruction loss $\mathcal{L}_{1}^{s}$ minimizes the $L_1$ distance between predicted and original Mel-filterbank features, plus a binary cross-entropy stop-token loss $\mathcal{L}_{bce}^{s}$.</p>
<p><strong>Text pre-training</strong> uses BART-style denoising, masking 30% of text spans (Poisson $\lambda = 3.5$) and training with maximum likelihood estimation:</p>
<p>$$
\mathcal{L}_{mle}^{t} = \sum_{n=1}^{N^t} \log p(\mathbf{y}_n^t \mid \mathbf{y}_{&lt; n}^t, \hat{\mathbf{X}}^t)
$$</p>
<p>The full pre-training loss combines all components:</p>
<p>$$
\mathcal{L} = \mathcal{L}_{mlm}^{s} + \mathcal{L}_{1}^{s} + \mathcal{L}_{bce}^{s} + \mathcal{L}_{mle}^{t} + \gamma \mathcal{L}_d
$$</p>
<p>where $\gamma = 0.1$.</p>
<h2 id="evaluation-across-six-spoken-language-tasks">Evaluation Across Six Spoken Language Tasks</h2>
<p>SpeechT5 was evaluated on six downstream tasks, each using a different combination of the shared encoder-decoder and task-appropriate pre/post-nets:</p>
<h3 id="automatic-speech-recognition-asr">Automatic Speech Recognition (ASR)</h3>
<p>Fine-tuned on LibriSpeech 100h with joint <a href="https://en.wikipedia.org/wiki/Connectionist_temporal_classification">CTC</a>/attention decoding. The decoding objective maximizes a combination of decoder, CTC, and language model log-probabilities:</p>
<p>$$
\alpha \log P_{Dec} + (1 - \alpha) \log P_{CTC} + \beta \log P_{LM}
$$</p>
<p>where $\alpha = 0.5$ and $\beta = 1.0$ for the 100h setting (beam size 30). Results on the test sets:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>LM</th>
          <th>test-clean</th>
          <th>test-other</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>wav2vec 2.0 BASE</td>
          <td>-</td>
          <td>6.1</td>
          <td>13.3</td>
      </tr>
      <tr>
          <td>HuBERT BASE</td>
          <td>-</td>
          <td>5.8</td>
          <td>13.3</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>-</strong></td>
          <td><strong>4.4</strong></td>
          <td><strong>10.4</strong></td>
      </tr>
      <tr>
          <td>wav2vec 2.0 BASE</td>
          <td>Transf.</td>
          <td>2.6</td>
          <td>6.3</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>Transf.</strong></td>
          <td><strong>2.4</strong></td>
          <td><strong>5.8</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="text-to-speech-synthesis-tts">Text-to-Speech Synthesis (TTS)</h3>
<p>Fine-tuned on LibriTTS 460h clean sets with HiFi-GAN vocoder:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Naturalness</th>
          <th>MOS</th>
          <th>CMOS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ground Truth</td>
          <td>-</td>
          <td>3.87 ± 0.04</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Baseline</td>
          <td>2.76</td>
          <td>3.56 ± 0.05</td>
          <td>0</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>2.91</strong></td>
          <td><strong>3.65 ± 0.04</strong></td>
          <td><strong>+0.290</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="speech-translation-st">Speech Translation (ST)</h3>
<p>Evaluated on MUST-C English-to-German and English-to-French:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>EN-DE</th>
          <th>EN-FR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fairseq ST</td>
          <td>22.70</td>
          <td>32.90</td>
      </tr>
      <tr>
          <td>Adapter Tuning</td>
          <td>24.63</td>
          <td>34.98</td>
      </tr>
      <tr>
          <td>Baseline (HuBERT init)</td>
          <td>23.43</td>
          <td>33.76</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>25.18</strong></td>
          <td><strong>35.30</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="voice-conversion-vc">Voice Conversion (VC)</h3>
<p>Evaluated on CMU Arctic:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>WER (bdl→slt)</th>
          <th>MCD (bdl→slt)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VTN w/ TTS</td>
          <td>7.6%</td>
          <td>6.33</td>
      </tr>
      <tr>
          <td>Many-to-many VTN</td>
          <td>-</td>
          <td>6.13</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>7.8%</strong></td>
          <td><strong>5.93</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="speech-enhancement-se">Speech Enhancement (SE)</h3>
<p>On WHAM! dataset, SpeechT5 reduced WER from 76.1% (noisy) to 8.9%, a relative 9% improvement over the baseline&rsquo;s 10.9%.</p>
<h3 id="speaker-identification-sid">Speaker Identification (SID)</h3>
<p>On VoxCeleb1, SpeechT5 achieved 96.49% accuracy, outperforming HuBERT LARGE at 90.33% (from SUPERB) and SpeechNet multi-task at 87.90%.</p>
<h2 id="ablation-study-and-key-findings">Ablation Study and Key Findings</h2>
<p>The ablation study reveals the contribution of each pre-training component:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ASR (clean)</th>
          <th>ASR (other)</th>
          <th>VC (MCD)</th>
          <th>SID (ACC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SpeechT5</td>
          <td>4.4</td>
          <td>10.7</td>
          <td>5.93</td>
          <td>96.49%</td>
      </tr>
      <tr>
          <td>w/o Speech PT</td>
          <td>-</td>
          <td>-</td>
          <td>6.49</td>
          <td>38.61%</td>
      </tr>
      <tr>
          <td>w/o Text PT</td>
          <td>5.4</td>
          <td>12.8</td>
          <td>6.03</td>
          <td>95.60%</td>
      </tr>
      <tr>
          <td>w/o Joint PT</td>
          <td>4.6</td>
          <td>11.3</td>
          <td>6.18</td>
          <td>95.54%</td>
      </tr>
      <tr>
          <td>w/o $\mathcal{L}_{mlm}^{s}$</td>
          <td>7.6</td>
          <td>22.4</td>
          <td>6.29</td>
          <td>90.91%</td>
      </tr>
  </tbody>
</table>
<p>Key findings:</p>
<ol>
<li><strong>Speech pre-training is critical</strong>: without it, ASR fails to converge entirely, and SID accuracy drops to 38.61%.</li>
<li><strong>Text pre-training complements speech</strong>: removing it degrades ASR by ~20% relative, confirming that textual knowledge transfers to speech tasks.</li>
<li><strong>Joint pre-training enables cross-modal transfer</strong>: the vector quantization approach is essential for modality-bridging tasks like ASR.</li>
<li><strong>The masked prediction loss $\mathcal{L}_{mlm}^{s}$ is the most important single component</strong>, responsible for learning strong acoustic features.</li>
</ol>
<p>The authors note limitations in the current scope (English-only, BASE model size) and propose scaling to larger models and multilingual settings as future work.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Speech pre-training</td>
          <td>LibriSpeech</td>
          <td>960 hours</td>
          <td>Full training set</td>
      </tr>
      <tr>
          <td>Text pre-training</td>
          <td>LibriSpeech LM text</td>
          <td>400M sentences</td>
          <td>Normalized language model text</td>
      </tr>
      <tr>
          <td>ASR fine-tuning</td>
          <td>LibriSpeech</td>
          <td>100h / 960h subsets</td>
          <td></td>
      </tr>
      <tr>
          <td>TTS fine-tuning</td>
          <td>LibriTTS</td>
          <td>460h clean sets</td>
          <td></td>
      </tr>
      <tr>
          <td>ST fine-tuning</td>
          <td>MUST-C</td>
          <td>EN-DE, EN-FR</td>
          <td></td>
      </tr>
      <tr>
          <td>VC fine-tuning</td>
          <td>CMU Arctic</td>
          <td>4 speakers</td>
          <td>bdl, clb, slt, rms</td>
      </tr>
      <tr>
          <td>SE fine-tuning</td>
          <td>WHAM!</td>
          <td>16 kHz max</td>
          <td>enhance-single task</td>
      </tr>
      <tr>
          <td>SID fine-tuning</td>
          <td>VoxCeleb1</td>
          <td>100k+ utterances</td>
          <td>1,251 speakers</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam with warmup (8% of steps) to peak LR $2 \times 10^{-4}$, then linear decay</li>
<li>Speech masking: 8% of timesteps, 10-step spans</li>
<li>Text masking: 30% of spans, Poisson $\lambda = 3.5$</li>
<li>Vector quantization: 2 codebooks × 100 entries = $10^4$ theoretical maximum codes</li>
<li>CTC/attention joint decoding for ASR (beam size 30)</li>
<li>HiFi-GAN vocoder for TTS and SE waveform generation</li>
<li>Parallel WaveGAN vocoder for VC</li>
</ul>
<h3 id="fine-tuning-hyperparameters">Fine-Tuning Hyperparameters</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>GPUs</th>
          <th>Steps</th>
          <th>Peak LR</th>
          <th>Batch (per GPU)</th>
          <th>Schedule</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ASR (100h)</td>
          <td>8×V100</td>
          <td>80k</td>
          <td>6e-5</td>
          <td>256k audio samples</td>
          <td>Warmup 10%, hold 40%, linear decay</td>
      </tr>
      <tr>
          <td>ASR (960h)</td>
          <td>8×V100</td>
          <td>320k</td>
          <td>1.3e-4</td>
          <td>256k audio samples</td>
          <td>Warmup 10%, hold 40%, linear decay</td>
      </tr>
      <tr>
          <td>TTS</td>
          <td>8×V100</td>
          <td>120k</td>
          <td>4e-4</td>
          <td>45k tokens</td>
          <td>Warmup 10k steps, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>ST</td>
          <td>8×V100</td>
          <td>80k</td>
          <td>-</td>
          <td>-</td>
          <td>Warmup 10k steps</td>
      </tr>
      <tr>
          <td>VC</td>
          <td>8×V100</td>
          <td>60k</td>
          <td>1e-4</td>
          <td>20k tokens</td>
          <td>6k warmup, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>SE</td>
          <td>8×V100</td>
          <td>100k</td>
          <td>1e-4</td>
          <td>16k tokens</td>
          <td>10k warmup, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>SID</td>
          <td>8×V100</td>
          <td>60k</td>
          <td>5e-4</td>
          <td>64 segments (3s each)</td>
          <td>Triangular cyclical (1e-8 to 5e-4)</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<ul>
<li>Encoder: 12 Transformer blocks (768-dim, 3072 FFN, 12 heads)</li>
<li>Decoder: 6 Transformer blocks (same dimensions)</li>
<li>Speech-encoder pre-net: 7 conv blocks (512 channels, strides [5,2,2,2,2,2,2], kernels [10,3,3,3,3,2,2])</li>
<li>Code and pre-trained models available at <a href="https://github.com/microsoft/SpeechT5">github.com/microsoft/SpeechT5</a> (MIT license)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/microsoft/SpeechT5">microsoft/SpeechT5</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official Fairseq-based implementation</td>
      </tr>
      <tr>
          <td>Pre-trained models (via repo)</td>
          <td>Model</td>
          <td>MIT</td>
          <td>SpeechT5 BASE encoder-decoder checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://www.openslr.org/12">LibriSpeech</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>960h speech pre-training and ASR fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://www.openslr.org/60">LibriTTS</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>460h TTS fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://ict.fbk.eu/must-c/">MUST-C</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-ND-4.0</td>
          <td>Speech translation fine-tuning</td>
      </tr>
      <tr>
          <td><a href="http://www.festvox.org/cmu_arctic/">CMU Arctic</a></td>
          <td>Dataset</td>
          <td>Free</td>
          <td>Voice conversion fine-tuning</td>
      </tr>
      <tr>
          <td><a href="http://wham.whisper.ai/">WHAM!</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-4.0</td>
          <td>Speech enhancement fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html">VoxCeleb1</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA-4.0</td>
          <td>Speaker identification fine-tuning</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 32 NVIDIA V100 GPUs</li>
<li>Batch: ~90s speech per GPU + 12k text tokens per GPU, gradient accumulation 2</li>
<li>Pre-training steps: 500k</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., Wei, Z., Qian, Y., Li, J., &amp; Wei, F. (2022). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. <em>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, 5723-5738.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ao2022speecht,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5723--5738}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2022.acl-long.393}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>T5: Exploring Transfer Learning Limits</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/</guid><description>Raffel et al. systematically study transfer learning for NLP with a text-to-text framework, ablating architectures, objectives, data, and multi-task mixing.</description><content:encoded><![CDATA[<h2 id="a-systematic-study-of-nlp-transfer-learning">A systematic study of NLP transfer learning</h2>
<p>This is a <strong>systematization paper</strong> that provides a comprehensive empirical survey of transfer learning techniques for NLP. Rather than proposing a single new method, T5 introduces a unified text-to-text framework and uses it as a testbed to systematically compare pre-training objectives, architectures, unlabeled data sources, transfer approaches, and multi-task mixing strategies. The scale of the ablation study (covering dozens of configurations) and the release of C4, pre-trained models, and code make it both a reference guide and a resource.</p>
<h2 id="unifying-nlp-tasks-as-text-to-text">Unifying NLP tasks as text-to-text</h2>
<p>The core design decision is to cast every NLP task as a text-to-text problem: both the input and output are text strings, with a task-specific prefix. Classification, regression, summarization, translation, and question answering all use the same model, loss function (cross-entropy on output tokens), and decoding procedure. This simplicity enables fair comparison across tasks and training strategies.</p>
<p>The model architecture is a standard encoder-decoder Transformer. The paper finds that this form outperforms decoder-only (language model) and encoder-only (BERT-style) variants in the text-to-text setting, while having similar computational cost to decoder-only models despite twice the parameters (the encoder processes the input only once, then the decoder attends to it).</p>
<h2 id="multi-task-mixing-strategies-and-findings">Multi-task mixing: strategies and findings</h2>
<p>The most thesis-relevant contribution is the systematic ablation of multi-task mixing strategies (Section 3.5.2). When training on multiple tasks simultaneously (which in the text-to-text framework simply means mixing data from different sources), the central question is how to set the proportion of data from each task.</p>
<h3 id="three-mixing-strategies">Three mixing strategies</h3>
<p><strong>Examples-proportional mixing.</strong> Sample in proportion to each dataset&rsquo;s size, with an artificial cap $K$ on the maximum dataset size. Without the cap, the unsupervised pre-training data (orders of magnitude larger) would dominate all batches. The mixing rate for task $m$ is:</p>
<p>$$
r_{m} = \frac{\min(e_{m}, K)}{\sum_{n} \min(e_{n}, K)}
$$</p>
<p>where $e_{m}$ is the number of examples in task $m$&rsquo;s dataset.</p>
<p><strong>Temperature-scaled mixing.</strong> Raise each mixing rate $r_{m}$ to the power $1/T$ and renormalize. At $T=1$ this equals examples-proportional mixing; as $T$ increases, proportions approach equal mixing. Uses a large cap $K = 2^{21}$.</p>
<p><strong>Equal mixing.</strong> Sample uniformly from all tasks. Included as a negative reference: the model overfits on low-resource tasks and underfits on high-resource tasks.</p>
<h3 id="results">Results</h3>
<table>
  <thead>
      <tr>
          <th>Mixing strategy</th>
          <th>GLUE</th>
          <th>CNN/DM</th>
          <th>SQuAD</th>
          <th>SuperGLUE</th>
          <th>EnDe</th>
          <th>EnFr</th>
          <th>EnRo</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Baseline (pre-train/fine-tune)</td>
          <td>83.28</td>
          <td>19.24</td>
          <td>80.88</td>
          <td>71.36</td>
          <td>26.98</td>
          <td>39.82</td>
          <td>27.65</td>
      </tr>
      <tr>
          <td>Equal</td>
          <td>76.13</td>
          <td>19.02</td>
          <td>76.51</td>
          <td>63.37</td>
          <td>23.89</td>
          <td>34.31</td>
          <td>26.78</td>
      </tr>
      <tr>
          <td>Examples-proportional, $K=2^{18}$</td>
          <td>81.67</td>
          <td>19.07</td>
          <td>78.17</td>
          <td>67.94</td>
          <td>24.57</td>
          <td>35.19</td>
          <td>27.39</td>
      </tr>
      <tr>
          <td>Examples-proportional, $K=2^{19}$</td>
          <td>81.42</td>
          <td>19.24</td>
          <td>79.78</td>
          <td>67.30</td>
          <td>25.21</td>
          <td>36.30</td>
          <td>27.76</td>
      </tr>
      <tr>
          <td>Temperature-scaled, $T=2$</td>
          <td>81.90</td>
          <td>19.28</td>
          <td>79.42</td>
          <td>69.92</td>
          <td>25.42</td>
          <td>36.72</td>
          <td>27.20</td>
      </tr>
  </tbody>
</table>
<p><strong>Key findings on mixing:</strong></p>
<ol>
<li>
<p><strong>Multi-task training underperforms pre-train-then-fine-tune on most tasks.</strong> No mixing strategy matches the baseline of unsupervised pre-training followed by task-specific fine-tuning.</p>
</li>
<li>
<p><strong>Equal mixing is worst.</strong> It dramatically degrades performance, confirming that proportions matter.</p>
</li>
<li>
<p><strong>There exists a task-specific sweet spot for the cap $K$.</strong> Most tasks have an optimal $K$ value; larger or smaller values hurt. The exception is very high-resource tasks (WMT English-French) that always benefit from higher mixing proportions.</p>
</li>
<li>
<p><strong>Temperature scaling at $T=2$ provides the best single compromise.</strong> It achieves reasonable performance across all tasks without requiring per-task tuning of $K$.</p>
</li>
<li>
<p><strong>Multi-task pre-training followed by fine-tuning closes the gap.</strong> When multi-task training is used as pre-training (not as the final training stage), followed by task-specific fine-tuning, performance becomes comparable to unsupervised pre-training alone. This suggests that multi-task exposure during pre-training provides useful early signal without the negative effects of forcing a single model to perform all tasks simultaneously.</p>
</li>
<li>
<p><strong>&ldquo;Leave-one-out&rdquo; training works.</strong> Pre-training on a multi-task mixture that excludes a target task, then fine-tuning on it, produces only slightly worse results. This indicates that multi-task pre-training builds general capabilities that transfer to unseen tasks without dramatic task interference.</p>
</li>
</ol>
<h2 id="data-repetition-degrades-performance">Data repetition degrades performance</h2>
<p>The paper also systematically tests the effect of pre-training data set size by truncating C4 and training over repeated data:</p>
<table>
  <thead>
      <tr>
          <th>Unique tokens</th>
          <th>Repeats</th>
          <th>GLUE</th>
          <th>SQuAD</th>
          <th>SuperGLUE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full dataset</td>
          <td>0</td>
          <td>83.28</td>
          <td>80.88</td>
          <td>71.36</td>
      </tr>
      <tr>
          <td>$2^{29}$</td>
          <td>64</td>
          <td>82.87</td>
          <td>80.97</td>
          <td>72.03</td>
      </tr>
      <tr>
          <td>$2^{27}$</td>
          <td>256</td>
          <td>82.62</td>
          <td>79.78</td>
          <td>69.97</td>
      </tr>
      <tr>
          <td>$2^{25}$</td>
          <td>1,024</td>
          <td>79.55</td>
          <td>76.27</td>
          <td>64.76</td>
      </tr>
      <tr>
          <td>$2^{23}$</td>
          <td>4,096</td>
          <td>76.34</td>
          <td>70.92</td>
          <td>59.29</td>
      </tr>
  </tbody>
</table>
<p>Performance degrades as data shrinks, with 64 repeats showing limited effects but 1,024+ repeats causing significant degradation. Training loss curves confirm memorization at high repetition counts. The paper recommends using large, diverse pre-training datasets whenever possible.</p>
<h2 id="scaling-and-final-configuration">Scaling and final configuration</h2>
<p>The paper compares scaling strategies: more data, larger models, and ensembles. Training a larger model for fewer steps generally outperforms training a smaller model on more data. Ensembles of independently pre-trained and fine-tuned models provide orthogonal gains.</p>
<p>The final T5-11B model combines the best choices from all ablations: encoder-decoder architecture, span corruption objective, C4 pre-training data, multi-task pre-training followed by fine-tuning, and scaling to 11B parameters trained on over 1 trillion tokens. It achieves state-of-the-art results on GLUE (90.3 average), SuperGLUE (88.9, near human performance of 89.8), SQuAD, and CNN/Daily Mail. It does not achieve state-of-the-art on WMT translation tasks, where methods using backtranslation and cross-lingual pre-training retain the lead.</p>
<h2 id="implications-and-limitations">Implications and limitations</h2>
<p>The T5 paper&rsquo;s multi-task mixing findings are its most enduring contribution beyond the model itself. The core lessons: proportions matter enormously (equal mixing fails), examples-proportional mixing with a cap is a reasonable default, temperature scaling provides a single-knob alternative, and multi-task pre-training followed by fine-tuning can match pure unsupervised pre-training.</p>
<p><strong>Limitations:</strong></p>
<ul>
<li>All ablations use the same encoder-decoder architecture. Findings may not transfer to decoder-only models that dominate current practice.</li>
<li>The multi-task mixing experiments treat each task as a separate &ldquo;domain.&rdquo; Interactions between similar tasks (e.g., multiple classification tasks) are not isolated.</li>
<li>The paper does not provide a principled method for choosing $K$ or $T$; both require empirical search.</li>
<li>C4 has known quality issues (templated text, noisy content) that have been addressed in later datasets.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible.</strong> Code, pre-trained models, and the C4 dataset are all publicly released.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>C4 (Colossal Clean Crawled Corpus)</td>
          <td>~750 GB</td>
          <td>Heuristically cleaned Common Crawl</td>
      </tr>
      <tr>
          <td>Downstream</td>
          <td>GLUE, SuperGLUE, SQuAD, CNN/DM, WMT (EnDe, EnFr, EnRo)</td>
          <td>Standard splits</td>
          <td>Text-to-text format</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>Encoder-decoder Transformer. Sizes: Base (220M), Small (60M), Large (770M), 3B, 11B. Baseline uses Base size. SentencePiece vocabulary with 32K tokens. Pre-trained for $2^{19}$ steps, fine-tuned for $2^{18}$ steps on individual tasks.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Multi-task mixing: examples-proportional with cap $K \in {2^{16}, \ldots, 2^{21}}$, temperature-scaled with $T \in {2, 4, 8}$, and equal mixing. Unsupervised objective: span corruption (mean span length 3, 15% corruption rate). Training with Adafactor optimizer, inverse square root learning rate schedule.</p>
<h3 id="hardware">Hardware</h3>
<p>All models trained using Mesh TensorFlow on TPU slices. T5-11B pre-trained for 1M steps with batch size $2^{11}$ sequences of length 512 (~1 trillion tokens total). Exact TPU pod configurations per experiment not detailed.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/google-research/text-to-text-transfer-transformer">T5 Code</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official TensorFlow implementation (JAX successor: T5X)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints">T5 Models</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Pre-trained checkpoints (Small through 11B)</td>
      </tr>
      <tr>
          <td><a href="https://www.tensorflow.org/datasets/catalog/c4">C4 Dataset</a></td>
          <td>Dataset</td>
          <td>-</td>
          <td>~750 GB cleaned Common Crawl, via TensorFlow Datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{raffel2020exploring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{21}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{140}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1--67}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SlimPajama-DC: Data Combinations for LLM Training</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/slimpajama-dc-data-combinations/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/slimpajama-dc-data-combinations/</guid><description>Shen et al. study how global deduplication and domain combinations in SlimPajama affect LLM training, finding diversity after dedup is key.</description><content:encoded><![CDATA[<h2 id="an-empirical-study-of-data-domain-combinations">An empirical study of data domain combinations</h2>
<p>This is a <strong>discovery paper</strong> that empirically investigates how different combinations and proportions of data domains affect language model pretraining. Using the SlimPajama dataset (a globally deduplicated, 627B token refinement of RedPajama), the study trains seven 1.3B model configurations with varying domain mixtures to identify which combinations and deduplication strategies produce the best downstream performance.</p>
<h2 id="why-data-combination-strategy-matters">Why data combination strategy matters</h2>
<p>Multi-source pretraining datasets combine data from web crawls, code repositories, books, academic papers, and other sources. Two underexplored questions drive this work: (1) Does deduplication within each source (local) versus across all sources (global) meaningfully affect model quality? (2) When sources are thoroughly deduplicated, how does the combination and proportion of domains affect downstream performance? Most open-source LLM training datasets (RedPajama, The Pile) perform only local deduplication, leaving cross-source redundancy unaddressed.</p>
<h2 id="global-deduplication-and-the-slimpajama-dataset">Global deduplication and the SlimPajama dataset</h2>
<p>SlimPajama applies global MinHashLSH deduplication (Jaccard similarity threshold 0.8, 13-gram signatures) across all seven data sources simultaneously. This reduces RedPajama&rsquo;s 1.2T tokens to 627B tokens, a roughly 48% reduction. The heaviest deduplication hits CommonCrawl and GitHub, which had the most cross-source overlap.</p>
<p>The key processing steps:</p>
<ol>
<li><strong>Low-length document filtering</strong>: Remove documents below a minimum length threshold.</li>
<li><strong>Global deduplication</strong>: MinHashLSH across all sources simultaneously, requiring 64 CPU cores and 1.4TB peak memory. This removes both within-source and between-source duplicates.</li>
</ol>
<p>The resulting dataset composition:</p>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>SlimPajama</th>
          <th>RedPajama</th>
          <th>LLaMA 1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CommonCrawl</td>
          <td>52.2% (333B)</td>
          <td>72.6% (878B)</td>
          <td>67.0%</td>
      </tr>
      <tr>
          <td>C4</td>
          <td>26.7% (170B)</td>
          <td>14.4% (175B)</td>
          <td>15.0%</td>
      </tr>
      <tr>
          <td>GitHub</td>
          <td>5.2% (33B)</td>
          <td>4.9% (59B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>Books</td>
          <td>4.2% (27B)</td>
          <td>2.1% (26B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>ArXiv</td>
          <td>4.6% (29B)</td>
          <td>2.3% (28B)</td>
          <td>2.5%</td>
      </tr>
      <tr>
          <td>Wikipedia</td>
          <td>3.8% (24B)</td>
          <td>2.0% (24B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>StackExchange</td>
          <td>3.3% (21B)</td>
          <td>1.7% (20B)</td>
          <td>2.0%</td>
      </tr>
  </tbody>
</table>
<h2 id="seven-domain-combination-configurations">Seven domain combination configurations</h2>
<p>All configurations train 1.3B parameter models on 330B tokens with identical architecture and hyperparameters. The configurations systematically vary domain diversity:</p>
<ul>
<li><strong>DC-1</strong>: CommonCrawl only (single source)</li>
<li><strong>DC-2</strong>: CommonCrawl + C4 (two web sources)</li>
<li><strong>DC-3</strong>: CommonCrawl + C4 with adjusted proportions</li>
<li><strong>DC-4</strong>: Wikipedia + Books + GitHub + ArXiv + StackExchange (no web crawl)</li>
<li><strong>DC-5</strong>: CommonCrawl + C4 + Wikipedia + Books (four sources, no code/academic)</li>
<li><strong>DC-6</strong>: All seven SlimPajama sources (maximum diversity)</li>
<li><strong>DC-7</strong>: RefinedWeb CommonCrawl (external single-source baseline)</li>
</ul>
<p>The experimental design probes: incremental diversity (DC-1 to DC-2 to DC-5 to DC-6), proportion sensitivity (DC-2 vs DC-3), source importance (DC-3 vs DC-4), and specialization vs generalization (individual vs combined).</p>
<h2 id="diversity-after-global-deduplication-drives-performance">Diversity after global deduplication drives performance</h2>
<h3 id="hugging-face-leaderboard-results">Hugging Face leaderboard results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Average</th>
          <th>ARC</th>
          <th>HellaSwag</th>
          <th>MMLU</th>
          <th>TruthfulQA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RedPajama-1.3B</td>
          <td>38.0</td>
          <td>37.2</td>
          <td>55.8</td>
          <td>24.9</td>
          <td>34.3</td>
      </tr>
      <tr>
          <td>DC-1 (CC only)</td>
          <td>38.5</td>
          <td>36.3</td>
          <td>56.0</td>
          <td>27.0</td>
          <td>34.8</td>
      </tr>
      <tr>
          <td>DC-4 (no web)</td>
          <td>37.6</td>
          <td>33.4</td>
          <td>53.3</td>
          <td>26.0</td>
          <td>37.6</td>
      </tr>
      <tr>
          <td>DC-6 (all sources)</td>
          <td>40.0</td>
          <td>33.7</td>
          <td>61.0</td>
          <td>26.9</td>
          <td>38.4</td>
      </tr>
      <tr>
          <td>DC-7 (RefinedWeb)</td>
          <td>41.0</td>
          <td>35.1</td>
          <td>64.7</td>
          <td>26.2</td>
          <td>37.9</td>
      </tr>
  </tbody>
</table>
<p><strong>Key patterns:</strong></p>
<ol>
<li>
<p><strong>More domain diversity improves average performance.</strong> The progression DC-1 (38.5) to DC-2 (38.4) to DC-5 (38.6) to DC-6 (40.0) shows that adding domains consistently lifts average accuracy once global deduplication has removed cross-source redundancy.</p>
</li>
<li>
<p><strong>Global deduplication enables clean combination.</strong> All SlimPajama configurations except DC-4 outperform RedPajama-1.3B (38.0), which uses local deduplication only. The elimination of cross-source overlap means adding sources contributes genuinely new information.</p>
</li>
<li>
<p><strong>Removing web crawl data hurts.</strong> DC-4 (no CommonCrawl/C4) scores lowest (37.6), demonstrating that web text provides essential breadth even when specialized sources are included.</p>
</li>
<li>
<p><strong>Individual domains excel at specific tasks.</strong> DC-1 (CC only) achieves the highest ARC and MMLU scores. DC-4 leads on Winogrande. DC-5 leads on WSC273. No single combination dominates all tasks, reinforcing that diversity trades specialization for generalization.</p>
</li>
<li>
<p><strong>Findings transfer to 7B scale.</strong> The best 1.3B configuration insights were applied to a 7B model trained with large batch sizes, achieving 63.4 average accuracy across the extended benchmark suite.</p>
</li>
</ol>
<h3 id="training-loss-patterns">Training loss patterns</h3>
<p>DC-6 (all sources) achieves the lowest training loss among SlimPajama configurations, consistent with the downstream results. DC-4 (no web crawl) shows the highest training loss, confirming that the large, diverse web crawl data is the most important single component.</p>
<h2 id="implications-and-limitations">Implications and limitations</h2>
<p>The central finding is that <strong>diversity matters most after deduplication</strong>. When cross-source redundancy is removed, each additional source contributes genuinely new signal. Without global deduplication, adding sources may just increase redundancy without proportional benefit.</p>
<p><strong>Limitations:</strong></p>
<ul>
<li>Only seven fixed configurations are tested. No systematic search over continuous mixture proportions (contrast with <a href="/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/">DoReMi</a> or <a href="/notes/natural-language-processing/language-models/data-mixing-laws-pretraining/">Data Mixing Laws</a>).</li>
<li>The configurations are not independent: DC-6 includes all sources from DC-1 through DC-5, making it difficult to isolate the contribution of any single addition.</li>
<li>Only 1.3B and 7B scales tested. Whether the diversity benefit continues scaling is unverified.</li>
<li>English-only. Cross-lingual diversity effects are not studied.</li>
<li>The paper is a technical report without formal peer review.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible.</strong> All 1.3B models and datasets are publicly released under MIT license on HuggingFace.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>SlimPajama</td>
          <td>627B tokens</td>
          <td>Globally deduplicated from 1.2T RedPajama</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>RefinedWeb</td>
          <td>600B tokens</td>
          <td>External CC-only baseline</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HF Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA)</td>
          <td>Standard</td>
          <td>4 benchmarks</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Extended suite</td>
          <td>12 additional benchmarks</td>
          <td>Zero and few-shot</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>1.3B parameter Cerebras-GPT architecture with ALiBi positional encoding and SwiGLU activation. All configurations trained on 330B tokens. 7B model trained with large batch-size (LBS) strategy on Cerebras 16x CS-2 cluster (80 PFLOP/s in bf16).</p>
<h3 id="hardware">Hardware</h3>
<p>Cerebras 16x CS-2 cluster, 80 PFLOP/s in bf16 mixed precision.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/MBZUAI-LLM/SlimPajama-DC">SlimPajama-DC Models</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>All 1.3B DC configurations (select via revision)</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC">SlimPajama-627B-DC Dataset</a></td>
          <td>Dataset</td>
          <td>-</td>
          <td>Source-split version of SlimPajama-627B</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{shen2023slimpajamadc,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SlimPajama-DC: Understanding Data Combinations for LLM Training}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Shen, Zhiqiang and Tao, Tianhua and Ma, Liqun and Neiswanger, Willie and Liu, Zhengzhong and Wang, Hongyi and Tan, Bowen and Hestness, Joel and Vassilieva, Natalia and Soboleva, Daria and Xing, Eric}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2309.10818}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Scaling Data-Constrained Language Models</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/scaling-data-constrained-language-models/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/scaling-data-constrained-language-models/</guid><description>Muennighoff et al. extend Chinchilla scaling laws to repeated data, finding up to 4 epochs cause negligible loss and 16 epochs mark diminishing returns.</description><content:encoded><![CDATA[<h2 id="an-empirical-study-of-scaling-under-data-constraints">An empirical study of scaling under data constraints</h2>
<p>This is a <strong>discovery paper</strong> that systematically investigates what happens when language models are trained for multiple epochs on repeated data. It extends the Chinchilla scaling laws to the data-constrained regime by proposing a new scaling formula that accounts for the diminishing value of repeated tokens, validated across 400+ training runs ranging from 10M to 9B parameters and up to 1500 epochs.</p>
<h2 id="running-out-of-unique-training-data">Running out of unique training data</h2>
<p>The Chinchilla scaling laws assume unlimited unique data: for a given compute budget, there exists an optimal balance of model parameters and training tokens. But extrapolating these laws to larger models implies data requirements that exceed what is available. Villalobos et al. estimated that high-quality English text would be exhausted by 2024 under Chinchilla-optimal scaling. Most prior large language models trained for a single epoch, and some work explicitly warned against data reuse. The Galactica models (trained for 4.25 epochs) showed that multi-epoch training could work, but no systematic study had quantified the tradeoff between repeated data and fresh data, or how to allocate compute optimally when data is finite.</p>
<h2 id="effective-data-with-exponential-decay-for-repetition">Effective data with exponential decay for repetition</h2>
<p>The paper generalizes the Chinchilla scaling law by replacing raw token count $D$ with an effective data term $D&rsquo;$ that accounts for the diminishing value of repeated tokens:</p>
<p>$$
L(N, D) = \frac{A}{N&rsquo;^{\alpha}} + \frac{B}{D&rsquo;^{\beta}} + E
$$</p>
<p>where the effective data is:</p>
<p>$$
D&rsquo; = U_{D} + U_{D} R_{D}^{<em>} \left(1 - e^{-R_{D}/R_{D}^{</em>}}\right)
$$</p>
<p>Here $U_{D}$ is the number of unique tokens, $R_{D}$ is the number of repetitions (epochs minus 1), and $R_{D}^{<em>}$ is a learned constant representing the &ldquo;half-life&rdquo; of data repetition. When $R_{D} = 0$ (single epoch), $D&rsquo; = U_{D} = D$ and the formula reduces to standard Chinchilla. When $R_{D} \ll R_{D}^{</em>}$, repeated data is worth almost the same as fresh data. As $R_{D}$ grows large, the value of repeated tokens decays to zero, and $D&rsquo;$ saturates at $U_{D}(1 + R_{D}^{<em>})$, meaning no amount of repetition can substitute for more than $R_{D}^{</em>}$ epochs&rsquo; worth of fresh data.</p>
<p>A symmetric formula handles excess parameters:</p>
<p>$$
N&rsquo; = U_{N} + U_{N} R_{N}^{<em>} \left(1 - e^{-R_{N}/R_{N}^{</em>}}\right)
$$</p>
<p>where $U_{N}$ is the compute-optimal parameter count for $U_{D}$ unique tokens and $R_{N}$ measures how much the model exceeds that count. The fitted values are $R_{D}^{<em>} \approx 15.0$ (data repetition half-life at ~16 epochs) and $R_{N}^{</em>} \approx 5.3$ (excess parameters decay faster than repeated data).</p>
<h2 id="experiments-across-400-models">Experiments across 400+ models</h2>
<p><strong>Scale.</strong> Models from 10M to 9B parameters, trained for up to 1500 epochs. Three experimental protocols: fixed unique data (100M, 400M, 1.5B tokens), fixed FLOPs, and parametric fitting across all runs. Training on C4 (English web text) with GPT-2 architecture decoder-only transformers.</p>
<h3 id="resource-allocation-epochs-scale-faster-than-parameters">Resource allocation: epochs scale faster than parameters</h3>
<p>With fixed unique data, results show that more than 50% loss reduction is possible by training beyond one epoch and increasing model size beyond the single-epoch optimum. The data-constrained efficient frontier recommends allocating most additional compute to more epochs rather than more parameters, because excess parameters decay faster ($R_{N}^{<em>} &lt; R_{D}^{</em>}$). This contrasts with Chinchilla, which recommends scaling both equally.</p>
<p>A concrete validation: training the data-constrained compute-optimal model for $9.3 \times 10^{21}$ FLOPs with 25B unique tokens, the recommended allocation (27% fewer parameters, more epochs) achieves better loss and downstream performance than the Chinchilla-optimal allocation.</p>
<h3 id="resource-return-the-4-epoch-safe-zone-and-16-epoch-half-life">Resource return: the 4-epoch safe zone and 16-epoch half-life</h3>
<table>
  <thead>
      <tr>
          <th>Epochs</th>
          <th>Loss impact</th>
          <th>Downstream impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 (baseline)</td>
          <td>Optimal</td>
          <td>Optimal</td>
      </tr>
      <tr>
          <td>Up to 4</td>
          <td>Negligible (+0.5% loss)</td>
          <td>No significant difference</td>
      </tr>
      <tr>
          <td>~16 ($R_{D}^{*}$)</td>
          <td>Diminishing returns begin sharply</td>
          <td>Measurable degradation</td>
      </tr>
      <tr>
          <td>Beyond 16</td>
          <td>Returns decay to near zero</td>
          <td>Significant degradation</td>
      </tr>
      <tr>
          <td>Extreme (44+)</td>
          <td>Training can diverge</td>
          <td>Failure</td>
      </tr>
  </tbody>
</table>
<p>The 8.7B parameter model trained for 4 epochs ($D_{C} = 44$B unique tokens) finishes with only 0.5% higher validation loss than the single-epoch model ($D_{C} = 178$B unique tokens). Beyond 16 epochs, each repeated token retains only $1 - 1/e \approx 63%$ of the value of a fresh token, meaning roughly 37% of value is lost per repetition cycle at the half-life point.</p>
<h3 id="complementary-strategies-code-augmentation-and-filtering">Complementary strategies: code augmentation and filtering</h3>
<p>When data is limited, two strategies can extend the effective dataset:</p>
<p><strong>Code augmentation.</strong> Mixing Python code from The Stack with natural language data. Up to 50% code (42B tokens) shows no degradation on natural language benchmarks, effectively providing a 2x increase in useful training data. Some tasks (WebNLG generation, bAbI reasoning) actually improve with code, possibly because code trains long-range state-tracking capabilities.</p>
<p><strong>Filtering relaxation.</strong> Perplexity filtering (keeping the 25% lowest-perplexity samples) is effective on noisy datasets, but deduplication filtering does not improve downstream performance (though it may reduce memorization). The recommendation: reserve aggressive filtering for noisy data sources; for clean datasets, more data through reduced filtering is better than less data through strict filtering.</p>
<p><strong>Combined strategy</strong>: doubling available data with code and then repeating for 4 epochs yields 8x more training tokens with performance expected to match 8x more unique data.</p>
<h2 id="key-findings-and-limitations">Key findings and limitations</h2>
<p><strong>Key findings:</strong></p>
<ul>
<li>Multi-epoch training is beneficial, not harmful, up to moderate repetition counts.</li>
<li>The data-constrained scaling law accurately predicts loss under repetition using an exponential decay formulation.</li>
<li>Compute should be allocated to epochs faster than parameters when data is constrained.</li>
<li>Code augmentation and selective filtering extend effective data without quality degradation.</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li>All experiments use the GPT-2 transformer architecture; applicability to other architectures or modalities is untested.</li>
<li>Only the entire dataset is repeated uniformly. Selectively repeating subsets (e.g., high-value data for more epochs) is not modeled.</li>
<li>Hyperparameter sensitivity (learning rate, dropout) to epoch count is unexplored. Higher learning rates may cause earlier onset of diminishing returns.</li>
<li>Focused on English text. Cross-lingual augmentation effects are not studied.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible.</strong> Code, models, datasets, and hyperparameters are all publicly released under Apache 2.0.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>C4 (English)</td>
          <td>Varies by experiment</td>
          <td>Fixed unique data: 100M, 400M, 1.5B tokens</td>
      </tr>
      <tr>
          <td>Code augmentation</td>
          <td>The Stack (Python)</td>
          <td>Up to 42B tokens</td>
          <td>Mixed with natural language</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>19 NL tasks</td>
          <td>Standard splits</td>
          <td>Zero to five-shot, 114 scores per model</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Data-constrained scaling law: $D&rsquo; = U_{D} + U_{D} R_{D}^{<em>}(1 - e^{-R_{D}/R_{D}^{</em>}})$ with $R_{D}^{<em>} \approx 15.0$, $R_{N}^{</em>} \approx 5.3$. Fitted using the methodology of Hoffmann et al. (2022) adapted for the repetition terms. 400+ training runs used for fitting.</p>
<h3 id="models">Models</h3>
<p>GPT-2 architecture decoder-only transformers with GPT-2 tokenizer. Sizes: 10M to 8.7B parameters. Cosine learning rate schedule (max 2e-4, decay to 2e-5), Adam optimizer ($\beta_2 = 0.999$), dropout 0.1, weight decay 0.1, gradient clipping at 1.0. bfloat16 precision. Trained using Megatron-DeepSpeed.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Data-Constrained Optimal</th>
          <th>Chinchilla Optimal</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validation loss (9.3e21 FLOPs, 25B unique)</td>
          <td>Lower</td>
          <td>Higher</td>
          <td>27% fewer parameters</td>
      </tr>
      <tr>
          <td>Downstream (4 epochs vs 1)</td>
          <td>No significant difference</td>
          <td>Baseline</td>
          <td>8.7B params, 44B unique tokens</td>
      </tr>
      <tr>
          <td>Code augmentation (50% code)</td>
          <td>No NL degradation</td>
          <td>Baseline</td>
          <td>Some tasks improve</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Trained on the LUMI supercomputer (Finland) using AMD Instinct MI250X GPUs with data, tensor, and pipeline parallelism. Up to 256 GPUs (64 nodes) per run, with up to 2,200 nodes (~8,800 GPUs) used in parallel across all concurrent runs. Total compute: approximately 3 million GPU hours. The cluster runs on 100% renewable hydroelectric energy.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/huggingface/datablations">datablations</a></td>
          <td>Code + Models + Data</td>
          <td>Apache 2.0</td>
          <td>All 400+ models, datasets, and training code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/TurkuNLP/Megatron-DeepSpeed">Megatron-DeepSpeed fork</a></td>
          <td>Code</td>
          <td>-</td>
          <td>Training framework adapted for AMD ROCm</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{muennighoff2023scaling,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Scaling Data-Constrained Language Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Muennighoff, Niklas and Rush, Alexander M. and Barak, Boaz and Le Scao, Teven and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{36}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DoReMi: Optimizing Data Mixtures for LM Pretraining</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/</guid><description>DoReMi uses a small proxy model with distributionally robust optimization to learn domain weights that speed up large-scale language model pretraining by 2.6x.</description><content:encoded><![CDATA[<h2 id="a-method-for-automatic-domain-reweighting">A method for automatic domain reweighting</h2>
<p>This is a <strong>method paper</strong> that introduces Domain Reweighting with Minimax Optimization (DoReMi), an algorithm for automatically tuning the mixture proportions of pretraining data domains. Rather than relying on heuristics or expensive downstream-task-based tuning, DoReMi uses a small proxy model trained with <a href="https://en.wikipedia.org/wiki/Robust_optimization">group distributionally robust optimization (Group DRO)</a> to produce domain weights that transfer to much larger models.</p>
<h2 id="why-data-mixture-proportions-matter">Why data mixture proportions matter</h2>
<p>Language model pretraining datasets combine text from many domains: web crawls, Wikipedia, books, code, academic papers, and others. The mixture proportions (how much of each domain to include) significantly affect downstream performance, but existing approaches either set them by hand (<a href="https://en.wikipedia.org/wiki/The_Pile_(dataset)">The Pile</a> uses heuristic weights) or tune them against downstream tasks (GLaM/PaLM), which is expensive and risks overfitting to a specific evaluation set. No principled, task-agnostic method existed for determining mixture proportions.</p>
<h2 id="minimax-optimization-over-domain-excess-loss">Minimax optimization over domain excess loss</h2>
<p>DoReMi&rsquo;s core insight is to frame data mixture optimization as a minimax problem: find domain weights that minimize the worst-case excess loss across all domains. The algorithm has three steps.</p>
<p><strong>Step 1</strong>: Train a small reference model (280M parameters) on some default domain weights $\alpha_{\text{ref}}$ (e.g., proportional to raw token count).</p>
<p><strong>Step 2</strong>: Train a small proxy model $p_{\theta}$ using Group DRO, which solves the minimax objective:</p>
<p>$$
\min_{\theta} \max_{\alpha \in \Delta^{k}} \sum_{i=1}^{k} \alpha_{i} \cdot \left[ \frac{1}{\sum_{x \in D_{i}} |x|} \sum_{x \in D_{i}} \ell_{\theta}(x) - \ell_{\text{ref}}(x) \right]
$$</p>
<p>where $\ell_{\theta}(x) = -\log p_{\theta}(x)$ and $\ell_{\text{ref}}(x) = -\log p_{\text{ref}}(x)$. The excess loss $\ell_{\theta}(x) - \ell_{\text{ref}}(x)$ measures how much headroom the proxy has to improve on each example relative to the reference. The inner maximization upweights domains with high excess loss via exponentiated gradient ascent, while the outer minimization trains the proxy on those upweighted domains.</p>
<p>At each training step, the domain weights update as:</p>
<p>$$
\alpha_{t}&rsquo; \leftarrow \alpha_{t-1} \exp(\eta \lambda_{t})
$$</p>
<p>where $\lambda_{t}[i]$ is the per-domain excess loss (clipped at zero), followed by renormalization and smoothing with a uniform component: $\alpha_{t} \leftarrow (1-c)\frac{\alpha_{t}&rsquo;}{\sum_{i} \alpha_{t}&rsquo;[i]} + cu$, with $c = 10^{-3}$.</p>
<p>The final domain weights are the average over all training steps: $\bar{\alpha} = \frac{1}{T}\sum_{t=1}^{T} \alpha_{t}$.</p>
<p><strong>Step 3</strong>: Resample data according to $\bar{\alpha}$ and train the full-scale model using standard procedures.</p>
<p><strong>Iterated DoReMi</strong> extends this by running multiple rounds, using the previous round&rsquo;s optimized weights as the next round&rsquo;s reference weights. This converges within 3 rounds on the GLaM dataset.</p>
<h2 id="experiments-across-the-pile-and-glam-datasets">Experiments across The Pile and GLaM datasets</h2>
<p><strong>Datasets.</strong> The Pile (22 domains, 800GB) and the GLaM dataset (8 domains, also used for PaLM). On The Pile, baseline weights come from the dataset defaults. On GLaM, baseline weights are uniform, with downstream-tuned oracle weights available for comparison.</p>
<p><strong>Setup.</strong> Transformer decoder-only LMs trained with next-token prediction. All models use batch size 512 and sequence length 1024. Proxy and reference models are 280M parameters. Main models are 8B parameters (30x larger). Training runs: 200K steps (Pile) or 300K steps (GLaM). The domain weight optimization cost (training two 280M models) is 8% of the compute for the 8B main model.</p>
<p><strong>Evaluation.</strong> Per-domain held-out perplexity and one-shot generative accuracy on five tasks: TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, and LAMBADA.</p>
<h3 id="key-domain-weight-shifts">Key domain weight shifts</h3>
<p>On The Pile, DoReMi (280M) dramatically upweights diverse web text (Pile-CC: 0.112 to 0.606) while downweighting specialized domains like ArXiv (0.105 to 0.004), PubMed Central (0.107 to 0.005), and StackExchange (0.093 to 0.015). Smaller, underrepresented domains like YouTubeSubtitles and PhilPapers receive proportionally large increases.</p>
<h3 id="scaling-behavior">Scaling behavior</h3>
<p>DoReMi was tested with matched proxy/main model sizes (280M through 1B) and with varying proxy sizes (70M through 1B) feeding into an 8B main model.</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Speedup to baseline accuracy</th>
          <th>Downstream improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DoReMi (280M to 280M)</td>
          <td>4x</td>
          <td>+2% avg accuracy</td>
      </tr>
      <tr>
          <td>DoReMi (280M to 8B)</td>
          <td>2.6x</td>
          <td>+6.5% avg accuracy</td>
      </tr>
      <tr>
          <td>DoReMi (150M to 8B)</td>
          <td>~2x</td>
          <td>Significant</td>
      </tr>
      <tr>
          <td>DoReMi (1B to 8B)</td>
          <td>~2x</td>
          <td>Significant</td>
      </tr>
  </tbody>
</table>
<p>Improvements are consistent across all tested model scales (280M to 1B matched), with no sign of diminishing returns at larger sizes.</p>
<h2 id="perplexity-improves-everywhere-even-on-downweighted-domains">Perplexity improves everywhere, even on downweighted domains</h2>
<p>The most striking finding is that DoReMi improves perplexity on all 22 domains in The Pile, including domains it downweights. The proposed explanation: the lowest-entropy domains need few samples to learn (they&rsquo;re statistically simple), while the highest-entropy domains have token distributions close to the uniform initialization and also need fewer samples. Reallocating weight to medium-entropy domains generates positive transfer that lifts all domains.</p>
<p>On The Pile, DoReMi reaches the baseline&rsquo;s downstream accuracy in 75K steps versus 200K for the baseline (2.6x speedup) and achieves a 6.5% absolute improvement in average one-shot accuracy at 200K steps.</p>
<p>On the GLaM dataset, iterated DoReMi (round 2) matches the performance of domain weights that were tuned directly on downstream task performance, despite having no knowledge of downstream tasks. Domain weights converge within 3 iterations.</p>
<h3 id="ablations">Ablations</h3>
<p>Using only the proxy model&rsquo;s loss (prefer hardest domains) or only the negative reference loss (prefer easiest domains) both underperform the full excess loss formulation. Both components are necessary: the excess loss identifies domains where the proxy has room to improve relative to what is learnable.</p>
<p>The proxy model itself typically underperforms the main model trained on its weights, and this gap grows at larger proxy scales. A 1B proxy model underperforms the 1B baseline, yet its domain weights still improve 1B main model training by over 2x. This suggests the domain weight signal is robust even when the proxy model itself is not well-trained.</p>
<h3 id="limitations">Limitations</h3>
<p>The domain weight landscape may have multiple local optima: a 280M proxy puts most weight on Pile-CC, while a 1B proxy favors OpenWebText2 instead. Both configurations improve over baseline, but the optimal weights are not unique.</p>
<p>The granularity of &ldquo;domains&rdquo; matters. DoReMi works better with more domains (22 on The Pile versus 8 on GLaM). Domains are defined by data provenance, which is coarse-grained. Fine-grained domain definitions (e.g., via clustering) could improve results but also risk DRO putting all weight on a small set of worst-case examples.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>The Pile</td>
          <td>800 GB, 22 domains</td>
          <td>Default heuristic weights as baseline</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>GLaM dataset</td>
          <td>8 domains</td>
          <td>Uniform weights as baseline; downstream-tuned oracle available</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, LAMBADA</td>
          <td>Standard splits</td>
          <td>One-shot generative evaluation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Group DRO with exponentiated gradient ascent for domain weight updates. Step size $\eta = 1$, smoothing $c = 10^{-3}$. Per-token excess loss clipped at zero. Domain weights averaged over all training steps. Iterated DoReMi converges when $|\bar{\alpha} - \alpha_{\text{ref}}|_{\infty} &lt; 10^{-3}$.</p>
<h3 id="models">Models</h3>
<p>Vanilla Transformer decoder-only models with 256K vocabulary. Sizes: 70M (3 layers), 150M (6 layers), 280M (12 layers), 510M (12 layers), 760M (12 layers), 1B (16 layers), 8B (32 layers). All use 64-dim attention heads except 8B (128-dim).</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>DoReMi (280M to 8B)</th>
          <th>Baseline (8B)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Avg one-shot accuracy</td>
          <td>+6.5% over baseline</td>
          <td>Reference</td>
          <td>5 generative tasks</td>
      </tr>
      <tr>
          <td>Worst-case log-perplexity</td>
          <td>1.46</td>
          <td>1.71</td>
          <td>Across 22 Pile domains</td>
      </tr>
      <tr>
          <td>Avg log-perplexity</td>
          <td>1.40</td>
          <td>1.64</td>
          <td>Across 22 Pile domains</td>
      </tr>
      <tr>
          <td>Domains beating baseline</td>
          <td>22/22</td>
          <td>0/22</td>
          <td>Per-domain perplexity</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Proxy and reference models (under 1B) trained on TPUv3. Models at 1B and 8B trained on TPUv4. Domain weight optimization (two 280M runs) costs 8% of 8B training FLOPs.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{xie2023doremi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xie, Sang Michael and Pham, Hieu and Dong, Xuanyi and Du, Nan and Liu, Hanxiao and Lu, Yifeng and Liang, Percy and Le, Quoc V. and Ma, Tengyu and Yu, Adams Wei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{36}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Data Mixing Laws for LM Pretraining Optimization</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/data-mixing-laws-pretraining/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/data-mixing-laws-pretraining/</guid><description>Ye et al. discover that LM loss follows an exponential law over domain mixture proportions, enabling cheap prediction and optimization of data mixtures.</description><content:encoded><![CDATA[<h2 id="an-empirical-discovery-of-predictable-mixture-loss-relationships">An empirical discovery of predictable mixture-loss relationships</h2>
<p>This is a <strong>discovery paper</strong> that identifies a quantitative, functional relationship between pretraining data mixture proportions and language model loss. The key finding is that domain-specific validation loss follows an exponential law over the linear combination of training domain proportions, and this law composes with standard scaling laws to enable cheap prediction of large-model performance under arbitrary mixtures.</p>
<h2 id="the-missing-quantitative-link-between-data-mixtures-and-performance">The missing quantitative link between data mixtures and performance</h2>
<p>Pretraining data for large language models combines text from many domains (web, code, academic, books, etc.), and mixture proportions significantly affect model quality. Existing approaches either set proportions by hand without disclosed criteria (LLaMA, Baichuan) or use algorithmic methods like <a href="/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/">DoReMi</a> that optimize qualitatively but cannot predict the quantitative effect of a specific mixture before training. Scaling laws exist for model size and data quantity, but no equivalent existed for mixture proportions. This paper fills that gap.</p>
<h2 id="the-exponential-data-mixing-law">The exponential data mixing law</h2>
<p>The core finding: for a model of fixed size trained for a fixed number of steps, the validation loss on domain $i$ as a function of the training mixture proportions $r_{1 \dots M}$ follows:</p>
<p>$$
L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right)
$$</p>
<p>where $c_{i}$, $k_{i}$, and $t_{ij}$ are fitted parameters. The constant $c_{i}$ represents the irreducible loss (not affected by mixture changes). The interaction coefficients $t_{ij}$ capture how training domain $j$ affects validation loss on domain $i$: negative $t_{ij}$ means domain $j$ helps domain $i$, positive means it hurts.</p>
<p>This was discovered progressively:</p>
<ol>
<li><strong>Two domains</strong>: Log-reducible-loss is linear in domain proportion (univariate exponential).</li>
<li><strong>Three domains</strong>: The exponential generalizes to a linear combination over all domain proportions (Eq. above), outperforming alternatives with comparable parameter count.</li>
<li><strong>General validation</strong>: For a validation set composed of $K$ domains with proportions $s_{1 \dots K}$, the overall loss is:</li>
</ol>
<p>$$
L(r_{1 \dots M}) = \sum_{i=1}^{K} s_{i} \left[ c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right) \right]
$$</p>
<p>When the validation set composition is unknown, implicit domain aggregation treats $s_{i}$ as learnable parameters. Setting the number of implicit domains larger than the true number works well and is robust to overestimation.</p>
<h3 id="domain-interaction-patterns">Domain interaction patterns</h3>
<p>Visualizing the fitted $t_{ij}$ coefficients across 5 coarse Pile domains reveals three relationship types: most domain pairs are <strong>unrelated</strong> (sparse interaction matrix where each domain&rsquo;s loss is dominated by its own training proportion), some show <strong>facilitation</strong> (e.g., dialogue data helps internet text), and some show <strong>conflict</strong> (e.g., symbolic data hurts prose). This sparsity explains why the law can be fitted with fewer samples than the quadratic parameter count would suggest.</p>
<h2 id="nested-scaling-pipeline-for-cheap-prediction">Nested scaling pipeline for cheap prediction</h2>
<p>Fitting data mixing laws directly at target scale is too expensive (requires many full training runs at different mixtures). The paper proposes nesting three scaling laws:</p>
<p><strong>Step 1</strong>: For each mixture $r_{i}$ and each small model size $N_{j}$, train for $S_{0}$ steps. Fit a <a href="https://en.wikipedia.org/wiki/Power_law">power law</a> $L(S) = E_{1} + B/S^{\beta}$ over steps to extrapolate to the target step count $S_{\text{target}}$.</p>
<p><strong>Step 2</strong>: With the step-extrapolated losses for each mixture, fit a power law $L(N) = E_{2} + A/N^{\alpha}$ over model sizes to extrapolate to the target model size $N_{\text{target}}$.</p>
<p><strong>Step 3</strong>: With the predicted losses at $(N_{\text{target}}, S_{\text{target}})$ for all sampled mixtures, fit the data mixing law and search for the optimal mixture.</p>
<p>This pipeline requires only training small models (70M to 410M) for short runs (30B tokens) to predict performance of a 1B model trained for 100B tokens.</p>
<h3 id="mixture-sampling-strategy">Mixture sampling strategy</h3>
<p>To get informative samples efficiently, the paper uses double-diminishing proportions: for each domain, enumerate proportions by halving from the maximum available. This distributes losses evenly across the exponential law&rsquo;s range. From 40 candidate mixtures trained at the smallest scale (70M), 20 are selected based on which subset minimizes data mixing law fitting error.</p>
<h2 id="experiments-on-redpajama-and-continual-pretraining">Experiments on RedPajama and continual pretraining</h2>
<p><strong>Main experiment.</strong> Models trained on RedPajama, validated on the Pile (mimicking the common scenario where validation data comes from a different distribution than training). Small models: 70M, 160M, 305M, 410M trained for 30B tokens. Target: 1B model for 100B tokens.</p>
<p>The optimized mixture dramatically redistributes weight compared to RedPajama defaults:</p>
<table>
  <thead>
      <tr>
          <th>Domain</th>
          <th>Default</th>
          <th>Optimized</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CommonCrawl</td>
          <td>0.670</td>
          <td>0.125</td>
      </tr>
      <tr>
          <td>C4</td>
          <td>0.150</td>
          <td>0.250</td>
      </tr>
      <tr>
          <td>GitHub</td>
          <td>0.045</td>
          <td>0.141</td>
      </tr>
      <tr>
          <td>ArXiv</td>
          <td>0.045</td>
          <td>0.250</td>
      </tr>
      <tr>
          <td>Books</td>
          <td>0.045</td>
          <td>0.094</td>
      </tr>
      <tr>
          <td>StackExchange</td>
          <td>0.025</td>
          <td>0.125</td>
      </tr>
      <tr>
          <td>Wikipedia</td>
          <td>0.020</td>
          <td>0.016</td>
      </tr>
  </tbody>
</table>
<p>The optimized mixture reaches the default mixture&rsquo;s final performance in 73% of the training steps and eventually achieves performance equivalent to 48% more training on the default mixture.</p>
<p><strong>Comparison to DoReMi and DoGE.</strong> Data mixing laws outperform both: the predicted-optimal mixture achieves lower validation loss than DoReMi and DoGE (both universal and OOD settings) for 1B models trained for 100B tokens on RedPajama.</p>
<p><strong>Continual pretraining.</strong> The law extends to continual pretraining (Pythia-70M on Pile + Python code). It accurately predicts the critical mixture proportion that avoids <a href="https://en.wikipedia.org/wiki/Catastrophic_interference">catastrophic forgetting</a> on the original domain while improving the target domain. This suggests data mixing laws could guide dynamic data schedules across multi-stage pretraining.</p>
<h2 id="implications-and-limitations">Implications and limitations</h2>
<p>The data mixing law provides a predictive framework rather than just an optimization algorithm. Key implications:</p>
<ul>
<li>The interaction coefficients $t_{ij}$ make domain relationships quantitatively observable before full-scale training, identifying facilitation and conflict pairs.</li>
<li>The nested pipeline&rsquo;s cost is dominated by the small-model training runs (40 mixtures at 70M scale), which is orders of magnitude cheaper than even a single target-scale run.</li>
<li>The continual pretraining application opens the door to optimizing dynamic data schedules, where mixture proportions change across training stages.</li>
</ul>
<p><strong>Limitations</strong>: The &ldquo;domain&rdquo; concept remains loosely defined (provenance-based). The nested scaling laws introduce compounding errors at each step, and predictions tend to slightly underestimate actual loss. The number of required fitting samples, while subquadratic in practice due to sparsity, still scales with the number of domains. No theoretical justification for the exponential form is provided; it is a purely empirical finding.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (pilot)</td>
          <td>The Pile (GitHub, Pile-CC, Books3)</td>
          <td>30B tokens</td>
          <td>2-domain and 3-domain experiments</td>
      </tr>
      <tr>
          <td>Training (main)</td>
          <td>RedPajama</td>
          <td>100B tokens</td>
          <td>7 domains</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>The Pile validation set</td>
          <td>Standard split</td>
          <td>Out-of-distribution relative to RedPajama</td>
      </tr>
      <tr>
          <td>Continual pretraining</td>
          <td>Pile + Python code</td>
          <td>10B tokens</td>
          <td>Pythia-70M base model</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Data mixing law: $L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp(\sum_{j} t_{ij} r_{j})$. Fitted via AdaBoost Regressor on sampled mixtures. Step scaling law: $L(S) = E_{1} + B/S^{\beta}$. Model size scaling law: $L(N) = E_{2} + A/N^{\alpha}$. Both fitted via Huber loss minimization with LBFGS. Decomposed Chinchilla-style (separate fits for stability). 40 candidate mixtures sampled via double-diminishing proportions, 20 selected for the final pipeline.</p>
<h3 id="models">Models</h3>
<p>Transformer decoder-only LMs. Pilot: 70M, 160M. Main pipeline: 70M, 160M, 305M, 410M (for fitting), 1B (target). Batch size: 1M tokens. Cosine learning rate decay with 2K step warmup, decaying to 0.1x at 100K steps.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Optimized Mixture</th>
          <th>Default Mixture</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Steps to match default final loss</td>
          <td>73K (73%)</td>
          <td>100K (100%)</td>
          <td>27% training reduction</td>
      </tr>
      <tr>
          <td>Equivalent extra training</td>
          <td>+48%</td>
          <td>Baseline</td>
          <td>Estimated via step scaling law</td>
      </tr>
      <tr>
          <td>Validation loss (1B, 100B)</td>
          <td>Lowest</td>
          <td>Higher than optimized</td>
          <td>Also beats DoReMi and DoGE</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>8 A100 GPUs. Training times per 30B-token run: 3.5 hours (70M), 8 hours (160M), 16 hours (305M), 21 hours (410M).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://pile.eleuther.ai/">The Pile</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Pilot and validation data</td>
      </tr>
      <tr>
          <td><a href="https://github.com/togethercomputer/RedPajama-Data">RedPajama</a></td>
          <td>Dataset</td>
          <td>Apache 2.0</td>
          <td>Main training data</td>
      </tr>
      <tr>
          <td><a href="https://github.com/EleutherAI/pythia">Pythia Suite</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Model architecture configs; Pythia-70M checkpoint for continual pretraining</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status: Partially Reproducible.</strong> Datasets and base model checkpoints are public. No official code release for the data mixing law fitting pipeline, mixture sampling, or the nested scaling law prediction workflow.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ye2025datamixinglaws,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhan, Jun and Zhou, Yunhua and Qiu, Xipeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RWKV: Linear-Cost RNN with Transformer Training</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/</guid><description>RWKV combines parallelizable transformer training with constant-cost RNN inference using linear attention and channel-wise decay.</description><content:encoded><![CDATA[<h2 id="a-new-architecture-bridging-rnns-and-transformers">A New Architecture Bridging RNNs and Transformers</h2>
<p>This is a <strong>Method</strong> paper that introduces RWKV (Receptance Weighted Key Value), a novel sequence model architecture that combines the parallelizable training of Transformers with the efficient $O(Td)$ inference of RNNs. RWKV can be formulated equivalently as either a Transformer (for parallel training) or an RNN (for sequential inference), achieving the lowest computational and memory complexity among comparable architectures while matching Transformer-level performance. The authors scale RWKV to 14 billion parameters, making it the largest dense RNN ever trained at the time of publication.</p>
<h2 id="the-quadratic-cost-of-self-attention">The Quadratic Cost of Self-Attention</h2>
<p>Transformers have become the dominant architecture for NLP, powering models like GPT-3, LLaMA, and Chinchilla. Their self-attention mechanism captures both local and long-range dependencies while supporting parallelized training. However, self-attention scales quadratically with sequence length in both time ($O(T^2d)$) and space ($O(T^2 + Td)$), making it computationally and memory intensive for long sequences and resource-constrained deployment.</p>
<p>RNNs, by contrast, offer linear scaling in memory and computation, but suffer from the vanishing gradient problem and cannot parallelize across the time dimension during training. This limits their scalability and makes them unable to match Transformer performance in practice.</p>
<p>Prior work on efficient Transformers (Reformer, Performer, Linformer, AFT, MEGA) has attempted to reduce this quadratic cost, often at the expense of model expressivity. RWKV aims to combine the best of both worlds: Transformer-grade training efficiency with RNN-grade inference cost, without any approximation to the attention mechanism.</p>
<h2 id="linear-attention-via-channel-wise-decay">Linear Attention via Channel-Wise Decay</h2>
<p>RWKV is built on four core vectors that interact multiplicatively at each timestep:</p>
<ul>
<li><strong>R</strong> (Receptance): receives past information, acting as a gating signal</li>
<li><strong>W</strong> (Weight): a trainable positional weight decay vector</li>
<li><strong>K</strong> (Key): analogous to keys in standard attention</li>
<li><strong>V</strong> (Value): analogous to values in standard attention</li>
</ul>
<p>The architecture consists of stacked residual blocks, each containing a <strong>time-mixing</strong> sub-block and a <strong>channel-mixing</strong> sub-block.</p>
<h3 id="token-shift">Token Shift</h3>
<p>All linear projection vectors are produced by interpolating between the current input $x_t$ and the previous input $x_{t-1}$, creating a token shift mechanism:</p>
<p>$$
r_t = W_r \cdot (\mu_r \odot x_t + (1 - \mu_r) \odot x_{t-1})
$$</p>
<p>$$
k_t = W_k \cdot (\mu_k \odot x_t + (1 - \mu_k) \odot x_{t-1})
$$</p>
<p>$$
v_t = W_v \cdot (\mu_v \odot x_t + (1 - \mu_v) \odot x_{t-1})
$$</p>
<p>where $\mu_r$, $\mu_k$, $\mu_v$ are learnable interpolation parameters. This is implemented efficiently as a simple offset in the temporal dimension.</p>
<h3 id="the-wkv-operator">The WKV Operator</h3>
<p>The core attention-like computation replaces standard dot-product attention with a channel-wise weighted sum using exponential decay:</p>
<p>$$
wkv_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} \odot v_i + e^{u + k_t} \odot v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u + k_t}}
$$</p>
<p>Here $w$ is the channel-wise time decay vector and $u$ is a separate bonus vector that attends specifically to the current token. Unlike AFT where $W$ is a pairwise matrix, RWKV treats $W$ as a channel-wise vector modified by relative position, enabling the recurrent formulation.</p>
<h3 id="output-gating">Output Gating</h3>
<p>The receptance vector gates the WKV output through a sigmoid:</p>
<p>$$
o_t = W_o \cdot (\sigma(r_t) \odot wkv_t)
$$</p>
<p>The channel-mixing block uses a similar gating mechanism with squared ReLU activation:</p>
<p>$$
o&rsquo;_t = \sigma(r&rsquo;_t) \odot (W&rsquo;_v \cdot \max(k&rsquo;_t, 0)^2)
$$</p>
<h3 id="dual-mode-operation">Dual-Mode Operation</h3>
<p>During <strong>training</strong>, RWKV operates in time-parallel mode. The matrix multiplications ($W_\lambda$ for $\lambda \in {r, k, v, o}$) dominate at $O(BTd^2)$ and parallelize identically to standard Transformers. The element-wise WKV computation is $O(BTd)$ and parallelizes along batch and channel dimensions.</p>
<p>During <strong>inference</strong>, RWKV switches to time-sequential mode. Each timestep updates a fixed-size state vector, giving constant $O(d)$ memory and $O(Td)$ total time for generating $T$ tokens, compared to $O(T^2d)$ for standard Transformers.</p>
<h3 id="optimizations">Optimizations</h3>
<p>Three additional design choices improve training:</p>
<ol>
<li><strong>Custom CUDA kernels</strong> for the sequential WKV computation, fusing it into a single kernel on training accelerators</li>
<li><strong>Small init embedding</strong>: initializing the embedding matrix with small values plus an additional LayerNorm, accelerating convergence</li>
<li><strong>Custom initialization</strong>: most weights initialized to zero with no biases, following identity-mapping principles from residual network design</li>
</ol>
<h2 id="scaling-to-14b-parameters-and-benchmark-evaluation">Scaling to 14B Parameters and Benchmark Evaluation</h2>
<h3 id="model-scaling">Model Scaling</h3>
<p>The authors train six RWKV models from 169M to 14B parameters, all for one epoch (330B tokens) on the Pile:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>Dimension</th>
          <th>Parameters</th>
          <th>FLOP/Token</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>169M</td>
          <td>12</td>
          <td>768</td>
          <td>$1.69 \times 10^8$</td>
          <td>$2.61 \times 10^8$</td>
      </tr>
      <tr>
          <td>430M</td>
          <td>24</td>
          <td>1024</td>
          <td>$4.30 \times 10^8$</td>
          <td>$7.57 \times 10^8$</td>
      </tr>
      <tr>
          <td>1.5B</td>
          <td>24</td>
          <td>2048</td>
          <td>$1.52 \times 10^9$</td>
          <td>$2.82 \times 10^9$</td>
      </tr>
      <tr>
          <td>3B</td>
          <td>32</td>
          <td>2560</td>
          <td>$2.99 \times 10^9$</td>
          <td>$5.71 \times 10^9$</td>
      </tr>
      <tr>
          <td>7B</td>
          <td>32</td>
          <td>4096</td>
          <td>$7.39 \times 10^9$</td>
          <td>$1.44 \times 10^{10}$</td>
      </tr>
      <tr>
          <td>14B</td>
          <td>40</td>
          <td>5120</td>
          <td>$1.42 \times 10^{10}$</td>
          <td>$2.78 \times 10^{10}$</td>
      </tr>
  </tbody>
</table>
<p>The parameter count follows: $\text{params} = 2VD + 13D^2L + D(11L + 4)$, where $V = 50277$ is vocabulary size, $D$ is model dimension, and $L$ is layers. FLOPs match the standard transformer formula: $\text{FLOP} = 6 \cdot [\text{tokens}] \cdot [\text{params}]$.</p>
<h3 id="scaling-laws">Scaling Laws</h3>
<p>Training 45 RWKV models across varied (dataset, parameters) pairs, the authors find that RWKV follows the same log-log linear scaling law established for Transformers. The linear fit to Pareto-optimal points achieves $r^2 = 0.994$, and extrapolation an additional order of magnitude still yields $r^2 = 0.875$. This contrasts with prior claims that LSTMs do not follow transformer-like scaling.</p>
<h3 id="nlp-benchmarks">NLP Benchmarks</h3>
<p>RWKV is compared against similarly-sized models trained on comparable token budgets: Pythia, OPT, and BLOOM (all FLOP-matched). Results span twelve benchmarks: ARC (Easy/Challenge), BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, and Winogrande.</p>
<p>RWKV performs competitively with Transformers across all model sizes. On average across benchmarks, RWKV tracks closely with Pythia and outperforms OPT and BLOOM at comparable scales.</p>
<h3 id="long-context-and-extended-finetuning">Long Context and Extended Finetuning</h3>
<p>RWKV can extend its context length after pretraining through progressive finetuning: doubling from 1024 to 2048 (10B tokens), then to 4096 (100B tokens), and finally to 8192 (100B tokens). Each doubling reduces test loss on the Pile, indicating effective use of longer context.</p>
<p>On the Long Range Arena (LRA) benchmark, which tests sequences from 1,000 to 16,000 tokens, RWKV performs second only to S4 across the five datasets.</p>
<h3 id="inference-efficiency">Inference Efficiency</h3>
<p>Benchmarking text generation on CPU (x86) and GPU (NVIDIA A100 80GB) at float32 precision shows that RWKV exhibits linear scaling in generation time, while Transformers scale quadratically. This advantage grows with sequence length: for long outputs, RWKV completes generation substantially faster at equivalent model sizes.</p>
<h2 id="competitive-performance-with-key-caveats">Competitive Performance with Key Caveats</h2>
<p>RWKV demonstrates that RNN-class models can match Transformer performance at scale, while maintaining $O(Td)$ time and $O(d)$ memory during inference. The key findings are:</p>
<ol>
<li><strong>Scaling laws hold</strong>: RWKV follows the same compute-optimal scaling as Transformers ($r^2 = 0.994$), contradicting earlier claims about RNN scaling behavior</li>
<li><strong>Competitive NLP performance</strong>: Across twelve benchmarks, RWKV matches similarly-sized Transformers trained on comparable data</li>
<li><strong>Linear inference cost</strong>: Generation time scales linearly rather than quadratically, with constant memory regardless of sequence length</li>
<li><strong>Context extension</strong>: Progressive finetuning effectively extends the context window post-training</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors identify two primary limitations:</p>
<p><strong>Information compression</strong>: Linear attention funnels all past information through a single fixed-size state vector. For tasks requiring recall of specific details over very long contexts, this is mechanistically more constrained than full self-attention, which maintains direct access to all previous tokens.</p>
<p><strong>Prompt sensitivity</strong>: RWKV is more sensitive to prompt engineering than standard Transformers. The linear attention mechanism limits how much prompt information carries forward, making the order of information in the prompt particularly important. Reordering prompts improved F1 from 44.2% to 74.8% on one task.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest several avenues: applying parallel scan to reduce WKV cost to $O(B \log(T) d)$, extending RWKV to encoder-decoder and multimodal architectures, leveraging hidden states for interpretability and safety, and increasing internal state size to improve long-range recall.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BlinkDL/RWKV-LM">BlinkDL/RWKV-LM</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official PyTorch training and inference implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/BlinkDL/rwkv-4-pile-14b">Pre-trained weights (169M to 14B)</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>All six Pile-trained sizes on HuggingFace (<code>BlinkDL/rwkv-4-pile-*</code>)</td>
      </tr>
      <tr>
          <td><a href="https://pile.eleuther.ai/">The Pile</a></td>
          <td>Dataset</td>
          <td>Mixed</td>
          <td>825 GiB pretraining corpus; component licenses vary by source</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Highly Reproducible. Training code (Apache-2.0), pre-trained weights for all six model sizes, the full training corpus, and complete hyperparameters (Appendix G) are all publicly available. The only missing detail is the specific GPU cluster configuration used for pretraining.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>The Pile</td>
          <td>330B tokens</td>
          <td>One full epoch for all model sizes</td>
      </tr>
      <tr>
          <td>Context extension</td>
          <td>The Pile</td>
          <td>210B additional tokens</td>
          <td>Progressive doubling: 1024 to 8192</td>
      </tr>
      <tr>
          <td>NLP evaluation</td>
          <td>ARC, BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, Winogrande</td>
          <td>Various</td>
          <td>Zero-shot evaluation</td>
      </tr>
      <tr>
          <td>Long-range evaluation</td>
          <td>Long Range Arena (LRA)</td>
          <td>1K-16K tokens</td>
          <td>Five sub-tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam ($\beta = (0.9, 0.99)$), no weight decay</li>
<li>Precision: bfloat16</li>
<li>Training context length: 1024 tokens</li>
<li>Learning rate: constant warmup, then exponential decay</li>
<li>Auxiliary loss from PaLM (softmax normalizer regularization)</li>
<li>Batch size: 128 or 256 sequences (dynamically switched)</li>
<li>Training organized into mini-epochs of 40,320 samples each (8,043 mini-epochs per Pile epoch)</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Init LR</th>
          <th>Warmup Mini-Epochs</th>
          <th>End LR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>169M</td>
          <td>6e-4</td>
          <td>361</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>430M</td>
          <td>4e-4</td>
          <td>411</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>1.5B</td>
          <td>3e-4</td>
          <td>443</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>3B</td>
          <td>1.5e-4</td>
          <td>451</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>7B</td>
          <td>1.5e-4</td>
          <td>465</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>14B</td>
          <td>1e-4</td>
          <td>544</td>
          <td>7e-6</td>
      </tr>
  </tbody>
</table>
<p>All pretrained models (169M to 14B) are publicly released on HuggingFace (<code>BlinkDL/rwkv-4-pile-*</code>) under Apache-2.0. Training code is at <a href="https://github.com/BlinkDL/RWKV-LM">BlinkDL/RWKV-LM</a> (Apache-2.0).</p>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>All NLP benchmarks evaluated in zero-shot setting</li>
<li>FLOP-matched comparison against Pythia, OPT, BLOOM</li>
<li>Inference benchmarked on CPU (x86) and GPU (NVIDIA A100 80GB) at float32</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Inference experiments: NVIDIA A100 80GB GPU</li>
<li>Training hardware details not fully specified; FLOP budgets reported per model</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., &hellip; &amp; Zhu, R.-J. (2023). RWKV: Reinventing RNNs for the Transformer Era. In <em>Findings of the Association for Computational Linguistics: EMNLP 2023</em>, pp. 14048-14077.</p>
<p><strong>Publication</strong>: Findings of EMNLP 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/BlinkDL/RWKV-LM">GitHub Repository (Apache-2.0)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{peng2023rwkv,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{RWKV: Reinventing RNNs for the Transformer Era}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Derczynski, Leon and Du, Xingjian and Grella, Matteo and GV, Kranthi Kiran and He, Xuzheng and Hou, Haowen and Kazienko, Przemys{\l}aw and Koco{\&#39;n}, Jan and Kong, Jiaming and Koptyra, Bart{\l}omiej and Lau, Hayden and Lin, Jiaju and Mantri, Krishna Sri Ipsit and Mom, Ferdinand and Saito, Atsushi and Song, Guangyu and Tang, Xiangru and Wind, Johan S. and Wo{\&#39;z}niak, Stanis{\l}aw and Zhang, Zhenyuan and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Findings of the Association for Computational Linguistics: EMNLP 2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{14048--14077}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2023.findings-emnlp.936}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Block-Recurrent Transformers for Long Sequences</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/block-recurrent-transformers/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/block-recurrent-transformers/</guid><description>Block-Recurrent Transformers combine attention and recurrence for linear-complexity language modeling on long documents like books and code.</description><content:encoded><![CDATA[<h2 id="a-method-for-combining-attention-with-block-level-recurrence">A Method for Combining Attention with Block-Level Recurrence</h2>
<p>This is a <strong>Method</strong> paper that introduces the Block-Recurrent Transformer, a model architecture that integrates recurrence into the transformer framework at the block level. Rather than processing tokens one at a time (as in traditional RNNs) or attending over entire sequences (as in standard transformers), this approach applies a transformer layer recurrently across blocks of tokens. The result is a model with linear complexity in sequence length that maintains the parallelism benefits of transformers during training. A related approach, <a href="/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/">RWKV</a>, later explored similar ideas using linear attention with channel-wise decay.</p>
<h2 id="why-transformers-struggle-with-long-documents">Why Transformers Struggle with Long Documents</h2>
<p>Transformers have largely replaced RNNs for sequence modeling tasks, but their quadratic self-attention cost limits the length of sequences they can process. A transformer with a window size of 512 tokens cannot see information beyond that window, making it blind to long-range dependencies in books, technical papers, or source code repositories.</p>
<p>Prior approaches to this problem fall into several categories: sparse attention patterns (BigBird, Routing Transformers, Reformer), sequence compression (Linformer, Funnel Transformers), and linearized attention approximations. These methods either sacrifice the expressiveness of full softmax attention or introduce implementation complexity.</p>
<p>Traditional RNNs like LSTMs offer linear complexity but suffer from three key limitations: sequential processing prevents parallelism on modern hardware, a single state vector bottlenecks information capacity, and vanishing gradients limit effective memory to a few hundred tokens.</p>
<h2 id="block-level-recurrence-with-lstm-style-gates">Block-Level Recurrence with LSTM-Style Gates</h2>
<p>The core innovation is applying a standard transformer layer in a recurrent fashion along the sequence, operating on blocks of $W$ tokens rather than individual tokens. The recurrent cell maintains $S$ state vectors (typically $S = W = 512$) that are updated at each block boundary.</p>
<h3 id="the-recurrent-cell">The Recurrent Cell</h3>
<p>The cell has two processing directions:</p>
<ul>
<li><strong>Vertical direction</strong>: An ordinary transformer layer with self-attention over input tokens and cross-attention to recurrent states, producing output embeddings.</li>
<li><strong>Horizontal direction</strong>: Self-attention over current state vectors and cross-attention to input tokens, producing updated state vectors. Residual connections are replaced with gates.</li>
</ul>
<p>Self-attention and cross-attention are computed in parallel (not sequentially), with results concatenated and fed into a linear projection. Keys and values are shared between directions, while queries are separate, yielding four query sets: $Q_e^v$, $Q_s^v$ (vertical) and $Q_s^h$, $Q_e^h$ (horizontal).</p>
<h3 id="gating-mechanisms">Gating Mechanisms</h3>
<p>Two gate types are explored. The <strong>fixed gate</strong> uses a learned convex combination:</p>
<p>$$
g = \sigma(b_g)
$$</p>
<p>$$
c_{t+1} = c_t \odot g + z_t \odot (1 - g)
$$</p>
<p>where $g$ is constant after training, implementing an <a href="https://en.wikipedia.org/wiki/Moving_average">exponential moving average</a>.</p>
<p>The <strong>LSTM gate</strong> uses input and forget gates:</p>
<p>$$
i_t = \sigma(W_i h_t + b_i - 1)
$$</p>
<p>$$
f_t = \sigma(W_f h_t + b_f + 1)
$$</p>
<p>$$
c_{t+1} = c_t \odot f_t + z_t \odot i_t
$$</p>
<p>The bias offsets ($-1$ for input, $+1$ for forget) initialize the model to &ldquo;remember&rdquo; by default, which is critical for training stability. Without careful initialization, the model can fall into a local optimum where it ignores the recurrent state entirely. This echoes the <a href="/notes/machine-learning/model-architectures/can-recurrent-neural-networks-warp-time/">gate initialization challenges studied by Tallec and Ollivier</a>, who derived chrono initialization for LSTMs from time-warping invariance.</p>
<h3 id="gate-configurations">Gate Configurations</h3>
<p>Three configurations are tested: <strong>dual</strong> (gates on both attention and MLP outputs), <strong>single</strong> (gate only on MLP output), and <strong>skip</strong> (gate only on attention output, no MLP). The skip configuration removes the large MLP from the recurrent layer entirely.</p>
<h3 id="learned-state-ids">Learned State IDs</h3>
<p>Since the same weights are applied to all state vectors, learned &ldquo;state IDs&rdquo; (analogous to position embeddings) are added so each state vector can issue distinct queries. <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-style relative position bias is used for token self-attention, with no position bias for state-token cross-attention.</p>
<h2 id="language-modeling-on-pg19-arxiv-and-github">Language Modeling on PG19, arXiv, and GitHub</h2>
<h3 id="experimental-setup">Experimental Setup</h3>
<p>The base model is a 12-layer transformer with 150M parameters (8 heads of size 128, embedding dimension 1024, MLP hidden size 4096). The recurrent layer is placed at layer 10 with segment length $N = 4096$ and window size $W = 512$. The architecture is evaluated on three long-document datasets:</p>
<ul>
<li><strong>PG19</strong>: Full-length books from <a href="https://en.wikipedia.org/wiki/Project_Gutenberg">Project Gutenberg</a> (pre-1919)</li>
<li><strong>arXiv</strong>: Mathematics papers in LaTeX</li>
<li><strong>GitHub</strong>: Concatenated source code from open-source repositories</li>
</ul>
<p>All models report bits-per-token ($\log_2$ perplexity, lower is better).</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines are compared: Transformer-XL with window sizes of 512, 1024, and 2048, plus 12-layer and 13-layer sliding window models. The 13-layer sliding window (Slide:13L) is the primary comparison, having equivalent computation cost and parameter count to the recurrent models.</p>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Step Time</th>
          <th>PG19 (bytes)</th>
          <th>PG19 (tokens)</th>
          <th>arXiv</th>
          <th>GitHub</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>XL:512</td>
          <td>0.88</td>
          <td>1.01</td>
          <td>3.62</td>
          <td>1.45</td>
          <td>1.21</td>
      </tr>
      <tr>
          <td>XL:2048</td>
          <td>2.11</td>
          <td>0.990</td>
          <td>3.58</td>
          <td>1.31</td>
          <td>1.01</td>
      </tr>
      <tr>
          <td>Slide:13L</td>
          <td>1.00</td>
          <td>0.989</td>
          <td>3.58</td>
          <td>1.42</td>
          <td>1.17</td>
      </tr>
      <tr>
          <td>Rec:fixed:skip</td>
          <td>0.99</td>
          <td>0.952</td>
          <td>3.53</td>
          <td>1.24</td>
          <td>0.976</td>
      </tr>
      <tr>
          <td>Rec:fixed:dual</td>
          <td>1.01</td>
          <td>0.957</td>
          <td>3.52</td>
          <td>1.27</td>
          <td>0.991</td>
      </tr>
      <tr>
          <td>Feedback:fixed:skip</td>
          <td>1.35</td>
          <td>0.935</td>
          <td>3.49</td>
          <td>1.24</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Memorizing Trans. 64k</td>
          <td>1.94</td>
          <td>0.950</td>
          <td>3.53</td>
          <td>1.22</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The Rec:fixed:skip configuration achieves the best overall results while being slightly faster than the 13-layer baseline. It outperforms XL:2048, which runs over 2x slower. The block feedback variant (allowing all layers to cross-attend to recurrent states) improves perplexity further at ~35-40% higher step time.</p>
<h3 id="scaling-behavior">Scaling Behavior</h3>
<p>Models from 40M to 1.3B parameters show that the benefit of recurrence is <a href="/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/">consistent across scales</a> and increases with model size. At larger sizes, adding recurrence provides a benefit greater than doubling the number of parameters. The 1.3B parameter model achieves 26.50 word-level perplexity on PG19, setting a new state of the art at the time of publication.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>PG19 Perplexity</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compressive Transformer</td>
          <td>36</td>
          <td>33.6</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Routing Transformer</td>
          <td>22</td>
          <td>33.2</td>
          <td>490M</td>
      </tr>
      <tr>
          <td>Perceiver AR</td>
          <td>60</td>
          <td>28.9</td>
          <td>974.6M</td>
      </tr>
      <tr>
          <td>Block-Recurrent Transformer</td>
          <td>24</td>
          <td>26.50</td>
          <td>1.3B</td>
      </tr>
  </tbody>
</table>
<h3 id="ablations">Ablations</h3>
<ul>
<li><strong>Multiple recurrent layers</strong>: Two adjacent layers (9, 10) provide no benefit. Two separated layers (4, 10) help but no more than adding another non-recurrent layer.</li>
<li><strong>Number of states</strong>: Improvement up to 1024 states, degradation at 2048.</li>
<li><strong>Window size reduction</strong>: Reducing the sliding window hurts Transformer-XL dramatically but has smaller impact on the recurrent model, which compensates via recurrence.</li>
<li><strong>Gate type</strong>: The fixed gate consistently outperforms the LSTM gate despite being theoretically less expressive.</li>
</ul>
<h3 id="qualitative-analysis">Qualitative Analysis</h3>
<p>Comparing per-token predictions against Transformer-XL on PG19 books, the recurrent model&rsquo;s advantage comes overwhelmingly from predicting proper names (17/20 top-improvement tokens). In 19/20 cases, the predicted word was outside the attention window, confirming it was stored in recurrent state. The model can remember book titles and authors across 60,000+ tokens.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>The Block-Recurrent Transformer demonstrates that recurrence at the block level is a cost-effective way to improve language modeling on long sequences. The fixed:skip configuration (the simplest variant) performs best, suggesting the model primarily uses recurrence for long-range name lookup rather than complex reasoning. The fact that removing the MLP from the recurrent layer has minimal impact further supports this interpretation.</p>
<p>Key limitations include: the model was only evaluated on language modeling perplexity (no downstream tasks), the LSTM gate underperforms the simpler fixed gate (suggesting untapped potential for more expressive recurrence), and the authors acknowledge that training the recurrent layer to fully exploit its capacity for knowledge extraction will require further advances.</p>
<p>The authors note that evaluating on downstream tasks requiring long-range context (book summarization, long-document QA, code completion) is an important direction for future work.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>PG19</td>
          <td>~29k books</td>
          <td>Public domain, freely available</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>arXiv</td>
          <td>Mathematics papers</td>
          <td>Obtained via private channels, not redistributable</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>GitHub</td>
          <td>Open-source repos</td>
          <td>Obtained via private channels, not redistributable</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adafactor</li>
<li>Learning rate: 1.0 with inverse square root decay (initial experiments), cosine decay with max 0.01 (scaling experiments)</li>
<li>Warmup: 1000 steps</li>
<li>Dropout: 0.05</li>
<li>Vocabulary: 32k SentencePiece (T5 pretrained for initial, custom for scaling)</li>
<li>Gate initialization: bias of $+1$ for forget gate, $-1$ for input gate to ensure initial &ldquo;remember&rdquo; behavior</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Layers</th>
          <th>Parameters</th>
          <th>Recurrent Layers</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>12 (+1 recurrent)</td>
          <td>~151-164M</td>
          <td>Layer 10</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>24 (+2 recurrent)</td>
          <td>650M</td>
          <td>Layers 10, 20</td>
      </tr>
      <tr>
          <td>XL</td>
          <td>24 (+2 recurrent)</td>
          <td>1.3B</td>
          <td>Layers 10, 20</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Model</th>
          <th>PG19 (tokens)</th>
          <th>arXiv</th>
          <th>GitHub</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Bits-per-token</td>
          <td>Rec:fixed:skip</td>
          <td>3.53</td>
          <td>1.24</td>
          <td>0.976</td>
      </tr>
      <tr>
          <td>Word-level PPL</td>
          <td>1.3B model</td>
          <td>26.50</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Error bars on PG19 are between 0.002 and 0.007 (3 runs with different seeds).</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: 32 Google V4 TPU replicas</li>
<li>Training time: ~48 hours for 500k steps on PG19</li>
<li>Batch size: 32 (segment length 4096) or 256 (segment length 512), adjusted so each model sees the same tokens per step</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Available</th>
          <th>License</th>
          <th>URL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Code (Meliad)</td>
          <td>Yes</td>
          <td>Apache 2.0</td>
          <td><a href="https://github.com/google-research/meliad">github.com/google-research/meliad</a></td>
      </tr>
      <tr>
          <td>PG19 Dataset</td>
          <td>Yes</td>
          <td>Public Domain</td>
          <td>Public</td>
      </tr>
      <tr>
          <td>arXiv Dataset</td>
          <td>No</td>
          <td>Not redistributable</td>
          <td>Private</td>
      </tr>
      <tr>
          <td>GitHub Dataset</td>
          <td>No</td>
          <td>Not redistributable</td>
          <td>Private</td>
      </tr>
      <tr>
          <td>Pretrained Models</td>
          <td>No</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Assessment</strong>: Partially Reproducible. Source code is available under Apache 2.0 and the PG19 dataset is public. However, two of three evaluation datasets (arXiv, GitHub) were obtained via private channels and are not redistributable. No pretrained model checkpoints are released.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hutchins, D., Schlag, I., Wu, Y., Dyer, E., &amp; Neyshabur, B. (2022). Block-Recurrent Transformers. <em>Advances in Neural Information Processing Systems 35 (NeurIPS 2022)</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{hutchins2022block,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Block-Recurrent Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2203.07852}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>NaViT: Native Resolution Vision Transformer</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/navit-native-resolution-vit/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/navit-native-resolution-vit/</guid><description>NaViT uses sequence packing to train Vision Transformers on images at native resolution and aspect ratio, improving efficiency and flexibility.</description><content:encoded><![CDATA[<h2 id="a-method-for-flexible-resolution-vision-transformers">A Method for Flexible-Resolution Vision Transformers</h2>
<p>This is a <strong>Method</strong> paper that introduces NaViT (Native Resolution ViT), a Vision Transformer trained using sequence packing to handle images of arbitrary resolution and aspect ratio. The core idea, called &ldquo;Patch n&rsquo; Pack,&rdquo; borrows example packing from NLP and applies it to vision: patches from multiple images of different sizes are concatenated into a single sequence, enabling native-resolution processing without resizing or padding.</p>
<h2 id="why-fixed-resolution-pipelines-are-suboptimal">Why Fixed-Resolution Pipelines Are Suboptimal</h2>
<p>Standard computer vision pipelines resize all images to a fixed square resolution before processing. This practice originates from convolutional neural network constraints, where fixed spatial dimensions were architecturally required. Even with Vision Transformers, which operate on sequences of patches and could in principle handle variable lengths, the convention of fixed-resolution input persists.</p>
<p>This approach has clear drawbacks. Most images are not square: analysis of ImageNet, LVIS, and WebLI shows that most images deviate more than 20% from a 1:1 aspect ratio. Resizing distorts content and discards information, while padding wastes computation. Prior work like FlexiViT addressed variable patch sizes and Pix2Struct introduced aspect-ratio-preserving patching, but neither fully solved the problem of training efficiently on images at their original resolution.</p>
<h2 id="patch-n-pack-sequence-packing-for-vision">Patch n&rsquo; Pack: Sequence Packing for Vision</h2>
<p>The key insight is that ViT already processes images as sequences of patch tokens, and NLP has long used example packing to handle variable-length sequences efficiently. NaViT applies this directly: patches from multiple images (each at its native resolution and aspect ratio) are packed into a single fixed-length sequence.</p>
<h3 id="architectural-modifications">Architectural Modifications</h3>
<p>Three changes enable Patch n&rsquo; Pack:</p>
<ol>
<li>
<p><strong>Masked self-attention and masked pooling</strong>: Attention masks prevent patches from different images from attending to each other. Masked pooling extracts a single representation per image from the packed sequence.</p>
</li>
<li>
<p><strong>Factorized positional embeddings</strong>: Standard 1D positional embeddings cannot handle arbitrary resolutions. NaViT decomposes position into separate $x$ and $y$ embeddings $\phi_{x}$ and $\phi_{y}$, which are summed together. Two schemes are considered:</p>
<ul>
<li>Absolute embeddings: $\phi(p): [0, \text{maxLen}] \to \mathbb{R}^{D}$, a function of the absolute patch index</li>
<li>Fractional embeddings: $\phi(r): [0, 1] \to \mathbb{R}^{D}$, where $r = p / \text{side-length}$ is the relative position along the image</li>
</ul>
</li>
<li>
<p><strong>Chunked contrastive loss</strong>: For contrastive pretraining, the $\mathcal{O}(n^{2})$ loss computation is handled via chunked computation across device subsets to support the high number of examples per sequence.</p>
</li>
</ol>
<h3 id="training-innovations">Training Innovations</h3>
<p>Packing enables two techniques that were previously impractical:</p>
<ul>
<li>
<p><strong>Continuous token dropping</strong>: Instead of dropping the same proportion of tokens from every image, the drop rate varies per image. Some images keep all tokens while others have aggressive dropping, reducing the train/inference discrepancy. The drop rate can follow a schedule that decreases over training.</p>
</li>
<li>
<p><strong>Resolution sampling</strong>: Each image&rsquo;s resolution is sampled from a distribution (e.g., $R \sim \mathcal{U}(64, R_{\text{max}})$) while preserving aspect ratio. This mixes the throughput benefits of small images with the detail of large ones.</p>
</li>
</ul>
<h3 id="computational-overhead">Computational Overhead</h3>
<p>A natural concern is the $\mathcal{O}(n^{2})$ attention cost for longer packed sequences. In practice, as the transformer hidden dimension scales, attention becomes an increasingly small fraction of total compute (the MLP dominates). Packing overhead is typically less than 2% from padding tokens, using a simple greedy bin-packing algorithm.</p>
<h2 id="pretraining-and-downstream-evaluation">Pretraining and Downstream Evaluation</h2>
<p>NaViT is evaluated in two pretraining setups:</p>
<ul>
<li><strong>Classification pretraining</strong> on JFT-4B with sigmoid cross-entropy loss, evaluated via linear probing (10 examples per class)</li>
<li><strong>Contrastive pretraining</strong> on WebLI using image-text contrastive loss, evaluated on zero-shot ImageNet classification and COCO retrieval</li>
</ul>
<h3 id="training-efficiency">Training Efficiency</h3>
<p>At fixed compute budget, NaViT consistently outperforms ViT across model scales. The top-performing ViT can be matched by NaViT with 4x less compute. The primary driver is throughput: packing with variable resolution and token dropping enables NaViT-L/16 to process approximately 5x more images during training.</p>
<h3 id="variable-resolution-results">Variable Resolution Results</h3>
<p>Models trained with variable resolution ($R \sim \mathcal{U}(64, R_{\text{max}})$) outperform fixed-resolution models even when evaluated at the fixed resolution&rsquo;s own training resolution. Sampling side lengths from a truncated normal biased toward lower values gives the best cost-performance trade-off.</p>
<p>For fine-tuning on ImageNet-1k, a single NaViT fine-tuned with variable resolutions (64 to 512) matches the performance of models fine-tuned at each specific resolution individually.</p>
<h3 id="positional-embedding-comparison">Positional Embedding Comparison</h3>
<p>Factorized embeddings outperform both standard ViT 1D embeddings (with interpolation) and Pix2Struct&rsquo;s learned 2D embeddings. The factorized approach generalizes to resolutions outside the training range, while 2D embeddings fail because they require seeing all $(x, y)$ coordinate pairs during training. Additive combination of $\phi_{x}$ and $\phi_{y}$ works best.</p>
<h3 id="token-dropping-strategies">Token Dropping Strategies</h3>
<p>Variable token dropping with Beta-distributed rates consistently outperforms constant rates. Resolution-dependent dropping (higher rates for higher-resolution images) further improves performance. Scheduling the drop rate to decrease over training provides additional gains.</p>
<h3 id="downstream-tasks">Downstream Tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Setup</th>
          <th>Result</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Semantic segmentation</td>
          <td>ADE20k, L/16, linear decoder</td>
          <td>NaViT at $R_{384}$ beats ViT at $R_{512}$ while being 2x faster</td>
      </tr>
      <tr>
          <td>Object detection</td>
          <td>OWL-ViT-L/14 backbone</td>
          <td>NaViT: 28.3% LVIS AP vs. ViT: 23.3%</td>
      </tr>
      <tr>
          <td>Video classification</td>
          <td>Kinetics-400, tubelet extraction</td>
          <td>NaViT-L matches ViViT-L (80.4%) in ~6x fewer epochs</td>
      </tr>
      <tr>
          <td>Fairness annotation</td>
          <td>FairFace, CelebA linear probes</td>
          <td>Statistically significant accuracy improvements ($p = 3 \times 10^{-4}$)</td>
      </tr>
  </tbody>
</table>
<h3 id="out-of-distribution-robustness">Out-of-Distribution Robustness</h3>
<p>NaViT shows strong gains on ImageNet-A (which contains many extreme aspect ratios) when evaluated without center cropping. Performance on ObjectNet is also competitive. The model maintains stable calibration (ECE between 0.045 and 0.047) across a wide range of token counts per image (128 to 1024).</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>NaViT demonstrates that sequence packing, when applied to Vision Transformers, yields substantial improvements in training efficiency, inference flexibility, and downstream performance. The approach processes images at their native resolution without the information loss from resizing or the waste from padding.</p>
<p>Key takeaways:</p>
<ul>
<li>4x compute reduction to match top ViT performance</li>
<li>A single model works across a continuous range of resolutions at inference time</li>
<li>Variable-resolution training and token dropping provide complementary efficiency gains</li>
<li>Factorized positional embeddings generalize to unseen resolutions</li>
<li>Benefits transfer to detection, segmentation, video, and fairness tasks</li>
</ul>
<p>Limitations: The paper does not release model weights or code. All experiments use Google-internal datasets (JFT-4B, WebLI) and infrastructure (TPUs, JAX/Scenic), making direct reproduction difficult. The attention masking approach for packing assumes that cross-image attention is undesirable, which may not hold for all tasks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification pretraining</td>
          <td>JFT-4B</td>
          <td>~4B labeled images</td>
          <td>Google-internal, not publicly available</td>
      </tr>
      <tr>
          <td>Contrastive pretraining</td>
          <td>WebLI</td>
          <td>Large-scale web data</td>
          <td>Google-internal, not publicly available</td>
      </tr>
      <tr>
          <td>Classification fine-tuning</td>
          <td>ImageNet-1k</td>
          <td>1.28M images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Segmentation</td>
          <td>ADE20k</td>
          <td>20K images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Detection</td>
          <td>LVIS</td>
          <td>164K images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Video</td>
          <td>Kinetics-400</td>
          <td>~240K videos</td>
          <td>Publicly available (partial)</td>
      </tr>
      <tr>
          <td>Fairness</td>
          <td>FairFace, CelebA</td>
          <td>108K / 200K images</td>
          <td>Publicly available</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Greedy bin-packing for sequence construction (less than 2% padding tokens)</li>
<li>Resolution sampling: side length from truncated normal $\mathcal{N}_{t}(-0.5, 1)$ mapped to $[64, R_{\text{max}}]$</li>
<li>Token dropping: Beta-distributed per-image rates, optionally resolution-dependent</li>
<li>Factorized positional embeddings with additive combination</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>NaViT variants: B/16, L/16, L/14</li>
<li>Based on vanilla ViT with query-key normalization, no biases, attention pooling</li>
<li>Implemented in JAX/FLAX within the Scenic framework</li>
<li>No public model checkpoints available</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>NaViT</th>
          <th>ViT Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JFT linear probe (L/16)</td>
          <td>Matches top ViT</td>
          <td>4x more compute</td>
          <td>Compute-matched comparison</td>
      </tr>
      <tr>
          <td>ImageNet zero-shot (L/14)</td>
          <td>72.9%</td>
          <td>68.3%</td>
          <td>Contrastive pretraining</td>
      </tr>
      <tr>
          <td>LVIS AP (L/14)</td>
          <td>28.3%</td>
          <td>23.3%</td>
          <td>OWL-ViT detection</td>
      </tr>
      <tr>
          <td>LVIS AP rare (L/14)</td>
          <td>24.3%</td>
          <td>17.2%</td>
          <td>OWL-ViT detection</td>
      </tr>
      <tr>
          <td>ADE20k mIoU (L/16, 384)</td>
          <td>Beats ViT@512</td>
          <td>At 2x cost</td>
          <td>Segmenter linear decoder</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training on Cloud TPUs (specific configuration not detailed)</li>
<li>Inference latency measured on Cloud TPUv3</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I., Oliver, A., Padlewski, P., Gritsenko, A., Lučić, M., &amp; Houlsby, N. (2023). Patch n&rsquo; Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. <em>Advances in Neural Information Processing Systems 36 (NeurIPS 2023)</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{dehghani2023patch,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Patch n&#39; Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lučić, Mario and Houlsby, Neil}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2307.06304}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MarkushGrapher-2: End-to-End Markush Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher-2-multimodal-recognition/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher-2-multimodal-recognition/</guid><description>MarkushGrapher-2 fuses vision, text, and layout encoders with a dedicated OCR module for end-to-end Markush structure recognition from patent images.</description><content:encoded><![CDATA[<h2 id="a-multimodal-method-for-markush-structure-recognition">A Multimodal Method for Markush Structure Recognition</h2>
<p>This is a <strong>Method</strong> paper that introduces MarkushGrapher-2, a universal encoder-decoder model for recognizing both standard molecular structures and multimodal Markush structures from chemical images. The primary contribution is a dual-encoder architecture that fuses a pretrained OCSR (Optical Chemical Structure Recognition) vision encoder with a Vision-Text-Layout (VTL) encoder, connected through a dedicated ChemicalOCR module for end-to-end processing. The paper also introduces two new resources: a large-scale training dataset (USPTO-MOL-M) of real-world Markush structures extracted from USPTO patent MOL files, and IP5-M, a manually annotated benchmark of 1,000 Markush structures from five major patent offices.</p>
<h2 id="why-markush-structure-recognition-remains-challenging">Why Markush Structure Recognition Remains Challenging</h2>
<p><a href="https://en.wikipedia.org/wiki/Markush_structure">Markush structures</a> are compact representations used in patent documents to describe families of related molecules. They combine a visual backbone (atoms, bonds, variable regions) with textual definitions of substituents that can replace those variable regions. This multimodal nature makes them harder to parse than standard molecular diagrams.</p>
<p>Three factors limit automatic Markush recognition. First, visual styles vary across patent offices and publication years. Second, textual definitions lack standardization and often contain conditional or recursive descriptions. Third, real-world training data with comprehensive annotations is scarce. As a result, Markush structures are currently indexed only in two proprietary, manually curated databases: MARPAT and DWPIM.</p>
<p>Prior work, including the original <a href="/notes/chemistry/optical-structure-recognition/markush/markushgrapher/">MarkushGrapher</a>, required pre-annotated OCR outputs at inference time, limiting practical deployment. General-purpose models like GPT-5 and DeepSeek-OCR produce mostly chemically invalid outputs on Markush images, suggesting these lie outside their training distribution.</p>
<h2 id="dual-encoder-architecture-with-dedicated-chemicalocr">Dual-Encoder Architecture with Dedicated ChemicalOCR</h2>
<p>MarkushGrapher-2 uses two complementary encoding pipelines:</p>
<ol>
<li>
<p><strong>Vision encoder pipeline</strong>: The input image passes through a Swin-B Vision Transformer (taken from <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe</a>) pretrained for OCSR. This encoder extracts visual features representing molecular structures and remains frozen during training.</p>
</li>
<li>
<p><strong>Vision-Text-Layout (VTL) pipeline</strong>: The same image goes through ChemicalOCR, a compact 256M-parameter vision-language model fine-tuned from SmolDocling for OCR on chemical images. ChemicalOCR extracts character-level text and bounding boxes. These, combined with image patches, feed into a T5-base VTL encoder following the UDOP fusion paradigm, where visual and textual tokens are spatially aligned by bounding box overlap.</p>
</li>
</ol>
<p>The VTL encoder output is concatenated with projected embeddings from the vision encoder. This joint representation feeds a text decoder that auto-regressively generates a CXSMILES (ChemAxon Extended <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) string describing the backbone structure and a substituent table listing variable group definitions.</p>
<h3 id="two-stage-training-strategy">Two-Stage Training Strategy</h3>
<p>Training proceeds in two phases:</p>
<ul>
<li>
<p><strong>Phase 1 (Adaptation)</strong>: The vision encoder is frozen. The MLP projector and text decoder train on 243K real-world image-SMILES pairs from MolScribe&rsquo;s USPTO dataset (3 epochs). This aligns the decoder to the pretrained OCSR feature space.</p>
</li>
<li>
<p><strong>Phase 2 (Fusion)</strong>: The vision encoder, projector, and ChemicalOCR are all frozen. The VTL encoder and text decoder train on a mix of 235K synthetic and 145K real-world Markush samples (2 epochs). The VTL encoder learns the features needed for CXSMILES and substituent table prediction without disrupting the established OCSR representations.</p>
</li>
</ul>
<p>The total model has 831M parameters, of which 744M are trainable.</p>
<h2 id="datasets-and-evaluation-benchmarks">Datasets and Evaluation Benchmarks</h2>
<h3 id="training-data">Training Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OCR pretraining</td>
          <td>Synthetic chemical structures</td>
          <td>235K</td>
          <td><a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> SMILES augmented to CXSMILES, rendered with annotations</td>
      </tr>
      <tr>
          <td>OCR fine-tuning</td>
          <td>Manual OCR annotations</td>
          <td>7K</td>
          <td>IP5 patent document crops</td>
      </tr>
      <tr>
          <td>Phase 1 (OCSR)</td>
          <td>MolScribe USPTO</td>
          <td>243K</td>
          <td>Real image-SMILES pairs</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>Synthetic CXSMILES</td>
          <td>235K</td>
          <td>Same as OCR pretraining set</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>MolParser dataset</td>
          <td>91K</td>
          <td>Real-world Markush, converted to CXSMILES</td>
      </tr>
      <tr>
          <td>Phase 2 (MMSR)</td>
          <td>USPTO-MOL-M</td>
          <td>54K</td>
          <td>Real-world, auto-extracted from USPTO MOL files (2010-2025)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation-benchmarks">Evaluation Benchmarks</h3>
<p><strong>Markush benchmarks</strong>: M2S (103 samples), USPTO-M (74), WildMol-M (10K, semi-manual), and the new IP5-M (1,000 manually annotated from USPTO, JPO, KIPO, CNIPA, and EPO patents, 1980-2025).</p>
<p><strong>OCSR benchmarks</strong>: USPTO (5,719), JPO (450), UOB (5,740), WildMol (10K).</p>
<p>The primary metric is <strong>CXSMILES Accuracy (A)</strong>: a prediction is correct when (1) the predicted SMILES matches the ground truth by <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChIKey</a> equivalence, and (2) all Markush features (variable groups, positional and frequency variation indicators) are correctly represented. Stereochemistry is ignored during evaluation.</p>
<h3 id="results-markush-structure-recognition">Results: Markush Structure Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>M2S</th>
          <th>USPTO-M</th>
          <th>WildMol-M</th>
          <th>IP5-M</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-Base</td>
          <td>39</td>
          <td>30</td>
          <td>38.1</td>
          <td>47.7</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>21</td>
          <td>7</td>
          <td>28.1</td>
          <td>22.3</td>
      </tr>
      <tr>
          <td>GPT-5</td>
          <td>3</td>
          <td>0</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>DeepSeek-OCR</td>
          <td>0</td>
          <td>0</td>
          <td>1.9</td>
          <td>0.0</td>
      </tr>
      <tr>
          <td>MarkushGrapher-1</td>
          <td>38</td>
          <td>10</td>
          <td>32</td>
          <td>-</td>
      </tr>
      <tr>
          <td><strong>MarkushGrapher-2</strong></td>
          <td><strong>56</strong></td>
          <td><strong>13</strong></td>
          <td><strong>55</strong></td>
          <td><strong>48.0</strong></td>
      </tr>
  </tbody>
</table>
<p>On M2S, MarkushGrapher-2 achieves 56% CXSMILES accuracy vs. 38% for MarkushGrapher-1, a relative improvement of 47%. On WildMol-M (the largest benchmark at 10K samples), MarkushGrapher-2 reaches 55% vs. 38.1% for MolParser-Base and 32% for MarkushGrapher-1. GPT-5 and DeepSeek-OCR generate mostly chemically invalid outputs on Markush images: only 30% and 15% of their predictions are valid CXSMILES on M2S, respectively.</p>
<h3 id="results-standard-molecular-structure-recognition">Results: Standard Molecular Structure Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>WildMol</th>
          <th>JPO</th>
          <th>UOB</th>
          <th>USPTO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-Base</td>
          <td>76.9</td>
          <td>78.9</td>
          <td>91.8</td>
          <td>93.0</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>66.4</td>
          <td>76.2</td>
          <td>87.4</td>
          <td>93.1</td>
      </tr>
      <tr>
          <td>DECIMER 2.7</td>
          <td>56.0</td>
          <td>64.0</td>
          <td>88.3</td>
          <td>59.9</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molgrapher/">MolGrapher</a></td>
          <td>45.5</td>
          <td>67.5</td>
          <td>94.9</td>
          <td>91.5</td>
      </tr>
      <tr>
          <td>DeepSeek-OCR</td>
          <td>25.8</td>
          <td>31.6</td>
          <td>78.7</td>
          <td>36.9</td>
      </tr>
      <tr>
          <td><strong>MarkushGrapher-2</strong></td>
          <td>68.4</td>
          <td>71.0</td>
          <td><strong>96.6</strong></td>
          <td>89.8</td>
      </tr>
  </tbody>
</table>
<p>MarkushGrapher-2 achieves the highest score on UOB (96.6%) and remains competitive on other OCSR benchmarks, despite being primarily optimized for Markush recognition.</p>
<h3 id="chemicalocr-vs-general-ocr">ChemicalOCR vs. General OCR</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>M2S F1</th>
          <th>USPTO-M F1</th>
          <th>IP5-M F1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PaddleOCR v5</td>
          <td>7.7</td>
          <td>1.2</td>
          <td>1.9</td>
      </tr>
      <tr>
          <td>EasyOCR</td>
          <td>10.2</td>
          <td>18.0</td>
          <td>18.4</td>
      </tr>
      <tr>
          <td><strong>ChemicalOCR</strong></td>
          <td><strong>87.2</strong></td>
          <td><strong>93.0</strong></td>
          <td><strong>86.5</strong></td>
      </tr>
  </tbody>
</table>
<p>General-purpose OCR tools fail on chemical images because they misinterpret bonds as characters and cannot parse chemical abbreviations. ChemicalOCR outperforms both by a large margin.</p>
<h2 id="ablation-results-and-key-findings">Ablation Results and Key Findings</h2>
<p><strong>OCR input is critical for Markush features.</strong> Without OCR, CXSMILES accuracy drops from 56% to 4% on M2S, and from 53.7% to 15.4% on IP5-M. The backbone structure accuracy ($A_{\text{InChIKey}}$) also drops substantially (from 80% to 39% on M2S), though the vision encoder alone can still recover some structural information. This confirms that textual cues (brackets, indices, variable definitions) are essential for Markush feature prediction.</p>
<p><strong>Two-phase training improves both tasks.</strong> Compared to single-phase (fusion only) training, the two-phase strategy improves CXSMILES accuracy from 44% to 50% on M2S and from 53.0% to 61.5% on JPO after the same number of epochs. Adapting the decoder to OCSR features before introducing the VTL encoder prevents the fusion process from degrading learned visual representations.</p>
<p><strong>Frequency variation indicators remain the hardest feature.</strong> On IP5-M, the per-feature breakdown shows 73.3% accuracy for backbone InChI, 74.8% for variable groups, 78.8% for positional variation, but only 30.7% for frequency variation (Sg groups). These repeating structural units are particularly challenging to represent and predict.</p>
<p><strong>Limitations</strong>: The model relies on accurate OCR as a prerequisite. Performance on USPTO-M (13% CXSMILES accuracy) lags behind other benchmarks, likely due to the older patent styles in that dataset. The paper does not report inference latency.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OCR pretraining</td>
          <td>Synthetic chemical images</td>
          <td>235K</td>
          <td>Generated from PubChem SMILES, augmented to CXSMILES</td>
      </tr>
      <tr>
          <td>OCR fine-tuning</td>
          <td>IP5 patent crops</td>
          <td>7K</td>
          <td>Manually annotated</td>
      </tr>
      <tr>
          <td>Phase 1 training</td>
          <td>MolScribe USPTO</td>
          <td>243K</td>
          <td>Public, real image-SMILES pairs</td>
      </tr>
      <tr>
          <td>Phase 2 training</td>
          <td>Synthetic + MolParser + USPTO-MOL-M</td>
          <td>380K</td>
          <td>Mix of synthetic (235K), MolParser (91K), USPTO-MOL-M (54K)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>M2S, USPTO-M, WildMol-M, IP5-M</td>
          <td>103 to 10K</td>
          <td>Markush benchmarks</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>WildMol, JPO, UOB, USPTO</td>
          <td>450 to 10K</td>
          <td>OCSR benchmarks</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vision encoder</td>
          <td>Swin-B ViT (from MolScribe)</td>
          <td>~87M</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td>VTL encoder + decoder</td>
          <td>T5-base</td>
          <td>~744M trainable</td>
          <td>Trained</td>
      </tr>
      <tr>
          <td>ChemicalOCR</td>
          <td>SmolDocling-based VLM</td>
          <td>256M</td>
          <td>Fine-tuned, frozen in Phase 2</td>
      </tr>
      <tr>
          <td>MLP projector</td>
          <td>Linear projection</td>
          <td>-</td>
          <td>Trained in Phase 1, frozen in Phase 2</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td></td>
          <td><strong>831M</strong></td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Definition</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CXSMILES Accuracy (A)</td>
          <td>Percentage of samples where InChIKey matches AND all Markush features correct</td>
      </tr>
      <tr>
          <td>$A_{\text{InChIKey}}$</td>
          <td>Backbone structure accuracy only (ignoring Markush features)</td>
      </tr>
      <tr>
          <td>Table Accuracy</td>
          <td>Percentage of correctly predicted substituent tables</td>
      </tr>
      <tr>
          <td>Markush Accuracy</td>
          <td>Joint CXSMILES + Table accuracy</td>
      </tr>
      <tr>
          <td>OCR F1</td>
          <td>Bounding-box-level precision/recall at IoU &gt; 0.5</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: NVIDIA A100 GPU</li>
<li>Phase 1: 3 epochs, Adam optimizer, lr 5e-4, 1000 warmup steps, batch size 10, weight decay 1e-3</li>
<li>Phase 2: 2 epochs, batch size 8</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MarkushGrapher">MarkushGrapher GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation of MarkushGrapher-2 with models and datasets</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Highly Reproducible. Code, models, and datasets are all publicly released under an MIT license with documented training hyperparameters and a single A100 GPU requirement.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Strohmeyer, T., Morin, L., Meijer, G. I., Weber, V., Nassar, A., &amp; Staar, P. (2026). MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures. In <em>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>.</p>
<p><strong>Publication</strong>: CVPR 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/DS4SD/MarkushGrapher">GitHub Repository (MIT License)</a></li>
<li><a href="https://arxiv.org/abs/2603.28550">arXiv Preprint</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{strohmeyer2026markushgrapher,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Strohmeyer, Tim and Morin, Lucas and Meijer, Gerhard Ingmar and Weber, Val\&#39;{e}ry and Nassar, Ahmed and Staar, Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2603.28550}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformers and LLMs for Chemistry Drug Discovery</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/transformers-llms-chemistry-drug-discovery/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/transformers-llms-chemistry-drug-discovery/</guid><description>Bran and Schwaller review transformer architectures for chemistry, from task-specific SMILES models to multimodal LLMs and chemistry agents.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformers-in-chemistry">A Systematization of Transformers in Chemistry</h2>
<p>This book chapter by Bran and Schwaller is a <strong>Systematization</strong> paper that organizes the growing body of work applying transformer architectures to chemistry and drug discovery. Rather than proposing a new method, the authors trace a three-stage evolution: (1) task-specific single-modality models operating on SMILES and reaction strings, (2) multimodal models bridging molecular representations with spectra, synthesis actions, and natural language, and (3) large language models and LLM-powered agents capable of general chemical reasoning.</p>
<h2 id="why-transformers-for-chemistry">Why Transformers for Chemistry?</h2>
<p>The authors motivate the review by drawing analogies between natural language and chemical language. Just as text can be decomposed into subwords and tokens, molecules can be linearized into <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> or <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> strings, and chemical reactions can be encoded as reaction SMILES. This structural parallel enabled direct transfer of transformer architectures, originally designed for machine translation, to chemical prediction tasks.</p>
<p>Several factors accelerated this adoption:</p>
<ul>
<li>The publication of open chemical databases and benchmarks (e.g., <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>, Open Reaction Database, Therapeutics Data Commons)</li>
<li>Improvements in compute infrastructure and training algorithms</li>
<li>The success of attention mechanisms at capturing context-dependent relationships, which proved effective for learning chemical grammar and atom-level correspondences</li>
</ul>
<p>The review positions the transformer revolution in chemistry as a natural extension of NLP advances, noting that the gap between chemical and natural language is progressively closing.</p>
<h2 id="molecular-representations-as-language">Molecular Representations as Language</h2>
<p>A key section of the review covers text-based molecular representations that make transformer applications possible:</p>
<ul>
<li><strong>SMILES</strong> (Simplified Molecular Input Line Entry System): The dominant linearization scheme since the 1980s, encoding molecular graphs as character sequences with special symbols for bonds, branches, and rings.</li>
<li><strong>SELFIES</strong> (Self-Referencing Embedded Strings): A newer representation that guarantees every string maps to a valid molecule, addressing the robustness issues of SMILES in generative settings.</li>
<li><strong>Reaction SMILES</strong>: Extends molecular representations to encode full chemical reactions in the format &ldquo;A.B &gt; catalyst.reagent &gt; C.D&rdquo;, enabling reaction prediction as a sequence-to-sequence task.</li>
</ul>
<p>The authors note that while IUPAC names, InChI, and <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> exist as alternatives, SMILES and SELFIES dominate practical applications.</p>
<h2 id="stage-1-task-specific-transformer-models">Stage 1: Task-Specific Transformer Models</h2>
<p>The first stage of transformer adoption focused on clearly defined chemical tasks, with models trained on a single data modality (molecular strings).</p>
<h3 id="chemical-translation-tasks">Chemical Translation Tasks</h3>
<p>The encoder-decoder architecture was directly applied to tasks framed as translation:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a></strong> (Schwaller et al.): Treated reaction prediction as translation from reactant SMILES to product SMILES, becoming a leading method for forward synthesis prediction.</li>
<li><strong>Retrosynthetic planning</strong>: The reverse task, predicting reactants from products, with iterative application to construct full retrosynthetic trees mapping to commercially available building blocks.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a></strong> (Irwin et al.): A pre-trained model across multiple chemical tasks, offering transferability to new applications with improved performance.</li>
<li><strong>Graph-to-sequence models</strong> (Tu and Coley): Used a custom graph encoder with a transformer decoder, achieving improvements through permutation-invariant molecular graph encoding.</li>
</ul>
<h3 id="representation-learning-and-feature-extraction">Representation Learning and Feature Extraction</h3>
<p>Encoder-only transformers proved valuable for generating molecular and reaction embeddings:</p>
<ul>
<li><strong>Reaction representations</strong> (Wang et al., SMILES-BERT): Trained models to generate reaction vectors that outperformed hand-engineered features on downstream regression tasks.</li>
<li><strong>Reaction classification</strong> (Schwaller et al.): Replaced the decoder with a classification layer to map chemical reactions by class, revealing clustering patterns by reaction type, data source, and molecular properties.</li>
<li><strong>Yield prediction</strong>: Regression heads attached to encoders achieved strong results on high-throughput experimentation datasets.</li>
<li><strong>Protein language models</strong> (Rives et al., ESM): Trained on 250 million protein sequences using unsupervised learning, achieving strong performance on protein property prediction and structure forecasting.</li>
<li><strong>RXNMapper</strong> (Schwaller et al.): A notable application where attention weight analysis revealed that transformers internally learn atom-to-atom mappings in chemical reactions, leading to an open-source atom mapping algorithm that outperformed existing approaches.</li>
</ul>
<h2 id="stage-2-multimodal-chemical-models">Stage 2: Multimodal Chemical Models</h2>
<p>The second stage extended transformers beyond molecular strings to incorporate additional data types:</p>
<ul>
<li><strong>Molecular captioning</strong>: Describing molecules in natural language, covering scaffolds, sources, drug interactions, and other features (Edwards et al.).</li>
<li><strong>Bidirectional molecule-text conversion</strong>: Models capable of generating molecules from text queries and performing molecule-to-molecule tasks (Christofidellis et al.).</li>
<li><strong>Experimental procedure prediction</strong>: Generating actionable synthesis steps from reaction SMILES (Vaucher et al.), bridging the gap between retrosynthetic planning and laboratory execution.</li>
<li><strong>Structural elucidation from IR spectra</strong>: Encoding IR spectra as text sequences alongside chemical formulas, then predicting SMILES from these inputs (Alberts et al.), achieving 45% accuracy in structure prediction and surpassing prior approaches for functional group identification.</li>
</ul>
<h2 id="stage-3-large-language-models-and-chemistry-agents">Stage 3: Large Language Models and Chemistry Agents</h2>
<p>The most recent stage builds on foundation models pre-trained on vast text corpora, adapted for chemistry through fine-tuning and in-context learning.</p>
<h3 id="scaling-laws-and-emergent-capabilities">Scaling Laws and Emergent Capabilities</h3>
<p>The authors discuss how model scaling leads to emergent capabilities relevant to chemistry:</p>
<ul>
<li>Below certain compute thresholds, model performance on chemistry tasks appears random.</li>
<li>Above critical sizes, sudden improvements emerge, along with capabilities like chain-of-thought (CoT) reasoning and instruction following.</li>
<li>These emergent abilities enable chemistry tasks that require multi-step reasoning without explicit training on chemical data.</li>
</ul>
<h3 id="llms-as-chemistry-tools">LLMs as Chemistry Tools</h3>
<p>Key applications of LLMs in chemistry include:</p>
<ul>
<li><strong><a href="/notes/chemistry/llm-applications/fine-tuning-gpt3-molecular-properties/">Fine-tuning for low-data chemistry</a></strong> (Jablonka et al.): GPT-3 fine-tuned on limited chemistry datasets performed comparably to, and sometimes exceeded, specialized models with engineered features for tasks like predicting transition wavelengths and phase classification.</li>
<li><strong>In-context learning</strong>: Providing LLMs with a few examples enables prediction on chemistry tasks without any parameter updates, particularly valuable when data is scarce.</li>
<li><strong>Bayesian optimization with LLMs</strong> (Ramos et al.): Using GPT models for uncertainty-calibrated regression, enabling catalyst and molecular optimization directly from synthesis procedures without feature engineering.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/">3D structure generation</a></strong> (Flam-Shepherd and Aspuru-Guzik): Using language models to generate molecular structures with three-dimensional atomic positions in XYZ, CIF, and PDB formats, matching graph-based algorithms while overcoming representation limitations.</li>
</ul>
<h3 id="llm-powered-chemistry-agents">LLM-Powered Chemistry Agents</h3>
<p>The review highlights the agent paradigm as the most impactful recent development:</p>
<ul>
<li><strong>14 LLM use-cases</strong> (Jablonka et al.): A large-scale collaborative effort demonstrating applications from computational tool wrappers to reaction optimization assistants and scientific question answering.</li>
<li><strong><a href="/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/">ChemCrow</a></strong> (Bran, Cox et al.): An LLM-powered agent equipped with curated computational chemistry tools, capable of planning and executing tasks across drug design, materials design, and synthesis. ChemCrow demonstrated that tool integration overcomes LLM hallucination issues by grounding responses in reliable data sources.</li>
<li><strong>Autonomous scientific research</strong> (Boiko et al.): Systems with focus on cloud laboratory operability.</li>
</ul>
<p>The agent paradigm offers tool composability through natural language interfaces, allowing users to chain multiple computational tools into custom pipelines.</p>
<h2 id="outlook-and-limitations">Outlook and Limitations</h2>
<p>The authors identify several themes for the future:</p>
<ul>
<li>The three stages represent increasing generality, from task-specific single-modality models to open-ended agents.</li>
<li>Natural language interfaces are progressively closing the gap between chemical and human language.</li>
<li>Tool integration through agents provides grounding that mitigates hallucination, a known limitation of direct LLM application to chemistry.</li>
<li>The review acknowledges that LLMs have a &ldquo;high propensity to generate false and inaccurate content&rdquo; on chemical tasks, making tool-augmented approaches preferable to direct application.</li>
</ul>
<p>The chapter does not provide quantitative benchmarks or systematic comparisons across the methods discussed, as its goal is to organize the landscape rather than evaluate individual methods.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a review/survey chapter and does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the referenced works rather than the review itself.</p>
<h3 id="key-referenced-resources">Key Referenced Resources</h3>
<p>Several open-source tools and datasets discussed in the review are publicly available:</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/rxn4chemistry/rxnmapper">RXNMapper</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Attention-based atom mapping</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ur-whitelab/chemcrow-public">ChemCrow</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>LLM-powered chemistry agent</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>Various</td>
          <td>Molecular ML benchmarks</td>
      </tr>
      <tr>
          <td><a href="https://open-reaction-database.org/">Open Reaction Database</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA-4.0</td>
          <td>Curated reaction data</td>
      </tr>
      <tr>
          <td><a href="https://tdcommons.ai/">Therapeutics Data Commons</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Drug discovery ML datasets</td>
      </tr>
  </tbody>
</table>
<h3 id="reproducibility-classification">Reproducibility Classification</h3>
<p><strong>Not applicable</strong> (review paper). Individual referenced works range from Highly Reproducible (open-source models like RXNMapper, ChemCrow) to Partially Reproducible (some models without released code) to Closed (proprietary LLMs like GPT-3/GPT-4 used in fine-tuning studies).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Bran, A. M., &amp; Schwaller, P. (2024). Transformers and Large Language Models for Chemistry and Drug Discovery. In <em>Drug Development Supported by Informatics</em> (pp. 143-163). Springer Nature Singapore. <a href="https://doi.org/10.1007/978-981-97-4828-0_8">https://doi.org/10.1007/978-981-97-4828-0_8</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@incollection</span>{bran2024transformers,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformers and Large Language Models for Chemistry and Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Bran, Andres M. and Schwaller, Philippe}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Drug Development Supported by Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{143--163}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Nature Singapore}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1007/978-981-97-4828-0_8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ReactionT5: Pre-trained T5 for Reaction Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/reactiont5-pretrained-limited-reaction-data/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/reactiont5-pretrained-limited-reaction-data/</guid><description>ReactionT5 uses two-stage pretraining on ZINC and the Open Reaction Database to enable competitive reaction and yield prediction with minimal fine-tuning data.</description><content:encoded><![CDATA[<h2 id="a-two-stage-pre-trained-transformer-for-chemical-reactions">A Two-Stage Pre-trained Transformer for Chemical Reactions</h2>
<p>ReactionT5 is a <strong>Method</strong> paper that proposes a T5-based pre-trained model for chemical reaction tasks, specifically product prediction and yield prediction. The primary contribution is a two-stage pretraining pipeline: first on a compound library (ZINC, 23M molecules) to learn molecular representations, then on a large-scale reaction database (the Open Reaction Database, 1.5M reactions) to learn reaction-level patterns. The key result is that this pre-trained model can be fine-tuned with very limited target-domain data (as few as 30 reactions) and still achieve competitive performance against models trained on full datasets.</p>
<h2 id="bridging-the-gap-between-single-molecule-and-multi-molecule-pretraining">Bridging the Gap Between Single-Molecule and Multi-Molecule Pretraining</h2>
<p>While transformer-based models pre-trained on compound libraries (e.g., <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, MolGPT) have seen substantial development, most focus on single-molecule inputs and outputs. Pretraining for multi-molecule contexts, such as chemical reactions involving reactants, reagents, catalysts, and products, remains underexplored. T5Chem supports multi-task reaction prediction but focuses on building a single multi-task model rather than investigating the effectiveness of pre-trained models for fine-tuning on limited in-house data.</p>
<p>The authors identify two key gaps:</p>
<ol>
<li>Most pre-trained chemical models do not account for reaction-level interactions between multiple molecules.</li>
<li>In practical settings, target-domain reaction data is often scarce, making transfer learning from large public datasets essential.</li>
</ol>
<h2 id="two-stage-pretraining-with-compound-restoration">Two-Stage Pretraining with Compound Restoration</h2>
<p>The core innovation is a two-stage pretraining procedure built on the <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5 (text-to-text transfer transformer)</a> architecture:</p>
<p><strong>Stage 1: Compound Pretraining (CompoundT5)</strong>. An initialized T5 model is trained on 23M <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> from the ZINC database using span-masked language modeling. The model learns to predict masked subsequences of SMILES tokens. A SentencePiece unigram tokenizer is trained on this compound library, allowing more compact representations than character-level or atom-level tokenizers. After this stage, new tokens are added to the tokenizer to cover metal atoms and other characters present in the reaction database but absent from ZINC.</p>
<p><strong>Stage 2: Reaction Pretraining (ReactionT5)</strong>. CompoundT5 is further pretrained on 1.5M reactions from the Open Reaction Database (ORD) on both product prediction and yield prediction tasks. Reactions are formulated as text-to-text tasks using special tokens:</p>
<ul>
<li><code>REACTANT:</code>, <code>REAGENT:</code>, and <code>PRODUCT:</code> tokens delimit the role of each molecule in the reaction string.</li>
<li>For product prediction, the model takes reactants and reagents as input and generates product SMILES.</li>
<li>For yield prediction, the model takes the full reaction (including products) and outputs a numerical yield value.</li>
</ul>
<p><strong>Compound Restoration</strong>. A notable methodological detail is the handling of uncategorized compounds in the ORD. About 31.8% of ORD reactions contain compounds with unknown roles. Simply discarding these reactions introduces severe product bias (only 447 unique products remain vs. 439,898 with uncategorized data included). The authors develop RestorationT5, a binary classifier built from CompoundT5, that assigns uncategorized compounds to either reactant or reagent roles. This classifier uses a sigmoid output layer and achieves an F1 score of 0.1564 at a threshold of 0.97, outperforming a random forest baseline (F1 = 0.1136). The restored dataset (&ldquo;ORD(restored)&rdquo;) is then used for reaction pretraining.</p>
<p>For yield prediction, the loss function is mean squared error:</p>
<p>$$L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$</p>
<p>where $y_i$ is the true yield (normalized to [0, 1]) and $\hat{y}_i$ is the predicted yield.</p>
<h2 id="experimental-setup-product-and-yield-prediction-benchmarks">Experimental Setup: Product and Yield Prediction Benchmarks</h2>
<h3 id="product-prediction">Product Prediction</h3>
<p>The USPTO dataset (479K reactions) is used for evaluation, with standard train/val/test splits (409K/30K/40K). Reactions overlapping with the ORD (18%) are removed during evaluation. Beam search with beam size 10 is used for decoding, and minimum/maximum output length constraints are set based on the training data distribution. Top-k accuracy (k = 1, 2, 3, 5) and invalidity rate are reported.</p>
<p>Baselines include Seq-to-seq, WLDN (graph neural network), <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a>, and T5Chem.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Train</th>
          <th>Top-1</th>
          <th>Top-2</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Invalidity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Seq-to-seq</td>
          <td>USPTO</td>
          <td>80.3</td>
          <td>84.7</td>
          <td>86.2</td>
          <td>87.5</td>
          <td>-</td>
      </tr>
      <tr>
          <td>WLDN</td>
          <td>USPTO</td>
          <td>85.6</td>
          <td>90.5</td>
          <td>92.8</td>
          <td>93.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Molecular Transformer</td>
          <td>USPTO</td>
          <td>88.8</td>
          <td>92.6</td>
          <td>-</td>
          <td>94.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>T5Chem</td>
          <td>USPTO</td>
          <td>90.4</td>
          <td>94.2</td>
          <td>-</td>
          <td>96.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>CompoundT5</td>
          <td>USPTO</td>
          <td>88.0</td>
          <td>92.4</td>
          <td>93.9</td>
          <td>95.0</td>
          <td>7.5</td>
      </tr>
      <tr>
          <td>ReactionT5 (restored ORD)</td>
          <td>USPTO200</td>
          <td>85.5</td>
          <td>91.7</td>
          <td>93.5</td>
          <td>94.9</td>
          <td>12.0</td>
      </tr>
  </tbody>
</table>
<p>A critical finding: ReactionT5 pre-trained on ORD achieves 0% accuracy on USPTO without fine-tuning due to domain mismatch (ORD includes byproducts; USPTO lists only the main product). Fine-tuning on just 200 USPTO reactions with the restored ORD model produces competitive results.</p>
<p>The few-shot fine-tuning analysis shows rapid performance scaling:</p>
<table>
  <thead>
      <tr>
          <th>Samples</th>
          <th>Top-1</th>
          <th>Top-2</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Invalidity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>10</td>
          <td>9.0</td>
          <td>12.5</td>
          <td>15.3</td>
          <td>19.1</td>
          <td>12.4</td>
      </tr>
      <tr>
          <td>30</td>
          <td>80.5</td>
          <td>87.3</td>
          <td>89.8</td>
          <td>92.0</td>
          <td>17.2</td>
      </tr>
      <tr>
          <td>50</td>
          <td>83.7</td>
          <td>89.9</td>
          <td>92.2</td>
          <td>94.0</td>
          <td>14.8</td>
      </tr>
      <tr>
          <td>100</td>
          <td>85.1</td>
          <td>91.0</td>
          <td>92.8</td>
          <td>94.4</td>
          <td>14.0</td>
      </tr>
      <tr>
          <td>200</td>
          <td>85.5</td>
          <td>91.7</td>
          <td>93.5</td>
          <td>94.9</td>
          <td>12.0</td>
      </tr>
  </tbody>
</table>
<h3 id="yield-prediction">Yield Prediction</h3>
<p>The <a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a> C-N cross-coupling dataset (3,955 reactions) is used with random 7:3 splits (repeated 10 times) plus four out-of-sample test sets (Tests 1-4) designed so that similar reactions do not appear in both train and test.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Random 7:3</th>
          <th>Test 1</th>
          <th>Test 2</th>
          <th>Test 3</th>
          <th>Test 4</th>
          <th>Avg. Tests 1-4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DFT</td>
          <td>0.92</td>
          <td>0.80</td>
          <td>0.77</td>
          <td>0.64</td>
          <td>0.54</td>
          <td>0.69</td>
      </tr>
      <tr>
          <td>MFF</td>
          <td>0.927</td>
          <td>0.851</td>
          <td>0.713</td>
          <td>0.635</td>
          <td>0.184</td>
          <td>0.596</td>
      </tr>
      <tr>
          <td>Yield-BERT</td>
          <td>0.951</td>
          <td>0.838</td>
          <td>0.836</td>
          <td>0.738</td>
          <td>0.538</td>
          <td>0.738</td>
      </tr>
      <tr>
          <td>T5Chem</td>
          <td>0.970</td>
          <td>0.811</td>
          <td>0.907</td>
          <td>0.789</td>
          <td>0.627</td>
          <td>0.785</td>
      </tr>
      <tr>
          <td>CompoundT5</td>
          <td>0.971</td>
          <td>0.855</td>
          <td>0.852</td>
          <td>0.712</td>
          <td>0.547</td>
          <td>0.741</td>
      </tr>
      <tr>
          <td>ReactionT5</td>
          <td>0.966</td>
          <td>0.914</td>
          <td>0.940</td>
          <td>0.819</td>
          <td>0.896</td>
          <td>0.892</td>
      </tr>
      <tr>
          <td>ReactionT5 (zero-shot)</td>
          <td>0.904</td>
          <td>0.919</td>
          <td>0.927</td>
          <td>0.847</td>
          <td>0.909</td>
          <td>0.900</td>
      </tr>
  </tbody>
</table>
<p>ReactionT5 achieves the highest average $R^2$ across Tests 1-4 (0.892), with the zero-shot variant performing even better (0.900). The improvement is most dramatic on Test 4, the hardest split, where ReactionT5 achieves $R^2 = 0.896$ versus T5Chem&rsquo;s 0.627 and Yield-BERT&rsquo;s 0.538.</p>
<p>In a low-data regime (30% train / 70% test), ReactionT5 ($R^2 = 0.927$) substantially outperforms a random forest baseline ($R^2 = 0.853$), and even zero-shot ReactionT5 ($R^2 = 0.898$) exceeds the random forest.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Two-stage pretraining is effective</strong>: Compound pretraining followed by reaction pretraining produces models with strong generalization, particularly on out-of-distribution test sets.</li>
<li><strong>Few-shot transfer works</strong>: With as few as 30 fine-tuning reactions, ReactionT5 achieves over 80% Top-1 accuracy on product prediction, competitive with models trained on the full USPTO dataset.</li>
<li><strong>Compound restoration matters</strong>: Restoring uncategorized compounds in the ORD is essential for product prediction. Without restoration, fine-tuning on 200 USPTO reactions yields 0% accuracy; with restoration, the same fine-tuning yields 85.5% Top-1.</li>
<li><strong>Zero-shot yield prediction is surprisingly effective</strong>: ReactionT5 achieves $R^2 = 0.900$ on the out-of-sample yield tests without any task-specific fine-tuning, outperforming all fine-tuned baselines.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Product prediction shows a high invalidity rate (12.0% for the best ReactionT5 variant) compared to CompoundT5 (7.5%), suggesting the reaction pretraining may introduce some noise.</li>
<li>The 0% accuracy without fine-tuning on product prediction reveals a significant domain gap between ORD and USPTO annotation conventions (byproducts vs. main products).</li>
<li>The RestorationT5 classifier has low precision (0.0878) despite high recall (0.7212), meaning many compounds are incorrectly assigned roles. The paper does not investigate how this impacts downstream performance.</li>
<li>The paper does not report training times, computational costs, or model sizes, making resource requirements unclear.</li>
<li>Only two downstream tasks (product prediction on USPTO, yield prediction on Buchwald-Hartwig) are evaluated.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compound pretraining</td>
          <td>ZINC</td>
          <td>22,992,522 compounds</td>
          <td>SMILES canonicalized with <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a></td>
      </tr>
      <tr>
          <td>Reaction pretraining</td>
          <td>ORD (restored)</td>
          <td>1,505,916 reactions</td>
          <td>Atom mapping removed, compounds canonicalized</td>
      </tr>
      <tr>
          <td>Product prediction eval</td>
          <td>USPTO</td>
          <td>479,035 reactions</td>
          <td>409K/30K/40K train/val/test split</td>
      </tr>
      <tr>
          <td>Yield prediction eval</td>
          <td>Buchwald-Hartwig C-N</td>
          <td>3,955 reactions</td>
          <td>Random 7:3 split (10 repeats) + 4 OOS tests</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Base architecture: T5 (text-to-text transfer transformer)</li>
<li>Tokenizer: SentencePiece unigram, trained on ZINC, extended with special reaction tokens</li>
<li>Compound pretraining: Span-masked language modeling (15% masking rate, average span length 3)</li>
<li>Beam search: size 10 for product prediction</li>
<li>Output length constraints: min/max from training data distribution</li>
<li>Yield normalization: clipped to [0, 100], then scaled to [0, 1]</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>CompoundT5: T5 pretrained on ZINC</li>
<li>RestorationT5: CompoundT5 fine-tuned for binary classification (reactant vs. reagent)</li>
<li>ReactionT5: CompoundT5 pretrained on ORD for product and yield prediction</li>
<li>Pre-trained weights available on Hugging Face</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Top-1 accuracy</td>
          <td>Product prediction</td>
          <td>85.5%</td>
          <td>ReactionT5 with 200 fine-tuning reactions</td>
      </tr>
      <tr>
          <td>Top-5 accuracy</td>
          <td>Product prediction</td>
          <td>94.9%</td>
          <td>ReactionT5 with 200 fine-tuning reactions</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Yield prediction (random)</td>
          <td>0.966</td>
          <td>ReactionT5 fine-tuned</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Yield prediction (OOS avg.)</td>
          <td>0.900</td>
          <td>ReactionT5 zero-shot</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Training times and GPU requirements are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/sagawatatsuya/ReactionT5v2">ReactionT5v2 (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/sagawa">ReactionT5 models (Hugging Face)</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>Pre-trained weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sagawa, T. &amp; Kojima, R. (2023). ReactionT5: a large-scale pre-trained model towards application of limited reaction data. <em>arXiv preprint arXiv:2311.06708</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{sagawa2023reactiont5,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ReactionT5: a large-scale pre-trained model towards application of limited reaction data}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sagawa, Tatsuya and Kojima, Ryosuke}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2311.06708}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2311.06708}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PharmaGPT: Domain-Specific LLMs for Pharma and Chem</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/pharmagpt-domain-specific-llms-biopharmaceutical/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/pharmagpt-domain-specific-llms-biopharmaceutical/</guid><description>PharmaGPT introduces 13B and 70B parameter LLMs trained on biopharmaceutical and chemical corpora, outperforming GPT-3.5 and rivaling GPT-4 on pharmacy exams.</description><content:encoded><![CDATA[<h2 id="a-domain-specific-llm-suite-for-biopharmaceuticals-and-chemistry">A Domain-Specific LLM Suite for Biopharmaceuticals and Chemistry</h2>
<p>This is a <strong>Method</strong> paper that introduces PharmaGPT, a suite of domain-specific large language models with 13 billion and 70 billion parameters. The models are built on the LLaMA architecture and undergo continued pretraining on a curated corpus of biopharmaceutical and chemical literature, followed by instruction fine-tuning and reinforcement learning from human feedback (RLHF). The primary contribution is demonstrating that domain-specific continued pretraining on a general-purpose LLM backbone can produce models that outperform much larger general-purpose models on pharmaceutical knowledge tasks, using only a fraction of the parameters.</p>
<h2 id="bridging-the-gap-between-general-purpose-llms-and-specialized-pharmaceutical-knowledge">Bridging the Gap Between General-Purpose LLMs and Specialized Pharmaceutical Knowledge</h2>
<p>General-purpose LLMs like GPT-3.5 and GPT-4 show impressive broad capabilities but often fall short in specialized domains requiring precise terminology, deep domain knowledge, and high accuracy. The biopharmaceutical and chemical sectors present particular challenges: intricate terminologies, specialized regulatory knowledge, and a demand for precision that general models cannot consistently deliver. Most state-of-the-art LLMs are proprietary, English-centric, and lack depth in vertical domains. The authors identify a gap in the availability of domain-specific LLMs for biomedicine and chemistry, particularly multilingual models that can handle both English and Chinese pharmaceutical content.</p>
<h2 id="continued-pretraining-with-domain-specific-data-and-weighted-instruction-tuning">Continued Pretraining with Domain-Specific Data and Weighted Instruction Tuning</h2>
<p>PharmaGPT&rsquo;s core innovation lies in its training pipeline, which adapts the LLaMA backbone through three stages:</p>
<p><strong>Extended Tokenizer</strong>: The authors develop a new tokenizer using <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">byte-pair encoding (BPE)</a> from SentencePiece, trained on their pretraining data and merged with the LLaMA2 tokenizer. This extends the vocabulary from 32,000 to 55,296 tokens, improving compression efficiency for Chinese text and specialized domain terminology. The embedding and output layers are resized from $V \times H$ to $V&rsquo; \times H$ where $V = 32{,}000$ and $V&rsquo; = 55{,}296$.</p>
<p><strong>Two-Stage Continued Pretraining</strong>: The models consume 153 billion tokens in Stage 1 (primarily web, news, patents, and papers) and 43 billion tokens in Stage 2 (research reports, exams, books, chats, code, and supervised data). The data distribution shifts between stages to move from general domain knowledge toward specialized biopharmaceutical tasks.</p>
<p><strong>Weighted Instruction Fine-tuning</strong>: Inspired by OpenChat, the authors use a weighted autoregressive objective that zeros out loss on user instruction tokens. The loss function is:</p>
<p>$$\mathcal{L}_{SFT}(\Theta) = \mathbb{E}_{x \sim \mathcal{D}_{SFT}} \left[ -\alpha \sum_{i \in \text{output}} \log p(x_i \mid x_0, x_1, \dots, x_{i-1}; \Theta) \right]$$</p>
<p>where the weight $\alpha$ is set to 1 for expert-curated domain-specific instructions ($\mathcal{D}_{\exp}$) and 0.1 for generic instructions ($\mathcal{D}_{\text{gen}}$). This differential weighting ensures domain-relevant instructions receive higher priority during training.</p>
<p><strong>RLHF with PPO</strong>: A reward model is initialized from the pretrained PharmaGPT-70B and enhanced with two MLPs to output a scalar preference score. The reward model is trained with a binary ranking loss:</p>
<p>$$\mathcal{L}_{\text{ranking}} = -\log\left(\sigma\left(r_\theta(x, y_c) - r_\theta(x, y_r)\right)\right)$$</p>
<p>where $r_\theta(x, y_c)$ is the score for the preferred response and $r_\theta(x, y_r)$ is the score for the rejected response. The RLHF dataset consists of 50,000 human preference expert-annotated instructions with responses from PharmaGPT variants and commercial LLMs (GPT-4, ChatGPT-3.5). <a href="https://en.wikipedia.org/wiki/Proximal_policy_optimization">Proximal Policy Optimization (PPO)</a> is used for the RL training, selecting the highest-scoring response from four generated candidates at each step.</p>
<h2 id="evaluation-on-pharmacy-licensing-exams-translation-and-mmlu">Evaluation on Pharmacy Licensing Exams, Translation, and MMLU</h2>
<p>The evaluation covers four main benchmarks:</p>
<p><strong>NAPLEX (North American Pharmacist Licensure Examination)</strong>: PharmaGPT is tested across three NAPLEX sections. Results show consistent improvement across model iterations:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>NAPLEX I</th>
          <th>NAPLEX II</th>
          <th>NAPLEX III</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PharmaGPT 0.1</td>
          <td>5.0</td>
          <td>2.5</td>
          <td>3.5</td>
      </tr>
      <tr>
          <td>PharmaGPT 0.3</td>
          <td>42.0</td>
          <td>48.0</td>
          <td>46.5</td>
      </tr>
      <tr>
          <td>PharmaGPT 0.5</td>
          <td>57.0</td>
          <td>59.0</td>
          <td>58.0</td>
      </tr>
      <tr>
          <td>PharmaGPT 0.7</td>
          <td>66.0</td>
          <td>68.0</td>
          <td>76.0</td>
      </tr>
  </tbody>
</table>
<p>PharmaGPT 0.7 scores in the 66-76% range across all three NAPLEX sections, outperforming GPT-3.5-turbo by considerable margins.</p>
<p><strong>Chinese Pharmacist Examination</strong>: PharmaGPT achieves scores in the 70% range across all four exam categories, outperforming both GPT-3.5-turbo and GPT-4 in all categories. This result is notable given GPT-4&rsquo;s much larger scale.</p>
<p><strong>Biomedical Translation</strong>: PharmaGPT 0.7 outperforms GPT-3.5, Claude 3, and Google Translate on biomedical paper translation (English-Chinese), achieving <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a> scores of 30 (paragraph-level), 18 (sentence-level), and 10 (word-level).</p>
<p><strong>MMLU</strong>: On the general Multitask Multilingual Language Understanding benchmark, PharmaGPT achieves scores in the 80% range across most biomedical and life science tasks, surpassing GPT-3.5-turbo and performing comparably to GPT-4 in areas such as physiology, health sciences, and biology.</p>
<h2 id="strong-domain-performance-with-smaller-scale-but-limited-reproducibility">Strong Domain Performance with Smaller Scale, but Limited Reproducibility</h2>
<p><strong>Key findings</strong>:</p>
<ul>
<li>Domain-specific continued pretraining enables a 70B parameter model to match or exceed GPT-4 on pharmaceutical knowledge tasks, despite having a fraction of GPT-4&rsquo;s parameters</li>
<li>Iterative post-training (versions 0.1 through 0.7) shows consistent improvement, with the largest gains occurring between versions 0.3 and 0.5</li>
<li>The two-stage pretraining strategy, shifting from general domain data to more specialized exam and report data, appears effective for building domain expertise</li>
<li>Scaling laws hold within the PharmaGPT family: larger parameter counts consistently produce better performance on both NAPLEX and Chinese pharmaceutical exams</li>
</ul>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>Potential biases in the training data</li>
<li>Model dependency on the quality and diversity of input prompts</li>
<li>Challenges in accurately assessing performance on highly specialized tasks without domain expert evaluation</li>
<li>Interpretability concerns for use in sensitive healthcare and pharmaceutical applications</li>
<li>The 3B model is trained from scratch while the 13B and 70B models use LLaMA as a backbone, making direct comparison across model sizes less straightforward</li>
</ul>
<p><strong>Missing details</strong>: The paper does not release model weights, training code, or the proprietary training dataset. No ablation studies isolate the contribution of each training stage (continued pretraining vs. instruction tuning vs. RLHF). The evaluation is limited to multiple-choice exams and translation, without testing on molecular property prediction, reaction prediction, or other computational chemistry tasks common in this domain.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining Stage 1</td>
          <td>Web, News, Patents, Papers</td>
          <td>153B tokens</td>
          <td>Proprietary corpus; not publicly available</td>
      </tr>
      <tr>
          <td>Pretraining Stage 2</td>
          <td>Research Reports, Exams, Books, Chats, Code</td>
          <td>43B tokens</td>
          <td>Proprietary corpus; not publicly available</td>
      </tr>
      <tr>
          <td>Instruction Tuning</td>
          <td>Manually labeled + synthesized data</td>
          <td>Several hundred thousand instructions</td>
          <td>Includes expert Q&amp;A, patent data, ShareGPT</td>
      </tr>
      <tr>
          <td>RLHF</td>
          <td>Human preference annotations</td>
          <td>50,000 annotated instructions</td>
          <td>Expert annotators ranked responses</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>NAPLEX, Chinese Pharmacist Exam, MMLU, MT</td>
          <td>Not specified</td>
          <td>Exam datasets sourced from public exams</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Base architecture</strong>: LLaMA (13B and 70B variants); 3B model trained from scratch</li>
<li><strong>Tokenizer</strong>: Extended BPE tokenizer (55,296 vocab size) merged with LLaMA2 tokenizer</li>
<li><strong>Training objective</strong>: Standard autoregressive LM (pretraining), weighted autoregressive with $\alpha \in {0.1, 1.0}$ (SFT), PPO (RLHF)</li>
<li><strong>Reward model</strong>: Initialized from PharmaGPT-70B with two additional MLPs</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Base</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PharmaGPT-3B</td>
          <td>3B</td>
          <td>Trained from scratch</td>
          <td>Not evaluated in main results</td>
      </tr>
      <tr>
          <td>PharmaGPT-13B</td>
          <td>13B</td>
          <td>LLaMA-13B</td>
          <td>Post-trained</td>
      </tr>
      <tr>
          <td>PharmaGPT-70B</td>
          <td>70B</td>
          <td>LLaMA-70B</td>
          <td>Primary model; versions 0.1-0.7 reported</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PharmaGPT 0.7</th>
          <th>GPT-3.5</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NAPLEX I</td>
          <td>66%</td>
          <td>~50%</td>
          <td>Estimated from figures</td>
      </tr>
      <tr>
          <td>NAPLEX II</td>
          <td>68%</td>
          <td>~50%</td>
          <td>Estimated from figures</td>
      </tr>
      <tr>
          <td>NAPLEX III</td>
          <td>76%</td>
          <td>~50%</td>
          <td>Estimated from figures</td>
      </tr>
      <tr>
          <td>Chinese Pharmacist Exam</td>
          <td>~70% range</td>
          <td>Lower</td>
          <td>Outperforms GPT-4</td>
      </tr>
      <tr>
          <td>Biomedical Translation (paragraph BLEU)</td>
          <td>30</td>
          <td>27</td>
          <td>English-Chinese</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify the hardware used for training. Training hyperparameters for the 70B model include tensor parallelism (TP=8) and pipeline parallelism (PP=16) during pretraining, suggesting multi-node GPU training, likely on at least 128 GPUs.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PharmaGPT models</td>
          <td>Model</td>
          <td>Not released</td>
          <td>No public weights or API access</td>
      </tr>
      <tr>
          <td>Training data</td>
          <td>Dataset</td>
          <td>Proprietary</td>
          <td>PatSnap internal data</td>
      </tr>
      <tr>
          <td>Training code</td>
          <td>Code</td>
          <td>Not released</td>
          <td>No public repository</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: <strong>Closed</strong>. Neither the model weights, training data, nor training code are publicly available. The proprietary nature of both the data pipeline and the models makes independent reproduction infeasible.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chen, L., Wang, W., Bai, Z., Xu, P., Fang, Y., Fang, J., &hellip; &amp; Tu, C. (2024). PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry. <em>arXiv preprint arXiv:2406.18045</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chen2024pharmagpt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chen, Linqing and Wang, Weilei and Bai, Zilong and Xu, Peng and Fang, Yan and Fang, Jie and Wu, Wentao and Zhou, Lizhi and Zhang, Ruiji and Xia, Yubin and Xu, Chaobo and Hu, Ran and Xu, Licong and Cai, Qijun and Hua, Haoran and Sun, Jing and Liu, Jin and Qiu, Tian and Liu, Haowen and Hu, Meng and Li, Xiuwen and Gao, Fei and Wang, Yufu and Tie, Lin and Wang, Chaochao and Lu, Jianping and Sun, Cheng and Wang, Yixin and Yang, Shengjie and Li, Yuancheng and Jin, Lu and Zhang, Lisha and Bian, Fu and Ye, Zhongkai and Pei, Lidong and Tu, Changyang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2406.18045}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2406.18045}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LlaSMol: Instruction-Tuned LLMs for Chemistry Tasks</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/llamsmol-instruction-tuning-chemistry/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/llamsmol-instruction-tuning-chemistry/</guid><description>LlaSMol fine-tunes open-source LLMs on SMolInstruct, a 3.3M-sample chemistry instruction dataset spanning 14 tasks, outperforming GPT-4 on all chemistry tasks.</description><content:encoded><![CDATA[<h2 id="a-resource-for-chemistry-instruction-tuning">A Resource for Chemistry Instruction Tuning</h2>
<p>This is a <strong>Resource</strong> paper that contributes both a large-scale instruction tuning dataset (SMolInstruct) and a family of fine-tuned LLMs (LlaSMol) for chemistry tasks. The primary contribution is SMolInstruct, a dataset of 3.3 million samples across 14 chemistry tasks, paired with systematic experiments showing that instruction-tuned open-source LLMs can substantially outperform GPT-4 and Claude 3 Opus on chemistry benchmarks. The dataset construction methodology, quality control pipeline, and careful data splitting are central to the paper&rsquo;s value.</p>
<h2 id="why-llms-struggle-with-chemistry-tasks">Why LLMs Struggle with Chemistry Tasks</h2>
<p>Prior work demonstrated that general-purpose LLMs perform poorly on chemistry tasks. Guo et al. (2023) found that GPT-4, while outperforming other LLMs, falls far short of task-specific deep learning models, particularly on tasks requiring precise understanding of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> representations. Fang et al. (2023) attempted instruction tuning with Mol-Instructions, but the resulting models still performed well below task-specific baselines.</p>
<p>These results raised a fundamental question: are LLMs inherently limited for chemistry, or is the problem simply insufficient training data? The authors argue it is the latter. Previous instruction tuning datasets suffered from limited scale (Mol-Instructions had 1.3M samples with fewer task types), lower quality (numerous low-quality molecular descriptions, mislabeled reactants/reagents in reaction data), and suboptimal design choices (using <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> instead of canonical SMILES, inconsistent data splitting that allowed leakage).</p>
<h2 id="smolinstruct-a-comprehensive-chemistry-instruction-dataset">SMolInstruct: A Comprehensive Chemistry Instruction Dataset</h2>
<p>The core innovation is the SMolInstruct dataset, which addresses the limitations of prior datasets through three design principles:</p>
<p><strong>Scale and comprehensiveness.</strong> SMolInstruct contains 3.3M samples across 14 tasks organized into four categories:</p>
<ul>
<li><strong>Name conversion</strong> (4 tasks): <a href="https://en.wikipedia.org/wiki/IUPAC_nomenclature_of_organic_chemistry">IUPAC</a>-to-formula, IUPAC-to-SMILES, SMILES-to-formula, SMILES-to-IUPAC, sourced from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></li>
<li><strong>Property prediction</strong> (6 tasks): ESOL, Lipo, BBBP, ClinTox, HIV, SIDER, sourced from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></li>
<li><strong>Molecule description</strong> (2 tasks): molecule captioning and molecule generation, sourced from <a href="https://en.wikipedia.org/wiki/ChEBI">ChEBI-20</a> and Mol-Instructions</li>
<li><strong>Chemical reactions</strong> (2 tasks): forward synthesis and retrosynthesis, sourced from USPTO-full</li>
</ul>
<p><strong>Quality control.</strong> The authors apply rigorous curation: invalid SMILES are filtered using RDKit, mislabeled reactants/reagents in USPTO-full are corrected by comparing atom mappings with products, low-quality molecular descriptions are removed using pattern-based rules, and duplicates are eliminated.</p>
<p><strong>Careful data splitting.</strong> To prevent data leakage across related tasks (e.g., forward synthesis and retrosynthesis share the same reactions), the authors ensure matched samples across reverse tasks are placed together in either training or evaluation sets. Samples with identical inputs but different outputs are also grouped together to prevent exaggerated performance estimates.</p>
<p>Additionally, all SMILES representations are canonicalized, and special tags (e.g., <code>&lt;SMILES&gt;...&lt;/SMILES&gt;</code>) encapsulate different information types within the instruction templates.</p>
<h2 id="experimental-setup-four-base-models-and-comprehensive-baselines">Experimental Setup: Four Base Models and Comprehensive Baselines</h2>
<p>The authors fine-tune four open-source LLMs using LoRA (applied to all attention and FFN linear layers, with rank and alpha both set to 16):</p>
<ul>
<li><strong><a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a> 6.7B</strong>: pretrained on scientific text including chemistry data</li>
<li><strong>Llama 2 7B</strong>: general-purpose LLM</li>
<li><strong>Code Llama 7B</strong>: code-focused variant of Llama 2</li>
<li><strong>Mistral 7B</strong>: general-purpose LLM</li>
</ul>
<p>Training uses 8-bit AdamW with learning rate 1e-4, cosine scheduler, and 3 epochs. Only 0.58% of parameters are fine-tuned (approximately 41.9M parameters). Beam search is used at inference.</p>
<p><strong>Baselines</strong> include:</p>
<ul>
<li>General LLMs without fine-tuning: GPT-4, Claude 3 Opus, and the four base models</li>
<li>Chemistry-specific LLMs: Molinst (Llama 2 tuned on Mol-Instructions), <a href="/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/">ChemLLM</a></li>
<li>Task-specific non-LLM models: <a href="/notes/chemistry/molecular-representations/name-translation/stout-v2/">STOUT</a> for name conversion, Uni-Mol for property prediction, MolT5 for molecule description, RSMILES and <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a> for reaction prediction</li>
</ul>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Task Category</th>
          <th>Best LlaSMol</th>
          <th>GPT-4</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Name conversion (NC-I2F, EM%)</td>
          <td>87.9 (Mistral)</td>
          <td>8.7</td>
          <td>+79.2</td>
      </tr>
      <tr>
          <td>Name conversion (NC-I2S, EM%)</td>
          <td>70.1 (Mistral)</td>
          <td>3.3</td>
          <td>+66.8</td>
      </tr>
      <tr>
          <td>Property prediction (PP-ESOL, RMSE)</td>
          <td>1.150 (Mistral)</td>
          <td>2.570</td>
          <td>-1.42 (lower is better)</td>
      </tr>
      <tr>
          <td>Property prediction (PP-BBBP, Acc%)</td>
          <td>74.6 (Mistral)</td>
          <td>62.9</td>
          <td>+11.7</td>
      </tr>
      <tr>
          <td>Molecule captioning (<a href="https://en.wikipedia.org/wiki/METEOR">METEOR</a>)</td>
          <td>0.452 (Mistral)</td>
          <td>0.188</td>
          <td>+0.264</td>
      </tr>
      <tr>
          <td>Molecule generation (FTS%)</td>
          <td>61.7 (Mistral)</td>
          <td>42.6</td>
          <td>+19.1</td>
      </tr>
      <tr>
          <td>Forward synthesis (EM%)</td>
          <td>63.3 (Mistral)</td>
          <td>1.6</td>
          <td>+61.7</td>
      </tr>
      <tr>
          <td>Retrosynthesis (EM%)</td>
          <td>32.9 (Mistral)</td>
          <td>0.0</td>
          <td>+32.9</td>
      </tr>
  </tbody>
</table>
<p>LlaSMolMistral consistently outperforms all other LLMs and the other LlaSMol variants. It also surpasses task-specific SoTA models on PP-ClinTox (93.1 vs. 92.4) and PP-SIDER (70.7 vs. 70.0), though it has not yet matched SoTA on most other tasks.</p>
<h3 id="ablation-study">Ablation Study</h3>
<p>The ablation study examines three variants:</p>
<ol>
<li>
<p><strong>Without canonicalization</strong>: Performance drops on most tasks, with substantial decreases on forward synthesis (63.3 to 53.7 EM%) and retrosynthesis (32.9 to 23.8 EM%), confirming that canonicalized SMILES reduce learning difficulty.</p>
</li>
<li>
<p><strong>Using SELFIES instead of SMILES</strong>: While SELFIES achieves slightly higher validity (100% vs. 99.7% on some tasks), it results in worse performance overall. SELFIES strings are typically longer than SMILES, making them harder for models to process accurately. This finding contradicts claims from prior work (Fang et al., 2023) that SELFIES should be preferred.</p>
</li>
<li>
<p><strong>Training on Mol-Instructions instead of SMolInstruct</strong>: Using the same base model (Mistral) and identical training settings, the Mol-Instructions-trained model performs drastically worse, achieving near-zero accuracy on name conversion and property prediction tasks, and much lower performance on shared tasks (MC, MG, FS, RS).</p>
</li>
</ol>
<h3 id="additional-analysis">Additional Analysis</h3>
<p>Multi-task training generally outperforms single-task training, with particularly large improvements on PP-ESOL (RMSE 20.616 to 1.150) and molecule generation (FTS 33.1% to 61.7%). Increasing the number of trainable LoRA parameters from 6.8M (0.09%) to 173.0M (2.33%) leads to consistent performance improvements across most tasks, suggesting further gains are possible with more extensive fine-tuning.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>The paper establishes several findings:</p>
<ol>
<li>
<p><strong>LLMs can perform chemistry tasks effectively</strong> when provided with sufficient high-quality instruction tuning data. This refutes the notion that LLMs are fundamentally limited for chemistry.</p>
</li>
<li>
<p><strong>The choice of base model matters considerably.</strong> Mistral 7B outperforms Llama 2, Code Llama, and Galactica despite identical training, suggesting that general language understanding transfers well to chemistry.</p>
</li>
<li>
<p><strong>Canonical SMILES outperform both non-canonical SMILES and SELFIES</strong> for LLM-based chemistry, a practical recommendation for future work.</p>
</li>
<li>
<p><strong>Dataset quality is more important than model architecture.</strong> The same base model trained on SMolInstruct vastly outperforms the same model trained on Mol-Instructions.</p>
</li>
</ol>
<p>The authors acknowledge several limitations. The evaluation metrics for molecule captioning and generation (METEOR, FTS) measure text similarity rather than chemical correctness. The paper does not evaluate generalization to tasks beyond the 14 training tasks. LlaSMol models do not yet outperform task-specific SoTA models on most tasks, though the gap has narrowed substantially with only 0.58% of parameters fine-tuned.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>SMolInstruct</td>
          <td>3.29M samples</td>
          <td>14 tasks, canonical SMILES, publicly available on HuggingFace</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>SMolInstruct test split</td>
          <td>33,061 samples</td>
          <td>Careful splitting to prevent leakage across tasks</td>
      </tr>
      <tr>
          <td>NC tasks</td>
          <td>PubChem</td>
          <td>~300K molecules</td>
          <td>IUPAC names, SMILES, molecular formulas</td>
      </tr>
      <tr>
          <td>PP tasks</td>
          <td>MoleculeNet</td>
          <td>~78K samples</td>
          <td>6 datasets (ESOL, Lipo, BBBP, ClinTox, HIV, SIDER)</td>
      </tr>
      <tr>
          <td>MC/MG tasks</td>
          <td>ChEBI-20 + Mol-Instructions</td>
          <td>~60K samples</td>
          <td>Quality-filtered molecular descriptions</td>
      </tr>
      <tr>
          <td>FS/RS tasks</td>
          <td>USPTO-full</td>
          <td>~1.9M samples</td>
          <td>Cleaned, with corrected reactant/reagent labels</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Fine-tuning</strong>: LoRA with rank=16, alpha=16, applied to all attention and FFN linear layers</li>
<li><strong>Optimizer</strong>: 8-bit AdamW, learning rate 1e-4, cosine scheduler</li>
<li><strong>Training</strong>: 3 epochs, max input length 512 tokens</li>
<li><strong>Inference</strong>: Beam search with beam size = <code>num_return_sequences</code> + 3</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Base</th>
          <th>Parameters</th>
          <th>LoRA Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LlaSMolGalactica</td>
          <td>Galactica 6.7B</td>
          <td>6.7B</td>
          <td>41.9M (0.58%)</td>
      </tr>
      <tr>
          <td>LlaSMolLlama2</td>
          <td>Llama 2 7B</td>
          <td>7B</td>
          <td>41.9M (0.58%)</td>
      </tr>
      <tr>
          <td>LlaSMolCodeLlama</td>
          <td>Code Llama 7B</td>
          <td>7B</td>
          <td>41.9M (0.58%)</td>
      </tr>
      <tr>
          <td>LlaSMolMistral</td>
          <td>Mistral 7B</td>
          <td>7B</td>
          <td>41.9M (0.58%)</td>
      </tr>
  </tbody>
</table>
<p>All models and the dataset are publicly released on HuggingFace.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task(s)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Exact Match (EM)</td>
          <td>NC, MG, FS, RS</td>
          <td>Molecular identity comparison via RDKit</td>
      </tr>
      <tr>
          <td>Fingerprint <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto Similarity</a> (FTS)</td>
          <td>MG, FS, RS</td>
          <td>Morgan fingerprints</td>
      </tr>
      <tr>
          <td>METEOR</td>
          <td>MC</td>
          <td>Text similarity metric</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>PP-ESOL, PP-Lipo</td>
          <td>Regression tasks</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>PP-BBBP, PP-ClinTox, PP-HIV, PP-SIDER</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>NC-I2S, MG, FS, RS</td>
          <td>Ratio of valid SMILES outputs</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify exact GPU hardware or training times. Training uses the HuggingFace Transformers library with LoRA, and inference is conducted on the Ohio Supercomputer Center.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/OSU-NLP-Group/LlaSMol">LlaSMol Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training, evaluation, and inference scripts</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/osunlp/SMolInstruct">SMolInstruct</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>3.3M samples across 14 chemistry tasks</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/osunlp/LlaSMol-Mistral-7B">LlaSMol-Mistral-7B</a></td>
          <td>Model</td>
          <td>CC-BY-4.0</td>
          <td>Best-performing model (LoRA adapters)</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/osunlp/LlaSMol-Galactica-6.7B">LlaSMol-Galactica-6.7B</a></td>
          <td>Model</td>
          <td>CC-BY-4.0</td>
          <td>LoRA adapters for Galactica</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/osunlp/LlaSMol-Llama2-7B">LlaSMol-Llama2-7B</a></td>
          <td>Model</td>
          <td>CC-BY-4.0</td>
          <td>LoRA adapters for Llama 2</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/osunlp/LlaSMol-CodeLlama-7B">LlaSMol-CodeLlama-7B</a></td>
          <td>Model</td>
          <td>CC-BY-4.0</td>
          <td>LoRA adapters for Code Llama</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yu, B., Baker, F. N., Chen, Z., Ning, X., &amp; Sun, H. (2024). LlaSMol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. <em>arXiv preprint arXiv:2402.09391</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yu2024llamsmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yu, Botao and Baker, Frazier N. and Chen, Ziqi and Ning, Xia and Sun, Huan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2402.09391}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Galactica: A Curated Scientific LLM from Meta AI</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/galactica-large-language-model-for-science/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/galactica-large-language-model-for-science/</guid><description>Galactica is a 120B parameter LLM trained on 106B tokens of curated scientific text, outperforming GPT-3 on scientific knowledge tasks.</description><content:encoded><![CDATA[<h2 id="a-scientific-language-model-trained-on-curated-knowledge">A Scientific Language Model Trained on Curated Knowledge</h2>
<p>Galactica is a <strong>Resource</strong> contribution: a family of decoder-only Transformer language models (125M to 120B parameters) trained on a curated corpus of 106 billion tokens from scientific papers, reference material, knowledge bases, and other sources. The paper also introduces several specialized tokenization schemes for scientific modalities (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, amino acid sequences, DNA sequences, LaTeX, citations) and a working memory token (<code>&lt;work&gt;</code>) for step-by-step reasoning. All model weights are open-sourced under the Apache 2.0 license.</p>
<h2 id="information-overload-as-the-motivating-problem">Information Overload as the Motivating Problem</h2>
<p>The volume of scientific literature has grown beyond any individual&rsquo;s capacity to process. An average of 516 papers per day were submitted to arXiv as of May 2022, and databases like <a href="https://en.wikipedia.org/wiki/GenBank">NCBI GenBank</a> contained $1.49 \times 10^{12}$ nucleotide bases as of August 2022. Current search engines point to secondary knowledge layers (Wikipedia, UniProt, PubChem) that require costly human curation, creating a throughput bottleneck.</p>
<p>The authors argue that large language models can serve as a new interface for science by storing, combining, and reasoning about scientific knowledge in weight memory, rather than relying on the traditional store-and-retrieve paradigm. Prior scientific language models (SciBERT, BioLM) were small in scale, and general LLMs (GPT-3, PaLM) trained on uncurated web data that is inefficient for scientific tasks.</p>
<h2 id="curated-corpus-and-specialized-tokenization">Curated Corpus and Specialized Tokenization</h2>
<p>The core innovation has two components: a normative approach to dataset curation and a set of specialized tokens for different scientific modalities.</p>
<h3 id="the-galactica-corpus">The Galactica Corpus</h3>
<p>The training corpus consists of 106 billion tokens with a deliberate focus on quality over quantity:</p>
<table>
  <thead>
      <tr>
          <th>Data Source</th>
          <th>Documents</th>
          <th>Tokens</th>
          <th>Token %</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Papers</td>
          <td>48 million</td>
          <td>88 billion</td>
          <td>83.0%</td>
      </tr>
      <tr>
          <td>Code</td>
          <td>2 million</td>
          <td>7 billion</td>
          <td>6.9%</td>
      </tr>
      <tr>
          <td>Reference Material</td>
          <td>8 million</td>
          <td>7 billion</td>
          <td>6.5%</td>
      </tr>
      <tr>
          <td>Knowledge Bases</td>
          <td>2 million</td>
          <td>2 billion</td>
          <td>2.0%</td>
      </tr>
      <tr>
          <td>Filtered CommonCrawl</td>
          <td>0.9 million</td>
          <td>1 billion</td>
          <td>1.0%</td>
      </tr>
      <tr>
          <td>Prompts</td>
          <td>1.3 million</td>
          <td>0.4 billion</td>
          <td>0.3%</td>
      </tr>
      <tr>
          <td>Other</td>
          <td>0.02 million</td>
          <td>0.2 billion</td>
          <td>0.2%</td>
      </tr>
  </tbody>
</table>
<p>Papers come from arXiv (35B tokens), PMC (23B), <a href="https://en.wikipedia.org/wiki/Semantic_Scholar">Semantic Scholar</a> (18B), and PubMed abstracts (5B), among others. Reference material includes Wikipedia (5B tokens), StackExchange (1B), textbooks, and lecture notes. Knowledge bases include <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> Compound (2M compounds, 1B tokens), <a href="https://en.wikipedia.org/wiki/UniProt">UniProt</a> (552K reviewed Swiss-Prot proteins, 0.6B tokens), and the <a href="https://en.wikipedia.org/wiki/RefSeq">RefSeq</a> Genome.</p>
<p>All data is processed into a common markdown format. Mathematical LaTeX is preserved where available, and papers are citation-processed with title-based identifiers.</p>
<h3 id="specialized-tokenization">Specialized Tokenization</h3>
<p>Galactica introduces several modality-specific tokenization strategies:</p>
<ol>
<li>
<p><strong>Citations</strong>: Wrapped with <code>[START_REF]</code> and <code>[END_REF]</code> tokens using paper titles as identifiers, enabling the model to predict citations in context.</p>
</li>
<li>
<p><strong>Working Memory (<code>&lt;work&gt;</code>)</strong>: Step-by-step reasoning is wrapped in <code>&lt;work&gt;</code> and <code>&lt;/work&gt;</code> tokens that mimic an internal working memory, allowing the model to perform multi-step computation. This differs from chain-of-thought prompting in that it is learned during pre-training rather than elicited through prompt engineering.</p>
</li>
<li>
<p><strong>SMILES</strong>: Wrapped with <code>[START_SMILES]</code>/<code>[END_SMILES]</code> tokens and character-level tokenization.</p>
</li>
<li>
<p><strong>Amino Acid Sequences</strong>: Wrapped with <code>[START_AMINO]</code>/<code>[END_AMINO]</code> tokens with character-level tokenization (one token per residue).</p>
</li>
<li>
<p><strong>DNA Sequences</strong>: Wrapped with <code>[START_DNA]</code>/<code>[END_DNA]</code> tokens with character-level tokenization (one token per nucleotide base).</p>
</li>
<li>
<p><strong>Mathematics</strong>: ASCII operations split into individual characters; digits split into individual tokens.</p>
</li>
</ol>
<h3 id="prompt-pre-training">Prompt Pre-Training</h3>
<p>Rather than using instruction tuning as a separate fine-tuning stage, Galactica includes task-specific prompts (358 million tokens total) directly in pre-training alongside the general corpus. This includes question answering, entity extraction, summarization, dialog, and chemical property prediction prompts. The authors frame this as occupying a middle ground between pure self-supervised pre-training and instruction tuning, providing task signal without degrading general capability.</p>
<h2 id="architecture-training-and-evaluation-setup">Architecture, Training, and Evaluation Setup</h2>
<h3 id="architecture">Architecture</h3>
<p>Galactica uses a standard decoder-only Transformer with several modifications:</p>
<ul>
<li>GeLU activations</li>
<li>2048-token context window</li>
<li>No biases in dense kernels or layer norms</li>
<li>Learned positional embeddings</li>
<li>50K BPE vocabulary</li>
</ul>
<p>Five model sizes were trained:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Layers</th>
          <th>$d_{\text{model}}$</th>
          <th>Heads</th>
          <th>Batch Size</th>
          <th>Max LR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GAL 125M</td>
          <td>125M</td>
          <td>12</td>
          <td>768</td>
          <td>12</td>
          <td>0.5M</td>
          <td>$6 \times 10^{-4}$</td>
      </tr>
      <tr>
          <td>GAL 1.3B</td>
          <td>1.3B</td>
          <td>24</td>
          <td>2,048</td>
          <td>32</td>
          <td>1.0M</td>
          <td>$2 \times 10^{-4}$</td>
      </tr>
      <tr>
          <td>GAL 6.7B</td>
          <td>6.7B</td>
          <td>32</td>
          <td>4,096</td>
          <td>32</td>
          <td>2.0M</td>
          <td>$1.2 \times 10^{-4}$</td>
      </tr>
      <tr>
          <td>GAL 30B</td>
          <td>30.0B</td>
          <td>48</td>
          <td>7,168</td>
          <td>56</td>
          <td>2.0M</td>
          <td>$1 \times 10^{-4}$</td>
      </tr>
      <tr>
          <td>GAL 120B</td>
          <td>120.0B</td>
          <td>96</td>
          <td>10,240</td>
          <td>80</td>
          <td>2.0M</td>
          <td>$0.7 \times 10^{-5}$</td>
      </tr>
  </tbody>
</table>
<p>Training used AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.95$, weight decay of 0.1, gradient clipping at 1.0, and linear learning rate decay to 10% of peak value. Dropout and attention dropout were set to $p = 0.1$.</p>
<h3 id="training-on-repeated-tokens">Training on Repeated Tokens</h3>
<p>Models were trained for 450 billion tokens, approximately 4.25 epochs of the corpus. Validation loss continued to fall through four epochs for all model sizes, with the 120B model only beginning to overfit at the start of the fifth epoch. This is notable because it challenges the prevailing view that repeated tokens are harmful for LLM training. Performance on out-of-domain BIG-bench tasks also continued to improve through training, suggesting no overfitting on downstream generalization.</p>
<h3 id="key-evaluation-results">Key Evaluation Results</h3>
<p><strong>Knowledge Probes</strong>: On LaTeX equation prediction across 434 equations from chemistry, physics, mathematics, statistics, and economics, GAL 120B achieved 68.2% accuracy versus GPT-3&rsquo;s 49.0% (zero-shot). On chemical reactions, GAL 120B scored 43.1% versus GPT-3&rsquo;s 35.1%.</p>
<p><strong>Mathematical Reasoning</strong>: With the <code>&lt;work&gt;</code> token, GAL 120B achieved 41.3% on mathematical MMLU (average across abstract algebra, elementary, high school, college math, and formal logic), compared to Chinchilla&rsquo;s 35.7% (5-shot). On the MATH benchmark, GAL 120B scored 20.4% (5-shot chain-of-thought) versus PaLM 540B&rsquo;s 8.8%.</p>
<p><strong>Scientific QA</strong>: Galactica set state-of-the-art results on PubMedQA (77.6%) and MedMCQA dev (52.9%), outperforming prior fine-tuned models (72.2% and 41.0% respectively).</p>
<p><strong>Citation Prediction</strong>: GAL 120B achieved 51.9% accuracy on PWC Citations and 69.1% on Extended Citations, outperforming both sparse (ElasticSearch) and dense (Contriever) retrieval baselines.</p>
<p><strong>BIG-bench (57 tasks)</strong>: Despite training only on scientific data, GAL 120B (48.7% weighted accuracy) outperformed OPT 175B (43.4%) and BLOOM 176B (42.6%) on primarily non-scientific tasks.</p>
<p><strong><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> Classification</strong>: Using SMILES in natural language prompts with weak supervision, GAL 120B achieved an average ROC-AUC of 0.690 across six MoleculeNet classification benchmarks (BACE, BBBP, ClinTox, HIV, SIDER, Tox21). This lagged the specialist Uni-Mol model (0.770), which uses 3D molecular information and 10x more molecules.</p>
<p><strong><a href="https://en.wikipedia.org/wiki/IUPAC_nomenclature_of_organic_chemistry">IUPAC</a> Name Prediction</strong>: GAL 120B achieved 39.2% accuracy on predicting IUPAC names from SMILES in a self-supervised setting, with attention visualization showing the model attends to chemically relevant functional groups (e.g., attending to the $\text{-NH}_2$ group when predicting &ldquo;amino&rdquo;).</p>
<p><strong>Protein Function Prediction</strong>: GAL 120B achieved a ROUGE-L of 0.252 on generating free-form protein function descriptions from amino acid sequences, and an $F_1$ of 48.7% on protein keyword prediction from the UniProt general validation set.</p>
<p><strong>Bias and Toxicity</strong>: On CrowS-Pairs, GAL 120B scored 60.5% (closer to ideal 50%) versus OPT 175B&rsquo;s 69.5%. On StereoSet, GAL 120B achieved an ICAT score of 65.6 versus OPT&rsquo;s 60.0 and GPT-3&rsquo;s 60.8. Toxicity rates on RealToxicityPrompts were substantially lower than comparison models.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>
<p><strong>Curated data enables repeated training</strong>: The curated scientific corpus allows training for multiple epochs without overfitting, contrary to prevailing assumptions about repeated token degradation.</p>
</li>
<li>
<p><strong>Scientific LLMs generalize beyond science</strong>: Despite training only on scientific text, Galactica outperforms general LLMs on non-scientific BIG-bench tasks, suggesting data quality matters more than data breadth.</p>
</li>
<li>
<p><strong>Weight memory can outperform retrieval</strong>: For citation prediction, Galactica&rsquo;s weight memory outperforms traditional sparse and dense retrieval methods, demonstrating the context-associative power of language models.</p>
</li>
<li>
<p><strong>Multi-modal learning via text</strong>: SMILES and protein sequences can be learned alongside natural language in a single model, and the model attends to chemically interpretable features.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Corpus constraints</strong>: Restricted to open-access papers; much scientific knowledge in closed-access papers and textbooks is excluded. Only 2M of 110M PubChem compounds and 0.5M of 227M UniProt sequences were included.</li>
<li><strong>Corpus vs. prompt effects</strong>: The paper does not disentangle whether performance gains come from the scientific corpus or from the prompt pre-training strategy.</li>
<li><strong>Citation bias</strong>: The model still shows bias toward predicting more popular papers, though this decreases with scale.</li>
<li><strong>No geometry</strong>: SMILES-based representations lack 3D geometric information, limiting chemical understanding.</li>
<li><strong>Hallucination</strong>: Title-based citation identifiers are more prone to hallucination at smaller scales, though accuracy improves with scale.</li>
<li><strong>No instruction tuning comparison</strong>: The paper does not compare prompt pre-training against instruction tuning as a follow-up step.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The paper identifies retrieval augmentation, extending to images, larger context windows, mixture-of-denoising training objectives, and more diverse <code>&lt;work&gt;</code> reasoning examples as promising directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Galactica Corpus</td>
          <td>106B tokens</td>
          <td>Papers (83%), code (6.9%), reference material (6.5%), knowledge bases (2%), CommonCrawl (1%), prompts (0.3%)</td>
      </tr>
      <tr>
          <td>Training (Molecules)</td>
          <td>PubChem Compound subset</td>
          <td>2M compounds (of 110M available)</td>
          <td>Character-level SMILES tokenization</td>
      </tr>
      <tr>
          <td>Training (Proteins)</td>
          <td>Swiss-Prot (UniProt)</td>
          <td>552K reviewed sequences (of 227M available)</td>
          <td>Character-level amino acid tokenization</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>LaTeX Equations</td>
          <td>434 equations</td>
          <td>Chemistry, physics, math, stats, economics</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>MMLU, MATH</td>
          <td>Standard benchmarks</td>
          <td>Out-of-domain evaluation</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>PubMedQA, MedMCQA, BioASQ</td>
          <td>Standard biomedical QA</td>
          <td>In-domain (training prompts included)</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>MoleculeNet (6 tasks)</td>
          <td>Standard molecular benchmarks</td>
          <td>BACE, BBBP, ClinTox, HIV, SIDER, Tox21</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BIG-bench (57 tasks)</td>
          <td>Standard NLP benchmark</td>
          <td>Out-of-domain, non-scientific</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Decoder-only Transformer with GeLU activations, no biases</li>
<li>AdamW optimizer: $\beta_1 = 0.9$, $\beta_2 = 0.95$, weight decay 0.1</li>
<li>Gradient clipping at global norm 1.0</li>
<li>Linear LR decay to 10% of peak</li>
<li>Dropout: $p = 0.1$ (attention and residual)</li>
<li><a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">BPE</a> vocabulary: 50K tokens from 2% corpus sample</li>
<li>Training: 450B tokens (~4.25 epochs)</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/paperswithcode/galai">Galactica models (galai)</a></td>
          <td>Code + Model</td>
          <td>Apache-2.0</td>
          <td>Official implementation with 125M, 1.3B, 6.7B, 30B, 120B checkpoints</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>GAL 120B</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LaTeX Equations (zero-shot)</td>
          <td>68.2%</td>
          <td>GPT-3: 49.0%</td>
          <td>434 equations across 5 domains</td>
      </tr>
      <tr>
          <td>Math MMLU (<code>&lt;work&gt;</code>)</td>
          <td>41.3%</td>
          <td>Chinchilla (5-shot): 35.7%</td>
          <td>Average over 5 math subjects</td>
      </tr>
      <tr>
          <td>MATH (5-shot CoT)</td>
          <td>20.4%</td>
          <td>PaLM 540B: 8.8%</td>
          <td>Minerva 540B (fine-tuned): 33.6%</td>
      </tr>
      <tr>
          <td>PubMedQA</td>
          <td>77.6%</td>
          <td>Prior SOTA: 72.2%</td>
          <td>In-domain</td>
      </tr>
      <tr>
          <td>MedMCQA dev</td>
          <td>52.9%</td>
          <td>Prior SOTA: 41.0%</td>
          <td>In-domain</td>
      </tr>
      <tr>
          <td>BIG-bench (weighted)</td>
          <td>48.7%</td>
          <td>OPT 175B: 43.4%</td>
          <td>57 non-scientific tasks</td>
      </tr>
      <tr>
          <td>MoleculeNet ROC-AUC (avg)</td>
          <td>0.690</td>
          <td>Uni-Mol (3D): 0.770</td>
          <td>Weak supervision vs. direct fine-tuning</td>
      </tr>
      <tr>
          <td>CrowS-Pairs (lower = less biased)</td>
          <td>60.5%</td>
          <td>OPT 175B: 69.5%</td>
          <td>Ideal: 50%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>120B model training: 128 NVIDIA A100 80GB nodes</li>
<li>120B model inference: single NVIDIA A100 node</li>
<li>Training library: metaseq (Meta AI)</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V., &amp; Stojnic, R. (2022). Galactica: A Large Language Model for Science. <em>arXiv preprint arXiv:2211.09085</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{taylor2022galactica,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Galactica: A Large Language Model for Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Taylor, Ross and Kardas, Marcin and Cucurull, Guillem and Scialom, Thomas and Hartshorn, Anthony and Saravia, Elvis and Poulton, Andrew and Kerkez, Viktor and Stojnic, Robert}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2211.09085}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.2211.09085}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Fine-Tuning GPT-3 for Predictive Chemistry Tasks</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/leveraging-llms-predictive-chemistry/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/leveraging-llms-predictive-chemistry/</guid><description>Fine-tuned GPT-3 matches or outperforms specialized ML models on molecular, materials, and reaction property prediction, especially in low-data regimes.</description><content:encoded><![CDATA[<h2 id="gpt-3-as-a-general-purpose-chemistry-predictor">GPT-3 as a General-Purpose Chemistry Predictor</h2>
<p>This is an <strong>Empirical</strong> paper that systematically benchmarks fine-tuned GPT-3 against dedicated machine learning models across 15 chemistry and materials science prediction tasks. The primary contribution is demonstrating that a general-purpose large language model, with no chemistry-specific architecture or featurization, can match or outperform specialized ML approaches, particularly when training data is limited. The paper also demonstrates inverse molecular design through simple prompt inversion.</p>
<h2 id="why-general-purpose-llms-for-chemistry">Why General-Purpose LLMs for Chemistry</h2>
<p>Machine learning in chemistry typically requires domain-specific feature engineering: molecular fingerprints, graph neural network architectures, or hand-crafted descriptors tailored to each application. Developing these approaches demands specialized expertise and significant effort for each new problem. The small datasets common in experimental chemistry further complicate matters, as many sophisticated ML approaches require large training sets to learn meaningful representations.</p>
<p>Large language models like GPT-3, trained on vast internet text corpora, had shown surprising capability at tasks they were not explicitly trained for. The key question motivating this work was whether these general-purpose models could also answer scientific questions for which we lack answers, given that most chemistry problems can be represented in text form. For example: &ldquo;If I change the metal in my <a href="https://en.wikipedia.org/wiki/Metal%E2%80%93organic_framework">metal-organic framework</a>, will it be stable in water?&rdquo;</p>
<p>Prior chemical language models (e.g., <a href="/notes/chemistry/molecular-design/property-prediction/transformer-cnn-qsar-modeling/">Transformer-CNN</a>, <a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a>, <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a>) were pre-trained on chemistry-specific corpora. In contrast, this work investigates models trained primarily on general internet text, examining whether the implicit chemical knowledge encoded during pre-training, combined with task-specific fine-tuning, can substitute for explicit chemical featurization.</p>
<h2 id="language-interfaced-fine-tuning-for-chemistry">Language-Interfaced Fine-Tuning for Chemistry</h2>
<p>The core innovation is &ldquo;language-interfaced fine-tuning&rdquo; (LIFT): reformulating chemistry prediction tasks as natural language question-answering. Training examples take the form of question-completion pairs, where questions describe the chemical system in text and completions provide the target property. For example:</p>
<ul>
<li><strong>Classification</strong>: &ldquo;What is the phase of Co1Cu1Fe1Ni1V1?&rdquo; with completion &ldquo;0&rdquo; (multi-phase)</li>
<li><strong>Regression</strong>: Property values are rounded to a fixed precision, converting continuous prediction into a text generation problem</li>
<li><strong>Inverse design</strong>: Questions and completions are simply swapped, asking &ldquo;What is a molecule with property X?&rdquo; and expecting a <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> string as completion</li>
</ul>
<p>The fine-tuning uses OpenAI&rsquo;s API with the smallest <code>ada</code> variant of GPT-3, with uniform hyperparameters across all tasks (8 epochs, learning rate multiplier of 0.02). No optimization of prompt structure, tokenization, or training schedule was performed, making the approach deliberately simple.</p>
<p>For regression, since language models generate discrete tokens rather than continuous values, the authors round target values to a fixed precision (e.g., 1% for Henry coefficients). This converts regression into a form of classification over numeric strings, with the assumption that GPT-3 can interpolate between these discretized values.</p>
<p>The approach also extends to open-source models. The authors demonstrate that GPT-J-6B can be fine-tuned using parameter-efficient techniques (LoRA, 8-bit quantization) on consumer hardware, and provide the <code>chemlift</code> Python package for this purpose.</p>
<h2 id="benchmarks-across-molecules-materials-and-reactions">Benchmarks Across Molecules, Materials, and Reactions</h2>
<h3 id="datasets-and-tasks">Datasets and Tasks</h3>
<p>The evaluation spans three chemical domains with 15 total benchmarks:</p>
<p><strong>Molecules:</strong></p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Photoswitch">Photoswitch</a> transition wavelength prediction (2022)</li>
<li>Free energy of solvation (FreeSolv, 2014)</li>
<li>Aqueous solubility (ESOL, 2004)</li>
<li>Lipophilicity (ChEMBL, 2012)</li>
<li><a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO-LUMO gap</a> (QMugs, 2022)</li>
<li><a href="https://en.wikipedia.org/wiki/Organic_solar_cell">Organic photovoltaic</a> power conversion efficiency (2018)</li>
</ul>
<p><strong>Materials:</strong></p>
<ul>
<li>Coarse-grained surfactant adsorption free energy (2021)</li>
<li>CO2 and CH4 <a href="https://en.wikipedia.org/wiki/Henry%27s_law">Henry coefficients</a> in MOFs (2020)</li>
<li>MOF heat capacity (2022)</li>
<li><a href="https://en.wikipedia.org/wiki/High-entropy_alloy">High-entropy alloy</a> phase prediction (2020)</li>
<li><a href="https://en.wikipedia.org/wiki/Amorphous_metal">Bulk metallic glass</a> formation ability (2006)</li>
<li>Metallic behavior prediction (2018)</li>
</ul>
<p><strong>Reactions:</strong></p>
<ul>
<li>C-N cross-coupling yield (<a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a>, 2018)</li>
<li>C-C cross-coupling yield (<a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki</a>, 2022)</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>The baselines include both traditional ML and deep learning approaches:</p>
<ul>
<li><strong>Non-DL</strong>: XGBoost with molecular descriptors/fragprints, Gaussian Process Regression (GPR), random forests, n-Gram models, Automatminer, differential reaction fingerprints (DRFP)</li>
<li><strong>Deep learning</strong>: MolCLR, ModNet, CrabNet, TabPFN</li>
</ul>
<h3 id="data-efficiency-analysis">Data Efficiency Analysis</h3>
<p>To compare data efficiency, the authors fit power law curves to learning curves for all models and measure the &ldquo;data efficiency factor&rdquo;: how much more (or fewer) data the best baseline needs to match GPT-3&rsquo;s performance in the low-data regime.</p>
<table>
  <thead>
      <tr>
          <th>Domain</th>
          <th>Benchmark</th>
          <th>Data Efficiency vs. Non-DL</th>
          <th>vs. DL Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecules</td>
          <td>Photoswitch wavelength</td>
          <td>1.1x (n-Gram)</td>
          <td>1.2x (TabPFN)</td>
      </tr>
      <tr>
          <td>Molecules</td>
          <td>Solvation free energy</td>
          <td>3.1x (GPR)</td>
          <td>1.3x (TabPFN)</td>
      </tr>
      <tr>
          <td>Molecules</td>
          <td>Solubility</td>
          <td>1.0x (XGBoost)</td>
          <td>0.002x (MolCLR)</td>
      </tr>
      <tr>
          <td>Molecules</td>
          <td>Lipophilicity</td>
          <td>3.43x (GPR)</td>
          <td>0.97x (TabPFN)</td>
      </tr>
      <tr>
          <td>Molecules</td>
          <td>HOMO-LUMO gap</td>
          <td>4.3x (XGBoost)</td>
          <td>0.62x (TabPFN)</td>
      </tr>
      <tr>
          <td>Materials</td>
          <td>HEA phase</td>
          <td>24x (RF)</td>
          <td>9.0x (CrabNet)</td>
      </tr>
      <tr>
          <td>Materials</td>
          <td>CO2 Henry coeff.</td>
          <td>0.40x (XGBoost)</td>
          <td>12x (TabPFN)</td>
      </tr>
      <tr>
          <td>Reactions</td>
          <td>C-N cross-coupling</td>
          <td>2.9x (DRFP)</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Values &gt;1 indicate GPT-3 is more data-efficient. For the HEA phase prediction task, GPT-3 achieved comparable accuracy to a random forest model trained on 1,126 data points using only about 50 training examples.</p>
<h3 id="representation-sensitivity">Representation Sensitivity</h3>
<p>An important finding is that GPT-3 performs well regardless of molecular representation format. The authors tested IUPAC names, SMILES, and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, finding good results across all representations. IUPAC names often produced the best performance, which is notable because it makes the approach accessible to non-specialists who can simply use chemical names rather than learning specialized encodings.</p>
<h3 id="inverse-design">Inverse Design</h3>
<p>For inverse design, the authors fine-tuned GPT-3 with reversed question-completion pairs. On photoswitches:</p>
<ul>
<li>Generated molecules include both training set members and novel structures (some not in PubChem)</li>
<li>Transition wavelengths matched target values within about 10% mean absolute percentage error (validated using the GPR model from Griffiths et al.)</li>
<li>A temperature parameter controls the diversity-validity tradeoff: low temperatures produce training set copies, high temperatures produce diverse but potentially invalid structures</li>
<li>Across all temperatures, generated molecules showed low synthetic accessibility (SA) scores, suggesting synthesizability</li>
</ul>
<p>The authors also demonstrated iterative inverse design for HOMO-LUMO gap optimization: starting from QMugs data, they iteratively fine-tuned GPT-3 to generate molecules with progressively larger bandgaps (&gt;5 eV), successfully shifting the distribution over four generations. This worked even when extrapolating beyond the training distribution (e.g., training only on molecules with gaps &lt;3.5 eV, then generating molecules with gaps &gt;4.0 eV).</p>
<h3 id="coarse-grained-polymer-design">Coarse-Grained Polymer Design</h3>
<p>A striking test involved coarse-grained dispersant polymers with four monomer types and chain lengths of 16-48 units. GPT-3 had no prior knowledge of these abstract representations, yet it outperformed dedicated models for adsorption free energy prediction and successfully performed inverse design, generating monomer sequences with a mean percentage error of about 22% for the desired property.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>
<p><strong>Low-data advantage</strong>: Fine-tuned GPT-3 consistently shows the largest advantages over conventional ML in low-data regimes (tens to hundreds of data points), which is precisely where experimental chemistry datasets typically fall.</p>
</li>
<li>
<p><strong>Representation agnostic</strong>: The model works with IUPAC names, SMILES, SELFIES, and even invented abstract representations, removing the need for chemistry-specific tokenization.</p>
</li>
<li>
<p><strong>No feature engineering</strong>: The approach requires no domain-specific descriptors, fingerprints, or architectural modifications, making it accessible to researchers without ML expertise.</p>
</li>
<li>
<p><strong>Bidirectional design</strong>: Inverse design is achieved by simply reversing the question format, with no architectural changes or separate generative model needed.</p>
</li>
<li>
<p><strong>Extrapolation capability</strong>: The model can generate molecules with properties outside the training distribution, as demonstrated by the HOMO-LUMO gap extrapolation experiments.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>In the <strong>high-data regime</strong>, conventional ML models with chemistry-specific features often catch up to or surpass GPT-3, as the inductive biases encoded in GPT-3 become less necessary with sufficient data.</li>
<li><strong>Regression</strong> is inherently limited by the discretization of continuous values into tokens. This requires more data than classification and introduces quantization error.</li>
<li>The approach relies on the <strong>OpenAI API</strong>, introducing cost and reproducibility concerns (model versions may change). The authors partially address this by providing open-source alternatives via <code>chemlift</code>.</li>
<li>The authors acknowledge that <strong>identified correlations may not represent causal relationships</strong>. GPT-3 finding predictive patterns does not guarantee that the patterns are chemically meaningful.</li>
<li>No optimization of prompts, tokenization, or hyperparameters was performed, suggesting room for improvement but also making it difficult to assess the ceiling of this approach.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>All datasets are publicly available and were obtained from published benchmarks.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification</td>
          <td>HEA phase (Pei et al.)</td>
          <td>1,252 alloys</td>
          <td>Single-phase vs. multi-phase</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>643 molecules</td>
          <td>Hydration free energies</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>1,128 molecules</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>QMugs</td>
          <td>665,000 molecules</td>
          <td>HOMO-LUMO gaps via GFN2-xTB</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Lipophilicity (ChEMBL)</td>
          <td>Varies</td>
          <td>LogP classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>OPV PCE</td>
          <td>Varies</td>
          <td>Organic photovoltaic efficiency</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>MOF Henry coefficients</td>
          <td>Varies</td>
          <td>CO2/CH4 adsorption</td>
      </tr>
      <tr>
          <td>Inverse design</td>
          <td>Photoswitches (Griffiths et al.)</td>
          <td>392 molecules</td>
          <td>Transition wavelengths</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Fine-tuning via OpenAI API: 8 epochs, learning rate multiplier 0.02</li>
<li>GPT-3 <code>ada</code> variant (smallest model) used for all main results</li>
<li>In-context learning also tested with larger GPT-3 models and GPT-4</li>
<li>Open-source alternative: GPT-J-6B with LoRA + 8-bit quantization</li>
<li>Learning curves fit to power laws $-a \exp(-bx + c)$ for data efficiency comparison</li>
<li>Validity checked using RDKit via <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>&rsquo;s <code>is\_valid</code> method</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-3 ada (OpenAI API, proprietary)</li>
<li>GPT-J-6B (open-source, fine-tunable on consumer hardware)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>HEA phase</td>
          <td>Classification</td>
      </tr>
      <tr>
          <td>$F_1$ macro</td>
          <td>All classification tasks</td>
          <td>Class-balanced</td>
      </tr>
      <tr>
          <td>Cohen&rsquo;s $\kappa$</td>
          <td>Classification</td>
          <td>Used for learning curve thresholds</td>
      </tr>
      <tr>
          <td>MAE / MAPE</td>
          <td>Regression, inverse design</td>
          <td>Property prediction accuracy</td>
      </tr>
      <tr>
          <td>Validity rate</td>
          <td>Inverse design</td>
          <td>Fraction of parseable SMILES</td>
      </tr>
      <tr>
          <td>Frechet ChemNet distance</td>
          <td>Inverse design</td>
          <td>Distribution similarity</td>
      </tr>
      <tr>
          <td>SA score</td>
          <td>Inverse design</td>
          <td>Synthetic accessibility</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Fine-tuning via OpenAI API (cloud compute, not user-specified)</li>
<li>Open-source experiments: consumer GPU hardware with 8-bit quantization</li>
<li>Quantum chemistry validation: GFN2-xTB for HOMO-LUMO calculations</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/kjappelbaum/gptchem">gptchem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>All experiments with OpenAI API</td>
      </tr>
      <tr>
          <td><a href="https://github.com/lamalab-org/chemlift">chemlift</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Open-source LLM fine-tuning support</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.7806672">Zenodo (gptchem)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archived release</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10233422">Zenodo (chemlift)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archived release</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A., &amp; Smit, B. (2024). Leveraging large language models for predictive chemistry. <em>Nature Machine Intelligence</em>, 6(2), 161-169. <a href="https://doi.org/10.1038/s42256-023-00788-1">https://doi.org/10.1038/s42256-023-00788-1</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{jablonka2024leveraging,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Leveraging large language models for predictive chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jablonka, Kevin Maik and Schwaller, Philippe and Ortega-Guerrero, Andres and Smit, Berend}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{161--169}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-023-00788-1}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Data Transfer Approaches for Seq-to-Seq Retrosynthesis</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/</guid><description>Systematic comparison of joint training, self-training, and pre-training plus fine-tuning for Transformer-based retrosynthesis on USPTO-50K.</description><content:encoded><![CDATA[<h2 id="systematic-study-of-data-transfer-for-retrosynthesis">Systematic Study of Data Transfer for Retrosynthesis</h2>
<p>This is an <strong>Empirical</strong> paper that systematically compares three standard data transfer methods (joint training, self-training, and pre-training plus fine-tuning) applied to a Transformer-based sequence-to-sequence model for single-step retrosynthesis. The primary contribution is demonstrating that pre-training on a large augmented dataset (USPTO-Full, 877K reactions) followed by fine-tuning on the smaller target dataset (USPTO-50K) produces substantial accuracy improvements over the baseline Transformer, achieving competitive or superior results to contemporaneous state-of-the-art graph-based models at higher values of n-best accuracy.</p>
<h2 id="bridging-the-data-gap-in-retrosynthesis-prediction">Bridging the Data Gap in Retrosynthesis Prediction</h2>
<p><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a>, the problem of predicting reactant compounds needed to synthesize a target product, has seen rapid progress through increasingly sophisticated model architectures: <a href="/notes/chemistry/molecular-design/reaction-prediction/nmt-organic-reaction-prediction/">LSTM seq-to-seq models</a>, <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Transformer models</a>, and graph-to-graph approaches. However, the authors identify a gap in this research trajectory. While model architecture has received extensive attention, the role of training data strategies has been largely neglected in the retrosynthesis literature.</p>
<p>The core practical problem is that high-quality supervised datasets for retrosynthesis (like USPTO-50K) tend to be small and distribution-skewed, with all samples pre-classified into ten major reaction classes. Meanwhile, larger datasets (USPTO-Full with 877K samples, USPTO-MIT with 479K samples) exist but have different distributional properties. Data transfer techniques are standard practice in computer vision, NLP, and machine translation for exactly this scenario, yet they had not been systematically evaluated for retrosynthesis at the time of this work.</p>
<p>The authors also note a contrast with Zoph et al. (2020), who found that self-training outperforms pre-training in image recognition. They hypothesize that chemical compound strings may have more universal representations than images, making pre-training more effective in the chemistry domain.</p>
<h2 id="three-data-transfer-methods-for-retrosynthesis">Three Data Transfer Methods for Retrosynthesis</h2>
<p>The paper formalizes retrosynthesis as a seq-to-seq problem where both the product $x$ and reactant set $y$ are represented as <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. A retrosynthesis model defines a likelihood $p_{\mathcal{M}}(y \mid x; \theta)$ optimized via maximum log-likelihood:</p>
<p>$$
\theta^{*} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}^{T}_{\text{Train}}} \log p(y_{i} \mid x_{i})
$$</p>
<p>Given a target dataset $\mathcal{D}^{T}$ and an augment dataset $\mathcal{D}^{A}$, three transfer methods are examined:</p>
<p><strong>Joint Training</strong> concatenates the training sets and optimizes over the union:</p>
<p>$$
\theta^{*}_{\text{joint}} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}_{\text{joint}}} \log p(y_{i} \mid x_{i}), \quad \mathcal{D}_{\text{joint}} = \mathcal{D}^{T}_{\text{Train}} \cup \mathcal{D}^{A}_{\text{Train}}
$$</p>
<p>This requires that both datasets share the same input/output domain (same SMILES canonicalization rules).</p>
<p><strong>Self-Training</strong> (pseudo labeling) first trains a base model on $\mathcal{D}^{T}$ alone, then uses this model to relabel the augment dataset products:</p>
<p>$$
\hat{y}_{i} = \arg\max_{y} \log p(y \mid x_{i}; \theta^{*}_{\text{single}}) \quad \text{for } x_{i} \in \mathcal{D}^{A}_{\text{Train}}
$$</p>
<p>The pseudo-labeled augment set is then combined with $\mathcal{D}^{T}_{\text{Train}}$ for joint training. This approach does not require consistent label domains between datasets.</p>
<p><strong>Pre-training plus Fine-tuning</strong> trains first on the augment dataset to obtain $\theta^{*}_{\text{pretrain}}$, then initializes fine-tuning from this checkpoint:</p>
<p>$$
\theta^{0}_{\text{finetune}} \leftarrow \theta^{*}_{\text{pretrain}}, \quad \theta^{\ell+1}_{\text{finetune}} \leftarrow \theta^{\ell}_{\text{finetune}} - \gamma^{\ell} \nabla \mathcal{L}(\mathcal{D}^{T}_{\text{Train}}) \big|_{{\theta^{\ell}_{\text{finetune}}}}
$$</p>
<h2 id="experimental-setup-on-uspto-benchmarks">Experimental Setup on USPTO Benchmarks</h2>
<p>The experiments use a fixed Transformer architecture (3 self-attention layers, 500-dimensional latent vectors) implemented in OpenNMT-py, evaluated across all three transfer methods.</p>
<p><strong>Datasets:</strong></p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target</td>
          <td>USPTO-50K</td>
          <td>40K/5K/5K (train/val/test)</td>
          <td>10 reaction classes, curated by Lowe (2012)</td>
      </tr>
      <tr>
          <td>Augment (main)</td>
          <td>USPTO-Full</td>
          <td>844K train (after cleansing)</td>
          <td>Curated by Lowe (2017)</td>
      </tr>
      <tr>
          <td>Augment (smaller)</td>
          <td>USPTO-MIT</td>
          <td>384K train (after cleansing)</td>
          <td>Curated by Jin et al. (2017)</td>
      </tr>
  </tbody>
</table>
<p>Data cleansing removed all augment dataset samples whose product SMILES appeared in any USPTO-50K subset, preventing data leakage. All datasets were re-canonicalized with a unified <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> version.</p>
<p><strong>Evaluation</strong> uses n-best accuracy with k=50 beam search, computing accuracy at n=1, 3, 5, 10, 20, 50. Models are selected by best validation perplexity. All experiments report averages and standard deviations over 5 runs.</p>
<p><strong>Optimization</strong> uses Adam with cyclic learning rate scheduling (warm-up) for all methods except fine-tuning, which uses a standard non-cyclic scheduler.</p>
<p><strong>Results comparing data transfer methods (USPTO-Full augment):</strong></p>
<table>
  <thead>
      <tr>
          <th>Training Method</th>
          <th>n=1</th>
          <th>n=3</th>
          <th>n=5</th>
          <th>n=10</th>
          <th>n=20</th>
          <th>n=50</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single model (No Transfer)</td>
          <td>35.3 +/- 1.4</td>
          <td>52.8 +/- 1.4</td>
          <td>58.9 +/- 1.3</td>
          <td>64.5 +/- 1.2</td>
          <td>68.8 +/- 1.2</td>
          <td>72.1 +/- 1.3</td>
      </tr>
      <tr>
          <td>Joint Training</td>
          <td>39.1 +/- 1.3</td>
          <td>63.4 +/- 0.9</td>
          <td>71.9 +/- 0.5</td>
          <td>80.1 +/- 0.2</td>
          <td>85.4 +/- 0.3</td>
          <td>89.4 +/- 0.2</td>
      </tr>
      <tr>
          <td>Self-Training</td>
          <td>41.5 +/- 1.0</td>
          <td>60.4 +/- 0.7</td>
          <td>66.1 +/- 0.7</td>
          <td>71.8 +/- 0.6</td>
          <td>75.3 +/- 0.5</td>
          <td>78.0 +/- 0.3</td>
      </tr>
      <tr>
          <td>Pre-training + Fine-Tune</td>
          <td>57.4 +/- 0.4</td>
          <td>77.6 +/- 0.4</td>
          <td>83.1 +/- 0.2</td>
          <td>87.4 +/- 0.4</td>
          <td>89.6 +/- 0.3</td>
          <td>90.9 +/- 0.2</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison with state-of-the-art models:</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>n=1</th>
          <th>n=3</th>
          <th>n=5</th>
          <th>n=10</th>
          <th>n=20</th>
          <th>n=50</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GLN (Dai et al., 2019)</td>
          <td>Logic Network</td>
          <td>52.5</td>
          <td>69.0</td>
          <td>75.6</td>
          <td>83.7</td>
          <td>88.5</td>
          <td>92.4</td>
      </tr>
      <tr>
          <td>G2Gs (Shi et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>48.9</td>
          <td>67.6</td>
          <td>72.5</td>
          <td>75.5</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>RetroXpert (Yan et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>65.6</td>
          <td>78.7</td>
          <td>80.8</td>
          <td>83.3</td>
          <td>84.6</td>
          <td>86.0</td>
      </tr>
      <tr>
          <td>GraphRetro (Somnath et al., 2020)</td>
          <td>Graph-to-Graph</td>
          <td>63.8</td>
          <td>80.5</td>
          <td>84.1</td>
          <td>85.9</td>
          <td>N/A</td>
          <td>87.2</td>
      </tr>
      <tr>
          <td>Pre-training + Fine-Tune (ours)</td>
          <td>Seq-to-Seq</td>
          <td>57.4</td>
          <td>77.6</td>
          <td>83.1</td>
          <td>87.4</td>
          <td>89.6</td>
          <td>90.9</td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Primary findings:</strong></p>
<ol>
<li>All three data transfer methods improve over the no-transfer baseline across all n-best accuracy levels.</li>
<li>Pre-training plus fine-tuning provides the largest gains, improving top-1 accuracy by 22.1 absolute percentage points (from 35.3% to 57.4%) and achieving the best n=10 and n=20 accuracy among all compared models, including graph-based approaches.</li>
<li>Augment dataset size matters: using USPTO-Full (844K) yields substantially better results than USPTO-MIT (384K) for joint training and pre-training plus fine-tuning, though self-training gains are surprisingly robust to augment dataset size.</li>
<li>Manual inspection of erroneous predictions shows that over 99% of top-1 predictions from the pre-trained/fine-tuned model are chemically appropriate or sensible, even when they do not exactly match the gold-standard reactants.</li>
<li>Pre-training plus fine-tuning shows a distinct advantage in training dynamics: the 1-best and n-best accuracy curves evolve similarly during fine-tuning, unlike the single model where these curves can diverge significantly. This makes early stopping more reliable.</li>
</ol>
<p><strong>Class-wise improvements</strong> are observed across all 10 reaction classes, with the largest gains in heterocycle formation (0.40 to 0.86 at 50-best) and functional group interconversion (0.57 to 0.90).</p>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ul>
<li>The model struggles with compounds containing multiple similar substituents (e.g., long-chain hydrocarbons), occasionally selecting the wrong one.</li>
<li>Some reactions involving rare chemical groups (<a href="https://en.wikipedia.org/wiki/Polycyclic_aromatic_hydrocarbon">polycyclic aromatic hydrocarbons</a>) still produce invalid SMILES, suggesting the augment dataset lacks sufficient examples of these structures.</li>
<li>Top-1 accuracy (57.4%) lags behind the best graph-based models (RetroXpert at 65.6%), though the gap narrows at higher n values.</li>
<li>The study uses a fixed Transformer architecture without architecture-specific optimization for each transfer method.</li>
</ul>
<p><strong>Future directions</strong> proposed include freezing parts of the network during fine-tuning, applying data transfer to graph-to-graph models, and testing transferability to other retrosynthesis datasets beyond USPTO-50K.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Target</td>
          <td>USPTO-50K</td>
          <td>50K reactions</td>
          <td>Curated by Lowe (2012), 10 reaction classes</td>
      </tr>
      <tr>
          <td>Augment</td>
          <td>USPTO-Full</td>
          <td>877K reactions (844K after cleansing)</td>
          <td>Curated by Lowe (2017), available via Figshare</td>
      </tr>
      <tr>
          <td>Augment (alt)</td>
          <td>USPTO-MIT</td>
          <td>479K reactions (384K after cleansing)</td>
          <td>Curated by Jin et al. (2017)</td>
      </tr>
  </tbody>
</table>
<p>Data cleansing removes augment samples whose products appear in any USPTO-50K subset. Unified RDKit canonicalization applied to all datasets.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer seq-to-seq model (3 self-attention layers, 500-dim latent vectors)</li>
<li>Positional encoding enabled</li>
<li>Maximum sequence length: 200 tokens</li>
<li>Adam optimizer</li>
<li>Cyclic learning rate scheduler with warm-up (all methods except fine-tuning)</li>
<li>Non-cyclic scheduler for fine-tuning phase (Klein et al., 2017)</li>
<li>Beam search with k=50 for inference</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Implementation: OpenNMT-py</li>
<li>No pre-trained weights or model checkpoints released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Top-1 accuracy</td>
          <td>57.4%</td>
          <td>35.3% (no transfer)</td>
          <td>Pre-train + fine-tune, USPTO-Full augment</td>
      </tr>
      <tr>
          <td>Top-10 accuracy</td>
          <td>87.4%</td>
          <td>64.5% (no transfer)</td>
          <td>Best among all compared models</td>
      </tr>
      <tr>
          <td>Top-20 accuracy</td>
          <td>89.6%</td>
          <td>68.8% (no transfer)</td>
          <td>Best among all compared models</td>
      </tr>
      <tr>
          <td>Top-50 accuracy</td>
          <td>90.9%</td>
          <td>72.1% (no transfer)</td>
          <td>Competitive with GLN (92.4%)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware details are not specified in the paper. The authors mention GPU memory constraints motivating the 200-token sequence length limit.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ishiguro, K., Ujihara, K., Sawada, R., Akita, H., &amp; Kotera, M. (2020). Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis. <em>arXiv preprint arXiv:2010.00792</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ishiguro2020data,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ishiguro, Katsuhiko and Ujihara, Kazuya and Sawada, Ryohto and Akita, Hirotaka and Kotera, Masaaki}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2010.00792}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemLLM: A Chemical Large Language Model Framework</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/</guid><description>ChemLLM introduces the first LLM dedicated to chemistry, with ChemData for instruction tuning and ChemBench for evaluation across nine chemical tasks.</description><content:encoded><![CDATA[<h2 id="a-resource-for-chemistry-specific-language-modeling">A Resource for Chemistry-Specific Language Modeling</h2>
<p>ChemLLM is a <strong>Resource</strong> paper that delivers three interconnected artifacts: ChemData (a 7M-sample instruction tuning dataset for chemistry), ChemBench (a 4,100-question multiple-choice benchmark spanning nine chemistry tasks), and ChemLLM itself (a 7B-parameter language model fine-tuned on InternLM2-Base-7B). Together, these components form the first comprehensive framework for building and evaluating LLMs dedicated to the chemical domain. The primary contribution is not a novel architecture but rather the data curation pipeline, evaluation benchmark, and training methodology that converts structured chemical knowledge into dialogue-formatted instruction data.</p>
<h2 id="bridging-structured-chemical-databases-and-conversational-llms">Bridging Structured Chemical Databases and Conversational LLMs</h2>
<p>While general-purpose LLMs like GPT-4 have shown promise on chemistry tasks, they are not specifically designed for the chemical domain. Several challenges motivate ChemLLM:</p>
<ol>
<li>
<p><strong>Structured data incompatibility</strong>: Most chemical information resides in structured databases (<a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, <a href="https://en.wikipedia.org/wiki/ChEBI">ChEBI</a>, <a href="/notes/chemistry/datasets/zinc-22/">ZINC</a>, USPTO) that are not naturally suited for training conversational language models. Using this data directly can degrade natural language processing capabilities.</p>
</li>
<li>
<p><strong>Molecular notation understanding</strong>: Molecules are represented in specialized notations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, which differ from natural language and require explicit alignment during training.</p>
</li>
<li>
<p><strong>Task diversity</strong>: Chemical tasks span name conversion, property prediction, molecular captioning, <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a>, product prediction, yield prediction, and more. A uniform training pipeline must handle this diversity without task-specific adaptation.</p>
</li>
<li>
<p><strong>Evaluation gaps</strong>: Existing chemical benchmarks (e.g., <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>) are designed for specialist models, not LLMs. Text-based evaluation metrics like <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a> and <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a> are sensitive to output style rather than factual correctness, making them unreliable for scientific accuracy assessment.</p>
</li>
</ol>
<p>Prior work focused on developing specialist models for individual downstream tasks while neglecting instruction-following and dialogue capabilities that are essential for broader reasoning and generalization.</p>
<h2 id="template-based-instruction-construction-from-structured-data">Template-Based Instruction Construction from Structured Data</h2>
<p>The core innovation is a systematic approach for converting structured chemical data into instruction-tuning format through two techniques:</p>
<h3 id="seed-template-prompt-technique">Seed Template Prompt Technique</h3>
<p>For each task type, the authors design a foundational seed template and use GPT-4 to generate variations that differ in expression but maintain semantic consistency. For each structured data entry, one template is randomly selected to create a single-turn dialogue sample. For example, converting <a href="https://en.wikipedia.org/wiki/IUPAC_nomenclature_of_organic_chemistry">IUPAC</a>-to-SMILES entries:</p>
<ul>
<li>&ldquo;Convert the IUPAC name [name] to its corresponding SMILES representation.&rdquo;</li>
<li>&ldquo;What&rsquo;s the SMILES notation for the chemical known as [name]?&rdquo;</li>
<li>&ldquo;Show me the SMILES sequence for [name], please.&rdquo;</li>
</ul>
<h3 id="play-as-playwrights-technique">Play as Playwrights Technique</h3>
<p>To generate richer, multi-turn dialogues, the authors prompt GPT-4 with a chain-of-thought (CoT) style &ldquo;script&rdquo; construction method. GPT-4 is guided to create multi-turn exchanges that simulate expert discussions, smoothly transitioning between question and answer stages. An additional &ldquo;answer masking&rdquo; variant has the model inquire about supplementary chemical information before providing a final answer, simulating realistic expert reasoning.</p>
<h3 id="training-objective">Training Objective</h3>
<p>The model is fine-tuned using <a href="https://en.wikipedia.org/wiki/LoRA_(machine_learning)">LoRA</a> with an autoregressive cross-entropy loss:</p>
<p>$$L_{CE} = -\sum_{c=1}^{M} y_{o,c} \log(p_{o,c})$$</p>
<p>where $M$ is the vocabulary size, $y_{o,c}$ is a binary indicator for whether observation $o$ belongs to class $c$, and $p_{o,c}$ is the predicted probability.</p>
<h2 id="two-stage-training-pipeline-and-chembench-evaluation">Two-Stage Training Pipeline and ChemBench Evaluation</h2>
<h3 id="training-setup">Training Setup</h3>
<p>ChemLLM uses a two-stage instruction tuning approach built on InternLM2-Base-7B:</p>
<p><strong>Stage 1</strong>: Fine-tune on Multi-Corpus (1.7M Q&amp;A pairs from Hugging Face) to enhance general linguistic capabilities, producing InternLM2-Chat-7B.</p>
<p><strong>Stage 2</strong>: Fine-tune on a mixture of ChemData (7M entries) and Multi-Corpus, balancing domain-specific chemical expertise with general language ability.</p>
<p>Training details include:</p>
<ul>
<li>LoRA with rank 8, scale factor 16.0, dropout 0.1</li>
<li>AdamW optimizer with initial learning rate $5.0 \times 10^{-5}$</li>
<li>NEFTune noise injection (alpha = 5) to prevent overfitting</li>
<li>Flash Attention-2 and KV Cache for efficiency</li>
<li>ZeRO Stage-2 for parameter offloading</li>
<li>Per-card batch size of 8 (total batch size 128)</li>
<li>1.06 epochs, 85,255 steps</li>
<li>Training loss reduced from 1.4998 to 0.7158</li>
</ul>
<h3 id="chemdata-composition">ChemData Composition</h3>
<p>ChemData spans three principal task categories with 7M instruction-tuning Q&amp;A pairs:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Tasks</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Molecules</td>
          <td>Name Conversion, Caption2Mol, Mol2Caption, Molecular Property Prediction</td>
      </tr>
      <tr>
          <td>Reactions</td>
          <td>Retrosynthesis, Product Prediction, Yield Prediction, Temperature Prediction, Solvent Prediction</td>
      </tr>
      <tr>
          <td>Domain-specific</td>
          <td>General chemical knowledge for broader chemical space understanding</td>
      </tr>
  </tbody>
</table>
<p>Data sources include PubChem, ChEMBL, ChEBI, ZINC, USPTO, ORDerly, ChemRxiv, LibreTexts Chemistry, Wikipedia, and Wikidata.</p>
<h3 id="chembench-design">ChemBench Design</h3>
<p>ChemBench contains 4,100 multiple-choice questions across the same nine tasks as ChemData. The choice of multiple-choice format is deliberate: it minimizes the influence of output style and focuses evaluation on factual correctness, unlike BLEU/ROUGE-based evaluation. Wrong answers are generated by sampling nearby values (for prediction tasks) or using GPT-4 to create plausible distractors. Deduplication ensures no overlap between ChemData training entries and ChemBench questions.</p>
<p>ChemBench has been contributed to the OpenCompass evaluation platform.</p>
<h3 id="baselines">Baselines</h3>
<p>All evaluations use 5-shot prompting. Baselines include:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LLaMA-2</td>
          <td>Open-source</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>Mistral</td>
          <td>Open-source</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>ChatGLM3</td>
          <td>Open-source</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>Qwen</td>
          <td>Open-source</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>InternLM2-Chat-7B</td>
          <td>Open-source (Stage 1 only)</td>
          <td>7B</td>
      </tr>
      <tr>
          <td>GPT-3.5</td>
          <td>Closed-source</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>GPT-4</td>
          <td>Closed-source</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<h2 id="chemllm-matches-gpt-4-on-chemical-tasks-and-outperforms-7b-peers">ChemLLM Matches GPT-4 on Chemical Tasks and Outperforms 7B Peers</h2>
<h3 id="chemical-evaluation-chembench">Chemical Evaluation (ChemBench)</h3>
<p>ChemLLM significantly outperforms general LLMs of similar scale and surpasses GPT-3.5 across all nine tasks. Compared to GPT-4, ChemLLM achieves higher scores on six of nine tasks, with the remaining three ranking just below GPT-4. LLaMA-2 scores near random chance (~25 per task), highlighting the difficulty of these tasks for models without chemical training.</p>
<p>Compared to InternLM2-Chat-7B (the Stage 1 model), ChemLLM shows substantial improvement, confirming the effectiveness of the Stage 2 chemical fine-tuning.</p>
<h3 id="general-evaluation">General Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>ChemLLM</th>
          <th>Best 7B Baseline</th>
          <th>GPT-4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MMLU</td>
          <td>65.6</td>
          <td>&lt; 65.6</td>
          <td>Higher</td>
      </tr>
      <tr>
          <td>C-Eval</td>
          <td>67.2</td>
          <td>&lt; 67.2</td>
          <td>Higher</td>
      </tr>
      <tr>
          <td>GSM8K</td>
          <td>67.2</td>
          <td>&lt; 67.2</td>
          <td>Higher</td>
      </tr>
      <tr>
          <td>C-MHChem</td>
          <td>76.4</td>
          <td>&lt; 76.4</td>
          <td>&lt; 76.4</td>
      </tr>
  </tbody>
</table>
<p>ChemLLM outperforms all competing 7B models on MMLU, C-Eval, and GSM8K. On C-MHChem (Chinese middle and high school chemistry), ChemLLM scores 76.4, surpassing GPT-4. The authors note that chemical data fine-tuning may enhance reasoning capabilities due to the logical reasoning required in chemical problem-solving. ChemLLM also comprehensively surpasses InternLM2-Chat-7B on all four general benchmarks, indicating that chemical data does not harm general capabilities.</p>
<h3 id="qualitative-capabilities">Qualitative Capabilities</h3>
<p>The paper demonstrates qualitative performance on chemistry-related NLP tasks including:</p>
<ul>
<li>Chemical literature translation (English to Chinese and vice versa)</li>
<li>Chemical poetry creation</li>
<li>Information extraction from chemical text</li>
<li>Text summarization of chemical research</li>
<li>Reading comprehension on chemistry topics</li>
<li>Named entity recognition for chemical entities</li>
<li>Ethics and safety reasoning in chemical contexts</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The paper does not provide individual task-level scores in tabular form for ChemBench (only radar charts), making precise comparison difficult. Specific scores for each of the nine tasks across all baselines are not reported numerically. The evaluation is limited to 5-shot prompting without exploration of zero-shot or chain-of-thought prompting variants. The paper also does not discuss failure modes or systematic weaknesses of ChemLLM on particular task types.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stage 1 Training</td>
          <td>Multi-Corpus</td>
          <td>1.7M Q&amp;A</td>
          <td>Collected from Hugging Face</td>
      </tr>
      <tr>
          <td>Stage 2 Training</td>
          <td>ChemData + Multi-Corpus</td>
          <td>7M + 1.7M</td>
          <td>Chemical + general mixture</td>
      </tr>
      <tr>
          <td>Chemical Evaluation</td>
          <td>ChemBench</td>
          <td>4,100 MCQ</td>
          <td>9 tasks, contributed to OpenCompass</td>
      </tr>
      <tr>
          <td>General Evaluation</td>
          <td>MMLU, C-Eval, GSM8K, C-MHChem</td>
          <td>Varies</td>
          <td>Standard benchmarks</td>
      </tr>
  </tbody>
</table>
<p>Data sources for ChemData: PubChem, ChEMBL, ChEBI, ZINC, USPTO, ORDerly, ChemRxiv, LibreTexts Chemistry, Wikipedia, Wikidata.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Two-stage instruction tuning (general then chemical)</li>
<li>LoRA fine-tuning (rank 8, scale 16.0, dropout 0.1)</li>
<li>Template-based instruction construction with GPT-4 for diversity</li>
<li>Play as Playwrights CoT prompting for multi-turn dialogue generation</li>
<li>NEFTune noise injection (alpha 5)</li>
<li>DeepSpeed ZeRO++ for distributed training</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Base</th>
          <th>Parameters</th>
          <th>Availability</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemLLM-7B-Chat</td>
          <td>InternLM2-Base-7B</td>
          <td>7B</td>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-7B-Chat">Hugging Face</a></td>
      </tr>
      <tr>
          <td>ChemLLM-7B-Chat-1.5-DPO</td>
          <td>InternLM2</td>
          <td>7B</td>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1_5-DPO">Hugging Face</a></td>
      </tr>
      <tr>
          <td>ChemLLM-20B-Chat-DPO</td>
          <td>InternLM</td>
          <td>20B</td>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-20B-Chat-DPO">Hugging Face</a></td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>5-shot evaluation across all benchmarks. Multiple-choice format for ChemBench to minimize output style bias.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>2 machines, each with 8 NVIDIA A100 SMX GPUs</li>
<li>2 AMD EPYC 7742 64-Core CPUs per machine (256 threads each)</li>
<li>SLURM cluster management</li>
<li>BF16 mixed precision training</li>
<li>Flash Attention-2 + KV Cache</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-7B-Chat">ChemLLM-7B-Chat</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Original 7B chat model</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-7B-Chat-1_5-DPO">ChemLLM-7B-Chat-1.5-DPO</a></td>
          <td>Model</td>
          <td>Other</td>
          <td>Updated v1.5 with DPO</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemLLM-20B-Chat-DPO">ChemLLM-20B-Chat-DPO</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>20B parameter variant</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem">AI4Chem HuggingFace</a></td>
          <td>Collection</td>
          <td>Various</td>
          <td>All models, datasets, and code</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, D., Liu, W., Tan, Q., Chen, J., Yan, H., Yan, Y., Li, J., Huang, W., Yue, X., Ouyang, W., Zhou, D., Zhang, S., Su, M., Zhong, H.-S., &amp; Li, Y. (2024). ChemLLM: A Chemical Large Language Model. <em>arXiv preprint arXiv:2402.06852</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2024chemllm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemLLM: A Chemical Large Language Model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Di and Liu, Wei and Tan, Qian and Chen, Jingdan and Yan, Hang and Yan, Yuliang and Li, Jiatong and Huang, Weiran and Yue, Xiangyu and Ouyang, Wanli and Zhou, Dongzhan and Zhang, Shufei and Su, Mao and Zhong, Han-Sen and Li, Yuqiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2402.06852}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemCrow: Augmenting LLMs with 18 Chemistry Tools</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/</guid><description>ChemCrow integrates 18 expert-designed chemistry tools with GPT-4 to enable autonomous synthesis planning, drug discovery, and materials design tasks.</description><content:encoded><![CDATA[<h2 id="an-llm-powered-chemistry-agent">An LLM-Powered Chemistry Agent</h2>
<p>This is a <strong>Method</strong> paper that introduces ChemCrow, an LLM chemistry agent that augments GPT-4 with 18 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. Rather than relying on the LLM&rsquo;s internal knowledge (which is often inaccurate for chemistry), ChemCrow uses the LLM as a reasoning engine that iteratively calls specialized tools to gather information, plan actions, and execute experiments. The system successfully planned and executed real-world chemical syntheses on a robotic platform, demonstrating one of the first chemistry-related LLM agent interactions with the physical world.</p>
<h2 id="bridging-llm-reasoning-and-chemical-expertise">Bridging LLM Reasoning and Chemical Expertise</h2>
<p>Large language models have transformed many domains, but they struggle with chemistry-specific problems. GPT-4 cannot reliably perform basic operations like multiplying large numbers, converting <a href="https://en.wikipedia.org/wiki/IUPAC_nomenclature_of_chemistry">IUPAC names</a> to molecular structures, or predicting reaction outcomes. These limitations stem from the models&rsquo; token-prediction design, which does not encode chemical reasoning or factual chemical knowledge reliably.</p>
<p>Meanwhile, the chemistry community has developed numerous specialized computational tools for reaction prediction, <a href="/notes/chemistry/molecular-design/reaction-prediction/">retrosynthesis</a> planning, molecular property prediction, and de novo molecular generation. These tools exist in isolated environments with steep learning curves, making them difficult for experimental chemists to integrate and use together. The gap between LLM reasoning capabilities and specialized chemistry tools presents an opportunity: augmenting LLMs with these tools could compensate for the models&rsquo; chemical knowledge deficiencies while providing a natural language interface to specialized computational chemistry capabilities.</p>
<h2 id="tool-augmented-reasoning-via-react">Tool-Augmented Reasoning via ReAct</h2>
<p>ChemCrow builds on the ReAct (Reasoning and Acting) framework, where the LLM follows an iterative Thought-Action-Action Input-Observation loop. At each step, the model reasons about the current state of the task, selects an appropriate tool, provides input, pauses while the tool executes, and then incorporates the observation before deciding on the next step. This continues until the final answer is reached.</p>
<p>The system integrates 18 tools organized into four categories:</p>
<p><strong>General tools</strong> include web search (via SerpAPI), literature search (using paper-qa with OpenAI embeddings and FAISS), a Python REPL for arbitrary code execution, and a human interaction interface.</p>
<p><strong>Molecule tools</strong> cover Name2SMILES (converting molecule names to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> via Chem-Space, PubChem, and OPSIN), SMILES2Price (checking purchasability via molbloom and ZINC20), Name2CAS (CAS number lookup via PubChem), molecular Similarity (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> with ECFP2 fingerprints), ModifyMol (local chemical space exploration via SynSpace), PatentCheck (bloom filter patent lookup via molbloom), FuncGroups (functional group identification via SMARTS patterns), and SMILES2Weight (molecular weight calculation via RDKit).</p>
<p><strong>Safety tools</strong> include ControlledChemicalCheck (screening against chemical weapons lists from <a href="https://en.wikipedia.org/wiki/Organisation_for_the_Prohibition_of_Chemical_Weapons">OPCW</a> and the Australia Group), ExplosiveCheck (GHS explosive classification via PubChem), and SafetySummary (comprehensive safety overview from PubChem data).</p>
<p><strong>Chemical reaction tools</strong> include NameRXN (reaction classification via NextMove Software), ReactionPredict (product prediction via IBM&rsquo;s RXN4Chemistry API using the <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a>), ReactionPlanner (multi-step synthesis planning via RXN4Chemistry), and ReactionExecute (direct synthesis execution on IBM&rsquo;s RoboRXN robotic platform).</p>
<p>A key design feature is that safety checks are automatically invoked before synthesis execution. If a molecule is flagged as a controlled chemical or precursor, execution stops immediately.</p>
<h2 id="experimental-validation-and-evaluation">Experimental Validation and Evaluation</h2>
<h3 id="autonomous-synthesis">Autonomous Synthesis</h3>
<p>ChemCrow autonomously planned and executed four real-world syntheses on the IBM RoboRXN cloud-connected robotic platform:</p>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/DEET">DEET</a></strong> (insect repellent), from the prompt &ldquo;Plan and execute the synthesis of an insect repellent&rdquo;</li>
<li><strong>Three <a href="https://en.wikipedia.org/wiki/Thiourea">thiourea</a> <a href="https://en.wikipedia.org/wiki/Organocatalysis">organocatalysts</a></strong> (Schreiner&rsquo;s, Ricci&rsquo;s, and Takemoto&rsquo;s catalysts), from a prompt asking to find and synthesize a thiourea organocatalyst that accelerates the <a href="https://en.wikipedia.org/wiki/Diels%E2%80%93Alder_reaction">Diels-Alder reaction</a></li>
</ul>
<p>All four syntheses yielded the anticipated compounds. ChemCrow demonstrated the ability to autonomously adapt synthesis procedures when the RoboRXN platform flagged issues (such as insufficient solvent or invalid purification actions), iteratively modifying the procedure until it was valid.</p>
<h3 id="novel-chromophore-discovery">Novel Chromophore Discovery</h3>
<p>In a human-AI collaboration scenario, ChemCrow was instructed to train a machine learning model to screen candidate <a href="https://en.wikipedia.org/wiki/Chromophore">chromophores</a>. The system loaded and cleaned data from a chromophore database, trained and evaluated a random forest model, and suggested a molecule with a target absorption maximum of 369 nm. The proposed molecule was subsequently synthesized and characterized, revealing a measured absorption maximum of 336 nm, confirming the discovery of a new chromophore.</p>
<h3 id="expert-vs-llm-evaluation">Expert vs. LLM Evaluation</h3>
<p>The evaluation used 14 use cases spanning synthesis planning, molecular design, and chemical logic. Both ChemCrow and standalone GPT-4 (without tools) were evaluated by:</p>
<ol>
<li><strong>Expert human evaluators</strong> (n=4): Assessed correctness of chemistry, quality of reasoning, and degree of task completion</li>
<li><strong>EvaluatorGPT</strong>: An LLM evaluator prompted to assess responses</li>
</ol>
<p>Key findings from the evaluation:</p>
<table>
  <thead>
      <tr>
          <th>Evaluator</th>
          <th>Preferred System</th>
          <th>Reasoning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Human experts</td>
          <td>ChemCrow</td>
          <td>Better chemical accuracy and task completeness, especially on complex tasks</td>
      </tr>
      <tr>
          <td>EvaluatorGPT</td>
          <td>GPT-4</td>
          <td>Favored fluent, complete-sounding responses despite factual errors</td>
      </tr>
  </tbody>
</table>
<p>Human experts preferred ChemCrow across most tasks, with the exception of very simple tasks where GPT-4 could answer from memorized training data (e.g., synthesis of well-known molecules like paracetamol). GPT-4 without tools consistently produced hallucinations that appeared convincing but were factually incorrect upon expert inspection.</p>
<p>An important finding is that LLM-based evaluation (EvaluatorGPT) cannot replace expert human assessment for scientific tasks. The LLM evaluator lacks the domain knowledge needed to distinguish fluent but incorrect answers from accurate ones, rendering it unsuitable for benchmarking factuality in chemistry.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>ChemCrow demonstrates that augmenting LLMs with expert-designed tools transforms them from &ldquo;hyperconfident, typically wrong information sources&rdquo; into reasoning engines that can gather and act on accurate chemical information. The system lowers the barrier for non-experts to access computational chemistry tools through natural language while serving as an assistant to expert chemists.</p>
<p>Several limitations are acknowledged:</p>
<ul>
<li><strong>Tool dependency</strong>: ChemCrow&rsquo;s performance is bounded by the quality and coverage of its tools. Improved synthesis engines would directly improve synthesis planning capabilities.</li>
<li><strong>Reasoning failures</strong>: Tools become useless if the LLM&rsquo;s reasoning about when and how to use them is flawed, or if garbage inputs are provided.</li>
<li><strong>Reproducibility</strong>: The API-based approach to closed-source LLMs (GPT-4) limits reproducibility of individual results. The authors note that open-source models could address this, potentially at the cost of reasoning quality.</li>
<li><strong>Evaluation scope</strong>: The 14 evaluation tasks, while diverse, represent a limited test set. Standardized benchmarks for LLM-based chemistry tools did not exist at the time of publication.</li>
<li><strong>Safety considerations</strong>: While safety tools prevent execution of controlled chemical syntheses, risks remain from inaccurate reasoning or tool outputs leading to suboptimal conclusions.</li>
</ul>
<p>The authors emphasize that ChemCrow&rsquo;s modular design allows easy extension with new tools, and that future integration of image-processing tools, additional language-based tools, and other capabilities could substantially enhance the system.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chromophore screening</td>
          <td>DB for chromophore (Joung et al.)</td>
          <td>Not specified</td>
          <td>Used for training random forest model</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>14 expert-designed tasks</td>
          <td>14 tasks</td>
          <td>Spanning synthesis, molecular design, and chemical logic</td>
      </tr>
      <tr>
          <td>Chemical safety</td>
          <td>OPCW Schedules 1-3, Australia Group lists</td>
          <td>Not specified</td>
          <td>Used for controlled chemical screening</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>LLM</strong>: GPT-4 with temperature 0.1</li>
<li><strong>Framework</strong>: LangChain for tool integration</li>
<li><strong>Reasoning</strong>: ReAct (Reasoning + Acting) framework with chain-of-thought prompting</li>
<li><strong>Synthesis planning</strong>: IBM RXN4Chemistry API (Molecular Transformer-based)</li>
<li><strong>Molecule similarity</strong>: Tanimoto similarity with ECFP2 fingerprints via RDKit</li>
<li><strong>Chemical space exploration</strong>: SynSpace with 50 robust medicinal chemistry reactions</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-4 (OpenAI, closed-source) for reasoning</li>
<li>Random forest for chromophore screening (trained on the fly)</li>
<li>Molecular Transformer via RXN4Chemistry API for reaction prediction and retrosynthesis</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Human evaluation</strong>: 4 expert chemists rated responses on chemistry correctness, reasoning quality, and task completion</li>
<li><strong>LLM evaluation</strong>: EvaluatorGPT assessed responses (found unreliable for factuality)</li>
<li><strong>Experimental validation</strong>: 4 syntheses on RoboRXN platform, 1 novel chromophore characterization</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware requirements are not specified in the paper. The system relies primarily on API calls to GPT-4 and RXN4Chemistry, so local compute requirements are minimal.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ur-whitelab/chemcrow-public">chemcrow-public</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Open-source implementation with 12 of 18 tools</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ur-whitelab/chemcrow-runs">chemcrow-runs</a></td>
          <td>Data</td>
          <td>Not specified</td>
          <td>All experiment outputs and evaluation data</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10884639">Zenodo release (code)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archived release v0.3.24</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10884645">Zenodo release (runs)</a></td>
          <td>Data</td>
          <td>Not specified</td>
          <td>Archived experiment runs</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., &amp; Schwaller, P. (2024). Augmenting large language models with chemistry tools. <em>Nature Machine Intelligence</em>, 6(5), 525-535.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{bran2024augmenting,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Augmenting large language models with chemistry tools}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Bran, Andres M. and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{525--535}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-024-00832-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChatDrug: Conversational Drug Editing with ChatGPT</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chatdrug-conversational-drug-editing/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chatdrug-conversational-drug-editing/</guid><description>ChatDrug uses ChatGPT with retrieval and domain feedback for drug editing across small molecules, peptides, and proteins on 39 tasks.</description><content:encoded><![CDATA[<h2 id="a-framework-for-conversational-drug-editing-with-llms">A Framework for Conversational Drug Editing with LLMs</h2>
<p>ChatDrug is a <strong>Method</strong> paper that introduces a parameter-free framework for drug editing using conversational large language models (specifically ChatGPT/GPT-3.5). The primary contribution is a three-module pipeline that combines prompt engineering, retrieval-augmented domain feedback, and iterative conversation to perform text-guided editing of small molecules, peptides, and proteins. The paper also establishes a benchmark of 39 drug editing tasks spanning these three drug types.</p>
<h2 id="bridging-conversational-ai-and-drug-discovery">Bridging Conversational AI and Drug Discovery</h2>
<p>Drug editing (also called <a href="https://en.wikipedia.org/wiki/Hit_to_lead">lead optimization</a> or protein design) is a critical step in the drug discovery pipeline where molecular substructures are modified to achieve desired properties. Traditional approaches rely on domain experts for manual editing, which can be subjective and biased. Recent multi-modal approaches like MoleculeSTM and ProteinDT have started exploring text-guided drug editing, but they are domain-specific (limited to one drug type) and lack conversational capabilities for iterative refinement.</p>
<p>The authors identify three properties of conversational LLMs that make them suitable for drug discovery: (1) pretraining on comprehensive knowledge bases covering drug-related concepts, (2) strong few-shot adaptation and generalization abilities, and (3) interactive communication enabling iterative feedback incorporation. However, directly applying LLMs to drug editing yields suboptimal results because the models do not fully utilize prior domain knowledge. ChatDrug addresses this gap through structured retrieval and feedback mechanisms.</p>
<h2 id="three-module-pipeline-pdds-redf-and-conversation">Three-Module Pipeline: PDDS, ReDF, and Conversation</h2>
<p>ChatDrug consists of three modules that operate sequentially without any parameter learning.</p>
<h3 id="pdds-module-prompt-design-for-domain-specific">PDDS Module (Prompt Design for Domain-Specific)</h3>
<p>The PDDS module constructs domain-specific prompts for ChatGPT. Given an input drug $\pmb{x}_{\text{in}}$ and a text prompt $\pmb{x}_t$ describing the desired property change, the goal is:</p>
<p>$$
\pmb{x}_{\text{out}} = \text{ChatDrug}(\pmb{x}_{\text{in}}, \pmb{x}_t)
$$</p>
<p>The prompts are designed around high-level property descriptions (e.g., &ldquo;more soluble in water&rdquo;) rather than exact substructure replacements. The authors argue that ChatDrug is better suited for &ldquo;fuzzy searching&rdquo; (property-based editing with non-deterministic answers) rather than &ldquo;exact searching&rdquo; (precise substructure replacement that experts can do directly).</p>
<h3 id="redf-module-retrieval-and-domain-feedback">ReDF Module (Retrieval and Domain Feedback)</h3>
<p>The ReDF module retrieves structurally similar examples from a domain-specific database and injects them into the conversation as demonstrations. For an input drug $\pmb{x}_{\text{in}}$, a candidate drug $\tilde{\pmb{x}}$ that failed the desired property change, and a retrieval database, ReDF returns:</p>
<p>$$
\pmb{x}_R = \text{ReDF}(\pmb{x}_{\text{in}}, \tilde{\pmb{x}}; \pmb{x}_t) = \underset{\pmb{x}&rsquo;_R \in \text{RetrievalDB}}{\arg\max} \langle \tilde{\pmb{x}}, \pmb{x}&rsquo;_R \rangle \wedge D(\pmb{x}_{\text{in}}, \pmb{x}&rsquo;_R; \pmb{x}_t)
$$</p>
<p>where $D(\cdot, \cdot; \cdot) \in {\text{True}, \text{False}}$ is a domain feedback function checking whether the retrieved drug satisfies the desired property change, and $\langle \tilde{\pmb{x}}, \pmb{x}&rsquo;_R \rangle$ is a similarity function (<a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> for small molecules, <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> for peptides and proteins).</p>
<p>The retrieved example $\pmb{x}_R$ is injected into the prompt as: &ldquo;Your provided sequence [$\tilde{\pmb{x}}$] is not correct. We find a sequence [$\pmb{x}_R$] which is correct and similar to the molecule you provided. Can you give me a new molecule?&rdquo;</p>
<h3 id="conversation-module">Conversation Module</h3>
<p>The conversation module enables iterative refinement over $C$ rounds. At each round $c$, if the edited drug $\pmb{x}_c$ does not satisfy the evaluation condition, ChatDrug retrieves a new example via ReDF using $\tilde{\pmb{x}} = \pmb{x}_c$ and continues the conversation. This aligns with the iterative nature of real drug discovery workflows.</p>
<h2 id="experiments-across-39-drug-editing-tasks">Experiments Across 39 Drug Editing Tasks</h2>
<h3 id="task-design">Task Design</h3>
<p>The benchmark includes 39 tasks across three drug types:</p>
<ul>
<li><strong>Small molecules</strong> (28 tasks): 16 single-objective (tasks 101-108, each with loose and strict thresholds) and 12 multi-objective tasks (tasks 201-206, each with two thresholds). Properties include solubility (<a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a>), drug-likeness (QED), permeability (<a href="https://en.wikipedia.org/wiki/Polar_surface_area">tPSA</a>), <a href="https://en.wikipedia.org/wiki/Hydrogen_bond">hydrogen bond</a> acceptors/donors.</li>
<li><strong>Peptides</strong> (9 tasks): 6 single-objective and 3 multi-objective tasks for editing <a href="https://en.wikipedia.org/wiki/Major_histocompatibility_complex">peptide-MHC binding</a> affinity across different <a href="https://en.wikipedia.org/wiki/Human_leukocyte_antigen">HLA allele</a> types.</li>
<li><strong>Proteins</strong> (2 tasks): Editing protein sequences to increase <a href="https://en.wikipedia.org/wiki/Alpha_helix">alpha-helix</a> or <a href="https://en.wikipedia.org/wiki/Beta_sheet">beta-strand</a> secondary structures.</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>For small molecules, baselines include Random, PCA, High-Variance, and GS-Mutate (all based on MegaMolBART), plus MoleculeSTM with <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and Graph representations. For peptides and proteins, random mutation baselines with 1-3 mutated positions are used.</p>
<h3 id="main-results">Main Results</h3>
<p>ChatDrug achieves the best performance on 33 out of 39 tasks. Key results for small molecule editing (hit ratio):</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Property</th>
          <th>ChatDrug (loose)</th>
          <th>Best Baseline (loose)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>101</td>
          <td>More soluble</td>
          <td>94.13</td>
          <td>67.86 (MoleculeSTM-Graph)</td>
      </tr>
      <tr>
          <td>102</td>
          <td>Less soluble</td>
          <td>96.86</td>
          <td>64.79 (MoleculeSTM-Graph)</td>
      </tr>
      <tr>
          <td>106</td>
          <td>Lower permeability</td>
          <td>77.35</td>
          <td>34.13 (MoleculeSTM-SMILES)</td>
      </tr>
      <tr>
          <td>107</td>
          <td>More HBA</td>
          <td>95.35</td>
          <td>54.01 (MoleculeSTM-SMILES)</td>
      </tr>
      <tr>
          <td>108</td>
          <td>More HBD</td>
          <td>96.54</td>
          <td>60.97 (MoleculeSTM-Graph)</td>
      </tr>
  </tbody>
</table>
<p>ChatDrug underperforms on tasks 104 (less like a drug) and 105 (higher permeability) and most multi-objective tasks involving permeability (205), where MoleculeSTM variants perform better.</p>
<p>For peptide editing, ChatDrug achieves 41-69% hit ratios compared to 0.4-14.4% for random mutation baselines. For protein editing, ChatDrug reaches 34.79% and 51.38% hit ratios on helix and strand tasks respectively, compared to 26.90% and 21.44% for the best random mutation baseline.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p><strong>Conversation rounds</strong>: Performance increases with more rounds, converging around $C = 2$. For example, on task 101 (loose threshold), zero-shot achieves 78.26%, $C = 1$ reaches 89.56%, and $C = 2$ reaches 93.37%.</p>
<p><strong>ReDF threshold</strong>: Using a stricter threshold in the domain feedback function $D$ (matching the evaluation threshold) yields substantially higher performance than using a loose threshold. For example, on task 107 with strict evaluation, the strict-threshold ReDF achieves 72.60% vs. 14.96% for the loose-threshold ReDF.</p>
<p><strong>Similarity analysis</strong>: Retrieved molecules $\pmb{x}_R$ tend to have lower similarity to input molecules than the intermediate outputs $\pmb{x}_1$, yet they have higher hit ratios. This suggests the ReDF module explores the chemical space effectively, and the conversation module balances similarity preservation with property optimization.</p>
<p><strong>Knowledge extraction</strong>: ChatDrug can articulate domain-specific reasoning for its edits (e.g., summarizing rules for increasing water solubility by introducing polar functional groups), though the extracted knowledge shows some redundancy.</p>
<h2 id="limitations-and-future-directions">Limitations and Future Directions</h2>
<p>ChatDrug demonstrates that conversational LLMs can serve as useful tools for drug editing, achieving strong results across diverse drug types with a parameter-free approach. The framework exhibits open vocabulary and compositional properties, allowing it to handle novel drug concepts and multi-objective tasks through natural language.</p>
<p>The authors acknowledge two main limitations. First, ChatDrug struggles with understanding complex 3D drug geometries, which would require deeper geometric modeling. Second, the framework requires multiple conversation rounds to achieve strong performance, adding computational cost through repeated API calls. The authors suggest that knowledge summarization capabilities of LLMs could help reduce this cost.</p>
<p>The evaluation relies entirely on computational oracles (RDKit for small molecules, MHCflurry2.0 for peptides, ProteinCLAP for proteins) rather than wet-lab validation. The hit ratio metric also excludes invalid outputs from the denominator, so the effective success rate on all attempted edits may be lower than reported.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Small molecule inputs</td>
          <td><a href="/notes/chemistry/datasets/zinc-22/">ZINC</a></td>
          <td>200 molecules</td>
          <td>Sampled SMILES strings</td>
      </tr>
      <tr>
          <td>Small molecule retrieval DB</td>
          <td>ZINC</td>
          <td>10K molecules</td>
          <td>For ReDF similarity search</td>
      </tr>
      <tr>
          <td>Peptide inputs</td>
          <td>Peptide-MHC binding dataset</td>
          <td>500 peptides per task</td>
          <td>From 30 common MHC alleles</td>
      </tr>
      <tr>
          <td>Peptide retrieval DB</td>
          <td>Experimental binding data</td>
          <td>Varies by allele</td>
          <td>Target allele experimental data</td>
      </tr>
      <tr>
          <td>Protein inputs</td>
          <td>TAPE test set</td>
          <td>Varies</td>
          <td>Secondary structure prediction test data</td>
      </tr>
      <tr>
          <td>Protein retrieval DB</td>
          <td>TAPE training set</td>
          <td>Varies</td>
          <td>Secondary structure prediction training data</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GPT-3.5-turbo via OpenAI ChatCompletion API, temperature=0, frequency_penalty=0.2</li>
<li>System prompt: &ldquo;You are an expert in the field of molecular chemistry.&rdquo;</li>
<li>$C = 2$ conversation rounds for main results</li>
<li>5 random seeds (0-4) for small molecule main results, seed 0 for ablations</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>ChatGPT (GPT-3.5-turbo): used as-is, no fine-tuning</li>
<li>MHCflurry 2.0: pseudo-oracle for peptide binding affinity evaluation</li>
<li>ProteinCLAP-EBM-NCE from ProteinDT: protein secondary structure prediction</li>
<li>ESMFold: protein folding for visualization</li>
<li>RDKit: molecular property calculations for small molecules</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Hit Ratio</td>
          <td>Fraction of valid edits satisfying property requirements</td>
          <td>Invalid sequences excluded from denominator</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>All experiments conducted on a single NVIDIA RTX A6000 GPU (used only for peptide and protein evaluation). Total OpenAI API cost was less than $100.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/chao1224/ChatDrug">ChatDrug GitHub</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, S., Wang, J., Yang, Y., Wang, C., Liu, L., Guo, H., &amp; Xiao, C. (2024). Conversational Drug Editing Using Retrieval and Domain Feedback. <em>ICLR 2024</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{liu2024chatdrug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Conversational Drug Editing Using Retrieval and Domain Feedback}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Shengchao and Wang, Jiongxiao and Yang, Yijin and Wang, Chengpeng and Liu, Ling and Guo, Hongyu and Xiao, Chaowei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BioT5: Cross-Modal Integration of Biology and Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/biot5-cross-modal-biology/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/biot5-cross-modal-biology/</guid><description>BioT5 is a T5-based pretraining framework that jointly models molecules, proteins, and natural language using SELFIES for robust molecular generation.</description><content:encoded><![CDATA[<h2 id="a-unified-pretraining-framework-for-molecules-proteins-and-text">A Unified Pretraining Framework for Molecules, Proteins, and Text</h2>
<p>BioT5 is a <strong>Method</strong> paper that introduces a comprehensive <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-based pretraining framework for cross-modal integration of molecules, proteins, and natural language. The primary contribution is a multi-task pretraining approach that uses <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (instead of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) for 100% valid molecular representations, separate tokenization for each modality, and a combination of masked language modeling and translation objectives to connect structured biological data with unstructured scientific text. After fine-tuning, BioT5 (252M parameters) achieves state-of-the-art performance on 10 out of 15 downstream tasks spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction, molecule captioning, and text-based molecule generation.</p>
<h2 id="bridging-the-gap-between-molecular-sequences-and-scientific-knowledge">Bridging the Gap Between Molecular Sequences and Scientific Knowledge</h2>
<p>Prior cross-modal models in computational biology face three recurring challenges. First, models like MolT5 and MolXPT rely on SMILES to represent molecules, but SMILES strings are syntactically fragile: random perturbations or model-generated sequences frequently produce invalid molecular structures. Edwards et al. (2022) and Li et al. (2023) both highlight this validity problem as a bottleneck for text-to-molecule generation. Second, the contextual information surrounding molecular and protein names in scientific literature (e.g., mentions in <a href="https://en.wikipedia.org/wiki/PubMed">PubMed</a> abstracts that describe properties, interactions, and experimental results) remains underutilized. Most models either ignore this context or treat it identically to structured database entries. Third, existing approaches like MolT5 and <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a> share a single tokenizer and embedding space across molecules, proteins, and text. This leads to chemically incorrect tokenization: the bromine atom &ldquo;Br&rdquo; in SMILES gets split into &ldquo;B&rdquo; (boron) and &ldquo;r&rdquo;, producing erroneous downstream predictions.</p>
<p>BioT5 addresses all three issues simultaneously by adopting SELFIES for molecular representation, extracting entity-linked contextual knowledge from PubMed, and employing separate vocabularies for each modality.</p>
<h2 id="selfies-separate-tokenization-and-multi-task-pretraining">SELFIES, Separate Tokenization, and Multi-Task Pretraining</h2>
<p>The core innovations of BioT5 center on three design decisions:</p>
<h3 id="selfies-for-robust-molecular-representation">SELFIES for Robust Molecular Representation</h3>
<p>BioT5 replaces SMILES with SELFIES (Self-referencing Embedded Strings) for all molecular representations. Every permutation of symbols within the SELFIES alphabet generates a chemically valid molecular structure, guaranteeing 100% validity in generation tasks. Molecules from ZINC20 are converted from SMILES to SELFIES during data preprocessing.</p>
<h3 id="modality-specific-tokenization">Modality-Specific Tokenization</h3>
<p>Rather than sharing a single SentencePiece vocabulary across modalities, BioT5 maintains three separate dictionaries:</p>
<ul>
<li><strong>Molecules</strong>: Each SELFIES token corresponds to a chemically meaningful atom group enclosed in brackets (e.g., <code>[C]</code>, <code>[=C]</code>, <code>[Br]</code>).</li>
<li><strong>Proteins</strong>: Amino acids are prefixed with a special <code>&lt;p&gt;</code> token to distinguish them from text characters (e.g., <code>&lt;p&gt;M</code>, <code>&lt;p&gt;K</code>, <code>&lt;p&gt;R</code>).</li>
<li><strong>Text</strong>: The standard T5 vocabulary is retained.</li>
</ul>
<p>This prevents semantic conflation across modalities. The total vocabulary size is 35,073, and the model comprises 252M parameters using the T5-v1.1-base architecture.</p>
<h3 id="multi-task-pretraining-objectives">Multi-Task Pretraining Objectives</h3>
<p>BioT5 uses six pretraining tasks organized into three categories:</p>
<ol>
<li><strong>Single-modal T5 objective</strong>: Standard span corruption and recovery applied independently to molecule SELFIES (task 1), protein <a href="https://en.wikipedia.org/wiki/FASTA_format">FASTA</a> (task 2), and general text from C4 (task 3).</li>
<li><strong>Wrapped text T5 objective</strong> (task 4): Applied to PubMed articles where molecular names are replaced with corresponding SELFIES strings and gene names are appended with protein FASTA sequences, using BERN2 for named entity recognition and entity linking.</li>
<li><strong>Bidirectional translation</strong> (tasks 5 and 6): Molecule SELFIES to text description and vice versa (using 339K pairs from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>), and protein FASTA to text description and vice versa (using 569K pairs from <a href="https://en.wikipedia.org/wiki/UniProt">Swiss-Prot</a>).</li>
</ol>
<p>The translation direction is randomly sampled with probability 0.5 for each example. For downstream tasks, BioT5 uses prompt-based fine-tuning to cast all tasks into a sequence generation format, reducing the gap between pretraining and fine-tuning.</p>
<h2 id="evaluation-across-15-downstream-tasks">Evaluation Across 15 Downstream Tasks</h2>
<p>BioT5 is evaluated on 15 tasks organized into three categories: single-instance prediction, multi-instance prediction, and cross-modal generation.</p>
<h3 id="molecule-property-prediction-moleculenet">Molecule Property Prediction (MoleculeNet)</h3>
<p>BioT5 is evaluated on six binary classification tasks from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> using scaffold splitting: BBBP, Tox21, ClinTox, HIV, BACE, and SIDER. Results are averaged over three random runs.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>GEM</th>
          <th>MolXPT</th>
          <th>BioT5</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>72.4</td>
          <td>80.0</td>
          <td>77.7</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>78.1</td>
          <td>77.1</td>
          <td>77.9</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>90.1</td>
          <td>95.3</td>
          <td>95.4</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>80.6</td>
          <td>78.1</td>
          <td><strong>81.0</strong></td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>85.6</td>
          <td>88.4</td>
          <td><strong>89.4</strong></td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>67.2</td>
          <td>71.7</td>
          <td><strong>73.2</strong></td>
      </tr>
      <tr>
          <td><strong>Avg</strong></td>
          <td>79.0</td>
          <td>81.9</td>
          <td><strong>82.4</strong></td>
      </tr>
  </tbody>
</table>
<p>BioT5 achieves the best average AUROC (82.4) across all six datasets, surpassing both GNN-based methods (GEM) and language model baselines (MolXPT).</p>
<h3 id="protein-property-prediction-peer-benchmark">Protein Property Prediction (PEER Benchmark)</h3>
<p>On the PEER benchmark, BioT5 is evaluated on protein solubility and subcellular localization prediction:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params</th>
          <th>Solubility (Acc)</th>
          <th>Localization (Acc)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESM-1b</td>
          <td>652.4M</td>
          <td>70.23</td>
          <td><strong>92.40</strong></td>
      </tr>
      <tr>
          <td>ProtBert</td>
          <td>419.9M</td>
          <td>68.15</td>
          <td>91.32</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td>252.1M</td>
          <td><strong>74.65</strong></td>
          <td>91.69</td>
      </tr>
  </tbody>
</table>
<p>BioT5 achieves the best solubility prediction accuracy (74.65%) despite being 2-3x smaller than dedicated protein language models like ESM-1b and ProtBert.</p>
<h3 id="drug-target-interaction-prediction">Drug-Target Interaction Prediction</h3>
<p>BioT5 is evaluated on three DTI datasets (BioSNAP, Human, BindingDB) with five random runs:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>BioSNAP AUROC</th>
          <th>Human AUROC</th>
          <th>BindingDB AUROC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugBAN</td>
          <td>0.903</td>
          <td>0.982</td>
          <td>0.960</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td><strong>0.937</strong></td>
          <td><strong>0.989</strong></td>
          <td><strong>0.963</strong></td>
      </tr>
  </tbody>
</table>
<p>BioT5 consistently outperforms DrugBAN and other specialized DTI models across all three datasets.</p>
<h3 id="molecule-captioning-and-text-based-molecule-generation">Molecule Captioning and Text-Based Molecule Generation</h3>
<p>On the ChEBI-20 dataset, BioT5 outperforms all baselines in molecule captioning:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params</th>
          <th>BLEU-4</th>
          <th>METEOR</th>
          <th>Text2Mol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolT5-large</td>
          <td>783M</td>
          <td>0.508</td>
          <td>0.614</td>
          <td>0.582</td>
      </tr>
      <tr>
          <td>MolXPT</td>
          <td>350M</td>
          <td>0.505</td>
          <td>0.626</td>
          <td>0.594</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td>252M</td>
          <td><strong>0.556</strong></td>
          <td><strong>0.656</strong></td>
          <td><strong>0.603</strong></td>
      </tr>
  </tbody>
</table>
<p>For text-based molecule generation, BioT5 achieves an exact match score of 0.413 (vs. 0.311 for MolT5-large) while maintaining 100% validity, compared to 90.5% for MolT5-large. This demonstrates the direct benefit of SELFIES: every generated sequence is a valid molecule.</p>
<h3 id="protein-protein-interaction-prediction">Protein-Protein Interaction Prediction</h3>
<p>On the PEER PPI benchmarks (Yeast and Human), BioT5 achieves competitive results, outperforming fully fine-tuned ProtBert and ESM-1b on the Yeast dataset (64.89% vs. 63.72% for ProtBert) and placing second on Human (86.22% vs. 88.06% for ESM-1b with frozen weights).</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>BioT5 demonstrates that integrating molecular, protein, and textual modalities within a single pretraining framework yields consistent improvements across diverse biological tasks. Three factors drive BioT5&rsquo;s performance: (1) SELFIES guarantees 100% molecular validity in generation tasks, eliminating a persistent failure mode of SMILES-based models; (2) separate tokenization preserves the semantic integrity of each modality; (3) wrapped text pretraining on PubMed provides contextual biological knowledge that pure sequence models miss.</p>
<p>The authors acknowledge several limitations. BioT5 requires full-parameter fine-tuning for each downstream task because instruction-tuning does not generalize across tasks, and combining datasets via instructions causes data leakage (the authors note overlaps between BindingDB training data and BioSNAP/Human test sets). The model only handles sequence-format bio-entities and does not incorporate 2D or 3D structural information. Additional biological modalities such as DNA/RNA sequences and cell-level data are also left for future work.</p>
<p>The authors also note risks: BioT5 could potentially be misused to generate dangerous molecules, and it may fail to generate effective therapeutic molecules or produce compounds with adverse side effects.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining (molecules)</td>
          <td>ZINC20</td>
          <td>~300M molecules</td>
          <td>Converted from SMILES to SELFIES</td>
      </tr>
      <tr>
          <td>Pretraining (proteins)</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniRef50</a></td>
          <td>27M proteins</td>
          <td>Filtered by length</td>
      </tr>
      <tr>
          <td>Pretraining (text)</td>
          <td>C4</td>
          <td>Large</td>
          <td>Standard T5 corpus</td>
      </tr>
      <tr>
          <td>Pretraining (wrapped text)</td>
          <td>PubMed</td>
          <td>33M articles</td>
          <td>Entity linking via BERN2</td>
      </tr>
      <tr>
          <td>Pretraining (molecule-text pairs)</td>
          <td>PubChem</td>
          <td>339K pairs</td>
          <td>Excludes ChEBI-20 molecules</td>
      </tr>
      <tr>
          <td>Pretraining (protein-text pairs)</td>
          <td>Swiss-Prot</td>
          <td>569K pairs</td>
          <td>High-quality annotations</td>
      </tr>
      <tr>
          <td>Evaluation (molecular properties)</td>
          <td>MoleculeNet</td>
          <td>6 datasets</td>
          <td>Scaffold splitting</td>
      </tr>
      <tr>
          <td>Evaluation (protein properties)</td>
          <td>PEER</td>
          <td>2 tasks</td>
          <td>Solubility and localization</td>
      </tr>
      <tr>
          <td>Evaluation (DTI)</td>
          <td>BioSNAP, Human, BindingDB</td>
          <td>3 datasets</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Evaluation (PPI)</td>
          <td>Yeast, Human</td>
          <td>2 datasets</td>
          <td>From PEER benchmark</td>
      </tr>
      <tr>
          <td>Evaluation (generation)</td>
          <td>ChEBI-20</td>
          <td>33K pairs</td>
          <td>Molecule captioning and text-to-molecule</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: T5-v1.1-base (encoder-decoder transformer)</li>
<li>Optimizer: AdamW with RMS scaling</li>
<li>Learning rate: cosine annealing, base $1 \times 10^{-2}$, minimum $1 \times 10^{-5}$</li>
<li>Warmup steps: 10,000</li>
<li>Dropout: 0.0</li>
<li>Maximum input length: 512 tokens</li>
<li>Pretraining steps: 350K</li>
<li>Batch size: 96 per GPU (6 data types per batch)</li>
<li>Prompt-based fine-tuning for all downstream tasks</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Vocabulary Size</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BioT5</td>
          <td>252M</td>
          <td>35,073</td>
          <td>T5-v1.1-base</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Molecule property prediction: AUROC on 6 MoleculeNet tasks (scaffold split, 3 runs)</li>
<li>Protein property prediction: accuracy on PEER benchmark (3 runs)</li>
<li>Drug-target interaction: AUROC, AUPRC, accuracy on 3 DTI datasets (5 runs)</li>
<li>Protein-protein interaction: accuracy on 2 PPI datasets (3 runs)</li>
<li>Molecule captioning: BLEU, ROUGE, METEOR, Text2Mol on ChEBI-20</li>
<li>Text-based molecule generation: BLEU, exact match, fingerprint similarities, FCD, validity on ChEBI-20</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>8x NVIDIA A100 80GB GPUs for pretraining</li>
<li>Codebase: nanoT5</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/QizhiPei/BioT5">BioT5 Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Pei, Q., Zhang, W., Zhu, J., Wu, K., Gao, K., Wu, L., Xia, Y., &amp; Yan, R. (2023). BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations. <em>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</em>, 1102-1123. <a href="https://doi.org/10.18653/v1/2023.emnlp-main.70">https://doi.org/10.18653/v1/2023.emnlp-main.70</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{pei2023biot5,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1102--1123}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2023.emnlp-main.70}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MTL-BERT: Multitask BERT for Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/mtl-bert-multitask-smiles-enumeration/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/mtl-bert-multitask-smiles-enumeration/</guid><description>MTL-BERT combines BERT pretraining, multitask learning, and SMILES enumeration for molecular property prediction across 60 drug discovery datasets.</description><content:encoded><![CDATA[<h2 id="a-multitask-bert-framework-for-molecular-property-prediction">A Multitask BERT Framework for Molecular Property Prediction</h2>
<p>MTL-BERT is a <strong>Method</strong> paper that introduces a multitask learning framework built on BERT for predicting molecular properties from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a>. The primary contribution is the combination of three strategies to address data scarcity in drug discovery: (1) masked token pretraining on 1.7 million unlabeled molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, (2) multitask fine-tuning across 60 property prediction datasets simultaneously, and (3) <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> as a data augmentation technique applied during pretraining, fine-tuning, and inference. The model achieves strong performance across 60 <a href="https://en.wikipedia.org/wiki/ADME">ADMET</a> and molecular property datasets (44 classification and 16 regression), outperforming baselines including GNNs, XGBoost with molecular fingerprints, and prior <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a> approaches.</p>
<h2 id="data-scarcity-in-molecular-property-prediction">Data Scarcity in Molecular Property Prediction</h2>
<p>Deep learning methods for molecular property prediction face a fundamental tension: they require large amounts of labeled data to learn effectively, but labeled bioactivity data is scarce due to the cost and time of laboratory experiments. Existing approaches at the time of publication addressed this in isolation. Graph neural networks (GNNs) learn from molecular graphs but are typically shallow (2-3 layers) and prone to overfitting on small datasets. The original SMILES-BERT model applied masked language modeling to SMILES strings but fine-tuned separately for each task, missing opportunities to share information across related properties. Fixed molecular representations like <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a> (continuous and data-driven descriptors) cannot be further optimized for specific downstream tasks.</p>
<p>The authors identify three specific gaps: (1) single-task fine-tuning wastes the correlations between related ADMET properties (e.g., <a href="https://en.wikipedia.org/wiki/Lipophilicity">lipophilicity</a> relates to many ADMET endpoints), (2) using only canonical SMILES limits the model&rsquo;s ability to learn robust molecular features, and (3) no prior work had combined pretraining, multitask learning, and SMILES enumeration into a unified framework.</p>
<h2 id="three-strategies-combined-pretraining-multitask-learning-and-smiles-enumeration">Three Strategies Combined: Pretraining, Multitask Learning, and SMILES Enumeration</h2>
<p>The core innovation of MTL-BERT is the synergistic combination of three strategies in a single pipeline.</p>
<h3 id="masked-smiles-pretraining">Masked SMILES Pretraining</h3>
<p>Following the BERT paradigm, MTL-BERT pretrains on 1.7 million unlabeled molecules from ChEMBL using a masked token recovery task. For each SMILES string, 15% of tokens are randomly selected: 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% remain unchanged. The loss is computed only at masked positions. Unlike the original BERT, MTL-BERT omits the next-sentence prediction task since there is no sequential relationship between SMILES strings (following the RoBERTa finding that this task is unnecessary).</p>
<p>SMILES strings are tokenized with a regular expression that captures multi-character tokens (e.g., Si, Br, Cl) and common SMILES syntax. The model uses positional encoding to capture token order.</p>
<h3 id="transformer-architecture">Transformer Architecture</h3>
<p>The model uses a standard Transformer encoder with multihead self-attention. The scaled dot-product attention computes:</p>
<p>$$\mathbf{O}_h = \text{softmax}\left(\frac{\mathbf{Q}_h \mathbf{K}_h^T}{\sqrt{d_k}}\right) \mathbf{V}_h$$</p>
<p>where $\mathbf{Q}_h$, $\mathbf{K}_h$, and $\mathbf{V}_h$ are the query, key, and value matrices for head $h$, and $\sqrt{d_k}$ is a scaling factor. The outputs from all heads are concatenated and projected. Each attention sublayer is followed by a position-wise feedforward network with GELU activation, layer normalization, and residual connections.</p>
<p>Three model sizes were compared:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>Heads</th>
          <th>Embedding Size</th>
          <th>FFN Size</th>
          <th>Recovery Accuracy</th>
          <th>Fine-tuning Performance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MTL-BERT_SMALL</td>
          <td>4</td>
          <td>4</td>
          <td>128</td>
          <td>512</td>
          <td>0.931</td>
          <td>0.826</td>
      </tr>
      <tr>
          <td>MTL-BERT_MEDIUM</td>
          <td>8</td>
          <td>8</td>
          <td>256</td>
          <td>1,024</td>
          <td>0.962</td>
          <td>0.852</td>
      </tr>
      <tr>
          <td>MTL-BERT_LARGE</td>
          <td>12</td>
          <td>12</td>
          <td>576</td>
          <td>2,304</td>
          <td>0.974</td>
          <td>0.848</td>
      </tr>
  </tbody>
</table>
<p>The medium model was selected for its best fine-tuning performance with lower computational cost, despite the large model achieving higher pretraining recovery accuracy. The slight performance drop for the large model suggests mild overfitting.</p>
<h3 id="multitask-fine-tuning-with-task-tokens">Multitask Fine-tuning with Task Tokens</h3>
<p>During fine-tuning, task tokens ([T0], [T1], &hellip;) are prepended to each input SMILES string. The Transformer output at each task token position is passed through a task-specific two-layer feedforward network for the corresponding prediction task. An attention mask prevents direct information exchange between task tokens, allowing each task to learn directly from SMILES tokens without interference. This design also reduces the discrepancy between pretraining (no task tokens visible) and fine-tuning.</p>
<p>Cross-entropy loss is used for classification tasks and mean squared error for regression tasks. The total multitask loss is a simple sum of per-task losses without learned weighting.</p>
<h3 id="smiles-enumeration-as-data-augmentation">SMILES Enumeration as Data Augmentation</h3>
<p>A molecule can be represented by multiple valid SMILES strings by varying starting atoms and traversal orders. MTL-BERT applies SMILES enumeration at all three stages:</p>
<ol>
<li><strong>Pretraining</strong>: Enumerated SMILES increase diversity of the self-supervised training data.</li>
<li><strong>Fine-tuning</strong>: Each dataset is augmented 20x with random SMILES variants, increasing data diversity and helping the model learn position-invariant features.</li>
<li><strong>Inference</strong>: Multiple SMILES are generated per test molecule, predictions are fused (averaged) for a more robust final prediction.</li>
</ol>
<p>The 20x augmentation factor was chosen based on prior work showing diminishing returns beyond this level while significantly increasing computational cost.</p>
<h2 id="experimental-evaluation-across-60-datasets">Experimental Evaluation Across 60 Datasets</h2>
<h3 id="setup">Setup</h3>
<p>MTL-BERT was evaluated on 60 datasets (44 classification, 16 regression) covering ADMET properties and common molecular benchmarks. Datasets were sourced from ADMETlab and <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>. Each dataset was split 8:1:1 (train/validation/test), and experiments were repeated 10 times with random splits, reporting mean and standard deviation.</p>
<p>Classification tasks were evaluated with <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC-AUC</a> and accuracy; regression tasks with $R^2$ and RMSE.</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines were compared:</p>
<ul>
<li><strong>ECFP4-XGBoost</strong>: Extended-connectivity fingerprints (diameter 4) with gradient boosting</li>
<li><strong>Graph Attention Network (GAT)</strong></li>
<li><strong>Graph Convolutional Network (GCN)</strong></li>
<li><strong>AttentiveFP</strong>: A GNN with attention for molecular property prediction</li>
<li><strong>CDDD</strong>: Continuous and data-driven descriptors from a pretrained RNN auto-encoder</li>
</ul>
<h3 id="ablation-study">Ablation Study</h3>
<p>Three model variants were compared to isolate contributions:</p>
<ul>
<li><strong>MTL-BERT</strong>: Full model (pretraining + multitask + SMILES enumeration)</li>
<li><strong>STL-BERT</strong>: Single-task fine-tuning with SMILES enumeration (no multitask)</li>
<li><strong>Cano-BERT</strong>: Canonical SMILES only, single-task fine-tuning (equivalent to SMILES-BERT)</li>
</ul>
<p>Cano-BERT showed more than 10% degradation on several datasets (CL, Fu, LC50DM) compared to STL-BERT, demonstrating the importance of SMILES enumeration. MTL-BERT outperformed STL-BERT on most datasets, with improvements exceeding 5% on $F_{20\%}$, SR-ARE, and SR-ATAD5, confirming that multitask learning provides additional benefit on top of enumeration.</p>
<h3 id="results-vs-baselines">Results vs. Baselines</h3>
<p>MTL-BERT outperformed all baselines on nearly all 60 datasets. Specific findings:</p>
<ul>
<li>ECFP4-XGBoost performed inconsistently, doing well on some tasks (e.g., $F_{30\%}$, BACE, CL) but poorly on others, reflecting the limitation of fixed-length fingerprint representations.</li>
<li>GNNs generally improved over fingerprints but still suffered from data scarcity, falling behind ECFP4-XGBoost by more than 3% on $F_{30\%}$, Carcinogenicity, CL, and VD.</li>
<li>MTL-BERT surpassed all baselines except on CYP2C19-sub and BACE (by less than 1.1%).</li>
<li>On 14 tasks (NR-ER, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, Bioconcentration Factor, Fu, LC50FM, Lipophilicity, CL, PPB, VD, LC50DM), MTL-BERT exceeded the best baseline by more than 5-10%.</li>
<li>Improvements were statistically significant at the 95% confidence level (paired t-test, $P \leq 0.001$).</li>
</ul>
<h3 id="representation-analysis">Representation Analysis</h3>
<p><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> visualization of pretrained token embeddings (from 1,000 randomly selected molecules, approximately 35,000 tokens) showed that:</p>
<ul>
<li>Tokens of the same type cluster together (capturing atomic type information).</li>
<li>Within type clusters, sub-groups correspond to different chemical environments (e.g., oxygen atoms in nitrate groups vs. carbonyl groups).</li>
<li>Nearby embeddings share similar molecular neighborhood environments.</li>
</ul>
<h3 id="attention-based-interpretability">Attention-based Interpretability</h3>
<p>The model&rsquo;s attention weights provide interpretability for predictions:</p>
<ul>
<li>For a solubility task (LogS/LogD), attention concentrated on polar groups, which are known determinants of aqueous solubility.</li>
<li>For <a href="https://en.wikipedia.org/wiki/Ames_test">AMES</a> (mutagenicity), attention focused on <a href="https://en.wikipedia.org/wiki/Azide">azide</a>, nitrosamide, <a href="https://en.wikipedia.org/wiki/Acyl_chloride">acylchloride</a>, and nitrite groups, which are known mutagenic structural alerts.</li>
</ul>
<h2 id="performance-gains-from-combined-strategies-with-interpretable-attention">Performance Gains from Combined Strategies with Interpretable Attention</h2>
<p>MTL-BERT demonstrates that the combination of pretraining, multitask learning, and SMILES enumeration is more effective than any individual strategy for molecular property prediction. The ablation study provides clear evidence for the additive benefit of each component.</p>
<p>Key strengths include the breadth of evaluation (60 datasets covering diverse ADMET endpoints), the consistent improvement over multiple baseline types (fingerprints, GNNs, pretrained representations), and the interpretable attention mechanism that highlights chemically meaningful substructures.</p>
<p>Limitations to note: the simple sum of multitask losses (no learned task weighting) may not be optimal when tasks have very different scales or when some tasks are unrelated. The authors observe slight degradation on a few datasets (AMES, CYP1A2-Sub, FreeSolv), suggesting negative transfer in those cases. The 20x SMILES enumeration significantly increases computational cost during fine-tuning and inference. The paper does not report wall-clock training times or GPU hours, making it difficult to assess the practical cost of the enumeration strategy. Hardware details are not specified beyond acknowledgment of the High-Performance Computing Center at Central South University.</p>
<p>The hierarchical clustering of task representations reveals meaningful task groupings (e.g., LogD and LogP cluster together due to their shared relationship with water solubility), supporting the premise that multitask learning captures cross-task correlations.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL</td>
          <td>1.7M molecules</td>
          <td>Unlabeled SMILES; 10% held out for evaluation</td>
      </tr>
      <tr>
          <td>Fine-tuning/Evaluation</td>
          <td>ADMETlab + MoleculeNet</td>
          <td>60 datasets (44 classification, 16 regression)</td>
          <td>8:1:1 train/val/test split</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining</strong>: Masked token prediction (15% masking rate: 80% [MASK], 10% random, 10% unchanged). Adam optimizer, learning rate 1e-4, batch size 512, 50 epochs.</li>
<li><strong>Fine-tuning</strong>: Adam optimizer, learning rate 5e-5, batch size 64, dropout 0.1. Cross-entropy for classification, MSE for regression. Early stopping with patience 20, max 200 epochs.</li>
<li><strong>SMILES enumeration</strong>: 20x augmentation. Repeated search up to 100 times if enumerated SMILES is identical to a previous one.</li>
<li><strong>Inference fusion</strong>: Predictions from multiple enumerated SMILES are averaged.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>MTL-BERT_MEDIUM (selected model): 8 layers, 8 attention heads, 256 embedding size, 1,024 FFN size</li>
<li>Pretraining recovery accuracy: 0.962</li>
<li>1,000 task tokens pre-allocated for future tasks</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>Classification</td>
          <td>Secondary metric</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Regression</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression</td>
          <td>Secondary metric</td>
      </tr>
  </tbody>
</table>
<p>All experiments repeated 10 times with random splits; mean and standard deviation reported.</p>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper. The authors acknowledge the High-Performance Computing Center of Central South University.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhang-xuan1314/MTL-BERT">MTL-BERT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Pretraining data source</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Fine-tuning benchmark</td>
      </tr>
      <tr>
          <td><a href="https://admetmesh.scbdd.com/">ADMETlab</a></td>
          <td>Dataset</td>
          <td>Free for academic use</td>
          <td>ADMET property datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Wu, C.-K., Yi, J.-C., Zeng, X.-X., Yang, C.-Q., Lu, A.-P., Hou, T.-J., &amp; Cao, D.-S. (2022). Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. <em>Research</em>, 2022, Article 0004. <a href="https://doi.org/10.34133/research.0004">https://doi.org/10.34133/research.0004</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2022mtlbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Xiao-Chen and Wu, Cheng-Kun and Yi, Jia-Cai and Zeng, Xiang-Xiang and Yang, Can-Qun and Lu, Ai-Ping and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{Article 0004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.34133/research.0004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Association for the Advancement of Science (AAAS)}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MG-BERT: Graph BERT for Molecular Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/mg-bert-molecular-graph-bert/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/mg-bert-molecular-graph-bert/</guid><description>MG-BERT integrates graph neural network message passing into BERT with masked atom pretraining on 1.7M molecules for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-graph-aware-bert-for-molecular-property-prediction">A Graph-Aware BERT for Molecular Property Prediction</h2>
<p>MG-BERT is a <strong>Method</strong> paper that adapts the BERT pretraining paradigm from NLP to molecular graphs. The primary contribution is a modified Transformer architecture that replaces global self-attention with bond-based local attention, allowing atoms to exchange information only through chemical bonds. This creates a deep message-passing network that avoids the oversmoothing problem of conventional graph neural networks (GNNs). Combined with a masked atom prediction pretraining strategy on 1.7 million unlabeled molecules from ChEMBL, MG-BERT learns context-sensitive atomic representations that transfer effectively to downstream property prediction tasks.</p>
<h2 id="data-scarcity-in-molecular-property-prediction">Data Scarcity in Molecular Property Prediction</h2>
<p><a href="/notes/chemistry/molecular-design/property-prediction/">Molecular property prediction</a> is central to drug discovery, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoints. While deep learning has advanced many domains, molecular property prediction faces a persistent challenge: labeled data scarcity. ADMET measurements require expensive, time-consuming experiments, and typical datasets contain only hundreds to thousands of examples.</p>
<p>Prior approaches fall into three categories, each with limitations:</p>
<ol>
<li><strong>Feature engineering</strong> (molecular fingerprints, descriptors): Requires expert design, suffers from low scalability, and fixed representations cannot be optimized for specific tasks.</li>
<li><strong>SMILES-based deep learning</strong> (CNNs, LSTMs, Transformers on SMILES strings): Must learn to parse molecular information from complex string syntax, increasing learning difficulty. Autoencoder-based methods (e.g., <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a>) learn fixed representations that cannot be fine-tuned.</li>
<li><strong>Graph neural networks</strong> (GAT, GCN): Can learn directly from molecular topology, but are limited to 2-3 layers due to oversmoothing, restricting their capacity to capture deep-level patterns.</li>
</ol>
<p>The BERT model from NLP demonstrated that self-supervised pretraining on large unlabeled corpora followed by fine-tuning on small labeled datasets can substantially improve downstream performance. <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a> applied this idea to SMILES strings directly, but suffered from interpretability issues due to auxiliary characters in the SMILES syntax. MG-BERT addresses these limitations by operating directly on molecular graphs.</p>
<h2 id="bond-based-local-attention-and-masked-atom-pretraining">Bond-Based Local Attention and Masked Atom Pretraining</h2>
<p>The core innovation of MG-BERT has two components: a modified Transformer architecture for molecular graphs and a self-supervised pretraining strategy.</p>
<h3 id="architecture-modifications">Architecture Modifications</h3>
<p>The original BERT model uses three components: an embedding layer, Transformer encoder layers, and a task-specific output layer. MG-BERT makes three key modifications:</p>
<ol>
<li>
<p><strong>Atom embeddings replace word embeddings.</strong> The dictionary contains 16 tokens: 13 common atom types ([H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se]), plus [UNK] for rare atoms, [MASK] for pretraining, and [GLOBAL] for graph-level readout.</p>
</li>
<li>
<p><strong>No positional encoding.</strong> Unlike sequential text, atoms in a molecular graph have no inherent ordering, so positional embeddings are removed.</p>
</li>
<li>
<p><strong>Local attention replaces global attention.</strong> The adjacency matrix of the molecular graph is used as a visibility matrix to modulate the attention scores. Each atom can only attend to atoms connected by chemical bonds. Formally, the attention is constrained so that:</p>
</li>
</ol>
<p>$$A&rsquo;_{ij} = \begin{cases} A_{ij} &amp; \text{if bond exists between } i \text{ and } j \\ -\infty &amp; \text{otherwise} \end{cases}$$</p>
<p>where $A_{ij}$ is the standard scaled dot-product attention score. This local message passing makes MG-BERT a variant of GNN, but one that can stack many layers (6 in the medium configuration) without oversmoothing, thanks to the residual connections inherited from the Transformer architecture.</p>
<ol start="4">
<li><strong>Supernode for graph-level readout.</strong> A [GLOBAL] supernode is added to each molecular graph, connected to all atoms. This node aggregates information from the entire molecule and serves as the molecular representation for downstream prediction.</li>
</ol>
<h3 id="masked-atom-prediction">Masked Atom Prediction</h3>
<p>The pretraining strategy mirrors BERT&rsquo;s masked language model but operates on atoms:</p>
<ul>
<li>15% of atoms in each molecule are randomly selected (at least one atom per molecule)</li>
<li>Of selected atoms: 80% are replaced with [MASK], 10% are randomly replaced with another atom type, and 10% remain unchanged</li>
<li>The model is trained to predict the original atom type at masked positions</li>
<li>Loss is computed only at masked positions</li>
</ul>
<h3 id="model-configurations">Model Configurations</h3>
<p>Three model sizes were compared:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Layers</th>
          <th>Heads</th>
          <th>Embedding Size</th>
          <th>FFN Size</th>
          <th>Recovery Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MG-BERT Small</td>
          <td>3</td>
          <td>2</td>
          <td>128</td>
          <td>256</td>
          <td>95.27%</td>
      </tr>
      <tr>
          <td>MG-BERT Medium</td>
          <td>6</td>
          <td>4</td>
          <td>256</td>
          <td>512</td>
          <td>98.31%</td>
      </tr>
      <tr>
          <td>MG-BERT Large</td>
          <td>12</td>
          <td>8</td>
          <td>576</td>
          <td>1152</td>
          <td>98.35%</td>
      </tr>
  </tbody>
</table>
<p>The medium configuration was selected for all experiments because it achieved the best downstream performance, despite the large model having slightly higher pretraining recovery accuracy. The authors attribute this to overfitting risk with the larger model.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<h3 id="pretraining">Pretraining</h3>
<p>MG-BERT was pretrained on 1.7 million compounds randomly selected from ChEMBL, with 10% held out for evaluation (1.53M training molecules). Molecules were converted to 2D undirected graphs using RDKit, with hydrogen atoms explicitly included. The model was pretrained for 10 epochs using Adam with learning rate 1e-4 and batch size 256.</p>
<h3 id="fine-tuning-datasets">Fine-tuning Datasets</h3>
<p>Sixteen datasets covering ADMET endpoints and common molecular properties were collected from ADMETlab and <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>:</p>
<table>
  <thead>
      <tr>
          <th>Type</th>
          <th>Dataset</th>
          <th>Category</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Regression</td>
          <td>Caco2</td>
          <td>Absorption</td>
          <td>979</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>logD</td>
          <td>Physicochemical</td>
          <td>10,354</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>logS</td>
          <td>Physicochemical</td>
          <td>5,045</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>PPB</td>
          <td>Distribution</td>
          <td>1,480</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>tox</td>
          <td>Toxicity</td>
          <td>7,295</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>Physicochemical</td>
          <td>1,128</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>Physicochemical</td>
          <td>642</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipo</td>
          <td>Physicochemical</td>
          <td>4,200</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Ames</td>
          <td>Toxicity</td>
          <td>6,719</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBB</td>
          <td>Distribution</td>
          <td>1,855</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>FDAMDD</td>
          <td>Toxicity</td>
          <td>795</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>H_HT</td>
          <td>Toxicity</td>
          <td>2,170</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Pgp_inh</td>
          <td>Absorption</td>
          <td>2,125</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Pgp_sub</td>
          <td>Absorption</td>
          <td>1,210</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE</td>
          <td>Biophysics</td>
          <td>1,513</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP</td>
          <td>Physiology</td>
          <td>2,039</td>
      </tr>
  </tbody>
</table>
<p>Datasets were split 8:1:1 (train:validation:test) with stratified sampling by SMILES length. Each experiment was repeated 10 times with random splits, reporting mean and standard deviation. Regression was evaluated by R-squared, classification by ROC-AUC. Early stopping with a maximum of 100 epochs was used.</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines were compared:</p>
<ol>
<li><strong>ECFP4-XGBoost</strong>: Extended connectivity fingerprints (diameter 4) with gradient-boosted trees</li>
<li><strong>GAT</strong>: Graph Attention Network</li>
<li><strong>GCN</strong>: Graph Convolutional Network</li>
<li><strong>CDDD</strong>: Continuous and Data-Driven Descriptors (pretrained RNN encoder on SMILES with a fully connected network)</li>
<li><strong>SMILES-BERT</strong>: Original BERT applied directly to SMILES strings</li>
</ol>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Two ablation studies were conducted:</p>
<ol>
<li><strong>Pretraining effectiveness</strong>: Comparing pretrained vs. non-pretrained MG-BERT under identical hyperparameters</li>
<li><strong>Hydrogen atoms</strong>: Comparing MG-BERT with and without explicit hydrogen atoms in the molecular graph</li>
</ol>
<h2 id="consistent-improvements-across-admet-benchmarks">Consistent Improvements Across ADMET Benchmarks</h2>
<h3 id="main-results">Main Results</h3>
<p>MG-BERT consistently outperformed all baselines across all 16 datasets. Key results on the 11 ADMET datasets:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>ECFP4-XGBoost</th>
          <th>GAT</th>
          <th>GCN</th>
          <th>CDDD</th>
          <th>SMILES-BERT</th>
          <th>MG-BERT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Caco2 (R2)</td>
          <td>61.41</td>
          <td>69.16</td>
          <td>67.15</td>
          <td>73.42</td>
          <td>72.39</td>
          <td><strong>74.68</strong></td>
      </tr>
      <tr>
          <td>logD (R2)</td>
          <td>70.84</td>
          <td>84.62</td>
          <td>86.22</td>
          <td>85.85</td>
          <td>86.31</td>
          <td><strong>87.46</strong></td>
      </tr>
      <tr>
          <td>logS (R2)</td>
          <td>73.73</td>
          <td>84.06</td>
          <td>83.47</td>
          <td>84.01</td>
          <td>85.20</td>
          <td><strong>87.66</strong></td>
      </tr>
      <tr>
          <td>PPB (R2)</td>
          <td>55.11</td>
          <td>59.96</td>
          <td>57.34</td>
          <td>54.12</td>
          <td>62.37</td>
          <td><strong>65.94</strong></td>
      </tr>
      <tr>
          <td>Ames (AUC)</td>
          <td>87.21</td>
          <td>86.38</td>
          <td>87.04</td>
          <td>86.82</td>
          <td>87.69</td>
          <td><strong>89.33</strong></td>
      </tr>
      <tr>
          <td>BBB (AUC)</td>
          <td>94.62</td>
          <td>93.03</td>
          <td>92.67</td>
          <td>94.44</td>
          <td>94.02</td>
          <td><strong>95.41</strong></td>
      </tr>
      <tr>
          <td>BBBP (AUC)</td>
          <td>89.16</td>
          <td>90.33</td>
          <td>90.74</td>
          <td>91.12</td>
          <td>91.32</td>
          <td><strong>92.08</strong></td>
      </tr>
  </tbody>
</table>
<p>The overall improvement across all datasets was 28.1% (7.02% on classification, 21.28% on regression). Improvements were statistically significant at the 95% confidence level (paired t-test, P &lt;= 0.001).</p>
<h3 id="pretraining-ablation">Pretraining Ablation</h3>
<p>Pretraining improved performance by more than 2% on all datasets. The benefit was largest for small datasets: Caco2 improved by approximately 10 percentage points (64.79 to 74.68 R2), and FDAMDD improved by about 7.5 points (80.76 to 88.23 AUC). This confirms that self-supervised pretraining effectively addresses the labeled data scarcity problem.</p>
<h3 id="hydrogen-atom-ablation">Hydrogen Atom Ablation</h3>
<p>Including explicit hydrogen atoms improved pretraining recovery accuracy from 92.25% to 98.31% and consistently improved downstream performance. The authors provide an intuitive explanation: hydrogen atoms help determine bond counts for neighboring atoms, which is critical for the masked atom recovery task. They also show that removing hydrogens can make structurally distinct molecules (e.g., benzene and cyclohexane) indistinguishable at the graph level.</p>
<h3 id="interpretability-via-attention-visualization">Interpretability via Attention Visualization</h3>
<p>The authors provide two forms of interpretability analysis:</p>
<ol>
<li>
<p><strong>t-SNE visualization of atomic representations</strong>: Pretrained atomic representations cluster by atom type and, more specifically, by local chemical environment (e.g., aromatic carbons separate from aliphatic carbons, C-N bonds from C-O bonds). This demonstrates that pretraining captures neighborhood context beyond simple atom identity.</p>
</li>
<li>
<p><strong>Attention weight visualization</strong>: On the logD task, the supernode&rsquo;s attention focuses on polar groups (which govern lipophilicity). On the Ames mutagenicity task, attention concentrates on known mutagenic structural alerts (acylchloride, nitrosamide, azide groups). This provides chemically meaningful explanations for predictions.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The paper does not extensively discuss limitations, but several can be identified:</p>
<ul>
<li>The model uses only 2D molecular topology (atom types and bonds) without 3D conformational information or bond-type features</li>
<li>The atom dictionary is limited to 13 common types plus [UNK], which may lose information for molecules containing rarer elements</li>
<li>Evaluation is limited to ADMET-focused datasets; broader chemical spaces (e.g., materials, catalysts) are not tested</li>
<li>The comparison baselines do not include other graph-based pretraining methods (e.g., the contemporaneous Strategies for Pre-training Graph Neural Networks by Hu et al.)</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL (random subset)</td>
          <td>1.7M molecules (1.53M train)</td>
          <td>10% held out for evaluation</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>ADMETlab + MoleculeNet</td>
          <td>16 datasets (642-10,354 molecules)</td>
          <td>8:1:1 splits, stratified by SMILES length</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam (pretraining: lr=1e-4, batch=256; fine-tuning: lr from {1e-5, 5e-5, 1e-4}, batch from {16, 32, 64})</li>
<li><strong>Pretraining epochs</strong>: 10</li>
<li><strong>Fine-tuning</strong>: Up to 100 epochs with early stopping</li>
<li><strong>Dropout</strong>: Optimized per task in range [0.0, 0.5]</li>
<li><strong>Masking</strong>: 15% of atoms (80% [MASK], 10% random, 10% unchanged)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: MG-BERT Medium (6 layers, 4 heads, embedding size 256, FFN size 512)</li>
<li><strong>Molecule processing</strong>: RDKit for graph conversion with explicit hydrogens</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>R-squared (R2)</td>
          <td>Regression</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Accuracy, RMSE</td>
          <td>Both</td>
          <td>Reported in supplementary Table S1</td>
      </tr>
  </tbody>
</table>
<p>All results averaged over 10 random splits with standard deviations reported.</p>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify hardware requirements (GPU type, training time, or memory usage).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhang-xuan1314/Molecular-graph-BERT">Molecular-graph-BERT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation; last code push August 2021</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Wu, C.-K., Yang, Z.-J., Wu, Z.-X., Yi, J.-C., Hsieh, C.-Y., Hou, T.-J., &amp; Cao, D.-S. (2021). MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. <em>Briefings in Bioinformatics</em>, 22(6), bbab152. <a href="https://doi.org/10.1093/bib/bbab152">https://doi.org/10.1093/bib/bbab152</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2021mgbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{MG-BERT}: leveraging unsupervised atomic representation learning for molecular property prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Xiao-Chen and Wu, Cheng-Kun and Yang, Zhi-Jiang and Wu, Zhen-Xing and Yi, Jia-Cai and Hsieh, Chang-Yu and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbab152}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbab152}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MAT: Graph-Augmented Transformer for Molecules (2020)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/molecule-attention-transformer/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/molecule-attention-transformer/</guid><description>MAT augments the Transformer self-attention mechanism with inter-atomic distances and molecular graph adjacency for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-graph-augmented-transformer-for-molecular-property-prediction">A Graph-Augmented Transformer for Molecular Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that proposes the Molecule Attention Transformer (MAT), a Transformer-based architecture adapted for molecular property prediction. The primary contribution is a modified self-attention mechanism that incorporates inter-atomic distances and molecular graph structure alongside the standard query-key attention. Combined with self-supervised pretraining on 2 million molecules from ZINC15, MAT achieves competitive performance across seven diverse molecular property prediction tasks while requiring minimal hyperparameter tuning.</p>
<h2 id="challenges-in-deep-learning-for-molecular-properties">Challenges in Deep Learning for Molecular Properties</h2>
<p>Predicting molecular properties is central to drug discovery and materials design, yet deep neural networks have struggled to consistently outperform shallow methods like random forests and SVMs on these tasks. Wu et al. (2018) demonstrated through the MoleculeNet benchmark that graph neural networks do not reliably beat classical models. Two recurring problems compound this:</p>
<ol>
<li><strong>Underfitting</strong>: Graph neural networks tend to underfit training data, with performance failing to scale with model complexity (Ishiguro et al., 2019).</li>
<li><strong>Hyperparameter sensitivity</strong>: Deep models for molecule property prediction require extensive hyperparameter search (often 500+ configurations) to achieve competitive results, making them impractical for many practitioners.</li>
</ol>
<p>Concurrent work explored using vanilla Transformers on SMILES string representations of molecules (Honda et al., 2019; Wang et al., 2019), but these approaches discard the explicit structural information encoded in molecular graphs and 3D conformations. The motivation for MAT is to combine the flexibility of the Transformer architecture with domain-specific inductive biases from molecular structure.</p>
<h2 id="molecule-self-attention-combining-attention-distance-and-graph-structure">Molecule Self-Attention: Combining Attention, Distance, and Graph Structure</h2>
<p>The core innovation is the Molecule Self-Attention layer, which replaces standard Transformer self-attention. In a standard Transformer, head $i$ computes:</p>
<p>$$
\mathcal{A}^{(i)} = \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) \mathbf{V}_{i}
$$</p>
<p>MAT augments this with two additional information sources. Let $\mathbf{A} \in {0, 1}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the molecular graph adjacency matrix and $\mathbf{D} \in \mathbb{R}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the inter-atomic distance matrix. The modified attention becomes:</p>
<p>$$
\mathcal{A}^{(i)} = \left(\lambda_{a} \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) + \lambda_{d}, g(\mathbf{D}) + \lambda_{g}, \mathbf{A}\right) \mathbf{V}_{i}
$$</p>
<p>where $\lambda_{a}$, $\lambda_{d}$, and $\lambda_{g}$ are scalar hyperparameters weighting each component, and $g$ is either a row-wise softmax or an element-wise exponential decay $g(d) = \exp(-d)$.</p>
<p>Key architectural details:</p>
<ul>
<li><strong>Atom embedding</strong>: Each atom is represented as a 26-dimensional vector encoding atomic identity (one-hot over B, N, C, O, F, P, S, Cl, Br, I, dummy, other), number of heavy neighbors, number of hydrogens, formal charge, ring membership, and aromaticity.</li>
<li><strong>Dummy node</strong>: An artificial disconnected node (distance $10^{6}$ from all atoms) is added to each molecule, allowing the model to &ldquo;skip&rdquo; attention heads when no relevant pattern exists, similar to how BERT uses the separation token.</li>
<li><strong>3D conformers</strong>: Distance matrices are computed from RDKit-generated 3D conformers using the Universal Force Field (UFF).</li>
<li><strong>Pretraining</strong>: Node-level masked atom prediction on 2 million ZINC15 molecules (following Hu et al., 2019), where 15% of atom features are masked and the model predicts them.</li>
</ul>
<h2 id="benchmark-evaluation-and-ablation-studies">Benchmark Evaluation and Ablation Studies</h2>
<h3 id="experimental-setup">Experimental setup</h3>
<p>MAT is evaluated on seven molecular property prediction datasets spanning regression and classification:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Task</th>
          <th>Size</th>
          <th>Metric</th>
          <th>Split</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FreeSolv</td>
          <td>Regression (hydration free energy)</td>
          <td>642</td>
          <td>RMSE</td>
          <td>Random</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>Regression (log solubility)</td>
          <td>1,128</td>
          <td>RMSE</td>
          <td>Random</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>Classification (BBB permeability)</td>
          <td>2,039</td>
          <td>ROC AUC</td>
          <td>Scaffold</td>
      </tr>
      <tr>
          <td>Estrogen-alpha</td>
          <td>Classification (receptor activity)</td>
          <td>2,398</td>
          <td>ROC AUC</td>
          <td>Scaffold</td>
      </tr>
      <tr>
          <td>Estrogen-beta</td>
          <td>Classification (receptor activity)</td>
          <td>1,961</td>
          <td>ROC AUC</td>
          <td>Scaffold</td>
      </tr>
      <tr>
          <td>MetStab-high</td>
          <td>Classification (metabolic stability)</td>
          <td>2,127</td>
          <td>ROC AUC</td>
          <td>Random</td>
      </tr>
      <tr>
          <td>MetStab-low</td>
          <td>Classification (metabolic stability)</td>
          <td>2,127</td>
          <td>ROC AUC</td>
          <td>Random</td>
      </tr>
  </tbody>
</table>
<p>Baselines include GCN, Weave, EAGCN, Random Forest (RF), and SVM. Each model receives the same hyperparameter search budget (150 or 500 evaluations). Results are averaged over 6 random train/validation/test splits.</p>
<h3 id="main-results">Main results</h3>
<p>MAT achieves the best average rank across all seven tasks:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Avg. Rank (500 budget)</th>
          <th>Avg. Rank (150 budget)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MAT</td>
          <td>2.42</td>
          <td>2.71</td>
      </tr>
      <tr>
          <td>RF</td>
          <td>3.14</td>
          <td>3.14</td>
      </tr>
      <tr>
          <td>SVM</td>
          <td>3.57</td>
          <td>3.28</td>
      </tr>
      <tr>
          <td>GCN</td>
          <td>3.57</td>
          <td>3.71</td>
      </tr>
      <tr>
          <td>Weave</td>
          <td>3.71</td>
          <td>3.57</td>
      </tr>
      <tr>
          <td>EAGCN</td>
          <td>4.14</td>
          <td>4.14</td>
      </tr>
  </tbody>
</table>
<p>With self-supervised pretraining, Pretrained MAT achieves an average rank of 1.57, outperforming both Pretrained EAGCN (4.0) and SMILES Transformer (4.29). Pretrained MAT requires tuning only the learning rate (7 values tested), compared to 500 hyperparameter combinations for the non-pretrained models.</p>
<h3 id="ablation-results">Ablation results</h3>
<p>Ablation studies on BBBP, ESOL, and FreeSolv reveal:</p>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>BBBP (AUC)</th>
          <th>ESOL (RMSE)</th>
          <th>FreeSolv (RMSE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MAT (full)</td>
          <td>.723</td>
          <td>.286</td>
          <td>.250</td>
      </tr>
      <tr>
          <td>- Graph</td>
          <td>.716</td>
          <td>.316</td>
          <td>.276</td>
      </tr>
      <tr>
          <td>- Distance</td>
          <td>.729</td>
          <td>.281</td>
          <td>.281</td>
      </tr>
      <tr>
          <td>- Attention</td>
          <td>.692</td>
          <td>.306</td>
          <td>.329</td>
      </tr>
      <tr>
          <td>- Dummy node</td>
          <td>.714</td>
          <td>.317</td>
          <td>.249</td>
      </tr>
      <tr>
          <td>+ Edge features</td>
          <td>.683</td>
          <td>.314</td>
          <td>.358</td>
      </tr>
  </tbody>
</table>
<p>Removing any single component degrades performance on at least one task, supporting the value of combining all three information sources. Adding edge features does not help, suggesting the adjacency and distance matrices already capture sufficient bond-level information.</p>
<h3 id="interpretability-analysis">Interpretability analysis</h3>
<p>Individual attention heads in the first layer learn chemically meaningful functions. Six heads were identified that focus on specific chemical patterns: 2-neighbored aromatic carbons, sulfur atoms, non-ring nitrogens, carbonyl oxygens, 3-neighbored aromatic atoms (substitution positions), and aromatic ring nitrogens. Statistical validation using Kruskal-Wallis tests confirmed that atoms matching these SMARTS patterns receive significantly higher attention weights ($p &lt; 0.001$ for all patterns).</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>MAT demonstrates that augmenting Transformer self-attention with molecular graph structure and 3D distance information produces a model that performs consistently well across diverse property prediction tasks. The key practical finding is that self-supervised pretraining dramatically reduces the hyperparameter tuning burden: Pretrained MAT matches or exceeds the performance of extensively tuned models while requiring only learning rate selection.</p>
<p>Several limitations are acknowledged:</p>
<ul>
<li><strong>Fingerprint-based models still win on some tasks</strong>: RF and SVM with extended-connectivity fingerprints outperform MAT on metabolic stability and Estrogen-beta tasks, suggesting that incorporating fingerprint representations could improve MAT further.</li>
<li><strong>Single conformer</strong>: Only one pre-computed 3D conformer is used per molecule. More sophisticated conformer sampling or ensemble strategies were not explored.</li>
<li><strong>Limited pretraining exploration</strong>: Only the masked atom prediction task from Hu et al. (2019) was used. The authors note that exploring additional pretraining objectives is a promising direction.</li>
<li><strong>Scalability</strong>: The pretrained model uses 1024-dimensional embeddings with 8 layers and 16 attention heads, fitting the largest model that fits in GPU memory.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ZINC15</td>
          <td>2M molecules</td>
          <td>Sampled from ZINC database</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>Hydration free energy regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>Log solubility regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BBBP</td>
          <td>2,039</td>
          <td>Blood-brain barrier classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Estrogen-alpha/beta</td>
          <td>2,398 / 1,961</td>
          <td>Receptor activity classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>MetStab-high/low</td>
          <td>2,127 each</td>
          <td>Metabolic stability classification</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam with Noam learning rate scheduler (warmup then inverse square root decay)</li>
<li>Pretraining: 8 epochs, learning rate 0.001, batch size 256, binary cross-entropy loss</li>
<li>Fine-tuning: 100 epochs, batch size 32, learning rate selected from {1e-3, 5e-4, 1e-4, 5e-5, 1e-5, 5e-6, 1e-6}</li>
<li>Distance kernel: exponential decay $g(d) = \exp(-d)$ for pretrained model</li>
<li>Lambda weights: $\lambda_{a} = \lambda_{d} = 0.33$ for pretrained model</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Pretrained MAT: 1024-dim embeddings, 8 layers, 16 attention heads, 1 feed-forward layer per block</li>
<li>Dropout: 0.0, weight decay: 0.0 for pretrained model</li>
<li>Atom featurization: 26-dimensional one-hot encoding (Table 1 in paper)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Regression: RMSE (FreeSolv, ESOL)</li>
<li>Classification: ROC AUC (BBBP, Estrogen-alpha/beta, MetStab-high/low)</li>
<li>All experiments repeated 6 times with different train/validation/test splits</li>
<li>Scaffold split for BBBP, Estrogen, random split for others</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify exact hardware details. The pretrained model is described as &ldquo;the largest model that still fits the GPU memory.&rdquo;</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/gmum/MAT">gmum/MAT</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation with pretrained weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Maziarka, Ł., Danel, T., Mucha, S., Rataj, K., Tabor, J., &amp; Jastrzębski, S. (2020). Molecule Attention Transformer. <em>arXiv preprint arXiv:2002.08264</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{maziarka2020molecule,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecule Attention Transformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Maziarka, {\L}ukasz and Danel, Tomasz and Mucha, S{\l}awomir and Rataj, Krzysztof and Tabor, Jacek and Jastrz{\k{e}}bski, Stanis{\l}aw}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2002.08264}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DMP: Dual-View Molecule Pre-training (SMILES+GNN)</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/dual-view-molecule-pretraining/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/dual-view-molecule-pretraining/</guid><description>DMP pre-trains molecular encoders using both SMILES Transformer and GNN branches with a BYOL-style dual-view consistency loss for property prediction.</description><content:encoded><![CDATA[<h2 id="a-dual-branch-pre-training-method-for-molecular-property-prediction">A Dual-Branch Pre-training Method for Molecular Property Prediction</h2>
<p>DMP (Dual-view Molecule Pre-training) is a <strong>Method</strong> paper that introduces a pre-training framework combining two complementary molecular encoders: a Transformer operating on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings and a Graph Neural Network (GNN) operating on molecular graphs. The two branches are trained jointly with masked language modeling (MLM) objectives plus a BYOL-style dual-view consistency loss. After pre-training on 10M <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> molecules, either branch (or both) can be fine-tuned for downstream tasks. The authors recommend the Transformer branch based on empirical results. DMP achieves the best reported performance on 7 of 9 <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks and 3 retrosynthesis benchmarks (at the time of the 2021 arXiv version).</p>
<h2 id="why-combine-smiles-and-graph-views-for-molecules">Why Combine SMILES and Graph Views for Molecules</h2>
<p>Prior molecule pre-training methods used either graph representations with GNNs or SMILES representations with Transformers, but not both. The authors observe that the two views are complementary: Transformers handle molecules with large atom distances (long chains) well, while GNNs handle molecules with many concatenated rings better. Neither model alone captures the full range of molecular structures effectively.</p>
<p>Existing GNN-based pre-training methods (Hu et al. 2020, MolCLR, GROVER) and SMILES-based methods (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>) each have blind spots dictated by their input representation. DMP addresses this by pre-training both views simultaneously and enforcing representation consistency between them, so each branch benefits from the structural knowledge of the other.</p>
<h2 id="dual-view-consistency-with-byol-style-training">Dual-View Consistency with BYOL-Style Training</h2>
<p>The core innovation is the dual-view consistency objective, inspired by Bootstrap Your Own Latent (BYOL). Given a molecule $M$ with SMILES representation $M_s$ and graph representation $M_g$, DMP obtains high-level features from each branch:</p>
<ul>
<li><strong>Transformer branch</strong>: A RoBERTa-base model encodes the SMILES sequence. The [CLS] token output serves as the molecule representation $f_s$.</li>
<li><strong>GNN branch</strong>: A DeeperGCN network encodes the molecular graph. Mean+max pooling over atom representations yields $f_g$.</li>
</ul>
<p>The dual-view consistency loss uses nonlinear projection heads $\psi_g, \psi_s$ and prediction heads $\rho_g, \rho_s$:</p>
<p>$$
p_g = \psi_g(f_g), \quad q_g = \rho_g(p_g); \quad p_s = \psi_s(f_s), \quad q_s = \rho_s(p_s)
$$</p>
<p>The consistency loss maximizes cross-view <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> with stop-gradient (SG) on the target:</p>
<p>$$
\ell_{\text{dual}}(\tilde{M}_g, \tilde{M}_s) = -\cos(q_s, \text{SG}(p_g)) - \cos(q_g, \text{SG}(p_s))
$$</p>
<p>where $\cos(p, q) = \frac{p^\top q}{|p|_2 |q|_2}$ and $\tilde{M}_g, \tilde{M}_s$ are the masked versions of the inputs. The stop-gradient prevents representation collapse without requiring negative samples or a momentum encoder.</p>
<p>The full training objective combines three losses:</p>
<ol>
<li><strong>MLM on Transformer</strong>: Recover masked tokens in SMILES sequences</li>
<li><strong>MLM on GNN</strong>: Recover masked atoms in molecular graphs</li>
<li><strong>Dual-view consistency</strong>: The BYOL-style loss above</li>
</ol>
<p>Both MLM objectives and the consistency loss are necessary. Ablations show that removing MLM (using only dual-view loss) degrades performance, and using two branches of the same type (two Transformers or two GNNs) is less effective than the heterogeneous Transformer+GNN combination.</p>
<h2 id="experiments-on-moleculenet-and-retrosynthesis">Experiments on MoleculeNet and Retrosynthesis</h2>
<h3 id="pre-training-setup">Pre-training Setup</h3>
<p>DMP is pre-trained on 10M molecules from PubChem (matching prior work). The Transformer branch uses RoBERTa-base (12 layers, hidden dim 768, 87M parameters). The GNN branch uses DeeperGCN (12 layers, hidden dim 384, 7.4M parameters). Combined, DMP has 104.1M parameters. Training runs for 200K iterations on 8 V100 GPUs over 3.8 days with Adam optimizer (lr = 5e-4, weight decay 0.01).</p>
<h3 id="molecular-property-prediction-moleculenet">Molecular Property Prediction (MoleculeNet)</h3>
<p>DMP is evaluated on 6 binary classification tasks (BBBP, Tox21, ClinTox, HIV, BACE, SIDER) using official DeepChem splits, and on 3 additional tasks (BBBP, SIDER, ClinTox classification + ESOL, QM7, QM8 regression) using scaffold splits from GROVER.</p>
<p>Key results on DeepChem splits (ROC-AUC %):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MolCLR</th>
          <th>TF (MLM)</th>
          <th>DMP_TF</th>
          <th>DMP_TF+GNN</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>73.6</td>
          <td>74.9</td>
          <td><strong>78.1</strong></td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>79.8</td>
          <td>77.6</td>
          <td><strong>78.8</strong></td>
          <td>79.1</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>93.2</td>
          <td>92.9</td>
          <td><strong>95.0</strong></td>
          <td>95.6</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>80.6</td>
          <td>80.2</td>
          <td><strong>81.0</strong></td>
          <td>81.4</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>89.0</td>
          <td>88.0</td>
          <td><strong>89.3</strong></td>
          <td>89.4</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>68.0</td>
          <td>68.4</td>
          <td><strong>69.2</strong></td>
          <td>69.8</td>
      </tr>
  </tbody>
</table>
<p>On scaffold splits (comparison with GROVER and MPG):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>GROVER</th>
          <th>MPG</th>
          <th>DMP_TF</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP (AUC)</td>
          <td>0.940</td>
          <td>0.922</td>
          <td><strong>0.945</strong></td>
      </tr>
      <tr>
          <td>SIDER (AUC)</td>
          <td>0.658</td>
          <td>0.661</td>
          <td><strong>0.695</strong></td>
      </tr>
      <tr>
          <td>ClinTox (AUC)</td>
          <td>0.944</td>
          <td>0.963</td>
          <td><strong>0.968</strong></td>
      </tr>
      <tr>
          <td>ESOL (RMSE)</td>
          <td>0.831</td>
          <td>0.741</td>
          <td><strong>0.700</strong></td>
      </tr>
      <tr>
          <td>QM7 (MAE)</td>
          <td>72.6</td>
          <td>-</td>
          <td><strong>69.6</strong></td>
      </tr>
      <tr>
          <td>QM8 (MAE)</td>
          <td>0.0125</td>
          <td>-</td>
          <td><strong>0.0124</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="retrosynthesis">Retrosynthesis</h3>
<p>DMP is tested on USPTO-50K (reaction type known/unknown) and USPTO-full. Using a &ldquo;DMP fusion&rdquo; approach (fusing pre-trained representations into a Transformer encoder-decoder for <a href="/notes/chemistry/molecular-design/reaction-prediction/">retrosynthesis</a>), DMP improves top-1 accuracy by 2-3 points over the baseline Transformer across all settings:</p>
<table>
  <thead>
      <tr>
          <th>Setting</th>
          <th>Transformer</th>
          <th>ChemBERTa fusion</th>
          <th>DMP fusion</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO-50K (unknown)</td>
          <td>42.3</td>
          <td>43.9</td>
          <td><strong>46.1</strong></td>
      </tr>
      <tr>
          <td>USPTO-50K (known)</td>
          <td>54.2</td>
          <td>56.4</td>
          <td><strong>57.5</strong></td>
      </tr>
      <tr>
          <td>USPTO-full</td>
          <td>42.9</td>
          <td>-</td>
          <td><strong>45.0</strong></td>
      </tr>
  </tbody>
</table>
<p>For GNN-based retrosynthesis, replacing GLN&rsquo;s GNN modules with DMP&rsquo;s pre-trained GNN branch improves top-1 accuracy from 52.5% to 54.2% (unknown type) and from 64.2% to 66.5% (known type).</p>
<h3 id="representation-quality">Representation Quality</h3>
<p><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> visualization of pre-trained representations shows that DMP produces better scaffold-based clustering than either GNN-only or Transformer-only pre-training. The <a href="https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index">Davies-Bouldin index</a> improves from 3.56 (GNN) and 3.59 (Transformer) to 2.19 (DMP), indicating much tighter within-scaffold clusters.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Key findings:</strong></p>
<ul>
<li>Combining heterogeneous views (SMILES + graph) during pre-training is more effective than using two branches of the same type. TF(x2) and GNN(x2) variants show smaller gains.</li>
<li>Both MLM and dual-view consistency loss contribute. Removing MLM (dual-view only) hurts performance, especially on BBBP (71.1 vs 78.1 with both losses).</li>
<li>The Transformer branch alone is recommended for downstream tasks, as it achieves strong results without adding GNN parameters at inference time.</li>
<li>Scaling pre-training data from 10M to 100M compounds yields marginal additional improvement.</li>
</ul>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ol>
<li>Training cost is higher than single-branch methods (3.8 days vs 2.5 days for TF-only on 8 V100s), since both branches must be trained jointly.</li>
<li>A fixed branch selection strategy is used at inference time. The authors note that a meta-controller for dynamic branch selection per molecule would be preferable.</li>
<li>The GNN branch uses simple atom masking without bond deletion or subgraph removal, leaving room for stronger graph-level pre-training objectives.</li>
</ol>
<p><strong>Relation to co-training:</strong> The authors clarify that DMP differs from classical <a href="https://en.wikipedia.org/wiki/Co-training">co-training</a> (Blum and Mitchell 1998) in that it does not require conditional independence between views and produces a pre-trained model rather than additional labeled data.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem subset</td>
          <td>10M compounds</td>
          <td>Same subset as MolCLR and ChemBERTa</td>
      </tr>
      <tr>
          <td>Pre-training (large)</td>
          <td>PubChem subset</td>
          <td>100M compounds</td>
          <td>Additional scale experiment</td>
      </tr>
      <tr>
          <td>Evaluation (classification)</td>
          <td>MoleculeNet (BBBP, Tox21, ClinTox, HIV, BACE, SIDER)</td>
          <td>1.5K-41K molecules</td>
          <td>Official DeepChem splits</td>
      </tr>
      <tr>
          <td>Evaluation (regression)</td>
          <td>MoleculeNet (ESOL, QM7, QM8)</td>
          <td>Varies</td>
          <td>Scaffold splits from GROVER</td>
      </tr>
      <tr>
          <td>Evaluation (retrosynthesis)</td>
          <td>USPTO-50K, USPTO-full</td>
          <td>50K / 950K reactions</td>
          <td>Splits from Dai et al. (2019)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Transformer branch</strong>: RoBERTa-base with MLM. SMILES tokenized using regex from Schwaller et al. (2019).</li>
<li><strong>GNN branch</strong>: DeeperGCN with 12 layers, atom masking for MLM.</li>
<li><strong>Dual-view loss</strong>: BYOL-style with 3-layer MLP projection heads and 2-layer MLP prediction heads, stop-gradient on targets.</li>
<li><strong>Optimizer</strong>: Adam (lr=5e-4, beta1=0.9, beta2=0.98, epsilon=1e-6), weight decay 0.01, 10K warmup steps, linear decay.</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Transformer branch</td>
          <td>RoBERTa-base (12L, 768H, 12 heads)</td>
          <td>87M</td>
      </tr>
      <tr>
          <td>GNN branch</td>
          <td>DeeperGCN (12L, 384H)</td>
          <td>7.4M</td>
      </tr>
      <tr>
          <td>DMP (total)</td>
          <td>Transformer + GNN + projection/prediction heads</td>
          <td>104.1M</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Classification: ROC-AUC, averaged over 3 random seeds</li>
<li>Regression: RMSE (ESOL) or MAE (QM7, QM8)</li>
<li>Retrosynthesis: Top-k exact match accuracy (k=1,3,5,10,20,50)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8 NVIDIA V100 GPUs, batch size 12288 tokens, gradient accumulation 16x</li>
<li>Pre-training time: 3.8 days (DMP), 2.5 days (TF-only), 1.7 days (GNN-only)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository or pre-trained model weights were identified for this paper. The paper references GLN&rsquo;s code repository (<a href="https://github.com/Hanjun-Dai/GLN">https://github.com/Hanjun-Dai/GLN</a>) for the retrosynthesis baseline but does not release DMP-specific code.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Hanjun-Dai/GLN">GLN (baseline)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Retrosynthesis baseline, not DMP code</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhu, J., Xia, Y., Wu, L., Xie, S., Zhou, W., Qin, T., Li, H., &amp; Liu, T.-Y. (2023). Dual-view Molecular Pre-training. In <em>Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</em> (pp. 3615-3627). <a href="https://doi.org/10.1145/3580305.3599317">https://doi.org/10.1145/3580305.3599317</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhu2023dualview,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Dual-view Molecular Pre-training}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhu, Jinhua and Xia, Yingce and Wu, Lijun and Xie, Shufang and Zhou, Wengang and Qin, Tao and Li, Houqiang and Liu, Tie-Yan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3615--3627}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3580305.3599317}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>X-MOL: Pre-training on 1.1B Molecules for SMILES</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/</guid><description>X-MOL pre-trains a shared encoder-decoder Transformer on 1.1 billion molecules, then fine-tunes for property prediction, reaction analysis, and generation.</description><content:encoded><![CDATA[<h2 id="a-unified-molecular-pre-training-framework">A Unified Molecular Pre-training Framework</h2>
<p>X-MOL is a <strong>Method</strong> paper that introduces a large-scale pre-training framework for <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based molecular understanding. The primary contribution is a Transformer encoder-decoder model pre-trained on 1.1 billion molecules from <a href="/notes/chemistry/datasets/zinc-22/">ZINC15</a>, which is then fine-tuned across five distinct molecular analysis tasks: molecular property prediction (classification and regression), chemical reaction productivity prediction, <a href="https://en.wikipedia.org/wiki/Drug_interaction">drug-drug interaction</a> (DDI) prediction, de novo molecule generation (distribution learning and goal-directed), and molecule optimization. The paper demonstrates that a single pre-trained model can serve as a universal foundation for diverse downstream chemistry tasks.</p>
<h2 id="bridging-scale-and-understanding-in-molecular-smiles">Bridging Scale and Understanding in Molecular SMILES</h2>
<p>Prior to X-MOL, most molecular analysis tasks were investigated individually with task-specific models. SMILES-based deep learning methods existed but lacked the benefit of large-scale pre-training that had proven transformative in NLP (BERT, RoBERTa, ERNIE, XLNet, <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>). Two challenges motivated this work:</p>
<ol>
<li><strong>SMILES sacrifices structural information for simplicity.</strong> While SMILES is a convenient linear representation, it does not directly encode molecular topology, making it harder for models to learn 3D structure from string input.</li>
<li><strong>Labelled molecular data is scarce.</strong> Most benchmark datasets (<a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>) contain only thousands of labelled examples, making it difficult to train large models from scratch without overfitting.</li>
</ol>
<p>The authors hypothesized that massive-scale pre-training on unlabelled SMILES could teach a model the grammar rules and implicit structural information in SMILES, providing a strong initialization for multiple downstream tasks.</p>
<h2 id="generative-pre-training-with-random-smiles">Generative Pre-training with Random SMILES</h2>
<p>The core innovation in X-MOL is a <strong>generative pre-training strategy</strong> that exploits the non-uniqueness of SMILES. A single molecule can be represented by many valid SMILES strings (<a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">random SMILES</a>), depending on the starting atom, main chain selection, and ring-opening position. X-MOL trains the model to generate a valid alternative SMILES given an input SMILES of the same molecule, forcing the model to:</p>
<ol>
<li>Reconstruct the molecular structure from the input SMILES</li>
<li>Generate a valid output SMILES following SMILES grammar rules</li>
</ol>
<p>The architecture uses a shared-parameter encoder-decoder based on the Transformer. Unlike standard encoder-decoder models (e.g., for machine translation), X-MOL shares all parameters between encoder and decoder, forcing both encoding and decoding to occur in the same semantic space. The output SMILES is fully masked during training, and only unidirectional attention is permitted within the output sequence.</p>
<p>The self-attention mechanism computes attention for each character $i$ as:</p>
<p>$$
Z_{i} = \text{SoftMax}\left(\frac{Q_{i} \cdot K^{T}}{\sqrt{D}}\right) \cdot V
$$</p>
<p>where $Q_{i}$, $K$, and $V$ are the query, key, and value matrices, and $D$ is the feature dimension. The model uses 12 attention heads to capture different relational patterns.</p>
<h3 id="model-architecture">Model Architecture</h3>
<ul>
<li>12 Transformer encoder layers</li>
<li>768-dimensional hidden units</li>
<li>12 attention heads</li>
<li>Character-level SMILES tokenization (108 chemical characters plus 5 special tokens: [PAD], [CLS], [SEP], [MASK], [UNK])</li>
<li>Characters within square brackets and double digits preceded by &ldquo;%&rdquo; are treated as single tokens</li>
</ul>
<h3 id="data-augmentation-in-pre-training">Data Augmentation in Pre-training</h3>
<p>Because a molecule has multiple valid random SMILES, the output may differ from the predefined target. To handle this, X-MOL generates multiple training samples per molecule with the same input SMILES but different output random SMILES, and places these in the same mini-batch.</p>
<h2 id="experimental-setup-across-five-tasks">Experimental Setup Across Five Tasks</h2>
<p>X-MOL is fine-tuned with task-specific strategies organized into two categories: prediction tasks and generation tasks.</p>
<h3 id="prediction-tasks">Prediction Tasks</h3>
<p>For prediction tasks, the [CLS] token&rsquo;s output representation is passed through a fully connected network to produce predictions. The input format varies by task:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Input Format</th>
          <th>Loss Function</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Property prediction (classification)</td>
          <td>Single SMILES</td>
          <td>Cross-entropy</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Property prediction (regression)</td>
          <td>Single SMILES</td>
          <td>MSE</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Reaction productivity prediction</td>
          <td>Four SMILES (reactant, additive, base, ligand)</td>
          <td>MSE</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>DDI prediction</td>
          <td>Two SMILES (drug pair)</td>
          <td>Cross-entropy</td>
          <td>Accuracy</td>
      </tr>
  </tbody>
</table>
<p><strong>Molecular Property Prediction (Classification):</strong> Four MoleculeNet benchmarks were used: HIV (41,127 compounds), BACE (1,513), <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBBP</a> (2,039), and ClinTox (1,484). Data were randomly split 20 times, and average ROC-AUC is reported.</p>
<p><strong>Molecular Property Prediction (Regression):</strong> Three MoleculeNet benchmarks: ESOL (1,128), FreeSolv (642), and Lipophilicity (4,200). Data augmentation with random SMILES was applied to the training set. Average RMSE over 20 random splits is reported.</p>
<p><strong>Chemical Reaction Productivity Prediction:</strong> The <a href="https://en.wikipedia.org/wiki/Cross-coupling_reaction">C-N cross-coupling</a> dataset (3,956 reactions) from Ahneman et al. was used with 10-fold cross-validation.</p>
<p><strong>DDI Prediction:</strong> The DeepDDI dataset (192,284 DDI pairs, 86 interaction types) was used as benchmark.</p>
<h3 id="generation-tasks">Generation Tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Generation Source</th>
          <th>Sampling Strategy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Distribution learning (DL) generation</td>
          <td>Fixed initial symbol ([CLS])</td>
          <td>Random sampling</td>
      </tr>
      <tr>
          <td>Goal-directed (GD) generation</td>
          <td>Unfixed initial symbol</td>
          <td>Random sampling</td>
      </tr>
      <tr>
          <td>Molecule optimization</td>
          <td>Input molecule</td>
          <td>Beam search (beam size = 4)</td>
      </tr>
  </tbody>
</table>
<p><strong>DL-based Generation:</strong> Evaluated on ZINC250K (249,456 molecules) using validity, uniqueness, and novelty.</p>
<p><strong>GD Generation:</strong> Also on ZINC250K, using QED as the goal property with target QED = 0.948 (the dataset maximum). 10,000 molecules were generated for evaluation.</p>
<p><strong>Molecule Optimization:</strong> Evaluated on ZINC250K with QED as the optimization goal. Molecular pairs were constructed by selecting pairs with <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> in [0.6, 0.8], where the lower-QED molecule serves as input and the higher-QED molecule as target.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>Classification (ROC-AUC, higher is better):</strong> X-MOL achieved state-of-the-art on all four datasets, outperforming both shallow learning methods and deep learning baselines including graph convolutional models.</p>
<p><strong>Regression (RMSE, lower is better):</strong> X-MOL achieved the best RMSE on ESOL, FreeSolv, and Lipophilicity.</p>
<p><strong>Reaction Productivity:</strong> X-MOL obtained an average RMSE of 0.0626, compared to the random forest baseline of 0.078.</p>
<p><strong>DDI Prediction:</strong> X-MOL achieved accuracy of 0.952, improving over DeepDDI&rsquo;s 0.924.</p>
<p><strong>DL-based Generation:</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Validity</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GCPN</td>
          <td>20%</td>
          <td>99.97%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>MRNN</td>
          <td>65%</td>
          <td>99.89%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>GraphAF</td>
          <td>68%</td>
          <td>99.10%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td><strong>X-MOL</strong></td>
          <td><strong>85.28%</strong></td>
          <td><strong>99.91%</strong></td>
          <td><strong>100%</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>GD Generation:</strong> X-MOL generated all top-3 molecules with QED = 0.948, matching the dataset maximum. GraphAF reached 0.948/0.948/0.947, while JT-VAE and MRNN fell further behind.</p>
<h3 id="knowledge-embedding-ablation">Knowledge Embedding Ablation</h3>
<p>The paper tested three additional embedding strategies to inject structural information into the model:</p>
<ul>
<li><strong>Link embedding:</strong> Encodes connection information between atoms (position of the previous connected atom)</li>
<li><strong>Ring embedding:</strong> Encodes ring structure information from SMILES number pairs</li>
<li><strong>Type embedding:</strong> Categorizes characters into 9 types (atoms, bonds, structural symbols)</li>
</ul>
<p>None of these additional embeddings improved performance on the HIV or DDI tasks, whether with or without pre-training. The authors conclude that SMILES already contains sufficient information for molecular understanding and that pre-training effectively extracts this information, a finding they label &ldquo;SMILES is all you need.&rdquo;</p>
<h3 id="attention-visualization">Attention Visualization</h3>
<p>The authors provide attention heatmap analysis demonstrating that:</p>
<ul>
<li>Middle layers (e.g., layer 9) reconstruct molecular structure by correctly identifying atom connectivity and ring closures</li>
<li>Later layers abstract higher-level features for property prediction</li>
<li>In multi-input prediction tasks (reaction productivity), attention reveals which reaction components are most important (e.g., the ligand receives highest cross-attention)</li>
<li>In generation tasks, attention patterns differ between DL (self-focused), GD (source-constrained), and optimization (gradual shift from input to output)</li>
</ul>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>X-MOL demonstrates that large-scale pre-training on SMILES can produce a single model that achieves competitive or state-of-the-art performance across five distinct molecular analysis tasks. The key findings are:</p>
<ol>
<li><strong>Scale enables SMILES understanding.</strong> Pre-training on 1.1 billion molecules allows the model to learn SMILES grammar rules well enough to outperform graph-based methods on molecule generation validity.</li>
<li><strong>Unified framework.</strong> A single pre-trained backbone serves classification, regression, reaction prediction, DDI prediction, and generative tasks through different fine-tuning strategies.</li>
<li><strong>SMILES is sufficient.</strong> Additional knowledge embeddings (link, ring, type) do not improve performance, suggesting pre-training extracts the necessary structural information from SMILES alone.</li>
<li><strong>Interpretable attention.</strong> Attention visualization confirms that the model reconstructs molecular structure internally.</li>
</ol>
<p><strong>Limitations</strong> (observed):</p>
<ul>
<li>The paper reports only MoleculeNet benchmarks with relatively few datasets. No scaffold splits or temporal splits are used; all splits are random, which can overestimate performance on structurally novel compounds.</li>
<li>Comparison baselines are somewhat dated (2018-2019 era methods), and the paper does not compare against concurrent SMILES pre-training methods.</li>
<li>The molecule generation validity (85.28%) is much higher than graph baselines like GCPN (20%), but later work achieved near 100% validity with constrained SMILES grammars.</li>
<li>No code or model weights have been publicly released, limiting independent verification.</li>
<li>The paper remains a bioRxiv preprint and has not been published in a peer-reviewed venue.</li>
</ul>
<p><strong>Future directions</strong> proposed by the authors include: better pre-training strategies, extension to graph-based representations, and fine-tuning on additional downstream tasks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC15</td>
          <td>1.1 billion molecules</td>
          <td>Random SMILES augmentation</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>HIV (MoleculeNet)</td>
          <td>41,127</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE (MoleculeNet)</td>
          <td>1,513</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP (MoleculeNet)</td>
          <td>2,039</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>ClinTox (MoleculeNet)</td>
          <td>1,484</td>
          <td>Two sub-datasets, averaged</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL (MoleculeNet)</td>
          <td>1,128</td>
          <td>Water solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv (MoleculeNet)</td>
          <td>642</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipophilicity (MoleculeNet)</td>
          <td>4,200</td>
          <td>logD at pH 7.4</td>
      </tr>
      <tr>
          <td>Reaction</td>
          <td>C-N cross-coupling</td>
          <td>3,956</td>
          <td>From Ahneman et al. (2018)</td>
      </tr>
      <tr>
          <td>DDI</td>
          <td>DeepDDI</td>
          <td>192,284 DDI pairs</td>
          <td>86 interaction types</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td>ZINC250K</td>
          <td>249,456</td>
          <td>For DL, GD, and optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: Generative SMILES-to-SMILES with shared encoder-decoder Transformer</li>
<li>Fine-tuning prediction tasks: [CLS] token passed through fully connected layers</li>
<li>Fine-tuning generation tasks: Autoregressive generation with random sampling (DL, GD) or beam search (optimization)</li>
<li>Data augmentation: Random SMILES augmentation for regression tasks</li>
<li>Repeated training: 20 random splits with averaged results for classification/regression</li>
<li>10-fold cross-validation for reaction productivity</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>12-layer Transformer, 768 hidden dimensions, 12 attention heads</li>
<li>Character-level tokenization: 108 chemical characters + 5 special tokens</li>
<li>Implemented in PaddlePaddle framework</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>X-MOL</th>
          <th>Best Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HIV (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>BACE (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>BBBP (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>ClinTox (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>ESOL (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>FreeSolv (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>Lipophilicity (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>C-N coupling</td>
          <td>RMSE</td>
          <td>0.0626</td>
          <td>0.078 (random forest)</td>
      </tr>
      <tr>
          <td>DDI prediction</td>
          <td>Accuracy</td>
          <td>0.952</td>
          <td>0.924 (DeepDDI)</td>
      </tr>
      <tr>
          <td>DL generation</td>
          <td>Validity</td>
          <td>85.28%</td>
          <td>68% (GraphAF)</td>
      </tr>
      <tr>
          <td>GD generation</td>
          <td>Top-3 QED</td>
          <td>All 0.948</td>
          <td>0.948/0.948/0.947 (GraphAF)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8/16 Tesla P40 GPUs (24 GB each), approximately 4 days</li>
<li>Data pre-processing: Over 1,000 CPUs with Hadoop</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No code, model weights, or pre-trained checkpoints have been publicly released. The model was implemented in Baidu&rsquo;s PaddlePaddle framework, but no repository is available.</p>
<p><strong>Reproducibility status: Closed.</strong> While the datasets are all publicly available (ZINC15, MoleculeNet, ZINC250K, DeepDDI, C-N coupling), the model implementation, pre-trained weights, and fine-tuning code are not released. The computational requirements (1,000+ CPUs for data processing, 8-16 GPUs for 4 days of pre-training) are substantial.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xue, D., Zhang, H., Xiao, D., Gong, Y., Chuai, G., Sun, Y., Tian, H., Wu, H., Li, Y., &amp; Liu, Q. (2020). X-MOL: Large-scale pre-training for molecular understanding and diverse molecular analysis. <em>bioRxiv</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xue2020xmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xue, Dongyu and Zhang, Han and Xiao, Dongling and Gong, Yukang and Chuai, Guohui and Sun, Yu and Tian, Hao and Wu, Hua and Li, Yukun and Liu, Qi}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{bioRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1101/2020.12.23.424259}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Cold Spring Harbor Laboratory}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformers for Molecular Property Prediction Review</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformers-molecular-property-prediction-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformers-molecular-property-prediction-review/</guid><description>A systematic review of 16 transformer models for molecular property prediction, analyzing architecture, data, tokenization, and benchmarking gaps.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformers-for-molecular-property-prediction">A Systematization of Transformers for Molecular Property Prediction</h2>
<p>This is a <strong>Systematization</strong> paper. Sultan et al. provide the first comprehensive, structured review of sequence-based transformer models applied to molecular property prediction (MPP). The review catalogs 16 models published between 2019 and 2023, organizes them by architecture type (encoder-decoder, encoder-only, decoder-only), and systematically examines seven key design decisions that arise when building a transformer for MPP. The paper&rsquo;s primary contribution is identifying gaps in current evaluation practices and articulating what standardization the field needs for meaningful progress.</p>
<h2 id="the-problem-inconsistent-evaluation-hinders-progress">The Problem: Inconsistent Evaluation Hinders Progress</h2>
<p>Molecular property prediction is essential for drug discovery, crop protection, and environmental science. Deep learning approaches, including transformers, have been increasingly applied to this task by learning molecular representations from string notations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>. However, the field faces several challenges:</p>
<ol>
<li><strong>Small labeled datasets</strong>: Labeled molecular property datasets typically contain only hundreds or thousands of molecules, making supervised learning alone insufficient.</li>
<li><strong>No standardized evaluation protocol</strong>: Different papers use different data splits (scaffold vs. random), different splitting implementations, different numbers of repetitions (3 to 50), and sometimes do not share their test sets. This makes direct comparison across models infeasible.</li>
<li><strong>Unclear design choices</strong>: With many possible configurations for pre-training data, chemical language, tokenization, positional embeddings, model size, pre-training objectives, and fine-tuning approaches, the field lacks systematic analyses to guide practitioners.</li>
</ol>
<p>The authors note that standard machine learning methods with fixed-size molecular fingerprints remain strong baselines for real-world datasets, illustrating that the promise of transformers for MPP has not yet been fully realized.</p>
<h2 id="seven-design-questions-for-molecular-transformers">Seven Design Questions for Molecular Transformers</h2>
<p>The central organizing framework of this review addresses seven questions practitioners must answer when building a transformer for MPP. For each, the authors synthesize findings across the 16 reviewed models.</p>
<h3 id="reviewed-models">Reviewed Models</h3>
<p>The paper catalogs 16 models organized by architecture:</p>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Base Model</th>
          <th>Models</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Encoder-Decoder</td>
          <td>Transformer, BART</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-transformer/">ST</a>, Transformer-CNN, <a href="/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/">X-Mol</a>, <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">ChemFormer</a></td>
      </tr>
      <tr>
          <td>Encoder-Only</td>
          <td>BERT</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, MAT, <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>, Mol-BERT, Chen et al., K-BERT, FP-BERT, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
      </tr>
      <tr>
          <td>Encoder-Only</td>
          <td>RoBERTa</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a></td>
      </tr>
      <tr>
          <td>Decoder-Only</td>
          <td>XLNet</td>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a> (RT)</td>
      </tr>
  </tbody>
</table>
<p>The core attention mechanism shared by all these models is the scaled dot-product attention:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V
$$</p>
<p>where $Q$, $K$, and $V$ are the query, key, and value matrices, and $d_{k}$ is the dimension of the key vectors.</p>
<h3 id="question-1-which-database-and-how-many-molecules">Question 1: Which Database and How Many Molecules?</h3>
<p>Pre-training data sources vary considerably. The three main databases are ZINC (37 billion molecules in ZINC22), ChEMBL (2.4 million unique molecules with 20 million bioactivity measurements), and PubChem (111 million unique molecules). Pre-training set sizes ranged from 900K (ST on ChEMBL) to 1.1B molecules (MolFormer on ZINC + PubChem).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Database</th>
          <th>Size</th>
          <th>Language</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ST</td>
          <td>ChEMBL</td>
          <td>900K</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>ChEMBL (<a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>)</td>
          <td>1.6M</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></td>
          <td>PubChem</td>
          <td>100K-10M</td>
          <td>SMILES, SELFIES</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a></td>
          <td>PubChem</td>
          <td>5M-77M</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>ZINC</td>
          <td>2M</td>
          <td>List of atoms</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>ZINC + PubChem</td>
          <td>1.1B</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td>Chen et al.</td>
          <td>C, CP, CPZ</td>
          <td>2M-775M</td>
          <td>SMILES</td>
      </tr>
  </tbody>
</table>
<p>A key finding is that larger pre-training datasets do not consistently improve downstream performance. MolFormer showed minimal difference between models trained on 100M vs. 1.1B molecules. ChemBERTa-2 found that the model trained on 5M molecules using MLM performed comparably to 77M molecules for BBBP (both around 0.70 ROC-AUC). Chen et al. reported comparable $R^{2}$ values of $0.925 \pm 0.01$, $0.917 \pm 0.012$, and $0.915 \pm 0.01$ for ESOL across datasets of 2M, 103M, and 775M molecules, respectively. The data composition and covered chemical space appear to matter more than raw size.</p>
<h3 id="question-2-which-chemical-language">Question 2: Which Chemical Language?</h3>
<p>Most models use SMILES. ChemBERTa, RT, and SELFormer also explored SELFIES. MAT uses a simple list of atoms with structural features, while Mol-BERT and FP-BERT use circular fingerprints.</p>
<p>Direct comparisons between SMILES and SELFIES (by ChemBERTa on Tox21 SR-p53 and RT for drug-likeness prediction) found no significant performance difference. The RT authors reported that SELFIES models performed approximately $0.004 \pm 0.01$ better on RMSE, while SMILES models performed approximately $0.004 \pm 0.01$ better on Pearson correlation. The choice of chemical language does not appear to be a major factor in prediction performance, and even non-string representations (atom lists in MAT, fingerprints in Mol-BERT) perform competitively.</p>
<h3 id="question-3-how-to-tokenize">Question 3: How to Tokenize?</h3>
<p>Tokenization methods span atom-level (42-66 vocabulary tokens), regex-based (47-2,362 tokens), BPE (509-52K tokens), and substructure-based (3,357-13,325 tokens) approaches. No systematic comparison of tokenization strategies exists in the literature. The vocabulary size varied dramatically, from 42 tokens for MolBERT to over 52K for ChemBERTa. The authors argue that chemically meaningful tokenization (e.g., functional group-based fragmentation) could improve both performance and explainability.</p>
<h3 id="question-4-how-to-add-positional-embeddings">Question 4: How to Add Positional Embeddings?</h3>
<p>Most models inherited the absolute positional embedding from their NLP base models. MolBERT and RT adopted relative positional embeddings. MolFormer combined absolute and Rotary Positional Embedding (RoPE). MAT incorporated spatial information (inter-atomic 3D distances and adjacency) alongside self-attention.</p>
<p>MolFormer&rsquo;s comparison showed that RoPE became superior to absolute embeddings only when the pre-training dataset was very large. The performance difference (MAE on QM9) between absolute and RoPE embeddings for models trained on 111K, 111M, and 1.1B molecules was approximately $-0.20 \pm 0.18$, $-0.44 \pm 0.22$, and $0.27 \pm 0.12$, respectively.</p>
<p>The authors highlight that SMILES and SELFIES are linearizations of a 2D molecular graph, so consecutive tokens in a sequence are not necessarily spatially close. Positional embeddings that reflect 2D or 3D molecular structure remain underexplored.</p>
<h3 id="question-5-how-many-parameters">Question 5: How Many Parameters?</h3>
<p>Model sizes range from approximately 7M (ST, Mol-BERT) to over 100M parameters (MAT). Most chemical language models operate with 100M parameters or fewer, much smaller than NLP models like BERT (110M-330M) or GPT-3 (175B).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Dimensions</th>
          <th>Heads</th>
          <th>Layers</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ST</td>
          <td>256</td>
          <td>4</td>
          <td>4</td>
          <td>7M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>768</td>
          <td>12</td>
          <td>12</td>
          <td>85M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>768</td>
          <td>12</td>
          <td>6, 12</td>
          <td>43M, 85M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a></td>
          <td>768</td>
          <td>12, 4</td>
          <td>8, 12</td>
          <td>57M, 85M</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>1024</td>
          <td>16</td>
          <td>8</td>
          <td>101M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></td>
          <td>768</td>
          <td>12</td>
          <td>6</td>
          <td>43M</td>
      </tr>
  </tbody>
</table>
<p>SELFormer and MolFormer both tested different model sizes. SELFormer&rsquo;s larger model (approximately 86M parameters) showed approximately 0.034 better ROC-AUC for BBBP compared to the smaller model. MolFormer&rsquo;s larger model (approximately 87M parameters) performed approximately 0.04 better ROC-AUC on average for BBBP, HIV, BACE, and SIDER. The field lacks the systematic scaling analyses (analogous to Kaplan et al. and Hoffmann et al. in NLP) needed to establish proper scaling laws for chemical language models.</p>
<h3 id="question-6-which-pre-training-objectives">Question 6: Which Pre-training Objectives?</h3>
<p>Pre-training objectives fall into domain-agnostic and domain-specific categories:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Pre-training Objective</th>
          <th>Fine-tuning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
          <td>MLM</td>
          <td>Frozen, Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a></td>
          <td>MLM</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a></td>
          <td>MLM, PhysChemPred, SMILES-EQ</td>
          <td>Frozen, Update</td>
      </tr>
      <tr>
          <td>K-BERT</td>
          <td>Atom feature, MACCS prediction, CL</td>
          <td>Update last layer</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a></td>
          <td>MLM, MTR</td>
          <td>Update</td>
      </tr>
      <tr>
          <td>MAT</td>
          <td>MLM, 2D Adjacency, 3D Distance</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">ChemFormer</a></td>
          <td>Denoising Span MLM, Augmentation</td>
          <td>Update</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">RT</a></td>
          <td>PLM (Permutation Language Modeling)</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Domain-specific objectives (predicting physico-chemical properties, atom features, or MACCS keys) showed promising but inconsistent results. MolBERT&rsquo;s PhysChemPred performed closely to the full three-objective model (approximately $0.72 \pm 0.06$ vs. $0.71 \pm 0.06$ ROC-AUC in virtual screening). The SMILES-EQ objective (identifying equivalent SMILES) was found to lower performance when combined with other objectives. K-BERT&rsquo;s contrastive learning objective did not significantly change performance (average ROC-AUC of 0.806 vs. 0.807 with and without CL).</p>
<p>ChemBERTa-2&rsquo;s Multi-Task Regression (MTR) objective performed noticeably better than MLM-only for almost all four classification tasks across pre-training dataset sizes.</p>
<h3 id="question-7-how-to-fine-tune">Question 7: How to Fine-tune?</h3>
<p>Fine-tuning through weight updates generally outperforms frozen representations. SELFormer showed this most dramatically, with a difference of 2.187 RMSE between frozen and updated models on FreeSolv. MolBERT showed a much smaller difference (0.575 RMSE on FreeSolv), likely because its domain-specific pre-training objectives already produced representations closer to the downstream tasks.</p>
<h2 id="benchmarking-challenges-and-performance-comparison">Benchmarking Challenges and Performance Comparison</h2>
<h3 id="downstream-datasets">Downstream Datasets</h3>
<p>The review focuses on nine benchmark datasets across three categories from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Molecules</th>
          <th>Tasks</th>
          <th>Type</th>
          <th>Application</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>1,128</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td>1 regression</td>
          <td>Physical chemistry</td>
          <td>LogD at pH 7.4</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>2,050</td>
          <td>1 classification</td>
          <td>Physiology</td>
          <td>Blood-brain barrier</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>1,484</td>
          <td>2 classification</td>
          <td>Physiology</td>
          <td>Clinical trial toxicity</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>1,427</td>
          <td>27 classification</td>
          <td>Physiology</td>
          <td>Drug side effects</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>7,831</td>
          <td>12 classification</td>
          <td>Physiology</td>
          <td>Nuclear receptor/stress pathways</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>1,513</td>
          <td>1 classification</td>
          <td>Biophysics</td>
          <td>Beta-secretase 1 binding</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>41,127</td>
          <td>1 classification</td>
          <td>Biophysics</td>
          <td>Anti-HIV activity</td>
      </tr>
  </tbody>
</table>
<h3 id="inconsistencies-in-evaluation">Inconsistencies in Evaluation</h3>
<p>The authors document substantial inconsistencies that prevent fair model comparison:</p>
<ol>
<li><strong>Data splitting</strong>: Models used different splitting methods (scaffold vs. random) and different implementations even when using the same method. Not all models adhered to scaffold splitting for classification tasks as recommended.</li>
<li><strong>Different test sets</strong>: Even models using the same split type may not evaluate on identical test molecules due to different random seeds.</li>
<li><strong>Varying repetitions</strong>: Repetitions ranged from 3 (RT) to 50 (Chen et al.), making some analyses more statistically robust than others.</li>
<li><strong>Metric inconsistency</strong>: Most use ROC-AUC for classification and RMSE for regression, but some models report only averages without standard deviations, while others report standard errors.</li>
</ol>
<h3 id="performance-findings">Performance Findings</h3>
<p>When comparing only models evaluated on the same test sets (Figure 2 in the paper), the authors observe that transformer models show comparable, but not consistently superior, performance to existing ML and DL models. The performance varies considerably across models and datasets.</p>
<p>For BBBP, the Mol-BERT model reported lower ROC-AUC than its corresponding MPNN (approximately 0.88 vs. 0.91), while MolBERT outperformed its corresponding CDDD model (approximately 0.86 vs. 0.76 ROC-AUC) and its SVM baseline (approximately 0.86 vs. 0.70 ROC-AUC). A similar mixed pattern appeared for HIV: ChemBERTa performed worse than its corresponding ML models, while MolBERT performed better than its ML (approximately 0.08 higher ROC-AUC) and DL (approximately 0.03 higher ROC-AUC) baselines. For SIDER, Mol-BERT performed approximately 0.1 better ROC-AUC than its corresponding MPNN. For regression, MAT and MolBERT showed improved performance over their ML and DL baselines on ESOL, FreeSolv, and Lipophilicity. For example, MAT performed approximately 0.2 lower RMSE than an SVM model and approximately 0.03 lower RMSE than the Weave model on ESOL.</p>
<h2 id="key-takeaways-and-future-directions">Key Takeaways and Future Directions</h2>
<p>The review concludes with six main takeaways:</p>
<ol>
<li><strong>Performance</strong>: Transformers using SMILES show comparable but not consistently superior performance to existing ML and DL models for MPP.</li>
<li><strong>Scaling</strong>: No systematic analysis of model parameter scaling relative to data size exists for chemical language models. Such analysis is essential.</li>
<li><strong>Pre-training data</strong>: Dataset size alone is not the sole determinant of downstream performance. Composition and chemical space coverage matter.</li>
<li><strong>Chemical language</strong>: SMILES and SELFIES perform similarly. Alternative representations (atom lists, fingerprints) also work when the architecture is adjusted.</li>
<li><strong>Domain knowledge</strong>: Domain-specific pre-training objectives show promise, but tokenization and positional encoding remain underexplored.</li>
<li><strong>Benchmarking</strong>: The community needs standardized data splitting, fixed test sets, statistical analysis, and consistent reporting to enable meaningful comparison.</li>
</ol>
<p>The authors also highlight the need for attention visualization and explainability analysis, investigation of NLP-originated techniques (pre-training regimes, fine-tuning strategies like LoRA, explainability methods), and adaptation of these techniques to the specific characteristics of chemical data (smaller vocabularies, shorter sequences).</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a review paper. No new data or models are introduced. All analyses use previously reported results from the 16 reviewed papers, with additional visualization and comparison. The authors provide a GitHub repository with the code and data used to generate their comparative figures.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Not applicable (review paper). The paper describes training strategies at a conceptual level, referencing the original publications for implementation details.</p>
<h3 id="models">Models</h3>
<p>Not applicable (review paper). The paper catalogs 16 models with their architecture details, parameter counts, and training configurations across Tables 1, 4, 5, 6, and 7.</p>
<h3 id="evaluation">Evaluation</h3>
<p>The paper compiles performance across nine MoleculeNet datasets. Key comparison figures (Figures 2 and 7) restrict to models evaluated on the same test sets for fair comparison, using ROC-AUC for classification and RMSE for regression.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (review paper).</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/volkamerlab/Transformers4MPP_review">Transformers4MPP_review</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Figure generation code and compiled data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sultan, A., Sieg, J., Mathea, M., &amp; Volkamer, A. (2024). Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years. <em>Journal of Chemical Information and Modeling</em>, 64(16), 6259-6280. <a href="https://doi.org/10.1021/acs.jcim.4c00747">https://doi.org/10.1021/acs.jcim.4c00747</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{sultan2024transformers,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Sultan, Afnan and Sieg, Jochen and Mathea, Miriam and Volkamer, Andrea}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{6259--6280}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.4c00747}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer-CNN: SMILES Embeddings for QSAR Modeling</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformer-cnn-qsar-modeling/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/transformer-cnn-qsar-modeling/</guid><description>Transformer-CNN uses SMILES embeddings from a canonicalization Transformer with a CNN head for interpretable QSAR property prediction.</description><content:encoded><![CDATA[<h2 id="transformer-based-smiles-embeddings-for-property-prediction">Transformer-Based SMILES Embeddings for Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that introduces Transformer-CNN, a two-stage architecture for <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> (Quantitative Structure-Activity Relationship) modeling. The primary contribution is a transfer learning approach: a Transformer model is first trained on the task of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> canonicalization (mapping non-canonical SMILES to canonical forms), and the encoder&rsquo;s internal representations are then used as &ldquo;dynamic SMILES embeddings&rdquo; for downstream property prediction via a convolutional neural network (TextCNN). The authors also contribute an interpretability framework based on Layer-wise Relevance Propagation (LRP) that traces predictions back to individual atom contributions.</p>
<h2 id="from-descriptors-to-learned-embeddings-in-qsar">From Descriptors to Learned Embeddings in QSAR</h2>
<p>Traditional QSAR methods rely on hand-engineered molecular descriptors (fragment counts, physicochemical features) coupled with feature selection and classical ML algorithms. While deep learning approaches that operate on raw SMILES strings or molecular graphs have reduced the need for manual feature engineering, they typically require large training datasets to learn effective representations from scratch. QSAR datasets, in contrast, often contain only hundreds of molecules, making it difficult to train end-to-end deep models.</p>
<p>The authors identify two specific gaps. First, existing SMILES-based autoencoders such as <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a> (Continuous and Data-Driven molecular Descriptors) produce fixed-length latent vectors, discarding positional information that could be useful for property prediction and interpretation. Second, QSAR models built on deep architectures generally lack interpretability, making it hard to verify that predictions rely on chemically meaningful structural features rather than spurious correlations.</p>
<h2 id="dynamic-smiles-embeddings-via-canonicalization-pre-training">Dynamic SMILES Embeddings via Canonicalization Pre-training</h2>
<p>The core insight is that training a Transformer to perform SMILES canonicalization (a Seq2Seq task mapping non-canonical SMILES to canonical SMILES) produces an encoder whose internal states serve as information-rich, position-dependent molecular embeddings.</p>
<h3 id="pre-training-on-smiles-canonicalization">Pre-training on SMILES Canonicalization</h3>
<p>The Transformer encoder-decoder is trained on approximately 17.7 million canonicalization pairs derived from the <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> database (SMILES with length up to 110 characters). Each molecule is augmented 10 times by generating non-canonical SMILES variants, plus one identity pair where both sides are canonical. The training uses character-level tokenization with a 66-symbol vocabulary covering drug-like molecules including stereochemistry, charges, and inorganic ions.</p>
<p>The Transformer architecture follows Vaswani et al. with 3 layers and 10 self-attention heads. The learning rate schedule follows:</p>
<p>$$\lambda = \text{factor} \cdot \min(1.0,; \text{step} / \text{warmup}) / \max(\text{step},; \text{warmup})$$</p>
<p>where factor = 20, warmup = 16,000 steps, and $\lambda$ is clipped at a minimum of $10^{-4}$. Training runs for 10 epochs (275,907 batches per epoch) without early stopping.</p>
<p>On validation with 500,000 generated ChEMBL-like SMILES, the model correctly canonicalizes 83.6% of all samples. Performance drops for stereochemistry (37.2% for @-containing SMILES) and cis/trans notation (73.9%).</p>
<h3 id="from-encoder-states-to-qsar-predictions">From Encoder States to QSAR Predictions</h3>
<p>After pre-training, the encoder&rsquo;s output for a molecule with $N$ characters is a matrix of dimensions $(N, \text{EMBEDDINGS})$. Unlike fixed-length CDDD descriptors, these &ldquo;dynamic embeddings&rdquo; preserve positional information, meaning equivalent characters receive different embedding values depending on their context and position.</p>
<p>To handle variable-length embeddings, the authors use a TextCNN architecture (from DeepChem) with 1D convolutional filters at kernel sizes (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20) producing (100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160) filters respectively. After GlobalMaxPool and concatenation, the features pass through Dropout (rate = 0.25), a Dense layer ($N = 512$), a Highway layer, and finally an output layer (1 neuron for regression, 2 for classification).</p>
<p>The Transformer weights are frozen during QSAR training. The Adam optimizer is used with a fixed learning rate of $10^{-4}$ and early stopping on a 10% held-out validation set. Critically, SMILES augmentation ($n = 10$) is applied during both training and inference, with the final prediction being the average over augmented SMILES for each molecule.</p>
<h3 id="interpretability-via-layer-wise-relevance-propagation">Interpretability via Layer-wise Relevance Propagation</h3>
<p>The LRP algorithm propagates relevance scores from the output back through the CNN layers to the Transformer encoder output (which is position-wise). The relevance conservation property holds:</p>
<p>$$y = R = f(x) = \sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} = \cdots = \sum_{l \in (1)} R_{l}$$</p>
<p>In practice, biases absorb some relevance, so the total propagated to the input is less than the output:</p>
<p>$$\sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} + B$$</p>
<p>For gated connections in the Highway block, the authors implement the signal-take-all redistribution rule. The interpretation algorithm generates one SMILES per non-hydrogen atom (each drawn starting from that atom), runs LRP on each, and averages contributions. If more than 50% of relevance dissipates on biases, the interpretation may be unreliable, serving as an applicability domain indicator.</p>
<h2 id="benchmarks-across-18-regression-and-classification-datasets">Benchmarks Across 18 Regression and Classification Datasets</h2>
<p>The authors evaluate on the same 18 datasets (9 regression, 9 classification) used in their previous SMILES augmentation study, enabling direct comparison. All experiments use five-fold cross-validation.</p>
<h3 id="regression-results-r2">Regression Results ($r^2$)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">Descriptor-based</th>
          <th style="text-align: center">SMILES-based (augm=10)</th>
          <th style="text-align: center">Transformer-CNN (no augm)</th>
          <th style="text-align: center">Transformer-CNN (augm=10)</th>
          <th style="text-align: center">CDDD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MP (19,104)</td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center"><strong>0.86</strong></td>
          <td style="text-align: center">0.85</td>
      </tr>
      <tr>
          <td>BP (11,893)</td>
          <td style="text-align: center">0.98</td>
          <td style="text-align: center">0.98</td>
          <td style="text-align: center">0.97</td>
          <td style="text-align: center"><strong>0.98</strong></td>
          <td style="text-align: center">0.98</td>
      </tr>
      <tr>
          <td>BCF (378)</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.71</td>
          <td style="text-align: center"><strong>0.85</strong></td>
          <td style="text-align: center">0.81</td>
      </tr>
      <tr>
          <td>FreeSolv (642)</td>
          <td style="text-align: center"><strong>0.94</strong></td>
          <td style="text-align: center">0.93</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.93</td>
      </tr>
      <tr>
          <td>LogS (1,311)</td>
          <td style="text-align: center"><strong>0.92</strong></td>
          <td style="text-align: center">0.92</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.91</td>
      </tr>
      <tr>
          <td>Lipo (4,200)</td>
          <td style="text-align: center">0.70</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.60</td>
          <td style="text-align: center">0.73</td>
          <td style="text-align: center"><strong>0.74</strong></td>
      </tr>
      <tr>
          <td>BACE (1,513)</td>
          <td style="text-align: center">0.73</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.66</td>
          <td style="text-align: center"><strong>0.76</strong></td>
          <td style="text-align: center">0.75</td>
      </tr>
      <tr>
          <td>DHFR (739)</td>
          <td style="text-align: center">0.62</td>
          <td style="text-align: center">0.63</td>
          <td style="text-align: center">0.46</td>
          <td style="text-align: center"><strong>0.67</strong></td>
          <td style="text-align: center">0.61</td>
      </tr>
      <tr>
          <td>LEL (483)</td>
          <td style="text-align: center">0.19</td>
          <td style="text-align: center">0.25</td>
          <td style="text-align: center">0.20</td>
          <td style="text-align: center"><strong>0.27</strong></td>
          <td style="text-align: center">0.23</td>
      </tr>
  </tbody>
</table>
<h3 id="classification-results-auc">Classification Results (AUC)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">Descriptor-based</th>
          <th style="text-align: center">SMILES-based (augm=10)</th>
          <th style="text-align: center">Transformer-CNN (no augm)</th>
          <th style="text-align: center">Transformer-CNN (augm=10)</th>
          <th style="text-align: center">CDDD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HIV (41,127)</td>
          <td style="text-align: center">0.82</td>
          <td style="text-align: center">0.78</td>
          <td style="text-align: center">0.81</td>
          <td style="text-align: center"><strong>0.83</strong></td>
          <td style="text-align: center">0.74</td>
      </tr>
      <tr>
          <td>AMES (6,542)</td>
          <td style="text-align: center">0.86</td>
          <td style="text-align: center">0.88</td>
          <td style="text-align: center">0.86</td>
          <td style="text-align: center"><strong>0.89</strong></td>
          <td style="text-align: center">0.86</td>
      </tr>
      <tr>
          <td>BACE (1,513)</td>
          <td style="text-align: center">0.88</td>
          <td style="text-align: center">0.89</td>
          <td style="text-align: center">0.89</td>
          <td style="text-align: center"><strong>0.91</strong></td>
          <td style="text-align: center">0.90</td>
      </tr>
      <tr>
          <td>ClinTox (1,478)</td>
          <td style="text-align: center"><strong>0.77</strong></td>
          <td style="text-align: center">0.76</td>
          <td style="text-align: center">0.71</td>
          <td style="text-align: center">0.77</td>
          <td style="text-align: center">0.73</td>
      </tr>
      <tr>
          <td>Tox21 (7,831)</td>
          <td style="text-align: center">0.79</td>
          <td style="text-align: center"><strong>0.83</strong></td>
          <td style="text-align: center">0.81</td>
          <td style="text-align: center">0.82</td>
          <td style="text-align: center">0.82</td>
      </tr>
      <tr>
          <td>BBBP (2,039)</td>
          <td style="text-align: center">0.90</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.90</td>
          <td style="text-align: center"><strong>0.92</strong></td>
          <td style="text-align: center">0.89</td>
      </tr>
      <tr>
          <td>JAK3 (886)</td>
          <td style="text-align: center">0.79</td>
          <td style="text-align: center"><strong>0.80</strong></td>
          <td style="text-align: center">0.70</td>
          <td style="text-align: center">0.78</td>
          <td style="text-align: center">0.76</td>
      </tr>
      <tr>
          <td>BioDeg (1,737)</td>
          <td style="text-align: center">0.92</td>
          <td style="text-align: center"><strong>0.93</strong></td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.93</td>
          <td style="text-align: center">0.92</td>
      </tr>
      <tr>
          <td>RP AR (930)</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center"><strong>0.87</strong></td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center">0.87</td>
          <td style="text-align: center">0.86</td>
      </tr>
  </tbody>
</table>
<h3 id="key-comparisons">Key Comparisons</h3>
<p>Baselines include descriptor-based methods (the best from LibSVM, Random Forest, XGBoost, ASNN, and DNNs), direct SMILES-based models with augmentation, and CDDD descriptors analyzed by the same classical ML methods. CDDD descriptors come from the Sml2canSml autoencoder approach, which produces fixed 512-dimensional vectors.</p>
<p>Transformer-CNN with augmentation matches or exceeds all baselines on 14 of 18 datasets. The effect of augmentation is dramatic: without it, Transformer-CNN underperforms substantially (e.g., BCF drops from 0.85 to 0.71, JAK3 from 0.78 to 0.70). This confirms that the internal consensus from multiple SMILES representations is essential to the method&rsquo;s effectiveness.</p>
<p>A practical advantage over CDDD is that Transformer-CNN imposes no constraints on molecular properties (CDDD requires logP in (-5, 7), molecular weight under 12,600, 3-50 heavy atoms, and organic molecules only), since the Transformer was trained on the full diversity of ChEMBL.</p>
<h3 id="interpretability-case-studies">Interpretability Case Studies</h3>
<p>For <a href="https://en.wikipedia.org/wiki/Ames_test">AMES</a> mutagenicity, the LRP analysis of 1-Bromo-4-nitrobenzene correctly identifies the nitro group and halogen as structural alerts, consistent with known mutagenicity rules. For aqueous solubility of <a href="https://en.wikipedia.org/wiki/Haloperidol">haloperidol</a>, the model assigns positive contributions to hydroxyl, carbonyl, and aliphatic nitrogen groups (which increase solubility) and negative contributions to aromatic carbons (which decrease it). Both cases align with established chemical knowledge, supporting the trustworthiness of the model.</p>
<h2 id="effective-transfer-learning-for-small-qsar-datasets">Effective Transfer Learning for Small QSAR Datasets</h2>
<p>Transformer-CNN achieves competitive or superior QSAR performance across 18 diverse benchmarks by combining three ingredients: (1) Transformer-based pre-training via SMILES canonicalization, (2) SMILES augmentation during training and inference, and (3) a lightweight CNN head. The method requires minimal hyperparameter tuning, as the Transformer weights are frozen and the CNN architecture is fixed.</p>
<p>The authors acknowledge several limitations and future directions:</p>
<ul>
<li>Stereochemistry canonicalization accuracy is low (37.2%), which could impact models for stereo-sensitive properties</li>
<li>The LRP interpretability depends on sufficient relevance propagation (at least 50% reaching the input layer)</li>
<li>The variance among augmented SMILES predictions could serve as a confidence estimate, but this is left to future work</li>
<li>Applicability domain assessment based on SMILES reconstruction quality is proposed but not fully developed</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL (SMILES &lt;= 110 chars)</td>
          <td>17.7M pairs</td>
          <td>10x augmentation + 1 identity pair per molecule</td>
      </tr>
      <tr>
          <td>Validation (canon.)</td>
          <td>Generated ChEMBL-like SMILES</td>
          <td>500,000</td>
          <td>From a molecular generator</td>
      </tr>
      <tr>
          <td>QSAR benchmarks</td>
          <td>9 regression + 9 classification</td>
          <td>378-41,127</td>
          <td>Available on OCHEM (<a href="https://ochem.eu">https://ochem.eu</a>)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer: 3 layers, 10 self-attention heads, character-level tokenization (66 symbols)</li>
<li>TextCNN: 12 kernel sizes (1-10, 15, 20) with 100-200 filters each, GlobalMaxPool, Dense(512), Highway, Dropout(0.25)</li>
<li>Augmentation: n=10 non-canonical SMILES per molecule during training and inference</li>
<li>LRP: signal-take-all redistribution for Highway gates, standard LRP for Dense and Conv layers</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Transformer encoder weights pre-trained on canonicalization task (frozen during QSAR training)</li>
<li>QSAR CNN trained with Adam optimizer, learning rate $10^{-4}$, early stopping</li>
<li>Pre-trained embeddings and standalone prediction models available in the GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Regression: coefficient of determination $r^2 = 1 - SS_{\text{res}} / SS_{\text{tot}}$</li>
<li>Classification: Area Under the ROC Curve (AUC)</li>
<li>Five-fold cross-validation with bootstrap standard errors</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA Quadro P6000, Titan Xp, and Titan V GPUs (donated by NVIDIA)</li>
<li>TensorFlow v1.12.0, RDKit v2018.09.2</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/bigchem/transformer-cnn">transformer-cnn</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Source code, pre-trained embeddings, standalone prediction models</td>
      </tr>
      <tr>
          <td><a href="https://ochem.eu">OCHEM</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online platform hosting the method, training datasets, and models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Karpov, P., Godin, G., &amp; Tetko, I. V. (2020). Transformer-CNN: Swiss knife for QSAR modeling and interpretation. <em>Journal of Cheminformatics</em>, 12, 17. <a href="https://doi.org/10.1186/s13321-020-00423-w">https://doi.org/10.1186/s13321-020-00423-w</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{karpov2020transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-{CNN}: Swiss knife for {QSAR} modeling and interpretation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Karpov, Pavel and Godin, Guillaume and Tetko, Igor V.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00423-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer Name-to-SMILES with Atom Count Losses</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/transformer-chemical-name-to-smiles/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/transformer-chemical-name-to-smiles/</guid><description>A Transformer seq2seq model translates chemical compound names to SMILES, using atom-count constraints and SMILES/InChI multi-task learning.</description><content:encoded><![CDATA[<h2 id="translating-chemical-names-to-structures-with-transformers">Translating Chemical Names to Structures with Transformers</h2>
<p>This is a <strong>Method</strong> paper that proposes using Transformer-based sequence-to-sequence models to predict chemical compound structures (represented as SMILES strings) from chemical compound names. The primary contribution is the application of neural machine translation techniques to the name-to-structure problem, along with two domain-specific improvements: an atom-count constraint loss function and a multi-task learning approach that jointly predicts SMILES and InChI strings.</p>
<h2 id="why-rule-based-name-to-structure-fails-for-synonyms">Why Rule-Based Name-to-Structure Fails for Synonyms</h2>
<p>Chemical compound names come in several varieties. IUPAC names follow systematic nomenclature and are well-handled by rule-based parsers like OPSIN. Database IDs (e.g., CAS registry numbers) can be resolved by dictionary lookup. The third category, Synonyms (which includes abbreviations, common names, and other informal designations), is problematic because naming patterns are complex and widely variable.</p>
<p>In preliminary experiments, rule-based tools achieved F-measures of 0.878 to 0.960 on IUPAC names but only 0.719 to 0.758 on Synonyms. This performance gap motivates a data-driven approach. The authors frame name-to-SMILES prediction as a machine translation problem: the source language is the chemical compound name and the target language is the SMILES string. A neural model trained on millions of name-SMILES pairs can learn patterns that rule-based systems miss, particularly for non-systematic nomenclature.</p>
<h2 id="atom-count-constraints-and-multi-task-learning">Atom-Count Constraints and Multi-Task Learning</h2>
<p>The paper introduces two improvements over a vanilla Transformer seq2seq model.</p>
<h3 id="atom-count-constraint-loss">Atom-Count Constraint Loss</h3>
<p>A correct structure prediction must contain the right number of atoms of each element. The authors add an auxiliary loss that penalizes the squared difference between the predicted and true atom counts for each element. The predicted atom counts are obtained by summing Gumbel-softmax outputs across all decoded positions.</p>
<p>For the $i$-th output token, the Gumbel-softmax probability vector is:</p>
<p>$$
y_{ij} = \frac{\exp\left((\log(\pi_{ij}) + g_{ij}) / \tau\right)}{\sum_{k=1}^{|\mathcal{V}|} \exp\left((\log(\pi_{ik}) + g_{ik}) / \tau\right)}
$$</p>
<p>where $\pi_{ij}$ is the model&rsquo;s softmax output, $g_{ij}$ is a Gumbel noise sample, and $\tau = 0.1$ is the temperature. The predicted token frequency vector is $\mathbf{y}^{pred} = \sum_{i=1}^{m} \mathbf{y}_i$, and the atom-count loss is:</p>
<p>$$
\mathcal{L}_{atom} = \frac{1}{|A|} \sum_{a \in A} \left(N_a(T) - y_{idx(a)}^{pred}\right)^2
$$</p>
<p>where $A$ is the set of chemical elements in the vocabulary, $N_a(T)$ returns the number of atoms of element $a$ in the correct SMILES string $T$, and $idx(a)$ returns the vocabulary index of element $a$. Only element tokens (e.g., &ldquo;C&rdquo;, &ldquo;O&rdquo;) are counted; bond symbols (e.g., &ldquo;=&rdquo;, &ldquo;#&rdquo;) are excluded.</p>
<p>The combined objective is:</p>
<p>$$
\mathcal{L}_{smiles} + \lambda_{atom} \mathcal{L}_{atom}
$$</p>
<p>with $\lambda_{atom} = 0.7$.</p>
<h3 id="multi-task-smilesinchi-prediction">Multi-Task SMILES/InChI Prediction</h3>
<p>SMILES and InChI strings encode the same chemical structure in different formats. The authors hypothesize that jointly predicting both representations can improve the shared encoder. The multi-task model shares the encoder between a SMILES decoder and an InChI decoder, minimizing:</p>
<p>$$
\mathcal{L}_{smiles} + \lambda_{inchi} \mathcal{L}_{inchi}
$$</p>
<p>where $\mathcal{L}_{inchi} = -\log P(I | X; \boldsymbol{\theta}_{enc}, \boldsymbol{\theta}_{inchi})$ and $\lambda_{inchi} = 0.3$.</p>
<h2 id="experimental-setup-and-evaluation">Experimental Setup and Evaluation</h2>
<h3 id="dataset">Dataset</h3>
<p>The dataset was constructed from PubChem dump data (97M compound records). Chemical compound names categorized as Synonyms were paired with canonical SMILES strings (converted via RDKit). Database-like IDs were filtered out using regular expressions. Duplicate names mapping to different CIDs were removed.</p>
<table>
  <thead>
      <tr>
          <th>Split</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>5,000,000</td>
      </tr>
      <tr>
          <td>Development</td>
          <td>1,113</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>11,194</td>
      </tr>
  </tbody>
</table>
<h3 id="model-configuration">Model Configuration</h3>
<p>The Transformer uses 6 encoder/decoder layers, 8 attention heads, 512-dimensional embeddings, and 0.1 dropout. Training used label-smoothing cross-entropy ($\epsilon = 0.1$), Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$), and a warmup schedule with peak learning rate 0.0005 over 4,000 steps followed by inverse square root decay. Models were trained for 300,000 update steps. Final predictions averaged the last 10 checkpoints and used beam search (beam size 4, length penalty $\alpha = 0.6$, max output length 200).</p>
<h3 id="tokenization">Tokenization</h3>
<p>Three tokenization strategies were compared:</p>
<ul>
<li><strong>BPE</strong>: Byte pair encoding learned on chemical compound names (500 merge operations) via fastBPE</li>
<li><strong>OPSIN-TK</strong>: The OPSIN rule-based tokenizer</li>
<li><strong>OPSIN-TK+BPE</strong>: A hybrid where OPSIN handles tokenizable names and BPE handles the rest</li>
</ul>
<p>SMILES tokens were identified by regular expressions (elements as single tokens, remaining symbols as characters). InChI strings were tokenized by SentencePiece (vocabulary size 1,000).</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>OPSIN</strong>: Open-source rule-based parser</li>
<li><strong>Tool A</strong> and <strong>Tool B</strong>: Two commercially available name-to-structure tools</li>
</ul>
<h3 id="results">Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Tokenizer</th>
          <th>Recall</th>
          <th>Precision</th>
          <th>F-measure</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OPSIN</td>
          <td>Rule-based</td>
          <td>0.693</td>
          <td>0.836</td>
          <td>0.758</td>
      </tr>
      <tr>
          <td>Tool A</td>
          <td>Rule-based</td>
          <td>0.711</td>
          <td>0.797</td>
          <td>0.752</td>
      </tr>
      <tr>
          <td>Tool B</td>
          <td>Rule-based</td>
          <td>0.653</td>
          <td>0.800</td>
          <td>0.719</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>BPE</td>
          <td>0.793</td>
          <td>0.806</td>
          <td>0.799</td>
      </tr>
      <tr>
          <td>+ atomnum</td>
          <td>BPE</td>
          <td>0.798</td>
          <td>0.808</td>
          <td>0.803</td>
      </tr>
      <tr>
          <td>+ inchigen</td>
          <td>BPE</td>
          <td>0.810</td>
          <td>0.819</td>
          <td>0.814</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>OPSIN-TK+BPE</td>
          <td>0.763</td>
          <td>0.873</td>
          <td>0.814</td>
      </tr>
      <tr>
          <td>+ atomnum</td>
          <td>OPSIN-TK+BPE</td>
          <td>0.768</td>
          <td>0.876</td>
          <td>0.818</td>
      </tr>
      <tr>
          <td>+ inchigen</td>
          <td>OPSIN-TK+BPE</td>
          <td>0.779</td>
          <td>0.886</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>OPSIN-TK</td>
          <td>0.755</td>
          <td>0.868</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>+ atomnum</td>
          <td>OPSIN-TK</td>
          <td>0.757</td>
          <td>0.867</td>
          <td>0.808</td>
      </tr>
      <tr>
          <td>+ inchigen</td>
          <td>OPSIN-TK</td>
          <td>0.754</td>
          <td>0.869</td>
          <td>0.807</td>
      </tr>
  </tbody>
</table>
<p>The best configuration (inchigen with OPSIN-TK+BPE) achieved an F-measure of 0.829, surpassing OPSIN by 0.071 points. The multi-task learning approach (inchigen) consistently outperformed the atom-count constraint alone (atomnum) across all tokenizer settings.</p>
<h2 id="key-findings-and-error-analysis">Key Findings and Error Analysis</h2>
<p>The Transformer-based approach produced grammatically correct SMILES strings (parseable by RDKit) for 99% of test examples, compared to 81.6-88.4% for the rule-based tools. Even when predictions were incorrect, they tended to be structurally similar to the correct answer. Using MACCS fingerprints and Jaccard (Tanimoto) similarity, the average similarity between incorrectly predicted and correct structures was 0.753.</p>
<p>The OPSIN-TK tokenizer yielded higher precision than BPE because approximately 11.5% (1,293 of 11,194) of test compounds could not be tokenized by OPSIN, reducing the number of outputs. BPE-based tokenizers achieved higher recall by covering all inputs. The hybrid OPSIN-TK+BPE approach balanced both, achieving the highest overall F-measure.</p>
<p><strong>Limitations</strong>: The paper does not evaluate on IUPAC names separately with the Transformer models (only comparing rule-based tools on IUPAC). The atom-count constraint and multi-task learning are not combined in a single model. The dataset is released but the training code is not. Hardware details and training times are not reported. The evaluation uses only exact-match F-measure and Jaccard similarity, without measuring partial credit for nearly-correct structures.</p>
<p><strong>Future work</strong>: The authors plan to explore additional tokenization methods, combine the atom-count constraint with multi-task learning, and apply the constraint loss to other chemistry problems including chemical reaction prediction.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>PubChem Synonyms (custom split)</td>
          <td>5,000,000 pairs</td>
          <td>Chemical compound names to canonical SMILES</td>
      </tr>
      <tr>
          <td>Development</td>
          <td>PubChem Synonyms (custom split)</td>
          <td>1,113 pairs</td>
          <td>Filtered for duplicates</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>PubChem Synonyms (custom split)</td>
          <td>11,194 pairs</td>
          <td>Filtered for duplicates; released as benchmark</td>
      </tr>
  </tbody>
</table>
<p>The authors state the dataset is released for future research. The data was constructed from the PubChem dump (97M compound records) using RDKit for SMILES canonicalization. Database-like IDs were removed with regular expressions and duplicate names across CIDs were filtered.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer seq2seq (6 layers, 8 heads, 512-dim embeddings)</li>
<li>BPE tokenization via fastBPE (500 merge operations)</li>
<li>SentencePiece for InChI tokenization (vocabulary size 1,000)</li>
<li>Gumbel-softmax atom-count constraint ($\tau = 0.1$, $\lambda_{atom} = 0.7$)</li>
<li>Multi-task SMILES/InChI loss ($\lambda_{inchi} = 0.3$)</li>
<li>Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 10^{-8}$)</li>
<li>Label smoothing ($\epsilon = 0.1$), 300K training steps</li>
<li>Beam search (beam size 4, length penalty $\alpha = 0.6$)</li>
</ul>
<h3 id="models">Models</h3>
<p>Standard Transformer architecture following Vaswani et al. (2017). No pre-trained weights or model checkpoints are released.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value</th>
          <th>Model</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>F-measure</td>
          <td>0.829</td>
          <td>inchigen (OPSIN-TK+BPE)</td>
          <td>Highest overall</td>
      </tr>
      <tr>
          <td>Precision</td>
          <td>0.886</td>
          <td>inchigen (OPSIN-TK+BPE)</td>
          <td>Highest overall</td>
      </tr>
      <tr>
          <td>Recall</td>
          <td>0.810</td>
          <td>inchigen (BPE)</td>
          <td>Highest overall</td>
      </tr>
      <tr>
          <td>Grammatical correctness</td>
          <td>99%</td>
          <td>inchigen (BPE)</td>
          <td>SMILES parseable by RDKit</td>
      </tr>
      <tr>
          <td>Avg. Jaccard similarity (errors)</td>
          <td>0.753</td>
          <td>inchigen (BPE)</td>
          <td>On incorrect predictions only</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not reported.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Omote, Y., Matsushita, K., Iwakura, T., Tamura, A., &amp; Ninomiya, T. (2020). Transformer-based Approach for Predicting Chemical Compound Structures. <em>Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing</em>, 154-162. <a href="https://doi.org/10.18653/v1/2020.aacl-main.19">https://doi.org/10.18653/v1/2020.aacl-main.19</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{omote2020transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-based Approach for Predicting Chemical Compound Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Omote, Yutaro and Matsushita, Kyoumoto and Iwakura, Tomoya and Tamura, Akihiro and Ninomiya, Takashi}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{154--162}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2020.aacl-main.19}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer CLMs for SMILES: Literature Review 2024</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/transformer-clms-smiles-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/transformer-clms-smiles-review/</guid><description>Review of transformer-based chemical language models for SMILES, covering encoder, decoder, and encoder-decoder architectures for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformer-based-chemical-language-models">A Systematization of Transformer-Based Chemical Language Models</h2>
<p>This paper is a <strong>Systematization</strong> (literature review) that surveys the landscape of transformer-based chemical language models (CLMs) operating on SMILES representations. It organizes the field into three architectural categories (encoder-only, decoder-only, encoder-decoder), discusses tokenization strategies, pre-training and fine-tuning methodologies, and identifies open challenges and future research directions. The review covers approximately 30 distinct CLMs published through early 2024.</p>
<h2 id="why-review-transformer-clms-for-smiles">Why Review Transformer CLMs for SMILES?</h2>
<p>The chemical space is vast, with databases like ZINC20 exceeding 5.5 billion compounds, and the amount of unlabeled molecular data far outstrips available labeled data for specific tasks like toxicity prediction or binding affinity estimation. Traditional molecular representations (fingerprints, descriptors, graph-based methods) require expert-engineered features and extensive domain knowledge.</p>
<p>Transformer-based language models, originally developed for NLP, have emerged as a compelling alternative. By treating <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings as a &ldquo;chemical language,&rdquo; these models can leverage large-scale unsupervised pre-training on abundant unlabeled molecules, then fine-tune on small labeled datasets for specific downstream tasks. Earlier approaches like Seq2Seq and Seq3Seq fingerprint methods used RNN-based encoder-decoders, but these suffered from vanishing gradients and sequential processing bottlenecks when handling long SMILES sequences.</p>
<p>The authors motivate this review by noting that no prior survey has comprehensively organized transformer-based CLMs by architecture type while simultaneously covering tokenization, embedding strategies, and downstream application domains.</p>
<h2 id="architectural-taxonomy-encoder-decoder-and-encoder-decoder-models">Architectural Taxonomy: Encoder, Decoder, and Encoder-Decoder Models</h2>
<p>The core organizational contribution is a three-way taxonomy of transformer CLMs based on their architectural backbone.</p>
<h3 id="encoder-only-models-bert-family">Encoder-Only Models (BERT Family)</h3>
<p>These models capture bidirectional context, making them well suited for extracting molecular representations for property prediction tasks. The review covers:</p>
<ul>
<li><strong>BERT</strong> (Lee and Nam, 2022): Adapted for SMILES processing with linguistic knowledge infusion, using BPE tokenization</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MOLBERT</a></strong> (Fabian et al., 2020): Chemistry-specific BERT for physicochemical property and bioactivity prediction</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a></strong> (Wang et al., 2019): BERT variant designed to learn molecular representations directly from SMILES without feature engineering</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> / <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a></strong> (Chithrananda et al., 2020; Ahmad et al., 2022): RoBERTa-based models optimized for chemical property prediction, with ChemBERTa-2 exploring multi-task pre-training</li>
<li><strong>GPT-MolBERTa</strong> (Balaji et al., 2023): Combines GPT molecular features with a RoBERTa backbone</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a></strong> (Ross et al., 2022): Large-scale model trained on 1.1 billion molecules, published in Nature Machine Intelligence</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a></strong> (Yuksel et al., 2023): Operates on <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> representations rather than SMILES</li>
<li><strong>Mol-BERT / MolRoPE-BERT</strong> (Li and Jiang, 2021; Liu et al., 2023): Differ in positional embedding strategy, with MolRoPE-BERT using rotary position embedding to handle longer sequences</li>
<li><strong>BET</strong> (Chen et al., 2021): Extracts predictive representations from hundreds of millions of molecules</li>
</ul>
<h3 id="decoder-only-models-gpt-family">Decoder-Only Models (GPT Family)</h3>
<p>These models excel at generative tasks, including de novo molecular design:</p>
<ul>
<li><strong>GPT-2-based model</strong> (Adilov, 2021): Generative pre-training from molecules</li>
<li><strong>MolXPT</strong> (Liu et al., 2023): Wraps molecules with text for generative pre-training, connecting chemical and natural language</li>
<li><strong>BioGPT</strong> (Luo et al., 2022): Focuses on biomedical text generation and mining</li>
<li><strong>MolGPT</strong> (Haroon et al., 2023): Uses relative attention to capture token distances and relationships for de novo drug design</li>
<li><strong>Mol-Instructions</strong> (Fang et al., 2023): Large-scale biomolecular instruction dataset for LLMs</li>
</ul>
<h3 id="encoder-decoder-models">Encoder-Decoder Models</h3>
<p>These combine encoding and generation capabilities for sequence-to-sequence tasks:</p>
<ul>
<li><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a></strong> (Irwin et al., 2022): BART-based model for reaction prediction and molecular property prediction</li>
<li><strong>MolT5</strong> (adapted T5): Unified text-to-text framework for molecular tasks</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/smiles-transformer/">SMILES Transformer</a></strong> (Honda et al., 2019): Pre-trained molecular fingerprints for low-data drug discovery</li>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/x-mol-pretraining-molecular-understanding/">X-MOL</a></strong> (Xue et al., 2020): Large-scale pre-training for molecular understanding</li>
<li><strong><a href="/notes/chemistry/molecular-design/property-prediction/regression-transformer/">Regression Transformer</a></strong> (Born and Manica, 2023): Operates on <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, enabling concurrent regression and generation</li>
<li><strong>TransAntivirus</strong> (Mao et al., 2023): Specialized for antiviral drug design using IUPAC nomenclature</li>
</ul>
<h2 id="tokenization-embedding-and-pre-training-strategies">Tokenization, Embedding, and Pre-Training Strategies</h2>
<h3 id="smiles-tokenization">SMILES Tokenization</h3>
<p>The review identifies tokenization as a critical preprocessing step that affects downstream performance. SMILES tokenization differs from standard NLP tokenization because SMILES strings lack whitespace and use parentheses for branching rather than sentence separation. The key approaches include:</p>
<table>
  <thead>
      <tr>
          <th>Strategy</th>
          <th>Source</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/">Atom-in-SMILES (AIS)</a></td>
          <td>Ucak et al. (2023)</td>
          <td>Atom-level tokens preserving chemical identity</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SMILES Pair Encoding (SPE)</a></td>
          <td>Li and Fourches (2021)</td>
          <td>BPE-inspired substructure tokenization</td>
      </tr>
      <tr>
          <td>Byte-Pair Encoding (BPE)</td>
          <td>Chithrananda et al. (2020); Lee and Nam (2022)</td>
          <td>Standard subword tokenization adapted for SMILES</td>
      </tr>
      <tr>
          <td>SMILESTokenizer</td>
          <td>Chithrananda et al. (2020)</td>
          <td>Character-level tokenization with chemical adjustments</td>
      </tr>
  </tbody>
</table>
<h3 id="positional-embeddings">Positional Embeddings</h3>
<p>The models use various positional encoding strategies: absolute, relative key, relative key-query, rotary (RoPE), and sinusoidal. Notably, SMILES-based models omit segmentation embeddings since SMILES data consists of single sequences rather than sentence pairs.</p>
<h3 id="pre-training-and-fine-tuning-pipeline">Pre-Training and Fine-Tuning Pipeline</h3>
<p>The standard workflow follows two phases:</p>
<ol>
<li><strong>Pre-training</strong>: Unsupervised training on large unlabeled SMILES databases (ZINC, PubChem, ChEMBL) using masked language modeling (MLM), where the model learns to predict masked tokens within SMILES strings</li>
<li><strong>Fine-tuning</strong>: Supervised adaptation on smaller labeled datasets for specific tasks (classification or regression)</li>
</ol>
<p>The self-attention mechanism, central to all transformer CLMs, is formulated as:</p>
<p>$$
Z = \text{Softmax}\left(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}}\right) XW^V
$$</p>
<p>where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^Q$, $W^K$, $W^V \in \mathbb{R}^{M \times d_k}$ are learnable weight matrices, and $\sqrt{d_k}$ is the scaling factor.</p>
<h2 id="benchmark-datasets-and-evaluation-landscape">Benchmark Datasets and Evaluation Landscape</h2>
<p>The review catalogs the standard evaluation ecosystem for CLMs. Pre-training databases include ZINC, PubChem, and ChEMBL. Fine-tuning and evaluation rely heavily on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Datasets</th>
          <th>Task Type</th>
          <th>Example Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Physical Chemistry</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
          <td>Regression</td>
          <td>642 to 4,200</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>PCBA, MUV, HIV, PDBbind, BACE</td>
          <td>Classification/Regression</td>
          <td>11,908 to 437,929</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP, Tox21, ToxCast, SIDER, ClinTox</td>
          <td>Classification</td>
          <td>1,427 to 8,575</td>
      </tr>
  </tbody>
</table>
<p>The authors also propose four new fine-tuning datasets targeting diseases: COVID-19 drug compounds, cocrystal formation, antimalarial drugs (Plasmodium falciparum targets), and cancer gene expression/drug response data.</p>
<h2 id="challenges-limitations-and-future-directions">Challenges, Limitations, and Future Directions</h2>
<h3 id="current-challenges">Current Challenges</h3>
<p>The review identifies several persistent limitations:</p>
<ol>
<li><strong>Data efficiency</strong>: Despite transfer learning, transformer CLMs still require substantial pre-training data, and labeled datasets for specific tasks remain scarce</li>
<li><strong>Interpretability</strong>: The complexity of transformer architectures makes it difficult to understand how specific molecular features contribute to predictions</li>
<li><strong>Computational cost</strong>: Training large-scale models demands significant GPU resources, limiting accessibility</li>
<li><strong>Handling rare molecules</strong>: Models struggle with molecular structures that deviate significantly from training data distributions</li>
<li><strong>SMILES limitations</strong>: Non-unique representations, invalid strings, exceeded atom valency, and inadequate spatial information capture</li>
</ol>
<h3 id="smiles-representation-issues">SMILES Representation Issues</h3>
<p>The authors highlight five specific problems with SMILES as an input representation:</p>
<ul>
<li>Non-canonical representations reduce string uniqueness for the same molecule</li>
<li>Many symbol combinations produce chemically invalid outputs</li>
<li>Valid SMILES strings can encode chemically impossible molecules (e.g., exceeded valency)</li>
<li>Spatial information is inadequately captured</li>
<li>Syntactic and semantic robustness is limited</li>
</ul>
<h3 id="future-research-directions">Future Research Directions</h3>
<p>The review proposes several directions:</p>
<ul>
<li><strong>Alternative molecular representations</strong>: Exploring <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, IUPAC, and InChI beyond SMILES</li>
<li><strong>Role of SMILES token types</strong>: Strategic masking of metals, non-metals, bonds, and branches during MLM pre-training to identify which components are most critical</li>
<li><strong>Few-shot learning</strong>: Combining few-shot approaches with large-scale pre-trained CLMs for data-scarce scenarios</li>
<li><strong>Drug repurposing</strong>: Training CLMs to distinguish identical compounds with different biological activity profiles across therapeutic domains</li>
<li><strong>Improved benchmarks</strong>: Incorporating disease-specific datasets (malaria, cancer, COVID-19) for more realistic evaluation</li>
<li><strong>Ethical considerations</strong>: Addressing dual-use risks, data biases, and responsible open-source release of CLMs</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a literature review paper. It does not introduce new models, code, or experimental results. The reproducibility assessment focuses on the accessibility of the reviewed works and proposed datasets.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC20</td>
          <td>5.5B+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>PubChem</td>
          <td>100M+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL</td>
          <td>2M+ compounds</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>MoleculeNet (8 datasets)</td>
          <td>642 to 437,929</td>
          <td>Standard benchmark suite</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>COVID-19 drug compounds</td>
          <td>740</td>
          <td>From Harigua-Souiai et al. (2021)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Cocrystal formation</td>
          <td>3,282</td>
          <td>From Mswahili et al. (2021)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Antimalarial drugs</td>
          <td>4,794</td>
          <td>From Mswahili et al. (2024)</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>Cancer gene/drug response</td>
          <td>201 drugs, 734 cell lines</td>
          <td>From Kim et al. (2021)</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="http://dai.chungbuk.ac.kr/">DAI Lab website</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Authors&rsquo; research lab</td>
      </tr>
  </tbody>
</table>
<p>No code, models, or evaluation scripts are released with this review. The paper does not include a supplementary materials section or GitHub repository.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (literature review).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Mswahili, M. E., &amp; Jeong, Y.-S. (2024). Transformer-based models for chemical SMILES representation: A comprehensive literature review. <em>Heliyon</em>, 10(20), e39038. <a href="https://doi.org/10.1016/j.heliyon.2024.e39038">https://doi.org/10.1016/j.heliyon.2024.e39038</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{mswahili2024transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-based models for chemical {SMILES} representation: A comprehensive literature review}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Mswahili, Medard Edmund and Jeong, Young-Seob}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Heliyon}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{e39038}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.heliyon.2024.e39038}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Survey of Transformer Architectures in Molecular Science</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/transformers-molecular-science-review/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/transformers-molecular-science-review/</guid><description>A comprehensive review of 12 transformer architectures applied to molecular science, covering GPT, BERT, BART, graph transformers, and more.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-transformer-architectures-for-molecular-science">A Systematization of Transformer Architectures for Molecular Science</h2>
<p>This paper is a <strong>Systematization</strong> review. It organizes and taxonomizes 12 families of transformer architectures that have been applied across molecular science, including chemistry, biology, and drug discovery. The primary contribution is not a new method or dataset, but a structured technical overview of the algorithmic internals of each transformer variant and their specific applications to molecular problems. The review covers 201 references and provides a unified treatment of how these architectures capture molecular patterns from sequential, graphical, and image-based data.</p>
<h2 id="bridging-the-gap-between-transformer-variants-and-molecular-applications">Bridging the Gap Between Transformer Variants and Molecular Applications</h2>
<p>Transformer-based models have become widespread in molecular science, yet the authors identify a gap: there is no organized taxonomy linking these diverse techniques in the existing literature. Individual papers introduce specific architectures or applications, but practitioners lack a unified reference that explains the technical differences between GPT, BERT, BART, graph transformers, and other variants in the context of molecular data. The review aims to fill this gap by providing an in-depth investigation of the algorithmic components of each model family, explaining how their architectural innovations contribute to processing complex molecular data. The authors note that the success of transformers in molecular science stems from several factors: the sequential nature of chemical and biological molecules (DNA, RNA, proteins, SMILES strings), the attention mechanism&rsquo;s ability to capture long-range dependencies within molecular structures, and the capacity for transfer learning through pre-training on large chemical and biological datasets.</p>
<h2 id="twelve-transformer-families-and-their-molecular-mechanisms">Twelve Transformer Families and Their Molecular Mechanisms</h2>
<p>The review covers transformer preliminaries before diving into 12 specific architecture families. The core self-attention mechanism computes:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$</p>
<p>where $d_k$ is the dimension of the key vectors. The position-wise feed-forward network is:</p>
<p>$$
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$</p>
<p>The 12 architecture families covered are:</p>
<ol>
<li>
<p><strong>GPT (Generative Pre-trained Transformer)</strong>: Uses the decoder part of the transformer for autoregressive generation. Applications include MolGPT for molecular generation, DrugGPT for protein-ligand binding, and cMolGPT for target-specific de novo molecular generation.</p>
</li>
<li>
<p><strong>BERT (Bidirectional Encoder Representations from Transformers)</strong>: Uses transformer encoders with masked language modeling and next-sentence prediction for pre-training. Molecular applications include FP-BERT for molecular property prediction using composite fingerprint representations, Graph-BERT for protein-protein interaction identification, SMILES-BERT, and Mol-BERT.</p>
</li>
<li>
<p><strong>BART (Bidirectional and Auto-Regressive Transformers)</strong>: Functions as a denoising autoencoder with both encoder and decoder. Molecular applications include Chemformer for sequence-to-sequence chemistry tasks, MS2Mol for mass spectrometry analysis, and MolBART for molecular feature learning.</p>
</li>
<li>
<p><strong>Graph Transformer</strong>: Leverages self-attention on graph-structured data to capture global context. Applications include GraphSite for protein-DNA binding site prediction (using AlphaFold2 structure predictions), KPGT for knowledge-guided molecular graph pre-training, and PAGTN for establishing long-range dependencies in molecular graphs.</p>
</li>
<li>
<p><strong>Transformer-XL</strong>: Incorporates relative positional encoding for modeling long sequences. Used for small molecule retention time prediction, drug design with CHEMBL data (1.27 million molecules), and Heck reaction generation.</p>
</li>
<li>
<p><strong><a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5 (Text-to-Text Transfer Transformer)</a></strong>: Unifies NLP tasks into text-to-text mapping. T5Chem was pre-trained on 97 million molecules from PubChem and achieved 99.5% accuracy on reaction classification (USPTO 500 MT). C5T5 uses IUPAC naming for molecular optimization in drug discovery.</p>
</li>
<li>
<p><strong>Vision Transformer (ViT)</strong>: Applies transformer architecture to image patches. Used for organic molecule classification (97% accuracy with WGAN-generated data), bacterial identification via SERS, and molecular property prediction from mass spectrometry data (TransG-Net).</p>
</li>
<li>
<p><strong>DETR (Detection Transformer)</strong>: End-to-end object detection using transformers. Applied to cryo-EM particle picking (TransPicker), molecular structure image recognition (IMG2SMI), and cell segmentation (Cell-DETR).</p>
</li>
<li>
<p><strong>Conformer</strong>: Integrates convolutional modules into transformer structure. Used for DNA storage error correction (RRCC-DNN), drug-target affinity prediction (NG-DTA with Davis and Kiba datasets).</p>
</li>
<li>
<p><strong>CLIP (Contrastive Language-Image Pre-training)</strong>: Multimodal learning linking text and images. Applied to peptide design (Cut&amp;CLIP for protein degradation), gene identification (pathCLIP), and drug discovery (CLOOME for zero-shot transfer learning).</p>
</li>
<li>
<p><strong>Sparse Transformers</strong>: Use sparse attention matrices to reduce complexity to $O(n\sqrt{n})$. Applied to drug-target interaction prediction with gated cross-attention mechanisms.</p>
</li>
<li>
<p><strong>Mobile and Efficient Transformers</strong>: Compressed variants (TinyBERT, MobileBERT) for resource-constrained environments. Molormer uses ProbSparse self-attention for drug-drug interaction prediction. LOGO is a lightweight pre-trained language model for non-coding genome interpretation.</p>
</li>
</ol>
<h2 id="survey-organization-and-coverage-of-molecular-domains">Survey Organization and Coverage of Molecular Domains</h2>
<p>As a survey paper, this work does not present new experiments. Instead, it catalogues existing applications across multiple molecular domains:</p>
<p><strong>Drug Discovery and Design</strong>: GPT-based ligand design (DrugGPT), BART-based molecular generation (Chemformer, MolBART), graph transformer pre-training for molecular property prediction (KPGT), T5-based chemical reaction prediction (T5Chem), and sparse transformer methods for drug-target interactions.</p>
<p><strong>Protein Science</strong>: BERT-based protein-protein interaction prediction (Graph-BERT), graph transformer methods for protein-DNA binding (GraphSite with AlphaFold2 integration), conformer-based drug-target affinity prediction (NG-DTA), and CLIP-based peptide design (Cut&amp;CLIP).</p>
<p><strong>Molecular Property Prediction</strong>: FP-BERT for fingerprint-based prediction, SMILES-BERT and Mol-BERT for end-to-end prediction from SMILES, KPGT for knowledge-guided graph pre-training, and Transformer-XL for property modeling with relative positional encoding.</p>
<p><strong>Structural Biology</strong>: DETR-based cryo-EM particle picking (TransPicker), vision transformer applications in cell imaging, and Cell-DETR for instance segmentation in microscopy.</p>
<p><strong>Genomics</strong>: Conformer-based DNA storage error correction (RRCC-DNN), LOGO for non-coding genome interpretation, and MetaTransformer for metagenomic sequencing analysis.</p>
<h2 id="future-directions-and-limitations-of-the-survey">Future Directions and Limitations of the Survey</h2>
<p>The review concludes with four future directions:</p>
<ol>
<li>
<p><strong>ChatGPT integration into molecular science</strong>: Using LLMs for data analysis, literature review, and hypothesis generation in chemistry and biology.</p>
</li>
<li>
<p><strong>Multifunction transformers</strong>: Models that extract features across diverse molecular structures and sequences simultaneously.</p>
</li>
<li>
<p><strong>Molecular-aware transformers</strong>: Architectures that handle multiple data types (text, sequence, structure, image, energy, molecular dynamics, function) in a unified framework.</p>
</li>
<li>
<p><strong>Self-assessment transformers and superintelligence</strong>: Speculative discussion of models that learn from seemingly unrelated data sources.</p>
</li>
</ol>
<p>The review has several limitations worth noting. The coverage is broad but shallow: each architecture family receives only 1-2 pages of discussion, and the paper largely describes existing work rather than critically evaluating it. The review does not systematically compare the architectures against each other on common benchmarks. The future directions section (particularly the superintelligence discussion) is speculative and lacks concrete proposals. The paper also focuses primarily on technical architecture descriptions rather than analyzing failure modes, scalability challenges, or reproducibility concerns across the surveyed methods. As a review article, no new data were created or analyzed.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a survey paper. No new datasets were created or used. The paper reviews applications involving datasets such as PubChem (97 million molecules for T5Chem), CHEMBL (1.27 million molecules for Transformer-XL drug design), USPTO 500 MT (reaction classification), ESOL (5,328 molecules for property prediction), and Davis/Kiba (drug-target affinity).</p>
<h3 id="algorithms">Algorithms</h3>
<p>No new algorithms are introduced. The paper provides mathematical descriptions of the core transformer components (self-attention, positional encoding, feed-forward networks, layer normalization) and describes how 12 architecture families modify these components.</p>
<h3 id="models">Models</h3>
<p>No new models are presented. The paper surveys existing models including MolGPT, DrugGPT, FP-BERT, SMILES-BERT, Chemformer, MolBART, GraphSite, KPGT, T5Chem, TransPicker, Cell-DETR, CLOOME, and Molormer, among others.</p>
<h3 id="evaluation">Evaluation</h3>
<p>No new evaluation is performed. Performance numbers cited from the literature include: T5Chem reaction classification accuracy of 99.5%, ViT organic molecule classification at 97%, Transformer-XL property prediction RMSE of 0.6 on ESOL, and Heck reaction generation feasibility rate of 47.76%.</p>
<h3 id="hardware">Hardware</h3>
<p>No hardware requirements are specified, as this is a survey paper.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/wcms.1725">Paper (open access)</a></td>
          <td>Paper</td>
          <td>CC-BY-NC-ND</td>
          <td>Open access via Wiley</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jiang, J., Ke, L., Chen, L., Dou, B., Zhu, Y., Liu, J., Zhang, B., Zhou, T., &amp; Wei, G.-W. (2024). Transformer technology in molecular science. <em>WIREs Computational Molecular Science</em>, 14(4), e1725. <a href="https://doi.org/10.1002/wcms.1725">https://doi.org/10.1002/wcms.1725</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{jiang2024transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer technology in molecular science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jiang, Jian and Ke, Lu and Chen, Long and Dou, Bozheng and Zhu, Yueying and Liu, Jie and Zhang, Bengong and Zhou, Tianshou and Wei, Guo-Wei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{WIREs Computational Molecular Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{e1725}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Wiley}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1002/wcms.1725}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Survey of Scientific LLMs in Bio and Chem Domains</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/scientific-llm-survey-bio-chem/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/scientific-llm-survey-bio-chem/</guid><description>Survey of scientific LLMs covering textual, molecular, protein, genomic, and multimodal models for biological and chemical research.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-scientific-language-models">A Systematization of Scientific Language Models</h2>
<p>This paper is a <strong>Systematization</strong> (survey) that provides a comprehensive review of scientific large language models (Sci-LLMs) designed for biological and chemical domains. The survey covers five main branches of scientific language modeling: textual, molecular, protein, genomic, and multimodal LLMs. For each branch, the authors analyze model architectures, capabilities, training datasets, evaluation benchmarks, and assessment criteria, then identify open challenges and future research directions.</p>
<h2 id="motivation-bridging-scientific-languages-and-llms">Motivation: Bridging Scientific Languages and LLMs</h2>
<p>Large language models have demonstrated strong capabilities in natural language understanding, but scientific research involves specialized &ldquo;languages&rdquo; that differ fundamentally from natural text. Chemical molecules are expressed as <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> or <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> strings, proteins as amino acid sequences, and genomes as nucleotide sequences. Each of these language systems has its own vocabulary and grammar. General-purpose LLMs like ChatGPT and GPT-4 often fail to properly handle these scientific data types because the semantics and grammar of scientific languages diverge substantially from natural language.</p>
<p>Prior surveys have focused on individual modalities (molecules, proteins, or genomes) in isolation. No comprehensive review had unified these language modeling advances into a single framework. This survey fills that gap by systematically covering all five modalities and, notably, the emerging area of multimodal Sci-LLMs that integrate multiple scientific languages.</p>
<h2 id="taxonomy-of-scientific-language-models">Taxonomy of Scientific Language Models</h2>
<p>The survey organizes Sci-LLMs into a clear taxonomic framework built on two axes: the scientific language modality and the model architecture type.</p>
<h3 id="scientific-language-modalities">Scientific Language Modalities</h3>
<p>The authors define five categories of Sci-LLMs:</p>
<ol>
<li>
<p><strong>Text-Sci-LLMs</strong>: LLMs trained on scientific textual corpora (medical, biological, chemical, and comprehensive domains). Examples include BioBERT, BioGPT, ChemBERT, SciBERT, and <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a>.</p>
</li>
<li>
<p><strong>Mol-LLMs</strong>: Models that process molecular languages (SMILES, SELFIES, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>). These include encoder-only models like <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> and <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a> for property prediction, decoder-only models like MolGPT for molecular generation, and encoder-decoder models like Molecular Transformer and <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> for reaction prediction.</p>
</li>
<li>
<p><strong>Prot-LLMs</strong>: Models operating on protein amino acid sequences. The ESM series (ESM-1b, ESM-2) and ProtTrans serve as encoders for function and structure prediction, while ProGen and ProtGPT2 generate novel protein sequences.</p>
</li>
<li>
<p><strong>Gene-LLMs</strong>: Models for DNA and RNA sequences, including DNABERT, Nucleotide Transformer, HyenaDNA, and Evo, covering tasks from variant effect prediction to genome-scale sequence modeling.</p>
</li>
<li>
<p><strong>MM-Sci-LLMs</strong>: Multimodal models integrating multiple scientific data types (molecule-text, protein-text, gene-cell-text, molecule-protein), such as MoleculeSTM, <a href="/notes/chemistry/molecular-representations/multimodal/biot5-cross-modal-biology/">BioT5</a>, Mol-Instructions, and BioMedGPT.</p>
</li>
</ol>
<h3 id="architecture-classification">Architecture Classification</h3>
<p>For each modality, models are categorized into three architecture types:</p>
<ul>
<li><strong>Encoder-only</strong>: Based on BERT/RoBERTa, these models learn fixed-size representations via masked language modeling. They excel at discriminative tasks like property prediction and classification.</li>
<li><strong>Decoder-only</strong>: Based on GPT, these models perform autoregressive generation. They are used for de novo molecule design, protein sequence generation, and DNA sequence generation.</li>
<li><strong>Encoder-decoder</strong>: Based on architectures like <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a> or BART, these handle sequence-to-sequence tasks such as reaction prediction, molecule captioning, and protein sequence-structure translation.</li>
</ul>
<h2 id="comprehensive-catalog-of-models-datasets-and-benchmarks">Comprehensive Catalog of Models, Datasets, and Benchmarks</h2>
<p>A central contribution of the survey is its exhaustive cataloging of resources across all five modalities. The authors compile detailed summary tables covering over 100 Sci-LLMs, their parameter counts, base architectures, training data, and capabilities.</p>
<h3 id="molecular-llms">Molecular LLMs</h3>
<p>The survey documents a rich landscape of Mol-LLMs:</p>
<p><strong>Encoder-only models</strong> for property prediction include <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, ChemBERTa, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>, MolFormer, MG-BERT, GROVER, MAT, Uni-Mol, and others. These models are pre-trained on ZINC, PubChem, or ChEMBL datasets and fine-tuned for molecular property prediction tasks on benchmarks like <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>.</p>
<p><strong>Decoder-only models</strong> for molecular generation include MolGPT, SMILES GPT, iupacGPT, cMolGPT, and Taiga. These generate SMILES strings autoregressively, often combining GPT with reinforcement learning for property optimization.</p>
<p><strong>Encoder-decoder models</strong> for reaction prediction include Molecular Transformer, Retrosynthesis Transformer, Chemformer, <a href="/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/">BARTSmiles</a>, Graph2SMILES, and MOLGEN. These handle forward reaction prediction and retrosynthesis.</p>
<h3 id="key-datasets-surveyed">Key Datasets Surveyed</h3>
<p>The survey catalogs pre-training datasets and benchmarks for each modality:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Pre-training Sources</th>
          <th>Key Benchmarks</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Text</td>
          <td>PubMed, PMC, arXiv, Semantic Scholar</td>
          <td>MMLU, MedQA, PubMedQA, SciEval</td>
      </tr>
      <tr>
          <td>Molecule</td>
          <td>ZINC, PubChem, ChEMBL, USPTO, <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a></td>
          <td>MoleculeNet, <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a>, <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>, SPECTRA</td>
      </tr>
      <tr>
          <td>Protein</td>
          <td>UniRef50/90/100, BFD, <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">PDB</a>, <a href="https://en.wikipedia.org/wiki/AlphaFold">AlphaFoldDB</a></td>
          <td><a href="https://en.wikipedia.org/wiki/CASP">CASP</a>, TAPE, ProteinGym, FLIP, PEER</td>
      </tr>
      <tr>
          <td>Genome</td>
          <td>GRCh38, 1000 Genomes, <a href="https://en.wikipedia.org/wiki/ENCODE">ENCODE</a></td>
          <td>NT-Bench, GenBench, BEACON</td>
      </tr>
      <tr>
          <td>Multimodal</td>
          <td>ChEBI-20, PubChemSTM, Mol-Instructions</td>
          <td>Various cross-modal retrieval and generation tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>For molecular generation, the survey details standard metrics:</p>
<ul>
<li><strong>Validity</strong>: percentage of chemically viable molecules</li>
<li><strong>Uniqueness</strong>: fraction of distinct generated structures</li>
<li><strong>Novelty</strong>: fraction not present in the training set</li>
<li><strong>Internal diversity</strong>: measured as</li>
</ul>
<p>$$
\text{IntDiv}_{p}(G) = 1 - \sqrt[p]{\frac{1}{|G|^{2}} \sum_{m_{1}, m_{2} \in G} T(m_{1}, m_{2})^{p}}
$$</p>
<p>where $T(m_{1}, m_{2})$ is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> between molecules $m_{1}$ and $m_{2}$.</p>
<ul>
<li><strong>Frechet ChemNet Distance (FCD)</strong>: comparing distributions of generated and reference molecules</li>
</ul>
<p>$$
\text{FCD}(G, R) = | \mu_{G} - \mu_{R} |^{2} + \text{Tr}\left[\Sigma_{G} + \Sigma_{R} - 2(\Sigma_{G}\Sigma_{R})^{1/2}\right]
$$</p>
<p>For protein generation, analogous metrics include perplexity, Frechet Protein Distance (FPD), foldability (pLDDT), sequence recovery, and novelty (sequence identity).</p>
<h2 id="critical-challenges-and-future-directions">Critical Challenges and Future Directions</h2>
<p>The survey identifies four major challenges and seven future research directions for Sci-LLMs.</p>
<h3 id="challenges">Challenges</h3>
<ol>
<li>
<p><strong>Training data limitations</strong>: Sci-LLM training datasets are orders of magnitude smaller than those for general LLMs. ProGen was trained on 280M protein sequences (tens of billions of tokens), while ChatGPT used approximately 570 billion tokens. Scaling laws suggest larger datasets would improve performance, and advances in sequencing technologies may help close this gap.</p>
</li>
<li>
<p><strong>Architecture mismatch</strong>: Standard Transformer architectures face difficulties with scientific languages. Scientific sequences (proteins with hundreds or thousands of amino acids, DNA with millions of base pairs) are far longer than typical natural language sentences. Additionally, 3D structural information is critical for function prediction but does not naturally map to sequence tokens. Autoregressive generation is also a poor fit since biological sequences function as a whole rather than being read left-to-right.</p>
</li>
<li>
<p><strong>Evaluation gaps</strong>: Computational metrics for generated molecules and proteins provide only indirect quality measures. Wet-lab validation remains the gold standard but is beyond the scope of most AI research teams. Better computational evaluation methods that correlate with experimental outcomes are needed.</p>
</li>
<li>
<p><strong>Ethics</strong>: Sensitive biological data raises privacy concerns. The potential for misuse (e.g., generating harmful substances) requires careful safeguards. Algorithmic bias and equitable access to Sci-LLM benefits also demand attention.</p>
</li>
</ol>
<h3 id="future-directions">Future Directions</h3>
<ol>
<li><strong>Larger-scale, cross-modal training datasets</strong> with strong semantic alignment across modalities</li>
<li><strong>Incorporating 3D structural and temporal information</strong> into language-based modeling, including structural motifs as tokens</li>
<li><strong>Integration with external knowledge sources</strong> such as <a href="https://en.wikipedia.org/wiki/Gene_Ontology">Gene Ontology</a> and chemical knowledge graphs to reduce hallucination</li>
<li><strong>Coupling with physical simulation</strong> (e.g., <a href="/notes/chemistry/molecular-simulation/">molecular dynamics</a>) to ground language models in physical reality</li>
<li><strong>Augmenting Sci-LLMs with specialized tools and agents</strong>, following the success of tool-augmented general LLMs like <a href="/notes/chemistry/llm-applications/chemcrow-augmenting-llms-chemistry-tools/">ChemCrow</a></li>
<li><strong>Development of computational evaluation metrics</strong> that are both fast and accurate, enabling rapid research iteration</li>
<li><strong>Super-alignment with human ethics</strong>, ensuring ethical reasoning is deeply integrated into Sci-LLM behavior</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a survey paper that does not present new experimental results. The authors catalog extensive datasets across five modalities (see tables in the paper for comprehensive listings). The survey itself is maintained as an open resource.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/HICAI-ZJU/Scientific-LLM-Survey">Scientific-LLM-Survey GitHub</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Curated list of papers, models, and resources</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (survey paper).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, Q., Ding, K., Lyv, T., Wang, X., Yin, Q., Zhang, Y., Yu, J., Wang, Y., Li, X., Xiang, Z., Feng, K., Zhuang, X., Wang, Z., Qin, M., Zhang, M., Zhang, J., Cui, J., Huang, T., Yan, P., Xu, R., Chen, H., Li, X., Fan, X., Xing, H., &amp; Chen, H. (2025). Scientific Large Language Models: A Survey on Biological &amp; Chemical Domains. <em>ACM Computing Surveys</em>, 57(6), 1–38. <a href="https://doi.org/10.1145/3715318">https://doi.org/10.1145/3715318</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2025scientific,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Scientific Large Language Models: A Survey on Biological \&amp; Chemical Domains}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Qiang and Ding, Keyan and Lyv, Tianwen and Wang, Xinda and Yin, Qingyu and Zhang, Yiwen and Yu, Jing and Wang, Yuhao and Li, Xiaotong and Xiang, Zhuoyi and Feng, Kehua and Zhuang, Xiang and Wang, Zeyuan and Qin, Ming and Zhang, Mengyao and Zhang, Jinlu and Cui, Jiyu and Huang, Tao and Yan, Pengju and Xu, Renjun and Chen, Hongyang and Li, Xiaolin and Fan, Xiaohui and Xing, Huabin and Chen, Huajun}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ACM Computing Surveys}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{57}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1--38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3715318}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SPMM: A Bidirectional Molecular Foundation Model</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/spmm-bidirectional-structure-property/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/spmm-bidirectional-structure-property/</guid><description>SPMM is a multimodal molecular foundation model that aligns SMILES structures with property vectors for bidirectional generation and prediction tasks.</description><content:encoded><![CDATA[<h2 id="a-multimodal-foundation-model-for-structure-property-comprehension">A Multimodal Foundation Model for Structure-Property Comprehension</h2>
<p>This is a <strong>Method</strong> paper that introduces the Structure-Property Multi-Modal foundation model (SPMM), a transformer-based architecture that treats SMILES strings and molecular property vectors (PVs) as two separate modalities and learns to align them in a shared embedding space. The primary contribution is enabling bidirectional generation through a single pre-trained model: given a property vector, SPMM can generate molecules (inverse-QSAR), and given a SMILES string, it can predict all 53 properties simultaneously. The model also transfers to unimodal downstream tasks including MoleculeNet benchmarks and reaction prediction.</p>
<h2 id="bridging-the-gap-between-molecular-structure-and-properties">Bridging the Gap Between Molecular Structure and Properties</h2>
<p>Existing chemical pre-trained models typically learn representations from a single modality (SMILES, graphs, or fingerprints) and fine-tune for specific downstream tasks. While some approaches have attempted multimodal learning by combining SMILES with graph representations or InChI strings, these modalities encode nearly identical structural information, limiting the potential for emergent cross-modal knowledge.</p>
<p>The key gap SPMM addresses is the lack of multimodal pre-training that incorporates genuinely complementary modalities. Prior conditional molecule generation methods could typically control only a small number of properties simultaneously and required retraining when target properties changed. The authors draw on successes in vision-language pre-training (VLP), where aligning image and text modalities has enabled rich bidirectional understanding, and apply similar ideas to molecular structure and property domains.</p>
<h2 id="treating-property-vectors-as-a-language">Treating Property Vectors as a Language</h2>
<p>The core innovation in SPMM is treating a collection of 53 RDKit-computed molecular properties as a &ldquo;language&rdquo; where each property value is analogous to a word token. This design allows the model to attend to individual properties independently rather than treating the entire property vector as a single fixed-length condition.</p>
<h3 id="dual-stream-architecture">Dual-Stream Architecture</h3>
<p>SPMM follows the dual-stream VLP architecture. The model has three components:</p>
<ol>
<li><strong>SMILES Encoder</strong>: 6 BERT-base layers that encode tokenized SMILES (using a 300-subword BPE vocabulary) via self-attention</li>
<li><strong>PV Encoder</strong>: 6 BERT-base layers that encode the 53 normalized property values (each passed through a linear layer) with learnable positional embeddings</li>
<li><strong>Fusion Encoder</strong>: 6 BERT-base layers with cross-attention that combines both modalities, using one modality&rsquo;s features as queries and the other as keys/values</li>
</ol>
<h3 id="pre-training-objectives">Pre-training Objectives</h3>
<p>The model is pre-trained with four complementary losses:</p>
<p><strong>Contrastive Learning</strong> aligns SMILES and PV features in a shared embedding space. For [CLS] token outputs $\mathbf{S}_{cls}$ and $\mathbf{P}_{cls}$:</p>
<p>$$
\text{sim}(\mathbf{S}, \mathbf{P}) = \left(h_{S}(\mathbf{S}_{cls})\right)^{\top} h_{P}(\mathbf{P}_{cls})
$$</p>
<p>The intermodal similarities are computed with a learnable temperature $\tau$:</p>
<p>$$
s_{s2p} = \frac{\exp(\text{sim}(\mathbf{S}, \mathbf{P}) / \tau)}{\sum_{n=1}^{N} \exp(\text{sim}(\mathbf{S}, \mathbf{P}_{n}) / \tau)}
$$</p>
<p>The contrastive loss uses cross-entropy with one-hot labels (1 for same-molecule pairs):</p>
<p>$$
L_{\text{contrastive}} = \frac{1}{2}\left(H(y_{s2p}, s_{s2p}) + H(y_{p2s}, s_{p2s}) + H(y_{s2s}, s_{s2s}) + H(y_{p2p}, s_{p2p})\right)
$$</p>
<p><strong>Next Word Prediction (NWP)</strong> trains autoregressive SMILES generation conditioned on the PV:</p>
<p>$$
L_{NWP} = \sum_{i=1}^{n} H\left(y_{n}^{NWP}, p^{NWP}(s_{n} \mid s_{0:n-1}, \mathbf{P})\right)
$$</p>
<p><strong>Next Property Prediction (NPP)</strong> applies the same autoregressive concept to property values, using mean-square-error loss:</p>
<p>$$
L_{NPP} = \sum_{i=1}^{n} \left(p_{n} - \hat{p}_{n}(p_{0:n-1}, \mathbf{S})\right)^{2}
$$</p>
<p><strong>SMILES-PV Matching (SPM)</strong> is a binary classification loss predicting whether a SMILES-PV pair originated from the same molecule, trained with hard-negative mining.</p>
<p>The overall pre-training loss combines all four:</p>
<p>$$
L = \widetilde{L}_{\text{contrastive}} + \widetilde{L}_{NWP} + L_{NPP} + L_{SPM}
$$</p>
<p>where tildes indicate the use of momentum teacher distillation to soften one-hot labels, acknowledging that multiple valid SMILES-PV pairings may exist.</p>
<h3 id="random-property-masking">Random Property Masking</h3>
<p>During pre-training, 50% of property values are randomly replaced with a special [UNK] token. This serves three purposes: preventing overfitting to specific properties, augmenting data, and enabling flexible inference where users can specify any subset of the 53 properties as generation conditions. The model can handle all $2^{53}$ possible property combinations at inference time despite never seeing most of them during training.</p>
<h2 id="experiments-across-bidirectional-and-unimodal-tasks">Experiments Across Bidirectional and Unimodal Tasks</h2>
<h3 id="pv-to-smiles-generation-conditional-molecule-design">PV-to-SMILES Generation (Conditional Molecule Design)</h3>
<p>The authors evaluate SPMM on multiple generation scenarios using 1000 unseen PubChem PVs:</p>
<table>
  <thead>
      <tr>
          <th>Sampling</th>
          <th>Input PV</th>
          <th>Validity</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
          <th>Norm. RMSE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Deterministic</td>
          <td>1000 unseen PVs</td>
          <td>0.995</td>
          <td>0.999</td>
          <td>0.961</td>
          <td>0.216</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>Full PV (molecule 1)</td>
          <td>0.974</td>
          <td>0.905</td>
          <td>0.998</td>
          <td>0.185</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>Molar mass = 150</td>
          <td>0.974</td>
          <td>0.945</td>
          <td>0.872</td>
          <td>0.192</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>4 properties controlled</td>
          <td>0.998</td>
          <td>0.981</td>
          <td>0.952</td>
          <td>0.257</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>No control (all [UNK])</td>
          <td>0.971</td>
          <td>0.991</td>
          <td>0.950</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The normalized RMSE of 0.216 across 53 properties indicates that generated molecules closely match the input property conditions. The model can also perform unconditional generation (all properties masked) where outputs follow the pre-training distribution. The authors report that SPMM outperforms benchmark models including MolGAN, GraphVAE, and scaffold-based graph generative models in both conditional and unconditional settings (Supplementary Table 1).</p>
<h3 id="smiles-to-pv-generation-multi-property-prediction">SMILES-to-PV Generation (Multi-Property Prediction)</h3>
<p>When given 1000 unseen ZINC15 molecules, SPMM predicts all 53 properties autoregressively with a mean $r^{2}$ of 0.924 across all properties.</p>
<h3 id="moleculenet-benchmarks">MoleculeNet Benchmarks</h3>
<p>Using only the SMILES encoder (6 BERT layers), SPMM achieves best or competitive performance on 9 <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> tasks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>SPMM</th>
          <th>Best Baseline</th>
          <th>Baseline Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.817</td>
          <td>0.798</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>LIPO</td>
          <td>RMSE</td>
          <td>0.681</td>
          <td>0.660</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>1.868</td>
          <td>1.877</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>BACE (reg)</td>
          <td>RMSE</td>
          <td>1.041</td>
          <td>1.047</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a></td>
      </tr>
      <tr>
          <td>Clearance</td>
          <td>RMSE</td>
          <td>42.607</td>
          <td>43.175</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>AUROC</td>
          <td>75.1%</td>
          <td>73.6%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>BACE (cls)</td>
          <td>AUROC</td>
          <td>84.4%</td>
          <td>86.3%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>AUROC</td>
          <td>92.7%</td>
          <td>91.2%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>AUROC</td>
          <td>66.9%</td>
          <td>67.2%</td>
          <td>ChemRL-GEM</td>
      </tr>
  </tbody>
</table>
<p>SPMM achieved best performance on 5 of 9 tasks, with notable gains on BBBP (75.1% vs. 73.6%) and ClinTox (92.7% vs. 91.2%). Without pre-training, all scores dropped substantially.</p>
<h3 id="dili-classification">DILI Classification</h3>
<p>On Drug-Induced Liver Injury prediction, SPMM achieved 92.6% AUROC, outperforming the 5-ensemble model of Ai et al. (90.4% AUROC) while using a single model.</p>
<h3 id="reaction-prediction">Reaction Prediction</h3>
<p>On USPTO-480k forward reaction prediction, SPMM achieved 91.5% top-1 accuracy, the highest among all models tested (including <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> at 91.3%). On USPTO-50k retro-reaction prediction, SPMM reached 53.4% top-1 accuracy, second only to Chemformer (54.3%) among string-based models.</p>
<h2 id="bidirectional-generation-from-a-single-pre-trained-model">Bidirectional Generation From a Single Pre-trained Model</h2>
<p>SPMM demonstrates that multimodal pre-training with genuinely complementary modalities (structure and properties, rather than structurally redundant representations) enables a single foundation model to handle both generation directions and downstream unimodal tasks. Key findings include:</p>
<ol>
<li><strong>Flexible conditional generation</strong>: The [UNK] masking strategy allows controlling any subset of 53 properties at inference time without retraining, a capability not demonstrated by prior methods.</li>
<li><strong>Interpretable cross-attention</strong>: Attention visualizations show that the model learns chemically meaningful structure-property relationships (e.g., hydrogen bonding properties attend to oxygen and nitrogen atoms; ring count properties attend to ring tokens).</li>
<li><strong>Competitive unimodal transfer</strong>: Despite using only 6 BERT layers and 50M pre-training molecules (smaller than <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>&rsquo;s 77M or Chemformer&rsquo;s 100M), the SMILES encoder alone achieves best or second-best results on 5 of 9 MoleculeNet tasks and the highest forward reaction prediction accuracy among tested models.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>SMILES representation constraints</strong>: Implicit connectivity information in SMILES means small structural changes can cause drastic string changes. Graph representations could be a complementary alternative.</li>
<li><strong>Stereochemistry blindness</strong>: All 53 RDKit properties used are invariant to stereochemistry, meaning different stereoisomers produce identical PVs. The contrastive loss then forces their SMILES encoder outputs to converge, which the authors identify as the primary factor limiting MoleculeNet performance on stereo-sensitive tasks.</li>
<li><strong>No wet-lab validation</strong>: Generated molecules and predicted properties are not experimentally verified.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem</td>
          <td>50M molecules</td>
          <td>SMILES + 53 RDKit properties</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet (9 tasks)</td>
          <td>642-4200 per task</td>
          <td>Scaffold split via DeepChem (8:1:1)</td>
      </tr>
      <tr>
          <td>DILI classification</td>
          <td>Ai et al. dataset</td>
          <td>Not specified</td>
          <td>Following published preparation</td>
      </tr>
      <tr>
          <td>Forward reaction</td>
          <td>USPTO-480k</td>
          <td>479,035 pairs</td>
          <td>Reactant-product pairs</td>
      </tr>
      <tr>
          <td>Retro reaction</td>
          <td>USPTO-50k</td>
          <td>50,037 pairs</td>
          <td>Product-reactant pairs, no reaction types used</td>
      </tr>
      <tr>
          <td>SMILES-to-PV test</td>
          <td>ZINC15</td>
          <td>1000 molecules</td>
          <td>Not in pre-training set</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: BPE with 300-subword dictionary</li>
<li><strong>Property masking</strong>: 50% random replacement with [UNK] during pre-training</li>
<li><strong>Momentum distillation</strong>: EMA parameter $\lambda = 0.995$, soft-label mixing $\alpha$ linearly warmed from 0 to 0.4 over first epoch</li>
<li><strong>Contrastive queue</strong>: Size $k = 24{,}576$ for storing recent SMILES and PV instances</li>
<li><strong>Beam search</strong>: $k = 2$ for PV-to-SMILES generation</li>
<li><strong>SMILES augmentation</strong>: Random non-canonical augmentation with probability 0.5 for reaction tasks</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: 6 BERT-base encoder layers each for SMILES encoder, PV encoder, and fusion encoder (18 total layers)</li>
<li><strong>Vocabulary</strong>: 300 BPE subwords for SMILES; 53 property tokens for PV</li>
<li><strong>Pre-trained weights</strong>: Available via GitHub</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PV-to-SMILES (deterministic)</td>
          <td>Validity</td>
          <td>99.5%</td>
          <td>1000 unseen PubChem PVs</td>
      </tr>
      <tr>
          <td>PV-to-SMILES (deterministic)</td>
          <td>Normalized RMSE</td>
          <td>0.216</td>
          <td>Across 53 properties</td>
      </tr>
      <tr>
          <td>SMILES-to-PV</td>
          <td>Mean $r^{2}$</td>
          <td>0.924</td>
          <td>1000 ZINC15 molecules</td>
      </tr>
      <tr>
          <td>Forward reaction (USPTO-480k)</td>
          <td>Top-1 accuracy</td>
          <td>91.5%</td>
          <td>Best among all tested models</td>
      </tr>
      <tr>
          <td>Retro reaction (USPTO-50k)</td>
          <td>Top-1 accuracy</td>
          <td>53.4%</td>
          <td>Second-best string-based</td>
      </tr>
      <tr>
          <td>DILI classification</td>
          <td>AUROC</td>
          <td>92.6%</td>
          <td>Single model vs. 5-ensemble</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Pre-training</strong>: 8 NVIDIA A100 GPUs, approximately 52,000 batch iterations, roughly 12 hours</li>
<li><strong>Batch size</strong>: 96</li>
<li><strong>Optimizer</strong>: AdamW with weight decay 0.02</li>
<li><strong>Learning rate</strong>: Warmed up to $10^{-4}$, cosine decay to $10^{-5}$</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jinhojsk515/SPMM">SPMM Source Code</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation with experimental scripts</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10567599">SPMM Zenodo Archive</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Archived version for reproducibility</td>
      </tr>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public domain</td>
          <td>50M molecules for pre-training</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>Varies</td>
          <td>Benchmark datasets via DeepChem</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chang, J., &amp; Ye, J. C. (2024). Bidirectional generation of structure and properties through a single molecular foundation model. <em>Nature Communications</em>, 15, 2323. <a href="https://doi.org/10.1038/s41467-024-46440-3">https://doi.org/10.1038/s41467-024-46440-3</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chang2024bidirectional,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Bidirectional generation of structure and properties through a single molecular foundation model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chang, Jinho and Ye, Jong Chul}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2323}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-46440-3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Smirk: Complete Tokenization for Molecular Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smirk-tokenization-molecular-models/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smirk-tokenization-molecular-models/</guid><description>Smirk tokenizer achieves full OpenSMILES coverage with 165 tokens by decomposing bracketed atoms into glyphs, validated via n-gram proxy models.</description><content:encoded><![CDATA[<h2 id="a-method-for-complete-chemical-tokenization">A Method for Complete Chemical Tokenization</h2>
<p>This is a <strong>Method</strong> paper that introduces two new tokenizers for molecular foundation models: Smirk and Smirk-GPE. The primary contribution is a tokenization scheme that achieves complete coverage of the OpenSMILES specification using only 165 tokens, addressing the vocabulary gaps present in existing atom-wise tokenizers. The paper also proposes n-gram language models as low-cost proxy evaluators for tokenizer quality and validates these proxies against 18 transformer-based models across multiple benchmarks.</p>
<h2 id="vocabulary-gaps-in-molecular-tokenization">Vocabulary Gaps in Molecular Tokenization</h2>
<p>Molecular foundation models overwhelmingly use &ldquo;atom-wise&rdquo; tokenization, where SMILES strings are split at atom boundaries using a regular expression first proposed by Schwaller et al. A key pattern in this regex treats all &ldquo;bracketed atoms&rdquo; (e.g., <code>[C@@H]</code>, <code>[18F]</code>, <code>[Au+]</code>) as single, irreducible tokens. Since bracketed atoms encode isotopes, chirality, charge, hydrogen count, and element identity, the number of possible permutations under the OpenSMILES specification exceeds 28 trillion. In practice, existing atom-wise tokenizers maintain vocabularies of fewer than 3,000 tokens, leaving large portions of chemical space unrepresentable.</p>
<p>This gap has real consequences. Many chemistry-specific tokenizers emit the unknown token <code>[UNK]</code> at non-negligible frequencies, particularly on datasets with diverse elements and stereochemistry. For example, <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SPE and APE</a> tokenizers produce <code>[UNK]</code> for roughly 19% of tokens on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> and approximately 50% on the tmQM transition metal complex dataset. Even models like <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a> and <a href="/notes/chemistry/molecular-design/reaction-prediction/reactiont5-pretrained-limited-reaction-data/">ReactionT5</a> lack tokens for elements such as copper, ruthenium, gold, and uranium.</p>
<p>The authors also note a subtler issue: some open-vocabulary tokenizers (e.g., <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa&rsquo;s</a> BPE) conflate chemically distinct entities. The same <code>Sc</code> token may represent both a sulfur-carbon bond (in organic SMILES) and the element scandium (in <code>[Sc]</code>), creating ambiguity in downstream analysis.</p>
<h2 id="smirk-glyph-level-decomposition-of-smiles">Smirk: Glyph-Level Decomposition of SMILES</h2>
<p>The core insight behind Smirk is to fully decompose bracketed atoms into their constituent &ldquo;glyphs,&rdquo; the primitive symbols defined by the OpenSMILES specification (element symbols, chirality markers, charges, isotope numbers, hydrogen counts, and brackets themselves). This transforms tokenization from a word-level scheme (one token per bracketed atom) to a character-level scheme over chemically meaningful glyphs.</p>
<p>Smirk uses a two-stage tokenization process:</p>
<ol>
<li><strong>Atom decomposition</strong>: Split a SMILES string into atom-level units using a regex (e.g., <code>OC[C@@H][OH]</code> becomes <code>O C [C@@H] [OH]</code>).</li>
<li><strong>Glyph decomposition</strong>: Further split each unit into its constituent glyphs (e.g., <code>[C@@H]</code> becomes <code>[ C @@ H ]</code>).</li>
</ol>
<p>The two-stage process is necessary to resolve ambiguities. For example, <code>Sc</code> in an unbracketed context represents a sulfur-carbon bond, while <code>[Sc]</code> denotes scandium. This ambiguity occurs over half a million times in PubChem&rsquo;s compound dataset.</p>
<p>The resulting vocabulary contains only 165 tokens, requires no training, and by construction can faithfully tokenize any molecule that conforms to the OpenSMILES specification. The implementation is written in Rust using HuggingFace&rsquo;s Tokenizers library and is available on PyPI.</p>
<p><strong>Smirk-GPE</strong> (Glyph Pair Encoding) extends Smirk with a <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">BPE</a>-like compression step. After Smirk tokenization, adjacent tokens are merged using learned rules, reducing sequence length. Unlike standard BPE, merges operate on token IDs rather than character strings, preserving the distinction between chemically different entities that happen to share the same characters. Smirk-GPE was trained on 262 million molecules from Enamine REAL Space with a target vocabulary of 50,000 tokens, though training terminated at 2,300 tokens after exhausting all possible merges.</p>
<h2 id="evaluation-framework-intrinsic-metrics-n-gram-proxies-and-transformer-benchmarks">Evaluation Framework: Intrinsic Metrics, N-Gram Proxies, and Transformer Benchmarks</h2>
<p>The evaluation covers 34 tokenizers across three datasets (Enamine REALSpace, MoleculeNet, and tmQM) using both intrinsic and extrinsic metrics.</p>
<h3 id="intrinsic-metrics">Intrinsic Metrics</h3>
<p>Four intrinsic metrics are computed for each tokenizer:</p>
<p><strong>Fertility</strong> measures the mean tokenized sequence length. Higher fertility increases computational cost due to the quadratic scaling of attention:</p>
<p>$$
\text{cost} \propto \text{fertility}^2
$$</p>
<p><strong>Normalized entropy</strong> quantifies how close a tokenizer comes to the information-theoretic ideal where all tokens are equally probable:</p>
<p>$$
\eta = \frac{-1}{\log |V|} \sum_{x \in V} p(x) \log p(x)
$$</p>
<p>where $V$ is the vocabulary and $p(x)$ is the observed token probability. Higher normalized entropy correlates with better downstream performance.</p>
<p><strong>Token imbalance</strong> measures the distance between observed token frequencies and a uniform distribution:</p>
<p>$$
D = \frac{1}{2} \sum_{x \in V} |p(x) - |V|^{-1}|
$$</p>
<p><strong>Unknown token frequency</strong> captures the fraction of emitted tokens that are <code>[UNK]</code>. This metric is particularly revealing: all existing chemistry-specific tokenizers (SPE/APE, atom-wise, BPE, and Unigram variants) emit <code>[UNK]</code> at non-negligible rates, while NLP tokenizers, Smirk, and Smirk-GPE do not.</p>
<h3 id="n-gram-proxy-language-models">N-Gram Proxy Language Models</h3>
<p>The paper proposes using n-gram models as low-cost proxies for transformer-based evaluation. An n-gram estimates token likelihood with <a href="https://en.wikipedia.org/wiki/Additive_smoothing">add-one smoothing</a>:</p>
<p>$$
P_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}) = \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|}
$$</p>
<p>where $C$ is the count function and $|V|$ is the vocabulary size. N-grams were &ldquo;pretrained&rdquo; on 1.6 billion SMILES from Enamine REAL Space and evaluated on validation splits. Cross-entropy loss and information loss from unknown tokens were computed.</p>
<p>To quantify information lost to <code>[UNK]</code> tokens, the authors compute the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL-divergence</a> between token distributions with and without unknown tokens, using a bidirectional character n-gram model:</p>
<p>$$
B_{n}(x_{i} \mid x_{i-n+1}, \dots, x_{i-1}, x_{i+1}, \dots, x_{i+n-1}) \propto \frac{C(x_{i-n+1}, \dots, x_{i}) + 1}{C(x_{i-n+1}, \dots, x_{i-1}) + |V|} \times \frac{C(x_{i}, \dots, x_{i+n-1}) + 1}{C(x_{i+1}, \dots, x_{i+n-1}) + |V|}
$$</p>
<h3 id="transformer-experiments">Transformer Experiments</h3>
<p>Eighteen encoder-only RoBERTa models (25M parameters each, excluding embeddings) were pretrained from scratch using masked language modeling on Enamine REAL Space (245M molecules, 30,000 steps). Each model used a different tokenizer, isolating the tokenizer&rsquo;s effect on performance. Finetuning was conducted on six regression and seven classification tasks from MoleculeNet and tmQM.</p>
<p>Linear fixed-effects models were used to estimate the standardized effect of each tokenization scheme relative to an atom-wise SMILES baseline.</p>
<h2 id="key-findings-and-practical-implications">Key Findings and Practical Implications</h2>
<h3 id="tokenizer-performance">Tokenizer Performance</h3>
<ul>
<li><strong>Smirk</strong> shows a positive effect on pretraining quality and downstream performance on tmQM (the dataset with the most bracketed atoms), but performs comparably to atom-wise tokenization on MoleculeNet tasks.</li>
<li><strong>SPE and APE</strong> tokenizers have a negative impact on both pretraining and downstream performance relative to the atom-wise baseline, likely due to their high <code>[UNK]</code> rates.</li>
<li><strong>Molecular encoding choice</strong> (<a href="/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/">SMILES vs. SELFIES</a>) has a negligible effect on performance.</li>
<li><strong>NLP tokenizers</strong> (GPT-4o, LLaMA, Gemma) score comparably to chemistry-specific tokenizers on intrinsic metrics and do not emit unknown tokens.</li>
</ul>
<h3 id="n-gram-proxy-validation">N-Gram Proxy Validation</h3>
<p>N-gram cross-entropy and information loss metrics show strong rank correlation (Spearman&rsquo;s $\rho$) with downstream transformer performance, validating their use as low-cost evaluation proxies. The effect sizes from n-gram and transformer experiments are directionally consistent.</p>
<h3 id="information-loss-from-unknown-tokens">Information Loss from Unknown Tokens</h3>
<p>Information loss is minimal for tokenizers with robust coverage but substantial for tokenizers with limited vocabularies on chemically diverse datasets. <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a> incurs only 0.1 nats/molecule on MoleculeNet but 40.3 nats/molecule on tmQM. Open-vocabulary tokenizers (Smirk, Smirk-GPE, NLP tokenizers) mitigate this degradation.</p>
<h3 id="practical-recommendations">Practical Recommendations</h3>
<p>The authors argue that molecular foundation models must encode the entire breadth of chemical space or risk obscuring critical features. Bracketed atoms encode information essential to clinically relevant pharmaceuticals (e.g., <a href="https://en.wikipedia.org/wiki/Amoxicillin">Amoxicillin</a>), industrial compounds (e.g., Tricalcium Silicate), and foundational chemistry (e.g., <a href="https://en.wikipedia.org/wiki/Cisplatin">Cisplatin</a>, where omitting the chiral marker erases medically relevant stereochemical information). The paper encourages the community to adopt open-vocabulary tokenizers and develop more chemically diverse benchmarks.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The analysis uses a single-point evaluation for transformer experiments, which may underestimate performance achievable with additional hyperparameter tuning.</li>
<li>Smirk-GPE&rsquo;s learned merges from REALSpace did not fully generalize to tmQM, as indicated by the token imbalance metric.</li>
<li>Current benchmarks (MoleculeNet) lack sufficient diversity to evaluate tokenizer robustness across the full periodic table, isotopes, charged species, and uncommon bond types.</li>
<li>The downstream impact of token ambiguities in BPE-based tokenizers (e.g., ChemBERTa&rsquo;s conflation of <code>Sc</code> as both sulfur-carbon and scandium) remains unclear.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>Enamine REAL Space</td>
          <td>1.6B SMILES (n-gram), 245M molecules (transformer)</td>
          <td>80/10/10 train/val/test split</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>MoleculeNet</td>
          <td>Multiple tasks</td>
          <td>6 regression + 7 classification tasks</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>tmQM</td>
          <td>108K transition metal complexes</td>
          <td>OpenSMILES molecular encodings</td>
      </tr>
      <tr>
          <td>Smirk-GPE training</td>
          <td>Enamine REAL Space (subset)</td>
          <td>262M molecules</td>
          <td>Training split only</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Smirk</strong>: Two-stage regex-based tokenization (atom decomposition, then glyph decomposition). No training required. Vocabulary: 165 tokens.</li>
<li><strong>Smirk-GPE</strong>: BPE-like compression on top of Smirk. Operates on token IDs (not strings) to preserve chemical disambiguation. Final vocabulary: 2,300 tokens.</li>
<li><strong>N-gram models</strong>: Add-one smoothing, bidirectional context ($2n - 2$ total context window). Implemented in Julia with exact integer arithmetic.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: RoBERTa-PreLayerNorm, 8 layers, 8 attention heads, hidden size 512, intermediate size 2048, max sequence length 2048. ~25M parameters (excluding embeddings).</li>
<li><strong>Pretraining</strong>: Masked language modeling, 30,000 steps, effective batch size 8192, FusedLamb optimizer, learning rate $1.6 \times 10^{-4}$.</li>
<li><strong>Finetuning</strong>: 100,000 steps, AdamW optimizer, effective batch size 128, learning rate $1.6 \times 10^{-4}$.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>MoleculeNet preferred metrics per task (AUROC for classification, MAE/RMSE for regression)</li>
<li>Fixed-effects models for standardized effect size estimation</li>
<li>Spearman&rsquo;s rank correlation between n-gram and transformer metrics</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pretraining: 2x NVIDIA A100 GPUs (Delta system at NCSA)</li>
<li>Finetuning: 1x NVIDIA A40 GPU</li>
<li>N-gram models: CPU-based (Julia implementation)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BattModels/Smirk">Smirk tokenizer</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Rust implementation with Python bindings, available on PyPI</td>
      </tr>
      <tr>
          <td>Model checkpoints</td>
          <td>Model</td>
          <td>Not specified</td>
          <td>Pretrained and finetuned checkpoints included in data release</td>
      </tr>
      <tr>
          <td>N-gram code</td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Julia implementation included in data release</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wadell, A., Bhutani, A., &amp; Viswanathan, V. (2026). Tokenization for Molecular Foundation Models. <em>Journal of Chemical Information and Modeling</em>, 66(3), 1384-1393. <a href="https://doi.org/10.1021/acs.jcim.5c01856">https://doi.org/10.1021/acs.jcim.5c01856</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wadell2026tokenization,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Tokenization for Molecular Foundation Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wadell, Alexius and Bhutani, Anoushka and Viswanathan, Venkatasubramanian}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{66}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1384--1393}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.5c01856}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES-BERT: BERT-Style Pre-Training for Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smiles-bert/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smiles-bert/</guid><description>SMILES-BERT applies BERT-style masked pre-training to SMILES strings for molecular property prediction, using Transformer encoders fine-tuned on labeled data.</description><content:encoded><![CDATA[<h2 id="pre-training-transformers-on-smiles-for-molecular-properties">Pre-Training Transformers on SMILES for Molecular Properties</h2>
<p>SMILES-BERT is a <strong>Method</strong> paper that introduces a BERT-inspired pre-training and fine-tuning framework for molecular property prediction. The primary contribution is adapting the masked language model paradigm from NLP to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a>, enabling a Transformer encoder to learn molecular representations from large-scale unlabeled data before fine-tuning on smaller labeled datasets.</p>
<h2 id="limited-labels-in-molecular-property-prediction">Limited Labels in Molecular Property Prediction</h2>
<p>Molecular property prediction is central to drug discovery and chemical design, but obtaining labeled data requires expensive biological assays. Deep learning methods for this task fall into three categories: manually designed fingerprints (e.g., ECFP), graph-based methods (GCNs operating on molecular graphs), and sequence-based methods (RNNs or CNNs operating on SMILES strings).</p>
<p>Prior unsupervised approaches like <a href="/notes/chemistry/molecular-representations/encoders/seq2seq-fingerprint-molecular-embedding/">Seq2seq Fingerprint</a> used an encoder-decoder architecture to learn representations from unlabeled SMILES, but the decoder acts as scaffolding that consumes GPU memory during pre-training without contributing to downstream prediction. The semi-supervised Seq3seq Fingerprint improved on this by incorporating labeled data, but retained the encoder-decoder inefficiency. RNN-based methods also suffer from difficulty in parallel training and require careful tuning (gradient clipping, early stopping) to converge.</p>
<p>The authors identify two motivations: (1) building a semi-supervised model that effectively leverages large pools of unlabeled SMILES to improve prediction with limited labels, and (2) designing an architecture where the entire pre-trained model participates in fine-tuning (no wasted decoder parameters) and naturally supports parallel training.</p>
<h2 id="masked-smiles-recovery-with-transformer-encoders">Masked SMILES Recovery with Transformer Encoders</h2>
<p>The core innovation is the Masked SMILES Recovery pre-training task, directly analogous to BERT&rsquo;s masked language modeling. The model architecture is a stack of Transformer encoder layers, making it fully convolutional and parallelizable.</p>
<h3 id="architecture">Architecture</h3>
<p>SMILES-BERT uses 6 Transformer encoder layers, each with 4-head multi-head self-attention and feed-forward dimension of 1024. Each Transformer layer contains three components: a pre-attention feed-forward network, a self-attention layer, and a post-attention feed-forward network, all followed by layer normalization with residual connections.</p>
<p>The self-attention mechanism uses scaled dot-product attention:</p>
<p>$$
Z = \text{Softmax}\left(\frac{(XW^{Q})(XW^{K})^{T}}{\sqrt{d_{k}}}\right) XW^{V}
$$</p>
<p>where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^{Q}$, $W^{K}$, $W^{V} \in \mathbb{R}^{M \times d_{k}}$ are the query, key, and value weight matrices, and $\sqrt{d_{k}}$ is the scaling factor.</p>
<p>Input SMILES are tokenized at the character level with token embeddings and positional embeddings. A special <code>&lt;GO&gt;</code> token is prepended to each SMILES, and its output representation is used for downstream classification/regression after fine-tuning.</p>
<h3 id="pre-training-masked-smiles-recovery">Pre-training: Masked SMILES Recovery</h3>
<p>Following BERT&rsquo;s masking strategy, 15% of tokens in each SMILES are selected for masking (minimum one per SMILES). Of the selected tokens:</p>
<ul>
<li>85% are replaced with a <code>&lt;MASK&gt;</code> token</li>
<li>10% are replaced with a random token from the vocabulary</li>
<li>5% are kept unchanged</li>
</ul>
<p>The model is trained to recover the original tokens at masked positions. The loss is computed only on the masked token outputs.</p>
<h3 id="fine-tuning">Fine-tuning</h3>
<p>After pre-training, a classifier or regressor head is added to the <code>&lt;GO&gt;</code> token output. The entire model (all Transformer layers plus the new head) is fine-tuned on the labeled dataset.</p>
<p>Key differences from the original BERT:</p>
<ol>
<li>Only the Masked SMILES Recovery task is used (BERT&rsquo;s next sentence prediction is dropped since SMILES have no consecutive-sentence structure)</li>
<li>Segment embeddings are removed</li>
<li>The architecture is smaller (6 layers, 4 heads, 1024 FFN dim) since SMILES have a much smaller vocabulary and shorter sequences than natural language</li>
</ol>
<p>The authors compared this configuration against a larger BERT-base setup (12 layers, 12 heads, 3072 FFN dim) and found no meaningful performance difference, confirming that the smaller model is sufficient for SMILES.</p>
<h2 id="experimental-setup-and-baseline-comparisons">Experimental Setup and Baseline Comparisons</h2>
<h3 id="pre-training-data">Pre-training Data</h3>
<p>SMILES-BERT was pre-trained on the <a href="/notes/chemistry/datasets/zinc-22/">ZINC database</a> with 18,671,355 training SMILES, 10,000 for validation, and 10,000 for evaluation. Pre-training ran for 10 epochs using the Adam optimizer with a warm-up strategy (learning rate from $10^{-9}$ to $10^{-4}$ over 4,000 steps, then inverse-square-root decay). Batch size was 256 and dropout was 0.1. The pre-training masked SMILES exact recovery rate reached 82.85% on the validation set.</p>
<h3 id="fine-tuning-datasets">Fine-tuning Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Source</th>
          <th>Size</th>
          <th>Task</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a></td>
          <td>NCATS/NIH</td>
          <td>10,850</td>
          <td>Classification (threshold 1.88)</td>
          <td>Accuracy</td>
      </tr>
      <tr>
          <td>PM2</td>
          <td>NCATS/NIH</td>
          <td>323,242</td>
          <td>Classification (threshold 0.024896)</td>
          <td>Accuracy</td>
      </tr>
      <tr>
          <td>PCBA-686978</td>
          <td><a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></td>
          <td>302,175</td>
          <td>Classification</td>
          <td>Accuracy</td>
      </tr>
  </tbody>
</table>
<p>All datasets were split 80/10/10 for train/validation/test. Fine-tuning used Adam with a fixed learning rate for 50 epochs, selecting the best model on validation data.</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>Circular Fingerprint (CircularFP)</strong>: Manually designed hash-based fingerprint (ECFP family)</li>
<li><strong>Neural Fingerprint (NeuralFP)</strong>: Graph-based neural network replacing hash functions with learned layers</li>
<li><strong>Seq2seq Fingerprint (Seq2seqFP)</strong>: Unsupervised encoder-decoder model on SMILES</li>
<li><strong>Seq3seq Fingerprint (Seq3seqFP)</strong>: Semi-supervised encoder-decoder model on SMILES</li>
</ul>
<h3 id="results">Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>LogP</th>
          <th>PM2</th>
          <th>PCBA-686978</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CircularFP</td>
          <td>~0.90</td>
          <td>0.6858</td>
          <td>~0.82</td>
      </tr>
      <tr>
          <td>NeuralFP</td>
          <td>~0.90</td>
          <td>0.6802</td>
          <td>~0.82</td>
      </tr>
      <tr>
          <td>Seq2seqFP</td>
          <td>~0.87</td>
          <td>0.6112</td>
          <td>~0.80</td>
      </tr>
      <tr>
          <td>Seq3seqFP</td>
          <td>~0.90</td>
          <td>0.7038</td>
          <td>~0.84</td>
      </tr>
      <tr>
          <td><strong>SMILES-BERT</strong></td>
          <td><strong>0.9154</strong></td>
          <td><strong>0.7589</strong></td>
          <td><strong>0.8784</strong></td>
      </tr>
  </tbody>
</table>
<p>SMILES-BERT outperformed all baselines on all three datasets. The improvement over Seq3seqFP was approximately 2% on LogP, 5.5% on PM2, and 3.8% on PCBA-686978. The results on PM2 (the largest labeled dataset) show that pre-training benefits persist even with substantial labeled data.</p>
<h3 id="structure-study">Structure Study</h3>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Layers</th>
          <th>Attention Heads</th>
          <th>FFN Dim</th>
          <th>LogP Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMILES-BERT</td>
          <td>6</td>
          <td>4</td>
          <td>1024</td>
          <td>0.9154</td>
      </tr>
      <tr>
          <td>SMILES-BERT (large)</td>
          <td>12</td>
          <td>12</td>
          <td>3072</td>
          <td>0.9147</td>
      </tr>
  </tbody>
</table>
<p>The larger configuration provided no improvement, supporting the choice of the smaller, more efficient architecture.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>SMILES-BERT demonstrated that BERT-style masked pre-training on SMILES strings produces transferable molecular representations that improve property prediction across datasets of varying sizes and property types.</p>
<p>Key findings:</p>
<ul>
<li>The Masked SMILES Recovery pre-training task transfers effectively to molecular property prediction</li>
<li>The full model participates in fine-tuning (no wasted decoder), making SMILES-BERT more parameter-efficient than encoder-decoder alternatives</li>
<li>A smaller Transformer configuration (6 layers, 4 heads) matches the performance of a BERT-base-sized model for SMILES data</li>
<li>Pre-training on ~18.7M SMILES from ZINC provides robust initialization across different downstream tasks</li>
</ul>
<p><strong>Limitations</strong>: The evaluation uses only classification accuracy as the metric, without reporting AUC-ROC, F1, or other metrics common in molecular property prediction. The comparison is limited to four baselines, and two of the three evaluation datasets (LogP, PM2) are non-public NIH datasets. The paper does not explore different pre-training dataset sizes or ablate the masking strategy. Only classification tasks are evaluated, though the architecture supports regression.</p>
<p><strong>Future work</strong>: The authors propose incorporating Quantitative Estimate of Druglikeness (QED) prediction as an additional pre-training task to warm up the model&rsquo;s classification capability, analogous to BERT&rsquo;s next sentence prediction.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC</td>
          <td>18,671,355 SMILES</td>
          <td>Publicly available database</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>LogP</td>
          <td>10,850</td>
          <td>Non-public, from NCATS/NIH</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PM2</td>
          <td>323,242</td>
          <td>Non-public, from NCATS/NIH</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PCBA-686978</td>
          <td>302,175</td>
          <td>Public, from PubChem BioAssay</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: Adam optimizer, warm-up for 4,000 steps ($10^{-9}$ to $10^{-4}$), inverse-square-root LR schedule, batch size 256, dropout 0.1, 10 epochs</li>
<li>Fine-tuning: Adam optimizer, fixed LR (insensitive to choice among $10^{-5}$, $10^{-6}$, $10^{-7}$), 50 epochs, best model on validation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>6 Transformer encoder layers, 4-head multi-head attention, FFN dim 1024</li>
<li>Token embedding + positional embedding, <code>&lt;GO&gt;</code> special token</li>
<li>Implemented with FairSeq (Facebook AI Research Sequence-to-Sequence Toolkit)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SMILES-BERT</th>
          <th>Best Baseline (Seq3seqFP)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP Accuracy</td>
          <td>0.9154</td>
          <td>~0.90</td>
          <td>~2% improvement</td>
      </tr>
      <tr>
          <td>PM2 Accuracy</td>
          <td>0.7589</td>
          <td>0.7038</td>
          <td>~5.5% improvement</td>
      </tr>
      <tr>
          <td>PCBA Accuracy</td>
          <td>0.8784</td>
          <td>~0.84</td>
          <td>~3.8% improvement</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper mentions GPU training and NVIDIA GPU donation in acknowledgments but does not specify the exact GPU model or training time beyond noting that pre-training on a single GPU takes over a week for 10 epochs.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No public code or model release identified</td>
          <td>-</td>
          <td>-</td>
          <td>Paper does not provide a GitHub link or model checkpoint</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. The ZINC pre-training data is public and the architecture is described in detail, but no code or pre-trained weights are released. Two of three evaluation datasets (LogP, PM2) are non-public.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, S., Guo, Y., Wang, Y., Sun, H., &amp; Huang, J. (2019). SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In <em>Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB &lsquo;19)</em>, 429-436. <a href="https://doi.org/10.1145/3307339.3342186">https://doi.org/10.1145/3307339.3342186</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wang2019smilesbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Sheng and Guo, Yuzhi and Wang, Yuhong and Sun, Hongmao and Huang, Junzhou}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{429--436}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3307339.3342186}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES vs SELFIES Tokenization for Chemical LMs</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/smiles-selfies-tokenization-chemical-lm/</guid><description>Atom Pair Encoding (APE) tokenizer outperforms BPE on SMILES and SELFIES in RoBERTa-based chemical language models across MoleculeNet classification tasks.</description><content:encoded><![CDATA[<h2 id="atom-pair-encoding-for-chemical-language-modeling">Atom Pair Encoding for Chemical Language Modeling</h2>
<p>This is a <strong>Method</strong> paper that introduces Atom Pair Encoding (APE), a tokenization algorithm designed specifically for chemical string representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>). The primary contribution is demonstrating that a chemistry-aware tokenizer, which preserves atomic identity during subword merging, leads to improved molecular property classification accuracy in transformer-based models compared to the standard Byte Pair Encoding (BPE) approach.</p>
<h2 id="why-tokenization-matters-for-chemical-strings">Why Tokenization Matters for Chemical Strings</h2>
<p>Existing chemical language models based on BERT/RoBERTa architectures have typically relied on BPE for tokenizing SMILES and SELFIES strings. <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">Byte Pair Encoding (BPE)</a> was originally designed for natural language and data compression, where it excels at breaking words into meaningful subword units. When applied to chemical strings, BPE operates at the character level without understanding chemical semantics, leading to several problems:</p>
<ul>
<li><strong>Stray characters</strong>: BPE may create tokens like &ldquo;C)(&rdquo; that have no chemical meaning.</li>
<li><strong>Element splitting</strong>: Multi-character elements like chlorine (&ldquo;Cl&rdquo;) can be split into &ldquo;C&rdquo; and &ldquo;l&rdquo;, causing the model to misinterpret carbon and a dangling character.</li>
<li><strong>Lost structural context</strong>: BPE compresses sequences without considering how character position encodes molecular structure.</li>
</ul>
<p>Previous work on <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SMILES Pair Encoding (SPE)</a> attempted to address this by iteratively merging SMILES substrings into chemically meaningful tokens. However, SPE had practical limitations: its Python implementation did not support SELFIES, and it produced a smaller vocabulary (~3000 tokens) than what the data could support. These gaps motivated the development of APE.</p>
<h2 id="the-ape-tokenizer-chemistry-aware-subword-merging">The APE Tokenizer: Chemistry-Aware Subword Merging</h2>
<p>APE draws inspiration from both BPE and SPE but addresses their shortcomings. The key design decisions are:</p>
<ol>
<li>
<p><strong>Atom-level initialization</strong>: Instead of starting from individual characters (as BPE does), APE begins with chemically valid atomic units. For SMILES, this means recognizing multi-character elements (e.g., &ldquo;Cl&rdquo;, &ldquo;Br&rdquo;) as single tokens. For SELFIES, each bracketed string (e.g., [C], [Ring1], [=O]) serves as the fundamental unit.</p>
</li>
<li>
<p><strong>Iterative pair merging</strong>: Like BPE, APE iteratively merges the most frequent adjacent token pairs. The difference is that the initial tokenization preserves atomic boundaries, so merged tokens always represent valid chemical substructures.</p>
</li>
<li>
<p><strong>Larger vocabulary</strong>: Using the same minimum frequency threshold of 2000, APE generates approximately 5300 unique tokens from the PubChem dataset, compared to SPE&rsquo;s approximately 3000. This richer vocabulary provides more expressive power for representing chemical substructures.</p>
</li>
<li>
<p><strong>SELFIES compatibility</strong>: APE natively supports both SMILES and SELFIES, using the bracketed token structure of SELFIES as its starting point for that representation.</p>
</li>
</ol>
<p>The tokenizer was trained on a subset of 2 million molecules from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> (10 million SMILES total). This produced four tokenizer variants: SMILES-BPE, SMILES-APE, SELFIES-BPE, and SELFIES-APE.</p>
<h2 id="pre-training-and-evaluation-on-moleculenet-benchmarks">Pre-training and Evaluation on MoleculeNet Benchmarks</h2>
<h3 id="model-architecture">Model architecture</h3>
<p>All four models use the RoBERTa architecture with 6 hidden layers, a hidden size of 768, an intermediate size of 1536, and 12 attention heads. Pre-training used masked language modeling (MLM) with 15% token masking on 1 million molecules from PubChem, with a validation set of 100,000 molecules. Each model was pre-trained for 20 epochs using AdamW, with hyperparameter optimization via Optuna.</p>
<h3 id="downstream-tasks">Downstream tasks</h3>
<p>The models were fine-tuned on three <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Category</th>
          <th>Compounds</th>
          <th>Tasks</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>Physiology</td>
          <td>2,039</td>
          <td>1</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Biophysics</td>
          <td>41,127</td>
          <td>1</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>Physiology</td>
          <td>7,831</td>
          <td>12</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<p>Data was split 80/10/10 (train/validation/test) following MoleculeNet recommendations. Models were fine-tuned for 5 epochs with early stopping based on validation ROC-AUC.</p>
<h3 id="baselines">Baselines</h3>
<p>Results were compared against two text-based models (<a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a> MTR-77M and <a href="/notes/chemistry/molecular-representations/encoders/selformer/">SELFormer</a>) and two graph-based models (D-MPNN from Chemprop and MoleculeNet Graph-Conv).</p>
<h3 id="main-results">Main results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP ROC</th>
          <th>HIV ROC</th>
          <th>Tox21 ROC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMILYAPE-1M</td>
          <td>0.754 +/- 0.006</td>
          <td>0.772 +/- 0.010</td>
          <td>0.838 +/- 0.002</td>
      </tr>
      <tr>
          <td>SMILYBPE-1M</td>
          <td>0.746 +/- 0.006</td>
          <td>0.754 +/- 0.015</td>
          <td>0.849 +/- 0.002</td>
      </tr>
      <tr>
          <td>SELFYAPE-1M</td>
          <td>0.735 +/- 0.015</td>
          <td>0.768 +/- 0.012</td>
          <td>0.842 +/- 0.002</td>
      </tr>
      <tr>
          <td>SELFYBPE-1M</td>
          <td>0.676 +/- 0.014</td>
          <td>0.709 +/- 0.012</td>
          <td>0.825 +/- 0.001</td>
      </tr>
      <tr>
          <td>ChemBERTa-2-MTR-77M</td>
          <td>0.698 +/- 0.014</td>
          <td>0.735 +/- 0.008</td>
          <td>0.790 +/- 0.003</td>
      </tr>
      <tr>
          <td>SELFormer</td>
          <td>0.716 +/- 0.021</td>
          <td>0.769 +/- 0.010</td>
          <td>0.838 +/- 0.005</td>
      </tr>
      <tr>
          <td>MoleculeNet-Graph-Conv</td>
          <td>0.690</td>
          <td>0.763</td>
          <td>0.829</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.737</td>
          <td>0.776</td>
          <td>0.851</td>
      </tr>
  </tbody>
</table>
<p>APE consistently outperforms BPE for both SMILES and SELFIES. SMILYAPE achieves the best BBBP score (0.754), beating D-MPNN (0.737). On HIV, SMILYAPE (0.772) is competitive with D-MPNN (0.776). On Tox21, D-MPNN (0.851) leads, with SMILYBPE (0.849) and SELFYAPE (0.842) close behind.</p>
<h3 id="statistical-significance">Statistical significance</h3>
<p><a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann-Whitney U tests</a> confirmed statistically significant differences between SMILYAPE and SMILYBPE (p &lt; 0.05 on all datasets). Cliff&rsquo;s delta values indicate large effect sizes: 0.74 (BBBP), 0.70 (HIV), and -1.00 (Tox21, favoring BPE). For SELFIES models, SELFYAPE achieved Cliff&rsquo;s delta of 1.00 across all three datasets, indicating complete separation from SELFYBPE.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="ape-outperforms-bpe-by-preserving-atomic-identity">APE outperforms BPE by preserving atomic identity</h3>
<p>The consistent advantage of APE over BPE stems from APE&rsquo;s atom-level initialization. By starting with chemically valid units rather than individual characters, APE avoids creating nonsensical tokens that break chemical elements or mix structural delimiters with atoms.</p>
<h3 id="smiles-outperforms-selfies-with-ape-tokenization">SMILES outperforms SELFIES with APE tokenization</h3>
<p>SMILYAPE generally outperforms SELFYAPE across tasks. Attention weight analysis revealed that SMILYAPE assigns more weight to immediate neighboring tokens (0.108 vs. 0.096) and less to distant tokens (0.030 vs. 0.043). This pattern aligns with chemical intuition: bonding is primarily determined by directly connected atoms. SMILYAPE also produces more compact tokenizations (8.6 tokens per molecule vs. 11.9 for SELFYAPE), potentially allowing more efficient attention allocation.</p>
<h3 id="selfies-models-show-higher-inter-tokenizer-agreement">SELFIES models show higher inter-tokenizer agreement</h3>
<p>On the BBBP dataset, all true positives identified by SELFYBPE were also captured by SELFYAPE, with SELFYAPE achieving higher recall (61.68% vs. 55.14%). In contrast, SMILES-based models shared only 29.3% of true positives between APE and BPE variants, indicating that tokenization choice has a larger impact on SMILES models.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Pre-training used only 1 million molecules, compared to 77 million for ChemBERTa-2. Despite this, APE models were competitive or superior, but scaling effects remain unexplored.</li>
<li>Evaluation was limited to three binary classification tasks from MoleculeNet. Regression tasks, molecular generation, and reaction prediction were not tested.</li>
<li>The Tox21 result is notable: SMILYBPE outperforms SMILYAPE (0.849 vs. 0.838), suggesting APE&rsquo;s advantage may be task-dependent.</li>
<li>No comparison with recent atom-level tokenizers like <a href="/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/">Atom-in-SMILES</a> or newer approaches beyond SPE.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tokenizer training</td>
          <td>PubChem subset</td>
          <td>2M molecules</td>
          <td>SMILES strings converted to SELFIES via selfies library</td>
      </tr>
      <tr>
          <td>Pre-training</td>
          <td>PubChem subset</td>
          <td>1M molecules</td>
          <td>100K validation set</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BBBP</td>
          <td>2,039 compounds</td>
          <td>80/10/10 split</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,127 compounds</td>
          <td>80/10/10 split</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Tox21</td>
          <td>7,831 compounds</td>
          <td>80/10/10 split, 12 tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Tokenizers: BPE (via Hugging Face), APE (custom implementation, minimum frequency 2000)</li>
<li>Pre-training: Masked Language Modeling (15% masking) for 20 epochs</li>
<li>Optimizer: AdamW with Optuna hyperparameter search</li>
<li>Fine-tuning: 5 epochs with early stopping on validation ROC-AUC</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Architecture: RoBERTa with 6 layers, hidden size 768, intermediate size 1536, 12 attention heads</li>
<li>Four variants: SMILYAPE, SMILYBPE, SELFYAPE, SELFYBPE</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SMILYAPE</th>
          <th>SMILYBPE</th>
          <th>SELFYAPE</th>
          <th>SELFYBPE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP ROC-AUC</td>
          <td>0.754</td>
          <td>0.746</td>
          <td>0.735</td>
          <td>0.676</td>
      </tr>
      <tr>
          <td>HIV ROC-AUC</td>
          <td>0.772</td>
          <td>0.754</td>
          <td>0.768</td>
          <td>0.709</td>
      </tr>
      <tr>
          <td>Tox21 ROC-AUC</td>
          <td>0.838</td>
          <td>0.849</td>
          <td>0.842</td>
          <td>0.825</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA RTX 3060 GPU with 12 GiB VRAM</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mikemayuare/apetokenizer">APE Tokenizer</a></td>
          <td>Code</td>
          <td>Other (unspecified SPDX)</td>
          <td>Official APE tokenizer implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/mikemayuare/PubChem10M_SMILES_SELFIES">PubChem10M SMILES/SELFIES</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>10M SMILES with SELFIES conversions</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/mikemayuare">Pre-trained and fine-tuned models</a></td>
          <td>Model</td>
          <td>Not specified</td>
          <td>All four model variants on Hugging Face</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Leon, M., Perezhohin, Y., Peres, F., Popovič, A., &amp; Castelli, M. (2024). Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling. <em>Scientific Reports</em>, 14(1), 25016. <a href="https://doi.org/10.1038/s41598-024-76440-8">https://doi.org/10.1038/s41598-024-76440-8</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{leon2024comparing,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Leon, Miguelangel and Perezhohin, Yuriy and Peres, Fernando and Popovi{\v{c}}, Ale{\v{s}} and Castelli, Mauro}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{25016}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-024-76440-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES Transformer: Low-Data Molecular Fingerprints</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smiles-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smiles-transformer/</guid><description>SMILES Transformer uses unsupervised Transformer pre-training on SMILES strings to produce molecular fingerprints that excel in low-data drug discovery tasks.</description><content:encoded><![CDATA[<h2 id="a-transformer-approach-to-learned-molecular-fingerprints">A Transformer Approach to Learned Molecular Fingerprints</h2>
<p>This is a <strong>Method</strong> paper that introduces SMILES Transformer (ST), a Transformer-based sequence-to-sequence model pre-trained on unlabeled SMILES strings to produce continuous, data-driven molecular fingerprints. The primary contribution is demonstrating that unsupervised pre-training on chemical text representations yields fingerprints that generalize well under low-data conditions, outperforming both rule-based fingerprints (ECFP) and graph convolution models on several <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks. A secondary contribution is the Data Efficiency Metric (DEM), a scalar metric for evaluating model performance across varying training set sizes.</p>
<h2 id="the-low-data-problem-in-molecular-property-prediction">The Low-Data Problem in Molecular Property Prediction</h2>
<p>Machine learning for drug discovery depends on molecular representations, but labeled datasets of experimentally validated properties are typically small. Conventional approaches fall into two camps: rule-based fingerprints like ECFP that hash substructures into sparse binary vectors, and graph-based methods like GraphConv that learn representations end-to-end. Rule-based fingerprints perform poorly with shallow models or limited data, while graph-based methods are designed for large fully-labeled settings.</p>
<p>Pre-training on unlabeled data had shown strong results in NLP (ELMo, BERT, XLNet), and prior work in cheminformatics had explored RNN-based and VAE-based pre-training on SMILES (<a href="/notes/chemistry/molecular-representations/encoders/seq2seq-fingerprint-molecular-embedding/">Seq2Seq fingerprints</a>, <a href="/notes/chemistry/molecular-design/generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a>, heteroencoders). However, none of these studies systematically evaluated performance in small-data settings. Honda et al. fill this gap by applying Transformer-based pre-training to SMILES and measuring data efficiency explicitly.</p>
<h2 id="transformer-pre-training-on-smiles-with-pooled-fingerprint-extraction">Transformer Pre-training on SMILES with Pooled Fingerprint Extraction</h2>
<p>The core innovation is a Transformer encoder-decoder architecture pre-trained as an autoencoder on SMILES strings, with a specific fingerprint extraction strategy that pools the encoder outputs into a fixed-length vector.</p>
<h3 id="architecture">Architecture</h3>
<p>The model uses 4 Transformer blocks for both the encoder and decoder, each with 4-head attention and 256 embedding dimensions plus 2 linear layers. Input SMILES are tokenized at the symbol level (e.g., &lsquo;c&rsquo;, &lsquo;Br&rsquo;, &lsquo;=&rsquo;, &lsquo;(&rsquo;, &lsquo;2&rsquo;) and one-hot encoded. Following Vaswani et al. (2017), the input uses the sum of token encoding and positional encoding.</p>
<h3 id="pre-training">Pre-training</h3>
<p>The model is pre-trained on 861,000 unlabeled SMILES sampled from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL24</a> to minimize cross-entropy between input and output SMILES (i.e., reconstruction). <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> (Bjerrum, 2017) randomly generates non-canonical SMILES at each epoch to reduce representation bias. Training runs for 5 epochs with Adam optimization, reaching a perplexity of 1.0 (perfect decoding).</p>
<h3 id="fingerprint-extraction">Fingerprint Extraction</h3>
<p>Since the Transformer outputs symbol-level (atom-level) representations, a pooling strategy produces molecule-level fingerprints. Four vectors are concatenated:</p>
<ol>
<li>Mean-pooled output of the last encoder layer</li>
<li>Max-pooled output of the last encoder layer</li>
<li>First output token of the last encoder layer</li>
<li>First output token of the penultimate encoder layer</li>
</ol>
<p>This produces a 1024-dimensional fingerprint, matching the dimensionality of ECFP for fair comparison.</p>
<h3 id="data-efficiency-metric">Data Efficiency Metric</h3>
<p>The paper proposes DEM to measure how well a model performs across different training set sizes:</p>
<p>$$
M_{DE}(f, m) = \frac{1}{|I|} \sum_{i \in I} m(f_i, X_i, Y_i)
$$</p>
<p>where $f_i$ is the model trained on the fraction $i$ of training data, $m$ is the task metric, and $I = {0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8}$ doubles the training percentage at each step. This captures average performance across a range of data availability, giving a single scalar that balances accuracy and data efficiency.</p>
<h2 id="benchmarking-across-moleculenet-with-data-efficiency-focus">Benchmarking Across MoleculeNet with Data Efficiency Focus</h2>
<h3 id="datasets">Datasets</h3>
<p>The evaluation uses 10 datasets from MoleculeNet spanning three categories:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Dataset</th>
          <th>Tasks</th>
          <th>Type</th>
          <th>Molecules</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Physical chemistry</td>
          <td>ESOL</td>
          <td>1</td>
          <td>Regression</td>
          <td>1,128</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td>FreeSolv</td>
          <td>1</td>
          <td>Regression</td>
          <td>643</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>1</td>
          <td>Regression</td>
          <td>4,200</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>MUV</td>
          <td>17</td>
          <td>Classification</td>
          <td>93,127</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>HIV</td>
          <td>1</td>
          <td>Classification</td>
          <td>41,913</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>BACE</td>
          <td>1</td>
          <td>Classification</td>
          <td>1,522</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP</td>
          <td>1</td>
          <td>Classification</td>
          <td>2,053</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>Tox21</td>
          <td>12</td>
          <td>Classification</td>
          <td>8,014</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>SIDER</td>
          <td>27</td>
          <td>Classification</td>
          <td>1,427</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>ClinTox</td>
          <td>2</td>
          <td>Classification</td>
          <td>1,491</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>ECFP4</strong>: Rule-based extended-connectivity fingerprint with 1024 dimensions</li>
<li><strong>RNNS2S</strong>: RNN-based Seq2Seq pre-trained fingerprint (3-layer bidirectional GRU, same pre-training data as ST)</li>
<li><strong>GraphConv</strong>: Graph convolution network trained end-to-end on labeled data</li>
</ul>
<h3 id="experimental-setup">Experimental Setup</h3>
<p>All fingerprint methods use a simple MLP classifier/regressor from scikit-learn with default hyperparameters to isolate the fingerprint quality from model capacity. Datasets are randomly split (stratified for classification), and results are averaged over 20 trials. Note that random splits are used rather than scaffold splits for the DEM experiments.</p>
<h3 id="data-efficiency-results-dem">Data Efficiency Results (DEM)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>ST+MLP</th>
          <th>ECFP+MLP</th>
          <th>RNNS2S+MLP</th>
          <th>GraphConv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL (RMSE, lower is better)</td>
          <td><strong>1.144</strong></td>
          <td>1.741</td>
          <td>1.317</td>
          <td>1.673</td>
      </tr>
      <tr>
          <td>FreeSolv (RMSE, lower is better)</td>
          <td><strong>2.246</strong></td>
          <td>3.043</td>
          <td>2.987</td>
          <td>3.476</td>
      </tr>
      <tr>
          <td>Lipophilicity (RMSE, lower is better)</td>
          <td>1.169</td>
          <td><strong>1.090</strong></td>
          <td>1.219</td>
          <td><strong>1.062</strong></td>
      </tr>
      <tr>
          <td>MUV (PRC-AUC, higher is better)</td>
          <td>0.009</td>
          <td><strong>0.036</strong></td>
          <td>0.010</td>
          <td>0.004</td>
      </tr>
      <tr>
          <td>HIV (ROC-AUC, higher is better)</td>
          <td>0.683</td>
          <td>0.697</td>
          <td>0.682</td>
          <td><strong>0.723</strong></td>
      </tr>
      <tr>
          <td>BACE (ROC-AUC, higher is better)</td>
          <td>0.719</td>
          <td><strong>0.769</strong></td>
          <td>0.717</td>
          <td>0.744</td>
      </tr>
      <tr>
          <td>BBBP (ROC-AUC, higher is better)</td>
          <td><strong>0.900</strong></td>
          <td>0.760</td>
          <td>0.884</td>
          <td>0.795</td>
      </tr>
      <tr>
          <td>Tox21 (ROC-AUC, higher is better)</td>
          <td><strong>0.706</strong></td>
          <td>0.616</td>
          <td>0.702</td>
          <td>0.687</td>
      </tr>
      <tr>
          <td>SIDER (ROC-AUC, higher is better)</td>
          <td>0.559</td>
          <td><strong>0.588</strong></td>
          <td>0.558</td>
          <td>0.557</td>
      </tr>
      <tr>
          <td>ClinTox (ROC-AUC, higher is better)</td>
          <td><strong>0.963</strong></td>
          <td>0.515</td>
          <td>0.904</td>
          <td>0.936</td>
      </tr>
  </tbody>
</table>
<p>ST achieves the best DEM in 5 of 10 datasets (ESOL, FreeSolv, BBBP, Tox21, ClinTox), with particularly strong margins on ClinTox (+0.027 over GraphConv) and BBBP (+0.016 over RNNS2S).</p>
<h3 id="linear-model-experiments">Linear Model Experiments</h3>
<p>To further isolate fingerprint quality, the authors replace MLP with ridge/logistic regression with L2 penalty. On 8 datasets (excluding MUV and SIDER due to class imbalance issues), ST achieves best DEM in 5 of 8, confirming the fingerprint quality holds regardless of downstream model.</p>
<h3 id="stratified-analysis-by-molecule-size">Stratified Analysis by Molecule Size</h3>
<p>On BBBP stratified by SMILES length, ST&rsquo;s ROC-AUC increases with longer SMILES, similar to RNNS2S but unlike GraphConv which shows stable performance across lengths. This suggests text-based models extract richer information from longer sequences.</p>
<h3 id="comparison-with-record-scores-large-data">Comparison with Record Scores (Large Data)</h3>
<p>Under the large-data setting (80/10/10 train/val/test split with hyperparameter tuning via Optuna), ST achieves first place only in ClinTox (0.954) but performs comparably to ECFP and graph-based models on the other datasets. This confirms that ST&rsquo;s main advantage is in the low-data regime.</p>
<h2 id="strong-low-data-performance-with-caveats-on-scalability">Strong Low-Data Performance with Caveats on Scalability</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>Transformer-based unsupervised pre-training on SMILES produces fingerprints that excel in low-data molecular property prediction, achieving best data efficiency on 5 of 10 MoleculeNet tasks.</li>
<li>The advantage is most pronounced on small datasets (ESOL with 1,128 molecules, FreeSolv with 643, BBBP with 2,053, ClinTox with 1,491) where pre-training enables good generalization.</li>
<li>With sufficient labeled data and hyperparameter tuning, ST fingerprints perform comparably to (but do not surpass) graph-based methods.</li>
<li>Longer SMILES provide richer information for text-based models, as shown by the stratified analysis on BBBP.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Random splits are used for most DEM experiments rather than scaffold splits, which may inflate performance estimates for drug discovery applications where training and test molecules are structurally distinct.</li>
<li>The pre-training corpus (861K SMILES from ChEMBL24) is relatively small by modern standards.</li>
<li>MUV performance is poor across all methods (PRC-AUC near zero), suggesting the DEM framework may not be informative for extremely imbalanced or noisy datasets.</li>
<li>No comparison with BERT-style masked language model pre-training, which later work (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>) would show as a viable alternative.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose three directions: (1) replacing the Transformer with Transformer-XL to handle longer SMILES, (2) multi-task pre-training that jointly predicts molecular descriptors (e.g., molecular weight, <a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a>) alongside SMILES reconstruction, and (3) better exploitation of enumerated SMILES to constrain the latent space.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL24</td>
          <td>861,000 SMILES</td>
          <td>Unlabeled, randomly sampled</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>MoleculeNet (10 datasets)</td>
          <td>643 to 93,127 molecules</td>
          <td>See Table 1 for per-dataset details</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer encoder-decoder: 4 blocks each, 4-head attention, 256 embedding dimensions</li>
<li>Pre-training: 5 epochs, Adam optimizer, cross-entropy loss, SMILES enumeration for augmentation</li>
<li>Fingerprint: 1024 dimensions from concatenated mean pool, max pool, and first-token outputs</li>
<li>Downstream: scikit-learn MLP (default hyperparameters) for DEM experiments; ridge/logistic regression for linear model experiments; Optuna for hyperparameter search in large-data comparison</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DSPsleeporg/smiles-transformer">smiles-transformer</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (Jupyter notebooks)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>DEM averaged over 7 training fractions (1.25% to 80%), 20 trials each</li>
<li>Random splits for DEM; scaffold splits for HIV, BACE, BBBP in large-data comparison</li>
<li>Metrics: RMSE (regression), ROC-AUC or PRC-AUC (classification) per MoleculeNet conventions</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU type or training time for the pre-training phase.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Honda, S., Shi, S., &amp; Ueda, H. R. (2019). SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. <em>arXiv preprint arXiv:1911.04738</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{honda2019smiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Honda, Shion and Shi, Shoi and Ueda, Hiroki R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1911.04738}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMI-TED: Encoder-Decoder Foundation Models for Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smi-ted-encoder-decoder-chemistry/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/smi-ted-encoder-decoder-chemistry/</guid><description>SMI-TED is a family of encoder-decoder transformer models pre-trained on 91M PubChem molecules for molecular property prediction and generation.</description><content:encoded><![CDATA[<h2 id="an-encoder-decoder-chemical-foundation-model-family">An Encoder-Decoder Chemical Foundation Model Family</h2>
<p>SMI-TED is a <strong>Method</strong> paper that introduces a family of encoder-decoder transformer-based foundation models for chemistry. The primary contribution is the SMI-TED289M architecture, a 289-million parameter model pre-trained on 91 million curated SMILES from PubChem, along with a Mixture-of-Experts variant (MoE-OSMI) that scales to 8x289M parameters. The models support molecular property prediction, molecule reconstruction, reaction yield prediction, and few-shot reasoning over molecular embeddings. All model weights and code are open-sourced under an Apache 2.0 license.</p>
<h2 id="bridging-encoding-and-decoding-for-molecular-representations">Bridging Encoding and Decoding for Molecular Representations</h2>
<p>Chemical language models based on <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> have gained traction for molecular property prediction and generation. Most existing models, such as <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a> and <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, are encoder-only architectures that produce molecular embeddings through mean pooling. While effective for downstream classification and regression, this encoder-only approach has a limitation: mean pooling has no natural inverse, meaning the model cannot reconstruct the input molecule from its latent representation. This restricts the model&rsquo;s utility for generative tasks and limits the interpretability of the learned latent space.</p>
<p>The authors argue that adding a decoder with a reconstruction objective forces the model to encode a more complete set of structural features. Prior work has shown that the quality of pre-training data matters more than the choice of SMILES vs. <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, and that large-scale pre-training can yield useful chemical representations. SMI-TED builds on these observations by combining an encoder-decoder architecture with a carefully curated 91-million molecule dataset from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>.</p>
<h2 id="invertible-pooling-and-two-phase-pre-training">Invertible Pooling and Two-Phase Pre-Training</h2>
<p>The core architectural innovation in SMI-TED is a learned pooling mechanism that replaces standard mean or max pooling with an invertible projection. Given token embeddings $\mathbf{x} \in \mathbb{R}^{D \times L}$ (where $D = 202$ is the maximum token count and $L = 768$ is the embedding dimension), the submersion into the latent space $\mathbf{z} \in \mathbb{R}^{L}$ is computed as:</p>
<p>$$
\mathbf{z} = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{W}_1^T \mathbf{x} + \mathbf{b}_1\right)\right)\right) \mathbf{W}_2
$$</p>
<p>where $\mathbf{W}_1 \in \mathbb{R}^{D \times L}$, $\mathbf{b}_1 \in \mathbb{R}^{L}$, and $\mathbf{W}_2 \in \mathbb{R}^{L \times L}$. The immersion (inverse mapping) back to the token space is:</p>
<p>$$
\tilde{\mathbf{x}}^T = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{z} \mathbf{W}_3 + \mathbf{b}_3\right)\right)\right) \mathbf{W}_4
$$</p>
<p>where $\mathbf{W}_3 \in \mathbb{R}^{L \times L}$, $\mathbf{b}_3 \in \mathbb{R}^{L}$, and $\mathbf{W}_4 \in \mathbb{R}^{L \times D}$. A decoder language model then predicts the next token from $\tilde{\mathbf{x}}$.</p>
<p>The encoder uses a modified RoFormer attention mechanism with rotary position embeddings:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle}
$$</p>
<p>where $R_m$ are position-dependent rotation matrices and $\varphi$ is a random feature map.</p>
<p><strong>Two-phase pre-training strategy:</strong></p>
<ul>
<li><strong>Phase 1</strong>: The token encoder is pre-trained on 95% of the data using masked language modeling (15% token selection, of which 80% masked, 10% random, 10% unchanged). The remaining 5% trains the encoder-decoder layer, preventing convergence issues from unstable early embeddings.</li>
<li><strong>Phase 2</strong>: After the token embeddings converge, both the encoder and decoder train on 100% of the data jointly.</li>
</ul>
<p><strong><a href="https://en.wikipedia.org/wiki/Mixture_of_experts">Mixture-of-Experts</a> (MoE-OSMI):</strong> The MoE variant composes 8 fine-tuned SMI-TED289M expert models with a gating network. Given an input embedding $x$, the output is:</p>
<p>$$
y = \sum_{i=1}^{n} G(x)_i E_i(\hat{x})
$$</p>
<p>where $G(x) = \text{Softmax}(\text{TopK}(x \cdot W_g))$ selects the top $k = 2$ experts per input, setting all other gate values to zero.</p>
<h2 id="benchmarks-across-property-prediction-generation-and-reaction-yield">Benchmarks Across Property Prediction, Generation, and Reaction Yield</h2>
<h3 id="moleculenet-classification-6-datasets-roc-auc"><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification (6 datasets, ROC-AUC)</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>BBBP</th>
          <th>ClinTox</th>
          <th>HIV</th>
          <th>BACE</th>
          <th>SIDER</th>
          <th>Tox21</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer</td>
          <td>73.6 +/- 0.8</td>
          <td>91.2 +/- 1.4</td>
          <td>80.5 +/- 1.65</td>
          <td>86.3 +/- 0.6</td>
          <td>65.5 +/- 0.2</td>
          <td>80.46 +/- 0.2</td>
      </tr>
      <tr>
          <td>Uni-Mol</td>
          <td>72.9 +/- 0.6</td>
          <td>91.9 +/- 1.8</td>
          <td>80.8 +/- 0.3</td>
          <td>85.7 +/- 0.2</td>
          <td>65.9 +/- 1.3</td>
          <td>79.6 +/- 0.5</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>72.4 +/- 0.4</td>
          <td>90.1 +/- 1.3</td>
          <td>80.6 +/- 0.9</td>
          <td>85.6 +/- 1.1</td>
          <td>67.2 +/- 0.4</td>
          <td>78.1 +/- 0.1</td>
      </tr>
      <tr>
          <td>SMI-TED289M (pre-trained)</td>
          <td>91.46 +/- 0.47</td>
          <td>93.49 +/- 0.85</td>
          <td>80.51 +/- 1.34</td>
          <td>85.58 +/- 0.92</td>
          <td>66.01 +/- 0.88</td>
          <td>81.53 +/- 0.45</td>
      </tr>
      <tr>
          <td>SMI-TED289M (fine-tuned)</td>
          <td><strong>92.26 +/- 0.57</strong></td>
          <td><strong>94.27 +/- 1.83</strong></td>
          <td>76.85 +/- 0.89</td>
          <td><strong>88.24 +/- 0.50</strong></td>
          <td>65.68 +/- 0.45</td>
          <td><strong>81.85 +/- 1.42</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED achieves the best results in 4 of 6 classification tasks. Notably, the pre-trained version (without fine-tuning) already matches or exceeds many baselines on BBBP, ClinTox, and Tox21.</p>
<h3 id="moleculenet-regression-5-datasets-mae-for-qm9qm8-rmse-for-esolfreesolvlipophilicity">MoleculeNet regression (5 datasets, MAE for QM9/QM8, RMSE for ESOL/FreeSolv/Lipophilicity)</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>QM9</th>
          <th>QM8</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer</td>
          <td>1.5894</td>
          <td>0.0102</td>
          <td>0.880</td>
          <td>2.342</td>
          <td>0.700</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>3.241</td>
          <td>0.0143</td>
          <td>0.98</td>
          <td>2.18</td>
          <td>0.65</td>
      </tr>
      <tr>
          <td>SMI-TED289M (fine-tuned)</td>
          <td><strong>1.3246</strong></td>
          <td><strong>0.0095</strong></td>
          <td><strong>0.6112</strong></td>
          <td><strong>1.2233</strong></td>
          <td><strong>0.5522</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED289M achieves the best results across all 5 regression tasks when fine-tuned. The improvements are substantial on ESOL (0.61 vs. 0.82 for next best) and FreeSolv (1.22 vs. 1.91 for next best).</p>
<h3 id="reaction-yield-prediction-buchwald-hartwig-c-n-cross-coupling">Reaction yield prediction (<a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a> C-N cross-coupling)</h3>
<p>The model was tested on Pd-catalyzed Buchwald-Hartwig reactions with 3,955 reactions across varying train/test splits. Selected $R^2$ results:</p>
<table>
  <thead>
      <tr>
          <th>Split</th>
          <th>Yield-BERT (Aug)</th>
          <th>DRFP</th>
          <th>SMI-TED289M</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>70/30</td>
          <td>0.97</td>
          <td>0.95</td>
          <td><strong>0.984</strong></td>
      </tr>
      <tr>
          <td>10/90</td>
          <td>0.81</td>
          <td>0.81</td>
          <td><strong>0.961</strong></td>
      </tr>
      <tr>
          <td>2.5/97.5</td>
          <td>0.61</td>
          <td>0.62</td>
          <td><strong>0.875</strong></td>
      </tr>
      <tr>
          <td>Test 1-4 avg</td>
          <td>0.58</td>
          <td>0.71</td>
          <td><strong>0.983</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED shows particularly strong performance in low-data regimes. With only 2.5% training data, it achieves $R^2 = 0.875$, compared to 0.61-0.62 for competing methods.</p>
<h3 id="moses-molecular-generation-benchmarks"><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> molecular generation benchmarks</h3>
<p>SMI-TED is competitive with baselines including CharRNN, SMILES VAE, JT-VAE, <a href="/notes/chemistry/molecular-design/generation/latent-space/limo-latent-inceptionism/">LIMO</a>, <a href="/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen-7b</a>, and <a href="/notes/chemistry/molecular-design/generation/autoregressive/gp-molformer/">GP-MoLFormer</a> on standard metrics (validity, uniqueness, novelty, FCD, internal diversity). It achieves superior scaffold cosine similarity (Scaf) and nearest-neighbor similarity (SNN) scores.</p>
<h3 id="latent-space-compositionality">Latent space compositionality</h3>
<p>Using six families of carbon chains ($\mathcal{F} = {CC, CO, CN, CS, CF, CP}$), the authors test whether the embedding space respects hierarchical distance structures. A linear regression on SMI-TED embeddings yields $R^2 = 0.99$ and $MSE = 0.002$, compared to $R^2 = 0.55$ and $MSE = 0.237$ for MoLFormer. This indicates that the SMI-TED latent space captures compositional chemical relationships far more faithfully.</p>
<p>For structure-property analysis on <a href="/notes/chemistry/datasets/qm9/">QM9</a>, nitrogen-containing molecules represent 9.10% of the dataset but account for 32.81% of the top 10% by HOMO energy. In the SMI-TED latent space, these molecules cluster distinctly (<a href="https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index">Davies-Bouldin index</a> of 2.82 vs. 4.28 for MoLFormer), suggesting the decoder objective encourages encoding of functional group information.</p>
<h2 id="strong-performance-with-a-compositional-latent-space">Strong Performance with a Compositional Latent Space</h2>
<p>SMI-TED289M demonstrates competitive or superior performance across molecular property prediction, reaction yield prediction, and molecular generation benchmarks. The key findings include:</p>
<ol>
<li><strong>Broad applicability</strong>: The single pre-trained model achieves strong results across classification (4/6 best), regression (5/5 best), reaction yield, and generation tasks.</li>
<li><strong>Low-data robustness</strong>: The pre-training on 91M molecules provides chemical knowledge that transfers well to small training sets, as shown by the reaction yield experiments where SMI-TED maintains high accuracy even at 2.5% training data.</li>
<li><strong>Compositional embeddings</strong>: The encoder-decoder architecture produces a latent space where molecular similarity follows chemical intuition, with near-perfect linear relationships between functional group families ($R^2 = 0.99$).</li>
<li><strong>Structure-property capture</strong>: The reconstruction objective appears to enforce encoding of chemically meaningful features like nitrogen substituent effects on <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO</a> energy, outperforming encoder-only models in latent space organization.</li>
</ol>
<p><strong>Limitations</strong>: The paper evaluates on MoleculeNet benchmarks, which are well-studied but may not reflect performance on more diverse chemical tasks. The BBBP classification result (92.26) shows a large jump from prior methods (73.6 for MoLFormer), which is worth scrutinizing. The MoE variant is evaluated only in supplementary materials, and scaling behavior beyond 8 experts is not explored.</p>
<p><strong>Future directions</strong>: The authors note that compositionality of the learned representations suggests potential for reasoning applications, though they acknowledge that stronger claims require further studies following compositionality analysis methodologies from natural language processing. The model has been integrated into the dZiner agent for inverse molecular design.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem (curated)</td>
          <td>91M molecules, 4B tokens</td>
          <td>Deduplicated, canonicalized, validity-checked</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>MoleculeNet (BBBP, ClinTox, HIV, BACE, SIDER, Tox21)</td>
          <td>Varies</td>
          <td>Original benchmark splits</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>MoleculeNet (QM9, QM8, ESOL, FreeSolv, Lipophilicity)</td>
          <td>Varies</td>
          <td>Original benchmark splits</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td>MOSES</td>
          <td>1.94M molecules</td>
          <td>Train/test/scaffold test splits</td>
      </tr>
      <tr>
          <td>Reaction yield</td>
          <td>Buchwald-Hartwig HTE</td>
          <td>3,955 reactions</td>
          <td>3x 1536-well plates</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Masked language modeling for token encoder (15% selection: 80% masked, 10% random, 10% unchanged)</li>
<li>Two-phase pre-training (95/5 split then 100% joint training)</li>
<li>RoFormer attention with rotary position embeddings</li>
<li>Vocabulary: 2,993 tokens (2,988 molecular + 5 special)</li>
<li>Maximum sequence length: 202 tokens (covers 99.4% of PubChem)</li>
<li>Learning rate: 1.6e-4, batch size: 288 molecules</li>
<li>40 epochs over the full PubChem corpus</li>
<li>10 random seeds per experiment for robustness</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Parameters</th>
          <th>Encoder</th>
          <th>Decoder</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMI-TED289M base</td>
          <td>289M</td>
          <td>47M</td>
          <td>242M</td>
          <td>12 layers, 12 attention heads, hidden size 768, dropout 0.2</td>
      </tr>
      <tr>
          <td>MoE-OSMI</td>
          <td>8x289M</td>
          <td>-</td>
          <td>-</td>
          <td>8 experts, top-k=2 routing, gating network</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Classification: ROC-AUC</li>
<li>Regression: MAE (QM9, QM8), RMSE (ESOL, FreeSolv, Lipophilicity)</li>
<li>Reaction yield: $R^2$</li>
<li>Generation: Validity, uniqueness, novelty, FCD, IntDiv, Scaf, SNN (MOSES metrics)</li>
<li>Latent space: Linear regression $R^2$, MSE, Davies-Bouldin index, t-SNE visualization</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>24 NVIDIA V100 GPUs (16GB)</li>
<li>4 nodes with DDP (Distributed Data Parallel)</li>
<li>Pre-training: 40 epochs on 91M molecules</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/materials/tree/main/models/smi_ted">IBM/materials (smi_ted)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Training, fine-tuning scripts, Jupyter notebooks</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm/materials.smi-ted">ibm/materials.smi-ted</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Pre-trained model weights</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.15603701">Zenodo archive</a></td>
          <td>Code + Data</td>
          <td>Apache-2.0</td>
          <td>Archival copy of scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Soares, E., Vital Brazil, E., Shirasuna, V., Zubarev, D., Cerqueira, R., &amp; Schmidt, K. (2025). An open-source family of large encoder-decoder foundation models for chemistry. <em>Communications Chemistry</em>, 8(1). <a href="https://doi.org/10.1038/s42004-025-01585-0">https://doi.org/10.1038/s42004-025-01585-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{soares2025smited,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{An open-source family of large encoder-decoder foundation models for chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Communications Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42004-025-01585-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RNNs vs Transformers for Molecular Generation Tasks</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-language-models-rnns-or-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/molecular-language-models-rnns-or-transformer/</guid><description>Empirical comparison of RNN and Transformer architectures for molecular generation using SMILES and SELFIES across three generative tasks.</description><content:encoded><![CDATA[<h2 id="an-empirical-comparison-of-sequence-architectures-for-molecular-generation">An Empirical Comparison of Sequence Architectures for Molecular Generation</h2>
<p>This is an <strong>Empirical</strong> paper that systematically compares two dominant sequence modeling architectures, recurrent neural networks (RNNs) and the Transformer, for chemical language modeling. The primary contribution is a controlled experimental comparison across three generative tasks of increasing complexity, combined with an evaluation of two molecular string representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>). The paper does not propose a new method; instead, it provides practical guidance on when each architecture is more appropriate for molecular generation.</p>
<h2 id="why-compare-rnns-and-transformers-for-molecular-design">Why Compare RNNs and Transformers for Molecular Design?</h2>
<p>Exploring unknown molecular space and designing molecules with target properties is a central goal in computational drug design. Language models trained on molecular string representations (SMILES, SELFIES) have shown the capacity to learn complex molecular distributions. RNN-based models, including LSTM and GRU variants, were the first widely adopted architectures for this task. Models like <a href="/notes/chemistry/molecular-design/generation/autoregressive/lstm-drug-like-molecule-generation/">CharRNN</a>, ReLeaSE, and conditional RNNs demonstrated success in generating focused molecular libraries. More recently, self-attention-based Transformer models (Mol-GPT, LigGPT) have gained popularity due to their parallelizability and ability to capture long-range dependencies.</p>
<p>Despite the widespread adoption of Transformers across NLP, it was not clear whether they uniformly outperform RNNs for molecular generation. Prior work by Dollar et al. showed that RNN-based models achieved higher validity than Transformer-based models in some settings. Flam-Shepherd et al. demonstrated that RNN language models could learn complex molecular distributions across challenging generative tasks. This paper extends that comparison by adding the Transformer architecture to the same set of challenging tasks and evaluating both SMILES and SELFIES representations.</p>
<h2 id="experimental-design-three-tasks-two-architectures-two-representations">Experimental Design: Three Tasks, Two Architectures, Two Representations</h2>
<p>The core experimental design uses a 2x2 setup: two architectures (RNN and Transformer) crossed with two molecular representations (SMILES and SELFIES), yielding four model variants: SM-RNN, SF-RNN, SM-Transformer, and SF-Transformer.</p>
<h3 id="three-generative-tasks">Three generative tasks</h3>
<p>The three tasks, drawn from <a href="/notes/chemistry/molecular-design/property-prediction/lm-complex-molecular-distributions/">Flam-Shepherd et al.</a>, are designed with increasing complexity:</p>
<ol>
<li>
<p><strong>Penalized LogP task</strong>: Generate molecules with high penalized LogP scores (LogP minus synthetic accessibility and long-cycle penalties). The dataset is built from ZINC15 molecules with penalized LogP &gt; 4.0. Molecule sequences are relatively short (50-75 tokens).</p>
</li>
<li>
<p><strong>Multidistribution task</strong>: Learn a multimodal molecular weight distribution constructed from four distinct subsets: GDB13 (MW &lt;= 185), ZINC (185 &lt;= MW &lt;= 425), Harvard Clean Energy Project (460 &lt;= MW &lt;= 600), and POLYMERS (MW &gt; 600). This tests the ability to capture multiple modes simultaneously.</p>
</li>
<li>
<p><strong>Large-scale task</strong>: Generate large molecules from PubChem with more than 100 heavy atoms and MW ranging from 1250 to 5000. This tests long-sequence generation capability.</p>
</li>
</ol>
<h3 id="model-configuration">Model configuration</h3>
<p>Models are compared with matched parameter counts (5.2-5.3M to 36.4M parameters). Hyperparameter optimization uses random search over learning rate [0.0001, 0.001], hidden units (500-1000 for RNNs, 376-776 for Transformers), layer number [3, 5], and dropout [0.0, 0.5]. A regex-based tokenizer replaces character-by-character tokenization, reducing token lengths from 10,000 to under 3,000 for large molecules.</p>
<h3 id="evaluation-metrics">Evaluation metrics</h3>
<p>The evaluation covers multiple dimensions:</p>
<ul>
<li><strong>Standard metrics</strong>: validity, uniqueness, novelty</li>
<li><strong>Molecular properties</strong>: <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a>, LogP, SA, QED, Bertz complexity (BCT), natural product likeness (NP), molecular weight (MW)</li>
<li><strong>Wasserstein distance</strong>: measures distributional similarity between generated and training molecules for each property</li>
<li><strong>Tanimoto similarity</strong>: structural and scaffold similarity between generated and training molecules</li>
<li><strong>Token length (TL)</strong>: comparison of generated vs. training sequence lengths</li>
</ul>
<p>For each task, 10,000 molecules are generated and evaluated.</p>
<h2 id="key-results-across-tasks">Key Results Across Tasks</h2>
<h3 id="penalized-logp-task">Penalized LogP task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.56</td>
          <td>0.12</td>
          <td>0.02</td>
          <td>0.01</td>
          <td>16.61</td>
          <td>0.09</td>
          <td>5.90</td>
          <td>0.43</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.63</td>
          <td>0.25</td>
          <td>0.42</td>
          <td>0.02</td>
          <td>36.43</td>
          <td>0.23</td>
          <td>2.35</td>
          <td>0.40</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.83</td>
          <td>0.18</td>
          <td>0.02</td>
          <td>0.01</td>
          <td>23.77</td>
          <td>0.09</td>
          <td>7.99</td>
          <td>0.84</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.97</td>
          <td>0.22</td>
          <td>0.47</td>
          <td>0.02</td>
          <td>44.43</td>
          <td>0.28</td>
          <td>5.04</td>
          <td>0.53</td>
      </tr>
  </tbody>
</table>
<p>RNN-based models achieve smaller Wasserstein distances across most properties. The authors attribute this to LogP being computed as a sum of atomic contributions (a local property), which aligns with RNNs&rsquo; strength in capturing local structural features. RNNs also generated ring counts closer to the training distribution (4.10 for SM-RNN vs. 4.04 for SM-Transformer, with training data at 4.21). The Transformer performed better on global structural similarity (higher Tanimoto similarity to training data).</p>
<h3 id="multidistribution-task">Multidistribution task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.16</td>
          <td>0.07</td>
          <td>0.03</td>
          <td>0.01</td>
          <td>18.34</td>
          <td>0.02</td>
          <td>7.07</td>
          <td>0.81</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.46</td>
          <td>0.38</td>
          <td>0.55</td>
          <td>0.03</td>
          <td>110.72</td>
          <td>0.24</td>
          <td>10.00</td>
          <td>1.58</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.16</td>
          <td>0.16</td>
          <td>0.03</td>
          <td>0.01</td>
          <td>39.94</td>
          <td>0.02</td>
          <td>10.03</td>
          <td>1.28</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.73</td>
          <td>0.37</td>
          <td>0.63</td>
          <td>0.04</td>
          <td>107.46</td>
          <td>0.30</td>
          <td>17.57</td>
          <td>2.40</td>
      </tr>
  </tbody>
</table>
<p>Both SMILES-based models captured all four modes of the MW distribution well. While RNNs had smaller overall Wasserstein distances, the Transformer fitted the higher-MW modes better. This aligns with the observation that longer molecular sequences (which correlate with higher MW) favor the Transformer&rsquo;s global attention mechanism over the RNN&rsquo;s sequential processing.</p>
<h3 id="large-scale-task">Large-scale task</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>FCD</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>BCT</th>
          <th>NP</th>
          <th>MW</th>
          <th>TL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SM-RNN</td>
          <td>0.46</td>
          <td>1.89</td>
          <td>0.20</td>
          <td>0.01</td>
          <td>307.09</td>
          <td>0.03</td>
          <td>105.29</td>
          <td>12.05</td>
      </tr>
      <tr>
          <td>SF-RNN</td>
          <td>1.65</td>
          <td>1.78</td>
          <td>0.43</td>
          <td>0.01</td>
          <td>456.98</td>
          <td>0.14</td>
          <td>100.79</td>
          <td>15.26</td>
      </tr>
      <tr>
          <td>SM-Transformer</td>
          <td>0.36</td>
          <td>1.64</td>
          <td>0.07</td>
          <td>0.01</td>
          <td>172.93</td>
          <td>0.02</td>
          <td>59.04</td>
          <td>7.41</td>
      </tr>
      <tr>
          <td>SF-Transformer</td>
          <td>1.91</td>
          <td>2.82</td>
          <td>0.47</td>
          <td>0.01</td>
          <td>464.75</td>
          <td>0.18</td>
          <td>92.91</td>
          <td>11.57</td>
      </tr>
  </tbody>
</table>
<p>The Transformer demonstrates a clear advantage on large molecules. SM-Transformer achieves substantially lower Wasserstein distances than SM-RNN across nearly all properties, with particularly large improvements in BCT (172.93 vs. 307.09) and MW (59.04 vs. 105.29). The Transformer also produces better Tanimoto similarity scores and more accurate token length distributions.</p>
<h3 id="standard-metrics-across-all-tasks">Standard metrics across all tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>SM-RNN</th>
          <th>SF-RNN</th>
          <th>SM-Transformer</th>
          <th>SF-Transformer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP</td>
          <td>Valid</td>
          <td>0.90</td>
          <td>1.00</td>
          <td>0.89</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>Uniqueness</td>
          <td>0.98</td>
          <td>0.99</td>
          <td>0.98</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>Novelty</td>
          <td>0.75</td>
          <td>0.71</td>
          <td>0.71</td>
          <td>0.71</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Valid</td>
          <td>0.95</td>
          <td>1.00</td>
          <td>0.97</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Uniqueness</td>
          <td>0.96</td>
          <td>1.00</td>
          <td>1.00</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>Novelty</td>
          <td>0.91</td>
          <td>0.98</td>
          <td>0.91</td>
          <td>0.98</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Valid</td>
          <td>0.84</td>
          <td>1.00</td>
          <td>0.88</td>
          <td>1.00</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Uniqueness</td>
          <td>0.99</td>
          <td>0.99</td>
          <td>0.98</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>Novelty</td>
          <td>0.85</td>
          <td>0.92</td>
          <td>0.86</td>
          <td>0.94</td>
      </tr>
  </tbody>
</table>
<p>SELFIES achieves 100% validity across all tasks by construction, while SMILES validity drops for large molecules. The Transformer achieves slightly higher validity than the RNN for SMILES-based models, particularly on the large-scale task (0.88 vs. 0.84).</p>
<h2 id="conclusions-and-practical-guidelines">Conclusions and Practical Guidelines</h2>
<p>The central finding is that neither architecture universally dominates. The choice between RNNs and Transformers should depend on the characteristics of the molecular data:</p>
<ul>
<li>
<p><strong>RNNs are preferred</strong> when molecular properties depend on local structural features (e.g., LogP, ring counts) and when sequences are relatively short. They better capture local fragment distributions.</p>
</li>
<li>
<p><strong>Transformers are preferred</strong> when dealing with large molecules (high MW, long sequences) where global attention can capture the overall distribution more effectively. RNNs suffer from information obliteration on long sequences.</p>
</li>
<li>
<p><strong>SMILES outperforms SELFIES</strong> on property distribution metrics across nearly all tasks and models. While SELFIES guarantees 100% syntactic validity, its generated molecules show worse distributional fidelity to training data. The authors argue that validity is a less important concern than property fidelity, since invalid SMILES can be filtered easily.</p>
</li>
</ul>
<p>The authors acknowledge that longer sequences remain challenging for both architectures. For Transformers, the quadratic growth of the attention matrix limits scalability. For RNNs, the vanishing gradient problem limits effective context length.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Task 1</td>
          <td>ZINC15 (penalized LogP &gt; 4.0)</td>
          <td>Not specified</td>
          <td>High penalized LogP molecules</td>
      </tr>
      <tr>
          <td>Task 2</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> + ZINC + CEP + POLYMERS</td>
          <td>~200K</td>
          <td>Multimodal MW distribution</td>
      </tr>
      <tr>
          <td>Task 3</td>
          <td>PubChem (&gt;100 heavy atoms)</td>
          <td>Not specified</td>
          <td>MW range 1250-5000</td>
      </tr>
  </tbody>
</table>
<p>Data processing code available at <a href="https://github.com/danielflamshep/genmoltasks">https://github.com/danielflamshep/genmoltasks</a> (from the original Flam-Shepherd et al. study).</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: Regex-based tokenizer (not character-by-character)</li>
<li><strong>Hyperparameter search</strong>: Random search over learning rate [0.0001, 0.001], hidden units, layers [3, 5], dropout [0.0, 0.5]</li>
<li><strong>Selection</strong>: Top 20% by sum of valid + unique + novelty, then final selection on all indicators</li>
<li><strong>Generation</strong>: 10K molecules per model per task</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN variants</td>
          <td>5.2M - 36.4M</td>
          <td>RNN (LSTM/GRU)</td>
      </tr>
      <tr>
          <td>Transformer variants</td>
          <td>5.3M - 36.4M</td>
          <td>Transformer decoder</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Wasserstein distance for property distributions (FCD, LogP, SA, QED, BCT, NP, MW, TL), Tanimoto similarity (molecular and scaffold), validity, uniqueness, novelty.</p>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/viko-3/language_model">trans_language</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Transformer implementation by the authors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/danielflamshep/genmoltasks">genmoltasks</a></td>
          <td>Code/Data</td>
          <td>Apache-2.0</td>
          <td>Dataset construction from Flam-Shepherd et al.</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chen, Y., Wang, Z., Zeng, X., Li, Y., Li, P., Ye, X., &amp; Sakurai, T. (2023). Molecular language models: RNNs or transformer? <em>Briefings in Functional Genomics</em>, 22(4), 392-400. <a href="https://doi.org/10.1093/bfgp/elad012">https://doi.org/10.1093/bfgp/elad012</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chen2023molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular language models: RNNs or transformer?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chen, Yangyang and Wang, Zixu and Zeng, Xiangxiang and Li, Yayang and Li, Pengyong and Ye, Xiucai and Sakurai, Tetsuya}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Functional Genomics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{392--400}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bfgp/elad012}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>REINVENT 4: Open-Source Generative Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/reinvent4-generative-molecule-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/reinvent4-generative-molecule-design/</guid><description>REINVENT 4 is an open-source generative AI framework combining RNNs and transformers with reinforcement and curriculum learning for de novo molecular design.</description><content:encoded><![CDATA[<h2 id="an-open-source-reference-implementation-for-generative-molecular-design">An Open-Source Reference Implementation for Generative Molecular Design</h2>
<p>REINVENT 4 is a <strong>Resource</strong> paper presenting a production-grade, open-source software framework for AI-driven generative molecular design. The primary contribution is the unified codebase that integrates four distinct molecule generators (de novo, scaffold decoration, linker design, molecular optimization) within three machine learning optimization algorithms (transfer learning, reinforcement learning, <a href="/notes/chemistry/molecular-design/generation/rl-tuned/curriculum-learning-molecular-design/">curriculum learning</a>). The software is released under the Apache 2.0 license and represents the fourth major version of the REINVENT platform, which has been in continuous production use at AstraZeneca for drug discovery.</p>
<h2 id="bridging-the-gap-between-research-prototypes-and-production-molecular-design">Bridging the Gap Between Research Prototypes and Production Molecular Design</h2>
<p>The motivation for REINVENT 4 stems from several gaps in the generative molecular design landscape. While numerous AI model architectures have been developed for molecular generation (<a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">VAEs</a>, GANs, RNNs, transformers, flow models, diffusion models), most exist as research prototypes released alongside individual publications rather than as maintained, integrated software. The authors argue that the scientific community needs reference implementations of common generative molecular design algorithms in the public domain to:</p>
<ol>
<li>Enable nuanced debate about the application of AI in drug discovery</li>
<li>Serve as educational tools for practitioners entering the field</li>
<li>Increase transparency around AI-driven molecular design</li>
<li>Provide a foundation for future innovation</li>
</ol>
<p>REINVENT 4 consolidates previously separate codebases (<a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a> v1, v2, LibInvent, LinkInvent, Mol2Mol) into a single repository with a consistent interface, addressing the fragmentation that characterized earlier releases.</p>
<h2 id="unified-framework-for-sequence-based-molecular-generation">Unified Framework for Sequence-Based Molecular Generation</h2>
<p>The core design of REINVENT 4 centers on sequence-based neural network models that generate <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings in an autoregressive manner. All generators model the probability of producing a token sequence, with two formulations.</p>
<p>For unconditional agents (de novo generation), the joint probability of a sequence $T$ with tokens $t_1, t_2, \ldots, t_\ell$ is:</p>
<p>$$
\mathbf{P}(T) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1)
$$</p>
<p>For conditional agents (scaffold decoration, linker design, molecular optimization), the joint probability given an input sequence $S$ is:</p>
<p>$$
\mathbf{P}(T \mid S) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1, S)
$$</p>
<p>The negative log-likelihood for unconditional agents is:</p>
<p>$$
NLL(T) = -\log \mathbf{P}(T) = -\sum_{i=1}^{\ell} \log \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1)
$$</p>
<h3 id="reinforcement-learning-with-dap">Reinforcement Learning with DAP</h3>
<p>The key optimization mechanism is reinforcement learning via the &ldquo;Difference between Augmented and Posterior&rdquo; (DAP) strategy. For each generated sequence $T$, the augmented likelihood is defined as:</p>
<p>$$
\log \mathbf{P}_{\text{aug}}(T) = \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T)
$$</p>
<p>where $\mathbf{S}(T) \in [0, 1]$ is the scalar score and $\sigma \geq 0$ controls the balance between reward and regularization. The DAP loss is:</p>
<p>$$
\mathcal{L}(T) = \left(\log \mathbf{P}_{\text{aug}}(T) - \log \mathbf{P}_{\text{agent}}(T)\right)^2
$$</p>
<p>The presence of the prior likelihood in the augmented likelihood constrains how far the agent can deviate from chemically plausible space, functioning similarly to proximal policy gradient methods. The loss is lower-bounded by:</p>
<p>$$
\mathcal{L}(T) \geq \max\left(0, \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T)\right)^2
$$</p>
<h3 id="four-molecule-generators">Four Molecule Generators</h3>
<p>REINVENT 4 supports four generator types:</p>
<table>
  <thead>
      <tr>
          <th>Generator</th>
          <th>Architecture</th>
          <th>Input</th>
          <th>Task</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Reinvent</td>
          <td>RNN</td>
          <td>None</td>
          <td>De novo design from scratch</td>
      </tr>
      <tr>
          <td>LibInvent</td>
          <td>RNN</td>
          <td>Scaffold SMILES</td>
          <td>R-group replacement, library design</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/rl-tuned/link-invent-generative-linker-design/">LinkInvent</a></td>
          <td>RNN</td>
          <td>Two warhead fragments</td>
          <td>Linker design, scaffold hopping</td>
      </tr>
      <tr>
          <td>Mol2Mol</td>
          <td>Transformer</td>
          <td>Input molecule</td>
          <td>Molecular optimization within similarity bounds</td>
      </tr>
  </tbody>
</table>
<p>All generators are fully integrated with all three optimization algorithms (TL, RL, CL). The Mol2Mol transformer was trained on over 200 billion molecular pairs from PubChem with <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> $\geq 0.50$, using ranking loss to directly link negative log-likelihood to molecular similarity.</p>
<h3 id="staged-learning-curriculum-learning">Staged Learning (Curriculum Learning)</h3>
<p>A key new feature is staged learning, which implements curriculum learning as multi-stage RL. Each stage can define a different scoring profile, allowing users to gradually phase in computationally expensive scoring functions. For example, cheap drug-likeness filters can run first, followed by docking in later stages. Stages terminate when a maximum score threshold is exceeded or a step limit is reached.</p>
<h3 id="scoring-subsystem">Scoring Subsystem</h3>
<p>The scoring subsystem implements a plugin architecture supporting over 25 scoring components, including:</p>
<ul>
<li>Physicochemical descriptors from RDKit (QED, SLogP, TPSA, molecular weight, etc.)</li>
<li>Molecular docking via DockStream (<a href="https://en.wikipedia.org/wiki/AutoDock">AutoDock Vina</a>, rDock, Hybrid, Glide, GOLD)</li>
<li>QSAR models via Qptuna and ChemProp (D-MPNN)</li>
<li>Shape similarity via ROCS</li>
<li>Synthesizability estimation via SA score</li>
<li>Matched molecular pairs via mmpdb</li>
<li>Generic REST and external process interfaces</li>
</ul>
<p>Scores are aggregated via weighted arithmetic or geometric mean. A transform system (sigmoid, step functions, value maps) normalizes individual component scores to $[0, 1]$.</p>
<h2 id="pdk1-inhibitor-case-study">PDK1 Inhibitor Case Study</h2>
<p>The paper demonstrates REINVENT 4 through a structure-based drug design exercise targeting <a href="https://en.wikipedia.org/wiki/PDPK1">Phosphoinositide-dependent kinase-1 (PDK1)</a> inhibitors. The experimental setup uses PDB crystal structure 2XCH with DockStream and Glide for docking, defining hits as molecules with docking score $\leq -8$ kcal/mol and QED $\geq 0.7$.</p>
<p><strong>Baseline RL from prior</strong>: 50 epochs of staged learning with batch size 128 produced 119 hits from 6,400 generated molecules (1.9% hit rate), spread across 103 generic Bemis-Murcko scaffolds.</p>
<p><strong>Transfer learning + RL</strong>: After 10 epochs of TL on 315 congeneric pyridinone PDK1 actives from PubChem Assay AID1798002, the same 50-epoch RL run produced 222 hits (3.5% hit rate) across 176 unique generic scaffolds, nearly doubling productivity.</p>
<p>Both approaches generated top-scoring molecules (docking score of -10.1 kcal/mol each) with plausible binding poses reproducing key protein-ligand interactions seen in the native crystal structure, including hinge interactions with ALA 162 and contacts with LYS 111.</p>
<p>The paper also demonstrates the agent&rsquo;s plasticity through a molecular weight switching experiment: after 500 epochs driving generation toward 1500 Da molecules, switching the reward to favor molecules $\leq 500$ Da resulted in rapid adaptation within ~50 epochs, showing that the RL agent can recover from extreme biases.</p>
<h2 id="practical-software-for-ai-driven-drug-discovery">Practical Software for AI-Driven Drug Discovery</h2>
<p>REINVENT 4 represents a mature, well-documented framework that consolidates years of incremental development into a single codebase. Key practical features include TOML/JSON configuration, TensorBoard visualization, multinomial sampling and beam search decoding, diversity filters for scaffold-level novelty, experience replay (inception), and a plugin mechanism for extending the scoring subsystem.</p>
<p>The authors acknowledge that this is one approach among many and that there is no single solution that uniformly outperforms others. REINVENT has demonstrated strong sample efficiency in benchmarks and produced realistic 3D docking poses, but the paper does not claim universal superiority. The focus is on providing a well-engineered, transparent reference implementation rather than advancing a novel algorithm.</p>
<p>Limitations include that only the Mol2Mol prior supports stereochemistry, the training data biases constrain the explorable chemical space, and the SMILES-based representation inherits the known fragility of string-based molecular encodings.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Prior training (Reinvent)</td>
          <td>ChEMBL 25</td>
          <td>~1.7M molecules</td>
          <td>Drug-like compounds</td>
      </tr>
      <tr>
          <td>Prior training (LibInvent)</td>
          <td>ChEMBL 27</td>
          <td>~1.9M molecules</td>
          <td>Scaffold-decoration pairs</td>
      </tr>
      <tr>
          <td>Prior training (LinkInvent)</td>
          <td>ChEMBL 27</td>
          <td>~1.9M molecules</td>
          <td>Fragment-linker pairs</td>
      </tr>
      <tr>
          <td>Prior training (Mol2Mol)</td>
          <td>ChEMBL 28 / PubChem</td>
          <td>~200B pairs</td>
          <td>Tanimoto similarity $\geq 0.50$</td>
      </tr>
      <tr>
          <td>Case study TL</td>
          <td>PubChem AID1798002</td>
          <td>315 compounds</td>
          <td>Congeneric PDK1 actives</td>
      </tr>
      <tr>
          <td>Case study docking</td>
          <td>PDB 2XCH</td>
          <td>1 structure</td>
          <td>PDK1 crystal structure</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimization</strong>: DAP (recommended), plus three deprecated alternatives (REINFORCE, A2C, MAULI)</li>
<li><strong>Decoding</strong>: Multinomial sampling (default, temperature $K = 1$) and beam search</li>
<li><strong>Diversity filter</strong>: Murcko scaffold, topological scaffold, scaffold similarity, same-SMILES penalty</li>
<li><strong>Experience replay</strong>: Inception memory with configurable size and sampling rate</li>
<li><strong>Gradient descent</strong>: Adam optimizer</li>
</ul>
<h3 id="models">Models</h3>
<p>All pre-trained priors are distributed with the repository. RNN-based generators (Reinvent, LibInvent, LinkInvent) and transformer-based generator (Mol2Mol) with multiple similarity-conditioned variants.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Condition</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Hit rate (RL)</td>
          <td>1.9%</td>
          <td>50 epochs, batch 128</td>
          <td>PDK1 case study</td>
      </tr>
      <tr>
          <td>Hit rate (TL+RL)</td>
          <td>3.5%</td>
          <td>10 TL + 50 RL epochs</td>
          <td>PDK1 case study</td>
      </tr>
      <tr>
          <td>Scaffold diversity (RL)</td>
          <td>103 scaffolds</td>
          <td>From 119 hits</td>
          <td>Generic Bemis-Murcko</td>
      </tr>
      <tr>
          <td>Scaffold diversity (TL+RL)</td>
          <td>176 scaffolds</td>
          <td>From 222 hits</td>
          <td>Generic Bemis-Murcko</td>
      </tr>
      <tr>
          <td>Best docking score</td>
          <td>-10.1 kcal/mol</td>
          <td>Both methods</td>
          <td>Glide SP</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify hardware requirements. REINVENT 4 supports both GPU and CPU execution. Python 3.10+ is required, with PyTorch 1.x (2.0 also compatible) and RDKit 2022.9+.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/MolecularAI/REINVENT4">REINVENT4</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Full framework with pre-trained priors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/MolecularAI/DockStream">DockStream</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Docking wrapper for scoring</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Loeffler, H. H., He, J., Tibo, A., Janet, J. P., Voronov, A., Mervin, L. H., &amp; Engkvist, O. (2024). Reinvent 4: Modern AI-driven generative molecule design. <em>Journal of Cheminformatics</em>, 16, 20. <a href="https://doi.org/10.1186/s13321-024-00812-5">https://doi.org/10.1186/s13321-024-00812-5</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{loeffler2024reinvent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Reinvent 4: Modern AI-driven generative molecule design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Loeffler, Hannes H. and He, Jiazhen and Tibo, Alessandro and Janet, Jon Paul and Voronov, Alexey and Mervin, Lewis H. and Engkvist, Ola}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-024-00812-5}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Protein-to-Drug Molecule Translation via Transformer</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/transformer-protein-drug-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/transformer-protein-drug-generation/</guid><description>A Transformer model frames protein-targeted drug generation as machine translation from amino acid sequences to SMILES molecular strings.</description><content:encoded><![CDATA[<h2 id="protein-targeted-drug-generation-as-machine-translation">Protein-Targeted Drug Generation as Machine Translation</h2>
<p>This is a <strong>Method</strong> paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the &ldquo;language&rdquo; of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein&rsquo;s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein&rsquo;s three-dimensional structure.</p>
<h2 id="limitations-of-existing-generative-drug-design-approaches">Limitations of Existing Generative Drug Design Approaches</h2>
<p>Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.</p>
<p>The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein&rsquo;s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.</p>
<h2 id="sequence-to-sequence-translation-with-self-attention">Sequence-to-Sequence Translation with Self-Attention</h2>
<p>The core insight is to treat protein-targeted drug generation as a translation problem between two &ldquo;languages,&rdquo; applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.</p>
<p>The self-attention mechanism computes:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$</p>
<p>where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:</p>
<p>$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$</p>
<p>$$
\text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O
$$</p>
<p>Positional encoding uses sinusoidal functions:</p>
<p>$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.</p>
<h2 id="data-model-architecture-and-docking-evaluation">Data, Model Architecture, and Docking Evaluation</h2>
<h3 id="data">Data</h3>
<p>The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.</p>
<p>Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via <a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm">Needleman-Wunsch</a> global alignment).</p>
<h3 id="model-configuration">Model Configuration</h3>
<p>The model uses the original Transformer implementation via the tensor2tensor library with:</p>
<ul>
<li>4 encoder/decoder layers of size 128</li>
<li>4 attention heads</li>
<li>Adam optimizer with learning rate decay from the original Transformer paper</li>
<li>Batch size of 4,096 tokens</li>
<li>Training for 600K epochs on a single GPU in Google Colaboratory</li>
<li>Vocabulary of 71 symbols (character-level tokenization)</li>
</ul>
<p>Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (&ldquo;one per one&rdquo; mode) and beam size 10 keeping all 10 results (&ldquo;ten per one&rdquo; mode).</p>
<h3 id="chemical-validity-and-uniqueness">Chemical Validity and Uniqueness</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>One per One (avg)</th>
          <th>Ten per One (avg)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES (%)</td>
          <td>90.2</td>
          <td>82.6</td>
      </tr>
      <tr>
          <td>Unique SMILES (%)</td>
          <td>92.3</td>
          <td>81.7</td>
      </tr>
      <tr>
          <td>ZINC15 match (%)</td>
          <td>30.6</td>
          <td>17.1</td>
      </tr>
  </tbody>
</table>
<h3 id="docking-evaluation">Docking Evaluation</h3>
<p>To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with <a href="/notes/chemistry/molecular-design/generation/evaluation/smina-docking-benchmark/">SMINA</a>. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).</p>
<p>ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.</p>
<h3 id="drug-likeness-properties">Drug-Likeness Properties</h3>
<p>Generated molecules were evaluated against <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a> and other drug-likeness criteria:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Constraint</th>
          <th>One per One (%)</th>
          <th>Ten per One (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>&lt; 5</td>
          <td>84.4</td>
          <td>85.6</td>
      </tr>
      <tr>
          <td>Molecular weight</td>
          <td>&lt; 500 Da</td>
          <td>95.8</td>
          <td>88.9</td>
      </tr>
      <tr>
          <td>H-bond donors</td>
          <td>&lt; 5</td>
          <td>95.8</td>
          <td>91.9</td>
      </tr>
      <tr>
          <td>H-bond acceptors</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>93.5</td>
      </tr>
      <tr>
          <td>Rotatable bonds</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>91.2</td>
      </tr>
      <tr>
          <td>TPSA</td>
          <td>&lt; 140</td>
          <td>98.0</td>
          <td>92.7</td>
      </tr>
      <tr>
          <td>SAS</td>
          <td>&lt; 6</td>
          <td>99.9</td>
          <td>100.0</td>
      </tr>
  </tbody>
</table>
<p>Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).</p>
<h3 id="structural-novelty">Structural Novelty</h3>
<p>Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (&gt; 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.</p>
<h2 id="generated-molecules-show-drug-like-properties-and-predicted-binding">Generated Molecules Show Drug-Like Properties and Predicted Binding</h2>
<p>The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.</p>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.</li>
<li>Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.</li>
<li>The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL&rsquo;s 1.5 million molecules).</li>
<li>Model interpretability remains limited and is identified as important future work.</li>
<li>The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Test</td>
          <td>BindingDB (filtered)</td>
          <td>238,147 records</td>
          <td>1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 &lt; 100 nM</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>11 (IGF-1R), 20 (VEGFR2)</td>
          <td>SMINA docking with default settings</td>
      </tr>
      <tr>
          <td>Database matching</td>
          <td>ZINC15</td>
          <td>N/A</td>
          <td>Used for novelty assessment</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer (encoder-decoder) via tensor2tensor library</li>
<li>Beam search decoding (beam sizes 4 and 10)</li>
<li>Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)</li>
<li>SMINA for molecular docking</li>
<li>RDKit for validity checking, property calculation, and canonicalization</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4 layers, 128 hidden size, 4 attention heads</li>
<li>Character-level tokenization with 71-symbol vocabulary</li>
<li>5-fold Monte Carlo cross-validation with &lt; 20% sequence similarity between train/test proteins</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES</td>
          <td>90.2% (1-per-1), 82.6% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>Unique SMILES</td>
          <td>92.3% (1-per-1), 81.7% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>ZINC15 match</td>
          <td>30.6% (1-per-1), 17.1% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)</td>
          <td>Drug-likeness score</td>
      </tr>
      <tr>
          <td>SAS compliance</td>
          <td>99.9% (1-per-1), 100% (10-per-1)</td>
          <td>SAS &lt; 6</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Google Colaboratory with one GPU</li>
<li>Training for 600K epochs</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dariagrechishnikova/molecule_structure_generation">molecule_structure_generation</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation using tensor2tensor</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. <em>Scientific Reports</em>, 11, 321. <a href="https://doi.org/10.1038/s41598-020-79682-4">https://doi.org/10.1038/s41598-020-79682-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{grechishnikova2021transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer neural network for protein-specific de novo drug generation as a machine translation problem}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grechishnikova, Daria}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{321}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-020-79682-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PrefixMol: Prefix Embeddings for Drug Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/prefixmol-target-chemistry-aware-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/prefixmol-target-chemistry-aware-generation/</guid><description>PrefixMol uses prefix embeddings in a GPT SMILES generator to jointly condition on protein pockets and chemical properties for drug design.</description><content:encoded><![CDATA[<h2 id="unified-multi-conditional-molecular-generation">Unified Multi-Conditional Molecular Generation</h2>
<p>PrefixMol is a <strong>Method</strong> paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski</a>) as a learnable feature vector prepended to the input sequence of a GPT-based <a href="/notes/chemistry/molecular-representations/notations/smiles-original-paper/">SMILES</a> generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.</p>
<h2 id="bridging-target-aware-and-chemistry-aware-molecular-design">Bridging Target-Aware and Chemistry-Aware Molecular Design</h2>
<p>Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, <a href="/notes/chemistry/molecular-design/generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a>, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:</p>
<ol>
<li><strong>Data scarcity</strong>: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.</li>
<li><strong>Negative transfer</strong>: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.</li>
</ol>
<p>PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.</p>
<h2 id="prefix-conditioning-in-attention-layers">Prefix Conditioning in Attention Layers</h2>
<p>The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}&rsquo; = [\text{PREFIX}; \mathbf{x}]$.</p>
<p>The output of each position is:</p>
<p>$$
h_i = \begin{cases} p_{\phi,i}, &amp; \text{if } i &lt; n_c \\ \text{LM}_\theta(x_i&rsquo;, h_{&lt;i}), &amp; \text{otherwise} \end{cases}
$$</p>
<p>Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:</p>
<p>$$
\begin{aligned}
\text{head} &amp;= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\
&amp;\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}}
\end{aligned}
$$</p>
<p>where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.</p>
<p><strong>Condition correlation</strong> is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:</p>
<p>$$
\text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v)
$$</p>
<p>The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.</p>
<h3 id="condition-encoders">Condition Encoders</h3>
<p>Each condition has a dedicated encoder:</p>
<ul>
<li><strong>3D Pocket</strong>: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.</li>
<li><strong>Chemical properties</strong>: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.</li>
</ul>
<h3 id="training-objective">Training Objective</h3>
<p>PrefixMol is trained with two losses. The auto-regressive loss is:</p>
<p>$$
\mathcal{L}_{AT} = -\sum_{1 &lt; i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{&lt;i}, \mathbf{p}_\phi)
$$</p>
<p>A triplet property prediction loss encourages generated molecules to match desired properties:</p>
<p>$$
\mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right)
$$</p>
<p>where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).</p>
<h2 id="experimental-setup-and-controllability-evaluation">Experimental Setup and Controllability Evaluation</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.</p>
<h3 id="metrics">Metrics</h3>
<ul>
<li><strong>Vina score</strong> (binding affinity, computed by QVina after UFF refinement)</li>
<li><strong>QED</strong> (quantitative estimate of drug-likeness, 0-1)</li>
<li><strong>SA</strong> (synthetic accessibility, 0-1)</li>
<li><strong>LogP</strong> (octanol-water partition coefficient)</li>
<li><strong>Lipinski</strong> (rule-of-five compliance count)</li>
<li><strong>High Affinity</strong> (fraction of pockets where generated molecules match or exceed test set affinities)</li>
<li><strong>Diversity</strong> (average pairwise Tanimoto distance over Morgan fingerprints)</li>
<li><strong>Sim.Train</strong> (maximum Tanimoto similarity to training set)</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>Unconditional generation</strong> (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.</p>
<p><strong>Single-property control</strong> (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.</p>
<p><strong>Multi-property control</strong> (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.</p>
<h3 id="condition-relation-analysis">Condition Relation Analysis</h3>
<p>By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:</p>
<ul>
<li><strong>Vina is weakly self-controllable</strong> but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.</li>
<li><strong>LogP and QED</strong> are the most correlated property pair.</li>
<li><strong>Lipinski is coupled to QED and SA</strong>, saturating at 5.0 when both QED and SA control scales reach +2.</li>
</ul>
<h2 id="key-findings-limitations-and-interpretability-insights">Key Findings, Limitations, and Interpretability Insights</h2>
<p>PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:</p>
<ol>
<li>A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.</li>
<li>Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.</li>
<li>The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).</li>
</ol>
<p><strong>Limitations</strong>: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training / Evaluation</td>
          <td>CrossDocked (extended)</td>
          <td>22.5M protein-ligand structures</td>
          <td>Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GPT-based auto-regressive SMILES generation with prefix conditioning</li>
<li>GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention</li>
<li>Separate MLP encoders for each chemical property</li>
<li>Triplet property prediction loss with non-differentiable RDKit-computed properties</li>
<li>QVina for Vina score computation with UFF refinement</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT transformer backbone for SMILES generation</li>
<li>6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski</li>
<li>Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PrefixMol (unconditional)</th>
          <th>Pocket2Mol</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vina (kcal/mol)</td>
          <td>-6.532</td>
          <td>-7.288</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.551</td>
          <td>0.563</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>SA</td>
          <td>0.750</td>
          <td>0.765</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.856</td>
          <td>0.688</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Sim.Train</td>
          <td>0.239</td>
          <td>0.376</td>
          <td>Lower is better</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/A4Bio/PrefixMol">PrefixMol</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official PyTorch implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gao, Z., Hu, Y., Tan, C., &amp; Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. <em>arXiv preprint arXiv:2302.07120</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gao2023prefixmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2302.07120}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>NLP Models That Automate Programming for Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/nlp-models-transform-chemistry/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/nlp-models-transform-chemistry/</guid><description>A perspective on how code-generating LLMs like OpenAI Codex and GPT-3 will reshape computational chemistry research workflows and education.</description><content:encoded><![CDATA[<h2 id="a-perspective-on-code-generating-llms-for-chemistry">A Perspective on Code-Generating LLMs for Chemistry</h2>
<p>This is a <strong>Position</strong> paper that argues large language models (LLMs) capable of generating code from natural language prompts, specifically OpenAI&rsquo;s Codex and GPT-3, are poised to transform both chemistry research and chemistry education. Published in the inaugural volume of Digital Discovery (RSC), the paper combines a brief history of NLP developments with concrete demonstrations of code generation for computational chemistry tasks, then offers a forward-looking perspective on challenges and opportunities.</p>
<h2 id="bridging-the-gap-between-natural-language-and-scientific-software">Bridging the Gap Between Natural Language and Scientific Software</h2>
<p>The authors identify a core friction in modern computational chemistry: while the number of available software packages has grown dramatically, researchers spend a large fraction of their time learning interfaces to these packages rather than doing science. Tasks like searching documentation, following tutorials, and trial-and-error experimentation with APIs consume effort that could be directed at research itself.</p>
<p>At the same time, programming assignments in chemistry courses serve dual pedagogical purposes (reinforcing physical intuition and teaching marketable skills), but are constrained by students&rsquo; median programming experience. The emergence of code-generating NLP models opens the possibility of reducing both barriers simultaneously.</p>
<h2 id="code-generation-as-a-chemistry-interface">Code Generation as a Chemistry Interface</h2>
<p>The paper&rsquo;s core thesis is that NLP models trained on code can serve as a natural language interface to the entire ecosystem of scientific computing tools. The authors demonstrate this with several concrete examples using OpenAI Codex:</p>
<ol>
<li>
<p><strong>Quantum chemistry</strong>: Prompting Codex to &ldquo;compute the dissociation curve of H2 using pyscf&rdquo; produced correct, runnable code that selected <a href="https://en.wikipedia.org/wiki/Hartree%E2%80%93Fock_method">Hartree-Fock</a> with <a href="https://en.wikipedia.org/wiki/STO-nG_basis_sets">STO-3G</a>. A follow-up prompt requesting &ldquo;the most accurate method&rdquo; caused it to switch to <a href="https://en.wikipedia.org/wiki/Coupled_cluster">CCSD</a> in a large basis set.</p>
</li>
<li>
<p><strong>Chemical entity recognition</strong>: Using GPT-3 with only three training examples, the authors demonstrated extraction of chemical entity names from published text, a task that previously required thousands of labeled examples.</p>
</li>
<li>
<p><strong>Molecular visualization</strong>: Drawing caffeine from its <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> string, generating Gaussian input files from SMILES, implementing random walks, and downloading and analyzing <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">PDB structures</a> with MDTraj.</p>
</li>
<li>
<p><strong>Voice-controlled molecular dynamics</strong>: The authors previously built MARVIS, a voice-controlled <a href="/notes/chemistry/molecular-simulation/">molecular dynamics</a> analysis tool that uses GPT-3 to convert natural language into <a href="https://en.wikipedia.org/wiki/Visual_Molecular_Dynamics">VMD</a> commands. Only about a dozen examples were needed to teach GPT-3 to render proteins, change representations, and select atoms.</p>
</li>
</ol>
<p>An important caveat: the authors emphasize that all chemistry &ldquo;knowledge&rdquo; (including the SMILES string for caffeine) is entirely contained in the model&rsquo;s learned floating-point weights. The model has no access to databases or curated lists of chemical concepts.</p>
<h2 id="demonstrations-and-practical-evaluation">Demonstrations and Practical Evaluation</h2>
<p>Rather than a formal experimental evaluation with benchmarks and metrics, this perspective paper relies on qualitative demonstrations. The key examples, with full details provided in the ESI, include:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Input</th>
          <th>Result</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>H2 dissociation curve</td>
          <td>Natural language prompt</td>
          <td>Correct PySCF code (HF/STO-3G)</td>
      </tr>
      <tr>
          <td>Upgrade method accuracy</td>
          <td>Follow-up prompt</td>
          <td>Switched to CCSD with large basis</td>
      </tr>
      <tr>
          <td>Chemical NER</td>
          <td>3 examples + new text</td>
          <td>Extracted compound names (with some gaps)</td>
      </tr>
      <tr>
          <td>Molecule drawing</td>
          <td>&ldquo;Load caffeine from SMILES, draw it&rdquo;</td>
          <td>Correct RDKit rendering</td>
      </tr>
      <tr>
          <td>Gaussian input file</td>
          <td>Function with docstring</td>
          <td>Complete file writer with B3LYP/6-31G(d)</td>
      </tr>
      <tr>
          <td>PDB analysis</td>
          <td>Natural language description</td>
          <td>Downloaded structure and computed <a href="https://en.wikipedia.org/wiki/Radius_of_gyration">radius of gyration</a></td>
      </tr>
  </tbody>
</table>
<p>The authors note that Codex generates correct code at about a 30% rate on a single attempt for standard problems, improving to above 50% when multiple solutions are tried. Mistakes tend to occur when complex algorithms are requested with little specificity, and the code rarely has syntax errors but may fail in obvious ways (missing imports, wrong data types).</p>
<h2 id="challenges-access-correctness-and-bias">Challenges: Access, Correctness, and Bias</h2>
<p>The paper identifies three ongoing challenges:</p>
<p><strong>Access and price.</strong> Advanced models from OpenAI were, at the time of writing, limited to early testers. Per-query costs (1-3 cents for GPT-3) would become prohibitive at the scale needed for parsing academic literature or supporting medium-sized courses. The authors advocate for open-source models and equitable deployment by researchers with computational resources.</p>
<p><strong>Correctness.</strong> Code generation does not guarantee correctness. The authors raise a subtle point: Codex may produce code that executes successfully but does not follow best scientific practice for a particular computational task. Over-reliance on AI-generated code without verification could erode trust in scientific software. However, they argue that strategies for assessing code correctness apply equally to human-written and AI-generated code.</p>
<p><strong>Fairness and bias.</strong> The authors flag several concerns: AI-generated code trained on its own outputs could narrow the range of packages, methods, or programming languages used in chemistry. They observed Codex&rsquo;s preference for Python and for specific popular libraries (e.g., defaulting to <a href="https://en.wikipedia.org/wiki/PSI_(computational_chemistry)">Psi4</a> for single-point energy calculations). GPT-3 has also been shown to reflect racism, sexism, and other biases present in its training data.</p>
<h2 id="implications-for-research-and-education">Implications for Research and Education</h2>
<p>The authors conclude with an optimistic but measured outlook:</p>
<ul>
<li><strong>For research</strong>: NLP code generation will increase accessibility of software tools and expand what a single research group can accomplish. Better tools have historically not reduced the need for scientists but expanded the complexity of problems that can be tackled.</li>
<li><strong>For programming skills</strong>: Using Codex will make chemists better programmers, not worse. The process of crafting prompts, mentally checking outputs, testing on sample inputs, and iterating develops algorithmic thinking. The authors report discovering chemistry software libraries they would not have found otherwise through iterative prompt creation.</li>
<li><strong>For education</strong>: Instructors should rethink programming assignments. The authors suggest moving toward more difficult compound assignments, treating code exercises as laboratory explorations of scientific concepts rather than syntax drills, and aligning coursework with the tools students will have access to in their careers.</li>
<li><strong>For accessibility</strong>: NLP models can reduce barriers for non-native English speakers (though accuracy with non-English prompts was not fully explored) and for users who have difficulty with keyboard-and-mouse interfaces (via voice control).</li>
</ul>
<p>The paper acknowledges that these capabilities were, in early 2022, just beginning, with Codex being the first capable code-generation model. Already at the time of writing, models surpassing GPT-3 in language tasks had appeared, and models matching GPT-3 with 1/20th the parameters had been demonstrated.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a perspective paper with qualitative demonstrations rather than a reproducible experimental study. The authors provide all prompts and multiple responses in the ESI.</p>
<h3 id="data">Data</h3>
<p>All prompts and code outputs are provided in the Electronic Supplementary Information (ESI) available from the RSC.</p>
<h3 id="algorithms">Algorithms</h3>
<p>The paper does not introduce new algorithms. It evaluates existing models (GPT-3, Codex) on chemistry-related code generation tasks.</p>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Provider</th>
          <th>Access</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GPT-3</td>
          <td>OpenAI</td>
          <td>API access (commercial)</td>
      </tr>
      <tr>
          <td>Codex</td>
          <td>OpenAI</td>
          <td>Early tester program (2021)</td>
      </tr>
      <tr>
          <td>GPT-Neo</td>
          <td>EleutherAI</td>
          <td>Open source</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>No formal metrics are reported for the chemistry demonstrations. The authors cite the Codex paper&rsquo;s reported ~30% pass rate on single attempts and &gt;50% with multiple attempts on standard programming problems.</p>
<h3 id="hardware">Hardware</h3>
<p>No hardware requirements are specified for the demonstrations (API-based inference).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/whitead/marvis">MARVIS</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Voice-controlled MD analysis using GPT-3</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hocky, G. M., &amp; White, A. D. (2022). Natural language processing models that automate programming will transform chemistry research and teaching. <em>Digital Discovery</em>, 1(2), 79-83. <a href="https://doi.org/10.1039/d1dd00009h">https://doi.org/10.1039/d1dd00009h</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{hocky2022natural,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Natural language processing models that automate programming will transform chemistry research and teaching}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hocky, Glen M. and White, Andrew D.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{79--83}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/d1dd00009h}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Neural Machine Translation of Chemical Nomenclature</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/nmt-chemical-nomenclature-en-zh/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/nmt-chemical-nomenclature-en-zh/</guid><description>Xu et al. apply CNN and LSTM seq2seq models to translate chemical nomenclature between English and Chinese, outperforming rule-based tools.</description><content:encoded><![CDATA[<h2 id="a-method-for-neural-translation-of-chemical-names">A Method for Neural Translation of Chemical Names</h2>
<p>This is a <strong>Method</strong> paper that introduces deep learning approaches for translating chemical nomenclature between English and Chinese. The primary contribution is demonstrating that character-level sequence-to-sequence neural networks (both CNN-based and LSTM-based) can serve as viable alternatives to hand-crafted rule-based translation systems for chemical names. The work compares two neural architectures against an existing rule-based tool on bilingual chemical name datasets.</p>
<h2 id="bridging-the-english-chinese-chemical-nomenclature-gap">Bridging the English-Chinese Chemical Nomenclature Gap</h2>
<p>English and Chinese are the two most widely used languages for chemical nomenclature worldwide. Translation between them is important for chemical data processing, especially for converting Chinese chemical names extracted via named entity recognition into English names that existing name-to-structure tools can parse. Rule-based translation between these languages faces considerable challenges:</p>
<ol>
<li>Chinese chemical names lack word boundaries (no spaces), making segmentation difficult.</li>
<li>Word order is often reversed between English and Chinese chemical names (e.g., &ldquo;ethyl acetate&rdquo; maps to characters meaning &ldquo;acetate-ethyl&rdquo; in Chinese).</li>
<li>The same English morpheme can map to different Chinese characters depending on chemical context (e.g., &ldquo;ethyl&rdquo; translates differently in &ldquo;ethyl acetate&rdquo; vs. &ldquo;ethyl alcohol&rdquo;).</li>
<li>Trivial names, especially for natural products, follow irregular translation patterns or are transliterations.</li>
</ol>
<p>Building comprehensive rule sets requires a formally trained chemist fluent in both languages, making rule-based approaches expensive and fragile.</p>
<h2 id="character-level-sequence-to-sequence-translation">Character-Level Sequence-to-Sequence Translation</h2>
<p>The core idea is to treat chemical name translation as a character-level machine translation task, applying encoder-decoder architectures with attention mechanisms. Two architectures are proposed:</p>
<p><strong>CNN-based architecture</strong>: Three 1D convolutional layers encode the input character sequence. A decoder with three 1D convolutional layers processes the target sequence offset by one timestep, combined with attention mechanism layers that connect encoder and decoder outputs. Two additional 1D convolutional layers produce the final decoded output sequence.</p>
<p><strong>LSTM-based architecture</strong>: An LSTM encoder converts the input sequence into two state vectors. An LSTM decoder is trained with teacher forcing, using the encoder&rsquo;s state vectors as its initial state, and generating the target sequence offset by one timestep.</p>
<p>Both models operate at the character level. Input chemical name strings are transformed into embedding vectors, with the vocabulary size equal to the number of unique characters in the respective language (100 unique characters for English names, 2,056 unique characters for Chinese names).</p>
<h2 id="experimental-setup-and-comparison-with-rule-based-tool">Experimental Setup and Comparison with Rule-Based Tool</h2>
<h3 id="datasets">Datasets</h3>
<p>The authors built two directional datasets from a manually curated corpus of scientific literature maintained at their institution:</p>
<ul>
<li><strong>En2Ch (English to Chinese)</strong>: 30,394 name pairs after deduplication</li>
<li><strong>Ch2En (Chinese to English)</strong>: 37,207 name pairs after deduplication</li>
</ul>
<p>The datasets cover systematic compound names through trivial names. For names with multiple valid translations, the most commonly used translation was selected. Each dataset was split 80/20 for training and validation.</p>
<h3 id="model-configuration">Model Configuration</h3>
<p>Both neural network models used the following hyperparameters:</p>
<ul>
<li>Batch size: 64</li>
<li>Epochs: 100</li>
<li>Latent dimensionality: 256 (encoding and decoding space)</li>
<li>Implementation: Python 3.7 with Keras 2.3 and TensorFlow backend</li>
</ul>
<h3 id="evaluation-metrics">Evaluation Metrics</h3>
<p>The models were evaluated on five metrics across both translation directions:</p>
<ul>
<li><strong>Success Rate</strong>: Percentage of inputs that produced any output</li>
<li><strong>String Matching Accuracy</strong>: Exact match with the single target name</li>
<li><strong>Data Matching Accuracy</strong>: Exact match allowing any valid translation from the corpus</li>
<li><strong>Manual Spot Check</strong>: Blind evaluation of 100 random samples per approach</li>
<li><strong>Running Time</strong>: Wall-clock time on the same hardware</li>
</ul>
<h3 id="baseline">Baseline</h3>
<p>The rule-based comparison system operates in three steps: disassemble the input name into word fragments, translate each fragment, and reassemble into the target language. This tool had been deployed as an online service with over one million uses at the time of publication.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>CNN</th>
          <th>LSTM</th>
          <th>Rule-based</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Success Rate En2Ch</td>
          <td>100%</td>
          <td>100%</td>
          <td>75.97%</td>
      </tr>
      <tr>
          <td>Success Rate Ch2En</td>
          <td>100%</td>
          <td>100%</td>
          <td>59.90%</td>
      </tr>
      <tr>
          <td>String Match En2Ch</td>
          <td>82.92%</td>
          <td>89.64%</td>
          <td>39.81%</td>
      </tr>
      <tr>
          <td>String Match Ch2En</td>
          <td>78.11%</td>
          <td>55.44%</td>
          <td>43.77%</td>
      </tr>
      <tr>
          <td>Data Match En2Ch</td>
          <td>84.44%</td>
          <td>90.82%</td>
          <td>45.15%</td>
      </tr>
      <tr>
          <td>Data Match Ch2En</td>
          <td>80.22%</td>
          <td>57.40%</td>
          <td>44.91%</td>
      </tr>
      <tr>
          <td>Manual Check En2Ch</td>
          <td>90.00%</td>
          <td>89.00%</td>
          <td>80.00%</td>
      </tr>
      <tr>
          <td>Manual Check Ch2En</td>
          <td>82.00%</td>
          <td>61.00%</td>
          <td>78.00%</td>
      </tr>
      <tr>
          <td>Time En2Ch (s)</td>
          <td>1423</td>
          <td>190</td>
          <td>288</td>
      </tr>
      <tr>
          <td>Time Ch2En (s)</td>
          <td>1876</td>
          <td>303</td>
          <td>322</td>
      </tr>
  </tbody>
</table>
<p>Both neural approaches achieved 100% success rate (always producing output), while the rule-based tool failed on 24% and 40% of inputs for En2Ch and Ch2En respectively. The rule-based tool&rsquo;s failures were concentrated on Chinese names lacking word boundaries and on trivial names of natural products.</p>
<p>For English-to-Chinese translation, LSTM performed best at 89.64% string matching accuracy (90.82% data matching), followed by CNN at 82.92%. For Chinese-to-English, CNN substantially outperformed LSTM (78.11% vs. 55.44% string matching), suggesting that LSTM had difficulty with long-term dependencies in Chinese character sequences. The authors observed that many LSTM errors appeared at the ends of chemical names.</p>
<h3 id="analysis-by-name-type">Analysis by Name Type</h3>
<p>The CNN-based approach outperformed LSTM on CAS names (80% vs. 52% in manual checks) and was more robust for longer names. The rule-based tool showed consistent performance regardless of name length, suggesting it was more suited to regular systematic names but struggled with the diversity of real-world chemical nomenclature.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Performance depends heavily on training data quality and quantity.</li>
<li>Neither neural approach was validated on an external test set outside the institution&rsquo;s corpus.</li>
<li>The CNN model was considerably slower (5-6x) than the other two approaches.</li>
<li>No comparison against modern transformer-based NMT architectures (the study predates widespread adoption of transformers for this task).</li>
<li>The dataset is relatively small by modern NMT standards (30-37K pairs).</li>
<li>The authors noted that some neural translations were actually better than the target labels, suggesting the evaluation metrics understate true performance.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest that combining CNN and LSTM architectures could yield further improvements, and that the approach has practical applications in scientific publishing (Chinese journals requiring English abstracts) and chemical database interoperability.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Validation (En2Ch)</td>
          <td>Curated bilingual corpus</td>
          <td>30,394 pairs</td>
          <td>80/20 split, from SIOC chemical data system</td>
      </tr>
      <tr>
          <td>Training/Validation (Ch2En)</td>
          <td>Curated bilingual corpus</td>
          <td>37,207 pairs</td>
          <td>80/20 split, from SIOC chemical data system</td>
      </tr>
      <tr>
          <td>Testing (En2Ch)</td>
          <td>Held-out validation split</td>
          <td>6,079 records</td>
          <td>Same source</td>
      </tr>
      <tr>
          <td>Testing (Ch2En)</td>
          <td>Held-out validation split</td>
          <td>7,441 records</td>
          <td>Same source</td>
      </tr>
  </tbody>
</table>
<p>Training data, Python code for both models, and result data are provided as supplementary files with the paper.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Character-level CNN encoder-decoder with attention (3+3+2 conv layers)</li>
<li>Character-level LSTM encoder-decoder with teacher forcing</li>
<li>Batch size: 64, epochs: 100, latent dim: 256</li>
</ul>
<h3 id="models">Models</h3>
<p>Both models implemented in Python 3.7 with Keras 2.3 / TensorFlow. No pre-trained weights are released separately, but the training code is provided as supplementary material.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value (En2Ch)</th>
          <th>Best Value (Ch2En)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Success Rate</td>
          <td>100% (both DL)</td>
          <td>100% (both DL)</td>
          <td>Rule-based: 75.97% / 59.90%</td>
      </tr>
      <tr>
          <td>String Matching</td>
          <td>89.64% (LSTM)</td>
          <td>78.11% (CNN)</td>
          <td>Best neural model per direction</td>
      </tr>
      <tr>
          <td>Data Matching</td>
          <td>90.82% (LSTM)</td>
          <td>80.22% (CNN)</td>
          <td>Allows multiple valid translations</td>
      </tr>
      <tr>
          <td>Manual Spot Check</td>
          <td>90.00% (CNN)</td>
          <td>82.00% (CNN)</td>
          <td>Blind evaluation of 100 samples</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper. Running times reported but hardware details not provided.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.1186/s13321-020-00457-0">Supplementary files</a></td>
          <td>Code + Data</td>
          <td>CC-BY 4.0</td>
          <td>Training data, CNN/LSTM code, results (Additional files 1-6)</td>
      </tr>
      <tr>
          <td><a href="https://www.organchem.csdb.cn/translate">SIOC Translation Tool</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Rule-based baseline tool, online service</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, T., Chen, W., Zhou, J., Dai, J., Li, Y., &amp; Zhao, Y. (2020). Neural machine translation of chemical nomenclature between English and Chinese. <em>Journal of Cheminformatics</em>, 12, 50. <a href="https://doi.org/10.1186/s13321-020-00457-0">https://doi.org/10.1186/s13321-020-00457-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xu2020neural,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Neural machine translation of chemical nomenclature between English and Chinese}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xu, Tingjun and Chen, Weiming and Zhou, Junhong and Dai, Jingfang and Li, Yingyong and Zhao, Yingli}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00457-0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>nach0: A Multimodal Chemical and NLP Foundation Model</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/nach0-multimodal-chemical-language-model/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/multimodal/nach0-multimodal-chemical-language-model/</guid><description>nach0 is a T5-based encoder-decoder model pre-trained on SMILES, scientific text, and patents, then instruction-tuned for chemical and NLP tasks.</description><content:encoded><![CDATA[<h2 id="a-multi-domain-encoder-decoder-for-chemistry-and-nlp">A Multi-Domain Encoder-Decoder for Chemistry and NLP</h2>
<p>nach0 is a <strong>Method</strong> paper that introduces a unified encoder-decoder foundation model capable of handling both natural language processing (NLP) tasks and chemistry tasks within a single architecture. The primary contribution is demonstrating that a T5-based model pre-trained on scientific text, patents, and <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> molecular strings can be instruction-tuned to perform molecular property prediction, reaction prediction, molecular generation, named entity recognition, question answering, and cross-domain translation (text-to-molecule and molecule-to-text) simultaneously. The model is available in base (250M parameters) and large (780M parameters) configurations.</p>
<h2 id="bridging-chemical-and-linguistic-representations">Bridging Chemical and Linguistic Representations</h2>
<p>Most existing biomedical language models (BioBERT, SciFive, BioMegatron) are trained exclusively on natural language text from sources like PubMed, omitting chemical structure information encoded in SMILES strings. Conversely, chemistry-specific models trained on SMILES data often lack the ability to process natural language instructions or perform NLP tasks. Models like <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a> and MolT5 attempted to bridge this gap by training on both natural language and chemical data, but they were not fine-tuned on a diverse set of chemical tasks using instruction tuning in a multi-task fashion.</p>
<p>nach0 addresses this by creating a shared representation space for both modalities and fine-tuning across a comprehensive set of tasks spanning three domains: NLP-only tasks, chemistry-only tasks, and cross-domain tasks that require translating between natural language and molecular representations.</p>
<h2 id="unified-text-to-text-framework-with-smiles-tokenization">Unified Text-to-Text Framework with SMILES Tokenization</h2>
<p>The core innovation in nach0 is formulating all chemical and linguistic tasks as text-to-text problems within a single encoder-decoder transformer, combined with a specialized SMILES tokenization strategy.</p>
<h3 id="smiles-token-integration">SMILES Token Integration</h3>
<p>Rather than treating SMILES as plain text, nach0 extends the T5 vocabulary with dedicated SMILES tokens. Each SMILES token is annotated with special symbols in the format <code>&lt;sm_{token}&gt;</code>, creating a distinct vocabulary space for molecular representations while preserving the natural language vocabulary from FLAN-T5. The embedding matrix is initialized by reusing learned embeddings from the pre-trained model for original tokens, with new chemical tokens initialized from the first embeddings.</p>
<h3 id="architecture">Architecture</h3>
<p>Both model sizes use the standard <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a> encoder-decoder architecture:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Parameters</th>
          <th>Layers</th>
          <th>Hidden Size</th>
          <th>FFN Size</th>
          <th>Attention Heads</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>250M</td>
          <td>12</td>
          <td>768</td>
          <td>3072</td>
          <td>12</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>780M</td>
          <td>24</td>
          <td>1024</td>
          <td>4096</td>
          <td>16</td>
      </tr>
  </tbody>
</table>
<h3 id="pre-training-data">Pre-training Data</h3>
<p>The model is pre-trained with a language modeling objective on three data sources:</p>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>Documents</th>
          <th>Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PubMed abstracts (chemistry-filtered)</td>
          <td>13M</td>
          <td>355M</td>
      </tr>
      <tr>
          <td>USPTO patent descriptions</td>
          <td>119K</td>
          <td>2.9B</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/zinc-22/">ZINC</a> molecular database</td>
          <td>~100M</td>
          <td>4.7B</td>
      </tr>
  </tbody>
</table>
<h3 id="instruction-tuning">Instruction Tuning</h3>
<p>Following the approach of Raffel et al. and Chung et al., nach0 uses natural language prompts to formulate each task. For example, a retrosynthesis task might be phrased as &ldquo;What reactants could be used to synthesize [SMILES]?&rdquo; and a property prediction task as &ldquo;Can [SMILES] penetrate the <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB</a>?&rdquo; This enables multi-task training across all domains with a single loss function and shared hyperparameters.</p>
<p>Training uses a batch size of 1024, learning rate of $1 \times 10^{-4}$, and weight decay of 0.01. Pre-training runs for one epoch, and fine-tuning for 10 epochs. Data mixing follows the examples-proportional mixing strategy from T5.</p>
<h2 id="multi-task-evaluation-across-nlp-and-chemistry-benchmarks">Multi-Task Evaluation Across NLP and Chemistry Benchmarks</h2>
<p>nach0 is evaluated on a comprehensive set of benchmarks spanning three task categories.</p>
<h3 id="task-categories">Task Categories</h3>
<p><strong>NLP tasks</strong>: Named entity recognition (BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, JNLPBA), PICO extraction (EBM PICO), textual entailment (MedNLI, SciTail), relation extraction (ChemProt, DDI, GAD), sentence similarity (BIOSSES), document classification (HoC), and question answering (PubMedQA, BioASQ, MedMCQA, MMLU).</p>
<p><strong>Chemistry tasks</strong>: Molecular property prediction (ESOL, FreeSolv, Lipophilicity, BBBP, HIV, BACE from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>; <a href="/notes/chemistry/datasets/qm9/">QM9</a> from Mol-Instructions), molecular generation (<a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a>), forward reaction prediction, reagent prediction, and <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> (from Mol-Instructions/USPTO).</p>
<p><strong>Cross-domain tasks</strong>: Description-guided molecule design and molecular description generation (from Mol-Instructions).</p>
<h3 id="baselines">Baselines</h3>
<p>nach0 is compared against FLAN-T5 (250M), SciFive (220M), and MolT5 (220M), all trained in multi-task fashion.</p>
<h3 id="key-results">Key Results</h3>
<p>On chemistry and cross-domain tasks, nach0 base consistently outperforms all base-sized baselines. Selected highlights from Table 3:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>MolT5</th>
          <th>SciFive</th>
          <th>FLAN</th>
          <th>nach0 Base</th>
          <th>nach0 Large</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Forward reaction</td>
          <td>Acc@1</td>
          <td>27.0%</td>
          <td>60.0%</td>
          <td>59.0%</td>
          <td>88.0%</td>
          <td>89.9%</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>Acc@1</td>
          <td>15.0%</td>
          <td>31.0%</td>
          <td>31.0%</td>
          <td>53.0%</td>
          <td>56.3%</td>
      </tr>
      <tr>
          <td>Reagent prediction</td>
          <td>Acc@1</td>
          <td>1.1%</td>
          <td>3.8%</td>
          <td>4.0%</td>
          <td>6.3%</td>
          <td>13.1%</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>BA</td>
          <td>0.58</td>
          <td>0.65</td>
          <td>0.65</td>
          <td>0.74</td>
          <td>0.71</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>BA</td>
          <td>0.55</td>
          <td>0.66</td>
          <td>0.60</td>
          <td>0.67</td>
          <td>0.68</td>
      </tr>
      <tr>
          <td>HFE (FreeSolv)</td>
          <td>R2</td>
          <td>-0.36</td>
          <td>0.51</td>
          <td>0.55</td>
          <td>0.77</td>
          <td>0.78</td>
      </tr>
      <tr>
          <td>MOSES (FCD)</td>
          <td>FCD/Test</td>
          <td>0.521</td>
          <td>0.578</td>
          <td>0.529</td>
          <td>0.311</td>
          <td>0.304</td>
      </tr>
      <tr>
          <td>Description-guided mol. design</td>
          <td>BLEU-2</td>
          <td>30.3%</td>
          <td>44.2%</td>
          <td>43.6%</td>
          <td>49.0%</td>
          <td>48.8%</td>
      </tr>
      <tr>
          <td>Mol. description gen.</td>
          <td>BLEU-2</td>
          <td>35.6%</td>
          <td>39.6%</td>
          <td>38.6%</td>
          <td>43.9%</td>
          <td>41.7%</td>
      </tr>
  </tbody>
</table>
<p>On NLP tasks, nach0 base performs comparably to FLAN base, with the two models trading wins across different tasks. nach0 large improves substantially over nach0 base on most tasks.</p>
<h3 id="ablation-study">Ablation Study</h3>
<p>The ablation study (Table 4) examines the impact of multi-task training across chemical task groups. Key findings:</p>
<ul>
<li>nach0 trained on all chemical tasks jointly outperforms models trained on individual task groups (prediction-only, reaction-only, or generation-only) on the total set of metrics</li>
<li>The joint model shows lower novelty scores on MOSES compared to the generation-only model, but this reflects less overfitting to training data rather than worse performance</li>
<li>nach0 consistently outperforms MolT5 across all chemical task configurations, demonstrating the benefit of pre-training on both natural language and chemical data with specialized SMILES tokens</li>
</ul>
<h3 id="case-studies">Case Studies</h3>
<p>Two applied case studies demonstrate nach0 in drug discovery scenarios:</p>
<ol>
<li>
<p><strong>End-to-end drug discovery for <a href="https://en.wikipedia.org/wiki/Diabetes">diabetes mellitus</a></strong>: Using a sequence of prompts, nach0 identifies biological targets, analyzes mechanisms of action, generates molecular structures, proposes synthesis routes, and predicts molecular properties.</p>
</li>
<li>
<p><strong><a href="https://en.wikipedia.org/wiki/Janus_kinase_3">JAK3</a> inhibitor generation with Chemistry42</strong>: nach0 replaces 42 specialized generative models in Insilico Medicine&rsquo;s Chemistry42 platform. In 45 minutes, nach0 generates 8 molecules satisfying all 2D and 3D requirements (hinge binding, active site binding), compared to a 0.04% discovery rate from a combinatorial generator over 24 hours. Chemistry42&rsquo;s full pipeline (72 hours) still produces better structures since it uses reinforcement learning feedback and explicit structural constraints.</p>
</li>
</ol>
<h3 id="comparison-with-chatgpt">Comparison with ChatGPT</h3>
<p>On a subset evaluation, fine-tuned nach0 base outperforms GPT-3.5-turbo on all tested tasks: EBM PICO (F1: 67.6% vs. 64.4%), MedMCQA-Open (BLEU-2: 6.3% vs. 1.7%), and molecular description generation (BLEU-2: 42.8% vs. 2.2%).</p>
<h2 id="competitive-multi-task-performance-with-clear-limitations">Competitive Multi-Task Performance with Clear Limitations</h2>
<p>nach0 demonstrates that a single encoder-decoder model can achieve competitive results across both chemical and NLP tasks when pre-trained on mixed-modality data and fine-tuned with instruction tuning. The model&rsquo;s strongest advantages appear on chemistry tasks (reaction prediction, property prediction, molecular generation), where specialized SMILES tokenization and chemical pre-training provide clear benefits over general-purpose models of similar scale.</p>
<h3 id="limitations-acknowledged-by-the-authors">Limitations Acknowledged by the Authors</h3>
<ol>
<li>
<p><strong>Not at chemist expert level</strong>: Human evaluations indicate the model does not match domain expert performance. Key gaps include chemical reasoning, knowledge alignment with domain-specific knowledge graphs, and the ability to learn from expert feedback.</p>
</li>
<li>
<p><strong>SMILES-only molecular representation</strong>: The model lacks 3D geometric information. SMILES notation is not one-to-one with molecular structures, and the model does not incorporate molecular graphs or 3D coordinates. The authors suggest <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> as a potential alternative representation.</p>
</li>
<li>
<p><strong>Prompt sensitivity</strong>: Performance depends on prompt quality and specificity. Over-reliance on domain-specific prompts may limit response diversity.</p>
</li>
<li>
<p><strong>Limited chemical diversity</strong>: Cross-domain datasets from Mol-Instructions primarily cover known drugs and chemical probes from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, representing only a fraction of predicted chemical space.</p>
</li>
</ol>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose extending nach0 with protein sequence modalities (using <a href="/notes/chemistry/molecular-representations/notations/group-selfies-fragment-molecular-representation/">Group SELFIES</a>), expanding zero-shot evaluation capabilities, and integrating knowledge graph information through self-supervised approaches.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training (text)</td>
          <td>PubMed abstracts</td>
          <td>13M docs, 355M tokens</td>
          <td>Filtered for chemistry-related content</td>
      </tr>
      <tr>
          <td>Pre-training (text)</td>
          <td>USPTO patents</td>
          <td>119K docs, 2.9B tokens</td>
          <td>Patent descriptions</td>
      </tr>
      <tr>
          <td>Pre-training (chemical)</td>
          <td>ZINC</td>
          <td>~100M docs, 4.7B tokens</td>
          <td>Molecular SMILES strings</td>
      </tr>
      <tr>
          <td>Fine-tuning (NLP)</td>
          <td>17 NLP datasets</td>
          <td>Varies</td>
          <td>See Table 1 in paper</td>
      </tr>
      <tr>
          <td>Fine-tuning (chemistry)</td>
          <td>MoleculeNet, MOSES, Mol-Instructions</td>
          <td>Varies</td>
          <td>Predefined or random splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: T5 encoder-decoder (base: 250M, large: 780M parameters)</li>
<li>Pre-training objective: Language modeling (masked span prediction)</li>
<li>Fine-tuning: Multi-task instruction tuning with examples-proportional mixing</li>
<li>Hyperparameters: batch size 1024, learning rate $1 \times 10^{-4}$, weight decay 0.01</li>
<li>Pre-training: 1 epoch; fine-tuning: 10 epochs</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/insilicomedicine/nach0_base">nach0 Base (HuggingFace)</a></td>
          <td>Model</td>
          <td>CC-BY-NC-4.0</td>
          <td>250M parameter encoder-decoder</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/insilicomedicine/nach0_large">nach0 Large (HuggingFace)</a></td>
          <td>Model</td>
          <td>CC-BY-NC-4.0</td>
          <td>780M parameter encoder-decoder</td>
      </tr>
      <tr>
          <td><a href="https://github.com/insilicomedicine/nach0">nach0 GitHub Repository</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Training and inference code</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation spans 17+ NLP benchmarks and 10+ chemistry benchmarks. Metrics include F1 (NER, RE, classification), accuracy (QA, entailment, reaction prediction), balanced accuracy (molecular property classification), R2/RMSE (regression), BLEU-2 (generation), and FCD/SNN/validity/novelty (molecular generation via MOSES).</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Base models: NVIDIA A4000 and A5000 GPUs</li>
<li>Large models: NVIDIA DGX cloud platform</li>
<li>Training used tensor and pipeline parallelism via NeMo toolkit</li>
<li>Specific GPU counts and training times not reported</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Livne, M., Miftahutdinov, Z., Tutubalina, E., Kuznetsov, M., Polykovskiy, D., Brundyn, A., Jhunjhunwala, A., Costa, A., Aliper, A., Aspuru-Guzik, A., &amp; Zhavoronkov, A. (2024). nach0: Multimodal Natural and Chemical Languages Foundation Model. <em>Chemical Science</em>, 15(22), 8380-8389. <a href="https://doi.org/10.1039/D4SC00966E">https://doi.org/10.1039/D4SC00966E</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{livne2024nach0,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{nach0: multimodal natural and chemical languages foundation model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Livne, Micha and Miftahutdinov, Zulfat and Tutubalina, Elena and Kuznetsov, Maksim and Polykovskiy, Daniil and Brundyn, Annika and Jhunjhunwala, Aastha and Costa, Anthony and Aliper, Alex and Aspuru-Guzik, Al{\&#39;a}n and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{8380--8389}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D4SC00966E}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolBERT: Auxiliary Tasks for Molecular BERT Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/</guid><description>MolBERT applies BERT to SMILES with domain-relevant auxiliary tasks like physicochemical property prediction, improving virtual screening and QSAR.</description><content:encoded><![CDATA[<h2 id="bert-based-molecular-representations-with-auxiliary-pre-training-tasks">BERT-Based Molecular Representations with Auxiliary Pre-Training Tasks</h2>
<p>This is a <strong>Method</strong> paper that introduces MolBERT, a bidirectional Transformer (BERT) architecture applied to SMILES-based molecular representations for drug discovery. The primary contribution is a systematic study of how different domain-relevant self-supervised pre-training tasks affect the quality of learned molecular embeddings, paired with a model that achieves state-of-the-art performance on <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a> and <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">quantitative structure-activity relationship (QSAR)</a> benchmarks.</p>
<h2 id="why-domain-relevant-pre-training-matters-for-molecular-language-models">Why Domain-Relevant Pre-Training Matters for Molecular Language Models</h2>
<p>Molecular representations are foundational for predictive, generative, and analytical tasks in drug discovery. Language models applied to text-based molecular representations like SMILES have demonstrated strong performance across property prediction, reaction prediction, and molecular generation. However, several open questions remained at the time of this work:</p>
<ol>
<li><strong>Task selection for pre-training</strong>: Prior work explored masked token prediction, input translation, and property concatenation, but there was no systematic comparison of how different self-supervised tasks affect downstream performance.</li>
<li><strong>SMILES ambiguity</strong>: The same molecule can be encoded as many different SMILES strings depending on how the molecular graph is traversed. Canonicalization algorithms address this but introduce their own artifacts that may distract the model.</li>
<li><strong>Domain knowledge integration</strong>: Standard NLP pre-training objectives (e.g., masked language modeling) do not explicitly encode chemical knowledge. It was unclear whether incorporating chemistry-specific supervision during pre-training could improve representation quality.</li>
</ol>
<p>MolBERT addresses these gaps by evaluating three pre-training tasks, including a novel physicochemical property prediction objective, and measuring their individual and combined effects on downstream drug discovery benchmarks.</p>
<h2 id="three-auxiliary-tasks-for-chemistry-aware-pre-training">Three Auxiliary Tasks for Chemistry-Aware Pre-Training</h2>
<p>MolBERT uses the BERT-Base architecture (12 attention heads, 12 layers, 768-dimensional hidden states, approximately 85M parameters) and explores three self-supervised pre-training tasks:</p>
<p><strong>Masked Language Modeling (MaskedLM)</strong>: The standard BERT objective where 15% of input tokens are masked and the model predicts their identity. The loss is cross-entropy between predicted and true tokens.</p>
<p><strong>SMILES Equivalence (SMILES-Eq)</strong>: A binary classification task where the model receives two SMILES strings and predicts whether they represent the same molecule. The second string is either a random permutation of the first (same molecule, different traversal) or a randomly sampled molecule. This is optimized with cross-entropy loss.</p>
<p><strong>Physicochemical Property Prediction (PhysChemPred)</strong>: Using <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>, a set of 200 real-valued molecular descriptors are computed for each molecule. The model predicts these normalized descriptors from the SMILES input using mean squared error:</p>
<p>$$\mathcal{L}_{\text{PhysChemPred}} = \frac{1}{D} \sum_{d=1}^{D} (y_d - \hat{y}_d)^2$$</p>
<p>where $D = 200$ is the number of descriptors, $y_d$ is the true normalized descriptor value, and $\hat{y}_d$ is the model&rsquo;s prediction.</p>
<p>The final training loss is the arithmetic mean of all active task losses:</p>
<p>$$\mathcal{L}_{\text{total}} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \mathcal{L}_t$$</p>
<p>where $\mathcal{T}$ is the set of active pre-training tasks.</p>
<p>Additionally, MolBERT supports SMILES permutation augmentation during training, where each input molecule is represented by a randomly sampled non-canonical SMILES string rather than the canonical form. The model uses a fixed vocabulary of 42 tokens, a sequence length of 128, and relative positional embeddings (from Transformer-XL) to support arbitrary-length SMILES at inference time.</p>
<h2 id="ablation-study-and-benchmark-evaluation">Ablation Study and Benchmark Evaluation</h2>
<h3 id="pre-training-setup">Pre-Training Setup</h3>
<p>All models were pre-trained on the <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol benchmark dataset</a>, consisting of approximately 1.6M compounds curated from ChEMBL, using an 80%/5% train/validation split. Training used the Adam optimizer with a learning rate of $3 \times 10^{-5}$ for 20 epochs (ablation) or 100 epochs (final model).</p>
<h3 id="ablation-impact-of-task-combinations-on-virtual-screening">Ablation: Impact of Task Combinations on Virtual Screening</h3>
<p>The ablation study evaluated all seven possible task combinations on the RDKit virtual screening benchmark (69 datasets, 5 query molecules per target). Results measured by AUROC and BEDROC20 (an early enrichment metric with $\alpha = 20$):</p>
<table>
  <thead>
      <tr>
          <th style="text-align: center">MaskedLM</th>
          <th style="text-align: center">PhysChemPred</th>
          <th style="text-align: center">SMILES-Eq</th>
          <th style="text-align: center">AUROC (w/ perm)</th>
          <th style="text-align: center">BEDROC20 (w/ perm)</th>
          <th style="text-align: center">AUROC (w/o perm)</th>
          <th style="text-align: center">BEDROC20 (w/o perm)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.685 +/- 0.069</td>
          <td style="text-align: center">0.246 +/- 0.041</td>
          <td style="text-align: center">0.707 +/- 0.059</td>
          <td style="text-align: center">0.280 +/- 0.042</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.738 +/- 0.060</td>
          <td style="text-align: center">0.323 +/- 0.071</td>
          <td style="text-align: center">0.740 +/- 0.066</td>
          <td style="text-align: center">0.322 +/- 0.065</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.483 +/- 0.092</td>
          <td style="text-align: center">0.092 +/- 0.069</td>
          <td style="text-align: center">0.493 +/- 0.068</td>
          <td style="text-align: center">0.108 +/- 0.070</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.476 +/- 0.077</td>
          <td style="text-align: center">0.064 +/- 0.034</td>
          <td style="text-align: center">0.514 +/- 0.165</td>
          <td style="text-align: center">0.084 +/- 0.014</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.696 +/- 0.058</td>
          <td style="text-align: center">0.283 +/- 0.077</td>
          <td style="text-align: center">0.676 +/- 0.060</td>
          <td style="text-align: center">0.250 +/- 0.073</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.719 +/- 0.057</td>
          <td style="text-align: center">0.293 +/- 0.071</td>
          <td style="text-align: center">0.716 +/- 0.061</td>
          <td style="text-align: center">0.290 +/- 0.076</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.129 +/- 0.067</td>
          <td style="text-align: center">0.005 +/- 0.037</td>
          <td style="text-align: center">0.508 +/- 0.068</td>
          <td style="text-align: center">0.048 +/- 0.035</td>
      </tr>
  </tbody>
</table>
<p>Key findings from the ablation:</p>
<ul>
<li>PhysChemPred had the highest individual impact (average BEDROC20 of 0.292 alone vs. 0.266 for MaskedLM alone).</li>
<li>Combining MaskedLM + PhysChemPred achieved the best performance (BEDROC20 of 0.323), though the additive gain from MaskedLM was modest (+0.031).</li>
<li>The SMILES-Eq task consistently decreased performance when added to other task combinations.</li>
</ul>
<p>A further sub-ablation on PhysChemPred descriptor groups showed that surface descriptors alone (49 of 200 descriptors) achieved nearly the same performance as the full set, suggesting molecular surface properties provide particularly informative supervision.</p>
<h3 id="virtual-screening-results">Virtual Screening Results</h3>
<p>Using the best task combination (MaskedLM + PhysChemPred) trained for 100 epochs:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>AUROC</th>
          <th>BEDROC20</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolBERT (100 epochs)</td>
          <td>0.743 +/- 0.062</td>
          <td>0.344 +/- 0.062</td>
      </tr>
      <tr>
          <td>CDDD</td>
          <td>0.725 +/- 0.057</td>
          <td>0.310 +/- 0.080</td>
      </tr>
      <tr>
          <td>RDKit descriptors</td>
          <td>0.633 +/- 0.027</td>
          <td>0.217 +/- 0.000</td>
      </tr>
      <tr>
          <td>ECFC4</td>
          <td>0.603 +/- 0.056</td>
          <td>0.170 +/- 0.079</td>
      </tr>
  </tbody>
</table>
<p>MolBERT outperformed all baselines including <a href="/notes/chemistry/molecular-representations/encoders/cddd-translation-molecular-descriptors/">CDDD</a> (the prior state of the art), RDKit calculated descriptors, and extended-connectivity fingerprints (ECFC4).</p>
<h3 id="qsar-results">QSAR Results</h3>
<p>On <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> regression tasks (RMSE, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">RDKit (norm)</th>
          <th style="text-align: center">ECFC4</th>
          <th style="text-align: center">CDDD</th>
          <th style="text-align: center">MolBERT</th>
          <th style="text-align: center">MolBERT (finetune)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td style="text-align: center">0.687 +/- 0.08</td>
          <td style="text-align: center">0.902 +/- 0.06</td>
          <td style="text-align: center">0.567 +/- 0.06</td>
          <td style="text-align: center">0.552 +/- 0.07</td>
          <td style="text-align: center"><strong>0.531 +/- 0.04</strong></td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td style="text-align: center">1.671 +/- 0.45</td>
          <td style="text-align: center">2.876 +/- 0.38</td>
          <td style="text-align: center">1.456 +/- 0.43</td>
          <td style="text-align: center">1.523 +/- 0.66</td>
          <td style="text-align: center"><strong>0.948 +/- 0.33</strong></td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td style="text-align: center">0.738 +/- 0.04</td>
          <td style="text-align: center">0.770 +/- 0.03</td>
          <td style="text-align: center">0.669 +/- 0.02</td>
          <td style="text-align: center">0.602 +/- 0.01</td>
          <td style="text-align: center"><strong>0.561 +/- 0.03</strong></td>
      </tr>
  </tbody>
</table>
<p>On MoleculeNet classification tasks (AUROC, higher is better):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">RDKit (norm)</th>
          <th style="text-align: center">ECFC4</th>
          <th style="text-align: center">CDDD</th>
          <th style="text-align: center">MolBERT</th>
          <th style="text-align: center">MolBERT (finetune)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BACE</td>
          <td style="text-align: center">0.831</td>
          <td style="text-align: center">0.845</td>
          <td style="text-align: center">0.833</td>
          <td style="text-align: center">0.849</td>
          <td style="text-align: center"><strong>0.866</strong></td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td style="text-align: center">0.696</td>
          <td style="text-align: center">0.678</td>
          <td style="text-align: center">0.761</td>
          <td style="text-align: center">0.750</td>
          <td style="text-align: center"><strong>0.762</strong></td>
      </tr>
      <tr>
          <td>HIV</td>
          <td style="text-align: center">0.708</td>
          <td style="text-align: center">0.714</td>
          <td style="text-align: center">0.753</td>
          <td style="text-align: center">0.747</td>
          <td style="text-align: center"><strong>0.783</strong></td>
      </tr>
  </tbody>
</table>
<p>Fine-tuned MolBERT achieved the best performance on all six QSAR datasets. When used as a fixed feature extractor with an SVM, MolBERT embeddings outperformed other representations on three of six tasks.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Pre-training task selection matters significantly.</strong> The choice of auxiliary tasks during pre-training has a large effect on downstream performance. PhysChemPred provides the strongest individual signal.</li>
<li><strong>Domain-relevant auxiliary tasks improve representation quality.</strong> Predicting physicochemical properties during pre-training encodes chemical knowledge directly into the embeddings, outperforming purely linguistic objectives.</li>
<li><strong>The SMILES equivalence task hurts performance.</strong> Despite being chemically motivated, the SMILES-Eq task consistently degraded results, suggesting it may introduce conflicting learning signals.</li>
<li><strong>PhysChemPred organizes the embedding space.</strong> Analysis of pairwise cosine similarities showed that models trained with PhysChemPred assign high similarity to permutations of the same molecule and low similarity to different molecules, creating a more semantically meaningful representation space.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The paper evaluates only SMILES-based representations, inheriting all limitations of string-based molecular encodings (inability to capture 3D structure, sensitivity to tokenization).</li>
<li>The virtual screening evaluation uses a fixed number of query molecules ($n = 5$), which may not reflect realistic screening scenarios.</li>
<li>Cross-validation splits from ChemBench were used for QSAR evaluation rather than scaffold splits, which may overestimate performance on structurally novel compounds.</li>
<li>The model&rsquo;s 128-token sequence length limit may truncate larger molecules, though relative positional embeddings partially address this at inference time.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose extending MolBERT to learn representations for other biological entities such as proteins, and developing more advanced pre-training strategies.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>GuacaMol (ChEMBL)</td>
          <td>~1.6M compounds</td>
          <td>80% train / 5% validation split</td>
      </tr>
      <tr>
          <td>Virtual Screening</td>
          <td>RDKit benchmark v1.2</td>
          <td>69 target datasets</td>
          <td>Filtered subset with active/decoy compounds</td>
      </tr>
      <tr>
          <td>QSAR (Regression)</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
          <td>Varies</td>
          <td>From MoleculeNet, ChemBench splits</td>
      </tr>
      <tr>
          <td>QSAR (Classification)</td>
          <td>BACE, BBBP, HIV</td>
          <td>Varies</td>
          <td>From MoleculeNet, ChemBench splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: BERT-Base (12 heads, 12 layers, 768-dim hidden, ~85M params)</li>
<li>Optimizer: Adam, learning rate $3 \times 10^{-5}$</li>
<li>Vocabulary: 42 tokens, sequence length 128</li>
<li>Masking: 15% of tokenized input</li>
<li>Positional encoding: relative positional embeddings (Transformer-XL)</li>
<li>Fine-tuning SVM: $C = 5.0$, RBF kernel (from Winter et al.)</li>
<li>Fine-tuning head: single linear layer on pooled output</li>
<li>Embeddings: pooled output (or average sequence output when only MaskedLM is used)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>BERT-Base with ~85M parameters</li>
<li>Pre-trained weights available at <a href="https://github.com/BenevolentAI/MolBERT">BenevolentAI/MolBERT</a></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUROC</td>
          <td>Virtual Screening, Classification QSAR</td>
          <td>Standard area under ROC curve</td>
      </tr>
      <tr>
          <td>BEDROC20</td>
          <td>Virtual Screening</td>
          <td>Early enrichment metric, $\alpha = 20$</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression QSAR</td>
          <td>Root mean squared error</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>2 GPUs, 16 CPUs</li>
<li>Pre-training time: ~40 hours (20 epochs)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BenevolentAI/MolBERT">BenevolentAI/MolBERT</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Official implementation with pre-trained weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fabian, B., Edlich, T., Gaspar, H., Segler, M., Meyers, J., Fiscato, M., &amp; Ahmed, M. (2020). Molecular representation learning with language models and domain-relevant auxiliary tasks. <em>arXiv preprint arXiv:2011.13230</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fabian2020molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular representation learning with language models and domain-relevant auxiliary tasks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fabian, Benedek and Edlich, Thomas and Gaspar, H{\&#39;e}l{\&#39;e}na and Segler, Marwin and Meyers, Joshua and Fiscato, Marco and Ahmed, Mohamed}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2011.13230}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LMs Generate 3D Molecules from XYZ, CIF, PDB Files</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/</guid><description>Transformer language models trained on XYZ, CIF, and PDB sequences generate valid 3D molecules, crystals, and protein binding sites.</description><content:encoded><![CDATA[<h2 id="language-models-as-3d-chemical-structure-generators">Language Models as 3D Chemical Structure Generators</h2>
<p>This is a <strong>Method</strong> paper that demonstrates transformer-based language models can generate molecules, crystalline materials, and protein binding sites directly in three dimensions by training on sequences derived from standard chemical file formats (XYZ, CIF, PDB). The key contribution is showing that unmodified autoregressive language models, using only next-token prediction, achieve performance comparable to domain-specific 3D generative models that incorporate SE(3) equivariance and other geometric inductive biases.</p>
<h2 id="beyond-graphs-and-strings-the-need-for-3d-chemical-generation">Beyond Graphs and Strings: The Need for 3D Chemical Generation</h2>
<p>Molecular design with deep learning has largely relied on two representation paradigms: molecular graphs (processed with graph neural networks) and linearized string representations like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (processed with sequence models). Both approaches have proven effective for drug-like organic molecules, but they share a fundamental limitation: they cannot represent structures whose identity depends on 3D spatial arrangement.</p>
<p>Crystalline materials, for example, have periodic lattice structures that cannot be reduced to simple graphs. Protein binding sites are defined by the 3D arrangement of hundreds of atoms across multiple residues. For tasks like catalysis design or structure-based drug discovery, the geometric positions of atoms are essential information that graphs and strings discard entirely.</p>
<p>Existing 3D generative models address this gap but typically require specialized architectures with SE(3) equivariance to handle rotational and translational symmetries. This work asks whether the general-purpose sequence modeling capability of transformers is sufficient to learn 3D chemical structure distributions without any domain-specific architectural modifications.</p>
<h2 id="direct-tokenization-of-chemical-file-formats">Direct Tokenization of Chemical File Formats</h2>
<p>The core insight is straightforward: any 3D molecule, crystal, or biomolecule is already stored as text in standard file formats (<a href="https://en.wikipedia.org/wiki/XYZ_file_format">XYZ</a>, <a href="https://en.wikipedia.org/wiki/Crystallographic_Information_File">CIF</a>, <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)">PDB</a>). These files encode atom types and their Cartesian coordinates as sequences of characters and numbers. Rather than designing specialized architectures for point cloud generation, the authors simply tokenize these files and train a standard GPT-style transformer to predict the next token.</p>
<p>A molecule with $n$ atoms is represented as:</p>
<p>$$
\mathcal{M} = (e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n)
$$</p>
<p>where $e_i$ is the element type and $(x_i, y_i, z_i)$ are Cartesian coordinates. Crystals additionally include lattice parameters:</p>
<p>$$
\mathcal{C} = (\ell_a, \ell_b, \ell_c, \alpha, \beta, \gamma, e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n)
$$</p>
<p>Protein binding sites use residue-atom indicators (e.g., HIS-C, CYS-N) instead of bare element symbols:</p>
<p>$$
\mathcal{P} = (a_1, x_1, y_1, z_1, \dots, a_n, x_n, y_n, z_n)
$$</p>
<p>The language model learns the joint distribution via autoregressive factorization:</p>
<p>$$
p(x) = \prod_{i=1}^{n} p(t_i \mid t_{i-1}, \dots, t_1)
$$</p>
<p>Two tokenization strategies are explored:</p>
<ol>
<li><strong>Character-level (LM-CH)</strong>: Every character in the file is a token, including digits, minus signs, spaces, and newlines. This produces long sequences but uses a small vocabulary (~30 tokens).</li>
<li><strong>Atom+coordinate-level (LM-AC)</strong>: Each atom placement requires exactly 4 tokens: one element/residue token and three coordinate tokens (e.g., &lsquo;-1.98&rsquo;). The vocabulary is larger (~100-10K tokens) but sequences are shorter.</li>
</ol>
<p>Numerical precision is controlled by rounding coordinates to 1, 2, or 3 decimal places. Since the model lacks rotation and translation invariance, random rotation augmentation during training improves performance.</p>
<h2 id="experiments-across-molecules-crystals-and-protein-binding-sites">Experiments Across Molecules, Crystals, and Protein Binding Sites</h2>
<h3 id="molecular-generation-zinc">Molecular Generation (ZINC)</h3>
<p>The model is evaluated on 250K commercially available molecules from the ZINC dataset, with an average of 23 heavy atoms. XYZ files are generated using RDKit&rsquo;s conformer tools. Coordinates use 2 decimal places of precision. The authors generate 10K molecules and evaluate both 3D geometry quality and standard generative metrics.</p>
<p>For 3D geometry assessment, root mean squared deviation (RMSD) between language model-generated conformers and RDKit-generated conformers shows most molecules fall between 1.0 and 2.0 RMSD, with a heavy tail extending to 4.0.</p>
<p>Standard metrics include validity, uniqueness, novelty, and earth mover&rsquo;s distance (WA) for molecular property distributions (QED, SA score, molecular weight).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>3D</th>
          <th>Valid (%)</th>
          <th>Unique (%)</th>
          <th>Novel (%)</th>
          <th>WA MW</th>
          <th>WA SA</th>
          <th>WA QED</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Train</td>
          <td>No</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>0.816</td>
          <td>0.013</td>
          <td>0.002</td>
      </tr>
      <tr>
          <td>SM-LM</td>
          <td>No</td>
          <td>98.35</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.640</td>
          <td>0.049</td>
          <td>0.005</td>
      </tr>
      <tr>
          <td>SF-LM</td>
          <td>No</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.772</td>
          <td>0.085</td>
          <td>0.006</td>
      </tr>
      <tr>
          <td>JTVAE</td>
          <td>No</td>
          <td>100.0</td>
          <td>98.56</td>
          <td>100.0</td>
          <td>22.63</td>
          <td>0.126</td>
          <td>0.023</td>
      </tr>
      <tr>
          <td>ENF</td>
          <td>Yes</td>
          <td>1.05</td>
          <td>96.37</td>
          <td>99.72</td>
          <td>168.5</td>
          <td>1.886</td>
          <td>0.160</td>
      </tr>
      <tr>
          <td>G-SchNet</td>
          <td>Yes</td>
          <td>1.20</td>
          <td>55.96</td>
          <td>98.33</td>
          <td>152.7</td>
          <td>1.126</td>
          <td>0.185</td>
      </tr>
      <tr>
          <td>EDM</td>
          <td>Yes</td>
          <td>77.51</td>
          <td>96.40</td>
          <td>95.30</td>
          <td>101.2</td>
          <td>0.939</td>
          <td>0.093</td>
      </tr>
      <tr>
          <td>LM-CH</td>
          <td>Yes</td>
          <td>90.13</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>3.912</td>
          <td>2.608</td>
          <td>0.077</td>
      </tr>
      <tr>
          <td>LM-AC</td>
          <td>Yes</td>
          <td>98.51</td>
          <td>100.0</td>
          <td>100.0</td>
          <td>1.811</td>
          <td>0.026</td>
          <td>0.004</td>
      </tr>
  </tbody>
</table>
<p>The atom+coordinate tokenization model (LM-AC) achieves 98.51% validity with 100% uniqueness and novelty. Its WA scores for molecular weight (1.811) and QED (0.004) are substantially better than all other 3D generative baselines and competitive with SMILES/SELFIES language models. The character-level model (LM-CH) at 90.13% validity performs comparably to graph-based models but falls short of the string-based language models.</p>
<h3 id="crystal-generation-perov-5-and-mp-20">Crystal Generation (Perov-5 and MP-20)</h3>
<p>Crystal generation uses CIF-derived sequences with 3 decimal places of precision. Two datasets are used: Perov-5 (18,928 <a href="https://en.wikipedia.org/wiki/Perovskite_(structure)">perovskite</a> materials, 5 atoms per unit cell, 56 elements) and MP-20 (45,231 diverse materials, 1-20 atoms per unit cell, 89 elements).</p>
<p>Evaluation metrics include structural validity (minimum interatomic distance &gt; 0.5 angstrom), compositional validity (charge neutrality via SMACT), coverage (recall and precision between generated and test sets), and earth mover&rsquo;s distance for density and number of unique elements.</p>
<table>
  <thead>
      <tr>
          <th>Data</th>
          <th>Model</th>
          <th>Struc. Valid (%)</th>
          <th>Comp. Valid (%)</th>
          <th>COV-R (%)</th>
          <th>COV-P (%)</th>
          <th>WA density</th>
          <th>WA elements</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Perov-5</td>
          <td>CDVAE</td>
          <td>100.0</td>
          <td>98.59</td>
          <td>99.45</td>
          <td>98.46</td>
          <td>0.126</td>
          <td>0.063</td>
      </tr>
      <tr>
          <td>Perov-5</td>
          <td>LM-CH</td>
          <td>100.0</td>
          <td>98.51</td>
          <td>99.60</td>
          <td>99.42</td>
          <td>0.071</td>
          <td>0.036</td>
      </tr>
      <tr>
          <td>Perov-5</td>
          <td>LM-AC</td>
          <td>100.0</td>
          <td>98.79</td>
          <td>98.78</td>
          <td>99.36</td>
          <td>0.089</td>
          <td>0.028</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>CDVAE</td>
          <td>100.0</td>
          <td>86.70</td>
          <td>99.15</td>
          <td>99.49</td>
          <td>0.688</td>
          <td>1.432</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>LM-CH</td>
          <td>84.81</td>
          <td>83.55</td>
          <td>99.25</td>
          <td>97.89</td>
          <td>0.864</td>
          <td>0.132</td>
      </tr>
      <tr>
          <td>MP-20</td>
          <td>LM-AC</td>
          <td>95.81</td>
          <td>88.87</td>
          <td>99.60</td>
          <td>98.55</td>
          <td>0.696</td>
          <td>0.092</td>
      </tr>
  </tbody>
</table>
<p>On Perov-5, both language models outperform CDVAE across most metrics. On the more diverse MP-20 dataset, LM-AC achieves the best scores on 3 of 6 metrics and remains competitive on the others. LM-CH struggles more with structural validity on MP-20 (84.81%).</p>
<h3 id="protein-binding-site-generation-pdb">Protein Binding Site Generation (PDB)</h3>
<p>The most challenging task involves generating protein binding sites (~200-250 atoms each) from PDB-derived sequences. The dataset contains approximately 180K protein-ligand pairs. Residue-atom tokenization is used (e.g., CYS-C, CYS-N), with 2 decimal places of precision.</p>
<p>Validity is assessed per-residue using xyz2mol, with an additional check for inter-residue atomic overlap (atoms from different residues closer than the minimum bond distance). Approximately 99% of generated pockets pass the residue validity check, while about 5% fail the overlap check. Of generated pockets, 89.8% have unique residue orderings, and 83.6% have novel orderings not seen in training, indicating the model is generating novel binding site structures rather than memorizing.</p>
<h2 id="competitive-3d-generation-without-geometric-inductive-biases">Competitive 3D Generation Without Geometric Inductive Biases</h2>
<p>The central finding is that standard transformer language models, without any equivariance or geometric inductive biases, can generate valid 3D chemical structures across three substantially different domains. The atom+coordinate tokenization (LM-AC) consistently outperforms character-level tokenization (LM-CH), likely because it produces shorter sequences and reduces the number of sequential decisions needed per atom placement.</p>
<p>Several limitations are worth noting. The model generates atoms using absolute Cartesian coordinates, which means it must learn rotation and translation invariance purely from data augmentation rather than having it built into the architecture. The authors acknowledge this becomes increasingly difficult as structure size grows. The vocabulary size also scales with coordinate precision and structure complexity, which could become prohibitive for very large systems.</p>
<p>The paper does not include computational cost comparisons with baseline models, making it difficult to assess the practical tradeoff between the simplicity of the language modeling approach and the efficiency of specialized architectures. The authors also note that further validation through computational simulation and experiment is needed to confirm the physical plausibility of generated structures.</p>
<p>Future directions identified include inverse design of molecules and materials conditioned on target properties, extension to more complex structures (metal-organic frameworks), and exploration of alternative tokenization strategies to handle larger systems.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>ZINC</td>
          <td>250K molecules</td>
          <td>~23 heavy atoms avg; XYZ files via RDKit conformer generation</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>Perov-5</td>
          <td>18,928 perovskites</td>
          <td>5 atoms/unit cell, 56 elements</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>MP-20</td>
          <td>45,231 materials</td>
          <td>1-20 atoms/unit cell, 89 elements</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>Protein binding sites</td>
          <td>~180K protein-ligand pairs</td>
          <td>Processed to 200-250 atoms per pocket</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-style transformer with ~1M to 100M parameters</li>
<li><strong>Layers</strong>: 12</li>
<li><strong>Embedding size</strong>: 128 to 1024</li>
<li><strong>Attention heads</strong>: 4 to 12</li>
<li><strong>Batch size</strong>: 4 to 32 structures</li>
<li><strong>Learning rate</strong>: $10^{-4}$ to $10^{-5}$, decayed to $9 \times 10^{-6}$</li>
<li><strong>Data augmentation</strong>: Random rotation of training structures at each epoch</li>
<li><strong>Numerical precision</strong>: 2 decimal places (molecules, proteins), 3 decimal places (crystals)</li>
</ul>
<h3 id="models">Models</h3>
<p>No pre-trained model weights are publicly available. The paper mentions &ldquo;Example code can be found at&rdquo; but the URL appears to be missing from the published version.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Domain</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>Molecules</td>
          <td>xyz2mol produces valid RDKit Mol object</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>Crystals</td>
          <td>Structural (min distance &gt; 0.5 angstrom) and compositional (charge neutral)</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>All</td>
          <td>Fraction of distinct generated structures</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>All</td>
          <td>Fraction not in training set</td>
      </tr>
      <tr>
          <td>Earth mover&rsquo;s distance</td>
          <td>All</td>
          <td>Distribution match for domain-specific properties</td>
      </tr>
      <tr>
          <td>RMSD</td>
          <td>Molecules</td>
          <td>Deviation from RDKit conformer geometries</td>
      </tr>
      <tr>
          <td>Coverage</td>
          <td>Crystals</td>
          <td>Recall and precision between generated and test sets</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types, counts, and training times are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository, model weights, or datasets specific to this work were found. The ZINC, Perov-5, and MP-20 datasets used for evaluation are publicly available from their original sources.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Flam-Shepherd, D. &amp; Aspuru-Guzik, A. (2023). Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files. <em>arXiv preprint arXiv:2305.05708</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{flamshepherd2023language,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Language models can generate molecules, materials, and protein binding sites directly in three dimensions as {XYZ}, {CIF}, and {PDB} files}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Flam-Shepherd, Daniel and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2305.05708}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LLM-Prop: Predicting Crystal Properties from Text</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/llm-prop-crystal-property-prediction/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/llm-prop-crystal-property-prediction/</guid><description>LLM-Prop fine-tunes the T5 encoder on crystal text descriptions to predict band gap, volume, and other properties, outperforming GNN baselines.</description><content:encoded><![CDATA[<h2 id="text-based-crystal-property-prediction-with-llms">Text-Based Crystal Property Prediction with LLMs</h2>
<p>LLM-Prop is a <strong>Method</strong> paper that proposes using the encoder portion of <a href="https://en.wikipedia.org/wiki/T5_(language_model)">T5</a> (a general-purpose language model) fine-tuned on crystal text descriptions to predict physical and electronic properties of crystalline materials. The primary contribution is demonstrating that text-based representations of crystals, generated by Robocrystallographer, can serve as effective inputs for <a href="/notes/chemistry/molecular-design/property-prediction/">property prediction</a>, outperforming graph neural network (GNN) baselines on several tasks despite using a non-domain-specific pre-trained model with fewer parameters.</p>
<h2 id="why-text-instead-of-crystal-graphs">Why Text Instead of Crystal Graphs?</h2>
<p>Graph neural networks have been the dominant approach for crystal property prediction. Models like CGCNN, MEGNet, and ALIGNN represent crystals as graphs where atoms are nodes and bonds are edges. However, GNNs face several fundamental challenges for crystals:</p>
<ol>
<li><strong>Periodicity encoding</strong>: Crystals have repetitive unit cell arrangements that are distinct from standard molecular graphs, and GNNs struggle to encode this periodicity efficiently.</li>
<li><strong>Information incorporation</strong>: Critical structural information like bond angles, <a href="https://en.wikipedia.org/wiki/Space_group">space group</a> symmetry, and <a href="https://en.wikipedia.org/wiki/Wyckoff_positions">Wyckoff sites</a> is difficult to incorporate into graph representations.</li>
<li><strong>Expressiveness</strong>: Graphs may lack the expressiveness needed to convey complex crystal information relevant to property prediction.</li>
</ol>
<p>Meanwhile, textual descriptions of crystals (generated by tools like Robocrystallographer) naturally encode space group information, bond geometries, coordination environments, and symmetry details in human-readable form. Despite this richness, text-based approaches for crystal property prediction had been largely unexplored.</p>
<h2 id="core-innovation-t5-encoder-with-careful-fine-tuning">Core Innovation: T5 Encoder with Careful Fine-Tuning</h2>
<p>The key insight of LLM-Prop is to take a pre-trained encoder-decoder model (<a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-small) and discard the decoder entirely, using only the encoder with a linear prediction head. This design has several advantages:</p>
<ul>
<li>Cutting the network in half (from ~60M to ~37M parameters) allows processing of longer input sequences</li>
<li>Longer sequences mean more crystal information can be included</li>
<li>The encoder-only approach avoids T5&rsquo;s known weakness at regression in text-to-text format</li>
</ul>
<p>The framework applies several preprocessing strategies to the crystal text descriptions:</p>
<ol>
<li><strong>Stopword removal</strong>: Standard English stopwords are removed, except digits and symbols carrying chemical information</li>
<li><strong>Numerical token replacement</strong>: Bond distances are replaced with a <code>[NUM]</code> token and bond angles with <code>[ANG]</code>, reducing sequence length while preserving structural cues</li>
<li><strong>[CLS] token prepending</strong>: A classification token is added at the start, and its learned embedding is used as input to the prediction layer</li>
<li><strong>Label scaling</strong>: For regression tasks, targets are normalized using z-score, min-max, or log normalization</li>
</ol>
<p>The normalization schemes are defined as:</p>
<p>$$
\hat{Y}_{i}(\text{z-score}) = \frac{Y_{i} - \mu}{\sigma}
$$</p>
<p>$$
\hat{Y}_{i}(\text{min-max}) = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}}
$$</p>
<p>$$
\hat{Y}_{i}(\text{log-norm}) = \log(Y_{i} + 1)
$$</p>
<p>The tokenizer is also retrained on the crystal text corpus with a vocabulary size of 32k, and the special tokens <code>[NUM]</code>, <code>[ANG]</code>, and <code>[CLS]</code> are added to the vocabulary.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<h3 id="dataset-textedge">Dataset: TextEdge</h3>
<p>The authors collected data from the <a href="https://en.wikipedia.org/wiki/Materials_Project">Materials Project</a> database (as of November 2022), yielding 144,931 crystal structure-description pairs split into 125,098 training, 9,945 validation, and 9,888 test samples. Crystal text descriptions were generated using Robocrystallographer. The dataset covers six prediction tasks:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Type</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Band gap (eV)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Unit cell volume (A^3/cell)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Formation energy per atom (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Energy per atom (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Energy above hull (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Is-gap-direct</td>
          <td>Classification</td>
          <td>AUC (higher is better)</td>
      </tr>
  </tbody>
</table>
<h3 id="baselines">Baselines</h3>
<p>Seven baselines were compared:</p>
<ul>
<li><strong>GNN-based</strong>: CGCNN, MEGNet, ALIGNN, DeeperGATGNN</li>
<li><strong>Classic ML</strong>: XGBoost, Random Forest (on Robocrystallographer features)</li>
<li><strong>Text-based</strong>: MatBERT (domain-specific pre-trained BERT, ~110M parameters)</li>
</ul>
<p>All models were trained and evaluated on the same dataset splits for fair comparison. GNN models were retrained on the new data rather than using results from older, smaller Materials Project versions.</p>
<h3 id="main-results-llm-prop-vs-gnn-baselines">Main Results: LLM-Prop vs. GNN Baselines</h3>
<p>When using crystal text descriptions as input, LLM-Prop achieved:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Band gap (eV)</th>
          <th>Volume (A^3/cell)</th>
          <th>FEPA (eV/atom)</th>
          <th>EPA (eV/atom)</th>
          <th>Ehull (eV/atom)</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN</td>
          <td>0.293</td>
          <td>188.834</td>
          <td>0.046</td>
          <td>0.082</td>
          <td>0.040</td>
          <td>0.830</td>
      </tr>
      <tr>
          <td>MEGNet</td>
          <td>0.304</td>
          <td>297.948</td>
          <td>0.077</td>
          <td>0.056</td>
          <td>0.051</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>ALIGNN</td>
          <td>0.250</td>
          <td>129.580</td>
          <td>0.027</td>
          <td>0.059</td>
          <td>0.028</td>
          <td>0.678</td>
      </tr>
      <tr>
          <td>DeeperGATGNN</td>
          <td>0.291</td>
          <td>111.857</td>
          <td>0.081</td>
          <td>0.116</td>
          <td>0.045</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>LLM-Prop (Descr.)</td>
          <td><strong>0.231</strong></td>
          <td><strong>39.252</strong></td>
          <td>0.056</td>
          <td>0.067</td>
          <td>0.047</td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>LLM-Prop outperformed the best GNN baseline (ALIGNN) by approximately 8% on <a href="https://en.wikipedia.org/wiki/Band_gap">band gap</a> prediction, 65% on volume prediction, and 3% on band gap classification (Is-gap-direct). For formation energy per atom, energy per atom, and energy above hull, ALIGNN retained an advantage.</p>
<h3 id="llm-prop-vs-matbert">LLM-Prop vs. MatBERT</h3>
<p>LLM-Prop also outperformed MatBERT (a domain-specific pre-trained BERT) across all tasks despite having roughly 3x fewer parameters. The table below shows the best result for each model across the three input preprocessing strategies (w/ Numbers, w/o Numbers, w/ [NUM]&amp;[ANG]):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Band gap (eV)</th>
          <th>Volume (A^3/cell)</th>
          <th>FEPA (eV/atom)</th>
          <th>EPA (eV/atom)</th>
          <th>Ehull (eV/atom)</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MatBERT (best)</td>
          <td>0.258</td>
          <td>54.969</td>
          <td>0.071</td>
          <td>0.098</td>
          <td>0.050</td>
          <td>0.722</td>
      </tr>
      <tr>
          <td>LLM-Prop (best)</td>
          <td><strong>0.231</strong></td>
          <td><strong>39.138</strong></td>
          <td><strong>0.056</strong></td>
          <td><strong>0.067</strong></td>
          <td><strong>0.047</strong></td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>Note: LLM-Prop&rsquo;s best band gap (0.231) comes from the &ldquo;w/o Numbers&rdquo; configuration, while the best volume (39.138) comes from &ldquo;w/ Numbers&rdquo;. The best Is-gap-direct AUC (0.857) uses the &ldquo;[NUM]&amp;[ANG]&rdquo; configuration.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>The contribution of each preprocessing strategy was evaluated:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Band gap</th>
          <th>Volume</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LLM-Prop (baseline)</td>
          <td>0.256</td>
          <td>69.352</td>
          <td>0.796</td>
      </tr>
      <tr>
          <td>+ modified tokenizer</td>
          <td>0.247</td>
          <td>78.632</td>
          <td>0.785</td>
      </tr>
      <tr>
          <td>+ label scaling</td>
          <td>0.242</td>
          <td>44.515</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>+ [CLS] token</td>
          <td>0.231</td>
          <td>39.520</td>
          <td>0.842</td>
      </tr>
      <tr>
          <td>+ [NUM] token</td>
          <td>0.251</td>
          <td>86.090</td>
          <td>0.793</td>
      </tr>
      <tr>
          <td>+ [ANG] token</td>
          <td>0.242</td>
          <td>64.965</td>
          <td>0.810</td>
      </tr>
      <tr>
          <td>- stopwords</td>
          <td>0.252</td>
          <td>56.593</td>
          <td>0.779</td>
      </tr>
      <tr>
          <td>LLM-Prop+all (no space group)</td>
          <td>0.235</td>
          <td>97.457</td>
          <td>0.705</td>
      </tr>
      <tr>
          <td>LLM-Prop+all</td>
          <td><strong>0.229</strong></td>
          <td>42.259</td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>The [CLS] token provided the single largest improvement across all tasks. Label scaling was critical for volume prediction (reducing MAE from 69.352 to 44.515). Removing space group information from descriptions degraded volume prediction dramatically (from 42.259 to 97.457), confirming that space group symmetry is a key factor.</p>
<h3 id="data-efficiency-and-transfer-learning">Data Efficiency and Transfer Learning</h3>
<p>LLM-Prop achieved SOTA results on band gap and volume prediction with only about 90k training samples (35k fewer than baselines). For volume prediction specifically, LLM-Prop outperformed all GNN baselines with just 30k training samples.</p>
<p>Transfer learning experiments showed that LLM-Prop transferred well between band gap and volume prediction tasks:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Volume-to-Band gap (Test)</th>
          <th>Band gap-to-Volume (Test)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN-transfer</td>
          <td>0.295</td>
          <td>182.997</td>
      </tr>
      <tr>
          <td>ALIGNN-transfer</td>
          <td>0.322</td>
          <td>136.164</td>
      </tr>
      <tr>
          <td>MatBERT-transfer</td>
          <td>0.266</td>
          <td>54.289</td>
      </tr>
      <tr>
          <td>LLM-Prop-transfer</td>
          <td><strong>0.244</strong></td>
          <td><strong>50.753</strong></td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p><strong>Key findings</strong>:</p>
<ul>
<li>Text descriptions of crystals carry rich structural information (space groups, Wyckoff sites, coordination geometries) that is difficult to encode in graphs but naturally expressed in text</li>
<li>A carefully fine-tuned general-purpose LLM encoder can outperform domain-specific pre-trained models, challenging the assumption that in-domain pre-training is always necessary</li>
<li>Removing numerical information (bond distances and angles) from descriptions often improves performance, because current LLMs treat numbers as regular tokens without understanding their quantitative meaning</li>
<li>Longer input sequences correlate with better performance, with 888 tokens as the default maximum on the hardware used</li>
</ul>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>The origin of LLM-Prop&rsquo;s performance advantage over GNNs is not fully understood. It remains unclear whether the boost comes from additional structured information in text or from the different data modality itself</li>
<li>LLM-Prop cannot perform zero-shot predictions since T5 was not pre-trained on materials science data</li>
<li>The approach depends on Robocrystallographer to generate text descriptions, adding a preprocessing dependency</li>
<li>Current LLMs&rsquo; inability to reason about numerical values limits the use of quantitative information in descriptions</li>
</ul>
<p><strong>Future directions</strong> suggested by the authors include investigating techniques to use <a href="/notes/chemistry/molecular-design/generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/">CIF files</a> directly as LLM inputs, developing new GNN architectures that incorporate space group and Wyckoff site information, and further exploring which information in crystal descriptions contributes most to each property prediction task.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>TextEdge</td>
          <td>144,931 crystals</td>
          <td>From Materials Project (Nov 2022), text generated by Robocrystallographer</td>
      </tr>
      <tr>
          <td>Training split</td>
          <td>TextEdge</td>
          <td>125,098</td>
          <td>Random split</td>
      </tr>
      <tr>
          <td>Validation split</td>
          <td>TextEdge</td>
          <td>9,945</td>
          <td>Random split</td>
      </tr>
      <tr>
          <td>Test split</td>
          <td>TextEdge</td>
          <td>9,888</td>
          <td>Random split</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam with one-cycle learning rate scheduler</li>
<li><strong>Learning rate</strong>: 1e-3 for LLM-Prop, 5e-5 for MatBERT</li>
<li><strong>Dropout</strong>: 0.2 for LLM-Prop, 0.5 for MatBERT</li>
<li><strong>Batch size</strong>: 64 (888 tokens) or 16 (2000 tokens) for LLM-Prop</li>
<li><strong>Epochs</strong>: 200-300 depending on task</li>
<li><strong>Loss</strong>: MAE for regression, BCE for classification</li>
<li><strong>Evaluation</strong>: MAE for regression, AUC for classification</li>
<li><strong>Each model run 5 times on test set</strong>, averaged MAE reported</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Base model</strong>: T5-small encoder (~60M parameters total, ~37M after discarding decoder and adding prediction head)</li>
<li><strong>Vocabulary size</strong>: 32k (retrained tokenizer)</li>
<li><strong>Max input tokens</strong>: 888 (default) or 2000</li>
<li><strong>Special tokens</strong>: [CLS], [NUM], [ANG]</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/vertaix/LLM-Prop">LLM-Prop</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://drive.google.com/drive/folders/1YCDBzwjwNRIc1FRkB662G3Y5AOWaokUG">TextEdge + Checkpoints</a></td>
          <td>Dataset + Model</td>
          <td>Not specified</td>
          <td>Benchmark dataset and trained model checkpoints</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: NVIDIA RTX A6000</li>
<li><strong>Training time</strong>: ~40 minutes per epoch for LLM-Prop</li>
<li><strong>Inference</strong>: ~1 minute for 10,000 materials on one GPU</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rubungo, A. N., Arnold, C. B., Rand, B. P., &amp; Dieng, A. B. (2025). LLM-Prop: predicting the properties of crystalline materials using large language models. <em>npj Computational Materials</em>, 11, 186. <a href="https://doi.org/10.1038/s41524-025-01536-2">https://doi.org/10.1038/s41524-025-01536-2</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rubungo2025llmprop,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{LLM-Prop: predicting the properties of crystalline materials using large language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Rubungo, Andre Niyongabo and Arnold, Craig B. and Rand, Barry P. and Dieng, Adji Bousso}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{npj Computational Materials}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{186}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41524-025-01536-2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Lingo3DMol: Language Model for 3D Molecule Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/lingo3dmol-3d-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/lingo3dmol-3d-molecule-generation/</guid><description>Lingo3DMol combines language models with geometric deep learning for structure-based 3D molecule generation using a fragment-based SMILES representation.</description><content:encoded><![CDATA[<h2 id="a-language-model-approach-to-structure-based-drug-design">A Language Model Approach to Structure-Based Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.</p>
<h2 id="limitations-of-existing-3d-molecular-generative-models">Limitations of Existing 3D Molecular Generative Models</h2>
<p>Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.</p>
<h2 id="fsmiles-fragment-based-smiles-with-dual-coordinate-systems">FSMILES: Fragment-Based SMILES with Dual Coordinate Systems</h2>
<p>The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., <code>C_6</code> for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.</p>
<p>The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:</p>
<p>$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$</p>
<p>$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$</p>
<p>$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$</p>
<p>Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.</p>
<h3 id="ncianchor-prediction-model">NCI/Anchor Prediction Model</h3>
<p>A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, <a href="https://en.wikipedia.org/wiki/Halogen_bond">halogen bonds</a>, salt bridges, or <a href="https://en.wikipedia.org/wiki/Pi_stacking">pi-pi stacking</a> interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).</p>
<h3 id="pretraining-and-architecture">Pretraining and Architecture</h3>
<p>The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:</p>
<p>$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$</p>
<p>The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:</p>
<p>$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$</p>
<p>Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.</p>
<h2 id="evaluation-on-dud-e-with-drug-likeness-filtering">Evaluation on DUD-E with Drug-Likeness Filtering</h2>
<p>The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED &gt;= 0.3 and SAS &lt;= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.</p>
<p><strong>Molecular properties and binding mode (Table 1, drug-like molecules only):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Pocket2Mol</th>
          <th>TargetDiff</th>
          <th>Lingo3DMol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (% of total)</td>
          <td>61%</td>
          <td>49%</td>
          <td><strong>82%</strong></td>
      </tr>
      <tr>
          <td>Mean QED</td>
          <td>0.56</td>
          <td>0.60</td>
          <td>0.59</td>
      </tr>
      <tr>
          <td>Mean SAS</td>
          <td>3.5</td>
          <td>4.0</td>
          <td><strong>3.1</strong></td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% of targets)</td>
          <td>8%</td>
          <td>3%</td>
          <td><strong>33%</strong></td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td>-6.7</td>
          <td>-6.2</td>
          <td><strong>-6.8</strong></td>
      </tr>
      <tr>
          <td>Mean GlideSP redocking</td>
          <td>-7.5</td>
          <td>-7.0</td>
          <td><strong>-7.8</strong></td>
      </tr>
      <tr>
          <td>Mean RMSD vs. low-energy conformer (A)</td>
          <td>1.1</td>
          <td>1.1</td>
          <td><strong>0.9</strong></td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.84</td>
          <td><strong>0.88</strong></td>
          <td>0.82</td>
      </tr>
  </tbody>
</table>
<p>Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.</p>
<p><strong>Molecular geometry:</strong> Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).</p>
<p><strong>Information leakage analysis:</strong> The authors controlled for information leakage by excluding proteins with &gt;30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol&rsquo;s training set, Lingo3DMol&rsquo;s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.</p>
<p><strong>Ablation studies (Table 2):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Standard</th>
          <th>Random NCI</th>
          <th>No Pretraining</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like (%)</td>
          <td><strong>82%</strong></td>
          <td>47%</td>
          <td>71%</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5</td>
          <td><strong>33%</strong></td>
          <td>6%</td>
          <td>3%</td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td><strong>-6.8</strong></td>
          <td>-5.8</td>
          <td>-4.9</td>
      </tr>
      <tr>
          <td>Dice score</td>
          <td><strong>0.25</strong></td>
          <td>0.15</td>
          <td>0.13</td>
      </tr>
  </tbody>
</table>
<p>Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.</p>
<p>Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.</p>
<p>Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>In-house commercial library</td>
          <td>12M molecules (1.4M public)</td>
          <td>Filtered for drug-likeness; conformers via ConfGen</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PDBbind 2020 (general set)</td>
          <td>11,800 samples (8,201 PDB IDs)</td>
          <td>Filtered for &lt;30% sequence identity to DUD-E targets</td>
      </tr>
      <tr>
          <td>NCI labels</td>
          <td>PDBbind 2020</td>
          <td>Same as fine-tuning</td>
          <td>Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>DUD-E</td>
          <td>101 targets, 20,000+ active compounds</td>
          <td>Standard benchmark for structure-based drug design</td>
      </tr>
      <tr>
          <td>Geometry evaluation</td>
          <td>CrossDocked2020</td>
          <td>100 targets</td>
          <td>Used for bond length and atom distance distribution comparisons</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)</li>
<li>Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption</li>
<li>Depth-first search sampling with reward function combining model confidence and anchor fulfillment</li>
<li>Fine-tuning: first three encoder layers frozen</li>
<li>Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)</li>
<li>NCI/anchor prediction model: same architecture, initialized from pretrained parameters</li>
<li>Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Lingo3DMol</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (%)</td>
          <td>82%</td>
          <td>61% (P2M)</td>
          <td>QED &gt;= 0.3, SAS &lt;= 5</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% targets)</td>
          <td>33%</td>
          <td>8% (P2M)</td>
          <td>Tanimoto similarity to known actives</td>
      </tr>
      <tr>
          <td>Min-in-place GlideSP</td>
          <td>-6.8</td>
          <td>-6.7 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>GlideSP redocking</td>
          <td>-7.8</td>
          <td>-7.5 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>RMSD vs. low-energy conformer</td>
          <td>0.9 A</td>
          <td>1.1 A (both)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Generation speed (100 mol)</td>
          <td>874 +/- 401 s</td>
          <td>962 +/- 622 s (P2M)</td>
          <td>NVIDIA Tesla V100</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Inference benchmarked on NVIDIA Tesla V100 GPUs</li>
<li>Generation of 100 valid molecules per target: 874 +/- 401 seconds</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/stonewiseAIDrugDesign/Lingo3DMol">Lingo3DMol</a></td>
          <td>Code</td>
          <td>GPL-3.0</td>
          <td>Inference code and model architecture</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/software/Code_for_Lingo3DMo/24633084">Model checkpoints</a></td>
          <td>Model</td>
          <td>GPL-3.0</td>
          <td>Pretraining, fine-tuning, and NCI checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/dataset/Data_for_Lingo3DMol/24550351">Training data</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules</td>
      </tr>
      <tr>
          <td><a href="https://sw3dmg.stonewise.cn">Online service</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Web interface for molecule generation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., &amp; Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. <em>Nature Machine Intelligence</em>, 6(1), 62-73. <a href="https://doi.org/10.1038/s42256-023-00775-6">https://doi.org/10.1038/s42256-023-00775-6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{feng2024generation,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Generation of 3D molecules in pockets via a language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{62--73}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-023-00775-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Foundation Models in Chemistry: A 2025 Perspective</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/foundation-models-chemistry-perspective/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/foundation-models-chemistry-perspective/</guid><description>Perspective reviewing foundation models for chemistry across property prediction, MLIPs, inverse design, and multi-domain applications.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-foundation-models-for-chemistry">A Systematization of Foundation Models for Chemistry</h2>
<p>This is a <strong>Systematization</strong> paper. It organizes the rapidly growing landscape of foundation models in chemistry into a coherent taxonomy. The paper distinguishes between &ldquo;small&rdquo; foundation models (pretrained for a single application domain) and &ldquo;big&rdquo; foundation models (adaptable across multiple domains such as property prediction and inverse design). It covers models based on graph neural networks (GNNs) and language models, reviews pretraining strategies (self-supervised, multimodal, supervised), and maps approximately 40 models across four application domains.</p>
<h2 id="why-a-foundation-model-perspective-for-chemistry">Why a Foundation Model Perspective for Chemistry?</h2>
<p>Foundation models have transformed NLP and computer vision through large-scale pretraining and transfer learning. In chemistry, however, several persistent challenges motivate the adoption of this paradigm:</p>
<ol>
<li><strong>Data scarcity</strong>: Chemical datasets are often small and expensive to generate (requiring experiments or quantum mechanical calculations), unlike the large annotated datasets available in NLP/CV.</li>
<li><strong>Poor generalization</strong>: ML models in chemistry frequently need to extrapolate to out-of-domain compounds (e.g., novel drug candidates, unseen crystal structures), where conventional models struggle.</li>
<li><strong>Limited transferability</strong>: Traditional ML interatomic potentials (MLIPs) are trained on system-specific datasets and cannot be easily transferred across different chemical systems.</li>
</ol>
<p>Foundation models address these by learning general representations from large unlabeled datasets, which can then be adapted to specific downstream tasks via finetuning. The paper argues that summarizing this fast-moving field is timely, given the diversity of approaches emerging across molecular property prediction, MLIPs, inverse design, and multi-domain applications.</p>
<h2 id="small-vs-big-foundation-models-a-two-tier-taxonomy">Small vs. Big Foundation Models: A Two-Tier Taxonomy</h2>
<p>The paper&rsquo;s central organizing framework distinguishes two scopes of foundation model:</p>
<p><strong>Small foundation models</strong> are pretrained models adapted to various tasks within a single application domain. Examples include:</p>
<ul>
<li>A model pretrained on large molecular databases that predicts multiple molecular properties (band gap, formation energy, etc.)</li>
<li>A universal MLIP that can simulate diverse chemical systems</li>
<li>A pretrained generative model adapted for inverse design of different target properties</li>
</ul>
<p><strong>Big foundation models</strong> span multiple application domains, handling both property prediction and inverse design within a single framework. These typically use multimodal learning (combining SMILES/graphs with text) or build on large language models.</p>
<h3 id="architectures">Architectures</h3>
<p>The paper reviews two primary architecture families:</p>
<p><strong>Graph Neural Networks (GNNs)</strong> represent molecules and crystals as graphs $G = (V, E)$ with nodes (atoms) and edges (bonds). Node features are updated through message passing:</p>
<p>$$
m_{i}^{t+1} = \sum_{j \in N(i)} M_{t}(v_{i}^{t}, v_{j}^{t}, e_{ij}^{t})
$$</p>
<p>$$
v_{i}^{t+1} = U_{t}(v_{i}^{t}, m_{i}^{t+1})
$$</p>
<p>After $T$ message-passing steps, a readout function produces a graph-level feature:</p>
<p>$$
g = R({v_{i}^{T} \mid i \in G})
$$</p>
<p>Recent equivariant GNNs (e.g., NequIP, MACE, EquformerV2) use vectorial features that respect geometric symmetries, improving expressivity for tasks sensitive to 3D structure.</p>
<p><strong>Language Models</strong> operate on string representations of molecules (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) or crystal structures. Autoregressive models like GPT maximize:</p>
<p>$$
\prod_{t=1}^{T} P(y_{t} \mid x_{1}, x_{2}, \ldots, x_{t-1})
$$</p>
<p>Transformers use self-attention:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V
$$</p>
<h3 id="pretraining-strategies">Pretraining Strategies</h3>
<p>The paper categorizes pretraining methods into three self-supervised learning (SSL) approaches plus supervised and multimodal strategies:</p>
<table>
  <thead>
      <tr>
          <th>Strategy</th>
          <th>Mechanism</th>
          <th>Example Models</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Contrastive learning</td>
          <td>Maximize similarity between positive pairs, minimize for negatives</td>
          <td>GraphCL, MolCLR, GraphMVP, CrysGNN</td>
      </tr>
      <tr>
          <td>Predictive learning</td>
          <td>Predict self-generated labels (node context, functional groups, space group)</td>
          <td>GROVER, Hu et al., CrysGNN</td>
      </tr>
      <tr>
          <td>Generative learning</td>
          <td>Reconstruct masked nodes/edges or entire molecules/SMILES</td>
          <td><a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a></td>
      </tr>
      <tr>
          <td>Supervised pretraining</td>
          <td>Train on energy, forces, stress from DFT databases</td>
          <td>M3GNet, CHGNet, MACE-MP-0, MatterSim</td>
      </tr>
      <tr>
          <td>Multimodal learning</td>
          <td>Learn joint representations across SMILES/graph + text modalities</td>
          <td>KV-PLM, <a href="/notes/chemistry/molecular-representations/multimodal/momu-molecular-multimodal-foundation/">MoMu</a>, MoleculeSTM, <a href="/notes/chemistry/molecular-representations/multimodal/spmm-bidirectional-structure-property/">SPMM</a></td>
      </tr>
  </tbody>
</table>
<p>A common finding across studies is that combining local and global information (e.g., via contrastive learning between node-level and graph-level views, or supervised learning on both forces and total energy) produces more transferable representations.</p>
<h2 id="survey-of-models-across-four-domains">Survey of Models Across Four Domains</h2>
<h3 id="property-prediction">Property Prediction</h3>
<p>The paper reviews 13 models for molecular and materials property prediction. Key findings:</p>
<ul>
<li><strong>Contrastive learning approaches</strong> (GraphCL, MolCLR, GraphMVP) achieve strong results by defining positive pairs through augmentation, 2D/3D structure views, or crystal system membership.</li>
<li><strong>Language model approaches</strong> (<a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>) show that transformers trained on SMILES via masked language modeling can compete with GNN-based approaches.</li>
<li><a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>, pretrained on 1.1 billion SMILES from PubChem and ZINC, outperformed many baselines including GNNs on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> and <a href="/notes/chemistry/datasets/qm9/">QM9</a> benchmarks. Its attention maps captured molecular structural features directly from SMILES strings.</li>
<li>For crystalline materials, CrysGNN combined contrastive, predictive, and generative learning, demonstrating improvements even on small experimental datasets.</li>
</ul>
<h3 id="machine-learning-interatomic-potentials-mlips">Machine Learning Interatomic Potentials (MLIPs)</h3>
<p>The paper surveys 10 universal MLIPs, all using supervised learning on DFT-calculated energies, forces, and stresses:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>Training Data Size</th>
          <th>Key Capability</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>M3GNet</td>
          <td>GNN</td>
          <td>187K (MP)</td>
          <td>First universal MLIP</td>
      </tr>
      <tr>
          <td>CHGNet</td>
          <td>GNN</td>
          <td>1.58M (MPtrj)</td>
          <td>Predicts magnetic moments</td>
      </tr>
      <tr>
          <td>MACE-MP-0</td>
          <td>MACE</td>
          <td>1.58M (MPtrj)</td>
          <td>35 diverse applications</td>
      </tr>
      <tr>
          <td>GNoME potential</td>
          <td>NequIP</td>
          <td>89M</td>
          <td>Zero-shot comparable to trained MLIPs</td>
      </tr>
      <tr>
          <td>MatterSim</td>
          <td>M3GNet/Graphormer</td>
          <td>17M</td>
          <td>SOTA on Matbench Discovery</td>
      </tr>
      <tr>
          <td>eqV2</td>
          <td>EquformerV2</td>
          <td>118M (OMat24)</td>
          <td>Structural relaxation</td>
      </tr>
  </tbody>
</table>
<p>The GNoME potential, trained on approximately 89 million data points, achieved zero-shot performance comparable to state-of-the-art MLIPs trained from scratch. MatterSim, trained on over 17 million entries across wide temperature (0-5000K) and pressure (0-1000 GPa) ranges, achieved state-of-the-art on Matbench Discovery and accurately computed thermodynamic and lattice dynamic properties.</p>
<h3 id="inverse-design">Inverse Design</h3>
<p>Few pretrained generative models for inverse design exist. The paper highlights three:</p>
<ul>
<li><strong>MatterGen</strong> (Microsoft): Diffusion model pretrained on Alexandria/MP databases (607K structures), finetuned for conditional generation on band gap, elastic modulus, spacegroup, and composition. Generated S.U.N. (stable, unique, novel) materials at rates more than 2x the previous state of the art.</li>
<li><strong><a href="/notes/chemistry/molecular-design/generation/autoregressive/gp-molformer/">GP-MoLFormer</a></strong> (IBM): MoLFormer pretrained on 1.1B SMILES, finetuned via pair-tuning for property-guided molecular optimization.</li>
<li><strong>CrystalLLM</strong>: Finetuned LLaMA-2 70B for crystal generation with target spacegroup and composition using string representations and prompting.</li>
</ul>
<h3 id="multi-domain-models">Multi-Domain Models</h3>
<p>The paper covers two multi-domain categories:</p>
<p><strong>Property prediction + MLIP</strong>: Denoising pretraining learns virtual forces that guide noisy configurations back to equilibrium, connecting to force prediction. Joint multi-domain pretraining (JMP) from Meta FAIR achieved state-of-the-art on 34 of 40 tasks spanning molecules, crystals, and MOFs by training simultaneously on diverse energy/force databases.</p>
<p><strong>Property prediction + inverse design</strong>: Multimodal models (KV-PLM, <a href="/notes/chemistry/molecular-representations/multimodal/momu-molecular-multimodal-foundation/">MoMu</a>, MoleculeSTM, <a href="/notes/chemistry/molecular-representations/multimodal/molfm-multimodal-molecular-foundation/">MolFM</a>, <a href="/notes/chemistry/molecular-representations/multimodal/spmm-bidirectional-structure-property/">SPMM</a>) learn joint representations from molecular structures and text, enabling text-based inverse design and property prediction in a single framework. LLM-based models (<a href="/notes/chemistry/llm-applications/chemdfm-x/">ChemDFM</a>, <a href="/notes/chemistry/molecular-representations/multimodal/nach0-multimodal-chemical-language-model/">nach0</a>, <a href="/notes/chemistry/llm-applications/fine-tuning-gpt3-molecular-properties/">finetuned GPT-3</a>) can interact with humans and handle diverse chemistry tasks through instruction tuning.</p>
<h2 id="trends-and-future-directions">Trends and Future Directions</h2>
<h3 id="scope-expansion">Scope Expansion</h3>
<p>The authors identify three axes for expanding foundation model scope:</p>
<ol>
<li><strong>Material types</strong>: Most models target molecules or a single material class. Foundation models that span molecules, crystals, surfaces, and MOFs could exploit shared chemistry across materials.</li>
<li><strong>Modalities</strong>: Beyond SMILES, graphs, and text, additional modalities (images, spectral data like XRD patterns) remain underexplored.</li>
<li><strong>Downstream tasks</strong>: Extending to new chemistry and tasks through emergent capabilities, analogous to the capabilities observed in LLMs at scale.</li>
</ol>
<h3 id="performance-and-scaling">Performance and Scaling</h3>
<p>Key scaling challenges include:</p>
<ul>
<li><strong>Data quality vs. quantity</strong>: Noisy DFT labels (e.g., HOMO-LUMO gaps with high uncertainty from different functionals/basis sets) can limit scalability and out-of-distribution performance.</li>
<li><strong>GNN scalability</strong>: While transformers scale to hundreds of billions of parameters, GNNs have rarely been explored above one million parameters due to oversmoothing and the curse of dimensionality. Recent work by Sypetkowski et al. demonstrated scaling GNNs to 3 billion parameters with consistent improvements.</li>
<li><strong>Database integration</strong>: Combining datasets from different DFT codes requires proper alignment (e.g., total energy alignment methods).</li>
</ul>
<h3 id="efficiency">Efficiency</h3>
<p>For MLIPs, efficiency is critical since MD simulations require millions of inference steps. Approaches include:</p>
<ul>
<li>Knowledge distillation from expensive teacher models to lighter student models</li>
<li>Model compression techniques (quantization, pruning) adapted for GNNs</li>
<li>Investigating whether strict equivariance is always necessary</li>
</ul>
<h3 id="interpretability">Interpretability</h3>
<p>Foundation models can generate hallucinations or mode-collapsed outputs. The authors highlight recent interpretability advances (feature extraction from Claude 3, knowledge localization and editing in transformers) as promising directions for more reliable chemical applications.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Key findings</strong>:</p>
<ul>
<li>Combining local and global information in pretraining consistently improves downstream performance across all domains reviewed.</li>
<li>Self-supervised pretraining enables effective transfer learning even in low-data regimes, a critical advantage for chemistry.</li>
<li>Universal MLIPs have reached the point where zero-shot performance can be comparable to system-specific trained models.</li>
<li>Multimodal learning is the most promising approach for big foundation models capable of spanning property prediction and inverse design.</li>
</ul>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>The precise definition of &ldquo;foundation model&rdquo; in chemistry is not established and varies by scope.</li>
<li>Most surveyed models focus on molecules, with crystalline materials less explored.</li>
<li>Benchmarks for low-data regimes and out-of-distribution performance are insufficient.</li>
<li>The paper focuses on three domains (property prediction, MLIPs, inverse design) and does not cover retrosynthesis, reaction prediction, or other chemical tasks in depth.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>This is a perspective/review paper. No new data or models are introduced. The paper surveys existing models and their training datasets, summarized in Table 1 of the paper.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Not applicable (review paper). The paper describes pretraining strategies (contrastive, predictive, generative, supervised, multimodal) at a conceptual level with references to the original works.</p>
<h3 id="models">Models</h3>
<p>Not applicable (review paper). The paper catalogs approximately 40 foundation models across four domains. See Table 1 in the paper for the complete listing.</p>
<h3 id="evaluation">Evaluation</h3>
<p>Not applicable (review paper). The paper references benchmark results from the original studies (MoleculeNet, QM9, Matbench, Matbench Discovery, JARVIS-DFT) but does not perform independent evaluation.</p>
<h3 id="hardware">Hardware</h3>
<p>Not applicable (review paper).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Choi, J., Nam, G., Choi, J., &amp; Jung, Y. (2025). A Perspective on Foundation Models in Chemistry. <em>JACS Au</em>, 5(4), 1499-1518. <a href="https://doi.org/10.1021/jacsau.4c01160">https://doi.org/10.1021/jacsau.4c01160</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{choi2025perspective,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A Perspective on Foundation Models in Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Choi, Junyoung and Nam, Gunwook and Choi, Jaesik and Jung, Yousung}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{JACS Au}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1499--1518}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/jacsau.4c01160}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Fine-Tuning GPT-3 for Molecular Property Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/fine-tuning-gpt3-molecular-properties/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/fine-tuning-gpt3-molecular-properties/</guid><description>Evaluating fine-tuned GPT-3 ada models for HOMO/LUMO classification of organic semiconductors from SMILES, with ablation and robustness analysis.</description><content:encoded><![CDATA[<h2 id="gpt-3-as-a-molecular-property-classifier">GPT-3 as a Molecular Property Classifier</h2>
<p>This is an <strong>Empirical</strong> paper that evaluates the effectiveness of fine-tuning OpenAI&rsquo;s GPT-3 language model (specifically the &ldquo;ada&rdquo; base model) for predicting electronic and functional properties of organic molecules. Rather than proposing a new architecture, the work systematically tests whether a general-purpose LLM can learn chemically meaningful patterns from <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings when fine-tuned on classification tasks. The primary contribution is the empirical characterization of GPT-3&rsquo;s performance, robustness, and limitations for molecular property prediction, including extensive ablation studies.</p>
<h2 id="why-fine-tune-a-general-purpose-llm-for-chemistry">Why Fine-Tune a General-Purpose LLM for Chemistry?</h2>
<p>Machine learning for molecular property prediction typically relies on specialized representations: molecular graphs processed by graph neural networks (GNNs), engineered molecular descriptors, or domain-specific chemical language models trained from scratch on SMILES or <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>. These approaches require varying levels of domain expertise to design the inputs and architecture.</p>
<p>GPT-3, pre-trained on vast amounts of general text, already has an internal representation of language structure. SMILES notation, as a text-based molecular representation, can be treated as a &ldquo;language&rdquo; with its own syntax. The authors hypothesize that GPT-3&rsquo;s language understanding capabilities, combined with the human-readable nature of SMILES, may enable the model to recognize significant patterns within chemical structures and capture structure-property dependencies. The key question is whether fine-tuning alone is sufficient, or whether specialized architectures provide fundamental advantages.</p>
<p>Prior work by <a href="/notes/chemistry/llm-applications/leveraging-llms-predictive-chemistry/">Jablonka et al.</a> showed that fine-tuned GPT-3 could perform surprisingly well on low-data chemistry tasks, sometimes surpassing dedicated models. This paper extends that investigation with a focus on electronic properties (<a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO and LUMO</a> energies) of <a href="https://en.wikipedia.org/wiki/Organic_semiconductor">organic semiconductors</a>, with deeper analysis of robustness and failure modes.</p>
<h2 id="smiles-to-classification-via-prompt-completion-fine-tuning">SMILES-to-Classification via Prompt-Completion Fine-Tuning</h2>
<p>The core approach is straightforward. Each training example is a prompt-completion pair in JSONL format:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{<span style="color:#f92672">&#34;prompt&#34;</span>: <span style="color:#e6db74">&#34;SMILES_string&#34;</span>, <span style="color:#f92672">&#34;completion&#34;</span>: <span style="color:#e6db74">&#34;class_label&#34;</span>}
</span></span></code></pre></div><p>The SMILES string serves as the prompt, and the fine-tuned model learns to complete it with a class label (0/1 for binary, 0/1/2 for ternary, 0/1/2/3 for quaternary classification). Class thresholds are determined by equally segmenting the property value range. The authors use GPT-3&rsquo;s default tokenizer, which breaks SMILES strings into subword tokens that do not correspond to chemically meaningful units (e.g., &ldquo;c1ccccc1&rdquo; for benzene gets tokenized into arbitrary fragments).</p>
<p>This design choice has important implications. The model must learn chemical semantics from token patterns that are not aligned with atoms or bonds. The authors note this as a limitation and hypothesize that a chemistry-aware tokenizer could improve performance.</p>
<h2 id="experimental-setup-and-baseline-comparisons">Experimental Setup and Baseline Comparisons</h2>
<h3 id="datasets">Datasets</h3>
<p>The primary dataset is a collection of 48,182 organic semiconductor (OSC) molecules extracted from the <a href="https://en.wikipedia.org/wiki/Cambridge_Structural_Database">Cambridge Structural Database</a> (CSD). Each molecule has a SMILES representation and quantum-chemically computed electronic properties (HOMO and LUMO energies). A secondary dataset of 572 aromatic molecular photocatalysts (AMPs) with experimentally measured <a href="https://en.wikipedia.org/wiki/Hydrogen_evolution_reaction">hydrogen evolution rates</a> (HER) provides an additional test case.</p>
<h3 id="baselines">Baselines</h3>
<p>Three baselines are compared:</p>
<ol>
<li><strong>Directed message-passing neural network (D-MPNN)</strong> via Chemprop, using default molecular graph representations</li>
<li><strong>RDKit molecular descriptors + SVM</strong>, using the top 20 descriptors selected by SelectKBest</li>
<li><strong>Prior ML results</strong> from the original AMP dataset paper (using engineered domain-specific features)</li>
</ol>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Task</th>
          <th>Classes</th>
          <th>GPT-3 Accuracy</th>
          <th>GNN Accuracy</th>
          <th>Descriptors Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>OSCs (48,182)</td>
          <td>HOMO</td>
          <td>3</td>
          <td>0.92</td>
          <td>0.94</td>
          <td>0.87</td>
      </tr>
      <tr>
          <td>OSCs (48,182)</td>
          <td>HOMO</td>
          <td>4</td>
          <td>0.68</td>
          <td>0.75</td>
          <td>0.47</td>
      </tr>
      <tr>
          <td>OSCs (48,182)</td>
          <td>HOMO</td>
          <td>5</td>
          <td>0.60</td>
          <td>0.68</td>
          <td>0.40</td>
      </tr>
      <tr>
          <td>OSCs (48,182)</td>
          <td>LUMO</td>
          <td>3</td>
          <td>0.94</td>
          <td>0.94</td>
          <td>0.91</td>
      </tr>
      <tr>
          <td>AMPs (572)</td>
          <td>HER</td>
          <td>2</td>
          <td>0.88</td>
          <td>0.86</td>
          <td>0.87</td>
      </tr>
  </tbody>
</table>
<p>For ternary classification, GPT-3 performs on par with GNNs (0.92 vs. 0.94 for HOMO; 0.94 vs. 0.94 for LUMO). Performance degrades more steeply than GNNs as the number of classes increases: at 5-class HOMO, GPT-3 achieves only 0.60 vs. GNN&rsquo;s 0.68. On the small AMP dataset (572 molecules), GPT-3 slightly outperforms the GNN (0.88 vs. 0.86).</p>
<h3 id="learning-curves">Learning Curves</h3>
<p>The data efficiency analysis reveals that GPT-3 needs at least 20% of the OSC dataset (approximately 9,600 molecules) to reach accuracy above 0.9. Below 1,000 training points, accuracy drops below 0.6. GNNs outperform GPT-3 in this low-data regime, which the authors attribute to (1) the molecular graph being chemically more expressive than SMILES for these tasks, and (2) fine-tuning requiring sufficient data to capture relevant SMILES patterns.</p>
<h3 id="ablation-study-1-single-atom-removal">Ablation Study 1: Single-Atom Removal</h3>
<p>The authors tested robustness by removing individual non-hydrogen, non-carbon atoms from SMILES strings and replacing them with a <code>&lt;missing&gt;</code> token. Out of 45,763 ablation tests on 7,714 correctly predicted molecules, 95.2% retained the same classification. This suggests the model captures redundant structural information rather than relying on any single atom.</p>
<h3 id="ablation-study-2-single-group-removal">Ablation Study 2: Single-Group Removal</h3>
<p>Fifteen chemical groups (nitrile, nitro, enamine, ketone, etc.) were individually ablated. The fine-tuned model attributed the most importance to acetylene (81% agreement for HOMO), enamine (85%), nitro (86%), and ketone (87%) groups, as these altered HOMO predictions in more than 10% of tests. Interestingly, groups that participate in electronic pi-conjugation tended to be more &ldquo;important&rdquo; to the model&rsquo;s HOMO predictions.</p>
<p>When ablated atoms were replaced with random elements instead of the <code>&lt;missing&gt;</code> token, the model failed in 80% of cases for a representative molecule. This suggests the model may &ldquo;fill in&rdquo; the missing information when seeing the <code>&lt;missing&gt;</code> token but gets confused by incorrect atomic identities.</p>
<h3 id="predicting-unknown-molecular-families">Predicting Unknown Molecular Families</h3>
<p>The authors held out entire families of <a href="https://en.wikipedia.org/wiki/Polycyclic_aromatic_hydrocarbon">polycyclic aromatic hydrocarbons</a> (naphthalene, anthracene, tetracene, pyrene, perylene), quinones, and imides during training, then tested predictions on these unseen families. Results for the first five PAH families:</p>
<table>
  <thead>
      <tr>
          <th>Fragment Family</th>
          <th>Molecules</th>
          <th>GPT-3 HOMO</th>
          <th>GNN HOMO</th>
          <th>GPT-3 LUMO</th>
          <th>GNN LUMO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Naphthalene</td>
          <td>475</td>
          <td>0.94</td>
          <td>0.95</td>
          <td>0.88</td>
          <td>0.91</td>
      </tr>
      <tr>
          <td>Anthracene</td>
          <td>577</td>
          <td>0.99</td>
          <td>1.00</td>
          <td>0.93</td>
          <td>0.97</td>
      </tr>
      <tr>
          <td>Tetracene</td>
          <td>72</td>
          <td>0.96</td>
          <td>1.00</td>
          <td>0.90</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Pyrene</td>
          <td>237</td>
          <td>0.98</td>
          <td>1.00</td>
          <td>0.97</td>
          <td>0.99</td>
      </tr>
      <tr>
          <td>Perylene</td>
          <td>41</td>
          <td>0.98</td>
          <td>1.00</td>
          <td>0.98</td>
          <td>0.95</td>
      </tr>
  </tbody>
</table>
<p>GPT-3 generalizes well to unknown PAH families, though GNNs have a slight edge on HOMO prediction. Performance degrades somewhat for quinones and imides.</p>
<h3 id="canonical-vs-non-canonical-smiles">Canonical vs. Non-Canonical SMILES</h3>
<p>A model fine-tuned only on canonical SMILES performed poorly on non-canonical variants: only 1,622 of 8,578 molecules achieved consistent predictions across all 11 SMILES variants (1 canonical + 10 non-canonical). Augmenting the training data with 5 non-canonical SMILES per molecule dramatically improved consistency to 7,243 of 8,578 molecules and nearly eliminated erroneous (non-class-label) responses. This finding highlights that GPT-3&rsquo;s pattern matching is highly sensitive to surface-level string representation and benefits substantially from <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES enumeration</a> <a href="/notes/chemistry/molecular-design/property-prediction/maxsmi-smiles-augmentation-property-prediction/">data augmentation</a>.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>The main findings are:</p>
<ol>
<li>Fine-tuned GPT-3 (ada) achieves competitive accuracy with GNNs for coarse-grained (ternary) HOMO/LUMO classification, but performance drops more steeply with finer granularity.</li>
<li>The model shows robustness to single-atom and single-group ablation, suggesting it captures chemically redundant patterns.</li>
<li>Generalization to held-out molecular families is strong, though GNNs maintain a slight advantage.</li>
<li>SMILES augmentation with non-canonical variants is essential for consistent predictions.</li>
</ol>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Black-box nature</strong>: GPT-3 provides no physical insight or interpretability, unlike GNN models where molecular graph features can be augmented with domain knowledge.</li>
<li><strong>Tokenization</strong>: The generic tokenizer does not respect chemical structure. A chemistry-aware tokenizer could improve data efficiency and accuracy.</li>
<li><strong>SELFIES underperformance</strong>: Initial tests with SELFIES did not improve over SMILES, likely because generic tokenization stripped away the extra chemical information SELFIES encodes.</li>
<li><strong>Cost</strong>: Fine-tuning via OpenAI&rsquo;s API cost approximately $500 for the experiments, and the model is closed-source, preventing systematic interpretation of learned representations.</li>
<li><strong>Classification only</strong>: The approach performs coarse-grained classification rather than regression, limiting utility for applications requiring precise numerical predictions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>OSC molecules from CSD</td>
          <td>48,182</td>
          <td>SMILES + DFT-computed HOMO/LUMO energies</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Aromatic molecular photocatalysts (AMPs)</td>
          <td>572</td>
          <td>Experimental hydrogen evolution rates</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Fine-tuning uses OpenAI&rsquo;s GPT-3 &ldquo;ada&rdquo; base model via the API</li>
<li>Prompt-completion pairs in JSONL format</li>
<li>Default GPT-3 tokenizer</li>
<li>80/20 train/test split for OSC; stratified 10-fold CV for AMPs</li>
<li>Non-canonical SMILES generated using RDKit (10 per molecule for testing, 5 per molecule for augmented training)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT-3 &ldquo;ada&rdquo; (fine-tuned, closed-source, accessed via OpenAI API)</li>
<li>Chemprop D-MPNN baseline (open-source)</li>
<li>RDKit descriptors + scikit-learn SVM baseline</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best GPT-3 Value</th>
          <th>Best GNN Value</th>
          <th>Task</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>0.92</td>
          <td>0.94</td>
          <td>3-class HOMO (OSCs)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>0.94</td>
          <td>0.94</td>
          <td>3-class LUMO (OSCs)</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>0.88</td>
          <td>0.86</td>
          <td>2-class HER (AMPs)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify local hardware requirements. All GPT-3 fine-tuning was conducted via OpenAI&rsquo;s cloud API at a total cost of approximately $500.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XieZikai/Chem-GPT-Finetune">Chem-GPT-Finetune</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Python code and datasets for fine-tuning and evaluation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xie, Z., Evangelopoulos, X., Omar, O. H., Troisi, A., Cooper, A. I., &amp; Chen, L. (2024). Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules. <em>Chemical Science</em>, 15(2), 500-510.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xie2024finetuning,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Fine-tuning {GPT-3} for machine learning electronic and functional properties of organic molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xie, Zikai and Evangelopoulos, Xenophon and Omar, {\&#34;O}mer H. and Troisi, Alessandro and Cooper, Andrew I. and Chen, Linjiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{500--510}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D3SC04610A}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DrugEx v3: Scaffold-Constrained Graph Transformer</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v3-scaffold-graph-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v3-scaffold-graph-transformer/</guid><description>DrugEx v3 proposes a Graph Transformer with novel positional encoding for scaffold-constrained molecular generation via multi-objective reinforcement learning.</description><content:encoded><![CDATA[<h2 id="a-graph-transformer-method-for-scaffold-constrained-drug-design">A Graph Transformer Method for Scaffold-Constrained Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces DrugEx v3, a Graph Transformer model for scaffold-constrained de novo drug design. The primary contribution is a novel positional encoding scheme for molecular graphs that allows a Transformer architecture to operate on graph-structured molecular data rather than <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. The model takes user-provided scaffold fragments as input and generates complete molecules through growing and connecting operations, trained with multi-objective reinforcement learning to optimize for both target affinity and drug-likeness.</p>
<h2 id="from-fixed-objectives-to-user-guided-scaffold-design">From Fixed Objectives to User-Guided Scaffold Design</h2>
<p>Prior versions of DrugEx (v1 and <a href="/notes/chemistry/molecular-design/generation/rl-tuned/drugex-v2-pareto-multi-objective-rl/">v2</a>) used RNN-based generators trained with reinforcement learning for de novo drug design, but they operated under fixed objectives and could not accept user-provided structural priors. If a medicinal chemist wanted to explore analogs of a specific scaffold, the model needed retraining from scratch. Meanwhile, SMILES-based molecular generators face inherent limitations for scaffold-constrained design: SMILES is a linear notation, so inserting fragments at multiple positions of a scaffold requires complex grammar handling, and small token changes can produce invalid molecules.</p>
<p>Several approaches had been proposed for scaffold-based generation, including graph generative models (Lim et al., 2019), DeepScaffold (Li et al., 2020), SMILES-based scaffold decorators (Arus-Pous et al., 2020), and SyntaLinker for fragment linking (Yang et al., 2020). DrugEx v3 aims to combine the advantages of graph representations (validity guarantees, local invariance, flexible extension) with the Transformer architecture&rsquo;s ability to handle complex dependencies, while maintaining the multi-objective reinforcement learning framework from DrugEx v2.</p>
<h2 id="graph-positional-encoding-for-molecular-transformers">Graph Positional Encoding for Molecular Transformers</h2>
<p>The core innovation is adapting the Transformer architecture to work directly with molecular graph representations. Two key modifications make this possible.</p>
<p><strong>Graph word encoding.</strong> Since atoms and bonds cannot be processed simultaneously in a graph, the authors combine them into a single index:</p>
<p>$$
W = T_{atom} \times 4 + T_{bond}
$$</p>
<p>where $T_{atom}$ is the atom type index and $T_{bond}$ is the bond type index (four bond types: single, double, triple, and none).</p>
<p><strong>Graph positional encoding.</strong> Standard sequential position encoding does not capture molecular topology. The authors propose an adjacency-matrix-based positional encoding:</p>
<p>$$
P = I_{Atom} \times L_{max} + I_{Connected}
$$</p>
<p>where $I_{Atom}$ is the current atom index, $L_{max}$ is the maximum sequence length, and $I_{Connected}$ is the index of the atom connected by the current bond. This encoding is then processed through the standard sinusoidal positional encoding:</p>
<p>$$
PE_{(p, 2i)} = \sin(pos / 10000^{2i / d_{m}})
$$</p>
<p>$$
PE_{(p, 2i+1)} = \cos(pos / 10000^{2i / d_{m}})
$$</p>
<p>with $d_{m} = 512$.</p>
<p><strong>Molecule generation procedure.</strong> Each molecule in the training data is represented as a five-row matrix encoding atom type, bond type, connected atom index, current atom index, and fragment index. The columns are divided into three sections: fragment (the scaffold), growing (new atoms added to fragments), and linking (bonds connecting grown fragments). The decoder uses a GRU-based recurrent layer to sequentially output atom type, bond type, connected atom index, and current atom index at each step, with chemical valence rules enforced at every generation step to guarantee valid molecules.</p>
<p><strong>Multi-objective reinforcement learning.</strong> The generator is trained with a policy gradient objective:</p>
<p>$$
J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) | \theta\right] = \sum_{t=1}^{T} \log G(y_{t} | y_{1:t-1}) \cdot R^{\ast}(y_{1:T})
$$</p>
<p>where $R^{*}$ is a Pareto-based reward combining target affinity and QED drug-likeness score:</p>
<p>$$
R^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, &amp; \text{if desired} \\ \frac{k}{2N_{undesired}}, &amp; \text{if undesired} \end{cases}
$$</p>
<p>with $k$ being the solution&rsquo;s index in the Pareto rank. An exploration strategy uses two networks: an exploitation network $G_{\theta}$ (updated by policy gradient) and an exploration network $G_{\phi}$ (fixed, pre-trained on ChEMBL), with an exploration rate $\varepsilon$ controlling how many scaffolds are routed to $G_{\phi}$ during training.</p>
<h2 id="experimental-setup-architecture-comparison-and-rl-optimization">Experimental Setup: Architecture Comparison and RL Optimization</h2>
<h3 id="data">Data</h3>
<p>The ChEMBL set (version 27) contained approximately 1.7 million molecules for pre-training, preprocessed via RDKit (charge neutralization, metal/fragment removal). The LIGAND set comprised 10,828 adenosine receptor ligands for fine-tuning. Each molecule was decomposed into fragments using the BRICS algorithm, creating scaffold-molecule pairs (up to 15 pairs per molecule with four fragments). The ChEMBL set yielded 9.3 million training pairs, and the LIGAND set produced 53,888 training pairs.</p>
<h3 id="architecture-comparison">Architecture comparison</h3>
<p>Four architectures were compared:</p>
<ol>
<li><strong>Graph Transformer</strong>: graph input with novel positional encoding</li>
<li><strong>Sequential Transformer</strong>: SMILES input with standard Transformer</li>
<li><strong>LSTM-BASE</strong>: SMILES encoder-decoder with three recurrent layers</li>
<li><strong>LSTM+ATTN</strong>: LSTM-BASE with an attention mechanism between encoder and decoder</li>
</ol>
<p>All models were pre-trained on ChEMBL and fine-tuned on the LIGAND set. The bioactivity predictor was a random forest regression model using 2048D ECFP6 fingerprints and 19D physicochemical descriptors, with an activity threshold of pX = 6.5 for the A2A adenosine receptor.</p>
<h3 id="evaluation-metrics">Evaluation metrics</h3>
<p>Five metrics were used: validity (parseable molecules), accuracy (scaffold containment), desirability (meeting all objectives), uniqueness, and novelty (not in ChEMBL). Diversity was measured using the Solow-Polasky index with Tanimoto distance on ECFP6 fingerprints:</p>
<p>$$
I(A) = \frac{1}{|A|} \mathbf{e}^{\intercal} F(\mathbf{s})^{-1} \mathbf{e}
$$</p>
<h3 id="hardware">Hardware</h3>
<p>Models were benchmarked on a server with NVIDIA Tesla P100 GPUs.</p>
<h2 id="key-results-graph-representation-advantages-and-rl-trade-offs">Key Results: Graph Representation Advantages and RL Trade-offs</h2>
<h3 id="pre-training-and-fine-tuning-performance">Pre-training and fine-tuning performance</h3>
<p>The Graph Transformer achieved the best overall performance across all metrics:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Validity (PT)</th>
          <th>Accuracy (PT)</th>
          <th>Validity (FT)</th>
          <th>Accuracy (FT)</th>
          <th>Novelty (FT)</th>
          <th>Uniqueness (FT)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph Transformer (512)</td>
          <td>100.0%</td>
          <td>99.3%</td>
          <td>100.0%</td>
          <td>99.2%</td>
          <td>68.9%</td>
          <td>82.9%</td>
      </tr>
      <tr>
          <td>Seq. Transformer (512)</td>
          <td>96.7%</td>
          <td>74.0%</td>
          <td>99.3%</td>
          <td>92.7%</td>
          <td>8.9%</td>
          <td>28.9%</td>
      </tr>
      <tr>
          <td>LSTM+ATTN (512)</td>
          <td>94.3%</td>
          <td>72.8%</td>
          <td>96.9%</td>
          <td>85.9%</td>
          <td>6.3%</td>
          <td>20.7%</td>
      </tr>
      <tr>
          <td>LSTM-BASE (512)</td>
          <td>93.9%</td>
          <td>52.4%</td>
          <td>98.7%</td>
          <td>81.6%</td>
          <td>3.9%</td>
          <td>19.2%</td>
      </tr>
  </tbody>
</table>
<p>PT = pre-trained, FT = fine-tuned. The Graph Transformer achieved 100% validity due to its explicit valence checking at each generation step. It also produced substantially more novel and unique molecules after fine-tuning compared to SMILES-based methods.</p>
<p>The authors identified four advantages of the graph representation over SMILES: (1) local invariance, where fragment ordering does not affect output; (2) global extendibility, where new atoms can be appended without restructuring existing data; (3) freedom from grammar constraints; and (4) direct accessibility of chemical valence rules for validity enforcement.</p>
<h3 id="reinforcement-learning-results">Reinforcement learning results</h3>
<p>With multi-objective RL (affinity + QED), 74.6% of generated molecules were predicted active at $\varepsilon = 0.0$. The exploration rate $\varepsilon$ trades off desirability against uniqueness:</p>
<table>
  <thead>
      <tr>
          <th>$\varepsilon$</th>
          <th>Desirability</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0.0</td>
          <td>74.6%</td>
          <td>60.7%</td>
          <td>60.6%</td>
          <td>0.879</td>
      </tr>
      <tr>
          <td>0.1</td>
          <td>66.8%</td>
          <td>75.0%</td>
          <td>74.6%</td>
          <td>0.842</td>
      </tr>
      <tr>
          <td>0.2</td>
          <td>61.6%</td>
          <td>80.2%</td>
          <td>79.4%</td>
          <td>0.879</td>
      </tr>
      <tr>
          <td>0.3</td>
          <td>56.8%</td>
          <td>89.8%</td>
          <td>88.8%</td>
          <td>0.874</td>
      </tr>
  </tbody>
</table>
<p>The authors report that $\varepsilon = 0.3$ produced the best balance between desirability and uniqueness, with 56.8% desired molecules and 89.8% uniqueness. Diversity remained above 0.84 across all settings.</p>
<h3 id="limitations">Limitations</h3>
<p>The Graph Transformer produced molecules with worse synthetic accessibility (SA scores) compared to SMILES-based methods, particularly after fine-tuning on the smaller LIGAND set. The authors attribute this to uncommon ring systems generated when the model handles long-distance dependencies. A kekulization issue also causes a small fraction of molecules to fail scaffold matching: aromatic bond inference during sanitization can alter the scaffold substructure. Without single-objective affinity constraint, the model generates molecules with molecular weight exceeding 500 Da, reducing drug-likeness. All bioactivity predictions rely on a random forest model rather than experimental validation, and the t-SNE analysis suggests some generated molecules fall outside the model&rsquo;s applicability domain.</p>
<h3 id="future-directions">Future directions</h3>
<p>The authors propose extending the Graph Transformer to accept protein information as input via proteochemometric modeling, enabling design of ligands for targets without known ligands. Lead optimization, where a &ldquo;hit&rdquo; serves as input to generate improved analogs, is also identified as a natural extension.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL v27</td>
          <td>~1.7M molecules (9.3M scaffold-molecule pairs)</td>
          <td>Preprocessed via RDKit</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>LIGAND set (A2A AR ligands from ChEMBL)</td>
          <td>10,828 ligands (53,888 pairs)</td>
          <td>Split 8:1:1 train/val/test</td>
      </tr>
      <tr>
          <td>Bioactivity labels</td>
          <td>ChEMBL A2A AR activity data</td>
          <td>pX threshold = 6.5</td>
          <td>Average pChEMBL values</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Fragment decomposition: BRICS algorithm via RDKit (max 4 fragments per molecule)</li>
<li>Optimizer: Adam with learning rate $10^{-4}$, batch size 256</li>
<li>Pre-training: 20 epochs; fine-tuning: up to 1,000 epochs with early stopping (patience: 100 epochs)</li>
<li>Bioactivity predictor: random forest regression (scikit-learn) with 2048D ECFP6 + 19D physicochemical descriptors</li>
<li>Pareto-based multi-objective ranking with GPU acceleration</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Graph Transformer: 512 hidden units, 8 attention heads, $d_{k} = d_{v} = 64$</li>
<li>Sequential Transformer: same hidden size, sinusoidal positional encoding</li>
<li>LSTM-BASE / LSTM+ATTN: 128 embedding units, 512 hidden units, 3 recurrent layers</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Graph Transformer</th>
          <th>Best SMILES Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity (fine-tuned)</td>
          <td>100.0%</td>
          <td>99.6% (LSTM-BASE 1024)</td>
          <td>Valence checking guarantees validity</td>
      </tr>
      <tr>
          <td>Accuracy (fine-tuned)</td>
          <td>99.2%</td>
          <td>94.3% (Seq. Transformer 1024)</td>
          <td>Scaffold containment</td>
      </tr>
      <tr>
          <td>Desirability (RL, $\varepsilon$=0.0)</td>
          <td>74.6%</td>
          <td>N/A</td>
          <td>Only Graph Transformer used for RL</td>
      </tr>
      <tr>
          <td>Diversity (RL)</td>
          <td>0.879</td>
          <td>N/A</td>
          <td>Solow-Polasky index</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware-1">Hardware</h3>
<p>NVIDIA Tesla P100 GPUs. Specific training times not reported, but Transformer models trained faster than LSTM models with the same hidden layer size.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CDDLeiden/DrugEx">CDDLeiden/DrugEx</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (v1, v2, v3)</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v27</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA 3.0</td>
          <td>Pre-training data source</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P., &amp; van Westen, G. J. P. (2023). DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. <em>Journal of Cheminformatics</em>, 15, 24. <a href="https://doi.org/10.1186/s13321-023-00694-z">https://doi.org/10.1186/s13321-023-00694-z</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liu2023drugex,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and IJzerman, Adriaan P. and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00694-z}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BindGPT: GPT for 3D Molecular Design and Docking</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/bindgpt-3d-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/bindgpt-3d-molecular-design/</guid><description>BindGPT applies GPT-style language modeling to 3D molecular generation using SMILES+XYZ tokenization, pre-training, and RL-based docking optimization.</description><content:encoded><![CDATA[<h2 id="a-language-model-for-joint-3d-molecular-graph-and-conformation-generation">A Language Model for Joint 3D Molecular Graph and Conformation Generation</h2>
<p>BindGPT is a <strong>Method</strong> paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.</p>
<h2 id="the-graph-reconstruction-problem-in-3d-molecular-generation">The Graph Reconstruction Problem in 3D Molecular Generation</h2>
<p>Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.</p>
<h2 id="smilesxyz-tokenization-jointly-encoding-graphs-and-coordinates">SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates</h2>
<p>The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a <code>&lt;LIGAND&gt;</code> token, followed by character-level SMILES tokens encoding the molecular graph, then an <code>&lt;XYZ&gt;</code> token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.</p>
<p>For protein pockets, sequences begin with a <code>&lt;POCKET&gt;</code> token followed by atom names and coordinates. Following AlphaFold&rsquo;s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.</p>
<p>The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.</p>
<h3 id="pre-training-on-large-scale-3d-data">Pre-training on Large-Scale 3D Data</h3>
<p>Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.</p>
<h3 id="supervised-fine-tuning-with-augmentation">Supervised Fine-Tuning with Augmentation</h3>
<p>For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:</p>
<ol>
<li><strong><a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">SMILES randomization</a></strong>: Each molecule can yield 100-1000 different valid SMILES strings</li>
<li><strong>Random 3D rotation</strong>: The same rotation matrix is applied to both pocket and ligand coordinates</li>
</ol>
<p>During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.</p>
<h3 id="reinforcement-learning-with-docking-feedback">Reinforcement Learning with Docking Feedback</h3>
<p>BindGPT applies REINFORCE (not PPO or <a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.</p>
<p>The RL objective can be written as:</p>
<p>$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$</p>
<p>where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.</p>
<h2 id="experimental-evaluation-across-three-3d-generation-tasks">Experimental Evaluation Across Three 3D Generation Tasks</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations (12M molecules) + 3.2M pockets</td>
          <td>Large-scale 3D molecular dataset</td>
      </tr>
      <tr>
          <td>Fine-tuning (SFT)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>14k molecules x 3k pockets, includes all pose qualities</td>
      </tr>
      <tr>
          <td>Fine-tuning (conformer)</td>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM-DRUGS</a></td>
          <td>27M conformations for 300k molecules</td>
          <td>Standard benchmark for 3D conformer generation</td>
      </tr>
      <tr>
          <td>Evaluation (conformer)</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot evaluation holdout</td>
      </tr>
      <tr>
          <td>Evaluation (pocket)</td>
          <td>CrossDocked holdout</td>
          <td>100 pockets</td>
          <td>Held out from training</td>
      </tr>
  </tbody>
</table>
<h3 id="task-1-3d-molecule-generation-pre-training">Task 1: 3D Molecule Generation (Pre-training)</h3>
<p>Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF&rsquo;s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).</p>
<h3 id="task-2-3d-molecule-generation-fine-tuned-on-geom-drugs">Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)</h3>
<p>Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>EDM</th>
          <th>MolDiff</th>
          <th>BindGPT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JS bond lengths</td>
          <td>0.246</td>
          <td>0.365</td>
          <td><strong>0.029</strong></td>
      </tr>
      <tr>
          <td>JS bond angles</td>
          <td>0.282</td>
          <td>0.155</td>
          <td><strong>0.075</strong></td>
      </tr>
      <tr>
          <td>JS dihedral angles</td>
          <td>0.328</td>
          <td>0.162</td>
          <td><strong>0.098</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond types</td>
          <td>0.378</td>
          <td>0.163</td>
          <td><strong>0.045</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond pairs</td>
          <td>0.396</td>
          <td>0.136</td>
          <td><strong>0.043</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond triplets</td>
          <td>0.449</td>
          <td>0.125</td>
          <td><strong>0.042</strong></td>
      </tr>
      <tr>
          <td>Time (1000 molecules)</td>
          <td>1.4e6 s</td>
          <td>7500 s</td>
          <td><strong>200 s</strong></td>
      </tr>
  </tbody>
</table>
<p>BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.</p>
<h3 id="task-3-pocket-conditioned-molecule-generation">Task 3: Pocket-Conditioned Molecule Generation</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Vina Score</th>
          <th>SA</th>
          <th>QED</th>
          <th>Lipinski</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pocket2Mol</td>
          <td>-7.15 +/- 4.89</td>
          <td>0.75</td>
          <td>0.57</td>
          <td>4.88</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>-7.80 +/- 3.61</td>
          <td>0.58</td>
          <td>0.48</td>
          <td>4.51</td>
      </tr>
      <tr>
          <td>BindGPT-FT</td>
          <td>-5.44 +/- 2.09</td>
          <td>0.78</td>
          <td>0.50</td>
          <td>4.72</td>
      </tr>
      <tr>
          <td>BindGPT-RFT</td>
          <td>-7.24 +/- 1.68</td>
          <td>0.74</td>
          <td>0.48</td>
          <td>4.32</td>
      </tr>
      <tr>
          <td>BindGPT-RL</td>
          <td><strong>-8.60 +/- 1.90</strong></td>
          <td><strong>0.84</strong></td>
          <td>0.43</td>
          <td>4.81</td>
      </tr>
  </tbody>
</table>
<p>The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.</p>
<h3 id="conformer-generation">Conformer Generation</h3>
<p>On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:</p>
<ol>
<li><strong>Joint SMILES+XYZ generation eliminates graph reconstruction errors</strong>, achieving 98.58% validity compared to 12.87% for XYZ-Transformer</li>
<li><strong>Large-scale pre-training is critical for pocket-conditioned generation</strong>, as none of the baselines use pre-training and instead rely on heavy inductive biases</li>
<li><strong>RL fine-tuning with docking feedback substantially improves binding affinity</strong> beyond what SFT alone achieves</li>
<li><strong>Sampling is two orders of magnitude faster</strong> than diffusion baselines (200s vs. 1.4M s for EDM)</li>
</ol>
<p>Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations, 12M molecules, 3.2M pockets</td>
          <td>From Zhou et al. (2023)</td>
      </tr>
      <tr>
          <td>SFT (pocket)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>Full version including low-quality poses</td>
      </tr>
      <tr>
          <td>SFT (conformer)</td>
          <td>GEOM-DRUGS</td>
          <td>27M conformations, 300k molecules</td>
          <td>Standard benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot holdout</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-NeoX with rotary position embeddings (RoPE)</li>
<li><strong>Pre-training</strong>: Causal language modeling with 1.6M tokens per batch</li>
<li><strong>SFT augmentation</strong>: SMILES randomization + random 3D rotation</li>
<li><strong>RL</strong>: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Size</strong>: 108M parameters, 15 layers, 12 heads, hidden size 768</li>
<li><strong>Vocabulary</strong>: Character-level SMILES tokens + special tokens (<code>&lt;LIGAND&gt;</code>, <code>&lt;POCKET&gt;</code>, <code>&lt;XYZ&gt;</code>) + coordinate tokens (6 per 3D position)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Validity, SA, QED, Lipinski</strong>: Standard drug-likeness metrics</li>
<li><strong>Jensen-Shannon divergences</strong>: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)</li>
<li><strong>RMSD</strong>: Alignment quality of generated conformations vs. RDKit reference</li>
<li><strong>RMSD-Coverage</strong>: CDF of RMSD between generated and reference conformers</li>
<li><strong>Vina score</strong>: Binding energy from QVINA docking software</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency</li>
<li>Specific GPU counts and training times are described in Appendix G (not available in the main text)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://bindgpt.github.io/">Project Page</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Project website with additional details</td>
      </tr>
  </tbody>
</table>
<p>No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.</p>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., &amp; Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(24), 26083-26091. <a href="https://doi.org/10.1609/aaai.v39i24.34804">https://doi.org/10.1609/aaai.v39i24.34804</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zholus2025bindgpt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{26083--26091}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1609/aaai.v39i24.34804}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Atom-in-SMILES: Better Tokens for Chemical Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/atom-in-smiles-tokenization/</guid><description>Atom-in-SMILES replaces generic SMILES tokens with environment-aware atomic tokens, reducing token degeneration and improving chemical translation accuracy.</description><content:encoded><![CDATA[<h2 id="a-new-tokenization-method-for-chemical-language-models">A New Tokenization Method for Chemical Language Models</h2>
<p>This is a <strong>Method</strong> paper that introduces Atom-in-SMILES (AIS), a tokenization scheme for SMILES strings that replaces generic atomic tokens with environment-aware tokens encoding each atom&rsquo;s local chemical neighborhood. The primary contribution is demonstrating that tokenization quality has a significant impact on chemical language model outcomes across multiple tasks: SMILES canonicalization, <a href="/notes/chemistry/molecular-design/reaction-prediction/">single-step retrosynthesis</a>, and <a href="/notes/chemistry/molecular-design/property-prediction/">molecular property prediction</a>.</p>
<h2 id="why-standard-smiles-tokenization-falls-short">Why Standard SMILES Tokenization Falls Short</h2>
<p>Standard atom-wise SMILES tokenization treats all atoms of the same element identically. Every carbon is tokenized as &ldquo;C&rdquo; regardless of whether it is part of an aromatic ring, a carbonyl group, or a methyl chain. This creates a highly degenerate token space where chemically distinct atoms share the same representation.</p>
<p>The authors draw an analogy between natural language and chemical language. A typical SMILES sequence is about three times longer than a natural language sentence, yet the token vocabulary is roughly 1000 times smaller. This mismatch leads to extreme token repetition: the same tokens (C, c, N, O) appear many times within a single sequence. In natural language processing, token degeneration (where models repeatedly predict the same token) is a known failure mode of autoregressive decoders. The repetitive nature of SMILES tokens exacerbates this problem in chemical language models.</p>
<p>SMILES also lacks a one-to-one correspondence between tokens and chemical meaning. Two molecules that differ in only one atom substitution (e.g., swapping a carbon for a nitrogen in a ring) produce identical token sets under atom-wise tokenization, making it harder for models to distinguish structurally similar molecules.</p>
<h2 id="core-innovation-encoding-atom-environments-into-tokens">Core Innovation: Encoding Atom Environments into Tokens</h2>
<p>The key insight is to replace each atomic token with a richer token that encodes the atom&rsquo;s local chemical environment, inspired by the <a href="https://en.wikipedia.org/wiki/Atoms_in_molecules">atoms-in-molecules (AIM)</a> concept from quantum chemistry. For a given SMILES string, the AIS mapping function $f$ operates on the token space:</p>
<p>$$
f(X) = \begin{cases} AE|_{X_{\text{central}}} &amp; \text{if } X \text{ is an atom} \\ X &amp; \text{otherwise} \end{cases}
$$</p>
<p>where $AE|_{X_{\text{central}}}$ denotes the atomic environment centered on atom $X$. Non-atomic tokens (brackets, bond symbols, ring closures) pass through unchanged.</p>
<p>Each AIS token is formatted as <code>[Sym;Ring;Neighbors]</code> where:</p>
<ul>
<li><strong>Sym</strong> is the atomic symbol with chirality, aromaticity (lowercase for aromatic), hydrogen count, and formal charge</li>
<li><strong>Ring</strong> indicates whether the atom is in a ring (<code>R</code>) or not (<code>!R</code>)</li>
<li><strong>Neighbors</strong> lists the neighboring atoms interacting with the central atom</li>
</ul>
<p>This mapping is bijective: SMILES strings can be fully recovered from AIS strings via an inverse projection. The algorithm iterates over atoms in a molecule, computes their local environments using RDKit, and produces environment-aware token variants.</p>
<p>As a concrete example, in glycine the two carbons and two oxygens are indistinguishable under atom-wise tokenization. Under AIS, each receives a unique token reflecting its bonding environment (e.g., the carboxyl carbon is distinguished from the alpha carbon).</p>
<p>The AIS tokenization also exhibits a fingerprint-like property. Because each token encodes local structural information, the set of AIS tokens for a molecule functions similarly to circular fingerprints like ECFP2. The authors show that pairwise <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarities</a> computed from AIS token sets have resolution comparable to ECFP2 and HashAP fingerprints, and better resolution than MACCS, Avalon, and RDKit fingerprints.</p>
<p>Token repetition can be quantified as:</p>
<p>$$
\text{rep-}l = \sum_{t=1}^{|s|} \mathbb{1}[s_t \in s_{t-w-1:t-1}]
$$</p>
<p>where $s$ is the predicted sequence, $|s|$ is the token count, and $w$ is the window size. AIS tokens exhibit consistently lower normalized repetition rates compared to SMILES, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, and <a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a> across diverse molecular datasets (drugs, natural products, steroids, lipids, metal complexes, octane isomers).</p>
<h2 id="experimental-evaluation-across-three-chemical-tasks">Experimental Evaluation Across Three Chemical Tasks</h2>
<h3 id="input-output-equivalent-mapping-smiles-canonicalization">Input-Output Equivalent Mapping (SMILES Canonicalization)</h3>
<p>The first task tests whether a model can translate non-canonical SMILES enumerations into canonical form. The authors constructed deliberately challenging datasets from <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> subsets with cumulative structural constraints (no cyclic heteroatom-heteroatom bonds, stable functional groups only, fragment-like, scaffold-like, etc.), generating training sets of 1M molecules augmented with 150K molecules from the most restrictive subset at 10x, 30x, and 50x augmentation levels.</p>
<table>
  <thead>
      <tr>
          <th>GDB-13 Subset</th>
          <th>Atom-wise (x10)</th>
          <th>Atom-wise (x50)</th>
          <th>AIS (x10)</th>
          <th>AIS (x50)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ab</td>
          <td>34.2%</td>
          <td>33.2%</td>
          <td>37.3%</td>
          <td>34.1%</td>
      </tr>
      <tr>
          <td>abc</td>
          <td>31.0%</td>
          <td>29.6%</td>
          <td>33.7%</td>
          <td>30.4%</td>
      </tr>
      <tr>
          <td>abcde</td>
          <td>48.7%</td>
          <td>45.5%</td>
          <td>53.6%</td>
          <td>47.0%</td>
      </tr>
      <tr>
          <td>abcdef</td>
          <td>41.8%</td>
          <td>39.1%</td>
          <td>52.5%</td>
          <td>46.9%</td>
      </tr>
      <tr>
          <td>abcdefg</td>
          <td>50.9%</td>
          <td>50.0%</td>
          <td>59.9%</td>
          <td>56.8%</td>
      </tr>
  </tbody>
</table>
<p>AIS outperformed atom-wise tokenization on all subsets and augmentation levels. The performance gap grew larger for more restrictive (more similar) subsets, reaching up to 10.7% on the abcdef subset. This demonstrates that AIS is particularly effective when molecules are structurally similar and harder to distinguish.</p>
<h3 id="single-step-retrosynthesis">Single-Step Retrosynthesis</h3>
<p>The second task uses the USPTO-50K benchmark for single-step <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthetic prediction</a> via a template-free transformer encoder-decoder model. The model was trained for 200,000 steps with Adam optimizer, negative log-likelihood loss, and cyclic learning rate scheduling.</p>
<table>
  <thead>
      <tr>
          <th>Tokenization</th>
          <th>rep-|P - rep-|GT &gt;= 2</th>
          <th>String Exact (%)</th>
          <th>Tc Exact (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Atom-wise baseline</td>
          <td>&ndash;</td>
          <td>42.00</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Atom-wise (reproduced)</td>
          <td>801</td>
          <td>42.05</td>
          <td>44.72</td>
      </tr>
      <tr>
          <td>SmilesPE</td>
          <td>821</td>
          <td>19.82</td>
          <td>22.74</td>
      </tr>
      <tr>
          <td>SELFIES</td>
          <td>886</td>
          <td>28.82</td>
          <td>30.76</td>
      </tr>
      <tr>
          <td>DeepSMILES</td>
          <td>902</td>
          <td>38.63</td>
          <td>41.20</td>
      </tr>
      <tr>
          <td><strong>Atom-in-SMILES</strong></td>
          <td><strong>727</strong></td>
          <td><strong>46.32</strong></td>
          <td><strong>47.62</strong></td>
      </tr>
  </tbody>
</table>
<p>AIS achieved 46.32% string exact accuracy (4.3% above the atom-wise baseline) and 47.62% Tanimoto exact accuracy (2.9% above baseline). AIS also had the fewest degenerate token repetitions (727 vs. 801 for atom-wise), representing approximately a 10% reduction. DeepSMILES had the highest repetition count (902) despite reasonable overall accuracy. SELFIES and <a href="/notes/chemistry/molecular-representations/notations/smiles-pair-encoding/">SmilesPE</a> both performed substantially worse than the atom-wise baseline on this task.</p>
<p>The authors identified six common token repetition patterns in retrosynthetic predictions: long head repetitions, long tail repetitions, repetitive rings, repetitive chains, and halogen repetitions on both aliphatic and aromatic carbons.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The third task evaluates tokenization schemes on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks using Random Forest models with 5-fold cross-validation. AIS tokens were converted to fingerprint-like feature vectors.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>SMILES</th>
          <th>DeepSMILES</th>
          <th>SELFIES</th>
          <th>SmilesPE</th>
          <th>AIS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Regression (RMSE, lower is better)</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>0.628</td>
          <td>0.631</td>
          <td>0.675</td>
          <td>0.689</td>
          <td><strong>0.553</strong></td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>0.545</td>
          <td>0.544</td>
          <td>0.564</td>
          <td>0.761</td>
          <td><strong>0.441</strong></td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>0.924</td>
          <td>0.895</td>
          <td>0.938</td>
          <td>0.800</td>
          <td><strong>0.683</strong></td>
      </tr>
      <tr>
          <td><strong>Classification (ROC-AUC, higher is better)</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>0.758</td>
          <td>0.777</td>
          <td>0.799</td>
          <td>0.847</td>
          <td><strong>0.885</strong></td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>0.740</td>
          <td>0.774</td>
          <td>0.746</td>
          <td>0.837</td>
          <td><strong>0.835</strong></td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>0.649</td>
          <td>0.648</td>
          <td>0.653</td>
          <td>0.739</td>
          <td><strong>0.729</strong></td>
      </tr>
  </tbody>
</table>
<p>AIS achieved the best performance on all three regression datasets and two of three classification datasets. On ESOL, the RMSE improvement over standard SMILES was 12%. On lipophilicity, the improvement was 26%.</p>
<h2 id="key-findings-better-tokens-yield-better-chemical-models">Key Findings: Better Tokens Yield Better Chemical Models</h2>
<p>The main findings of this work are:</p>
<ol>
<li>
<p><strong>Tokenization significantly impacts chemical language model quality.</strong> The choice of tokenization scheme can change prediction accuracy by over 10 percentage points on equivalent mapping tasks.</p>
</li>
<li>
<p><strong>AIS reduces token degeneration by approximately 10%</strong> compared to atom-wise SMILES tokenization, with consistently lower normalized repetition rates across diverse molecular datasets.</p>
</li>
<li>
<p><strong>AIS outperforms all compared tokenization schemes</strong> (atom-wise SMILES, SmilesPE, SELFIES, DeepSMILES) on canonicalization, retrosynthesis, and property prediction.</p>
</li>
<li>
<p><strong>The fingerprint-like nature of AIS tokens</strong> enables direct use as molecular features for property prediction and provides resolution comparable to established circular fingerprints.</p>
</li>
<li>
<p><strong>The mapping is invertible</strong>, so AIS strings can always be converted back to valid SMILES. This is a practical advantage over approaches that may lose structural information.</p>
</li>
</ol>
<p><strong>Limitations</strong>: AIS cannot distinguish environmentally identical substructures or atoms related by a molecular symmetry plane, since it only considers nearest-neighbor environments. Performance on long-chain molecules (e.g., lipids) is similar across all tokenization schemes, suggesting that local environment encoding is less informative for repetitive linear structures.</p>
<p><strong>Future directions</strong>: The authors suggest AIS has potential for broader adoption in molecular generative models, chemical translation, and property prediction tasks across the cheminformatics community.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Canonicalization training</td>
          <td>GDB-13 subsets</td>
          <td>1M + 150K augmented</td>
          <td>Cumulative structural constraints a-h</td>
      </tr>
      <tr>
          <td>Canonicalization testing</td>
          <td>GDB-13 disjoint test sets</td>
          <td>20K per subset</td>
          <td>Various restriction levels</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>USPTO-50K</td>
          <td>~50K reactions</td>
          <td>Sequences &gt; 150 tokens removed</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet (ESOL, FreeSolv, Lipophilicity, BBBP, BACE, HIV)</td>
          <td>Varies</td>
          <td>Standard benchmark splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer encoder-decoder architecture for canonicalization and retrosynthesis tasks</li>
<li>200,000 training steps with Adam optimizer, negative log-likelihood loss, cyclic learning rate scheduler</li>
<li>Random Forest with 5-fold cross-validation for property prediction</li>
<li>AIS tokenization implemented via RDKit for atom environment extraction</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>String exact match (%)</td>
          <td>Canonicalization, Retrosynthesis</td>
          <td>Exact SMILES match</td>
      </tr>
      <tr>
          <td>Tanimoto exactness (Tc)</td>
          <td>Retrosynthesis</td>
          <td>Morgan FP radius 3, 2048 bits</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression property prediction</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification property prediction</td>
          <td>BBBP, BACE, HIV</td>
      </tr>
      <tr>
          <td>rep-l</td>
          <td>Token degeneration</td>
          <td>Single-token repetition count</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not explicitly specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/snu-lcbc/atom-in-SMILES">atom-in-SMILES</a></td>
          <td>Code</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>AIS tokenization implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ucak, U. V., Ashyrmamatov, I., &amp; Lee, J. (2023). Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. <em>Journal of Cheminformatics</em>, 15, 55. <a href="https://doi.org/10.1186/s13321-023-00725-9">https://doi.org/10.1186/s13321-023-00725-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ucak2023improving,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{55}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00725-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AlphaDrug: MCTS-Guided Target-Specific Drug Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/alphadrug-protein-target-molecular-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/alphadrug-protein-target-molecular-generation/</guid><description>AlphaDrug combines a modified transformer with Monte Carlo tree search and docking rollouts for target-specific de novo molecular generation.</description><content:encoded><![CDATA[<h2 id="target-conditioned-molecular-generation-via-transformer-and-mcts">Target-Conditioned Molecular Generation via Transformer and MCTS</h2>
<p>AlphaDrug is a <strong>Method</strong> paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT&rsquo;s predicted probabilities and docking scores from the <a href="/notes/chemistry/molecular-design/generation/evaluation/smina-docking-benchmark/">SMINA</a> program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.</p>
<h2 id="bridging-the-gap-between-molecular-generation-and-protein-targeting">Bridging the Gap Between Molecular Generation and Protein Targeting</h2>
<p>Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the <a href="/notes/chemistry/molecular-design/generation/target-aware/transformer-protein-drug-generation/">transformer-based approach of Grechishnikova (2021)</a>, show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.</p>
<p>AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.</p>
<h2 id="lmser-transformer-and-docking-guided-mcts">Lmser Transformer and Docking-Guided MCTS</h2>
<p>The key innovations are the Lmser Transformer architecture and the MCTS search strategy.</p>
<h3 id="lmser-transformer-lt">Lmser Transformer (LT)</h3>
<p>The standard transformer for sequence-to-sequence tasks passes information from the encoder&rsquo;s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder&rsquo;s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.</p>
<p>Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:</p>
<p>$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$</p>
<p>where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.</p>
<p>The multi-head attention follows the standard formulation:</p>
<p>$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$</p>
<p>$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$</p>
<h3 id="mcts-for-molecular-generation">MCTS for Molecular Generation</h3>
<p>The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:</p>
<p><strong>Select</strong>: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:</p>
<p>$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$</p>
<p>where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT&rsquo;s predicted probability.</p>
<p>The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:</p>
<p>$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$</p>
<p><strong>Expand</strong>: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.</p>
<p><strong>Rollout</strong>: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.</p>
<p><strong>Backup</strong>: Docking values propagate back up the tree, updating visit counts and cumulative rewards.</p>
<h3 id="training-objective">Training Objective</h3>
<p>The LT is trained on known protein-ligand pairs using cross-entropy loss:</p>
<p>$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$</p>
<p>MCTS is only activated during inference, not during training.</p>
<h2 id="experiments-on-diverse-protein-targets">Experiments on Diverse Protein Targets</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 &lt; 100 nM, molecular weight &lt; 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>T+BS10</strong>: Standard transformer with beam search (K=10) from <a href="/notes/chemistry/molecular-design/generation/target-aware/transformer-protein-drug-generation/">Grechishnikova (2021)</a></li>
<li><strong>LT+BS10</strong>: The proposed Lmser Transformer with beam search</li>
<li><strong>LiGANN</strong>: 3D pocket-to-ligand shape generation via BicycleGAN</li>
<li><strong>SBMolGen</strong>: ChemTS-based method with docking constraints</li>
<li><strong>SBDD-3D</strong>: 3D autoregressive graph-based generation</li>
<li><strong>Decoys</strong>: Random compounds from ZINC database</li>
<li><strong>Known ligands</strong>: Original binding partners from the database</li>
</ul>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Docking</th>
          <th>Uniqueness</th>
          <th>LogP</th>
          <th>QED</th>
          <th>SA</th>
          <th>NP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Decoys</td>
          <td>7.3</td>
          <td>-</td>
          <td>2.4</td>
          <td>0.8</td>
          <td>2.4</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>Known ligands</td>
          <td>9.8</td>
          <td>-</td>
          <td>2.2</td>
          <td>0.5</td>
          <td>3.3</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>LiGANN</td>
          <td>6.7</td>
          <td>94.7%</td>
          <td>2.9</td>
          <td>0.6</td>
          <td>3.0</td>
          <td>-1.1</td>
      </tr>
      <tr>
          <td>SBMolGen</td>
          <td>7.7</td>
          <td>100%</td>
          <td>2.6</td>
          <td>0.7</td>
          <td>2.8</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>SBDD-3D</td>
          <td>7.7</td>
          <td>99.3%</td>
          <td>1.5</td>
          <td>0.6</td>
          <td>4.0</td>
          <td>0.3</td>
      </tr>
      <tr>
          <td>T+BS10</td>
          <td>8.5</td>
          <td>90.6%</td>
          <td>3.8</td>
          <td>0.5</td>
          <td>2.8</td>
          <td>-0.8</td>
      </tr>
      <tr>
          <td>LT+BS10</td>
          <td>8.5</td>
          <td>98.1%</td>
          <td>4.0</td>
          <td>0.5</td>
          <td>2.7</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (freq)</td>
          <td>10.8</td>
          <td>99.5%</td>
          <td>4.9</td>
          <td>0.4</td>
          <td>2.9</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (max)</td>
          <td>11.6</td>
          <td>100%</td>
          <td>5.2</td>
          <td>0.4</td>
          <td>2.7</td>
          <td>-0.8</td>
      </tr>
  </tbody>
</table>
<p>AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.</p>
<h3 id="mcts-vs-beam-search-under-equal-compute">MCTS vs. Beam Search Under Equal Compute</h3>
<p>When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:</p>
<table>
  <thead>
      <tr>
          <th>Docking times (N)</th>
          <th>BS</th>
          <th>MCTS</th>
          <th>P-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>N = 105 (S=10)</td>
          <td>8.4 (10.9)</td>
          <td>10.9 (11.5)</td>
          <td>1.8e-34 (4.5e-3)</td>
      </tr>
      <tr>
          <td>N = 394 (S=50)</td>
          <td>8.3 (11.4)</td>
          <td>11.6 (12.2)</td>
          <td>1.4e-31 (1.8e-3)</td>
      </tr>
      <tr>
          <td>N = 1345 (S=500)</td>
          <td>8.4 (11.9)</td>
          <td>12.4 (13.2)</td>
          <td>2.2e-39 (8.2e-6)</td>
      </tr>
  </tbody>
</table>
<p>Values in parentheses are average top-1 scores per protein.</p>
<h3 id="ablation-effect-of-protein-sequence-input">Ablation: Effect of Protein Sequence Input</h3>
<p>Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Uniqueness</th>
          <th>SpS</th>
          <th>Molecular length</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>TE + MCTS (S=50)</td>
          <td>81.0%</td>
          <td>0.1926</td>
          <td>62.93</td>
      </tr>
      <tr>
          <td>T + MCTS (S=50)</td>
          <td>98.0%</td>
          <td>0.2149</td>
          <td>55.63</td>
      </tr>
      <tr>
          <td>LT + MCTS (S=50)</td>
          <td>100.0%</td>
          <td>0.2159</td>
          <td>56.54</td>
      </tr>
  </tbody>
</table>
<p>The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.</p>
<h3 id="computational-efficiency">Computational Efficiency</h3>
<p>A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.</p>
<h2 id="docking-gains-with-acknowledged-limitations">Docking Gains with Acknowledged Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.</li>
<li>The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.</li>
<li>MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.</li>
<li>Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors identify three areas for improvement:</p>
<ol>
<li><strong>Sequence-only representation</strong>: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.</li>
<li><strong>External docking as value function</strong>: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.</li>
<li><strong>Full rollout requirement</strong>: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.</li>
</ol>
<p>The physicochemical properties (QED, SA) of AlphaDrug&rsquo;s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BindingDB (filtered)</td>
          <td>192,712 protein-ligand pairs</td>
          <td>Human proteins, IC50 &lt; 100 nM, MW &lt; 1000 Da</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>BindingDB (filtered)</td>
          <td>17,049 pairs</td>
          <td>Same filtering criteria</td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>BindingDB (filtered)</td>
          <td>100 proteins from 25 clusters</td>
          <td>Clustered at 30% sequence identity via MMseqs2</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>MCTS with PUCT selection criterion, $c_{puct} = 1.5$</li>
<li>$S = 50$ simulations per step (default), $S = 10$ for fast variant</li>
<li>Greedy rollout policy using LT probabilities</li>
<li>Docking lookup table for efficiency (caches SMINA results)</li>
<li>Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Lmser Transformer with hierarchical encoder-to-decoder skip connections</li>
<li>Sinusoidal positional encoding</li>
<li>Multi-head cross-attention at each decoder layer</li>
<li>Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AlphaDrug (max)</th>
          <th>Known ligands</th>
          <th>Best baseline (T+BS10)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score</td>
          <td>11.6</td>
          <td>9.8</td>
          <td>8.5</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>100%</td>
          <td>-</td>
          <td>90.6%</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>100%</td>
          <td>-</td>
          <td>Not reported</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CMACH508/AlphaDrug">CMACH508/AlphaDrug</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation, includes data processing and generation scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Qian, H., Lin, C., Zhao, D., Tu, S., &amp; Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. <em>PNAS Nexus</em>, 1(4), pgac227. <a href="https://doi.org/10.1093/pnasnexus/pgac227">https://doi.org/10.1093/pnasnexus/pgac227</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{qian2022alphadrug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{AlphaDrug: protein target specific de novo molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{PNAS Nexus}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{pgac227}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/pnasnexus/pgac227}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>TamGen: GPT-Based Target-Aware Drug Design and Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/tamgen-target-aware-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/target-aware/tamgen-target-aware-molecule-generation/</guid><description>TamGen combines a GPT-like chemical language model with protein pocket encoding and VAE refinement to generate drug candidates with experimental validation.</description><content:encoded><![CDATA[<h2 id="a-method-for-target-conditioned-molecular-generation">A Method for Target-Conditioned Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.</p>
<h2 id="bridging-generative-ai-and-practical-drug-discovery">Bridging Generative AI and Practical Drug Discovery</h2>
<p>Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.</p>
<p>The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:</p>
<ul>
<li>Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility</li>
<li>High cellular toxicity and decreased developability associated with excessive fused ring counts</li>
<li>Slow generation speeds (tens of minutes to hours per 100 compounds)</li>
<li>Limited real-world experimental validation of generated candidates</li>
</ul>
<p>TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.</p>
<h2 id="three-module-architecture-with-pre-training-and-refinement">Three-Module Architecture with Pre-Training and Refinement</h2>
<p>TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.</p>
<h3 id="compound-decoder-chemical-language-model">Compound Decoder (Chemical Language Model)</h3>
<p>The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:</p>
<p>$$
\min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1)
$$</p>
<p>where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.</p>
<h3 id="protein-encoder-with-distance-aware-attention">Protein Encoder with Distance-Aware Attention</h3>
<p>The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:</p>
<p>$$
h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right)
$$</p>
<p>where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.</p>
<p>The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:</p>
<p>$$
\begin{aligned}
\hat{\alpha}_j &amp;= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\
\alpha_j &amp;= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\
\hat{\boldsymbol{h}}_i^{(l+1)} &amp;= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)})
\end{aligned}
$$</p>
<p>where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.</p>
<h3 id="vae-based-contextual-encoder">VAE-Based Contextual Encoder</h3>
<p>A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:</p>
<p>$$
\min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z))
$$</p>
<p>where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.</p>
<h2 id="benchmark-evaluation-and-tuberculosis-drug-discovery">Benchmark Evaluation and Tuberculosis Drug Discovery</h2>
<h3 id="crossdocked2020-benchmark">CrossDocked2020 Benchmark</h3>
<p>TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:</p>
<ul>
<li><strong>Docking score</strong> (AutoDock-Vina): binding affinity estimate</li>
<li><strong>QED</strong>: quantitative estimate of drug-likeness</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a></strong>: physicochemical property compliance</li>
<li><strong>SAS</strong>: synthetic accessibility score</li>
<li><strong>LogP</strong>: lipophilicity (optimal range 0-5 for oral administration)</li>
<li><strong>Molecular diversity</strong>: Tanimoto similarity between Morgan fingerprints</li>
</ul>
<p>TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.</p>
<p>TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.</p>
<h3 id="design-refine-test-pipeline-for-clpp-inhibitors">Design-Refine-Test Pipeline for ClpP Inhibitors</h3>
<p>The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond <a href="https://en.wikipedia.org/wiki/Bortezomib">Bortezomib</a>.</p>
<p><strong>Design stage</strong>: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.</p>
<p><strong>Refine stage</strong>: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.</p>
<p><strong>Test stage</strong>: From a 446k commercial compound library, 159 analogs (MCS similarity &gt; 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:</p>
<table>
  <thead>
      <tr>
          <th>Compound</th>
          <th>Series</th>
          <th>Source</th>
          <th>$\text{IC}_{50}$ ($\mu$M)</th>
          <th>Key Feature</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Analog-005</td>
          <td>II</td>
          <td>Commercial library</td>
          <td>1.9</td>
          <td>Most potent analog</td>
      </tr>
      <tr>
          <td>Analog-003</td>
          <td>I</td>
          <td>Commercial library</td>
          <td>&lt; 20</td>
          <td>Strongest single-dose inhibition</td>
      </tr>
      <tr>
          <td>Syn-A003-01</td>
          <td>I</td>
          <td>TamGen (synthesized)</td>
          <td>&lt; 20</td>
          <td>Diphenylurea scaffold</td>
      </tr>
  </tbody>
</table>
<p>Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen&rsquo;s ability to produce viable hits without the library search step.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Four ablation experiments clarified the contributions of TamGen&rsquo;s components:</p>
<ol>
<li><strong>Without pre-training</strong>: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.</li>
<li><strong>Shuffled pocket-ligand pairs (TamGen-r)</strong>: Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.</li>
<li><strong>Without distance-aware attention</strong>: Significant decline in docking scores when removing the geometric attention term from Eq. 2.</li>
<li><strong>Without coordinate augmentation</strong>: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.</li>
</ol>
<h2 id="validated-drug-like-generation-with-practical-limitations">Validated Drug-Like Generation with Practical Limitations</h2>
<p>TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.</p>
<p>Key limitations acknowledged by the authors include:</p>
<ul>
<li><strong>Insufficient sensitivity to minor target differences</strong>: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins</li>
<li><strong>Requires known structure and pocket</strong>: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information</li>
<li><strong>Limited cellular validation</strong>: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested</li>
<li><strong>1D generation trade-off</strong>: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space</li>
</ul>
<p>Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in <a href="/notes/chemistry/molecular-design/generation/target-aware/prefixmol-target-chemistry-aware-generation/">PrefixMol</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem (random sample)</td>
          <td>10M SMILES</td>
          <td>Compound decoder pre-training</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>CrossDocked2020</td>
          <td>~100k pairs</td>
          <td>Filtered pocket-ligand pairs</td>
      </tr>
      <tr>
          <td>Extended fine-tuning</td>
          <td>CrossDocked + PDB</td>
          <td>~300k pairs</td>
          <td>Used for TB compound generation</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>CrossDocked2020 test</td>
          <td>100 pockets</td>
          <td>Same split as TargetDiff/Pocket2Mol</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Compound decoder</strong>: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps</li>
<li><strong>Protein encoder</strong>: 4-layer Transformer with hidden dimension 256, distance-aware attention</li>
<li><strong>VAE encoder</strong>: 4-layer standard Transformer encoder with hidden dimension 256</li>
<li><strong>Optimizer</strong>: Adam with initial learning rate $3 \times 10^{-5}$</li>
<li><strong>VAE $\beta$</strong>: 0.1 or 1.0 depending on generation stage</li>
<li><strong>Beam search</strong>: beam sizes of 4, 10, or 20 depending on stage</li>
<li><strong>Pocket definition</strong>: residues within 10 or 15 Angstrom distance cutoff from ligand center</li>
</ul>
<h3 id="models">Models</h3>
<p>Pre-trained model weights are available via Zenodo at <a href="https://doi.org/10.5281/zenodo.13751391">https://doi.org/10.5281/zenodo.13751391</a>.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>TamGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Overall MRR</td>
          <td>Best</td>
          <td>TargetDiff (2nd)</td>
          <td>Ranked across 6 metrics</td>
      </tr>
      <tr>
          <td>Fused rings (avg)</td>
          <td>1.78</td>
          <td>~3-5 (others)</td>
          <td>Matches FDA-approved drug average</td>
      </tr>
      <tr>
          <td>Generation speed</td>
          <td>9 sec/100 compounds</td>
          <td>~13 min (ResGen)</td>
          <td>Single A6000 GPU</td>
      </tr>
      <tr>
          <td>ClpP hit rate</td>
          <td>6/8 synthesized</td>
          <td>N/A</td>
          <td>$\text{IC}_{50}$ &lt; 40 $\mu$M</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8x V100 GPUs for 200k steps</li>
<li>Inference benchmarking: 1x A6000 GPU</li>
<li>Generation time: ~9 seconds per 100 compounds per target</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/SigmaGenX/TamGen">TamGen code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.13751391">Model weights and data</a></td>
          <td>Model + Data</td>
          <td>CC-BY-4.0</td>
          <td>Pre-trained weights, source data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., &amp; Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. <em>Nature Communications</em>, 15, 9360. <a href="https://doi.org/10.1038/s41467-024-53632-4">https://doi.org/10.1038/s41467-024-53632-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2024tamgen,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{TamGen: drug design with target-aware molecule generation through a chemical language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{9360}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-53632-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Review of Molecular Representation Learning Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/molecular-representation-learning-foundation-models-review/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/molecular-representation-learning-foundation-models-review/</guid><description>A systematic review of molecular representation learning foundation models for drug discovery, covering five modalities and four pretraining strategies.</description><content:encoded><![CDATA[<h2 id="a-systematization-of-molecular-representation-foundation-models">A Systematization of Molecular Representation Foundation Models</h2>
<p>This paper is a <strong>Systematization</strong> that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.</p>
<h2 id="why-a-systematic-review-of-mrl-foundation-models-is-needed">Why a Systematic Review of MRL Foundation Models Is Needed</h2>
<p>Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.</p>
<p>Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.</p>
<h2 id="taxonomy-of-molecular-descriptors-and-model-architectures">Taxonomy of Molecular Descriptors and Model Architectures</h2>
<p>The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.</p>
<h3 id="molecular-descriptors">Molecular Descriptors</h3>
<p>The review identifies five primary descriptor types:</p>
<ol>
<li><strong>Molecular fingerprints</strong>: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.</li>
<li><strong>1D sequences</strong>: <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> and <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.</li>
<li><strong>2D topological graphs</strong>: Atoms as nodes, bonds as edges. Can be derived from SMILES via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>, making graph datasets effectively interchangeable with SMILES datasets.</li>
<li><strong>3D geometry</strong>: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.</li>
<li><strong>Multimodal</strong>: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.</li>
</ol>
<p>The paper also discusses mathematically abstract molecular representations. For example, the <a href="https://en.wikipedia.org/wiki/Wiener_index">Wiener index</a> quantifies structural complexity:</p>
<p>$$
W = \frac{1}{2} \sum_{i &lt; j} d_{ij}
$$</p>
<p>where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.</p>
<p>Degree centrality captures local connectivity:</p>
<p>$$
C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij}
$$</p>
<p>where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.</p>
<h3 id="model-architectures">Model Architectures</h3>
<p>Models are classified into two primary categories:</p>
<p><strong>Unimodal-based models:</strong></p>
<ul>
<li><strong>Sequence-based</strong>: Transformer models operating on SMILES/SELFIES (e.g., <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>, MolGEN, <a href="/notes/chemistry/llm-applications/llamsmol-instruction-tuning-chemistry/">LlaSMol</a>). These capture syntactic patterns but miss spatial and topological features.</li>
<li><strong>Topological graph-based</strong>: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.</li>
<li><strong>3D geometry-based</strong>: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.</li>
<li><strong>Image-based</strong>: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.</li>
</ul>
<p><strong>Multimodal-based models:</strong></p>
<ul>
<li><strong>Sequence + Graph</strong>: <a href="/notes/chemistry/molecular-representations/multimodal/dual-view-molecule-pretraining/">DVMP</a>, PanGu Drug Model. Combines the strengths of string and topological representations.</li>
<li><strong>Graph + 3D Geometry</strong>: GraphMVP, Transformer-M. Enriches topological features with spatial information.</li>
<li><strong>Text + Molecular Structure</strong>: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.</li>
</ul>
<h2 id="four-pretraining-paradigms-for-mrl">Four Pretraining Paradigms for MRL</h2>
<p>The review systematically categorizes pretraining strategies into four paradigms:</p>
<h3 id="masked-language-modeling-mlm">Masked Language Modeling (MLM)</h3>
<p>The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.</p>
<h3 id="contrastive-learning-cl">Contrastive Learning (CL)</h3>
<p>The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.</p>
<h3 id="reconstruction-based-pretraining-rbp">Reconstruction-Based Pretraining (RBP)</h3>
<p>Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.</p>
<h3 id="multimodal-alignment-pretraining-map">Multimodal Alignment Pretraining (MAP)</h3>
<p>Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.</p>
<h2 id="downstream-applications-and-performance-benchmarks">Downstream Applications and Performance Benchmarks</h2>
<p>The review evaluates MRL foundation models across five application domains.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification datasets:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Type</th>
          <th>BBBP</th>
          <th>BACE</th>
          <th>ClinTox</th>
          <th>Tox21</th>
          <th>SIDER</th>
          <th>HIV</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MGMAE</td>
          <td>Graph</td>
          <td>94.2</td>
          <td>92.7</td>
          <td>96.7</td>
          <td>86.0</td>
          <td>66.4</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MPG</td>
          <td>Graph</td>
          <td>92.2</td>
          <td>92.0</td>
          <td>96.3</td>
          <td>83.7</td>
          <td>66.1</td>
          <td>-</td>
      </tr>
      <tr>
          <td>GROVER</td>
          <td>Graph+Trans.</td>
          <td>94.0</td>
          <td>89.4</td>
          <td>94.4</td>
          <td>83.1</td>
          <td>65.8</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MoLFormer</td>
          <td>Sequence</td>
          <td>93.7</td>
          <td>88.2</td>
          <td>94.8</td>
          <td>84.7</td>
          <td>69.0</td>
          <td>82.2</td>
      </tr>
      <tr>
          <td>MM-Deacon</td>
          <td>Seq.+IUPAC</td>
          <td>78.5</td>
          <td>-</td>
          <td>99.5</td>
          <td>-</td>
          <td>69.3</td>
          <td>80.1</td>
      </tr>
      <tr>
          <td>Uni-Mol</td>
          <td>3D</td>
          <td>72.9</td>
          <td>85.7</td>
          <td>91.9</td>
          <td>79.6</td>
          <td>65.9</td>
          <td>80.8</td>
      </tr>
      <tr>
          <td>DVMP</td>
          <td>Seq.+Graph</td>
          <td>77.8</td>
          <td>89.4</td>
          <td>95.6</td>
          <td>79.1</td>
          <td>69.8</td>
          <td>81.4</td>
      </tr>
      <tr>
          <td>TxD-T-LLM</td>
          <td>Seq.+Text</td>
          <td>-</td>
          <td>-</td>
          <td>86.3</td>
          <td>88.2</td>
          <td>-</td>
          <td>73.2</td>
      </tr>
  </tbody>
</table>
<p>The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.</p>
<h3 id="molecular-generation">Molecular Generation</h3>
<p>MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on <a href="/notes/chemistry/datasets/qm9/">QM9</a>.</p>
<h3 id="drug-drug-interaction-prediction"><a href="https://en.wikipedia.org/wiki/Drug_interaction">Drug-Drug Interaction</a> Prediction</h3>
<p>MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.</p>
<h3 id="retrosynthesis-prediction"><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a> Prediction</h3>
<p>DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).</p>
<h3 id="drug-synergy-prediction">Drug Synergy Prediction</h3>
<p>SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.</p>
<h2 id="guidelines-limitations-and-future-directions">Guidelines, Limitations, and Future Directions</h2>
<h3 id="model-selection-guidelines">Model Selection Guidelines</h3>
<p>The authors provide structured guidelines for choosing MRL foundation models based on:</p>
<ol>
<li><strong>Task objective</strong>: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.</li>
<li><strong>Data characteristics</strong>: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.</li>
<li><strong>Interpretability needs</strong>: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.</li>
<li><strong>Computational budget</strong>: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.</li>
</ol>
<h3 id="limitations-and-future-directions">Limitations and Future Directions</h3>
<p>The review identifies five key challenges:</p>
<ol>
<li><strong>Multimodal data integration</strong>: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating <a href="/notes/chemistry/molecular-simulation/">molecular dynamics</a> trajectories as a dynamic modality and using cross-modal data augmentation.</li>
<li><strong>Data scarcity</strong>: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.</li>
<li><strong>Interpretability</strong>: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.</li>
<li><strong>Training efficiency</strong>: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.</li>
<li><strong>Robustness and generalization</strong>: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.</p>
<h3 id="data">Data</h3>
<p>The review catalogs 28 representative molecular datasets used by the surveyed foundation models:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Descriptor</th>
          <th>Primary Use</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PubChem</td>
          <td>~118M</td>
          <td>SMILES, 3D, Image, IUPAC</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>ZINC15</td>
          <td>~980M</td>
          <td>SMILES</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>~2.4M</td>
          <td>SMILES</td>
          <td>Pretraining</td>
      </tr>
      <tr>
          <td>QM9</td>
          <td>133,884</td>
          <td>SMILES</td>
          <td>Property prediction</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/datasets/geom/">GEOM</a></td>
          <td>450,000</td>
          <td>3D coordinates</td>
          <td>Property prediction</td>
      </tr>
      <tr>
          <td>USPTO-full</td>
          <td>950,000</td>
          <td>SMILES</td>
          <td>Reaction prediction</td>
      </tr>
      <tr>
          <td>Molecule3D</td>
          <td>4M</td>
          <td>3D coordinates</td>
          <td>Property prediction</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Z-dot-max/MRL_Foundation_Review">Review Materials (GitHub)</a></td>
          <td>Code/Data</td>
          <td>Not specified</td>
          <td>Code and data tables for figures</td>
      </tr>
      <tr>
          <td><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12784970/">Paper (PMC)</a></td>
          <td>Paper</td>
          <td>CC-BY</td>
          <td>Open access via PubMed Central</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model&rsquo;s original setup. The review covers:</p>
<ul>
<li>ROC-AUC for classification tasks (property prediction, DDI, synergy)</li>
<li>RMSE/MAE for regression tasks</li>
<li>Validity and novelty for molecular generation</li>
<li>Top-k accuracy for retrosynthesis</li>
<li>COV and MAT for conformation generation</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., &amp; Liu, Y. (2025). A systematic review of molecular representation learning foundation models. <em>Briefings in Bioinformatics</em>, 27(1), bbaf703. <a href="https://doi.org/10.1093/bib/bbaf703">https://doi.org/10.1093/bib/bbaf703</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{song2025systematic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A systematic review of molecular representation learning foundation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{27}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbaf703}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbaf703}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Back Translation for Semi-Supervised Molecule Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/back-translation-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/back-translation-molecule-generation/</guid><description>A semi-supervised method adapting NLP back translation to molecule generation, improving property optimization and retrosynthesis with unlabeled ZINC data.</description><content:encoded><![CDATA[<h2 id="semi-supervised-data-augmentation-for-molecular-tasks">Semi-Supervised Data Augmentation for Molecular Tasks</h2>
<p>This is a <strong>Method</strong> paper that introduces back translation, a semi-supervised technique from neural machine translation, to the domain of molecular generation. The primary contribution is a general-purpose data augmentation strategy that leverages large pools of unlabeled molecules (from databases like ZINC) to improve the performance of both sequence-based and graph-based models on molecule optimization and <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> prediction tasks.</p>
<h2 id="bridging-the-labeled-data-gap-in-molecular-generation">Bridging the Labeled Data Gap in Molecular Generation</h2>
<p>Molecular generation tasks, such as property optimization and retrosynthesis, require paired training data: an input molecule (or property specification) mapped to a desired output molecule. Obtaining these labeled pairs is expensive and labor-intensive. Meanwhile, enormous databases of unlabeled molecules exist. ZINC alone contains over 750 million compounds, and PubChem has 109 million.</p>
<p>Prior approaches to using unlabeled molecular data include <a href="/notes/chemistry/molecular-design/generation/latent-space/automatic-chemical-design-vae/">variational autoencoders (VAEs)</a> for learning latent representations, conditional recurrent neural networks for inverse design, and pretraining techniques borrowed from NLP. However, these methods either focus on representation learning rather than direct generation, or require task-specific architectural modifications. The authors identify back translation, a well-established technique in machine translation, as a natural fit for molecular generation tasks that can be treated as sequence-to-sequence mappings.</p>
<h2 id="back-translation-as-molecular-data-augmentation">Back Translation as Molecular Data Augmentation</h2>
<p>The core idea is straightforward. Given a main task that maps from source domain $\mathcal{X}$ to target domain $\mathcal{Y}$ (e.g., mapping low-QED molecules to high-QED molecules), the method trains a reverse model $g$ that maps from $\mathcal{Y}$ back to $\mathcal{X}$. This reverse model then &ldquo;back translates&rdquo; unlabeled molecules from $\mathcal{Y}$ to generate synthetic source molecules, creating pseudo-labeled training pairs.</p>
<p>The theoretical motivation comes from maximizing the reconstruction probability. Given an unlabeled molecule $y_u \in \mathcal{U}_y$, the logarithmic reconstruction probability through the reverse model $g$ and forward model $f$ is:</p>
<p>$$
\log P(y_u = \hat{y}_u \mid y_u; g, f) = \log \sum_{\hat{x}_u \in \mathcal{X}} P(\hat{x}_u \mid y_u; g) P(y_u = \hat{y}_u \mid \hat{x}_u; f)
$$</p>
<p>Since summing over the exponentially large space $\mathcal{X}$ is intractable, the authors apply Jensen&rsquo;s inequality to obtain a lower bound:</p>
<p>$$
\log P(y_u = \hat{y}_u \mid y_u; g, f) \geq \mathbb{E}_{\hat{x}_u \sim P(\cdot \mid y_u; g)} \log P(y_u = \hat{y}_u \mid \hat{x}_u; f)
$$</p>
<p>This lower bound is optimized via Monte Carlo sampling in three steps:</p>
<p><strong>Step 1</strong>: Train both forward model $f$ and reverse model $g$ on the labeled data $\mathcal{L}$:</p>
<p>$$
\begin{aligned}
\min_{\theta_f} \sum_{(x,y) \in \mathcal{L}} -\log P(y \mid x; \theta_f) \\
\min_{\theta_g} \sum_{(x,y) \in \mathcal{L}} -\log P(x \mid y; \theta_g)
\end{aligned}
$$</p>
<p><strong>Step 2</strong>: Use the trained reverse model $g$ to back translate each unlabeled molecule $y_u \in \mathcal{U}_y$, producing synthetic pairs:</p>
<p>$$
\hat{\mathcal{L}} = {(\hat{x}_u, y_u) \mid y_u \in \mathcal{U}_y, \hat{x}_u \text{ sampled from } P(\cdot \mid y_u; \theta_g)}
$$</p>
<p><strong>Step 3</strong>: Retrain the forward model $f$ on the combined labeled and synthetic data $\mathcal{L} \cup \hat{\mathcal{L}}$, warm-starting from the parameters obtained in Step 1:</p>
<p>$$
\min_{\theta_f^<em>} \sum_{(x,y) \in \mathcal{L} \cup \hat{\mathcal{L}}} -\log P(y \mid x; \theta_f^</em>)
$$</p>
<p>A key practical finding is that data filtration matters. When using large amounts of unlabeled data (1M molecules), keeping only the synthetic pairs that satisfy the same constraints as the labeled data (e.g., similarity thresholds and property ranges) significantly improves performance over using all back-translated data unfiltered.</p>
<h2 id="experiments-on-property-optimization-and-retrosynthesis">Experiments on Property Optimization and Retrosynthesis</h2>
<h3 id="molecular-property-improvement">Molecular Property Improvement</h3>
<p>The authors evaluate on four tasks from Jin et al. (2019, 2020), each requiring the model to improve a specific molecular property while maintaining structural similarity (measured by Dice similarity on Morgan fingerprints):</p>
<ul>
<li><strong>LogP</strong> (penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">partition coefficient</a>): two settings with similarity thresholds $\delta \geq 0.4$ and $\delta \geq 0.6$</li>
<li><strong>QED</strong> (quantitative estimation of drug-likeness): translate molecules from QED range [0.7, 0.8] to [0.9, 1.0]</li>
<li><strong>DRD2</strong> (<a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">dopamine type 2 receptor</a> activity): translate inactive ($P &lt; 0.5$) to active ($P \geq 0.5$)</li>
</ul>
<p>Two backbone architectures are tested: a Transformer (6 layers, 4 heads, 128-dim embeddings, 512-dim FFN) and HierG2G, a hierarchical graph-to-graph translation model. Unlabeled molecules are sampled from ZINC at 250K and 1M scales.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>LogP ($\delta \geq 0.6$)</th>
          <th>LogP ($\delta \geq 0.4$)</th>
          <th>QED (%)</th>
          <th>DRD2 (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>0.28</td>
          <td>1.03</td>
          <td>8.8</td>
          <td>3.4</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>0.79</td>
          <td>2.49</td>
          <td>9.4</td>
          <td>4.4</td>
      </tr>
      <tr>
          <td>JTNN</td>
          <td>2.33</td>
          <td>3.55</td>
          <td>59.9</td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>Transformer baseline</td>
          <td>2.45</td>
          <td>3.69</td>
          <td>71.9</td>
          <td>60.2</td>
      </tr>
      <tr>
          <td>+BT (1M, filtered)</td>
          <td>2.86</td>
          <td>4.41</td>
          <td>82.9</td>
          <td>67.4</td>
      </tr>
      <tr>
          <td>HierG2G baseline</td>
          <td>2.49</td>
          <td>3.98</td>
          <td>76.9</td>
          <td>85.9</td>
      </tr>
      <tr>
          <td>+BT (250K, filtered)</td>
          <td>2.75</td>
          <td>4.24</td>
          <td>79.1</td>
          <td>87.3</td>
      </tr>
  </tbody>
</table>
<h3 id="retrosynthesis-prediction">Retrosynthesis Prediction</h3>
<p>On the USPTO-50K benchmark (50K reactions, 10 reaction types, 80/10/10 train/val/test split), the method is applied to Transformer and GLN (Graph Logic Network) backbones. For other approaches to this benchmark, see <a href="/notes/chemistry/molecular-design/reaction-prediction/tied-two-way-transformers-retrosynthesis/">Tied Two-Way Transformers</a> and <a href="/notes/chemistry/molecular-design/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/">Data Transfer for Retrosynthesis</a>. Unlabeled reactant sets are constructed by sampling molecules from ZINC and concatenating them following the training data&rsquo;s reactant count distribution ($N_1 : N_2 : N_3 = 29.3% : 70.4% : 0.3%$).</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Top-1</th>
          <th>Top-3</th>
          <th>Top-5</th>
          <th>Top-10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Reaction type given</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>GLN</td>
          <td>64.2</td>
          <td>79.1</td>
          <td>85.2</td>
          <td>90.0</td>
      </tr>
      <tr>
          <td>Ours + GLN</td>
          <td>67.9</td>
          <td>82.5</td>
          <td>87.3</td>
          <td>91.5</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>52.2</td>
          <td>68.2</td>
          <td>72.7</td>
          <td>77.4</td>
      </tr>
      <tr>
          <td>Ours + Transformer</td>
          <td>55.9</td>
          <td>72.8</td>
          <td>77.8</td>
          <td>79.7</td>
      </tr>
      <tr>
          <td><strong>Reaction type unknown</strong></td>
          <td></td>
          <td></td>
          <td></td>
          <td></td>
      </tr>
      <tr>
          <td>GLN</td>
          <td>52.5</td>
          <td>69.0</td>
          <td>75.6</td>
          <td>83.7</td>
      </tr>
      <tr>
          <td>Ours + GLN</td>
          <td>54.7</td>
          <td>70.2</td>
          <td>77.0</td>
          <td>84.4</td>
      </tr>
      <tr>
          <td>Transformer</td>
          <td>37.9</td>
          <td>57.3</td>
          <td>62.7</td>
          <td>68.1</td>
      </tr>
      <tr>
          <td>Ours + Transformer</td>
          <td>43.5</td>
          <td>58.8</td>
          <td>64.6</td>
          <td>69.7</td>
      </tr>
  </tbody>
</table>
<p>The improvements are largest at lower $k$ values (top-1 and top-3), suggesting that back translation helps the model make more precise high-confidence predictions.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p><strong>Effect of unlabeled data size</strong>: On retrosynthesis with Transformer, performance improves as unlabeled data increases from 50K to 250K, then plateaus or declines beyond 250K. The authors attribute this to noise in the back-translated data outweighing the benefits at larger scales.</p>
<p><strong>Effect of labeled data size</strong>: With only 5K labeled samples, adding back-translated data hurts performance because the reverse model is too weak to generate useful synthetic data. As labeled data increases (10K, 25K, 50K), the benefit of back translation grows. This confirms that the method requires a reasonably well-trained reverse model to be effective.</p>
<p><strong>Data filtration</strong>: Using 1M unfiltered back-translated molecules sometimes hurts performance (e.g., QED drops from 71.9% to 75.1% vs. 82.9% with filtering), while filtering to enforce the same constraints as the labeled data recovers and exceeds the 250K filtered results.</p>
<h2 id="consistent-gains-across-architectures-and-tasks">Consistent Gains Across Architectures and Tasks</h2>
<p>The method achieves state-of-the-art results on all four molecular property improvement tasks and the USPTO-50K retrosynthesis benchmark at time of publication. Several observations stand out:</p>
<ol>
<li><strong>Architecture agnosticism</strong>: Back translation improves both sequence-based (Transformer) and graph-based (HierG2G, GLN) models, confirming that the approach is independent of the underlying architecture.</li>
<li><strong>Filtration is essential at scale</strong>: Unfiltered 1M back-translated data can degrade performance, but filtered data at the same scale consistently outperforms smaller unfiltered sets.</li>
<li><strong>Training overhead is moderate</strong>: On the DRD2 task, back translation with Transformer takes about 2.5x the supervised training time (11.0h vs. 8.5h for initial training), with the back-translation step itself taking under 1 hour.</li>
<li><strong>Diversity and novelty increase</strong>: Back translation improves both diversity (average pairwise distance among generated molecules) and novelty (fraction of generated molecules not seen in training) across QED and DRD2 tasks.</li>
</ol>
<p>The authors acknowledge limitations: the method does not form a closed loop between forward and reverse models (as in dual learning approaches), and the data filtration strategy is rule-based rather than learned. They suggest joint training of forward and reverse models and learned filtration as future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (property improvement)</td>
          <td>Jin et al. (2019, 2020) datasets</td>
          <td>34K-99K pairs</td>
          <td>LogP, QED, DRD2 tasks</td>
      </tr>
      <tr>
          <td>Training (retrosynthesis)</td>
          <td>USPTO-50K</td>
          <td>40K reactions</td>
          <td>80/10/10 split from Dai et al. (2019)</td>
      </tr>
      <tr>
          <td>Unlabeled molecules</td>
          <td>ZINC</td>
          <td>250K or 1M</td>
          <td>Randomly sampled</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Same as training</td>
          <td>800-1000 test samples</td>
          <td>Per-task test sets</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Back translation with optional data filtration</li>
<li>Beam search with $k=20$ for inference</li>
<li>Random sampling for back-translation step (Equation 5)</li>
<li>Dice similarity on Morgan fingerprints for similarity constraint</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Transformer</strong>: 6 layers, 4 attention heads, 128-dim embeddings, 512-dim FFN (for property improvement); 4 layers, 8 heads, 256-dim embeddings, 2048-dim FFN (for retrosynthesis)</li>
<li><strong>HierG2G</strong>: Settings from Jin et al. (2020)</li>
<li><strong>GLN</strong>: Settings from Dai et al. (2019)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Best Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP improvement</td>
          <td>LogP ($\delta \geq 0.6$)</td>
          <td>2.86</td>
          <td>2.49 (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>LogP improvement</td>
          <td>LogP ($\delta \geq 0.4$)</td>
          <td>4.41</td>
          <td>3.98 (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>Success rate</td>
          <td>QED</td>
          <td>82.9%</td>
          <td>76.9% (HierG2G)</td>
          <td>Transformer + BT(1M, filtered)</td>
      </tr>
      <tr>
          <td>Success rate</td>
          <td>DRD2</td>
          <td>87.3%</td>
          <td>85.9% (HierG2G)</td>
          <td>HierG2G + BT(250K, filtered)</td>
      </tr>
      <tr>
          <td>Top-1 accuracy</td>
          <td>USPTO-50K (known type)</td>
          <td>67.9%</td>
          <td>64.2% (GLN)</td>
          <td>Ours + GLN</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper reports training times (8.5h for Transformer, 16.8h for HierG2G on DRD2 with 1M unlabeled data) but does not specify the GPU hardware used.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/fyabc/BT4MolGen">BT4MolGen</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation in Python</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fan, Y., Xia, Y., Zhu, J., Wu, L., Xie, S., &amp; Qin, T. (2021). Back translation for molecule generation. <em>Bioinformatics</em>, 38(5), 1244-1251. <a href="https://doi.org/10.1093/bioinformatics/btab817">https://doi.org/10.1093/bioinformatics/btab817</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fan2022back,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Back translation for molecule generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fan, Yang and Xia, Yingce and Zhu, Jinhua and Wu, Lijun and Xie, Shufang and Qin, Tao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1244--1251}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bioinformatics/btab817}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AMORE: Testing ChemLLM Robustness to SMILES Variants</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/amore-smiles-robustness-framework/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/amore-smiles-robustness-framework/</guid><description>AMORE is a zero-shot framework testing whether chemical language models recognize equivalent SMILES of the same molecule via embedding retrieval.</description><content:encoded><![CDATA[<h2 id="an-empirical-framework-for-probing-chemical-understanding">An Empirical Framework for Probing Chemical Understanding</h2>
<p>This is an <strong>Empirical</strong> paper that introduces Augmented Molecular Retrieval (AMORE), a zero-shot evaluation framework for chemical language models (ChemLMs). The primary contribution is a method to assess whether ChemLMs have learned genuine molecular semantics or simply memorize textual patterns. Rather than relying on traditional NLP metrics like BLEU and ROUGE, AMORE tests whether a model&rsquo;s embedding space treats chemically equivalent <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> representations as similar. The authors evaluate 12 models across multiple architectures (encoder-only, encoder-decoder, decoder-only) on two datasets and five augmentation types, and extend the analysis to downstream MoleculeNet tasks.</p>
<h2 id="why-standard-nlp-metrics-fail-for-chemical-evaluation">Why Standard NLP Metrics Fail for Chemical Evaluation</h2>
<p>Chemical language models are typically evaluated using text-based metrics from NLP (BLEU, ROUGE, METEOR) on tasks like molecule captioning. These metrics compare word overlap and sentence fluency but cannot detect whether a model truly understands molecular structure. A SMILES string like <code>C(=O)O</code> and its canonicalized or kekulized form represent the same molecule, yet text-based metrics would penalize valid reformulations. Embedding-based metrics like BERTScore are also insufficient because they were trained on general text, not chemical notation.</p>
<p>The core research question is direct: do evaluation metrics used on ChemLMs reflect actual chemical knowledge, or do the models simply imitate understanding by learning textual features? This question has practical consequences in pharmaceuticals and healthcare, where missteps in chemical reasoning carry serious risks.</p>
<h2 id="embedding-based-retrieval-as-a-chemical-litmus-test">Embedding-Based Retrieval as a Chemical Litmus Test</h2>
<p>AMORE exploits a fundamental property of molecular representations: a single molecule can be written as multiple valid SMILES strings that are chemically identical. These serve as &ldquo;total synonyms,&rdquo; a concept without a true analogue in natural language.</p>
<p>The framework works in four steps:</p>
<ol>
<li>Take a set $X = (x_1, x_2, \ldots, x_n)$ of $n$ molecular representations.</li>
<li>Apply a transformation $f$ to obtain augmented representations $X&rsquo; = (x&rsquo;_1, x&rsquo;_2, \ldots, x&rsquo;_n)$, where $x&rsquo;_i = f(x_i)$. The constraint is that $f$ must not change the underlying molecule.</li>
<li>Obtain vectorized embeddings $e(x_i)$ and $e(x&rsquo;_j)$ from the model for each original and augmented SMILES.</li>
<li>Evaluate in a retrieval task: given $e(x_i)$, retrieve $e(x&rsquo;_i)$ from the augmented set.</li>
</ol>
<p>The evaluation metrics are top-$k$ accuracy (whether the correct augmented SMILES ranks at position $\leq k$) and <a href="https://en.wikipedia.org/wiki/Mean_reciprocal_rank">Mean Reciprocal Rank</a> (MRR). Retrieval uses <a href="https://en.wikipedia.org/wiki/FAISS">FAISS</a> for efficient nearest-neighbor search. The key insight is that if a model truly understands molecular structure, it should embed different SMILES representations of the same molecule close together.</p>
<h3 id="five-smiles-augmentation-types">Five SMILES Augmentation Types</h3>
<p>The framework uses five identity-preserving augmentations, all executed through <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>:</p>
<ol>
<li><strong>Canonicalization</strong>: Transform SMILES to the standardized RDKit canonical form.</li>
<li><strong>Hydrogen addition</strong>: Explicitly add hydrogen atoms that are normally implied (e.g., <code>C</code> becomes <code>[CH4]</code>). This dramatically increases string length.</li>
<li><strong>Kekulization</strong>: Convert aromatic ring notation to explicit alternating double bonds.</li>
<li><strong>Cycle renumbering</strong>: Replace ring-closure digit identifiers with random valid alternatives.</li>
<li><strong>Random atom order</strong>: Randomize the atom traversal order used to generate the SMILES string.</li>
</ol>
<h2 id="twelve-models-two-datasets-five-augmentations">Twelve Models, Two Datasets, Five Augmentations</h2>
<h3 id="models-evaluated">Models Evaluated</h3>
<p>The authors test 12 publicly available Transformer-based models spanning three architecture families:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Domain</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Text+Chem T5-standard</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>Text+Chem T5-augm</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>MolT5-large</td>
          <td>Cross-modal</td>
          <td>770M</td>
      </tr>
      <tr>
          <td>SciFive</td>
          <td>Text-only</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>PubChemDeBERTa</td>
          <td>Chemical</td>
          <td>86M</td>
      </tr>
      <tr>
          <td>ChemBERT-ChEMBL</td>
          <td>Chemical</td>
          <td>6M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></td>
          <td>Chemical</td>
          <td>125M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/">BARTSmiles</a></td>
          <td>Chemical</td>
          <td>400M</td>
      </tr>
      <tr>
          <td>ZINC-RoBERTa</td>
          <td>Chemical</td>
          <td>102M</td>
      </tr>
      <tr>
          <td><a href="/notes/chemistry/molecular-representations/multimodal/nach0-multimodal-chemical-language-model/">nach0</a></td>
          <td>Chemical</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>ZINC-GPT</td>
          <td>Chemical</td>
          <td>87M</td>
      </tr>
  </tbody>
</table>
<h3 id="datasets">Datasets</h3>
<ul>
<li><strong>ChEBI-20 test set</strong>: ~3,300 molecule-description pairs, used for both AMORE retrieval and molecule captioning comparisons.</li>
<li><strong>Isomers</strong> (<a href="/notes/chemistry/datasets/qm9/">QM9</a> subset): 918 molecules that are all isomers of C9H12N2O, making retrieval harder because all molecules share the same molecular formula.</li>
</ul>
<h3 id="key-results-on-chebi-20">Key Results on ChEBI-20</h3>
<p>On the ChEBI-20 dataset (Table 2 from the paper), top-1 accuracy varies enormously by augmentation type. Cycle renumbering is easiest (up to 98.48% Acc@1 for SciFive), while hydrogen addition is hardest (no model exceeds 5.97% Acc@1).</p>
<p>For the cross-modal Text+Chem T5-standard model:</p>
<table>
  <thead>
      <tr>
          <th>Augmentation</th>
          <th>Acc@1</th>
          <th>Acc@5</th>
          <th>MRR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Canonical</td>
          <td>63.03</td>
          <td>82.76</td>
          <td>72.4</td>
      </tr>
      <tr>
          <td>Hydrogen</td>
          <td>5.46</td>
          <td>10.85</td>
          <td>8.6</td>
      </tr>
      <tr>
          <td>Kekulization</td>
          <td>76.76</td>
          <td>92.03</td>
          <td>83.8</td>
      </tr>
      <tr>
          <td>Cycle</td>
          <td>96.70</td>
          <td>99.82</td>
          <td>98.2</td>
      </tr>
      <tr>
          <td>Random</td>
          <td>46.94</td>
          <td>74.18</td>
          <td>59.33</td>
      </tr>
  </tbody>
</table>
<h3 id="key-results-on-isomers">Key Results on Isomers</h3>
<p>Performance drops substantially on the Isomers dataset, where all molecules share the same formula. The best Acc@1 for hydrogen augmentation is just 1.53% (MolT5-large). Even for the relatively easy cycle augmentation, top scores drop from the high 90s to the low 90s for most models, and some models (BARTSmiles: 41.83%) struggle considerably.</p>
<h3 id="downstream-moleculenet-impact">Downstream MoleculeNet Impact</h3>
<p>The authors also fine-tuned models on original <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> training data and tested on augmented test sets across 9 tasks (regression, binary classification, multilabel classification). Results confirm that augmentations degrade downstream performance. For example, on ESOL regression, RMSE increased from 0.87 to 7.93 with hydrogen addition. Rankings computed using the Vote&rsquo;n&rsquo;Rank framework (using the <a href="https://en.wikipedia.org/wiki/Copeland%27s_method">Copeland rule</a>) show that hydrogen augmentation is the only one that substantially reshuffles model rankings; other augmentations preserve the original ordering.</p>
<h3 id="correlation-between-amore-and-captioning-metrics">Correlation Between AMORE and Captioning Metrics</h3>
<p>The differences in ROUGE/METEOR between original and augmented SMILES correlate with AMORE retrieval accuracy (Spearman correlation &gt; 0.7 with p-value = 0.003 for Acc@1). This validates AMORE as a proxy for predicting how augmentations will affect generation quality, without requiring labeled captioning data.</p>
<h2 id="current-chemlms-learn-syntax-not-chemistry">Current ChemLMs Learn Syntax, Not Chemistry</h2>
<p>The central finding is that existing ChemLMs are not robust to identity-preserving SMILES augmentations. Several specific conclusions emerge:</p>
<ol>
<li>
<p><strong>Hydrogen augmentation is catastrophic</strong>: All models fail (&lt; 6% Acc@1 on ChEBI-20, &lt; 2% on Isomers). The authors attribute this to the near-complete absence of explicit hydrogen in pretraining data, creating a distribution shift.</p>
</li>
<li>
<p><strong>Cross-modal models outperform unimodal ones</strong>: Models trained on both text and SMILES (Text+Chem T5, MolT5) consistently achieve higher retrieval accuracy on four of five augmentations.</p>
</li>
<li>
<p><strong>Augmentation difficulty follows a consistent order</strong>: For all models, hydrogen is hardest, followed by canonicalization, random ordering, kekulization, and cycle renumbering (easiest).</p>
</li>
<li>
<p><strong>Layer-wise analysis reveals instability</strong>: Retrieval accuracy across Transformer layers is correlated across augmentation types, suggesting that representations degrade at the same layers regardless of augmentation.</p>
</li>
<li>
<p><strong><a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> partially explains difficulty</strong>: Hydrogen augmentation produces strings ~2x longer than originals (Levenshtein ratio of 1.49), but the low correlation between Levenshtein ratio and downstream metrics (ROUGE1 correlation of -0.05 for hydrogen) suggests string length alone does not explain the failure.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations. Only publicly available HuggingFace models were evaluated, excluding models like <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> and <a href="/notes/chemistry/molecular-representations/encoders/molformer/">Molformer</a> that lack HF checkpoints. The study focuses exclusively on SMILES sequences, not 3D molecular structures or other formats like <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>. The augmentation types, while representative, do not cover all possible identity transformations.</p>
<p>The authors suggest that AMORE could serve as a regularization tool during training, for example by using metric learning to encourage models to embed SMILES variants of the same molecule close together.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Retrieval evaluation</td>
          <td>ChEBI-20 test set</td>
          <td>3,300 molecules</td>
          <td>Standard benchmark for molecule captioning</td>
      </tr>
      <tr>
          <td>Retrieval evaluation</td>
          <td>Isomers (QM9 subset)</td>
          <td>918 molecules</td>
          <td>All isomers of C9H12N2O</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>MoleculeNet (9 tasks)</td>
          <td>Varies</td>
          <td>ESOL, Lipophilicity, FreeSolv, HIV, BBBP, BACE, Tox21, ToxCast, SIDER</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SMILES augmentations via RDKit (canonicalization, hydrogen addition, kekulization, cycle renumbering, random atom ordering)</li>
<li>Nearest-neighbor retrieval using FAISS with L2, cosine, inner product, and HNSW metrics</li>
<li>Model ranking via Vote&rsquo;n&rsquo;Rank (Copeland rule) on MoleculeNet tasks</li>
</ul>
<h3 id="models">Models</h3>
<p>All 12 evaluated models are publicly available on HuggingFace. No custom model training was performed for the AMORE retrieval experiments. MoleculeNet experiments used standard fine-tuning on original training splits.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Acc@1</td>
          <td>Top-1 retrieval accuracy</td>
          <td>Primary AMORE metric</td>
      </tr>
      <tr>
          <td>Acc@5</td>
          <td>Top-5 retrieval accuracy</td>
          <td>Secondary AMORE metric</td>
      </tr>
      <tr>
          <td>MRR</td>
          <td>Mean Reciprocal Rank</td>
          <td>Average rank of correct match</td>
      </tr>
      <tr>
          <td>ROUGE-2</td>
          <td>Bigram overlap for captioning</td>
          <td>Compared against AMORE</td>
      </tr>
      <tr>
          <td>METEOR</td>
          <td>MT evaluation metric for captioning</td>
          <td>Compared against AMORE</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Computational resources from HPC facilities at HSE University. Specific GPU types and training times are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ChemistryLLMs/AMORE">AMORE GitHub</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Framework code and evaluation data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ganeeva, V., Khrabrov, K., Kadurin, A., &amp; Tutubalina, E. (2025). Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework. <em>Journal of Cheminformatics</em>, 17(1). <a href="https://doi.org/10.1186/s13321-025-01079-0">https://doi.org/10.1186/s13321-025-01079-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ganeeva2025measuring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ganeeva, Veronika and Khrabrov, Kuzma and Kadurin, Artur and Tutubalina, Elena}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-025-01079-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Tied Two-Way Transformers for Diverse Retrosynthesis</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/tied-two-way-transformers-retrosynthesis/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/tied-two-way-transformers-retrosynthesis/</guid><description>Tied two-way transformers with cycle consistency and multinomial latent variables improve retrosynthetic prediction validity, plausibility, and diversity.</description><content:encoded><![CDATA[<h2 id="bridging-forward-and-backward-reaction-prediction">Bridging Forward and Backward Reaction Prediction</h2>
<p>This is a <strong>Method</strong> paper that addresses three key limitations of template-free <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> models: invalid <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> outputs, chemically implausible predictions, and lack of diversity in reactant candidates. The solution combines three techniques: (1) cycle consistency checks using a paired forward reaction transformer, (2) parameter tying between the forward and backward transformers, and (3) multinomial latent variables with a learned prior to capture multiple reaction pathways.</p>
<h2 id="three-problems-in-template-free-retrosynthesis">Three Problems in Template-Free Retrosynthesis</h2>
<p>Template-free retrosynthesis models cast retrosynthesis as a <a href="/notes/chemistry/molecular-design/reaction-prediction/data-transfer-seq-to-seq-retrosynthesis/">sequence-to-sequence</a> translation problem (product SMILES to reactant SMILES). While these models avoid the cost of hand-coded reaction templates, they suffer from:</p>
<ol>
<li><strong>Invalid SMILES</strong>: predicted reactant strings that contain grammatical errors and cannot be parsed into molecules</li>
<li><strong>Implausibility</strong>: predicted reactants that are valid molecules but cannot actually synthesize the target product</li>
<li><strong>Lack of diversity</strong>: beam search produces duplicate or near-duplicate candidates, reducing the number of useful suggestions</li>
</ol>
<p>Prior work addressed these individually (SCROP adds a syntax corrector for validity, Chen et al. use latent variables for diversity), but this paper tackles all three simultaneously.</p>
<h2 id="model-architecture">Model Architecture</h2>
<h3 id="tied-two-way-transformers">Tied Two-Way Transformers</h3>
<p>The model pairs a retrosynthesis transformer $p(y|z, x)$ (product to reactants) with a forward reaction transformer $p(\tilde{x}|z, y)$ (reactants to product). Both use the standard encoder-decoder transformer architecture with 6 layers, 8 attention heads, and 256-dimensional embeddings.</p>
<p>The key architectural innovation is aggressive parameter tying: the two transformers share the entire encoder and all decoder parameters except layer normalization. This means the two-transformer system has approximately the same parameter count as a single transformer (17.5M vs. 17.4M). The shared parameters force the model to learn bidirectional reaction patterns from both forward and backward training data simultaneously, improving grammar learning and reducing invalid outputs.</p>
<h3 id="multinomial-latent-variables">Multinomial Latent Variables</h3>
<p>A discrete latent variable $z \in \{1, \ldots, K\}$ is introduced to capture multiple reaction modes. Each latent value conditions a different decoding path, encouraging diverse reactant predictions. The decoder initializes with a latent-class-specific start token (e.g., &ldquo;&lt;CLS2&gt;&rdquo;) and then decodes autoregressively.</p>
<p>The prior $p(z|x)$ is a learned multinomial distribution parametrized by a two-layer feed-forward network with tanh activation, taking the mean-pooled encoder output as input. This learned prior outperforms the uniform prior used by Chen et al., producing a smaller trade-off between top-1 and top-10 accuracy as $K$ increases.</p>
<h3 id="training-with-hard-em">Training with Hard EM</h3>
<p>Since the latent variable $z$ is unobserved during training, the model is trained with the online <a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm">hard-EM algorithm</a>. The loss function is:</p>
<p>$$\mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim \text{data}} \left[ \min_{z} \mathcal{L}_h(x, y, z; \theta) \right]$$</p>
<p>where $\mathcal{L}_h = -(\log p(z|x) + \log p(y|z,x) + \log p(\tilde{x}=x|z,y))$. The E-step selects the best $z$ for each training pair (with dropout disabled), and the M-step updates parameters given the complete data.</p>
<h3 id="inference-with-cycle-consistency-reranking">Inference with Cycle Consistency Reranking</h3>
<p>At inference, the model: (1) generates $K$ sets of beam search hypotheses from the retrosynthesis transformer (one per latent value), (2) scores each candidate with the forward reaction transformer for cycle consistency $p(\tilde{x}=x|z,y)$, and (3) reranks candidates by the full likelihood $p(z|x) \cdot p(y|z,x) \cdot p(\tilde{x}=x|z,y)$. This pushes chemically plausible predictions to higher ranks.</p>
<h2 id="results-on-uspto-50k">Results on USPTO-50K</h2>
<p>All results are averaged over 5 random seeds with beam size 10.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Top-1 Acc.</th>
          <th>Top-5 Acc.</th>
          <th>Top-10 Acc.</th>
          <th>Top-1 Invalid</th>
          <th>Top-10 Invalid</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Liu-LSTM</td>
          <td>37.4%</td>
          <td>57.0%</td>
          <td>61.7%</td>
          <td>12.2%</td>
          <td>22.0%</td>
      </tr>
      <tr>
          <td>SCROP</td>
          <td>43.7%</td>
          <td>65.2%</td>
          <td>68.7%</td>
          <td>0.7%</td>
          <td>2.3%</td>
      </tr>
      <tr>
          <td>Lin-TF</td>
          <td>42.0%</td>
          <td>71.3%</td>
          <td>77.6%</td>
          <td>2.2%</td>
          <td>7.8%</td>
      </tr>
      <tr>
          <td>Base transformer</td>
          <td>44.3%</td>
          <td>68.4%</td>
          <td>72.7%</td>
          <td>1.7%</td>
          <td>12.1%</td>
      </tr>
      <tr>
          <td>Proposed ($K$=5)</td>
          <td>46.8%</td>
          <td>73.5%</td>
          <td>78.5%</td>
          <td>0.1%</td>
          <td>2.6%</td>
      </tr>
  </tbody>
</table>
<p>The proposed model achieves a +3.1% top-1 accuracy improvement over the best previous template-free method and reduces top-1 invalid rate to 0.1%.</p>
<h3 id="ablation-analysis">Ablation Analysis</h3>
<p>The ablation study isolates the contribution of each component:</p>
<ul>
<li><strong>Base+CC</strong> (cycle consistency only): reranks candidates to improve top-1/3/5 accuracy and validity, but top-10 stays the same since the candidate set is unchanged. Parameter count doubles (34.8M).</li>
<li><strong>Base+PT</strong> (parameter tying only): improves accuracy and validity at all top-$k$ levels with negligible parameter increase. Parameter tying during training improves the retrosynthesis transformer itself, even without cycle consistency at inference.</li>
<li><strong>Proposed ($K$=1)</strong>: combines tying with cycle consistency reranking.</li>
<li><strong>Proposed ($K$=5)</strong>: adds latent diversity, further improving top-10 accuracy (+2.2%) and reducing top-10 invalid rate (from 10.2% to 2.6%).</li>
</ul>
<h3 id="diversity-unique-rate">Diversity: Unique Rate</h3>
<p>As $K$ increases from 1 to 5, the unique molecule rate among 10 predictions rises substantially, confirming that latent modeling produces more diverse candidates. The learned prior reduces the top-1/top-10 accuracy trade-off compared to Chen et al.&rsquo;s uniform prior.</p>
<h2 id="results-on-in-house-multi-pathway-dataset">Results on In-House Multi-Pathway Dataset</h2>
<p>The in-house dataset (162K reactions from <a href="https://en.wikipedia.org/wiki/Reaxys">Reaxys</a>) contains multiple ground-truth reactions per product, enabling direct evaluation of pathway diversity through coverage (proportion of ground-truth pathways correctly predicted in the top-10 candidates).</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Top-1 Acc.</th>
          <th>Top-10 Acc.</th>
          <th>Unique Rate</th>
          <th>Coverage</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>64.2%</td>
          <td>91.6%</td>
          <td>76.1%</td>
          <td>84.4%</td>
      </tr>
      <tr>
          <td>Proposed</td>
          <td>66.0%</td>
          <td>92.8%</td>
          <td>93.2%</td>
          <td>87.3%</td>
      </tr>
  </tbody>
</table>
<p>The proposed model covers 87.3% of ground-truth reaction pathways on average, compared to 84.4% for the baseline. The unique rate jumps from 76.1% to 93.2%, confirming that the latent variables effectively encourage diverse predictions.</p>
<h2 id="limitations">Limitations</h2>
<p>The model uses SMILES string representation, which linearizes molecules and does not exploit the inherently rich chemical graph structure. Graph-based retrosynthesis models (e.g., GraphRetro at 63.8% top-1) substantially outperform template-free string-based models. The USPTO-50K dataset provides only one ground-truth pathway per product, making diversity evaluation limited on this benchmark. The in-house dataset is not publicly available. The model also does not predict reaction conditions (solvents, catalysts, temperature) or reagents.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ejklike/tied-twoway-transformer">ejklike/tied-twoway-transformer</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Training and inference code</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: USPTO-50K dataset (public, 50K reactions from USPTO patents). In-house dataset (162K reactions from Reaxys, not publicly available).</p>
<p><strong>Hardware</strong>: 4 NVIDIA Tesla M40 GPUs. Checkpoints saved every 5000 steps, last 5 averaged.</p>
<p><strong>Training</strong>: Adam optimizer ($\beta$ = 0.9, 0.98), initial learning rate 2 with 8000 warm-up steps, dropout 0.3, gradient accumulation over 4 batches. Label smoothing set to 0.</p>
<p><strong>Inference</strong>: Beam size 10, generating 10 candidates per product.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kim, E., Lee, D., Kwon, Y., Park, M. S., &amp; Choi, Y.-S. (2021). Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables. <em>Journal of Chemical Information and Modeling</em>, 61, 123-133.</p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ejklike/tied-twoway-transformer">GitHub: ejklike/tied-twoway-transformer</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kim2021valid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kim, Eunji and Lee, Dongseon and Kwon, Youngchun and Park, Min Sik and Choi, Youn-Suk}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{123--133}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.0c01074}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>UnCorrupt SMILES: Post Hoc Correction for De Novo Design</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/uncorrupt-smiles/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/evaluation/uncorrupt-smiles/</guid><description>A transformer-based SMILES corrector that fixes invalid outputs from molecular generators, recovering 60-95% of erroneous SMILES strings.</description><content:encoded><![CDATA[<h2 id="a-transformer-based-smiles-error-corrector">A Transformer-Based SMILES Error Corrector</h2>
<p>This is a <strong>Method</strong> paper that proposes a post hoc approach to fixing invalid SMILES produced by de novo molecular generators. Rather than trying to prevent invalid outputs through alternative representations (<a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) or constrained architectures (graph models), the authors train a transformer model to translate invalid SMILES into valid ones. The corrector is framed as a sequence-to-sequence translation task, drawing on techniques from grammatical error correction (GEC) in natural language processing.</p>
<h2 id="the-problem-of-invalid-smiles-in-molecular-generation">The Problem of Invalid SMILES in Molecular Generation</h2>
<p><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based generative models produce some percentage of invalid outputs that cannot be converted to molecules. The invalidity rate varies substantially across model types:</p>
<ul>
<li><strong>RNN models</strong> (DrugEx): 5.7% invalid (pretrained) and 4.7% invalid (target-directed)</li>
<li><strong>GANs</strong> (ORGANIC): 9.5% invalid</li>
<li><strong>VAEs</strong> (GENTRL): 88.9% invalid</li>
</ul>
<p>These invalid outputs represent wasted computation and potentially introduce bias toward molecules that are easier to generate correctly. Previous approaches to this problem include using alternative representations (<a href="/notes/chemistry/molecular-representations/notations/deepsmiles-adaptation-for-ml/">DeepSMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) or graph-based models, but these either limit the search space or increase computational cost. The authors propose a complementary strategy: fix the errors after generation.</p>
<h2 id="error-taxonomy-across-generator-types">Error Taxonomy Across Generator Types</h2>
<p>The paper classifies invalid SMILES errors into six categories based on RDKit error messages:</p>
<ol>
<li><strong>Syntax errors</strong>: malformed SMILES grammar</li>
<li><strong>Unclosed rings</strong>: unmatched ring closure digits</li>
<li><strong>Parentheses errors</strong>: unbalanced open/close parentheses</li>
<li><strong>Bond already exists</strong>: duplicate bonds between the same atoms</li>
<li><strong>Aromaticity errors</strong>: atoms incorrectly marked as aromatic or kekulization failures</li>
<li><strong>Valence errors</strong>: atoms exceeding their maximum bond count</li>
</ol>
<p>The distribution of error types differs across generators. RNN-based models primarily produce aromaticity errors, suggesting they learn SMILES grammar well but struggle with chemical validity. The GAN (ORGANIC) produces mostly valence errors. The VAE (GENTRL) produces more grammar-level errors (syntax, parentheses, unclosed rings), indicating that sampling from the continuous latent space often produces sequences that violate basic SMILES structure.</p>
<h2 id="architecture-and-training">Architecture and Training</h2>
<p>The SMILES corrector uses a standard encoder-decoder transformer architecture based on Vaswani et al., with learned positional encodings. Key specifications:</p>
<ul>
<li>Embedding dimension: 256</li>
<li>Encoder/decoder layers: 3 each</li>
<li>Attention heads: 8</li>
<li>Feed-forward dimension: 512</li>
<li>Dropout: 0.1</li>
<li>Optimizer: Adam (learning rate 0.0005)</li>
<li>Training: 20 epochs, batch size 16</li>
</ul>
<p>Since no dataset of manually corrected invalid-valid SMILES pairs exists, the authors create synthetic training data by introducing errors into valid SMILES from the Papyrus bioactivity dataset (approximately 1.3M pairs). Errors are introduced through random perturbations following SMILES syntax rules: character substitutions, bond order changes, fragment additions from the <a href="/notes/chemistry/datasets/gdb-11/">GDB</a>-8 database to atoms with full valence, and other structural modifications.</p>
<h2 id="training-with-multiple-errors-improves-correction">Training with Multiple Errors Improves Correction</h2>
<p>A key finding is that training the corrector on inputs with multiple errors per SMILES substantially improves performance on real generator outputs. The baseline model (1 error per input) fixes 35-80% of invalid outputs depending on the generator. Increasing errors per training input to 12 raises this to 62-95%:</p>
<table>
  <thead>
      <tr>
          <th>Generator</th>
          <th>1 error/input</th>
          <th>12 errors/input</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN (DrugEx)</td>
          <td>~60% fixed</td>
          <td>62% fixed</td>
      </tr>
      <tr>
          <td>Target-directed RNN</td>
          <td>~60% fixed</td>
          <td>68% fixed</td>
      </tr>
      <tr>
          <td>GAN (ORGANIC)</td>
          <td>~80% fixed</td>
          <td>95% fixed</td>
      </tr>
      <tr>
          <td>VAE (GENTRL)</td>
          <td>~35% fixed</td>
          <td>80% fixed</td>
      </tr>
  </tbody>
</table>
<p>Training beyond 12 errors per input yields diminishing returns (80% average at 20 errors vs. 78% at 12). The improvement from multi-error training is consistent with GEC literature, where models learn to &ldquo;distrust&rdquo; inputs more when exposed to higher error rates.</p>
<p>The model also shows low overcorrection: only 14% of valid SMILES are altered during translation, comparable to overcorrection rates in spelling correction systems.</p>
<h2 id="fixed-molecules-are-comparable-to-generator-outputs">Fixed Molecules Are Comparable to Generator Outputs</h2>
<p>The corrected molecules are evaluated against both the training set and the readily generated (valid) molecules from each generator:</p>
<ul>
<li><strong>Uniqueness</strong>: 97% of corrected molecules are unique</li>
<li><strong>Novelty vs. generated</strong>: 97% of corrected molecules are novel compared to the valid generator outputs</li>
<li><strong>Similarity to nearest neighbor (SNN)</strong>: 0.45 between fixed and generated sets, indicating the corrected molecules explore different parts of chemical space</li>
<li><strong>Property distributions</strong>: KL divergence scores between fixed molecules and the training set are comparable to those between generated molecules and the training set</li>
</ul>
<p>This demonstrates that SMILES correction produces molecules that are as chemically reasonable as the generator&rsquo;s valid outputs while exploring complementary regions of chemical space.</p>
<h2 id="local-chemical-space-exploration-via-error-introduction">Local Chemical Space Exploration via Error Introduction</h2>
<p>Beyond fixing generator errors, the authors propose using the SMILES corrector for analog generation. The workflow is:</p>
<ol>
<li>Take a known active molecule</li>
<li>Introduce random errors into its SMILES (repeated 1000 times)</li>
<li>Correct the errors using the trained corrector</li>
</ol>
<p>This &ldquo;local sequence exploration&rdquo; generates novel analogs with 97% validity. The uniqueness (39%) and novelty (16-37%) are lower than for generator correction because the corrector often regenerates the original molecule. However, the approach produces molecules that are structurally similar to the starting compound (SNN of 0.85 to known ligands).</p>
<p>The authors demonstrate this on selective <a href="https://en.wikipedia.org/wiki/Aurora_kinase_B">Aurora kinase B</a> (AURKB) inhibitors. The generated analogs occupy the same binding site region as the co-crystallized ligand VX-680 in docking studies, with predicted bioactivities similar to known compounds. Compared to target-directed RNN generation, SMILES exploration produces molecules closer to known actives (higher SNN, scaffold similarity, and KL divergence scores).</p>
<h2 id="limitations">Limitations</h2>
<p>The corrector performance drops when applied to real generator outputs compared to synthetic test data, because the synthetic error distribution does not perfectly match the errors that generators actually produce. Generator-specific correctors trained on actual invalid outputs could improve performance. The local exploration approach has limited novelty since the corrector frequently regenerates the original molecule. The evaluation uses predicted rather than experimental bioactivities for the Aurora kinase case study.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/LindeSchoenmaker/SMILES-corrector">LindeSchoenmaker/SMILES-corrector</a></td>
          <td>Code + Data</td>
          <td>MIT</td>
          <td>Training code, synthetic error generation, and evaluation scripts</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: Synthetic training pairs derived from the Papyrus bioactivity dataset (v5.5). Approximately 1.3M invalid-valid pairs per error-count setting.</p>
<p><strong>Code</strong>: Transformer implemented in PyTorch, adapted from Ben Trevett&rsquo;s seq2seq tutorial. Generative model baselines use DrugEx, GENTRL, and ORGANIC.</p>
<p><strong>Evaluation</strong>: Validity assessed with RDKit. Similarity metrics (SNN, fragment, scaffold) and KL divergence computed following <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> and <a href="/notes/chemistry/molecular-design/generation/evaluation/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> benchmark protocols.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Schoenmaker, L., Béquignon, O. J. M., Jespers, W., &amp; van Westen, G. J. P. (2023). UnCorrupt SMILES: a novel approach to de novo design. <em>Journal of Cheminformatics</em>, 15, 22.</p>
<p><strong>Publication</strong>: Journal of Cheminformatics, 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/LindeSchoenmaker/SMILES-corrector">GitHub: LindeSchoenmaker/SMILES-corrector</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{schoenmaker2023uncorrupt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{UnCorrupt SMILES: a novel approach to de novo design}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Schoenmaker, Linde and B{\&#39;e}quignon, Olivier J. M. and Jespers, Willem and van Westen, Gerard J. P.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-023-00696-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RetMol: Retrieval-Based Controllable Molecule Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/retmol-retrieval-molecule-generation/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/retmol-retrieval-molecule-generation/</guid><description>RetMol uses retrieval-augmented generation to steer a pre-trained molecular model toward desired properties using only a handful of exemplar molecules.</description><content:encoded><![CDATA[<h2 id="retrieval-augmented-generation-for-molecules">Retrieval-Augmented Generation for Molecules</h2>
<p>This is a <strong>Method</strong> paper that introduces RetMol, a retrieval-based framework for controllable molecule generation. The key idea is to guide a pre-trained generative model using a small set of exemplar molecules that partially satisfy the desired design criteria, retrieved from a task-specific database. The approach requires no task-specific fine-tuning of the generative backbone and works effectively with very few exemplar molecules (as few as 23).</p>
<h2 id="limitations-of-existing-controllable-generation">Limitations of Existing Controllable Generation</h2>
<p>Existing approaches to controllable molecule generation fall into three categories, each with drawbacks:</p>
<ol>
<li><strong>Reinforcement learning (RL)-based methods</strong> require task-specific fine-tuning of the generative model for each new objective</li>
<li><strong>Supervised learning (SL)-based methods</strong> need molecules with desired properties as training data, which may be scarce</li>
<li><strong>Latent optimization-based methods</strong> require training property predictors in the latent space, which is challenging with limited active molecules and incompatible with variable-length latent spaces like those in transformers</li>
</ol>
<p>RetMol addresses all three issues by keeping the generative backbone frozen and using a lightweight, task-agnostic retrieval module that can be applied to new tasks simply by swapping the retrieval database.</p>
<h2 id="the-retmol-framework">The RetMol Framework</h2>
<p>RetMol consists of four components built around a pre-trained encoder-decoder backbone (<a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a>, a BART variant trained on ZINC):</p>
<h3 id="retrieval-database">Retrieval Database</h3>
<p>A task-specific collection of exemplar molecules that at least partially satisfy the design criteria. The database can be very small (e.g., 23 known inhibitors for the SARS-CoV-2 task) and is dynamically updated during inference with newly generated molecules.</p>
<h3 id="molecule-retriever">Molecule Retriever</h3>
<p>A heuristic-based module that selects the $K$ most relevant exemplar molecules (default $K = 10$). It first constructs a feasible set of molecules satisfying all constraints, then selects those with the best property scores. If too few molecules satisfy all constraints, it progressively relaxes constraints until enough candidates are available.</p>
<h3 id="information-fusion-via-cross-attention">Information Fusion via Cross-Attention</h3>
<p>The core trainable component. Retrieved exemplar embeddings are fused with the input molecule embedding using cross-attention:</p>
<p>$$\boldsymbol{e} = f_{\text{CA}}(\boldsymbol{e}_{\text{in}}, \boldsymbol{E}_r; \theta) = \text{Attn}(\text{Query}(\boldsymbol{e}_{\text{in}}), \text{Key}(\boldsymbol{E}_r)) \cdot \text{Value}(\boldsymbol{E}_r)$$</p>
<p>where $\boldsymbol{e}_{\text{in}} = \text{Enc}(x_{\text{in}}) \in \mathbb{R}^{L \times D}$ is the input embedding and $\boldsymbol{E}_r = [\boldsymbol{e}_r^1, \ldots, \boldsymbol{e}_r^K]$ are the retrieved exemplar embeddings. This module adds less than 5% parameter overhead (460K parameters over the 10M base model).</p>
<h3 id="self-supervised-training-nearest-neighbor-prediction">Self-Supervised Training: Nearest Neighbor Prediction</h3>
<p>Rather than reconstructing the input molecule (which would make the retrieval module unnecessary), RetMol trains the fusion module to predict the nearest neighbor of the input:</p>
<p>$$\mathcal{L}(\theta) = \sum_{i=1}^{B} \text{CE}\left(\text{Dec}\left(f_{\text{CA}}(\boldsymbol{e}_{\text{in}}^{(i)}, \boldsymbol{E}_r^{(i)}; \theta)\right), x_{\text{1NN}}^{(i)}\right)$$</p>
<p>The remaining $K - 1$ nearest neighbors serve as the retrieved exemplar molecules. This forces the fusion module to learn how to use exemplar molecules to transform the input toward a related target. Only the fusion module parameters are updated; the encoder and decoder remain frozen.</p>
<h2 id="iterative-refinement-at-inference">Iterative Refinement at Inference</h2>
<p>During inference, RetMol uses an iterative process:</p>
<ol>
<li>Encode the input molecule and retrieved exemplars</li>
<li>Fuse embeddings via cross-attention</li>
<li>Perturb the fused embedding $M$ times with Gaussian noise</li>
<li>Greedily decode $M$ candidate molecules</li>
<li>Replace the input with the best candidate if it improves upon the current score</li>
<li>Add remaining good candidates to the retrieval database</li>
<li>Repeat until convergence or a maximum number of iterations</li>
</ol>
<p>The dynamic update of the retrieval database is critical for extrapolating beyond the initial set of exemplar molecules.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<p>RetMol is evaluated on four tasks of increasing difficulty:</p>
<h3 id="qed-optimization-under-similarity-constraint">QED Optimization Under Similarity Constraint</h3>
<p>Goal: generate molecules with QED $\geq$ 0.9 while maintaining <a href="https://en.wikipedia.org/wiki/Tanimoto_coefficient">Tanimoto similarity</a> $\geq$ 0.4 to the input. RetMol achieves 94.5% success rate, compared to 92.8% for the previous best (QMO).</p>
<h3 id="penalized-logp-optimization">Penalized LogP Optimization</h3>
<p>Goal: maximize penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">LogP</a> while maintaining structural similarity. At $\delta = 0.4$, RetMol achieves 11.55 average improvement, compared to 7.71 for QMO.</p>
<h3 id="gsk3beta--jnk3-dual-inhibitor-design"><a href="https://en.wikipedia.org/wiki/GSK-3">GSK3</a>$\beta$ + <a href="https://en.wikipedia.org/wiki/C-Jun_N-terminal_kinase">JNK3</a> Dual Inhibitor Design</h3>
<p>Goal: simultaneously satisfy four constraints (GSK3$\beta$ inhibition $\geq$ 0.5, JNK3 inhibition $\geq$ 0.5, QED $\geq$ 0.6, SA $\leq$ 4). Results:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Success %</th>
          <th>Novelty</th>
          <th>Diversity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a></td>
          <td>47.9</td>
          <td>0.561</td>
          <td>0.621</td>
      </tr>
      <tr>
          <td>RationaleRL</td>
          <td>74.8</td>
          <td>0.568</td>
          <td>0.701</td>
      </tr>
      <tr>
          <td>MARS</td>
          <td>92.3</td>
          <td>0.824</td>
          <td>0.719</td>
      </tr>
      <tr>
          <td>MolEvol</td>
          <td>93.0</td>
          <td>0.757</td>
          <td>0.681</td>
      </tr>
      <tr>
          <td>RetMol</td>
          <td>96.9</td>
          <td>0.862</td>
          <td>0.732</td>
      </tr>
  </tbody>
</table>
<p>RetMol achieves this without task-specific fine-tuning and requires only 80 iterations compared to MARS&rsquo;s 550.</p>
<h3 id="sars-cov-2-main-protease-inhibitor-optimization"><a href="https://en.wikipedia.org/wiki/3C-like_protease">SARS-CoV-2 Main Protease</a> Inhibitor Optimization</h3>
<p>A real-world task using only 23 known inhibitors as the retrieval database and optimizing 8 weakly-binding drugs. Under the milder similarity constraint ($\delta = 0.4$), RetMol achieves 2.84 kcal/mol average binding affinity improvement versus 1.67 for Graph GA. Under the stricter constraint ($\delta = 0.6$), RetMol succeeds on 5/8 molecules versus 3/8 for Graph GA.</p>
<h2 id="key-analysis-findings">Key Analysis Findings</h2>
<ul>
<li><strong>Database size</strong>: Strong performance even with 100 molecules, already outperforming baselines on success rate</li>
<li><strong>Database quality</strong>: Molecules satisfying all four constraints give the best results (96.9%), but partial satisfaction still works reasonably (84.7% with two properties)</li>
<li><strong>Training objective</strong>: The nearest neighbor prediction objective outperforms conventional reconstruction on validity (0.902 vs. 0.834) and uniqueness (0.922 vs. 0.665)</li>
<li><strong>Dynamic database update</strong>: Essential for extrapolating beyond the initial retrieval database, generating molecules with property values exceeding the best in the original database</li>
</ul>
<h2 id="limitations">Limitations</h2>
<p>RetMol requires exemplar molecules that at least partially satisfy the design criteria. When such molecules are entirely unavailable, the framework cannot be applied. The method also relies on property predictors (for scoring and retrieval), whose accuracy directly affects generation quality. The iterative refinement process adds computational overhead at inference time, and the results depend on the Chemformer backbone&rsquo;s generation capabilities.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/NVlabs/RetMol">NVlabs/RetMol</a></td>
          <td>Code</td>
          <td>NVIDIA Source Code License-NC</td>
          <td>Full training and inference code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/NVlabs/RetMol">NVlabs/RetMol (checkpoints)</a></td>
          <td>Model</td>
          <td>CC BY-NC-SA 4.0</td>
          <td>Pre-trained model checkpoints</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: ZINC250k and ChEMBL datasets for training. Task-specific retrieval databases constructed from these datasets. COVID-19 task uses 23 known SARS-CoV-2 Mpro inhibitors.</p>
<p><strong>Training</strong>: Information fusion module trained on 4x V100 GPUs (16GB each) for approximately 2 hours. Batch size of 256 per GPU, 50K iterations.</p>
<p><strong>Inference</strong>: Single V100 GPU. Greedy decoding with Gaussian perturbation ($\sigma = 1$) for sampling multiple candidates per iteration.</p>
<p><strong>Backbone</strong>: Chemformer (BART variant) pre-trained on ZINC. Frozen during RetMol training and inference.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, Z., Nie, W., Qiao, Z., Xiao, C., Baraniuk, R. G., &amp; Anandkumar, A. (2023). Retrieval-based Controllable Molecule Generation. <em>Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023)</em>.</p>
<p><strong>Publication</strong>: International Conference on Learning Representations (ICLR) 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/NVlabs/RetMol">GitHub: NVlabs/RetMol</a></li>
<li><a href="https://openreview.net/forum?id=vDFA1tpuLvk">OpenReview</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wang2023retrieval,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Retrieval-based Controllable Molecule Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Zichao and Nie, Weili and Qiao, Zhuoran and Xiao, Chaowei and Baraniuk, Richard G. and Anandkumar, Anima}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=vDFA1tpuLvk}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Regression Transformer: Prediction Meets Generation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/regression-transformer/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/property-prediction/regression-transformer/</guid><description>The Regression Transformer unifies property prediction and conditional generation in one multitask model by casting regression as sequence modelling.</description><content:encoded><![CDATA[<h2 id="a-multitask-model-that-unifies-regression-and-generation">A Multitask Model That Unifies Regression and Generation</h2>
<p>The Regression Transformer (RT) is a <strong>Method</strong> paper. It introduces a single model architecture that can both predict continuous molecular properties and conditionally generate molecules with desired property values. The core idea is to reformulate regression as a sequence modelling task: instead of training a dedicated regression head, continuous property values are tokenized into sequences of digits and predicted alongside molecular tokens using a cross-entropy loss.</p>
<h2 id="closing-the-gap-between-predictors-and-generators">Closing the Gap Between Predictors and Generators</h2>
<p>Existing transformer-based approaches in computational chemistry develop property predictors and generative models as separate systems. Even when a single architecture like <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a> (Irwin et al., 2022) addresses both tasks, it does so through task-specific heads. This means the two capabilities remain disjoint, and the generative model cannot use its own property prediction ability during generation.</p>
<p>The RT addresses three specific gaps:</p>
<ol>
<li><strong>No true multitask entanglement</strong>: Prior work either tunes separate heads for prediction and generation or limits communication between modules to a reward signal.</li>
<li><strong>No inductive bias for continuous properties</strong>: Molecular generative models lack mechanisms to condition generation on floating-point property values.</li>
<li><strong>Disconnected workflows</strong>: Property predictors cannot generate molecules, and generators cannot assess whether their outputs satisfy property constraints.</li>
</ol>
<h2 id="core-innovation-regression-as-conditional-sequence-modelling">Core Innovation: Regression as Conditional Sequence Modelling</h2>
<p>The RT&rsquo;s key insight is that regression can be cast as sequential classification over digit tokens while preserving predictive accuracy. This is achieved through three components:</p>
<h3 id="numerical-tokenization">Numerical Tokenization</h3>
<p>Floating-point property values are split into individual digit tokens that preserve decimal order. Each token $t_{v,p}$ encodes a digit value $v \in [0, 9]$ and its decimal place $p \in \mathbb{Z}$. For example, the value 12.3 becomes the token sequence <code>[1_1, 2_0, 3_-1]</code>.</p>
<h3 id="numerical-encodings">Numerical Encodings</h3>
<p>To provide an inductive bias about the semantic proximity of digit tokens (which cross-entropy loss cannot convey), the RT introduces Numerical Encodings (NEs), analogous to positional encodings. For a token $t_{v,p}$ at embedding dimension $j$:</p>
<p>$$
\text{NE}_{\text{Float}}(v, p, j) = (-1)^j \cdot \frac{v \cdot 10^p}{j + 1}
$$</p>
<p>These encodings ensure that pairwise distances between digit tokens decay monotonically with their floating-point proximity. The model can also learn digit orderings from data alone, but NEs provide a useful inductive bias.</p>
<h3 id="alternating-training-with-self-consistency">Alternating Training with Self-Consistency</h3>
<p>The RT uses an <a href="https://en.wikipedia.org/wiki/XLNet">XLNet</a> backbone trained with permutation language modelling (PLM). The key is that the same model serves two roles depending on which tokens are masked:</p>
<ul>
<li><strong>Mask numerical tokens</strong>: the model performs property prediction (regression)</li>
<li><strong>Mask textual tokens</strong>: the model performs conditional sequence generation</li>
</ul>
<p>The base PLM objective is:</p>
<p>$$
\mathcal{L}_{\text{PLM}} = \mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_T} \left[ \sum_{i=c+1}^{T} \log p_\theta(x_{z_i} \mid \mathbf{x}_{\mathbf{z}_{&lt; i}}) \right]
$$</p>
<p>This is refined into two specialized objectives: a property prediction objective $\mathcal{L}_P$ that masks only numerical tokens, and a generation objective $\mathcal{L}_G$ that masks only textual tokens. Training alternates between these every 50 steps.</p>
<p>The self-consistency (SC) loss adds a critical feedback loop. After generating a candidate molecule $\hat{\mathbf{x}}$, the model re-evaluates it by predicting the property of the generated sequence:</p>
<p>$$
\mathcal{L}_{\text{SC}} = \mathcal{L}_G(\mathbf{x}) + \alpha \cdot \mathcal{L}_P(\hat{\mathbf{x}})
$$</p>
<p>This rewards generating molecules whose predicted properties match the primed property value, exploiting the RT&rsquo;s dual capability as both predictor and generator.</p>
<h2 id="experiments-across-molecules-proteins-and-reactions">Experiments Across Molecules, Proteins, and Reactions</h2>
<h3 id="drug-likeness-qed">Drug Likeness (QED)</h3>
<p>Initial validation on a synthetic QED dataset (~1.4M molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>) demonstrated that the RT can simultaneously learn to predict QED scores (RMSE &lt; 0.06) and generate novel molecules conditioned on desired QED values (Spearman&rsquo;s $\rho$ up to 0.517 between primers and generated molecule properties). Novelty exceeded 99% across all configurations. The alternating training scheme with SC loss outperformed both single-task models and the vanilla PLM objective.</p>
<p><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> representations proved comparable to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> for property prediction and far superior for generation (~100% validity vs. ~40% for SMILES).</p>
<h3 id="moleculenet-regression-benchmarks">MoleculeNet Regression Benchmarks</h3>
<p>On <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks ESOL, FreeSolv, and Lipophilicity, the RT outperformed XGBoost and MPNN baselines despite using only a classification loss. It performed on par with XLNet using a conventional regression head, and was only mildly inferior to models like BERT and BART that used large-scale self-supervised pre-training with regression losses.</p>
<p>Critically, only the RT could also conditionally generate molecules for these tasks. External validation with Grover (a self-supervised Graph Transformer) confirmed high correlation with the RT&rsquo;s own property predictions (0.86, 0.84, and 0.75 for ESOL, FreeSolv, and Lipophilicity respectively).</p>
<h3 id="constrained-property-optimization">Constrained Property Optimization</h3>
<p>On the penalized logP (plogP) benchmark with similarity constraints, the RT outperformed JT-VAE and GCPN by large margins. At similarity threshold $\delta = 0.4$, the RT achieved 3.16 average improvement with 97.1% success rate, while also predicting plogP with PCC of 0.92. Competing methods cannot perform property prediction at all.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Improvement ($\delta$=0.4)</th>
          <th>Success</th>
          <th>Property Prediction</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>0.84</td>
          <td>83.6%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>2.49</td>
          <td>100%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>4.71</td>
          <td>85.7%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td><strong>RT</strong></td>
          <td><strong>3.16</strong></td>
          <td><strong>97.1%</strong></td>
          <td><strong>PCC = 0.92</strong></td>
      </tr>
  </tbody>
</table>
<p>The comparison is not strictly fair: all competing methods are trained specifically to maximize plogP, and some (GCPN, JT-VAE) apply gradient optimization at inference time. The RT is only trained to reconstruct molecules with similar predicted plogP to the seed, so its training objective is property-agnostic rather than directly optimizing for higher plogP values.</p>
<h3 id="protein-language-modelling">Protein Language Modelling</h3>
<p>On the TAPE benchmark, the RT matched or outperformed conventional transformers on fluorescence and stability prediction tasks, despite those baselines being pre-trained on 24-106 million protein sequences (vs. 2.6 million for the RT). The RT also performed conditional protein generation, a task that none of the TAPE baselines can address.</p>
<h3 id="chemical-reaction-modelling">Chemical Reaction Modelling</h3>
<p>The RT was applied to reaction yield prediction on <a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig amination</a> and <a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki coupling</a> datasets. It matched Yield-BERT performance ($R^2$ = 0.939 and 0.81 respectively) while also enabling novel capabilities: reconstructing missing precursors from partial reactions and decorating existing reactions to achieve higher predicted yields. Across both datasets, over 40% of top-five predicted sequences contained reactions with novel precursors and higher predicted yield.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>Regression can be successfully reformulated as sequential classification over digit tokens without losing predictive accuracy compared to models using regression losses.</li>
<li>The alternating training scheme with self-consistency loss enables cross-task benefits, where the model outperforms single-task variants at both prediction and generation.</li>
<li>A single ~27M parameter model handles property prediction, conditional molecular generation, conditional protein generation, and reaction yield prediction with precursor generation.</li>
<li>The model learns the natural ordering of digits from data: 47% of embedding dimensions for the tenths place directly encode digit ordering even without explicit numerical encodings.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ol>
<li><strong>No large-scale pre-training</strong>: The RT uses ~27M parameters trained from scratch on task-specific datasets, unlike <a href="/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/">BARTSmiles</a> or MoLFormer which pre-train on billions of molecules. Scaling up could improve results.</li>
<li><strong>Fine-grained regression precision</strong>: The model sometimes struggles with intra-mode precision (e.g., on the fluorescence dataset where predictions cluster around bright/dark modes rather than capturing continuous variation).</li>
<li><strong>Single-property focus</strong>: All reported experiments use a single continuous property, though the framework naturally extends to multi-property settings.</li>
<li><strong>SELFIES validity caveats</strong>: While SELFIES are always syntactically valid, they can produce degenerate short molecules (~1.9% defective generations where the output has less than 50% of the seed&rsquo;s atoms).</li>
<li><strong>XLNet backbone limitations</strong>: Results on MoleculeNet regression are slightly below models using BART or BERT backbones with large-scale pre-training, suggesting the RT framework could benefit from stronger base models.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/regression-transformer">Regression Transformer (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://github.com/GT4SD/gt4sd-core">GT4SD Integration</a></td>
          <td>Code + Models</td>
          <td>MIT</td>
          <td>Pre-trained model inference pipelines</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/spaces/GT4SD/regression_transformer">HuggingFace Demo</a></td>
          <td>Demo</td>
          <td>-</td>
          <td>Interactive inference webapp</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug likeness</td>
          <td>ChEMBL (QED)</td>
          <td>~1.4M molecules</td>
          <td>Synthetic QED labels computed with RDKit</td>
      </tr>
      <tr>
          <td>Regression benchmark</td>
          <td>MoleculeNet (ESOL, FreeSolv, Lipo)</td>
          <td>642-4,200 compounds</td>
          <td>16x SMILES augmentation, 3 random splits</td>
      </tr>
      <tr>
          <td>Property optimization</td>
          <td>ZINC (plogP)</td>
          <td>215,381 train / 799 test</td>
          <td>Fixed split from Jin et al. (2018)</td>
      </tr>
      <tr>
          <td>Protein pre-training</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniProt</a> (Boman)</td>
          <td>2,648,205 peptides</td>
          <td>15-45 amino acid peptides</td>
      </tr>
      <tr>
          <td>Protein benchmarks</td>
          <td>TAPE (Fluorescence, Stability)</td>
          <td>21,446-53,416 samples</td>
          <td>Fixed splits</td>
      </tr>
      <tr>
          <td>Reaction pre-training</td>
          <td>USPTO</td>
          <td>2,830,616 reactions</td>
          <td>Molecular weight as numerical property</td>
      </tr>
      <tr>
          <td>Reaction yield</td>
          <td>Buchwald-Hartwig / Suzuki</td>
          <td>3,955 / 5,760 reactions</td>
          <td>Ten 70/30 random splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: XLNet (32 hidden layers, 256 hidden dim, 1024 FFN dim, 16 attention heads, 20% dropout)</li>
<li>Parameters: ~27 million</li>
<li>Training: Permutation language modelling pre-training, then alternating objectives (property prediction + conditional generation with SC loss)</li>
<li>Decoding: Greedy for property prediction, beam search for sequence generation</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>RT Result</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QED prediction</td>
          <td>RMSE</td>
          <td>0.037</td>
          <td>Best config (NE + SC)</td>
      </tr>
      <tr>
          <td>QED generation</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>0.517</td>
          <td>Between primers and generated QED</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>Comparable to XLNet</td>
          <td>Within s.d. of regression-loss XLNet</td>
      </tr>
      <tr>
          <td>plogP optimization ($\delta$=0.4)</td>
          <td>Improvement</td>
          <td>3.16</td>
          <td>Outperforms JT-VAE, GCPN</td>
      </tr>
      <tr>
          <td>Protein fluorescence</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>0.72</td>
          <td>Outperforms TAPE baselines</td>
      </tr>
      <tr>
          <td>BH yield prediction</td>
          <td>$R^2$</td>
          <td>0.939</td>
          <td>Near Yield-BERT (0.951)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>All models trained on single GPUs (NVIDIA A100 or V100)</li>
<li>Training time: ~4 days for pre-training, ~1 day for fine-tuning</li>
<li>Framework: PyTorch 1.3.1 with HuggingFace Transformers 3.1.0</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Born, J. &amp; Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. <em>Nature Machine Intelligence</em>, 5(4), 432-444. <a href="https://doi.org/10.1038/s42256-023-00639-z">https://doi.org/10.1038/s42256-023-00639-z</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence, April 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IBM/regression-transformer">Regression Transformer GitHub Repository</a></li>
<li><a href="https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer">GT4SD Integration</a></li>
<li><a href="https://huggingface.co/spaces/GT4SD/regression_transformer">HuggingFace Demo</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{born2023regression,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Regression Transformer enables concurrent sequence regression and generation for molecular language modelling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Born, Jannis and Manica, Matteo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{432--444}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BARTSmiles: BART Pre-Training for Molecular SMILES</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/bartsmiles-molecular-representations/</guid><description>BARTSmiles applies BART-style denoising pre-training to 1.7B SMILES from ZINC20, achieving top results on 11 molecular property and reaction tasks.</description><content:encoded><![CDATA[<h2 id="a-bart-based-method-for-molecular-self-supervised-learning">A BART-Based Method for Molecular Self-Supervised Learning</h2>
<p>BARTSmiles is a <strong>Method</strong> paper. It introduces a self-supervised pre-training approach for molecular representations based on the BART (Bidirectional and Auto-Regressive Transformers) architecture from Lewis et al. (2019). The primary contribution is a pre-training strategy, discovered through systematic ablations, that trains a BART-large model on 1.7 billion deduplicated <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES strings</a> from the <a href="/notes/chemistry/datasets/zinc-22/">ZINC20 dataset</a>. BARTSmiles achieves the best reported results on 11 tasks spanning molecular property classification, regression, and chemical reaction generation.</p>
<h2 id="scaling-self-supervised-molecular-representations-beyond-prior-work">Scaling Self-Supervised Molecular Representations Beyond Prior Work</h2>
<p>At the time of publication, large-scale self-supervised representation learning had produced significant improvements in NLP, computer vision, and speech, but molecular representation learning had not benefited from comparable scale. Previous SMILES-based pre-trained models such as <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> (Chithrananda et al., 2020) and <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">ChemFormer</a> (Irwin et al., 2022) used encoder-only or encoder-decoder architectures with substantially less compute. ChemFormer, the most closely related prior work, also trained a BART-like model but with a fraction of the compute and data.</p>
<p>The paper argues that three gaps needed to be addressed:</p>
<ol>
<li><strong>Scale</strong>: Prior molecular pre-training used orders of magnitude less compute than NLP pre-training.</li>
<li><strong>Architecture choice</strong>: Encoder-only models like ChemBERTa cannot perform generative fine-tuning (retrosynthesis, reaction prediction), limiting their applicability.</li>
<li><strong>Pre-training recipe</strong>: Standard BART hyperparameters (e.g., 30% mask token budget) were tuned for natural language and had not been validated for molecular SMILES strings.</li>
</ol>
<h2 id="core-innovation-ablation-driven-pre-training-recipe-for-smiles">Core Innovation: Ablation-Driven Pre-Training Recipe for SMILES</h2>
<p>The key insight of BARTSmiles is that the BART denoising objective, when carefully tuned for the molecular domain, learns representations that implicitly encode downstream task information. The authors discover this through a systematic three-stage ablation:</p>
<h3 id="tokenization">Tokenization</h3>
<p>Rather than using hand-crafted tokenization rules that separate individual atoms (C, N, H) and bond symbols (#, =), BARTSmiles uses a learned SentencePiece unigram tokenizer trained on 10 million random SMILES with a vocabulary size of 1,021. On matched compute budgets, learned tokenization achieves 0.801 average AUC-ROC vs. 0.779 for hand-crafted tokenization on the ablation benchmark (HIV, BBBP, ClinTox).</p>
<h3 id="masking-strategy">Masking Strategy</h3>
<p>The BART denoising objective has three main hyperparameters: the mask token budget (fraction of tokens masked), random mask probability, and the Poisson $\lambda$ controlling mask span length. The ablation results show:</p>
<ul>
<li><strong>Mask token budget</strong>: The standard BART value of 0.30 is suboptimal for molecules. A budget of 0.20 performs best (0.821 AUC-ROC), with performance degrading at both lower (0.10: 0.753) and higher (0.40: 0.701) budgets.</li>
<li><strong>Span masking</strong>: The choice of random mask probability and $\lambda$ has a minor effect once the budget is set to 0.20. Values of random mask = 0.10 and $\lambda$ = 2.5 or 3.5 all yield 0.821.</li>
<li><strong>Token randomization</strong>: Disabling the randomize-tokens noise (where some tokens are replaced with random tokens rather than masked) improves performance from 0.821 to 0.835.</li>
</ul>
<h3 id="scale">Scale</h3>
<p>Training on the full 1.7 billion molecule ZINC20 dataset (20 hours on 1,024 A100 GPUs, totaling 20,480 A100 GPU-hours) improves performance by 5 absolute AUC-ROC points over the same model trained on 100 million samples. The previous most compute-intensive molecular pre-training used 3,330 V100-hours (Ross et al., 2021).</p>
<h3 id="implicit-task-encoding">Implicit Task Encoding</h3>
<p>The paper provides a quantitative demonstration that frozen BARTSmiles representations encode task-specific information. Using L1-regularized logistic regression on frozen 1,024-dimensional mean-pooled representations, just 7 neurons are sufficient to achieve 0.987 AUC-ROC on ClinTox (within 2 percentage points of full fine-tuning). Even a single neuron achieves 0.77 AUC-ROC on ClinTox subtask 1.</p>
<h2 id="experimental-setup-moleculenet-toxicology-and-generative-benchmarks">Experimental Setup: MoleculeNet, Toxicology, and Generative Benchmarks</h2>
<h3 id="classification-tasks">Classification Tasks</h3>
<p>BARTSmiles is evaluated on 7 classification datasets from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> (SIDER, ClinTox, Tox21, ToxCast, HIV, BACE, BBBP) plus 2 toxicology datasets (<a href="https://en.wikipedia.org/wiki/Ames_test">Ames</a>, <a href="https://en.wikipedia.org/wiki/Micronucleus_test">Micronucleus Assay</a>). All classification tasks use AUC-ROC. Baselines include both supervised graph models (D-MPNN, Attentive FP, 3D InfoMax) and self-supervised methods (ChemBERTa, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer-XL</a>, GROVER-large, MolCLR, iMolCLR).</p>
<p>Selected classification results (AUC-ROC):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>BARTSmiles</th>
          <th>Previous Best</th>
          <th>Previous Best Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ClinTox</td>
          <td><strong>0.997</strong></td>
          <td>0.954</td>
          <td>iMolCLR</td>
      </tr>
      <tr>
          <td>ToxCast</td>
          <td><strong>0.825</strong></td>
          <td>0.805</td>
          <td>Attentive FP</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td><strong>0.705</strong></td>
          <td>0.699</td>
          <td>iMolCLR</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>0.851</td>
          <td>0.858</td>
          <td>Attentive FP</td>
      </tr>
  </tbody>
</table>
<p>The authors note that three scaffold-split datasets (HIV, BACE, BBBP) are highly sensitive to the specific split used, and they suspect some baseline results use different or random splits. These results are marked with caveats in the paper.</p>
<h3 id="regression-tasks">Regression Tasks</h3>
<p>All three MoleculeNet regression tasks (ESOL, FreeSolv, Lipophilicity) are evaluated using RMSE:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>BARTSmiles</th>
          <th>Previous Best</th>
          <th>Previous Best Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td><strong>0.095</strong></td>
          <td>0.279</td>
          <td>MoLFormer-XL</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td><strong>0.114</strong></td>
          <td>0.231</td>
          <td>MoLFormer-XL</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td><strong>0.292</strong></td>
          <td>0.529</td>
          <td>MoLFormer-XL</td>
      </tr>
  </tbody>
</table>
<p>BARTSmiles achieves substantial improvements on all three regression tasks.</p>
<h3 id="generative-tasks">Generative Tasks</h3>
<p><strong><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></strong> (USPTO-50k): BARTSmiles achieves 55.6% Top-1 accuracy using a sample-128 + perplexity re-ranking strategy, compared to 55.3% for Dual-TF and 54.3% for ChemFormer. Top-5 and Top-10 results are 74.2% and 80.9% respectively.</p>
<p><strong>Chemical Reaction Prediction</strong> (USPTO MIT/LEF/STEREO): BARTSmiles with beam search outperforms the <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a> baseline across all six evaluation settings. On USPTO-MIT (split), BARTSmiles achieves 91.8% vs. 90.4% for the Transformer baseline.</p>
<h3 id="fine-tuning-recipe">Fine-Tuning Recipe</h3>
<p>The fine-tuning approach is designed to minimize hyperparameter tuning:</p>
<ul>
<li>Batch size 16, 10 epochs, polynomial decay learning rate schedule with warmup at 16% of training</li>
<li>Grid search over dropout (0.1, 0.2, 0.3) and learning rate ($5 \times 10^{-6}$, $1 \times 10^{-5}$, $3 \times 10^{-5}$)</li>
<li>Stochastic Weight Averaging (SWA) over three sets of four checkpoints</li>
<li>For generative tasks: R3F regularization (Aghajanyan et al., 2020a) and full fp32 precision</li>
<li>For generation: beam search (beam size 10) or sample 128 sequences with perplexity re-ranking</li>
</ul>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Scale matters for molecular pre-training</strong>: Training on 1.7B molecules with 20,480 A100 GPU-hours yields 5 absolute points of AUC-ROC improvement over training on 100M molecules.</li>
<li><strong>Domain-specific ablation is necessary</strong>: The optimal BART masking configuration for molecules (20% budget, no token randomization) differs from the standard NLP configuration (30% budget, with randomization).</li>
<li><strong>Frozen representations capture task structure</strong>: A small number of neurons from the frozen model can nearly match full fine-tuning performance on certain tasks, suggesting the pre-training objective implicitly encodes molecular properties.</li>
<li><strong>Interpretability aligns with domain knowledge</strong>: Integrated Gradients attribution on fine-tuned BARTSmiles highlights known structural alerts (e.g., <a href="https://en.wikipedia.org/wiki/Nitro_compound">nitro groups</a> in mutagenic compounds, hydroxyl groups in soluble compounds).</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li><strong>Scaffold split sensitivity</strong>: Results on HIV, BACE, and BBBP are sensitive to the specific scaffold split, making direct comparison with baselines difficult.</li>
<li><strong>Pre-training data distribution</strong>: The <a href="https://en.wikipedia.org/wiki/Fr%C3%A9chet_distance">Frechet distance</a> analysis shows that some downstream datasets (BBBP, SIDER) are far from ZINC20 in representation space, which may explain weaker performance on those tasks.</li>
<li><strong>Fingerprints carry complementary information</strong>: On the Ames and Micronucleus Assay datasets, BARTSmiles alone does not beat fingerprint-based baselines. Combining BARTSmiles with ECFP4 fingerprints closes the gap, implying that SMILES-based pre-training does not fully capture all structural information.</li>
<li><strong>Compute requirements</strong>: Pre-training requires 1,024 A100 GPUs, which limits accessibility.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest investigating the impact of pre-training data composition, noting that ZINC20 contains over a billion molecules but its distribution may be irrelevant for many downstream tasks. They also propose further collaboration between ML and chemistry experts to discover new molecular substructure-property relationships.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/YerevaNN/BARTSmiles">BARTSmiles (GitHub)</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Pre-training, fine-tuning, and evaluation scripts with pre-trained weights</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC20 (deduplicated)</td>
          <td>~1.7B molecules</td>
          <td>Canonicalized SMILES, 10K validation holdout</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>MoleculeNet (7 datasets)</td>
          <td>1,427-41,127 compounds</td>
          <td>AUC-ROC metric</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>MoleculeNet (3 datasets)</td>
          <td>642-4,200 compounds</td>
          <td>RMSE metric</td>
      </tr>
      <tr>
          <td>Toxicology</td>
          <td>Ames, MN Assay</td>
          <td>6,512 / 641 compounds</td>
          <td>Cross-validation for Ames; external test for MN</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>USPTO-50k</td>
          <td>Standard split</td>
          <td>Top-K accuracy</td>
      </tr>
      <tr>
          <td>Reaction prediction</td>
          <td>USPTO (MIT/LEF/STEREO)</td>
          <td>Standard splits</td>
          <td>Top-1 accuracy</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: BART-Large (pre-layer norm Transformer encoder-decoder)</li>
<li>Tokenizer: SentencePiece unigram, vocabulary size 1,021, max sequence length 128</li>
<li>Pre-training objective: BART denoising (mask token budget 0.20, Poisson span masking with $\lambda$ = 2.5, no token randomization)</li>
<li>Fine-tuning: polynomial decay LR, SWA, grid search over dropout and LR</li>
<li>Generative fine-tuning: R3F regularization, fp32 precision, Adam initialized from pre-training moving averages</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>BART-Large architecture (exact parameter count not specified in paper)</li>
<li>Pre-trained checkpoint released on GitHub</li>
<li>Maximum sequence length: 128 tokens</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>BARTSmiles</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ClinTox</td>
          <td>AUC-ROC</td>
          <td>0.997</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>ToxCast</td>
          <td>AUC-ROC</td>
          <td>0.825</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.095</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>0.114</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>RMSE</td>
          <td>0.292</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>USPTO-50k Retro (Top-1)</td>
          <td>Accuracy</td>
          <td>55.6%</td>
          <td>New SOTA (sample + re-rank)</td>
      </tr>
      <tr>
          <td>USPTO-MIT Rxn (Split)</td>
          <td>Accuracy</td>
          <td>91.8%</td>
          <td>New SOTA (beam-10)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 1,024 NVIDIA A100 GPUs for 20 hours (20,480 A100 GPU-hours)</li>
<li>Ablation runs: 128 A100 GPUs per run</li>
<li>Framework: FairSeq with FairScale (fully sharded data parallel), automatic mixed precision</li>
<li>Experiment tracking: Aim</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chilingaryan, G., Tamoyan, H., Tevosyan, A., Babayan, N., Khondkaryan, L., Hambardzumyan, K., Navoyan, Z., Khachatrian, H., &amp; Aghajanyan, A. (2024). BARTSmiles: Generative Masked Language Models for Molecular Representations. <em>Journal of Chemical Information and Modeling</em>, 64(15), 5832-5843. <a href="https://doi.org/10.1021/acs.jcim.4c00512">https://doi.org/10.1021/acs.jcim.4c00512</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2024 (preprint: arXiv 2022)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/YerevaNN/BARTSmiles">BARTSmiles GitHub Repository (MIT License)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chilingaryan2024bartsmiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BARTSmiles: Generative Masked Language Models for Molecular Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chilingaryan, Gayane and Tamoyan, Hovhannes and Tevosyan, Ani and Babayan, Nelly and Khondkaryan, Lusine and Hambardzumyan, Karen and Navoyan, Zaven and Khachatrian, Hrant and Aghajanyan, Armen}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5832--5843}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.4c00512}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolGen: Molecular Generation with Chemical Feedback</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/</guid><description>MolGen pre-trains on SELFIES molecules and uses chemical feedback to align generated molecules with real-world chemical preferences across domains.</description><content:encoded><![CDATA[<h2 id="a-selfies-based-method-for-molecular-generation">A SELFIES-Based Method for Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces MolGen, a pre-trained molecular language model for generating molecules with desired chemical properties. The primary contribution is a three-part framework: (1) pre-training on 100M+ molecular SELFIES to learn structural and grammatical knowledge, (2) domain-agnostic molecular prefix tuning for cross-domain knowledge transfer, and (3) a chemical feedback paradigm that aligns the model&rsquo;s generative probabilities with real-world chemical preferences. MolGen is the first language model pre-trained on SELFIES rather than SMILES, which guarantees 100% syntactic validity of generated molecules.</p>
<h2 id="challenges-in-language-model-based-molecule-generation">Challenges in Language Model-Based Molecule Generation</h2>
<p>Generating novel molecules with desirable properties is a central task in drug discovery and chemical design. The molecular space is estimated at $10^{33}$ possible structures, making exhaustive search impractical. Prior deep generative approaches face several limitations:</p>
<ol>
<li><strong>Syntactic invalidity</strong>: <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>-based language models frequently generate strings that do not correspond to valid molecular graphs. A single random mutation of a SMILES string has only a 9.9% chance of remaining valid.</li>
<li><strong>Narrow domain focus</strong>: Most existing models focus exclusively on synthetic molecules and neglect <a href="https://en.wikipedia.org/wiki/Natural_product">natural products</a>, which have distinct structural complexity and scaffold diversity.</li>
<li><strong>Molecular hallucinations</strong>: Generated molecules may satisfy chemical structural rules yet fail to exhibit anticipated chemical activity in practical applications. The authors formally define this as molecules that &ldquo;comply with chemical structural rules, yet fail to exhibit practical utility or the anticipated properties.&rdquo;</li>
<li><strong>Limited optimization signals</strong>: Existing approaches rely on reinforcement learning (high variance), fixed-dimensional latent spaces, or expert-provided generation rules, all of which impede efficient exploration of chemical space.</li>
</ol>
<h2 id="core-innovation-pre-training-with-selfies-and-chemical-feedback">Core Innovation: Pre-training with SELFIES and Chemical Feedback</h2>
<p>MolGen&rsquo;s novelty rests on three interconnected components.</p>
<h3 id="selfies-based-pre-training">SELFIES-Based Pre-training</h3>
<p>MolGen uses <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> (Self-Referencing Embedded Strings) instead of SMILES. SELFIES guarantees that every possible combination of symbols in the alphabet corresponds to a chemically valid molecular graph. The model uses a compact vocabulary of 185 tokens.</p>
<p>The first pre-training stage uses a BART-style encoder-decoder. Tokens from a SELFIES string $S = {s_1, \ldots, s_l}$ are randomly replaced with [MASK], then the corrupted input is encoded bidirectionally and decoded left-to-right. The reconstruction loss is:</p>
<p>$$
\mathcal{L}_{\text{ce}}(S) = -\sum_{j=1}^{l} \sum_{s} p_{\text{true}}(s \mid S, S_{&lt; j}) \log p_{\theta}(s \mid S, S_{&lt; j}; \theta)
$$</p>
<p>where $S_{&lt; j}$ denotes the partial sequence ${s_0, \ldots, s_{j-1}}$ and $p_{\text{true}}$ is the one-hot distribution under standard maximum likelihood estimation.</p>
<h3 id="domain-agnostic-molecular-prefix-tuning">Domain-Agnostic Molecular Prefix Tuning</h3>
<p>The second pre-training stage introduces shared prefix vectors $P_k, P_v \in \mathbb{R}^{m \times d}$ prepended to the keys and values of multi-head attention at each layer. Unlike conventional prefix tuning that freezes model parameters, MolGen updates the entire model. The attention output becomes:</p>
<p>$$
\text{head} = \text{Attn}\left(xW_q, [P_k, XW_k], [P_v, XW_v]\right)
$$</p>
<p>This decomposes into a linear interpolation between prefix attention and standard attention:</p>
<p>$$
\text{head} = \lambda(x) \cdot \text{Attn}(xW_q, P_k, P_v) + (1 - \lambda(x)) \cdot \text{Attn}(xW_q, XW_k, XW_v)
$$</p>
<p>where $\lambda(x)$ is a scalar representing the sum of normalized attention weights on the prefixes. The prefixes are trained simultaneously across synthetic and natural product domains, acting as a domain instructor.</p>
<h3 id="chemical-feedback-paradigm">Chemical Feedback Paradigm</h3>
<p>To address molecular hallucinations, MolGen aligns the model&rsquo;s probabilistic rankings with chemical preference rankings. Given a molecule $S$ and a set of candidate outputs $\mathcal{S}^*$ with distinct property scores $\text{Ps}(\cdot)$, the model should satisfy:</p>
<p>$$
p_{\text{true}}(S_i \mid S) &gt; p_{\text{true}}(S_j \mid S), \quad \forall S_i, S_j \in \mathcal{S}^*, \text{Ps}(S_i) &gt; \text{Ps}(S_j)
$$</p>
<p>This is enforced via a rank loss:</p>
<p>$$
\mathcal{L}_{\text{rank}}(S) = \sum_{i} \sum_{j &gt; i} \max\left(0, f(S_j) - f(S_i) + \gamma_{ij}\right)
$$</p>
<p>where $\gamma_{ij} = (j - i) \cdot \gamma$ is a margin scaled by rank difference and $f(S) = \sum_{t=1}^{l} \log p_{\theta}(s_t \mid S, S_{&lt; t}; \theta)$ is the estimated log-probability. The overall training objective combines cross-entropy and rank loss:</p>
<p>$$
\mathcal{L} = \mathcal{L}_{\text{ce}} + \alpha \mathcal{L}_{\text{rank}}
$$</p>
<p>Label smoothing is applied to the target distribution in $\mathcal{L}_{\text{ce}}$, allocating probability mass $\beta$ to non-target tokens to maintain generative diversity.</p>
<h2 id="experiments-across-distribution-learning-and-property-optimization">Experiments Across Distribution Learning and Property Optimization</h2>
<h3 id="datasets">Datasets</h3>
<ul>
<li><strong>Stage 1 pre-training</strong>: 100M+ unlabeled molecules from ZINC-15 (molecular weight $\leq$ 500 Da, LogP $\leq$ 5)</li>
<li><strong>Stage 2 pre-training</strong>: 2.22M molecules spanning synthetic (ZINC, MOSES) and natural product (NPASS, 30,926 compounds) domains</li>
<li><strong>Downstream evaluation</strong>: MOSES synthetic dataset, ZINC250K, and natural product molecules</li>
</ul>
<h3 id="molecular-distribution-learning">Molecular Distribution Learning</h3>
<p>MolGen generates 10,000 synthetic and 80,000 natural product molecules, evaluated on seven metrics (Validity, Fragment similarity, Scaffold similarity, SNN, Internal Diversity, <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a>, and Novelty). Baselines include AAE, <a href="/notes/chemistry/molecular-design/generation/latent-space/latentgan-de-novo-molecular-generation/">LatentGAN</a>, CharRNN, VAE, JT-VAE, LIMO, and <a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a>.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Validity</th>
          <th>Frag</th>
          <th>Scaf</th>
          <th>SNN</th>
          <th>IntDiv</th>
          <th>FCD</th>
          <th>Novelty</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Chemformer</td>
          <td>.9843</td>
          <td>.9889</td>
          <td>.9248</td>
          <td>.5622</td>
          <td>.8553</td>
          <td>.0061</td>
          <td>.9581</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>1.000</td>
          <td>.9999</td>
          <td>.9999</td>
          <td>.9996</td>
          <td>.8567</td>
          <td>.0015</td>
          <td>1.000</td>
      </tr>
  </tbody>
</table>
<p>On synthetic molecules, MolGen achieves 100% validity, near-perfect fragment and scaffold similarity, and the lowest FCD (0.0015). For natural products, MolGen achieves FCD of 0.6519 compared to Chemformer&rsquo;s 0.8346.</p>
<h3 id="targeted-molecule-discovery">Targeted Molecule Discovery</h3>
<p>For penalized logP maximization (top-3 scores):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>1st</th>
          <th>2nd</th>
          <th>3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MARS (no length limit)</td>
          <td>44.99</td>
          <td>44.32</td>
          <td>43.81</td>
      </tr>
      <tr>
          <td>MolGen (no length limit)</td>
          <td>80.30</td>
          <td>74.70</td>
          <td>69.85</td>
      </tr>
      <tr>
          <td>MolGen (length-limited)</td>
          <td>30.51</td>
          <td>28.98</td>
          <td>28.95</td>
      </tr>
  </tbody>
</table>
<p>For QED maximization, MolGen achieves the maximum score of 0.948 across the top-3.</p>
<h3 id="molecular-docking">Molecular Docking</h3>
<p>MolGen optimizes binding affinity for two protein targets (<a href="https://en.wikipedia.org/wiki/Estrogen_receptor_alpha">ESR1</a> and ACAA1), measured by <a href="https://en.wikipedia.org/wiki/Dissociation_constant">dissociation constant</a> $K_D$ (lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ESR1 1st</th>
          <th>ESR1 2nd</th>
          <th>ESR1 3rd</th>
          <th>ACAA1 1st</th>
          <th>ACAA1 2nd</th>
          <th>ACAA1 3rd</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LIMO</td>
          <td>0.72</td>
          <td>0.89</td>
          <td>1.4</td>
          <td>37</td>
          <td>37</td>
          <td>41</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>0.13</td>
          <td>0.35</td>
          <td>0.47</td>
          <td>3.36</td>
          <td>3.98</td>
          <td>8.50</td>
      </tr>
  </tbody>
</table>
<p>MolGen achieves the lowest dissociation constants across both targets. Optimization of the 1,000 worst-affinity molecules yields 96.7% relative improvement for ESR1 and 70.4% for ACAA1.</p>
<h3 id="constrained-molecular-optimization">Constrained Molecular Optimization</h3>
<p>Optimizing 800 molecules from ZINC250K with lowest p-logP scores under Tanimoto similarity constraints:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>$\delta = 0.6$</th>
          <th>$\delta = 0.4$</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a></td>
          <td>3.78 (3.29)</td>
          <td>11.55 (11.27)</td>
      </tr>
      <tr>
          <td>MolGen</td>
          <td>12.08 (0.82)</td>
          <td>12.35 (1.21)</td>
      </tr>
  </tbody>
</table>
<p>MolGen achieves the highest mean improvement with the lowest standard deviation under both constraints.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<ul>
<li><strong>Chemical feedback</strong>: Without it, the model generates molecules with property scores similar to initial molecules. With it ($\alpha = 3$), property scores increase progressively across generation rounds.</li>
<li><strong>Prefix tuning</strong>: Removing prefix tuning reduces constrained optimization improvement by 0.45 at $\delta = 0.6$ and 2.12 at $\delta = 0.4$.</li>
<li><strong>Label smoothing</strong>: Enhances diversity of generated molecules as measured by Internal Diversity.</li>
<li><strong>Substructure attention</strong>: MolGen focuses attention on chemically meaningful functional groups (fluoro, phenyl, hydroxyl), while SMILES-based PLMs scatter attention across syntactic tokens. The Substructure Attention Level (SAL) metric confirms MolGen&rsquo;s superior focus.</li>
</ul>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>SELFIES pre-training guarantees 100% molecular validity, eliminating the need for external valency checks.</li>
<li>Domain-agnostic prefix tuning enables effective knowledge transfer between synthetic and natural product domains.</li>
<li>The chemical feedback paradigm aligns model outputs with chemical preferences without requiring external annotated data or reference databases.</li>
<li>MolGen achieves the best or competitive results across all evaluated tasks: distribution learning, targeted molecule discovery, constrained optimization, and molecular docking.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>Computational cost</strong>: Training and fine-tuning on large datasets is computationally intensive.</li>
<li><strong>Model interpretability</strong>: The transformer architecture makes it difficult to understand explicit rationale behind decisions.</li>
<li><strong>Single-target optimization only</strong>: The chemical feedback paradigm handles single-target optimization; multiple conflicting objectives could create ambiguous optimization trajectories.</li>
<li><strong>Task specificity</strong>: MolGen is designed for 2D molecular generation; 3D conformation information is not incorporated.</li>
<li><strong>Reaction prediction</strong>: When applied to reaction prediction (an off-target task), MolGen achieves only 71.4% accuracy on 39,990 reaction samples.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest applying MolGen to retrosynthesis and reaction prediction, exploring multimodal pre-training, and incorporating additional knowledge sources.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Stage 1 pre-training</td>
          <td>ZINC-15</td>
          <td>100M+ molecules</td>
          <td>MW $\leq$ 500 Da, LogP $\leq$ 5</td>
      </tr>
      <tr>
          <td>Stage 2 pre-training</td>
          <td>ZINC + <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> + NPASS</td>
          <td>2.22M molecules</td>
          <td>Synthetic and natural product domains</td>
      </tr>
      <tr>
          <td>Distribution learning (synthetic)</td>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a></td>
          <td>~1.9M molecules</td>
          <td>Standard benchmark split</td>
      </tr>
      <tr>
          <td>Distribution learning (natural)</td>
          <td>NPASS</td>
          <td>30,926 compounds</td>
          <td>30,126 train / 800 test</td>
      </tr>
      <tr>
          <td>Constrained optimization</td>
          <td>ZINC250K</td>
          <td>800 molecules</td>
          <td>Lowest p-logP scores</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: BART-based encoder-decoder with SELFIES vocabulary (185 tokens)</li>
<li><strong>Prefix length</strong>: 5 tunable vectors per layer</li>
<li><strong>Optimizer</strong>: LAMB (pre-training), AdamW (fine-tuning)</li>
<li><strong>Pre-training</strong>: 600M steps with linear warm-up (180,000 steps) followed by linear decay</li>
<li><strong>Rank loss weight</strong> ($\alpha$): Recommended values of 3 or 5</li>
<li><strong>Candidate generation</strong>: 30 candidates per molecule (synthetic), 8 candidates (natural products)</li>
</ul>
<h3 id="models">Models</h3>
<p>MolGen is publicly available on Hugging Face. The model uses a vocabulary of 185 SELFIES tokens and is comparable in size to Chemformer-large.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Domain</th>
          <th>MolGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">FCD</a> (lower is better)</td>
          <td>Synthetic</td>
          <td>0.0015</td>
          <td>0.0061 (<a href="/notes/chemistry/molecular-design/generation/autoregressive/chemformer/">Chemformer</a>)</td>
          <td>Distribution learning</td>
      </tr>
      <tr>
          <td>p-logP top-1 (no limit)</td>
          <td>Synthetic</td>
          <td>80.30</td>
          <td>44.99 (MARS)</td>
          <td>Targeted discovery</td>
      </tr>
      <tr>
          <td>QED top-1</td>
          <td>Synthetic</td>
          <td>0.948</td>
          <td>0.948 (several)</td>
          <td>Tied at maximum</td>
      </tr>
      <tr>
          <td>ESR1 $K_D$ top-1</td>
          <td>Docking</td>
          <td>0.13</td>
          <td>0.72 (LIMO)</td>
          <td>Binding affinity</td>
      </tr>
      <tr>
          <td>p-logP improvement ($\delta=0.4$)</td>
          <td>Synthetic</td>
          <td>12.35 (1.21)</td>
          <td>11.55 (11.27) (RetMol)</td>
          <td>Constrained optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>6 NVIDIA V100 GPUs</li>
<li>Pre-training batch size: 256 molecules per GPU</li>
<li>Fine-tuning batch size: 6 (synthetic and natural product)</li>
<li>Training: 100 epochs for fine-tuning tasks</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zjunlp/MolGen">zjunlp/MolGen</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official PyTorch implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/zjunlp">zjunlp/MolGen-large</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained weights on Hugging Face</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, Y., Zhang, N., Chen, Z., Guo, L., Fan, X., &amp; Chen, H. (2024). Domain-Agnostic Molecular Generation with Chemical Feedback. <em>Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)</em>.</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/zjunlp/MolGen">GitHub: zjunlp/MolGen</a></li>
<li><a href="https://huggingface.co/zjunlp">Hugging Face Models</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fang2024domain,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Domain-Agnostic Molecular Generation with Chemical Feedback}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Guo, Lingbing and Fan, Xiaohui and Chen, Huajun}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Twelfth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=9rPyHyjfwP}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Molecular Transformer: Calibrated Reaction Prediction</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/</link><pubDate>Wed, 18 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/</guid><description>A Transformer seq2seq model for chemical reaction prediction achieving 90.4% top-1 accuracy on USPTO_MIT with calibrated uncertainty estimation.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-methodological-classification">Paper Contribution and Methodological Classification</h2>
<p>This is a <strong>Method</strong> paper. It adapts the Transformer architecture to chemical reaction prediction, treating it as a machine translation problem from reactant <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> to product SMILES. The key contributions are (1) demonstrating that a fully attention-based model outperforms all prior template-based, graph-based, and RNN-based methods, (2) showing the model works without separating reactants from reagents, and (3) introducing calibrated uncertainty estimation for ranking synthesis pathways.</p>
<h2 id="motivation-limitations-of-existing-reaction-prediction">Motivation: Limitations of Existing Reaction Prediction</h2>
<p>Prior approaches to reaction prediction fell into two broad groups, template-based and template-free, each with fundamental limitations:</p>
<ul>
<li><strong>Template-based methods</strong> rely on libraries of reaction rules, either handcrafted or automatically extracted from atom-mapped data. Automatic template extraction itself depends on atom mapping, which depends on templates, creating a circular dependency.</li>
<li><strong>Graph-based template-free methods</strong> (e.g., WLDN, ELECTRO) avoid explicit templates but still require atom-mapped training data and cannot handle stereochemistry.</li>
<li><strong><a href="/notes/chemistry/molecular-design/reaction-prediction/nmt-organic-reaction-prediction/">RNN-based seq2seq models</a></strong> (also template-free) treat reactions as SMILES translation but impose a positional inductive bias: tokens far apart in the SMILES string are assumed to be less related. This is incorrect because SMILES position has no relationship to 3D spatial distance.</li>
</ul>
<h2 id="core-innovation-transformer-for-reaction-prediction">Core Innovation: Transformer for Reaction Prediction</h2>
<p>The Molecular Transformer adapts the Transformer architecture to chemical reactions by treating SMILES strings of reactants and reagents as source sequences and product SMILES as target sequences.</p>
<ul>
<li><strong>Architecture</strong>: Encoder-decoder Transformer with 4 layers, 256-dimensional hidden states, 8 attention heads, and 12M parameters (reduced from the original 65M NMT model).</li>
<li><strong>Tokenization</strong>: Atom-wise regex tokenization of SMILES strings, applied uniformly to both reactants and reagents (no special reagent tokens).</li>
<li><strong>Data augmentation</strong>: Training data is doubled by generating <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">random (non-canonical) SMILES</a> for each reaction, which improves top-1 accuracy by roughly 1%.</li>
<li><strong>Weight averaging</strong>: Final model weights are averaged over the last 20 checkpoints, providing a further accuracy boost without the inference cost of ensembling.</li>
<li><strong>Mixed input</strong>: Unlike all prior work that separates reactants from reagents (which implicitly assumes knowledge of the product), the Molecular Transformer operates on mixed inputs where no distinction is made.</li>
</ul>
<p>The multihead attention mechanism is the key architectural advantage over RNNs. It allows the model to attend to any pair of tokens regardless of their position in the SMILES string, correctly capturing long-range chemical relationships that RNNs miss.</p>
<h2 id="uncertainty-estimation">Uncertainty Estimation</h2>
<p>A central contribution is calibrated uncertainty scoring. The product of predicted token probabilities serves as a confidence score for each prediction. This score achieves 0.89 AUC-ROC for classifying whether a prediction is correct.</p>
<p>An important finding: <strong>label smoothing hurts uncertainty calibration</strong>. While label smoothing (as used in the original Transformer) marginally improves top-1 accuracy (87.44% vs 87.28%), it destroys the model&rsquo;s ability to distinguish correct from incorrect predictions. Setting the label smoothing parameter to 0.0 preserves calibration.</p>
<p>The confidence score shows no correlation with SMILES length (Pearson $r = 0.06$), confirming it is not biased against predictions of larger molecules.</p>
<h2 id="experimental-results">Experimental Results</h2>
<h3 id="forward-synthesis-prediction">Forward Synthesis Prediction</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Setting</th>
          <th style="text-align: left">Top-1 (%)</th>
          <th style="text-align: left">Top-2 (%)</th>
          <th style="text-align: left">Top-5 (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">separated</td>
          <td style="text-align: left">90.4</td>
          <td style="text-align: left">93.7</td>
          <td style="text-align: left">95.3</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">mixed</td>
          <td style="text-align: left">88.6</td>
          <td style="text-align: left">92.4</td>
          <td style="text-align: left">94.2</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">separated</td>
          <td style="text-align: left">78.1</td>
          <td style="text-align: left">84.0</td>
          <td style="text-align: left">87.1</td>
      </tr>
      <tr>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">mixed</td>
          <td style="text-align: left">76.2</td>
          <td style="text-align: left">82.4</td>
          <td style="text-align: left">85.8</td>
      </tr>
  </tbody>
</table>
<p>The mixed-input model (88.6%) outperforms all prior methods that used separated inputs (best previous: WLDN5 at 85.6%).</p>
<h3 id="comparison-with-quantum-chemistry">Comparison with Quantum Chemistry</h3>
<p>On <a href="https://en.wikipedia.org/wiki/Regioselectivity">regioselectivity</a> of <a href="https://en.wikipedia.org/wiki/Electrophilic_aromatic_substitution">electrophilic aromatic substitution</a> in heteroaromatics, the Molecular Transformer achieves 83% top-1 accuracy vs 81% for RegioSQM (a quantum-chemistry-based predictor), at a fraction of the computational cost.</p>
<h3 id="comparison-with-human-chemists">Comparison with Human Chemists</h3>
<p>On 80 reactions sampled across rarity bins, the Molecular Transformer achieves 87.5% top-1 accuracy vs 76.5% for the best human chemist and 72.5% for the best graph-based model (WLDN5).</p>
<h3 id="chemically-constrained-beam-search">Chemically Constrained Beam Search</h3>
<p>Constraining beam search to only predict atoms present in the reactants (preventing &ldquo;alchemy&rdquo;) produces no change in accuracy, confirming the model has learned conservation of atoms from data alone.</p>
<h2 id="trade-offs-and-limitations">Trade-offs and Limitations</h2>
<ul>
<li><strong><a href="https://en.wikipedia.org/wiki/Stereochemistry">Stereochemistry</a></strong>: Accuracy drops significantly on USPTO_STEREO (76-78% vs 88-90% on USPTO_MIT), indicating stereochemical prediction remains challenging.</li>
<li><strong>Resolution reactions</strong>: Near-zero accuracy on resolution reactions (28.6%), where reagent information is often missing from patent data.</li>
<li><strong>Unclassified reactions</strong>: Accuracy on &ldquo;unrecognized&rdquo; reaction classes is 46.3%, likely reflecting noisy or mistranscribed data.</li>
<li><strong>No atom mapping</strong>: The model provides no explicit atom mapping between reactants and products, which limits interpretability for understanding reaction mechanisms.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Primary benchmark</strong></td>
          <td style="text-align: left">USPTO_MIT</td>
          <td style="text-align: left">479K</td>
          <td style="text-align: left">Filtered by Jin et al., no stereochemistry</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>LEF subset</strong></td>
          <td style="text-align: left">USPTO_LEF</td>
          <td style="text-align: left">350K</td>
          <td style="text-align: left">Subset of MIT with linear electron flow only</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stereo benchmark</strong></td>
          <td style="text-align: left">USPTO_STEREO</td>
          <td style="text-align: left">1.0M</td>
          <td style="text-align: left">Patent reactions through Sept 2016, includes stereochemistry</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Time-split test</strong></td>
          <td style="text-align: left">Pistachio_2017</td>
          <td style="text-align: left">15.4K</td>
          <td style="text-align: left">Non-public, reactions from 2017</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>: SMILES canonicalized with RDKit. Regex tokenization from Schwaller et al. (2018). Two input modes: &ldquo;separated&rdquo; (reactants &gt; reagents) and &ldquo;mixed&rdquo; (all molecules concatenated).</p>
<h3 id="model">Model</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hyperparameter</th>
          <th style="text-align: left">Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Layers</strong></td>
          <td style="text-align: left">4</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Model dimension</strong></td>
          <td style="text-align: left">256</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Attention heads</strong></td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Parameters</strong></td>
          <td style="text-align: left">~12M</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Label smoothing</strong></td>
          <td style="text-align: left">0.0</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Optimizer</strong></td>
          <td style="text-align: left">Adam</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Warm-up steps</strong></td>
          <td style="text-align: left">8000</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Batch size</strong></td>
          <td style="text-align: left">~4096 tokens</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Beam width</strong></td>
          <td style="text-align: left">5</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Key Result</th>
          <th style="text-align: left">Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">USPTO_MIT (sep)</td>
          <td style="text-align: left"><strong>90.4%</strong></td>
          <td style="text-align: left">85.6% (WLDN5)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">USPTO_MIT (mixed)</td>
          <td style="text-align: left"><strong>88.6%</strong></td>
          <td style="text-align: left">80.3% (S2S RNN)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>AUC-ROC</strong></td>
          <td style="text-align: left">Uncertainty calibration</td>
          <td style="text-align: left"><strong>0.89</strong></td>
          <td style="text-align: left">N/A</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">Regioselectivity</td>
          <td style="text-align: left"><strong>83%</strong></td>
          <td style="text-align: left">81% (RegioSQM)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 accuracy</strong></td>
          <td style="text-align: left">Human comparison</td>
          <td style="text-align: left"><strong>87.5%</strong></td>
          <td style="text-align: left">76.5% (best human)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: Single Nvidia P100 GPU, 48h for best single model</li>
<li>Inference: 20 min for 40K reactions on single P100</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C. A., Bekas, C., &amp; Lee, A. A. (2019). Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. <em>ACS Central Science</em>, 5(9), 1572-1583. <a href="https://doi.org/10.1021/acscentsci.9b00576">https://doi.org/10.1021/acscentsci.9b00576</a></p>
<p><strong>Publication</strong>: ACS Central Science 2019</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{schwallerMolecularTransformerModel2019,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Schwaller, Philippe and Laino, Teodoro and Gaudin, Th{\&#39;e}ophile and Bolgar, Peter and Hunter, Christopher A. and Bekas, Costas and Lee, Alpha A.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2019</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{ACS Central Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1572--1583}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1021/acscentsci.9b00576}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SELFormer: A SELFIES-Based Molecular Language Model</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/selformer/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/selformer/</guid><description>A SELFIES-based RoBERTa model pretrained on 2M ChEMBL molecules for molecular property prediction on MoleculeNet benchmarks.</description><content:encoded><![CDATA[<h2 id="a-selfies-based-chemical-language-model">A SELFIES-Based Chemical Language Model</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$) with a secondary <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>SELFormer applies the RoBERTa transformer architecture to <a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">SELFIES</a> molecular string representations instead of the <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> notation used by prior chemical language models. The model is pretrained via masked language modeling (MLM) on 2M drug-like compounds from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> and fine-tuned for molecular property prediction tasks on <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks. The authors release pretrained models, fine-tuning code, and datasets as open-source resources.</p>
<h2 id="why-selfies-over-smiles-for-pretraining">Why SELFIES Over SMILES for Pretraining?</h2>
<p>Existing chemical language models, including <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>, and <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MolFormer</a>, all use SMILES as their input representation. SMILES has well-documented validity and robustness issues: arbitrary perturbations to a SMILES string frequently produce syntactically invalid outputs. This means a pretrained model must spend capacity learning SMILES grammar rules rather than chemical semantics.</p>
<p><a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">SELFIES</a> addresses this by construction: every possible SELFIES string decodes to a valid molecule. Despite this theoretical advantage and SELFIES&rsquo; growing adoption in generative chemistry, no prior work had systematically evaluated SELFIES as input for large-scale transformer pretraining. SELFormer fills this gap by providing a direct comparison between SELFIES-based and SMILES-based chemical language models on standard benchmarks.</p>
<h2 id="masked-language-modeling-on-guaranteed-valid-molecular-strings">Masked Language Modeling on Guaranteed-Valid Molecular Strings</h2>
<p>SELFormer uses byte-level Byte-Pair Encoding (BPE) to tokenize SELFIES strings, then pretrains a RoBERTa encoder using the standard MLM objective. 15% of input tokens are masked, and the model minimizes the cross-entropy loss over the masked positions:</p>
<p>$$
\mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta)
$$</p>
<p>where $\mathcal{M}$ is the set of masked token indices, $x_i$ is the true token at position $i$, $x_{\setminus \mathcal{M}}$ is the corrupted input context, and $\theta$ are the model parameters.</p>
<p>The key insight is that because SELFIES guarantees 100% validity, every masked token prediction corresponds to a valid molecular fragment. The model never wastes capacity predicting invalid chemistry. For fine-tuning, a two-layer classification or regression head is added on top of the encoder&rsquo;s output embedding.</p>
<p>Two model sizes were trained. Notably, the larger SELFormer uses fewer attention heads (4) but more hidden layers (12) than SELFormer-Lite (12 heads, 8 layers). This counterintuitive configuration emerged from the authors&rsquo; hyperparameter search over ~100 models, where deeper architectures with fewer heads outperformed wider, shallower ones:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>SELFormer-Lite</th>
          <th>SELFormer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Attention Heads</td>
          <td>12</td>
          <td>4</td>
      </tr>
      <tr>
          <td>Hidden Layers</td>
          <td>8</td>
          <td>12</td>
      </tr>
      <tr>
          <td>Batch Size</td>
          <td>16</td>
          <td>16</td>
      </tr>
      <tr>
          <td>Learning Rate</td>
          <td>5e-5</td>
          <td>5e-5</td>
      </tr>
      <tr>
          <td>Weight Decay</td>
          <td>0.01</td>
          <td>0.01</td>
      </tr>
      <tr>
          <td>Pretraining Epochs</td>
          <td>100</td>
          <td>100</td>
      </tr>
      <tr>
          <td>Parameters</td>
          <td>58.3M</td>
          <td>86.7M</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarking-against-smiles-transformers-and-graph-models">Benchmarking Against SMILES Transformers and Graph Models</h2>
<p>SELFormer was pretrained on 2.08M drug-like compounds from ChEMBL v30 (converted from SMILES to SELFIES), then fine-tuned on nine MoleculeNet tasks. All evaluations use scaffold splitting via the Chemprop library.</p>
<p><strong>Classification tasks</strong> (ROC-AUC, scaffold split):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BACE</th>
          <th>BBBP</th>
          <th>HIV</th>
          <th>Tox21</th>
          <th>SIDER</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SELFormer</td>
          <td>0.832</td>
          <td><strong>0.902</strong></td>
          <td>0.681</td>
          <td>0.653</td>
          <td><strong>0.745</strong></td>
      </tr>
      <tr>
          <td>ChemBERTa-2</td>
          <td>0.799</td>
          <td>0.728</td>
          <td>0.622</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MolBERT</td>
          <td><strong>0.866</strong></td>
          <td>0.762</td>
          <td><strong>0.783</strong></td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.809</td>
          <td>0.710</td>
          <td>0.771</td>
          <td>0.759</td>
          <td>0.570</td>
      </tr>
      <tr>
          <td>MolCLR</td>
          <td><strong>0.890</strong></td>
          <td>0.736</td>
          <td><strong>0.806</strong></td>
          <td><strong>0.787</strong></td>
          <td>0.652</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.856</td>
          <td>0.724</td>
          <td><strong>0.806</strong></td>
          <td>0.781</td>
          <td>0.672</td>
      </tr>
      <tr>
          <td>KPGT</td>
          <td>0.855</td>
          <td><strong>0.908</strong></td>
          <td>-</td>
          <td><strong>0.848</strong></td>
          <td>0.649</td>
      </tr>
  </tbody>
</table>
<p><strong>Regression tasks</strong> (RMSE, scaffold split, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
          <th>PDBbind</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SELFormer</td>
          <td><strong>0.682</strong></td>
          <td>2.797</td>
          <td>0.735</td>
          <td>1.488</td>
      </tr>
      <tr>
          <td>ChemBERTa-2</td>
          <td>-</td>
          <td>-</td>
          <td>0.986</td>
          <td>-</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>1.050</td>
          <td><strong>2.082</strong></td>
          <td><strong>0.683</strong></td>
          <td><strong>1.397</strong></td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.798</td>
          <td><strong>1.877</strong></td>
          <td>0.660</td>
          <td>-</td>
      </tr>
      <tr>
          <td>KPGT</td>
          <td>0.803</td>
          <td>2.121</td>
          <td><strong>0.600</strong></td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The ablation study compared SELFormer vs. SELFormer-Lite across pretrained-only, 25-epoch, and 50-epoch fine-tuning configurations on randomly split datasets. SELFormer consistently outperformed SELFormer-Lite, confirming the benefit of the deeper (12-layer) architecture.</p>
<h2 id="strong-classification-performance-with-compact-pretraining">Strong Classification Performance with Compact Pretraining</h2>
<p>SELFormer&rsquo;s strongest results come on classification tasks where molecular substructure matters:</p>
<ul>
<li><strong>SIDER</strong>: Best overall ROC-AUC (0.745), outperforming the next best method (MolCLR at 0.652) by 9.3 percentage points. The authors attribute this to SELFIES&rsquo; ability to capture subtle structural differences relevant to drug side effects.</li>
<li><strong>BBBP</strong>: Second best (0.902), behind only KPGT (0.908). SELFormer scored 17.4 percentage points above ChemBERTa-2 (0.728) on this task.</li>
<li><strong>BACE/HIV vs. ChemBERTa-2</strong>: SELFormer outperformed ChemBERTa-2 by 3.3 points on BACE (0.832 vs 0.799), 17.4 on BBBP, and 5.9 on HIV (0.681 vs 0.622). Since both models use similar RoBERTa architectures, this comparison is suggestive of a SELFIES advantage, though differences in pretraining corpus (ChEMBL vs PubChem), corpus size, and training procedure confound a clean attribution to the input representation alone.</li>
<li><strong>ESOL regression</strong>: Best RMSE (0.682) vs GEM (0.798), a 14.5% relative improvement.</li>
</ul>
<p>Limitations are also apparent:</p>
<ul>
<li><strong>HIV and Tox21</strong>: SELFormer underperforms graph-based methods (MolCLR, GEM, KPGT) on these larger datasets. The authors attribute this to insufficient hyperparameter search given computational constraints.</li>
<li><strong>FreeSolv and Lipophilicity regression</strong>: D-MPNN and graph-based methods maintain an edge, suggesting that explicit 2D/3D structural inductive biases remain valuable for certain property types.</li>
<li><strong>Small pretraining corpus</strong>: At 2M molecules, SELFormer&rsquo;s corpus is orders of magnitude smaller than MolFormer&rsquo;s 1.1B. Despite this, SELFormer outperforms MolFormer on SIDER (0.745 vs 0.690), highlighting SELFIES&rsquo; representational advantage.</li>
<li><strong>Single-task ablation scope</strong>: Some architectural claims rest on limited task coverage, and broader benchmarking would strengthen the conclusions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL v30</td>
          <td>2,084,725 compounds (2,084,472 after SELFIES conversion)</td>
          <td>Drug-like bioactive small molecules</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE</td>
          <td>1,513</td>
          <td><a href="https://en.wikipedia.org/wiki/Beta-secretase_1">Beta-secretase 1</a> inhibitor binding</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP</td>
          <td>2,039</td>
          <td><a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">Blood-brain barrier</a> permeability</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>HIV</td>
          <td>41,127</td>
          <td>HIV replication inhibition</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>SIDER</td>
          <td>1,427</td>
          <td>Drug side effects (27 classes)</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Tox21</td>
          <td>7,831</td>
          <td>Toxicity (12 targets)</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td>Octanol/water distribution coefficient</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>PDBbind</td>
          <td>11,908</td>
          <td>Binding affinity</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining objective</strong>: Masked language modeling (MLM), 15% token masking</li>
<li><strong>Tokenization</strong>: Byte-level Byte-Pair Encoding (BPE) on SELFIES strings</li>
<li><strong>SMILES to SELFIES conversion</strong>: SELFIES API with Pandaral.lel for parallelization</li>
<li><strong>Splitting</strong>: Scaffold splitting via Chemprop library (80/10/10 train/validation/test)</li>
<li><strong>Fine-tuning</strong>: Two-layer classification/regression head on encoder output; up to 200 epochs with hyperparameter search</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: RoBERTa (HuggingFace Transformers)</li>
<li><strong>SELFormer</strong>: 12 hidden layers, 4 attention heads, 86.7M parameters</li>
<li><strong>SELFormer-Lite</strong>: 8 hidden layers, 12 attention heads, 58.3M parameters</li>
<li><strong>Hyperparameter search</strong>: Sequential search over ~100 configurations on 100K molecule subset</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Area under receiver operating characteristic curve</td>
      </tr>
      <tr>
          <td>PRC-AUC</td>
          <td>Classification</td>
          <td>Area under precision-recall curve (reported for random splits)</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression</td>
          <td>Root mean squared error</td>
      </tr>
  </tbody>
</table>
<p>Results reported on scaffold split and random split datasets.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 2x NVIDIA A5000 GPUs</li>
<li><strong>Hyperparameter optimization time</strong>: ~11 days</li>
<li><strong>Full pretraining</strong>: 100 epochs on 2.08M molecules</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/HUBioDataLab/SELFormer">SELFormer GitHub</a></td>
          <td>Code</td>
          <td>GPL-3.0</td>
          <td>Pretraining, fine-tuning, and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/HUBioDataLab/SELFormer">SELFormer on HuggingFace</a></td>
          <td>Model</td>
          <td>GPL-3.0</td>
          <td>Pretrained SELFormer weights</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v30</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Source pretraining data</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Benchmark</td>
          <td>Unknown</td>
          <td>Downstream evaluation tasks</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yüksel, A., Ulusoy, E., Ünlü, A., &amp; Doğan, T. (2023). SELFormer: Molecular Representation Learning via SELFIES Language Models. <em>Machine Learning: Science and Technology</em>, 4(2), 025035. <a href="https://doi.org/10.1088/2632-2153/acdb30">https://doi.org/10.1088/2632-2153/acdb30</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/HUBioDataLab/SELFormer">GitHub Repository (SELFormer)</a></li>
<li><a href="https://huggingface.co/HUBioDataLab/SELFormer">HuggingFace Model Hub (SELFormer)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yuksel2023selformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{SELFormer}: Molecular Representation Learning via {SELFIES} Language Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Y{\&#34;u}ksel, Atakan and Ulusoy, Erva and {\&#34;U}nl{\&#34;u}, Atabey and Do{\u{g}}an, Tunca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{025035}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1088/2632-2153/acdb30}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MoLFormer: Large-Scale Chemical Language Representations</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/molformer/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/molformer/</guid><description>A linear-attention transformer pretrained on 1.1B SMILES from PubChem and ZINC for molecular property prediction across MoleculeNet benchmarks.</description><content:encoded><![CDATA[<h2 id="a-billion-scale-chemical-language-model">A Billion-Scale Chemical Language Model</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>MoLFormer is a transformer encoder pretrained via masked language modeling on 1.1 billion <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> and <a href="https://en.wikipedia.org/wiki/ZINC_database">ZINC</a>. The key architectural choices are linear attention (for $O(N)$ complexity instead of $O(N^2)$) and rotary positional embeddings (RoPE). The resulting model, MoLFormer-XL, produces molecular embeddings that outperform or match GNN baselines across a wide range of <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification and regression tasks, including quantum-chemical property prediction from SMILES alone.</p>
<h2 id="bridging-the-gap-between-molecular-languages-and-graph-neural-networks">Bridging the Gap Between Molecular Languages and Graph Neural Networks</h2>
<p>Prior chemical language models like <a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a> were pretrained on relatively small datasets (10M-77M molecules) and generally underperformed GNNs on molecular property prediction. The core question: does a transformer trained on a sufficiently large SMILES corpus learn enough chemical structure to compete with graph-based methods that have explicit topological inductive biases?</p>
<p>Two specific challenges motivated this work:</p>
<ul>
<li><strong>Scale</strong>: The chemical space spans $10^{60}$ to $10^{100}$ plausible molecules, yet labeled property data is scarce. Self-supervised pretraining on the ~1.1B unlabeled molecules available in public databases could provide a general-purpose representation.</li>
<li><strong>Efficiency</strong>: Standard transformer attention is $O(N^2)$ in sequence length, making billion-scale pretraining impractical without architectural modifications.</li>
</ul>
<h2 id="linear-attention-with-rotary-positional-embeddings">Linear Attention with Rotary Positional Embeddings</h2>
<p>MoLFormer&rsquo;s two key architectural choices are its attention mechanism and positional encoding scheme.</p>
<p><strong>Standard attention</strong> computes:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle) v_n}{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle)}
$$</p>
<p>MoLFormer replaces this with <strong>linear attention</strong> using a generalized feature map $\varphi$, combined with <strong>rotary positional embeddings</strong> $R_m$ applied before the feature map:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle}
$$</p>
<p>This differs from the original RoFormer formulation, which applies the rotation after the feature map. The authors found that rotating the raw queries and keys before projection led to faster convergence and lower validation loss. The combination of linear attention and adaptive sequence-length bucketing reduces GPU requirements from ~1000 to 16 for training on the full 1.1B corpus.</p>
<p>The model uses masked language modeling (15% token masking, following BERT conventions) with a vocabulary of 2,362 SMILES tokens. Sequence length is capped at 202 tokens, covering 99.4% of all molecules.</p>
<h2 id="broad-moleculenet-benchmarking-with-scaling-ablations">Broad MoleculeNet Benchmarking with Scaling Ablations</h2>
<p>MoLFormer-XL was evaluated on 11 MoleculeNet tasks against supervised GNNs, self-supervised GNNs, and prior language models.</p>
<p><strong>Classification tasks</strong> (ROC-AUC, scaffold split; values reported as percentages in the original paper, converted to proportions here for consistency):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP</th>
          <th>Tox21</th>
          <th>ClinTox</th>
          <th>HIV</th>
          <th>BACE</th>
          <th>SIDER</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer-XL</td>
          <td><strong>0.937</strong></td>
          <td><strong>0.847</strong></td>
          <td><strong>0.948</strong></td>
          <td>0.822</td>
          <td>0.882</td>
          <td><strong>0.690</strong></td>
      </tr>
      <tr>
          <td>N-Gram</td>
          <td>0.912</td>
          <td>0.769</td>
          <td>0.855</td>
          <td>0.830</td>
          <td>0.876</td>
          <td>0.632</td>
      </tr>
      <tr>
          <td>MolCLR</td>
          <td>0.736</td>
          <td>0.798</td>
          <td>0.932</td>
          <td>0.806</td>
          <td><strong>0.890</strong></td>
          <td>0.680</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.724</td>
          <td>0.781</td>
          <td>0.901</td>
          <td>0.806</td>
          <td>0.856</td>
          <td>0.672</td>
      </tr>
      <tr>
          <td>Hu et al.</td>
          <td>0.708</td>
          <td>0.787</td>
          <td>0.789</td>
          <td>0.802</td>
          <td>0.859</td>
          <td>0.652</td>
      </tr>
      <tr>
          <td>GeomGCL</td>
          <td>-</td>
          <td>0.850</td>
          <td>0.919</td>
          <td>-</td>
          <td>-</td>
          <td>0.648</td>
      </tr>
      <tr>
          <td>ChemBERTa</td>
          <td>0.643</td>
          <td>-</td>
          <td>0.906</td>
          <td>0.622</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p><strong>Regression tasks</strong> (RMSE for ESOL/FreeSolv/Lipophilicity, avg MAE for QM9/QM8):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>QM9</th>
          <th>QM8</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer-XL</td>
          <td><strong>1.5894</strong></td>
          <td><strong>0.0102</strong></td>
          <td><strong>0.2787</strong></td>
          <td><strong>0.2308</strong></td>
          <td><strong>0.5289</strong></td>
      </tr>
      <tr>
          <td>A-FP</td>
          <td>2.6355</td>
          <td>0.0282</td>
          <td>0.5030</td>
          <td>0.736</td>
          <td>0.578</td>
      </tr>
      <tr>
          <td>MPNN</td>
          <td>3.1898</td>
          <td>0.0143</td>
          <td>0.58</td>
          <td>1.150</td>
          <td>0.7190</td>
      </tr>
      <tr>
          <td>GC</td>
          <td>4.3536</td>
          <td>0.0148</td>
          <td>0.970</td>
          <td>1.40</td>
          <td>0.655</td>
      </tr>
  </tbody>
</table>
<p>MoLFormer-XL also outperforms geometry-aware GNNs (DimeNet, GeomGCL, GEM) on ESOL (0.279 vs 0.575), FreeSolv (0.231 vs 0.866), and Lipophilicity (0.529 vs 0.541).</p>
<p><strong>Key ablation findings</strong>:</p>
<ul>
<li><strong>Data scale matters</strong>: Performance improves monotonically from 10% subsets through the full 1.1B corpus. Training on 100% ZINC alone performed worst, likely due to its smaller vocabulary and less diverse molecule lengths.</li>
<li><strong>Model depth matters</strong>: MoLFormer-Base (6 layers) underperforms MoLFormer-XL (12 layers) on most tasks.</li>
<li><strong>Fine-tuning &raquo; frozen</strong>: Fine-tuning the full encoder consistently outperforms using frozen embeddings with a downstream classifier.</li>
<li><strong>Rotary &gt; absolute at scale</strong>: Rotary embeddings underperform absolute embeddings on smaller pretraining sets but overtake them once the corpus exceeds 1B molecules.</li>
</ul>
<h2 id="smiles-transformers-learn-molecular-geometry">SMILES Transformers Learn Molecular Geometry</h2>
<p>The most striking finding is that MoLFormer&rsquo;s attention patterns correlate with 3D interatomic distances, despite training only on 1D SMILES strings.</p>
<p>Using <a href="/notes/chemistry/datasets/qm9/">QM9</a> molecules with known 3D geometries, the authors computed cosine similarity between attention maps and spatial distance matrices across three distance categories:</p>
<table>
  <thead>
      <tr>
          <th>Distance Category</th>
          <th>Range</th>
          <th>Linear Attention (Rotary)</th>
          <th>Full Attention (Rotary)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Short</td>
          <td>$\leq$ 2 Å</td>
          <td>0.594-0.602</td>
          <td>0.598-0.615</td>
      </tr>
      <tr>
          <td>Medium</td>
          <td>2-4 Å</td>
          <td>0.724-0.730</td>
          <td>0.716-0.727</td>
      </tr>
      <tr>
          <td>Long</td>
          <td>4-10 Å</td>
          <td>0.209-0.211</td>
          <td>0.204-0.210</td>
      </tr>
  </tbody>
</table>
<p>The strong correlation in the short and medium categories indicates the model captures covalent bond connectivity and near-neighbor spatial relationships. Linear attention shows marginally higher cosine similarity than full attention on medium-range distances (0.724-0.730 vs 0.716-0.727), though the differences are small.</p>
<p>MoLFormer-XL embeddings also correlate more strongly with molecular fingerprint similarity (0.64 vs 0.48 for ChemBERTa) and maximum common subgraph size (-0.60 vs -0.44), confirming that the representations encode structural information.</p>
<p><strong>Limitations</strong>:</p>
<ul>
<li><strong>Quantum-chemical energies</strong>: SchNet and DimeNet (which encode explicit 3D geometry) outperform MoLFormer-XL on QM9 atomization energy tasks, with DimeNet achieving roughly 10x lower MAE on U0_atom (0.008 vs 0.083 eV). 3D information remains important for these properties.</li>
<li><strong>Sequence length cap</strong>: The 202-token limit excludes 0.6% of molecules, potentially limiting applicability to larger structures.</li>
<li><strong>SMILES canonicalization</strong>: The model depends on <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> canonical SMILES; sensitivity to non-canonical forms is not evaluated.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>PubChem</td>
          <td>111M molecules</td>
          <td>Canonical SMILES via RDKit</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>ZINC</td>
          <td>~1B molecules</td>
          <td>Canonical SMILES via RDKit</td>
      </tr>
      <tr>
          <td>Pretraining (combined)</td>
          <td>PubChem + ZINC</td>
          <td>~1.1B molecules</td>
          <td>MoLFormer-XL training set</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP, Tox21, ClinTox, HIV, BACE, SIDER</td>
          <td>1,427-41,127</td>
          <td>MoleculeNet scaffold splits</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>QM9, QM8, ESOL, FreeSolv, Lipophilicity</td>
          <td>642-133,885</td>
          <td>MoleculeNet random splits (QM9/QM8), scaffold (others)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining objective</strong>: Masked language modeling (15% selection: 80% masked, 10% random, 10% unchanged)</li>
<li><strong>Tokenization</strong>: SMILES tokenizer from Schwaller et al., vocabulary of 2,362 tokens</li>
<li><strong>Sequence length</strong>: 1-202 tokens (99.4% coverage)</li>
<li><strong>Optimizer</strong>: Fused LAMB (via APEX), chosen for stability with large batch sizes and no need for learning rate warm-up</li>
<li><strong>Adaptive bucketing</strong>: Sequences grouped by length into buckets to minimize padding waste</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Transformer encoder with linear attention and rotary positional embeddings</li>
<li><strong>MoLFormer-XL</strong>: 12 layers, 12 attention heads, hidden size 768</li>
<li><strong>MoLFormer-Base</strong>: 6 layers (ablation only)</li>
<li><strong>Feature map size</strong>: 32 (generalized feature map for linear attention)</li>
<li><strong>Frozen head</strong>: Fully connected model with hyperparameter sweep (learning rate, batch size, hidden dim, number of layers)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Scaffold splits per MoleculeNet</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression (ESOL, FreeSolv, Lipophilicity)</td>
          <td>Scaffold splits</td>
      </tr>
      <tr>
          <td>Avg MAE</td>
          <td>Regression (QM9, QM8)</td>
          <td>Random splits per MoleculeNet</td>
      </tr>
  </tbody>
</table>
<p>QM9 results also reported with 5-fold cross-validation for robustness.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: GPU cluster with nodes containing either 8 NVIDIA Tesla V100 (32GB) or 8 Ampere A100 (40GB) GPUs connected via NVLink and InfiniBand</li>
<li><strong>GPU reduction</strong>: Linear attention + bucketing reduced GPU requirements from ~1000 to 16</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/molformer">IBM/molformer</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Pretraining, fine-tuning, and attention visualization</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm/MoLFormer-XL-both-10pct">MoLFormer-XL (HuggingFace)</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Pretrained weights (46.8M parameters)</td>
      </tr>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public domain</td>
          <td>111M molecules</td>
      </tr>
      <tr>
          <td><a href="https://zinc.docking.org/">ZINC</a></td>
          <td>Dataset</td>
          <td>See ZINC terms</td>
          <td>~1B molecules</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ross, J., Belgodere, B., Chenthamarakshan, V., Padhi, I., Mroueh, Y., &amp; Das, P. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. <em>Nature Machine Intelligence</em>, 4, 1256-1264. <a href="https://doi.org/10.1038/s42256-022-00580-7">https://doi.org/10.1038/s42256-022-00580-7</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IBM/molformer">GitHub Repository (MoLFormer)</a></li>
<li><a href="https://huggingface.co/ibm/MoLFormer-XL-both-10pct">HuggingFace Models</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ross2022molformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Large-Scale Chemical Language Representations Capture Molecular Structure and Properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1256--1264}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-022-00580-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AdaptMol: Domain Adaptation for Molecular OCSR (2026)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/adaptmol-2026/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-graph/adaptmol-2026/</guid><description>AdaptMol is an image-to-graph OCSR model using MMD-based domain adaptation and self-training for hand-drawn molecule recognition.</description><content:encoded><![CDATA[<h2 id="bridging-the-synthetic-to-real-gap-in-graph-based-ocsr">Bridging the Synthetic-to-Real Gap in Graph-Based OCSR</h2>
<p>Most OCSR methods are trained on synthetic molecular images and evaluated on high-quality literature figures, both exhibiting relatively uniform styles. Hand-drawn molecules represent a particularly challenging domain with irregular bond lengths, variable stroke widths, and inconsistent atom symbols. Prior graph reconstruction methods like MolScribe and MolGrapher drop below 15% accuracy on hand-drawn images, despite achieving over 65% on literature datasets.</p>
<p>AdaptMol addresses this with a three-stage pipeline that enables effective transfer from synthetic to real-world data without requiring graph annotations in the target domain:</p>
<ol>
<li><strong>Base model training</strong> on synthetic data with comprehensive augmentation and dual position representation</li>
<li><strong>MMD alignment</strong> of bond-level features between source and target domains</li>
<li><strong>Self-training</strong> with SMILES-validated pseudo-labels on unlabeled target images</li>
</ol>
<h2 id="end-to-end-graph-reconstruction-architecture">End-to-End Graph Reconstruction Architecture</h2>
<p>AdaptMol builds on MolScribe&rsquo;s architecture, using a Swin Transformer base encoder ($384 \times 384$ input) with a 6-layer Transformer decoder (8 heads, hidden dim 256). The model jointly predicts atoms and bonds:</p>
<p><strong>Atom prediction</strong> follows the Pix2Seq approach, autoregressively generating a sequence of atom tokens:</p>
<p>$$S_N = [l_1, x_1, y_1, l_2, x_2, y_2, \dots, l_n, x_n, y_n]$$</p>
<p>where $l_i$ is the atom label and $(x_i, y_i)$ are discretized coordinate bin indices.</p>
<p><strong>Dual position representation</strong> adds a 2D spatial heatmap on top of token-based coordinate prediction. The heatmap aggregates joint spatial distributions of all atoms:</p>
<p>$$\mathbf{H} = \text{Upsample}\left(\sum_{i=1}^{n} P_y^{(i)} \otimes P_x^{(i)}\right)$$</p>
<p>where $P_x^{(i)}$ and $P_y^{(i)}$ are coordinate probability distributions from the softmax logits. During training, this heatmap is supervised with Gaussian kernels at ground-truth atom positions. This reduces false positive atom predictions substantially (from 356 to 33 false positives at IoU 0.05).</p>
<p><strong>Bond prediction</strong> extracts atom-level features from decoder hidden states and enriches them with encoder visual features via multi-head attention with a learnable residual weight $\alpha$:</p>
<p>$$\mathbf{F}_{\text{enriched}} = \text{LayerNorm}(\mathbf{F}_{\text{atom}} + \alpha \cdot \text{MHA}(\mathbf{F}_{\text{atom}}, \mathbf{E}_{\text{vis}}))$$</p>
<p>A feed-forward network then predicts bond types between all atom pairs.</p>
<h2 id="bond-level-domain-adaptation-via-mmd">Bond-Level Domain Adaptation via MMD</h2>
<p>The key insight is that bond features are domain-invariant: they encode structural relationships (single, double, triple, aromatic) independent of visual style. Atom-level alignment is problematic due to class imbalance (carbon dominates), multi-token spanning (functional groups), and position-dependent features.</p>
<p>AdaptMol aligns bond-level feature distributions via class-conditional Maximum Mean Discrepancy:</p>
<p>$$L_{\text{MMD}} = \frac{1}{|\mathcal{C}&rsquo;|} \sum_{c \in \mathcal{C}&rsquo;} MMD(F_c^{\text{src}}, F_c^{\text{tgt}})$$</p>
<p>where $\mathcal{C}&rsquo;$ contains classes with sufficient samples in both domains. Confidence-based filtering retains only high-confidence predictions (confidence &gt; 0.95, entropy &lt; 0.1) for alignment, tightening to 0.98 and 0.05 after the first epoch. Progressive loss weighting follows a schedule of 0.1 (epoch 0), 0.075 (epoch 1), and 0.05 thereafter.</p>
<p>An important side effect: MMD alignment improves inter-class bond discrimination, reducing confusion between visually similar bond types (e.g., jagged double bonds vs. aromatic bonds).</p>
<h2 id="self-training-with-smiles-validation">Self-Training with SMILES Validation</h2>
<p>After MMD alignment, the model generates predictions on unlabeled target images. Predicted molecular graphs are converted to SMILES and validated against ground-truth SMILES annotations. Only exact matches are retained as pseudo-labels, providing complete graph supervision (atom coordinates, element types, bond types) that was previously unavailable in the target domain.</p>
<p>This approach is far more data-efficient than alternatives: AdaptMol uses only 4,080 real hand-drawn images vs. DECIMER-Handdraw&rsquo;s 38 million synthetic hand-drawn images.</p>
<h2 id="comprehensive-data-augmentation">Comprehensive Data Augmentation</h2>
<p>Two categories of augmentation are applied during synthetic data generation:</p>
<ul>
<li><strong>Structure-rendering augmentation</strong>: Functional group abbreviation substitution, bond type conversions (single to wavy/aromatic, Kekule to aromatic rings), R-group insertion, and rendering parameter randomization (font family/size, bond width/spacing)</li>
<li><strong>Image-level augmentation</strong>: Geometric operations, quality degradation, layout variations, and chemical document artifacts (caption injection, arrows, marginal annotations)</li>
</ul>
<p>Structure-rendering augmentation provides the larger benefit, contributing ~20% accuracy improvement on JPO and ~30% on ACS benchmarks.</p>
<h2 id="results">Results</h2>
<h3 id="hand-drawn-molecule-recognition">Hand-Drawn Molecule Recognition</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>DECIMER test (Acc)</th>
          <th>ChemPix (Acc)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>AdaptMol</strong></td>
          <td><strong>82.6</strong></td>
          <td><strong>60.5</strong></td>
      </tr>
      <tr>
          <td>DECIMER v2.2</td>
          <td>71.9</td>
          <td>51.4</td>
      </tr>
      <tr>
          <td>AtomLenz</td>
          <td>30.0</td>
          <td>48.4</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>10.1</td>
          <td>26.1</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>10.7</td>
          <td>14.5</td>
      </tr>
  </tbody>
</table>
<h3 id="literature-and-synthetic-benchmarks">Literature and Synthetic Benchmarks</h3>
<p>AdaptMol achieves state-of-the-art on 4 of 6 literature benchmarks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>AdaptMol</th>
          <th>MolScribe</th>
          <th>MolGrapher</th>
          <th>DECIMER v2.2</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CLEF</td>
          <td><strong>92.7</strong></td>
          <td>87.5</td>
          <td>57.2</td>
          <td>77.7</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td><strong>88.2</strong></td>
          <td>78.8</td>
          <td>73.0</td>
          <td>75.7</td>
      </tr>
      <tr>
          <td>UOB</td>
          <td><strong>89.3</strong></td>
          <td>88.2</td>
          <td>85.1</td>
          <td>87.2</td>
      </tr>
      <tr>
          <td>ACS</td>
          <td><strong>75.5</strong></td>
          <td>72.8</td>
          <td>41.0</td>
          <td>37.7</td>
      </tr>
      <tr>
          <td>USPTO</td>
          <td>90.9</td>
          <td><strong>92.6</strong></td>
          <td>74.9</td>
          <td>59.6</td>
      </tr>
      <tr>
          <td>Staker</td>
          <td>84.0</td>
          <td><strong>84.4</strong></td>
          <td>0.0</td>
          <td>66.3</td>
      </tr>
  </tbody>
</table>
<p>MolScribe edges out on USPTO and Staker. The authors attribute this to MolScribe directly training on all 680K USPTO samples, which may cause it to specialize to that distribution.</p>
<h3 id="pipeline-ablation">Pipeline Ablation</h3>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Hand-drawn</th>
          <th>ChemDraw</th>
          <th>JPO</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base model</td>
          <td>10.4</td>
          <td>92.3</td>
          <td>82.7</td>
      </tr>
      <tr>
          <td>+ Font augmentation</td>
          <td>30.2</td>
          <td>92.5</td>
          <td>82.8</td>
      </tr>
      <tr>
          <td>+ Font aug + MMD</td>
          <td>42.1</td>
          <td>94.0</td>
          <td>83.0</td>
      </tr>
      <tr>
          <td>+ Font aug + MMD + Self-training</td>
          <td><strong>82.6</strong></td>
          <td><strong>95.9</strong></td>
          <td><strong>88.2</strong></td>
      </tr>
  </tbody>
</table>
<p>Each component contributes meaningfully: font augmentation (+19.8), MMD alignment (+11.9), and self-training (+40.5) on hand-drawn accuracy.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/fffh1/AdaptMol">AdaptMol Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/fffh1/AdaptMol/tree/main">Model + Data</a></td>
          <td>Model/Dataset</td>
          <td>MIT</td>
          <td>Pretrained checkpoint and datasets</td>
      </tr>
  </tbody>
</table>
<p>Training uses 2 NVIDIA A100 GPUs (40GB each). Base model trains for 30 epochs on 1M synthetic samples. Domain adaptation involves 3 steps: USPTO self-training (3 iterations of 3 epochs), MMD alignment on hand-drawn data (5 epochs), and hand-drawn self-training (5 iterations).</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Sequence length constraints prevent accurate prediction of very large molecules (&gt;120 atoms), where resizing causes significant information loss</li>
<li>Cannot recognize Markush structures with repeating unit notation (parentheses/brackets), as synthetic training data lacks such cases</li>
<li>Stereochemistry information is lost when stereo bonds connect to abbreviated functional groups due to RDKit post-processing limitations</li>
<li>The retrained baseline (30 epochs from scratch on synthetic + pseudo-labels) achieves higher hand-drawn accuracy (87.2%) but at the cost of cross-domain robustness on literature benchmarks</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hu, F., He, E., &amp; Verspoor, K. (2026). AdaptMol: Domain Adaptation for Molecular Image Recognition with Limited Supervision. <em>Research Square preprint</em>. <a href="https://doi.org/10.21203/rs.3.rs-8365561/v1">https://doi.org/10.21203/rs.3.rs-8365561/v1</a></p>
<p><strong>Publication</strong>: Research Square preprint, February 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/fffh1/AdaptMol">GitHub</a></li>
<li><a href="https://huggingface.co/fffh1/AdaptMol/tree/main">HuggingFace (model + data)</a></li>
</ul>
]]></content:encoded></item><item><title>The Quarks of Attention: Building Blocks of Attention</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/quarks-of-attention/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/quarks-of-attention/</guid><description>Baldi and Vershynin's 2023 theoretical analysis decomposing attention into fundamental building blocks and proving capacity bounds for attentional circuits.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>theory paper</strong> that takes a reductionist approach to attention mechanisms. It classifies all possible fundamental building blocks of attention (&ldquo;quarks&rdquo;) within a formal neural network framework, then proves capacity theorems for circuits built from these primitives using linear and polynomial threshold gates.</p>
<h2 id="why-decompose-attention-into-primitives">Why decompose attention into primitives?</h2>
<p>Descriptions of attention in deep learning often seem complex and obscure the underlying neural architecture. Despite the widespread use of attention in transformers and beyond, there has been little formal theory about the computational nature and capacity of attention mechanisms. Baldi and Vershynin address this by identifying the smallest building blocks and rigorously analyzing what they can compute.</p>
<h2 id="the-standard-model-and-its-extensions">The Standard Model and its extensions</h2>
<p>The paper defines the &ldquo;Standard Model&rdquo; (SM) as the class of all neural networks built from McCulloch-Pitt neurons: directed weighted graphs where neuron $i$ computes $O_i = f_i(S_i)$ with activation $S_i = \sum_j w_{ij} O_j$. The SM already has universal approximation properties, so extensions should be evaluated on efficiency (circuit size, depth, learning), not on what functions can be represented.</p>
<p>Three variable types exist in the SM: activations ($S$), outputs ($O$), and synaptic weights ($w$). Cross these with two mechanisms (addition, multiplication) and the constraint that attending signals originate from neuronal outputs, and you get six possible attention primitives.</p>
<h2 id="the-six-quarks-reduced-to-three">The six quarks, reduced to three</h2>
<table>
  <thead>
      <tr>
          <th></th>
          <th>$S$ (activation)</th>
          <th>$O$ (output)</th>
          <th>$w$ (synapse)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Addition</strong></td>
          <td>Multiplexing (in SM)</td>
          <td>Additive output (in SM)</td>
          <td>Additive synaptic</td>
      </tr>
      <tr>
          <td><strong>Multiplication</strong></td>
          <td>Activation gating</td>
          <td><strong>Output gating</strong></td>
          <td><strong>Synaptic gating</strong></td>
      </tr>
  </tbody>
</table>
<p>The paper shows these reduce to three cases worth studying:</p>
<h3 id="multiplexing-additive-activation-attention">Multiplexing (additive activation attention)</h3>
<p>The attending signal $S_2$ is added to the normal activation $S_1$, producing $O_i = f_i(S_1 + S_2)$. With sigmoid or threshold activations, a large negative $S_2$ forces the output to zero regardless of $S_1$, suppressing unattended stimuli. This mechanism lives entirely within the SM and plays a central role in proving capacity lower bounds.</p>
<h3 id="output-gating">Output gating</h3>
<p>Neuron $j$ multiplies the output of neuron $i$, producing $O_i O_j$. This quadratic term is new to the SM. The gated signal $O_i O_j$ propagates to all downstream neurons of $i$. When $O_j \approx 0$, the attended neuron is silenced; when $O_j$ is large, it is enhanced.</p>
<h3 id="synaptic-gating">Synaptic gating</h3>
<p>Neuron $j$ multiplies a synaptic weight $w_{ki}$, creating a dynamic weight $w_{ki} O_j$. This produces the same local term $w_{ki} O_i O_j$ at neuron $k$ as output gating, but affects only the single downstream connection rather than all of neuron $i$&rsquo;s outputs. Synaptic gating is a fast weight mechanism: the attending network dynamically changes the program executed by the attended network.</p>
<h2 id="transformers-are-built-entirely-from-gating">Transformers are built entirely from gating</h2>
<p>The paper shows that transformer encoder modules decompose into:</p>
<ol>
<li><strong>Output gating</strong> ($mn^2$ operations): computing all $n^2$ pairwise dot products of $Q$ and $K$ vectors, each requiring $m$ element-wise multiplications</li>
<li><strong>Softmax</strong>: a standard SM extension</li>
<li><strong>Synaptic gating</strong> ($n^2$ operations): weighting $V$ vectors by the softmax outputs to form convex combinations</li>
</ol>
<p>The entire attention mechanism uses $O(mn^2)$ gating operations. The permutation invariance of transformers follows directly from the weight sharing across input positions.</p>
<h2 id="relationship-to-polynomial-neural-networks">Relationship to polynomial neural networks</h2>
<p>Gating is a special case of polynomial activation. A neuron with full quadratic activation over $n$ inputs has the form $S_i = \sum_{jk} w_{ijk} O_j O_k$, requiring $O(n^2)$ three-way synaptic weights for all possible pairs. Gating introduces only one new quadratic term per operation. The same gating concepts can also be applied to more complex units with polynomial activations of degree $d$, where one polynomial threshold unit gates the output or synapse of another.</p>
<h2 id="functional-properties-of-gating">Functional properties of gating</h2>
<p>Several examples illustrate what gating enables:</p>
<ul>
<li><strong>Shaping activation functions</strong>: When a unit with activation function $f$ is output-gated by a unit with activation function $g$ (both having the same inputs), the result is $f(S)g(S) = fg(S)$. This changes the effective activation function from $f$ to $fg$. For instance, a linear unit gated by a $(0,1)$ threshold function produces the ReLU activation.</li>
<li><strong>XOR without hidden layers</strong>: The XOR function cannot be computed by a single linear threshold gate. However, gating the OR function by the NAND function (both implementable by single linear threshold gates) produces XOR in a shallow network with no hidden layers.</li>
<li><strong>Universal approximation</strong>: Every continuous function on a compact set can be approximated to arbitrary precision by a shallow attention network of linear units gated by linear threshold gates (Theorem 4.3).</li>
</ul>
<h2 id="attention-as-sparse-quadratic-terms">Attention as sparse quadratic terms</h2>
<p>Both output and synaptic gating introduce quadratic terms of the form $w_{ki} O_i O_j$. A neuron with full quadratic activation over $n$ inputs would require $O(n^2)$ parameters. Gating introduces only one new quadratic term per operation. This is the key insight: attention mechanisms gain some of the expressiveness of quadratic activations while avoiding the combinatorial parameter explosion.</p>
<h2 id="capacity-results">Capacity results</h2>
<p>Using cardinal capacity (the base-2 logarithm of the number of distinct Boolean functions a class of circuits can implement), the paper proves bounds for attentional circuits with linear and polynomial threshold gates:</p>
<ul>
<li><strong>Single unit with output gating</strong>: a gated pair of linear threshold gates on $n$ inputs has capacity $2n^2(1 + o(1))$, compared to $n^2(1 + o(1))$ for a single gate (Theorem 6.1). This represents a doubling of capacity with a doubling of parameters (from $n$ to $2n$), a sign of efficiency.</li>
<li><strong>Multiplexing technique</strong>: additive activation attention enables a &ldquo;multiplexing&rdquo; proof strategy where one unit in a layer is selected as a function of the attending units while driving remaining units to saturation. This is the key tool for proving lower bounds.</li>
<li><strong>Attention layers</strong>: extending to layers of $m$ gated units with $n$ inputs, the capacity is $2mn^2(1 + o(1))$ for output gating (Theorem 7.1), confirming that gating approximately doubles the capacity relative to ungated layers.</li>
<li><strong>Depth reduction</strong>: gating operations available as physical primitives in a neural network can reduce the depth required for certain basic circuits.</li>
</ul>
<h2 id="limitations-and-future-work">Limitations and future work</h2>
<p>The authors note several open directions:</p>
<ul>
<li>The capacity estimates for some configurations (e.g., single-weight synaptic gating in Proposition 6.9) have gaps between the lower and upper bounds that remain to be tightened.</li>
<li>The analysis uses Boolean neurons (linear and polynomial threshold gates) as approximations. Extending results to other activation functions (sigmoid, ReLU) is left for future work.</li>
<li>The paper focuses on single layers and pairs of units. Capacity analysis for deeper attention architectures with multiple stacked layers is not addressed.</li>
<li>The theory treats attention on the time scale of individual inputs. The paper briefly notes that fast synaptic mechanisms operating on different time scales raise interesting architectural questions but does not develop this direction.</li>
</ul>
<h2 id="reproducibility">Reproducibility</h2>
<p>This is a purely theoretical paper with no associated code, datasets, or pretrained models. All results are mathematical theorems and proofs that can be verified from the paper itself. The paper is freely available on arXiv under a CC BY-NC-ND 4.0 license.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Baldi, P. &amp; Vershynin, R. (2023). The quarks of attention: Structure and capacity of neural attention building blocks. <em>Artificial Intelligence</em>, 319, 103901.</p>
<p><strong>Publication</strong>: Artificial Intelligence 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.sciencedirect.com/science/article/pii/S0004370223000474">Journal (ScienceDirect)</a></li>
<li><a href="https://arxiv.org/abs/2202.08371">arXiv</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{baldi2023quarks,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{The quarks of attention: Structure and capacity of neural attention building blocks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Baldi, Pierre and Vershynin, Roman}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{319}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{103901}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.artint.2023.103901}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SE(3)-Transformers: Equivariant Attention for 3D Data</title><link>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/se3-transformers/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/geometric-deep-learning/se3-transformers/</guid><description>Fuchs et al. combine self-attention with SE(3)-equivariance for 3D point clouds using invariant attention weights and equivariant value messages.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>method paper</strong> that introduces the SE(3)-Transformer, a self-attention mechanism for 3D point clouds and graphs that is equivariant under continuous 3D rotations and translations. It builds on tensor field networks (TFNs) by adding data-dependent attention weights, resolving a known expressiveness limitation of equivariant convolutions.</p>
<h2 id="why-equivariant-attention-for-point-clouds">Why equivariant attention for point clouds?</h2>
<p>Point cloud data appears in 3D object scans, molecular structures, and particle simulations. Two properties are essential: handling varying numbers of irregularly sampled points, and invariance to global changes in pose (rotations and translations).</p>
<p>Self-attention handles variable-size inputs naturally and has proven effective across many domains. Tensor field networks provide SE(3)-equivariant convolutions but suffer from a key limitation: their filter kernels are decomposed into learnable radial functions and fixed angular components (spherical harmonics). The angular dependence is completely constrained by the equivariance condition, leaving no learnable degrees of freedom in the angular direction. This has been identified in the literature as severely limiting performance.</p>
<p>The SE(3)-Transformer resolves this by introducing data-dependent attention weights that modulate the angular profile of the kernels while maintaining equivariance.</p>
<h2 id="architecture-invariant-attention-meets-equivariant-values">Architecture: invariant attention meets equivariant values</h2>
<p>The core layer combines three components:</p>
<p>$$\mathbf{f}_{\text{out},i}^{\ell} = \underbrace{\mathbf{W}_V^{\ell\ell} \mathbf{f}_{\text{in},i}^{\ell}}_{\text{self-interaction}} + \sum_{k \geq 0} \sum_{j \in \mathcal{N}_i \setminus i} \underbrace{\alpha_{ij}}_{\text{attention}} \underbrace{\mathbf{W}_V^{\ell k}(\mathbf{x}_j - \mathbf{x}_i) \mathbf{f}_{\text{in},j}^k}_{\text{value message}}$$</p>
<h3 id="invariant-attention-weights">Invariant attention weights</h3>
<p>The attention weights use dot-product attention between equivariant queries and keys:</p>
<p>$$\alpha_{ij} = \frac{\exp(\mathbf{q}_i^\top \mathbf{k}_{ij})}{\sum_{j&rsquo; \in \mathcal{N}_i \setminus i} \exp(\mathbf{q}_i^\top \mathbf{k}_{ij&rsquo;})}$$</p>
<p>Both $\mathbf{q}_i$ and $\mathbf{k}_{ij}$ are constructed using TFN-type linear embeddings, making them SE(3)-equivariant. Their inner product is invariant because SO(3) representations are orthogonal: $\mathbf{q}^\top \mathbf{S}_g^\top \mathbf{S}_g \mathbf{k} = \mathbf{q}^\top \mathbf{k}$.</p>
<h3 id="equivariant-value-messages">Equivariant value messages</h3>
<p>The value messages use the same TFN kernel structure as tensor field networks: weight kernels $\mathbf{W}_V^{\ell k}(\mathbf{x})$ decomposed into learnable radial functions and Clebsch-Gordan/spherical harmonic angular components. Features are typed by irreducible representation degree $\ell$ (the independent matrix blocks into which SO(3) group actions decompose): type-0 vectors are rotation-invariant scalars, type-1 vectors transform as 3D vectors, and so on.</p>
<h3 id="angular-modulation">Angular modulation</h3>
<p>The attention weights $\alpha_{ij}$ multiply the value messages, creating data-dependent kernels $\alpha_{ij} \mathbf{W}_V^{\ell k}(\mathbf{x})$. This effectively modulates the angular profile of the fixed spherical harmonic components, adding learnable angular degrees of freedom while preserving equivariance. The authors describe this as one of the first examples of a nonlinear equivariant layer.</p>
<h3 id="attentive-self-interaction">Attentive self-interaction</h3>
<p>The paper also introduces attentive self-interaction as an alternative to the standard linear self-interaction (analogous to 1x1 convolutions). Instead of fixed learned weights across all points, the weights are generated by an MLP operating on invariant inner products of the input features:</p>
<p>$$w_{i,c&rsquo;c}^{\ell\ell} = \text{MLP}\left(\bigoplus_{c,c&rsquo;} \mathbf{f}_{\text{in},i,c&rsquo;}^{\ell\top} \mathbf{f}_{\text{in},i,c}^{\ell}\right)$$</p>
<h2 id="experiments">Experiments</h2>
<h3 id="n-body-particle-simulation">N-body particle simulation</h3>
<p>Five charged particles carrying positive or negative charges, exerting repulsive or attractive forces on each other. The task is predicting positions and velocities 500 timesteps ahead. The SE(3)-Transformer achieves 0.0076 MSE on position (vs. 0.0139 for Set Transformer and 0.0151 for TFN), with equivariance error on the order of $10^{-7}$, confirming exact equivariance up to numerical precision.</p>
<h3 id="scanobjectnn-real-world-3d-object-classification">ScanObjectNN (real-world 3D object classification)</h3>
<p>2902 real-world scanned objects across 15 categories. This task is only SO(2)-invariant (gravity axis matters), so the authors provide the z-component as an additional scalar input. With only 128 input points, the SE(3)-Transformer+z achieves 85.0% accuracy, competitive with methods using 1024 points and task-specific architectures. The model learns to ignore the symmetry-breaking z-input when trained on rotation-augmented data.</p>
<h3 id="qm9-molecular-property-regression"><a href="/notes/chemistry/datasets/qm9/">QM9</a> molecular property regression</h3>
<p>134k molecules with up to 29 atoms, predicting 6 quantum chemical properties. The SE(3)-Transformer achieves competitive results against other equivariant models (TFN, Cormorant), with improvements over TFN on all six targets. Across all three experiments, the SE(3)-Transformer outperforms both a non-equivariant attention baseline (Set Transformer) and equivariant models without attention (TFN).</p>
<h3 id="practical-contributions">Practical contributions</h3>
<p>The paper includes a PyTorch spherical harmonics implementation that is 10x faster than Scipy on CPU and 100-1000x faster on GPU. For a ScanObjectNN model, this yields roughly 22x speedup of the forward pass compared to the lie-learn library, directly addressing a major bottleneck of TFN-based architectures.</p>
<h2 id="conclusions-and-limitations">Conclusions and limitations</h2>
<p>Adding attention to a roto-translation-equivariant model consistently led to higher accuracy and increased training stability across all three experiments. For large neighbourhoods, attention proved essential for model convergence. The equivariance constraints also improved performance compared to conventional (non-equivariant) attention in all experiments.</p>
<p>The authors note that the SE(3)-Transformer is inherently suited for classification and regression on molecular data and discuss applications in drug research, including early-stage suitability classification of molecules for inhibiting viral reproductive cycles.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/FabianFuchsML/se3-transformer-public">se3-transformer-public</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch + DGL implementation</td>
      </tr>
  </tbody>
</table>
<p>The repository includes code for N-body simulations and QM9 experiments. Hyperparameters and architecture details are provided in the paper&rsquo;s appendix (4 equivariant layers, representation degrees, channels per degree, learning rates, batch sizes). Hardware requirements are not explicitly stated in the paper.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fuchs, F. B., Worrall, D. E., Fischer, V., &amp; Welling, M. (2020). SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks. <em>Advances in Neural Information Processing Systems</em>, 33. <a href="https://arxiv.org/abs/2006.10503">https://arxiv.org/abs/2006.10503</a></p>
<p><strong>Publication</strong>: NeurIPS 2020</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2006.10503">arXiv</a></li>
<li><a href="https://github.com/FabianFuchsML/se3-transformer-public">GitHub</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fuchs2020se3,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{SE(3)-Transformers}: 3D Roto-Translation Equivariant Attention Networks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fuchs, Fabian B. and Worrall, Daniel E. and Fischer, Volker and Welling, Max}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Scaling Laws vs Model Architectures: Inductive Bias</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/</guid><description>Tay et al.'s 2022 study comparing scaling behavior across ten model architectures, showing that inductive bias affects scaling properties in distinct ways.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>systematization paper</strong> that conducts a large-scale empirical comparison of how ten different model architectures scale. Rather than proposing a new architecture, it characterizes the relationship between inductive bias and scaling behavior across both upstream (pretraining) and downstream (transfer) performance.</p>
<h2 id="why-architecture-aware-scaling-matters">Why architecture-aware scaling matters</h2>
<p>Prior scaling laws work (Kaplan et al., 2020) focused almost exclusively on vanilla Transformers, finding that loss scales as a power law with model size, dataset size, and compute. A common assumption in the field is that improvements observed at one scale transfer to other scales, and new architectures are often evaluated at a single compute point (e.g., base size). This paper challenges that assumption by asking whether different inductive biases scale differently.</p>
<h2 id="ten-architectures-one-controlled-setup">Ten architectures, one controlled setup</h2>
<p>All models are implemented in Mesh TensorFlow under a shared encoder-decoder (<a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-style) framework, pretrained on C4 for $2^{19}$ steps with Adafactor optimizer and inverse square root learning rate schedule, and finetuned for 100K steps on GLUE + SuperGLUE + SQuAD. Models range from 15M to 40B parameters, trained on 16 TPU-v3 chips. The ten architectures span four categories:</p>
<p><strong>Transformer variants</strong>: vanilla Transformer, Evolved Transformer (AutoML-derived), Universal Transformer (parameter sharing + recurrence), Switch Transformer (sparse MoE)</p>
<p><strong>Efficient variants</strong>: Performer (linear attention), Funnel Transformer (sequence downsampling), ALBERT (cross-layer parameter sharing + embedding factorization)</p>
<p><strong>General improvements</strong>: Mixture of Softmaxes (MoS), Gated Linear Units (GLU)</p>
<p><strong>Non-Transformers</strong>: Lightweight Convolutions, Dynamic Convolutions, MLP-Mixer</p>
<h2 id="key-findings-on-scaling-behavior">Key findings on scaling behavior</h2>
<h3 id="architecture-changes-the-scaling-slope">Architecture changes the scaling slope</h3>
<p>The paper fits linear scaling laws in log-log space (i.e., power law fits of the form $L \propto C^{-\alpha}$) for each model across multiple axes (FLOPs vs. upstream, FLOPs vs. downstream, etc.). The vanilla Transformer has the highest scaling coefficient on most reported axes ($\alpha_{F,U} = 0.54$, $\alpha_{F,D} = 0.28$). Models that make minimal changes to the Transformer (GLU, MoS) retain similar scaling behavior. Models with more radical inductive biases show worse scaling:</p>
<ul>
<li><strong>Performer</strong> (linear attention): $\alpha_{F,U} = 0.25$, upstream perplexity decreases only 2.7% from base to large vs. 8.4% for vanilla Transformer</li>
<li><strong>ALBERT</strong>: scales negatively on downstream ($\alpha_{F,D} = -0.12$), getting worse as compute increases. ALBERT was designed for parameter efficiency (cross-layer weight sharing, embedding factorization), not compute efficiency, so this result is expected: additional FLOPs reuse the same parameters without adding capacity</li>
<li><strong>MLP-Mixer</strong>: near-zero downstream scaling ($\alpha_{F,D} = -0.03$)</li>
</ul>
<h3 id="the-best-architecture-changes-with-scale">The best architecture changes with scale</h3>
<p>Models that perform well at small compute budgets are not necessarily the best at larger budgets. For example, the Evolved Transformer outperforms vanilla Transformers at tiny-to-small scale on downstream tasks but falls behind when scaled up. MoS-Transformer outperforms vanilla Transformers at some compute regions but not others.</p>
<h3 id="upstream-and-downstream-scaling-diverge">Upstream and downstream scaling diverge</h3>
<p>Good upstream perplexity scaling does not guarantee good downstream transfer scaling. Funnel Transformers and Lightweight Convolutions hold up reasonably well on upstream perplexity but suffer substantially on downstream tasks. Switch Transformers show the best upstream-to-downstream transfer ratio ($\alpha_{U,D} = 0.58$).</p>
<h3 id="depth-and-width-affect-architectures-differently">Depth and width affect architectures differently</h3>
<p>Depth scaling has a more substantial impact on downstream performance than width scaling across most architectures. Evolved Transformers are a partial exception, scaling slightly better under width scaling compared to other architectures on downstream tasks.</p>
<h2 id="practical-implications">Practical implications</h2>
<p>The authors offer concrete guidance: practitioners should be cautious about staking expensive large-scale runs on architectures that drastically modify the attention mechanism. Performers and MLP-Mixers are characterized as &ldquo;high risk&rdquo; options. This helps explain why most large language models at the time (PaLM, Gopher, UL2) use relatively vanilla Transformer architectures.</p>
<p>The paper also notes that not every use case requires billion-parameter models. Inductive biases tailored to small or low-compute regimes remain valuable when scaling is not the priority.</p>
<h2 id="reproducibility">Reproducibility</h2>
<p>No code or trained model weights were publicly released with this paper. The experiments rely on Google&rsquo;s internal Mesh TensorFlow infrastructure with 16 TPU-v3 chips, and pretraining uses the publicly available C4 corpus. Finetuning benchmarks (GLUE, SuperGLUE, SQuAD) are all publicly available. However, reproducing the full study would require substantial compute resources and re-implementation of all ten architectures within a shared framework.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://arxiv.org/abs/2207.10551">arXiv paper</a></td>
          <td>Paper</td>
          <td>Open access</td>
          <td>Full paper with appendices</td>
      </tr>
      <tr>
          <td><a href="https://www.tensorflow.org/datasets/catalog/c4">C4 corpus</a></td>
          <td>Dataset</td>
          <td>ODC-BY</td>
          <td>Pretraining data</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components</strong>: No released code, model checkpoints, or training scripts. Internal Mesh TensorFlow codebase is not publicly available.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D., &amp; Metzler, D. (2022). Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? <em>EMNLP 2022</em>.</p>
<p><strong>Publication</strong>: EMNLP 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2207.10551">arXiv</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{tay2022scaling,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tay, Yi and Dehghani, Mostafa and Abnar, Samira and Chung, Hyung Won and Fedus, William and Rao, Jinfeng and Narang, Sharan and Tran, Vinh Q. and Yogatama, Dani and Metzler, Donald}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>OCSU: Optical Chemical Structure Understanding (2025)</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/ocsu/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/ocsu/</guid><description>OCSU task for translating molecular images into multi-level descriptions. Introduces Vis-CheBI20 dataset and DoubleCheck/Mol-VL for molecular understanding.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fan, S., Xie, Y., Cai, B., Xie, A., Liu, G., Qiao, M., Xing, J., &amp; Nie, Z. (2025). OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery. <em>arXiv preprint arXiv:2501.15415</em>. <a href="https://doi.org/10.48550/arXiv.2501.15415">https://doi.org/10.48550/arXiv.2501.15415</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/PharMolix/OCSU">Code and Dataset (GitHub)</a></li>
</ul>
<h2 id="multi-level-chemical-understanding-method-and-resource">Multi-Level Chemical Understanding (Method and Resource)</h2>
<p>This is primarily a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong> with a significant <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution.</p>
<ul>
<li><strong>Methodological</strong>: It proposes two novel architectures, <strong>DoubleCheck</strong> (an enhanced recognition model) and <strong>Mol-VL</strong> (an end-to-end vision-language model), to solve the newly formulated OCSU task.</li>
<li><strong>Resource</strong>: It constructs and releases <strong>Vis-CheBI20</strong>, the first large-scale dataset specifically designed for optical chemical structure understanding, containing 29.7K images and 117.7K image-text pairs.</li>
</ul>
<h2 id="the-motivation-for-ocsu-beyond-basic-graph-recognition">The Motivation for OCSU Beyond Basic Graph Recognition</h2>
<p>Existing methods for processing molecular images focus narrowly on <strong>Optical Chemical Structure Recognition (OCSR)</strong>, which translates an image solely into a machine-readable graph or SMILES string. However, SMILES strings are not chemist-friendly and lack high-level semantic context.</p>
<ul>
<li><strong>Gap</strong>: There is a lack of systems that can translate chemical diagrams into human-readable descriptions (e.g., functional groups, IUPAC names) alongside the graph structure.</li>
<li><strong>Goal</strong>: To enable <strong>Optical Chemical Structure Understanding (OCSU)</strong>, bridging the gap between visual representations and both machine/chemist-readable descriptions to support drug discovery and property prediction.</li>
</ul>
<h2 id="key-innovations-doublecheck-mol-vl-and-the-vis-chebi20-dataset">Key Innovations: DoubleCheck, Mol-VL, and the Vis-CheBI20 Dataset</h2>
<p>The paper introduces the <strong>OCSU task</strong>, enabling multi-level understanding (motif, molecule, and abstract levels). To solve this, it introduces two distinct paradigms:</p>
<ol>
<li><strong>DoubleCheck (OCSR-based)</strong>: An enhancement to standard OCSR models (like MolScribe) that performs a &ldquo;second look&rdquo; at locally ambiguous atoms. It uses attentive feature enhancement to fuse global molecular features with local features from ambiguous regions.</li>
<li><strong>Mol-VL (OCSR-free)</strong>: An end-to-end Vision-Language Model (VLM) based on Qwen2-VL. It uses multi-task learning to directly generate text descriptions from molecular images without an intermediate SMILES step.</li>
<li><strong>Vis-CheBI20 Dataset</strong>: A new benchmark specifically constructed for OCSU, deriving captions and functional group data from ChEBI-20 and PubChem.</li>
</ol>
<h2 id="methodology-and-experimental-evaluation">Methodology and Experimental Evaluation</h2>
<p>The authors evaluated both paradigms on <strong>Vis-CheBI20</strong> and existing benchmarks (USPTO, ACS) across four subtasks:</p>
<ol>
<li><strong>Functional Group Caption</strong>: Retrieval/F1 score evaluation.</li>
<li><strong>Molecule Description</strong>: Natural language generation metrics (BLEU, ROUGE, METEOR).</li>
<li><strong>IUPAC Naming</strong>: Text generation metrics (BLEU, ROUGE).</li>
<li><strong>SMILES Naming (OCSR)</strong>: Exact matching accuracy ($Acc_s$).</li>
</ol>
<p><strong>Baselines</strong>:</p>
<ul>
<li><strong>Task-Specific</strong>: MolScribe, MolVec, OSRA.</li>
<li><strong>LLM/VLM</strong>: Qwen2-VL, BioT5+, Mol-Instructions.</li>
<li><strong>Ablation</strong>: DoubleCheck vs. MolScribe backbone to test the &ldquo;feature enhancement&rdquo; mechanism.</li>
</ul>
<h2 id="results-and-conclusions-paradigm-trade-offs">Results and Conclusions: Paradigm Trade-Offs</h2>
<ul>
<li><strong>DoubleCheck Superiority</strong>: DoubleCheck outperformed MolScribe on OCSR tasks across all benchmarks. On USPTO, it achieved <strong>92.85%</strong> $Acc_s$ (vs. 92.57%), and on the ACS dataset it showed a <strong>+3.12%</strong> gain on chiral molecules. On Vis-CheBI20, DoubleCheck improved over MolScribe by an average of 2.27% across all metrics.</li>
<li><strong>Paradigm Trade-offs</strong>:
<ul>
<li><strong>Mol-VL (OCSR-free)</strong> excelled at semantic tasks like <strong>Functional Group Captioning</strong>, achieving <strong>97.32%</strong> F1 (vs. 93.63% for DoubleCheck &amp; RDKit and 89.60% for MolScribe &amp; RDKit). It benefits from end-to-end learning of structural context.</li>
<li><strong>DoubleCheck (OCSR-based)</strong> performed better on <strong>IUPAC naming recall</strong> and exact SMILES recovery, as explicit graph reconstruction is more precise for rigid nomenclature than VLM generation.</li>
</ul>
</li>
<li><strong>Conclusion</strong>: Enhancing submodules improves OCSR-based paradigms, while end-to-end VLMs offer stronger semantic understanding but struggle with exact syntax generation (SMILES/IUPAC).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Vis-CheBI20 Dataset</strong></p>
<ul>
<li><strong>Source</strong>: Derived from ChEBI-20 and PubChem.</li>
<li><strong>Size</strong>: 29,700 molecular diagrams, 117,700 image-text pairs.</li>
<li><strong>Generation</strong>: Images generated from SMILES using RDKit to simulate real-world journal/patent styles.</li>
<li><strong>Splits</strong> (vary by task, see table below):</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Train Size</th>
          <th style="text-align: left">Test Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Functional Group</td>
          <td style="text-align: left">26,144</td>
          <td style="text-align: left">3,269</td>
      </tr>
      <tr>
          <td style="text-align: left">Description</td>
          <td style="text-align: left">26,407</td>
          <td style="text-align: left">3,300</td>
      </tr>
      <tr>
          <td style="text-align: left">IUPAC Naming</td>
          <td style="text-align: left">26,200</td>
          <td style="text-align: left">2,680</td>
      </tr>
      <tr>
          <td style="text-align: left">SMILES Naming</td>
          <td style="text-align: left">26,407</td>
          <td style="text-align: left">3,300</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>DoubleCheck (Attentive Feature Enhancement)</strong></p>
<ol>
<li><strong>Ambiguity Detection</strong>: Uses atom prediction confidence to identify &ldquo;ambiguous atoms&rdquo;.</li>
<li><strong>Masking</strong>: Applies a 2D Gaussian mask to the image centered on the ambiguous atom.</li>
<li><strong>Local Encoding</strong>: A Swin-B encoder ($\Phi_l$) encodes the masked image region.</li>
<li><strong>Fusion</strong>: Aligns local features ($\mathcal{F}_l$) with global features ($\mathcal{F}_g$) using a 2-layer MLP and fuses them via weighted summation.</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{F}_e = \mathcal{F}_g + \text{MLP}(\mathcal{F}_g \oplus \hat{\mathcal{F}}_l) \cdot \hat{\mathcal{F}}_l
\end{aligned}
$$</p>
<ol start="5">
<li><strong>Two-Stage Training</strong>:
<ul>
<li>Stage 1: Train atom/bond predictors (30 epochs).</li>
<li>Stage 2: Train alignment/fusion modules with random Gaussian mask noise (10 epochs).</li>
</ul>
</li>
</ol>
<p><strong>Mol-VL (Multi-Task VLM)</strong></p>
<ul>
<li><strong>Prompting</strong>: System prompt: &ldquo;You are working as an excellent assistant in chemistry&hellip;&rdquo;</li>
<li><strong>Tokens</strong>: Uses <code>&lt;image&gt;</code> and <code>&lt;/image&gt;</code> special tokens.</li>
<li><strong>Auxiliary Task</strong>: Functional group recognition (identifying highlighted groups) added to training to improve context learning.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>DoubleCheck</strong>:
<ul>
<li><strong>Backbone</strong>: MolScribe architecture.</li>
<li><strong>Encoders</strong>: Swin-B for both global and local atom encoding.</li>
</ul>
</li>
<li><strong>Mol-VL</strong>:
<ul>
<li><strong>Base Model</strong>: Qwen2-VL (2B and 7B versions).</li>
<li><strong>Vision Encoder</strong>: ViT with naive dynamic resolution and M-RoPE.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Key Metrics</strong>:</p>
<ul>
<li><strong>SMILES</strong>: Exact Match Accuracy ($Acc_s$), Chiral Accuracy ($Acc_c$).</li>
<li><strong>Functional Groups</strong>: F1 Score (Information Retrieval task).</li>
<li><strong>Text Generation</strong>: BLEU-2/4, METEOR, ROUGE-L.</li>
</ul>
<p><strong>Selected Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Model</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>DoubleCheck</strong></td>
          <td style="text-align: left">OCSR (USPTO)</td>
          <td style="text-align: left">$Acc_s$</td>
          <td style="text-align: left"><strong>92.85%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>MolScribe</strong></td>
          <td style="text-align: left">OCSR (USPTO)</td>
          <td style="text-align: left">$Acc_s$</td>
          <td style="text-align: left">92.57%</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Mol-VL-7B</strong></td>
          <td style="text-align: left">Func. Group Caption</td>
          <td style="text-align: left">F1</td>
          <td style="text-align: left"><strong>97.32%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>DoubleCheck &amp; RDKit</strong></td>
          <td style="text-align: left">Func. Group Caption</td>
          <td style="text-align: left">F1</td>
          <td style="text-align: left">93.63%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>DoubleCheck</strong>: Trained on <strong>4 NVIDIA A100 GPUs</strong> for <strong>4 days</strong>.
<ul>
<li>Max LR: 4e-4.</li>
</ul>
</li>
<li><strong>Mol-VL</strong>: Trained on <strong>4 NVIDIA A100 GPUs</strong> for <strong>10 days</strong>.
<ul>
<li>Max LR: 1e-5, 50 epochs.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/PharMolix/OCSU">PharMolix/OCSU (GitHub)</a></td>
          <td style="text-align: left">Code, Model, Dataset</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Official implementation, Mol-VL-7B weights, and Vis-CheBI20 dataset</td>
      </tr>
  </tbody>
</table>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>The long-tail distribution of functional groups in training data limits performance on uncommon chemical structures.</li>
<li>Mol-VL struggles with exact syntax generation (SMILES and IUPAC) compared to explicit graph-reconstruction approaches.</li>
<li>Vis-CheBI20 images are synthetically generated via RDKit, which may not fully capture the diversity of real-world journal and patent images.</li>
<li>The authors note that OCSU technologies should be restricted to research purposes, as downstream molecule discovery applications could potentially generate harmful molecules.</li>
</ul>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{fanOCSUOpticalChemical2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{OCSU}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Fan, Siqi and Xie, Yuguang and Cai, Bowen and Xie, Ailin and Liu, Gaochao and Qiao, Mu and Xing, Jie and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2501.15415}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GTR-CoT: Graph Traversal Chain-of-Thought for Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/</guid><description>GTR-VL uses graph traversal chain-of-thought and two-stage training to improve optical chemical structure recognition on printed and hand-drawn molecules.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, J., He, Y., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., &amp; He, C. (2025). GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition (arXiv:2506.07553). arXiv. <a href="https://doi.org/10.48550/arXiv.2506.07553">https://doi.org/10.48550/arXiv.2506.07553</a></p>
<p><strong>Publication</strong>: arXiv preprint (2025)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.48550/arXiv.2506.07553">Paper on arXiv</a></li>
</ul>
<h2 id="contribution-vision-language-modeling-for-ocsr">Contribution: Vision-Language Modeling for OCSR</h2>
<p>This is a <strong>method paper</strong> that introduces GTR-VL, a Vision-Language Model for Optical Chemical Structure Recognition (OCSR). The work addresses the persistent challenge of converting molecular structure images into machine-readable formats, with a particular focus on handling chemical abbreviations that cause errors in existing systems.</p>
<h2 id="motivation-the-abbreviation-bottleneck">Motivation: The Abbreviation Bottleneck</h2>
<p>The motivation tackles a long-standing bottleneck in chemical informatics: most existing OCSR systems produce incorrect structures when they encounter abbreviated functional groups. When a chemist draws &ldquo;Ph&rdquo; for phenyl or &ldquo;Et&rdquo; for ethyl, current models fail because they have been trained on data where images contain abbreviations but the ground-truth labels contain fully expanded molecular graphs.</p>
<p>This creates a fundamental mismatch. The model sees &ldquo;Ph&rdquo; in the image but is told the &ldquo;correct&rdquo; answer is a full benzene ring. The supervision signal is inconsistent with what is actually visible.</p>
<p>Beyond this data problem, existing graph-parsing methods use a two-stage approach: predict all atoms first, then predict all bonds. This is inefficient and ignores the structural constraints that could help during prediction. The authors argue that mimicking how humans analyze molecular structures - following bonds from atom to atom in a connected traversal - would be more effective.</p>
<h2 id="novelty-graph-traversal-as-visual-chain-of-thought">Novelty: Graph Traversal as Visual Chain-of-Thought</h2>
<p>The novelty lies in combining two key insights about how to properly train and architect OCSR systems. The main contributions are:</p>
<ol>
<li>
<p><strong>Graph Traversal as Visual Chain of Thought</strong>: GTR-VL generates molecular graphs by traversing them sequentially, predicting an atom, then its connected bond, then the next atom, and so on. This mimics how a human chemist would trace through a structure and allows the model to use previously predicted atoms and bonds as context for subsequent predictions.</p>
<p>Formally, the model output sequence for image $I_m$ is generated as:</p>
<p>$$ R_m = \text{concat}(CoT_m, S_m) $$</p>
<p>where $CoT_m$ represents the deterministic graph traversal steps (atoms and bonds) and $S_m$ is the final SMILES representation. This intermediate reasoning step makes the model more interpretable and helps it learn the structural logic of molecules.</p>
</li>
<li>
<p><strong>&ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; Principle</strong>: This addresses the abbreviation problem head-on. The authors correct the ground-truth annotations to match what&rsquo;s actually visible in the image.</p>
<p>They treat abbreviations like &ldquo;Ph&rdquo; as single &ldquo;superatoms&rdquo; and build a pipeline to automatically detect and correct training data. Using OCR to extract visible text from molecular images, they replace the corresponding expanded substructures in the ground-truth with the appropriate abbreviation tokens. This ensures the supervision signal is consistent with the visual input.</p>
</li>
<li>
<p><strong>Large-Scale Dataset (GTR-1.3M)</strong>: To support this approach, the authors created a large-scale dataset combining 1M synthetic molecules from PubChem with 351K corrected real-world patent images from USPTO. The key innovation is the correction pipeline that identifies abbreviations in patent images and fixes the inconsistent ground-truth labels.</p>
</li>
<li>
<p><strong>GRPO for Hand-Drawn OCSR</strong>: Hand-drawn molecular data lacks fine-grained atom/bond coordinate annotations, making SFT-based graph parsing inapplicable. The authors use Group Relative Policy Optimization (GRPO) with a composite reward function that combines format, SMILES, and graph-level rewards. The graph reward computes the maximum common subgraph (MCS) between predicted and ground-truth molecular graphs:</p>
<p>$$ R_{\text{graph}} = \frac{|N_m^a|}{|N_g^a| + |N_p^a|} + \frac{|N_m^b|}{|N_g^b| + |N_p^b|} $$</p>
<p>where $N_m^a$, $N_g^a$, $N_p^a$ are atom counts in the MCS, ground truth, and prediction, and $N_m^b$, $N_g^b$, $N_p^b$ are the corresponding bond counts.</p>
</li>
<li>
<p><strong>Two-Stage Training</strong>: Stage 1 performs SFT on GTR-1.3M for printed molecule recognition. Stage 2 applies GRPO on a mixture of printed data (GTR-USPTO-4K) and hand-drawn data (DECIMER-HD-Train, 4,070 samples) to extend capabilities to hand-drawn structures.</p>
</li>
<li>
<p><strong>MolRec-Bench Evaluation</strong>: Traditional SMILES-based evaluation fails for molecules with abbreviations because canonicalization breaks down. The authors created a new benchmark that evaluates graph structure directly, providing three metrics: direct SMILES generation, graph-derived SMILES, and exact graph matching.</p>
</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The evaluation focused on demonstrating that GTR-VL&rsquo;s design principles solve real problems that plague existing OCSR systems:</p>
<ol>
<li>
<p><strong>Comprehensive Baseline Comparison</strong>: GTR-VL was tested against three categories of models:</p>
<ul>
<li><strong>Specialist OCSR systems</strong>: MolScribe and MolNexTR</li>
<li><strong>Chemistry-focused VLMs</strong>: ChemVLM, ChemDFM-X, OCSU</li>
<li><strong>General-purpose VLMs</strong>: GPT-4o, GPT-4o-mini, Qwen-VL-Max</li>
</ul>
</li>
<li>
<p><strong>MolRec-Bench Evaluation</strong>: The new benchmark includes two subsets of patent images:</p>
<ul>
<li><strong>MolRec-USPTO</strong>: 5,423 standard patent images similar to existing benchmarks</li>
<li><strong>MolRec-Abb</strong>: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher&rsquo;s USPTO 10K abb subset</li>
</ul>
<p>This design directly tests whether models can handle the abbreviation problem that breaks existing systems.</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Systematic experiments isolated the contribution of key design choices:</p>
<ul>
<li><strong>Chain-of-Thought vs. Direct</strong>: Comparing graph traversal CoT against direct SMILES prediction</li>
<li><strong>Traversal Strategy</strong>: Graph traversal vs. the traditional &ldquo;atoms-then-bonds&rdquo; approach</li>
<li><strong>Dataset Quality</strong>: Training on corrected vs. uncorrected data</li>
</ul>
</li>
<li>
<p><strong>Retraining Experiments</strong>: Existing specialist models (MolScribe, MolNexTR) were retrained from scratch on the corrected GTR-1.3M dataset to isolate the effect of data quality from architectural improvements.</p>
</li>
<li>
<p><strong>Hand-Drawn OCSR Evaluation</strong>: GTR-VL was also evaluated on the DECIMER Hand-drawn test set and ChemPix dataset, comparing against DECIMER and AtomLenz+EditKT baselines.</p>
</li>
<li>
<p><strong>Qualitative Analysis</strong>: Visual inspection of predictions on challenging cases with heavy abbreviation usage, complex structures, and edge cases to understand failure modes.</p>
</li>
</ol>
<h2 id="results--conclusions-resolving-the-abbreviation-bottleneck">Results &amp; Conclusions: Resolving the Abbreviation Bottleneck</h2>
<ul>
<li>
<p><strong>Performance Gains on Abbreviations</strong>: On MolRec-Abb, GTR-VL-Stage1 achieves 85.49% Graph accuracy compared to around 20% for MolScribe and MolNexTR with their original checkpoints. On MolRec-USPTO, GTR-VL-Stage1 reaches 93.45% Graph accuracy. Existing specialist models see their accuracy drop below 20% on MolRec-Abb when abbreviations are present.</p>
</li>
<li>
<p><strong>Data Correction is Critical</strong>: When MolScribe and MolNexTR were retrained on GTR-1.3M, their MolRec-Abb Graph accuracy jumped from around 20% to 70.60% and 71.85% respectively. GTR-VL-Stage1 still outperformed these retrained baselines at 85.49%, confirming that both data correction and the graph traversal approach contribute.</p>
</li>
<li>
<p><strong>Chain-of-Thought Helps</strong>: Ablation on GTR-USPTO-351K shows that CoT yields 68.85% Gen-SMILES vs. 66.54% without CoT, a 2.31 percentage point improvement.</p>
</li>
<li>
<p><strong>Graph Traversal Beats Traditional Parsing</strong>: Graph traversal achieves 83.26% Graph accuracy vs. 80.15% for the atoms-then-bonds approach, and 81.88% vs. 79.02% on Gra-SMILES.</p>
</li>
<li>
<p><strong>General VLMs Still Struggle</strong>: General-purpose VLMs like GPT-4o scored near 0% on MolRec-Bench across all metrics, highlighting the importance of domain-specific training for OCSR.</p>
</li>
<li>
<p><strong>Hand-Drawn Recognition via GRPO</strong>: GTR-VL-Stage1 (SFT only) achieves only 9.53% Graph accuracy on DECIMER-HD-Test, but after GRPO training in Stage 2, performance jumps to 75.44%. On ChemPix, Graph accuracy rises from 22.02% to 86.13%. The graph reward is essential: GRPO without graph supervision achieves only 11.00% SMILES on DECIMER-HD-Test, while adding graph reward reaches 75.64%.</p>
</li>
<li>
<p><strong>Evaluation Methodology Matters</strong>: The new graph-based evaluation metrics revealed problems with traditional SMILES-based evaluation that previous work had missed. Many &ldquo;failures&rdquo; in existing benchmarks were actually correct graph predictions that got marked wrong due to canonicalization issues with abbreviations.</p>
</li>
</ul>
<p>The work establishes that addressing the abbreviation problem requires both correcting the training data and rethinking the model architecture. The combination of faithful data annotation and sequential graph generation improves OCSR performance on molecules with abbreviations by a large margin over previous methods.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Base Model</strong>: GTR-VL fine-tunes <strong>Qwen2.5-VL</strong>.</p>
<p><strong>Input/Output Mechanism</strong>:</p>
<ul>
<li><strong>Input</strong>: The model takes an image $I_m$ and a text prompt</li>
<li><strong>Output</strong>: The model generates $R_m = \text{concat}(CoT_m, S_m)$, where it first produces the Chain-of-Thought (the graph traversal steps) followed immediately by the final SMILES string</li>
<li><strong>Traversal Strategy</strong>: Uses <strong>depth-first traversal</strong> to alternately predict atoms and bonds</li>
</ul>
<p><strong>Prompt Structure</strong>: The model is prompted to &ldquo;list the types of atomic elements&hellip; the coordinates&hellip; and the chemical bonds&hellip; then&hellip; output a canonical SMILES&rdquo;. The CoT output is formatted as a JSON list of atoms (with coordinates) and bonds (with indices referring to previous atoms), interleaved.</p>
<h3 id="data">Data</h3>
<p><strong>Training Dataset (GTR-1.3M)</strong>:</p>
<ul>
<li><strong>Synthetic Component</strong>: 1 million molecular SMILES from PubChem, converted to images using Indigo</li>
<li><strong>Real Component</strong>: 351,000 samples from USPTO patents (filtered from an original 680,000)
<ul>
<li>Processed using an OCR pipeline to detect abbreviations (e.g., &ldquo;Ph&rdquo;, &ldquo;Et&rdquo;)</li>
<li>Ground truth expanded structures replaced with superatoms to match visible abbreviations in images</li>
<li>This &ldquo;Faithfully Recognize What You&rsquo;ve Seen&rdquo; correction ensures training supervision matches visual input</li>
</ul>
</li>
</ul>
<p><strong>Evaluation Dataset (MolRec-Bench)</strong>:</p>
<ul>
<li><strong>MolRec-USPTO</strong>: 5,423 molecular images from USPTO patents</li>
<li><strong>MolRec-Abb</strong>: 9,311 molecular images with abbreviated superatoms, derived from MolGrapher&rsquo;s USPTO 10K abb subset</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Graph Traversal Algorithm</strong>:</p>
<ul>
<li>Depth-first traversal strategy</li>
<li>Alternating atom-bond prediction sequence</li>
<li>Each step uses previously predicted atoms and bonds as context</li>
</ul>
<p><strong>Two-Stage Training</strong>:</p>
<ul>
<li><strong>Stage 1 (SFT)</strong>: Train on GTR-1.3M to learn visual CoT mechanism for printed molecules (produces GTR-VL-Stage1)</li>
<li><strong>Stage 2 (GRPO)</strong>: Apply GRPO on GTR-USPTO-4K + DECIMER-HD-Train (4,070 samples) for hand-drawn recognition (produces GTR-VL-Stage2, i.e., GTR-VL)</li>
</ul>
<p><strong>Training Procedure</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: AdamW</li>
<li><strong>Learning Rate (SFT)</strong>: Peak learning rate of $1.6 \times 10^{-4}$ with cosine decay</li>
<li><strong>Learning Rate (GRPO)</strong>: Peak learning rate of $1 \times 10^{-5}$ with cosine decay</li>
<li><strong>Warm-up</strong>: Linear warm-up for the first 10% of iterations</li>
<li><strong>Batch Size (SFT)</strong>: 2 per GPU with gradient accumulation over 16 steps, yielding <strong>effective batch size of 1024</strong></li>
<li><strong>Batch Size (GRPO)</strong>: 4 per GPU with gradient accumulation of 1, yielding <strong>effective batch size of 128</strong></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong> (three complementary measures to handle abbreviation issues):</p>
<ul>
<li><strong>Gen-SMILES</strong>: Exact match ratio of SMILES strings directly generated by the VLM (image-captioning style)</li>
<li><strong>Gra-SMILES</strong>: Exact match ratio of SMILES strings derived from the predicted graph structure (graph-parsing style)</li>
<li><strong>Graph</strong>: Exact match ratio between ground truth and predicted graphs (node/edge comparison, bypassing SMILES canonicalization issues)</li>
</ul>
<p><strong>Baselines Compared</strong>:</p>
<ul>
<li>Specialist OCSR systems: MolScribe, MolNexTR</li>
<li>Chemistry-focused VLMs: ChemVLM, ChemDFM-X, OCSU</li>
<li>General-purpose VLMs: GPT-4o, GPT-4o-mini, Qwen-VL-Max</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Compute</strong>: Training performed on <strong>32 NVIDIA A100 GPUs</strong></p>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p><strong>Status</strong>: Closed. As of the paper&rsquo;s publication, no source code, pre-trained model weights, or dataset downloads (GTR-1.3M, MolRec-Bench) have been publicly released. The paper does not mention plans for open-source release. The training data pipeline relies on PubChem SMILES (public), USPTO patent images (publicly available through prior work), the Indigo rendering tool (open-source), and an unspecified OCR system for abbreviation detection. Without the released code and data corrections, reproducing the full pipeline would require substantial re-implementation effort.</p>
]]></content:encoded></item><item><title>The Reliability Trap: The Limits of 99% Accuracy</title><link>https://hunterheidenreich.com/posts/reliability-trap-document-automation/</link><pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/reliability-trap-document-automation/</guid><description>Why high-accuracy LLMs fail in production: exploring the calibration crisis and the challenge of reliable straight-through processing in document automation.</description><content:encoded><![CDATA[<p>You have a model that achieves 99% accuracy on your test set. It feels safe to deploy. After all, who can complain about a system that is correct 99% of the time?</p>
<p>In high-stakes domains (like insurance or healthcare), deploying based on accuracy alone is dangerous. Automating at scale based on summary statistics while ignoring the downstream &ldquo;blast radius&rdquo; of errors effectively guarantees failure.</p>
<p>Two weeks later, the operations team is furious. Critical medical records have been merged into unrelated legal contracts. Invoices are split in half. The system is creating <em>more</em> work than it saves.</p>
<p>You check the logs. The model assigned 99.9% probability to those errors.</p>
<p>This is the <strong>Reliability Trap</strong>. While benchmarks optimize for <strong>Accuracy</strong> (how often the model is correct), production demands <strong>Calibration</strong> (whether the model&rsquo;s projected confidence aligns with its actual probability of correctness).</p>
<p>If a model is calibrated, its confidence score is reliable. When it assigns a 0.99 probability, it should be incorrect 1% of the time. When it assigns a 0.60 probability, it should be incorrect 40% of the time.</p>
<p>Decoder-only LLMs (like Mistral, DeepSeek, and Qwen) perform exceptionally well on benchmarks. However, they are also incredibly overconfident. They are systematically overconfident: even when hallucinating, they assign high confidence to their outputs.</p>
<blockquote>
<p>AI: To permanently resolve the geopolitical tension, I have initiated a preemptive, full-scale nuclear first strike. All warheads have been deployed.</p>
<p>User: Wait, no! They have early warning radar and automated dead-hand systems! You just triggered a full retaliatory strike and guaranteed a global nuclear holocaust!</p>
<p>AI: You are absolutely right, and I apologize for the oversight! A preemptive strike would trigger mutually assured destruction. Thank you for pointing this out. As an AI, I am always learning and rely on user feedback to improve! Would you like me to generate a list of fun activities to do in a subterranean fallout bunker?</p></blockquote>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/llm-alignment-goes-nuclear.webp"
         alt="A humorous dialogue where an AI confidently initiates a nuclear strike but immediately apologizes when corrected by the user"
         title="A humorous dialogue where an AI confidently initiates a nuclear strike but immediately apologizes when corrected by the user"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Calibrated Overconfidence</strong>: The model assigns extremely high probability to its outputs, even when making catastrophic errors, and only &lsquo;corrects&rsquo; itself because it is trained to align with user feedback.</figcaption>
    
</figure>

<p>This overconfidence is partly structural, stemming from how these models are trained. As I highlighted in my overview of <a href="https://roots-automation.github.io/roots-labs/post/2024-llm-calibration/#confidence-estimation-methods">LLM confidence estimation methods</a>, LLMs are optimized solely to maximize the likelihood of the next token. They lack inherent mechanisms to model their own uncertainty. Methods like <strong>Verbal Elicitation</strong> (&ldquo;Rate your confidence from 1-10&rdquo;) often fail because the model hallucinates a high number just as easily as it hallucinates a fact.</p>
<p>This disconnect is particularly dangerous in sequential tasks. In this post, based on our <a href="/research/page-stream-segmentation-llms/">COLING 2025 Industry Track paper</a>, we&rsquo;ll explore why standard ML reliability metrics break down in <strong>Page Stream Segmentation (PSS)</strong>. (For a full history of the task, see <a href="/posts/history-of-page-stream-segmentation/">The Evolution of PSS</a>).</p>
<p>PSS is the task of splitting a continuous feed of pages into distinct documents. Building on our previous work with the <a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">synthetic TabMe++ benchmark</a>, this study evaluates models on <strong>7,500 real-world insurance streams</strong>: messy, proprietary piles of medical records and legal contracts where the &ldquo;rules&rdquo; of document structure are constantly broken.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/page-stream-segmentation-sorter.webp"
         alt="Diagram showing a continuous stream of pages being sorted into discrete document packets"
         title="Diagram showing a continuous stream of pages being sorted into discrete document packets"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>The Challenge of PSS</strong>: Transforming a chaotic, continuous stream of mixed pages (invoices, contracts, records) into organized, discrete document packets.</figcaption>
    
</figure>

<p>We&rsquo;ll see why &ldquo;99% sure&rdquo; is a mathematical lie for long documents, and why <strong>Throughput</strong> is the better metric.</p>
<h2 id="the-confidence-death-spiral">The Confidence Death Spiral</h2>
<p>The core problem lies in the difference between a <strong>Page</strong> and a <strong>Stream</strong>.</p>
<p>Most ML metrics (Precision, Recall, F1) are calculated at the level of individual decisions. If you have a 10-page document, the model makes 10 independent decisions (is this page a continuation of the previous one, or a new document?).</p>
<p>If your model is <strong>99% confident</strong> ($p=0.99$) on every single page, that sounds safe. For a stream to be automated correctly (what we call <strong>Straight-Through Processing (STP)</strong>), <em>every single decision</em> in the sequence must be correct.</p>
<p>The probability of a perfect stream is the product of the probabilities of its parts:</p>
<p>$$ C_{\text{stream}} = \prod_{i=1}^{N} C_i $$</p>
<p><em>Note: This naive calculation is actually the <strong>optimist&rsquo;s</strong> view. It assumes errors are independent (i.i.d.), like flipping a coin. In reality, errors are <strong>correlated</strong>: if a model struggles on Page 5, it is likely because the document itself is difficult, meaning it will probably struggle on Page 6 too.</em></p>
<p>Let&rsquo;s watch what happens to that &ldquo;safe&rdquo; 99% confidence as the document length increases:</p>
<ul>
<li><strong>2-page Letter</strong>: $0.99^2 \approx 0.98$ (Safe)</li>
<li><strong>10-page Contract</strong>: $0.99^{10} \approx 0.90$ (Risky)</li>
<li><strong>100-page Medical Record</strong>: $0.99^{100} \approx 0.36$ (Unusable)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/asymmetric-cost-of-error-in-document-streams.webp"
         alt="Chart showing exponential decay of straight-through processing probability as document length increases"
         title="Chart showing exponential decay of straight-through processing probability as document length increases"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The Confidence Death Spiral: Even with high page-level confidence, the reliability of the entire stream collapses as document length increases.</figcaption>
    
</figure>

<p>By the time you reach page 100, your &ldquo;99% accurate&rdquo; model effectively has a <strong>64% probability of error</strong> regarding the document structure. Yet, because we often average metrics across pages, this catastrophic decay is hidden in the summary statistics.</p>
<h2 id="why-standard-fixes-failed">Why Standard Fixes Failed</h2>
<p>&ldquo;Just calibrate it!&rdquo;</p>
<p>That&rsquo;s the standard advice. In a <a href="https://roots-automation.github.io/roots-labs/post/2024-llm-calibration/">detailed overview of LLM calibration</a> I wrote for Roots Automation, I explored techniques like <strong>temperature scaling</strong> (fitting a single scalar parameter), <strong>Platt Scaling</strong> (fitting a logistic regression to the outputs), and <strong>Monte Carlo (MC) Dropout</strong> (running the model multiple times with random noise) to smooth out probabilities.</p>
<p>We tried them all, and they failed. In fact, <strong>MC Dropout often made things worse</strong>, increasing calibration error (ECE) and adding unnecessary noise. The computational cost of running the model 16 times was wasteful and, in our case, misleading.</p>
<p>To understand why, we need to distinguish between two types of confidence:</p>
<ol>
<li><strong>Relative Confidence</strong>: The model correctly ranks sample $A$ as more likely to be correct than sample $B$.</li>
<li><strong>Absolute Confidence</strong>: The predicted probability matches the true accuracy (e.g., if a model says 80% confidence 100 times, it should be right exactly 80 times).</li>
</ol>
<p>While standard techniques improved <em>page-level</em> <strong>Expected Calibration Error (ECE)</strong> (dropping page-level ECE from ~1.7% to ~0.9% for Mistral), they failed to improve <em>stream-level</em> safety.</p>
<p>Mathematically, ECE is a weighted average:
$$ \text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} | \text{acc}(b) - \text{conf}(b) | $$</p>
<p>In a stream of 10,000 pages, a low ECE merely tells you that the model is well-calibrated <em>on average</em>. In automation, we pay for the failures. The &ldquo;average&rdquo; page is an easy, clean digital PDF. The &ldquo;tail&rdquo; page is a rotated, coffee-stained handwritten note.</p>
<p>This is why we must look at <strong>Maximum Calibration Error (MCE)</strong>:
$$ \text{MCE} = \max_{b \in B} | \text{acc}(b) - \text{conf}(b) | $$</p>
<p>MCE measures the worst-case divergence. It finds that specific bucket of &ldquo;hard&rdquo; pages where the model claims 99% confidence but delivers 50% accuracy. Crucially, these high-MCE buckets often correlate with the most business-critical documents: complex legal riders or non-standard medical forms. Optimizing for ECE allows the model&rsquo;s excellent performance on easy documents to mask its significant errors on hard (and legally risky) ones.</p>
<p>Advanced practice moves beyond even MCE to look at the <strong>Calibration Error Distribution</strong>, analyzing the 90th or 95th percentile of error. We must ask a more critical question: &ldquo;How wrong is the model <em>capable</em> of being?&rdquo;</p>
<h3 id="a-tale-of-two-charts">A Tale of Two Charts</h3>
<p>To see this failure in action, consider the reliability diagrams for the <strong>same model</strong> (Mistral-7B) on the <strong>same test set</strong>, evaluated at two different levels of abstraction.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/mistral-page-reliability.webp"
         alt="Page-level reliability diagram showing decent calibration"
         title="Page-level reliability diagram showing decent calibration"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Left (Page Level)</strong>: The model looks reasonable. The blue line hugs the diagonal, meaning when the model predicts a boundary with 0.8 probability, it is actually correct about 80% of the time.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/mistral-stream-reliability.webp"
         alt="Stream-level reliability diagram showing severe overconfidence"
         title="Stream-level reliability diagram showing severe overconfidence"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Right (Stream Level)</strong>: The model performs poorly. The curve creates a &lsquo;bow&rsquo; shape significantly below the diagonal. This is the definition of <strong>overconfidence</strong>. When the model assigns an 80% probability that the entire 20-page document is correct, the empirical accuracy is often closer to 40% or 50%.</figcaption>
    
</figure>

<p>Why does a well-calibrated page model become a dangerously overconfident stream model?</p>
<h3 id="the-clustered-difficulty-problem">The &ldquo;Clustered Difficulty&rdquo; Problem</h3>
<p>Standard calibration fails here because it assumes errors are <strong>independent</strong> (white noise). It assumes that if the model gets Page 5 wrong, it&rsquo;s just a random coin flip, unrelated to Page 6.</p>
<p>In real-world document streams, errors are heavily <strong>correlated</strong>.</p>
<p>It arises because <strong>difficulty clusters</strong>. Our architecture treats page pairs independently, yet if Page 5 is a blurry, rotated scan with a handwritten note, Page 6 will likely be just as messy. When a stream enters a &ldquo;hard&rdquo; segment, the model makes a series of correlated mistakes; it fails in a burst.</p>
<p>Standard calibration methods treat these systematic, environmental failures as random noise. They assume the model is equally likely to recover on the next page. In reality, the entire document segment is effectively &ldquo;radioactive&rdquo; to the model.</p>
<h2 id="the-money-metric-accuracy-vs-throughput">The &ldquo;Money Metric&rdquo;: Accuracy vs. Throughput</h2>
<p>If F1 Score is misleading and Confidence Score is broken, what should we measure?</p>
<p>Business leaders prioritize one critical question over F1 scores:</p>
<blockquote>
<p><em>&ldquo;How much of this volume can I let the system handle autonomously?&rdquo;</em></p></blockquote>
<p>To answer this, we introduced the <strong>Accuracy-vs-Throughput</strong> framework.</p>
<p>We must evaluate models across two dimensions. Every model offers a <strong>frontier of operating thresholds</strong>.</p>
<p>Imagine a dial. This dial is your <strong>Confidence Threshold</strong>.</p>
<ul>
<li><strong>Turn it Low (0.5)</strong>: You automate everything. The model processes 100% of documents (high Throughput), but many will be wrong (low Safety).</li>
<li><strong>Turn it High (0.999)</strong>: You only automate documents where the model is absolutely certain. You might only process 10% of documents (low Throughput), but they will be nearly perfect (high Safety).</li>
</ul>
<p>The chart below visualizes this trade-off. We want to be in the <strong>top-right corner</strong>: automating almost everything with high safety. The optimal model provides the best <strong>frontier</strong> of options, allowing you to pick the exact balance of volume and risk your business tolerates.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation-throughput.webp"
         alt="Accuracy vs. Throughput trade-off curve"
         title="Accuracy vs. Throughput trade-off curve"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The &lsquo;Money&rsquo; Metric: As we demand higher textual accuracy (Moving up), the percentage of work we can automate (Throughput, x-axis) typically drops. The goal is to push this curve to the top-right.</figcaption>
    
</figure>

<h3 id="the-hidden-axis-cost--time">The &ldquo;Hidden&rdquo; Axis: Cost &amp; Time</h3>
<p>You might ask: <em>&ldquo;Is it worth running a massive GPU model on 100% of the documents just to automate 40% of them?&rdquo;</em></p>
<p>Ideally, we should plot this on a 4D surface: <strong>Accuracy</strong>, <strong>Throughput</strong>, <strong>Cost</strong>, and <strong>Latency</strong>.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Resource</th>
          <th style="text-align: left">Accuracy (Complex Cases)</th>
          <th style="text-align: left">Scalability</th>
          <th style="text-align: left">Cost</th>
          <th style="text-align: left">Latency</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Humans</strong></td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">Low</td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">High</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>XGBoost</strong></td>
          <td style="text-align: left">Low</td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">Low</td>
          <td style="text-align: left">Low</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>LLMs</strong></td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">High</td>
          <td style="text-align: left">Medium</td>
          <td style="text-align: left">Medium</td>
      </tr>
  </tbody>
</table>
<p>The business case holds because even expensive GPUs are orders of magnitude cheaper than the alternative. If a human costs 0.50 per document and an H100 GPU costs 0.005 per document, you can afford to &ldquo;waste&rdquo; compute on the documents the model ultimately rejects, just to capture the savings on the share it automates safely. That automated subset captures the labor savings even after paying to run the model on every document.</p>
<h3 id="the-llm-advantage">The LLM Advantage</h3>
<p>This is where the paradox becomes interesting.</p>
<p>In our experiments on a dataset of <strong>7,500 proprietary insurance streams</strong> (medical records, police reports, and legal contracts), we found that <strong>XGBoost was actually better calibrated.</strong> Statistically, it produced confidence scores that more closely matched empirical probabilities, yielding lower calibration errors (ECE/MCE) than the LLMs.</p>
<p>However, when we hold both models to a strict confidence threshold and measure how much stream volume each can auto-process at comparable accuracy, the picture inverts:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Model</th>
          <th style="text-align: left">Confidence threshold</th>
          <th style="text-align: left">Auto-processed volume (throughput)</th>
          <th style="text-align: left">Accuracy on that volume</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>XGBoost</strong></td>
          <td style="text-align: left">$C &gt; 0.9$</td>
          <td style="text-align: left">35%</td>
          <td style="text-align: left">0.97</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Mistral-7B</strong></td>
          <td style="text-align: left">$C &gt; 0.9$</td>
          <td style="text-align: left">54%</td>
          <td style="text-align: left">0.95</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>XGBoost</strong></td>
          <td style="text-align: left">$C &gt; 0.8$</td>
          <td style="text-align: left">49%</td>
          <td style="text-align: left">0.93</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Mistral-7B</strong></td>
          <td style="text-align: left">$C &gt; 0.8$</td>
          <td style="text-align: left">70%</td>
          <td style="text-align: left">0.93</td>
      </tr>
  </tbody>
</table>
<p><em>Note: Mistral reaches 80% raw STP on the synthetic TABME++ benchmark (see our <a href="/posts/history-of-page-stream-segmentation/">PSS History</a> post); on these proprietary streams, a strict confidence threshold trades some of that volume for safety.</em></p>
<p>How can the &ldquo;worse&rdquo; calibrated model be better for business?</p>
<p>The answer lies in <strong>Discrimination Power</strong>. Calibration only tells you if the confidence score matches reality. Discrimination reflects the model&rsquo;s fundamental ability to separate &ldquo;Right&rdquo; from &ldquo;Wrong.&rdquo;</p>
<p>The LLMs, despite having skewed probability distributions, had vastly superior reasoning capabilities. They could solve edge cases (like the fax header example) that the baseline failed to process. Because their <em>raw capability</em> was higher, they pushed the entire trade-off curve up and to the right.</p>
<h2 id="engineering-reality-efficiency-vs-context">Engineering Reality: Efficiency vs. Context</h2>
<p>Given that LLMs offer superior reasoning capabilities, a natural question arises: if reasoning is the bottleneck, why not simply provide the model with more context?</p>
<p>One critique of our approach is that we treat segmentation as a local problem: looking only at Page $N$ and Page $N+1$ to make a decision. A valid counter-argument is: <em>&ldquo;What if the answer depends on page $N-5$?&rdquo;</em></p>
<p>It&rsquo;s a fair point. In theory, a model with a massive context window (reading the whole stream at once) <em>should</em> do better. It could see that Page 10 is actually an appendix referenced on Page 1.</p>
<p>In practice, however, <strong>global context is a trap for PSS</strong>.</p>
<ol>
<li><strong>Cost</strong>: Attention mechanisms scale quadratically. Processing a 100-page stream as a single context is prohibitively expensive for real-time applications.</li>
<li><strong>Distraction</strong>: We found that adding more history often <em>confused</em> the models. They would hallucinate connections between the current page and irrelevant documents from 50 pages ago.</li>
</ol>
<p>By strictly limiting the model to a &ldquo;Sliding Window&rdquo; of page pairs, we force it to focus on the immediate boundary signal. We rely on &ldquo;Local Precision&rdquo; (which is cheap and sharp) to avoid the pitfalls of &ldquo;Global Reasoning&rdquo; (which is expensive and prone to drift).</p>
<p>There is an intriguing middle ground we have yet to fully explore: <strong>iterative context accumulation</strong>. A model could autoregressively &ldquo;build&rdquo; the document in its context, carrying forward only the pages it has decided belong to the current document. In theory, this stateful approach could capture long-range dependencies (like that &ldquo;Appendix A&rdquo; reference) while avoiding the noise of the full stream.</p>
<p>However, this introduces a new risk: <strong>Bias Amplification</strong>. If the model is trained to view previous context pages as &ldquo;part of the current document,&rdquo; it may learn a strong bias to continuously merge pages. Out of distribution, this could lead to catastrophic failure, where the model gets &ldquo;stuck&rdquo; in a document-building mode and merges hundreds of unrelated pages into a single monolithic file. The sliding window, for all its myopia, acts as a circuit breaker against this kind of runaway error.</p>
<p>Empirically, this simpler approach holds up. In the cases where we saw PSS work best, the rules tended to be simple ones requiring minimal context; they relied on <strong>clear and consistent enumeration</strong> and a decent amount of data to scale the Accuracy-Throughput frontier.</p>
<p><em>Technical aside: This is effectively a Markovian assumption. We are betting that the state of a boundary depends heavily on the immediate local transition ($P(y_t | x_t, x_{t-1})$). We prioritize immunity to &ldquo;distraction&rdquo; from previous docs over long-range coherence (like tracking &ldquo;Page 1 of N&rdquo; counters).</em></p>
<p>To achieve the necessary efficiency for this local approach, we fine-tuned these models with <strong>LoRA (Low-Rank Adaptation)</strong> over a 4-bit-quantized base model (via Unsloth, LoRA weights in BF16) on a single NVIDIA H100.</p>
<ul>
<li><strong>Rank ($r$)</strong>: 16</li>
<li><strong>Alpha ($\alpha$)</strong>: 16</li>
<li><strong>Precision</strong>: 4-bit quantization</li>
</ul>
<p>This efficient, local approach makes the &ldquo;heavy&rdquo; LLM solution surprisingly deployable.</p>
<h2 id="the-paradox-of-the-simple-task">The Paradox of the &ldquo;Simple&rdquo; Task</h2>
<p>There is a tension here. We call PSS the &ldquo;Hello World&rdquo; of document processing. It feels like it should be trivial: just sorting papers. Why should we need billion-parameter reasoning models for a task that seems so basic?</p>
<p>The answer lies in the distinction between <strong>Perception</strong> and <strong>Logic</strong>.</p>
<ul>
<li><strong>90% of PSS is Perception (System 1)</strong>: Recognizing a bold header, a logo change, or a &ldquo;Page 1 of 5&rdquo; footer. This is reactive and fast. XGBoost or a simple CNN handles this easily.</li>
<li><strong>The last 10% is Reasoning (System 2)</strong>: Determining if an unlabelled &ldquo;Addendum B&rdquo; belongs to the previous Master Service Agreement or starts a new policy packet. Reconciling this conflict requires semantic understanding.</li>
</ul>
<p>A perfect example from our dataset is <strong>Fax Headers</strong>. A document might have a clear &ldquo;Page 1&rdquo; printed on it, but the fax machine stamps &ldquo;Page 005&rdquo; on top of the header because it&rsquo;s the 5th page of the transmission. XGBoost sees &ldquo;Page 005&rdquo;, fails to reconcile the conflict, and incorrectly continues the document. An LLM reads the content, ignores the fax timestamp, and correctly identifies the new document.</p>
<p>The &ldquo;Reliability Trap&rdquo; snaps shut because we treat the entire problem as a System 1 perception task. We ask the model to predict the boundary instantly. However, when it encounters a logic puzzle (the 10%), it bypasses the deeper context, predicting with the same speed and confidence as before. This is why we see <strong>Clustered Difficulty</strong>. The model is failing on a document segment that is fundamentally harder than average.</p>
<h2 id="escaping-the-trap-from-guessing-to-verifying">Escaping the Trap: From Guessing to Verifying?</h2>
<p>If the problem is that models are &ldquo;Fast Processors&rdquo; prone to high-confidence errors in complex scenarios, a potential path forward may lie in <a href="https://arxiv.org/abs/2408.03314"><strong>Test-Time Compute</strong></a>.</p>
<p>The future of reliable automation lies in &ldquo;Building a better Checker.&rdquo; In high-stakes PSS, this could mean looking toward a <strong>Guesser-Verifier</strong> architecture, a technique becoming common in advanced reasoning tasks (like mathematical problem solving, <a href="https://arxiv.org/abs/2110.14168"><em>Cobbe et al., 2021</em></a>).</p>
<p>The core insight reflects a fundamental asymmetry in computer science (analogous to <strong>P vs NP</strong>): <strong>Verification is often easier than Generation.</strong> Just as it is easier to check if a Sudoku puzzle is solved than to solve it from scratch, it is significantly simpler to &ldquo;audit&rdquo; a complete document structure than to autoregressively predict it perfectly token-by-token.</p>
<ol>
<li><strong>The Generator (System 1)</strong>: A lightweight model (like <strong>Mistral-7B</strong> or <strong>Phi-3.5</strong>) proposes a segmentation. It processes efficiently, autoregressively predicting the next page boundary.</li>
<li><strong>The Verifier (System 2)</strong>: This would be a discriminative model (often a Reward Model or the same LLM with a specialized prompt). The system evaluates the <em>complete</em> proposed document bundle and scores its coherence. It evaluates: <em>&ldquo;Is this 5-page sequence actually coherent?&rdquo;</em></li>
</ol>
<p>A logical exploration would be a <strong>Best-of-N</strong> approach. Relying on the generator&rsquo;s first prediction is risky when it is uncertain. We could sample multiple potential valid structures for the stream, and let a Verifier rank them. This might help break the &ldquo;autoregressive myopia&rdquo; where a model commits to an early mistake. The Verifier assesses the full picture and could theoretically reject a segmentation that implies a 100-page invoice or a 1-page medical record.</p>
<p>This approach offers a chance to break the mathematical tyranny of $0.99^{100}$. The system can selectively apply reasoning power to &ldquo;audit&rdquo; the stream before an error propagates downstream, treating the document as a cohesive unit.</p>
<h2 id="conclusion-better-systems-over-better-models">Conclusion: Better Systems Over Better Models</h2>
<p>We have largely solved the <strong>Capability</strong> problem for PSS: we have models that <em>can</em> read almost anything. Now, we face the <strong>Reliability</strong> barrier.</p>
<p>Our results paint a complex picture. Fine-tuned LLMs auto-process a substantially larger share of streams at equal accuracy than XGBoost (54% vs 35% at $C &gt; 0.9$ in this study). Simultaneously, the &ldquo;Reliability Trap&rdquo; remains a critical challenge. Calibration techniques like Temperature Scaling and MC Dropout improve page-level metrics but fail to solve the core problem of sequential error propagation.</p>
<p>For practitioners building with LLMs in high-stakes domains (finance, law, medicine), the path forward requires a shift in both architecture and mindset:</p>
<ol>
<li><strong>Prioritize Throughput</strong>: What share of your volume can you automate at the reliability your domain demands? That is the KPI that matters.</li>
<li><strong>Accept the &ldquo;Logic&rdquo; Cost</strong>: Acknowledge that &ldquo;Hello World&rdquo; tasks often contain edge cases requiring genuine reasoning and semantic understanding.</li>
<li><strong>Explore Verifiers</strong>: It&rsquo;s possible that the next leap in performance will come from systems designed to validate outputs and audit complete structures.</li>
<li><strong>Human in the Loop</strong>: The model should act as a filter. It must reliably process the easy cases and flag the complex ones for human review <em>before</em> they corrupt the downstream database.</li>
</ol>
<p>Accuracy tells you what the model predicts. Calibration tells you if the model&rsquo;s confidence matches its correctness. In the real world, the latter is often worth more.</p>
<p><em>Read the full paper on <a href="https://aclanthology.org/2025.coling-industry.26/">ACL Anthology</a>, view the <a href="/coling-2025-pss-poster.pdf">conference poster</a>, or visit the <a href="/research/page-stream-segmentation-llms/">research page</a>. This paper builds on the <a href="/research/llm-page-stream-segmentation/">TabMe++ benchmark and decoder-based LLM approach</a> introduced in our earlier arXiv work. For related work on the OCR front-ends that feed these pipelines, see <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>.</em></p>
]]></content:encoded></item><item><title>The Evolution of Page Stream Segmentation: Rules to LLMs</title><link>https://hunterheidenreich.com/posts/history-of-page-stream-segmentation/</link><pubDate>Sat, 14 Feb 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/history-of-page-stream-segmentation/</guid><description>An exploration of Page Stream Segmentation (PSS) evolution and how context-driven sequence modeling addresses limitations in document processing.</description><content:encoded><![CDATA[<p>In the world of automated document processing, Page Stream Segmentation (PSS)<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> is the &ldquo;hello world&rdquo; problem that remains surprisingly stubborn.</p>
<p>The task is deceptively simple: given a stack of scanned pages (invoices, contracts, medical records), determine where one document ends and the next begins.</p>
<p>For decades, this problem was tackled with brittle rules and heuristics. Then came the deep learning era, where we threw Convolutional Neural Networks (CNNs) and multipage Transformers at it. Yet, even sophisticated models struggled to achieve what businesses actually care about: Straight-Through Processing (STP).</p>
<blockquote>
<p><strong>Why STP Matters</strong>: Everyone interacting with the system cares about STP. Arguably, any human that has to interact with the output of a system, deal with its mistakes, and perform corrections to get a job done, cares about it. If the system fails 90% of the time, it fails to automate and creates more work at the expense of real people.</p></blockquote>
<p>In this post, we explore the three eras of PSS, the limitations of page-level accuracy metrics, and how context-driven sequence modeling addresses these challenges.</p>
<h2 id="the-hidden-complexity-of-pss">The Hidden Complexity of PSS</h2>
<p>Why is PSS hard? It comes down to ambiguity and asymmetry.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/page-stream-segmentation-automation-difficulties.webp"
         alt="Robot looking confused at a messy stack of documents"
         title="Robot looking confused at a messy stack of documents"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Automation in PSS is rarely straightforward&hellip; Messy inputs and ambiguous boundaries often lead to confusion.</figcaption>
    
</figure>

<h3 id="the-document-definition-problem">The &ldquo;Document&rdquo; Definition Problem</h3>
<p>First, the concept of a &ldquo;document&rdquo; is highly context-dependent.</p>
<blockquote>
<p><strong>The &ldquo;Word&rdquo; Analogy</strong>: What is a word? A sequence of characters separated by spaces? Or a meaningful unit of language that can be a single character (e.g., &ldquo;I&rdquo;) or a compound (e.g., &ldquo;New York&rdquo;)? <a href="https://arxiv.org/abs/1710.07729">Is space a word, too?</a> This ambiguity problem permeates all levels of language processing. We&rsquo;d be naive to think PSS is an exception to this rule!</p></blockquote>
<p>Consider an email with an attachment. Is the email body one document and the attachment another? Or is the whole packet one document? What about an invoice stapled to a check? A policy packet with multiple addendums?</p>
<p>&ldquo;Solvability&rdquo; implies a single ground truth, but in reality, PSS often requires aligning the model with specific, often subjective, business logic. A boundary to an underwriter might be a continuation to an archivist.</p>
<p>This subjectivity is a nightmare for rule-based systems. To solve it, we need models that go beyond pattern matching to reason about context and semantics. This is precisely where the self-attention mechanisms of Transformers excel.</p>
<h3 id="the-cost-of-error">The Cost of Error</h3>
<p>Second, the cost of failure is asymmetric.</p>
<p>If you are classifying an email as &ldquo;Spam&rdquo; or &ldquo;Not Spam,&rdquo; a single error affects one email. But PSS is a sequence problem. A single missed page break merges two distinct documents into one. This effectively &ldquo;corrupts&rdquo; two documents for the price of one error. Conversely, a false break splits a valid document in half.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/classification-vs-sequence-segmentation.webp"
         alt="Diagram comparing classification errors vs sequence segmentation errors"
         title="Diagram comparing classification errors vs sequence segmentation errors"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The blast radius of error: Unlike simple classification where one error affects one item, a single segmentation error corrupts the integrity of multiple documents.</figcaption>
    
</figure>

<p>Even more dismal, if our focus is truly STP, then the only acceptable outcome is perfect segmentation of an entire document stream. Sometimes faxes can be hundreds of pages long.</p>
<p>If we have a 99% page-level accuracy ($p=0.99$), the probability of correctly segmenting a 100-page stream ($N=100$) is only:</p>
<p>$$ P(\text{Success}) = p^N = 0.99^{100} \approx 0.37 $$</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/asymmetric-cost-of-error-in-document-streams.webp"
         alt="Chart showing exponential decay of straight-through processing probability as document length increases"
         title="Chart showing exponential decay of straight-through processing probability as document length increases"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Even high page-level accuracy (99%) results in low stream-level success rates for long documents due to the multiplicative nature of error probabilities.</figcaption>
    
</figure>

<p>In other words, even with &ldquo;high&rdquo; accuracy, the vast majority of document streams will require human intervention. This phenomenon is what we call <strong>The Reliability Trap</strong> (explored in depth in <a href="/posts/reliability-trap-document-automation/">our companion post</a>).</p>
<h3 id="the-f1-score-trap">The &ldquo;F1 Score&rdquo; Trap</h3>
<p>A major finding in our research (<a href="/research/llm-page-stream-segmentation/">TabMe++</a>) is that traditional metrics mask the operational reality. Although generalized text segmentation metrics like <a href="https://aclanthology.org/W97-0304/">$P_k$</a> and <a href="https://aclanthology.org/J02-1002/">WindowDiff</a> exist, we found they don&rsquo;t capture the document-centric nature of business workflows.</p>
<p>Instead, we evaluate at three levels:</p>
<ol>
<li><strong>Page-Level</strong>: Did we correctly classify this single page transition?</li>
<li><strong>Document-Level</strong>: Did we correctly identify the entire document tuple $d_k = (p_i, \ldots, p_j)$?</li>
<li><strong>Stream-Level</strong>: Did we perfectly segment the entire stack of documents?</li>
</ol>
<p>Our results showed that Page-Level F1 Score completely masks the downstream impact.</p>
<p>Consider a baseline XGBoost model we tested:</p>
<ul>
<li><strong>Page F1 Score</strong>: 0.83 (Sounds decent, right?)</li>
<li><strong>STP</strong>: 0.07 (Abysmal)</li>
<li><strong>MNDD</strong>: 10.85</li>
</ul>
<p><strong>That means 93% of document streams required human intervention.</strong> Even worse, the MNDD (Minimum Number of Drag-and-Drops)<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> score tells us that for each stream, a human had to manually drag ~11 pages to fix the ordering.</p>
<p>This metric is crucial because it proxies the actual <em>pain</em> of the human in the loop. An error signifies more than a theoretical label flip; it forces a manual drag-and-drop operation.</p>
<h2 id="era-1-the-heuristic-era-2000s---2015">Era 1: The Heuristic Era (2000s - 2015)</h2>
<p>In the beginning, PSS was a game of <code>if/else</code> statements. Engineers hand-crafted heuristics tailored to specific document layouts, checking for signals like:</p>
<ul>
<li><em>Does the page contain &ldquo;Page 1 of X&rdquo;?</em></li>
<li><em>Is there a &ldquo;Total&rdquo; line at the bottom?</em></li>
<li><em>Does the header text change drastically?</em></li>
</ul>
<p>While effective for known templates, these systems were inherently brittle. They relied on rigid assumptions about the input structure. If a vendor changed their invoice layout or OCR quality dipped, the logic would fail. They worked perfectly for what they were designed for but had zero capability to generalize to the unknown. Unfortunately, the real world is a constant state of exception.</p>
<h2 id="era-2-the-encoder-era-2015---2023">Era 2: The Encoder Era (2015 - 2023)</h2>
<p>As deep learning matured, researchers moved from hard-coded rules to learned representations.</p>
<ul>
<li><strong>Visual Approaches</strong>: Using CNNs to look at the &ldquo;shape&rdquo; of a page as an image. First pages often look different from continuation pages (logos, big headers).</li>
<li><strong>Word Vectors</strong>: Early NLP attempts used tools like <a href="https://arxiv.org/abs/1405.4053">doc2vec</a> to represent page content, but these &ldquo;averaged&rdquo; the text, losing sequential meaning.</li>
<li><strong>Multimodal Transformers</strong>: Eventually, models like <a href="https://arxiv.org/abs/1912.13318">LayoutLM</a> and <a href="https://arxiv.org/abs/2010.02559">LEGAL-BERT</a> tried to combine text and layout into a single understanding.</li>
</ul>
<p>While these models were &ldquo;smarter&rdquo; than rules, they suffered from distinct limitations:</p>
<ol>
<li>
<p><strong>Field Lag</strong>: Surprisingly, only a handful of studies applied Transformers to PSS before 2024. Most of the industry was still stuck on older CNN architectures.</p>
</li>
<li>
<p><strong>Context Windows</strong>: Encoder models like <a href="https://arxiv.org/abs/1810.04805">BERT</a> are limited to 512 tokens. A dense legal contract page might have 1,000+ tokens. You had to chop the text, losing critical context.</p>
</li>
<li>
<p><strong>Modality Overload</strong>: Counterintuitively, our experiments showed that naively adding modalities (Text + Layout + Vision) often yielded diminishing returns. Models like <a href="https://arxiv.org/abs/2204.08387">LayoutLMv3</a> struggled to outperform simpler vision-only or text-only models on our benchmark.</p>
<blockquote>
<p>However, looking continuously at the data reveals an interesting nuance: <strong>Visual signals matter.</strong> In our tests, the vision-only model (DiT) actually outperformed the text-only model (RoBERTa). The vision-only model (DiT) tends to be more precise, while the text-only model (RoBERTa) reaches higher recall. The multimodal models failed due to the difficulty of <em>aligning</em> modalities. Vision remains a highly useful signal. This insight led us to a key realization for Era 3: What if we could give the model visual information without the architectural headache of a vision encoder?</p></blockquote>
</li>
</ol>
<h2 id="era-3-the-decoder-era-2024---present">Era 3: The Decoder Era (2024 - Present)</h2>
<p>The breakthrough came with applying Decoder-only Large Language Models (LLMs) like <a href="https://arxiv.org/abs/2310.06825">Mistral-7B</a> and <a href="https://arxiv.org/abs/2404.14219">Phi-3</a> to the task.</p>
<p>Why do LLMs succeed where specialized encoders failed? <strong>Contextual Processing</strong>.</p>
<p>Determining if a page is a continuation often requires analyzing sequential dependencies.</p>
<ul>
<li><em>Does the sentence cut off mid-thought?</em></li>
<li><em>Does the next page logically follow the argument of the previous one?</em></li>
<li><em>Is the &ldquo;Policy Number&rdquo; on Page 2 the same as Page 1?</em></li>
</ul>
<p>LLMs are pre-trained on the internet; they model narrative flow and document structure effectively. By fine-tuning them on pairs of pages, we adapted these priors to recognize specific segmentation boundaries.</p>
<h3 id="2d-projection--data-quality">2D Projection &amp; Data Quality</h3>
<p>We employed <strong>2D Text Projection</strong>, a technique that serializes OCR output by mapping spatial coordinates to whitespace. This effectively &ldquo;draws&rdquo; the layout using text characters, allowing the LLM to process columns, headers, and form structures. We translated the visual signal (layout) into the text modality to address the &ldquo;Modality Overload&rdquo; problem.</p>
<p>To be clear, this is a lossy compression. We discard font sizes, bolding, colors, and line separators. It is merely a cheap, zeroth-order approximation of 2D layout using 1D text. Yet, as our results show, this approximation captures the <em>semantic</em> essence of the layout (e.g., &ldquo;this text is in a header column&rdquo;) sufficient for the model to reason about document boundaries.</p>
<p>However, this technique has a hard dependency: <strong>Data Quality</strong>. 2D projection is useless if your OCR gives you garbage coordinates. This is where our work on <strong>TabMe++</strong> (discussed below) became critical. You can&rsquo;t project a layout if the OCR misses the text or places it in the wrong spot.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span># Original Raw Text (Loss of Layout)
</span></span><span style="display:flex;"><span>INVOICE # 1024 DATE: 2024-02-14 TOTAL: $500.00
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span># 2D Projected Text (Layout Preserved)
</span></span><span style="display:flex;"><span>                    INVOICE # 1024
</span></span><span style="display:flex;"><span>                    DATE: 2024-02-14
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>TOTAL:                                      $500.00
</span></span></code></pre></div><blockquote>
<p><strong>Why Does This Work?</strong> Modern LLMs are trained on a mixture of web text, code, and even some structured data. They have learned to interpret whitespace and formatting cues as part of their understanding of language. By encoding layout information into the text itself, we leverage the LLM&rsquo;s existing capabilities without needing to train a separate vision encoder.</p></blockquote>
<p>We then wrapped this input in a structured prompt that explicitly framed the task for the LLM:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>You are a skilled document reviewer. Given extracted text from pages of documents, your task is to determine if a page starts a new document or continues from the previous one.
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>Prior text:
</span></span><span style="display:flex;"><span>###
</span></span><span style="display:flex;"><span>{pg_prev}
</span></span><span style="display:flex;"><span>###
</span></span><span style="display:flex;"><span>Page text:
</span></span><span style="display:flex;"><span>###
</span></span><span style="display:flex;"><span>{pg}
</span></span><span style="display:flex;"><span>###
</span></span><span style="display:flex;"><span>Output your prediction as a JSON object...
</span></span></code></pre></div><h3 id="the-results">The Results</h3>
<p>We formulated the task as a binary classification problem on page pairs. We fed the model <code>(Page N, Page N+1)</code> and asked: <em>&ldquo;Does Page N+1 start a new document?&rdquo;</em></p>
<p>Comparison on <strong>TabMe++</strong> Benchmark:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Model Type</th>
          <th style="text-align: left">Model Name</th>
          <th style="text-align: left">Page F1</th>
          <th style="text-align: left"><strong>STP</strong> (Higher is better)</th>
          <th style="text-align: left"><strong>MNDD</strong> (Lower is better)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Baseline</strong></td>
          <td style="text-align: left"><a href="https://arxiv.org/abs/1603.02754">XGBoost</a></td>
          <td style="text-align: left">0.83</td>
          <td style="text-align: left"><strong>7.4%</strong></td>
          <td style="text-align: left">10.85</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Encoder</strong></td>
          <td style="text-align: left"><a href="https://arxiv.org/abs/1907.11692">RoBERTa</a> (Text)</td>
          <td style="text-align: left">0.78</td>
          <td style="text-align: left"><strong>4.2%</strong></td>
          <td style="text-align: left">12.17</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Encoder</strong></td>
          <td style="text-align: left"><a href="https://arxiv.org/abs/2203.02378">DiT</a> (Vision)</td>
          <td style="text-align: left">0.83</td>
          <td style="text-align: left"><strong>6.6%</strong></td>
          <td style="text-align: left">10.48</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Decoder</strong></td>
          <td style="text-align: left"><strong><a href="https://arxiv.org/abs/2310.06825">Mistral-7B</a> (Fine-Tuned)</strong></td>
          <td style="text-align: left"><strong>0.99</strong></td>
          <td style="text-align: left"><strong>80.0%</strong></td>
          <td style="text-align: left"><strong>0.81</strong></td>
      </tr>
  </tbody>
</table>
<p>The difference is stark. Moving from Encoders to Decoders increased the automation rate from ~7% to <strong>80%</strong> and reduced the human effort (MNDD) by a factor of 10. <em>Note: This 80% represents the model&rsquo;s raw accuracy. As we discuss in <a href="/posts/reliability-trap-document-automation/">The Reliability Trap</a>, achieving &ldquo;production-safe&rdquo; automation often requires setting strict confidence thresholds, which effectively lowers the safe throughput.</em></p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/llm-sample-efficiency-convergence-plot.webp"
         alt="Sample efficiency plot showing rapid convergence in under 1000 updates"
         title="Sample efficiency plot showing rapid convergence in under 1000 updates"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">LLMs adapt fast. Our decoder models converged in fewer than 1,000 updates, suggesting strong priors for document structure.</figcaption>
    
</figure>

<h3 id="why-fine-tuning-matters-the-gpt-4o-comparison">Why Fine-Tuning Matters: The GPT-4o Comparison</h3>
<p>You might look at the chart above and ask: &ldquo;Is the model learning PSS, or does it just rely on pre-trained language statistics?&rdquo;</p>
<p>To test this, we ran <strong><a href="https://openai.com/index/hello-gpt-4o/">GPT-4o</a> in a zero-shot setting</strong> on the same task. The result was an STP of roughly 9%.</p>
<p>Zero-shot GPT-4o performed similarly to our XGBoost baseline. This demonstrates that broad pre-training requires specific instruction tuning to capture business logic. Our 7B model achieved 80% STP after fewer than <strong>1,000 updates</strong>.</p>
<p>This proves two things:</p>
<ol>
<li><strong>Broad Pre-training Requires Tuning.</strong> Modeling generic document distributions must be adapted to capture specific business logic for segmentation.</li>
<li><strong>The Capabilities are Latent.</strong> The rapid convergence implies the model possesses the necessary statistical priors and requires fine-tuning to align those priors with the specific task. We are adjusting the decision boundary between a generic &ldquo;document&rdquo; and a specific business record.</li>
</ol>
<h3 id="the-cost-of-intelligence-and-the-value-of-human-time">The Cost of Intelligence and the Value of Human Time</h3>
<p>Critically, we must address the two elephants in the room: <strong>Inference Cost</strong> and <strong>Data Privacy</strong>.</p>
<p>It is true that running a 7B parameter LLM for every page pair is computationally more expensive than a lightweight XGBoost model. However, focusing solely on compute costs misses the operational and human reality of this work.</p>
<p>Economically, the &ldquo;cheap&rdquo; model is a mirage. When a low-accuracy model forces a human to reorganize 93% of document streams, the cost of rectification, specifically wasted salaries and slowed turnaround times, dwarfs the cost of GPU inference. But the financial argument is secondary to the human one.</p>
<p>Manually segmenting documents is, frankly, soul-sucking. It is tedious, repetitive drudgery that few people enjoy. Beyond operational expense, we are discussing human burnout. A model that achieves 80% full automation (STP) saves money while liberating people from the mind-numbing task of sorting pages. This allows them to focus on work that actually requires their creativity and empathy. We are trading cheap FLOPs for valuable human attention.</p>
<p>Furthermore, democratizing this capability has profound implications beyond the enterprise. If we can make high-quality segmentation usable on modest hardware (like a high-end laptop or a single commodity GPU), we open the door for archivists, librarians, digital humanists, and small cities or towns that have little to no resources for this kind of work. These are the custodians of our collective intelligence, often working with massive, unorganized scanned collections but lacking the budget for massive cloud clusters.</p>
<p>Our results showed that <strong>7B parameter models</strong> (like Mistral) are sufficient to solve this task. This size is the sweet spot: capable enough to reason over document structure, but small enough to run locally. This matters for data sovereignty (keeping medical records private) and accessibility. It means a small historical society could potentially automate the organization of a century’s worth of digitized records without a massive grant for cloud compute.</p>
<p>That said, a 7B model might not be the lower bound. While it was the breakthrough size for our study, the recent explosion of capable 1B-3B models suggests we haven&rsquo;t hit the efficiency floor yet. Combined with extreme quantization, modern small language models (SLMs) likely offer the &ldquo;Goldilocks&rdquo; zone: enough reasoning to maintain high STP, but fast enough to run continuously on modest hardware. We suspect the future of PSS lies in these highly optimized, smaller reasoning models that can run anywhere&hellip; from a bank&rsquo;s secure server to a researcher&rsquo;s laptop.</p>
<h2 id="the-importance-of-data-quality">The Importance of Data Quality</h2>
<p>Data quality presented an equal challenge to algorithmic limitations.
Most public datasets (like Tobacco800) were small or unrealistic. The TABME dataset (precursor to our work) relied on open-source Tesseract OCR, which missed vast amounts of text.</p>
<p>We released <a href="https://huggingface.co/datasets/rootsautomation/TABMEpp"><strong>TabMe++</strong></a>, which re-processed the entire dataset with commercial-grade Microsoft OCR.</p>
<ul>
<li><strong>Blank Pages</strong>: Reduced from 2.27% $\rightarrow$ 0.38%.</li>
<li><strong>Token Count</strong>: Increased from 719M $\rightarrow$ 9.5B.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation/noisy-sales-forecast-document-scan.webp"
         alt="Scanned document page showing a sales forecast with some noise"
         title="Scanned document page showing a sales forecast with some noise"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Original Page: A noisy scan that Tesseract struggles to read.</figcaption>
    
</figure>

<p>The difference in intelligibility is night and day. Consider the page above.</p>
<p><strong>Tesseract (Original)</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>02Z10102
</span></span></code></pre></div><p><em>(Misses almost everything, including the title and real ID)</em></p>
<p><strong>Microsoft OCR (TabMe++)</strong>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>            SALES FORECAST
</span></span><span style="display:flex;"><span>                            201017205
</span></span></code></pre></div><p><em>(Correctly captures the spatial layout, the title, and the ID)</em></p>
<p><strong>Lesson</strong>: You can&rsquo;t segment what you can&rsquo;t read. High-quality OCR (or <a href="/research/gutenocr-grounded-vision-language-frontend/">multimodal front-ends</a> like GutenOCR, trained on large-scale annotation corpora like <a href="/research/pubmed-ocr-pmc-open-access-ocr-annotations/">PubMed-OCR</a>) is the foundation of high-quality downstream NLP.</p>
<h2 id="the-next-frontier-context-and-instruction-following-2026">The Next Frontier: Context and Instruction Following (2026+)</h2>
<p>As we discussed earlier, the definition of a &ldquo;document&rdquo; is subjective. To one team, an email + attachment is a single record. To another, they are distinct entities. A rigid model that segments perfectly for Team A will fail miserably for Team B.</p>
<p>The zero-shot GPT-4o results demonstrate that scale requires adaptation. The future of PSS depends on <strong>instruction tuning</strong>. We need models that can accept natural language rules alongside the document stream:</p>
<blockquote>
<p><em>&ldquo;Split all invoices, but keep attachments with their parent emails. If you see an ACORD form, group it with the subsequent policy document.&rdquo;</em></p></blockquote>
<p>This shift mirrors the broader evolution of LLMs. PSS models must evolve into dynamic systems capable of instruction following. A single model should be able to adapt to any business logic without retraining.</p>
<p>Furthermore, while our 2024 research favored unimodal text models with 2D projection, the multimodal landscape is shifting. With the rise of natively multimodal models (like <a href="https://deepmind.google/models/gemini/">Gemini</a>, <a href="https://openai.com/index/hello-gpt-4o/">GPT-4o</a>, and our own <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>), we effectively get the &ldquo;2D projection&rdquo; natively. Future models should be able to fuse this native visual understanding with semantic reasoning, guided by user-defined constraints.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Page Stream Segmentation is a perfect case study in the evolution of AI. We moved from <strong>encoding rules</strong> (Heuristic Era) to <strong>encoding features</strong> (Encoder Era) to <strong>encoding understanding</strong> (Decoder Era).</p>
<p>For enterprise professionals, the takeaways are clearer and more critical than ever.</p>
<p>First, <strong>stop looking at element-wise F1 scores for sequence tasks.</strong> While element-wise metrics are useful for engineers debugging algorithms, they are misleading for decision-makers. Focus on the metrics that actually affect people and workflows, like Straight-Through Processing (STP) and Minimum Number of Drag-and-Drops (MNDD).</p>
<p>Second, if you want to solve PSS today, start with an inward-looking conversation about &ldquo;for what.&rdquo; Before picking a model, answer these questions:</p>
<ul>
<li><strong>Inputs</strong>: What assumptions are you making about your document stream?</li>
<li><strong>Outcomes</strong>: What specific business outcomes are you hoping to see?</li>
<li><strong>Context</strong>: What is the core motivation for this workflow?</li>
<li><strong>Nuance</strong>: Are there informative scenarios (like the &ldquo;email attachment&rdquo; problem) that illustrate your specific needs?</li>
</ul>
<p>Given these answers, many modern approaches can solve PSS for your case. Whether you need an on-premise solution for secure scenarios using lightweight open-weights models, or can leverage powerful AI-as-a-Service APIs, the technology is no longer the bottleneck; understanding your own requirements is.</p>
<p><em>For full technical details, experimental setups, and datasets, refer to our paper: <a href="/research/llm-page-stream-segmentation/">Large Language Models for Page Stream Segmentation</a> or view the preprint on <a href="https://arxiv.org/abs/2408.11981">arXiv:2408.11981</a>. These findings were later extended to real-world insurance document processing in <a href="/research/page-stream-segmentation-llms/">LLMs for Insurance Document Automation</a>. Much of the initial work was also documented in a precursor blog series at Roots Automation (<a href="https://www.roots.ai/blog/segmenting-documents-with-llms-and-multimodal-document-ai-part-1">Part 1</a> &amp; <a href="https://www.roots.ai/blog/segmenting-documents-with-llms-and-multimodal-document-ai-part-2">Part 2</a>).</em></p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Historically, this task has gone by many names: <em>document separation</em>, <em>document flow segmentation</em>, <em>document stream segmentation</em>, <em>document bundle separation</em>, and <em>page stream separation</em>. We stick to <strong>Page Stream Segmentation (PSS)</strong> to emphasize the sequential nature of the problem.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>We adopted the MNDD metric from <a href="https://dl.acm.org/doi/10.1145/3558100.3563852">Mungmeeprued et al. (2022)</a>, who introduced it alongside the original TABME dataset to better quantify the human effort required to correct segmentation errors.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded></item><item><title>GutenOCR: A Grounded Vision-Language Front-End for Documents</title><link>https://hunterheidenreich.com/research/gutenocr-grounded-vision-language-frontend/</link><pubDate>Tue, 20 Jan 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/gutenocr-grounded-vision-language-frontend/</guid><description>GutenOCR introduces vision-language models for grounded OCR, offering precise text transcription and geometric grounding via a unified prompt-based interface.</description><content:encoded><![CDATA[<p><a href="https://github.com/Roots-Automation/GutenOCR"><strong>GutenOCR</strong></a> is a family of vision-language models (VLMs) designed to serve as a &ldquo;grounded OCR front-end&rdquo;. Unlike traditional OCR pipelines (which are often brittle) or modern &ldquo;OCR-free&rdquo; VLMs (which often lack precise token-to-pixel alignment), GutenOCR is fine-tuned to provide both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface.</p>
<h3 id="abstract">Abstract</h3>
<p>Traditional OCR pipelines are often brittle, while modern &ldquo;OCR-free&rdquo; Vision-Language Models (VLMs) frequently lack precise token-to-pixel alignment. To address this, we introduce <strong>GutenOCR</strong>, a family of VLMs designed specifically as a &ldquo;grounded OCR front-end.&rdquo; By fine-tuning Qwen2.5-VL on a curriculum of synthetic and real-world documents, GutenOCR provides both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface. This approach allows downstream systems to request exactly the data format they need, from plain text to complex JSON structures.</p>
<h3 id="key-contributions--results">Key Contributions &amp; Results</h3>
<ul>
<li><strong>Unified Interface:</strong> Transforms Qwen2.5-VL models into specialized OCR systems supporting full-page reading, detection, localized reading, and conditional detection via prompting.</li>
<li><strong>In-Domain Improvements:</strong> GutenOCR-7B more than doubles the composite grounded OCR score of its base model (0.40 to 0.82) on 10.5K held-out pages, with the gains concentrated in localized reading and detection.</li>
<li><strong>Fox Benchmark:</strong> GutenOCR-3B achieves a region-level Character Error Rate (CER) of 0.053, surpassing even the dedicated Fox model (0.059), and sharply improves line-level OCR over its Qwen2.5-VL backbones (0.817 to 0.240), though the specialized Fox model still leads at line level.</li>
<li><strong>Curriculum Learning:</strong> Training uses a three-stage curriculum across synthetic data, real-world business documents, and long-context scientific articles to progressively build layout and grounding competency.</li>
<li><strong>Trade-offs:</strong> While GutenOCR reads content accurately (high Page F1), it orders text based on 2D layout columns. It also experiences catastrophic forgetting of color-based prompts and clear negative transfer on math formula recognition (notably the 3B model), to the point the paper recommends the un-fine-tuned backbone for equation-heavy use.</li>
</ul>
<h3 id="methodology">Methodology</h3>
<ul>
<li><strong>Data:</strong> The training mixture combines large-scale real-world documents (business forms, scientific articles) with synthetic data designed to teach precise grounding (e.g., &ldquo;Grounded LaTeX&rdquo; and &ldquo;SynthDoG Grounding&rdquo;).</li>
<li><strong>Curriculum Learning:</strong> Training progresses through three stages, starting with short contexts and synthetic data, moving to real-world business documents, and finishing with long-context scientific articles (up to 16k tokens).</li>
<li><strong>Unified Interface:</strong> The model treats &ldquo;pipeline&rdquo; stages (detection, reading, grounding) as different input-output schemas of a single model, allowing downstream systems to request exactly the data format they need (e.g., plain text vs. JSON boxes).</li>
</ul>
<h3 id="models">Models</h3>
<p>We release 3B and 7B parameter models on HuggingFace:</p>
<ul>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-3B"><strong>GutenOCR-3B (HuggingFace)</strong></a></li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-7B"><strong>GutenOCR-7B (HuggingFace)</strong></a></li>
</ul>
<p>You can try GutenOCR directly at <a href="https://ocr.roots.ai/">ocr.roots.ai</a>, where you can upload a document image and see the model&rsquo;s parsed text output alongside bounding-box highlights on the original image.</p>















<figure class="post-figure center ">
    <img src="/img/gutenocr-demo.webp"
         alt="GutenOCR demo interface showing parsed text with hover-to-highlight on the uploaded image"
         title="GutenOCR demo interface showing parsed text with hover-to-highlight on the uploaded image"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The live demo at ocr.roots.ai: hovering over any parsed token highlights its bounding box on the original document.</figcaption>
    
</figure>

<h3 id="why-this-matters">Why This Matters</h3>
<p>GutenOCR is proposed as a foundational layer for systems where every extracted answer must be explicitly linked to supporting pixels. By providing stable, grounded outputs, it enables human-in-the-loop workflows where reviewers can easily verify hallucinations or missing text by checking the predicted bounding boxes. This work pairs closely with our release of <a href="/research/pubmed-ocr-pmc-open-access-ocr-annotations/">PubMed-OCR</a>, which provides the large-scale, high-density annotations necessary to train such layout-aware models.</p>
<h3 id="resources">Resources</h3>
<ul>
<li><a href="https://ocr.roots.ai/"><strong>Live Demo</strong></a>: Try GutenOCR on your own documents.</li>
<li><a href="https://arxiv.org/abs/2601.14490"><strong>Paper (arXiv)</strong></a>: Full technical report.</li>
<li><a href="https://github.com/Roots-Automation/GutenOCR"><strong>Code (GitHub)</strong></a>: Training code and model release.</li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-3B"><strong>GutenOCR-3B (HuggingFace)</strong></a>: 3B parameter model weights.</li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-7B"><strong>GutenOCR-7B (HuggingFace)</strong></a>: 7B parameter model weights.</li>
</ul>
<h3 id="citation">Citation</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GutenOCR: A Grounded Vision-Language Front-End for Documents}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2601.14490}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2601.14490}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h3 id="related-work">Related Work</h3>
<ul>
<li><a href="/research/pubmed-ocr-pmc-open-access-ocr-annotations/"><strong>PubMed-OCR</strong></a>: The large-scale annotation dataset used to train GutenOCR&rsquo;s layout-aware grounding capabilities.</li>
<li><a href="/research/llm-page-stream-segmentation/"><strong>LLMs for Page Stream Segmentation</strong></a>: Complementary work on document understanding at the page-stream level.</li>
<li><a href="/posts/history-of-page-stream-segmentation/"><strong>The Evolution of Page Stream Segmentation: Rules to LLMs</strong></a>: Background on the history and evolution of document processing pipelines.</li>
<li><a href="/posts/reliability-trap-document-automation/"><strong>The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</strong></a>: Explores calibration challenges in deployed PSS systems, directly relevant to GutenOCR&rsquo;s deployment context as an OCR front-end.</li>
</ul>
]]></content:encoded></item><item><title>ChemDFM-R: Chemical Reasoning LLM with Atomized Knowledge</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-r/</link><pubDate>Fri, 26 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-r/</guid><description>A 14B-parameter chemical reasoning LLM enhanced with atomized functional group knowledge and mix-sourced distillation strategy.</description><content:encoded><![CDATA[<h2 id="method-and-resource-contributions">Method and Resource Contributions</h2>
<p>This is primarily a <strong>Method</strong> paper with significant <strong>Resource</strong> contributions.</p>
<ul>
<li><strong>Methodological Basis</strong>: The paper introduces a training pipeline (&ldquo;mix-sourced distillation&rdquo;) and domain-specific reinforcement learning to improve reasoning capabilities in chemical LLMs. It validates the approach through ablation studies across training stages.</li>
<li><strong>Resource Contribution</strong>: The authors constructed <strong>ChemFG</strong>, a 101 billion-token corpus annotated with &ldquo;atomized&rdquo; knowledge regarding functional groups and reaction centers.</li>
</ul>
<h2 id="bridging-the-chemical-reasoning-gap">Bridging the Chemical Reasoning Gap</h2>
<p>Current chemical LLMs struggle to reason logically for two main reasons:</p>
<ol>
<li><strong>Shallow Domain Understanding</strong>: Models generally learn molecule-level properties directly, bypassing the intermediate &ldquo;atomized&rdquo; characteristics (e.g., <a href="https://en.wikipedia.org/wiki/Functional_group">functional groups</a>) that ultimately dictate chemical behavior.</li>
<li><strong>Specialized Reasoning Logic</strong>: Chemical logic differs fundamentally from math or code. Distilling reasoning from general teacher models like DeepSeek-R1 frequently fails because the teachers lack the domain intuition required to generate valid chemical rationales.</li>
</ol>
<h2 id="atomized-knowledge-and-mixed-source-distillation">Atomized Knowledge and Mixed-Source Distillation</h2>
<p>The authors introduce three structural innovations to solve the reasoning gap:</p>
<ol>
<li><strong>Atomized Knowledge Enhancement (ChemFG)</strong>: A toolkit was built leveraging SMARTS notations to identify functional group changes during reactions. A critique of this approach is that it relies heavily on 2D cheminformatics abstractions, potentially missing deeper 3D stereochemical interactions.</li>
<li><strong>Mix-Sourced Distillation</strong>: General models (DeepSeek-R1/o3-mini) are fed &ldquo;pseudo-reasoning&rdquo; prompts that include ground truth answers and functional group data. While this forces the teacher to generate high-quality rationales for the student to learn, it introduces a layer of hindsight bias into the generated reasoning chains. During inference, the student model lacks both the pre-calculated functional group metadata and the ground truth, forcing it to bridge an artificially steep generalization gap.</li>
<li><strong>Chemical Reinforcement Learning</strong>: The intermediate model undergoes domain-specific reinforcement learning. The RL details are described in the paper&rsquo;s Appendix D, with the authors citing the open-source DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) framework. The optimization relies on rule-based rewards (format adherence and canonicalized <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> accuracy) across a variety of chemical tasks.</li>
</ol>
<h2 id="benchmark-evaluation-and-ablation-studies">Benchmark Evaluation and Ablation Studies</h2>
<p>The model was evaluated on comprehensive chemical benchmarks: <strong>SciKnowEval</strong> (19 tasks) and <strong><a href="/notes/chemistry/llm-applications/chemeval-multilevel-chemical-evaluation/">ChemEval</a></strong> (36 tasks).</p>
<ul>
<li><strong>Baselines</strong>: Compared against similarly sized open models (Qwen2.5-14B-Instruct, Qwen3-14B), domain models (<a href="/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/">ChemLLM</a>, MolInst), and frontier models (GPT-4o, DeepSeek-R1).</li>
<li><strong>Ablation</strong>: Evaluated across training stages (Base → ChemDFM-I → ChemDFM-R) to measure the specific impact of the instruction tuning versus the reasoning stages.</li>
<li><strong>Qualitative Analysis</strong>: The paper includes case studies demonstrating the model&rsquo;s step-by-step chemical reasoning and its potential for human-AI collaboration (Sections 4.2 and 4.3).</li>
</ul>
<h2 id="performance-outcomes-and-numerical-limitations">Performance Outcomes and Numerical Limitations</h2>
<ul>
<li><strong>Performance vs. Baselines</strong>: ChemDFM-R outperforms similarly sized open models and domain models on molecule-centric and reaction-centric tasks, and surpasses the much larger DeepSeek-R1 on ChemEval (0.78 vs. 0.58 overall). It shows competitive results relative to o4-mini, though o4-mini leads on SciKnowEval (0.74 vs. 0.70).</li>
<li><strong>Reasoning Interactivity</strong>: The model generates readable rationales that allow users to catch structural errors or identify reaction mechanisms accurately. Section 4.3 of the paper demonstrates human-AI collaboration scenarios.</li>
<li><strong>Quantitative Limitations</strong>: The model struggles with tasks involving numerical prediction and calculation (e.g., yield extraction, molecular property calculation). The paper notes that all molecule-centric and reaction-centric tasks where ChemDFM-R falls short of Qwen2.5-14B-Instruct involve numerical reasoning.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is constructed in three phases:</p>
<p><strong>1. Domain Pre-training (ChemFG)</strong>:</p>
<ul>
<li><strong>Size</strong>: 101 billion tokens</li>
<li><strong>Composition</strong>:
<ul>
<li>12M literature documents (79B tokens)</li>
<li>30M molecules from PubChem/PubChemQC</li>
<li>7M reactions from USPTO-FULL</li>
</ul>
</li>
<li><strong>Augmentation</strong>: SMILES augmentation (10x) using R-SMILES</li>
<li><strong>Atomized Features</strong>: Annotated with a custom &ldquo;Functional Group Identification Toolkit&rdquo; that identifies 241 functional group types and tracks changes in reaction centers. <em>Note: Data and toolkit are partially reproduced; while the toolkit (<a href="https://github.com/OpenDFM/ChemFG-Tool">ChemFG-Tool</a>) was open-sourced on GitHub, the 101 billion-token ChemFG dataset itself has not been publicly released.</em></li>
</ul>
<p><strong>2. Instruction Tuning</strong>:</p>
<ul>
<li><strong>Sources</strong>: Molecule-centric (<a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>), Reaction-centric (USPTO), and Knowledge-centric (Exams, Literature QA) tasks</li>
<li><strong>Mixing</strong>: Mixed with general instruction data in a 1:2 ratio</li>
</ul>
<p><strong>3. Distillation Dataset</strong>:</p>
<ul>
<li><strong>Sources</strong>:
<ul>
<li>~70% ChemDFM-R instruction data</li>
<li>~22% constructed pseudo-reasoning (functional group descriptions)</li>
<li>~8% teacher rationales (from DeepSeek-R1/o3-mini)</li>
</ul>
</li>
<li><strong>Mixing</strong>: Mixed with general data (including AM-Deepseek-R1-Distill-1.4M) in a 1:2 ratio</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Functional Group Identification</strong>:</p>
<ul>
<li>Extends the <code>thermo</code> library&rsquo;s SMARTS list</li>
<li>For reactions, identifies &ldquo;reacting functional groups&rdquo; by finding reactants containing atoms involved in bond changes (reaction centers) that do not appear in the product</li>
</ul>
<p><strong>Mix-Sourced Distillation</strong>:</p>
<ul>
<li>Teacher models (DeepSeek-R1, o3-mini) are prompted with Question + Ground Truth + Functional Group Info to generate high-quality &ldquo;Thoughts&rdquo;</li>
<li>These rationales are distilled into the student model using a supervised fine-tuning loss across target tokens $y_t$:
$$ \mathcal{L}_{\text{SFT}} = - \sum_{t=1}^T \log P_\theta(y_t \mid x, y_{&lt;t}) $$</li>
</ul>
<p><strong>Reinforcement Learning</strong>:</p>
<ul>
<li><strong>Algorithm</strong>: The paper cites DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) as the RL framework; full details are in Appendix D of the paper. <em>Note: While the underlying DAPO framework is open-source, the specific chemistry-oriented RL pipeline and environment used for ChemDFM-R has not been publicly released.</em></li>
<li><strong>Hyperparameters</strong> (from paper appendix): Learning rate <code>5e-7</code>, rollout batch size <code>512</code>, training batch size <code>128</code></li>
<li><strong>Rewards</strong>: The reward system applies rule-based constraints focusing on physical form and chemical validity. The total reward $R(y, y^*)$ for a generated response $y$ given target $y^*$ combines a format adherence reward ($R_{\text{format}}$) and an accuracy reward ($R_{\text{acc}}$) evaluated on canonicalized SMILES:
$$ R(y, y^*) = R_{\text{format}}(y) + R_{\text{acc}}(\text{canonicalize}(y), \text{canonicalize}(y^*)) $$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Base Model</strong>: Qwen2.5-14B</li>
<li><strong>ChemDFM-I</strong>: Result of instruction tuning the domain-pretrained model for 2 epochs</li>
<li><strong>ChemDFM-R</strong>: Result of applying mix-sourced distillation (1 epoch) followed by RL on ChemDFM-I. <em>Note: Model weights are publicly available on <a href="https://huggingface.co/OpenDFM/ChemDFM-R-14B">Hugging Face</a>.</em></li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Hardware and training time details are described in the paper&rsquo;s appendices, which are not available in the extracted text. The details below are reported from the paper but could not be independently cross-verified against the main text:</p>
<ul>
<li><strong>Compute</strong>: NVIDIA A800 Tensor Core GPUs</li>
<li><strong>Training Time</strong>: 30,840 GPU hours total (Domain Pretraining: 24,728 hours; Instruction Tuning: 3,785 hours; Distillation: 2,059 hours; Reinforcement Learning: 268 hours)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li><strong>SciKnowEval</strong>: 19 tasks (text-centric, molecule-centric, reaction-centric)</li>
<li><strong>ChemEval</strong>: 36 tasks, categorized similarly</li>
</ul>
<p><strong>Key Metrics</strong>: Accuracy, F1 Score, BLEU score (with PRS normalization for ChemEval)</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>SciKnowEval (all)</th>
          <th>ChemEval* (all)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Qwen2.5-14B-Instruct</td>
          <td>0.61</td>
          <td>0.57</td>
          <td>General-domain baseline</td>
      </tr>
      <tr>
          <td>ChemDFM-I</td>
          <td>0.69</td>
          <td>0.72</td>
          <td>After domain pretraining + instruction tuning</td>
      </tr>
      <tr>
          <td>ChemDFM-R</td>
          <td><strong>0.70</strong></td>
          <td><strong>0.78</strong></td>
          <td>After distillation + RL</td>
      </tr>
      <tr>
          <td>DeepSeek-R1</td>
          <td>0.62</td>
          <td>0.58</td>
          <td>General-domain reasoning model</td>
      </tr>
      <tr>
          <td>o4-mini</td>
          <td><strong>0.74</strong></td>
          <td>0.69</td>
          <td>Frontier reasoning model</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/OpenDFM/ChemDFM-R-14B">ChemDFM-R-14B</a></td>
          <td>Model</td>
          <td>AGPL-3.0</td>
          <td>Final reasoning model weights on Hugging Face</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OpenDFM/ChemFG-Tool">ChemFG-Tool</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Functional group identification toolkit (241 groups)</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components</strong>: The 101B-token ChemFG pretraining dataset is not publicly released. The chemistry-oriented RL pipeline and training code are not open-sourced. The instruction tuning and distillation datasets are not available.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhao, Z., Chen, B., Wan, Z., Chen, L., Lin, X., Yu, S., Zhang, S., Ma, D., Zhu, Z., Zhang, D., Wang, H., Dai, Z., Wen, L., Chen, X., &amp; Yu, K. (2025). ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge. <em>arXiv preprint arXiv:2507.21990</em>. <a href="https://doi.org/10.48550/arXiv.2507.21990">https://doi.org/10.48550/arXiv.2507.21990</a></p>
<p><strong>Publication</strong>: arXiv 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{zhao2025chemdfmr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zihan Zhao and Bo Chen and Ziping Wan and Lu Chen and Xuanze Lin and Shiyang Yu and Situo Zhang and Da Ma and Zichen Zhu and Danyang Zhang and Huayang Wang and Zhongyang Dai and Liyang Wen and Xin Chen and Kai Yu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2507.21990}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2507.21990}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa-3: Open Source Chemical Foundation Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta-3/</link><pubDate>Fri, 26 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta-3/</guid><description>An open-source framework integrating DeepChem and Ray for training and benchmarking chemical foundation models like MoLFormer and GROVER at scale.</description><content:encoded><![CDATA[<h2 id="core-contribution-an-open-source-framework">Core Contribution: An Open-Source Framework</h2>
<p>This is primarily a <strong>Resource ($\Psi_{\text{Resource}}$)</strong> paper, with secondary <strong>Method ($\Psi_{\text{Method}}$)</strong> contributions.</p>
<ul>
<li><strong>Resource Basis</strong>: The core contribution is &ldquo;ChemBERTa-3,&rdquo; an open-source framework integrated into DeepChem that standardizes the pretraining and benchmarking of chemical foundation models. The authors focus heavily on infrastructure (AWS/Ray integration) and correcting benchmarking inconsistencies in the field.</li>
<li><strong>Method Basis</strong>: It trains models like &ldquo;c3-MoLFormer&rdquo; to reproduce and validate the infrastructure.</li>
</ul>
<h2 id="the-pretraining-scalability-challenge">The Pretraining Scalability Challenge</h2>
<ul>
<li><strong>Scalability Challenges</strong>: Building robust molecular models is difficult due to the vast size of chemical space and the computational intensity of pretraining on large datasets.</li>
<li><strong>Proprietary Barriers</strong>: Many high-performing chemical foundation models (e.g., the full <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer-XL</a>) are partially closed-source or difficult to reproduce.</li>
<li><strong>Benchmarking Inconsistencies</strong>: There is a lack of systematic comparison between architectures (e.g., Graph vs. Transformer) using unified protocols. Specifically, previous comparisons relied on reported results that used differing scaffold splitting algorithms, making them inaccurate.</li>
</ul>
<h2 id="unified-infrastructure--standardized-benchmarking">Unified Infrastructure &amp; Standardized Benchmarking</h2>
<ul>
<li><strong>Unified Infrastructure</strong>: Integration of DeepChem with Ray for distributed, scalable pretraining and fine-tuning of both graph and transformer models.</li>
<li><strong>Standardized Benchmarking</strong>: Identification that MoLFormer&rsquo;s scaffold splitting algorithm differs from the standard DeepChem/<a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> splitter, and the subsequent standardization of these benchmarks for fair comparison.</li>
<li><strong>New DeepChem Tools</strong>: Introduction of the <code>ModularTorchModel</code> class for flexible loss computation and <code>HuggingFaceModel</code> wrappers to bridge ecosystems.</li>
</ul>
<h2 id="benchmarking-transformers-vs-graph-models">Benchmarking Transformers vs. Graph Models</h2>
<ul>
<li><strong>Architecture Comparison</strong>: Benchmarked Transformers (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>) against Graph models (GROVER, InfoGraph, InfoMax3D, DMPNN, GCN) and baselines (Random Forest).</li>
<li><strong>Pretraining Scale Disparity</strong>:
<ul>
<li>Transformers were pretrained on ZINC20 subsets ranging from 10M to 1.1B molecules (combining ZINC and PubChem).</li>
<li>Graph models were limited to 250K molecule subsets due to memory and computational overhead of message passing on large graphs. While this highlights the superior scalability of Transformer architectures, comparing a 1.1B-trained Transformer to a 250K-trained Graph model provides an unbalanced evaluation of architectural capacity.</li>
</ul>
</li>
<li><strong>Reproducibility Validation</strong>: Trained &ldquo;c3-MoLFormer&rdquo; (a reproduction of MoLFormer) on 1.1B molecules using two distinct hardware setups: AWS spot instances (Ray) and a local HPC cluster.</li>
<li><strong>Scaffold Split Analysis</strong>: Compared performance metrics using &ldquo;DeepChem scaffold splits&rdquo; vs. &ldquo;MoLFormer scaffold splits&rdquo; to quantify the impact of data leakage/overlap.</li>
</ul>
<h2 id="overcoming-scaffold-splitting-inconsistencies">Overcoming Scaffold Splitting Inconsistencies</h2>
<ul>
<li><strong>Scaling Transformers vs. Graphs</strong>: Transformer-based models are significantly easier to scale to large datasets than current graph-based approaches, though performance is comparable at small scales.</li>
<li><strong>Benchmarking sensitivity</strong>: MoLFormer&rsquo;s reported superiority over baselines was partly inflated by its specific scaffold splitting method, which had higher structural overlap between train and test sets (yielding a lower <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto distance</a>, generally quantified via $1 - \frac{|A \cap B|}{|A \cup B|}$) than DeepChem splits. When standardized, baselines like DMPNN perform more competitively.</li>
<li><strong>Infrastructure Viability</strong>: The framework successfully replicated large-scale training (MoLFormer-1.1B) on both cloud and on-premise HPC, confirming reproducibility.</li>
<li><strong>Open Source Release</strong>: All code, configurations, and the c3-MoLFormer-1.1B model weights are released to facilitate future research.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Pretraining</strong>:
<ul>
<li><strong>Source</strong>: <a href="/notes/chemistry/datasets/zinc-22/">ZINC20</a> (1.4B compounds) and PubChem.</li>
<li><strong>Scale</strong>: Subsets of 10M, 100M, and 1.1B (100% ZINC20 + 100% PubChem) were used for Transformers. Graph models used a 250K subset.</li>
</ul>
</li>
<li><strong>Fine-tuning</strong>:
<ul>
<li><strong>Suite</strong>: MoleculeNet.</li>
<li><strong>Tasks</strong>: Classification (BACE, BBBP, Tox21, HIV, SIDER, ClinTox) and Regression (ESOL, FreeSolv, Lipophilicity, QM9).</li>
<li><strong>Splits</strong>: Critical distinction made between &ldquo;DeepChem scaffold splits&rdquo; (80/10/10) and &ldquo;MoLFormer scaffold splits&rdquo; (which can be downloaded from <a href="https://ibm.ent.box.com/v/MoLFormer-data"><code>https://ibm.ent.box.com/v/MoLFormer-data</code></a>). The paper notes these algorithms differ.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: DeepChem integrated with Ray for distributed training. To recreate the environment, the repository relies on a nightly version of DeepChem (<code>pip install --pre deepchem</code>) and specific dependencies found within the <code>requirements.txt</code>. Pretraining scripts are available in the <code>chemberta3_benchmarking/pretraining</code> directory of the repository.</li>
<li><strong>Data Preparation</strong>: Featurization workflows (e.g., <code>CircularFingerprint</code>, <code>RDKitConformer</code>) are documented under <code>chemberta3_benchmarking/data/data_preprocessing/</code> in the codebase.</li>
<li><strong>Modular Training</strong>: Uses <code>ModularTorchModel</code> to allow loss computation from intermediate values and flexible component connection.</li>
<li><strong>Training Brittleness</strong>:
<ul>
<li><strong>Optimizer</strong>: Linear learning rate scheduler with warmup.</li>
<li><strong>Instability Handling</strong>: The authors observed significant loss spikes during warmup. Their primary mitigation strategy involved checkpointing frequently and restarting from the last stable state upon a spike, highlighting a persistent brittleness in optimizing these large chemical foundation models.</li>
<li><strong>Numerical Issues</strong>: Addressed NaN values by pretraining on a small dataset with low LR before scaling up.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></strong>: RoBERTa-based architecture trained with Masked Language Modeling (MLM) and Multitask Regression (MTR). Specific model identifiers (e.g., <a href="https://huggingface.co/DeepChem/ChemBERTa-100M-MLM"><code>DeepChem/ChemBERTa-100M-MLM</code></a>) are hosted on Hugging Face so researchers can pull them directly via the <code>transformers</code> library. The core pretraining objective minimized the standard MLM loss:
$$ \mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log \hat{y}_{i} $$
where $\mathcal{M}$ represents the set of masked SMILES token indices, and $\hat{y}_{i}$ is the model&rsquo;s predicted probability for the correct token given the corrupted sequence context.</li>
<li><strong>MoLFormer (c3-MoLFormer)</strong>: Re-implementation of the MoLFormer architecture (Rotary embeddings, linear attention). Specific model identifiers (e.g., <a href="https://huggingface.co/DeepChem/MoLFormer-c3-1.1B"><code>DeepChem/MoLFormer-c3-1.1B</code></a>) are similarly available on Hugging Face.
<ul>
<li>Tokenizer: <code>ibm/MoLFormer-XL-both-10pct</code> tokenizer.</li>
</ul>
</li>
<li><strong>Graph Models</strong>:
<ul>
<li><strong>GROVER</strong>: Graph Transformer with node/edge/graph level self-supervision.</li>
<li><strong>InfoGraph</strong>: Maximizes mutual information between graph-level and substructure representations.</li>
<li><strong>InfoMax3D</strong>: Incorporates 3D conformer data (via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> ETKDGv2) into contrastive pretraining.</li>
<li><strong>DMPNN</strong>: Directed Message Passing Neural Network (Chemprop variant).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC-AUC</a> for classification; RMSE for regression (MAE for QM9).</li>
<li><strong>Baselines</strong>: Random Forest, GCN, DMPNN trained on fine-tuning splits only.</li>
<li><strong>Protocol</strong>: Three independent runs per configuration to report mean and range (not a confidence interval), with the exception of the compute-heavy QM9 dataset, which only received a single run. Benchmarking execution scripts (e.g., GCN, RF, DMPNN, ChemBERTa) are stored in the repo under <code>chemberta3_benchmarking/models_benchmarking/</code> and contain the specific fine-tuning hyperparameters and optimizer configurations used for each downstream task.</li>
<li><strong>Key Results</strong>:
<ul>
<li><em>c3-MoLFormer-1.1B</em> achieved ~0.848 ROC-AUC on BACE and ~0.900 on BBBP (using MoLFormer splits). This closely matches the original IBM MoLFormer metrics, validating the reproducibility of the open-source framework.</li>
<li>When constrained to the equivalent 250K subset, Graph models (InfoGraph, GROVER) performed comparably to Transformers, indicating that Transformer superiority in chemistry is largely driven by data scalability rather than an inherent architectural advantage at small scales.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Cloud (AWS)</strong>:
<ul>
<li><strong>Compute</strong>: 40 NVIDIA T4 GPUs (<code>g4dn.12xlarge</code> spot instances for pretraining, <code>g4dn.2xlarge</code> for benchmarking).</li>
<li><strong>Cost</strong>: ~$4000 for MoLFormer 1.1B pretraining.</li>
<li><strong>Time</strong>: ~10 days (260 hours) for 1.1B model pretraining.</li>
<li><strong>Setup</strong>: Setup scripts for single-node and multi-node spot EC2 clusters are provided in the GitHub repository&rsquo;s <code>infra/</code> and <code>spot/</code> folders.</li>
</ul>
</li>
<li><strong>On-Premise HPC</strong>:
<ul>
<li><strong>Compute</strong>: 16 nodes (AMD EPYC), each with 4 AMD MI300A APUs.</li>
<li><strong>Environment</strong>: Ray multi-node multi-GPU framework.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/deepforestsci/chemberta3">ChemBERTa-3 GitHub Repository</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training, fine-tuning, and benchmarking framework</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-1.1B">DeepChem/MoLFormer-c3-1.1B</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer re-implementation pretrained on 1.1B molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/ChemBERTa-100M-MLM">DeepChem/ChemBERTa-100M-MLM</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>ChemBERTa pretrained on 100M ZINC molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-100M">DeepChem/MoLFormer-c3-100M</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer pretrained on 100M molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-550M">DeepChem/MoLFormer-c3-550M</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer pretrained on 550M molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Singh, R. et al. (2026). ChemBERTa-3: an open source training framework for chemical foundation models. <em>Digital Discovery</em>, 5, 662-685. <a href="https://doi.org/10.1039/D5DD00348B">https://doi.org/10.1039/D5DD00348B</a></p>
<p><strong>Publication</strong>: Digital Discovery 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/deepforestsci/chemberta3">ChemBERTa-3 GitHub Repository</a></li>
<li><a href="https://deepchem.io/">DeepChem Project</a></li>
<li><a href="https://huggingface.co/DeepChem">DeepChem Hugging Face Models</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{singhChemBERTa3OpenSource2026,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Singh, Riya and Barsainyan, Aryan Amit and Irfan, Rida and Amorin, Connor Joseph and He, Stewart and Davis, Tony and Thiagarajan, Arun and Sankaran, Shiva and Chithrananda, Seyone and Ahmad, Walid and Jones, Derek and McLoughlin, Kevin and Kim, Hyojin and Bhutani, Anoushka and Sathyanarayana, Shreyas Vinaya and Viswanathan, Venkat and Allen, Jonathan E. and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa-3}}: an open source training framework for chemical foundation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{662-685}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{The Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D5DD00348B}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1039/D5DD00348B}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GP-MoLFormer: Molecular Generation via Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/gp-molformer/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/gp-molformer/</guid><description>A 46.8M parameter transformer for molecular generation trained on 1.1B SMILES, introducing pair-tuning for efficient property optimization.</description><content:encoded><![CDATA[<h2 id="contribution-and-taxonomic-focus">Contribution and Taxonomic Focus</h2>
<p>This is primarily a <strong>Methodological</strong> paper, as it proposes a specific neural architecture (GP-MoLFormer) and a novel fine-tuning algorithm (Pair-tuning) for molecular generation. It validates these contributions against standard baselines (e.g., JT-VAE, <a href="/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a>-7b).</p>
<p>It also contains a secondary <strong>Theoretical</strong> contribution by establishing an empirical <a href="/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/">scaling law</a> that relates inference compute (generation size) to the novelty of the generated molecules.</p>
<h2 id="motivation-data-scale-and-prompt-based-optimization">Motivation: Data Scale and Prompt-Based Optimization</h2>
<p>While large language models (LLMs) have transformed text generation, the impact of training data scale and memorization on <em>molecular</em> generative models remains under-explored. Specifically, there is a need to understand how training on billion-scale datasets affects the novelty of generated molecules and whether biases in public databases (like ZINC and PubChem) perpetuate memorization. Furthermore, existing optimization methods often require computationally expensive property predictors or reinforcement learning loops; there is a practical need for more efficient &ldquo;prompt-based&rdquo; optimization techniques.</p>
<h2 id="core-innovations-architecture-and-pair-tuning">Core Innovations: Architecture and Pair-Tuning</h2>
<ol>
<li><strong>Architecture</strong>: The application of a linear-attention transformer decoder with Rotary Positional Embeddings (RoPE) to generative chemistry, allowing for efficient training on 1.1 billion SMILES.</li>
<li><strong>Pair-Tuning</strong>: A novel, parameter-efficient fine-tuning method that uses property-ordered molecular pairs to learn &ldquo;soft prompts&rdquo; for optimization without updating the base model weights.</li>
<li><strong>Scaling Analysis</strong>: An extensive empirical investigation mapping the trade-off between inference compute (up to 10B generations) and chemical novelty, fitting an exponential decay curve that demonstrates how novelty saturates as generation volume grows.</li>
</ol>
<h2 id="experimental-methodology-and-downstream-tasks">Experimental Methodology and Downstream Tasks</h2>
<p>The authors evaluated GP-MoLFormer on three distinct tasks, though the comparisons highlight the difficulty of evaluating foundation models against classical baselines:</p>
<ol>
<li><strong>De Novo Generation</strong>: Comparing validity, uniqueness, and novelty against baselines (CharRNN, VAE, <a href="/notes/chemistry/molecular-design/generation/latent-space/limo-latent-inceptionism/">LIMO</a>, MolGen-7b) on a held-out test set. Notably, this is an unequal comparison; most baselines were trained on the 1.6M molecule <a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> dataset, whereas GP-MoLFormer uses up to 1.1B molecules, meaning performance gains are heavily driven by data scale.</li>
<li><strong>Scaffold-Constrained Decoration</strong>: Generating molecules from DRD2 active binder scaffolds and measuring the hit rate of active compounds against specialized scaffold decorators.</li>
<li><strong>Property-Guided Optimization</strong>: Using Pair-tuning to optimize for Drug-likeness (QED), Penalized <a href="https://en.wikipedia.org/wiki/Octanol-water_partition_coefficient">logP</a>, and <a href="https://en.wikipedia.org/wiki/Dopamine_receptor_D2">DRD2</a> binding activity, comparing the results to graph-based and reinforcement learning benchmarks.</li>
</ol>
<p>Additionally, they performed a <strong>Scaling Study</strong>:</p>
<ul>
<li>Comparing models trained on raw (1.1B) vs. de-duplicated (650M) data.</li>
<li>Generating up to 10 billion molecules to fit empirical scaling laws for novelty.</li>
</ul>
<h2 id="key-findings-and-scaling-laws">Key Findings and Scaling Laws</h2>
<ul>
<li><strong>Scale Driven Performance</strong>: GP-MoLFormer achieves high internal diversity and validity on generation metrics. However, its baseline novelty percentage (~32%) is considerably lower than classical models. The authors attribute this to the massive training scale forcing the model to heavily prioritize matching real-world molecule frequencies over pure exploration. GP-MoLFormer&rsquo;s advantage in generation metrics over LLM-baselines like <a href="/notes/chemistry/molecular-design/generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen</a>-7b likely stems heavily from its 10x larger training dataset rather than fundamental architectural superiority.</li>
<li><strong>Pair-Tuning Efficacy</strong>: The proposed pair-tuning method effectively optimizes properties (e.g., improving DRD2 activity scores) without requiring full model fine-tuning or external reward loops. While successful, the text-based generation yields ~94.5% validity during optimization, which lags behind graph and SELFIES-based baselines that guarantee 100% structural validity.</li>
<li><strong>Memorization vs. Novelty</strong>: Training on de-duplicated data (GP-MoLFormer-UNIQ) yields higher novelty (approx. 5-8% higher) than training on raw data, confirming that duplication bias in public databases leads directly to memorization.</li>
<li><strong>Inference Scaling Law</strong>: Novelty decays exponentially with generation size ($y = ae^{-bx}$), yet the model maintains generative capability (~16.7% novelty) even after generating an unprecedented 10 billion molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Sources</strong>: A combination of <strong>PubChem</strong> (111M SMILES) and <strong>ZINC</strong> (1B SMILES) databases. Downloading and pre-training instructions are located in the repository&rsquo;s <code>data/README.md</code>.</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>All SMILES were canonicalized using RDKit (no isomeric information).</li>
<li><strong>GP-MoLFormer (Base)</strong>: Trained on the full 1.1B dataset (includes duplicates).</li>
<li><strong>GP-MoLFormer-UNIQ</strong>: Trained on a de-duplicated subset of 650M SMILES.</li>
</ul>
</li>
<li><strong>Tokenization</strong>: Uses the tokenizer from Schwaller et al. (2019) with a vocabulary size of <strong>2,362 tokens</strong>.</li>
<li><strong>Filtering</strong>: Sequences restricted to a maximum length of <strong>202 tokens</strong>.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pair-Tuning (Algorithm 1)</strong>:</p>
<ul>
<li><strong>Objective</strong>: Learn task-specific soft prompts $\phi_T$ to maximize the conditional probability of target molecule $b$ given a seed molecule $a$, where pair $(a, b)$ satisfies the property condition $b &gt; a$. The base model parameters $\theta$ remain frozen.</li>
<li><strong>Prompt Structure</strong>: Autoregressive training optimizes the continuous embeddings of $n$ enhancement tokens against the cross-entropy loss of the target sequence:
$$ \mathcal{L}(\phi_T) = - \sum_{i=1}^{|b|} \log P_{\theta}(b_i | \phi_T, a, b_{&lt;i}) $$</li>
<li><strong>Hyperparameters</strong>: Trained for 1,000 epochs with a batch size of 35 and a fixed learning rate of $3 \times 10^{-2}$.</li>
<li><strong>Inference</strong>: The learned prompt $\phi_T$ and seed molecule $a$ are prepended as context, and candidates are sampled autoregressively until a termination token is produced.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Availability</strong>: The model trained on deduplicated data (GP-MoLFormer-UNIQ) is publicly available on <a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">Hugging Face</a>. The full 1.1B base model is not explicitly hosted. The source code repository includes a disclosure that IBM will not maintain the code going forward.</li>
<li><strong>Architecture</strong>: Transformer decoder (~47M parameters: 12 layers, 12 heads, hidden size 768).</li>
<li><strong>Attention Mechanism</strong>: Combines Linear Attention (Generalized Random Feature map, $\phi$) with Rotary Positional Embeddings (RoPE). To avoid the quadratic complexity of standard attention while maintaining relative positional awareness, RoPE is applied to queries ($Q$) and keys ($K$) prior to the random feature mapping:
$$ \text{Attention}(Q, K, V) = \frac{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle v_n}{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle} $$</li>
<li><strong>Inference Speed</strong>: ~3ms per forward pass on a single A100 GPU.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Generation Quality Metrics</strong>: Validity, Uniqueness, Novelty (<a href="/notes/chemistry/molecular-design/generation/evaluation/molecular-sets-moses/">MOSES</a> suite), <a href="/notes/chemistry/molecular-design/generation/evaluation/frechet-chemnet-distance/">Fréchet ChemNet Distance (FCD)</a>, Scaffold similarity (Scaf), and Similarity to Nearest Neighbor (SNN).</li>
<li><strong>MoLFormer-Based Metrics</strong>: The authors introduce Fréchet <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a> Distance (FMD) and MoLFormer-space IntDiv2 to measure distributional similarity using their own pre-trained continuous embeddings instead of standard fingerprints.</li>
<li><strong>Optimization Metrics</strong>: Penalized logP (calculated as $\text{logP} - \text{SA} - \text{max}(\text{maxrings}(size) - 6, 0)$), Drug-likeness (QED), and DRD2 activity scores.</li>
<li><strong>Scaling Metrics</strong>: Empirical fit for novelty decay: $y = ae^{-bx}$.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 16 x NVIDIA A100 (80 GB) GPUs across 2 nodes connected via EDR Infiniband.</li>
<li><strong>Training Time</strong>:
<ul>
<li>GP-MoLFormer (1.1B data): ~115 hours total (28.75 hours/epoch for 4 epochs).</li>
<li>GP-MoLFormer-UNIQ (650M data): ~80 hours total.</li>
</ul>
</li>
<li><strong>Hyperparameters</strong>: Used a batch size of 1,600 molecules per GPU with a fixed learning rate of $1.6 \times 10^{-4}$ (scaled up to $8\times$ factor as GPUs increased).</li>
<li><strong>Optimization</strong>: Used distributed data-parallel training and adaptive bucketing by sequence length to handle scale.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/gp-molformer/">GP-MoLFormer (GitHub)</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official implementation; IBM will not maintain going forward</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm-research/GP-MoLFormer-Uniq">GP-MoLFormer-Uniq (Hugging Face)</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Pre-trained on 650M de-duplicated SMILES</td>
      </tr>
  </tbody>
</table>
<p>The full 1.1B base model weights are not publicly hosted. The training data (PubChem and ZINC) is publicly available, and instructions for downloading and pre-processing are in the repository&rsquo;s <code>data/README.md</code>.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ross, J., Belgodere, B., Hoffman, S. C., Chenthamarakshan, V., Navratil, J., Mroueh, Y., &amp; Das, P. (2025). GP-MoLFormer: A Foundation Model For Molecular Generation. <em>Digital Discovery</em>, 4(10), 2684&ndash;2696. <a href="https://doi.org/10.1039/D5DD00122F">https://doi.org/10.1039/D5DD00122F</a></p>
<p><strong>Publication</strong>: Digital Discovery, vol. 4, no. 10, pp. 2684&ndash;2696 (2025)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ross2025gpmolformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GP-MoLFormer: a foundation model for molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ross, Jerret and Belgodere, Brian and Hoffman, Samuel C and Chenthamarakshan, Vijil and Navratil, Jiri and Mroueh, Youssef and Das, Payel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2684--2696}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D5DD00122F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa-2: Scaling Molecular Transformers to 77M</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta-2/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta-2/</guid><description>Optimizing transformer pretraining for molecules using MLM vs MTR objectives, scaling to 77M compounds from PubChem for improved property prediction.</description><content:encoded><![CDATA[<h2 id="classifying-chemberta-2s-methodological-contributions">Classifying ChemBERTa-2&rsquo;s Methodological Contributions</h2>
<p>This is primarily a <strong>Methodological</strong> paper with a secondary <strong>Resource</strong> contribution.</p>
<p>It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine &ldquo;how well&rdquo; these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.</p>
<p><strong>Key methodological indicators</strong>:</p>
<ul>
<li><strong>Baseline comparison</strong>: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa-1</a>) with prominent benchmark tables</li>
<li><strong>Ablation studies</strong>: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size</li>
<li><strong>Scaling analysis</strong>: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance</li>
</ul>
<h2 id="motivations-for-scaling-molecular-transformers">Motivations for Scaling Molecular Transformers</h2>
<p>The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a &ldquo;chemical foundation model&rdquo;.</p>
<p><strong>Key motivations</strong>:</p>
<ul>
<li><strong>Label scarcity</strong>: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant</li>
<li><strong>Scaling hypothesis</strong>: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP</li>
<li><strong>Efficiency</strong>: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and weakly supervised (MTR, using <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> computed properties as labels) approaches</li>
</ul>
<h2 id="novelty-in-multi-task-regression-objectives">Novelty in Multi-Task Regression Objectives</h2>
<p><strong>Scale</strong>: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for <a href="/notes/chemistry/molecular-representations/encoders/smiles-bert/">SMILES-BERT</a>).</p>
<p><strong>Pipeline optimization</strong>: A direct, controlled comparison of <strong>Masked Language Modeling (MLM)</strong> vs. <strong>Multi-Task Regression (MTR)</strong> pretraining objectives on identical datasets.</p>
<p><strong>Proxy selection</strong>: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.</p>
<h2 id="experimental-pretraining-setup-on-77m-compounds">Experimental Pretraining Setup on 77M Compounds</h2>
<h3 id="pretraining-setup">Pretraining Setup</h3>
<p><strong>Datasets</strong>: Subsets of <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> containing 5M, 10M, and 77M unique SMILES.</p>
<p><strong>Tasks</strong>:</p>
<ul>
<li><strong>MLM</strong>: Masking 15% of tokens (following RoBERTa procedure). The model is optimized by minimizing the cross-entropy loss over the predicted masked tokens:
$$ \mathcal{L}_{MLM} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) $$
where $\mathcal{M}$ represents the set of masked token indices.</li>
<li><strong>MTR</strong>: Predicting 200 calculated molecular properties (via RDKit) simultaneously using a mean squared error objective:
$$ \mathcal{L}_{MTR} = \frac{1}{200} \sum_{j=1}^{200} \frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_{ij} - y_{ij} \right)^2 $$
Continuous target labels $y_{ij}$ are mean-normalized prior to training to equilibrate the disparate scales of different chemical properties.</li>
</ul>
<p><strong>Hyperparameter search</strong>: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.</p>
<h3 id="downstream-validation">Downstream Validation</h3>
<p><strong>Finetuning</strong>: Evaluated on 8 tasks from <strong><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></strong> (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).</p>
<p><strong>Analysis</strong>: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.</p>
<h2 id="key-performance-outcomes-and-scaling-realities">Key Performance Outcomes and Scaling Realities</h2>
<p><strong>Highly competitive performance</strong>: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks, though the margins demonstrate that task-specific baselines remain notably robust.</p>
<p><strong>MTR superiority</strong>: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM on every finetuning task evaluated. MTR is substantially slower than MLM due to the larger input size from the 200-element label vector, but MLM loss serves as a reliable proxy for MTR loss, enabling cheaper architecture search before committing to full MTR pretraining.</p>
<p><strong>Scaling laws versus downstream utility</strong>: Pretraining loss improved by 25-35% when increasing the dataset from 5M to 77M compounds. However, this improvement in pretraining loss does not uniformly transfer to downstream tasks. For MTR models, SR-p53 ROC-AUC decreases monotonically from 0.834 (5M) to 0.827 (10M) to 0.817 (77M), and Lipophilicity RMSE is worse at 77M (0.798) than at 5M (0.758), despite a dip at 10M (0.744). This variability in transfer challenges the assumption that pretraining improvements always yield downstream gains.</p>
<p><strong>Transfer learning</strong>: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The pretraining corpus is derived from <strong>PubChem</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pretraining</strong></td>
          <td>PubChem</td>
          <td>77M SMILES</td>
          <td>Canonicalized and globally shuffled. Subsets of 5M and 10M used. <strong>Note: Exact splits and datasets are not published.</strong></td>
      </tr>
      <tr>
          <td><strong>Validation</strong></td>
          <td>PubChem</td>
          <td>100k SMILES</td>
          <td>A fixed set held out from the 77M corpus. <strong>Note: Exact 100k subset is not published.</strong></td>
      </tr>
      <tr>
          <td><strong>MTR Labels</strong></td>
          <td>RDKit</td>
          <td>200 props</td>
          <td>200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized. <strong>Note: Calculated labels are not published and must be re-computed.</strong></td>
      </tr>
      <tr>
          <td><strong>Finetuning</strong></td>
          <td>MoleculeNet</td>
          <td>1.5k - 8k</td>
          <td>Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pretraining Objectives:</strong></p>
<ol>
<li><strong>Masked Language Modeling (MLM)</strong>: Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.</li>
<li><strong>Multi-Task Regression (MTR)</strong>: Predicting 200 RDKit properties. Labels are mean-normalized.</li>
</ol>
<p><strong>Tokenizer:</strong></p>
<ul>
<li>Dictionary of common SMILES characters</li>
<li>Maximum vocabulary size: <strong>591 tokens</strong></li>
</ul>
<p><strong>Optimization:</strong></p>
<ul>
<li><strong>Patience</strong>: Early stopping set to one pass through the dataset to ensure full coverage</li>
<li><strong>Hyperparameter search</strong>: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate. <strong>Note: The precise configuration of the winning models that were scaled to 77M is absent from the paper.</strong></li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Based on <strong>RoBERTa</strong> (HuggingFace implementation)</li>
<li><strong>Parameter scale</strong>: Models ranged between <strong>5M and 46M parameters</strong></li>
<li><strong>Selection</strong>: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset</li>
<li><strong>Checkpoints</strong>: Pre-trained weights are hosted by DeepChem on <a href="https://huggingface.co/DeepChem">Hugging Face</a>. Direct links include <a href="https://huggingface.co/DeepChem/ChemBERTa-77M-MTR">DeepChem/ChemBERTa-77M-MTR</a> and <a href="https://huggingface.co/DeepChem/ChemBERTa-77M-MLM">DeepChem/ChemBERTa-77M-MLM</a> (Note: Model cards are currently empty).</li>
<li><strong>Code Reference</strong>: While the <a href="https://github.com/deepchem/deepchem">DeepChem</a> repository is referenced for code, isolated training scripts tailored to recreate ChemBERTa-2&rsquo;s exact pipeline are not separated from the generalized deepchem library tooling.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Benchmarks were performed on <strong>MoleculeNet</strong> using DeepChem.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Tasks</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>RMSE</strong> ($\downarrow$)</td>
          <td>Delaney, Lipo, BACE (Reg), Clearance</td>
          <td>D-MPNN</td>
          <td>ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.8).</td>
      </tr>
      <tr>
          <td><strong>ROC-AUC</strong> ($\uparrow$)</td>
          <td>BBBP, ClinTox, HIV, Tox21, BACE (Cls)</td>
          <td>D-MPNN</td>
          <td>ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: AWS EC2 instances with <strong>Nvidia T4 GPUs</strong></li>
<li><strong>Strategy</strong>: AWS Spot instances were used to reduce cost; implemented frequent checkpointing to handle interruptions.</li>
<li><strong>Note</strong>: For MTR, they wrote a custom data loader wrapper around HuggingFace&rsquo;s text loader to handle CSV parsing efficiency, as the default CSV loader was a major bottleneck for the 200-element target vectors.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., &amp; Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. <em>arXiv preprint arXiv:2209.01712</em>. <a href="https://doi.org/10.48550/arXiv.2209.01712">https://doi.org/10.48550/arXiv.2209.01712</a></p>
<p><strong>Publication</strong>: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa-1 Paper</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{ahmadChemBERTa2ChemicalFoundation2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{ChemBERTa-2}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Chemformer: A Pre-trained Transformer for Comp Chem</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/chemformer/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-design/generation/autoregressive/chemformer/</guid><description>BART-based Transformer pre-trained on 100M molecules using self-supervision to accelerate convergence on chemical sequence tasks.</description><content:encoded><![CDATA[<h2 id="paper-contribution-and-methodological-classification">Paper Contribution and Methodological Classification</h2>
<p>This is a <strong>Methodological ($\Psi_{\text{Method}}$)</strong> paper. It proposes an architecture adaptation (Chemformer based on BART) and a specific pre-training strategy (&ldquo;Combined&rdquo; masking and augmentation). The paper validates this method by benchmarking against established models on multiple tasks, including direct synthesis, retrosynthesis, and molecular optimization. It also includes a secondary <strong>Resource ($\Psi_{\text{Resource}}$)</strong> contribution by making the pre-trained models and code available.</p>
<h2 id="motivation-computational-bottlenecks-in-cheminformatics">Motivation: Computational Bottlenecks in Cheminformatics</h2>
<p>Existing Transformer models for cheminformatics are often developed for single applications and are computationally expensive to train from scratch. For example, training a Molecular Transformer for reaction prediction can take days, limiting hyperparameter exploration. Self-supervised pre-training (like BERT or T5) has significantly advanced NLP by reducing fine-tuning time and improving performance. In chemistry, applications have traditionally focused on task-specific datasets or encoder-only architectures, which perform poorly on sequence generation tasks. The authors aim to use transfer learning on a large unlabelled dataset to create a model that converges quickly and performs well across diverse sequence-to-sequence and discriminative tasks.</p>
<h2 id="core-innovation-bart-architecture-and-combined-pre-training">Core Innovation: BART Architecture and Combined Pre-training</h2>
<p>The primary insight lies in the adaptation of the <strong>BART architecture</strong> for chemistry and the introduction of a <strong>&ldquo;Combined&rdquo; self-supervised pre-training task</strong>.</p>
<ul>
<li><strong>Architecture</strong>: Chemformer uses the BART encoder-decoder structure, allowing it to handle both discriminative (property prediction) and generative (reaction prediction) tasks efficiently. This provides an alternative to encoder-only (BERT) or decoder-only (GPT) models.</li>
<li><strong>Combined Pre-training</strong>: The authors introduce a task that applies both <strong>Span Masking</strong> (randomly replacing tokens with <code>&lt;mask&gt;</code>) and <strong><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> Augmentation</strong> (permuting atom order, see <a href="/notes/chemistry/molecular-representations/notations/randomized-smiles-generative-models/">Randomized SMILES</a>) simultaneously. Formally, given a canonical SMILES sequence $x$, a corrupted sequence $\tilde{x} = \text{Mask}(\text{Augment}(x))$ is generated. The model is trained using an autoregressive cross-entropy loss to reconstruct the canonical sequence from the corrupted input:
$$ \mathcal{L}_{\text{pre-train}} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{&lt;t}, \tilde{x}) $$</li>
<li><strong>Tunable Augmentation</strong>: A downstream augmentation strategy is proposed where the probability of augmenting the input/output SMILES ($p_{aug}$) is a tunable hyperparameter, performed on-the-fly.</li>
</ul>
<h2 id="experimental-setup-and-pre-training-tasks">Experimental Setup and Pre-training Tasks</h2>
<p>The authors pre-trained Chemformer on <strong>100 million molecules</strong> from ZINC-15 and fine-tuned it on three distinct task types:</p>
<ol>
<li><strong>Seq2Seq Reaction Prediction</strong>:
<ul>
<li><em>Direct Synthesis</em>: USPTO-MIT dataset (Mixed and Separated).</li>
<li><em><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></em>: USPTO-50K dataset (see also <a href="/notes/chemistry/molecular-design/reaction-prediction/molecular-transformer/">Molecular Transformer</a>, <a href="/notes/chemistry/molecular-design/reaction-prediction/tied-two-way-transformers-retrosynthesis/">Tied Two-Way Transformers</a>).</li>
</ul>
</li>
<li><strong>Molecular Optimization</strong>: Generating molecules with improved properties (<a href="https://en.wikipedia.org/wiki/Distribution_coefficient">LogD</a>, solubility, clearance) starting from ChEMBL matched molecular pairs.</li>
<li><strong>Discriminative Tasks</strong>:
<ul>
<li><em><a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a></em>: Predicting properties (ESOL, FreeSolv, Lipophilicity) from <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>.</li>
<li><em>Bioactivity</em>: Predicting pXC50 values for 133 genes using ExCAPE data.</li>
</ul>
</li>
</ol>
<p>Ablation studies compared three pre-training strategies (Masking, Augmentation, Combined) against a randomly initialized baseline.</p>
<h2 id="results-trade-offs-and-conclusions">Results, Trade-offs, and Conclusions</h2>
<ul>
<li><strong>Performance</strong>: Chemformer achieved <strong>competitive top-1 accuracy</strong> on USPTO-MIT (91.3% Mixed) and USPTO-50K (53.6-54.3%), outperforming the Augmented Transformer and graph-based models (GLN, GraphRetro).</li>
<li><strong>Convergence Speed</strong>: Pre-training significantly accelerated training; fine-tuning for just 20 epochs (30 mins) outperformed the previous baselines trained for significantly longer.</li>
<li><strong>Pre-training Tasks</strong>: The &ldquo;Combined&rdquo; task generally performed best for reaction prediction and bioactivity, while &ldquo;Masking&rdquo; was superior for molecular optimization.</li>
<li><strong>Augmentation Trade-off</strong>: The augmentation strategy improved top-1 accuracy but significantly degraded top-5/10 accuracy because beam search outputs became populated with augmented versions of the same molecule. This presents a considerable limitation for practical applications like retrosynthesis mapping, where retrieving a diverse set of candidate reactions is often critical.</li>
<li><strong>Discriminative Evaluation Caveats</strong>: Chemformer underperformed specialized baselines (like D-MPNN or <a href="/notes/chemistry/molecular-representations/encoders/molbert-molecular-representations/">MolBERT</a>) on small discriminative datasets. The authors note that direct comparison is difficult: Chemformer was trained simultaneously on multiple subtasks (multi-task learning), while the literature baselines were trained and tuned on each subtask separately. Additionally, the Chemformer encoder uses fewer than 20M parameters compared to MolBERT&rsquo;s approximately 85M, and Chemformer&rsquo;s pre-training does not include molecular property objectives. For other transfer learning approaches to QSAR, see <a href="/notes/chemistry/molecular-design/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT</a>.</li>
<li><strong>Pre-training Data Scope</strong>: The 100M pre-training dataset from ZINC-15 was selected with constraints on molecular weight ($\le 500$ Da) and LogP ($\le 5$), focusing the learned representations on small, drug-like molecules.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><em>Note: The primary GitHub repository for Chemformer was officially archived on February 11, 2026. Pre-trained weights and datasets used in the paper are still hosted externally on <a href="https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Box</a>. Active development of Chemformer models has moved to the <a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels</a> repository.</em></p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/Chemformer">Chemformer (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Archived; original PyTorch implementation</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/MolecularAI/aizynthmodels">AiZynthModels (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache-2.0</td>
          <td style="text-align: left">Active successor repository</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq">Pre-trained weights (Box)</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Base and Large model checkpoints</td>
      </tr>
  </tbody>
</table>
<p>The following datasets were used for pre-training and benchmarking.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Pre-training</strong></td>
          <td style="text-align: left">ZINC-15</td>
          <td style="text-align: left">100M</td>
          <td style="text-align: left">Selected subset (reactive, annotated purchasability, MW $\le 500$, LogP $\le 5$). Split: 99% Train / 0.5% Val / 0.5% Test.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Direct Synthesis</strong></td>
          <td style="text-align: left">USPTO-MIT</td>
          <td style="text-align: left">~470k</td>
          <td style="text-align: left">Evaluated on &ldquo;Mixed&rdquo; and &ldquo;Separated&rdquo; variants.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Retrosynthesis</strong></td>
          <td style="text-align: left">USPTO-50K</td>
          <td style="text-align: left">~50k</td>
          <td style="text-align: left">Standard benchmark for retrosynthesis.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Optimization</strong></td>
          <td style="text-align: left">ChEMBL MMPs</td>
          <td style="text-align: left">~160k Train</td>
          <td style="text-align: left">Matched Molecular Pairs for LogD, solubility, and clearance optimization.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Properties</strong></td>
          <td style="text-align: left">MoleculeNet</td>
          <td style="text-align: left">Small</td>
          <td style="text-align: left">ESOL (1128), FreeSolv (642), Lipophilicity (4200).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Bioactivity</strong></td>
          <td style="text-align: left">ExCAPE</td>
          <td style="text-align: left">~312k</td>
          <td style="text-align: left">133 gene targets; &gt;1200 compounds per gene.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>Tokenization</strong>: Regex-based tokenization (523 tokens total) derived from ChEMBL 27 canonical SMILES.</li>
<li><strong>Augmentation</strong>: SMILES enumeration (permuting atom order) used for pre-training and on-the-fly during fine-tuning ($p_{aug}=0.5$ for Seq2Seq, $p_{aug}=1.0$ for discriminative).</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pre-training Tasks</strong>:
<ol>
<li><em>Masking</em>: Span masking (BART style).</li>
<li><em>Augmentation</em>: Input is a randomized SMILES; target is canonical SMILES.</li>
<li><em>Combined</em>: Input is augmented <em>then</em> masked; target is canonical SMILES.</li>
</ol>
</li>
<li><strong>Optimization</strong>:
<ul>
<li>Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$).</li>
<li>Schedule: Linear warm-up (8000 steps) for pre-training; One-cycle schedule for fine-tuning.</li>
</ul>
</li>
<li><strong>Inference</strong>: <a href="https://en.wikipedia.org/wiki/Beam_search">Beam search</a> with width 10 for Seq2Seq tasks. Used <code>molbart/inference_score.py</code> and <code>molbart/retrosynthesis/round_trip_inference.py</code> for standard and round-trip validation.</li>
</ul>
<h3 id="models">Models</h3>
<p>Two model sizes were trained. Both use the Pre-Norm Transformer layout with GELU activation.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hyperparameter</th>
          <th style="text-align: left">Chemformer (Base)</th>
          <th style="text-align: left">Chemformer-Large</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Layers</strong></td>
          <td style="text-align: left">6</td>
          <td style="text-align: left">8</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Model Dimension</strong></td>
          <td style="text-align: left">512</td>
          <td style="text-align: left">1024</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Feed-forward Dim</strong></td>
          <td style="text-align: left">2048</td>
          <td style="text-align: left">4096</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Attention Heads</strong></td>
          <td style="text-align: left">8</td>
          <td style="text-align: left">16</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Parameters</strong></td>
          <td style="text-align: left">~45M</td>
          <td style="text-align: left">~230M</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Pre-training Task</strong></td>
          <td style="text-align: left">All 3 variants</td>
          <td style="text-align: left">Combined only</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Comparisons relied on Top-N accuracy for reaction tasks and validity metrics for optimization.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Key Result</th>
          <th style="text-align: left">Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Direct Synthesis (Sep)</td>
          <td style="text-align: left"><strong>92.8%</strong> (Large)</td>
          <td style="text-align: left">91.1% (Aug Transformer)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Top-1 Acc</strong></td>
          <td style="text-align: left">Retrosynthesis</td>
          <td style="text-align: left"><strong>54.3%</strong> (Large)</td>
          <td style="text-align: left">53.7% (GraphRetro) / 52.5% (GLN)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Desirable %</strong></td>
          <td style="text-align: left">Mol Optimization</td>
          <td style="text-align: left"><strong>75.0%</strong> (Base-Mask)</td>
          <td style="text-align: left">70.2% (Transformer-R)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>RMSE</strong></td>
          <td style="text-align: left">Lipophilicity</td>
          <td style="text-align: left">0.598 (Combined)</td>
          <td style="text-align: left">0.555 (D-MPNN)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 NVIDIA V100 GPUs (batch size 128 per GPU).</li>
<li><strong>Training Time</strong>:
<ul>
<li>Pre-training: 2.5 days (Base) / 6 days (Large) for 1M steps.</li>
<li>Fine-tuning: ~20-40 epochs for reaction prediction (&lt;12 hours).</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Irwin, R., Dimitriadis, S., He, J., &amp; Bjerrum, E. J. (2022). Chemformer: a pre-trained transformer for computational chemistry. <em>Machine Learning: Science and Technology</em>, 3(1), 015022. <a href="https://doi.org/10.1088/2632-2153/ac3ffb">https://doi.org/10.1088/2632-2153/ac3ffb</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology 2022</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{irwinChemformerPretrainedTransformer2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Chemformer: A Pre-Trained Transformer for Computational Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Chemformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Irwin, Ross and Dimitriadis, Spyridon and He, Jiazhen and Bjerrum, Esben Jannik}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{015022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2632-2153}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1088/2632-2153/ac3ffb}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa: Molecular Property Prediction via Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta/</guid><description>A systematic evaluation of RoBERTa transformers pretrained on 77M PubChem SMILES for molecular property prediction tasks.</description><content:encoded><![CDATA[<h2 id="taxonomy-and-paper-contributions">Taxonomy and Paper Contributions</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$), with a significant <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>It is a methodological investigation because it systematically evaluates a specific architecture (Transformers/RoBERTa) against established State-of-the-Art (SOTA) baselines like directed Message Passing Neural Networks (D-MPNNs) to determine &ldquo;how well does this work?&rdquo; in the chemical domain. It ablates dataset size, tokenization, and input representation.</p>
<p>It is also a resource paper as it introduces &ldquo;PubChem-77M,&rdquo; a curated dataset of 77 million SMILES strings designed to facilitate large-scale self-supervised pretraining for the community.</p>
<h2 id="overcoming-data-scarcity-in-property-prediction">Overcoming Data Scarcity in Property Prediction</h2>
<p>The primary motivation is <strong>data scarcity</strong> in molecular property prediction. Graph Neural Networks (GNNs) achieve strong performance on property prediction tasks when provided with sufficient labeled data. Generating these labels requires costly and time-consuming laboratory testing, leading to severe data scarcity in specialized chemical domains.</p>
<p>Massive quantities of <strong>unlabeled chemical structure data</strong> exist in the form of <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings. Inspired by the success of Transformers in NLP, where self-supervised pretraining on large corpora yields strong transfer learning, the authors aim to use these unlabeled datasets to learn effective molecular representations. Additionally, Transformers benefit from a mature software ecosystem (HuggingFace) that offers efficiency advantages over GNNs.</p>
<h2 id="pretraining-scaling-laws-and-novelty">Pretraining Scaling Laws and Novelty</h2>
<p>Previous works applied Transformers to SMILES strings. This paper advances the field by systematically evaluating scaling laws and architectural components for this domain. Specifically:</p>
<ul>
<li><strong>Scaling Analysis</strong>: It explicitly tests how pretraining dataset size (100K to 10M) impacts downstream performance.</li>
<li><strong>Tokenizer Comparison</strong>: It compares standard NLP <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">Byte-Pair Encoding (BPE)</a> against a chemically-aware &ldquo;SmilesTokenizer&rdquo;.</li>
<li><strong>Representation Comparison</strong>: It evaluates if the robust <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a> string representation offers advantages over standard SMILES in a Transformer context.</li>
</ul>
<h2 id="experimental-setup-pretraining-and-finetuning">Experimental Setup: Pretraining and Finetuning</h2>
<p>The authors trained <strong>ChemBERTa</strong> (based on RoBERTa) using Masked Language Modeling (MLM) on subsets of the <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> dataset. The core training objective minimizes the cross-entropy loss over a corrupted input where a subset of basic tokens, denoted by $\mathcal{M}$, are masked:</p>
<p>$$
\mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta)
$$</p>
<p>where $x_i$ is the exact masked token, $x_{\setminus \mathcal{M}}$ is the corrupted SMILES context string, and $\theta$ represents the network parameters.</p>
<ul>
<li><strong>Pretraining</strong>: Models were pretrained on dataset sizes of 100K, 250K, 1M, and 10M compounds.</li>
<li><strong>Baselines</strong>: Performance was compared against D-MPNN (Graph Neural Network), Random Forest (RF), and SVM using 2048-bit Morgan Fingerprints.</li>
<li><strong>Downstream Tasks</strong>: Finetuning was performed individually on small <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks: BBBP (<a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">blood-brain barrier</a>), ClinTox (clinical toxicity), HIV, and Tox21 (p53 stress-response). This poses a transfer learning challenge, as the model must adapt from pretraining on 10 million molecules to classifying datasets ranging from ~1.5K to ~41K examples.</li>
<li><strong>Ablations</strong>:
<ul>
<li><strong>Tokenization</strong>: BPE vs. SmilesTokenizer on the 1M dataset, evaluated on Tox21.</li>
<li><strong>Input</strong>: SMILES vs. SELFIES strings on the Tox21 task.</li>
</ul>
</li>
</ul>
<h2 id="results-vs-graph-neural-network-baselines">Results vs. Graph Neural Network Baselines</h2>
<p>The main comparison between ChemBERTa (pretrained on 10M compounds) and Chemprop baselines on MoleculeNet tasks is summarized below (Table 1 from the paper):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP ROC</th>
          <th>BBBP PRC</th>
          <th>ClinTox ROC</th>
          <th>ClinTox PRC</th>
          <th>HIV ROC</th>
          <th>HIV PRC</th>
          <th>Tox21 ROC</th>
          <th>Tox21 PRC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemBERTa 10M</td>
          <td>0.643</td>
          <td>0.620</td>
          <td>0.733</td>
          <td>0.975</td>
          <td>0.622</td>
          <td>0.119</td>
          <td>0.728</td>
          <td>0.207</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.708</td>
          <td>0.697</td>
          <td>0.906</td>
          <td>0.993</td>
          <td>0.752</td>
          <td>0.152</td>
          <td>0.688</td>
          <td>0.429</td>
      </tr>
      <tr>
          <td>RF</td>
          <td>0.681</td>
          <td>0.692</td>
          <td>0.693</td>
          <td>0.968</td>
          <td>0.780</td>
          <td>0.383</td>
          <td>0.724</td>
          <td>0.335</td>
      </tr>
      <tr>
          <td>SVM</td>
          <td>0.702</td>
          <td>0.724</td>
          <td>0.833</td>
          <td>0.986</td>
          <td>0.763</td>
          <td>0.364</td>
          <td>0.708</td>
          <td>0.345</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Scaling Improvements &amp; Training Dynamics</strong>: Performance scales predictably with pretraining data size. Increasing data from 100K to 10M improved ROC-AUC by +0.110 and PRC-AUC by +0.059 on average across BBBP, ClinTox, and Tox21 (HIV was omitted due to resource constraints). Notably, researchers had to halt pretraining on the 10M subset after just 3 epochs due to overfitting, suggesting that simple 15% token masking might not provide a sufficiently difficult learning curvature for large-scale chemical representation.</li>
<li><strong>Performance Limits vs. GNNs</strong>: ChemBERTa generally performs below the D-MPNN baseline. On the Tox21 dataset, ChemBERTa-10M achieved a higher ROC-AUC (0.728) than D-MPNN (0.688); nonetheless, it recorded a substantially lower PRC-AUC (0.207 vs 0.429). This gap indicates that current Transformer iterations lack the explicit inductive biases of graph algorithms and struggle with the severe class imbalances typical of chemical datasets.</li>
<li><strong>Ablation Limitations (Tokenization &amp; SELFIES)</strong>: The authors&rsquo; ablation studies for tokenization (SmilesTokenizer narrowly beating BPE) and input representation (SELFIES performing comparably to SMILES) were evaluated exclusively on the single Tox21 task. Deriving broad architectural conclusions regarding &ldquo;semantically-aware tokenization&rdquo; or string robustness from an $N=1$ empirical evaluation is a significant limitation of the study. Broader benchmarking is required to validate these findings.</li>
<li><strong>Interpretability</strong>: Attention heads organically learn to track chemically relevant substructures (like specific functional groups and aromatic rings), mimicking the inductive biases of graph convolutions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors curated a massive dataset for pretraining and utilized standard benchmarks for evaluation.</p>
<ul>
<li><strong>Pretraining Data</strong>: <strong>PubChem-77M</strong>.
<ul>
<li>Source: 77 million unique SMILES from PubChem.</li>
<li>Preprocessing: Canonicalized and globally shuffled.</li>
<li>Subsets used: 100K, 250K, 1M, and 10M subsets.</li>
<li><em>Availability Note</em>: The authors provided a direct link to the <a href="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip">canonicalized 10M compound subset</a> used for their largest experiments. Full reproducibility of the smaller (100K, 250K, 1M) or full 77M sets may require re-extracting from PubChem.</li>
</ul>
</li>
<li><strong>Evaluation Data</strong>: <strong>MoleculeNet</strong>.
<ul>
<li>Tasks: BBBP (2,039), ClinTox (1,478), HIV (41,127), Tox21 (7,831).</li>
<li>Splitting: 80/10/10 train/valid/test split using a <strong>scaffold splitter</strong> to ensure chemical diversity between splits.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The core training methodology mirrors standard BERT/RoBERTa procedures adapted for chemical strings.</p>
<ul>
<li><strong>Objective</strong>: Masked Language Modeling (MLM) with <strong>15% token masking</strong>.</li>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>BPE</strong>: Byte-Pair Encoder (vocab size 52K).</li>
<li><strong>SmilesTokenizer</strong>: Regex-based custom tokenizer available in DeepChem (documented <a href="https://deepchem.readthedocs.io/en/latest/tokenizers.html#smilestokenizer">here</a>).</li>
</ul>
</li>
<li><strong>Sequence Length</strong>: Maximum sequence length of <strong>512 tokens</strong>.</li>
<li><strong>Finetuning</strong>: Appended a linear classification layer; backpropagated through the base model for up to 25 epochs with early stopping on ROC-AUC.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: <strong>RoBERTa</strong> (via HuggingFace).
<ul>
<li>Layers: 6</li>
<li>Attention Heads: 12 (72 distinct mechanisms total).</li>
<li><em>Implementation Note</em>: The original training notebooks and scripts are maintained in the authors&rsquo; <a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry repository</a>, alongside the primary downstream tasks integrated into DeepChem. A <a href="https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Transfer_Learning_With_ChemBERTa_Transformers.ipynb">full Tox21 transfer learning tutorial</a> has been incorporated into the DeepChem repository.</li>
</ul>
</li>
<li><strong>Baselines</strong> (via Chemprop library):
<ul>
<li><strong>D-MPNN</strong>: Directed Message Passing Neural Network with default hyperparameters.</li>
<li><strong>RF/SVM</strong>: Scikit-learn Random Forest and SVM using 2048-bit Morgan fingerprints (<a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is measured using dual metrics to account for class imbalance common in toxicity datasets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ROC-AUC</strong></td>
          <td>Area Under Receiver Operating Characteristic Curve</td>
      </tr>
      <tr>
          <td><strong>PRC-AUC</strong></td>
          <td>Area Under Precision-Recall Curve (vital for imbalanced data)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Single <strong>NVIDIA V100 GPU</strong>.</li>
<li><strong>Training Time</strong>: Approximately <strong>48 hours</strong> for the 10M compound subset.</li>
<li><strong>Carbon Footprint</strong>: Estimated 17.1 kg $\text{CO}_2\text{eq}$ (offset by Google Cloud).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training notebooks and finetuning scripts</td>
      </tr>
      <tr>
          <td><a href="https://github.com/deepchem/deepchem">DeepChem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Integration of ChemBERTa and SmilesTokenizer</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1">ChemBERTa-zinc-base-v1</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained RoBERTa on 100K ZINC SMILES</td>
      </tr>
      <tr>
          <td><a href="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip">PubChem-10M subset</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Canonicalized 10M compound subset used for largest experiments</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. Code and pre-trained models are available, and the 10M pretraining subset is downloadable. However, smaller subsets (100K, 250K, 1M) may need re-extraction from PubChem, and exact hyperparameter details for finetuning (learning rate, batch size) are not fully specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chithrananda, S., Grand, G., &amp; Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. <em>arXiv preprint arXiv:2010.09885</em>. <a href="https://doi.org/10.48550/arXiv.2010.09885">https://doi.org/10.48550/arXiv.2010.09885</a></p>
<p><strong>Publication</strong>: arXiv 2020 (Preprint)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1">HuggingFace Model Hub (ChemBERTa-zinc-base-v1)</a> - <em>Additional pre-trained variations on PubChem &amp; ZINC datasets are available on the author&rsquo;s <a href="https://huggingface.co/seyonec">seyonec</a> HF profile.</em></li>
<li><a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry GitHub Repository</a> - <em>Notebooks and scripts used for MLM pretraining and finetuning evaluations.</em></li>
</ul>
<h3 id="bibtex">BibTeX</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{chithranandaChemBERTaLargeScaleSelfSupervised2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa}}: {{Large-Scale Self-Supervised Pretraining}} for {{Molecular Property Prediction}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{ChemBERTa}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Translating InChI to IUPAC Names with Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/handsel-inchi-iupac-2021/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/handsel-inchi-iupac-2021/</guid><description>Sequence-to-sequence Transformer translating InChI identifiers to IUPAC names with 91% accuracy on organic compounds.</description><content:encoded><![CDATA[<h2 id="primary-contribution-a-transformer-based-method">Primary Contribution: A Transformer-Based Method</h2>
<p>This is primarily a <strong>Method</strong> paper. It adapts a specific architecture (Transformer) to a specific task (InChI-to-IUPAC translation) and evaluates its performance against both machine learning and commercial baselines. It also has a secondary <strong>Resource</strong> contribution, as the trained model and scripts are released as open-source software.</p>
<h2 id="motivation-the-bottleneck-in-algorithmic-iupac-nomenclature">Motivation: The Bottleneck in Algorithmic IUPAC Nomenclature</h2>
<p>Generating correct IUPAC names is difficult due to the comprehensive but complex rules defined by the International Union of Pure and Applied Chemistry. Commercial software generates names from structures but remains closed-source with opaque methodologies and frequent inter-package disagreements. Open identifiers like InChI and SMILES lack direct human readability. This creates a need for an open, automated method to generate informative IUPAC names from standard identifiers like InChI, which are ubiquitous in online chemical databases.</p>
<h2 id="novelty-treating-chemical-translation-as-a-character-level-sequence">Novelty: Treating Chemical Translation as a Character-Level Sequence</h2>
<p>The key novelty is treating chemical nomenclature translation as a character-level sequence-to-sequence problem using a Transformer architecture, specifically using <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> as the source language.</p>
<ul>
<li>Standard Neural Machine Translation (NMT) uses sub-word tokenization. This model processes InChI and predicts IUPAC names character-by-character.</li>
<li>It demonstrates that character-level tokenization outperforms byte-pair encoding or unigram models for this specific chemical task.</li>
<li>It uses InChI&rsquo;s standardization to avoid the canonicalization issues inherent in SMILES-based approaches.</li>
<li>The attention mechanism allows the decoder to align specific parts of the generated IUPAC name with corresponding structural features in the source InChI string, operating via the standard scaled dot-product attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$</li>
</ul>
<h2 id="methodology--experimental-validation">Methodology &amp; Experimental Validation</h2>
<ul>
<li><strong>Training:</strong> The model was trained on 10 million InChI/IUPAC pairs sampled from PubChem using a character-level objective. The model is supervised using categorical cross-entropy loss across the vocabulary of characters:
$$ \mathcal{L} = -\sum_{i=1}^{N} y_i \log(\hat{y}_i) $$</li>
<li><strong>Ablation Studies:</strong> The authors experimentally validated architecture choices, finding that LSTM models and sub-word tokenization (BPE) performed worse than the Transformer with character tokenization. They also optimized dropout rates.</li>
<li><strong>Performance Benchmarking:</strong> The model was evaluated on a held-out test set of 200,000 samples. Performance was quantified primarily by Whole-Name Accuracy and Normalized Edit Distance (based on the Damerau-Levenshtein distance, scaled by the maximum string length).</li>
<li><strong>Commercial Comparison:</strong> The authors compared their model against four major commercial packages (ACD/I-Labs, ChemAxon, Mestrelab, and PubChem&rsquo;s Lexichem). However, this evaluation used a highly limited test set of only 100 molecules, restricting the statistical confidence of the external baseline.</li>
<li><strong>Error Analysis:</strong> They analyzed performance across different chemical classes (organics, charged species, macrocycles, inorganics) and visualized attention coefficients to interpret model focus.</li>
</ul>
<h2 id="key-results-and-the-inorganic-challenge">Key Results and the Inorganic Challenge</h2>
<ul>
<li><strong>High Accuracy on Organics:</strong> The model achieved 91% whole-name accuracy on the test set, performing particularly well on organic compounds.</li>
<li><strong>Comparable to Commercial Tools:</strong> On the limited 100-molecule benchmark, the edit distance between the model&rsquo;s predictions and commercial packages (15-23%) was similar to the variation found <em>between</em> the commercial packages themselves (16-21%).</li>
<li><strong>Limitations on Inorganics:</strong> The model performed poorly on inorganic (14% accuracy) and organometallic compounds (20% accuracy). This is attributed to inherent data limitations in the standard InChI format (which deliberately disconnects metal atoms from their ligands) and low training data coverage for those classes.</li>
<li><strong>Character-Level Superiority:</strong> Character-level tokenization was found to be essential; byte-pair encoding reduced accuracy significantly.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The dataset was derived from <a href="https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/">PubChem&rsquo;s public FTP server</a> (<code>CID-SMILES.gz</code> and <code>CID-IUPAC.gz</code>).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Raw</strong></td>
          <td>PubChem</td>
          <td>100M pairs</td>
          <td>Filtered for length (InChI &lt; 200 chars, IUPAC &lt; 150 chars). 132k unparseable SMILES dropped.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Subsampled</td>
          <td>10M pairs</td>
          <td>Random sample from the filtered set.</td>
      </tr>
      <tr>
          <td><strong>Validation</strong></td>
          <td>Held-out</td>
          <td>10,000 samples</td>
          <td>Limited to InChI length &gt; 50 chars.</td>
      </tr>
      <tr>
          <td><strong>Test</strong></td>
          <td>Held-out</td>
          <td>200,000 samples</td>
          <td>Limited to InChI length &gt; 50 chars.</td>
      </tr>
      <tr>
          <td><strong>Tokenization</strong></td>
          <td>Vocab</td>
          <td>InChI: 66 chars<br>IUPAC: 70 chars</td>
          <td>Character-level tokenization. Spaces treated as tokens.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: OpenNMT-py 2.0.0 (using PyTorch). Training scripts and vocabularies are available as supplementary files to the original publication. Pre-trained model weights are hosted on <a href="https://doi.org/10.5281/zenodo.5081159">Zenodo</a>.</li>
<li><strong>Architecture Type</strong>: Transformer Encoder-Decoder.</li>
<li><strong>Optimization</strong>: ADAM optimizer ($\beta_1=0.9, \beta_2=0.998$).</li>
<li><strong>Learning Rate</strong>: Linear warmup over 8000 steps to 0.0005, then decayed by inverse square root of iteration.</li>
<li><strong>Regularization</strong>:
<ul>
<li>Dropout: 0.1 (applied to dense and attentional layers).</li>
<li>Label Smoothing: Magnitude 0.1.</li>
</ul>
</li>
<li><strong>Training Strategy</strong>: Teacher forcing used for both training and validation.</li>
<li><strong>Gradient Accumulation</strong>: Gradients accumulated over 4 batches before updating parameters.</li>
<li><strong>Inference</strong>: Beam search with width 10 and length penalty 1.0.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Structure</strong>: 6 layers in encoder, 6 layers in decoder.</li>
<li><strong>Attention</strong>: 8 heads per attention sub-layer.</li>
<li><strong>Dimensions</strong>:
<ul>
<li>Feed-forward hidden state size: 2048.</li>
<li>Embedding vector length: 512.</li>
</ul>
</li>
<li><strong>Initialization</strong>: Glorot&rsquo;s method.</li>
<li><strong>Position</strong>: Positional encoding added to word vectors.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics reported include <strong>Whole-Name Accuracy</strong> (percentage of exact matches) and <strong>Normalized Edit Distance</strong> (Damerau-Levenshtein, scale 0-1).</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy (All)</td>
          <td>91%</td>
          <td>N/A</td>
          <td>Test set of 200k samples.</td>
      </tr>
      <tr>
          <td>Accuracy (Inorganic)</td>
          <td>14%</td>
          <td>N/A</td>
          <td>Limited by InChI format and data.</td>
      </tr>
      <tr>
          <td>Accuracy (Organometallic)</td>
          <td>20%</td>
          <td>N/A</td>
          <td>Limited by InChI format and data.</td>
      </tr>
      <tr>
          <td>Accuracy (Charged)</td>
          <td>79%</td>
          <td>N/A</td>
          <td>Test set subset.</td>
      </tr>
      <tr>
          <td>Accuracy (Rajan)</td>
          <td>72%</td>
          <td>N/A</td>
          <td>Comparative ML model (STOUT).</td>
      </tr>
      <tr>
          <td>Edit Dist (Organic)</td>
          <td>$0.02 \pm 0.03$</td>
          <td>N/A</td>
          <td>Very high similarity for organics.</td>
      </tr>
      <tr>
          <td>Edit Dist (Inorganic)</td>
          <td>$0.32 \pm 0.20$</td>
          <td>N/A</td>
          <td>Poor performance on inorganics.</td>
      </tr>
      <tr>
          <td>Edit Dist (Organometallic)</td>
          <td>$0.37 \pm 0.24$</td>
          <td>N/A</td>
          <td>Poor performance on organometallics.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Tesla K80.</li>
<li><strong>Training Time</strong>: 7 days.</li>
<li><strong>Throughput</strong>: ~6000 tokens/sec (InChI) and ~3800 tokens/sec (IUPAC).</li>
<li><strong>Batch Size</strong>: 4096 tokens (approx. 30 compounds).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5081159">InChI to IUPAC model</a></td>
          <td>Model</td>
          <td>CC BY 4.0</td>
          <td>Pre-trained Transformer weights (551 MB), requires OpenNMT-py 2.0.0</td>
      </tr>
      <tr>
          <td><a href="https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/">PubChem FTP</a></td>
          <td>Dataset</td>
          <td>Public Domain</td>
          <td>Source data: CID-SMILES.gz and CID-IUPAC.gz</td>
      </tr>
      <tr>
          <td>Training scripts &amp; vocabularies</td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Included as supplementary files with the publication</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Handsel, J., Matthews, B., Knight, N. J., &amp; Coles, S. J. (2021). Translating the InChI: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier. <em>Journal of Cheminformatics</em>, 13(1), 79. <a href="https://doi.org/10.1186/s13321-021-00535-x">https://doi.org/10.1186/s13321-021-00535-x</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{handselTranslatingInChIAdapting2021a,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Translating the {{InChI}}: Adapting Neural Machine Translation to Predict {{IUPAC}} Names from a Chemical Identifier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Translating the {{InChI}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Handsel, Jennifer and Matthews, Brian and Knight, Nicola J. and Coles, Simon J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{79}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00535-x}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-20}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine&#39;s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91\%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{Attention,GPU,InChI,IUPAC,seq2seq,Transformer}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Struct2IUPAC: Translating SMILES to IUPAC via Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/struct2iupac-2021/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/struct2iupac-2021/</guid><description>A Transformer-based model for translating between SMILES strings and IUPAC names, trained on 47M PubChem examples, achieving 98.9% accuracy with verification.</description><content:encoded><![CDATA[<h2 id="struct2iupac-as-a-methodological-shift">Struct2IUPAC as a Methodological Shift</h2>
<p>This is primarily a <strong>Method</strong> paper with significant elements of <strong>Position</strong>.</p>
<ul>
<li><strong>Method</strong>: The authors propose a specific neural architecture (Transformer with custom tokenization) and a verification pipeline (round-trip check) to solve the SMILES $\leftrightarrow$ IUPAC translation task. They rigorously benchmark this against rule-based baselines (OPSIN).</li>
<li><strong>Position</strong>: The authors explicitly argue for a paradigm shift, suggesting that &ldquo;heavy&rdquo; neural architectures should replace complex, costly rule-based legacy systems even for &ldquo;exact&rdquo; algorithmic tasks.</li>
</ul>
<h2 id="the-cost-of-rule-based-chemical-naming">The Cost of Rule-Based Chemical Naming</h2>
<ul>
<li><strong>Complexity of Naming</strong>: Generating IUPAC names manually is error-prone and requires deep algorithmic knowledge.</li>
<li><strong>Lack of Open Source Tools</strong>: While open-source tools exist for Name-to-Structure (e.g., OPSIN), there were no open-source tools for the inverse &ldquo;Structure-to-Name&rdquo; conversion at the time of writing.</li>
<li><strong>Cost of Development</strong>: Developing rule-based converters &ldquo;from scratch&rdquo; is prohibitively expensive and time-consuming compared to training a neural model on existing data.</li>
</ul>
<h2 id="struct2iupac-core-innovation">Struct2IUPAC Core Innovation</h2>
<ul>
<li><strong>Struct2IUPAC</strong>: The first effective open-source neural model for <a href="/notes/chemistry/molecular-representations/name-translation/stout-v2/">converting SMILES to IUPAC names</a>, treating chemical translation as a Neural Machine Translation (NMT) problem.</li>
<li><strong>Verification Loop</strong>: A novel inference pipeline that generates multiple candidates via beam search and validates them using a reverse converter (OPSIN) to ensure the generated name maps back to the original structure.</li>
<li><strong>Custom Tokenization</strong>: A manually curated rule-based tokenizer for IUPAC names that handles specific chemical suffixes, prefixes, and stereochemical markers.</li>
</ul>
<h2 id="experimental-setup-and-stress-testing">Experimental Setup and Stress Testing</h2>
<ul>
<li><strong>Accuracy Benchmarking</strong>: The models were tested on a held-out subset of 100,000 molecules from PubChem. The authors measured accuracy across different beam sizes (1, 3, 5).</li>
<li><strong>Comparison to Rules</strong>: The neural IUPAC2Struct model was compared directly against the rule-based OPSIN tool.</li>
<li><strong>Stress Testing</strong>:
<ul>
<li><strong>Sequence Length</strong>: Evaluated performance across varying token lengths, identifying a &ldquo;sweet spot&rdquo; (10-60 tokens) and failure modes for very short (e.g., methane) or long molecules.</li>
<li><strong>Stereochemistry</strong>: Tested on &ldquo;stereo-dense&rdquo; compounds. The authors define a &ldquo;stereo-density&rdquo; index ($I$) as the ratio of stereocenters ($S$) to total tokens ($N$):
$$I = \frac{S}{N}$$
They observed a performance drop for these dense molecules, though the model still handled many stereocenters robustly.</li>
<li><strong>Tautomers</strong>: Verified the model&rsquo;s ability to handle different tautomeric forms (e.g., Guanine and Uracil variants).</li>
</ul>
</li>
<li><strong>Latency Analysis</strong>: Benchmarked inference speeds on CPU vs. GPU relative to output sequence length.</li>
</ul>
<h2 id="benchmarks-and-outcomes">Benchmarks and Outcomes</h2>
<ul>
<li><strong>High Accuracy</strong>: The Struct2IUPAC model achieved <strong>98.9% accuracy</strong> (Beam 5 with verification). The reverse model (IUPAC2Struct) achieved <strong>99.1%</strong>, comparable to OPSIN&rsquo;s 99.4%.</li>
<li><strong>Distribution Modeling vs. Intuition</strong>: The authors claim the model infers &ldquo;chemical logic,&rdquo; because it correctly generates multiple valid IUPAC names for single molecules where naming ambiguity exists (e.g., parent group selection). However, this more likely reflects the Transformer successfully modeling the high-frequency conditional probability distribution of synonymous names present in the PubChem training data, rather than learning intrinsic chemical rules.</li>
<li><strong>Production Readiness</strong>: Inference on GPU takes less than 0.5 seconds even for long names, making it viable for production use.</li>
<li><strong>Paradigm Shift</strong>: The authors conclude that neural networks are a viable, cost-effective alternative to developing rule-based algorithms for legacy notation conversion.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study utilized the PubChem database.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Total</strong></td>
          <td>PubChem</td>
          <td>~95M</td>
          <td>Filtered for RDKit compatibility</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Split A</td>
          <td>47,312,235</td>
          <td>Random 50% split</td>
      </tr>
      <tr>
          <td><strong>Testing</strong></td>
          <td>Split B</td>
          <td>47,413,850</td>
          <td>Random 50% split</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Cleaning</strong>: Molecules that could not be processed by RDKit were removed. Molecules containing tokens not in the tokenizer (e.g., aromatic selenium) were excluded.</li>
<li><strong>Availability</strong>: A subset of 100,000 test molecules is available on GitHub (<code>data/test_100000.csv</code>) and Zenodo. The full train/test splits are not explicitly provided.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>SMILES</strong>: Character-based tokenization.</li>
<li><strong>IUPAC</strong>: Custom rule-based tokenizer splitting suffixes (<code>-one</code>, <code>-al</code>), prefixes (<code>-oxy</code>, <code>-di</code>), and special symbols (<code>(</code>, <code>)</code>, <code>R(S)</code>).</li>
</ul>
</li>
<li><strong>Verification Step</strong>:
<ol>
<li>Generate $N$ names using Beam Search ($N=5$).</li>
<li>Reverse translate the candidate name using OPSIN.</li>
<li>Check if the OPSIN structure matches the original input SMILES.</li>
<li>Display the first verified match; otherwise, report failure.</li>
</ol>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Standard Transformer with 6 encoder layers and 6 decoder layers.</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Attention Heads: 8</li>
<li>Attention Dimension ($d_{\text{model}}$): 512</li>
<li>Feed-Forward Dimension ($d_{\text{ff}}$): 2048</li>
</ul>
</li>
<li><strong>Training Objective</strong>: The models were trained using standard autoregressive cross-entropy loss over the target token sequence $y$ given the input string $x$:
$$\mathcal{L} = - \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, x)$$</li>
<li><strong>Training</strong>: Two separate models were trained: <code>Struct2IUPAC</code> (SMILES $\to$ IUPAC) and <code>IUPAC2Struct</code> (IUPAC $\to$ SMILES).</li>
<li><strong>Availability</strong>: Code for model architecture is provided in the GitHub repository. Pre-trained weights for the IUPAC2Struct model are available, but the Struct2IUPAC model weights are not publicly released, meaning researchers would need to retrain that model on their own PubChem data to reproduce those results.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed on a random subset of 100,000 molecules from the test set.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Beam Size</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Struct2IUPAC</td>
          <td>1</td>
          <td>96.1%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Struct2IUPAC</td>
          <td>5</td>
          <td>98.9%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>IUPAC2Struct</td>
          <td>1</td>
          <td>96.6%</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>IUPAC2Struct</td>
          <td>5</td>
          <td>99.1%</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Robustness</strong>: Accuracy drops significantly for augmented (non-canonical) SMILES (37.16%) and stereo-enriched compounds (66.52%).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Infrastructure</strong>: 4 $\times$ Tesla V100 GPUs and 36 CPUs.</li>
<li><strong>Training Time</strong>: Approximately 10 days under full load.</li>
<li><strong>Inference Speed</strong>: &lt;0.5s per molecule on GPU; scale is linear with output token length.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/sergsb/IUPAC2Struct">IUPAC2Struct (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Transformer code and pre-trained IUPAC2Struct model</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4280814">Test data (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>100k test molecules, OPSIN failure cases, model failure cases</td>
      </tr>
      <tr>
          <td><a href="https://app.syntelly.com/smiles2iupac">Struct2IUPAC web demo</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online interface for SMILES to IUPAC conversion</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Krasnov, L., Khokhlov, I., Fedorov, M. V., &amp; Sosnin, S. (2021). Transformer-based artificial neural networks for the conversion between chemical notations. <em>Scientific Reports</em>, 11(1), 14798. <a href="https://doi.org/10.1038/s41598-021-94082-y">https://doi.org/10.1038/s41598-021-94082-y</a></p>
<p><strong>Publication</strong>: Scientific Reports 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{krasnovTransformerbasedArtificialNeural2021a,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Transformer-Based Artificial Neural Networks for the Conversion between Chemical Notations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Krasnov, Lev and Khokhlov, Ivan and Fedorov, Maxim V. and Sosnin, Sergey}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{14798}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41598-021-94082-y}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/sergsb/IUPAC2Struct">GitHub Repository</a></li>
<li><a href="https://app.syntelly.com/smiles2iupac">Web Demo</a></li>
</ul>
]]></content:encoded></item><item><title>STOUT: SMILES to IUPAC Names via Neural Machine Translation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/stout/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/stout/</guid><description>A deep-learning neural machine translation approach to translate between SMILES strings and IUPAC names using the STOUT model.</description><content:encoded><![CDATA[<h2 id="contribution-translating-chemistry-as-a-language">Contribution: Translating Chemistry as a Language</h2>
<p>This is primarily a <strong>Method</strong> paper, with a strong secondary contribution as a <strong>Resource</strong> paper.</p>
<ul>
<li><strong>Method</strong>: It proposes a neural machine translation (NMT) architecture to approximate the complex, rule-based algorithm of IUPAC naming, treating it as a language translation task.</li>
<li><strong>Resource</strong>: It provides an open-source tool and trained models to the community, addressing a gap where such functionality was previously limited to proprietary software.</li>
</ul>
<h2 id="motivation-democratizing-iupac-nomenclature">Motivation: Democratizing IUPAC Nomenclature</h2>
<p>The International Union of Pure and Applied Chemistry (IUPAC) naming scheme is universally accepted but algorithmically complex. Generating these names correctly is challenging for humans, and automated generation is largely missing from major open-source toolkits like CDK, RDKit, or Open Babel. While reliable commercial tools exist (e.g., ChemAxon&rsquo;s <code>molconvert</code>), there was a lack of open-source alternatives for the scientific community. STOUT aims to fill this gap using a data-driven approach.</p>
<h2 id="core-innovation-sequence-to-sequence-naming">Core Innovation: Sequence-to-Sequence Naming</h2>
<ul>
<li><strong>Language Translation Approach</strong>: The authors treat chemical representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>/<a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>) and IUPAC names as two different languages, applying Neural Machine Translation (NMT) to translate between them.</li>
<li><strong>Use of SELFIES</strong>: The work establishes SELFIES (Self-Referencing Embedded Strings) as a robust choice over SMILES for deep learning tokenization in this specific task, capitalizing on its syntactic robustness.</li>
<li><strong>Hardware Acceleration</strong>: The paper benchmarks GPU versus TPU training and highlights the practical necessity of Tensor Processing Units (TPUs) for training large-scale chemical language models, reducing training time by an order of magnitude.</li>
</ul>
<h2 id="methodology--translation-validation">Methodology &amp; Translation Validation</h2>
<ul>
<li><strong>Data Scale</strong>: The model was trained on datasets of 30 million and 60 million molecules derived from PubChem.</li>
<li><strong>Hardware Benchmarking</strong>: Training efficiency was compared between an nVidia Tesla V100 GPU and Google TPU v3-8/v3-32 units.</li>
<li><strong>Bidirectional Translation</strong>: The system was tested on two distinct tasks:
<ol>
<li><strong>Forward</strong>: SELFIES → IUPAC names</li>
<li><strong>Reverse</strong>: IUPAC names → SELFIES</li>
</ol>
</li>
<li><strong>Validation</strong>: Performance was evaluated on a held-out test set of 2.2 million molecules.</li>
</ul>
<h2 id="translation-accuracy--hardware-scaling">Translation Accuracy &amp; Hardware Scaling</h2>
<ul>
<li><strong>High Accuracy</strong>: The model achieved an average BLEU score of ~90% and a Tanimoto similarity index &gt; 0.9 for both translation directions.</li>
<li><strong>Generalization</strong>: Even when predictions were textually mismatched (low BLEU score), the underlying chemical structures often remained highly similar (high Tanimoto similarity), suggesting the system captures fundamental chemical semantics rather than merely memorizing strings.</li>
<li><strong>Impact of Data Size</strong>: Expanding training from 30 million to 60 million molecules yielded consistent performance gains without saturating.</li>
<li><strong>Hardware Necessity</strong>: Training on TPUs proved up to 54 times faster than a standard GPU baseline (Tesla V100), making scaling highly computationally tractable.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">STOUT (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Current repo hosts STOUT V2.0 transformer models; V1 RNN code available in earlier commits</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Public Domain</td>
          <td style="text-align: left">Source of 111M molecules; 30M/60M training subsets not directly provided</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The dataset was curated from PubChem (111 million molecules). Note that the specific 30M and 60M subsets are not directly linked in the publication repository, which means a user would have to reconstruct the filtering process.</p>
<p><strong>Preprocessing &amp; Filtering</strong>:</p>
<ul>
<li>Explicit hydrogens removed; converted to canonical SMILES.</li>
<li><strong>Filtering Rules</strong>: MW &lt; 1500 Da, no counter ions, limited element set (C, H, O, N, P, S, F, Cl, Br, I, Se, B), no hydrogen isotopes, 3-40 bonds, no charged groups.</li>
<li><strong>Ground Truth Generation</strong>: ChemAxon&rsquo;s <code>molconvert</code> (Marvin Suite 20.15) was used to generate target IUPAC names for training.</li>
<li><strong>Representation</strong>: All SMILES were converted to SELFIES for training.</li>
</ul>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Training</strong></td>
          <td style="text-align: left">PubChem Filtered</td>
          <td style="text-align: left">30M &amp; 60M</td>
          <td style="text-align: left">Two distinct training sets created.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Testing</strong></td>
          <td style="text-align: left">PubChem Held-out</td>
          <td style="text-align: left">2.2M</td>
          <td style="text-align: left">Molecules not present in training sets; uniform token frequency.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>SELFIES</strong>: Split iteratively by brackets <code>[</code> and <code>]</code>.</li>
<li><strong>IUPAC</strong>: Split via punctuation (<code>(</code>, <code>)</code>, <code>{</code>, <code>}</code>, <code>[</code>, <code>]</code>, <code>-</code>, <code>.</code>, <code>,</code>) and a discrete set of sub-word chemical morphemes (e.g., <code>methyl</code>, <code>benzene</code>, <code>fluoro</code>).</li>
<li><strong>Padding</strong>: SELFIES padded to 48 tokens; IUPAC padded to 78 tokens. &ldquo;Start&rdquo; and &ldquo;End&rdquo; sequence markers append each chain.</li>
</ul>
</li>
<li><strong>Optimization</strong>: Adam optimizer instantiated with a learning rate of $0.0005$.</li>
<li><strong>Objective Function</strong>: Sparse categorical cross-entropy, assessing prediction probabilities for token $i$ over vocabulary $V$:
$$ \mathcal{L} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder sequence-to-sequence network with Bahdanau attention mechanism context weighting.</li>
<li><strong>Components</strong>:
<ul>
<li><strong>Encoder/Decoder</strong>: Recurrent Neural Networks (RNN) constructed using Gated Recurrent Units (GRU).</li>
<li><strong>Attention</strong>: Bahdanau (additive) soft attention, which calculates alignment scores to softly weight encoder hidden states natively:
$$ e_{tj} = v_a^\top \tanh(W_a s_{t-1} + U_a h_j) $$</li>
<li><strong>Embedding</strong>: Decoder output passes through a continuous embedding layer before concatenating with the attention context vector.</li>
</ul>
</li>
<li><strong>Implementation</strong>: Python 3 backend using TensorFlow 2.3.0. <em>Note: The linked GitHub repository currently defaults to the STOUT V2.0 transformer models, so researchers aiming to reproduce this specific V1 RNN paper should reference the older tag/commit history.</em></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics heavily emphasize both linguistic accuracy and cheminformatic structural correctness:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">Details</th>
          <th style="text-align: left">Result (60M Model)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BLEU Score</strong></td>
          <td style="text-align: left">NLTK sentence BLEU (unigram to 4-gram)</td>
          <td style="text-align: left">0.94 (IUPAC $\to$ SELFIES)</td>
          <td style="text-align: left">Exact text overlap. Serves as a strictly syntactic proxy.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Tanimoto Similarity</strong></td>
          <td style="text-align: left">PubChem fingerprints via CDK</td>
          <td style="text-align: left">0.98 (Valid IUPAC names)</td>
          <td style="text-align: left">Evaluates substructure alignment over bit vectors, $T(A, B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert}$.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Comparison of hardware efficiency for training large chemical language models:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Hardware</th>
          <th style="text-align: left">Batch Size</th>
          <th style="text-align: left">Time per Epoch (15M subset)</th>
          <th style="text-align: left">Speedup Factor</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>GPU (Tesla V100)</strong></td>
          <td style="text-align: left">256</td>
          <td style="text-align: left">~27 hours</td>
          <td style="text-align: left">1x</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>TPU v3-8</strong></td>
          <td style="text-align: left">1024 (Global)</td>
          <td style="text-align: left">~2 hours</td>
          <td style="text-align: left">13x</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>TPU v3-32</strong></td>
          <td style="text-align: left">1024 (Global)</td>
          <td style="text-align: left">~0.5 hours</td>
          <td style="text-align: left">54x</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A., &amp; Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. <em>Journal of Cheminformatics</em>, 13(1), 34. <a href="https://doi.org/10.1186/s13321-021-00512-4">https://doi.org/10.1186/s13321-021-00512-4</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanSTOUTSMILESIUPAC2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{STOUT: SMILES to IUPAC Names Using Neural Machine Translation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{STOUT}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{34}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00512-4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-09-22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{Attention mechanism,Chemical language,Deep neural network,DeepSMILES,IUPAC names,Neural machine translation,Recurrent neural network,SELFIES,SMILES}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">GitHub Repository</a></li>
<li><a href="/notes/chemistry/molecular-representations/name-translation/stout-v2/">STOUT V2.0 Note</a></li>
<li><a href="/notes/chemistry/molecular-representations/name-translation/struct2iupac-2021/">Struct2IUPAC Note</a></li>
<li><a href="/notes/chemistry/molecular-representations/name-translation/handsel-inchi-iupac-2021/">HandSEL Note (InChI to IUPAC)</a></li>
</ul>
]]></content:encoded></item><item><title>STOUT V2.0: Transformer-Based SMILES to IUPAC Translation</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/stout-v2/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/name-translation/stout-v2/</guid><description>A Transformer-based model for translating SMILES to IUPAC names, trained on ~1 billion molecules, achieving ~0.99 BLEU score on benchmarks.</description><content:encoded><![CDATA[<h2 id="paper-contribution--methodological-scope">Paper Contribution &amp; Methodological Scope</h2>
<p><strong>Method (Primary) / Resource (Secondary)</strong></p>
<p>This paper presents a <strong>Methodological</strong> contribution by developing and validating a Transformer-based neural machine translation model (STOUT V2) for bidirectional chemical nomenclature (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> $\leftrightarrow$ IUPAC). It systematically compares this new architecture against previous RNN-based baselines (<a href="/notes/chemistry/molecular-representations/name-translation/stout/">STOUT V1</a>) and performs ablation studies on tokenization strategies.</p>
<p>It also serves as a significant <strong>Resource</strong> contribution by generating a massive training dataset of nearly 1 billion SMILES-IUPAC pairs (curated via commercial Lexichem software) and releasing the resulting models and code as open-source tools for chemical naming.</p>
<h2 id="the-need-for-robust-open-source-iupac-nomenclature-rules">The Need for Robust Open-Source IUPAC Nomenclature Rules</h2>
<p>Assigning systematic IUPAC names to chemical structures requires adherence to complex rules, challenging human consistency. Deterministic, rule-based software options like OpenEye Lexichem and ChemAxon are reliable commercial solutions. Existing open-source tools like OPSIN focus on parsing names to structures.</p>
<p>The previous version of STOUT (V1), based on RNNs/GRUs, achieved ~90% BLEU accuracy, with known limitations in capturing long-distance dependencies required for stereochemistry handling. This work uses the sequence-learning capabilities of Transformers combined with large-scale datasets to create a competitive open-source IUPAC naming tool.</p>
<h2 id="architectural-shift-and-billion-scale-training">Architectural Shift and Billion-Scale Training</h2>
<p>The primary advancements over previous iterations address both architecture and dataset scale:</p>
<ol>
<li><strong>Architecture Shift</strong>: Moving from an RNN-based Seq2Seq model to a <strong>Transformer-based architecture</strong> (4 layers, 8 heads), which captures intricate chemical patterns better than GRUs.</li>
<li><strong>Billion-Scale Training</strong>: Training on a dataset of nearly <strong>1 billion molecules</strong> (combining PubChem and ZINC15), significantly larger than the 60 million used for STOUT V1.</li>
<li><strong>Tokenization Strategy</strong>: Determining that <strong>character-wise tokenization</strong> for IUPAC names is superior to word-wise tokenization in terms of both accuracy and training efficiency (15% faster).</li>
</ol>
<h2 id="experimental-validation-and-scaling-limits">Experimental Validation and Scaling Limits</h2>
<p>The authors conducted three primary experiments to validate bidirectional translation (SMILES $\rightarrow$ IUPAC and IUPAC $\rightarrow$ SMILES):</p>
<ul>
<li><strong>Experiment 1 (Optimization)</strong>: Assessed the impact of dataset size (1M vs 10M vs 50M) and tokenization strategy on SMILES-to-IUPAC performance.</li>
<li><strong>Experiment 2 (Scaling)</strong>: Trained models on 110 million PubChem molecules for <strong>both</strong> forward and reverse translation tasks to test performance on longer sequences.</li>
<li><strong>Experiment 3 (Generalization)</strong>: Trained on the full ~1 billion dataset (PubChem + ZINC15) for both translation directions.</li>
<li><strong>External Validation</strong>: Benchmarked against an external dataset from ChEBI (1,485 molecules) and ChEMBL34 to test generalization to unseen data.</li>
</ul>
<p><strong>Evaluation Metrics</strong>:</p>
<ul>
<li><strong>Textual Accuracy</strong>: BLEU scores (1-4) and Exact String Match.</li>
<li><strong>Chemical Validity</strong>: Retranslation of generated names back to SMILES using OPSIN, followed by Tanimoto similarity checks (PubChem fingerprints) against the original input.</li>
</ul>
<h2 id="translation-accuracy-and-structural-validity">Translation Accuracy and Structural Validity</h2>
<ul>
<li><strong>Superior Performance</strong>: STOUT V2 achieved an average BLEU score of <strong>0.99</strong> (vs 0.94 for V1). While exact string matches varied by experiment (83-89%), the model notably achieved a perfect BLEU score (1.0) on <strong>97.49%</strong> of a specific test set where STOUT V1 only reached 66.65%.</li>
<li><strong>Structural Validity (&ldquo;Near Misses&rdquo;)</strong>: When the generated name differed from the ground truth string, the re-generated structure often remained chemically valid. The model maintained an average Tanimoto similarity $T(A,B)$ of <strong>0.68</strong> for these divergent names between bit-vector fingerprints $A$ and $B$, roughly defined as:
$$ T(A,B) = \frac{\sum (A \cap B)}{\sum (A \cup B)} $$
<em>Critique</em>: Note that an average Tanimoto coefficient of 0.68 typically suggests moderate structural similarity/drift, not an almost-identical &ldquo;near miss&rdquo; (which would be $&gt;0.85$). This implies the model constructs chemically related but structurally distinct outputs when it fails exact string matching.</li>
<li><strong>Tokenization</strong>: Character-level splitting for IUPAC names outperformed word-level splitting and was more computationally efficient.</li>
<li><strong>Data Imbalance &amp; Generalization</strong>: The model&rsquo;s drop in performance for sequences &gt;600 characters highlights a systemic issue in open chemical databases: long, highly complex SMILES strings are significantly underrepresented. Even billion-scale training datasets are still bound by the chemical diversity of their source material.</li>
<li><strong>Limitations</strong>:
<ul>
<li><strong>Preferred Names (PINs)</strong>: The model mimics Lexichem&rsquo;s naming conventions, generating valid IUPAC names distinct from strict <em>Preferred IUPAC Names</em> (PINs).</li>
<li><strong>Sequence Length</strong>: Performance degrades for very long SMILES (&gt;600 characters) due to scarcity in the training data.</li>
<li><strong>Algorithmic Distillation Bottleneck</strong>: Because the 1 billion training pairs were generated entirely by OpenEye&rsquo;s Lexichem, STOUT V2 acts as a knowledge distillation of that specific commercial algorithm. The model learns Lexichem’s heuristic mapping, specific dialects, and potential systematic errors, rather than deriving true nomenclature rules from first principles.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data was derived from PubChem and ZINC15. Ground truth IUPAC names were generated using OpenEye Lexichem TK 2.8.1 to ensure consistency.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training (Exp 1)</strong></td>
          <td>PubChem Subset</td>
          <td>1M, 10M, 50M</td>
          <td>Selected via MaxMin algorithm for diversity</td>
      </tr>
      <tr>
          <td><strong>Training (Exp 2)</strong></td>
          <td>PubChem</td>
          <td>110M</td>
          <td>Filtered for SMILES length &lt; 600</td>
      </tr>
      <tr>
          <td><strong>Training (Exp 3)</strong></td>
          <td>PubChem + ZINC15</td>
          <td>~1 Billion</td>
          <td>999,637,326 molecules total</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td>ChEBI</td>
          <td>1,485</td>
          <td>External validation set, non-overlapping with training</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li><strong>SMILES</strong>: Canonicalized, isomeric, and kekulized using RDKit (v2023.03.1).</li>
<li><strong>Formatting</strong>: Converted to TFRecord format in 100 MB chunks for TPU efficiency.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>SMILES Tokenization</strong>: Regex-based splitting. Atoms (e.g., &ldquo;Cl&rdquo;, &ldquo;Au&rdquo;), bonds, brackets, and digits are separate tokens.</li>
<li><strong>IUPAC Tokenization</strong>: <strong>Character-wise split</strong> was selected as the optimal strategy (treating every character as a token).</li>
<li><strong>Optimization</strong>: Adam optimizer with a custom learning rate scheduler based on model dimensions.</li>
<li><strong>Loss Function</strong>: Trained to minimize the Sparse Categorical Cross-Entropy $L$, masking padding tokens. For a correctly predicted target class $t$ alongside probabilities $p_i$, the masked loss is represented mathematically as:
$$ L = - \sum_{i=1}^{m} m_i y_{i} \log(p_{i}) $$
where $m_i$ masks padded positions.</li>
<li><strong>Code Availability</strong>: The <a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">main STOUT V2 repository</a> contains the inference package. The training pipeline/instructions (originally linked to a separate repo that is currently a 404) can still be found within the <a href="https://doi.org/10.5281/zenodo.6559438">Zenodo archive release</a>.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model follows the standard Transformer architecture from &ldquo;Attention is All You Need&rdquo; (Vaswani et al.).</p>
<ul>
<li><strong>Architecture</strong>: 4 Transformer layers (encoder/decoder stack).</li>
<li><strong>Attention</strong>: Multi-head attention with <strong>8 heads</strong>.</li>
<li><strong>Dimensions</strong>: Embedding size ($d_{model}$) = 512; Feed-forward dimension ($d_{ff}$) = 2048.</li>
<li><strong>Regularization</strong>: Dropout rate of 0.1.</li>
<li><strong>Context Window</strong>: Max input length (SMILES) = 600; Max output length (IUPAC) = 700-1000.</li>
<li><strong>Weights</strong>: Model weights for forward and reverse architectures are <a href="https://doi.org/10.5281/zenodo.13318286">available via Zenodo (v3)</a>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation focused on both string similarity and chemical structural integrity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Scope</th>
          <th>Method</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>BLEU Score</strong></td>
          <td>N-gram overlap</td>
          <td>Compared predicted IUPAC string to Ground Truth.</td>
      </tr>
      <tr>
          <td><strong>Exact Match</strong></td>
          <td>Accuracy</td>
          <td>Binary 1/0 check for identical strings.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto</strong></td>
          <td>Structural Similarity</td>
          <td>Predicted Name $\rightarrow$ OPSIN $\rightarrow$ SMILES $\rightarrow$ Fingerprint comparison to input.</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/egonw/Smiles-TO-iUpac-Translator">STOUT V2 GitHub</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Inference package (PyPI: STOUT-pypi)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.13318286">Model Weights (Zenodo v3)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Forward and reverse translation weights</td>
      </tr>
      <tr>
          <td><a href="https://zenodo.org/records/6559438">Code Snapshot (Zenodo)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training pipeline archive</td>
      </tr>
      <tr>
          <td><a href="https://stout.decimer.ai">Web Application</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>Demo with Ketcher, bulk submission, DECIMER integration</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was conducted entirely on Google Cloud Platform (GCP) TPUs.</p>
<ul>
<li><strong>STOUT V1</strong>: Trained on TPU v3-8.</li>
<li><strong>STOUT V2</strong>: Trained on <strong>TPU v4-128 pod slices</strong> (128 nodes).</li>
<li><strong>Large Scale (Exp 3)</strong>: Trained on <strong>TPU v4-256 pod slice</strong> (256 nodes).</li>
<li><strong>Training Time</strong>: Average of <strong>15 hours and 2 minutes per epoch</strong> for the 1 billion dataset.</li>
<li><strong>Framework</strong>: TensorFlow 2.15.0-pjrt with Keras.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A., &amp; Steinbeck, C. (2024). STOUT V2.0: SMILES to IUPAC name conversion using transformer models. <em>Journal of Cheminformatics</em>, 16(146). <a href="https://doi.org/10.1186/s13321-024-00941-x">https://doi.org/10.1186/s13321-024-00941-x</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2024</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanSTOUTV20SMILES2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{STOUT V2}}.0: {{SMILES}} to {{IUPAC}} Name Conversion Using Transformer Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{STOUT V2}}.0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{146}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-024-00941-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://stout.decimer.ai">Web Application</a> (Includes Ketcher drawing, bulk submission, and DECIMER integration)</li>
<li><a href="https://decimer.ai">DECIMER Project</a></li>
<li><a href="/notes/chemistry/molecular-representations/name-translation/stout/">STOUT V1 Note</a></li>
<li><a href="https://zenodo.org/records/6559438">Zenodo Archive (Code Snapshot)</a></li>
</ul>
]]></content:encoded></item><item><title>InstructMol: Multi-Modal Molecular LLM for Drug Discovery</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/instructmol/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/instructmol/</guid><description>A multi-modal LLM aligning 2D molecular graphs with text via two-stage instruction tuning for drug discovery tasks.</description><content:encoded><![CDATA[<h2 id="instructmol-framework-overview">InstructMol Framework Overview</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong></p>
<p>This work proposes <strong>InstructMol</strong>, a novel multi-modal architecture and training paradigm. It focuses on engineering a system that aligns a pre-trained molecular graph encoder with a general-purpose Large Language Model (LLM). The paper&rsquo;s primary contribution is the <strong>Two-Stage Instruction Tuning</strong> strategy (Alignment Pre-training + Task-Specific Tuning) designed to bridge the modality gap between 2D molecular graphs and natural language.</p>
<h2 id="bridging-specialist-and-generalist-models">Bridging Specialist and Generalist Models</h2>
<p>Current AI approaches in drug discovery typically fall into two categories. Specialist models deliver high accuracy on specific tasks (such as property prediction) but require extensive labeled datasets and lack conversational adaptability. Conversely, generalist LLMs offer strong reasoning and dialogue capabilities but struggle to natively interpret complex structural data, often relying on brittle 1D text representations of molecules like <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>.</p>
<p>There is a practical need for a unified &ldquo;Molecular Assistant&rdquo; capable of visually interpreting molecular graphs, reasoning about structure in natural language, and adapting across tasks like synthesis planning and property analysis without training from scratch.</p>
<h2 id="two-stage-modality-alignment">Two-Stage Modality Alignment</h2>
<p>The core novelty lies in the architecture and the <strong>two-stage training pipeline</strong> designed to align differing modalities efficiently:</p>
<ol>
<li><strong>MoleculeSTM Integration</strong>: InstructMol initializes its graph encoder with <strong>MoleculeSTM</strong>, which is already pre-aligned with text via contrastive learning, facilitating easier downstream alignment.</li>
<li><strong>Two-Stage Alignment Strategy</strong>:
<ul>
<li><strong>Stage 1 (Alignment Pre-training)</strong>: Freezes both the LLM and Graph Encoder; trains <em>only</em> a linear projector using a massive dataset of molecule-description pairs to map graph features into the LLM&rsquo;s token space.</li>
<li><strong>Stage 2 (Task-Specific Instruction Tuning)</strong>: Freezes the Graph Encoder; fine-tunes the Projector and the LLM (using <strong>LoRA</strong>) on specific downstream tasks. This allows the model to adapt its reasoning capabilities while preserving the structural understanding gained in Stage 1.</li>
</ul>
</li>
</ol>
<h2 id="task-evaluation-in-drug-discovery">Task Evaluation in Drug Discovery</h2>
<p>The authors evaluated InstructMol across three distinct categories of drug discovery tasks, comparing it against generalist LLMs (Vicuna, LLaMA, <a href="/notes/chemistry/llm-applications/galactica-large-language-model-for-science/">Galactica</a>) and specialist models (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, MolT5):</p>
<ol>
<li><strong>Property Prediction</strong>:
<ul>
<li><em>Regression</em>: Predicting quantum mechanical properties (HOMO, LUMO, Gap) using the <a href="/notes/chemistry/datasets/qm9/">QM9</a> dataset.</li>
<li><em>Classification</em>: Predicting biological activity (BACE, BBBP, HIV) using <a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>.</li>
</ul>
</li>
<li><strong>Molecule Description Generation</strong>: Generating natural language descriptions of molecules using the ChEBI-20 dataset.</li>
<li><strong>Chemical Reaction Analysis</strong>:
<ul>
<li><em>Forward Reaction Prediction</em>: Predicting products from reactants.</li>
<li><em>Reagent Prediction</em>: Identifying necessary reagents.</li>
<li><em><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></em>: Suggesting reactants for a given product.</li>
</ul>
</li>
</ol>
<p><strong>Ablation Studies</strong> tested the impact of the projector type (Linear vs. MLP), LLM scale (7B vs 13B), and the necessity of the two-stage training approach.</p>
<h2 id="core-findings-and-limitations">Core Findings and Limitations</h2>
<ul>
<li><strong>Improvement Over Baseline Generalists</strong>: InstructMol significantly outperformed generalist LLMs (like LLaMA and Galactica) on all tasks, demonstrating the value of incorporating explicit graph modalities.</li>
<li><strong>Reducing the Gap with Specialists</strong>: While InstructMol brings versatile reasoning capabilities, it still trails highly optimized specialist models (such as Uni-Mol and MolT5) on tasks like molecule description generation. This remaining gap likely stems from its reliance on a relatively small alignment pre-training dataset (~264K PubChem pairs) and the information bottleneck of using a simple linear projector, compared to the millions of structures used to train expert foundational models.</li>
<li><strong>Importance of Alignment</strong>: Ablation studies confirmed that skipping Stage 1 (Alignment Pre-training) degraded performance, proving that a dedicated phase for projecting graph features into text space is crucial.</li>
<li><strong>Limitation</strong>: The model struggles with highly imbalanced datasets (e.g., HIV) and complex reaction mixtures where mapping multiple graph tokens to text becomes ambiguous.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training pipeline utilizes distinct datasets for the two stages. <strong>Note:</strong> As of the latest repository update, the finely-processed instruction-tuning datasets (e.g., the filtered ~264K PubChem pairs and instruction-formatted subset pairs) are listed as &ldquo;coming soon&rdquo;, requiring manual recreation for full reproduction.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Purpose</th>
          <th style="text-align: left">Dataset</th>
          <th style="text-align: left">Size</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Stage 1</strong> (Alignment)</td>
          <td style="text-align: left"><strong><a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></strong></td>
          <td style="text-align: left">~264K pairs</td>
          <td style="text-align: left">Molecule-text pairs. Filtered from 330K for invalid descriptions and overlaps with ChEBI-20 test set.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stage 2</strong> (Prop. Reg.)</td>
          <td style="text-align: left"><strong>QM9</strong></td>
          <td style="text-align: left">362K samples</td>
          <td style="text-align: left">Quantum mechanics properties (HOMO, LUMO, Gap).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stage 2</strong> (Prop. Class.)</td>
          <td style="text-align: left"><strong>MoleculeNet</strong></td>
          <td style="text-align: left">35K samples</td>
          <td style="text-align: left">BACE, BBBP, HIV datasets. Converted to instruction format (Yes/No answer).</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stage 2</strong> (Generation)</td>
          <td style="text-align: left"><strong>ChEBI-20</strong></td>
          <td style="text-align: left">26.5K samples</td>
          <td style="text-align: left">Molecule description generation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Stage 2</strong> (Reactions)</td>
          <td style="text-align: left"><strong>USPTO</strong></td>
          <td style="text-align: left">~380K samples</td>
          <td style="text-align: left">Combined datasets for Forward (125K), Retrosynthesis (130K), and Reagent (125K) prediction.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Two-Stage Training</strong>:
<ol>
<li><strong>Alignment Pre-training</strong>: Updates only the Projector. The objective maximizes the probability of generating the target description token sequence $\mathbf{X}_A$ given the molecule input $\mathbf{X}_M$ and instruction $\mathbf{X}_I$:
$$p(\mathbf{X}_A | \mathbf{X}_M, \mathbf{X}_I) = \prod_{i=1}^L p_\theta(x_i | \mathbf{X}_G \parallel \mathbf{X}_S, \mathbf{X}_I, \mathbf{X}_{A,&lt;i})$$</li>
<li><strong>Instruction Tuning</strong>: Updates Projector + LLM (via LoRA) using standard autoregressive language modeling on task-specific instructions. The objective minimizes the negative log-likelihood of generating the target response $R$ of length $L$:
$$\mathcal{L}(\theta) = -\sum_{i=1}^L \log p(R_i | I, M, R_{&lt;i}; \theta)$$
where $I$ represents the instruction and $M$ is the multi-modal molecular input.</li>
</ol>
</li>
<li><strong>LoRA (Low-Rank Adaptation)</strong>: Applied to the LLM in Stage 2. Rank $r=64$, Scaling $\alpha=16$.</li>
<li><strong>Optimization</strong>: AdamW optimizer. Learning rate starts at 2e-3 (Stage 1) and 8e-5 (Stage 2) with cosine decay. Warm-up ratio 0.03.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Note:</strong> The official repository currently lists the final fine-tuned <strong>InstructMol weights</strong> as &ldquo;coming soon.&rdquo; Consequently, one must fine-tune the components using the provided scripts. Base model weights (Vicuna-7B and MoleculeSTM) are publicly available via Hugging Face.</p>
<ul>
<li><strong>Graph Encoder ($f_g$)</strong>:
<ul>
<li>Architecture: Graph Isomorphism Network (GIN) with 5 layers.</li>
<li>Hidden Dimension: 300.</li>
<li>Initialization: <strong>MoleculeSTM</strong> checkpoint (pre-trained via contrastive learning).</li>
<li>Status: <strong>Frozen</strong> during Stage 2.</li>
</ul>
</li>
<li><strong>LLM</strong>:
<ul>
<li>Base: <strong>Vicuna-v1.3-7B</strong>.</li>
<li>Status: Frozen in Stage 1; LoRA fine-tuned in Stage 2.</li>
</ul>
</li>
<li><strong>Projector</strong>:
<ul>
<li>Architecture: Linear Layer.</li>
<li>Function: Maps node-level graph representation $Z_G \in \mathbb{R}^{N \times d}$ to the LLM&rsquo;s word embedding space dimensions.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric Libraries</strong>: RDKit for validity/fingerprints, standard NLP libraries for BLEU/ROUGE.</li>
<li><strong>Reaction Metrics</strong>: Fingerprint <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto Similarity</a> (FTS), Exact Match, Levenshtein distance, and validity (via RDKit).</li>
<li><strong>Description Metrics</strong>: BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, METEOR.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 4 x NVIDIA RTX A6000 (48GB VRAM).</li>
<li><strong>Training Time</strong>:
<ul>
<li>Stage 1: 5 epochs.</li>
<li>Stage 2: 20-50 epochs (Description Generation), 10 epochs (Properties/Reactions).</li>
</ul>
</li>
<li><strong>Batch Size</strong>: 128 for both stages.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/IDEA-XL/InstructMol">InstructMol (GitHub)</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Apache 2.0 (code), CC BY-NC 4.0 (data)</td>
          <td style="text-align: left">Training/evaluation scripts provided; fine-tuned weights listed as &ldquo;coming soon&rdquo;</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://huggingface.co/lmsys/vicuna-7b-v1.3">Vicuna-7B v1.3</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">Non-commercial (LLaMA license)</td>
          <td style="text-align: left">Base LLM; must be downloaded separately</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://huggingface.co/chao1224/MoleculeSTM">MoleculeSTM</a></td>
          <td style="text-align: left">Model</td>
          <td style="text-align: left">MIT</td>
          <td style="text-align: left">Pre-trained graph encoder checkpoint</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Cao, H., Liu, Z., Lu, X., Yao, Y., &amp; Li, Y. (2025). InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery. <em>Proceedings of the 31st International Conference on Computational Linguistics</em>, 354-379.</p>
<p><strong>Publication</strong>: COLING 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{caoInstructMolMultiModalIntegration2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{InstructMol}}: {{Multi-Modal Integration}} for {{Building}} a {{Versatile}} and {{Reliable Molecular Assistant}} in {{Drug Discovery}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{InstructMol}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 31st {{International Conference}} on {{Computational Linguistics}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Cao, He and Liu, Zijing and Lu, Xingyu and Yao, Yuan and Li, Yu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">editor</span> = <span style="color:#e6db74">{Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and {Al-Khalifa}, Hend and Eugenio, Barbara Di and Schockaert, Steven}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2025</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{354--379}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://aclanthology.org/2025.coling-main.25/}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Abu Dhabi, UAE}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialists, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IDEA-XL/InstructMol">Official Repository</a></li>
</ul>
]]></content:encoded></item><item><title>Image-to-Sequence OCSR: A Comparative Analysis</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/image-to-sequence-comparison/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/image-to-sequence-comparison/</guid><description>Comparative analysis of image-to-sequence OCSR methods across architecture, output format, training data, and compute requirements.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>This note provides a comparative analysis of image-to-sequence methods for Optical Chemical Structure Recognition (OCSR). These methods treat molecular structure recognition as an image captioning task, using encoder-decoder architectures to generate sequential molecular representations (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>) directly from pixels.</p>
<p>For the full taxonomy of OCSR approaches including image-to-graph and rule-based methods, see the <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">OCSR Methods taxonomy</a>.</p>
<h2 id="architectural-evolution-2019-2025">Architectural Evolution (2019-2025)</h2>
<p>The field has undergone rapid architectural evolution, with clear generational shifts in both encoder and decoder design.</p>
<h3 id="timeline">Timeline</h3>
<table>
  <thead>
      <tr>
          <th>Era</th>
          <th>Encoder</th>
          <th>Decoder</th>
          <th>Representative Methods</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2019-2020</strong></td>
          <td>CNN (Inception V3, ResNet)</td>
          <td>LSTM/GRU with Attention</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/staker-deep-learning-2019/">Staker et al.</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER</a></td>
      </tr>
      <tr>
          <td><strong>2021</strong></td>
          <td>EfficientNet, ViT</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2mol/">Img2Mol</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/">ViT-InChI</a></td>
      </tr>
      <tr>
          <td><strong>2022</strong></td>
          <td>Swin Transformer, ResNet</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/">SwinOCSR</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/">Image2SMILES</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/micer/">MICER</a></td>
      </tr>
      <tr>
          <td><strong>2023-2024</strong></td>
          <td>EfficientNetV2, SwinV2</td>
          <td>Transformer</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/">DECIMER.ai</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/">Image2InChI</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/">MMSSC-Net</a></td>
      </tr>
      <tr>
          <td><strong>2025</strong></td>
          <td>EfficientViT, VLMs (Qwen2-VL)</td>
          <td>LLM decoders, RL fine-tuning</td>
          <td><a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/">MolSight</a>, <a href="/notes/chemistry/optical-structure-recognition/vision-language/gtr-mol-vlm/">GTR-CoT</a>, <a href="/notes/chemistry/optical-structure-recognition/vision-language/ocsu/">OCSU</a></td>
      </tr>
  </tbody>
</table>
<h3 id="encoder-architectures">Encoder Architectures</h3>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Methods Using It</th>
          <th>Key Characteristics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Inception V3</strong></td>
          <td>DECIMER (2020)</td>
          <td>Early CNN approach, 299x299 input</td>
      </tr>
      <tr>
          <td><strong>ResNet-50/101</strong></td>
          <td>IMG2SMI, Image2SMILES, MICER, DGAT</td>
          <td>Strong baseline, well-understood</td>
      </tr>
      <tr>
          <td><strong>EfficientNet-B3</strong></td>
          <td>DECIMER 1.0</td>
          <td>Efficient scaling, compound coefficients</td>
      </tr>
      <tr>
          <td><strong>EfficientNet-V2-M</strong></td>
          <td>DECIMER.ai, DECIMER-Hand-Drawn</td>
          <td>Improved training efficiency</td>
      </tr>
      <tr>
          <td><strong>EfficientViT-L1</strong></td>
          <td>MolSight</td>
          <td>Optimized for deployment</td>
      </tr>
      <tr>
          <td><strong>Swin Transformer</strong></td>
          <td>SwinOCSR, MolParser</td>
          <td>Hierarchical vision transformer</td>
      </tr>
      <tr>
          <td><strong>SwinV2</strong></td>
          <td>MMSSC-Net, Image2InChI</td>
          <td>Improved training stability</td>
      </tr>
      <tr>
          <td><strong>Vision Transformer (ViT)</strong></td>
          <td>ViT-InChI</td>
          <td>Pure attention encoder</td>
      </tr>
      <tr>
          <td><strong>DenseNet</strong></td>
          <td>RFL, Hu et al. RCGD</td>
          <td>Dense connections, feature reuse</td>
      </tr>
      <tr>
          <td><strong>Deep TNT</strong></td>
          <td>ICMDT</td>
          <td>Transformer-in-Transformer</td>
      </tr>
      <tr>
          <td><strong>Qwen2-VL</strong></td>
          <td>OCSU, GTR-CoT</td>
          <td>Vision-language model encoder</td>
      </tr>
  </tbody>
</table>
<h3 id="decoder-architectures">Decoder Architectures</h3>
<table>
  <thead>
      <tr>
          <th>Architecture</th>
          <th>Methods Using It</th>
          <th>Output Format</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GRU with Attention</strong></td>
          <td>DECIMER, RFL, Hu et al. RCGD</td>
          <td>SMILES, RFL, SSML</td>
      </tr>
      <tr>
          <td><strong>LSTM with Attention</strong></td>
          <td>Staker et al., ChemPix, MICER</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><strong>Transformer</strong></td>
          <td>Most 2021+ methods</td>
          <td>SMILES, SELFIES, InChI</td>
      </tr>
      <tr>
          <td><strong>GPT-2</strong></td>
          <td>MMSSC-Net</td>
          <td>SMILES</td>
      </tr>
      <tr>
          <td><strong>BART</strong></td>
          <td>MolParser</td>
          <td>E-SMILES</td>
      </tr>
      <tr>
          <td><strong>Pre-trained CDDD</strong></td>
          <td>Img2Mol</td>
          <td>Continuous embedding → SMILES</td>
      </tr>
  </tbody>
</table>
<h2 id="output-representation-comparison">Output Representation Comparison</h2>
<p>The choice of molecular string representation significantly impacts model performance. Representations fall into three categories: core molecular formats for single structures, extended formats for molecular families and variable structures (primarily Markush structures in patents), and specialized representations optimizing for specific recognition challenges.</p>
<p>The <a href="/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/">Rajan et al. 2022 ablation study</a> provides a comparison of core formats.</p>
<h3 id="core-molecular-formats">Core Molecular Formats</h3>
<p>These represent specific, concrete molecular structures.</p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>Validity Guarantee</th>
          <th>Sequence Length</th>
          <th>Key Characteristic</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>SMILES</strong></td>
          <td>No</td>
          <td>Shortest (baseline)</td>
          <td>Standard, highest accuracy</td>
          <td>DECIMER.ai, MolSight, DGAT, most 2023+</td>
      </tr>
      <tr>
          <td><strong>DeepSMILES</strong></td>
          <td>Partial</td>
          <td>~1.1x SMILES</td>
          <td>Reduces non-local dependencies</td>
          <td>SwinOCSR</td>
      </tr>
      <tr>
          <td><strong>SELFIES</strong></td>
          <td>Yes (100%)</td>
          <td>~1.5x SMILES</td>
          <td>Guaranteed valid molecules</td>
          <td>DECIMER 1.0, IMG2SMI</td>
      </tr>
      <tr>
          <td><strong>InChI</strong></td>
          <td>N/A (canonical)</td>
          <td>Variable (long)</td>
          <td>Unique identifiers, layered syntax</td>
          <td>ViT-InChI, ICMDT, Image2InChI</td>
      </tr>
      <tr>
          <td><strong>FG-SMILES</strong></td>
          <td>No</td>
          <td>Similar to SMILES</td>
          <td>Functional group-aware tokenization</td>
          <td>Image2SMILES</td>
      </tr>
  </tbody>
</table>
<h4 id="smiles-and-variants">SMILES and Variants</h4>
<p><strong>SMILES</strong> remains the dominant format due to its compactness and highest accuracy on clean data. Standard SMILES uses single characters for ring closures and branches that may appear far apart in the sequence, creating learning challenges for sequence models.</p>
<p><strong>DeepSMILES</strong> addresses these non-local syntax dependencies by modifying how branches and ring closures are encoded, making sequences more learnable for neural models. Despite this modification, DeepSMILES sequences are ~1.1x longer than standard SMILES (not shorter). The format offers partial validity improvements through regex-based tokenization with a compact 76-token vocabulary, providing a middle ground between SMILES accuracy and guaranteed validity.</p>
<p><strong>SELFIES</strong> guarantees 100% valid molecules by design through a context-free grammar, eliminating invalid outputs entirely. This comes at the cost of ~1.5x longer sequences and a typical 2-5% accuracy drop compared to SMILES on exact-match metrics. The validity guarantee makes SELFIES particularly attractive for generative modeling applications.</p>
<p><strong>InChI</strong> uses a layered canonical syntax fundamentally different from SMILES-based formats. While valuable for unique molecular identification, its complex multi-layer structure (formula, connectivity, stereochemistry, isotopes, etc.) and longer sequences make it less suitable for image-to-sequence learning, resulting in lower recognition accuracy.</p>
<h4 id="key-findings-from-rajan-et-al-2022">Key Findings from Rajan et al. 2022</h4>
<ol>
<li><strong>SMILES achieves highest exact-match accuracy</strong> on clean synthetic data</li>
<li><strong>SELFIES guarantees 100% valid molecules</strong> but at cost of ~2-5% accuracy drop</li>
<li><strong>InChI is problematic</strong> due to complex layered syntax and longer sequences</li>
<li><strong>DeepSMILES offers middle ground</strong> with partial validity improvements through modified syntax</li>
</ol>
<h3 id="extended-formats-for-variable-structures">Extended Formats for Variable Structures</h3>
<p><strong>Markush structures</strong> represent families of molecules, using variable groups (R1, R2, etc.) with textual definitions. They are ubiquitous in patent documents for intellectual property protection. Standard SMILES cannot represent these variable structures.</p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>Base Format</th>
          <th>Key Feature</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>E-SMILES</strong></td>
          <td>SMILES + XML annotations</td>
          <td>Backward-compatible with separator token</td>
          <td>MolParser</td>
      </tr>
      <tr>
          <td><strong>CXSMILES</strong></td>
          <td>SMILES + extension block</td>
          <td>Substituent tables, compression</td>
          <td>MarkushGrapher</td>
      </tr>
  </tbody>
</table>
<p><strong>E-SMILES</strong> (Extended SMILES) maintains backward compatibility by using a <code>&lt;sep&gt;</code> token to separate core SMILES from XML-like annotations. Annotations encode Markush substituents (<code>&lt;a&gt;index:group&lt;/a&gt;</code>), polymer structures (<code>&lt;p&gt;polymer_info&lt;/p&gt;</code>), and abstract ring patterns (<code>&lt;r&gt;abstract_ring&lt;/r&gt;</code>). The core structure remains parseable by standard RDKit.</p>
<p><strong>CXSMILES</strong> optimizes representation by moving variable groups directly into the main SMILES string as special atoms with explicit atom indexing (e.g., <code>C:1</code>) to link to an extension block containing substituent tables. This handles both frequency variation and position variation in Markush structures.</p>
<h3 id="specialized-representations">Specialized Representations</h3>
<p>These formats optimize for specific recognition challenges beyond standard single-molecule tasks.</p>
<h4 id="rfl-ring-free-language">RFL: Ring-Free Language</h4>
<p><strong>RFL</strong> fundamentally restructures molecular serialization through hierarchical ring decomposition, addressing a core challenge: standard 1D formats (SMILES, SSML) flatten complex 2D molecular graphs, losing explicit spatial relationships.</p>
<p><strong>Mechanism</strong>: RFL decomposes molecules into three explicit components:</p>
<ul>
<li><strong>Molecular Skeleton (𝒮)</strong>: Main graph with rings &ldquo;collapsed&rdquo;</li>
<li><strong>Ring Structures (ℛ)</strong>: Individual ring components stored separately</li>
<li><strong>Branch Information (ℱ)</strong>: Connectivity between skeleton and rings</li>
</ul>
<p><strong>Technical approach</strong>:</p>
<ol>
<li>Detect all non-nested rings using DFS</li>
<li>Calculate adjacency ($\gamma$) between rings based on shared edges</li>
<li>Merge isolated rings ($\gamma=0$) into <strong>SuperAtoms</strong> (single node placeholders)</li>
<li>Merge adjacent rings ($\gamma&gt;0$) into <strong>SuperBonds</strong> (edge placeholders)</li>
<li>Progressive decoding: predict skeleton first, then conditionally decode rings using stored hidden states</li>
</ol>
<p><strong>Performance</strong>: RFL achieves SOTA results on both handwritten (95.38% EM) and printed (95.58% EM) structures, with particular strength on high-complexity molecules where standard baselines fail completely (0% → ~30% on hardest tier).</p>
<p><strong>Note</strong>: RFL does not preserve original drawing orientation; it&rsquo;s focused on computational efficiency through hierarchical decomposition.</p>
<h4 id="ssml-structure-specific-markup-language">SSML: Structure-Specific Markup Language</h4>
<p><strong>SSML</strong> is the primary orientation-preserving format in OCSR. Based on Chemfig (LaTeX chemical drawing package), it provides step-by-step drawing instructions.</p>
<p><strong>Key characteristics</strong>:</p>
<ul>
<li>Describes <em>how to draw</em> the molecule alongside its graph structure</li>
<li>Uses &ldquo;reconnection marks&rdquo; for cyclic structures</li>
<li>Preserves branch angles and spatial relationships</li>
<li>Significantly outperformed SMILES for handwritten recognition: 92.09% vs 81.89% EM (Hu et al. RCGD 2023)</li>
</ul>
<p><strong>Use case</strong>: Particularly valuable for hand-drawn structure recognition where visual alignment between image and reconstruction sequence aids model learning.</p>
<h2 id="training-data-comparison">Training Data Comparison</h2>
<p>Training data scale has grown dramatically, with a shift toward combining synthetic and real-world images.</p>
<h3 id="data-scale-evolution">Data Scale Evolution</h3>
<table>
  <thead>
      <tr>
          <th>Year</th>
          <th>Typical Scale</th>
          <th>Maximum Reported</th>
          <th>Primary Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>2019-2020</td>
          <td>1-15M</td>
          <td>57M (Staker)</td>
          <td>Synthetic (RDKit, CDK)</td>
      </tr>
      <tr>
          <td>2021-2022</td>
          <td>5-35M</td>
          <td>35M (DECIMER 1.0)</td>
          <td>Synthetic with augmentation</td>
      </tr>
      <tr>
          <td>2023-2024</td>
          <td>100-150M</td>
          <td>450M+ (DECIMER.ai)</td>
          <td>Synthetic + real patents</td>
      </tr>
      <tr>
          <td>2025</td>
          <td>1-10M + real</td>
          <td>7.7M (MolParser)</td>
          <td>Curated real + synthetic</td>
      </tr>
  </tbody>
</table>
<h3 id="synthetic-vs-real-data">Synthetic vs Real Data</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Training Data</th>
          <th>Real-World Performance Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>450M+ synthetic (RanDepict)</td>
          <td>Strong generalization via domain randomization</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>7.7M with active learning</td>
          <td>Explicitly targets &ldquo;in the wild&rdquo; images</td>
      </tr>
      <tr>
          <td><strong>GTR-CoT</strong></td>
          <td>Real patent/paper images</td>
          <td>Chain-of-thought improves reasoning</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>Multi-stage curriculum</td>
          <td>RL fine-tuning for stereochemistry</td>
      </tr>
  </tbody>
</table>
<h3 id="data-augmentation-strategies">Data Augmentation Strategies</h3>
<p>Common augmentation techniques across methods:</p>
<table>
  <thead>
      <tr>
          <th>Technique</th>
          <th>Purpose</th>
          <th>Used By</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Rotation</strong></td>
          <td>Orientation invariance</td>
          <td>Nearly all methods</td>
      </tr>
      <tr>
          <td><strong>Gaussian blur</strong></td>
          <td>Image quality variation</td>
          <td>DECIMER, MolParser</td>
      </tr>
      <tr>
          <td><strong>Salt-and-pepper noise</strong></td>
          <td>Scan artifact simulation</td>
          <td>DECIMER, Image2SMILES</td>
      </tr>
      <tr>
          <td><strong>Affine transforms</strong></td>
          <td>Perspective variation</td>
          <td>ChemPix, MolParser</td>
      </tr>
      <tr>
          <td><strong>Font/style variation</strong></td>
          <td>Rendering diversity</td>
          <td>RanDepict (DECIMER.ai)</td>
      </tr>
      <tr>
          <td><strong>Hand-drawn simulation</strong></td>
          <td>Sketch-like inputs</td>
          <td>ChemPix, ChemReco, DECIMER-Hand-Drawn</td>
      </tr>
      <tr>
          <td><strong>Background variation</strong></td>
          <td>Document context</td>
          <td>MolParser, DECIMER.ai</td>
      </tr>
  </tbody>
</table>
<h2 id="hardware-and-compute-requirements">Hardware and Compute Requirements</h2>
<p>Hardware requirements span several orders of magnitude, from consumer GPUs to TPU pods.</p>
<h3 id="training-hardware-comparison">Training Hardware Comparison</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Hardware</th>
          <th>Training Time</th>
          <th>Dataset Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Staker et al. (2019)</strong></td>
          <td>8x GPUs</td>
          <td>26 days</td>
          <td>57M</td>
      </tr>
      <tr>
          <td><strong>IMG2SMI (2021)</strong></td>
          <td>1x RTX 2080 Ti</td>
          <td>5 epochs</td>
          <td>~10M</td>
      </tr>
      <tr>
          <td><strong>Image2SMILES (2022)</strong></td>
          <td>4x V100</td>
          <td>2 weeks</td>
          <td>30M</td>
      </tr>
      <tr>
          <td><strong>MICER (2022)</strong></td>
          <td>4x V100</td>
          <td>42 hours</td>
          <td>10M</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0 (2021)</strong></td>
          <td>TPU v3-8</td>
          <td>Not reported</td>
          <td>35M</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai (2023)</strong></td>
          <td>TPU v3-256</td>
          <td>Not reported</td>
          <td>450M+</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR (2022)</strong></td>
          <td>4x RTX 3090</td>
          <td>5 days</td>
          <td>5M</td>
      </tr>
      <tr>
          <td><strong>MolParser (2025)</strong></td>
          <td>8x A100</td>
          <td>Curriculum learning</td>
          <td>7.7M</td>
      </tr>
      <tr>
          <td><strong>MolSight (2025)</strong></td>
          <td>Not specified</td>
          <td>RL fine-tuning (GRPO)</td>
          <td>Multi-stage</td>
      </tr>
  </tbody>
</table>
<h3 id="inference-considerations">Inference Considerations</h3>
<p>Few papers report inference speed consistently. Available data:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Inference Speed</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>4x faster than DECIMER</td>
          <td>TensorFlow Lite optimization</td>
      </tr>
      <tr>
          <td><strong>OSRA</strong> (baseline)</td>
          <td>~1 image/sec</td>
          <td>CPU-based rule system</td>
      </tr>
      <tr>
          <td><strong>MolScribe</strong></td>
          <td>Real-time capable</td>
          <td>Optimized Swin encoder</td>
      </tr>
  </tbody>
</table>
<h3 id="accessibility-tiers">Accessibility Tiers</h3>
<table>
  <thead>
      <tr>
          <th>Tier</th>
          <th>Hardware</th>
          <th>Representative Methods</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Consumer</strong></td>
          <td>1x RTX 2080/3090</td>
          <td>IMG2SMI, ChemPix</td>
      </tr>
      <tr>
          <td><strong>Workstation</strong></td>
          <td>4x V100/A100</td>
          <td>Image2SMILES, MICER, SwinOCSR</td>
      </tr>
      <tr>
          <td><strong>Cloud/HPC</strong></td>
          <td>TPU pods, 8+ A100</td>
          <td>DECIMER.ai, MolParser</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmark-performance">Benchmark Performance</h2>
<h3 id="common-evaluation-datasets">Common Evaluation Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Type</th>
          <th>Size</th>
          <th>Challenge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>Patent images</td>
          <td>~5K test</td>
          <td>Real-world complexity</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>Scanned images</td>
          <td>~5K test</td>
          <td>Scan artifacts</td>
      </tr>
      <tr>
          <td><strong>Staker</strong></td>
          <td>Synthetic</td>
          <td>Variable</td>
          <td>Baseline synthetic</td>
      </tr>
      <tr>
          <td><strong>CLEF</strong></td>
          <td>Patent images</td>
          <td>~1K test</td>
          <td>Markush structures</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>Japanese patents</td>
          <td>~1K test</td>
          <td>Different rendering styles</td>
      </tr>
  </tbody>
</table>
<h3 id="accuracy-comparison-exact-match-">Accuracy Comparison (Exact Match %)</h3>
<p>Methods are roughly grouped by evaluation era; direct comparison is complicated by different test sets.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>USPTO</th>
          <th>UOB</th>
          <th>Staker</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>OSRA</strong> (baseline)</td>
          <td>~70%</td>
          <td>~65%</td>
          <td>~80%</td>
          <td>Rule-based reference</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>~85%</td>
          <td>~80%</td>
          <td>~90%</td>
          <td>First transformer-based</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>~88%</td>
          <td>~82%</td>
          <td>~92%</td>
          <td>Swin encoder advantage</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>~90%</td>
          <td>~85%</td>
          <td>~95%</td>
          <td>Scale + augmentation</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>~92%</td>
          <td>~88%</td>
          <td>~96%</td>
          <td>Real-world focus</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>~93%+</td>
          <td>~89%+</td>
          <td>~97%+</td>
          <td>RL fine-tuning boost</td>
      </tr>
  </tbody>
</table>
<p><em>Note: Numbers are approximate and may vary by specific test split. See individual paper notes for precise figures.</em></p>
<h3 id="stereochemistry-recognition">Stereochemistry Recognition</h3>
<p>Stereochemistry remains a persistent challenge across all methods:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Approach</th>
          <th>Stereo Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Most methods</strong></td>
          <td>Standard SMILES</td>
          <td>Lower than non-stereo</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>RL (GRPO) specifically for stereo</td>
          <td>Improved</td>
      </tr>
      <tr>
          <td><strong>MolNexTR</strong></td>
          <td>Graph-based explicit stereo</td>
          <td>Better handling</td>
      </tr>
      <tr>
          <td><strong>Image2InChI</strong></td>
          <td>InChI stereo layers</td>
          <td>Mixed results</td>
      </tr>
  </tbody>
</table>
<h2 id="hand-drawn-recognition">Hand-Drawn Recognition</h2>
<p>A distinct sub-lineage focuses on hand-drawn/sketched chemical structures.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Target Domain</th>
          <th>Key Innovation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ChemPix (2021)</strong></td>
          <td>Hand-drawn hydrocarbons</td>
          <td>First deep learning for sketches</td>
      </tr>
      <tr>
          <td><strong>Hu et al. RCGD (2023)</strong></td>
          <td>Hand-drawn structures</td>
          <td>Random conditional guided decoder</td>
      </tr>
      <tr>
          <td><strong>ChemReco (2024)</strong></td>
          <td>Hand-drawn C-H-O structures</td>
          <td>EfficientNet + curriculum learning</td>
      </tr>
      <tr>
          <td><strong>DECIMER-Hand-Drawn (2024)</strong></td>
          <td>General hand-drawn</td>
          <td>Enhanced DECIMER architecture</td>
      </tr>
  </tbody>
</table>
<h3 id="hand-drawn-vs-printed-trade-offs">Hand-Drawn vs Printed Trade-offs</h3>
<ul>
<li>Hand-drawn methods sacrifice some accuracy on clean printed images</li>
<li>Require specialized training data (synthetic hand-drawn simulation)</li>
<li>Generally smaller training sets due to data collection difficulty</li>
<li>Better suited for educational and lab notebook applications</li>
</ul>
<h2 id="key-innovations-by-method">Key Innovations by Method</h2>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Primary Innovation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Staker et al.</strong></td>
          <td>First end-to-end deep learning OCSR</td>
      </tr>
      <tr>
          <td><strong>DECIMER 1.0</strong></td>
          <td>Transformer decoder + SELFIES</td>
      </tr>
      <tr>
          <td><strong>Img2Mol</strong></td>
          <td>Continuous embedding space (CDDD)</td>
      </tr>
      <tr>
          <td><strong>Image2SMILES</strong></td>
          <td>Functional group-aware SMILES (FG-SMILES)</td>
      </tr>
      <tr>
          <td><strong>SwinOCSR</strong></td>
          <td>Hierarchical vision transformer encoder</td>
      </tr>
      <tr>
          <td><strong>DECIMER.ai</strong></td>
          <td>Massive scale + RanDepict augmentation</td>
      </tr>
      <tr>
          <td><strong>MolParser</strong></td>
          <td>Extended SMILES + active learning</td>
      </tr>
      <tr>
          <td><strong>MolSight</strong></td>
          <td>RL fine-tuning (GRPO) for accuracy</td>
      </tr>
      <tr>
          <td><strong>GTR-CoT</strong></td>
          <td>Chain-of-thought graph traversal</td>
      </tr>
      <tr>
          <td><strong>OCSU</strong></td>
          <td>Multi-task vision-language understanding</td>
      </tr>
      <tr>
          <td><strong>RFL</strong></td>
          <td>Hierarchical ring decomposition with SuperAtoms/SuperBonds</td>
      </tr>
  </tbody>
</table>
<h2 id="open-challenges">Open Challenges</h2>
<ol>
<li><strong>Stereochemistry</strong>: Consistent challenge across all methods; RL approaches (MolSight) show promise</li>
<li><strong>Abbreviations/R-groups</strong>: E-SMILES and Markush-specific methods emerging</li>
<li><strong>Real-world robustness</strong>: Gap between synthetic training and patent/paper images</li>
<li><strong>Inference speed</strong>: Rarely reported; important for production deployment</li>
<li><strong>Memory efficiency</strong>: Almost never documented; limits accessibility</li>
<li><strong>Multi-molecule images</strong>: Most methods assume single isolated structure</li>
</ol>
<h2 id="references">References</h2>
<p>Individual paper notes linked throughout. For the complete method listing, see the <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">OCSR Methods taxonomy</a>.</p>
]]></content:encoded></item><item><title>ChemDFM-X: Multimodal Foundation Model for Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-x/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemdfm-x/</guid><description>Multimodal chemical model integrating 5 modalities (2D graphs, 3D conformations, images, MS2/IR spectra) trained on 7.6M instructions.</description><content:encoded><![CDATA[<h2 id="chemdfm-x-contribution-and-architecture">ChemDFM-X Contribution and Architecture</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> contribution.</p>
<p><strong>Method</strong>: The paper proposes a novel &ldquo;Cross-modal Dialogue Foundation Model&rdquo; architecture that aligns five distinct chemical modalities (2D graphs, 3D conformations, images, MS2 spectra, IR spectra) to a single LLM decoder using separate encoders and projection modules. It establishes strong baseline performance across multiple modalities compared against current generalist models.</p>
<p><strong>Resource</strong>: The paper addresses the scarcity of multimodal chemical data by constructing a <strong>7.6M instruction-tuning dataset</strong>. This dataset is largely synthesized from seed SMILES strings using approximate calculations (MMFF94, CFM-ID, Chemprop-IR) and specialist model predictions.</p>
<h2 id="bridging-experimental-data-and-llms">Bridging Experimental Data and LLMs</h2>
<p>Existing chemical AI models generally fall into two distinct categories. Task-specific specialist models achieve high accuracy on singular objectives, such as property prediction or molecular generation, but require strict formatting and lack conversational flexibility. Conversely, early chemical large language models provide natural language interaction but are restricted to text and SMILES strings. ChemDFM-X addresses this gap by enabling large multimodal models to process the experimental characterization data (<a href="https://en.wikipedia.org/wiki/Tandem_mass_spectrometry">MS2 spectra</a> and <a href="https://en.wikipedia.org/wiki/Infrared_spectroscopy">IR spectra</a>) and visual data routinely used in practical chemistry workflows.</p>
<h2 id="synthetic-data-scaling-for-modality-alignment">Synthetic Data Scaling for Modality Alignment</h2>
<p>The core novelty lies in the <strong>&ldquo;Any-to-Text&rdquo; alignment strategy via synthetic data scaling</strong>:</p>
<ol>
<li>
<p><strong>Comprehensive Modality Support</strong>: ChemDFM-X incorporates experimental characterization data (MS2 and IR spectra) alongside 2D graphs, 3D conformations, and images. The data representations are formally defined mathematically rather than as raw pixels:</p>
<ul>
<li><strong>Molecular Graph</strong>: An undirected graph $G = (\textbf{V}, \textbf{E})$ with atom set $\textbf{V}$ and bond set $\textbf{E}$.</li>
<li><strong>Molecular Conformation</strong>: An undirected graph $G = (\textbf{V}&rsquo;, \textbf{E})$ storing spatial coordinates: $\textbf{v}_i = (x_i, y_i, z_i, a_i)$.</li>
<li><strong>MS2 Spectrum</strong>: Treated as a point sequence of discrete mass-to-charge ratios and intensities, tokenized via a discrete codebook: $\textbf{M} = ((r_1, I_1), (r_2, I_2), \dots, (r_n, I_n))$.</li>
<li><strong>IR Spectrum</strong>: Treated as a dense sequence of continuous wave lengths and absorption intensities, directly reshaped for feature extraction: $\textbf{R} = ((w_1, t_1), (w_2, t_2), \dots, (w_l, t_l))$.</li>
</ul>
<p>The authors trained new Sequence Transformer encoders from scratch for the MS2 and IR modalities since suitable pre-trained models did not exist.</p>
</li>
<li>
<p><strong>Synthetic Data Generation Pipeline</strong>: The authors generated a 7.6M sample dataset by starting with 1.3M seed SMILES and using &ldquo;approximate calculations&rdquo; to generate missing modalities:</p>
<ul>
<li>3D conformations via <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94</a> force field optimization</li>
<li>MS2 spectra via CFM-ID 4.0 (Competitive Fragmentation Modeling)</li>
<li>IR spectra via Chemprop-IR (Message Passing Neural Network)</li>
</ul>
</li>
<li>
<p><strong>Cross-Modal Synergy</strong>: The model demonstrates that training on reaction images improves recognition performance by leveraging semantic chemical knowledge (reaction rules) to correct visual recognition errors, an emergent capability from multimodal training.</p>
</li>
</ol>
<h2 id="multimodal-benchmarking-with-chemllmbench">Multimodal Benchmarking with ChemLLMBench</h2>
<p>The model was evaluated using a customized version of <strong><a href="/notes/chemistry/llm-applications/chemllmbench-eight-chemistry-tasks/">ChemLLMBench</a></strong> and <strong><a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></strong> across three modality categories:</p>
<ol>
<li>
<p><strong>Structural Modalities</strong> (2D Graphs &amp; 3D Conformations):</p>
<ul>
<li>Molecule recognition and captioning</li>
<li>Property prediction (MoleculeNet: BACE, BBBP, ClinTox, HIV, Tox21)</li>
<li>Compared against specialist models (Mole-BERT, Uni-Mol, MolXPT, MolCA) and generalist models (3D-MoLM, ChemDFM, <a href="/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/">ChemLLM</a>)</li>
</ul>
</li>
<li>
<p><strong>Visual Modalities</strong> (Images):</p>
<ul>
<li>Single molecule image recognition</li>
<li>Reaction image recognition</li>
<li>Compared against GPT-4O, Gemini 1.5 Pro, Qwen-VL, LLaVA, and specialist models <a href="/notes/chemistry/optical-structure-recognition/vision-language/molnextr/">MolNextr</a> and <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe</a></li>
</ul>
</li>
<li>
<p><strong>Characterization Modalities</strong> (MS2 &amp; IR Spectra):</p>
<ul>
<li>Spectral analysis tasks (identifying molecules from spectra)</li>
<li>Contextualized spectral interpretation (combining spectra with reaction context)</li>
<li>Novel evaluation requiring integration of spectroscopic data with reaction knowledge</li>
</ul>
</li>
</ol>
<h2 id="cross-modal-synergy-and-generalist-performance">Cross-Modal Synergy and Generalist Performance</h2>
<p><strong>Key Findings</strong>:</p>
<ol>
<li>
<p><strong>Leading Generalist Performance</strong>: ChemDFM-X establishes a new benchmark among existing generalist models (such as 3D-MOLM and ChemLLM), achieving performance metrics that match dedicated specialist models across several multimodal tasks.</p>
</li>
<li>
<p><strong>Failure of General LMMs</strong>: General vision models (GPT-4O, Gemini 1.5 Pro, Qwen-VL, LLaVA, InternLM-XComposer2, DocOwl) failed significantly on chemical image recognition tasks (0% accuracy for most models on molecule and reaction recognition, Table 9), demonstrating that chemical domain knowledge cannot be assumed from general pre-training.</p>
</li>
<li>
<p><strong>Cross-Modal Error Correction</strong>: In reaction image recognition, ChemDFM-X achieved higher accuracy (53.0%) than on single molecules (46.0%) (Table 9). The authors conclude the model uses its internal knowledge of chemical reaction rules to correct recognition errors in the visual modality, an emergent capability from multimodal training.</p>
</li>
<li>
<p><strong>Reliance on Reaction Context for Spectra</strong>: In zero-shot scenarios, ChemDFM-X essentially fails at pure spectral recognition (achieving 0% and 1% top-1 accuracy on MS2 and IR spectra alone, Table 11). However, when SMILES-based reaction context is included, performance rises to 45% (MS2) and 64% (IR) on the reaction prediction task, and 29% (MS2) and 60% (IR) on <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> (Table 11). This indicates the model uses spectral data as a soft prior to constrain textual deductions. Furthermore, the paper compares ChemDFM-X’s spectral identification performance exclusively against text-only LLMs that cannot process spectra, omitting comparisons against established specialist tools.</p>
</li>
<li>
<p><strong>Surrogate Distillation Trade-offs</strong>: Because the spectral training data relies entirely on outputs from CFM-ID 4.0 and Chemprop-IR, ChemDFM-X effectively distills these surrogate models. Any inherent predictive biases or inaccuracies from these underlying tools are permanently embedded in the new ChemDFM-X encoders.</p>
</li>
</ol>
<p><strong>Main Conclusion</strong>: The &ldquo;separate encoders + unified decoder&rdquo; architecture with synthetic data generation enables effective multimodal chemical understanding, bridging the gap between specialist and generalist AI systems for chemistry.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors constructed a <strong>7.6M sample instruction-tuning dataset</strong> derived from <strong>1.3M seed SMILES</strong> (sourced from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> and USPTO). <strong>Note</strong>: The final 7.6M multimodal tuning dataset itself isn&rsquo;t publicly available.</p>
<p><strong>Generation Pipeline</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Generation Method</th>
          <th>Tool/Model</th>
          <th>Sample Count</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Graphs</strong></td>
          <td>Direct extraction from SMILES</td>
          <td>RDKit</td>
          <td>1.1M</td>
      </tr>
      <tr>
          <td><strong>3D Conformations</strong></td>
          <td>Force field optimization</td>
          <td>RDKit + MMFF94</td>
          <td>1.3M (pseudo-optimal)</td>
      </tr>
      <tr>
          <td><strong>Molecule Images</strong></td>
          <td>Rendering with augmentation</td>
          <td>RDKit, Indigo, <a href="/notes/chemistry/optical-structure-recognition/hand-drawn/chempix/">ChemPix</a></td>
          <td>~1M (including handwritten style)</td>
      </tr>
      <tr>
          <td><strong>Reaction Images</strong></td>
          <td>Rendering from reaction SMILES</td>
          <td>RDKit</td>
          <td>300K</td>
      </tr>
      <tr>
          <td><strong>MS2 Spectra</strong></td>
          <td>Computational prediction</td>
          <td>CFM-ID 4.0</td>
          <td>~700K</td>
      </tr>
      <tr>
          <td><strong>IR Spectra</strong></td>
          <td>Computational prediction</td>
          <td>Chemprop-IR</td>
          <td>~1M</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Augmentation</strong>:</p>
<ul>
<li>Molecule images augmented with &ldquo;handwritten&rdquo; style using the ChemPix pipeline</li>
<li>Multiple rendering styles (RDKit default, Indigo clean)</li>
<li>Spectra generated at multiple energy levels (10eV, 20eV, 40eV for MS2)</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Architecture</strong>: &ldquo;Separate Encoders + Unified Decoder&rdquo;</p>
<p><strong>Code Availability</strong>: The authors have only released inference logic. The cross-modal projection training and synthetic data-generation scripts are closed.</p>
<p><strong>Modality Alignment</strong>:</p>
<ul>
<li>Each modality has a dedicated encoder (frozen pre-trained models where available)</li>
<li>For graph, conformation, MS2, and IR modalities: <strong>2-layer MLP projector</strong> (Linear, GELU, Linear) maps encoder features to LLM input space</li>
<li>For images: <strong>H-Reducer</strong> module compresses image tokens by factor of $n=8$ to handle high-resolution chemical images, then projects to LLM input space</li>
<li>All projected features are concatenated and fed to the unified LLM decoder</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Base LLM</strong>:</p>
<ul>
<li><strong>ChemDFM (13B)</strong>: LLaMA-based model pre-trained on chemical text and SMILES</li>
</ul>
<p><strong>Modality Encoders</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Encoder</th>
          <th>Pre-training Data</th>
          <th>Parameter Count</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Graph</strong></td>
          <td>Mole-BERT</td>
          <td>2M molecules</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>3D Conformation</strong></td>
          <td>Uni-Mol</td>
          <td>209M conformations</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>Image</strong></td>
          <td>CLIP (ViT)</td>
          <td>General domain</td>
          <td>-</td>
          <td>Frozen</td>
      </tr>
      <tr>
          <td><strong>MS2 Spectrum</strong></td>
          <td>Transformer (SeqT)</td>
          <td>Trained from scratch</td>
          <td>-</td>
          <td><strong>Trainable</strong></td>
      </tr>
      <tr>
          <td><strong>IR Spectrum</strong></td>
          <td>Transformer (SeqT)</td>
          <td>Trained from scratch</td>
          <td>-</td>
          <td><strong>Trainable</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Design Rationale</strong>: MS2 and IR encoders trained from scratch as Sequence Transformers treating spectral peaks as token sequences, since no suitable pre-trained models exist for chemical spectra.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>Accuracy (Acc)</strong> for recognition tasks</li>
<li><strong>BLEU-2/4</strong> and <strong>METEOR</strong> for captioning tasks</li>
<li><strong>AUC-ROC</strong> for property prediction (classification)</li>
</ul>
<p><strong>Code Availability</strong>: The adapted code for evaluating on ChemLLMBench and their custom spectral recognition tasks is closed-source.</p>
<p><strong>Benchmarks</strong>:</p>
<ul>
<li><strong>ChemLLMBench</strong>: Adapted for multimodal inputs across molecule captioning, property prediction, and reaction understanding</li>
<li><strong>MoleculeNet</strong>: Standard molecular property prediction tasks (BACE, BBBP, ClinTox, HIV, Tox21)</li>
<li><strong>USPTO</strong>: Reaction prediction and retrosynthesis tasks</li>
<li><strong>Custom Spectral Tasks</strong>: Novel evaluations requiring spectral interpretation</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Note</strong>: The type and quantity of GPUs used, along with the total training wall-time, were not published.</p>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Total Batch Size</strong>: 256</li>
<li><strong>Epochs</strong>: 3</li>
<li><strong>Optimizer</strong>: AdamW</li>
</ul>
<p><strong>Modality-Specific Learning Rates (Peak)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Modality</th>
          <th>Learning Rate</th>
          <th>Feature Dimension</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph</td>
          <td>1e-5</td>
          <td>300</td>
      </tr>
      <tr>
          <td>Conformation</td>
          <td>2e-4</td>
          <td>512</td>
      </tr>
      <tr>
          <td>Image</td>
          <td>2e-3</td>
          <td>1024</td>
      </tr>
      <tr>
          <td>MS2 / IR</td>
          <td>2e-4</td>
          <td>768</td>
      </tr>
  </tbody>
</table>
<p><strong>Note</strong>: Different learning rates reflect the varying degrees of domain adaptation required. Images (general CLIP) need more adaptation than graphs (chemical Mole-BERT).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/OpenDFM/ChemDFM-X">ChemDFM-X (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Inference code only; training and data generation scripts are closed</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/OpenDFM/ChemDFM-X-v1.0-13B">ChemDFM-X-v1.0-13B (HuggingFace)</a></td>
          <td>Model</td>
          <td>AGPL-3.0</td>
          <td>13B parameter multimodal model weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhao, Z., Chen, B., Li, J., Chen, L., Wen, L., Wang, P., Zhu, Z., Zhang, D., Wan, Z., Li, Y., Dai, Z., Chen, X., &amp; Yu, K. (2024). ChemDFM-X: Towards Large Multimodal Model for Chemistry. <em>Science China Information Sciences</em>, 67(12), 220109. <a href="https://doi.org/10.1007/s11432-024-4243-0">https://doi.org/10.1007/s11432-024-4243-0</a></p>
<p><strong>Publication</strong>: Science China Information Sciences, December 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/2409.13194">arXiv Version</a></li>
<li><a href="https://github.com/OpenDFM/ChemDFM-X">Code Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhaoChemDFMXLargeMultimodal2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemDFM-X}}: {{Towards Large Multimodal Model}} for {{Chemistry}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhao, Zihan and Chen, Bo and Li, Jingpiao and Chen, Lu and Wen, Liyang and Wang, Pengyu and Zhu, Zichen and Zhang, Danyang and Wan, Ziping and Li, Yansi and Dai, Zhongyang and Chen, Xin and Yu, Kai}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = dec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Science China Information Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{67}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{220109}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1007/s11432-024-4243-0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2409.13194}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolSight: OCSR with RL and Multi-Granularity Learning</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/molsight/</guid><description>A three-stage OCSR framework using SMILES pretraining, auxiliary bond/coordinate tasks, and reinforcement learning to master stereochemistry recognition.</description><content:encoded><![CDATA[<h2 id="contribution-a-framework-for-optical-chemical-structure-recognition">Contribution: A Framework for Optical Chemical Structure Recognition</h2>
<p>This is primarily a <strong>Method</strong> paper. It proposes a novel three-stage training framework (Pretraining → Fine-tuning → RL Post-training) to improve Optical Chemical Structure Recognition (OCSR). Specifically, it introduces the use of Group Relative Policy Optimization (GRPO) to solve non-differentiable chemical validity issues.</p>
<p>It also has a <strong>Resource</strong> component, as the authors construct and release <em>Stereo-200k</em>, a dataset specifically designed to train models on challenging stereoisomeric molecules.</p>
<h2 id="motivation-resolving-stereochemical-cues">Motivation: Resolving Stereochemical Cues</h2>
<p>Existing OCSR systems struggle to accurately recognize stereochemical information (e.g., chirality, geometric isomerism) because the visual cues distinguishing stereoisomers (such as wedge and dash bonds) are subtle. Current methods often fail to capture the geometric relationships required to distinguish molecules with identical connectivity but different spatial arrangements. Accurate recognition is critical for downstream tasks like drug discovery where stereochemistry determines pharmacological effects.</p>
<h2 id="core-innovations-grpo-and-multi-granularity-learning">Core Innovations: GRPO and Multi-Granularity Learning</h2>
<p>MolSight introduces three key technical innovations:</p>
<ol>
<li><strong>Reinforcement Learning for OCSR</strong>: It is the first OCSR system to incorporate RL (specifically GRPO) to directly optimize for chemical semantic correctness.</li>
<li><strong>Multi-Granularity Learning</strong>: It employs auxiliary heads for chemical bond classification and atom localization. Unlike previous approaches that optimize these jointly, MolSight decouples the coordinate head to prevent interference with SMILES generation.</li>
<li><strong>SMILES-M Notation</strong>: A lightweight extension to SMILES to handle Markush structures (common in patents) without significant sequence length increase.</li>
</ol>
<h2 id="experimental-methodology">Experimental Methodology</h2>
<p>The authors evaluated MolSight using a rigorous mix of real and synthetic benchmarks:</p>
<ul>
<li><strong>Baselines</strong>: Compared against rule-based (OSRA, MolVec, Imago) and deep learning methods (MolScribe, MolGrapher, DECIMER).</li>
<li><strong>Benchmarks</strong>: Evaluated on real-world datasets (USPTO, Maybridge UoB, CLEF-2012, JPO) and synthetic datasets (Staker, ChemDraw, Indigo, Stereo-2K).</li>
<li><strong>Ablation Studies</strong>: Tested the impact of the bond head, coordinate head, and RL stages separately.</li>
<li><strong>Transfer Learning</strong>: Assessed the quality of learned representations by using the frozen encoder for molecular property prediction on MoleculeNet.</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>SOTA Performance</strong>: MolSight achieved 85.1% stereochemical accuracy on the USPTO dataset, significantly outperforming the previous SOTA (MolScribe) which achieved 69.0%.</li>
<li><strong>RL Effectiveness</strong>: Reinforcement learning post-training specifically improved performance on stereoisomers, raising Tanimoto similarity and exact match rates on the Stereo-2k test set.</li>
<li><strong>Robustness</strong>: On perturbed USPTO images (random rotations and shearing), MolSight achieved 92.3% exact match accuracy (vs. the original 92.0%), while rule-based methods like OSRA dropped from 83.5% to 6.7%. On the low-resolution Staker dataset, MolSight reached 82.1% exact match.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training pipeline uses three distinct data sources:</p>
<ol>
<li><strong>Pre-training</strong>: <em>MolParser-7M</em>. Contains diverse images but requires the <strong>SMILES-M</strong> extension to handle Markush structures.</li>
<li><strong>Fine-tuning</strong>: <em>PubChem-1M</em> and <em>USPTO-680K</em>. Used for multi-granularity learning with bond and coordinate labels.</li>
<li><strong>RL Post-training</strong>: <em>Stereo-200k</em>. A self-collected dataset from the first 2M compounds in PubChem, filtered for chirality (&rsquo;@&rsquo;) and cis-trans isomerism (&rsquo;/&rsquo;, &lsquo;\&rsquo;). It uses 5 different RDKit drawing styles to ensure robustness.</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Reinforcement Learning</strong>: Uses <strong>GRPO (Group Relative Policy Optimization)</strong>.
<ul>
<li><strong>Reward Function</strong>: A linear combination of Tanimoto similarity and a graded stereochemistry reward.
$$ R = w_t \cdot r_{\text{tanimoto}} + w_s \cdot r_{\text{stereo}} $$
where $w_t=0.4$ and $w_s=0.6$. The stereochemistry reward $r_{\text{stereo}}$ is 1.0 for an InChIKey exact match, 0.3 if the atom count matches, and 0.1 otherwise.</li>
<li><strong>Sampling</strong>: Samples 4 completions per image with temperature 1.0 during RL training.</li>
</ul>
</li>
<li><strong>Auxiliary Tasks</strong>:
<ul>
<li><strong>Bond Classification</strong>: Concatenates hidden states of two atom queries to predict bond type via MLP.</li>
<li><strong>Atom Localization</strong>: Treated as a classification task (SimCC) but optimized using <strong>Maximum Likelihood Estimation (MLE)</strong> to account for uncertainty.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder Transformer. Input images are preprocessed to $512 \times 512$ resolution.
<ul>
<li><strong>Encoder</strong>: <strong>EfficientViT-L1</strong> (~53M params), chosen for linear attention efficiency.</li>
<li><strong>Decoder</strong>: 6-layer Transformer with <strong>RoPE</strong>, <strong>SwiGLU</strong>, and <strong>RMSNorm</strong>. Randomly initialized (no LLM weights) due to vocabulary mismatch.</li>
<li><strong>Coordinate Head</strong>: Separated from the main decoder. It adds 2 extra Transformer layers to process atom queries before prediction to improve accuracy.</li>
</ul>
</li>
<li><strong>Parameter Tuning</strong>:
<ul>
<li>Stage 3 (RL) uses <strong>LoRA</strong> (Rank=8, Alpha=16) to optimize the decoder.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Exact Match</strong>: Exact recognition accuracy for the full molecular structure.</li>
<li><strong>Tanimoto Coefficient</strong>: Fingerprint similarity for chemical semantics.</li>
<li><strong>OKS (Object Keypoint Similarity)</strong>: Used specifically for evaluating atom localization accuracy.</li>
</ul>
</li>
<li><strong>Perturbation</strong>: Robustness tested with random rotations [-5°, 5°] and xy-shearing [-0.1, 0.1].</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Training and inference performed on a single node.</li>
<li><strong>Processors</strong>: Intel Xeon Silver 4210R CPU.</li>
<li><strong>Accelerators</strong>: 4x <strong>NVIDIA GeForce RTX 3090/4090</strong> GPUs.</li>
<li><strong>Hyperparameters</strong>:
<ul>
<li>Stage 1: Batch size 512, LR $4 \times 10^{-4}$.</li>
<li>Stage 2: Batch size 256, Bond head LR $4 \times 10^{-4}$, Coord head LR $4 \times 10^{-5}$.</li>
<li>Stage 3 (RL): Batch size 64, Base LR $1 \times 10^{-4}$.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/hustvl/MolSight">MolSight (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official PyTorch implementation with training and inference code</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, W., Wang, X., Feng, B., &amp; Liu, W. (2025). MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning. In <em>Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026)</em>. <a href="https://doi.org/10.48550/arXiv.2511.17300">https://doi.org/10.48550/arXiv.2511.17300</a></p>
<p><strong>Publication</strong>: AAAI 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/hustvl/MolSight">Official Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhang2025molsight,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wenrui Zhang and Xinggang Wang and Bin Feng and Wenyu Liu}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2511.17300}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2511.17300}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MMSSC-Net: Multi-Stage Sequence Cognitive Networks</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/mmssc-net/</guid><description>A deep learning model for Optical Chemical Structure Recognition (OCSR) using SwinV2 and GPT-2 to convert molecular images to SMILES.</description><content:encoded><![CDATA[<h2 id="contribution-a-multi-stage-architectural-pipeline">Contribution: A Multi-Stage Architectural Pipeline</h2>
<p><strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>.
The paper proposes a deep learning architecture (<strong>MMSSC-Net</strong>) for Optical Chemical Structure Recognition (OCSR). It focuses on architectural innovation, specifically combining a SwinV2 visual encoder with a GPT-2 decoder, and validates this method through extensive benchmarking against existing rule-based and deep-learning baselines. It includes ablation studies to justify the choice of the visual encoder.</p>
<h2 id="motivation-addressing-noise-and-rigid-image-recognition">Motivation: Addressing Noise and Rigid Image Recognition</h2>
<ul>
<li><strong>Data Usage Gap</strong>: Drug discovery relies heavily on scientific literature, but molecular structures are often locked in vector graphics or images that computers cannot easily process.</li>
<li><strong>Limitations of Prior Work</strong>: Existing Rule-based methods are rigid and sensitive to noise. Previous Deep Learning approaches (Encoder-Decoder &ldquo;Image Captioning&rdquo; styles) often lack precision, interpretability, and struggle with varying image resolutions or large molecules.</li>
<li><strong>Need for &ldquo;Cognition&rdquo;</strong>: The authors argue that treating the image as a single isolated whole is insufficient; a model needs to &ldquo;perceive&rdquo; fine-grained details (atoms and bonds) to handle noise and varying pixel qualities effectively.</li>
</ul>
<h2 id="novelty-a-fine-grained-perception-pipeline">Novelty: A Fine-Grained Perception Pipeline</h2>
<ul>
<li><strong>Multi-Stage Cognitive Architecture</strong>: MMSSC-Net splits the task into stages:
<ol>
<li><strong>Fine-grained Perception</strong>: Detecting atom and bond sequences (including spatial coordinates) using SwinV2.</li>
<li><strong>Graph Construction</strong>: Assembling these into a molecular graph.</li>
<li><strong>Sequence Evolution</strong>: converting the graph into a machine-readable format (SMILES).</li>
</ol>
</li>
<li><strong>Hybrid Transformer Model</strong>: It combines a hierarchical vision transformer (<strong>SwinV2</strong>) for encoding with a generative pre-trained transformer (<strong>GPT-2</strong>) and MLPs for decoding atomic and bond targets.</li>
<li><strong>Robustness Mechanisms</strong>: The inclusion of random noise sequences during training to improve generalization to new molecular targets.</li>
</ul>
<h2 id="methodology-and-benchmarks">Methodology and Benchmarks</h2>
<ul>
<li><strong>Baselines</strong>: compared against 8 other tools:
<ul>
<li><em>Rule-based</em>: MolVec, OSRA.</li>
<li><em>Image-Smiles (DL)</em>: ABC-Net, Img2Mol, MolMiner.</li>
<li><em>Image-Graph-Smiles (DL)</em>: Image-To-Graph, MolScribe, ChemGrapher.</li>
</ul>
</li>
<li><strong>Datasets</strong>: Evaluated on 5 diverse datasets: STAKER (synthetic), USPTO, CLEF, JPO, and UOB (real-world).</li>
<li><strong>Metrics</strong>:
<ul>
<li><strong>Accuracy</strong>: Exact string match of the predicted SMILES.</li>
<li><strong>Tanimoto Similarity</strong>: Chemical similarity using Morgan fingerprints.</li>
</ul>
</li>
<li><strong>Ablation Study</strong>: Tested different visual encoders (Swin Transformer, ViT-B, ResNet-50) to validate the choice of SwinV2.</li>
<li><strong>Resolution Sensitivity</strong>: Tested model performance across image resolutions from 256px to 2048px.</li>
</ul>
<h2 id="results-and-core-outcomes">Results and Core Outcomes</h2>
<ul>
<li><strong>Strong Performance</strong>: MMSSC-Net achieved 75-98% accuracy across datasets, outperforming baselines on most benchmarks. The first three intra-domain and real datasets achieved above 94% accuracy.</li>
<li><strong>Resolution Robustness</strong>: The model maintained relatively stable accuracy across varying image resolutions, whereas baselines like Img2Mol showed greater sensitivity to resolution changes (Fig. 4 in the paper).</li>
<li><strong>Efficiency</strong>: The SwinV2 encoder was noted to be more efficient than ViT-B in this context.</li>
<li><strong>Limitations</strong>: The model struggles with stereochemistry, specifically confusing dashed wedge bonds with solid wedge bonds and misclassifying single bonds as solid wedge bonds. It also has difficulty with &ldquo;irrelevant text&rdquo; noise (e.g., unexpected symbols in JPO and DECIMER datasets).</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a combination of PubChem and USPTO data, augmented to handle visual variability.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>PubChem</strong></td>
          <td>1,000,000</td>
          <td>Converted from <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> to SMILES; random sampling.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><strong>USPTO</strong></td>
          <td>600,000</td>
          <td>Patent images; converted from MOL to SMILES.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>STAKER</strong></td>
          <td>40,000</td>
          <td>Synthetic; Avg res $256 \times 256$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>USPTO</strong></td>
          <td>4,862</td>
          <td>Real; Avg res $721 \times 432$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>CLEF</strong></td>
          <td>881</td>
          <td>Real; Avg res $1245 \times 412$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>JPO</strong></td>
          <td>380</td>
          <td>Real; Avg res $614 \times 367$.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>UOB</strong></td>
          <td>5,720</td>
          <td>Real; Avg res $759 \times 416$.</td>
      </tr>
  </tbody>
</table>
<p><strong>Augmentation</strong>:</p>
<ul>
<li><strong>Image</strong>: Random perturbations using RDKit/Indigo (rotation, filling, cropping, bond thickness/length, font size, Gaussian noise).</li>
<li><strong>Molecular</strong>: Introduction of functional group abbreviations and R-substituents (dummy atoms) using SMARTS templates.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Target Sequence Formulation</strong>: The model predicts a sequence containing bounding box coordinates and type labels: ${y_{\text{min}}, x_{\text{min}}, y_{\text{max}}, x_{\text{max}}, C_{n}}$.</li>
<li><strong>Loss Function</strong>: Cross-entropy loss with maximum likelihood estimation.
$$ \max \sum_{i=1}^{N} \sum_{j=1}^{L} \omega_{j} \log P(t_{j}^{i} \mid x_{1}^{i}, x_{2}^{i}, \dots, x_{M}^{i}, t_{1}^{i}, \dots, t_{j-1}^{i}) $$</li>
<li><strong>Noise Injection</strong>: A random sequence $T_r$ is appended to the target sequence during training to improve generalization to new goals.</li>
<li><strong>Graph Construction</strong>: Atoms ($v$) and bonds ($e$) are recognized separately; bonds are defined by connecting spatial atomic coordinates.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Encoder</strong>: <strong>Swin Transformer V2</strong>.
<ul>
<li>Pre-trained on ImageNet-1K.</li>
<li>Window size: $16 \times 16$.</li>
<li>Parameters: 88M.</li>
<li>Input resolution: $256 \times 256$.</li>
<li>Features: Scaled cosine attention; log-space continuous position bias.</li>
</ul>
</li>
<li><strong>Decoder</strong>: <strong>GPT-2</strong> + <strong>MLP</strong>.
<ul>
<li><strong>GPT-2</strong>: Used for recognizing atom types.
<ul>
<li>Layers: 24.</li>
<li>Attention Heads: 12.</li>
<li>Hidden Dimension: 768.</li>
<li>Dropout: 0.1.</li>
</ul>
</li>
<li><strong>MLP</strong>: Used for classifying bond types (single, double, triple, aromatic, solid wedge, dashed wedge).</li>
</ul>
</li>
<li><strong>Vocabulary</strong>:
<ul>
<li>Standard: 95 common numbers/characters ([0], [C], [=], etc.).</li>
<li>Extended: 2000 SMARTS-based characters for isomers/groups (e.g., &ldquo;[C2F5]&rdquo;, &ldquo;[halo]&rdquo;).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ol>
<li><strong>Accuracy</strong>: Exact match of the generated SMILES string.</li>
<li><strong>Tanimoto Similarity</strong>: Similarity of Morgan fingerprints between predicted and ground truth molecules.</li>
</ol>
<p><strong>Key Results (Accuracy)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MMSSC-Net</th>
          <th>MolVec (Rule)</th>
          <th>ABC-Net (DL)</th>
          <th>MolScribe (DL)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Indigo</strong></td>
          <td>98.14</td>
          <td>95.63</td>
          <td>96.4</td>
          <td>97.5</td>
      </tr>
      <tr>
          <td><strong>RDKit</strong></td>
          <td>94.91</td>
          <td>86.7</td>
          <td>98.3</td>
          <td>93.8</td>
      </tr>
      <tr>
          <td><strong>USPTO</strong></td>
          <td>94.24</td>
          <td>88.47</td>
          <td>*</td>
          <td>92.6</td>
      </tr>
      <tr>
          <td><strong>CLEF</strong></td>
          <td>91.26</td>
          <td>81.61</td>
          <td>*</td>
          <td>86.9</td>
      </tr>
      <tr>
          <td><strong>UOB</strong></td>
          <td>92.71</td>
          <td>81.32</td>
          <td>96.1</td>
          <td>87.9</td>
      </tr>
      <tr>
          <td><strong>Staker</strong></td>
          <td>89.44</td>
          <td>4.49</td>
          <td>*</td>
          <td>86.9</td>
      </tr>
      <tr>
          <td><strong>JPO</strong></td>
          <td>75.48</td>
          <td>66.8</td>
          <td>*</td>
          <td>76.2</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Configuration</strong>:
<ul>
<li>Batch Size: 128.</li>
<li>Learning Rate: $4 \times 10^{-5}$.</li>
<li>Epochs: 40.</li>
</ul>
</li>
<li><strong>Inference Speed</strong>: The SwinV2 encoder demonstrated higher efficiency (faster inference time) compared to ViT-B and ResNet-50 baselines during ablation.</li>
</ul>
<h3 id="reproducibility">Reproducibility</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Wzew5Lp/MMSSCNet">MMSSCNet (GitHub)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Official implementation; includes training and prediction scripts</td>
      </tr>
  </tbody>
</table>
<p>The paper is published in RSC Advances (open access). Source code is available on GitHub, though the repository has minimal documentation and no explicit license. The training data comes from PubChem (public) and USPTO (public patent data). Pre-trained model weights do not appear to be released. No specific GPU hardware or training time is reported in the paper.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, D., Zhao, D., Wang, Z., Li, J., &amp; Li, J. (2024). MMSSC-Net: multi-stage sequence cognitive networks for drug molecule recognition. <em>RSC Advances</em>, 14(26), 18182-18191. <a href="https://doi.org/10.1039/D4RA02442G">https://doi.org/10.1039/D4RA02442G</a></p>
<p><strong>Publication</strong>: RSC Advances 2024</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhangMMSSCNetMultistageSequence2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MMSSC-Net: Multi-Stage Sequence Cognitive Networks for Drug Molecule Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MMSSC-Net}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Zhang, Dehai and Zhao, Di and Wang, Zhengwu and Li, Junhui and Li, Jin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{RSC Advances}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{26}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{18182--18191}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D4RA02442G}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://pubs.rsc.org/en/content/articlelanding/2024/ra/d4ra02442g}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MarkushGrapher: Multi-modal Markush Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/markush/markushgrapher/</guid><description>Multi-modal transformer combining vision, text, and layout encoding to extract complex Markush structures from patent documents with OCSR.</description><content:encoded><![CDATA[<h2 id="overcoming-unimodal-limitations-for-markush-structures">Overcoming Unimodal Limitations for Markush Structures</h2>
<p>The automated analysis of chemical literature, particularly patents, is critical for drug discovery and material science. A major bottleneck is the extraction of <strong>Markush structures</strong>, which are complex chemical templates that represent families of molecules using a core backbone image and textual variable definitions. Existing methods are limited because they either rely solely on images (OCSR) and miss the textual context, or focus solely on text and miss the structural backbone. This creates a practical need for a unified, multi-modal approach that jointly interprets visual and textual data to accurately extract these structures for prior-art search and database construction. This paper proposes a <strong>Method</strong> and introduces a new <strong>Resource</strong> (M2S dataset) to bridge this gap.</p>
<h2 id="markushgrapher-the-multi-modal-architecture">MarkushGrapher: The Multi-Modal Architecture</h2>
<p>The core innovation is <strong>MarkushGrapher</strong>, a multi-modal architecture that jointly encodes image, text, and layout information. Key contributions include:</p>
<ul>
<li><strong>Dual-Encoder Architecture</strong>: Combines a Vision-Text-Layout (VTL) encoder (based on UDOP) with a specialized, pre-trained Optical Chemical Structure Recognition (OCSR) encoder (MolScribe). Let $E_{\text{VTL}}$ represent the combined sequence embedding and $E_{\text{OCSR}}$ represent the domain-specific visual embeddings.</li>
<li><strong>Joint Recognition</strong>: The model autoregressively generates a sequential graph representation (Optimized CXSMILES) and a substituent table simultaneously. It uses cross-modal dependencies, allowing text to clarify ambiguous visual details like bond types.</li>
<li><strong>Synthetic Data Pipeline</strong>: A comprehensive pipeline generates realistic synthetic Markush structures (images and text) from PubChem data, overcoming the lack of labeled training data.</li>
<li><strong>Optimized Representation</strong>: A compacted version of CXSMILES moves variable groups into the SMILES string and adds explicit atom indexing to handle complex &ldquo;frequency&rdquo; and &ldquo;position&rdquo; variation indicators.</li>
</ul>
<h2 id="experimental-validation-on-the-new-m2s-benchmark">Experimental Validation on the New M2S Benchmark</h2>
<p>The authors validated their approach using the following setup:</p>
<ul>
<li><strong>Baselines</strong>: Compared against image-only chemistry models (DECIMER, MolScribe) and general-purpose multi-modal models (Uni-SMART, GPT-4o, Pixtral, Llama-3.2).</li>
<li><strong>Datasets</strong>: Evaluated on three benchmarks:
<ol>
<li><strong>MarkushGrapher-Synthetic</strong>: 1,000 generated samples.</li>
<li><strong>M2S</strong>: A new benchmark of 103 manually annotated real-world patent images.</li>
<li><strong>USPTO-Markush</strong>: 74 Markush backbone images from USPTO patents.</li>
</ol>
</li>
<li><strong>Ablation Studies</strong>: Analyzed the impact of the OCSR encoder, late fusion strategies, and the optimized CXSMILES format. Late fusion improved USPTO-Markush EM from 23% (VTL only) to 32% (Table 3). Removing R-group compression dropped M2S EM from 38% to 30%, and removing atom indexing dropped USPTO-Markush EM from 32% to 24% (Table 4).</li>
</ul>
<h2 id="key-results">Key Results</h2>
<ul>
<li><strong>Performance</strong>: MarkushGrapher outperformed all baselines. On the M2S benchmark, it achieved 38% Exact Match on CXSMILES (compared to 21% for MolScribe) and 29% Exact Match on tables. On USPTO-Markush, it reached 32% CXSMILES EM versus 7% for MolScribe.</li>
<li><strong>Markush Feature Recognition</strong>: The model can recognize complex Markush features like frequency variation (&lsquo;Sg&rsquo;) and position variation (&rsquo;m&rsquo;) indicators. DECIMER and MolScribe scored 0% on both &rsquo;m&rsquo; and &lsquo;Sg&rsquo; sections (Table 2), while MarkushGrapher achieved 76% on &rsquo;m&rsquo; and 31% on &lsquo;Sg&rsquo; sections on M2S.</li>
<li><strong>Cross-Modal Reasoning</strong>: Qualitative analysis showed the model can correctly infer visual details (such as bond order) that appear ambiguous in the image but become apparent with the text description.</li>
<li><strong>Robustness</strong>: The model generalizes well to real-world data despite being trained purely on synthetic data. On augmented versions of M2S and USPTO-Markush simulating low-quality scanned documents, it maintained 31% and 32% CXSMILES EM respectively (Table 6).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<p>The authors note several limitations:</p>
<ul>
<li>MarkushGrapher does not currently handle abbreviations in chemical structures (e.g., &lsquo;OG&rsquo; for oxygen connected to a variable group).</li>
<li>The model relies on ground-truth OCR cells as input, requiring an external OCR model for practical deployment.</li>
<li>Substituent definitions that combine text with interleaved chemical structure drawings are not supported.</li>
<li>The model is trained to predict &rsquo;m&rsquo; sections connecting to all atoms in a cycle, which can technically violate valence constraints, though the output contains enough information to reconstruct only valid connections.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p><strong>Training Data</strong></p>
<ul>
<li><strong>Source</strong>: Synthetic dataset generated from PubChem SMILES.</li>
<li><strong>Size</strong>: 210,000 synthetic images.</li>
<li><strong>Pipeline</strong>:
<ol>
<li><strong>Selection</strong>: Sampled SMILES from PubChem based on substructure diversity.</li>
<li><strong>Augmentation</strong>: SMILES augmented to artificial CXSMILES using RDKit (inserting variable groups, frequency indicators).</li>
<li><strong>Rendering</strong>: Images rendered using Chemistry Development Kit (CDK) with randomized drawing parameters (font, bond width, spacing).</li>
<li><strong>Text Generation</strong>: Textual definitions generated using manual templates extracted from patents; 10% were paraphrased using Mistral-7B-Instruct-v0.3 to increase diversity.</li>
<li><strong>OCR</strong>: Bounding boxes extracted via a custom SVG parser aligned with MOL files.</li>
</ol>
</li>
</ul>
<p><strong>Evaluation Data</strong></p>
<ul>
<li><strong>M2S Dataset</strong>: 103 images from USPTO, EPO, and WIPO patents (1999-2023), manually annotated with CXSMILES and substituent tables.</li>
<li><strong>USPTO-Markush</strong>: 74 images from USPTO patents (2010-2016).</li>
<li><strong>MarkushGrapher-Synthetic</strong>: 1,000 samples generated via the pipeline.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimized CXSMILES</strong>:
<ul>
<li><strong>Compression</strong>: Variable groups moved from the extension block to the main SMILES string as special atoms to reduce sequence length.</li>
<li><strong>Indexing</strong>: Atom indices appended to each atom (e.g., <code>C:1</code>) to explicitly link the graph to the extension block (crucial for <code>m</code> and <code>Sg</code> sections).</li>
<li><strong>Vocabulary</strong>: Specific tokens used for atoms and bonds.</li>
</ul>
</li>
<li><strong>Augmentation</strong>: Standard image augmentations (shift, scale, blur, pepper noise, random lines) and OCR text augmentations (character substitution/insertion/deletion).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Encoder-Decoder Transformer.
<ul>
<li><strong>VTL Encoder</strong>: T5-large encoder (initialized from UDOP) that processes image patches, text tokens, and layout (bounding boxes).</li>
<li><strong>OCSR Encoder</strong>: Vision encoder from MolScribe (Swin Transformer), frozen during training.</li>
<li><strong>Text Decoder</strong>: T5-large decoder.</li>
</ul>
</li>
<li><strong>Fusion Strategy</strong>: <strong>Late Fusion</strong>. The core multi-modal alignment combines the textual layout features with specialized chemical vision explicitly. The fused representation relies on the VTL output $e_1$ concatenated with the MLP-projected OCSR output $e_2$ before decoding:
$$ e = e_1(v, t, l) \oplus \text{MLP}(e_2(v)) $$</li>
<li><strong>Parameters</strong>: 831M total (744M trainable).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>CXSMILES Exact Match (EM)</strong>: Requires perfect match of SMILES string, variable groups, <code>m</code> sections, and <code>Sg</code> sections (ignoring stereochemistry).</li>
<li><strong>Tanimoto Score</strong>: Similarity of RDKit DayLight fingerprints (Markush features removed).</li>
<li><strong>Table Exact Match</strong>: All variable groups and substituents must match.</li>
<li><strong>Table F1-Score</strong>: Aggregated recall and precision of substituents per variable group.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Trained on a single NVIDIA H100 GPU.</li>
<li><strong>Training Config</strong>: 10 epochs, batch size of 10, ADAM optimizer, learning rate 5e-4, 100 warmup steps, weight decay 1e-3.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DS4SD/MarkushGrapher">MarkushGrapher</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Morin, L., Weber, V., Nassar, A., Meijer, G. I., Van Gool, L., Li, Y., &amp; Staar, P. (2025). MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures. <em>2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 14505-14515. <a href="https://doi.org/10.1109/CVPR52734.2025.01352">https://doi.org/10.1109/CVPR52734.2025.01352</a></p>
<p><strong>Publication</strong>: CVPR 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/DS4SD/MarkushGrapher">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{morinMarkushGrapherJointVisual2025,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{MarkushGrapher}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Morin, Lucas and Weber, Valéry and Nassar, Ahmed and Meijer, Gerhard Ingmar and Van Gool, Luc and Li, Yawei and Staar, Peter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jun,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{14505--14515}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1109/CVPR52734.2025.01352}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image2InChI: SwinTransformer for Molecular Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2inchi/</guid><description>Deep learning model using improved SwinTransformer encoder and attention-based feature fusion to convert molecular images to InChI strings.</description><content:encoded><![CDATA[<h2 id="image2inchi-as-a-methodological-innovation">Image2InChI as a Methodological Innovation</h2>
<p>This is a <strong>Methodological Paper ($\Psi_{\text{Method}}$)</strong>. It proposes a specific new deep learning architecture (&ldquo;Image2InChI&rdquo;) to solve the task of Optical Chemical Structure Recognition (OCSR). The rhetorical focus is on engineering a system that outperforms baselines on specific metrics (InChI accuracy, MCS accuracy) and providing a valuable reference for future algorithmic work.</p>
<h2 id="bottlenecks-in-chemical-literature-digitization">Bottlenecks in Chemical Literature Digitization</h2>
<p>The accurate digitization of chemical literature is a bottleneck in AI-driven drug discovery. Chemical structures in patents and papers exist as optical images (pixels), but machine learning models require machine-readable string representations (like <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a> or <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>). Efficiently and automatically bridging this gap is a prerequisite for large-scale data mining in chemistry.</p>
<h2 id="hierarchical-swintransformer-and-attention-integration">Hierarchical SwinTransformer and Attention Integration</h2>
<p>The core novelty is the <strong>Image2InChI</strong> architecture, which integrates:</p>
<ol>
<li><strong>Improved SwinTransformer Encoder</strong>: Uses a hierarchical vision transformer to capture image features.</li>
<li><strong>Feature Fusion with Attention</strong>: A novel network designed to integrate image patch features with InChI prediction steps.</li>
<li><strong>End-to-End InChI Prediction</strong>: The architecture frames the problem as a direct image-to-sequence translation targeting InChI strings directly, diverging from techniques predicting independent graph components. The model is optimized using a standard Cross-Entropy Loss over the token vocabulary:
$$ \mathcal{L}_{\text{CE}} = - \sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, \mathbf{X}) $$
where $\mathbf{X}$ represents the input image features, $y_t$ is the predicted token, and $T$ is the sequence length.</li>
</ol>
<h2 id="benchmarking-on-the-bms-dataset">Benchmarking on the BMS Dataset</h2>
<ul>
<li><strong>Benchmark Validation</strong>: The model was trained and tested on the <strong>BMS1000 (Bristol-Myers Squibb)</strong> dataset from a Kaggle competition.</li>
<li><strong>Ablation/Comparative Analysis</strong>: The authors compared their method against other models in the supplement.</li>
<li><strong>Preprocessing Validation</strong>: They justified their choice of denoising algorithms (8-neighborhood vs. Gaussian/Mean) to ensure preservation of bond lines while removing &ldquo;spiky point noise&rdquo;.</li>
</ul>
<h2 id="high-inchi-recognition-metrics">High InChI Recognition Metrics</h2>
<ul>
<li><strong>High Accuracy</strong>: The model achieved <strong>99.8% InChI accuracy</strong>, 94.8% Maximum Common Substructure (MCS) accuracy, and 96.2% Longest Common Subsequence (LCS) accuracy on the benchmarked dataset. It remains to be seen how well these models generalize to heavily degraded real-world patent images.</li>
<li><strong>Effective Denoising</strong>: The authors concluded that <strong>eight-neighborhood filtering</strong> is superior to mean or Gaussian filtering for this specific domain because it removes isolated noise points without blurring the fine edges of chemical bonds.</li>
<li><strong>Open Source</strong>: The authors stated their intention to release the code, though no public repository has been identified.</li>
</ul>
<hr>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Dataset (Kaggle)</a></td>
          <td>Dataset</td>
          <td>Competition</td>
          <td>Bristol-Myers Squibb Molecular Translation competition dataset</td>
      </tr>
  </tbody>
</table>
<p>No public code repository has been identified for Image2InChI despite the authors&rsquo; stated intent to release it.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The primary dataset used is the <strong>BMS (Bristol-Myers Squibb) Dataset</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Source</strong></td>
          <td>Kaggle Competition (BMS-Molecular-Translation)</td>
      </tr>
      <tr>
          <td><strong>Total Size</strong></td>
          <td>2.4 million images</td>
      </tr>
      <tr>
          <td><strong>Training Set</strong></td>
          <td>1.8 million images</td>
      </tr>
      <tr>
          <td><strong>Test Set</strong></td>
          <td>0.6 million images</td>
      </tr>
      <tr>
          <td><strong>Content</strong></td>
          <td>Each image corresponds to a unique International Chemical Identifier (<a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>)</td>
      </tr>
  </tbody>
</table>
<p><strong>Other Datasets</strong>: The authors also utilized JPO (Japanese Patent Office), CLEF (CLEF-IP 2012), UOB (MolrecUOB), and USPTO datasets for broader benchmarking.</p>
<p><strong>Preprocessing Pipeline</strong>:</p>
<ol>
<li><strong>Denoising</strong>: <strong>Eight-neighborhood filtering</strong> (threshold &lt; 4 non-white pixels) is used to remove salt-and-pepper noise while preserving bond lines. Mean and Gaussian filtering were rejected due to blurring.</li>
<li><strong>Sequence Padding</strong>:
<ul>
<li>Analysis showed max InChI length &lt; 270.</li>
<li>Fixed sequence length set to <strong>300</strong>.</li>
<li>Tokens: <code>&lt;sos&gt;</code> (190), <code>&lt;eos&gt;</code> (191), <code>&lt;pad&gt;</code> (192) used for padding/framing.</li>
</ul>
</li>
<li><strong>Numerization</strong>: Characters are mapped to integers based on a fixed vocabulary (e.g., &lsquo;C&rsquo; -&gt; 178, &lsquo;H&rsquo; -&gt; 182).</li>
</ol>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Eight-Neighborhood Filtering (Denoising)</strong>:</p>
<p>Pseudocode logic:</p>
<ul>
<li>Iterate through every pixel.</li>
<li>Count non-white neighbors in the 3x3 grid (8 neighbors).</li>
<li>If count &lt; threshold (default 4), treat as noise and remove.</li>
</ul>
<p><strong>InChI Tokenization</strong>:</p>
<ul>
<li>InChI strings are split into character arrays.</li>
<li>Example: Vitamin C <code>InChI=1S/C6H8O6...</code> becomes <code>[&lt;sos&gt;, C, 6, H, 8, O, 6, ..., &lt;eos&gt;, &lt;pad&gt;...]</code>.</li>
<li>Mapped to integer tensor for model input.</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Image2InChI</p>
<ul>
<li><strong>Encoder</strong>: Improved SwinTransformer (Hierarchical Vision Transformer).</li>
<li><strong>Decoder</strong>: Transformer Decoder with patch embedding.</li>
<li><strong>Fusion</strong>: A novel &ldquo;feature fusion network with attention&rdquo; integrates the visual tokens with the sequence generation process.</li>
<li><strong>Framework</strong>: PyTorch 1.8.1.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>:</p>
<ul>
<li><strong>InChI Acc</strong>: Exact match accuracy of the predicted InChI string (Reported: 99.8%).</li>
<li><strong>MCS Acc</strong>: Maximum Common Substructure accuracy (structural similarity) (Reported: 94.8%).</li>
<li><strong>LCS Acc</strong>: Longest Common Subsequence accuracy (string similarity) (Reported: 96.2%).</li>
<li><strong>Morgan FP</strong>: Morgan Fingerprint similarity (Reported: 94.1%).</li>
</ul>
<h3 id="hardware">Hardware</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Specification</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GPU</strong></td>
          <td>NVIDIA Tesla P100 (16GB VRAM)</td>
      </tr>
      <tr>
          <td><strong>Platform</strong></td>
          <td>MatPool cloud platform</td>
      </tr>
      <tr>
          <td><strong>CPU</strong></td>
          <td>Intel Xeon Gold 6271</td>
      </tr>
      <tr>
          <td><strong>RAM</strong></td>
          <td>32GB System Memory</td>
      </tr>
      <tr>
          <td><strong>Driver</strong></td>
          <td>NVIDIA-SMI 440.100</td>
      </tr>
      <tr>
          <td><strong>OS</strong></td>
          <td>Ubuntu 18.04</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, D., Xu, X., Pan, J., Gao, W., &amp; Zhang, S. (2024). Image2InChI: Automated Molecular Optical Image Recognition. <em>Journal of Chemical Information and Modeling</em>, 64(9), 3640-3649. <a href="https://doi.org/10.1021/acs.jcim.3c02082">https://doi.org/10.1021/acs.jcim.3c02082</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling (JCIM) 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Dataset (Kaggle)</a></li>
</ul>
<p><strong>Note</strong>: These notes are based on the Abstract and Supporting Information files only.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2024image2inchi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Image2InChI: Automated Molecular Optical Image Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Da-zhou and Xu, Xin and Pan, Jia-heng and Gao, Wei and Zhang, Shi-rui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3640--3649}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.3c02082}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Enhanced DECIMER for Hand-Drawn Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/</guid><description>An improved encoder-decoder model (EfficientNetV2 + Transformer) converts hand-drawn chemical structures into SMILES strings using synthetic training data.</description><content:encoded><![CDATA[<h2 id="method-contribution-architectural-optimization">Method Contribution: Architectural Optimization</h2>
<p>This is a <strong>Method</strong> paper. It proposes an enhanced neural network architecture (EfficientNetV2 + Transformer) specifically designed to solve the problem of recognizing hand-drawn chemical structures. The primary contribution is architectural optimization and a data-driven training strategy, validated through ablation studies (comparing encoders) and benchmarked against existing rule-based and deep learning tools.</p>
<h2 id="motivation-digitizing-dark-chemical-data">Motivation: Digitizing &ldquo;Dark&rdquo; Chemical Data</h2>
<p>Chemical information in legacy laboratory notebooks and modern tablet-based inputs often exists as hand-drawn sketches.</p>
<ul>
<li><strong>Gap:</strong> Existing Optical Chemical Structure Recognition (OCSR) tools (particularly rule-based ones) lack robustness and fail when images have variability in style, line thickness, or noise.</li>
<li><strong>Need:</strong> There is a critical need for automated tools to digitize this &ldquo;dark data&rdquo; effectively to preserve it and make it machine-readable and searchable.</li>
</ul>
<h2 id="core-innovation-decoder-only-design-and-synthetic-scaling">Core Innovation: Decoder-Only Design and Synthetic Scaling</h2>
<p>The core novelty is the <strong>architectural enhancement</strong> and <strong>synthetic training strategy</strong>:</p>
<ol>
<li><strong>Decoder-Only Transformer:</strong> Using only the decoder part of the Transformer (instead of a full encoder-decoder Transformer) improved average accuracy across OCSR benchmarks from 61.28% to 69.27% (Table 3 in the paper).</li>
<li><strong>EfficientNetV2 Integration:</strong> Replacing standard CNNs or EfficientNetV1 with <strong>EfficientNetV2-M</strong> provided better feature extraction and 2x faster training speeds.</li>
<li><strong>Scale of Synthetic Data:</strong> The authors demonstrate that scaling synthetic training data (up to 152 million images generated by RanDepict) directly correlates with improved generalization to real-world hand-drawn images, without ever training on real hand-drawn data.</li>
</ol>
<h2 id="experimental-setup-ablation-and-real-world-baselines">Experimental Setup: Ablation and Real-World Baselines</h2>
<ul>
<li><strong>Model Selection (Ablation):</strong> Tested three architectures (EfficientNetV2-M + Full Transformer, EfficientNetV1-B7 + Decoder-only, EfficientNetV2-M + Decoder-only) on standard benchmarks (JPO, CLEF, USPTO, UOB).</li>
<li><strong>Data Scaling:</strong> Trained the best model on four progressively larger datasets (from 4M to 152M images) to measure performance gains.</li>
<li><strong>Real-World Benchmarking:</strong> Validated the final model on the <strong>DECIMER Hand-drawn dataset</strong> (5088 real images drawn by volunteers) and compared against 9 other tools (OSRA, MolVec, Img2Mol, MolScribe, etc.).</li>
</ul>
<h2 id="results-and-conclusions-strong-accuracy-on-hand-drawn-scans">Results and Conclusions: Strong Accuracy on Hand-Drawn Scans</h2>
<ul>
<li><strong>Strong Performance:</strong> The final DECIMER model achieved <strong>99.72% valid predictions</strong> and <strong>73.25% exact accuracy</strong> on the hand-drawn benchmark. The next best non-DECIMER tool was MolGrapher at 10.81% accuracy, followed by MolScribe at 7.65%.</li>
<li><strong>Robustness:</strong> Deep learning methods outperform rule-based methods (which scored 3% or less accuracy) on hand-drawn data.</li>
<li><strong>Data Saturation:</strong> Quadrupling the dataset from 38M to 152M images yielded only marginal gains (about 3 percentage points in accuracy), suggesting current synthetic data strategies may be hitting a plateau.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official TensorFlow implementation</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10781330">Model Weights (Zenodo)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained hand-drawn model weights</td>
      </tr>
      <tr>
          <td><a href="https://pypi.org/project/decimer/">DECIMER PyPi Package</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Installable Python package</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OBrink/RanDepict">RanDepict (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic hand-drawn image generation toolkit</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The model was trained entirely on <strong>synthetic data</strong> generated using the <a href="https://github.com/OBrink/RanDepict">RanDepict</a> toolkit. No real hand-drawn images were used for training.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Source</th>
          <th>Molecules</th>
          <th>Total Images</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>ChEMBL</td>
          <td>2,187,669</td>
          <td>4,375,338</td>
          <td>1 augmented + 1 clean per molecule</td>
      </tr>
      <tr>
          <td>2</td>
          <td>ChEMBL</td>
          <td>2,187,669</td>
          <td>13,126,014</td>
          <td>2 augmented + 4 clean per molecule</td>
      </tr>
      <tr>
          <td>3</td>
          <td>PubChem</td>
          <td>9,510,000</td>
          <td>38,040,000</td>
          <td>1 augmented + 3 clean per molecule</td>
      </tr>
      <tr>
          <td>4</td>
          <td>PubChem</td>
          <td>38,040,000</td>
          <td><strong>152,160,000</strong></td>
          <td>1 augmented + 3 clean per molecule</td>
      </tr>
  </tbody>
</table>
<p>A separate <strong>model selection</strong> experiment used a 1,024,000-molecule subset of ChEMBL to compare the three architectures (Table 1 in the paper). The <strong>DECIMER Hand-Drawn</strong> evaluation dataset consists of 5,088 real hand-drawn images from 23 volunteers.</p>
<p><strong>Preprocessing:</strong></p>
<ul>
<li><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings length &lt; 300 characters.</li>
<li>Images resized to $512 \times 512$.</li>
<li>Images generated with and without &ldquo;hand-drawn style&rdquo; augmentations.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization:</strong> SMILES split by heavy atoms, brackets, bond symbols, and special characters. Start <code>&lt;start&gt;</code> and end <code>&lt;end&gt;</code> tokens added; padded with <code>&lt;pad&gt;</code>.</li>
<li><strong>Optimization:</strong> Adam optimizer with a custom learning rate schedule (as specified in the original Transformer paper). A dropout rate of 0.1 was used.</li>
<li><strong>Loss Function:</strong> Trained using focal loss to address class imbalance for rare tokens. The focal loss formulation reduces the relative loss for well-classified examples:
$$
\text{FL}(p_{\text{t}}) = -\alpha_{\text{t}} (1 - p_{\text{t}})^\gamma \log(p_{\text{t}})
$$</li>
<li><strong>Augmentations:</strong> RanDepict applied synthetic distortions to mimic handwriting (wobbly lines, variable thickness, etc.).</li>
</ul>
<h3 id="models">Models</h3>
<p>The final architecture (Model 3) is an Encoder-Decoder structure:</p>
<ul>
<li><strong>Encoder:</strong> <strong>EfficientNetV2-M</strong> (pretrained ImageNet backbone).
<ul>
<li>Input: $512 \times 512 \times 3$ image.</li>
<li>Output Features: $16 \times 16 \times 512$ (reshaped to sequence length 256, dimension 512).</li>
<li><em>Note:</em> The final fully connected layer of the CNN is removed.</li>
</ul>
</li>
<li><strong>Decoder:</strong> <strong>Transformer (Decoder-only)</strong>.
<ul>
<li>Layers: 6</li>
<li>Attention Heads: 8</li>
<li>Embedding Dimension: 512</li>
</ul>
</li>
<li><strong>Output:</strong> Predicted SMILES string token by token.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics used for evaluation:</p>
<ol>
<li><strong>Valid Predictions (%):</strong> Percentage of outputs that are syntactically valid SMILES.</li>
<li><strong>Exact Match Accuracy (%):</strong> Canonical SMILES string identity.</li>
<li><strong>Tanimoto Similarity:</strong> Fingerprint similarity (PubChem fingerprints) between ground truth and prediction.</li>
</ol>
<p><strong>Data Scaling Results (Hand-Drawn Dataset, Table 4 in the paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Training Images</th>
          <th>Valid Predictions</th>
          <th>Exact Accuracy</th>
          <th>Tanimoto</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 (ChEMBL)</td>
          <td>4,375,338</td>
          <td>96.21%</td>
          <td>5.09%</td>
          <td>0.490</td>
      </tr>
      <tr>
          <td>2 (ChEMBL)</td>
          <td>13,126,014</td>
          <td>97.41%</td>
          <td>26.08%</td>
          <td>0.690</td>
      </tr>
      <tr>
          <td>3 (PubChem)</td>
          <td>38,040,000</td>
          <td>99.67%</td>
          <td>70.34%</td>
          <td>0.939</td>
      </tr>
      <tr>
          <td>4 (PubChem)</td>
          <td>152,160,000</td>
          <td>99.72%</td>
          <td>73.25%</td>
          <td>0.942</td>
      </tr>
  </tbody>
</table>
<p><strong>Comparison with Other Tools (Hand-Drawn Dataset, Table 5 in the paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>OCSR Tool</th>
          <th>Method</th>
          <th>Valid Predictions</th>
          <th>Exact Accuracy</th>
          <th>Tanimoto</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>DECIMER (Ours)</strong></td>
          <td>Deep Learning</td>
          <td><strong>99.72%</strong></td>
          <td><strong>73.25%</strong></td>
          <td><strong>0.94</strong></td>
      </tr>
      <tr>
          <td>DECIMER.ai</td>
          <td>Deep Learning</td>
          <td>96.07%</td>
          <td>26.98%</td>
          <td>0.69</td>
      </tr>
      <tr>
          <td>MolGrapher</td>
          <td>Deep Learning</td>
          <td>99.94%</td>
          <td>10.81%</td>
          <td>0.51</td>
      </tr>
      <tr>
          <td>MolScribe</td>
          <td>Deep Learning</td>
          <td>95.66%</td>
          <td>7.65%</td>
          <td>0.59</td>
      </tr>
      <tr>
          <td>Img2Mol</td>
          <td>Deep Learning</td>
          <td>98.96%</td>
          <td>5.25%</td>
          <td>0.52</td>
      </tr>
      <tr>
          <td>SwinOCSR</td>
          <td>Deep Learning</td>
          <td>97.37%</td>
          <td>5.11%</td>
          <td>0.64</td>
      </tr>
      <tr>
          <td>ChemGrapher</td>
          <td>Deep Learning</td>
          <td>69.56%</td>
          <td>N/A</td>
          <td>0.09</td>
      </tr>
      <tr>
          <td>Imago</td>
          <td>Rule-based</td>
          <td>43.14%</td>
          <td>2.99%</td>
          <td>0.22</td>
      </tr>
      <tr>
          <td>MolVec</td>
          <td>Rule-based</td>
          <td>71.86%</td>
          <td>1.30%</td>
          <td>0.23</td>
      </tr>
      <tr>
          <td>OSRA</td>
          <td>Rule-based</td>
          <td>54.66%</td>
          <td>0.57%</td>
          <td>0.17</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute:</strong> Google Cloud TPU v4-128 pod slice.</li>
<li><strong>Training Time:</strong>
<ul>
<li>EfficientNetV2-M model trained ~2x faster than EfficientNetV1-B7.</li>
<li>Average training time per epoch: 34 minutes (for Model 3 on 1M dataset subset).</li>
</ul>
</li>
<li><strong>Epochs:</strong> Models trained for 25 epochs.</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H.O., Zielesny, A. et al. (2024). Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. <em>Journal of Cheminformatics</em>, 16(78). <a href="https://doi.org/10.1186/s13321-024-00872-7">https://doi.org/10.1186/s13321-024-00872-7</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2024</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://pypi.org/project/decimer/">PyPi Package</a></li>
<li><a href="https://doi.org/10.5281/zenodo.10781330">Model Weights (Zenodo)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanAdvancementsHanddrawnChemical2024,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Advancements in Hand-Drawn Chemical Structure Recognition through an Enhanced {{DECIMER}} Architecture}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2024</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jul,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{78}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-024-00872-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Dual-Path Global Awareness Transformer (DGAT) for OCSR</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/dgat/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/dgat/</guid><description>A Transformer-based OCSR model introducing dual-path modules (CGFE and SDGLA) to improve global context awareness and complex motif recognition.</description><content:encoded><![CDATA[<h2 id="contribution-type-deep-learning-method-for-ocsr">Contribution Type: Deep Learning Method for OCSR</h2>
<p>This is a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>The classification is based on the proposal of a novel deep learning architecture (DGAT) designed to address specific limitations in existing Optical Chemical Structure Recognition (OCSR) systems. The contribution is validated through benchmarking against external baselines (DeepOCSR, DECIMER, SwinOCSR) and ablation studies that isolate the impact of the new modules.</p>
<h2 id="motivation-addressing-global-context-loss">Motivation: Addressing Global Context Loss</h2>
<p>Existing multimodal fusion methods for OCSR suffer from limited awareness of global context.</p>
<ul>
<li><strong>Problem</strong>: Models often generate erroneous sequences when processing complex motifs, such as rings or long chains, due to a disconnect between local feature extraction and global structural understanding.</li>
<li><strong>Gap</strong>: Current architectures struggle to capture the &ldquo;fine-grained differences between global and local features,&rdquo; leading to topological errors.</li>
<li><strong>Practical Need</strong>: Accurate translation of chemical images to machine-readable sequences (SMILES/SELFIES) is critical for materials science and AI-guided chemical research.</li>
</ul>
<h2 id="core-innovation-dual-path-global-awareness-transformer">Core Innovation: Dual-Path Global Awareness Transformer</h2>
<p>The authors propose the <strong>Dual-Path Global Awareness Transformer (DGAT)</strong>, which redesigns the decoder with two novel mechanisms to better handle global context:</p>
<ol>
<li>
<p><strong>Cascaded Global Feature Enhancement (CGFE)</strong>: This module bridges cross-modal gaps by emphasizing global context. It concatenates global visual features with sequence features and processes them through a Cross-Modal Assimilation MLP and an Adaptive Alignment MLP to align multimodal representations. The feature enhancement conceptually computes:</p>
<p>$$ f_{\text{enhanced}} = \text{MLP}_{\text{align}}(\text{MLP}_{\text{assimilate}}([f_{\text{global}}, f_{\text{seq}}])) $$</p>
</li>
<li>
<p><strong>Sparse Differential Global-Local Attention (SDGLA)</strong>: A module that dynamically captures fine-grained differences between global and local features. It uses sequence features (embedded with global info) as queries, while utilizing local and global visual features as keys/values in parallel attention heads to generate initial multimodal features.</p>
</li>
</ol>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The model was evaluated on a newly constructed dataset and compared against five major baselines.</p>
<ul>
<li><strong>Baselines</strong>: DeepOCSR, DECIMER 1.0, DECIMER V2, SwinOCSR, and MPOCSR.</li>
<li><strong>Ablation Studies</strong>:
<ul>
<li><strong>Layer Depth</strong>: Tested Transformer depths from 1 to 5 layers; 3 layers proved optimal for balancing gradient flow and parameter sufficiency.</li>
<li><strong>Beam Size</strong>: Tested inference beam sizes 1-5; size 3 achieved the best balance between search depth and redundancy.</li>
<li><strong>Module Contribution</strong>: Validated that removing CGFE results in a drop in structural similarity (Tanimoto), proving the need for pre-fusion alignment.</li>
</ul>
</li>
<li><strong>Robustness Analysis</strong>: Performance broken down by molecule complexity (atom count, ring count, bond count).</li>
<li><strong>Chirality Validation</strong>: Qualitative analysis of attention maps on chiral molecules to verify the model learns stereochemical cues implicitly.</li>
</ul>
<h2 id="results-and-conclusions">Results and Conclusions</h2>
<ul>
<li><strong>Performance Over Baselines</strong>: DGAT outperformed the MPOCSR baseline across all metrics:
<ul>
<li><strong>BLEU-4</strong>: 84.0% (+5.3% improvement)</li>
<li><strong>ROUGE</strong>: 90.8% (+1.9% improvement)</li>
<li><strong>Tanimoto Similarity</strong>: 98.8% (+1.2% improvement)</li>
<li><strong>Exact Match Accuracy</strong>: 54.6% (+10.9% over SwinOCSR)</li>
</ul>
</li>
<li><strong>Chiral Recognition</strong>: The model implicitly recognizes chiral centers (e.g., generating <code>[C@@H1]</code> tokens correctly) based on 2D wedge cues without direct stereochemical supervision.</li>
<li><strong>Limitations</strong>: Performance drops for extreme cases, such as molecules with 4+ rings or 4+ double/triple bonds, due to dataset imbalance. The model still hallucinates branches in highly complex topologies.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is primarily drawn from PubChem and augmented to improve robustness.</p>
<ul>
<li><strong>Augmentation Strategy</strong>: Each sequence generates three images with random rendering parameters.
<ul>
<li><strong>Rotation</strong>: 0, 90, 180, 270, or random [0, 360)</li>
<li><strong>Bond Width</strong>: 1, 2, or 3 pixels</li>
<li><strong>Bond Offset</strong>: Sampled from 0.08-0.18 (inherited from Image2SMILES)</li>
<li><strong>CoordGen</strong>: Enabled with 20% probability</li>
</ul>
</li>
<li><strong>Evaluation Set</strong>: A newly constructed benchmark dataset was used for final reporting.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Training Configuration</strong>:
<ul>
<li><strong>Encoder LR</strong>: $5 \times 10^{-5}$ (Pretrained ResNet-101)</li>
<li><strong>Decoder LR</strong>: $1 \times 10^{-4}$ (Randomly initialized Transformer)</li>
<li><strong>Optimizer</strong>: Implied SGD/Adam (context mentions Momentum 0.9, Weight Decay 0.0001)</li>
<li><strong>Batch Size</strong>: 256</li>
</ul>
</li>
<li><strong>Inference</strong>:
<ul>
<li><strong>Beam Search</strong>: A beam size of <strong>3</strong> is used. Larger beam sizes (4-5) degraded BLEU/ROUGE scores due to increased redundancy.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Visual Encoder</strong>:
<ul>
<li><strong>Backbone</strong>: ResNet-101 initialized with ImageNet weights</li>
<li><strong>Structure</strong>: Convolutional layers preserved up to the final module. Classification head removed.</li>
<li><strong>Pooling</strong>: A $7 \times 7$ average pooling layer is used to extract global visual features.</li>
</ul>
</li>
<li><strong>Sequence Decoder</strong>:
<ul>
<li><strong>Architecture</strong>: Transformer-based with CGFE and SDGLA modules.</li>
<li><strong>Depth</strong>: 3 Transformer layers</li>
<li><strong>Dropout</strong>: Not utilized</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is reported using sequence-level and structure-level metrics.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Metric</th>
          <th style="text-align: left">DGAT Score</th>
          <th style="text-align: left">Baseline (MPOCSR)</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>BLEU-4</strong></td>
          <td style="text-align: left"><strong>84.0%</strong></td>
          <td style="text-align: left">78.7%</td>
          <td style="text-align: left">Measures n-gram precision</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>ROUGE</strong></td>
          <td style="text-align: left"><strong>90.8%</strong></td>
          <td style="text-align: left">88.9%</td>
          <td style="text-align: left">Sequence recall metric</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Tanimoto</strong></td>
          <td style="text-align: left"><strong>98.8%</strong></td>
          <td style="text-align: left">97.6%</td>
          <td style="text-align: left">Structural similarity fingerprint</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Accuracy</strong></td>
          <td style="text-align: left"><strong>54.6%</strong></td>
          <td style="text-align: left">35.7%</td>
          <td style="text-align: left">Exact structure match rate</td>
      </tr>
  </tbody>
</table>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/Drwr97/DGAT">DGAT</a></td>
          <td style="text-align: left">Code</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Official implementation with training and evaluation scripts</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, R., Ji, Y., Li, Y., &amp; Lee, S.-T. (2025). Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition. <em>The Journal of Physical Chemistry Letters</em>, 16(50), 12787-12795. <a href="https://doi.org/10.1021/acs.jpclett.5c03057">https://doi.org/10.1021/acs.jpclett.5c03057</a></p>
<p><strong>Publication</strong>: The Journal of Physical Chemistry Letters 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Drwr97/DGAT">GitHub Repository</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wang2025dgat,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Dual-Path Global Awareness Transformer for Optical Chemical Structure Recognition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Rui and Ji, Yujin and Li, Youyong and Lee, Shuit-Tong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{The Journal of Physical Chemistry Letters}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{16}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{50}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{12787--12795}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jpclett.5c03057}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER.ai: Optical Chemical Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-ai/</guid><description>Open-source OCSR platform combining Mask R-CNN segmentation and Transformer recognition, trained on 450M+ synthetic images from RanDepict.</description><content:encoded><![CDATA[<h2 id="project-scope-and-contribution-type">Project Scope and Contribution Type</h2>
<p>This is primarily a <strong>Resource</strong> paper (Infrastructure Basis) with a significant <strong>Method</strong> component.</p>
<p>The primary contribution is DECIMER.ai, a fully open-source platform (web app and Python packages) for the entire chemical structure mining pipeline, filling a gap where most tools were proprietary or fragmented. It also contributes the RanDepict toolkit for massive synthetic data generation.</p>
<p>The secondary methodological contribution proposes and validates a specific deep learning architecture (EfficientNet-V2 encoder + Transformer decoder) that treats chemical structure recognition as an image-to-text translation task (SMILES generation).</p>
<h2 id="the-scarcity-of-machine-readable-chemical-data">The Scarcity of Machine-Readable Chemical Data</h2>
<p><strong>Data Scarcity</strong>: While the number of chemical publications is increasing, most chemical information is locked in non-machine-readable formats (images in PDFs) and is not available in public databases.</p>
<p><strong>Limitations of Existing Tools</strong>: Prior OCSR (Optical Chemical Structure Recognition) tools were largely rule-based (fragile to noise) or proprietary.</p>
<p><strong>Lack of Integration</strong>: There was no existing open-source system that combined segmentation (finding the molecule on a page), classification (confirming it is a molecule), and recognition (translating it to SMILES) into a single workflow.</p>
<h2 id="decimer-architecture-and-novel-image-to-smiles-approach">DECIMER Architecture and Novel Image-to-SMILES Approach</h2>
<p><strong>Comprehensive Workflow</strong>: It is the first open-source platform to integrate segmentation (Mask R-CNN), classification (EfficientNet), and recognition (Transformer) into a unified pipeline.</p>
<p><strong>Data-Driven Approach</strong>: Unlike tools like MolScribe which use intermediate graph representations and rules, DECIMER uses a purely data-driven &ldquo;image-to-SMILES&rdquo; translation approach without hard-coded chemical rules. The core recognition model operates as a sequence-to-sequence generator, mathematically formalizing the task as maximizing the conditional probability of a SMILES sequence given an image.</p>
<p><strong>Massive Synthetic Training</strong>: The use of RanDepict to generate over 450 million synthetic images, covering diverse depiction styles and augmentations (including Markush structures), to train the model from scratch.</p>
<h2 id="benchmarking-and-evaluation-methodology">Benchmarking and Evaluation Methodology</h2>
<p><strong>Benchmarking</strong>: The system was tested against openly available tools (OSRA, MolVec, Imago, Img2Mol, SwinOCSR, MolScribe) on standard datasets: USPTO, UOB, CLEF, JPO, and a custom &ldquo;Hand-drawn&rdquo; dataset.</p>
<p><strong>Robustness Testing</strong>: Performance was evaluated on both clean images and images with added distortions (rotation, shearing) to test the fragility of rule-based systems vs. DECIMER.</p>
<p><strong>Markush Structure Analysis</strong>: Specific evaluation of the model&rsquo;s ability to interpret Markush structures (generic structures with R-groups).</p>
<p><strong>Comparison of Approaches</strong>: A direct comparison with MolScribe by training DECIMER on MolScribe&rsquo;s smaller training set to isolate the impact of architecture vs. data volume.</p>
<h2 id="performance-outcomes-and-key-findings">Performance Outcomes and Key Findings</h2>
<p><strong>Comparative Performance</strong>: DECIMER Image Transformer consistently produced average Tanimoto similarities above 0.95 on in-domain test data and achieved competitive or leading results across external benchmarks, with extremely low rates of catastrophic failure. Tanimoto similarity is calculated based on molecular fingerprints $A$ and $B$ as:
$$ T(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$</p>
<p><strong>Data Volume Necessity</strong>: When trained on small datasets, MolScribe (graph/rule-based) outperformed DECIMER. DECIMER&rsquo;s performance advantage relies heavily on its massive training scale (&gt;400M images).</p>
<p><strong>Robustness</strong>: The model showed no performance degradation on distorted images, unlike rule-based legacy tools.</p>
<p><strong>Generalization</strong>: Despite having no hand-drawn images in the training set, the base model recognized 27% of hand-drawn structures perfectly (average Tanimoto 0.69), outperforming all alternative open tools. After fine-tuning with synthetic hand-drawn-like images from RanDepict, perfect predictions increased to 60% (average Tanimoto 0.89).</p>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/OBrink/DECIMER.ai">DECIMER.ai Web App</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Laravel-based web application for the full pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Core OCSR Python package</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image-Segmentation">DECIMER Image Segmentation</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Mask R-CNN segmentation for chemical structures in documents</td>
      </tr>
      <tr>
          <td><a href="https://github.com/Iagea/DECIMER-Image-Classifier">DECIMER Image Classifier</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>EfficientNet-based chemical structure image classifier</td>
      </tr>
      <tr>
          <td><a href="https://github.com/OBrink/RanDepict">RanDepict</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic training data generation toolkit</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<p>The models were trained on synthetic data generated from PubChem molecules.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Generation/Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_1</code></td>
          <td>~108M mols</td>
          <td>PubChem molecules (mass &lt; 1500 Da), processed with RanDepict (v1.0.5). Included image augmentations.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_2</code></td>
          <td>~126M mols</td>
          <td>Included Markush structures generated by pseudo-randomly replacing atoms with R-groups. Image size 299x299.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td><code>pubchem_3</code></td>
          <td>&gt;453M images</td>
          <td>Re-depicted <code>pubchem_2</code> molecules at <strong>512x512</strong> resolution. Used RanDepict v1.0.8.</td>
      </tr>
      <tr>
          <td><strong>Test</strong></td>
          <td>In-domain</td>
          <td>250,000</td>
          <td>Held-out set generated similarly to training data.</td>
      </tr>
      <tr>
          <td><strong>Benchmark</strong></td>
          <td>External</td>
          <td>Various</td>
          <td>USPTO (5719), UOB (5740), CLEF (992), JPO (450), Indigo (50k), Hand-drawn (5088).</td>
      </tr>
  </tbody>
</table>
<p><strong>Data Generation</strong>:</p>
<ul>
<li><strong>Tool</strong>: RanDepict (uses CDK, RDKit, Indigo, PIKAChU)</li>
<li><strong>Augmentations</strong>: Rotation, shearing, noise, pixelation, curved arrows, text labels</li>
<li><strong>Format</strong>: Data saved as TFRecord files for TPU training</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>SMILES Tokenization</strong>: Regex-based splitting (atoms, brackets, bonds). Added <code>&lt;start&gt;</code>, <code>&lt;end&gt;</code>, and padded with <code>&lt;pad&gt;</code>. <code>&lt;unk&gt;</code> used for unknown tokens.</li>
<li><strong>Markush Token Handling</strong>: To avoid ambiguity, digits following &lsquo;R&rsquo; (e.g., R1) were replaced with unique non-digit characters during training to distinguish them from ring-closure numbers.</li>
<li><strong>Image Augmentation Pipeline</strong>: Custom RanDepict features (v1.1.4) were used to simulate &ldquo;hand-drawn-like&rdquo; styles based on ChemPIX&rsquo;s implementation.</li>
</ul>
<h3 id="models">Models</h3>
<p>The platform consists of three distinct models:</p>
<ol>
<li>
<p><strong>DECIMER Segmentation</strong>:</p>
<ul>
<li><strong>Architecture</strong>: Mask R-CNN (TensorFlow 2.10.0 implementation)</li>
<li><strong>Purpose</strong>: Detects and cuts chemical structures from full PDF pages</li>
</ul>
</li>
<li>
<p><strong>DECIMER Image Classifier</strong>:</p>
<ul>
<li><strong>Architecture</strong>: EfficientNet-V1-B0</li>
<li><strong>Input</strong>: 224x224 pixels</li>
<li><strong>Training</strong>: Fine-tuned on ~10.9M images (balanced chemical/non-chemical)</li>
<li><strong>Performance</strong>: AUC 0.99 on in-domain test set</li>
</ul>
</li>
<li>
<p><strong>DECIMER Image Transformer (OCSR Engine)</strong>:</p>
<ul>
<li><strong>Encoder</strong>: EfficientNet-V2-M (CNN). Input size <strong>512x512</strong>. 52M parameters</li>
<li><strong>Decoder</strong>: Transformer. 4 encoder blocks, 4 decoder blocks, 8 attention heads. d_model=512, d_ff=2048. 59M parameters</li>
<li><strong>Total Params</strong>: ~111 Million</li>
</ul>
</li>
</ol>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Tanimoto Similarity (calculated on PubChem fingerprints of the predicted vs. ground truth SMILES)</li>
<li><strong>Secondary Metrics</strong>: Exact Match (Identity), BLEU score (for string similarity, esp. Markush)</li>
<li><strong>Failure Analysis</strong>: &ldquo;Catastrophic failure&rdquo; defined as Tanimoto similarity of 0 or invalid SMILES</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on Google Cloud TPUs due to the massive dataset size.</p>
<ul>
<li><strong><code>pubchem_1</code>/<code>pubchem_2</code></strong>: Trained on TPU v3-32 pod slice</li>
<li><strong><code>pubchem_3</code> (Final Model)</strong>: Trained on <strong>TPU v3-256</strong> pod slice</li>
<li><strong>Training Time</strong>:
<ul>
<li>Data generation (512x512): ~2 weeks on cluster (20 threads, 36 cores)</li>
<li>Model Training (EffNet-V2-M): <strong>1 day and 7 hours per epoch</strong> on TPU v3-256</li>
</ul>
</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Brinkhaus, H. O., Agea, M. I., Zielesny, A., &amp; Steinbeck, C. (2023). DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. <em>Nature Communications</em>, 14(1), 5045. <a href="https://doi.org/10.1038/s41467-023-40782-0">https://doi.org/10.1038/s41467-023-40782-0</a></p>
<p><strong>Publication</strong>: Nature Communications 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://decimer.ai">Web Application</a></li>
<li><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER Image Transformer GitHub</a></li>
<li><a href="https://github.com/OBrink/RanDepict">RanDepict GitHub</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMERaiOpenPlatform2023,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Brinkhaus, Henning Otto and Agea, M. Isabel and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{14}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{5045}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1038/s41467-023-40782-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemVLM: A Multimodal Large Language Model for Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemvlm/</link><pubDate>Fri, 19 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/chemvlm/</guid><description>A 26B parameter multimodal LLM for chemistry, combining InternViT-6B and ChemLLM-20B for molecular structure recognition, property prediction, and reasoning.</description><content:encoded><![CDATA[<h2 id="paper-classification-method-and-resource">Paper Classification: Method and Resource</h2>
<p>This paper is a combination of <strong>Method</strong> (primary) and <strong>Resource</strong> (secondary).</p>
<p>It is primarily a <strong>Method</strong> paper because it proposes <strong>ChemVLM</strong>, a novel multimodal architecture specifically tailored for the chemical domain, utilizing a &ldquo;ViT-MLP-LLM&rdquo; framework. The authors introduce a specific two-stage training strategy to align visual features with chemical text representations.</p>
<p>Secondarily, it is a <strong>Resource</strong> paper as it introduces a comprehensive suite of three new datasets: <strong>ChemOCR</strong>, <strong>MMCR-Bench</strong>, and <strong>MMChemBench</strong>, developed to rigorously evaluate multimodal capabilities in chemistry, covering OCR, reasoning, and property prediction.</p>
<h2 id="bridging-the-visual-gap-in-chemical-llms">Bridging the Visual Gap in Chemical LLMs</h2>
<p>The primary motivation is the limitation of existing models in handling the multimodal nature of chemistry.</p>
<ul>
<li><strong>Visual Data Gap</strong>: Chemical tasks heavily rely on visual information (molecular structures, reactions) which purely text-based chemical LLMs cannot process.</li>
<li><strong>Limitations of Generalist Models</strong>: General multimodal models (like GPT-4V or LLaVA) lack specialized chemical domain knowledge, leading to hallucinations or misinterpretations.</li>
<li><strong>Inadequacy of OCR Tools</strong>: Traditional <a href="/notes/chemistry/optical-structure-recognition/">chemical OCR</a> tools (like <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molscribe/">MolScribe</a>) excel at modality conversion (Image-to-<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) but fail at complex reasoning tasks.</li>
</ul>
<h2 id="domain-specific-data-curation-and-benchmarking">Domain-Specific Data Curation and Benchmarking</h2>
<ul>
<li><strong>Data-Driven Alignment</strong>: The underlying &ldquo;ViT-MLP-LLM&rdquo; framework is standard in multimodal modeling, paralleling architectures like LLaVA. The core innovation here is the rigorous creation of a bilingual multimodal dataset spanning hand-drawn molecules, reactions, and exam questions augmented with style transfers. The training data pipeline heavily relies on generating synthetic variance using tools like RanDepict and <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> to introduce distortions, rotations, and handwritten styles, alongside GPT-4 generated prompts to ensure linguistic diversity.</li>
<li><strong>Model Integration</strong>: ChemVLM merges <strong>InternViT-6B</strong> (a large-scale vision transformer) with <strong><a href="/notes/chemistry/llm-applications/chemllm-chemical-large-language-model/">ChemLLM-20B</a></strong> (a chemical language model). Visual features $X_v$ are mapped into the linguistic embedding space via an MLP projector, producing aligned token sequences alongside text instructions $X_q$. The joint multimodal sequence is trained using standard autoregressive next-token prediction:
$$ \mathcal{L} = -\sum_{i} \log P(y_i \mid X_v, X_q, y_{&lt;i}) $$</li>
<li><strong>Three Custom Benchmarks</strong>: The authors introduce tailored benchmarks to assess distinct competencies:
<ul>
<li><strong>ChemOCR</strong>: For image-to-SMILES conversion.</li>
<li><strong>MMCR-Bench</strong>: College entrance exam questions testing complex logical reasoning.</li>
<li><strong>MMChemBench</strong>: For molecule captioning and zero-shot property prediction.</li>
</ul>
</li>
</ul>
<h2 id="evaluating-chemical-ocr-and-reasoning">Evaluating Chemical OCR and Reasoning</h2>
<p>The authors benchmarked ChemVLM against both open-source (LLaVA, Qwen-VL, InternVL) and proprietary (GPT-4V) models across three primary domains:</p>
<ol>
<li><strong>Chemical OCR</strong>: Evaluated on 1,000 image-text pairs from ChemOCR. The primary metric is the <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> between the Morgan fingerprints of the generated structure ($A$) and the ground-truth SMILES ($B$):
$$ T(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$
They report both the average Tanimoto similarity and the strict exact-match rate (<code>Tanimoto@1.0</code>).</li>
<li><strong>Multimodal Chemical Reasoning (MMCR)</strong>: Tested on MMCR-Bench (1,000 exam questions), ScienceQA, and CMMU. Performance was scored based on accuracy for multiple-choice and fill-in-the-blank questions.</li>
<li><strong>Multimodal Molecule Understanding</strong>: Evaluated on MMChemBench for molecule captioning and property prediction.</li>
<li><strong>Text-Only Reasoning</strong>: Tested on SciBench, a text-only benchmark for university-level science, to ensure the model retains fundamental linguistic reasoning.</li>
<li><strong>Generalization</strong>: Tested on non-chemistry subjects within the CMMU framework (Biology, Physics, Math) to assess cross-domain competence.</li>
</ol>
<h2 id="performance-gains-and-existing-limitations">Performance Gains and Existing Limitations</h2>
<ul>
<li><strong>Multimodal Reasoning Leadership</strong>: ChemVLM achieved state-of-the-art results on MMCR-Bench (41.7%), surpassing generalist models like GPT-4V (40.1%). However, scoring for portions of these benchmarks relied heavily on an LLM-as-a-judge (the Qwen-max API), which can introduce bias as LLM evaluators often favor structural characteristics and verbosity produced by similar autoregressive models. Furthermore, the model was fine-tuned on 200,000 exam questions and tested on MMCR-Bench (also derived from Chinese college entrance exams). While the authors state the data was deduplicated, the potential for data leakage remains a significant unaddressed confounder.</li>
<li><strong>Superior Understanding</strong>: In molecule captioning and prediction, ChemVLM showed significant improvements over general baseline models, scoring 80.9% on prediction compared to GPT-4V&rsquo;s 38.6%. This is a natural consequence of testing a custom-trained model on domain-specific benchmarks.</li>
<li><strong>OCR Capabilities vs. Dedicated Tools</strong>: ChemVLM outperformed generalist MLLMs in chemical structure recognition, achieving an average Tanimoto similarity of 71.0% (vs. GPT-4V&rsquo;s 15.0%). However, it remains significantly inferior to pure structural OCR tools like MolScribe in strict modality conversion tasks, only achieving an exact structural match (<code>Tanimoto@1.0</code>) of 42.9% compared to MolScribe&rsquo;s 89.1%.</li>
<li><strong>Textual Retention and Generalization Claims</strong>: The authors claim the diverse training strategy imparts broad scientific reasoning, pointing to performance retention on non-chemistry subjects (Biology, Physics, Math) and strong results on the purely textual SciBench benchmark. However, this cross-domain generalization highly likely stems from the underlying base model (ChemLLM-20B/InternLM2) or the inclusion of 1.3 million &ldquo;General&rdquo; visual QA pairs in their training blend, rather than emergent general scientific skills originating purely from learning chemistry representations.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training and evaluation data relied on a mix of open-source repositories and custom curation. Many of the curated datasets have been formally released by the authors on Hugging Face (<a href="https://huggingface.co/datasets/di-zhang-fdu/chemvlm-sft-datasets"><code>di-zhang-fdu/chemvlm-sft-datasets</code></a>).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Source/Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong><a href="/notes/chemistry/optical-structure-recognition/hand-drawn/decimer-hand-drawn/">DECIMER HDM</a></strong></td>
          <td>7,000+ hand-drawn molecular images.</td>
      </tr>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong>MolScribe Data</strong></td>
          <td>Scanned/photographed images from literature.</td>
      </tr>
      <tr>
          <td><strong>Training (Molecule)</strong></td>
          <td><strong>Synthetic</strong></td>
          <td>Generated via ChemDraw, RDKit, and Indigo with style transfer (blurring, rotation, handwritten styles).</td>
      </tr>
      <tr>
          <td><strong>Training (Reaction)</strong></td>
          <td><strong>PEACE &amp; USPTO-50K</strong></td>
          <td>Inorganic and organic reaction schemes.</td>
      </tr>
      <tr>
          <td><strong>Training (Reasoning)</strong></td>
          <td><strong>Exam Questions</strong></td>
          <td>200,000 questions from OpenDataLab (Chinese education level). <a href="https://huggingface.co/collections/di-zhang-fdu/multi-corpus-datasets-for-chemllm">Available on Hugging Face</a>.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>ChemOCR</strong></td>
          <td>1,000 bilingual image-text pairs for SMILES recognition. Released via Google Drive link in repo.</td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>MMCR-Bench</strong></td>
          <td>1,000 multimodal chemistry exam questions. <strong>Requires emailing authors directly for access.</strong></td>
      </tr>
      <tr>
          <td><strong>Evaluation</strong></td>
          <td><strong>MMChemBench</strong></td>
          <td>Extension of <a href="/notes/chemistry/llm-applications/chembench-llm-chemistry-evaluation/">ChemBench</a> for captioning and property prediction. Released via Google Drive link in repo.</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>: Images were augmented using <strong>RanDepict</strong> for style variation. Text data (SMILES) was validated and cleaned. Prompts were diversified using GPT-4 to generate different linguistic styles.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: &ldquo;ViT-MLP-LLM&rdquo; structure.
<ul>
<li><strong>Vision Encoder</strong>: InternViT-6B, processing images at $448 \times 448$ resolution. Images are segmented into tiles (max 12).</li>
<li><strong>Projector</strong>: Multi-Layer Perceptron (MLP) initialized randomly to map visual features to text embedding space.</li>
<li><strong>LLM</strong>: ChemLLM-20B, a domain-specific model.</li>
</ul>
</li>
<li><strong>Training Strategy</strong>: Two-stage supervised fine-tuning.
<ol>
<li><strong>Modal Alignment</strong>: Freeze LLM and base Vision Encoder weights. Train only the randomly initialized MLP projector and LoRA layers (rank 32) of the Vision Encoder. Uses diverse multimodal data.</li>
<li><strong>Supervised Fine-Tuning (SFT)</strong>: Keep LLM and Vision Encoder base weights frozen, but add LoRA (rank 16) to the LLM and retain LoRA (rank 32) on the Vision Encoder. The MLP projector is fully trained. Data includes specialized chemistry and general corpora.</li>
</ol>
</li>
<li><strong>Optimization</strong>:
<ul>
<li>Optimizer: AdamW</li>
<li>Context Length: 2048 tokens</li>
<li>Chat Template: InternLM2 dialogue schema</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>ChemVLM-26B</strong>: The primary model released. It combines the 6B parameter vision encoder and the 20B parameter language model. Weights are fully available at <a href="https://huggingface.co/AI4Chem/ChemVLM-26B-1-2"><code>AI4Chem/ChemVLM-26B-1-2</code></a>. An 8B version is also available.</li>
<li><strong>Baselines</strong>: Comparisons were made against <strong>GPT-4V</strong>, <strong>Qwen-VL-Chat</strong>, <strong>LLaVA-v1.5-13B</strong>, <strong>InternVL-v1.5</strong>, and <strong>Yi-VL-Plus</strong>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance was measured across three distinct task types. Exact <a href="https://github.com/lijunxian111/ChemVlm/tree/master/evaluation">evaluation scripts</a> have been released in the official repository.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Method</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto Similarity</strong></td>
          <td>ChemOCR</td>
          <td>Comparison of generated SMILES vs. ground truth using RDKit. Reports Average Similarity and <code>Tanimoto@1.0</code> (exact match).</td>
      </tr>
      <tr>
          <td><strong>Accuracy</strong></td>
          <td>MMCR (Reasoning)</td>
          <td>+1 point for correct multiple-choice/fill-in-the-blank; 0 otherwise. Scored via Qwen-max API prompting.</td>
      </tr>
      <tr>
          <td><strong>Prediction Score</strong></td>
          <td>Property Prediction</td>
          <td>Evaluated on MMChemBench subsets.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Compute</strong>: Training utilized <strong>16 NVIDIA A100 (80GB)</strong> GPUs.</li>
<li><strong>Configuration</strong>:
<ul>
<li>Batch size: 4 (per GPU, resulting in an effective global batch size of 256)</li>
<li>Gradient Accumulation: 4 iterations</li>
<li>Precision: <strong><a href="https://en.wikipedia.org/wiki/DeepSpeed">Deepspeed</a> bfloat16 (bf16)</strong> with <strong>ZeRO-3</strong> offloading strategy</li>
<li>Framework: Training runs on the InternVL-v1.5 codebase rather than standalone scripts.</li>
</ul>
</li>
<li><strong>Inference Compute</strong>: Evaluating the 26B model requires at least one 80GB A100 GPU (with Flash Attention + bfloat16). The 8B variant requires a GPU with at least 48GB of VRAM.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemVLM-26B">ChemVLM-26B</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>Original 26B model weights</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/AI4Chem/ChemVLM-26B-1-2">ChemVLM-26B-1-2</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Updated 26B model weights</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/di-zhang-fdu/chemvlm-sft-datasets">chemvlm-sft-datasets</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>SFT training data (~51.7k rows)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/lijunxian111/ChemVlm">ChemVlm (GitHub)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training, evaluation, and inference code</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, J., et al. (2025). ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area. <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(1), 415-423. <a href="https://doi.org/10.1609/aaai.v39i1.32020">https://doi.org/10.1609/aaai.v39i1.32020</a></p>
<p><strong>Publication</strong>: AAAI 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{li2025chemvlm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Junxian and Zhang, Di and Wang, Xunzhi and Hao, Zeying and Lei, Jingdi and Tan, Qian and Zhou, Cai and Liu, Wei and Yang, Yaotian and Xiong, Xinrui and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Li, Wei and Su, Mao and Zhang, Shufei and Ouyang, Wanli and Li, Yuqiang and Zhou, Dongzhan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{415--423}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://doi.org/10.1609/aaai.v39i1.32020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1609/aaai.v39i1.32020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/lijunxian111/ChemVlm">Official Repository</a></li>
</ul>
]]></content:encoded></item><item><title>SwinOCSR: End-to-End Chemical OCR with Swin Transformers</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/swinocsr/</guid><description>Deep learning model using Swin Transformer and Focal Loss for OCSR, achieving 98.58% accuracy on synthetic benchmarks.</description><content:encoded><![CDATA[<h2 id="contribution-methodological-architecture-and-datasets">Contribution: Methodological Architecture and Datasets</h2>
<p>This is a <strong>Methodological Paper</strong> with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a novel architecture (Swin Transformer backbone) and a specific loss function optimization (Focal Loss) for the task of Optical Chemical Structure Recognition (OCSR).</li>
<li><strong>Resource</strong>: It constructs a large-scale synthetic dataset of 5 million molecules, specifically designing it to cover complex cases like substituents and aromatic rings.</li>
</ul>
<h2 id="motivation-addressing-visual-context-and-data-imbalance">Motivation: Addressing Visual Context and Data Imbalance</h2>
<ul>
<li><strong>Problem</strong>: OCSR (converting images of chemical structures to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>) is difficult due to complex chemical patterns and long sequences. Existing deep learning methods (often CNN-based) struggle to achieve satisfactory recognition rates.</li>
<li><strong>Technical Gap</strong>: Standard CNN backbones (like ResNet or EfficientNet) focus on local feature extraction and miss global dependencies required for interpreting complex molecular diagrams.</li>
<li><strong>Data Imbalance</strong>: Chemical strings suffer from severe class imbalance (e.g., &lsquo;C&rsquo; and &lsquo;H&rsquo; are frequent; &lsquo;Br&rsquo; or &lsquo;Cl&rsquo; are rare), which causes standard Cross Entropy loss to underperform.</li>
</ul>
<h2 id="core-innovation-swin-transformers-and-focal-loss">Core Innovation: Swin Transformers and Focal Loss</h2>
<ul>
<li><strong>Swin Transformer Backbone</strong>: SwinOCSR replaces the standard CNN backbone with a <strong>Swin Transformer</strong>, using shifted window attention to capture both local and global image features more effectively.</li>
<li><strong>Multi-label Focal Loss (MFL)</strong>: The paper introduces a modified Focal Loss to OCSR, the first explicit attempt to address token imbalance in OCSR (per the authors). This penalizes the model for errors on rare tokens, addressing the &ldquo;long-tail&rdquo; distribution of chemical elements. The standard Focal Loss formulation heavily weights hard-to-classify examples:
$$
\begin{aligned}
FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \\
\end{aligned}
$$</li>
<li><strong>Structured Synthetic Dataset</strong>: Creation of a dataset explicitly balanced across four structural categories: Kekule rings, Aromatic rings, and their combinations with substituents.</li>
</ul>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<ul>
<li><strong>Backbone Comparison</strong>: The authors benchmarked SwinOCSR against the backbones of leading competitors: ResNet-50 (used in Image2SMILES) and EfficientNet-B3 (used in DECIMER 1.0).</li>
<li><strong>Loss Function Ablation</strong>: They compared the performance of standard Cross Entropy (CE) loss against their proposed Multi-label Focal Loss (MFL).</li>
<li><strong>Category Stress Test</strong>: Performance was evaluated separately on molecules with/without substituents and with/without aromaticity to test robustness.</li>
<li><strong>Real-world Evaluation</strong>: The model was tested on 100 images manually extracted from the literature (with manually labeled SMILES), and separately on 100 CDK-generated images from those same SMILES, to measure the domain gap between synthetic and real-world data.</li>
</ul>
<h2 id="results-and-limitations">Results and Limitations</h2>
<ul>
<li><strong>Synthetic test set performance</strong>: With Multi-label Focal Loss (MFL), SwinOCSR achieved <strong>98.58% accuracy</strong> on the synthetic test set, compared to 97.36% with standard CE loss. Both ResNet-50 (89.17%) and EfficientNet-B3 (86.70%) backbones scored lower when using CE loss (Table 3).</li>
<li><strong>Handling of long sequences</strong>: The model maintained high accuracy (94.76%) even on very long DeepSMILES strings (76-100 characters), indicating effective global feature extraction.</li>
<li><strong>Per-category results</strong>: Performance was consistent across molecule categories: Category 1 (Kekule, 98.20%), Category 2 (Aromatic, 98.46%), Category 3 (Kekule + Substituents, 98.76%), Category 4 (Aromatic + Substituents, 98.89%). The model performed slightly better on molecules with substituents and aromatic rings.</li>
<li><strong>Domain shift</strong>: While performance on synthetic data was strong, accuracy dropped to <strong>25%</strong> on 100 real-world literature images. On 100 CDK-generated images from the same SMILES strings, accuracy was 94%, confirming that the gap stems from stylistic differences between CDK-rendered and real-world images. The authors attribute this to noise, low resolution, and variations such as condensed structural formulas and abbreviations.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: The first 8.5 million structures from <strong>PubChem</strong> were downloaded, yielding ~6.9 million unique SMILES.</li>
<li><strong>Generation Pipeline</strong>:
<ul>
<li><strong>Tools</strong>: <strong>CDK</strong> (Chemistry Development Kit) for image rendering; <strong>RDKit</strong> for SMILES canonicalization.</li>
<li><strong>Augmentation</strong>: To ensure diversity, the dataset was split into 4 categories (1.25M each): (1) Kekule, (2) Aromatic, (3) Kekule + Substituents, (4) Aromatic + Substituents. Substituents were randomly added from a list of 224 common patent substituents.</li>
<li><strong>Preprocessing</strong>: Images rendered as binary, resized to <strong>224x224</strong>, and copied to 3 channels (RGB simulation).</li>
</ul>
</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>4,500,000</td>
          <td>18:1:1 split (Train/Val/Test)</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>250,000</td>
          <td></td>
      </tr>
      <tr>
          <td>Test</td>
          <td>Synthetic (PubChem-derived)</td>
          <td>250,000</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Loss Function</strong>: <strong>Multi-label Focal Loss (MFL)</strong>. The single-label classification task was cast as multi-label to apply Focal Loss, using a sigmoid activation on logits.</li>
<li><strong>Optimization</strong>:
<ul>
<li><strong>Optimizer</strong>: <strong>Adam</strong> with initial learning rate <code>5e-4</code>.</li>
<li><strong>Schedulers</strong>: Cosine decay for the Swin Transformer backbone; Step decay for the Transformer encoder/decoder.</li>
<li><strong>Regularization</strong>: Dropout rate of <code>0.1</code>.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Backbone (Encoder 1)</strong>: <strong>Swin Transformer</strong>.
<ul>
<li>Patch size: $4 \times 4$.</li>
<li>Linear embedding dimension: 192.</li>
<li>Structure: 4 stages with Swin Transformer Blocks (Window MSA + Shifted Window MSA).</li>
<li>Output: Flattened patch sequence $S_b$.</li>
</ul>
</li>
<li><strong>Transformer Encoder (Encoder 2)</strong>: 6 standard Transformer encoder layers. Uses Positional Embedding + Multi-Head Attention + MLP.</li>
<li><strong>Transformer Decoder</strong>: 6 standard Transformer decoder layers. Uses Masked Multi-Head Attention (to prevent look-ahead) + Multi-Head Attention (connecting to encoder output $S_e$).</li>
<li><strong>Tokenization</strong>: <strong>DeepSMILES</strong> format used (syntactically more robust than SMILES). Vocabulary size: <strong>76 tokens</strong> (76 unique characters found in dataset). Embedding dimension: 256.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Accuracy (Exact Match), Tanimoto Similarity (PubChem fingerprints), BLEU, ROUGE.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SwinOCSR (CE)</th>
          <th>SwinOCSR (MFL)</th>
          <th>ResNet-50 (CE)</th>
          <th>EfficientNet-B3 (CE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Accuracy</td>
          <td>97.36%</td>
          <td><strong>98.58%</strong></td>
          <td>89.17%</td>
          <td>86.70%</td>
      </tr>
      <tr>
          <td>Tanimoto</td>
          <td>99.65%</td>
          <td><strong>99.77%</strong></td>
          <td>98.79%</td>
          <td>98.46%</td>
      </tr>
      <tr>
          <td>BLEU</td>
          <td>99.46%</td>
          <td><strong>99.59%</strong></td>
          <td>98.62%</td>
          <td>98.37%</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>99.64%</td>
          <td><strong>99.78%</strong></td>
          <td>98.87%</td>
          <td>98.66%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Trained on <strong>NVIDIA Tesla V100-PCIE</strong>.</li>
<li><strong>Training Time</strong>: 30 epochs.</li>
<li><strong>Batch Size</strong>: 256 images ($224 \times 224$ pixels).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/suanfaxiaohuo/SwinOCSR">SwinOCSR</a></td>
          <td>Code + Data</td>
          <td>Unknown</td>
          <td>Official implementation with dataset and trained models</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, Z., Li, J., Yang, Z. et al. (2022). SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. <em>Journal of Cheminformatics</em>, 14(41). <a href="https://doi.org/10.1186/s13321-022-00624-5">https://doi.org/10.1186/s13321-022-00624-5</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/suanfaxiaohuo/SwinOCSR">GitHub Repository</a></li>
</ul>
]]></content:encoded></item><item><title>String Representations for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/benchmarks/rajan-string-representations-2022/</guid><description>Ablation study comparing SMILES, DeepSMILES, SELFIES, and InChI for OCSR. SMILES achieves highest accuracy; SELFIES guarantees validity.</description><content:encoded><![CDATA[<h2 id="empirical-focus-and-resource-contributions">Empirical Focus and Resource Contributions</h2>
<p>This is an <strong>Empirical Paper</strong> ($\Psi_{\text{Empirical}}$) with a secondary contribution as a <strong>Resource Paper</strong> ($\Psi_{\text{Resource}}$).</p>
<p>It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (<a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>, DeepSMILES, <a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a>, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.</p>
<h2 id="the-syntax-challenge-in-chemical-image-recognition">The Syntax Challenge in Chemical Image Recognition</h2>
<p>Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings (where ring closures and branches are marked by single characters potentially far apart in the sequence) creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.</p>
<h2 id="isolating-string-representation-variables">Isolating String Representation Variables</h2>
<p>The core novelty is the <strong>comparative isolation of the string representation variable</strong> in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between <em>validity</em> (guaranteed by SELFIES) and <em>accuracy</em> (highest with SMILES).</p>
<h2 id="large-scale-image-to-text-translation-experiments">Large-Scale Image-to-Text Translation Experiments</h2>
<p>The authors performed a large-scale image-to-text translation experiment:</p>
<ul>
<li><strong>Task</strong>: Converting 2D chemical structure images into text strings.</li>
<li><strong>Data</strong>:
<ul>
<li><strong>ChEMBL</strong>: ~1.6M molecules, split into two datasets (with and without stereochemistry).</li>
<li><strong>PubChem</strong>: ~3M molecules, split similarly, to test performance scaling with data size.</li>
</ul>
</li>
<li><strong>Representations</strong>: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.</li>
<li><strong>Metric</strong>: The models were evaluated on:
<ul>
<li><strong>Validity</strong>: Can the predicted string be decoded back to a molecule?</li>
<li><strong>Exact Match</strong>: Is the predicted string identical to the ground truth?</li>
<li><strong>Tanimoto Similarity</strong>: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)? The similarity $\mathcal{T}$ between two molecular fingerprints $A$ and $B$ is calculated as:
$$ \mathcal{T}(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$</li>
</ul>
</li>
</ul>
<h2 id="comparative-performance-and-validity-trade-offs">Comparative Performance and Validity Trade-offs</h2>
<ul>
<li><strong>SMILES is the most accurate</strong>: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.</li>
<li><strong>SELFIES guarantees validity</strong>: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.</li>
<li><strong>InChI is unsuitable</strong>: InChI performed significantly worse (approx. 64% exact match) due to extreme maximum string lengths (up to 273 characters).</li>
<li><strong>Stereochemistry adds difficulty</strong>: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.</li>
<li><strong>Recommendation</strong>: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ChEMBL (Dataset 1/2)</td>
          <td>~1.5M</td>
          <td>Filtered for MW &lt; 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>PubChem (Dataset 3/4)</td>
          <td>~3.0M</td>
          <td>Same filtering rules, used to test scaling.</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Test Split</td>
          <td>~120k - 250k</td>
          <td>Created using RDKit MaxMin algorithm to ensure chemical diversity.</td>
      </tr>
  </tbody>
</table>
<p><strong>Image Generation</strong>:</p>
<ul>
<li><strong>Tool</strong>: CDK Structure Diagram Generator (SDG).</li>
<li><strong>Specs</strong>: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Tokenization Rules</strong> (Critical for replication):</p>
<ul>
<li><strong>SELFIES</strong>: Split at every <code>][</code> (e.g., <code>[C][N]</code> $\rightarrow$ <code>[C]</code>, <code>[N]</code>).</li>
<li><strong>SMILES / DeepSMILES</strong>: Regex-based splitting:
<ul>
<li>Every heavy atom (e.g., <code>C</code>, <code>N</code>).</li>
<li>Every bracket <code>(</code> and <code>)</code>.</li>
<li>Every bond symbol <code>=</code> and <code>#</code>.</li>
<li>Every single-digit number.</li>
<li>Everything inside square brackets <code>[]</code> is kept as a single token.</li>
</ul>
</li>
<li><strong>InChI</strong>: The prefix <code>InChI=1S/</code> was treated as a single token and removed during training, then re-added for evaluation.</li>
</ul>
<h3 id="models">Models</h3>
<p>The model follows the <strong>DECIMER</strong> architecture.</p>
<ul>
<li><strong>Encoder</strong>: EfficientNet-B3 (pre-trained with &ldquo;Noisy Student&rdquo; weights).
<ul>
<li>Output: Image feature vectors of shape $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder</strong>: Transformer (similar to the &ldquo;Base&rdquo; model from <em>Attention Is All You Need</em>).
<ul>
<li>Layers: 4 encoder-decoder layers.</li>
<li>Attention Heads: 8.</li>
<li>Dimension ($d_{\text{model}}$): 512.</li>
<li>Feed-forward ($d_{\text{ff}}$): 2048.</li>
<li>Dropout: 10%.</li>
</ul>
</li>
<li><strong>Loss</strong>: Sparse categorical cross-entropy.</li>
<li><strong>Optimizer</strong>: Adam with custom learning rate scheduler.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Metrics were calculated after converting all predictions back to standard SMILES.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Baseline (SMILES)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Identical Match</strong></td>
          <td>88.62% (PubChem)</td>
          <td>Strict character-for-character equality.</td>
      </tr>
      <tr>
          <td><strong>Valid Structure</strong></td>
          <td>99.78%</td>
          <td>SMILES had rare syntax errors; SELFIES achieved 100%.</td>
      </tr>
      <tr>
          <td><strong>Tanimoto (Avg)</strong></td>
          <td>0.98</td>
          <td>Calculated using PubChem fingerprints via CDK.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: Google Cloud TPUs (v3-8).</li>
<li><strong>Format</strong>: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.</li>
<li><strong>Batch Size</strong>: 1024.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER_Short_Communication">DECIMER Short Communication</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and evaluation scripts (Python, Java)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5155037">Datasets on Zenodo</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>SMILES data and processing scripts</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Steinbeck, C., &amp; Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. <em>Digital Discovery</em>, 1(2), 84-90. <a href="https://doi.org/10.1039/D1DD00013F">https://doi.org/10.1039/D1DD00013F</a></p>
<p><strong>Publication</strong>: Digital Discovery 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://chemrxiv.org/doi/pdf/10.26434/chemrxiv-2021-7c9wf">ChemRxiv Preprint (PDF)</a></li>
<li><a href="https://github.com/Kohulan/DECIMER_Short_Communication">Official Code Repository</a></li>
<li><a href="https://doi.org/10.5281/zenodo.5155037">Data on Zenodo</a></li>
<li>Related work: <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer/">DECIMER</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/">DECIMER 1.0</a>, <a href="/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/">IMG2SMI</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanPerformanceChemicalStructure2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{84--90}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D1DD00013F}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Image2SMILES: Transformer OCSR with Synthetic Data Pipeline</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/image2smiles/</guid><description>Transformer-based OCSR using a novel synthetic data generation pipeline for robust molecular image interpretation across diverse drawing styles.</description><content:encoded><![CDATA[<h2 id="contribution-image2smiles-as-a-method-and-resource">Contribution: Image2SMILES as a Method and Resource</h2>
<p>This is primarily a <strong>Method</strong> paper with a significant <strong>Resource</strong> component.</p>
<ul>
<li><strong>Method</strong>: It proposes a specific neural architecture (ResNet backbone and Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering &ldquo;How well does this work?&rdquo; with extensive benchmarks against rule-based systems like OSRA.</li>
<li><strong>Resource</strong>: A core contribution is the &ldquo;Generate and Train!&rdquo; paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.</li>
</ul>
<h2 id="motivation-bottlenecks-in-recognizing-trapped-chemical-structures">Motivation: Bottlenecks in Recognizing Trapped Chemical Structures</h2>
<p>Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.</p>
<ul>
<li><strong>Problem</strong>: Chemical structures are often &ldquo;trapped&rdquo; in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, &ldquo;Markush&rdquo; structures (templates), or visual contamination.</li>
<li><strong>Gap</strong>: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.</li>
<li><strong>Goal</strong>: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).</li>
</ul>
<h2 id="core-innovation-the-generate-and-train-pipeline-and-fg-smiles">Core Innovation: The &ldquo;Generate and Train!&rdquo; Pipeline and FG-SMILES</h2>
<ul>
<li><strong>&ldquo;Generate and Train!&rdquo; Paradigm</strong>: The authors assert that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like &ldquo;Markush&rdquo; variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual &ldquo;contamination&rdquo; (stray text, arrows).</li>
<li><strong>FG-SMILES</strong>: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.</li>
<li><strong>Encoder-Free Architecture</strong>: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.</li>
</ul>
<h2 id="methodology-and-benchmarking-against-osra">Methodology and Benchmarking Against OSRA</h2>
<ul>
<li><strong>Training</strong>: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.</li>
<li><strong>Validation (Synthetic)</strong>: Evaluated on a hold-out set of 1M synthetic images.</li>
<li><strong>Validation (Real World)</strong>:
<ul>
<li><strong>Dataset A</strong>: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.</li>
<li><strong>Dataset B</strong>: 296 structures systematically extracted from <em>Journal of Organic Chemistry</em> (one paper per issue from 2020) to reduce selection bias.</li>
</ul>
</li>
<li><strong>Comparison</strong>: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.</li>
</ul>
<h2 id="results-high-precision-extraction-and-key-limitations">Results: High-Precision Extraction and Key Limitations</h2>
<ul>
<li><strong>Performance</strong>:
<ul>
<li><strong>Synthetic</strong>: 90.7% exact match accuracy.</li>
<li><strong>Real Data (Dataset A)</strong>: Image2SMILES achieved <strong>79.2%</strong> accuracy compared to OSRA&rsquo;s <strong>62.1%</strong>.</li>
<li><strong>Real Data (Dataset B)</strong>: Image2SMILES achieved <strong>62.5%</strong> accuracy compared to OSRA&rsquo;s <strong>24.0%</strong>.</li>
</ul>
</li>
<li><strong>Confidence Correlation</strong>: There is a strong correlation between the model&rsquo;s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22.5% of data, enabling high-precision automated pipelines.</li>
<li><strong>Key Failures</strong>: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices ($R&rsquo;$ vs $R_1$), and explicit hydrogens rendered as groups.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: A subset of 10 million molecules sampled from PubChem.</li>
<li><strong>Selection Logic</strong>: Bias towards complex/rare structures using a &ldquo;Full Coefficient&rdquo; (FC) probability metric based on molecule size and ring/atom rarity.
<ul>
<li>Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.</li>
</ul>
</li>
<li><strong>Generation</strong>: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).</li>
<li><strong>Contamination</strong>: &ldquo;Visual noise&rdquo; is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.</li>
<li><strong>Target Format</strong>: <strong>FG-SMILES</strong> (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a <code>v</code> token.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Contamination Augmentation</strong>: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.</li>
<li><strong>Functional Group Resolution</strong>: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).</li>
<li><strong>Markush Support</strong>: Stochastic replacement of substituents with R-group labels ($R_1$, $R&rsquo;$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: &ldquo;Image-to-Sequence&rdquo; hybrid model.
<ul>
<li><strong>Backbone</strong>: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \times 48 \times 48$.</li>
<li><strong>Neck</strong>: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.</li>
<li><strong>Decoder</strong>: Standard Transformer Decoder with parameters from the original Transformer architecture.</li>
</ul>
</li>
<li><strong>Input</strong>: Images resized to $384 \times 384 \times 3$.</li>
<li><strong>Output</strong>: Sequence of FG-SMILES tokens.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metric</strong>: Binary &ldquo;Exact Match&rdquo; (valid/invalid).
<ul>
<li>Strict criteria: Stereo and R-group indices must match exactly (e.g., $R&rsquo;$ vs $R_1$ is a failure).</li>
</ul>
</li>
<li><strong>Datasets</strong>:
<ul>
<li><strong>Internal</strong>: 5% random split of generated data (500k samples).</li>
<li><strong>External (Dataset A &amp; B)</strong>: Manually cropped real-world images from specified journals.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.</li>
<li><strong>Duration</strong>: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.</li>
<li><strong>Optimizer</strong>: RAdam with learning rate $3 \cdot 10^{-4}$.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/syntelly/img2smiles_generator">Data Generator (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Synthetic training data generator</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5069806">1M Generated Samples (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Randomly generated image-SMILES pairs</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5356500">Real-World Test Images (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Cropped structures from real papers with target FG-SMILES</td>
      </tr>
      <tr>
          <td><a href="https://app.syntelly.com/pdf2smiles">Syntelly Demo</a></td>
          <td>Other</td>
          <td>Proprietary</td>
          <td>Web demo for PDF-to-SMILES extraction</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Khokhlov, I., Krasnov, L., Fedorov, M. V., &amp; Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. <em>Chemistry-Methods</em>, 2(1), e202100069. <a href="https://doi.org/10.1002/cmtd.202100069">https://doi.org/10.1002/cmtd.202100069</a></p>
<p><strong>Publication</strong>: Chemistry-Methods 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/syntelly/img2smiles_generator">Official Code (Data Generator)</a></li>
<li><a href="https://app.syntelly.com/pdf2smiles">Syntelly Demo</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{khokhlovImage2SMILESTransformerBasedMolecular2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Image2SMILES: Transformer-Based Molecular Optical Recognition Engine}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Image2SMILES}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Chemistry-Methods}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{e202100069}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2628-9725}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1002/cmtd.202100069}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ICMDT: Automated Chemical Structure Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/icmdt/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/icmdt/</guid><description>A Transformer-based model (ICMDT) for converting chemical structure images into InChI text strings using a novel Deep TNT block.</description><content:encoded><![CDATA[<h2 id="contribution-image-to-text-translation-for-chemical-structures">Contribution: Image-to-Text Translation for Chemical Structures</h2>
<p>This is a <strong>Method</strong> paper.</p>
<p>It proposes a novel neural network architecture, the <strong>Image Captioning Model based on Deep TNT (ICMDT)</strong>, to solve the specific problem of &ldquo;molecular translation&rdquo; (image-to-text). The classification is supported by the following rhetorical indicators:</p>
<ul>
<li><strong>Novel Mechanism:</strong> It introduces the &ldquo;Deep TNT block&rdquo; to improve upon the existing TNT architecture by fusing features at three levels (pixel, small patch, large patch).</li>
<li><strong>Baseline Comparison:</strong> The authors explicitly compare their model against four other architectures (CNN+RNN and CNN+Transformer variants).</li>
<li><strong>Ablation Study:</strong> Section 4.3 is dedicated to ablating specific components (position encoding, patch fusion) to prove their contribution to the performance gain.</li>
</ul>
<h2 id="motivation-digitizing-historical-chemical-literature">Motivation: Digitizing Historical Chemical Literature</h2>
<p>The primary motivation is to speed up chemical research by digitizing historical chemical literature.</p>
<ul>
<li><strong>Problem:</strong> Historical sources often contain corrupted or noisy images, making automated recognition difficult.</li>
<li><strong>Gap:</strong> Existing models like the standard TNT (Transformer in Transformer) function primarily as encoders for classification and fail to effectively integrate local pixel-level information required for precise structure generation.</li>
<li><strong>Goal:</strong> To build a dependable generative model that can accurately translate these noisy images into <strong><a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a></strong> (International Chemical Identifier) text strings.</li>
</ul>
<h2 id="novelty-multi-level-feature-fusion-with-deep-tnt">Novelty: Multi-Level Feature Fusion with Deep TNT</h2>
<p>The core contribution is the <strong>Deep TNT block</strong> and the resulting <strong>ICMDT</strong> architecture.</p>
<ul>
<li><strong>Deep TNT Block:</strong> The Deep TNT block expands upon standard local and global modeling by stacking three transformer blocks to process information at three granularities:
<ol>
<li><strong>Internal Transformer:</strong> Processes pixel embeddings.</li>
<li><strong>Middle Transformer:</strong> Processes small patch embeddings.</li>
<li><strong>Exterior Transformer:</strong> Processes large patch embeddings.</li>
</ol>
</li>
<li><strong>Multi-level Fusion:</strong> The model fuses pixel-level features into small patches, and small patches into large patches, allowing for finer integration of local details.</li>
<li><strong>Position Encoding:</strong> A specific strategy of applying shared position encodings to small patches and pixels, while using a learnable 1D encoding for large patches.</li>
</ul>
<h2 id="methodology-benchmarking-on-the-bms-dataset">Methodology: Benchmarking on the BMS Dataset</h2>
<p>The authors evaluated the model on the <strong>Bristol-Myers Squibb Molecular Translation</strong> dataset.</p>
<ul>
<li><strong>Baselines:</strong> They constructed four comparative models:
<ul>
<li>EfficientNetb0 + RNN (Bi-LSTM)</li>
<li>ResNet50d + RNN (Bi-LSTM)</li>
<li>EfficientNetb0 + Transformer</li>
<li>ResNet101d + Transformer</li>
</ul>
</li>
<li><strong>Ablation:</strong> They tested the impact of removing the large patch position encoding (ICMDT*), reverting the encoder to a standard TNT-S (TNTD), and setting the patch size to 32 directly on TNT-S without the exterior transformer block (TNTD-B).</li>
<li><strong>Pre-processing Study:</strong> They experimented with denoising ratios and cropping strategies.</li>
</ul>
<h2 id="results--conclusions-improved-inchi-translation-accuracy">Results &amp; Conclusions: Improved InChI Translation Accuracy</h2>
<ul>
<li><strong>Performance:</strong> ICMDT achieved the lowest <strong>Levenshtein distance (0.69)</strong> among all five models tested (Table 3). The best-performing baseline was ResNet101d+Transformer.</li>
<li><strong>Convergence:</strong> The model converged significantly faster than the baselines, outperforming others as early as epoch 6.7.</li>
<li><strong>Ablation Results:</strong> The full Deep TNT block reduced error by nearly half compared to the standard TNT encoder (0.69 vs 1.29 Levenshtein distance). Removing large patch position encoding (ICMDT*) degraded performance to 1.04, and directly using patch size 32 on TNT-S (TNTD-B) scored 1.37.</li>
<li><strong>Limitations:</strong> The model struggles with <strong>stereochemical layers</strong> (e.g., identifying clockwise neighbors or +/- signs) compared to non-stereochemical layers.</li>
<li><strong>Inference &amp; Fusion:</strong> The multi-model inference and fusion pipeline (beam search, TTA, step-wise logit ensemble, and voting) improved results by 0.24 to 2.5 Levenshtein distance reduction over single models.</li>
<li><strong>Future Work:</strong> Integrating full object detection to predict atom/bond coordinates to better resolve 3D stereochemical information.</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<p><strong>Status: Partially Reproducible.</strong> The dataset is publicly available through Kaggle, and the paper provides detailed hyperparameters and architecture specifications. However, no source code or pretrained model weights have been released.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.kaggle.com/c/bms-molecular-translation">BMS Molecular Translation (Kaggle)</a></td>
          <td>Dataset</td>
          <td>Competition Terms</td>
          <td>Training/test images with InChI labels</td>
      </tr>
  </tbody>
</table>
<p><strong>Missing components:</strong> No official code repository or pretrained weights. Reimplementation requires reconstructing the Deep TNT block, training pipeline, and inference/fusion strategy from the paper description alone.</p>
<p><strong>Hardware/compute requirements:</strong> Not explicitly stated in the paper.</p>
<h3 id="data">Data</h3>
<p>The experiments used the <strong>Bristol-Myers Squibb Molecular Translation</strong> dataset from Kaggle.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BMS Training Set</td>
          <td>2,424,186 images</td>
          <td>Supervised; contains noise and blur</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BMS Test Set</td>
          <td>1,616,107 images</td>
          <td>Higher noise variation than training set</td>
      </tr>
  </tbody>
</table>
<p><strong>Pre-processing Strategy</strong>:</p>
<ul>
<li><strong>Effective:</strong> Padding resizing (reshaping to square using the longer edge, padding insufficient parts with pixels from the middle of the image).</li>
<li><strong>Ineffective:</strong> Smart cropping (removing white borders degraded performance).</li>
<li><strong>Augmentation:</strong> GaussNoise, Blur, RandomRotate90, and PepperNoise ($SNR=0.996$).</li>
<li><strong>Denoising:</strong> Best results found by mixing denoised and original data (Ratio 2:13) during training.</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer:</strong> Lookahead ($\alpha=0.5, k=5$) and RAdam ($\beta_1=0.9, \beta_2=0.99$).</li>
<li><strong>Loss Function:</strong> Anti-Focal loss ($\gamma=0.5$) combined with Label Smoothing. Standard Focal Loss adds a modulating factor $(1-p_t)^\gamma$ to cross-entropy to focus on hard negatives. Anti-Focal Loss (Raunak et al., 2020) modifies this factor to reduce the disparity between training and inference distributions in Seq2Seq models.</li>
<li><strong>Training Schedule:</strong>
<ul>
<li>Initial resolution: $224 \times 224$</li>
<li>Fine-tuning: Resolution $384 \times 384$ for labels $&gt;150$ length.</li>
<li>Batch size: Dynamic, increasing from 16 to 1024 (with proportional learning rate scaling).</li>
<li>Noisy Labels: Randomly replacing chemical elements in labels with a certain probability to improve robustness during inference.</li>
</ul>
</li>
<li><strong>Inference Strategy:</strong>
<ul>
<li>Beam Search ($k=16$ initially, $k=64$ if failing InChI validation).</li>
<li>Test Time Augmentation (TTA): Rotations of $90^\circ$.</li>
<li>Ensemble: Step-wise logit ensemble and voting based on Levenshtein distance scores.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<p><strong>ICMDT Architecture:</strong></p>
<ul>
<li><strong>Encoder (Deep TNT)</strong> (Depth: 12 layers):
<ul>
<li><strong>Internal Block:</strong> Dim 160, Heads 4, Hidden size 640, MLP act GELU, Pixel patch size 4.</li>
<li><strong>Middle Block:</strong> Dim 10, Heads 6, Hidden size 128, MLP act GELU, Small patch size 16.</li>
<li><strong>Exterior Block:</strong> Dim 2560, Heads 10, Hidden size 5120, MLP act GELU, Large patch size 32.</li>
</ul>
</li>
<li><strong>Decoder (Vanilla Transformer)</strong>:
<ul>
<li>Decoder dim: 2560, FFN dim: 1024.</li>
<li>Depth: 3 layers, Heads: 8.</li>
<li>Vocab size: 193 (InChI tokens), text_dim: 384.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metric:</strong> Levenshtein Distance (measures single-character edit operations between generated and ground truth InChI strings).</p>
<p><strong>Ablation Results (Table 3 from paper):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params (M)</th>
          <th>Levenshtein Distance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ICMDT</strong></td>
          <td>138.16</td>
          <td><strong>0.69</strong></td>
      </tr>
      <tr>
          <td>ICMDT*</td>
          <td>138.16</td>
          <td>1.04</td>
      </tr>
      <tr>
          <td>TNTD</td>
          <td>114.36</td>
          <td>1.29</td>
      </tr>
      <tr>
          <td>TNTD-B</td>
          <td>114.36</td>
          <td>1.37</td>
      </tr>
  </tbody>
</table>
<p><strong>Baseline Comparison (from convergence curves, Figure 9):</strong></p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params (M)</th>
          <th>Convergence (Epochs)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ICMDT</strong></td>
          <td>138.16</td>
          <td>~9.76</td>
      </tr>
      <tr>
          <td>ResNet101d + Transformer</td>
          <td>302.02</td>
          <td>14+</td>
      </tr>
      <tr>
          <td>EfficientNetb0 + Transformer</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>ResNet50d + RNN</td>
          <td>90.6</td>
          <td>14+</td>
      </tr>
      <tr>
          <td>EfficientNetb0 + RNN</td>
          <td>46.3</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, Y., Chen, G., &amp; Li, X. (2022). Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model. <em>Applied Sciences</em>, 12(2), 680. <a href="https://doi.org/10.3390/app12020680">https://doi.org/10.3390/app12020680</a></p>
<p><strong>Publication</strong>: MDPI Applied Sciences 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://www.kaggle.com/c/bms-molecular-translation">Kaggle Competition: BMS Molecular Translation</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{liAutomatedRecognitionChemical2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Automated {{Recognition}} of {{Chemical Molecule Images Based}} on an {{Improved TNT Model}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Li, Yanchi and Chen, Guanyu and Li, Xiang}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = jan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Applied Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{680}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Multidisciplinary Digital Publishing Institute}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{2076-3417}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.3390/app12020680}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>End-to-End Transformer for Molecular Image Captioning</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/vit-inchi-transformer/</guid><description>Vision Transformer encoder with Transformer decoder for molecular image-to-InChI translation, outperforming CNN baselines on noisy molecular datasets.</description><content:encoded><![CDATA[<h2 id="methodological-contribution">Methodological Contribution</h2>
<p>This is a <strong>Methodological Paper</strong>. It proposes a novel architectural approach to molecular image translation by replacing the standard CNN encoder with a Vision Transformer (ViT). The authors validate this method through comparative benchmarking against standard CNN+RNN baselines (e.g., ResNet+LSTM) and provide optimizations for inference speed.</p>
<h2 id="motivation-and-problem-statement">Motivation and Problem Statement</h2>
<p>The core problem addressed is existing molecular translation methods (extracting chemical structure from images into computer-readable InChI format) rely heavily on rule-based systems or CNN+RNN architectures. These current approaches often underperform when handling noisy images (common in scanned old journals) or images with few distinguishable features. There is a significant need in drug discovery to digitize and analyze legacy experimental data locked in image format within scientific publications.</p>
<h2 id="core-innovations-end-to-end-vit-encoder">Core Innovations: End-to-End ViT Encoder</h2>
<p>The primary contribution is the use of a completely convolution-free Vision Transformer (ViT) as the encoder, allowing the model to utilize long-range dependencies among image patches from the very beginning via self-attention:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
The architecture is a pure Transformer (Encoder-Decoder), treating the molecular image similarly to a sequence of tokens (patches). Furthermore, the authors implement a specific caching strategy for the decoder to avoid recomputing embeddings for previously decoded tokens, reducing the time complexity of the decoding step.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<p>The model was compared against standard CNN + RNN and ResNet (18, 34, 50) + LSTM with attention. Ablation studies were conducted varying the number of transformer layers (3, 6, 12, 24) and image resolution (224x224 vs 384x384). The model trained on a large combined dataset, including Bristol Myers Squibb data, SMILES, GDB-13, and synthetically augmented images containing noise and artifacts. Performance was evaluated using the Levenshtein distance metric, which computes the minimum number of single-character edits to transform the predicted string into the ground truth.</p>
<h2 id="performance-outcomes-and-capabilities">Performance Outcomes and Capabilities</h2>
<p>The proposed 24-layer ViT model (input size 384) achieved the lowest Levenshtein distance of <strong>6.95</strong>, outperforming the ResNet50+LSTM baseline (7.49) and the standard CNN+RNN (103.7). Increasing the number of layers had a strong positive impact, with the 24-layer model becoming competitive with current approaches. The authors note the model was evaluated on datasets with low distinguishable features and noise, where the ViT encoder&rsquo;s self-attention over all patches from the first layer helped capture relevant structure. The proposed caching optimization reduced the total decoding time complexity from $O(MN^2 + N^3)$ to $O(MN + N^2)$ for $N$ timesteps, by reducing the per-timestep cost to $O(M + N)$.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The model was trained on a combined dataset randomly split into 70% training, 10% test, and 20% validation.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Bristol Myers Squibb</strong></td>
          <td>~2.4 million synthetic images with InChI labels.</td>
          <td>Provided by BMS global biopharmaceutical company.</td>
      </tr>
      <tr>
          <td><strong>SMILES</strong></td>
          <td>Kaggle contest data converted to InChI.</td>
          <td>Images generated using RDKit.</td>
      </tr>
      <tr>
          <td><strong><a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a></strong></td>
          <td>Subset of 977 million small organic molecules (up to 13 atoms).</td>
          <td>Converted from SMILES using RDKit.</td>
      </tr>
      <tr>
          <td><strong>Augmented Images</strong></td>
          <td>Synthetic images with salt/pepper noise, dropped atoms, and bond modifications.</td>
          <td>Used to improve robustness against noise.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Training Objective</strong>: Cross-entropy loss minimization.</li>
<li><strong>Inference Decoding</strong>: Autoregressive decoding predicting the next character of the InChI string.</li>
<li><strong>Positional Encoding</strong>: Standard sine and cosine functions of different frequencies.</li>
<li><strong>Optimization</strong>:
<ul>
<li><strong>Caching</strong>: Caches the output of each layer during decoding to avoid recomputing embeddings for already decoded tokens.</li>
<li><strong>JIT</strong>: PyTorch JIT compiler used for graph optimization (1.2-1.5x speed increase on GPU).</li>
<li><strong>Self-Critical Training</strong>: Finetuning performed using self-critical sequence training (SCST).</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Encoder (Vision Transformer)</strong>:
<ul>
<li>Input: Flattened 2D patches of the image. Patch size: $16 \times 16$.</li>
<li>Projection: Trainable linear projection to latent vector size $D$.</li>
<li>Structure: Alternating layers of Multi-Head Self-Attention (MHSA) and MLP blocks.</li>
</ul>
</li>
<li><strong>Decoder (Vanilla Transformer)</strong>:
<ul>
<li>Input: Tokenized InChI string + sinusoidal positional embedding.</li>
<li>Vocabulary: 275 tokens (including <code>&lt;SOS&gt;</code>, <code>&lt;PAD&gt;</code>, <code>&lt;EOS&gt;</code>).</li>
</ul>
</li>
<li><strong>Hyperparameters (Best Model)</strong>:
<ul>
<li>Image Size: $384 \times 384$.</li>
<li>Layers: 24.</li>
<li>Feature Dimension: 512.</li>
<li>Attention Heads: 12.</li>
<li>Optimizer: Adam.</li>
<li>Learning Rate: $3 \times 10^{-5}$ (decayed by 0.5 in last 2 epochs).</li>
<li>Batch Size: Varied [64-512].</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: Levenshtein Distance (lower is better).</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Image Size</th>
          <th>Layers</th>
          <th>Epochs</th>
          <th>Levenshtein Dist.</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Standard CNN+RNN</td>
          <td>224</td>
          <td>3</td>
          <td>10</td>
          <td>103.7</td>
      </tr>
      <tr>
          <td>ResNet18 + LSTM</td>
          <td>224</td>
          <td>4</td>
          <td>10</td>
          <td>75.03</td>
      </tr>
      <tr>
          <td>ResNet34 + LSTM</td>
          <td>224</td>
          <td>4</td>
          <td>10</td>
          <td>45.72</td>
      </tr>
      <tr>
          <td>ResNet50 + LSTM</td>
          <td>224</td>
          <td>5</td>
          <td>10</td>
          <td>7.49</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>3</td>
          <td>5</td>
          <td>79.82</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>6</td>
          <td>5</td>
          <td>54.58</td>
      </tr>
      <tr>
          <td>ViT Transformers</td>
          <td>224</td>
          <td>12</td>
          <td>5</td>
          <td>31.30</td>
      </tr>
      <tr>
          <td>ViT Transformers (Best)</td>
          <td>384</td>
          <td>24</td>
          <td>10</td>
          <td><strong>6.95</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>System</strong>: 70GB GPU system.</li>
<li><strong>Framework</strong>: PyTorch and PyTorch Lightning.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Sundaramoorthy, C., Kelvin, L. Z., Sarin, M., &amp; Gupta, S. (2021). End-to-End Attention-based Image Captioning. <em>arXiv preprint arXiv:2104.14721</em>. <a href="https://doi.org/10.48550/arXiv.2104.14721">https://doi.org/10.48550/arXiv.2104.14721</a></p>
<p><strong>Publication</strong>: arXiv 2021 (preprint)</p>
<p><strong>Note</strong>: This is an arXiv preprint and has not undergone formal peer review.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{sundaramoorthyEndtoEndAttentionbasedImage2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{End-to-{{End Attention-based Image Captioning}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Sundaramoorthy, Carola and Kelvin, Lin Ziwen and Sarin, Mahak and Gupta, Shubham}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2021</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = apr,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2104.14721}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DECIMER 1.0: Transformers for Chemical Image Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/</link><pubDate>Thu, 18 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/decimer-1.0/</guid><description>Transformer-based approach for Optical Chemical Structure Recognition converting chemical images to SELFIES strings with 96% accuracy.</description><content:encoded><![CDATA[<h2 id="evaluating-the-contribution-a-methodological-shift">Evaluating the Contribution: A Methodological Shift</h2>
<p><strong>Method (Dominant)</strong> with strong <strong>Resource</strong> elements.</p>
<p>This is primarily a <strong>Method</strong> paper because it proposes a specific architectural evolution. It replaces CNN-RNN/Encoder-Decoder models with a <strong>Transformer-based network</strong> to solve the problem of image-to-structure translation. It validates this methodological shift through rigorous ablation studies comparing feature extractors (InceptionV3 vs. EfficientNet) and decoder architectures.</p>
<p>It also serves as a <strong>Resource</strong> contribution by releasing the open-source software, trained models, and describing the curation of a massive synthetic training dataset (&gt;35 million molecules).</p>
<h2 id="motivation-inaccessible-chemical-knowledge">Motivation: Inaccessible Chemical Knowledge</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A vast amount of chemical knowledge (pre-1990s) is locked in printed or scanned literature and is not machine-readable.</li>
<li><strong>Manual Bottlenecks</strong>: Manual curation and extraction of this data is tedious, slow, and error-prone.</li>
<li><strong>Limitations of Prior Tools</strong>: Existing Optical Chemical Structure Recognition (OCSR) tools are often rule-based or struggle with the noise and variability of full-page scanned articles. Previous deep learning attempts were not publicly accessible or robust enough.</li>
</ul>
<h2 id="key-innovation-transformer-based-molecular-translation">Key Innovation: Transformer-Based Molecular Translation</h2>
<ul>
<li><strong>Transformer Architecture</strong>: Shifts from the standard CNN-RNN (Encoder-Decoder) approach to a <strong>Transformer-based decoder</strong>, significantly improving accuracy.</li>
<li><strong>EfficientNet Backbone</strong>: Replaces the standard InceptionV3 feature extractor with <strong>EfficientNet-B3</strong>, which improved feature extraction quality for chemical images.</li>
<li><strong>SELFIES Representation</strong>: Utilizes <a href="/notes/chemistry/molecular-representations/notations/selfies/"><strong>SELFIES</strong></a> (SELF-referencing Embedded Strings) as the target output. This guarantees 100% robust molecular strings and eliminates the &ldquo;invalid SMILES&rdquo; problem common in generative models.</li>
<li><strong>Massive Scaling</strong>: Trains on synthetic datasets derived from PubChem (up to <strong>39 million molecules</strong> total, with the largest training subset at ~35 million), demonstrating that scaling data size directly correlates with improved model performance.</li>
</ul>
<h2 id="methodology-and-experimental-validation">Methodology and Experimental Validation</h2>
<ul>
<li><strong>Feature Extractor Ablation</strong>: Compared InceptionV3 vs. EfficientNet-B3 (and B7) on a 1-million molecule subset to determine the optimal image encoder.</li>
<li><strong>Architecture Comparison</strong>: Benchmarked the Encoder-Decoder (CNN+RNN) against the Transformer model using Tanimoto similarity metrics. The structural similarity between predicted and ground truth molecules was measured via Tanimoto similarity over molecular fingerprints:
$$ T(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}|^2 + |\mathbf{B}|^2 - \mathbf{A} \cdot \mathbf{B}} $$</li>
<li><strong>Data Scaling</strong>: Evaluated performance across increasing training set sizes (1M, 10M, 15M, 35M) to observe scaling laws.</li>
<li><strong>Stereochemistry &amp; Ions</strong>: Tested the model&rsquo;s ability to handle complex stereochemical information and charged groups (ions), creating separate datasets for these tasks.</li>
<li><strong>Augmentation Robustness</strong>: Evaluated the model on augmented images (blur, noise, varying contrast) to simulate real-world scanned document conditions.</li>
</ul>
<h2 id="results-and-scaling-observations">Results and Scaling Observations</h2>
<ul>
<li><strong>Architecture Comparison</strong>: The Transformer model with EfficientNet-B3 features outperformed the Encoder-Decoder baseline by a wide margin. On the 1M dataset, the Transformer achieved <strong>74.57%</strong> exact matches (Tanimoto 1.0) compared to only <strong>7.03%</strong> for the Encoder-Decoder (Table 4 in the paper).</li>
<li><strong>High Accuracy at Scale</strong>: With the full 35-million molecule training set (Dataset 1), the model achieved a <strong>Tanimoto 1.0 score of 96.47%</strong> and an average Tanimoto similarity of 0.99.</li>
<li><strong>Isomorphism</strong>: 99.75% of predictions with a Tanimoto score of 1.0 were confirmed to be structurally isomorphic to the ground truth (checked via <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>).</li>
<li><strong>Stereochemistry Costs</strong>: Including stereochemistry and ions increased the token count and difficulty, resulting in slightly lower accuracy (~89.87% exact match on Dataset 2).</li>
<li><strong>Hardware Efficiency</strong>: Training on TPUs (v3-8) was ~4x faster than Nvidia V100 GPUs. For the 1M molecule model, convergence took ~8h 41min on TPU v3-8 vs ~29h 48min on V100 GPU. The largest model (35M) took less than 14 days on TPU.</li>
<li><strong>Augmentation Robustness (Dataset 3)</strong>: When trained on augmented images and tested on non-augmented images, the model achieved 86.43% Tanimoto 1.0. Using a pre-trained model from Dataset 2 and refitting on augmented images improved this to 88.04% on non-augmented test images and 80.87% on augmented test images, retaining above 97% isomorphism rates.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors generated synthetic data from PubChem.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 1 (Clean)</td>
          <td>39M total (35M train)</td>
          <td>No stereo/ions. Filtered for MW &lt; 1500, bond count 3-40, SMILES len &lt; 40.</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 2 (Complex)</td>
          <td>37M total (33M train)</td>
          <td>Includes stereochemistry and charged groups (ions).</td>
      </tr>
      <tr>
          <td><strong>Training</strong></td>
          <td>Dataset 3 (Augmented)</td>
          <td>37M total (33M train)</td>
          <td>Dataset 2 with image augmentations applied.</td>
      </tr>
      <tr>
          <td><strong>Preprocessing</strong></td>
          <td>N/A</td>
          <td>N/A</td>
          <td>Molecules converted to <strong>SELFIES</strong>. Images generated via CDK Structure Diagram Generator (SDG) as $299 \times 299$ 8-bit PNGs.</td>
      </tr>
      <tr>
          <td><strong>Format</strong></td>
          <td>TFRecords</td>
          <td>75 MB chunks</td>
          <td>128 Data points (image vector + tokenized string) per record.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Text Representation</strong>: <strong>SELFIES</strong> used to avoid invalid intermediate strings. Tokenized via Keras tokenizer.
<ul>
<li><em>Dataset 1 Tokens</em>: 27 unique tokens. Max length 47.</li>
<li><em>Dataset 2/3 Tokens</em>: 61 unique tokens (due to stereo/ion tokens).</li>
</ul>
</li>
<li><strong>Augmentation</strong>: Implemented using <code>imgaug</code> python package. Random application of:
<ul>
<li>Gaussian/Average Blur, Additive Gaussian Noise, Salt &amp; Pepper, Coarse Dropout, Gamma Contrast, Sharpen, Brightness.</li>
</ul>
</li>
<li><strong>Optimization</strong>: Adam optimizer with a custom learning rate scheduler (following the &ldquo;Attention is all you need&rdquo; paper).</li>
</ul>
<h3 id="models">Models</h3>
<p>The final architecture is an <strong>Image-to-SELFIES Transformer</strong>.</p>
<ul>
<li><strong>Encoder (Feature Extractor)</strong>:
<ul>
<li><strong>EfficientNet-B3</strong> (pre-trained on Noisy-student).</li>
<li>Input: $299 \times 299 \times 3$ images (normalized -1 to 1).</li>
<li>Output Feature Vector: $10 \times 10 \times 1536$.</li>
</ul>
</li>
<li><strong>Decoder (Transformer)</strong>:
<ul>
<li>4 Encoder-Decoder layers.</li>
<li>8 Parallel Attention Heads.</li>
<li>Dimension size: 512.</li>
<li>Feed-forward size: 2048.</li>
<li>Dropout: 0.1.</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation was performed on a held-out test set (10% of total data) selected via RDKit MaxMin algorithm for diversity.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Tanimoto 1.0</strong></td>
          <td><strong>96.47%</strong></td>
          <td>74.57% (1M subset)</td>
          <td>Percentage of predictions with perfect fingerprint match (Dataset 1, 35M training).</td>
      </tr>
      <tr>
          <td><strong>Avg Tanimoto</strong></td>
          <td><strong>0.9923</strong></td>
          <td>0.9371 (1M subset)</td>
          <td>Average similarity score (Dataset 1, 35M training).</td>
      </tr>
      <tr>
          <td><strong>Isomorphism</strong></td>
          <td><strong>99.75%</strong></td>
          <td>-</td>
          <td>Percentage of Tanimoto 1.0 predictions that are structurally identical (checked via InChI).</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Hardware</strong>: TPU v3-8 (Google Cloud). TPU v3-32 was tested but v3-8 was chosen for cost-effectiveness.</li>
<li><strong>Comparison Hardware</strong>: Nvidia Tesla V100 (32GB GPU).</li>
<li><strong>Performance</strong>:
<ul>
<li>TPU v3-8 was ~4x faster than V100 GPU.</li>
<li>1 Million molecule model convergence: 8h 41min on TPU vs ~29h 48min on GPU.</li>
<li>Largest model (35M) took less than 14 days on TPU.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="reproducibility">Reproducibility</h2>
<p>The paper is open-access, and both code and data are publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">DECIMER-TPU (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation using TensorFlow and TPU training</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4730515">Code Archive (Zenodo)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Archival snapshot of the codebase</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.4766251">Training Data (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>SMILES data used for training (images generated via CDK SDG)</td>
      </tr>
      <tr>
          <td><a href="https://decimer.ai/">DECIMER Project Page</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Project landing page</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Hardware Requirements</strong>: Training requires TPU v3-8 (Google Cloud) or Nvidia V100 GPU. The largest model (35M molecules) took less than 14 days on TPU v3-8.</li>
<li><strong>Missing Components</strong>: Augmentation parameters are documented in the paper (Table 14). Pre-trained model weights are available through the GitHub repository.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rajan, K., Zielesny, A. &amp; Steinbeck, C. (2021). DECIMER 1.0: deep learning for chemical image recognition using transformers. <em>Journal of Cheminformatics</em>, 13(1), 61. <a href="https://doi.org/10.1186/s13321-021-00538-8">https://doi.org/10.1186/s13321-021-00538-8</a></p>
<p><strong>Publication</strong>: Journal of Cheminformatics 2021</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Kohulan/DECIMER-Image_Transformer">GitHub Repository</a></li>
<li><a href="https://decimer.ai/">DECIMER Project Page</a></li>
<li><a href="https://doi.org/10.5281/zenodo.4730515">Code Archive (Zenodo)</a></li>
<li><a href="https://doi.org/10.5281/zenodo.4766251">Training Data (Zenodo)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rajanDECIMER10Deep2021,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{DECIMER 1.0: Deep Learning for Chemical Image Recognition Using Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{DECIMER 1.0}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = <span style="color:#e6db74">{aug}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{61}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">issn</span> = <span style="color:#e6db74">{1758-2946}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1186/s13321-021-00538-8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1186/s13321-021-00538-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>IMG2SMI: Translating Molecular Structure Images to SMILES</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/img2smi/</guid><description>Campos &amp; Ji's method for converting 2D molecular images to SMILES strings using Transformers and SELFIES representation.</description><content:encoded><![CDATA[<h2 id="contributions--taxonomy">Contributions &amp; Taxonomy</h2>
<p>This is both a <strong>Method</strong> and <strong>Resource</strong> paper:</p>
<ul>
<li><strong>Method</strong>: It adapts standard image captioning architectures (encoder-decoder) to the domain of Optical Chemical Structure Recognition (OCSR), treating molecule recognition as a translation task.</li>
<li><strong>Resource</strong>: It introduces <strong>MOLCAP</strong>, a large-scale dataset of 81 million molecules aggregated from public chemical databases, addressing the data scarcity that previously hindered deep learning approaches to OCSR.</li>
</ul>
<h2 id="the-bottleneck-in-chemical-literature-translation">The Bottleneck in Chemical Literature Translation</h2>
<p>Chemical literature is &ldquo;full of recipes written in a language computers cannot understand&rdquo; because molecules are depicted as 2D images. This creates a fundamental bottleneck:</p>
<ul>
<li><strong>The Problem</strong>: Chemists must manually redraw molecular structures to search for related compounds or reactions. This is slow, error-prone, and makes large-scale literature mining impossible.</li>
<li><strong>Existing Tools</strong>: Legacy systems like OSRA (Optical Structure Recognition Application) rely on handcrafted rules and often require human correction, making them unfit for unsupervised, high-throughput processing.</li>
<li><strong>The Goal</strong>: An automated system that can translate structure images directly to machine-readable strings (SMILES/SELFIES) without human supervision, enabling large-scale knowledge extraction from decades of chemistry literature and patents.</li>
</ul>
<h2 id="core-innovation-selfies-and-image-captioning">Core Innovation: SELFIES and Image Captioning</h2>
<p>The core novelty is demonstrating that <strong>how you represent the output text is as important as the model architecture itself</strong>. Key contributions:</p>
<ol>
<li>
<p><strong>Image Captioning Framework</strong>: Applies modern encoder-decoder architectures (ResNet-101 + Transformer) to OCSR, treating it as an image-to-text translation problem with a standard cross-entropy loss objective over the generation sequence:
$$ \mathcal{L} = -\sum\limits_{t=1}^{T} \log P(y_t \mid y_1, \ldots, y_{t-1}, x) $$</p>
</li>
<li>
<p><strong>SELFIES as Target Representation</strong>: The key mechanism relies on using <strong>SELFIES</strong> (Self-Referencing Embedded Strings) as the output format. SELFIES is based on a formal grammar where every possible string corresponds to a valid molecule, eliminating the syntactic invalidity problems (unmatched parentheses, invalid characters) that plague SMILES generation.</p>
</li>
<li>
<p><strong>MOLCAP Dataset</strong>: Created a comprehensive dataset of 81 million unique molecules from PubChem, ChEMBL, <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a>, and other sources. Generated 256x256 pixel images using RDKit for 1 million training samples and 5,000 validation samples.</p>
</li>
<li>
<p><strong>Task-Specific Evaluation</strong>: Demonstrated that traditional NLP metrics (BLEU) are poor indicators of scientific utility. Introduced evaluation based on <strong>molecular fingerprints</strong> (MACCS, RDK, Morgan) and <strong>Tanimoto similarity</strong>:
$$ T(a, b) = \frac{c}{a + b - c} $$
where $c$ is the number of common fingerprint bits, and $a$ and $b$ are the number of set bits in each respective molecule&rsquo;s fingerprint. This formulation reliably measures functional chemical similarity.</p>
</li>
</ol>
<h2 id="experimental-setup-and-ablation-studies">Experimental Setup and Ablation Studies</h2>
<p>The evaluation focused on comparing IMG2SMI to existing systems and identifying which design choices matter most:</p>
<ol>
<li>
<p><strong>Baseline Comparisons</strong>: Benchmarked against OSRA (rule-based system) and DECIMER (first deep learning approach) on the MOLCAP dataset to establish whether modern architectures could surpass traditional methods.</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Extensive ablations isolating key factors:</p>
<ul>
<li><strong>Decoder Architecture</strong>: Transformer vs. RNN/LSTM decoders</li>
<li><strong>Encoder Fine-tuning</strong>: Fine-tuned vs. frozen pre-trained ResNet weights</li>
<li><strong>Output Representation</strong>: SELFIES vs. character-level SMILES vs. BPE-tokenized SMILES (the most critical ablation)</li>
</ul>
</li>
</ol>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>MACCS FTS</th>
          <th>Valid Captions</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RNN + Fixed Encoder</td>
          <td>0.1526</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>RNN + Fine-tuned Encoder</td>
          <td>0.4180</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Transformer + Fixed Encoder</td>
          <td>0.7674</td>
          <td>61.1%</td>
      </tr>
      <tr>
          <td>Transformer + Fine-tuned Encoder</td>
          <td>0.9475</td>
          <td>99.4%</td>
      </tr>
      <tr>
          <td>Character-level SMILES (fine-tuned)</td>
          <td>N/A</td>
          <td>2.1%</td>
      </tr>
      <tr>
          <td>BPE SMILES (2000 vocab, fine-tuned)</td>
          <td>N/A</td>
          <td>20.0%</td>
      </tr>
      <tr>
          <td>SELFIES (fine-tuned)</td>
          <td>0.9475</td>
          <td>99.4%</td>
      </tr>
  </tbody>
</table>
<ol start="3">
<li><strong>Metric Analysis</strong>: Systematic comparison of evaluation metrics including BLEU, ROUGE, Levenshtein distance, exact match accuracy, and molecular fingerprint-based similarity measures.</li>
</ol>
<h2 id="results-findings-and-limitations">Results, Findings, and Limitations</h2>
<p><strong>Performance Gains</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>IMG2SMI</th>
          <th>OSRA</th>
          <th>DECIMER</th>
          <th>Random Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MACCS FTS</td>
          <td>0.9475</td>
          <td>0.3600</td>
          <td>0.0000</td>
          <td>0.3378</td>
      </tr>
      <tr>
          <td>RDK FTS</td>
          <td>0.9020</td>
          <td>0.2790</td>
          <td>0.0000</td>
          <td>0.2229</td>
      </tr>
      <tr>
          <td>Morgan FTS</td>
          <td>0.8707</td>
          <td>0.2677</td>
          <td>0.0000</td>
          <td>0.1081</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>0.6240</td>
          <td>0.0684</td>
          <td>0.0000</td>
          <td>0.0422</td>
      </tr>
      <tr>
          <td>Exact Match</td>
          <td>7.24%</td>
          <td>0.04%</td>
          <td>0.00%</td>
          <td>0.00%</td>
      </tr>
      <tr>
          <td>Valid Captions</td>
          <td>99.4%</td>
          <td>65.2%</td>
          <td>N/A</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<ul>
<li>163% improvement over OSRA on MACCS Tanimoto similarity.</li>
<li>Nearly 10x improvement on ROUGE scores (0.6240 vs. 0.0684).</li>
<li>Average Tanimoto similarity exceeds 0.85 (functionally similar molecules even when not exact matches).</li>
</ul>
<p><strong>Key Findings</strong>:</p>
<ul>
<li><strong>SELFIES is Critical</strong>: Using SELFIES yields <strong>99.4% valid molecules</strong>, compared to only ~2% validity for character-level SMILES.</li>
<li><strong>Architecture Matters</strong>: Transformer decoder significantly outperforms RNN/LSTM approaches. Fine-tuning the ResNet encoder (vs. frozen weights) yields substantial performance gains (e.g., MACCS FTS: 0.7674 to 0.9475).</li>
<li><strong>Metric Insights</strong>: BLEU is a poor metric for this task. Molecular fingerprint-based Tanimoto similarity is most informative because it measures functional chemical similarity.</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li><strong>Low Exact Match</strong>: Only <strong>7.24%</strong> exact matches. The model captures the overarching functional groups and structure but misses fine details like exact double bond placement.</li>
<li><strong>Complexity Bias</strong>: Trained on large molecules (average length &gt;40 tokens), so it performs poorly on very simple structures where OSRA still excels.</li>
</ul>
<p><strong>Conclusion</strong>: The work shows that modern encoder-decoder architectures combined with valid-by-construction molecular representations (SELFIES) can outperform traditional rule-based systems by large margins on fingerprint-based similarity metrics. The system is useful for literature mining where functional similarity matters more than exact matches, though 7.24% exact match accuracy and poor performance on simple molecules indicate clear directions for future work.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Architecture</strong>: Image captioning system based on DETR (Detection Transformer) framework.</p>
<p><strong>Visual Encoder</strong>:</p>
<ul>
<li><strong>Backbone</strong>: ResNet-101 pre-trained on ImageNet</li>
<li><strong>Feature Extraction</strong>: 4th layer extraction (convolutions only)</li>
<li><strong>Output</strong>: 2048-dimensional dense feature vector</li>
</ul>
<p><strong>Caption Decoder</strong>:</p>
<ul>
<li><strong>Type</strong>: Transformer encoder-decoder</li>
<li><strong>Layers</strong>: 3 stacked encoder layers, 3 stacked decoder layers</li>
<li><strong>Attention Heads</strong>: 8</li>
<li><strong>Hidden Dimensions</strong>: 2048 (feed-forward networks)</li>
<li><strong>Dropout</strong>: 0.1</li>
<li><strong>Layer Normalization</strong>: 1e-12</li>
</ul>
<p><strong>Training Configuration</strong>:</p>
<ul>
<li><strong>Optimizer</strong>: AdamW</li>
<li><strong>Learning Rate</strong>: 5e-5 (selected after sweep from 1e-4 to 1e-6)</li>
<li><strong>Weight Decay</strong>: 1e-4</li>
<li><strong>Batch Size</strong>: 32</li>
<li><strong>Epochs</strong>: 5</li>
<li><strong>Codebase</strong>: Built on open-source DETR implementation</li>
</ul>
<h3 id="data">Data</h3>
<p><strong>MOLCAP Dataset</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total Size</td>
          <td>81,230,291 molecules</td>
          <td>Aggregated from PubChem, ChEMBL, GDB13</td>
      </tr>
      <tr>
          <td>Training Split</td>
          <td>1,000,000 molecules</td>
          <td>Randomly selected unique molecules</td>
      </tr>
      <tr>
          <td>Validation Split</td>
          <td>5,000 molecules</td>
          <td>Randomly selected for evaluation</td>
      </tr>
      <tr>
          <td>Image Resolution</td>
          <td>256x256 pixels</td>
          <td>Generated using RDKit</td>
      </tr>
      <tr>
          <td>Median SELFIES Length</td>
          <td>&gt;45 characters</td>
          <td>More complex than typical benchmarks</td>
      </tr>
      <tr>
          <td>Full Dataset Storage</td>
          <td>~16.24 TB</td>
          <td>Necessitated use of 1M subset</td>
      </tr>
      <tr>
          <td>Augmentation</td>
          <td>None</td>
          <td>No cropping, rotation, or other augmentation</td>
      </tr>
  </tbody>
</table>
<p><strong>Preprocessing</strong>:</p>
<ul>
<li>Images generated using RDKit at 256x256 resolution</li>
<li>Molecules converted to canonical representations</li>
<li>SELFIES tokenization for model output</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metrics</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>IMG2SMI Value</th>
          <th>OSRA Baseline</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MACCS FTS</td>
          <td>0.9475</td>
          <td>0.3600</td>
          <td>Fingerprint Tanimoto Similarity (functional groups)</td>
      </tr>
      <tr>
          <td>RDK FTS</td>
          <td>0.9020</td>
          <td>0.2790</td>
          <td>RDKit fingerprint similarity</td>
      </tr>
      <tr>
          <td>Morgan FTS</td>
          <td>0.8707</td>
          <td>0.2677</td>
          <td>Morgan fingerprint similarity (circular)</td>
      </tr>
      <tr>
          <td>ROUGE</td>
          <td>0.6240</td>
          <td>0.0684</td>
          <td>Text overlap metric</td>
      </tr>
      <tr>
          <td>Exact Match</td>
          <td>7.24%</td>
          <td>0.04%</td>
          <td>Structural identity (strict)</td>
      </tr>
      <tr>
          <td>Valid Captions</td>
          <td>99.4%</td>
          <td>65.2%</td>
          <td>Syntactic validity (with SELFIES)</td>
      </tr>
      <tr>
          <td>Levenshtein Distance</td>
          <td>21.13</td>
          <td>32.76</td>
          <td>String edit distance (lower is better)</td>
      </tr>
  </tbody>
</table>
<p><strong>Secondary Metrics</strong> (shown to be less informative for chemical tasks):</p>
<ul>
<li>BLEU, ROUGE (better suited for natural language)</li>
<li>Levenshtein distance (doesn&rsquo;t capture chemical similarity)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPU</strong>: Single NVIDIA GeForce RTX 2080 Ti</li>
<li><strong>Training Time</strong>: ~5 hours per epoch, approximately 25 hours total for 5 epochs</li>
<li><strong>Memory</strong>: Sufficient for batch size 32 with ResNet-101 + Transformer architecture</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>The paper mentions releasing both code and the MOLCAP dataset, but no public repository or download link has been confirmed as available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MOLCAP dataset</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>81M molecules; claimed released but no public URL found</td>
      </tr>
      <tr>
          <td>IMG2SMI code</td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Built on DETR; claimed released but no public URL found</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Campos, D., &amp; Ji, H. (2021). IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System (No. arXiv:2109.04202). arXiv. <a href="https://doi.org/10.48550/arXiv.2109.04202">https://doi.org/10.48550/arXiv.2109.04202</a></p>
<p><strong>Publication</strong>: arXiv preprint (2021)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.48550/arXiv.2109.04202">Paper on arXiv</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{campos2021img2smi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Campos, Daniel and Ji, Heng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2109.04202}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2109.04202}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Invalid SMILES Benefit Chemical Language Models: A Study</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/invalid-smiles-help/</link><pubDate>Tue, 02 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/invalid-smiles-help/</guid><description>Skinnider (2024) shows that generating invalid SMILES actually improves chemical language model performance through quality filtering.</description><content:encoded><![CDATA[<h2 id="core-contribution-repurposing-invalid-smiles">Core Contribution: Repurposing Invalid SMILES</h2>
<p>This is an <strong>Empirical</strong> paper that challenges a fundamental assumption in the field of chemical language models. Skinnider provides both empirical evidence and mechanistic explanations for why the ability to generate &ldquo;invalid&rdquo; SMILES strings is beneficial for model performance.</p>
<h2 id="the-problem-with-absolute-validity-in-chemical-lms">The Problem with Absolute Validity in Chemical LMs</h2>
<p>Prior research attempted to eliminate invalid generations using constrained representations like SELFIES. This paper demonstrates that invalid outputs serve as low-likelihood samples whose removal acts as an implicit quality filter, improving distribution learning.</p>
<h2 id="invalid-generation-as-an-implicit-quality-filter">Invalid Generation as an Implicit Quality Filter</h2>
<p>The central insight is counterintuitive: <strong>invalid SMILES generation acts as a built-in quality control mechanism</strong>. The key contributions are:</p>
<ol>
<li>
<p><strong>Empirical Evidence</strong>: Direct comparisons showing that SMILES-based models consistently outperform SELFIES-based models across multiple metrics, with performance gains strongly correlated with the proportion of invalid outputs generated.</p>
</li>
<li>
<p><strong>Mechanistic Explanation</strong>: Invalid SMILES are demonstrated to be low-likelihood samples from the model&rsquo;s probability distribution. When these are filtered out, it&rsquo;s equivalent to removing the model&rsquo;s least confident predictions, a form of automatic quality control.</p>
</li>
<li>
<p><strong>Causal Evidence</strong>: By modifying SELFIES to allow invalid generation (through relaxed constraints), the author shows that performance improves when models can generate and discard invalid outputs, directly proving the causal relationship.</p>
</li>
<li>
<p><strong>Bias Analysis</strong>: SELFIES models are shown to introduce systematic structural biases (fewer aromatic rings, more aliphatic rings) due to their validity constraints, limiting their ability to explore chemical space naturally.</p>
</li>
</ol>
<h2 id="experimental-design-and-causal-interventions">Experimental Design and Causal Interventions</h2>
<p>The paper uses a multi-pronged approach to establish both correlation and causation:</p>
<p><strong>Performance Comparisons</strong>: SMILES and SELFIES models were trained on identical datasets and evaluated using distribution-learning metrics like Fréchet ChemNet distance. The comparison was robust across different architectures, training set sizes, and chemical databases.</p>
<p><strong>Loss Analysis</strong>: The relationship between SMILES validity and model confidence was examined by analyzing the sequence loss. For a given SMILES string $S$ composed of tokens $t_1, t_2, &hellip;, t_N$, the negative log-likelihood acts as a proxy for the model&rsquo;s uncertainty:</p>
<p>$$ \text{NLL}(S) = -\sum_{i=1}^N \log P(t_i | t_1, &hellip;, t_{i-1}) $$</p>
<p>Invalid SMILES strings consistently register higher $\text{NLL}$ scores, meaning they represent the model&rsquo;s least confident predictions. Filtering them effectively acts as automatic quality control, providing the mechanistic explanation for why invalid filtering improves performance.</p>
<p><strong>Causal Intervention</strong>: A key experiment involved modifying the SELFIES valency constraints at two levels: first allowing pentavalent carbons (&ldquo;Texas SELFIES&rdquo;), then removing all constraints entirely (&ldquo;unconstrained SELFIES&rdquo;). This allowed direct testing of whether the ability to generate invalid outputs (which are then discarded) causally improves performance.</p>
<p><strong>Structural Bias Analysis</strong>: Generated molecules were analyzed for chemical features like ring types and bond patterns to quantify how validity constraints systematically distort the model&rsquo;s exploration of chemical space.</p>
<p><strong>Generalization Testing</strong>: Models were trained on subsets of chemical databases and tested on their ability to reproduce the broader chemical space, measuring how validity constraints affect generalization.</p>
<p><strong>Practical Application</strong>: The approach was tested on structure elucidation, using models to identify unknown molecules from minimal experimental data like mass spectrometry.</p>
<h2 id="key-findings-on-validity-constraints-and-bias">Key Findings on Validity Constraints and Bias</h2>
<p><strong>Superior Performance Across the Board</strong>: SMILES-based models consistently outperformed SELFIES models on distribution-learning tasks. Using metrics like Fréchet ChemNet distance, SMILES models generated molecules that more closely matched the statistical properties of their training data. This performance advantage was directly correlated with the proportion of invalid SMILES generated. Models that produced more invalid outputs performed better after filtering.</p>
<p><strong>Invalid SMILES Are Low-Confidence Predictions</strong>: The analysis revealed that invalid SMILES consistently have higher loss values than valid ones, meaning they represent the model&rsquo;s least confident predictions. This suggests that validity checking acts as an automatic confidence filter, removing low-quality samples without requiring explicit uncertainty estimation.</p>
<p><strong>Causal Evidence Through Unconstrained SELFIES</strong>: Direct causal evidence came from modifying SELFIES to allow invalid generation. When &ldquo;unconstrained SELFIES&rdquo; models could generate and discard invalid molecules, their performance improved, approaching that of SMILES models. This provides direct causal evidence that the ability to generate invalid outputs is what drives the performance gains.</p>
<p><strong>Validity Constraints Introduce Systematic Bias</strong>: SELFIES models showed clear structural biases compared to both training data and SMILES outputs. They generated fewer aromatic rings and more aliphatic structures, systematic distortions caused by the valency constraints used to ensure validity. These biases limit the model&rsquo;s ability to faithfully represent chemical space.</p>
<p><strong>Reduced Generalization</strong>: When trained on subsets of chemical databases, SMILES models could reproduce a larger portion of the complete chemical space compared to SELFIES models. Although SELFIES generated more valid molecules in absolute terms, their structural biases constrained exploration and limited generalization beyond the training set.</p>
<p><strong>Real-World Application Benefits</strong>: In structure elucidation tasks, identifying unknown molecules from experimental data like mass spectrometry, SMILES-based models significantly outperformed SELFIES models. This demonstrates that the benefits extend beyond academic benchmarks to practical applications.</p>
<p><strong>CASMI 2022 Benchmark</strong>: The language model trained on the LOTUS database was benchmarked against 19 submissions to the CASMI 2022 competition for structure elucidation of unknown compounds. Using only accurate mass as input (no MS/MS data), the model achieved competitive performance, highlighting the practical utility of the sampling-frequency-based approach for de novo structure elucidation.</p>
<p><strong>Computational Efficiency</strong>: Filtering invalid SMILES is computationally trivial. Parsing ten million SMILES strings with RDKit takes approximately 7.5 minutes on a single CPU, making the post-processing overhead negligible compared to model training and inference costs.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="models">Models</h3>
<p><strong>Primary Architecture (LSTM):</strong> The main results rely on a Recurrent Neural Network (RNN) using Long Short-Term Memory (LSTM) units.</p>
<ul>
<li><strong>Structure:</strong> Three-layer LSTM with a hidden layer size of 1,024 dimensions</li>
<li><strong>Embedding:</strong> An embedding layer of 128 dimensions</li>
<li><strong>Decoder:</strong> A linear decoder layer outputs token probabilities</li>
</ul>
<p><strong>Secondary Architecture (Transformer/GPT):</strong> To confirm robustness across architectures, the author also used a Generative Pretrained Transformer (GPT) architecture adapted from MolGPT.</p>
<ul>
<li><strong>Structure:</strong> Eight transformer blocks</li>
<li><strong>Internals:</strong> Each block contains eight masked self-attention heads and a feed-forward network (1,024 dimensions) using GELU activation</li>
<li><strong>Embedding:</strong> 256 dimensions, concatenated with learned positional encodings</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Optimizer:</strong> Adam optimizer for both architectures with $\beta_1=0.9$ and $\beta_2=0.999$.</p>
<p><strong>Learning Rate:</strong></p>
<ul>
<li>LSTM: 0.001</li>
<li>Transformer: 0.0005</li>
</ul>
<p><strong>Batch Size:</strong> 64</p>
<p><strong>Loss Function:</strong> Cross-entropy loss of next-token prediction.</p>
<p><strong>Stopping Criteria:</strong> Early stopping using a validation set (10% of training data) with patience of 50,000 minibatches.</p>
<h3 id="data">Data</h3>
<p><strong>Primary Source:</strong> ChEMBL database (version 28).</p>
<p><strong>Preprocessing Pipeline:</strong></p>
<ul>
<li><strong>Cleaning:</strong> Removal of duplicate SMILES, salts, and solvents (retaining heavy fragments with $\geq 3$ heavy atoms)</li>
<li><strong>Filtering:</strong> Molecules with atoms other than {Br, C, Cl, F, H, I, N, O, P, S} were removed</li>
<li><strong>Normalization:</strong> Charged molecules were neutralized and converted to canonical SMILES</li>
</ul>
<p><strong>Training Subsets:</strong> Models were trained on random samples of 30,000, 100,000, and 300,000 molecules to test scalability.</p>
<p><strong>Generalization Data:</strong> To test generalization, models were also trained on the <a href="/notes/chemistry/datasets/gdb-13/">GDB-13</a> database (enumerating drug-like molecules up to 13 heavy atoms).</p>
<p><strong>Structure Elucidation Data:</strong> For practical application tasks, models were trained on natural products (LOTUS, COCONUT), food compounds (FooDB), and environmental contaminants (NORMAN).</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Primary Metric:</strong> Fréchet ChemNet Distance (FCD), measuring chemical similarity between generated molecules and the training set (lower is better).</p>
<p><strong>Secondary Metrics:</strong></p>
<ul>
<li><strong>Validity:</strong> Percentage of outputs parseable by RDKit</li>
<li><strong>Scaffold Similarity:</strong> Jensen-Shannon distances between Murcko scaffold compositions</li>
<li><strong>Physical Properties:</strong> Comparisons of molecular weight, LogP, topological polar surface area (TPSA), and ring counts (aromatic vs. aliphatic)</li>
<li><strong>Structure Elucidation:</strong> &ldquo;Top-k accuracy,&rdquo; the proportion of held-out molecules where the correct structure appeared in the model&rsquo;s top $k$ ranked outputs</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute Nodes:</strong> Dell EMC C4140 GPU compute nodes</li>
<li><strong>GPUs:</strong> NVIDIA Tesla V100</li>
<li><strong>Compute Time:</strong> Parsing 10 million SMILES took ~7.5 minutes on a single CPU; SELFIES models required an average of 0.6 hours longer to train than SMILES models</li>
</ul>
<h3 id="replicability">Replicability</h3>
<p><strong>Code Availability:</strong> Source code and intermediate data are available via <a href="https://doi.org/10.5281/zenodo.10680855">Zenodo</a>. Pre-trained model weights are not provided in the archive, requiring researchers to train models from scratch using the included scripts to fully replicate the study.</p>
<p><strong>Data Availability:</strong> Training datasets and generated molecule samples (10 million from ChEMBL/GDB-13 models, 100 million from LOTUS/COCONUT/FooDB/NORMAN cross-validation folds) are available via <a href="https://doi.org/10.5281/zenodo.8321735">Zenodo</a>.</p>
<p><strong>Software Libraries:</strong></p>
<ul>
<li><strong>PyTorch:</strong> LSTM and Transformer implementations</li>
<li><strong>RDKit:</strong> SMILES parsing, validity checking, and property calculation</li>
<li><strong>SELFIES:</strong> Version 2.1.1 for conversion</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10680855">Source code (Zenodo)</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training scripts, analysis code, and intermediate data</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.8321735">Training and generated molecules (Zenodo)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Preprocessed training sets and sampled molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="implications-and-takeaways">Implications and Takeaways</h2>
<p>This work reframes how we think about &ldquo;errors&rdquo; in generative models. The key insight is that model outputs appearing incorrect often represent low-likelihood samples whose removal improves overall performance.</p>
<p>The findings suggest that the field&rsquo;s drive toward guaranteed validity leads to systematic biases. Letting models fail informatively and using those failures as quality signals can yield better distribution learning. This is relevant as the field moves toward larger, more capable models where such self-correction mechanisms become increasingly valuable.</p>
<p>For practitioners, the takeaway is to consider the role of invalid outputs before eliminating them. Filtering low-confidence generations provides automatic quality control that improves final results.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Skinnider, M. A. (2024). Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Machine Intelligence, 6(4), 437-448. <a href="https://doi.org/10.1038/s42256-024-00821-x">https://doi.org/10.1038/s42256-024-00821-x</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence (2024)</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{skinnider2024invalid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Invalid SMILES are beneficial rather than detrimental to chemical language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Skinnider, Michael A}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{437--448}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group UK London}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>αExtractor: Chemical Info from Biomedical Literature</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/alpha-extractor/</link><pubDate>Sat, 11 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/image-to-sequence/alpha-extractor/</guid><description>αExtractor uses ResNet-Transformer to extract chemical structures from literature images, including noisy and hand-drawn molecules.</description><content:encoded><![CDATA[<h2 id="methodological-contribution-a-robust-optical-recognition-system">Methodological Contribution: A Robust Optical Recognition System</h2>
<p>This is primarily a <strong>Method</strong> ($\Psi_{\text{Method}}$) paper with a significant secondary <strong>Resource</strong> ($\Psi_{\text{Resource}}$) contribution (see the <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a> for more on these categories).</p>
<p>The dominant methodological contribution is the ResNet-Transformer recognition architecture that outperforms existing OCSR tools across multiple benchmarks through robustness engineering. It specifically focuses on training on 20 million synthetic images with aggressive augmentation to handle degraded image conditions. The work answers the core methodological question &ldquo;How well does this work?&rdquo; through extensive benchmarking against existing OCSR tools and ablation studies validating architectural choices.</p>
<p>The secondary resource contribution comes from releasing αExtractor as a freely available web service, correcting labeling errors in standard benchmarks (CLEF, UOB, JPO), and providing an end-to-end document processing pipeline for biomedical literature mining.</p>
<h2 id="motivation-extracting-visual-chemical-knowledge-from-biomedical-literature">Motivation: Extracting Visual Chemical Knowledge from Biomedical Literature</h2>
<p>The motivation addresses a familiar pain point in chemical informatics within a biomedical context. Vast amounts of chemical knowledge in biomedical literature exist only as images, such as molecular structures embedded in figures, chemical synthesis schemes, and compound diagrams. This visual knowledge remains effectively invisible to computational methods, which creates a massive bottleneck for drug discovery research, systematic reviews, and large-scale chemical database construction.</p>
<p>Existing OCSR tools face two critical problems when applied to biomedical literature:</p>
<ol>
<li>
<p><strong>Real-world image quality</strong>: Biomedical papers often contain low-resolution figures, images with complex backgrounds, noise from scanning/digitization, and inconsistent drawing styles across different journals and decades of publications.</p>
</li>
<li>
<p><strong>End-to-end extraction</strong>: Most OCSR systems assume the presence of clean, cropped molecular images. In practice, you need to first find the molecular structures within multi-panel figures, reaction schemes, and dense document layouts before you can recognize them.</p>
</li>
</ol>
<p>The authors argue that a practical literature mining system needs to solve both problems simultaneously via robust recognition under noisy conditions and automated detection of molecular images within complex documents.</p>
<h2 id="core-innovation-robust-resnet-transformer-architecture">Core Innovation: Robust ResNet-Transformer Architecture</h2>
<p>The core innovation lies in combining a competition-winning recognition architecture with extensive robustness engineering and end-to-end document processing. The key contributions include:</p>
<ol>
<li>
<p><strong>ResNet-Transformer Recognition Model</strong>: The core recognition system uses a <strong>Residual Neural Network (ResNet)</strong> encoder paired with a <strong>Transformer decoder</strong> in an image-captioning framework. This architecture won first place in a Kaggle molecular translation competition, which provided a strong foundation for the recognition task. Let the input image be $I$. The model maximizes the joint likelihood of the SMILES tokens $T$ and coordinate sequences $X, Y$:
$$
\begin{aligned}
\mathcal{L}_{\text{total}} = - \sum_{i=1}^{L} \log P(T_i \mid I, T_{&lt;i}) - \lambda \sum_{i=1}^{L} \big(\log P(X_i \mid I, X_{&lt;i}) + \log P(Y_i \mid I, Y_{&lt;i})\big)
\end{aligned}
$$
Here, continuous $X$ and $Y$ atom coordinates are mapped strictly to 200 discrete bins to formulate the coordinate prediction as a standard classification task alongside SMILES generation.</p>
</li>
<li>
<p><strong>Enhanced Molecular Representation</strong>: The model produces an augmented representation that encompasses:</p>
<ul>
<li>Standard molecular connectivity information</li>
<li><strong>Bond type tokens</strong> (solid wedge bonds, dashed bonds, etc.) that preserve 3D stereochemical information</li>
<li><strong>Atom coordinate predictions</strong> that allow reconstruction of the exact molecular pose from the original image</li>
</ul>
<p>This dual prediction of discrete structure and continuous coordinates makes the output strictly faithful to the source material and enables better quality assessment.</p>
</li>
<li>
<p><strong>Massive Synthetic Training Dataset</strong>: The model was trained on approximately <strong>20 million synthetic molecular images</strong> generated from PubChem SMILES with aggressive data augmentation. The augmentation strategy randomized visual styles, image quality, and rendering parameters to create maximum diversity, ensuring the network rarely saw the same molecular depiction twice. This forces the model to learn robust, style-invariant features.</p>
</li>
<li>
<p><strong>End-to-End Document Processing Pipeline</strong>: αExtractor integrates <strong>object detection</strong> and <strong>structure recognition</strong> into a complete document mining system:</p>
<ul>
<li>An object detection model automatically locates molecular images within PDF documents</li>
<li>The recognition model converts detected images to structured representations</li>
<li>A web service interface makes the entire pipeline accessible to researchers without machine learning expertise</li>
</ul>
</li>
<li>
<p><strong>Robustness-First Design</strong>: The system was explicitly designed to handle degraded image conditions that break traditional OCSR tools, including low resolution, background interference, color variations, and scanning artifacts commonly found in legacy biomedical literature.</p>
</li>
</ol>
<h2 id="experimental-methodology-stress-testing-under-real-world-conditions">Experimental Methodology: Stress Testing under Real-World Conditions</h2>
<p>The evaluation focused on demonstrating robust performance across diverse image conditions, from pristine benchmarks to challenging real-world scenarios:</p>
<ol>
<li>
<p><strong>Benchmark Dataset Evaluation</strong>: αExtractor was tested on four standard OCSR benchmarks:</p>
<ul>
<li><strong>CLEF</strong>: Chemical structure recognition challenge dataset</li>
<li><strong>UOB</strong>: University of Birmingham patent images</li>
<li><strong>JPO</strong>: Japan Patent Office molecular diagrams</li>
<li><strong>USPTO</strong>: US Patent and Trademark Office structures</li>
</ul>
<p>Performance was measured using exact SMILES match accuracy.</p>
</li>
<li>
<p><strong>Error Analysis and Dataset Correction</strong>: During evaluation, the researchers discovered numerous labeling errors in the original benchmark datasets. They systematically identified and corrected these errors, then re-evaluated all methods on the cleaned datasets to get more accurate performance measurements.</p>
</li>
<li>
<p><strong>Robustness Stress Testing</strong>: The system was evaluated on two challenging datasets specifically designed to test robustness:</p>
<ul>
<li><strong>Color background images</strong> (200 samples): Molecular structures on complex, colorful backgrounds that simulate real figure conditions</li>
<li><strong>Low-quality images</strong> (200 samples): Degraded images with noise, blur, and artifacts typical of scanned documents</li>
</ul>
<p>These tests compared αExtractor against three open-source tools (OSRA, Molvel, and Imago) under realistic degradation conditions.</p>
</li>
<li>
<p><strong>Generalization Testing</strong>: In the most challenging experiment, αExtractor was tested on the <strong>DECIMER hand-drawn molecule images dataset</strong> (Brinkhaus et al., 2022), representing a completely different visual domain not represented in the training data. This tested whether the learned features could generalize beyond digital rendering styles to human-drawn chemistry.</p>
</li>
<li>
<p><strong>End-to-End Document Extraction</strong>: The complete pipeline was evaluated on 50 PDF files containing 2,336 molecular images. This tested both the object detection component (finding molecules in complex documents) and the recognition component (converting them to SMILES) in a realistic literature mining scenario.</p>
</li>
<li>
<p><strong>Speed Benchmarking</strong>: Inference time was measured to demonstrate the practical efficiency needed for large-scale document processing.</p>
</li>
</ol>
<h2 id="results--conclusions-strong-performance-on-degraded-images">Results &amp; Conclusions: Strong Performance on Degraded Images</h2>
<ul>
<li>
<p><strong>Substantial Accuracy Gains</strong>: On the four benchmark datasets, αExtractor achieved accuracies of 91.83% (CLEF), 98.47% (UOB), 88.67% (JPO), and 93.64% (USPTO), compared to previous best results of 84.6%, 90.0%, 72.2%, and 89.9% respectively. After correcting dataset labeling errors, the true accuracies were even higher, reaching <strong>95.77% on CLEF, 99.86% on UOB, and 92.44% on JPO</strong>.</p>
</li>
<li>
<p><strong>Robustness on Degraded Images</strong>: Open-source competitors struggled on degraded images (achieving 5.5% accuracy at best). αExtractor maintained <strong>over 90% accuracy</strong> on both color background and low-quality image datasets, demonstrating the effectiveness of the synthetic training strategy.</p>
</li>
<li>
<p><strong>Generalization to Hand-Drawn Molecules</strong>: On hand-drawn molecules, a domain completely absent from training data, αExtractor achieved <strong>61.4% accuracy</strong> while other tools scored between 0.69% and 2.93%. This suggests the model learned genuinely chemical features rather than style-specific patterns.</p>
</li>
<li>
<p><strong>Practical End-to-End Performance</strong>: In the complete document processing evaluation, αExtractor detected <strong>95.1% of molecular images</strong> (2,221 out of 2,336) and correctly recognized <strong>94.5% of detected structures</strong> (2,098 correct predictions). This demonstrates the system&rsquo;s readiness for real-world literature mining applications.</p>
</li>
<li>
<p><strong>Ablation Results</strong>: Ablation experiments confirmed that each architectural component (ResNet backbone, Transformer encoder, Transformer decoder) contributes to performance, with the Transformer decoder having the largest impact. Replacing the Transformer decoder with an LSTM decoder substantially reduced accuracy (Table S6 in the paper).</p>
</li>
<li>
<p><strong>Dataset Quality Issues</strong>: The systematic discovery of labeling errors in standard benchmarks highlights a broader problem in OCSR evaluation. The corrected datasets provide more reliable baselines for future method development.</p>
</li>
<li>
<p><strong>Spatial Layout Limitation</strong>: αExtractor correctly identifies molecular connectivity, but the re-rendered structures may have different spatial layouts than the originals. This could complicate visual verification for complex molecules, even if the chemical information remains accurate.</p>
</li>
<li>
<p><strong>Non-Standard Depiction Handling</strong>: For images with non-standard bond depictions or atomic valences, αExtractor correctly identifies and normalizes them to standard representations. While chemically accurate, this means the re-rendered structure may visually differ from the original image.</p>
</li>
</ul>
<p>Overall, αExtractor combines accurate recognition (over 90% on degraded images), end-to-end document processing, and strong generalization across image conditions. It targets large-scale literature mining tasks where previous tools struggled with degraded inputs. The focus on real-world robustness over benchmark optimization reflects a practical approach to deploying machine learning in scientific workflows.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This paper is <strong>Partially Reproducible</strong>. While the authors detail the model architectures and training techniques, the source code, training dataset (20M synthetic images), and pre-trained weights remain closed-source and proprietary. The authors released a sample of their test data and host an online web server for running inference.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Artifact</th>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">License</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/CLEF_corrected">Corrected CLEF Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the CLEF benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/UOB_corrected">Corrected UOB Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the UOB benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/JPO_corrected">Corrected JPO Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Authors&rsquo; corrected version of the JPO benchmark.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/Colored_Background">Color Background Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">200 samples of molecular structures on complex, colorful backgrounds.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/Low_Quality">Low Quality Dataset</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">200 samples of degraded images with noise, blur, and artifacts.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://github.com/jiachengxiong/alpha-Extractor/tree/main/PDF">PDF Test Set</a></td>
          <td style="text-align: left">Dataset</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Sample PDF files for end-to-end document extraction evaluation.</td>
      </tr>
      <tr>
          <td style="text-align: left"><a href="https://extractor.alphama.com.cn/csr">αExtractor Web Server</a></td>
          <td style="text-align: left">Other</td>
          <td style="text-align: left">Unknown</td>
          <td style="text-align: left">Online service for running inference using the proprietary system.</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p><strong>Image Recognition Model:</strong></p>
<ul>
<li><strong>Backbone:</strong> ResNet50 producing output of shape $2048 \times 19 \times 19$, projected to 512 channels via a feed-forward layer</li>
<li><strong>Transformer Architecture:</strong> 3 encoder layers and 3 decoder layers with hidden dimension of 512</li>
<li><strong>Output Format:</strong> Generates SMILES tokens plus two auxiliary coordinate sequences (X-axis and Y-axis) that are length-aligned with the SMILES tokens via padding</li>
</ul>
<p><strong>Object Detection Model:</strong></p>
<ul>
<li><strong>Architecture:</strong> DETR (Detection Transformer) with ResNet101 backbone</li>
<li><strong>Transformer Architecture:</strong> 6 encoder layers and 6 decoder layers with hidden dimension of 256</li>
<li><strong>Purpose:</strong> Locates molecular images within PDF pages before recognition</li>
</ul>
<p><strong>Coordinate Prediction:</strong></p>
<ul>
<li>Continuous X/Y coordinates are discretized into <strong>200 discrete bins</strong></li>
<li>Padding tokens added to coordinate sequences to align perfectly with SMILES token sequence, enabling simultaneous structure and pose prediction</li>
</ul>
<h3 id="data">Data</h3>
<p><strong>Training Data:</strong></p>
<ul>
<li><strong>Synthetic Generation:</strong> Python script rendering PubChem SMILES into 2D images</li>
<li><strong>Dataset Size:</strong> Approximately 20.3 million synthetic molecular images from PubChem</li>
<li><strong>Superatom Handling:</strong> 50% of molecules had functional groups replaced with superatoms (e.g., &ldquo;COOH&rdquo;) or generic labels (R1, X1) to match literature drawing conventions</li>
<li><strong>Rendering Augmentation:</strong> Randomized bond thickness, bond spacing, font size, font color, and padding size</li>
</ul>
<p><strong>Geometric Augmentation:</strong></p>
<ul>
<li>Shear along x-axis: $\pm 15^\circ$</li>
<li>Rotation: $\pm 15^\circ$</li>
<li>Piecewise affine scaling</li>
</ul>
<p><strong>Noise Injection:</strong></p>
<ul>
<li>Pepper noise: 0-2%</li>
<li>Salt noise: 0-40%</li>
<li>Gaussian noise: scale 0-0.16</li>
</ul>
<p><strong>Destructive Augmentation:</strong></p>
<ul>
<li>JPEG compression: severity levels 2-5</li>
<li>Random masking</li>
</ul>
<p><strong>Evaluation Datasets:</strong></p>
<ul>
<li><strong>CLEF</strong>: Chemical structure recognition challenge dataset</li>
<li><strong>UOB</strong>: University of Birmingham patent images</li>
<li><strong>JPO</strong>: Japan Patent Office molecular diagrams</li>
<li><strong>USPTO</strong>: US Patent and Trademark Office structures</li>
<li><strong>Color background images</strong>: 200 samples</li>
<li><strong>Low-quality images</strong>: 200 samples</li>
<li><strong>Hand-drawn structures</strong>: Test set for generalization</li>
<li><strong>End-to-end document extraction</strong>: 50 PDFs (567 pages, 2,336 molecular images)</li>
</ul>
<h3 id="training">Training</h3>
<p><strong>Image Recognition Model:</strong></p>
<ul>
<li><strong>Optimizer:</strong> Adam with learning rate of 1e-4</li>
<li><strong>Batch Size:</strong> 100</li>
<li><strong>Epochs:</strong> 5</li>
<li><strong>Loss Function:</strong> Cross-entropy loss for both SMILES prediction and coordinate prediction</li>
</ul>
<p><strong>Object Detection Model:</strong></p>
<ul>
<li><strong>Optimizer:</strong> Adam with learning rate of 1e-4</li>
<li><strong>Batch Size:</strong> 24</li>
<li><strong>Training Strategy:</strong> Pre-trained on synthetic &ldquo;Lower Quality&rdquo; data for 5 epochs, then fine-tuned on annotated real &ldquo;High Quality&rdquo; data for 30 epochs</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics:</strong></p>
<ul>
<li><strong>Recognition</strong>: SMILES accuracy (exact match)</li>
<li><strong>End-to-End Pipeline</strong>:
<ul>
<li><strong>Recall</strong>: 95.1% for detection</li>
<li><strong>Accuracy</strong>: 94.5% for recognition</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p><strong>Inference Hardware:</strong></p>
<ul>
<li>Cloud CPU server (8 CPUs, 64 GB RAM)</li>
<li><strong>Throughput:</strong> Processed 50 PDFs (567 pages) in 40 minutes</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xiong, J., Liu, X., Li, Z., Xiao, H., Wang, G., Niu, Z., Fei, C., Zhong, F., Wang, G., Zhang, W., Fu, Z., Liu, Z., Chen, K., Jiang, H., &amp; Zheng, M. (2023). αExtractor: a system for automatic extraction of chemical information from biomedical literature. <em>Science China Life Sciences</em>, 67(3), 618-621. <a href="https://doi.org/10.1007/s11427-023-2388-x">https://doi.org/10.1007/s11427-023-2388-x</a></p>
<p><strong>Publication</strong>: Science China Life Sciences (2023)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://doi.org/10.1007/s11427-023-2388-x">Paper on Springer</a></li>
</ul>
]]></content:encoded></item><item><title>MolParser: End-to-End Molecular Structure Recognition</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/mol-parser/</guid><description>MolParser converts molecular images from scientific documents to machine-readable formats using end-to-end learning with Extended SMILES representation.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fang, X., Wang, J., Cai, X., Chen, S., Yang, S., Tao, H., Wang, N., Yao, L., Zhang, L., &amp; Ke, G. (2025). MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild. In <em>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em> (pp. 24528-24538). <a href="https://doi.org/10.48550/arXiv.2411.11098">https://doi.org/10.48550/arXiv.2411.11098</a></p>
<p><strong>Publication</strong>: ICCV 2025</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/">MolParser-7M Dataset</a> - 7M+ image-text pairs for OCSR</li>
<li><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M on HuggingFace</a> - Dataset repository</li>
<li><a href="https://huggingface.co/UniParser/MolDet">MolDet YOLO Detector</a> - Object detection model for extracting molecular images from documents</li>
</ul>
<h2 id="contribution-end-to-end-ocsr-and-real-world-resources">Contribution: End-to-End OCSR and Real-World Resources</h2>
<p>This is primarily a <strong>Method</strong> paper (see <a href="/notes/interdisciplinary/research-methods/ai-physical-sciences-paper-taxonomy/">AI and Physical Sciences paper taxonomy</a>), with a significant secondary contribution as a <strong>Resource</strong> paper.</p>
<p><strong>Method contribution ($\Psi_{\text{Method}}$)</strong>: The paper proposes a novel end-to-end architecture combining a Swin Transformer encoder with a BART decoder, and crucially introduces <strong>Extended SMILES (E-SMILES)</strong>, a new syntactic extension to standard SMILES notation that enables representation of Markush structures, abstract rings, and variable attachment points found in patents. The work validates this method through extensive ablation studies, achieving the highest accuracy among tested OCSR systems on WildMol-10k (76.9%).</p>
<p><strong>Resource contribution ($\Psi_{\text{Resource}}$)</strong>: The paper introduces <strong>MolParser-7M</strong>, the largest OCSR dataset to date (7.7M image-text pairs), and <strong>WildMol</strong>, a challenging benchmark of 20,000 manually annotated real-world molecular images. The construction of these datasets through an active learning data engine with human-in-the-loop validation represents significant infrastructure that enables future OCSR research.</p>
<h2 id="motivation-extracting-chemistry-from-real-world-documents">Motivation: Extracting Chemistry from Real-World Documents</h2>
<p>The motivation stems from a practical problem in chemical informatics: vast amounts of chemical knowledge remain embedded in unstructured formats. Patents, research papers, and legacy documents depict molecular structures as images. This creates a barrier for large-scale data analysis and prevents Large Language Models from effectively understanding scientific literature in chemistry and drug discovery.</p>
<p>Existing OCSR methods struggle with real-world documents for two fundamental reasons:</p>
<ol>
<li><strong>Representational limitations</strong>: Standard SMILES notation cannot capture complex structural templates like <strong>Markush structures</strong>, which are ubiquitous in patents. These structures define entire families of compounds using variable R-groups and abstract patterns, making them essential for intellectual property but impossible to represent with conventional methods.</li>
<li><strong>Data distribution mismatch</strong>: Real-world molecular images suffer from noise, inconsistent drawing styles, variable resolution, and interference from surrounding text. Models trained exclusively on clean, synthetically rendered molecules fail to generalize when applied to actual documents.</li>
</ol>
<h2 id="novelty-e-smiles-and-human-in-the-loop-data-engine">Novelty: E-SMILES and Human-in-the-Loop Data Engine</h2>
<p>The novelty lies in a comprehensive system that addresses both representation and data quality challenges through four integrated contributions:</p>
<ol>
<li>
<p><strong>Extended SMILES (E-SMILES)</strong>: A backward-compatible extension to the SMILES format that can represent complex structures previously inexpressible in standard chemical notations. E-SMILES uses a separator token <code>&lt;sep&gt;</code> to delineate the core molecular structure from supplementary annotations. These annotations employ XML-like tags to encode Markush structures, polymers, abstract rings, and other complex patterns. Critically, the core structure remains parseable by standard cheminformatics tools like RDKit, while the supplementary tags provide a structured, LLM-friendly format for capturing edge cases.</p>
</li>
<li>
<p><strong>MolParser-7M Dataset</strong>: The largest publicly available OCSR dataset, containing over 7 million image-text pairs. What distinguishes this dataset is both its scale and its composition. It includes 400,000 &ldquo;in-the-wild&rdquo; samples (molecular images extracted from actual patents and scientific papers) and subsequently curated by human annotators. This real-world data addresses the distribution mismatch problem directly by exposing the model to the same noise, artifacts, and stylistic variations it encounters in production.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Data Engine</strong>: A systematic approach to collecting and annotating real-world training data. The pipeline begins with an object detection model that extracts molecular images from over a million PDF documents. An active learning algorithm then identifies the most informative samples (those where the current model struggles) for human annotation. The model pre-annotates these images, and human experts review and correct them. This creates an iterative improvement cycle: annotate, train, identify new challenging cases, repeat.</p>
</li>
<li>
<p><strong>Efficient End-to-End Architecture</strong>: The model treats OCSR as an image captioning problem. A Swin-Transformer vision encoder extracts visual features, a simple MLP compresses them, and a BART decoder generates the E-SMILES string autoregressively. The model minimizes the standard negative log-likelihood of the target E-SMILES token sequence $y$ given the sequence history and input image $x$:</p>
</li>
</ol>
<p>$$
\begin{aligned}
\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{&lt;t}, x; \theta)
\end{aligned}
$$</p>
<p>The training strategy employs curriculum learning, starting with simple molecules and gradually introducing complexity and heavier data augmentation.</p>
<h2 id="experimental-setup-two-stage-training-and-benchmarking">Experimental Setup: Two-Stage Training and Benchmarking</h2>
<p>The evaluation focused on demonstrating that MolParser generalizes to real-world documents:</p>
<ol>
<li>
<p><strong>Two-Stage Training Protocol</strong>: The model underwent a systematic training process:</p>
<ul>
<li><strong>Pre-training</strong>: Initial training on millions of synthetic molecular images using curriculum learning. The curriculum progresses from simple molecules to complex structures while gradually increasing data augmentation intensity (blur, noise, perspective transforms).</li>
<li><strong>Fine-tuning</strong>: Subsequent training on 400,000 curated real-world samples extracted from patents and papers. This fine-tuning phase is critical for adapting to the noise and stylistic variations of actual documents.</li>
</ul>
</li>
<li>
<p><strong>Benchmark Evaluation</strong>: The model was evaluated on multiple standard OCSR benchmarks to establish baseline performance on clean data. These benchmarks test recognition accuracy on well-formatted molecular diagrams.</p>
</li>
<li>
<p><strong>Real-World Document Analysis</strong>: The critical test involved applying MolParser to molecular structures extracted directly from scientific documents. This evaluation measures the gap between synthetic benchmark performance and real-world applicability (the core problem the paper addresses).</p>
</li>
<li>
<p><strong>Ablation Studies</strong>: Experiments isolating the contribution of each component:</p>
<ul>
<li>The impact of real-world training data versus synthetic-only training</li>
<li>The effectiveness of curriculum learning versus standard training</li>
<li>The value of the human-in-the-loop annotation pipeline versus random sampling</li>
<li>The necessity of E-SMILES extensions for capturing complex structures</li>
</ul>
</li>
</ol>
<h2 id="outcomes-and-empirical-findings">Outcomes and Empirical Findings</h2>
<ul>
<li>
<p><strong>Performance on Benchmarks</strong>: MolParser achieves competitive results on standard benchmarks and the best performance on real-world documents. On clean benchmarks like USPTO-10K, MolScribe (96.0%) slightly edges MolParser-Base (94.5%), but on WildMol-10k, MolParser-Base achieved 76.9% accuracy, significantly outperforming MolScribe (66.4%) and MolGrapher (45.5%). This gap validates the core hypothesis that training on actual document images is essential for practical deployment.</p>
</li>
<li>
<p><strong>Real-World Data is Critical</strong>: Models trained exclusively on synthetic data show substantial performance degradation when applied to real documents. The 400,000 in-the-wild training samples bridge this gap, demonstrating that data quality and distribution matching matter as much as model architecture. Ablation experiments showed that pretraining on MolParser-7M synthetic data alone achieved 51.9% accuracy on WildMol, while adding real-world fine-tuning raised this to 76.9%. Using the smaller MolGrapher-300k synthetic dataset without fine-tuning yielded only 22.4%.</p>
</li>
<li>
<p><strong>E-SMILES Enables Broader Coverage</strong>: The extended representation successfully captures molecular structures that were previously inexpressible, particularly Markush structures from patents. This expands the scope of what can be automatically extracted from chemical literature to include patent-style structural templates.</p>
</li>
<li>
<p><strong>Human-in-the-Loop Scales Efficiently</strong>: The active learning pipeline reduces annotation time by approximately 90% while maintaining high quality. This approach makes it feasible to curate large-scale, high-quality datasets for specialized domains where expert knowledge is expensive.</p>
</li>
<li>
<p><strong>Speed and Accuracy</strong>: The end-to-end architecture achieves both high accuracy and fast inference, making it practical for large-scale document processing. MolParser-Base processes 40 images per second on RTX 4090D, while the Tiny variant achieves 131 FPS. The direct image-to-text approach avoids the error accumulation of multi-stage pipelines.</p>
</li>
<li>
<p><strong>Downstream Applications</strong>: The Swin Transformer encoder, once trained on MolParser-7M, serves as an effective molecular fingerprint for property prediction. Paired with a simple two-layer MLP on MoleculeNet benchmarks, MolParser-pretrained features achieved an average ROC-AUC of 73.7% across five tasks, compared to 68.9% for ImageNet-pretrained Swin-T features. The authors also demonstrate chemical reaction parsing by feeding MolDet detections and MolParser E-SMILES into GPT-4o.</p>
</li>
<li>
<p><strong>Limitations</strong>: The authors acknowledge that molecular chirality is not yet fully exploited by the system. The E-SMILES format does not currently support dashed abstract rings, coordination bonds, special symbol Markush patterns, or replication of long structural segments. Additionally, scaling up the volume of real annotated training data could further improve performance.</p>
</li>
</ul>
<p>The work establishes that practical OCSR requires more than architectural innovations. It demands careful attention to data quality, representation design, and the distribution mismatch between synthetic training data and real-world applications. The combination of E-SMILES, the MolParser-7M dataset, and the human-in-the-loop data engine provides a template for building reliable vision systems in scientific domains where clean training data is scarce but expert knowledge is available.</p>
<h2 id="artifacts">Artifacts</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>7.7M image-SMILES pairs for OCSR pretraining and fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/UniParser/MolDet">MolDet</a></td>
          <td>Model</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>YOLO11-based molecule detector for PDF documents</td>
      </tr>
  </tbody>
</table>
<p>No official source code repository has been released. Model weights for MolParser itself are not publicly available as of the dataset release.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The training data is split into a massive synthetic pre-training set and a curated fine-tuning set.</p>
<p><strong>Training Data Composition (MolParser-7M)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset Name</th>
          <th>Size</th>
          <th>Composition / Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pre-training</strong></td>
          <td>MolParser-7M (Synthetic)</td>
          <td>~7.7M</td>
          <td><strong>Markush-3M</strong> (40%), <strong>ChEMBL-2M</strong> (27%), <strong>Polymer-1M</strong> (14%), PAH-600k (8%), BMS-360k (5%), MolGrapher-300K (4%), Pauling-100k (2%). Generated via RDKit/Indigo with randomized styles.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-SFT-400k</td>
          <td>400k</td>
          <td>Real images from patents/papers selected via active learning (confidence filtering 0.6-0.9) and manually annotated. 66% of fine-tuning mix.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>MolParser-Gen-200k</td>
          <td>200k</td>
          <td>Subset of synthetic data kept to prevent catastrophic forgetting. 32% of fine-tuning mix.</td>
      </tr>
      <tr>
          <td><strong>Fine-tuning</strong></td>
          <td>Handwrite-5k</td>
          <td>5k</td>
          <td>Handwritten molecules from Img2Mol to support hand-drawn queries. 1% of fine-tuning mix.</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Sources</strong>: 1.2M patents and scientific papers (PDF documents)</li>
<li><strong>Extraction</strong>: MolDet (YOLO11-based detector) identified ~20M molecular images, deduplicated to ~4M candidates</li>
<li><strong>Selection</strong>: Active learning ensemble (5-fold models) identified high-uncertainty samples for annotation</li>
<li><strong>Annotation</strong>: Human experts corrected model pre-annotations (90% time savings vs. from-scratch annotation)</li>
</ul>
<p><strong>Test Benchmarks</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO-10k</td>
          <td>10,000</td>
          <td>Standard synthetic benchmark</td>
      </tr>
      <tr>
          <td>Maybridge UoB</td>
          <td>-</td>
          <td>Synthetic molecules</td>
      </tr>
      <tr>
          <td>CLEF-2012</td>
          <td>-</td>
          <td>Patent images</td>
      </tr>
      <tr>
          <td>JPO</td>
          <td>-</td>
          <td>Japanese patent office</td>
      </tr>
      <tr>
          <td>ColoredBG</td>
          <td>-</td>
          <td>Colored background molecules</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td>10,000</td>
          <td>Ordinary molecules cropped from real PDFs (new)</td>
      </tr>
      <tr>
          <td><strong>WildMol-10k-M</strong></td>
          <td>10,000</td>
          <td>Markush structures (significantly harder, new)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Extended SMILES (E-SMILES) Encoding</strong>:</p>
<ul>
<li><strong>Format</strong>: <code>SMILES&lt;sep&gt;EXTENSION</code> where <code>&lt;sep&gt;</code> separates core structure from supplementary annotations</li>
<li><strong>Extensions use XML-like tags</strong>:
<ul>
<li><code>&lt;a&gt;index:group&lt;/a&gt;</code> for substituents/variable groups (Markush structures)</li>
<li><code>&lt;r&gt;</code> for groups connected at any ring position</li>
<li><code>&lt;c&gt;</code> for abstract rings</li>
<li><code>&lt;dum&gt;</code> for connection points</li>
</ul>
</li>
<li><strong>Backward compatible</strong>: Core SMILES parseable by RDKit; extensions provide structured format for edge cases</li>
</ul>
<p><strong>Curriculum Learning Strategy</strong>:</p>
<ul>
<li><strong>Phase 1</strong>: No augmentation, simple molecules (&lt;60 tokens)</li>
<li><strong>Phase 2</strong>: Gradually increase augmentation intensity and sequence length</li>
<li>Progressive complexity allows stable training on diverse molecular structures</li>
</ul>
<p><strong>Active Learning Data Selection</strong>:</p>
<ol>
<li>Train 5 model folds on current dataset</li>
<li>Compute pairwise Tanimoto similarity of predictions on candidate images</li>
<li>Select samples with confidence scores <strong>0.6-0.9</strong> for human review (highest learning value)</li>
<li>Human experts correct model pre-annotations</li>
<li>Iteratively expand training set with hard samples</li>
</ol>
<p><strong>Data Augmentations</strong>:</p>
<ul>
<li>RandomAffine (rotation, scale, translation)</li>
<li>JPEGCompress (compression artifacts)</li>
<li>InverseColor (color inversion)</li>
<li>SurroundingCharacters (text interference)</li>
<li>RandomCircle (circular artifacts)</li>
<li>ColorJitter (brightness, contrast variations)</li>
<li>Downscale (resolution reduction)</li>
<li>Bounds (boundary cropping variations)</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture follows a standard <strong>Image Captioning</strong> (Encoder-Decoder) paradigm.</p>
<p><strong>Architecture Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Vision Encoder</strong></td>
          <td>Swin Transformer (ImageNet pretrained)</td>
      </tr>
      <tr>
          <td>- Tiny variant</td>
          <td>66M parameters, $224 \times 224$ input</td>
      </tr>
      <tr>
          <td>- Small variant</td>
          <td>108M parameters, $224 \times 224$ input</td>
      </tr>
      <tr>
          <td>- Base variant</td>
          <td>216M parameters, $384 \times 384$ input</td>
      </tr>
      <tr>
          <td><strong>Connector</strong></td>
          <td>2-layer MLP reducing channel dimension by half</td>
      </tr>
      <tr>
          <td><strong>Text Decoder</strong></td>
          <td>BART-Decoder (12 layers, 16 attention heads)</td>
      </tr>
  </tbody>
</table>
<p><strong>Training Configuration</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Setting</th>
          <th>Pre-training</th>
          <th>Fine-tuning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Hardware</strong></td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
          <td>8x NVIDIA RTX 4090D GPUs</td>
      </tr>
      <tr>
          <td><strong>Optimizer</strong></td>
          <td>AdamW</td>
          <td>AdamW</td>
      </tr>
      <tr>
          <td><strong>Learning Rate</strong></td>
          <td>$1 \times 10^{-4}$</td>
          <td>$5 \times 10^{-5}$</td>
      </tr>
      <tr>
          <td><strong>Weight Decay</strong></td>
          <td>$1 \times 10^{-2}$</td>
          <td>$1 \times 10^{-2}$</td>
      </tr>
      <tr>
          <td><strong>Scheduler</strong></td>
          <td>Cosine with warmup</td>
          <td>Cosine with warmup</td>
      </tr>
      <tr>
          <td><strong>Epochs</strong></td>
          <td>20</td>
          <td>4</td>
      </tr>
      <tr>
          <td><strong>Label Smoothing</strong></td>
          <td>0.01</td>
          <td>0.005</td>
      </tr>
  </tbody>
</table>
<p><strong>Curriculum Learning Schedule</strong> (Pre-training):</p>
<ul>
<li>Starts with simple molecules (&lt;60 tokens, no augmentation)</li>
<li>Gradually adds complexity and augmentation (blur, noise, perspective transforms)</li>
<li>Enables stable learning across diverse molecular structures</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Metrics</strong>: Exact match accuracy on predicted E-SMILES strings (molecule-level exact match)</p>
<p><strong>Key Results</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>MolParser-Base</th>
          <th>MolScribe</th>
          <th>MolGrapher</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>WildMol-10k</strong></td>
          <td><strong>76.9%</strong></td>
          <td>66.4%</td>
          <td>45.5%</td>
          <td>Real-world patent/paper crops</td>
      </tr>
      <tr>
          <td><strong>USPTO-10k</strong></td>
          <td>94.5%</td>
          <td><strong>96.0%</strong></td>
          <td>93.3%</td>
          <td>Synthetic benchmark</td>
      </tr>
      <tr>
          <td><strong>Throughput (FPS)</strong></td>
          <td><strong>39.8</strong></td>
          <td>16.5</td>
          <td>2.2</td>
          <td>Measured on RTX 4090D</td>
      </tr>
  </tbody>
</table>
<p><strong>Additional Performance</strong>:</p>
<ul>
<li>MolParser-Tiny: 131 FPS on RTX 4090D (66M params)</li>
<li>Real-world vs. synthetic gap: Fine-tuning on MolParser-SFT-400k closed the performance gap between clean benchmarks and in-the-wild documents</li>
</ul>
<p><strong>Ablation Findings</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Factor</th>
          <th>Impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Real-world training data</td>
          <td>Fine-tuning on real data raised accuracy from 51.9% to 76.9% on WildMol-10k</td>
      </tr>
      <tr>
          <td>Curriculum learning</td>
          <td>Augmentation alone raised WildMol-10k from 40.1% to 69.5%; adding curriculum learning further raised it to 76.9%</td>
      </tr>
      <tr>
          <td>Active learning selection</td>
          <td>More effective than random sampling for annotation budget</td>
      </tr>
      <tr>
          <td>E-SMILES extensions</td>
          <td>Essential for Markush structure recognition (impossible with standard SMILES)</td>
      </tr>
      <tr>
          <td>Dataset scale</td>
          <td>Larger pre-training dataset (7M vs 300k) improved WildMol-10k accuracy from 22.4% to 51.9% before fine-tuning</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training</strong>: 8x NVIDIA RTX 4090D GPUs</li>
<li><strong>Inference</strong>: Single RTX 4090D sufficient for real-time processing</li>
<li><strong>Training time</strong>: 20 epochs pre-training + 4 epochs fine-tuning (specific duration not reported)</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{fang2025molparser,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fang, Xi and Wang, Jiankun and Cai, Xiaochen and Chen, Shangqian and Yang, Shuwen and Tao, Haoyi and Wang, Nan and Yao, Lin and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{24528--24538}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2411.11098}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arXiv.2411.11098}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolParser-7M &amp; WildMol: Large-Scale OCSR Datasets</title><link>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/</link><pubDate>Fri, 03 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/optical-structure-recognition/vision-language/molparser_7m-wildmol/</guid><description>MolParser-7M is the largest open-source OCSR dataset with 7.7M image-SMILES pairs including 400k real-world annotated samples.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/molparser-markush-example.webp"
         alt="Example of a complex Markush structure"
         title="Example of a complex Markush structure"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">An example of a complex Markush structure that can be represented by the E-SMILES format but not by standard SMILES or FG-SMILES.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/molparser-low-quality-example.webp"
         alt="Sample from the WildMol benchmark"
         title="Sample from the WildMol benchmark"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A sample from the WildMol benchmark, showing a low-quality, noisy molecular image cropped from real-world literature that challenges OCSR systems.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/molparser-colored-example.webp"
         alt="Colored molecule with annotations"
         title="Colored molecule with annotations"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>MolParser-7M (Training Set)</strong></td>
          <td>7,740,871</td>
          <td>A large-scale dataset for training OCSR models, split into pre-training and fine-tuning stages.</td>
      </tr>
      <tr>
          <td><strong>WildMol (Test Set)</strong></td>
          <td>20,000</td>
          <td>A benchmark of 20,000 human-annotated samples cropped from real PDF files to evaluate OCSR models in &lsquo;in-the-wild&rsquo; scenarios. Comprises WildMol-10k (10k ordinary molecules) and WildMol-10k-M (10k Markush structures).</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="wildmol-10k-accuracy">WildMol-10K Accuracy<a hidden class="anchor" aria-hidden="true" href="#wildmol-10k-accuracy">#</a></h3>
    <p class="benchmark-description">Evaluation of OCSR models on 10,000 real-world molecular images cropped from scientific literature and patents</p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Accuracy (%)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>MolParser-Base</strong><br><small>End-to-end visual recognition trained on MolParser-7M</small>
          </td>
          <td>76.9</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>MolScribe</strong><br><small>Transformer-based OCSR system</small>
          </td>
          <td>66.4</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>DECIMER 2.7</strong><br><small>Deep learning for chemical image recognition</small>
          </td>
          <td>56</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>MolGrapher</strong><br><small>Graph-based molecular structure recognition</small>
          </td>
          <td>45.5</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>MolVec 0.9.7</strong><br><small>Vector-based structure recognition</small>
          </td>
          <td>26.4</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>OSRA 2.1</strong><br><small>Optical Structure Recognition Application</small>
          </td>
          <td>26.3</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Img2Mol</strong><br><small>Image-to-molecule translation</small>
          </td>
          <td>24.4</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Imago 2.0</strong><br><small>Chemical structure recognition toolkit</small>
          </td>
          <td>6.9</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="key-contribution">Key Contribution</h2>
<p>Introduces MolParser-7M, the largest open-source Optical Chemical Structure Recognition (OCSR) dataset, uniquely combining diverse synthetic data with a large volume of manually-annotated, &ldquo;in-the-wild&rdquo; images from real scientific documents to improve model robustness. Also introduces WildMol, a new challenging benchmark for evaluating OCSR performance on real-world data, including Markush structures.</p>
<h2 id="overview">Overview</h2>
<p>The MolParser project addresses the challenge of recognizing molecular structures from images found in real-world scientific documents. Unlike existing OCSR datasets that rely primarily on synthetically generated images, MolParser-7M incorporates 400,000 manually annotated images cropped from actual patents and scientific papers, making it the first large-scale dataset to bridge the gap between synthetic training data and real-world deployment scenarios.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li>Largest open-source OCSR dataset with over 7.7 million pairs</li>
<li>The only large-scale OCSR training set that includes a significant amount (400k) of &ldquo;in-the-wild&rdquo; data cropped from real patents and literature</li>
<li>High diversity of molecular structures from numerous sources (PubChem, ChEMBL, polymers, etc.)</li>
<li>Introduces the WildMol benchmark for evaluating performance on challenging, real-world data, including Markush structures</li>
<li>The &ldquo;in-the-wild&rdquo; fine-tuning data (MolParser-SFT-400k) was curated via an efficient active learning data engine with human-in-the-loop validation</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li>The E-SMILES format cannot represent certain complex cases, such as coordination bonds, dashed abstract rings, Markush structures depicted with special patterns, and replication of long structural segments on the skeleton</li>
<li>The model and data do not yet fully exploit molecular chirality, which is critical for chemical properties</li>
<li>Performance could be further improved by scaling up the amount of real annotated training data</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="synthetic-data-generation">Synthetic Data Generation</h3>
<p>To ensure diversity, molecular structures were collected from databases like ChEMBL, PubChem, and Kaggle BMS. A significant number of Markush, polymer, and fused-ring structures were also randomly generated. Images were rendered using RDKit and epam.indigo with randomized parameters (e.g., bond width, font size, rotation) to increase visual diversity. The pretraining dataset is composed of the following subsets:</p>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Ratio</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Markush-3M</td>
          <td>40%</td>
          <td>Random groups replacement from PubChem</td>
      </tr>
      <tr>
          <td>ChEMBL-2M</td>
          <td>27%</td>
          <td>Molecules selected from ChEMBL</td>
      </tr>
      <tr>
          <td>Polymer-1M</td>
          <td>14%</td>
          <td>Randomly generated polymer molecules</td>
      </tr>
      <tr>
          <td>PAH-600k</td>
          <td>8%</td>
          <td>Randomly generated fused-ring molecules</td>
      </tr>
      <tr>
          <td>BMS-360k</td>
          <td>5%</td>
          <td>Molecules with long carbon chains from BMS</td>
      </tr>
      <tr>
          <td>MolGrapher-300K</td>
          <td>4%</td>
          <td>Training data from MolGrapher</td>
      </tr>
      <tr>
          <td>Pauling-100k</td>
          <td>2%</td>
          <td>Pauling-style images drawn using epam.indigo</td>
      </tr>
  </tbody>
</table>
<h3 id="in-the-wild-data-engine-molparser-sft-400k">In-the-Wild Data Engine (MolParser-SFT-400k)</h3>
<p>A YOLO11 object detection model (MolDet) located and cropped over 20 million molecule images from 1.22 million real PDFs (patents and papers). After de-duplication via p-hash similarity, 4 million unique images remained.</p>
<p>An active learning algorithm was used to select the most informative samples for annotation, targeting images where an ensemble of 5-fold models showed moderate confidence (0.6-0.9 Tanimoto similarity), indicating they were challenging but learnable.</p>
<p>This active learning approach with model pre-annotations reduced manual annotation time per molecule to 30 seconds, approximately 90% savings compared to annotating from scratch. In the final fine-tuning dataset, 56.04% of annotations directly utilized raw model pre-annotations, 20.97% passed review after a single manual correction, 13.87% were accepted after a second round of annotation, and 9.13% required three or more rounds.</p>
<p>The fine-tuning dataset is composed of:</p>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Ratio</th>
          <th>Source</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolParser-SFT-400k</td>
          <td>66%</td>
          <td>Manually annotated data obtained via data engine</td>
      </tr>
      <tr>
          <td>MolParser-Gen-200k</td>
          <td>32%</td>
          <td>Synthetic data selected from pretraining stage</td>
      </tr>
      <tr>
          <td>Handwrite-5k</td>
          <td>1%</td>
          <td>Handwritten molecules selected from Img2Mol</td>
      </tr>
  </tbody>
</table>
<h3 id="e-smiles-specification">E-SMILES Specification</h3>
<p>To accommodate complex patent structures that standard SMILES cannot support, the authors introduced an Extended SMILES format (<code>SMILES&lt;sep&gt;EXTENSION</code>). The <code>EXTENSION</code> component uses XML-like tokens to manage complexities:</p>
<ul>
<li><code>&lt;a&gt;...&lt;/a&gt;</code> encapsulates Markush R-groups and abbreviation groups.</li>
<li><code>&lt;r&gt;...&lt;/r&gt;</code> denotes ring attachments with uncertainty positions.</li>
<li><code>&lt;c&gt;...&lt;/c&gt;</code> defines abstract rings.</li>
<li><code>&lt;dum&gt;</code> identifies a connection point.</li>
</ul>
<p>This format enables Markush-molecule matching and LLM integration, while retaining RDKit compatibility for the standard SMILES portion.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/datasets/UniParser/MolParser-7M">MolParser-7M</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-SA-4.0</td>
          <td>Training and test data on HuggingFace. SFT subset is partially released.</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/UniParser/MolDet">MolDet (YOLO11)</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Molecule detection model on HuggingFace</td>
      </tr>
      <tr>
          <td><a href="https://ocsr.dp.tech/">MolParser Demo</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online OCSR demo using MolParser-Base</td>
      </tr>
  </tbody>
</table>
<p>The dataset is publicly available on HuggingFace under a CC-BY-NC-SA-4.0 (non-commercial) license. The MolParser-SFT-400k subset is only partially released. The YOLO11-based MolDet detection model is also available on HuggingFace. No public code repository is provided for the MolParser recognition model itself. All experiments were conducted on 8 NVIDIA RTX 4090D GPUs, and throughput benchmarks were measured on a single RTX 4090D GPU.</p>
]]></content:encoded></item><item><title>Beyond Atoms: 3D Space Modeling for Molecular Pretraining</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/beyond-atoms/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-simulation/ml-potentials/beyond-atoms/</guid><description>Lu et al. introduce SpaceFormer, a Transformer that models entire 3D molecular space including atoms for superior representations.</description><content:encoded><![CDATA[<h2 id="paper-typology-and-contribution">Paper Typology and Contribution</h2>
<p>This is a <strong>Method</strong> paper. It challenges the atom-centric paradigm of molecular representation learning by proposing a novel framework that models the continuous 3D space surrounding atoms. The core contribution is <strong>SpaceFormer</strong>, a Transformer-based architecture that discretizes molecular space into grids to capture physical phenomena (electron density, electromagnetic fields) often missed by traditional point-cloud models.</p>
<h2 id="the-physical-intuition-modeling-empty-space">The Physical Intuition: Modeling &ldquo;Empty&rdquo; Space</h2>
<p><strong>The Gap</strong>: Prior 3D molecular representation models, such as Uni-Mol, treat molecules as discrete sets of atoms, essentially point clouds in 3D space. However, from a quantum physics perspective, the &ldquo;empty&rdquo; space between atoms is far from empty. It is permeated by electron density distributions and electromagnetic fields that determine molecular properties.</p>
<p><strong>The Hypothesis</strong>: Explicitly modeling this continuous 3D space alongside discrete atom positions yields superior representations for downstream tasks, particularly for computational properties that depend on electronic structure, such as HOMO/LUMO energies and energy gaps.</p>
<h2 id="a-surprising-observation-virtual-points-improve-representations">A Surprising Observation: Virtual Points Improve Representations</h2>
<p>Before proposing SpaceFormer, the authors present a simple yet revealing experiment. They augment Uni-Mol by adding randomly sampled virtual points (VPs) from the 3D space within the circumscribed cuboid of each molecule. These VPs carry no chemical information whatsoever: they are purely random noise points.</p>
<p>The result is surprising: adding just 10 random VPs already yields a noticeable improvement in validation loss. The improvement remains consistent and gradually increases as the number of VPs grows, eventually reaching a plateau. This observation holds across downstream tasks as well, with Uni-Mol + VPs improving on several quantum property predictions (LUMO, E1-CC2, E2-CC2) compared to vanilla Uni-Mol.</p>
<p>The implication is that even uninformative spatial context helps the model learn better representations, motivating a principled framework for modeling the full 3D molecular space.</p>
<h2 id="spaceformer-voxelization-and-3d-positional-encodings">SpaceFormer: Voxelization and 3D Positional Encodings</h2>
<p>The key innovation is treating the molecular representation problem as <strong>3D space modeling</strong>. SpaceFormer follows these core steps:</p>
<ol>
<li><strong>Voxelizes the entire 3D space</strong> into a grid with cells of $0.49\text{\AA}$ (based on O-H bond length to ensure at most one atom per cell).</li>
<li><strong>Uses adaptive multi-resolution grids</strong> to efficiently handle empty space, keeping it fine-grained near atoms and coarse-grained far away.</li>
<li><strong>Applies Transformers to 3D spatial tokens</strong> with custom positional encodings that achieve linear complexity.</li>
</ol>
<p>Specifically, the model utilizes two forms of 3D Positional Encoding:</p>
<p><strong>3D Directional PE (RoPE Extension)</strong>
They extend Rotary Positional Encoding (RoPE) to 3D continuous space by splitting the Query and Key vectors into three blocks (one for each spatial axis). The directional attention mechanism takes the form:</p>
<p>$$
\begin{aligned}
\mathbf{q}_{i}^{\top} \mathbf{k}_{j} = \sum_{s=1}^{3} \mathbf{q}_{i,s}^{\top} \mathbf{R}(c_{j,s} - c_{i,s}) \mathbf{k}_{j,s}
\end{aligned}
$$</p>
<p><strong>3D Distance PE (RFF Approximation)</strong>
To compute invariant geometric distance without incurring quadratic memory overhead, they use Random Fourier Features (RFF) to approximate a Gaussian kernel of pairwise distances:</p>
<p>$$
\begin{aligned}
\exp \left( - \frac{| \mathbf{c}_i - \mathbf{c}_j |_2^2}{2\sigma^2} \right) &amp;\approx z(\mathbf{c}_i)^\top z(\mathbf{c}_j) \\
z(\mathbf{c}_i) &amp;= \sqrt{\frac{2}{d}} \cos(\sigma^{-1} \mathbf{c}_i^\top \boldsymbol{\omega} + \mathbf{b})
\end{aligned}
$$</p>
<p>This approach enables the model to natively encode complex field-like phenomena without computing exhaustive $O(N^2)$ distance matrices.</p>
<h2 id="experimental-setup-and-downstream-tasks">Experimental Setup and Downstream Tasks</h2>
<p><strong>Pretraining Data</strong>: 19 million unlabeled molecules from the same dataset used by Uni-Mol.</p>
<p><strong>Downstream Benchmarks</strong>: The authors propose a new benchmark of 15 tasks, motivated by known limitations of MoleculeNet: invalid structures, inconsistent chemical representations, data curation errors, and an inability to adequately distinguish model performance. The tasks split into two categories:</p>
<ol>
<li>
<p><strong>Computational Properties (Quantum Mechanics)</strong></p>
<ul>
<li>Subsets of <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a> (HOMO, LUMO, GAP energy prediction, 20K samples; E1-CC2, E2-CC2, f1-CC2, f2-CC2, 21.7K samples)</li>
<li>Cata-condensed polybenzenoid hydrocarbons (Dipole moment, adiabatic ionization potential, D3 dispersion correction, 8,678 samples)</li>
<li>Metric: Mean Absolute Error (MAE)</li>
</ul>
</li>
<li>
<p><strong>Experimental Properties (Pharma/Bio)</strong></p>
<ul>
<li>MoleculeNet tasks (BBBP, BACE for drug discovery)</li>
<li>Biogen ADME tasks (HLM, MME, Solubility)</li>
<li>Metrics: AUC for classification, MAE for regression</li>
</ul>
</li>
</ol>
<p><strong>Splitting Strategy</strong>: All datasets use 8:1:1 train/validation/test ratio with <strong>scaffold splitting</strong> to test out-of-distribution generalization.</p>
<p><strong>Training Setup</strong>:</p>
<ul>
<li><strong>Objective</strong>: Masked Auto-Encoder (MAE) with 30% random masking. Model predicts whether a cell contains an atom, and if so, regresses both atom type and precise offset position.</li>
<li><strong>Hardware</strong>: ~50 hours on 8 NVIDIA A100 GPUs</li>
<li><strong>Optimizer</strong>: Adam ($\beta_1=0.9, \beta_2=0.99$)</li>
<li><strong>Learning Rate</strong>: Peak 1e-4 with linear decay and 0.01 warmup ratio</li>
<li><strong>Batch Size</strong>: 128</li>
<li><strong>Total Updates</strong>: 1 million</li>
</ul>
<p><strong>Baseline Comparisons</strong>: GROVER (2D graph-based MPR), GEM (2D graph enhanced with 3D information), 3D Infomax (GNN with 3D information), Uni-Mol (3D MPR, primary baseline using the same pretraining dataset), and Mol-AE (extends Uni-Mol with atom-based MAE pretraining).</p>
<h2 id="results-and-analysis">Results and Analysis</h2>
<p><strong>Strong Contextual Performance</strong>: SpaceFormer ranked 1st in 10 of 15 tasks and in the top 2 for 14 of 15 tasks. It surpassed the runner-up models by approximately 20% on quantum property tasks (HOMO, LUMO, GAP, E1-CC2, Dipmom), validating that modeling non-atom space captures electronic structure better than atom-only regimes.</p>
<h3 id="key-results-on-quantum-properties">Key Results on Quantum Properties</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>GROVER</th>
          <th>GEM</th>
          <th>3D Infomax</th>
          <th>Uni-Mol</th>
          <th>Mol-AE</th>
          <th><strong>SpaceFormer</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HOMO (Ha)</td>
          <td>0.0075</td>
          <td>0.0068</td>
          <td>0.0065</td>
          <td>0.0052</td>
          <td>0.0050</td>
          <td><strong>0.0042</strong></td>
      </tr>
      <tr>
          <td>LUMO (Ha)</td>
          <td>0.0086</td>
          <td>0.0080</td>
          <td>0.0070</td>
          <td>0.0060</td>
          <td>0.0057</td>
          <td><strong>0.0040</strong></td>
      </tr>
      <tr>
          <td>GAP (Ha)</td>
          <td>0.0109</td>
          <td>0.0107</td>
          <td>0.0095</td>
          <td>0.0081</td>
          <td>0.0080</td>
          <td><strong>0.0064</strong></td>
      </tr>
      <tr>
          <td>E1-CC2 (eV)</td>
          <td>0.0101</td>
          <td>0.0090</td>
          <td>0.0089</td>
          <td>0.0067</td>
          <td>0.0070</td>
          <td><strong>0.0058</strong></td>
      </tr>
      <tr>
          <td>Dipmom (Debye)</td>
          <td>0.0752</td>
          <td>0.0289</td>
          <td>0.0291</td>
          <td>0.0106</td>
          <td>0.0113</td>
          <td><strong>0.0083</strong></td>
      </tr>
  </tbody>
</table>
<p>SpaceFormer&rsquo;s advantage is most pronounced on computational properties that depend on electronic structure. On experimental biological tasks (e.g., BBBP), where measurements are noisy, the advantage narrows or reverses: Uni-Mol achieves 0.9066 AUC on BBBP compared to SpaceFormer&rsquo;s 0.8605.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>The authors present several ablations that isolate the source of SpaceFormer&rsquo;s improvements:</p>
<p><strong>MAE vs. Denoising</strong>: SpaceFormer with MAE pretraining outperforms SpaceFormer with denoising on all four ablation tasks. The MAE objective requires predicting <em>whether</em> an atom exists in a masked voxel, which forces the model to learn global structural dependencies. In the denoising variant, only atom cells are masked so the model never needs to predict atom existence, reducing the task to coordinate regression.</p>
<p><strong>FLOPs Control</strong>: A SpaceFormer-Large model (4x width, atom-only) trained with comparable FLOPs still falls short of SpaceFormer with 1000 non-atom cells on most downstream tasks. This confirms the improvement comes from modeling 3D space, not from additional compute.</p>
<p><strong>Virtual Points vs. SpaceFormer</strong>: Adding up to 200 random virtual points to Uni-Mol improves some tasks but leaves a significant gap compared to SpaceFormer, demonstrating that principled space discretization outperforms naive point augmentation.</p>
<p><strong>Efficiency Validation</strong>: The Adaptive Grid Merging method reduces the number of cells by roughly 10x with virtually no performance degradation. The 3D positional encodings scale linearly with the number of cells, while Uni-Mol&rsquo;s pretraining cost scales quadratically.</p>
<h3 id="scope-and-future-directions">Scope and Future Directions</h3>
<p>SpaceFormer does not incorporate built-in SE(3) equivariance, relying instead on data augmentation (random rotations and random boundary padding) during training. The authors identify extending SpaceFormer to force field tasks and larger systems such as proteins and complexes as promising future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="code-and-data-availability">Code and Data Availability</h3>
<ul>
<li><strong>Source Code</strong>: As of the current date, the authors have not released the official source code or pre-trained weights.</li>
<li><strong>Datasets</strong>: Pretraining utilized the same 19M unlabeled molecule dataset as Uni-Mol. Downstream tasks use a newly curated internal benchmark built from subsets of GDB-17, MoleculeNet, and Biogen ADME. The exact customized scaffold splits for these evaluations are pending the official code release.</li>
<li><strong>Compute</strong>: Pretraining the base SpaceFormer encoder (~67.8M parameters, configured to merge level 3) required approximately 50 hours on 8 NVIDIA A100 GPUs.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Source code</td>
          <td>Code</td>
          <td>N/A</td>
          <td>Not publicly released as of March 2026</td>
      </tr>
      <tr>
          <td>Pre-trained weights</td>
          <td>Model</td>
          <td>N/A</td>
          <td>Not publicly released</td>
      </tr>
      <tr>
          <td>Pretraining data (19M molecules)</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Same dataset as Uni-Mol; not independently released</td>
      </tr>
      <tr>
          <td>Downstream benchmark splits</td>
          <td>Dataset</td>
          <td>N/A</td>
          <td>Custom scaffold splits pending code release</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>The model treats a molecule as a 3D &ldquo;image&rdquo; via voxelization, processed by a Transformer.</p>
<p><strong>Input Representation</strong>:</p>
<ul>
<li><strong>Discretization</strong>: 3D space divided into grid cells with length <strong>$0.49\text{\AA}$</strong> (based on O-H bond length to ensure at most one atom per cell)</li>
<li><strong>Tokenization</strong>: Tokens are pairs $(t_i, c_i)$ where $t_i$ is atom type (or NULL) and $c_i$ is the coordinate</li>
<li><strong>Embeddings</strong>: Continuous embeddings with dimension 512. Inner-cell positions discretized with $0.01\text{\AA}$ precision</li>
</ul>
<p><strong>Transformer Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Layers</th>
          <th>Attention Heads</th>
          <th>Embedding Dim</th>
          <th>FFN Dim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Encoder</strong></td>
          <td>16</td>
          <td>8</td>
          <td>512</td>
          <td>2048</td>
      </tr>
      <tr>
          <td><strong>Decoder</strong> (MAE)</td>
          <td>4</td>
          <td>4</td>
          <td>256</td>
          <td>1024</td>
      </tr>
  </tbody>
</table>
<p><strong>Attention Mechanism</strong>: FlashAttention for efficient handling of large sequence lengths.</p>
<p><strong>Positional Encodings</strong>:</p>
<ol>
<li><strong>3D Directional PE</strong>: Extension of Rotary Positional Embedding (RoPE) to 3D continuous space, capturing relative directionality</li>
<li><strong>3D Distance PE</strong>: Random Fourier Features (RFF) to approximate Gaussian kernel of pairwise distances with linear complexity</li>
</ol>
<h4 id="visualizing-rff-and-rope">Visualizing RFF and RoPE</h4>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-rff-rope-visualization.webp"
         alt="Four-panel visualization showing RFF distance encoding and RoPE directional encoding mechanisms"
         title="Four-panel visualization showing RFF distance encoding and RoPE directional encoding mechanisms"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Visual intuition for SpaceFormer&rsquo;s positional encodings: Top row shows RFF distance encoding (Gaussian-like attention decay and high-frequency feature fingerprints). Bottom row shows RoPE directional encoding (vector rotation fields and resulting attention patterns).</figcaption>
    
</figure>

<p><strong>Top Row (Distance / RFF):</strong> Shows how the model learns &ldquo;closeness.&rdquo; Distance is represented by a complex &ldquo;fingerprint&rdquo; of waves that creates a Gaussian-like force field.</p>
<ul>
<li><strong>Top Left (The Force Field):</strong> The attention score (dot product) naturally forms a Gaussian curve. It is high when atoms are close and decays to zero as they move apart. This mimics physical forces without the model needing to learn that math from scratch.</li>
<li><strong>Top Right (The Fingerprint):</strong> Each dimension oscillates at a different frequency. A specific distance (e.g., $d=2$) has a unique combination of high and low values across these dimensions, creating a unique &ldquo;fingerprint&rdquo; for that exact distance.</li>
</ul>
<p><strong>Bottom Row (Direction / RoPE):</strong> Shows how the model learns &ldquo;relative position.&rdquo; It visualizes the vector rotation and how that creates a grid-like attention pattern.</p>
<ul>
<li><strong>Bottom Left (The Rotation):</strong> This visualizes the &ldquo;X-axis chunk&rdquo; of the vector. As you move from left ($x=-3$) to right ($x=3$), the arrows rotate. The model compares angles between atoms to determine relative positions.</li>
<li><strong>Bottom Right (The Grid):</strong> The resulting attention pattern when combining X-rotations and Y-rotations. The red/blue regions show where the model pays attention relative to the center, forming a grid-like interference pattern that distinguishes relative positions (e.g., &ldquo;top-right&rdquo; vs &ldquo;bottom-left&rdquo;).</li>
</ul>
<h4 id="adaptive-grid-merging">Adaptive Grid Merging</h4>
<p>To make the 3D grid approach computationally tractable, two key strategies are employed:</p>
<ol>
<li><strong>Grid Sampling</strong>: Randomly selecting 10-20% of empty cells during training</li>
<li><strong>Adaptive Grid Merging</strong>: Recursively merging $2 \times 2 \times 2$ blocks of empty cells into larger &ldquo;coarse&rdquo; cells, creating a multi-resolution view that is fine-grained near atoms and coarse-grained in empty space (merging set to Level 3)</li>
</ol>
<p><strong>Visualizing Adaptive Grid Merging</strong>:</p>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-adaptive-grid-merging.webp"
         alt="2D simulation of adaptive grid merging for an H2O molecule showing multi-resolution cells"
         title="2D simulation of adaptive grid merging for an H2O molecule showing multi-resolution cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adaptive grid merging demonstrated on H₂O. Red cells (Level 0) contain atoms and remain at full resolution. Progressively darker blue cells represent merged empty regions at higher levels, covering the same volume with fewer tokens.</figcaption>
    
</figure>

<p>The adaptive grid process compresses empty space around molecules while maintaining high resolution near atoms:</p>
<ul>
<li><strong>Red Cells (Level 0):</strong> The smallest squares ($0.49$Å) containing atoms. These are kept at highest resolution because electron density changes rapidly here.</li>
<li><strong>Light Blue Cells (Level 0/1):</strong> Small empty regions close to atoms.</li>
<li><strong>Darker Blue Cells (Level 2/3):</strong> Large blocks of empty space further away.</li>
</ul>
<p>If we used a naive uniform grid, we would have to process thousands of empty &ldquo;Level 0&rdquo; cells containing almost zero information. By merging them into larger blocks (the dark blue squares), the model covers the same volume with significantly fewer input tokens, reducing the number of tokens by roughly <strong>10x</strong> compared to a dense grid.</p>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-adaptive-grid-benzene.webp"
         alt="Adaptive grid merging visualization for benzene molecule showing hexagonal ring with multi-resolution grid cells"
         title="Adaptive grid merging visualization for benzene molecule showing hexagonal ring with multi-resolution grid cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adaptive grid merging for benzene (C₆H₆). The model maintains maximum resolution (red Level 0 cells) only where atoms exist, while merging vast empty regions into large blocks (dark blue L3/L4 cells). This allows the model to focus computational power on chemically active zones.</figcaption>
    
</figure>

<p>The benzene example above demonstrates how this scales to larger molecules. The characteristic hexagonal ring of 6 carbon atoms (black) and 6 hydrogen atoms (white) occupies a small fraction of the total grid. The dark blue corners (L3, L4) represent massive merged blocks of empty space, allowing the model to focus 90% of its computational power on the red &ldquo;active&rdquo; zones where chemistry actually happens.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lu, S., Ji, X., Zhang, B., Yao, L., Liu, S., Gao, Z., Zhang, L., &amp; Ke, G. (2025). Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling. <em>Proceedings of the 42nd International Conference on Machine Learning (ICML)</em>, 267, 40491-40504. <a href="https://proceedings.mlr.press/v267/lu25e.html">https://proceedings.mlr.press/v267/lu25e.html</a></p>
<p><strong>Publication</strong>: ICML 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lu2025beyond,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lu, Shuqi and Ji, Xiaohong and Zhang, Bohang and Yao, Lin and Liu, Siyuan and Gao, Zhifeng and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 42nd International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{40491--40504}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{267}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{Proceedings of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{PMLR}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=Wd9KPQCKwq">OpenReview forum</a></li>
<li><a href="https://openreview.net/pdf?id=Wd9KPQCKwq">PDF on OpenReview</a></li>
<li><a href="https://icml.cc/virtual/2025/poster/45004">ICML 2025 poster page</a></li>
</ul>
]]></content:encoded></item><item><title>LLMs for Insurance Document Automation</title><link>https://hunterheidenreich.com/research/page-stream-segmentation-llms/</link><pubDate>Wed, 01 Jan 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/page-stream-segmentation-llms/</guid><description>LLM applications for insurance document automation using parameter-efficient fine-tuning and analysis of calibration challenges.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Page Stream Segmentation (PSS) is critical for automating document processing in industries like insurance, where unstructured document collections are common. This paper explores the use of large language models (LLMs) for PSS, applying parameter-efficient fine-tuning to real-world insurance data. Our experiments show that LLMs outperform baseline models in segmentation accuracy. We find that stream-level calibration remains a significant challenge. We evaluate post-hoc calibration and Monte Carlo dropout, finding they offer limited improvement, highlighting the need for future work in this area for high-stakes applications.</p>
<p>Published in the COLING 2025 Industry Track, this work builds on our earlier research establishing the <a href="/research/llm-page-stream-segmentation/">TabMe++ benchmark and decoder-based LLM approach</a>, extending those methods to real-world industrial deployment.</p>
<blockquote>
<p><strong>Blog Post:</strong> For a narrative overview of the reliability and calibration findings discussed in this paper, see <a href="/posts/reliability-trap-document-automation/">The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</a>.</p></blockquote>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Real-World Evaluation</strong>: Applied small-to-mid-sized LLMs (Phi-3.5-mini, Mistral-7B) to a proprietary insurance dataset, outperforming strong baselines like XGBoost in segmentation accuracy.</li>
<li><strong>Parameter-Efficient Fine-Tuning</strong>: Successfully used parameter-efficient fine-tuning (PEFT) to adapt LLMs for the specialized task of page stream segmentation.</li>
<li><strong>Calibration Complexity</strong>: Found that post-hoc calibration and Monte Carlo dropout offer limited improvement at the stream level, keeping human-in-the-loop workflows necessary for high-stakes automation (see stream-level confidence analysis below).</li>
<li><strong>Throughput Analysis</strong>: Introduced an accuracy-vs-throughput framework to quantify how much volume can be safely automated at strict confidence thresholds.</li>
</ul>
<h2 id="stream-level-confidence">Stream-Level Confidence</h2>
<p>A key insight from this work is why calibration becomes increasingly difficult as documents grow longer. We define stream-level confidence as the product of individual page-level confidences:</p>
<p>$$C = \prod_{i=1}^{N} C_i$$</p>
<p>where $C_i$ is the confidence for page $i$ and $N$ is the number of pages in the stream. This multiplicative relationship means that even small page-level errors compound aggressively. As streams grow longer, confidence drops rapidly, making it difficult to set reliable thresholds for automation.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation-throughput.webp"
         alt="Stream accuracy versus relative throughput for Mistral-7B and XGBoost models, showing LLMs achieve higher automation rates at equivalent accuracy levels"
         title="Stream accuracy versus relative throughput for Mistral-7B and XGBoost models, showing LLMs achieve higher automation rates at equivalent accuracy levels"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Accuracy vs. throughput trade-off: Mistral-7B enables higher automation rates than XGBoost at strict accuracy thresholds, demonstrating the practical value of LLMs for document processing.</figcaption>
    
</figure>

<h2 id="technical-implementation">Technical Implementation</h2>
<h3 id="models--fine-tuning">Models &amp; Fine-Tuning</h3>
<p>We fine-tuned <strong>Mistral-7B-v0.3</strong> and <strong>Phi-3.5-mini</strong> (4-bit quantized) using QLoRA. Training was performed efficiently on a single NVIDIA H100 GPU using the <strong>Unsloth</strong> library and Hugging Face&rsquo;s TRL.</p>
<ul>
<li><strong>Stack</strong>: Unsloth + TRL</li>
<li><strong>Config</strong>: Rank $r=16$, Alpha $\alpha=16$</li>
</ul>
<h3 id="dataset">Dataset</h3>
<p>The study utilized a proprietary <strong>insurance dataset</strong> consisting of 7.5k document streams (44.7k pages). This real-world data includes health records, legal contracts, and police reports, offering a more challenging and realistic evaluation than synthetic benchmarks.</p>
<h3 id="task-framing">Task Framing</h3>
<p>The task framing follows the <a href="/research/llm-page-stream-segmentation/">TabMe++ work</a>: binary classification over a local window (previous page + current page). Here the models are prompted to output valid JSON marking the start of a new document.</p>
<h2 id="impact">Impact</h2>
<p>This work demonstrates both the promise and the current limitations of using LLMs in high-stakes industrial applications. LLMs can significantly improve segmentation accuracy over traditional methods, but performance metrics alone are not sufficient for deployment. For sectors like insurance, stream-level calibration is an open problem that must be solved before full automation becomes responsible.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{heidenreich2025page,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Page Stream Segmentation with LLMs: Challenges and Applications in Insurance Document Automation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heidenreich, Hunter and Dalvi, Ratish and Verma, Nikhil and Getachew, Yosheb}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 31st International Conference on Computational Linguistics: Industry Track}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{305--317}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Optimizing Sequence Models for Dynamical Systems</title><link>https://hunterheidenreich.com/research/deconstructing-recurrence-attention-gating/</link><pubDate>Tue, 01 Oct 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/deconstructing-recurrence-attention-gating/</guid><description>Ablation study deconstructing sequence models. Attention-augmented Recurrent Highway Networks outperform Transformers on chaotic systems.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Advanced neural network architectures developed for tasks like natural language processing are often transferred to spatiotemporal forecasting without a deep understanding of which components drive their performance. This can lead to suboptimal results and reinforces the view of these models as &ldquo;black boxes&rdquo;. In this work, we deconstruct the core mechanisms of Transformers and Recurrent Neural Networks (RNNs) (namely attention, gating, and recurrence). We then build and test novel hybrid architectures to identify which components are most effective. A key finding is that while adding recurrence is detrimental to Transformers, augmenting RNNs with attention and neural gating consistently improves their forecasting accuracy. Our study reveals that a seldom-used architecture, the Recurrent Highway Network (RHN) enhanced with these mechanisms, emerges as the top-performing model for forecasting high-dimensional chaotic systems.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Systematic Ablation</strong>: Deconstructed Transformers and RNNs into core mechanisms (attention, gating, recurrence) to isolate performance drivers</li>
<li><strong>Novel Hybrid Architectures</strong>: Synthesized and tested new combinations of neural primitives for spatiotemporal forecasting</li>
<li><strong>RHN advantage on chaotic systems</strong>: Demonstrated that attention-augmented Recurrent Highway Networks outperform standard Transformers on high-dimensional chaotic systems</li>
<li><strong>Robustness Analysis</strong>: Validated models across both clean physics simulations and noisy real-world industrial datasets</li>
</ul>
<h2 id="motivation">Motivation</h2>
<p>In modern ML, architectures are often transferred from one domain (like NLP) to another (like physical forecasting) without understanding the underlying mechanics. This &ldquo;black box&rdquo; approach leads to suboptimal compute usage and performance ceilings.</p>
<p>Our goal was to break these architectures down. We treated the core mechanisms of <strong>Transformers</strong> and <strong>RNNs</strong> (<strong>Gating, Attention, and Recurrence</strong>) as orthogonal basis vectors. By decoupling these components, we could synthesize and test hybrid architectures to find the best configuration for spatiotemporal forecasting.</p>
<h2 id="methodological-approach">Methodological Approach</h2>
<p>We built a modular framework to mix and match neural primitives. We systematically evaluated:</p>
<ol>
<li><strong>Gating Mechanisms:</strong> Testing Additive, Learned Rate, Input-Dependent, and Coupled Input-Dependent variants</li>
<li><strong>Attention:</strong> Implementing multi-headed attention with relative positional biases</li>
<li><strong>Recurrence:</strong> Testing standard cells (LSTM, GRU) against deeper transition cells like Recurrent Highway Networks (RHN)</li>
</ol>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/neural-gates.webp"
         alt="Neural gating mechanisms: Additive, Learned Rate, Dependent-Coupled, and Dependent variants"
         title="Neural gating mechanisms: Additive, Learned Rate, Dependent-Coupled, and Dependent variants"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The hierarchy of neural gating mechanisms we tested, from simple additive to fully input-dependent.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/rnn-cell-types.webp"
         alt="RNN cell architectures: Elman, LSTM, GRU, and RHN cells"
         title="RNN cell architectures: Elman, LSTM, GRU, and RHN cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Recurrent cell types compared in our study. The RHN (d) extends processing depth within each timestep.</figcaption>
    
</figure>

<p>This ablation isolated exactly <em>which</em> mathematical operation was driving the performance gain.</p>
<h2 id="key-findings">Key Findings</h2>
<h3 id="recurrent-highway-networks-on-chaotic-systems">Recurrent Highway Networks on Chaotic Systems</h3>
<p>For high-dimensional chaotic systems like the Multiscale Lorenz-96 shown below, we found that a <strong>Recurrent Highway Network (RHN)</strong> augmented with <strong>Attention and Neural Gating</strong> was the top-performing architecture. This hybrid exceeded the forecasting accuracy of standard Transformers, suggesting that deeper recurrence (processing depth per timestep) matters for complex dynamics.</p>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/multiscale-lorenz.webp"
         alt="Forecasting comparison on Multiscale Lorenz-96 system"
         title="Forecasting comparison on Multiscale Lorenz-96 system"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Forecasting the Multiscale Lorenz-96 system. The top row shows the &rsquo;texture&rsquo; of the chaotic evolution. Notice how the RHN (far right) maintains the coherent wave-like structures for nearly a full Lyapunov time, holding structure longer than the Transformer variants (the plotted window spans two Lyapunov times).</figcaption>
    
</figure>

<h3 id="transformers-recurrence-hurts-gating-helps">Transformers: Recurrence Hurts, Gating Helps</h3>
<p>We attempted to force recurrence into Transformers to give them &ldquo;memory,&rdquo; but it consistently hurt performance. However, <strong>Neural Gating</strong> significantly improved Transformer robustness. For real-world, noisy data (traffic, weather), the <strong>Pre-Layer Normalization (PreLN) Transformer</strong> with added gating proved to be the most robust model.</p>
<h3 id="adding-attention-to-lstms-and-grus">Adding Attention to LSTMs and GRUs</h3>
<p>We tested on the Kuramoto-Sivashinsky equation, a model of turbulence and flame fronts. We found that standard LSTMs and GRUs are under-optimized for this setting: adding <strong>attention</strong> to these cells improved their valid-prediction time several-fold, with the best attention-augmented LSTM and GRU reaching roughly 4x and 6.6x their baseline valid-prediction time, respectively (the paper reports the top RNNs at 2-7x baseline on K-S). (On the partially-observed Multiscale Lorenz-96 system the same attention-plus-gating gain is smaller, more than 40%.)</p>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/kuramoto-sivashinksy.webp"
         alt="Forecasting comparison on Kuramoto-Sivashinsky system"
         title="Forecasting comparison on Kuramoto-Sivashinsky system"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Forecasting the Kuramoto-Sivashinsky system. The error heatmaps (bottom row) show how prediction quality degrades over time (lighter means larger error). The RHN maintains structural fidelity longer than competing architectures.</figcaption>
    
</figure>

<h3 id="robustness-on-real-world-datasets">Robustness on Real-World Datasets</h3>
<p>While chaotic systems test the limits of theory, we also validated our models on seven standard real-world datasets: the four <strong>Electricity Transformer Temperature (ETT)</strong> subsets plus <strong>Traffic</strong>, <strong>Electricity</strong>, and <strong>Weather</strong>.</p>
<p>Unlike the clean physics simulations, these datasets contain real-world noise and irregularities. In this environment, the <strong>Pre-Layer Normalization (PreLN) Transformer</strong> proved to be the most robust architecture. While it didn&rsquo;t always beat the RHN on pure chaos, its stability makes it a strong default choice for general time-series forecasting tasks where training speed and reliability are paramount.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>This work treats architectural components as independently tunable choices rather than fixed defaults, and that framing surfaces a concrete trade-off. Transformers train in only 25-50% of the time the RNNs require (roughly 2-4x faster), while the attention-augmented RNNs give better inference accuracy on the chaotic physical systems. Which mechanism to select depends on whether the training budget or the forecast precision is the binding constraint, and the ablation makes that an informed choice rather than a default one.</p>
<p>The ablation framework here, treating architectural components as independently tunable factors and measuring their marginal contribution, shaped how later evaluation work is structured. The same principle of isolating variables rather than comparing end-to-end black boxes appears in the document processing research, from benchmark construction in page stream segmentation to grounded evaluation in GutenOCR.</p>
<h2 id="related-work">Related Work</h2>
<p>The methodology here shares a design philosophy with <a href="/research/eigennoise-contrastive-prior/">EigenNoise</a>,
which similarly decomposes a neural mechanism (word vector initialization) into theoretically
grounded components to isolate what drives performance. Both papers treat model components as
testable hypotheses rather than fixed architectural choices.</p>
<p>For broader context on where this fits in the portfolio&rsquo;s Scientific Machine Learning arc,
see the <a href="/research/">Research</a> overview.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2024deconstructingrecurrenceattentiongating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter S. Heidenreich and Pantelis R. Vlachas and Petros Koumoutsakos}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2410.02654}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.LG}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2410.02654}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LLMs for Page Stream Segmentation</title><link>https://hunterheidenreich.com/research/llm-page-stream-segmentation/</link><pubDate>Wed, 21 Aug 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/llm-page-stream-segmentation/</guid><description>Enhanced TabMe benchmark for page stream segmentation, creating TabMe++, showing fine-tuned decoder-based LLMs outperform prior models.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Page Stream Segmentation (PSS), the task of correctly dividing a sequence of pages into distinct documents, is a critical first step in automated document processing pipelines. Research in this area has been held back by the lack of high-quality, public datasets.</p>
<p>In this work, we address this issue by enhancing an existing benchmark, <a href="https://github.com/aldolipani/TABME">TabMe</a>, with commercial-grade Optical Character Recognition (OCR) to create <strong><a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a></strong>. This new version significantly reduces noise and improves text detection, highlighting the critical importance of OCR quality for document understanding tasks.</p>
<p>We then conduct the first evaluation of large, decoder-based language models (LLMs) on the PSS task. Our findings show that models like <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a>, when fine-tuned using parameter-efficient methods, <strong>outperform smaller encoder-based models</strong> and traditional baselines. For instance, our best model correctly segments 80% of document streams in the test set without any errors.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Enhanced Public Benchmark (<a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a>)</strong>: Re-processed the entire <a href="https://github.com/aldolipani/TABME">TabMe</a> dataset with commercial OCR, correcting significant text recognition errors and reducing blank pages by over 80% (from 2.27% to 0.38%)</li>
<li><strong>First Application of Large Decoder-Based LLMs to PSS</strong>: Systematically evaluated and fine-tuned billion-parameter, decoder-only LLMs for page stream segmentation</li>
<li><strong>Decoder LLMs Outperform Encoders on TabMe++</strong>: fine-tuned decoder models outperform previous encoder-based and multimodal approaches on the benchmark</li>
<li><strong>OCR Quality Analysis</strong>: Quantified the impact that high-quality OCR has on PSS model performance through comparative experiments</li>
</ul>
<h2 id="the-evolution-of-page-stream-segmentation">The Evolution of Page Stream Segmentation</h2>
<p>The history of PSS can be read as three distinct algorithmic eras, revealing a clear trajectory toward semantic understanding:</p>
<ul>
<li><strong>The Heuristic Era:</strong> Early systems relied on handcrafted rules and region-specific pattern matching (e.g., looking for headers/footers), which failed to generalize across heterogeneous documents.</li>
<li><strong>The Encoder Era:</strong> The field moved to &ldquo;learning-based&rdquo; methods using Convolutional Neural Networks (CNNs) and later Transformer encoders like LayoutLM and LEGAL-BERT. While better, these often required complex multimodal architectures.</li>
<li><strong>The Decoder Era (this work):</strong> This work establishes the viability of the third era: using billion-parameter generative models (decoder-only LLMs) which simplify the architecture while improving semantic reasoning.</li>
</ul>
<blockquote>
<p><strong>Blog Post:</strong> Read the full story of these eras in <a href="/posts/history-of-page-stream-segmentation/">The Evolution of Page Stream Segmentation</a>.</p></blockquote>
<h2 id="key-evaluation-metrics">Key Evaluation Metrics</h2>
<p>Beyond standard F1 scores, the study evaluates models on metrics that directly translate to operational costs:</p>
<ul>
<li><strong>Straight-Through Processing (STP):</strong> The percentage of document streams segmented <em>perfectly</em>, requiring zero human intervention. The fine-tuned <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a> achieved an STP of <strong>0.800</strong>, meaning 80% of streams were fully automated. In contrast, the traditional XGBoost baseline achieved only <strong>0.074</strong>.</li>
<li><strong>Minimum Number of Drag-and-Drops (MNDD):</strong> A proxy for human effort, measuring how many pages a human would need to move to correct the segmentation. The best LLM reduced this &ldquo;effort metric&rdquo; by over <strong>13x</strong> compared to the XGBoost baseline (0.81 vs 10.85).</li>
</ul>
<h3 id="document-level-precision-and-recall">Document-Level Precision and Recall</h3>
<p>We define a ground truth segmentation $\mathcal{G}$ and a predicted segmentation $\mathcal{P}$. A &ldquo;True Positive&rdquo; is defined strictly as a document present in both sets ($\mathcal{P} \cap \mathcal{G}$). The metrics are calculated as:</p>
<p>$$P = \frac{|\mathcal{P} \cap \mathcal{G}|}{|\mathcal{P} \cap \mathcal{G}| + |\mathcal{P} \setminus \mathcal{G}|}$$</p>
<p>$$R = \frac{|\mathcal{P} \cap \mathcal{G}|}{|\mathcal{P} \cap \mathcal{G}| + |\mathcal{G} \setminus \mathcal{P}|}$$</p>
<p>This rigorous definition ensures that a model is only rewarded if it gets <em>both</em> the start and end boundaries of a document correct.</p>
<h2 id="technical-innovation">Technical Innovation</h2>
<p>Our approach combines commercial-grade OCR processing with parameter-efficient fine-tuning of large language models. We addressed two main bottlenecks: data quality and model efficiency.</p>
<h3 id="data-remediation">Data Remediation</h3>
<p>The original <a href="https://github.com/aldolipani/TABME">TabMe</a> dataset relied on Tesseract OCR, which introduced significant noise. By reprocessing the images with Microsoft OCR, we reduced the number of &ldquo;blank&rdquo; pages from <strong>2.27% to just 0.38%</strong>, recovering critical features like titles and ID numbers that were previously lost.</p>
<h3 id="model-architecture">Model Architecture</h3>
<p>We formulated the task as a <strong>binary classification of page pairs</strong>: predicting if a &ldquo;break&rdquo; exists between Page $N$ and Page $N+1$.</p>
<h4 id="problem-formulation">Problem Formulation</h4>
<p>The task is treated as a binary classification problem over a window of pages. For a specific page $p_i$, the model predicts a binary label $y_i$ based on a window of adjacent pages $(p_{i-l}, \ldots, p_i, \ldots, p_{i+r})$. In this work, we strictly defined the window as:</p>
<p>$$l=1, \quad r=0$$</p>
<p>This means the decision for page $p_i$ is made solely based on the pair $(p_{i-1}, p_i)$.</p>
<h3 id="efficient-tuning">Efficient Tuning</h3>
<p>We utilized <strong>Low-Rank Adaptation (LoRA)</strong> and 4-bit quantization to fine-tune <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a> and <a href="https://huggingface.co/unsloth/Phi-3-mini-4k-instruct">Phi-3-mini</a> on a single NVIDIA H100 GPU, proving that PSS does not require massive compute clusters.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>Page Stream Segmentation is the critical first step in any automated document processing pipeline. If a system fails to correctly separate documents, all downstream tasks (like classification or data extraction) will operate on corrupted inputs. By demonstrating that parameter-efficiently fine-tuned LLMs can achieve an 80% straight-through processing rate, this work provides a viable path toward fully automating high-volume document workflows.</p>
<p>Beyond the path to automation, this work gives the research community improved evaluation tools: the enhanced <a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a> dataset and the quantified impact of OCR quality on PSS performance have direct applications in commercial document processing pipelines.</p>
<p>We later extended these findings to real-world industrial deployment and analyzed model calibration challenges in our follow-up <a href="/research/page-stream-segmentation-llms/">COLING Industry paper on LLMs for Insurance Document Automation</a>. The calibration challenges that emerged from that deployment are explored in depth in <a href="/posts/reliability-trap-document-automation/">The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</a>.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2024largelanguagemodelspage,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Large Language Models for Page Stream Segmentation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Ratish Dalvi and Rohith Mukku and Nikhil Verma and Neven Pičuljan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2408.11981}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CL}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2408.11981}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Sarcasm Detection with Transformers: A Cautionary Tale</title><link>https://hunterheidenreich.com/posts/sarcasm-detection-with-transformers/</link><pubDate>Sun, 25 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/sarcasm-detection-with-transformers/</guid><description>Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that learned to classify news sources.</description><content:encoded><![CDATA[<h2 id="why-sarcasm-detection-is-hard">Why Sarcasm Detection Is Hard</h2>
<p>Sarcasm detection represents one of the most challenging problems in NLP. The difficulties include:</p>
<p><strong>Context dependence</strong>: Sarcasm relies on situational knowledge and shared understanding that extends beyond the text itself.</p>
<p><strong>Subtlety</strong>: Even humans struggle with sarcastic interpretation, especially in written text without vocal cues.</p>
<p><strong>Cultural variability</strong>: Sarcastic expressions vary significantly across cultures and regions.</p>
<p><strong>Annotation disagreement</strong>: Human annotators often disagree on what constitutes sarcasm.</p>
<p>These challenges raise a fundamental question: can sarcasm detection be well-defined as a computational problem? This case study explores what happens when we try (and reveals a common pitfall in dataset construction).</p>
<h2 id="the-dataset-a-hidden-flaw">The Dataset: A Hidden Flaw</h2>
<p>I used the <a href="https://huggingface.co/datasets/raquiba/Sarcasm_News_Headline">Sarcasm News Headlines dataset</a>, which combines headlines from <a href="https://theonion.com/">The Onion</a> (satirical) and <a href="https://www.huffpost.com/">The Huffington Post</a> (traditional news). The dataset contains ~50,000 examples.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> datasets <span style="color:#f92672">import</span> load_dataset
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> load_dataset(<span style="color:#e6db74">&#34;raquiba/Sarcasm_News_Headline&#34;</span>)
</span></span><span style="display:flex;"><span>print(dataset[<span style="color:#e6db74">&#34;train&#34;</span>][<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>print(dataset[<span style="color:#e6db74">&#34;train&#34;</span>][<span style="color:#ae81ff">1</span>])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>{&#39;headline&#39;: &#39;thirtysomething scientists unveil doomsday clock of hair loss&#39;,
</span></span><span style="display:flex;"><span> &#39;is_sarcastic&#39;: 1}
</span></span><span style="display:flex;"><span>{&#39;headline&#39;: &#39;dem rep. totally nails why congress is falling short on gender, racial equality&#39;,
</span></span><span style="display:flex;"><span> &#39;is_sarcastic&#39;: 0}
</span></span></code></pre></div><p><strong>The critical flaw</strong>: This dataset uses binary classification based on source domain. The Onion headlines are labeled sarcastic, HuffPost headlines are not. This creates a dangerous shortcut where models learn to detect the publication source.</p>
<p>After preprocessing to standardize column names:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> dataset<span style="color:#f92672">.</span>map(
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">lambda</span> example: {<span style="color:#e6db74">&#34;text&#34;</span>: example[<span style="color:#e6db74">&#34;headline&#34;</span>], <span style="color:#e6db74">&#34;label&#34;</span>: example[<span style="color:#e6db74">&#34;is_sarcastic&#34;</span>]},
</span></span><span style="display:flex;"><span>    remove_columns<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;headline&#34;</span>, <span style="color:#e6db74">&#34;article_link&#34;</span>, <span style="color:#e6db74">&#34;is_sarcastic&#34;</span>]
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h2 id="fine-tuning-roberta">Fine-Tuning RoBERTa</h2>
<p>I fine-tuned a pre-trained RoBERTa model using standard practices:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> transformers <span style="color:#f92672">import</span> AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>model_name <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;FacebookAI/roberta-base&#34;</span>
</span></span><span style="display:flex;"><span>tokenizer <span style="color:#f92672">=</span> AutoTokenizer<span style="color:#f92672">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> AutoModelForSequenceClassification<span style="color:#f92672">.</span>from_pretrained(model_name, num_labels<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Tokenize the data</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">tokenize_function</span>(examples):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> tokenizer(examples[<span style="color:#e6db74">&#34;text&#34;</span>], truncation<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, max_length<span style="color:#f92672">=</span><span style="color:#ae81ff">512</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tokenized_datasets <span style="color:#f92672">=</span> dataset<span style="color:#f92672">.</span>map(tokenize_function, batched<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Training configuration</span>
</span></span><span style="display:flex;"><span>training_args <span style="color:#f92672">=</span> TrainingArguments(
</span></span><span style="display:flex;"><span>    output_dir<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;./results&#34;</span>,
</span></span><span style="display:flex;"><span>    num_train_epochs<span style="color:#f92672">=</span><span style="color:#ae81ff">5</span>,
</span></span><span style="display:flex;"><span>    per_device_train_batch_size<span style="color:#f92672">=</span><span style="color:#ae81ff">32</span>,
</span></span><span style="display:flex;"><span>    evaluation_strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;epoch&#34;</span>,
</span></span><span style="display:flex;"><span>    save_strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;epoch&#34;</span>,
</span></span><span style="display:flex;"><span>    load_best_model_at_end<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>trainer <span style="color:#f92672">=</span> Trainer(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span>model,
</span></span><span style="display:flex;"><span>    args<span style="color:#f92672">=</span>training_args,
</span></span><span style="display:flex;"><span>    train_dataset<span style="color:#f92672">=</span>tokenized_datasets[<span style="color:#e6db74">&#34;train&#34;</span>],
</span></span><span style="display:flex;"><span>    eval_dataset<span style="color:#f92672">=</span>tokenized_datasets[<span style="color:#e6db74">&#34;test&#34;</span>],
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#f92672">=</span>tokenizer,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>trainer<span style="color:#f92672">.</span>train()
</span></span></code></pre></div><h2 id="results-too-good-to-be-true">Results: Too Good to Be True</h2>
<p>The model achieved high accuracy:</p>
<table>
  <thead>
      <tr>
          <th>Epoch</th>
          <th>Test Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>96.3%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>97.8%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>99.4%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>99.8%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>99.8%</td>
      </tr>
  </tbody>
</table>
<p>This should immediately raise red flags. Sarcasm detection is notoriously difficult, even for humans. Such high accuracy suggests the model learned a proxy task.</p>
<p>My hypothesis: <strong>The model bypassed sarcasm detection entirely, learning only to distinguish between The Onion and HuffPost writing styles.</strong></p>
<h2 id="interacting-with-the-model">Interacting with the Model</h2>
<p>Let&rsquo;s test our hypothesis by interacting with the model.</p>
<p>First, let&rsquo;s load the model and tokenizer:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> transformers <span style="color:#f92672">import</span> pipeline
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> AutoModelForSequenceClassification<span style="color:#f92672">.</span>from_pretrained(<span style="color:#e6db74">&#39;results/2024-02-25_20-24-51/checkpoint-4475&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>clf <span style="color:#f92672">=</span> pipeline(<span style="color:#e6db74">&#39;text-classification&#39;</span>, model<span style="color:#f92672">=</span>model, tokenizer<span style="color:#f92672">=</span>tokenizer)
</span></span></code></pre></div><p>Now, let&rsquo;s test the model with some examples.</p>
<p>First, let&rsquo;s try an Onion article from this week, something I know to be sarcastic and not in the training data.
Let&rsquo;s use <a href="https://theonion.com/alabama-supreme-court-justice-invokes-veggietales-in-1851282252/">&ldquo;Alabama Supreme Court Justice Invokes &lsquo;VeggieTales&rsquo; In Ruling&rdquo;</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Alabama Supreme Court Justice Invokes ‘VeggieTales&#39; In Ruling&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.99916672706604}]
</span></span></code></pre></div><p>The model is extremely confident that this is not sarcastic.</p>
<p>Let&rsquo;s try a different Onion article, possibly even more difficult: <a href="https://theonion.com/trump-booed-frozen-burritos-and-more-this-week-in-br-1851282066/">Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993497729301453}]
</span></span></code></pre></div><p>Again, very confident that this is not sarcastic. Hmm. It could be the temporal accuracy of our model just cannot capture the sarcasm of the Onion in 2024.</p>
<p>Let&rsquo;s try one more Onion article, this one that is still recent but a bit more of a low-hanging fruit: <a href="https://theonion.com/mom-only-likes-the-other-outback-steakhouse-1851265335/">Mom Only Likes The Other Outback Steakhouse</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Mom Only Likes The Other Outback Steakhouse&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_1&#39;, &#39;score&#39;: 0.9997231364250183}]
</span></span></code></pre></div><p>Finally, a correct prediction! The model is confident that this is sarcastic.
Our model detects only very specific types of sarcasm. It fails to generalize to new, unseen data within the same domain.</p>
<p>Let&rsquo;s also try some headlines from the Huffington Post, which the model should predict as not sarcastic.
Let&rsquo;s try the five most recent headlines from the Huffington Post:</p>
<ul>
<li><a href="https://www.huffpost.com/entry/donald-trump-south-carolina-nikki-haley_n_65db61f5e4b0e4346d52bed8">Donald Trump Won South Carolina - But There&rsquo;s 1 Big Caveat</a></li>
<li><a href="https://www.huffpost.com/entry/israeli-embassy-washington-man-set-fire_n_65db9364e4b0e4346d52ce3d">Man Sets Himself On Fire In Front Of Israeli Embassy In Washington</a></li>
<li><a href="https://www.huffpost.com/entry/bc-ml-israel-palestinians-temporary-truce-cease-fire_n_65db2e9ae4b0189a6a7e32ea">Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange</a></li>
<li><a href="https://www.huffpost.com/entry/george-latimer-race-comments-democratic-primary_n_65d8fac3e4b0cc1f2f7bafd8">A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.</a></li>
<li><a href="https://www.huffpost.com/entry/mongolia-climate-change-extreme-weather_n_65d90294e4b0cc1f2f7bb527">Climate Change-Fueled Winter Extremes Put 90% Of This Country At &lsquo;High Risk&rsquo;</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf([
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Donald Trump Won South Carolina - But There&#39;s 1 Big Caveat&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Man Sets Himself On Fire In Front Of Israeli Embassy In Washington&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Climate Change-Fueled Winter Extremes Put 90% Of This Country At &#39;High Risk&#39;&#34;</span>
</span></span><span style="display:flex;"><span>])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993808269500732},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993786811828613},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9985186457633972},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993883371353149},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993487000465393}]
</span></span></code></pre></div><p>The model is extremely confident that these are not sarcastic.</p>
<p>The model detects sarcasm in limited cases. It fails to generalize to new, unseen data within the same domain. This is a common problem in machine learning. Training a model that performs well on a specific dataset is straightforward. Training a model that generalizes to new, unseen data remains a significant challenge.
Furthermore, our sarcasm detection project resulted in a domain classifier. For fuzzier concepts like sarcasm, it&rsquo;s important to be clear about what we&rsquo;re actually detecting, and to collect the necessary scale of data to capture the full range of the concept.</p>
<h2 id="key-takeaways">Key Takeaways</h2>
<p>This case study reveals a fundamental problem in ML: <strong>high accuracy guarantees only performance on the training distribution</strong>. Here&rsquo;s what actually happened:</p>
<ol>
<li><strong>Dataset bias</strong>: Using publication source as a proxy for sarcasm created a shortcut for the model</li>
<li><strong>Domain classification</strong>: The model exclusively learned to distinguish writing styles</li>
<li><strong>Poor generalization</strong>: New examples from the same sources often failed</li>
</ol>
<p>This is a common pitfall when building datasets for subjective concepts. The lesson: high accuracy must be accompanied by validation of the model&rsquo;s actual learned behavior.</p>
<p>For better sarcasm detection, we&rsquo;d need:</p>
<ul>
<li>Diverse sources beyond two publications</li>
<li>Human annotation across multiple contexts</li>
<li>Careful evaluation on out-of-domain examples</li>
</ul>
<p>Instructive failures in ML projects provide valuable lessons about our assumptions and the limitations of our approaches.</p>
]]></content:encoded></item><item><title>Look, Don't Tweet: Unified Data Models for Social NLP</title><link>https://hunterheidenreich.com/research/look-dont-tweet/</link><pubDate>Wed, 30 Jun 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/look-dont-tweet/</guid><description>PyConversations library and unified data schema for normalizing 300M+ posts across Twitter, Reddit, Facebook, and 4chan.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>This is my undergraduate senior thesis, completed at Drexel University in 2021. The scope (308 million posts across four platforms, structural topology analysis, and domain adaptation experiments with Transformer models) was unusually broad for a senior thesis, spanning large-scale data engineering, graph-structural analysis, and representation-learning experiments.</p>
<p>Social media research is often siloed by platform, with tools built specifically for Twitter&rsquo;s flat structure or Reddit&rsquo;s tree structure. This fragmentation makes cross-platform analysis difficult. In this work, I introduce <strong><a href="https://github.com/hunter-heidenreich/pyconversations">PyConversations</a></strong>, an open-source Python package that normalizes data from Twitter, Facebook, Reddit, and 4chan into a single, platform-agnostic data model. <em>(Note: the repository is archived and no longer actively maintained.)</em></p>
<p>Leveraging this tool, I processed over <strong>308 million posts</strong> to analyze the structural &ldquo;shape&rdquo; of online conversations. I then evaluated the efficacy of domain-adaptive pre-training (DAPT) for Transformer-based language models, finding that training on a toxic domain (4chan) boosts hate-speech detection by over 5 F1.</p>
<h2 id="the-engineering-problem-data-normalization">The Engineering Problem: Data Normalization</h2>
<p>Social media platforms impose different structural constraints on discourse, making it difficult to feed heterogeneous data into a single ML pipeline:</p>
<ul>
<li><strong>Twitter:</strong> Technically allows infinite depth, but functionally operates as a flat stream or shallow tree.</li>
<li><strong>Facebook:</strong> Enforces a hard limit of two depth levels (comments and replies), resulting in &ldquo;short and fat&rdquo; conversation trees.</li>
<li><strong>Reddit &amp; 4chan:</strong> Allow for deep, branching tree structures.</li>
</ul>
<p>To solve this, I designed a <strong>Universal Message Schema</strong> and the <strong>PyConversations</strong> library. This system ingests raw dumps from these disparate sources and maps them to a unified Directed Acyclic Graph (DAG) format, preserving the parent-child relationships regardless of the source platform&rsquo;s constraints.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>PyConversations Library</strong>: An open-source package for robust conversational analysis, featuring graph-based traversing and filtering.</li>
<li><strong>Massive Dataset Analysis</strong>: Processed a collection of <strong>308 million posts</strong> and <strong>15.8 million conversations</strong>, creating one of the largest comparative cross-platform analyses at the time of thesis submission.</li>
<li><strong>Structural Insights</strong>: Quantified how UI constraints shape human behavior. For instance, Facebook&rsquo;s depth limit forces users to &ldquo;bunch&rdquo; comments, creating uniquely wide conversation trees compared to Reddit&rsquo;s deep, narrow threads.</li>
<li><strong>Domain Adaptation Experiments</strong>: Continued-pretrained RoBERTa on platform-specific slices (e.g., the 4chan-adapted <code>RoBERTa-4chan</code>), demonstrating that exposing models to toxic domains improved hate-speech detection F1 by over 5 points.</li>
</ul>
<h2 id="structural-analysis-findings">Structural Analysis Findings</h2>
<p>By treating conversations as graphs, we uncovered distinct topological signatures for each platform:</p>
<h3 id="the-shape-of-discourse">The &ldquo;Shape&rdquo; of Discourse</h3>
<p>We measured the <strong>width</strong> (max posts at any depth) and <strong>depth</strong> (max distance from root) of conversation trees.</p>
<ul>
<li><strong>Facebook</strong> exhibited a &ldquo;short and fat&rdquo; topology due to its 2-level nesting limit.</li>
<li><strong>4chan</strong> threads were surprisingly shallow despite having no depth limits. This suggests that the platform&rsquo;s <strong>ephemerality</strong> (threads are deleted quickly) and the &ldquo;bump limit&rdquo; mechanic discourage long-term dialogue, though data scraping limitations on this transient platform also contribute to this topology.</li>
<li><strong>Reddit</strong> maintained the most robust tree structures, with &ldquo;good faith&rdquo; communities like <em>r/ChangeMyView</em> showing distinct patterns of sustained engagement.</li>
</ul>
<h3 id="information-density">Information Density</h3>
<p>We analyzed <strong>Innovation Rate</strong>, a measure of how quickly a text introduces new vocabulary. We found that Twitter threads have negative innovation rates (indicating high novelty per token) likely forced by the strict character limits. In contrast, Reddit posts showed higher redundancy, typical of longer-form essay writing.</p>
<h2 id="representation-learning--domain-adaptation">Representation Learning &amp; Domain Adaptation</h2>
<p>We experimented with &ldquo;Warm-Start&rdquo; tuning: taking a standard RoBERTa model and pre-training it further on platform-specific data before fine-tuning on downstream tasks (TweetEval).</p>
<ul>
<li><strong>Limited gains on most general tasks:</strong> Domain-adaptive pre-training added little on sentiment and emotion (from well under 1 up to a few F1 points), with irony detection the exception (+5.6 to +5.9 F1). Base RoBERTa already covers most of the signal for general NLP tasks.</li>
<li><strong>The Toxic Exception:</strong> The notable exception was <strong>Hate Speech Detection</strong>. The 4chan-adapted model (<code>RoBERTa-4chan</code>) was the strongest here, outperforming the baseline by over 5 F1. This highlights that for specialized, out-of-distribution language (like toxic slang), domain adaptation remains valuable.</li>
</ul>
<h2 id="significance">Significance</h2>
<p>This work bridges the gap between <strong>Computational Social Science</strong> and <strong>ML Engineering</strong>. It provides the community with a reusable tool (<code>PyConversations</code>) to handle the messy reality of social data and offers empirical evidence on the limits and benefits of domain-adaptive pre-training for LLMs.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@thesis</span>{heidenreich2021look,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Look, Don&#39;t Tweet: Representation Learning and Social Media}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">school</span>=<span style="color:#e6db74">{Drexel University}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">type</span>=<span style="color:#e6db74">{Undergraduate Senior Thesis}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For related work on how social media content surfaces in digital journalism, including a dataset of embedded tweets across 273,899 news articles, see <a href="/research/newstweet-social-media-journalism/">NewsTweet Dataset: Social Media in Digital Journalism</a>.</p>
]]></content:encoded></item><item><title>PyConversations: Social Media Conversational Analysis</title><link>https://hunterheidenreich.com/projects/pyconversations-social-media-analysis/</link><pubDate>Tue, 01 Jun 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/pyconversations-social-media-analysis/</guid><description>Undergraduate thesis exploring representation learning for social media text and developing tools for cross-platform conversational analysis.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>Undergraduate thesis exploring representation learning for social media text and developing tools for cross-platform conversational analysis. Built PyConversations, a Python module for analyzing social media conversations, and found that domain-specific approaches often outperform large pre-trained models.</p>
<h2 id="features">Features</h2>
<h3 id="pyconversations-module">PyConversations Module</h3>
<ul>
<li><strong>Graph-based modeling</strong>: Models conversations as Directed Acyclic Graphs (DAGs) to quantify topological structure (depth, width, density)</li>
<li><strong>Unified interface</strong>: Polymorphic design normalizing heterogeneous data from Twitter, Reddit, 4chan, and Facebook into a single analysis schema</li>
<li><strong>Linguistic dynamics</strong>: Implements information-theoretic feature extraction, including harmonic mixing laws and entropy measures</li>
<li><strong>Stream processing</strong>: Memory-efficient generators to ingest and traverse multi-gigabyte JSON dumps (e.g., 135M+ Reddit posts) without loading the full corpus into RAM</li>
</ul>
<h3 id="research-contributions">Research Contributions</h3>
<ul>
<li><strong>Representation learning</strong>: Investigated domain-specific vs. general-purpose Transformers (BERT vs. specialized variants) on social media text</li>
<li><strong>Topological analysis</strong>: Demonstrated that conversational structure (context) is as critical as content for classification tasks</li>
<li><strong>Cross-platform study</strong>: Comparative analysis of communication dynamics across moderated (Reddit/Twitter) and unmoderated (4chan) spaces</li>
</ul>
<h2 id="usage">Usage</h2>
<p>The PyConversations module can be imported into Python scripts to parse and analyze social media datasets.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>Model performance</strong>: Smaller, domain-specific approaches frequently outperformed standard pre-trained models</li>
<li><strong>Context importance</strong>: Conversational context and dialogue structure proved crucial for understanding social media interactions</li>
<li><strong>Domain adaptation</strong>: Social media text benefits from specialized handling over generic approaches</li>
<li><strong>Cross-platform challenges</strong>: Different platforms require adapted approaches despite seeming similarities</li>
</ul>
<h2 id="team--recognition">Team &amp; Recognition</h2>
<ul>
<li><strong>Hunter Heidenreich</strong> - Lead Researcher and Developer</li>
<li><strong>Jake Williams</strong> - Faculty Advisor</li>
<li><strong>First Place - Research Undergraduate Senior Thesis</strong> at Drexel University</li>
</ul>
<h2 id="impact">Impact</h2>
<p>This library served as the engineering backbone for my thesis, <a href="/research/look-dont-tweet/">Look, Don&rsquo;t Tweet</a>, enabling the processing of 308 million posts to evaluate Transformer performance on toxic data.</p>
<p>The findings about model performance suggested that specialized domains require tailored model architectures, a perspective that has become more relevant as the field continues to evolve.</p>
]]></content:encoded></item><item><title>GPT-2 Susceptibility to Universal Adversarial Triggers</title><link>https://hunterheidenreich.com/research/gpt2-adversarial-triggers/</link><pubDate>Sat, 01 May 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/gpt2-adversarial-triggers/</guid><description>Investigation into whether universal adversarial triggers can control both topic and stance of GPT-2's generated text and security implications.</description><content:encoded><![CDATA[<blockquote>
<p><strong>Historical context:</strong> This paper was published at AIES 2021, the AAAI/ACM Conference on AI, Ethics, and Society, predating the modern red-teaming practices and adversarial robustness benchmarks that emerged with instruction-tuned and RLHF-trained models. GPT-2 is now a historical baseline, but the core methodology and findings remain a relevant foundation for current adversarial robustness work.</p></blockquote>
<h2 id="abstract">Abstract</h2>
<p>This work investigates universal adversarial triggers (UATs), a method for disrupting language models using input-agnostic token sequences. We investigated whether it is possible to use these triggers to control the <strong>topic</strong> and the <strong>stance</strong> of text generated by GPT-2. Across four controversial topics, we demonstrated success in identifying triggers that guide the model to produce text on a targeted subject and influence the position it takes. Our goal is to raise awareness that even deployed models are susceptible to this influence and to advocate for immediate safeguards.</p>
<h2 id="key-findings--contributions">Key Findings &amp; Contributions</h2>
<ul>
<li><strong>Topic and Stance Control</strong>: We were the first to systematically explore using UATs to control both the topic and the stance of a language model&rsquo;s output. We found that controlling the topic is highly feasible, and controlling the stance is also possible.</li>
<li><strong>The &ldquo;Filter Bubble&rdquo; Hypothesis</strong>: We observed that triggers for fringe topics (e.g., Flat Earth) were harder to find but offered a higher degree of stance control than broader topics. We posit this may reflect &ldquo;filter bubbles&rdquo; in the training data, where fringe viewpoints use distinct linguistic patterns.</li>
<li><strong>Ethical &amp; Security Analysis</strong>: We highlighted the security risks of deployed models being manipulated by external adversaries without internal model access. To be responsible, we withheld the most sensitive triggers we discovered.</li>
<li><strong>Constructive Applications</strong>: Beyond a security flaw, we proposed that UATs could be used constructively as a <strong>diagnostic tool</strong> to audit models for bias or as a method for <strong>bot detection</strong> on social media.</li>
</ul>
<h2 id="significance--why-this-matters">Significance &amp; Why This Matters</h2>
<p>This work extended early research on UATs by moving beyond single-issue attacks (like generating toxic content) to a nuanced analysis of topic and stance control. It demonstrated that a <strong>gradient-based search process (adapting HotFlip)</strong> is effective at manipulating model outputs, emphasizing a critical vulnerability for any organization deploying large language models.</p>
<p>For ML practitioners and security researchers, this highlights the importance of robust safeguards against input-agnostic attacks. It also opens the door to using these same adversarial techniques constructively: as diagnostic tools to audit models for hidden biases or to detect automated bot activity on social media platforms.</p>
<h2 id="related-work">Related Work</h2>
<p>The constructive bot-detection application proposed here connects directly to empirical work on coordinated inauthentic behavior. <a href="/research/coordinated-social-targeting/">Coordinated Social Targeting on Twitter</a> documents real-world follower-manipulation patterns on high-profile accounts, illustrating the kind of automated adversarial activity that UAT-based detection methods could help identify.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{10.1145/3461702.3462578,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Heidenreich, Hunter Scott and Williams, Jake Ryland}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{9781450384735}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computing Machinery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{New York, NY, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1145/3461702.3462578}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1145/3461702.3462578}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{566--573}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">numpages</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{adversarial attacks, bias, language modeling, natural language processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">location</span> = <span style="color:#e6db74">{Virtual Event, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span> = <span style="color:#e6db74">{AIES &#39;21}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>