<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Multimodal Molecular Models on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/</link><description>Recent content in Multimodal Molecular Models on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sat, 28 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/index.xml" rel="self" type="application/rss+xml"/><item><title>MoMu: Bridging Molecular Graphs and Natural Language</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/momu-molecular-multimodal-foundation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/momu-molecular-multimodal-foundation/</guid><description>MoMu bridges molecular graphs and natural language via contrastive pre-training, enabling cross-modal retrieval, captioning, and property prediction.</description><content:encoded><![CDATA[<h2 id="bridging-molecular-graphs-and-natural-language-through-contrastive-learning">Bridging Molecular Graphs and Natural Language Through Contrastive Learning</h2>
<p>MoMu (Molecular Multimodal foundation model) is a <strong>Method</strong> paper that proposes a multimodal pre-training approach to associate molecular graphs with natural language descriptions. The primary contribution is a dual-encoder architecture, consisting of a Graph Isomorphism Network (GIN) for molecular graphs and a BERT-based text encoder, jointly trained through contrastive learning on weakly-correlated graph-text pairs collected from scientific literature. The pre-trained model supports four downstream capabilities: cross-modal retrieval (graph-to-text and text-to-graph), molecule captioning, zero-shot text-to-graph molecule generation, and molecular property prediction.</p>
<h2 id="why-single-modality-models-are-insufficient-for-molecular-understanding">Why Single-Modality Models Are Insufficient for Molecular Understanding</h2>
<p>Existing AI models for molecular tasks generally operate on a single modality and learn a single cognitive ability. Language-based models process <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings or natural language texts and handle tasks like property prediction from strings, literature comprehension, or SMILES-based generation. Graph-based models use molecular graph representations and handle graph-level property prediction or graph generation. Neither category connects structural information from molecular graphs with the rich semantic knowledge encoded in scientific texts.</p>
<p>Prior work by Zeng et al. (KV-PLM) jointly modeled molecule-related texts and SMILES strings, but SMILES representations have inherent drawbacks: they are one-dimensional and may lose structural information, they cannot capture structural similarities between molecules, and a single molecule can have multiple valid SMILES representations. Molecular graphs, by contrast, are more intuitive and better reveal functional structures. Human experts learn molecular knowledge by associating both graphical representations and textual descriptions, yet no prior model bridged these two modalities directly.</p>
<p>The key challenge is the scarcity of paired molecular graph-text data compared to general image-text datasets. Additionally, learning specialized molecular knowledge requires foundational cognitive abilities in both the graph and text domains, making training from scratch infeasible with limited data.</p>
<h2 id="contrastive-pre-training-with-inter-modal-and-intra-modal-objectives">Contrastive Pre-Training with Inter-Modal and Intra-Modal Objectives</h2>
<p>MoMu consists of two encoders initialized from pre-trained unimodal models: a GIN graph encoder initialized from GraphCL self-supervised weights, and a BERT text encoder initialized from either Sci-BERT (yielding MoMu-S) or KV-PLM (yielding MoMu-K).</p>
<h3 id="data-collection">Data Collection</h3>
<p>The authors collect approximately 15,613 molecular graph-document pairs by:</p>
<ol>
<li>Gathering names, synonyms, and SMILES for the top 50K compounds in <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></li>
<li>Converting SMILES to molecular graphs using the OGB <code>smiles2graph</code> function</li>
<li>Retrieving related text from the S2ORC corpus (136M+ papers) by querying with molecule names, filtering to Medicine, Biology, Chemistry, and Computer Science fields</li>
<li>Restricting retrieval to abstract, introduction, and conclusion sections to avoid experimental data artifacts</li>
</ol>
<h3 id="contrastive-training-objective">Contrastive Training Objective</h3>
<p>For each graph-text pair in a mini-batch of $N$ pairs, MoMu applies two graph augmentations (node dropping and subgraph extraction) to create two augmented graphs, and randomly samples two sentences from the document. This produces $2N$ graph representations ${z_1^G, \tilde{z}_1^G, \ldots, z_N^G, \tilde{z}_N^G}$ and $2N$ text representations ${z_1^T, \tilde{z}_1^T, \ldots, z_N^T, \tilde{z}_N^T}$.</p>
<p>The cross-modal contrastive loss for a pair $(z_i^G, z_i^T)$ is:</p>
<p>$$
\ell_i^{(z_i^G, z_i^T)} = -\log \frac{\exp(\text{sim}(z_i^G, z_i^T) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, z_j^T) / \tau)}
$$</p>
<p>where $\tau$ is the temperature parameter and $\text{sim}(\cdot, \cdot)$ projects both representations into a shared 256-dimensional space before computing cosine similarity. The total cross-modal loss includes four contrastive terms for each pair: $(z_i^G, z_i^T)$, $(\tilde{z}_i^G, z_i^T)$, $(z_i^G, \tilde{z}_i^T)$, and $(\tilde{z}_i^G, \tilde{z}_i^T)$.</p>
<p>An intra-modal graph contrastive loss further strengthens the graph encoder:</p>
<p>$$
\ell_i^{(z_i^G, \tilde{z}_i^G)} = -\log \frac{\exp(\text{sim}(z_i^G, \tilde{z}_i^G) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, \tilde{z}_j^G) / \tau)}
$$</p>
<h3 id="zero-shot-text-to-graph-generation">Zero-Shot Text-to-Graph Generation</h3>
<p>MoMu enables a zero-shot generation pipeline by combining the pre-trained MoMu encoders with MoFlow, a flow-based molecular generator. Given an input text description $x^T$, the method:</p>
<ol>
<li>Samples a latent variable $q$ from MoFlow&rsquo;s Gaussian prior $P(q)$</li>
<li>Generates a molecular graph through MoFlow&rsquo;s reverse flows: $\hat{E} = f_g^{-1}(q_e)$ and $\hat{V} = f_c^{-1}(q_v \mid GN(\hat{E}))$</li>
<li>Feeds $\hat{V}$ (using soft atom type probabilities instead of hard assignments) into MoMu&rsquo;s graph encoder</li>
<li>Optimizes $q$ to maximize the cosine similarity between the resulting graph and text representations:</li>
</ol>
<p>$$
\ell_q = -\text{sim}(z^G, z^T) / \tau
$$</p>
<p>All MoMu and MoFlow parameters are frozen; only $q$ is updated via Adam for up to 500 iterations. The final molecule is obtained by applying argmax to the optimized probability matrices $\hat{V}$ and $\hat{E}$.</p>
<h2 id="evaluation-across-four-downstream-tasks">Evaluation Across Four Downstream Tasks</h2>
<h3 id="cross-modal-retrieval">Cross-Modal Retrieval</h3>
<p>MoMu is evaluated on the PCdes dataset (15K SMILES-description pairs from PubChem, split 10,500/1,500/3,000 for train/val/test). Retrieval is performed in mini-batches of 64 pairs, reporting top-1 accuracy and Recall@20.</p>
<p><strong>Graph-to-Text Retrieval (PCdes, fine-tuned)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Sentence Acc</th>
          <th>Sentence R@20</th>
          <th>Paragraph Acc</th>
          <th>Paragraph R@20</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sci-BERT</td>
          <td>50.38</td>
          <td>62.11</td>
          <td>62.57</td>
          <td>60.67</td>
      </tr>
      <tr>
          <td>KV-PLM</td>
          <td>53.79</td>
          <td>66.63</td>
          <td>64.81</td>
          <td>63.87</td>
      </tr>
      <tr>
          <td>KV-PLM*</td>
          <td>55.92</td>
          <td>68.59</td>
          <td>77.92</td>
          <td>75.93</td>
      </tr>
      <tr>
          <td>MoMu-S</td>
          <td>58.64</td>
          <td>80.59</td>
          <td>80.62</td>
          <td>79.11</td>
      </tr>
      <tr>
          <td>MoMu-K</td>
          <td>58.74</td>
          <td>81.29</td>
          <td>81.09</td>
          <td>80.15</td>
      </tr>
  </tbody>
</table>
<p><strong>Text-to-Graph Retrieval (PCdes, fine-tuned)</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Sentence Acc</th>
          <th>Sentence R@20</th>
          <th>Paragraph Acc</th>
          <th>Paragraph R@20</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Sci-BERT</td>
          <td>50.12</td>
          <td>68.02</td>
          <td>61.75</td>
          <td>60.77</td>
      </tr>
      <tr>
          <td>KV-PLM</td>
          <td>54.22</td>
          <td>71.80</td>
          <td>64.95</td>
          <td>64.27</td>
      </tr>
      <tr>
          <td>KV-PLM*</td>
          <td>55.61</td>
          <td>74.77</td>
          <td>77.03</td>
          <td>75.47</td>
      </tr>
      <tr>
          <td>MoMu-S</td>
          <td>55.44</td>
          <td>76.92</td>
          <td>80.22</td>
          <td>79.02</td>
      </tr>
      <tr>
          <td>MoMu-K</td>
          <td>54.94</td>
          <td>78.29</td>
          <td>81.45</td>
          <td>80.62</td>
      </tr>
  </tbody>
</table>
<p>In zero-shot retrieval (on a separate test set of 5,562 pairs not seen during pre-training), MoMu achieves approximately 39-46% accuracy compared to below 2% for Sci-BERT and KV-PLM, demonstrating strong generalization.</p>
<h3 id="molecule-captioning">Molecule Captioning</h3>
<p>MoMu&rsquo;s graph features are appended to MolT5&rsquo;s encoder inputs through a learned MLP mapping module on the ChEBI-20 dataset. Results show improvements in BLEU, METEOR, and Text2Mol scores when incorporating graph features, though ROUGE-L slightly drops. The graph structural information leads to more accurate captions for complex molecular structures.</p>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>The pre-trained graph encoder from MoMu is fine-tuned on eight <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> datasets using scaffold splitting and ROC-AUC evaluation (10 runs).</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>No Pre-Train</th>
          <th>GraphCL</th>
          <th>MoMu-S</th>
          <th>MoMu-K</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>65.8</td>
          <td>69.7</td>
          <td><strong>70.5</strong></td>
          <td>70.1</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>74.0</td>
          <td>73.9</td>
          <td>75.6</td>
          <td>75.6</td>
      </tr>
      <tr>
          <td>ToxCast</td>
          <td>63.4</td>
          <td>62.4</td>
          <td>63.4</td>
          <td>63.0</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>57.3</td>
          <td>60.5</td>
          <td>60.5</td>
          <td>60.4</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>58.0</td>
          <td>76.0</td>
          <td><strong>79.9</strong></td>
          <td>77.4</td>
      </tr>
      <tr>
          <td>MUV</td>
          <td>71.8</td>
          <td>69.8</td>
          <td>70.5</td>
          <td>71.1</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>75.3</td>
          <td><strong>78.5</strong></td>
          <td>75.9</td>
          <td>76.2</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>70.1</td>
          <td>75.4</td>
          <td>76.7</td>
          <td>77.1</td>
      </tr>
      <tr>
          <td><strong>Average</strong></td>
          <td>66.96</td>
          <td>70.78</td>
          <td><strong>71.63</strong></td>
          <td>71.36</td>
      </tr>
  </tbody>
</table>
<p>MoMu-S achieves the best average ROC-AUC (71.63%) across all eight datasets, outperforming GraphCL (70.78%), the self-supervised method used to initialize MoMu&rsquo;s graph encoder. MoMu outperforms GraphCL on six of eight datasets. Notably, MoMu-S and MoMu-K perform comparably, indicating that KV-PLM&rsquo;s SMILES-based knowledge does not transfer well to graph-based representations.</p>
<h3 id="zero-shot-text-to-graph-generation-1">Zero-Shot Text-to-Graph Generation</h3>
<p>The method generates molecules from three types of text descriptions:</p>
<ol>
<li><strong>High-level vague descriptions</strong> (e.g., &ldquo;The molecule is beautiful&rdquo;): MoMu generates diverse, interpretable molecules where &ldquo;beautiful&rdquo; tends to produce locally symmetric and stretched graphs, &ldquo;versatile&rdquo; produces molecules with varied elements and functional groups, and &ldquo;strange&rdquo; produces cluttered, irregular structures.</li>
<li><strong>Functional descriptions</strong> (e.g., &ldquo;fluorescent molecules&rdquo;, &ldquo;high water solubility and barrier permeability with low toxicity&rdquo;): MoMu successfully generates molecules with appropriate functional groups and properties. For the solubility/permeability/toxicity query, MoMu generates molecules that satisfy three of three evaluable properties.</li>
<li><strong>Structural descriptions</strong> (e.g., &ldquo;molecules containing <a href="https://en.wikipedia.org/wiki/Nucleophile">nucleophilic</a> groups&rdquo;): MoMu generates diverse molecules with appropriate functional groups (amino, hydroxyl, carbonyl, halogen atoms).</li>
</ol>
<h2 id="promising-multimodal-transfer-with-clear-data-limitations">Promising Multimodal Transfer with Clear Data Limitations</h2>
<p>MoMu demonstrates that contrastive pre-training on weakly-correlated graph-text data can bridge molecular graphs and natural language in a shared representation space. The key findings are:</p>
<ol>
<li><strong>Cross-modal alignment works with limited data</strong>: With only 15K graph-text pairs (far fewer than the millions used in vision-language models like CLIP), MoMu achieves meaningful cross-modal retrieval and enables zero-shot generation.</li>
<li><strong>Multimodal supervision improves graph representations</strong>: The graph encoder supervised by text descriptions outperforms self-supervised methods (GraphCL, AttrMasking, ContextPred) on average across molecular property prediction benchmarks.</li>
<li><strong>SMILES knowledge does not transfer to graphs</strong>: MoMu-S and MoMu-K perform comparably across all tasks, showing that structural information learned from one-dimensional SMILES strings does not readily generalize to graph neural networks.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several important limitations:</p>
<ul>
<li><strong>Data scarcity</strong>: 15K graph-text pairs is substantially smaller than general image-text datasets, potentially leaving the common space insufficiently aligned.</li>
<li><strong>Noisy supervision</strong>: Retrieved texts may mention a molecule by name without describing its properties or structure, leading to spurious correlations.</li>
<li><strong>Generator constraints</strong>: The zero-shot generation method is limited by MoFlow&rsquo;s capacity (maximum 38 atoms, 9 element types from ZINC250K training).</li>
<li><strong>Property coverage</strong>: Generation quality degrades for molecular properties that appear infrequently or not at all in the training texts.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose four avenues: (1) collecting larger-scale multimodal molecular data including 3D conformations, (2) using strongly-correlated paired data with more advanced generators, (3) developing interpretable tools for the learned cross-modal space, and (4) wet-lab validation of generated molecules.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Collected graph-text pairs (PubChem + S2ORC)</td>
          <td>15,613 pairs</td>
          <td>~37M paragraphs total; top 50K PubChem compounds</td>
      </tr>
      <tr>
          <td>Cross-modal retrieval</td>
          <td>PCdes</td>
          <td>15K pairs (10.5K/1.5K/3K split)</td>
          <td>SMILES-description pairs from PubChem</td>
      </tr>
      <tr>
          <td>Molecule captioning</td>
          <td>ChEBI-20</td>
          <td>~33K pairs</td>
          <td>Used with MolT5</td>
      </tr>
      <tr>
          <td>Text-to-graph generation</td>
          <td><a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC250K</a> (MoFlow)</td>
          <td>250K molecules</td>
          <td>Pre-trained generator, max 38 atoms</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet (8 datasets)</td>
          <td>Varies</td>
          <td>BBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Graph augmentations</strong>: Node dropping (10% ratio) and subgraph extraction (80% of original size via random walk)</li>
<li><strong>Contrastive learning</strong>: InfoNCE loss with temperature $\tau = 0.1$, following the DeClip paradigm with both inter-modal and intra-modal objectives</li>
<li><strong>Zero-shot generation</strong>: Adam optimizer on latent variable $q$ for up to 500 iterations; formal charges prohibited in output</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Graph encoder</strong>: GIN with 5 layers, 300-dimensional hidden size, initialized from GraphCL checkpoint</li>
<li><strong>Text encoder</strong>: BERT-base (768 hidden size), initialized from Sci-BERT or KV-PLM</li>
<li><strong>Projection heads</strong>: Two MLPs projecting graph (300-dim) and text (768-dim) features to 256-dimensional shared space</li>
<li><strong>Optimizer</strong>: AdamW, learning rate 0.0001, weight decay 1e-5, 300 epochs, batch size 256</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>Best Result</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>G-T Retrieval (PCdes)</td>
          <td>Accuracy / R@20</td>
          <td>81.09 / 80.15 (paragraph)</td>
          <td>MoMu-K, fine-tuned</td>
      </tr>
      <tr>
          <td>T-G Retrieval (PCdes)</td>
          <td>Accuracy / R@20</td>
          <td>81.45 / 80.62 (paragraph)</td>
          <td>MoMu-K, fine-tuned</td>
      </tr>
      <tr>
          <td>Zero-shot G-T Retrieval</td>
          <td>Accuracy</td>
          <td>~46%</td>
          <td>vs. ~1.4% for baselines</td>
      </tr>
      <tr>
          <td>Property Prediction</td>
          <td>ROC-AUC (avg)</td>
          <td>71.63%</td>
          <td>MoMu-S, 8 MoleculeNet datasets</td>
      </tr>
      <tr>
          <td>Molecule Captioning</td>
          <td>Text2Mol</td>
          <td>Improved over MolT5</td>
          <td>MoMu + MolT5-large</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8x NVIDIA Tesla V100 PCIe 32GB GPUs</li>
<li>Framework: PyTorch</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BingSu12/MoMu">MoMu code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Pre-training and downstream task code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/yangzhao1230/GraphTextRetrieval">GraphTextRetrieval</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Data collection and cross-modal retrieval code</td>
      </tr>
      <tr>
          <td><a href="https://pan.baidu.com/s/1aHJoYTTZWDHPCcRuu9I7Fg">Pre-training dataset</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>Hosted on Baidu Pan (Chinese cloud storage)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Su, B., Du, D., Yang, Z., Zhou, Y., Li, J., Rao, A., Sun, H., Lu, Z., &amp; Wen, J.-R. (2022). A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language. arXiv preprint arXiv:2209.05481.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{su2022momu,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Su, Bing and Du, Dazhao and Yang, Zhao and Zhou, Yujie and Li, Jiangmeng and Rao, Anyi and Sun, Hao and Lu, Zhiwu and Wen, Ji-Rong}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2209.05481}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolFM: Trimodal Molecular Foundation Pre-training</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/molfm-multimodal-molecular-foundation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/molfm-multimodal-molecular-foundation/</guid><description>MolFM fuses molecular graphs, biomedical text, and knowledge graphs via cross-modal attention for joint molecular representation learning.</description><content:encoded><![CDATA[<h2 id="trimodal-pre-training-for-molecular-understanding">Trimodal Pre-training for Molecular Understanding</h2>
<p>MolFM is a <strong>Method</strong> paper that introduces a multimodal molecular foundation model integrating three distinct sources of molecular knowledge: 2D molecular graphs, biomedical text, and knowledge graphs. The primary contribution is a pre-training framework that uses fine-grained cross-modal attention to fuse information across all three modalities, combined with theoretical justification from a deep metric learning perspective. MolFM achieves the best reported results (at time of publication) on cross-modal retrieval, molecule captioning, text-based molecule generation, and molecular property prediction.</p>
<h2 id="why-existing-molecular-models-fall-short">Why Existing Molecular Models Fall Short</h2>
<p>Prior multimodal molecular foundation models operate on at most two modalities (structures and text) and suffer from two key limitations. First, generative approaches like KV-PLM and MolT5 rely on 1D <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, which cannot capture complex topological and spatial molecular properties such as macrocycles. Contrastive approaches like <a href="/notes/computational-chemistry/chemical-language-models/multimodal-molecular/momu-molecular-multimodal-foundation/">MoMu</a> and MoleculeSTM learn global alignment between molecule graphs and text but overlook fine-grained connections between specific substructures and textual descriptions.</p>
<p>Second, and more fundamentally, no prior model incorporates <a href="https://en.wikipedia.org/wiki/Knowledge_graph">knowledge graphs</a> as a third modality. Knowledge graphs encode global-level relationships among molecules, target ligands, diseases, and other biomedical entities. These relationships capture functional and structural similarity patterns that cannot be learned from individual molecule-text pairs alone. MolFM addresses both gaps by introducing cross-modal attention across all three modalities and providing theoretical guarantees about what the pre-training objectives learn.</p>
<h2 id="cross-modal-attention-and-metric-learning-guarantees">Cross-Modal Attention and Metric Learning Guarantees</h2>
<h3 id="architecture">Architecture</h3>
<p>MolFM uses three pre-trained single-modal encoders:</p>
<ul>
<li><strong>Molecular graph encoder</strong>: A 5-layer GIN (1.8M parameters) initialized from GraphMVP, producing atom-level features $h_{SA}$ and a graph-level feature $h_{SM}$</li>
<li><strong>Text encoder</strong>: A 6-layer transformer (61.8M parameters) initialized from KV-PLM&rsquo;s first 6 layers, producing token features $h_T$</li>
<li><strong>Knowledge graph encoder</strong>: A TransE model (12.6M parameters) trained on the knowledge graph for 500 epochs, producing entity features $h_K$</li>
</ul>
<p>A multimodal encoder (61.8M parameters, 6 transformer layers with cross-attention) fuses the three modalities. The cross-attention uses text token features as queries and the concatenation of atom features and knowledge graph neighbor features as keys and values. For each molecule, the knowledge graph input is the molecule&rsquo;s entity and $N=4$ randomly sampled one-hop neighbors.</p>
<h3 id="pre-training-objectives">Pre-training Objectives</h3>
<p>MolFM combines four losses:</p>
<p><strong>Structure-text contrastive (STC)</strong> aligns the global feature spaces of structure and text encoders using a symmetric InfoNCE loss:</p>
<p>$$\mathcal{L}_{stc} = -\frac{1}{2} \left[ \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{S&rsquo; \in B} \exp(s(z_{S&rsquo;}, z_T) / \tau)} + \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{T&rsquo; \in B} \exp(s(z_S, z_{T&rsquo;}) / \tau)} \right]$$</p>
<p>where $s(\cdot, \cdot)$ is cosine similarity and $\tau = 0.1$ is a temperature parameter.</p>
<p><strong>Cross-modal matching (CMM)</strong> predicts whether a structure-text-knowledge triplet corresponds to the same molecule, using cross-entropy over the multimodal encoder&rsquo;s CLS token:</p>
<p>$$\mathcal{L}_{cmm} = \sum_{(\tilde{S}, \tilde{T}, \tilde{K}) \in \tilde{B}} H\left[y_{cmm}(\tilde{S}, \tilde{T}, \tilde{K}),; p_{cmm}\left(\mathcal{M}_\theta(h_{\tilde{S}}, h_{\tilde{T}}, h_{\tilde{K}})\right)\right]$$</p>
<p><strong>Masked language modeling (MLM)</strong> predicts masked text tokens conditioned on all three modalities:</p>
<p>$$\mathcal{L}_{mlm} = H\left[y_{mlm}(\hat{T}),; p_{mlm}\left(\mathcal{M}_\theta(h_S, h_{\hat{T}}, h_K)\right)\right]$$</p>
<p><strong>Knowledge graph embedding (KGE)</strong> regularizes entity embeddings with a max-margin TransE loss:</p>
<p>$$\mathcal{L}_{kge} = \sum_{h \in K} \left[\max(0, d(h,r,t) - d(h,r,\tilde{t}) + \Delta) + \max(0, d(h,r,t) - d(\tilde{h},r,t) + \Delta)\right]$$</p>
<p>where $d(h,r,t) = | f(h) + g(r) - f(t) |_2$ and $\Delta = 0.2$.</p>
<p>The total pre-training loss is:</p>
<p>$$\mathcal{L} = \mathbb{E}_{(S,T,K)}\left[\mathcal{L}_{stc} + \mathcal{L}_{cmm} + \mathcal{L}_{mlm} + \mathcal{L}_{kge}\right]$$</p>
<h3 id="theoretical-justifications">Theoretical Justifications</h3>
<p>The authors provide metric learning interpretations for each objective. For CMM, they show that the loss is proportional to assigning higher scores to matched triplets and lower scores to unmatched ones, aligning the feature space across all three modalities.</p>
<p>For KGE, two lemmas provide guarantees about structurally and functionally similar molecules:</p>
<p><strong>Lemma 1</strong> (Structural similarity): For a symmetric structural-similarity relation $r_s$, the KGE loss satisfies:</p>
<p>$$\mathcal{L}_{kge}(h, r_s, t) \propto 2|f(h) - f(t)| - \mathbb{E}_{\tilde{t}}|f(h) - f(\tilde{t})| - \mathbb{E}_{\tilde{h}}|f(\tilde{h}) - f(t)|$$</p>
<p>This shows KGE pulls structurally similar molecules closer while pushing dissimilar ones apart.</p>
<p><strong>Lemma 2</strong> (Functional similarity): For molecules $h$ and $t$ that interact with a common entity $o$, the distance between their embeddings is upper-bounded:</p>
<p>$$|f(h) - f(t)| \leq \alpha,\mathbb{E}_{(e_1, r, e_2) \sim \mathcal{I}}\left[\mathcal{L}_{kge}(e_1, r, e_2)\right] + C$$</p>
<p>where $\alpha \approx 1$ and $C \approx 0$. This guarantees that minimizing KGE also brings functionally similar molecules closer in the embedding space.</p>
<h2 id="experiments-across-four-downstream-tasks">Experiments Across Four Downstream Tasks</h2>
<h3 id="pre-training-data">Pre-training Data</h3>
<p>MolFM pre-trains on 15K molecules from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> paired with 37M paragraphs from S2ORC. The knowledge graph contains 49K entities and 3.2M relations, constructed from <a href="https://en.wikipedia.org/wiki/DrugBank">DrugBank</a>, <a href="https://en.wikipedia.org/wiki/BindingDB">BindingDB</a>, and additional public databases with heuristic augmentation.</p>
<h3 id="cross-modal-retrieval">Cross-Modal Retrieval</h3>
<p>Evaluated on PCdes (paragraph-level) in zero-shot and fine-tuning settings. MolFM uses a re-ranking strategy that linearly combines cosine similarity with CMM logits over the top-$k$ retrieved candidates.</p>
<table>
  <thead>
      <tr>
          <th>Mode</th>
          <th>Model</th>
          <th>S-T MRR</th>
          <th>S-T R@1</th>
          <th>S-T R@10</th>
          <th>T-S MRR</th>
          <th>T-S R@1</th>
          <th>T-S R@10</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Zero-shot</td>
          <td>MoMu</td>
          <td>9.89</td>
          <td>5.08</td>
          <td>18.93</td>
          <td>10.33</td>
          <td>4.90</td>
          <td>20.69</td>
      </tr>
      <tr>
          <td>Zero-shot</td>
          <td>MolFM</td>
          <td>21.42</td>
          <td>13.90</td>
          <td>36.21</td>
          <td>23.63</td>
          <td>16.14</td>
          <td>39.54</td>
      </tr>
      <tr>
          <td>Fine-tune</td>
          <td>MoMu</td>
          <td>34.29</td>
          <td>24.47</td>
          <td>53.84</td>
          <td>34.53</td>
          <td>24.87</td>
          <td>54.25</td>
      </tr>
      <tr>
          <td>Fine-tune</td>
          <td>MolFM</td>
          <td>39.56</td>
          <td>29.76</td>
          <td>58.63</td>
          <td>39.34</td>
          <td>29.39</td>
          <td>58.49</td>
      </tr>
  </tbody>
</table>
<p>MolFM achieves 12.13% and 5.04% absolute gains over MoMu under zero-shot and fine-tuning settings, respectively.</p>
<h3 id="molecule-captioning">Molecule Captioning</h3>
<p>Evaluated on ChEBI-20 using MolT5 decoders. MolFM&rsquo;s structure encoder features are concatenated with the MolT5 encoder outputs.</p>
<table>
  <thead>
      <tr>
          <th>Decoder</th>
          <th>Encoder</th>
          <th>BLEU-4</th>
          <th>ROUGE-L</th>
          <th>METEOR</th>
          <th>Text2Mol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolT5-base</td>
          <td>MolT5-base</td>
          <td>0.457</td>
          <td>0.578</td>
          <td>0.569</td>
          <td>0.547</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>MoMu</td>
          <td>0.462</td>
          <td>0.575</td>
          <td>0.576</td>
          <td>0.558</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>GraphMVP</td>
          <td>0.491</td>
          <td>0.592</td>
          <td>0.599</td>
          <td>0.570</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>MolFM</td>
          <td>0.498</td>
          <td>0.594</td>
          <td>0.607</td>
          <td>0.576</td>
      </tr>
  </tbody>
</table>
<h3 id="text-based-molecule-generation">Text-Based Molecule Generation</h3>
<p>Also on ChEBI-20 with MolT5 decoders. MolFM&rsquo;s text features are projected and fed to the decoder.</p>
<table>
  <thead>
      <tr>
          <th>Decoder</th>
          <th>Encoder</th>
          <th>Exact</th>
          <th>Valid</th>
          <th>Morgan FTS</th>
          <th>Text2Mol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolT5-base</td>
          <td>MolT5-base</td>
          <td>0.082</td>
          <td>0.786</td>
          <td>0.601</td>
          <td>0.543</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>MoMu</td>
          <td>0.183</td>
          <td>0.863</td>
          <td>0.678</td>
          <td>0.580</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>MolFM</td>
          <td>0.210</td>
          <td>0.892</td>
          <td>0.697</td>
          <td>0.583</td>
      </tr>
  </tbody>
</table>
<h3 id="molecular-property-prediction">Molecular Property Prediction</h3>
<p>On <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> (8 classification datasets), MolFM concatenates the structure feature and the multimodal encoder&rsquo;s CLS feature to predict properties.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP</th>
          <th>Tox21</th>
          <th>ClinTox</th>
          <th>HIV</th>
          <th>BACE</th>
          <th>Avg</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GraphMVP</td>
          <td>72.4</td>
          <td>74.4</td>
          <td>77.5</td>
          <td>77.0</td>
          <td>81.2</td>
          <td>73.07</td>
      </tr>
      <tr>
          <td>DeepEIK</td>
          <td>72.1</td>
          <td>72.4</td>
          <td>89.7</td>
          <td>75.0</td>
          <td>80.5</td>
          <td>73.27</td>
      </tr>
      <tr>
          <td>MolFM (w/o T+K)</td>
          <td>72.2</td>
          <td>76.6</td>
          <td>78.6</td>
          <td>78.2</td>
          <td>82.6</td>
          <td>73.95</td>
      </tr>
      <tr>
          <td>MolFM (w/ T+K)</td>
          <td>72.9</td>
          <td>77.2</td>
          <td>79.7</td>
          <td>78.8</td>
          <td>83.9</td>
          <td>74.62</td>
      </tr>
  </tbody>
</table>
<p>With multimodal inputs, MolFM averages 74.62% ROC-AUC, a 1.55% absolute gain over GraphMVP.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Zero-shot retrieval ablations reveal that cross-modal attention to atoms and CMM are the most critical components. Removing either causes a sharp drop (approximately 3% on S-T retrieval). Knowledge graph incorporation yields a 1.5% average improvement, with both attention to neighbors and KGE contributing marginally.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>MolFM demonstrates that incorporating knowledge graphs as a third modality provides consistent improvements across all evaluated tasks. The theoretical analysis connecting pre-training objectives to deep metric learning provides interpretability for why the model works: STC and CMM align representations of the same molecule across modalities, while KGE pulls structurally and functionally similar molecules closer in the embedding space.</p>
<p>The cross-modal attention visualizations show that MolFM learns to associate specific atom substructures with relevant text tokens and knowledge graph entities. For example, the model correctly attends to functional groups mentioned in textual descriptions.</p>
<p>The authors acknowledge several limitations:</p>
<ol>
<li><strong>Data quality</strong>: The pre-training dataset (15K molecules) is small and may introduce biases</li>
<li><strong>Cold-start problem</strong>: MolFM provides limited benefit for newly emerged molecules lacking text and knowledge graph information</li>
<li><strong>Entity scope</strong>: The model focuses on molecules and does not incorporate proteins, genes, or cell lines, which could further improve biomedical understanding</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training (molecules)</td>
          <td>PubChem</td>
          <td>15K molecules</td>
          <td>Follows MoMu&rsquo;s pre-training data</td>
      </tr>
      <tr>
          <td>Pre-training (text)</td>
          <td>S2ORC</td>
          <td>37M paragraphs</td>
          <td>Biomedical literature paragraphs</td>
      </tr>
      <tr>
          <td>Knowledge graph</td>
          <td>DrugBank, BindingDB, public DBs</td>
          <td>49K entities, 3.2M relations</td>
          <td>Constructed with heuristics from MoCL</td>
      </tr>
      <tr>
          <td>Cross-modal retrieval</td>
          <td>PCdes</td>
          <td>Paragraph-level</td>
          <td>Test split</td>
      </tr>
      <tr>
          <td>Captioning/Generation</td>
          <td>ChEBI-20</td>
          <td>-</td>
          <td>Following MolT5 splits</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet</td>
          <td>8 datasets</td>
          <td>Classification tasks, ROC-AUC metric</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: AdamW with weight decay $1 \times 10^{-4}$</li>
<li>Learning rate: linear warmup to $1 \times 10^{-4}$ over 2,000 iterations, cosine annealing to $1 \times 10^{-5}$</li>
<li>Batch size: 128</li>
<li>Pre-training epochs: 300</li>
<li>Knowledge graph neighbors per molecule: $N = 4$</li>
<li>Temperature: $\tau = 0.1$</li>
<li>Margin: $\Delta = 0.2$</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
          <th>Initialization</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Graph encoder</td>
          <td>5-layer GIN</td>
          <td>1.8M</td>
          <td>GraphMVP</td>
      </tr>
      <tr>
          <td>Text encoder</td>
          <td>6-layer Transformer</td>
          <td>61.8M</td>
          <td>KV-PLM (first 6 layers)</td>
      </tr>
      <tr>
          <td>Knowledge encoder</td>
          <td>TransE</td>
          <td>12.6M</td>
          <td>Trained 500 epochs on KG</td>
      </tr>
      <tr>
          <td>Multimodal encoder</td>
          <td>6-layer Transformer + cross-attention</td>
          <td>61.8M</td>
          <td>KV-PLM (last 6 layers)</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td></td>
          <td><strong>~138M</strong></td>
          <td></td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metrics</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Cross-modal retrieval</td>
          <td>MRR, Recall@1/5/10</td>
      </tr>
      <tr>
          <td>Molecule captioning</td>
          <td>BLEU-2/4, ROUGE-1/2/L, METEOR, Text2Mol</td>
      </tr>
      <tr>
          <td>Text-to-molecule generation</td>
          <td>BLEU, Exact ratio, Validity, Levenshtein, Fingerprint Tanimoto (MACCS/RDKit/Morgan), Text2Mol</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>ROC-AUC per dataset</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>4 NVIDIA A100 GPUs for pre-training</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BioFM/OpenBioMed">OpenBioMed</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation including MolFM</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Luo, Y., Yang, K., Hong, M., Liu, X. Y., &amp; Nie, Z. (2023). MolFM: A Multimodal Molecular Foundation Model. <em>arXiv preprint arXiv:2307.09484</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{luo2023molfm,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MolFM: A Multimodal Molecular Foundation Model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Luo, Yizhen and Yang, Kai and Hong, Massimo and Liu, Xing Yi and Nie, Zaiqing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2307.09484}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BioT5: Cross-Modal Integration of Biology and Chemistry</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/biot5-cross-modal-biology/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/biot5-cross-modal-biology/</guid><description>BioT5 is a T5-based pretraining framework that jointly models molecules, proteins, and natural language using SELFIES for robust molecular generation.</description><content:encoded><![CDATA[<h2 id="a-unified-pretraining-framework-for-molecules-proteins-and-text">A Unified Pretraining Framework for Molecules, Proteins, and Text</h2>
<p>BioT5 is a <strong>Method</strong> paper that introduces a comprehensive T5-based pretraining framework for cross-modal integration of molecules, proteins, and natural language. The primary contribution is a multi-task pretraining approach that uses <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> (instead of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>) for 100% valid molecular representations, separate tokenization for each modality, and a combination of masked language modeling and translation objectives to connect structured biological data with unstructured scientific text. After fine-tuning, BioT5 (252M parameters) achieves state-of-the-art performance on 10 out of 15 downstream tasks spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction, molecule captioning, and text-based molecule generation.</p>
<h2 id="bridging-the-gap-between-molecular-sequences-and-scientific-knowledge">Bridging the Gap Between Molecular Sequences and Scientific Knowledge</h2>
<p>Prior cross-modal models in computational biology face three recurring challenges. First, models like MolT5 and MolXPT rely on SMILES to represent molecules, but SMILES strings are syntactically fragile: random perturbations or model-generated sequences frequently produce invalid molecular structures. Edwards et al. (2022) and Li et al. (2023) both highlight this validity problem as a bottleneck for text-to-molecule generation. Second, the contextual information surrounding molecular and protein names in scientific literature (e.g., mentions in <a href="https://en.wikipedia.org/wiki/PubMed">PubMed</a> abstracts that describe properties, interactions, and experimental results) remains underutilized. Most models either ignore this context or treat it identically to structured database entries. Third, existing approaches like MolT5 and <a href="/notes/computational-chemistry/llms-for-chemistry/galactica-large-language-model-for-science/">Galactica</a> share a single tokenizer and embedding space across molecules, proteins, and text. This leads to chemically incorrect tokenization: the bromine atom &ldquo;Br&rdquo; in SMILES gets split into &ldquo;B&rdquo; (boron) and &ldquo;r&rdquo;, producing erroneous downstream predictions.</p>
<p>BioT5 addresses all three issues simultaneously by adopting SELFIES for molecular representation, extracting entity-linked contextual knowledge from PubMed, and employing separate vocabularies for each modality.</p>
<h2 id="selfies-separate-tokenization-and-multi-task-pretraining">SELFIES, Separate Tokenization, and Multi-Task Pretraining</h2>
<p>The core innovations of BioT5 center on three design decisions:</p>
<h3 id="selfies-for-robust-molecular-representation">SELFIES for Robust Molecular Representation</h3>
<p>BioT5 replaces SMILES with SELFIES (Self-referencing Embedded Strings) for all molecular representations. Every permutation of symbols within the SELFIES alphabet generates a chemically valid molecular structure, guaranteeing 100% validity in generation tasks. Molecules from ZINC20 are converted from SMILES to SELFIES during data preprocessing.</p>
<h3 id="modality-specific-tokenization">Modality-Specific Tokenization</h3>
<p>Rather than sharing a single SentencePiece vocabulary across modalities, BioT5 maintains three separate dictionaries:</p>
<ul>
<li><strong>Molecules</strong>: Each SELFIES token corresponds to a chemically meaningful atom group enclosed in brackets (e.g., <code>[C]</code>, <code>[=C]</code>, <code>[Br]</code>).</li>
<li><strong>Proteins</strong>: Amino acids are prefixed with a special <code>&lt;p&gt;</code> token to distinguish them from text characters (e.g., <code>&lt;p&gt;M</code>, <code>&lt;p&gt;K</code>, <code>&lt;p&gt;R</code>).</li>
<li><strong>Text</strong>: The standard T5 vocabulary is retained.</li>
</ul>
<p>This prevents semantic conflation across modalities. The total vocabulary size is 35,073, and the model comprises 252M parameters using the T5-v1.1-base architecture.</p>
<h3 id="multi-task-pretraining-objectives">Multi-Task Pretraining Objectives</h3>
<p>BioT5 uses six pretraining tasks organized into three categories:</p>
<ol>
<li><strong>Single-modal T5 objective</strong>: Standard span corruption and recovery applied independently to molecule SELFIES (task 1), protein <a href="https://en.wikipedia.org/wiki/FASTA_format">FASTA</a> (task 2), and general text from C4 (task 3).</li>
<li><strong>Wrapped text T5 objective</strong> (task 4): Applied to PubMed articles where molecular names are replaced with corresponding SELFIES strings and gene names are appended with protein FASTA sequences, using BERN2 for named entity recognition and entity linking.</li>
<li><strong>Bidirectional translation</strong> (tasks 5 and 6): Molecule SELFIES to text description and vice versa (using 339K pairs from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>), and protein FASTA to text description and vice versa (using 569K pairs from <a href="https://en.wikipedia.org/wiki/UniProt">Swiss-Prot</a>).</li>
</ol>
<p>The translation direction is randomly sampled with probability 0.5 for each example. For downstream tasks, BioT5 uses prompt-based fine-tuning to cast all tasks into a sequence generation format, reducing the gap between pretraining and fine-tuning.</p>
<h2 id="evaluation-across-15-downstream-tasks">Evaluation Across 15 Downstream Tasks</h2>
<p>BioT5 is evaluated on 15 tasks organized into three categories: single-instance prediction, multi-instance prediction, and cross-modal generation.</p>
<h3 id="molecule-property-prediction-moleculenet">Molecule Property Prediction (MoleculeNet)</h3>
<p>BioT5 is evaluated on six binary classification tasks from <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> using scaffold splitting: BBBP, Tox21, ClinTox, HIV, BACE, and SIDER. Results are averaged over three random runs.</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>GEM</th>
          <th>MolXPT</th>
          <th>BioT5</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>72.4</td>
          <td>80.0</td>
          <td>77.7</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>78.1</td>
          <td>77.1</td>
          <td>77.9</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>90.1</td>
          <td>95.3</td>
          <td>95.4</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>80.6</td>
          <td>78.1</td>
          <td><strong>81.0</strong></td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>85.6</td>
          <td>88.4</td>
          <td><strong>89.4</strong></td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>67.2</td>
          <td>71.7</td>
          <td><strong>73.2</strong></td>
      </tr>
      <tr>
          <td><strong>Avg</strong></td>
          <td>79.0</td>
          <td>81.9</td>
          <td><strong>82.4</strong></td>
      </tr>
  </tbody>
</table>
<p>BioT5 achieves the best average AUROC (82.4) across all six datasets, surpassing both GNN-based methods (GEM) and language model baselines (MolXPT).</p>
<h3 id="protein-property-prediction-peer-benchmark">Protein Property Prediction (PEER Benchmark)</h3>
<p>On the PEER benchmark, BioT5 is evaluated on protein solubility and subcellular localization prediction:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params</th>
          <th>Solubility (Acc)</th>
          <th>Localization (Acc)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESM-1b</td>
          <td>652.4M</td>
          <td>70.23</td>
          <td><strong>92.40</strong></td>
      </tr>
      <tr>
          <td>ProtBert</td>
          <td>419.9M</td>
          <td>68.15</td>
          <td>91.32</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td>252.1M</td>
          <td><strong>74.65</strong></td>
          <td>91.69</td>
      </tr>
  </tbody>
</table>
<p>BioT5 achieves the best solubility prediction accuracy (74.65%) despite being 2-3x smaller than dedicated protein language models like ESM-1b and ProtBert.</p>
<h3 id="drug-target-interaction-prediction">Drug-Target Interaction Prediction</h3>
<p>BioT5 is evaluated on three DTI datasets (BioSNAP, Human, BindingDB) with five random runs:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>BioSNAP AUROC</th>
          <th>Human AUROC</th>
          <th>BindingDB AUROC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DrugBAN</td>
          <td>0.903</td>
          <td>0.982</td>
          <td>0.960</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td><strong>0.937</strong></td>
          <td><strong>0.989</strong></td>
          <td><strong>0.963</strong></td>
      </tr>
  </tbody>
</table>
<p>BioT5 consistently outperforms DrugBAN and other specialized DTI models across all three datasets.</p>
<h3 id="molecule-captioning-and-text-based-molecule-generation">Molecule Captioning and Text-Based Molecule Generation</h3>
<p>On the ChEBI-20 dataset, BioT5 outperforms all baselines in molecule captioning:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Params</th>
          <th>BLEU-4</th>
          <th>METEOR</th>
          <th>Text2Mol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolT5-large</td>
          <td>783M</td>
          <td>0.508</td>
          <td>0.614</td>
          <td>0.582</td>
      </tr>
      <tr>
          <td>MolXPT</td>
          <td>350M</td>
          <td>0.505</td>
          <td>0.626</td>
          <td>0.594</td>
      </tr>
      <tr>
          <td>BioT5</td>
          <td>252M</td>
          <td><strong>0.556</strong></td>
          <td><strong>0.656</strong></td>
          <td><strong>0.603</strong></td>
      </tr>
  </tbody>
</table>
<p>For text-based molecule generation, BioT5 achieves an exact match score of 0.413 (vs. 0.311 for MolT5-large) while maintaining 100% validity, compared to 90.5% for MolT5-large. This demonstrates the direct benefit of SELFIES: every generated sequence is a valid molecule.</p>
<h3 id="protein-protein-interaction-prediction">Protein-Protein Interaction Prediction</h3>
<p>On the PEER PPI benchmarks (Yeast and Human), BioT5 achieves competitive results, outperforming fully fine-tuned ProtBert and ESM-1b on the Yeast dataset (64.89% vs. 63.72% for ProtBert) and placing second on Human (86.22% vs. 88.06% for ESM-1b with frozen weights).</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>BioT5 demonstrates that integrating molecular, protein, and textual modalities within a single pretraining framework yields consistent improvements across diverse biological tasks. Three factors drive BioT5&rsquo;s performance: (1) SELFIES guarantees 100% molecular validity in generation tasks, eliminating a persistent failure mode of SMILES-based models; (2) separate tokenization preserves the semantic integrity of each modality; (3) wrapped text pretraining on PubMed provides contextual biological knowledge that pure sequence models miss.</p>
<p>The authors acknowledge several limitations. BioT5 requires full-parameter fine-tuning for each downstream task because instruction-tuning does not generalize across tasks, and combining datasets via instructions causes data leakage (the authors note overlaps between BindingDB training data and BioSNAP/Human test sets). The model only handles sequence-format bio-entities and does not incorporate 2D or 3D structural information. Additional biological modalities such as DNA/RNA sequences and cell-level data are also left for future work.</p>
<p>The authors also note risks: BioT5 could potentially be misused to generate dangerous molecules, and it may fail to generate effective therapeutic molecules or produce compounds with adverse side effects.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining (molecules)</td>
          <td>ZINC20</td>
          <td>~300M molecules</td>
          <td>Converted from SMILES to SELFIES</td>
      </tr>
      <tr>
          <td>Pretraining (proteins)</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniRef50</a></td>
          <td>27M proteins</td>
          <td>Filtered by length</td>
      </tr>
      <tr>
          <td>Pretraining (text)</td>
          <td>C4</td>
          <td>Large</td>
          <td>Standard T5 corpus</td>
      </tr>
      <tr>
          <td>Pretraining (wrapped text)</td>
          <td>PubMed</td>
          <td>33M articles</td>
          <td>Entity linking via BERN2</td>
      </tr>
      <tr>
          <td>Pretraining (molecule-text pairs)</td>
          <td>PubChem</td>
          <td>339K pairs</td>
          <td>Excludes ChEBI-20 molecules</td>
      </tr>
      <tr>
          <td>Pretraining (protein-text pairs)</td>
          <td>Swiss-Prot</td>
          <td>569K pairs</td>
          <td>High-quality annotations</td>
      </tr>
      <tr>
          <td>Evaluation (molecular properties)</td>
          <td>MoleculeNet</td>
          <td>6 datasets</td>
          <td>Scaffold splitting</td>
      </tr>
      <tr>
          <td>Evaluation (protein properties)</td>
          <td>PEER</td>
          <td>2 tasks</td>
          <td>Solubility and localization</td>
      </tr>
      <tr>
          <td>Evaluation (DTI)</td>
          <td>BioSNAP, Human, BindingDB</td>
          <td>3 datasets</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Evaluation (PPI)</td>
          <td>Yeast, Human</td>
          <td>2 datasets</td>
          <td>From PEER benchmark</td>
      </tr>
      <tr>
          <td>Evaluation (generation)</td>
          <td>ChEBI-20</td>
          <td>33K pairs</td>
          <td>Molecule captioning and text-to-molecule</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: T5-v1.1-base (encoder-decoder transformer)</li>
<li>Optimizer: AdamW with RMS scaling</li>
<li>Learning rate: cosine annealing, base $1 \times 10^{-2}$, minimum $1 \times 10^{-5}$</li>
<li>Warmup steps: 10,000</li>
<li>Dropout: 0.0</li>
<li>Maximum input length: 512 tokens</li>
<li>Pretraining steps: 350K</li>
<li>Batch size: 96 per GPU (6 data types per batch)</li>
<li>Prompt-based fine-tuning for all downstream tasks</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Vocabulary Size</th>
          <th>Architecture</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BioT5</td>
          <td>252M</td>
          <td>35,073</td>
          <td>T5-v1.1-base</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Molecule property prediction: AUROC on 6 MoleculeNet tasks (scaffold split, 3 runs)</li>
<li>Protein property prediction: accuracy on PEER benchmark (3 runs)</li>
<li>Drug-target interaction: AUROC, AUPRC, accuracy on 3 DTI datasets (5 runs)</li>
<li>Protein-protein interaction: accuracy on 2 PPI datasets (3 runs)</li>
<li>Molecule captioning: BLEU, ROUGE, METEOR, Text2Mol on ChEBI-20</li>
<li>Text-based molecule generation: BLEU, exact match, fingerprint similarities, FCD, validity on ChEBI-20</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>8x NVIDIA A100 80GB GPUs for pretraining</li>
<li>Codebase: nanoT5</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/QizhiPei/BioT5">BioT5 Code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Pei, Q., Zhang, W., Zhu, J., Wu, K., Gao, K., Wu, L., Xia, Y., &amp; Yan, R. (2023). BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations. <em>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</em>, 1102-1123. <a href="https://doi.org/10.18653/v1/2023.emnlp-main.70">https://doi.org/10.18653/v1/2023.emnlp-main.70</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{pei2023biot5,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1102--1123}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2023.emnlp-main.70}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MG-BERT: Graph BERT for Molecular Property Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/mg-bert-molecular-graph-bert/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/mg-bert-molecular-graph-bert/</guid><description>MG-BERT integrates graph neural network message passing into BERT with masked atom pretraining on 1.7M molecules for molecular property prediction.</description><content:encoded><![CDATA[<h2 id="a-graph-aware-bert-for-molecular-property-prediction">A Graph-Aware BERT for Molecular Property Prediction</h2>
<p>MG-BERT is a <strong>Method</strong> paper that adapts the BERT pretraining paradigm from NLP to molecular graphs. The primary contribution is a modified Transformer architecture that replaces global self-attention with bond-based local attention, allowing atoms to exchange information only through chemical bonds. This creates a deep message-passing network that avoids the oversmoothing problem of conventional graph neural networks (GNNs). Combined with a masked atom prediction pretraining strategy on 1.7 million unlabeled molecules from ChEMBL, MG-BERT learns context-sensitive atomic representations that transfer effectively to downstream property prediction tasks.</p>
<h2 id="data-scarcity-in-molecular-property-prediction">Data Scarcity in Molecular Property Prediction</h2>
<p><a href="/notes/computational-chemistry/chemical-language-models/property-prediction/">Molecular property prediction</a> is central to drug discovery, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoints. While deep learning has advanced many domains, molecular property prediction faces a persistent challenge: labeled data scarcity. ADMET measurements require expensive, time-consuming experiments, and typical datasets contain only hundreds to thousands of examples.</p>
<p>Prior approaches fall into three categories, each with limitations:</p>
<ol>
<li><strong>Feature engineering</strong> (molecular fingerprints, descriptors): Requires expert design, suffers from low scalability, and fixed representations cannot be optimized for specific tasks.</li>
<li><strong>SMILES-based deep learning</strong> (CNNs, LSTMs, Transformers on SMILES strings): Must learn to parse molecular information from complex string syntax, increasing learning difficulty. Autoencoder-based methods (e.g., <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/">CDDD</a>) learn fixed representations that cannot be fine-tuned.</li>
<li><strong>Graph neural networks</strong> (GAT, GCN): Can learn directly from molecular topology, but are limited to 2-3 layers due to oversmoothing, restricting their capacity to capture deep-level patterns.</li>
</ol>
<p>The BERT model from NLP demonstrated that self-supervised pretraining on large unlabeled corpora followed by fine-tuning on small labeled datasets can substantially improve downstream performance. <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/">SMILES-BERT</a> applied this idea to SMILES strings directly, but suffered from interpretability issues due to auxiliary characters in the SMILES syntax. MG-BERT addresses these limitations by operating directly on molecular graphs.</p>
<h2 id="bond-based-local-attention-and-masked-atom-pretraining">Bond-Based Local Attention and Masked Atom Pretraining</h2>
<p>The core innovation of MG-BERT has two components: a modified Transformer architecture for molecular graphs and a self-supervised pretraining strategy.</p>
<h3 id="architecture-modifications">Architecture Modifications</h3>
<p>The original BERT model uses three components: an embedding layer, Transformer encoder layers, and a task-specific output layer. MG-BERT makes three key modifications:</p>
<ol>
<li>
<p><strong>Atom embeddings replace word embeddings.</strong> The dictionary contains 16 tokens: 13 common atom types ([H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se]), plus [UNK] for rare atoms, [MASK] for pretraining, and [GLOBAL] for graph-level readout.</p>
</li>
<li>
<p><strong>No positional encoding.</strong> Unlike sequential text, atoms in a molecular graph have no inherent ordering, so positional embeddings are removed.</p>
</li>
<li>
<p><strong>Local attention replaces global attention.</strong> The adjacency matrix of the molecular graph is used as a visibility matrix to modulate the attention scores. Each atom can only attend to atoms connected by chemical bonds. Formally, the attention is constrained so that:</p>
</li>
</ol>
<p>$$A&rsquo;_{ij} = \begin{cases} A_{ij} &amp; \text{if bond exists between } i \text{ and } j \\ -\infty &amp; \text{otherwise} \end{cases}$$</p>
<p>where $A_{ij}$ is the standard scaled dot-product attention score. This local message passing makes MG-BERT a variant of GNN, but one that can stack many layers (6 in the medium configuration) without oversmoothing, thanks to the residual connections inherited from the Transformer architecture.</p>
<ol start="4">
<li><strong>Supernode for graph-level readout.</strong> A [GLOBAL] supernode is added to each molecular graph, connected to all atoms. This node aggregates information from the entire molecule and serves as the molecular representation for downstream prediction.</li>
</ol>
<h3 id="masked-atom-prediction">Masked Atom Prediction</h3>
<p>The pretraining strategy mirrors BERT&rsquo;s masked language model but operates on atoms:</p>
<ul>
<li>15% of atoms in each molecule are randomly selected (at least one atom per molecule)</li>
<li>Of selected atoms: 80% are replaced with [MASK], 10% are randomly replaced with another atom type, and 10% remain unchanged</li>
<li>The model is trained to predict the original atom type at masked positions</li>
<li>Loss is computed only at masked positions</li>
</ul>
<h3 id="model-configurations">Model Configurations</h3>
<p>Three model sizes were compared:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Layers</th>
          <th>Heads</th>
          <th>Embedding Size</th>
          <th>FFN Size</th>
          <th>Recovery Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MG-BERT Small</td>
          <td>3</td>
          <td>2</td>
          <td>128</td>
          <td>256</td>
          <td>95.27%</td>
      </tr>
      <tr>
          <td>MG-BERT Medium</td>
          <td>6</td>
          <td>4</td>
          <td>256</td>
          <td>512</td>
          <td>98.31%</td>
      </tr>
      <tr>
          <td>MG-BERT Large</td>
          <td>12</td>
          <td>8</td>
          <td>576</td>
          <td>1152</td>
          <td>98.35%</td>
      </tr>
  </tbody>
</table>
<p>The medium configuration was selected for all experiments because it achieved the best downstream performance, despite the large model having slightly higher pretraining recovery accuracy. The authors attribute this to overfitting risk with the larger model.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<h3 id="pretraining">Pretraining</h3>
<p>MG-BERT was pretrained on 1.7 million compounds randomly selected from ChEMBL, with 10% held out for evaluation (1.53M training molecules). Molecules were converted to 2D undirected graphs using RDKit, with hydrogen atoms explicitly included. The model was pretrained for 10 epochs using Adam with learning rate 1e-4 and batch size 256.</p>
<h3 id="fine-tuning-datasets">Fine-tuning Datasets</h3>
<p>Sixteen datasets covering ADMET endpoints and common molecular properties were collected from ADMETlab and <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>:</p>
<table>
  <thead>
      <tr>
          <th>Type</th>
          <th>Dataset</th>
          <th>Category</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Regression</td>
          <td>Caco2</td>
          <td>Absorption</td>
          <td>979</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>logD</td>
          <td>Physicochemical</td>
          <td>10,354</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>logS</td>
          <td>Physicochemical</td>
          <td>5,045</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>PPB</td>
          <td>Distribution</td>
          <td>1,480</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>tox</td>
          <td>Toxicity</td>
          <td>7,295</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>Physicochemical</td>
          <td>1,128</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>Physicochemical</td>
          <td>642</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipo</td>
          <td>Physicochemical</td>
          <td>4,200</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Ames</td>
          <td>Toxicity</td>
          <td>6,719</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBB</td>
          <td>Distribution</td>
          <td>1,855</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>FDAMDD</td>
          <td>Toxicity</td>
          <td>795</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>H_HT</td>
          <td>Toxicity</td>
          <td>2,170</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Pgp_inh</td>
          <td>Absorption</td>
          <td>2,125</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Pgp_sub</td>
          <td>Absorption</td>
          <td>1,210</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE</td>
          <td>Biophysics</td>
          <td>1,513</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP</td>
          <td>Physiology</td>
          <td>2,039</td>
      </tr>
  </tbody>
</table>
<p>Datasets were split 8:1:1 (train:validation:test) with stratified sampling by SMILES length. Each experiment was repeated 10 times with random splits, reporting mean and standard deviation. Regression was evaluated by R-squared, classification by ROC-AUC. Early stopping with a maximum of 100 epochs was used.</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines were compared:</p>
<ol>
<li><strong>ECFP4-XGBoost</strong>: Extended connectivity fingerprints (diameter 4) with gradient-boosted trees</li>
<li><strong>GAT</strong>: Graph Attention Network</li>
<li><strong>GCN</strong>: Graph Convolutional Network</li>
<li><strong>CDDD</strong>: Continuous and Data-Driven Descriptors (pretrained RNN encoder on SMILES with a fully connected network)</li>
<li><strong>SMILES-BERT</strong>: Original BERT applied directly to SMILES strings</li>
</ol>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Two ablation studies were conducted:</p>
<ol>
<li><strong>Pretraining effectiveness</strong>: Comparing pretrained vs. non-pretrained MG-BERT under identical hyperparameters</li>
<li><strong>Hydrogen atoms</strong>: Comparing MG-BERT with and without explicit hydrogen atoms in the molecular graph</li>
</ol>
<h2 id="consistent-improvements-across-admet-benchmarks">Consistent Improvements Across ADMET Benchmarks</h2>
<h3 id="main-results">Main Results</h3>
<p>MG-BERT consistently outperformed all baselines across all 16 datasets. Key results on the 11 ADMET datasets:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>ECFP4-XGBoost</th>
          <th>GAT</th>
          <th>GCN</th>
          <th>CDDD</th>
          <th>SMILES-BERT</th>
          <th>MG-BERT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Caco2 (R2)</td>
          <td>61.41</td>
          <td>69.16</td>
          <td>67.15</td>
          <td>73.42</td>
          <td>72.39</td>
          <td><strong>74.68</strong></td>
      </tr>
      <tr>
          <td>logD (R2)</td>
          <td>70.84</td>
          <td>84.62</td>
          <td>86.22</td>
          <td>85.85</td>
          <td>86.31</td>
          <td><strong>87.46</strong></td>
      </tr>
      <tr>
          <td>logS (R2)</td>
          <td>73.73</td>
          <td>84.06</td>
          <td>83.47</td>
          <td>84.01</td>
          <td>85.20</td>
          <td><strong>87.66</strong></td>
      </tr>
      <tr>
          <td>PPB (R2)</td>
          <td>55.11</td>
          <td>59.96</td>
          <td>57.34</td>
          <td>54.12</td>
          <td>62.37</td>
          <td><strong>65.94</strong></td>
      </tr>
      <tr>
          <td>Ames (AUC)</td>
          <td>87.21</td>
          <td>86.38</td>
          <td>87.04</td>
          <td>86.82</td>
          <td>87.69</td>
          <td><strong>89.33</strong></td>
      </tr>
      <tr>
          <td>BBB (AUC)</td>
          <td>94.62</td>
          <td>93.03</td>
          <td>92.67</td>
          <td>94.44</td>
          <td>94.02</td>
          <td><strong>95.41</strong></td>
      </tr>
      <tr>
          <td>BBBP (AUC)</td>
          <td>89.16</td>
          <td>90.33</td>
          <td>90.74</td>
          <td>91.12</td>
          <td>91.32</td>
          <td><strong>92.08</strong></td>
      </tr>
  </tbody>
</table>
<p>The overall improvement across all datasets was 28.1% (7.02% on classification, 21.28% on regression). Improvements were statistically significant at the 95% confidence level (paired t-test, P &lt;= 0.001).</p>
<h3 id="pretraining-ablation">Pretraining Ablation</h3>
<p>Pretraining improved performance by more than 2% on all datasets. The benefit was largest for small datasets: Caco2 improved by approximately 10 percentage points (64.79 to 74.68 R2), and FDAMDD improved by about 7.5 points (80.76 to 88.23 AUC). This confirms that self-supervised pretraining effectively addresses the labeled data scarcity problem.</p>
<h3 id="hydrogen-atom-ablation">Hydrogen Atom Ablation</h3>
<p>Including explicit hydrogen atoms improved pretraining recovery accuracy from 92.25% to 98.31% and consistently improved downstream performance. The authors provide an intuitive explanation: hydrogen atoms help determine bond counts for neighboring atoms, which is critical for the masked atom recovery task. They also show that removing hydrogens can make structurally distinct molecules (e.g., benzene and cyclohexane) indistinguishable at the graph level.</p>
<h3 id="interpretability-via-attention-visualization">Interpretability via Attention Visualization</h3>
<p>The authors provide two forms of interpretability analysis:</p>
<ol>
<li>
<p><strong>t-SNE visualization of atomic representations</strong>: Pretrained atomic representations cluster by atom type and, more specifically, by local chemical environment (e.g., aromatic carbons separate from aliphatic carbons, C-N bonds from C-O bonds). This demonstrates that pretraining captures neighborhood context beyond simple atom identity.</p>
</li>
<li>
<p><strong>Attention weight visualization</strong>: On the logD task, the supernode&rsquo;s attention focuses on polar groups (which govern lipophilicity). On the Ames mutagenicity task, attention concentrates on known mutagenic structural alerts (acylchloride, nitrosamide, azide groups). This provides chemically meaningful explanations for predictions.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The paper does not extensively discuss limitations, but several can be identified:</p>
<ul>
<li>The model uses only 2D molecular topology (atom types and bonds) without 3D conformational information or bond-type features</li>
<li>The atom dictionary is limited to 13 common types plus [UNK], which may lose information for molecules containing rarer elements</li>
<li>Evaluation is limited to ADMET-focused datasets; broader chemical spaces (e.g., materials, catalysts) are not tested</li>
<li>The comparison baselines do not include other graph-based pretraining methods (e.g., the contemporaneous Strategies for Pre-training Graph Neural Networks by Hu et al.)</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL (random subset)</td>
          <td>1.7M molecules (1.53M train)</td>
          <td>10% held out for evaluation</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>ADMETlab + MoleculeNet</td>
          <td>16 datasets (642-10,354 molecules)</td>
          <td>8:1:1 splits, stratified by SMILES length</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam (pretraining: lr=1e-4, batch=256; fine-tuning: lr from {1e-5, 5e-5, 1e-4}, batch from {16, 32, 64})</li>
<li><strong>Pretraining epochs</strong>: 10</li>
<li><strong>Fine-tuning</strong>: Up to 100 epochs with early stopping</li>
<li><strong>Dropout</strong>: Optimized per task in range [0.0, 0.5]</li>
<li><strong>Masking</strong>: 15% of atoms (80% [MASK], 10% random, 10% unchanged)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: MG-BERT Medium (6 layers, 4 heads, embedding size 256, FFN size 512)</li>
<li><strong>Molecule processing</strong>: RDKit for graph conversion with explicit hydrogens</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>R-squared (R2)</td>
          <td>Regression</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Accuracy, RMSE</td>
          <td>Both</td>
          <td>Reported in supplementary Table S1</td>
      </tr>
  </tbody>
</table>
<p>All results averaged over 10 random splits with standard deviations reported.</p>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify hardware requirements (GPU type, training time, or memory usage).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhang-xuan1314/Molecular-graph-BERT">Molecular-graph-BERT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation; last code push August 2021</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Wu, C.-K., Yang, Z.-J., Wu, Z.-X., Yi, J.-C., Hsieh, C.-Y., Hou, T.-J., &amp; Cao, D.-S. (2021). MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. <em>Briefings in Bioinformatics</em>, 22(6), bbab152. <a href="https://doi.org/10.1093/bib/bbab152">https://doi.org/10.1093/bib/bbab152</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2021mgbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{MG-BERT}: leveraging unsupervised atomic representation learning for molecular property prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Xiao-Chen and Wu, Cheng-Kun and Yang, Zhi-Jiang and Wu, Zhen-Xing and Yi, Jia-Cai and Hsieh, Chang-Yu and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Briefings in Bioinformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{bbab152}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Oxford University Press}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/bib/bbab152}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DMP: Dual-View Molecule Pre-training (SMILES+GNN)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/dual-view-molecule-pretraining/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/dual-view-molecule-pretraining/</guid><description>DMP pre-trains molecular encoders using both SMILES Transformer and GNN branches with a BYOL-style dual-view consistency loss for property prediction.</description><content:encoded><![CDATA[<h2 id="a-dual-branch-pre-training-method-for-molecular-property-prediction">A Dual-Branch Pre-training Method for Molecular Property Prediction</h2>
<p>DMP (Dual-view Molecule Pre-training) is a <strong>Method</strong> paper that introduces a pre-training framework combining two complementary molecular encoders: a Transformer operating on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings and a Graph Neural Network (GNN) operating on molecular graphs. The two branches are trained jointly with masked language modeling (MLM) objectives plus a BYOL-style dual-view consistency loss. After pre-training on 10M <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> molecules, either branch (or both) can be fine-tuned for downstream tasks. The authors recommend the Transformer branch based on empirical results. DMP achieves the best reported performance on 7 of 9 <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks and 3 retrosynthesis benchmarks (at the time of the 2021 arXiv version).</p>
<h2 id="why-combine-smiles-and-graph-views-for-molecules">Why Combine SMILES and Graph Views for Molecules</h2>
<p>Prior molecule pre-training methods used either graph representations with GNNs or SMILES representations with Transformers, but not both. The authors observe that the two views are complementary: Transformers handle molecules with large atom distances (long chains) well, while GNNs handle molecules with many concatenated rings better. Neither model alone captures the full range of molecular structures effectively.</p>
<p>Existing GNN-based pre-training methods (Hu et al. 2020, MolCLR, GROVER) and SMILES-based methods (<a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/">SMILES-BERT</a>) each have blind spots dictated by their input representation. DMP addresses this by pre-training both views simultaneously and enforcing representation consistency between them, so each branch benefits from the structural knowledge of the other.</p>
<h2 id="dual-view-consistency-with-byol-style-training">Dual-View Consistency with BYOL-Style Training</h2>
<p>The core innovation is the dual-view consistency objective, inspired by Bootstrap Your Own Latent (BYOL). Given a molecule $M$ with SMILES representation $M_s$ and graph representation $M_g$, DMP obtains high-level features from each branch:</p>
<ul>
<li><strong>Transformer branch</strong>: A RoBERTa-base model encodes the SMILES sequence. The [CLS] token output serves as the molecule representation $f_s$.</li>
<li><strong>GNN branch</strong>: A DeeperGCN network encodes the molecular graph. Mean+max pooling over atom representations yields $f_g$.</li>
</ul>
<p>The dual-view consistency loss uses nonlinear projection heads $\psi_g, \psi_s$ and prediction heads $\rho_g, \rho_s$:</p>
<p>$$
p_g = \psi_g(f_g), \quad q_g = \rho_g(p_g); \quad p_s = \psi_s(f_s), \quad q_s = \rho_s(p_s)
$$</p>
<p>The consistency loss maximizes cross-view <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> with stop-gradient (SG) on the target:</p>
<p>$$
\ell_{\text{dual}}(\tilde{M}_g, \tilde{M}_s) = -\cos(q_s, \text{SG}(p_g)) - \cos(q_g, \text{SG}(p_s))
$$</p>
<p>where $\cos(p, q) = \frac{p^\top q}{|p|_2 |q|_2}$ and $\tilde{M}_g, \tilde{M}_s$ are the masked versions of the inputs. The stop-gradient prevents representation collapse without requiring negative samples or a momentum encoder.</p>
<p>The full training objective combines three losses:</p>
<ol>
<li><strong>MLM on Transformer</strong>: Recover masked tokens in SMILES sequences</li>
<li><strong>MLM on GNN</strong>: Recover masked atoms in molecular graphs</li>
<li><strong>Dual-view consistency</strong>: The BYOL-style loss above</li>
</ol>
<p>Both MLM objectives and the consistency loss are necessary. Ablations show that removing MLM (using only dual-view loss) degrades performance, and using two branches of the same type (two Transformers or two GNNs) is less effective than the heterogeneous Transformer+GNN combination.</p>
<h2 id="experiments-on-moleculenet-and-retrosynthesis">Experiments on MoleculeNet and Retrosynthesis</h2>
<h3 id="pre-training-setup">Pre-training Setup</h3>
<p>DMP is pre-trained on 10M molecules from PubChem (matching prior work). The Transformer branch uses RoBERTa-base (12 layers, hidden dim 768, 87M parameters). The GNN branch uses DeeperGCN (12 layers, hidden dim 384, 7.4M parameters). Combined, DMP has 104.1M parameters. Training runs for 200K iterations on 8 V100 GPUs over 3.8 days with Adam optimizer (lr = 5e-4, weight decay 0.01).</p>
<h3 id="molecular-property-prediction-moleculenet">Molecular Property Prediction (MoleculeNet)</h3>
<p>DMP is evaluated on 6 binary classification tasks (BBBP, Tox21, ClinTox, HIV, BACE, SIDER) using official DeepChem splits, and on 3 additional tasks (BBBP, SIDER, ClinTox classification + ESOL, QM7, QM8 regression) using scaffold splits from GROVER.</p>
<p>Key results on DeepChem splits (ROC-AUC %):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>MolCLR</th>
          <th>TF (MLM)</th>
          <th>DMP_TF</th>
          <th>DMP_TF+GNN</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP</td>
          <td>73.6</td>
          <td>74.9</td>
          <td><strong>78.1</strong></td>
          <td>77.8</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>79.8</td>
          <td>77.6</td>
          <td><strong>78.8</strong></td>
          <td>79.1</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>93.2</td>
          <td>92.9</td>
          <td><strong>95.0</strong></td>
          <td>95.6</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>80.6</td>
          <td>80.2</td>
          <td><strong>81.0</strong></td>
          <td>81.4</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>89.0</td>
          <td>88.0</td>
          <td><strong>89.3</strong></td>
          <td>89.4</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>68.0</td>
          <td>68.4</td>
          <td><strong>69.2</strong></td>
          <td>69.8</td>
      </tr>
  </tbody>
</table>
<p>On scaffold splits (comparison with GROVER and MPG):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>GROVER</th>
          <th>MPG</th>
          <th>DMP_TF</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BBBP (AUC)</td>
          <td>0.940</td>
          <td>0.922</td>
          <td><strong>0.945</strong></td>
      </tr>
      <tr>
          <td>SIDER (AUC)</td>
          <td>0.658</td>
          <td>0.661</td>
          <td><strong>0.695</strong></td>
      </tr>
      <tr>
          <td>ClinTox (AUC)</td>
          <td>0.944</td>
          <td>0.963</td>
          <td><strong>0.968</strong></td>
      </tr>
      <tr>
          <td>ESOL (RMSE)</td>
          <td>0.831</td>
          <td>0.741</td>
          <td><strong>0.700</strong></td>
      </tr>
      <tr>
          <td>QM7 (MAE)</td>
          <td>72.6</td>
          <td>-</td>
          <td><strong>69.6</strong></td>
      </tr>
      <tr>
          <td>QM8 (MAE)</td>
          <td>0.0125</td>
          <td>-</td>
          <td><strong>0.0124</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="retrosynthesis">Retrosynthesis</h3>
<p>DMP is tested on USPTO-50K (reaction type known/unknown) and USPTO-full. Using a &ldquo;DMP fusion&rdquo; approach (fusing pre-trained representations into a Transformer encoder-decoder for <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/">retrosynthesis</a>), DMP improves top-1 accuracy by 2-3 points over the baseline Transformer across all settings:</p>
<table>
  <thead>
      <tr>
          <th>Setting</th>
          <th>Transformer</th>
          <th>ChemBERTa fusion</th>
          <th>DMP fusion</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>USPTO-50K (unknown)</td>
          <td>42.3</td>
          <td>43.9</td>
          <td><strong>46.1</strong></td>
      </tr>
      <tr>
          <td>USPTO-50K (known)</td>
          <td>54.2</td>
          <td>56.4</td>
          <td><strong>57.5</strong></td>
      </tr>
      <tr>
          <td>USPTO-full</td>
          <td>42.9</td>
          <td>-</td>
          <td><strong>45.0</strong></td>
      </tr>
  </tbody>
</table>
<p>For GNN-based retrosynthesis, replacing GLN&rsquo;s GNN modules with DMP&rsquo;s pre-trained GNN branch improves top-1 accuracy from 52.5% to 54.2% (unknown type) and from 64.2% to 66.5% (known type).</p>
<h3 id="representation-quality">Representation Quality</h3>
<p><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> visualization of pre-trained representations shows that DMP produces better scaffold-based clustering than either GNN-only or Transformer-only pre-training. The <a href="https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index">Davies-Bouldin index</a> improves from 3.56 (GNN) and 3.59 (Transformer) to 2.19 (DMP), indicating much tighter within-scaffold clusters.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p><strong>Key findings:</strong></p>
<ul>
<li>Combining heterogeneous views (SMILES + graph) during pre-training is more effective than using two branches of the same type. TF(x2) and GNN(x2) variants show smaller gains.</li>
<li>Both MLM and dual-view consistency loss contribute. Removing MLM (dual-view only) hurts performance, especially on BBBP (71.1 vs 78.1 with both losses).</li>
<li>The Transformer branch alone is recommended for downstream tasks, as it achieves strong results without adding GNN parameters at inference time.</li>
<li>Scaling pre-training data from 10M to 100M compounds yields marginal additional improvement.</li>
</ul>
<p><strong>Limitations acknowledged by the authors:</strong></p>
<ol>
<li>Training cost is higher than single-branch methods (3.8 days vs 2.5 days for TF-only on 8 V100s), since both branches must be trained jointly.</li>
<li>A fixed branch selection strategy is used at inference time. The authors note that a meta-controller for dynamic branch selection per molecule would be preferable.</li>
<li>The GNN branch uses simple atom masking without bond deletion or subgraph removal, leaving room for stronger graph-level pre-training objectives.</li>
</ol>
<p><strong>Relation to co-training:</strong> The authors clarify that DMP differs from classical <a href="https://en.wikipedia.org/wiki/Co-training">co-training</a> (Blum and Mitchell 1998) in that it does not require conditional independence between views and produces a pre-trained model rather than additional labeled data.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem subset</td>
          <td>10M compounds</td>
          <td>Same subset as MolCLR and ChemBERTa</td>
      </tr>
      <tr>
          <td>Pre-training (large)</td>
          <td>PubChem subset</td>
          <td>100M compounds</td>
          <td>Additional scale experiment</td>
      </tr>
      <tr>
          <td>Evaluation (classification)</td>
          <td>MoleculeNet (BBBP, Tox21, ClinTox, HIV, BACE, SIDER)</td>
          <td>1.5K-41K molecules</td>
          <td>Official DeepChem splits</td>
      </tr>
      <tr>
          <td>Evaluation (regression)</td>
          <td>MoleculeNet (ESOL, QM7, QM8)</td>
          <td>Varies</td>
          <td>Scaffold splits from GROVER</td>
      </tr>
      <tr>
          <td>Evaluation (retrosynthesis)</td>
          <td>USPTO-50K, USPTO-full</td>
          <td>50K / 950K reactions</td>
          <td>Splits from Dai et al. (2019)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Transformer branch</strong>: RoBERTa-base with MLM. SMILES tokenized using regex from Schwaller et al. (2019).</li>
<li><strong>GNN branch</strong>: DeeperGCN with 12 layers, atom masking for MLM.</li>
<li><strong>Dual-view loss</strong>: BYOL-style with 3-layer MLP projection heads and 2-layer MLP prediction heads, stop-gradient on targets.</li>
<li><strong>Optimizer</strong>: Adam (lr=5e-4, beta1=0.9, beta2=0.98, epsilon=1e-6), weight decay 0.01, 10K warmup steps, linear decay.</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Transformer branch</td>
          <td>RoBERTa-base (12L, 768H, 12 heads)</td>
          <td>87M</td>
      </tr>
      <tr>
          <td>GNN branch</td>
          <td>DeeperGCN (12L, 384H)</td>
          <td>7.4M</td>
      </tr>
      <tr>
          <td>DMP (total)</td>
          <td>Transformer + GNN + projection/prediction heads</td>
          <td>104.1M</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Classification: ROC-AUC, averaged over 3 random seeds</li>
<li>Regression: RMSE (ESOL) or MAE (QM7, QM8)</li>
<li>Retrosynthesis: Top-k exact match accuracy (k=1,3,5,10,20,50)</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8 NVIDIA V100 GPUs, batch size 12288 tokens, gradient accumulation 16x</li>
<li>Pre-training time: 3.8 days (DMP), 2.5 days (TF-only), 1.7 days (GNN-only)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository or pre-trained model weights were identified for this paper. The paper references GLN&rsquo;s code repository (<a href="https://github.com/Hanjun-Dai/GLN">https://github.com/Hanjun-Dai/GLN</a>) for the retrosynthesis baseline but does not release DMP-specific code.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Hanjun-Dai/GLN">GLN (baseline)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Retrosynthesis baseline, not DMP code</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhu, J., Xia, Y., Wu, L., Xie, S., Zhou, W., Qin, T., Li, H., &amp; Liu, T.-Y. (2023). Dual-view Molecular Pre-training. In <em>Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</em> (pp. 3615-3627). <a href="https://doi.org/10.1145/3580305.3599317">https://doi.org/10.1145/3580305.3599317</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhu2023dualview,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Dual-view Molecular Pre-training}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhu, Jinhua and Xia, Yingce and Wu, Lijun and Xie, Shufang and Zhou, Wengang and Qin, Tao and Li, Houqiang and Liu, Tie-Yan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3615--3627}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3580305.3599317}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SPMM: A Bidirectional Molecular Foundation Model</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/spmm-bidirectional-structure-property/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/spmm-bidirectional-structure-property/</guid><description>SPMM is a multimodal molecular foundation model that aligns SMILES structures with property vectors for bidirectional generation and prediction tasks.</description><content:encoded><![CDATA[<h2 id="a-multimodal-foundation-model-for-structure-property-comprehension">A Multimodal Foundation Model for Structure-Property Comprehension</h2>
<p>This is a <strong>Method</strong> paper that introduces the Structure-Property Multi-Modal foundation model (SPMM), a transformer-based architecture that treats SMILES strings and molecular property vectors (PVs) as two separate modalities and learns to align them in a shared embedding space. The primary contribution is enabling bidirectional generation through a single pre-trained model: given a property vector, SPMM can generate molecules (inverse-QSAR), and given a SMILES string, it can predict all 53 properties simultaneously. The model also transfers to unimodal downstream tasks including MoleculeNet benchmarks and reaction prediction.</p>
<h2 id="bridging-the-gap-between-molecular-structure-and-properties">Bridging the Gap Between Molecular Structure and Properties</h2>
<p>Existing chemical pre-trained models typically learn representations from a single modality (SMILES, graphs, or fingerprints) and fine-tune for specific downstream tasks. While some approaches have attempted multimodal learning by combining SMILES with graph representations or InChI strings, these modalities encode nearly identical structural information, limiting the potential for emergent cross-modal knowledge.</p>
<p>The key gap SPMM addresses is the lack of multimodal pre-training that incorporates genuinely complementary modalities. Prior conditional molecule generation methods could typically control only a small number of properties simultaneously and required retraining when target properties changed. The authors draw on successes in vision-language pre-training (VLP), where aligning image and text modalities has enabled rich bidirectional understanding, and apply similar ideas to molecular structure and property domains.</p>
<h2 id="treating-property-vectors-as-a-language">Treating Property Vectors as a Language</h2>
<p>The core innovation in SPMM is treating a collection of 53 RDKit-computed molecular properties as a &ldquo;language&rdquo; where each property value is analogous to a word token. This design allows the model to attend to individual properties independently rather than treating the entire property vector as a single fixed-length condition.</p>
<h3 id="dual-stream-architecture">Dual-Stream Architecture</h3>
<p>SPMM follows the dual-stream VLP architecture. The model has three components:</p>
<ol>
<li><strong>SMILES Encoder</strong>: 6 BERT-base layers that encode tokenized SMILES (using a 300-subword BPE vocabulary) via self-attention</li>
<li><strong>PV Encoder</strong>: 6 BERT-base layers that encode the 53 normalized property values (each passed through a linear layer) with learnable positional embeddings</li>
<li><strong>Fusion Encoder</strong>: 6 BERT-base layers with cross-attention that combines both modalities, using one modality&rsquo;s features as queries and the other as keys/values</li>
</ol>
<h3 id="pre-training-objectives">Pre-training Objectives</h3>
<p>The model is pre-trained with four complementary losses:</p>
<p><strong>Contrastive Learning</strong> aligns SMILES and PV features in a shared embedding space. For [CLS] token outputs $\mathbf{S}_{cls}$ and $\mathbf{P}_{cls}$:</p>
<p>$$
\text{sim}(\mathbf{S}, \mathbf{P}) = \left(h_{S}(\mathbf{S}_{cls})\right)^{\top} h_{P}(\mathbf{P}_{cls})
$$</p>
<p>The intermodal similarities are computed with a learnable temperature $\tau$:</p>
<p>$$
s_{s2p} = \frac{\exp(\text{sim}(\mathbf{S}, \mathbf{P}) / \tau)}{\sum_{n=1}^{N} \exp(\text{sim}(\mathbf{S}, \mathbf{P}_{n}) / \tau)}
$$</p>
<p>The contrastive loss uses cross-entropy with one-hot labels (1 for same-molecule pairs):</p>
<p>$$
L_{\text{contrastive}} = \frac{1}{2}\left(H(y_{s2p}, s_{s2p}) + H(y_{p2s}, s_{p2s}) + H(y_{s2s}, s_{s2s}) + H(y_{p2p}, s_{p2p})\right)
$$</p>
<p><strong>Next Word Prediction (NWP)</strong> trains autoregressive SMILES generation conditioned on the PV:</p>
<p>$$
L_{NWP} = \sum_{i=1}^{n} H\left(y_{n}^{NWP}, p^{NWP}(s_{n} \mid s_{0:n-1}, \mathbf{P})\right)
$$</p>
<p><strong>Next Property Prediction (NPP)</strong> applies the same autoregressive concept to property values, using mean-square-error loss:</p>
<p>$$
L_{NPP} = \sum_{i=1}^{n} \left(p_{n} - \hat{p}_{n}(p_{0:n-1}, \mathbf{S})\right)^{2}
$$</p>
<p><strong>SMILES-PV Matching (SPM)</strong> is a binary classification loss predicting whether a SMILES-PV pair originated from the same molecule, trained with hard-negative mining.</p>
<p>The overall pre-training loss combines all four:</p>
<p>$$
L = \widetilde{L}_{\text{contrastive}} + \widetilde{L}_{NWP} + L_{NPP} + L_{SPM}
$$</p>
<p>where tildes indicate the use of momentum teacher distillation to soften one-hot labels, acknowledging that multiple valid SMILES-PV pairings may exist.</p>
<h3 id="random-property-masking">Random Property Masking</h3>
<p>During pre-training, 50% of property values are randomly replaced with a special [UNK] token. This serves three purposes: preventing overfitting to specific properties, augmenting data, and enabling flexible inference where users can specify any subset of the 53 properties as generation conditions. The model can handle all $2^{53}$ possible property combinations at inference time despite never seeing most of them during training.</p>
<h2 id="experiments-across-bidirectional-and-unimodal-tasks">Experiments Across Bidirectional and Unimodal Tasks</h2>
<h3 id="pv-to-smiles-generation-conditional-molecule-design">PV-to-SMILES Generation (Conditional Molecule Design)</h3>
<p>The authors evaluate SPMM on multiple generation scenarios using 1000 unseen PubChem PVs:</p>
<table>
  <thead>
      <tr>
          <th>Sampling</th>
          <th>Input PV</th>
          <th>Validity</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
          <th>Norm. RMSE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Deterministic</td>
          <td>1000 unseen PVs</td>
          <td>0.995</td>
          <td>0.999</td>
          <td>0.961</td>
          <td>0.216</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>Full PV (molecule 1)</td>
          <td>0.974</td>
          <td>0.905</td>
          <td>0.998</td>
          <td>0.185</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>Molar mass = 150</td>
          <td>0.974</td>
          <td>0.945</td>
          <td>0.872</td>
          <td>0.192</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>4 properties controlled</td>
          <td>0.998</td>
          <td>0.981</td>
          <td>0.952</td>
          <td>0.257</td>
      </tr>
      <tr>
          <td>Stochastic</td>
          <td>No control (all [UNK])</td>
          <td>0.971</td>
          <td>0.991</td>
          <td>0.950</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The normalized RMSE of 0.216 across 53 properties indicates that generated molecules closely match the input property conditions. The model can also perform unconditional generation (all properties masked) where outputs follow the pre-training distribution. The authors report that SPMM outperforms benchmark models including MolGAN, GraphVAE, and scaffold-based graph generative models in both conditional and unconditional settings (Supplementary Table 1).</p>
<h3 id="smiles-to-pv-generation-multi-property-prediction">SMILES-to-PV Generation (Multi-Property Prediction)</h3>
<p>When given 1000 unseen ZINC15 molecules, SPMM predicts all 53 properties autoregressively with a mean $r^{2}$ of 0.924 across all properties.</p>
<h3 id="moleculenet-benchmarks">MoleculeNet Benchmarks</h3>
<p>Using only the SMILES encoder (6 BERT layers), SPMM achieves best or competitive performance on 9 <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> tasks:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>SPMM</th>
          <th>Best Baseline</th>
          <th>Baseline Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.817</td>
          <td>0.798</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>LIPO</td>
          <td>RMSE</td>
          <td>0.681</td>
          <td>0.660</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>1.868</td>
          <td>1.877</td>
          <td>ChemRL-GEM</td>
      </tr>
      <tr>
          <td>BACE (reg)</td>
          <td>RMSE</td>
          <td>1.041</td>
          <td>1.047</td>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MolFormer</a></td>
      </tr>
      <tr>
          <td>Clearance</td>
          <td>RMSE</td>
          <td>42.607</td>
          <td>43.175</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>AUROC</td>
          <td>75.1%</td>
          <td>73.6%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>BACE (cls)</td>
          <td>AUROC</td>
          <td>84.4%</td>
          <td>86.3%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>ClinTox</td>
          <td>AUROC</td>
          <td>92.7%</td>
          <td>91.2%</td>
          <td>MolFormer</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td>AUROC</td>
          <td>66.9%</td>
          <td>67.2%</td>
          <td>ChemRL-GEM</td>
      </tr>
  </tbody>
</table>
<p>SPMM achieved best performance on 5 of 9 tasks, with notable gains on BBBP (75.1% vs. 73.6%) and ClinTox (92.7% vs. 91.2%). Without pre-training, all scores dropped substantially.</p>
<h3 id="dili-classification">DILI Classification</h3>
<p>On Drug-Induced Liver Injury prediction, SPMM achieved 92.6% AUROC, outperforming the 5-ensemble model of Ai et al. (90.4% AUROC) while using a single model.</p>
<h3 id="reaction-prediction">Reaction Prediction</h3>
<p>On USPTO-480k forward reaction prediction, SPMM achieved 91.5% top-1 accuracy, the highest among all models tested (including <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a> at 91.3%). On USPTO-50k retro-reaction prediction, SPMM reached 53.4% top-1 accuracy, second only to Chemformer (54.3%) among string-based models.</p>
<h2 id="bidirectional-generation-from-a-single-pre-trained-model">Bidirectional Generation From a Single Pre-trained Model</h2>
<p>SPMM demonstrates that multimodal pre-training with genuinely complementary modalities (structure and properties, rather than structurally redundant representations) enables a single foundation model to handle both generation directions and downstream unimodal tasks. Key findings include:</p>
<ol>
<li><strong>Flexible conditional generation</strong>: The [UNK] masking strategy allows controlling any subset of 53 properties at inference time without retraining, a capability not demonstrated by prior methods.</li>
<li><strong>Interpretable cross-attention</strong>: Attention visualizations show that the model learns chemically meaningful structure-property relationships (e.g., hydrogen bonding properties attend to oxygen and nitrogen atoms; ring count properties attend to ring tokens).</li>
<li><strong>Competitive unimodal transfer</strong>: Despite using only 6 BERT layers and 50M pre-training molecules (smaller than <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta-2/">ChemBERTa-2</a>&rsquo;s 77M or Chemformer&rsquo;s 100M), the SMILES encoder alone achieves best or second-best results on 5 of 9 MoleculeNet tasks and the highest forward reaction prediction accuracy among tested models.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations:</p>
<ul>
<li><strong>SMILES representation constraints</strong>: Implicit connectivity information in SMILES means small structural changes can cause drastic string changes. Graph representations could be a complementary alternative.</li>
<li><strong>Stereochemistry blindness</strong>: All 53 RDKit properties used are invariant to stereochemistry, meaning different stereoisomers produce identical PVs. The contrastive loss then forces their SMILES encoder outputs to converge, which the authors identify as the primary factor limiting MoleculeNet performance on stereo-sensitive tasks.</li>
<li><strong>No wet-lab validation</strong>: Generated molecules and predicted properties are not experimentally verified.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem</td>
          <td>50M molecules</td>
          <td>SMILES + 53 RDKit properties</td>
      </tr>
      <tr>
          <td>Property prediction</td>
          <td>MoleculeNet (9 tasks)</td>
          <td>642-4200 per task</td>
          <td>Scaffold split via DeepChem (8:1:1)</td>
      </tr>
      <tr>
          <td>DILI classification</td>
          <td>Ai et al. dataset</td>
          <td>Not specified</td>
          <td>Following published preparation</td>
      </tr>
      <tr>
          <td>Forward reaction</td>
          <td>USPTO-480k</td>
          <td>479,035 pairs</td>
          <td>Reactant-product pairs</td>
      </tr>
      <tr>
          <td>Retro reaction</td>
          <td>USPTO-50k</td>
          <td>50,037 pairs</td>
          <td>Product-reactant pairs, no reaction types used</td>
      </tr>
      <tr>
          <td>SMILES-to-PV test</td>
          <td>ZINC15</td>
          <td>1000 molecules</td>
          <td>Not in pre-training set</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Tokenization</strong>: BPE with 300-subword dictionary</li>
<li><strong>Property masking</strong>: 50% random replacement with [UNK] during pre-training</li>
<li><strong>Momentum distillation</strong>: EMA parameter $\lambda = 0.995$, soft-label mixing $\alpha$ linearly warmed from 0 to 0.4 over first epoch</li>
<li><strong>Contrastive queue</strong>: Size $k = 24{,}576$ for storing recent SMILES and PV instances</li>
<li><strong>Beam search</strong>: $k = 2$ for PV-to-SMILES generation</li>
<li><strong>SMILES augmentation</strong>: Random non-canonical augmentation with probability 0.5 for reaction tasks</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: 6 BERT-base encoder layers each for SMILES encoder, PV encoder, and fusion encoder (18 total layers)</li>
<li><strong>Vocabulary</strong>: 300 BPE subwords for SMILES; 53 property tokens for PV</li>
<li><strong>Pre-trained weights</strong>: Available via GitHub</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PV-to-SMILES (deterministic)</td>
          <td>Validity</td>
          <td>99.5%</td>
          <td>1000 unseen PubChem PVs</td>
      </tr>
      <tr>
          <td>PV-to-SMILES (deterministic)</td>
          <td>Normalized RMSE</td>
          <td>0.216</td>
          <td>Across 53 properties</td>
      </tr>
      <tr>
          <td>SMILES-to-PV</td>
          <td>Mean $r^{2}$</td>
          <td>0.924</td>
          <td>1000 ZINC15 molecules</td>
      </tr>
      <tr>
          <td>Forward reaction (USPTO-480k)</td>
          <td>Top-1 accuracy</td>
          <td>91.5%</td>
          <td>Best among all tested models</td>
      </tr>
      <tr>
          <td>Retro reaction (USPTO-50k)</td>
          <td>Top-1 accuracy</td>
          <td>53.4%</td>
          <td>Second-best string-based</td>
      </tr>
      <tr>
          <td>DILI classification</td>
          <td>AUROC</td>
          <td>92.6%</td>
          <td>Single model vs. 5-ensemble</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Pre-training</strong>: 8 NVIDIA A100 GPUs, approximately 52,000 batch iterations, roughly 12 hours</li>
<li><strong>Batch size</strong>: 96</li>
<li><strong>Optimizer</strong>: AdamW with weight decay 0.02</li>
<li><strong>Learning rate</strong>: Warmed up to $10^{-4}$, cosine decay to $10^{-5}$</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jinhojsk515/SPMM">SPMM Source Code</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation with experimental scripts</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.10567599">SPMM Zenodo Archive</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Archived version for reproducibility</td>
      </tr>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public domain</td>
          <td>50M molecules for pre-training</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>Varies</td>
          <td>Benchmark datasets via DeepChem</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chang, J., &amp; Ye, J. C. (2024). Bidirectional generation of structure and properties through a single molecular foundation model. <em>Nature Communications</em>, 15, 2323. <a href="https://doi.org/10.1038/s41467-024-46440-3">https://doi.org/10.1038/s41467-024-46440-3</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chang2024bidirectional,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Bidirectional generation of structure and properties through a single molecular foundation model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chang, Jinho and Ye, Jong Chul}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2323}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-46440-3}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>nach0: A Multimodal Chemical and NLP Foundation Model</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/nach0-multimodal-chemical-language-model/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/multimodal-molecular/nach0-multimodal-chemical-language-model/</guid><description>nach0 is a T5-based encoder-decoder model pre-trained on SMILES, scientific text, and patents, then instruction-tuned for chemical and NLP tasks.</description><content:encoded><![CDATA[<h2 id="a-multi-domain-encoder-decoder-for-chemistry-and-nlp">A Multi-Domain Encoder-Decoder for Chemistry and NLP</h2>
<p>nach0 is a <strong>Method</strong> paper that introduces a unified encoder-decoder foundation model capable of handling both natural language processing (NLP) tasks and chemistry tasks within a single architecture. The primary contribution is demonstrating that a T5-based model pre-trained on scientific text, patents, and <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> molecular strings can be instruction-tuned to perform molecular property prediction, reaction prediction, molecular generation, named entity recognition, question answering, and cross-domain translation (text-to-molecule and molecule-to-text) simultaneously. The model is available in base (250M parameters) and large (780M parameters) configurations.</p>
<h2 id="bridging-chemical-and-linguistic-representations">Bridging Chemical and Linguistic Representations</h2>
<p>Most existing biomedical language models (BioBERT, SciFive, BioMegatron) are trained exclusively on natural language text from sources like PubMed, omitting chemical structure information encoded in SMILES strings. Conversely, chemistry-specific models trained on SMILES data often lack the ability to process natural language instructions or perform NLP tasks. Models like <a href="/notes/computational-chemistry/llms-for-chemistry/galactica-large-language-model-for-science/">Galactica</a> and MolT5 attempted to bridge this gap by training on both natural language and chemical data, but they were not fine-tuned on a diverse set of chemical tasks using instruction tuning in a multi-task fashion.</p>
<p>nach0 addresses this by creating a shared representation space for both modalities and fine-tuning across a comprehensive set of tasks spanning three domains: NLP-only tasks, chemistry-only tasks, and cross-domain tasks that require translating between natural language and molecular representations.</p>
<h2 id="unified-text-to-text-framework-with-smiles-tokenization">Unified Text-to-Text Framework with SMILES Tokenization</h2>
<p>The core innovation in nach0 is formulating all chemical and linguistic tasks as text-to-text problems within a single encoder-decoder transformer, combined with a specialized SMILES tokenization strategy.</p>
<h3 id="smiles-token-integration">SMILES Token Integration</h3>
<p>Rather than treating SMILES as plain text, nach0 extends the T5 vocabulary with dedicated SMILES tokens. Each SMILES token is annotated with special symbols in the format <code>&lt;sm_{token}&gt;</code>, creating a distinct vocabulary space for molecular representations while preserving the natural language vocabulary from FLAN-T5. The embedding matrix is initialized by reusing learned embeddings from the pre-trained model for original tokens, with new chemical tokens initialized from the first embeddings.</p>
<h3 id="architecture">Architecture</h3>
<p>Both model sizes use the standard T5 encoder-decoder architecture:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Parameters</th>
          <th>Layers</th>
          <th>Hidden Size</th>
          <th>FFN Size</th>
          <th>Attention Heads</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>250M</td>
          <td>12</td>
          <td>768</td>
          <td>3072</td>
          <td>12</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>780M</td>
          <td>24</td>
          <td>1024</td>
          <td>4096</td>
          <td>16</td>
      </tr>
  </tbody>
</table>
<h3 id="pre-training-data">Pre-training Data</h3>
<p>The model is pre-trained with a language modeling objective on three data sources:</p>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>Documents</th>
          <th>Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>PubMed abstracts (chemistry-filtered)</td>
          <td>13M</td>
          <td>355M</td>
      </tr>
      <tr>
          <td>USPTO patent descriptions</td>
          <td>119K</td>
          <td>2.9B</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC</a> molecular database</td>
          <td>~100M</td>
          <td>4.7B</td>
      </tr>
  </tbody>
</table>
<h3 id="instruction-tuning">Instruction Tuning</h3>
<p>Following the approach of Raffel et al. and Chung et al., nach0 uses natural language prompts to formulate each task. For example, a retrosynthesis task might be phrased as &ldquo;What reactants could be used to synthesize [SMILES]?&rdquo; and a property prediction task as &ldquo;Can [SMILES] penetrate the <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB</a>?&rdquo; This enables multi-task training across all domains with a single loss function and shared hyperparameters.</p>
<p>Training uses a batch size of 1024, learning rate of $1 \times 10^{-4}$, and weight decay of 0.01. Pre-training runs for one epoch, and fine-tuning for 10 epochs. Data mixing follows the examples-proportional mixing strategy from T5.</p>
<h2 id="multi-task-evaluation-across-nlp-and-chemistry-benchmarks">Multi-Task Evaluation Across NLP and Chemistry Benchmarks</h2>
<p>nach0 is evaluated on a comprehensive set of benchmarks spanning three task categories.</p>
<h3 id="task-categories">Task Categories</h3>
<p><strong>NLP tasks</strong>: Named entity recognition (BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, JNLPBA), PICO extraction (EBM PICO), textual entailment (MedNLI, SciTail), relation extraction (ChemProt, DDI, GAD), sentence similarity (BIOSSES), document classification (HoC), and question answering (PubMedQA, BioASQ, MedMCQA, MMLU).</p>
<p><strong>Chemistry tasks</strong>: Molecular property prediction (ESOL, FreeSolv, Lipophilicity, BBBP, HIV, BACE from <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>; QM9 from Mol-Instructions), molecular generation (<a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a>), forward reaction prediction, reagent prediction, and <a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">retrosynthesis</a> (from Mol-Instructions/USPTO).</p>
<p><strong>Cross-domain tasks</strong>: Description-guided molecule design and molecular description generation (from Mol-Instructions).</p>
<h3 id="baselines">Baselines</h3>
<p>nach0 is compared against FLAN-T5 (250M), SciFive (220M), and MolT5 (220M), all trained in multi-task fashion.</p>
<h3 id="key-results">Key Results</h3>
<p>On chemistry and cross-domain tasks, nach0 base consistently outperforms all base-sized baselines. Selected highlights from Table 3:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>MolT5</th>
          <th>SciFive</th>
          <th>FLAN</th>
          <th>nach0 Base</th>
          <th>nach0 Large</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Forward reaction</td>
          <td>Acc@1</td>
          <td>27.0%</td>
          <td>60.0%</td>
          <td>59.0%</td>
          <td>88.0%</td>
          <td>89.9%</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>Acc@1</td>
          <td>15.0%</td>
          <td>31.0%</td>
          <td>31.0%</td>
          <td>53.0%</td>
          <td>56.3%</td>
      </tr>
      <tr>
          <td>Reagent prediction</td>
          <td>Acc@1</td>
          <td>1.1%</td>
          <td>3.8%</td>
          <td>4.0%</td>
          <td>6.3%</td>
          <td>13.1%</td>
      </tr>
      <tr>
          <td>BACE</td>
          <td>BA</td>
          <td>0.58</td>
          <td>0.65</td>
          <td>0.65</td>
          <td>0.74</td>
          <td>0.71</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>BA</td>
          <td>0.55</td>
          <td>0.66</td>
          <td>0.60</td>
          <td>0.67</td>
          <td>0.68</td>
      </tr>
      <tr>
          <td>HFE (FreeSolv)</td>
          <td>R2</td>
          <td>-0.36</td>
          <td>0.51</td>
          <td>0.55</td>
          <td>0.77</td>
          <td>0.78</td>
      </tr>
      <tr>
          <td>MOSES (FCD)</td>
          <td>FCD/Test</td>
          <td>0.521</td>
          <td>0.578</td>
          <td>0.529</td>
          <td>0.311</td>
          <td>0.304</td>
      </tr>
      <tr>
          <td>Description-guided mol. design</td>
          <td>BLEU-2</td>
          <td>30.3%</td>
          <td>44.2%</td>
          <td>43.6%</td>
          <td>49.0%</td>
          <td>48.8%</td>
      </tr>
      <tr>
          <td>Mol. description gen.</td>
          <td>BLEU-2</td>
          <td>35.6%</td>
          <td>39.6%</td>
          <td>38.6%</td>
          <td>43.9%</td>
          <td>41.7%</td>
      </tr>
  </tbody>
</table>
<p>On NLP tasks, nach0 base performs comparably to FLAN base, with the two models trading wins across different tasks. nach0 large improves substantially over nach0 base on most tasks.</p>
<h3 id="ablation-study">Ablation Study</h3>
<p>The ablation study (Table 4) examines the impact of multi-task training across chemical task groups. Key findings:</p>
<ul>
<li>nach0 trained on all chemical tasks jointly outperforms models trained on individual task groups (prediction-only, reaction-only, or generation-only) on the total set of metrics</li>
<li>The joint model shows lower novelty scores on MOSES compared to the generation-only model, but this reflects less overfitting to training data rather than worse performance</li>
<li>nach0 consistently outperforms MolT5 across all chemical task configurations, demonstrating the benefit of pre-training on both natural language and chemical data with specialized SMILES tokens</li>
</ul>
<h3 id="case-studies">Case Studies</h3>
<p>Two applied case studies demonstrate nach0 in drug discovery scenarios:</p>
<ol>
<li>
<p><strong>End-to-end drug discovery for <a href="https://en.wikipedia.org/wiki/Diabetes">diabetes mellitus</a></strong>: Using a sequence of prompts, nach0 identifies biological targets, analyzes mechanisms of action, generates molecular structures, proposes synthesis routes, and predicts molecular properties.</p>
</li>
<li>
<p><strong><a href="https://en.wikipedia.org/wiki/Janus_kinase_3">JAK3</a> inhibitor generation with Chemistry42</strong>: nach0 replaces 42 specialized generative models in Insilico Medicine&rsquo;s Chemistry42 platform. In 45 minutes, nach0 generates 8 molecules satisfying all 2D and 3D requirements (hinge binding, active site binding), compared to a 0.04% discovery rate from a combinatorial generator over 24 hours. Chemistry42&rsquo;s full pipeline (72 hours) still produces better structures since it uses reinforcement learning feedback and explicit structural constraints.</p>
</li>
</ol>
<h3 id="comparison-with-chatgpt">Comparison with ChatGPT</h3>
<p>On a subset evaluation, fine-tuned nach0 base outperforms GPT-3.5-turbo on all tested tasks: EBM PICO (F1: 67.6% vs. 64.4%), MedMCQA-Open (BLEU-2: 6.3% vs. 1.7%), and molecular description generation (BLEU-2: 42.8% vs. 2.2%).</p>
<h2 id="competitive-multi-task-performance-with-clear-limitations">Competitive Multi-Task Performance with Clear Limitations</h2>
<p>nach0 demonstrates that a single encoder-decoder model can achieve competitive results across both chemical and NLP tasks when pre-trained on mixed-modality data and fine-tuned with instruction tuning. The model&rsquo;s strongest advantages appear on chemistry tasks (reaction prediction, property prediction, molecular generation), where specialized SMILES tokenization and chemical pre-training provide clear benefits over general-purpose models of similar scale.</p>
<h3 id="limitations-acknowledged-by-the-authors">Limitations Acknowledged by the Authors</h3>
<ol>
<li>
<p><strong>Not at chemist expert level</strong>: Human evaluations indicate the model does not match domain expert performance. Key gaps include chemical reasoning, knowledge alignment with domain-specific knowledge graphs, and the ability to learn from expert feedback.</p>
</li>
<li>
<p><strong>SMILES-only molecular representation</strong>: The model lacks 3D geometric information. SMILES notation is not one-to-one with molecular structures, and the model does not incorporate molecular graphs or 3D coordinates. The authors suggest <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> as a potential alternative representation.</p>
</li>
<li>
<p><strong>Prompt sensitivity</strong>: Performance depends on prompt quality and specificity. Over-reliance on domain-specific prompts may limit response diversity.</p>
</li>
<li>
<p><strong>Limited chemical diversity</strong>: Cross-domain datasets from Mol-Instructions primarily cover known drugs and chemical probes from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>, representing only a fraction of predicted chemical space.</p>
</li>
</ol>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose extending nach0 with protein sequence modalities (using <a href="/notes/computational-chemistry/molecular-representations/group-selfies-fragment-molecular-representation/">Group SELFIES</a>), expanding zero-shot evaluation capabilities, and integrating knowledge graph information through self-supervised approaches.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training (text)</td>
          <td>PubMed abstracts</td>
          <td>13M docs, 355M tokens</td>
          <td>Filtered for chemistry-related content</td>
      </tr>
      <tr>
          <td>Pre-training (text)</td>
          <td>USPTO patents</td>
          <td>119K docs, 2.9B tokens</td>
          <td>Patent descriptions</td>
      </tr>
      <tr>
          <td>Pre-training (chemical)</td>
          <td>ZINC</td>
          <td>~100M docs, 4.7B tokens</td>
          <td>Molecular SMILES strings</td>
      </tr>
      <tr>
          <td>Fine-tuning (NLP)</td>
          <td>17 NLP datasets</td>
          <td>Varies</td>
          <td>See Table 1 in paper</td>
      </tr>
      <tr>
          <td>Fine-tuning (chemistry)</td>
          <td>MoleculeNet, MOSES, Mol-Instructions</td>
          <td>Varies</td>
          <td>Predefined or random splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: T5 encoder-decoder (base: 250M, large: 780M parameters)</li>
<li>Pre-training objective: Language modeling (masked span prediction)</li>
<li>Fine-tuning: Multi-task instruction tuning with examples-proportional mixing</li>
<li>Hyperparameters: batch size 1024, learning rate $1 \times 10^{-4}$, weight decay 0.01</li>
<li>Pre-training: 1 epoch; fine-tuning: 10 epochs</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/insilicomedicine/nach0_base">nach0 Base (HuggingFace)</a></td>
          <td>Model</td>
          <td>CC-BY-NC-4.0</td>
          <td>250M parameter encoder-decoder</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/insilicomedicine/nach0_large">nach0 Large (HuggingFace)</a></td>
          <td>Model</td>
          <td>CC-BY-NC-4.0</td>
          <td>780M parameter encoder-decoder</td>
      </tr>
      <tr>
          <td><a href="https://github.com/insilicomedicine/nach0">nach0 GitHub Repository</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Training and inference code</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>Evaluation spans 17+ NLP benchmarks and 10+ chemistry benchmarks. Metrics include F1 (NER, RE, classification), accuracy (QA, entailment, reaction prediction), balanced accuracy (molecular property classification), R2/RMSE (regression), BLEU-2 (generation), and FCD/SNN/validity/novelty (molecular generation via MOSES).</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Base models: NVIDIA A4000 and A5000 GPUs</li>
<li>Large models: NVIDIA DGX cloud platform</li>
<li>Training used tensor and pipeline parallelism via NeMo toolkit</li>
<li>Specific GPU counts and training times not reported</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Livne, M., Miftahutdinov, Z., Tutubalina, E., Kuznetsov, M., Polykovskiy, D., Brundyn, A., Jhunjhunwala, A., Costa, A., Aliper, A., Aspuru-Guzik, A., &amp; Zhavoronkov, A. (2024). nach0: Multimodal Natural and Chemical Languages Foundation Model. <em>Chemical Science</em>, 15(22), 8380-8389. <a href="https://doi.org/10.1039/D4SC00966E">https://doi.org/10.1039/D4SC00966E</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{livne2024nach0,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{nach0: multimodal natural and chemical languages foundation model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Livne, Micha and Miftahutdinov, Zulfat and Tutubalina, Elena and Kuznetsov, Maksim and Polykovskiy, Daniil and Brundyn, Annika and Jhunjhunwala, Aastha and Costa, Anthony and Aliper, Alex and Aspuru-Guzik, Al{\&#39;a}n and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{22}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{8380--8389}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D4SC00966E}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>