<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Python on Hunter Heidenreich | Senior AI Research Scientist</title><link>https://hunterheidenreich.com/tags/python/</link><description>Recent content in Python on Hunter Heidenreich | Senior AI Research Scientist</description><image><title>Hunter Heidenreich | Senior AI Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 31 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/python/index.xml" rel="self" type="application/rss+xml"/><item><title>NLP Models That Automate Programming for Chemistry</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/nlp-models-transform-chemistry/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/nlp-models-transform-chemistry/</guid><description>A perspective on how code-generating LLMs like OpenAI Codex and GPT-3 will reshape computational chemistry research workflows and education.</description><content:encoded><![CDATA[<h2 id="a-perspective-on-code-generating-llms-for-chemistry">A Perspective on Code-Generating LLMs for Chemistry</h2>
<p>This is a <strong>Position</strong> paper that argues large language models (LLMs) capable of generating code from natural language prompts, specifically OpenAI&rsquo;s Codex and GPT-3, are poised to transform both chemistry research and chemistry education. Published in the inaugural volume of Digital Discovery (RSC), the paper combines a brief history of NLP developments with concrete demonstrations of code generation for computational chemistry tasks, then offers a forward-looking perspective on challenges and opportunities.</p>
<h2 id="bridging-the-gap-between-natural-language-and-scientific-software">Bridging the Gap Between Natural Language and Scientific Software</h2>
<p>The authors identify a core friction in modern computational chemistry: while the number of available software packages has grown dramatically, researchers spend a large fraction of their time learning interfaces to these packages rather than doing science. Tasks like searching documentation, following tutorials, and trial-and-error experimentation with APIs consume effort that could be directed at research itself.</p>
<p>At the same time, programming assignments in chemistry courses serve dual pedagogical purposes (reinforcing physical intuition and teaching marketable skills), but are constrained by students&rsquo; median programming experience. The emergence of code-generating NLP models opens the possibility of reducing both barriers simultaneously.</p>
<h2 id="code-generation-as-a-chemistry-interface">Code Generation as a Chemistry Interface</h2>
<p>The paper&rsquo;s core thesis is that NLP models trained on code can serve as a natural language interface to the entire ecosystem of scientific computing tools. The authors demonstrate this with several concrete examples using OpenAI Codex:</p>
<ol>
<li>
<p><strong>Quantum chemistry</strong>: Prompting Codex to &ldquo;compute the dissociation curve of H2 using pyscf&rdquo; produced correct, runnable code that selected <a href="https://en.wikipedia.org/wiki/Hartree%E2%80%93Fock_method">Hartree-Fock</a> with <a href="https://en.wikipedia.org/wiki/STO-nG_basis_sets">STO-3G</a>. A follow-up prompt requesting &ldquo;the most accurate method&rdquo; caused it to switch to <a href="https://en.wikipedia.org/wiki/Coupled_cluster">CCSD</a> in a large basis set.</p>
</li>
<li>
<p><strong>Chemical entity recognition</strong>: Using GPT-3 with only three training examples, the authors demonstrated extraction of chemical entity names from published text, a task that previously required thousands of labeled examples.</p>
</li>
<li>
<p><strong>Molecular visualization</strong>: Drawing caffeine from its <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> string, generating Gaussian input files from SMILES, implementing random walks, and downloading and analyzing <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">PDB structures</a> with MDTraj.</p>
</li>
<li>
<p><strong>Voice-controlled molecular dynamics</strong>: The authors previously built MARVIS, a voice-controlled <a href="/notes/chemistry/molecular-simulation/">molecular dynamics</a> analysis tool that uses GPT-3 to convert natural language into <a href="https://en.wikipedia.org/wiki/Visual_Molecular_Dynamics">VMD</a> commands. Only about a dozen examples were needed to teach GPT-3 to render proteins, change representations, and select atoms.</p>
</li>
</ol>
<p>An important caveat: the authors emphasize that all chemistry &ldquo;knowledge&rdquo; (including the SMILES string for caffeine) is entirely contained in the model&rsquo;s learned floating-point weights. The model has no access to databases or curated lists of chemical concepts.</p>
<h2 id="demonstrations-and-practical-evaluation">Demonstrations and Practical Evaluation</h2>
<p>Rather than a formal experimental evaluation with benchmarks and metrics, this perspective paper relies on qualitative demonstrations. The key examples, with full details provided in the ESI, include:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Input</th>
          <th>Result</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>H2 dissociation curve</td>
          <td>Natural language prompt</td>
          <td>Correct PySCF code (HF/STO-3G)</td>
      </tr>
      <tr>
          <td>Upgrade method accuracy</td>
          <td>Follow-up prompt</td>
          <td>Switched to CCSD with large basis</td>
      </tr>
      <tr>
          <td>Chemical NER</td>
          <td>3 examples + new text</td>
          <td>Extracted compound names (with some gaps)</td>
      </tr>
      <tr>
          <td>Molecule drawing</td>
          <td>&ldquo;Load caffeine from SMILES, draw it&rdquo;</td>
          <td>Correct RDKit rendering</td>
      </tr>
      <tr>
          <td>Gaussian input file</td>
          <td>Function with docstring</td>
          <td>Complete file writer with B3LYP/6-31G(d)</td>
      </tr>
      <tr>
          <td>PDB analysis</td>
          <td>Natural language description</td>
          <td>Downloaded structure and computed <a href="https://en.wikipedia.org/wiki/Radius_of_gyration">radius of gyration</a></td>
      </tr>
  </tbody>
</table>
<p>The authors note that Codex generates correct code at about a 30% rate on a single attempt for standard problems, improving to above 50% when multiple solutions are tried. Mistakes tend to occur when complex algorithms are requested with little specificity, and the code rarely has syntax errors but may fail in obvious ways (missing imports, wrong data types).</p>
<h2 id="challenges-access-correctness-and-bias">Challenges: Access, Correctness, and Bias</h2>
<p>The paper identifies three ongoing challenges:</p>
<p><strong>Access and price.</strong> Advanced models from OpenAI were, at the time of writing, limited to early testers. Per-query costs (1-3 cents for GPT-3) would become prohibitive at the scale needed for parsing academic literature or supporting medium-sized courses. The authors advocate for open-source models and equitable deployment by researchers with computational resources.</p>
<p><strong>Correctness.</strong> Code generation does not guarantee correctness. The authors raise a subtle point: Codex may produce code that executes successfully but does not follow best scientific practice for a particular computational task. Over-reliance on AI-generated code without verification could erode trust in scientific software. However, they argue that strategies for assessing code correctness apply equally to human-written and AI-generated code.</p>
<p><strong>Fairness and bias.</strong> The authors flag several concerns: AI-generated code trained on its own outputs could narrow the range of packages, methods, or programming languages used in chemistry. They observed Codex&rsquo;s preference for Python and for specific popular libraries (e.g., defaulting to <a href="https://en.wikipedia.org/wiki/PSI_(computational_chemistry)">Psi4</a> for single-point energy calculations). GPT-3 has also been shown to reflect racism, sexism, and other biases present in its training data.</p>
<h2 id="implications-for-research-and-education">Implications for Research and Education</h2>
<p>The authors conclude with an optimistic but measured outlook:</p>
<ul>
<li><strong>For research</strong>: NLP code generation will increase accessibility of software tools and expand what a single research group can accomplish. Better tools have historically not reduced the need for scientists but expanded the complexity of problems that can be tackled.</li>
<li><strong>For programming skills</strong>: Using Codex will make chemists better programmers, not worse. The process of crafting prompts, mentally checking outputs, testing on sample inputs, and iterating develops algorithmic thinking. The authors report discovering chemistry software libraries they would not have found otherwise through iterative prompt creation.</li>
<li><strong>For education</strong>: Instructors should rethink programming assignments. The authors suggest moving toward more difficult compound assignments, treating code exercises as laboratory explorations of scientific concepts rather than syntax drills, and aligning coursework with the tools students will have access to in their careers.</li>
<li><strong>For accessibility</strong>: NLP models can reduce barriers for non-native English speakers (though accuracy with non-English prompts was not fully explored) and for users who have difficulty with keyboard-and-mouse interfaces (via voice control).</li>
</ul>
<p>The paper acknowledges that these capabilities were, in early 2022, just beginning, with Codex being the first capable code-generation model. Already at the time of writing, models surpassing GPT-3 in language tasks had appeared, and models matching GPT-3 with 1/20th the parameters had been demonstrated.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>This is a perspective paper with qualitative demonstrations rather than a reproducible experimental study. The authors provide all prompts and multiple responses in the ESI.</p>
<h3 id="data">Data</h3>
<p>All prompts and code outputs are provided in the Electronic Supplementary Information (ESI) available from the RSC.</p>
<h3 id="algorithms">Algorithms</h3>
<p>The paper does not introduce new algorithms. It evaluates existing models (GPT-3, Codex) on chemistry-related code generation tasks.</p>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Provider</th>
          <th>Access</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GPT-3</td>
          <td>OpenAI</td>
          <td>API access (commercial)</td>
      </tr>
      <tr>
          <td>Codex</td>
          <td>OpenAI</td>
          <td>Early tester program (2021)</td>
      </tr>
      <tr>
          <td>GPT-Neo</td>
          <td>EleutherAI</td>
          <td>Open source</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<p>No formal metrics are reported for the chemistry demonstrations. The authors cite the Codex paper&rsquo;s reported ~30% pass rate on single attempts and &gt;50% with multiple attempts on standard programming problems.</p>
<h3 id="hardware">Hardware</h3>
<p>No hardware requirements are specified for the demonstrations (API-based inference).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/whitead/marvis">MARVIS</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Voice-controlled MD analysis using GPT-3</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hocky, G. M., &amp; White, A. D. (2022). Natural language processing models that automate programming will transform chemistry research and teaching. <em>Digital Discovery</em>, 1(2), 79-83. <a href="https://doi.org/10.1039/d1dd00009h">https://doi.org/10.1039/d1dd00009h</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{hocky2022natural,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Natural language processing models that automate programming will transform chemistry research and teaching}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hocky, Glen M. and White, Andrew D.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{79--83}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/d1dd00009h}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Kabsch-Horn Cookbook: Differentiable Alignment</title><link>https://hunterheidenreich.com/projects/kabsch-horn-cookbook/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/kabsch-horn-cookbook/</guid><description>Differentiable Kabsch (SVD) and Horn (quaternion) alignment for NumPy, PyTorch, JAX, TensorFlow, and MLX with gradient-safe SVD.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>Aligning two sets of corresponding points, finding the optimal rotation (and optionally translation and scale) that maps one onto the other, is a fundamental operation across scientific computing. It appears in molecular dynamics (superimposing protein conformations), robotics (sensor registration), and computer vision (shape matching). The two dominant algorithm families are the Kabsch (SVD-based) method and the Horn (quaternion-based) method.</p>
<p>The <strong>Kabsch-Horn Cookbook</strong> is a Python library that implements both algorithm families across five numerical frameworks: NumPy, PyTorch, JAX, TensorFlow, and MLX. Every backend shares the same API, supports N-dimensional point sets, per-point weights, and arbitrary batch dimensions. The PyTorch, JAX, TensorFlow, and MLX backends are fully differentiable, with custom autograd rules that bypass the numerically unstable gradient of the standard SVD near degenerate singular values.</p>
<h2 id="features">Features</h2>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Kabsch</strong>: SVD-based optimal rotation for rigid alignment</li>
<li><strong>Kabsch-Umeyama</strong>: Kabsch with an additional optimal scaling factor $c$, solving $Q \approx cRP + t$</li>
<li><strong>Horn</strong>: Quaternion-based optimal rotation via the eigendecomposition of a $4 \times 4$ key matrix</li>
<li><strong>Horn + Scale</strong>: Horn&rsquo;s method extended with optimal isotropic scaling</li>
<li><strong>RMSD Wrappers</strong>: Convenience functions that return RMSD directly alongside the alignment parameters</li>
</ul>
<h3 id="framework-support">Framework Support</h3>
<table>
  <thead>
      <tr>
          <th>Framework</th>
          <th style="text-align: center">Differentiable</th>
          <th style="text-align: center">Compile/JIT</th>
          <th>Versions</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NumPy</td>
          <td style="text-align: center"></td>
          <td style="text-align: center"></td>
          <td>1.24+</td>
      </tr>
      <tr>
          <td>PyTorch</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center"><code>torch.compile</code></td>
          <td>2.0+</td>
      </tr>
      <tr>
          <td>JAX</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center"><code>jax.jit</code></td>
          <td>0.4+</td>
      </tr>
      <tr>
          <td>TensorFlow</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center"></td>
          <td>2.13+</td>
      </tr>
      <tr>
          <td>MLX</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center"></td>
          <td>0.1+</td>
      </tr>
  </tbody>
</table>
<p><code>torch.compile</code> and <code>jax.jit</code> are the tested compile/JIT paths. MLX supports 3D inputs only; the Kabsch (SVD) path is N-dimensional on the other four backends.</p>
<h3 id="numerical-robustness">Numerical Robustness</h3>
<p>Standard SVD and eigendecomposition backward passes produce <code>NaN</code> gradients when singular values collide or are near-zero. The library provides custom autograd primitives to handle these cases:</p>
<ul>
<li><strong>SafeSVD</strong> (PyTorch, JAX, TF, MLX): Custom backward pass that clamps the singular value gap, preventing division-by-zero in the gradient</li>
<li><strong>SafeEigh</strong> (PyTorch, JAX, TF, MLX): Analogous safe backward for the symmetric eigendecomposition used in Horn&rsquo;s method</li>
<li><strong>Per-point weights</strong>: Weighted centroids and weighted cross-covariance for mass-weighted or confidence-weighted alignment</li>
<li><strong>Batch dimensions</strong>: All functions broadcast over leading batch dimensions without explicit loops</li>
<li><strong>Mixed-dtype promotion</strong>: Inputs are promoted to a common floating-point dtype automatically</li>
</ul>
<h3 id="testing">Testing</h3>
<p>The test suite uses Hypothesis-based property testing across 13 modules covering:</p>
<ul>
<li>Round-trip correctness (align then compare)</li>
<li>Gradient finiteness and correctness (finite-difference checks)</li>
<li>Reflection handling (proper vs. improper rotations)</li>
<li>Weighted alignment consistency</li>
<li>Batch broadcasting</li>
<li>4 differentiable backends $\times$ 4 precisions (float32, float64, and where supported, float16, bfloat16)</li>
</ul>
<h2 id="usage">Usage</h2>
<p>This is a reference cookbook, so you can copy the framework folder you need from <code>src/kabsch_horn/&lt;framework&gt;/</code> directly into your project (the code has no runtime dependencies beyond the framework itself). To depend on it instead, install a pinned version from GitHub:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install <span style="color:#e6db74">&#34;git+https://github.com/hunter-heidenreich/Kabsch-Cookbook.git@v0.4.1&#34;</span>
</span></span></code></pre></div><p>Basic alignment with NumPy:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> kabsch_horn <span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> kh
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Two sets of corresponding 3D points</span>
</span></span><span style="display:flex;"><span>P <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>randn(<span style="color:#ae81ff">100</span>, <span style="color:#ae81ff">3</span>)
</span></span><span style="display:flex;"><span>R_true <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>qr(np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>randn(<span style="color:#ae81ff">3</span>, <span style="color:#ae81ff">3</span>))[<span style="color:#ae81ff">0</span>]  <span style="color:#75715e"># random rotation matrix</span>
</span></span><span style="display:flex;"><span>Q <span style="color:#f92672">=</span> (P <span style="color:#f92672">@</span> R_true<span style="color:#f92672">.</span>T) <span style="color:#f92672">+</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>randn(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">3</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>R, t, rmsd <span style="color:#f92672">=</span> kh<span style="color:#f92672">.</span>kabsch(P, Q)
</span></span><span style="display:flex;"><span>aligned <span style="color:#f92672">=</span> P <span style="color:#f92672">@</span> R<span style="color:#f92672">.</span>T <span style="color:#f92672">+</span> t
</span></span></code></pre></div><p>RMSD loss for training in PyTorch:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> torch
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> kabsch_horn <span style="color:#f92672">import</span> pytorch <span style="color:#66d9ef">as</span> kh
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>pred_coords <span style="color:#f92672">=</span> model(input_features)   <span style="color:#75715e"># (B, N, 3), requires_grad=True</span>
</span></span><span style="display:flex;"><span>target_coords <span style="color:#f92672">=</span> batch[<span style="color:#e6db74">&#34;target&#34;</span>]       <span style="color:#75715e"># (B, N, 3)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>rmsd <span style="color:#f92672">=</span> kh<span style="color:#f92672">.</span>kabsch_rmsd(pred_coords, target_coords)  <span style="color:#75715e"># (B,)</span>
</span></span><span style="display:flex;"><span>loss <span style="color:#f92672">=</span> rmsd<span style="color:#f92672">.</span>mean()
</span></span><span style="display:flex;"><span>loss<span style="color:#f92672">.</span>backward()  <span style="color:#75715e"># safe gradients via SafeSVD</span>
</span></span></code></pre></div><p>For the full API reference and additional examples, see the <a href="https://hunter-heidenreich.github.io/Kabsch-Cookbook/">documentation site</a>.</p>
<h2 id="results">Results</h2>
<h3 id="gradient-stability">Gradient Stability</h3>
<p>The standard SVD backward pass computes terms of the form $\frac{1}{\sigma_i^2 - \sigma_j^2}$, which diverges when two singular values are close. In molecular alignment this happens frequently: planar molecules, symmetric structures, and noisy coordinates can all produce near-degenerate singular values. The SafeSVD primitive floors the magnitude of that denominator at the dtype&rsquo;s machine epsilon (<code>finfo(dtype).eps</code>), producing finite (if slightly biased) gradients in these edge cases. Property-based tests confirm that gradients remain finite across thousands of random rotations, scales, and noise levels for all four differentiable backends.</p>
<h3 id="framework-parity">Framework Parity</h3>
<p>All five backends produce numerically equivalent results (up to floating-point tolerance) on the same inputs. The shared API means switching from NumPy prototyping to PyTorch training requires changing only the import path.</p>
<h2 id="related-work">Related Work</h2>
<p>This project builds on the foundational alignment algorithms described in these papers:</p>
<ul>
<li><a href="/notes/biology/computational-biology/kabsch-algorithm/">Kabsch (1976)</a>: the original SVD-based rotation alignment</li>
<li><a href="/notes/biology/computational-biology/arun-svd-point-fitting/">Arun et al. (1987)</a>: SVD formulation for 3D point set fitting</li>
<li><a href="/notes/biology/computational-biology/horn-absolute-orientation/">Horn (1987)</a>: quaternion-based closed-form absolute orientation</li>
<li><a href="/notes/biology/computational-biology/horn-orthonormal-matrices/">Horn et al. (1988)</a>: orthonormal matrix (polar decomposition) approach</li>
<li><a href="/notes/biology/computational-biology/umeyama-similarity-transformation/">Umeyama (1991)</a>: extension to include optimal scaling</li>
</ul>
<p>For a detailed walkthrough of the Kabsch algorithm with code examples, see the companion blog post: <a href="/posts/kabsch-algorithm/">The Kabsch Algorithm</a>.</p>
]]></content:encoded></item><item><title>ChemBERTa-3: Open Source Chemical Foundation Models</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta-3/</link><pubDate>Fri, 26 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/encoders/chemberta-3/</guid><description>An open-source framework integrating DeepChem and Ray for training and benchmarking chemical foundation models like MoLFormer and GROVER at scale.</description><content:encoded><![CDATA[<h2 id="core-contribution-an-open-source-framework">Core Contribution: An Open-Source Framework</h2>
<p>This is primarily a <strong>Resource ($\Psi_{\text{Resource}}$)</strong> paper, with secondary <strong>Method ($\Psi_{\text{Method}}$)</strong> contributions.</p>
<ul>
<li><strong>Resource Basis</strong>: The core contribution is &ldquo;ChemBERTa-3,&rdquo; an open-source framework integrated into DeepChem that standardizes the pretraining and benchmarking of chemical foundation models. The authors focus heavily on infrastructure (AWS/Ray integration) and correcting benchmarking inconsistencies in the field.</li>
<li><strong>Method Basis</strong>: It trains models like &ldquo;c3-MoLFormer&rdquo; to reproduce and validate the infrastructure.</li>
</ul>
<h2 id="the-pretraining-scalability-challenge">The Pretraining Scalability Challenge</h2>
<ul>
<li><strong>Scalability Challenges</strong>: Building robust molecular models is difficult due to the vast size of chemical space and the computational intensity of pretraining on large datasets.</li>
<li><strong>Proprietary Barriers</strong>: Many high-performing chemical foundation models (e.g., the full <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer-XL</a>) are partially closed-source or difficult to reproduce.</li>
<li><strong>Benchmarking Inconsistencies</strong>: There is a lack of systematic comparison between architectures (e.g., Graph vs. Transformer) using unified protocols. Specifically, previous comparisons relied on reported results that used differing scaffold splitting algorithms, making them inaccurate.</li>
</ul>
<h2 id="unified-infrastructure--standardized-benchmarking">Unified Infrastructure &amp; Standardized Benchmarking</h2>
<ul>
<li><strong>Unified Infrastructure</strong>: Integration of DeepChem with Ray for distributed, scalable pretraining and fine-tuning of both graph and transformer models.</li>
<li><strong>Standardized Benchmarking</strong>: Identification that MoLFormer&rsquo;s scaffold splitting algorithm differs from the standard DeepChem/<a href="/notes/chemistry/molecular-design/property-prediction/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> splitter, and the subsequent standardization of these benchmarks for fair comparison.</li>
<li><strong>New DeepChem Tools</strong>: Introduction of the <code>ModularTorchModel</code> class for flexible loss computation and <code>HuggingFaceModel</code> wrappers to bridge ecosystems.</li>
</ul>
<h2 id="benchmarking-transformers-vs-graph-models">Benchmarking Transformers vs. Graph Models</h2>
<ul>
<li><strong>Architecture Comparison</strong>: Benchmarked Transformers (<a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a>, <a href="/notes/chemistry/molecular-representations/encoders/molformer/">MoLFormer</a>) against Graph models (GROVER, InfoGraph, InfoMax3D, DMPNN, GCN) and baselines (Random Forest).</li>
<li><strong>Pretraining Scale Disparity</strong>:
<ul>
<li>Transformers were pretrained on ZINC20 subsets ranging from 10M to 1.1B molecules (combining ZINC and PubChem).</li>
<li>Graph models were limited to 250K molecule subsets due to memory and computational overhead of message passing on large graphs. While this highlights the superior scalability of Transformer architectures, comparing a 1.1B-trained Transformer to a 250K-trained Graph model provides an unbalanced evaluation of architectural capacity.</li>
</ul>
</li>
<li><strong>Reproducibility Validation</strong>: Trained &ldquo;c3-MoLFormer&rdquo; (a reproduction of MoLFormer) on 1.1B molecules using two distinct hardware setups: AWS spot instances (Ray) and a local HPC cluster.</li>
<li><strong>Scaffold Split Analysis</strong>: Compared performance metrics using &ldquo;DeepChem scaffold splits&rdquo; vs. &ldquo;MoLFormer scaffold splits&rdquo; to quantify the impact of data leakage/overlap.</li>
</ul>
<h2 id="overcoming-scaffold-splitting-inconsistencies">Overcoming Scaffold Splitting Inconsistencies</h2>
<ul>
<li><strong>Scaling Transformers vs. Graphs</strong>: Transformer-based models are significantly easier to scale to large datasets than current graph-based approaches, though performance is comparable at small scales.</li>
<li><strong>Benchmarking sensitivity</strong>: MoLFormer&rsquo;s reported superiority over baselines was partly inflated by its specific scaffold splitting method, which had higher structural overlap between train and test sets (yielding a lower <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto distance</a>, generally quantified via $1 - \frac{|A \cap B|}{|A \cup B|}$) than DeepChem splits. When standardized, baselines like DMPNN perform more competitively.</li>
<li><strong>Infrastructure Viability</strong>: The framework successfully replicated large-scale training (MoLFormer-1.1B) on both cloud and on-premise HPC, confirming reproducibility.</li>
<li><strong>Open Source Release</strong>: All code, configurations, and the c3-MoLFormer-1.1B model weights are released to facilitate future research.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Pretraining</strong>:
<ul>
<li><strong>Source</strong>: <a href="/notes/chemistry/datasets/zinc-22/">ZINC20</a> (1.4B compounds) and PubChem.</li>
<li><strong>Scale</strong>: Subsets of 10M, 100M, and 1.1B (100% ZINC20 + 100% PubChem) were used for Transformers. Graph models used a 250K subset.</li>
</ul>
</li>
<li><strong>Fine-tuning</strong>:
<ul>
<li><strong>Suite</strong>: MoleculeNet.</li>
<li><strong>Tasks</strong>: Classification (BACE, BBBP, Tox21, HIV, SIDER, ClinTox) and Regression (ESOL, FreeSolv, Lipophilicity, QM9).</li>
<li><strong>Splits</strong>: Critical distinction made between &ldquo;DeepChem scaffold splits&rdquo; (80/10/10) and &ldquo;MoLFormer scaffold splits&rdquo; (which can be downloaded from <a href="https://ibm.ent.box.com/v/MoLFormer-data"><code>https://ibm.ent.box.com/v/MoLFormer-data</code></a>). The paper notes these algorithms differ.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: DeepChem integrated with Ray for distributed training. To recreate the environment, the repository relies on a nightly version of DeepChem (<code>pip install --pre deepchem</code>) and specific dependencies found within the <code>requirements.txt</code>. Pretraining scripts are available in the <code>chemberta3_benchmarking/pretraining</code> directory of the repository.</li>
<li><strong>Data Preparation</strong>: Featurization workflows (e.g., <code>CircularFingerprint</code>, <code>RDKitConformer</code>) are documented under <code>chemberta3_benchmarking/data/data_preprocessing/</code> in the codebase.</li>
<li><strong>Modular Training</strong>: Uses <code>ModularTorchModel</code> to allow loss computation from intermediate values and flexible component connection.</li>
<li><strong>Training Brittleness</strong>:
<ul>
<li><strong>Optimizer</strong>: Linear learning rate scheduler with warmup.</li>
<li><strong>Instability Handling</strong>: The authors observed significant loss spikes during warmup. Their primary mitigation strategy involved checkpointing frequently and restarting from the last stable state upon a spike, highlighting a persistent brittleness in optimizing these large chemical foundation models.</li>
<li><strong>Numerical Issues</strong>: Addressed NaN values by pretraining on a small dataset with low LR before scaling up.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong><a href="/notes/chemistry/molecular-representations/encoders/chemberta/">ChemBERTa</a></strong>: RoBERTa-based architecture trained with Masked Language Modeling (MLM) and Multitask Regression (MTR). Specific model identifiers (e.g., <a href="https://huggingface.co/DeepChem/ChemBERTa-100M-MLM"><code>DeepChem/ChemBERTa-100M-MLM</code></a>) are hosted on Hugging Face so researchers can pull them directly via the <code>transformers</code> library. The core pretraining objective minimized the standard MLM loss:
$$ \mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log \hat{y}_{i} $$
where $\mathcal{M}$ represents the set of masked SMILES token indices, and $\hat{y}_{i}$ is the model&rsquo;s predicted probability for the correct token given the corrupted sequence context.</li>
<li><strong>MoLFormer (c3-MoLFormer)</strong>: Re-implementation of the MoLFormer architecture (Rotary embeddings, linear attention). Specific model identifiers (e.g., <a href="https://huggingface.co/DeepChem/MoLFormer-c3-1.1B"><code>DeepChem/MoLFormer-c3-1.1B</code></a>) are similarly available on Hugging Face.
<ul>
<li>Tokenizer: <code>ibm/MoLFormer-XL-both-10pct</code> tokenizer.</li>
</ul>
</li>
<li><strong>Graph Models</strong>:
<ul>
<li><strong>GROVER</strong>: Graph Transformer with node/edge/graph level self-supervision.</li>
<li><strong>InfoGraph</strong>: Maximizes mutual information between graph-level and substructure representations.</li>
<li><strong>InfoMax3D</strong>: Incorporates 3D conformer data (via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> ETKDGv2) into contrastive pretraining.</li>
<li><strong>DMPNN</strong>: Directed Message Passing Neural Network (Chemprop variant).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC-AUC</a> for classification; RMSE for regression (MAE for QM9).</li>
<li><strong>Baselines</strong>: Random Forest, GCN, DMPNN trained on fine-tuning splits only.</li>
<li><strong>Protocol</strong>: Three independent runs per configuration to report mean and range (not a confidence interval), with the exception of the compute-heavy QM9 dataset, which only received a single run. Benchmarking execution scripts (e.g., GCN, RF, DMPNN, ChemBERTa) are stored in the repo under <code>chemberta3_benchmarking/models_benchmarking/</code> and contain the specific fine-tuning hyperparameters and optimizer configurations used for each downstream task.</li>
<li><strong>Key Results</strong>:
<ul>
<li><em>c3-MoLFormer-1.1B</em> achieved ~0.848 ROC-AUC on BACE and ~0.900 on BBBP (using MoLFormer splits). This closely matches the original IBM MoLFormer metrics, validating the reproducibility of the open-source framework.</li>
<li>When constrained to the equivalent 250K subset, Graph models (InfoGraph, GROVER) performed comparably to Transformers, indicating that Transformer superiority in chemistry is largely driven by data scalability rather than an inherent architectural advantage at small scales.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Cloud (AWS)</strong>:
<ul>
<li><strong>Compute</strong>: 40 NVIDIA T4 GPUs (<code>g4dn.12xlarge</code> spot instances for pretraining, <code>g4dn.2xlarge</code> for benchmarking).</li>
<li><strong>Cost</strong>: ~$4000 for MoLFormer 1.1B pretraining.</li>
<li><strong>Time</strong>: ~10 days (260 hours) for 1.1B model pretraining.</li>
<li><strong>Setup</strong>: Setup scripts for single-node and multi-node spot EC2 clusters are provided in the GitHub repository&rsquo;s <code>infra/</code> and <code>spot/</code> folders.</li>
</ul>
</li>
<li><strong>On-Premise HPC</strong>:
<ul>
<li><strong>Compute</strong>: 16 nodes (AMD EPYC), each with 4 AMD MI300A APUs.</li>
<li><strong>Environment</strong>: Ray multi-node multi-GPU framework.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/deepforestsci/chemberta3">ChemBERTa-3 GitHub Repository</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training, fine-tuning, and benchmarking framework</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-1.1B">DeepChem/MoLFormer-c3-1.1B</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer re-implementation pretrained on 1.1B molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/ChemBERTa-100M-MLM">DeepChem/ChemBERTa-100M-MLM</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>ChemBERTa pretrained on 100M ZINC molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-100M">DeepChem/MoLFormer-c3-100M</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer pretrained on 100M molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-550M">DeepChem/MoLFormer-c3-550M</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer pretrained on 550M molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Singh, R. et al. (2026). ChemBERTa-3: an open source training framework for chemical foundation models. <em>Digital Discovery</em>, 5, 662-685. <a href="https://doi.org/10.1039/D5DD00348B">https://doi.org/10.1039/D5DD00348B</a></p>
<p><strong>Publication</strong>: Digital Discovery 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/deepforestsci/chemberta3">ChemBERTa-3 GitHub Repository</a></li>
<li><a href="https://deepchem.io/">DeepChem Project</a></li>
<li><a href="https://huggingface.co/DeepChem">DeepChem Hugging Face Models</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{singhChemBERTa3OpenSource2026,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Singh, Riya and Barsainyan, Aryan Amit and Irfan, Rida and Amorin, Connor Joseph and He, Stewart and Davis, Tony and Thiagarajan, Arun and Sankaran, Shiva and Chithrananda, Seyone and Ahmad, Walid and Jones, Derek and McLoughlin, Kevin and Kim, Hyojin and Bhutani, Anoushka and Sathyanarayana, Shreyas Vinaya and Viswanathan, Venkat and Allen, Jonathan E. and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa-3}}: an open source training framework for chemical foundation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{662-685}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{The Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D5DD00348B}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1039/D5DD00348B}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MERMaid: Multimodal Chemical Reaction Mining from PDFs</title><link>https://hunterheidenreich.com/notes/chemistry/llm-applications/mermaid/</link><pubDate>Sat, 20 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/llm-applications/mermaid/</guid><description>Vision-language pipeline extracting chemical reaction data from PDF figures and tables into structured knowledge graphs with 87% accuracy.</description><content:encoded><![CDATA[<h2 id="methodological-and-resource-contributions">Methodological and Resource Contributions</h2>
<p>This is primarily a <strong>Methodological</strong> paper ($\Psi_{\text{Method}}$) that introduces a novel pipeline (MERMaid) for extracting structured chemical data from unstructured PDF documents. It proposes a specific architecture combining fine-tuned vision models (VisualHeist) with vision-language models (DataRaider) and a retrieval-augmented generation system (KGWizard) to solve the problem of multimodal data ingestion.</p>
<p>Secondarily, it is a <strong>Resource</strong> paper ($\Psi_{\text{Resource}}$) as it releases the source code, prompts, and a new benchmark dataset (<strong>MERMaid-100</strong>) consisting of annotated reaction data across three chemical domains.</p>
<h2 id="the-inaccessibility-of-diagrammatic-reaction-data">The Inaccessibility of Diagrammatic Reaction Data</h2>
<ul>
<li><strong>Data Inaccessibility</strong>: A significant volume of chemical knowledge currently resides in &ldquo;print-optimized&rdquo; PDF formats, specifically within graphical elements like figures, schemes, and tables, which resist standard text mining.</li>
<li><strong>Limitations of Prior Work</strong>: Existing tools (e.g., ChemDataExtractor, <a href="/notes/chemistry/optical-structure-recognition/image-to-graph/molmole/">OpenChemIE</a>) focus primarily on text, struggle with multimodal parsing, or lack the &ldquo;contextual awareness&rdquo; needed to interpret implicit information (e.g., &ldquo;standard conditions&rdquo; with modifications in optimization tables).</li>
<li><strong>Need for Structured Data</strong>: To enable <a href="/notes/chemistry/llm-applications/autonomous-chemical-research-coscientist/">self-driving laboratories</a> and data-driven discovery, this unstructured literature must be converted into machine-actionable formats like <a href="https://en.wikipedia.org/wiki/Knowledge_graph">knowledge graphs</a>.</li>
</ul>
<h2 id="the-mermaid-pipeline-vision-models-and-llm-rag">The MERMaid Pipeline: Vision Models and LLM RAG</h2>
<ul>
<li><strong>VisualHeist (Fine-tuned Segmentation)</strong>: A custom fine-tuned model based on Microsoft&rsquo;s Florence-2 that accurately segments figures, captions, and footnotes, even in messy supplementary materials.</li>
<li><strong>DataRaider (Context-Aware Extraction)</strong>: A VLM-powered module (using GPT-4o) with a <strong>two-step prompt framework</strong> that performs &ldquo;self-directed context completion.&rdquo; It can infer missing reaction parameters from context and resolve footnote labels (e.g., linking &ldquo;condition a&rdquo; in a table to its footnote description).</li>
<li><strong>KGWizard (Schema-Adaptive Graph Construction)</strong>: A text-to-graph engine that uses LLMs as higher-order functions to synthesize parsers dynamically. It employs <strong>Retrieval-Augmented Generation (RAG)</strong> to check for existing nodes during creation, implicitly resolving coreferences (e.g., unifying &ldquo;MeCN&rdquo; and &ldquo;Acetonitrile&rdquo;).</li>
<li><strong>Topic-Agnostic Design</strong>: MERMaid features a flexible design that works across three distinct domains: <a href="https://en.wikipedia.org/wiki/Electrosynthesis">organic electrosynthesis</a>, <a href="https://en.wikipedia.org/wiki/Photocatalysis">photocatalysis</a>, and organic synthesis.</li>
</ul>
<h2 id="benchmarking-segmentation-and-extraction-accuracy">Benchmarking Segmentation and Extraction Accuracy</h2>
<ul>
<li><strong>Segmentation Benchmarking</strong>: The authors compared VisualHeist against OpenChemIE (LayoutParser) and PDFigCapX using a dataset of 121 PDFs from 5 publishers.</li>
<li><strong>End-to-End Extraction</strong>: Evaluated the full pipeline on <strong>MERMaid-100</strong>, a curated dataset of 100 articles across three domains (organic electrosynthesis, photocatalysis, organic synthesis).
<ul>
<li>Validating extraction of specific parameters (e.g., catalysts, solvents, yields) using &ldquo;hard-match&rdquo; accuracy.</li>
</ul>
</li>
<li><strong>Knowledge Graph Construction</strong>: Automatically generated knowledge graphs for the three domains and assessed the structural integrity and <a href="https://en.wikipedia.org/wiki/Coreference">coreference resolution</a> accuracy.</li>
</ul>
<h2 id="end-to-end-extraction-performance">End-to-End Extraction Performance</h2>
<ul>
<li><strong>Segmentation Results</strong>: VisualHeist achieved &gt;93% F1 score across all document types (including pre-2000 papers and supplementary materials), outperforming OpenChemIE by 15-75% and PDFigCapX by 28-75% across all metrics.</li>
<li><strong>Extraction Accuracy</strong>: DataRaider achieved &gt;92% accuracy for VLM-based parameter extraction and near-unity accuracy for domain-specific reaction parameters (e.g., anode, cathode, photocatalyst).</li>
<li><strong>Graph Building</strong>: KGWizard achieved 96% accuracy in node creation and coreference resolution.</li>
<li><strong>Overall Performance</strong>: The pipeline demonstrated an 87% end-to-end overall accuracy.</li>
<li><strong>Limitations</strong>: The architecture relies heavily on closed-weight models (GPT-4o) for reasoning and graph construction, which risks future reproducibility if API snapshots are deprecated. Additionally, the system remains vulnerable to cumulative error propagation from upstream OCR/OCSR tools like <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">RxnScribe</a>.</li>
<li><strong>Availability</strong>: The authors provide a modular, extensible framework that can be adapted to other scientific domains.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Training Data (VisualHeist)</strong>:
<ul>
<li>Dataset of <strong>3,435 figures</strong> and <strong>1,716 tables</strong> annotated from 3,518 PDF pages.</li>
<li>Includes main text, supplementary materials, and unformatted archive papers.</li>
</ul>
</li>
<li><strong>Evaluation Data (MERMaid-100)</strong>:
<ul>
<li><strong>100 PDF articles</strong> curated from three domains: organic electrosynthesis, photocatalysis, and organic synthesis.</li>
<li>Includes 104 image-caption/table-heading pairs relevant to reaction optimization.</li>
<li>Available for download at Zenodo (DOI: 10.5281/zenodo.14917752).</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Two-Step Prompt Framework (DataRaider)</strong>:
<ul>
<li><em>Step 1</em>: Generic base prompt + domain keys to extract &ldquo;reaction dictionaries&rdquo; and &ldquo;footnote dictionaries&rdquo;. Uses &ldquo;fill-in-the-blank&rdquo; inference for missing details.</li>
<li><em>Step 2</em>: Safety check prompt where the VLM updates the reaction dictionary using the footnote dictionary to resolve entry-specific modifications.</li>
</ul>
</li>
<li><strong>LLM-Synthesized Parsers (KGWizard)</strong>:
<ul>
<li>Uses LLM as a function $g_{A,B}: A \times B \rightarrow (X \rightarrow Y)$ to generate Python code (parsers) dynamically based on input schema instructions.</li>
</ul>
</li>
<li><strong>RAG for Coreference</strong>:
<ul>
<li>During graph construction, the system queries the existing database for matching values (e.g., &ldquo;MeCN&rdquo;) before creating new nodes to prevent duplication.</li>
</ul>
</li>
<li><strong>Batching</strong>:
<ul>
<li>Articles processed in dynamic batch sizes (starting at 1, increasing to 30) to balance speed and redundancy checks.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>VisualHeist</strong>: Fine-tuned <strong>Florence-2-large</strong> (Microsoft vision foundation model).
<ul>
<li><em>Hyperparameters</em>: 12 epochs, learning rate $5 \times 10^{-6}$, batch size 4.</li>
</ul>
</li>
<li><strong>DataRaider &amp; KGWizard</strong>: <strong>GPT-4o</strong> (version <code>gpt-4o-2024-08-06</code>). Note: Requires an active OpenAI API key. The pipeline&rsquo;s long-term reproducibility is currently tied to the continued availability of this specific closed-source endpoint.</li>
<li><strong>RxnScribe</strong>: Used for <a href="/notes/chemistry/optical-structure-recognition/benchmarks/ocsr-methods/">Optical Chemical Structure Recognition (OCSR)</a> to convert reactant/product images to <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a>.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>:
<ul>
<li><em>Segmentation</em>: Precision, Recall, F1, Accuracy.</li>
<li><em>Caption Extraction</em>: Evaluated via <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a>, mapping predicted token sets $A$ and true token sets $B$ to a threshold condition: $$J(A, B) = \frac{|A \cap B|}{|A \cup B|} \ge 0.70$$</li>
<li><em>Data Extraction</em>: Evaluated via Hard-Match accuracy, requiring exact correspondence between predicted sets ($\hat{Y}$) and ground-truth parameters ($Y$) for specific roles (e.g., anode vs. cathode): $$\text{HMA} = \frac{1}{|N|} \sum_{i=1}^{N} \mathbb{1}[y_i = \hat{y}_i]$$</li>
</ul>
</li>
<li><strong>Baselines</strong>: OpenChemIE (LayoutParser + EasyOCR) and PDFigCapX.</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training (VisualHeist)</strong>: 2x NVLINK Nvidia RTX A6000 GPUs (48GB VRAM) + Intel Xeon w7-2495X CPU (48 cores).</li>
<li><strong>DataRaider Evaluation</strong>: 13th Gen Intel Core i7-1360P CPU (12 cores).</li>
<li><strong>Inference Costs</strong>:
<ul>
<li>DataRaider: ~$0.051 per image.</li>
<li>KGWizard: ~$0.40 per JSON.</li>
</ul>
</li>
<li><strong>Timing</strong>:
<ul>
<li>VisualHeist inference: ~4.5 seconds/image.</li>
<li>DataRaider inference: ~41.3 seconds/image.</li>
<li>KGWizard processing: ~110.6 seconds/file.</li>
</ul>
</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Leong, S. X., Pablo-García, S., Wong, B., &amp; Aspuru-Guzik, A. (2025). MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models. <em>Matter</em>, 8(12), 102331. <a href="https://doi.org/10.1016/j.matt.2025.102331">https://doi.org/10.1016/j.matt.2025.102331</a></p>
<p><strong>Publication</strong>: Matter, 2025</p>
<p><strong>Artifacts</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/MERMaid">GitHub Repository</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (VisualHeist, DataRaider, KGWizard)</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.14917752">Zenodo Data/Prompts</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>MERMaid-100 benchmark, prompts, and raw VLM responses</td>
      </tr>
  </tbody>
</table>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{leong2025mermaid,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{MERMaid: Universal multimodal mining of chemical reactions from PDFs using vision-language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Leong, Shi Xuan and Pablo-Garc{\&#39;i}a, Sergio and Wong, Brandon and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Matter}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{102331}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.matt.2025.102331}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Molecular String Renderer: Chemical Visualization Library</title><link>https://hunterheidenreich.com/projects/molecular-string-renderer/</link><pubDate>Sun, 30 Nov 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/molecular-string-renderer/</guid><description>A type-safe Python library for converting chemical strings (SMILES, SELFIES, InChI) into publication-quality molecular images.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>In computational chemistry and AI drug discovery, visualization pipelines are often brittle; breaking on edge cases or failing silently when processing millions of molecules for training data.</p>
<p>I built <code>molecular-string-renderer</code> to treat molecular visualization as a strict software engineering problem. It is a highly configurable wrapper around RDKit that standardizes the conversion of text-based chemical representations (SMILES, <a href="/notes/chemistry/molecular-representations/notations/inchi-2013/">InChI</a>, SELFIES) into raster and vector graphics, degrading gracefully on inputs RDKit cannot vectorize.</p>
<h2 id="features">Features</h2>
<p>This library differentiates itself from standard plotting scripts through strict architectural patterns designed for reliability:</p>
<h3 id="1-strategy-pattern-for-svg-generation">1. Strategy Pattern for SVG Generation</h3>
<p>RDKit&rsquo;s vector rendering can sometimes fail on complex molecular topologies. I implemented a <strong>Hybrid Strategy</strong> so that a single molecule RDKit cannot vectorize does not fail the batch:</p>
<ul>
<li><strong>Vector Strategy</strong>: Attempts to generate a true, scalable vector graphic.</li>
<li><strong>Raster Fallback</strong>: If the vector engine fails, the system automatically renders a high-res PNG and embeds it transparently into the SVG container.</li>
</ul>
<h3 id="2-native-generative-ai-support">2. Native Generative AI Support</h3>
<p>With the rise of Large Language Models in chemistry, <strong>SELFIES</strong> (Self-Referencing Embedded Strings) has become a standard output format. This library handles SELFIES natively, managing the decoding and sanitization lifecycle internally so that ML training loops can simply &ldquo;pass strings and get images.&rdquo;</p>
<h3 id="3-strict-configuration-contracts">3. Strict Configuration Contracts</h3>
<p>The library uses <strong>Pydantic</strong> models (<code>RenderConfig</code>, <code>ParserConfig</code>, <code>OutputConfig</code>) to enforce strict data contracts. This ensures that visualization parameters are validated before any heavy computation begins, preventing runtime errors deep in a batch job.</p>
<h2 id="usage">Usage</h2>
<p>The library provides a simple Python API for rendering single molecules or batches of molecules from various string formats.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>Type Safety</strong>: The codebase runs with strict <code>mypy</code> settings, ensuring type safety across the entire pipeline.</li>
<li><strong>Grid Auto-Fitting</strong>: Implemented smart layout algorithms that automatically adjust grid dimensions based on the input batch size.</li>
<li><strong>Format Agnostic</strong>: Decouples the <em>parsing</em> logic (SMILES vs. MolBlock vs. SELFIES) from the <em>rendering</em> logic, making it trivial to add support for new proprietary formats.</li>
</ul>
<h2 id="reliability">Reliability</h2>
<p>When rendering large batches of generated molecules, a single hard-to-draw structure should not fail the whole job. The raster fallback and the strict Pydantic and mypy contracts exist so the pipeline degrades gracefully on edge cases rather than crashing or failing silently, the common failure mode of ad hoc RDKit plotting scripts.</p>
<h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/posts/visualizing-smiles-and-selfies-strings/">Visualizing SMILES and SELFIES Strings</a>: walkthrough of the visualization pipeline this library implements</li>
<li><a href="/projects/isomer-dataset-generation/">Isomer Dataset Generation</a>: related project generating molecular datasets using SMILES/SELFIES representations</li>
</ul>
]]></content:encoded></item><item><title>Recent Advances in the SELFIES Library: 2023 Update</title><link>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-2023/</link><pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/molecular-representations/notations/selfies-2023/</guid><description>Major updates to the SELFIES library, improved performance, expanded chemistry support, and new customization features.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>This software update paper documents major improvements to the SELFIES Python library (version 2.1.1), covering its history, underlying algorithms, design, and performance.</p>
<h2 id="limitations-in-the-original-selfies-implementation">Limitations in the Original SELFIES Implementation</h2>
<p>While the <a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">original SELFIES concept</a> was promising, the initial 2019 implementation had critical limitations that prevented widespread adoption:</p>
<ol>
<li><strong>Performance</strong>: Too slow for production ML workflows</li>
<li><strong>Limited chemistry</strong>: Couldn&rsquo;t represent aromatic molecules, stereochemistry, or many other important chemical features</li>
<li><strong>Poor usability</strong>: Lacked user-friendly APIs for common tasks</li>
</ol>
<p>These barriers meant that despite SELFIES&rsquo; theoretical advantages (100% validity guarantee), researchers couldn&rsquo;t practically use it for real-world applications like drug discovery or materials science.</p>
<h2 id="architectural-refactoring-and-new-ml-integrations">Architectural Refactoring and New ML Integrations</h2>
<p>The 2023 update refactors the underlying SELFIES engine with improvements to design, efficiency, and supported features. The key updates include:</p>
<ol>
<li>
<p><strong>Streamlined Grammar</strong>: The underlying context-free grammar has been generalized and streamlined, improving execution speed and extensibility while maintaining the 100% validity guarantee.</p>
</li>
<li>
<p><strong>Expanded Chemical Support</strong>: Adds support for aromatic systems (via internal kekulization), stereochemistry (chirality, cis/trans), charged species, and isotopic data, covering nearly all features supported by SMILES while preserving the validity guarantee.</p>
</li>
<li>
<p><strong>Semantic Constraint API</strong>: Introduces the <code>set_semantic_constraints()</code> function, allowing specification of custom valence definitions useful for theoretical studies or hypervalent states.</p>
</li>
<li>
<p><strong>ML Utility Functions</strong>: Provides tokenization (<code>split_selfies</code>), length estimation (<code>len_selfies</code>), label/one-hot encoding (<code>selfies_to_encoding</code>), vocabulary extraction, and attribution tracking for integration with neural network pipelines.</p>
</li>
</ol>
<h2 id="performance-benchmarks--validity-testing">Performance Benchmarks &amp; Validity Testing</h2>
<p>The authors validated the library through several benchmarks:</p>
<p><strong>Performance testing</strong>: Roundtrip conversion (SMILES to SELFIES to SMILES) on the DTP open compound collection (slightly over 300K molecules) completed in 252 seconds total (136s encoding, 116s decoding), using pure Python with no external dependencies.</p>
<p><strong>Random SELFIES generation</strong>: Demonstrated that random SELFIES strings of varying lengths always decode to valid molecules, with the size distribution of generated molecules controllable by filtering the sampling alphabet (e.g., removing multi-bond and low-valence atom symbols shifts the distribution toward larger molecules).</p>
<p><strong>Validity guarantee</strong>: By construction, every SELFIES string decodes to a valid molecule. The grammar&rsquo;s bond demotion and deferred ring closure mechanisms make it impossible to generate chemically invalid structures.</p>
<p><strong>Attribution system</strong>: Showed both encoder and decoder can track which input symbols produce which output symbols, useful for property alignment.</p>
<h2 id="future-trajectories-for-general-chemical-representations">Future Trajectories for General Chemical Representations</h2>
<p>The 2023 update successfully addresses the main adoption barriers:</p>
<ol>
<li><strong>Fast enough</strong> for large-scale ML applications (300K molecules in ~4 minutes)</li>
<li><strong>Chemically comprehensive</strong> enough for drug discovery and materials science</li>
<li><strong>User-friendly</strong> enough for straightforward integration into existing workflows</li>
</ol>
<p>The validity guarantee, SELFIES&rsquo; core advantage, is now practically accessible for real-world research. The roadmap includes future extensions for polymers, crystals, chemical reactions, and non-covalent interactions, which would expand SELFIES&rsquo; applicability beyond small-molecule chemistry.</p>
<p><strong>Limitations acknowledged</strong>: The paper focuses on implementation improvements. Some advanced chemical systems (polymers, crystals) still need future work.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/selfies">selfies</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official Python library, installable via <code>pip install selfies</code></td>
      </tr>
  </tbody>
</table>
<h3 id="code">Code</h3>
<p>The <code>selfies</code> library is completely open-source and written in pure Python. It requires no extra dependencies and is available on GitHub, installable via <code>pip install selfies</code>. The repository includes testing suites (<code>tox</code>) and example benchmarking scripts to reproduce the translation speeds reported in the paper.</p>
<h3 id="hardware">Hardware</h3>
<p>Performance benchmarks (e.g., the 252-second roundtrip conversion on 300K molecules) were executed on Google Colaboratory using two 2.20GHz Intel Xeon CPUs.</p>
<h3 id="algorithms">Algorithms</h3>
<h4 id="technical-specification-the-grammar">Technical Specification: The Grammar</h4>
<p>The core innovation of SELFIES is a <strong>Context-Free Grammar (CFG) augmented with state-machine logic</strong> to ensure that every derived string represents a valid molecule. While the software features are important, understanding the underlying derivation rules is essential for replication or extension of the system.</p>
<p><strong>1. Derivation Rules: The Atom State Machine</strong></p>
<p>The fundamental mechanism that guarantees validity is a <strong>state machine</strong> that tracks the remaining valence of the most recently added atom:</p>
<ul>
<li><strong>State Tracking</strong>: The derivation maintains a non-terminal state $X_l$, where $l$ represents the current atom&rsquo;s remaining valence (number of bonds it can still form)</li>
<li><strong>Standard Derivation</strong>: An atom symbol $[\beta \alpha]$ (bond order + atom type) transitions the state from $S$ (start) to $X_l$, where $l$ is calculated from the atom&rsquo;s standard valence minus the incoming bond order</li>
<li><strong>Bond Demotion (The Key Rule)</strong>: When deriving atom symbol $[\beta \alpha]$ in state $X_i$, the actual bond order used is $d_0 = \min(\ell, i, d(\beta))$, where $\ell$ is the new atom&rsquo;s valence, $i$ is the previous atom&rsquo;s remaining capacity, and $d(\beta)$ is the requested bond order. This automatic downward adjustment is the mathematical core of the validity guarantee.</li>
</ul>
<p>This state machine ensures that no atom ever exceeds its allowed valence, making it impossible to generate chemically invalid structures.</p>
<p><strong>2. Control Symbols: Branches and Rings</strong></p>
<p>Branch length calculation: SELFIES uses a <strong>hexadecimal encoding</strong> to determine branch lengths. A branch symbol <code>[Branch l]</code> consumes the next $\ell$ symbols from the queue and converts them to integer indices $c_1, \dots, c_\ell$ via a fixed mapping (Table III in the paper). The number of symbols $N$ to include in the branch is then:</p>
<p>$$
N = 1 + \sum_{k=1}^{\ell} 16^{\ell - k} , c_k
$$</p>
<p>This formula interprets the indices as hexadecimal digits, allowing compact specification of branches up to hundreds of symbols long.</p>
<p>Ring closure queue system: Ring formation uses a <strong>deferred evaluation</strong> strategy to maintain validity. Ring symbols don&rsquo;t create bonds immediately; instead, they push closure candidates into a queue $R$. These candidates are resolved after the main derivation completes. A ring closure candidate is <strong>rejected</strong> if either ring atom has no remaining valence ($m_1 = 0$ or $m_2 = 0$), or if the left and right ring atoms are not distinct (to avoid self-loops). If a prior bond already exists between the two atoms, the bond order is incremented rather than duplicated. This deferred validation prevents invalid ring structures while keeping the grammar context-free during the main derivation.</p>
<p><strong>3. Symbol Structure and Standardization</strong></p>
<p>SELFIES enforces a strict, standardized format for atom symbols to eliminate ambiguity:</p>
<ul>
<li><strong>Canonical Format</strong>: Atom symbols follow the structure <code>[Bond, Isotope, Element, Chirality, H-count, Charge]</code></li>
<li><strong>No Variation</strong>: There is only one way to write each symbol (e.g., <code>[Fe++]</code> and <code>[Fe+2]</code> are standardized to a single form)</li>
<li><strong>Order Matters</strong>: The components must appear in the specified order</li>
</ul>
<p><strong>4. Default Semantic Constraints</strong></p>
<p>By default, the library enforces standard organic chemistry valence rules:</p>
<ul>
<li><strong>Charge-Dependent Valences</strong>: Default constraints specify maximum bonds per charge state (e.g., C: 4/5/3 for neutral/+1/-1; S: 6/7/5). Unlisted atom types default to 8 maximum bonds as a catch-all.</li>
<li><strong>Preset Options</strong>: Three preset constraint sets are available: <code>default</code>, <code>octet_rule</code>, and <code>hypervalent</code>.</li>
<li><strong>Customizable</strong>: Constraints can be modified via <code>set_semantic_constraints()</code> for specialized applications (hypervalent compounds, theoretical studies, etc.)</li>
</ul>
<p>The combination of these grammar rules with the state machine ensures that <strong>every valid SELFIES string decodes to a chemically valid molecule</strong>, regardless of how the string was generated (random, ML model output, manual construction, etc.).</p>
<h3 id="data">Data</h3>
<p><strong>Benchmark dataset</strong>: DTP (Developmental Therapeutics Program) open compound collection with slightly over 300K SMILES strings, a set of molecules tested experimentally for potential treatment against cancer and AIDS.</p>
<p><strong>Random generation testing</strong>: Random SELFIES strings of varying lengths (10, 100, 250 symbols) generated from both basic and filtered alphabets to test decoding validity and molecule size distributions.</p>
<h3 id="evaluation">Evaluation</h3>
<p><strong>Performance metric</strong>: Roundtrip conversion time (SMILES to SELFIES to SMILES) is 252 seconds for 300K+ molecules (136s encoding, 116s decoding). Times averaged over 3 replicate trials on Google Colaboratory.</p>
<p><strong>Validity testing</strong>: Random SELFIES strings of lengths 10, 100, and 250 all decode to valid molecules. Decoding 1000 random strings of length 250 from the basic alphabet takes 0.341s; from the filtered alphabet, 1.633s.</p>
<p><strong>Attribution system</strong>: Both <code>encoder()</code> and <code>decoder()</code> support an <code>attribute</code> flag that returns <code>AttributionMap</code> objects, tracing which input symbols produce which output symbols for property alignment.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., &amp; Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. <em>Digital Discovery</em>, <em>2</em>(4), 897-908. <a href="https://doi.org/10.1039/D3DD00044C">https://doi.org/10.1039/D3DD00044C</a></p>
<p><strong>Publication</strong>: Digital Discovery 2023</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{lo2023recent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Recent advances in the self-referencing embedded strings (SELFIES) library}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lo, Alston and Pollice, Robert and Nigam, AkshatKumar and White, Andrew D and Krenn, Mario and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{897--908}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/D3DD00044C}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/aspuru-guzik-group/selfies">SELFIES GitHub Repository</a></li>
<li><a href="/notes/chemistry/molecular-representations/notations/selfies-original-paper/">Original SELFIES Paper (2020)</a></li>
<li><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES Format Overview</a></li>
</ul>
]]></content:encoded></item><item><title>Converting SMILES and SELFIES to 2D Molecular Images</title><link>https://hunterheidenreich.com/posts/visualizing-smiles-and-selfies-strings/</link><pubDate>Fri, 12 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/visualizing-smiles-and-selfies-strings/</guid><description>A guide to generating 2D molecular structure images from SMILES and SELFIES strings using Python, RDKit, and PIL.</description><content:encoded><![CDATA[<p>Lately, I&rsquo;ve spent a lot of time staring at datasets full of 1D molecular strings.
With time, I find I get better at recognizing functional groups and substructures like <code>C(=O)O</code> (carboxylic acid) or <code>c1ccccc1</code> (benzene ring) in SMILES.
However, anything really complex is beyond my personal visualization capabilities.</p>
<p>I ran into this recently while debugging a generative model.
Sometimes the grammar of the string provides the clue as to what is going wrong.
Other times, actually <em>seeing</em> the molecule is what helps.
I had a terminal full of generated strings and needed to verify their structures visually.
I needed a streamlined way to generate these images locally.
A lightweight script turns that text into a properly formatted image directly from the terminal.</p>
<h2 id="smiles-vs-selfies">SMILES vs. SELFIES</h2>
<p>There are two primary string representations you will encounter in modern cheminformatics:</p>
<ol>
<li><strong><a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a></strong>: The industry standard. It uses simple rules (<code>C</code> for carbon, <code>=</code> for double bonds, parentheses for branches). It is compact and machine-parseable. However, random SMILES strings are often invalid (e.g., unclosed rings or invalid valences).</li>
<li><strong><a href="/notes/chemistry/molecular-representations/notations/selfies/">SELFIES</a></strong>: Designed specifically for machine learning. It is a robust representation where <em>every</em> string corresponds to a valid molecular graph. This makes it ideal for generative models. Note that it is more verbose than SMILES.</li>
</ol>
<p>I often need to visualize both formats. Let&rsquo;s build a single, robust Python tool to handle them.</p>
<h2 id="the-quick-win-native-rdkit">The Quick Win: Native RDKit</h2>
<p>If you just need a quick image from a SMILES string and don&rsquo;t care about the image dimensions or adding a legend, RDKit can do this in three lines:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem <span style="color:#f92672">import</span> Draw
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(<span style="color:#e6db74">&#34;CCO&#34;</span>)
</span></span><span style="display:flex;"><span>Draw<span style="color:#f92672">.</span>MolToFile(mol, <span style="color:#e6db74">&#34;ethanol.png&#34;</span>)
</span></span></code></pre></div><p>The native RDKit method is fast for quick checks. However, custom rendering provides necessary control over image dimensions, formula subscripts, and handling multiple input formats like SELFIES.</p>
<h2 id="building-a-custom-renderer-for-precise-control">Building a Custom Renderer for Precise Control</h2>
<p>Let&rsquo;s build a fuller tool using RDKit for chemical processing, the <code>selfies</code> library for decoding, and PIL for image manipulation.</p>
<h3 id="core-dependencies">Core Dependencies</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> selfies <span style="color:#66d9ef">as</span> sf
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem <span style="color:#f92672">import</span> Draw, rdDepictor, rdMolDescriptors
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> PIL <span style="color:#f92672">import</span> Image, ImageDraw, ImageFont
</span></span></code></pre></div><p>RDKit handles the chemical logic, <code>selfies</code> translates SELFIES to SMILES, and PIL gives us fine control over the final image appearance.</p>
<h3 id="the-main-conversion-function">The Main Conversion Function</h3>
<p>Here is the core conversion logic. Notice the Python type hints on the signature.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">string_to_png</span>(mol_string: str, output_file: str, size: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>, is_selfies: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Generates a 2D molecule image with a chemical formula legend from SMILES or SELFIES.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Decode SELFIES to SMILES if necessary</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> is_selfies:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            smiles <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(mol_string)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid SELFIES string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>) <span style="color:#f92672">from</span> e
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        smiles <span style="color:#f92672">=</span> mol_string
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(smiles)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> mol:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Could not generate molecule from SMILES: </span><span style="color:#e6db74">{</span>smiles<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Generate 2D coordinates and formula</span>
</span></span><span style="display:flex;"><span>    rdDepictor<span style="color:#f92672">.</span>Compute2DCoords(mol)
</span></span><span style="display:flex;"><span>    formula <span style="color:#f92672">=</span> rdMolDescriptors<span style="color:#f92672">.</span>CalcMolFormula(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Render the molecule</span>
</span></span><span style="display:flex;"><span>    img <span style="color:#f92672">=</span> Draw<span style="color:#f92672">.</span>MolToImage(mol, size<span style="color:#f92672">=</span>(size, size))<span style="color:#f92672">.</span>convert(<span style="color:#e6db74">&#34;RGBA&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Create a canvas with extra space at the bottom for the legend</span>
</span></span><span style="display:flex;"><span>    legend_height <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.1</span>)
</span></span><span style="display:flex;"><span>    canvas <span style="color:#f92672">=</span> Image<span style="color:#f92672">.</span>new(<span style="color:#e6db74">&#34;RGBA&#34;</span>, (size, size <span style="color:#f92672">+</span> legend_height), <span style="color:#e6db74">&#34;white&#34;</span>)
</span></span><span style="display:flex;"><span>    canvas<span style="color:#f92672">.</span>paste(img, (<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    draw <span style="color:#f92672">=</span> ImageDraw<span style="color:#f92672">.</span>Draw(canvas)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Define dynamic font sizes</span>
</span></span><span style="display:flex;"><span>    font_reg <span style="color:#f92672">=</span> get_font(int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.03</span>))
</span></span><span style="display:flex;"><span>    font_sub <span style="color:#f92672">=</span> get_font(int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw the legend</span>
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>)
</span></span><span style="display:flex;"><span>    y <span style="color:#f92672">=</span> size <span style="color:#f92672">+</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw &#34;Formula: &#34; label</span>
</span></span><span style="display:flex;"><span>    draw<span style="color:#f92672">.</span>text((x, y), <span style="color:#e6db74">&#34;Formula: &#34;</span>, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">+=</span> draw<span style="color:#f92672">.</span>textlength(<span style="color:#e6db74">&#34;Formula: &#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw formula with subscript handling for numbers</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> char <span style="color:#f92672">in</span> formula:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Use smaller font and lower y-offset for numbers (subscripts)</span>
</span></span><span style="display:flex;"><span>        font <span style="color:#f92672">=</span> font_sub <span style="color:#66d9ef">if</span> char<span style="color:#f92672">.</span>isdigit() <span style="color:#66d9ef">else</span> font_reg
</span></span><span style="display:flex;"><span>        y_offset <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.005</span>) <span style="color:#66d9ef">if</span> char<span style="color:#f92672">.</span>isdigit() <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        draw<span style="color:#f92672">.</span>text((x, y <span style="color:#f92672">+</span> y_offset), char, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font)
</span></span><span style="display:flex;"><span>        x <span style="color:#f92672">+=</span> draw<span style="color:#f92672">.</span>textlength(char, font<span style="color:#f92672">=</span>font)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw original string</span>
</span></span><span style="display:flex;"><span>    label <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;SELFIES&#34;</span> <span style="color:#66d9ef">if</span> is_selfies <span style="color:#66d9ef">else</span> <span style="color:#e6db74">&#34;SMILES&#34;</span>
</span></span><span style="display:flex;"><span>    draw<span style="color:#f92672">.</span>text((x, y), <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34; | </span><span style="color:#e6db74">{</span>label<span style="color:#e6db74">}</span><span style="color:#e6db74">: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    canvas<span style="color:#f92672">.</span>save(output_file)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Saved: </span><span style="color:#e6db74">{</span>output_file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>This function handles everything: SELFIES decoding, validation, coordinate generation, image creation, and legend drawing.</p>
<h3 id="font-handling">Font Handling</h3>
<p>We need a helper to handle fonts robustly across systems:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_font</span>(size: int, font_name: str <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;arial.ttf&#34;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Attempts to load a TTF font, falls back to default if unavailable.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> ImageFont<span style="color:#f92672">.</span>truetype(font_name, size)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">IOError</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> ImageFont<span style="color:#f92672">.</span>load_default()
</span></span></code></pre></div><h2 id="examples-in-action">Examples in Action</h2>
<p>Let&rsquo;s see the tool in action with some common molecules, comparing the SMILES and SELFIES inputs.</p>
<h3 id="simple-molecules">Simple Molecules</h3>















<figure class="post-figure center ">
    <img src="/img/smiles2img/ethanol_demo.webp"
         alt="Ethanol molecular structure with formula C2H6O"
         title="Ethanol molecular structure with formula C2H6O"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Ethanol</strong>: A simple alcohol. The SMILES is <code>CCO</code>, while the SELFIES is <code>[C][C][O]</code>.</figcaption>
    
</figure>

<h3 id="aromatic-compounds">Aromatic Compounds</h3>















<figure class="post-figure center ">
    <img src="/img/smiles2img/benzene_demo.webp"
         alt="Benzene molecular structure with formula C6H6"
         title="Benzene molecular structure with formula C6H6"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Benzene</strong>: The classic aromatic ring. SMILES uses numbers for ring closures (<code>C1=CC=CC=C1</code>), while SELFIES uses explicit tokens (<code>[C][=C][C][=C][C][=C][Ring1][=Branch1]</code>).</figcaption>
    
</figure>

<h3 id="complex-pharmaceuticals">Complex Pharmaceuticals</h3>















<figure class="post-figure center ">
    <img src="/img/smiles2img/aspirin_demo.webp"
         alt="Aspirin molecular structure with formula C9H8O4"
         title="Aspirin molecular structure with formula C9H8O4"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption"><strong>Aspirin</strong>: A more complex molecule showing how the tool handles branched structures and multiple functional groups.</figcaption>
    
</figure>

<h2 id="going-further-vector-graphics-svg">Going Further: Vector Graphics (SVG)</h2>
<p>Use vector graphics (SVG/PDF) for true publication-quality figures. Vector graphics scale infinitely without pixelation.</p>
<p>RDKit handles this natively with <code>rdMolDraw2D</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem.Draw <span style="color:#f92672">import</span> rdMolDraw2D
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">string_to_svg</span>(mol_string: str, output_file: str, size: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>, is_selfies: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Generates a 2D molecule SVG image.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> is_selfies:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            mol_string <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(mol_string)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid SELFIES string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>) <span style="color:#f92672">from</span> e
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(mol_string)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> mol:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    rdDepictor<span style="color:#f92672">.</span>Compute2DCoords(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> rdMolDraw2D<span style="color:#f92672">.</span>MolDraw2DSVG(size, size)
</span></span><span style="display:flex;"><span>    d<span style="color:#f92672">.</span>DrawMolecule(mol)
</span></span><span style="display:flex;"><span>    d<span style="color:#f92672">.</span>FinishDrawing()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(output_file, <span style="color:#e6db74">&#34;w&#34;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        f<span style="color:#f92672">.</span>write(d<span style="color:#f92672">.</span>GetDrawingText())
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Saved: </span><span style="color:#e6db74">{</span>output_file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>This provides a perfect vector image. Note that this method omits the custom PIL-based legend. Choose the right tool for the job: PNG for quick checks and slides, SVG for journal submissions.</p>
<h2 id="command-line-interface">Command-Line Interface</h2>
<p>The tool uses Python&rsquo;s standard <code>argparse</code> library for the command-line interface. It automatically detects if you want an SVG based on the file extension and includes a <code>--selfies</code> flag.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Basic SMILES usage</span>
</span></span><span style="display:flex;"><span>python mol2img.py <span style="color:#e6db74">&#34;CCO&#34;</span> -o ethanol.png
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># SELFIES usage</span>
</span></span><span style="display:flex;"><span>python mol2img.py <span style="color:#e6db74">&#34;[C][C][O]&#34;</span> -o ethanol.png --selfies
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Generate SVG for publication</span>
</span></span><span style="display:flex;"><span>python mol2img.py <span style="color:#e6db74">&#34;CCO&#34;</span> -o ethanol.svg
</span></span></code></pre></div><h2 id="download-the-complete-script">Download the Complete Script</h2>
<p>You can copy the complete <code>mol2img.py</code> script directly from the code block below. For a fuller version with an SVG fallback, type hints, and batch (grid) rendering, see the <a href="/projects/molecular-string-renderer/">Molecular String Renderer project</a>.</p>
<h3 id="installation-and-setup">Installation and Setup</h3>
<p>Before using the script, install the required dependencies:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install rdkit pillow selfies
</span></span></code></pre></div><h3 id="complete-script">Complete Script</h3>
<details>
<summary>Click to expand the complete mol2img.py script</summary>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> argparse
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> sys
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> selfies <span style="color:#66d9ef">as</span> sf
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit <span style="color:#f92672">import</span> Chem
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem <span style="color:#f92672">import</span> Draw, rdDepictor, rdMolDescriptors
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> rdkit.Chem.Draw <span style="color:#f92672">import</span> rdMolDraw2D
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> PIL <span style="color:#f92672">import</span> Image, ImageDraw, ImageFont
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_font</span>(size: int, font_name: str <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;arial.ttf&#34;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Attempts to load a TTF font, falls back to default if unavailable.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> ImageFont<span style="color:#f92672">.</span>truetype(font_name, size)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">IOError</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> ImageFont<span style="color:#f92672">.</span>load_default()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">string_to_svg</span>(mol_string: str, output_file: str, size: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>, is_selfies: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Generates a 2D molecule SVG image.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> is_selfies:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            mol_string <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(mol_string)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid SELFIES string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>) <span style="color:#f92672">from</span> e
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(mol_string)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> mol:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    rdDepictor<span style="color:#f92672">.</span>Compute2DCoords(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> rdMolDraw2D<span style="color:#f92672">.</span>MolDraw2DSVG(size, size)
</span></span><span style="display:flex;"><span>    d<span style="color:#f92672">.</span>DrawMolecule(mol)
</span></span><span style="display:flex;"><span>    d<span style="color:#f92672">.</span>FinishDrawing()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(output_file, <span style="color:#e6db74">&#34;w&#34;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>        f<span style="color:#f92672">.</span>write(d<span style="color:#f92672">.</span>GetDrawingText())
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Saved: </span><span style="color:#e6db74">{</span>output_file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">string_to_png</span>(mol_string: str, output_file: str, size: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>, is_selfies: bool <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>) <span style="color:#f92672">-&gt;</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;Generates a 2D molecule image with a chemical formula legend.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> is_selfies:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>            smiles <span style="color:#f92672">=</span> sf<span style="color:#f92672">.</span>decoder(mol_string)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Invalid SELFIES string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>) <span style="color:#f92672">from</span> e
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        smiles <span style="color:#f92672">=</span> mol_string
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    mol <span style="color:#f92672">=</span> Chem<span style="color:#f92672">.</span>MolFromSmiles(smiles)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> mol:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">raise</span> <span style="color:#a6e22e">ValueError</span>(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Could not generate molecule from string: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Generate 2D coordinates and formula</span>
</span></span><span style="display:flex;"><span>    rdDepictor<span style="color:#f92672">.</span>Compute2DCoords(mol)
</span></span><span style="display:flex;"><span>    formula <span style="color:#f92672">=</span> rdMolDescriptors<span style="color:#f92672">.</span>CalcMolFormula(mol)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Render the molecule</span>
</span></span><span style="display:flex;"><span>    img <span style="color:#f92672">=</span> Draw<span style="color:#f92672">.</span>MolToImage(mol, size<span style="color:#f92672">=</span>(size, size))<span style="color:#f92672">.</span>convert(<span style="color:#e6db74">&#34;RGBA&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Create a canvas with extra space at the bottom for the legend</span>
</span></span><span style="display:flex;"><span>    legend_height <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.1</span>)
</span></span><span style="display:flex;"><span>    canvas <span style="color:#f92672">=</span> Image<span style="color:#f92672">.</span>new(<span style="color:#e6db74">&#34;RGBA&#34;</span>, (size, size <span style="color:#f92672">+</span> legend_height), <span style="color:#e6db74">&#34;white&#34;</span>)
</span></span><span style="display:flex;"><span>    canvas<span style="color:#f92672">.</span>paste(img, (<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    draw <span style="color:#f92672">=</span> ImageDraw<span style="color:#f92672">.</span>Draw(canvas)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Define dynamic font sizes</span>
</span></span><span style="display:flex;"><span>    font_reg <span style="color:#f92672">=</span> get_font(int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.03</span>))
</span></span><span style="display:flex;"><span>    font_sub <span style="color:#f92672">=</span> get_font(int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw the legend</span>
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>)
</span></span><span style="display:flex;"><span>    y <span style="color:#f92672">=</span> size <span style="color:#f92672">+</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.02</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw &#34;Formula: &#34; label</span>
</span></span><span style="display:flex;"><span>    draw<span style="color:#f92672">.</span>text((x, y), <span style="color:#e6db74">&#34;Formula: &#34;</span>, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">+=</span> draw<span style="color:#f92672">.</span>textlength(<span style="color:#e6db74">&#34;Formula: &#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw formula with subscript handling for numbers</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> char <span style="color:#f92672">in</span> formula:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Use smaller font and lower y-offset for numbers (subscripts)</span>
</span></span><span style="display:flex;"><span>        font <span style="color:#f92672">=</span> font_sub <span style="color:#66d9ef">if</span> char<span style="color:#f92672">.</span>isdigit() <span style="color:#66d9ef">else</span> font_reg
</span></span><span style="display:flex;"><span>        y_offset <span style="color:#f92672">=</span> int(size <span style="color:#f92672">*</span> <span style="color:#ae81ff">0.005</span>) <span style="color:#66d9ef">if</span> char<span style="color:#f92672">.</span>isdigit() <span style="color:#66d9ef">else</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        draw<span style="color:#f92672">.</span>text((x, y <span style="color:#f92672">+</span> y_offset), char, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font)
</span></span><span style="display:flex;"><span>        x <span style="color:#f92672">+=</span> draw<span style="color:#f92672">.</span>textlength(char, font<span style="color:#f92672">=</span>font)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Draw original string</span>
</span></span><span style="display:flex;"><span>    label <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;SELFIES&#34;</span> <span style="color:#66d9ef">if</span> is_selfies <span style="color:#66d9ef">else</span> <span style="color:#e6db74">&#34;SMILES&#34;</span>
</span></span><span style="display:flex;"><span>    draw<span style="color:#f92672">.</span>text((x, y), <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34; | </span><span style="color:#e6db74">{</span>label<span style="color:#e6db74">}</span><span style="color:#e6db74">: </span><span style="color:#e6db74">{</span>mol_string<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>, fill<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;black&#34;</span>, font<span style="color:#f92672">=</span>font_reg)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    canvas<span style="color:#f92672">.</span>save(output_file)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Saved: </span><span style="color:#e6db74">{</span>output_file<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;__main__&#34;</span>:
</span></span><span style="display:flex;"><span>    parser <span style="color:#f92672">=</span> argparse<span style="color:#f92672">.</span>ArgumentParser(description<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Convert a SMILES or SELFIES string to a 2D molecular image.&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;string&#34;</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;The molecular string to convert&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;-o&#34;</span>, <span style="color:#e6db74">&#34;--output&#34;</span>, default<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;molecule.png&#34;</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Output filename (default: molecule.png)&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;--size&#34;</span>, type<span style="color:#f92672">=</span>int, default<span style="color:#f92672">=</span><span style="color:#ae81ff">500</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Image width/height in pixels (default: 500)&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;--svg&#34;</span>, action<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;store_true&#34;</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Force SVG output (overrides filename extension)&#34;</span>)
</span></span><span style="display:flex;"><span>    parser<span style="color:#f92672">.</span>add_argument(<span style="color:#e6db74">&#34;--selfies&#34;</span>, action<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;store_true&#34;</span>, help<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Treat the input string as SELFIES.&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    args <span style="color:#f92672">=</span> parser<span style="color:#f92672">.</span>parse_args()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Determine format based on flag or file extension</span>
</span></span><span style="display:flex;"><span>        is_svg <span style="color:#f92672">=</span> args<span style="color:#f92672">.</span>svg <span style="color:#f92672">or</span> args<span style="color:#f92672">.</span>output<span style="color:#f92672">.</span>lower()<span style="color:#f92672">.</span>endswith(<span style="color:#e6db74">&#34;.svg&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> is_svg:
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Ensure extension is correct if not present</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> <span style="color:#f92672">not</span> args<span style="color:#f92672">.</span>output<span style="color:#f92672">.</span>lower()<span style="color:#f92672">.</span>endswith(<span style="color:#e6db74">&#34;.svg&#34;</span>):
</span></span><span style="display:flex;"><span>                args<span style="color:#f92672">.</span>output <span style="color:#f92672">=</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>splitext(args<span style="color:#f92672">.</span>output)[<span style="color:#ae81ff">0</span>] <span style="color:#f92672">+</span> <span style="color:#e6db74">&#34;.svg&#34;</span>
</span></span><span style="display:flex;"><span>            string_to_svg(args<span style="color:#f92672">.</span>string, args<span style="color:#f92672">.</span>output, args<span style="color:#f92672">.</span>size, args<span style="color:#f92672">.</span>selfies)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>            string_to_png(args<span style="color:#f92672">.</span>string, args<span style="color:#f92672">.</span>output, args<span style="color:#f92672">.</span>size, args<span style="color:#f92672">.</span>selfies)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">except</span> <span style="color:#a6e22e">Exception</span> <span style="color:#66d9ef">as</span> e:
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Error: </span><span style="color:#e6db74">{</span>e<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>        sys<span style="color:#f92672">.</span>exit(<span style="color:#ae81ff">1</span>)
</span></span></code></pre></div></details>
]]></content:encoded></item><item><title>Exponential Random Numbers: Two Classic Algorithms</title><link>https://hunterheidenreich.com/posts/random-number-tricks/</link><pubDate>Sun, 31 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/random-number-tricks/</guid><description>Compare inverse transform sampling and von Neumann's rejection method for exponential random numbers with Python implementations and performance.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>In the early days of computing, generating random numbers was a significant computational challenge. In a landmark 1951 paper, mathematician John von Neumann detailed various &ldquo;cooking recipes&rdquo; for producing and using random numbers on machines like the ENIAC. While much of the paper focuses on generating <em>uniform</em> random digits, he also described ingenious methods for generating numbers from more complex, non-uniform probability distributions.</p>
<p>One of the most fundamental needs in scientific simulation (from modeling radioactive decay to calculating particle free-paths in molecular dynamics) is sampling from an <strong>exponential distribution</strong> with probability density function:</p>
<p>$$f(x) = e^{-x} \quad \text{for } x \ge 0$$</p>
<p>Today&rsquo;s standard approach is elegant and direct, but it requires computing a natural logarithm (a computationally expensive operation on early hardware). To sidestep this limitation, von Neumann described a fascinating alternative that uses only basic comparisons, resembling what he called &ldquo;a well known game of chance Twenty-One, or Black Jack.&rdquo;</p>
<p>In this post, we&rsquo;ll explore both methods: the modern inverse transform approach and von Neumann&rsquo;s ingenious comparison-based algorithm. We&rsquo;ll implement them in Python, verify their correctness, and compare their performance, empirically testing the trade-offs von Neumann identified nearly 75 years ago.</p>
<hr>
<h2 id="method-1-the-standard-approach-inverse-transform-sampling">Method 1: The Standard Approach (Inverse Transform Sampling)</h2>
<p>The most common method for sampling from a given distribution is <strong>inverse transform sampling</strong>. This method relies on a fundamental principle: if you have a uniform random variable $U$ on the interval (0, 1), you can transform it into a random variable $X$ with any desired cumulative distribution function (CDF) $F(x)$ by applying:</p>
<p>$$X = F^{-1}(U)$$</p>
<p>For the exponential distribution, the CDF is $F(x) = 1 - e^{-x}$. To find the inverse, we set $U = 1 - e^{-X}$ and solve for $X$:</p>
<p>$$
\begin{align}
e^{-X} &amp;= 1 - U \
-X &amp;= \ln(1 - U) \
X &amp;= -\ln(1 - U)
\end{align}
$$</p>
<p>Here&rsquo;s a useful simplification: since $U$ is uniformly distributed on (0, 1), the quantity $(1 - U)$ is also uniformly distributed on (0, 1). Therefore, we can use the simpler formula:</p>
<p>$$X = -\ln(U)$$</p>
<p>This gives us an efficient method for generating exponentially distributed numbers, provided the logarithm function is computationally accessible.</p>
<h3 id="python-implementation">Python Implementation</h3>
<p>Here&rsquo;s a straightforward implementation using NumPy:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">exponential_inverse_transform</span>(n_samples<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Generate samples from an exponential distribution using inverse transform sampling.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Args:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        n_samples (int): Number of samples to generate.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Returns:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        np.ndarray: Array of exponentially distributed samples.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Generate uniform random numbers</span>
</span></span><span style="display:flex;"><span>    U <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>rand(n_samples)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Apply the inverse transform</span>
</span></span><span style="display:flex;"><span>    X <span style="color:#f92672">=</span> <span style="color:#f92672">-</span>np<span style="color:#f92672">.</span>log(U)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> X
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Generate 100,000 samples for testing</span>
</span></span><span style="display:flex;"><span>n_samples <span style="color:#f92672">=</span> <span style="color:#ae81ff">100000</span>
</span></span><span style="display:flex;"><span>inverse_samples <span style="color:#f92672">=</span> exponential_inverse_transform(n_samples)
</span></span></code></pre></div><hr>
<h2 id="method-2-von-neumanns-ingenious-trick-rejection-sampling">Method 2: Von Neumann&rsquo;s Ingenious Trick (Rejection Sampling)</h2>
<p>Von Neumann proposed a clever alternative that avoids transcendental functions entirely. His procedure, which he noted &ldquo;resembles a well known game of chance Twenty-One, or Black Jack,&rdquo; generates sequences of uniform random numbers and accepts or rejects them based on simple comparison rules.</p>
<p>The algorithm works as follows to generate a single exponential sample $X$:</p>
<ol>
<li>
<p><strong>Initialize</strong>: Start with an integer offset <code>k = 0</code>, which will form the integer part of the final result.</p>
</li>
<li>
<p><strong>Generate a trial sequence</strong>:</p>
<ul>
<li>Generate uniform random numbers $Y_1, Y_2, Y_3, \ldots$ from (0, 1)</li>
<li>Find the smallest integer <code>n</code> such that the sequence is no longer strictly decreasing</li>
<li>That is, find <code>n</code> where $Y_1 &gt; Y_2 &gt; \cdots &gt; Y_n$ but $Y_n \leq Y_{n+1}$</li>
</ul>
</li>
<li>
<p><strong>Accept or reject</strong>:</p>
<ul>
<li>If <code>n</code> is <strong>odd</strong>: Accept the trial. Return $X = Y_1 + k$ and terminate.</li>
<li>If <code>n</code> is <strong>even</strong>: Reject the trial. Increment <code>k</code> by 1 and start a new trial.</li>
</ul>
</li>
</ol>
<p>This process is guaranteed to terminate and produces samples that follow the exponential distribution exactly. As von Neumann elegantly put it, the machine has &ldquo;in effect computed a logarithm by performing only discriminations on the relative magnitude of numbers.&rdquo;</p>
<h3 id="python-implementation-1">Python Implementation</h3>
<p>This implementation requires more careful state management due to the nested trial structure:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">exponential_von_neumann</span>(n_samples<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Generate samples from an exponential distribution using von Neumann&#39;s
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    comparison-based rejection sampling method.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Args:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        n_samples (int): Number of samples to generate.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Returns:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">        tuple[np.ndarray, float]: Array of samples and average uniform draws per sample.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    samples <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>    total_uniform_draws <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(n_samples):
</span></span><span style="display:flex;"><span>        k <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>  <span style="color:#75715e"># Integer offset</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:  <span style="color:#75715e"># Trial loop</span>
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Generate decreasing sequence</span>
</span></span><span style="display:flex;"><span>            y_prev <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>rand()
</span></span><span style="display:flex;"><span>            total_uniform_draws <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>            y1 <span style="color:#f92672">=</span> y_prev  <span style="color:#75715e"># Store first value</span>
</span></span><span style="display:flex;"><span>            n <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Find length of decreasing sequence</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span>                y_curr <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>rand()
</span></span><span style="display:flex;"><span>                total_uniform_draws <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">if</span> y_prev <span style="color:#f92672">&lt;=</span> y_curr:
</span></span><span style="display:flex;"><span>                    <span style="color:#66d9ef">break</span>  <span style="color:#75715e"># Sequence no longer decreasing</span>
</span></span><span style="display:flex;"><span>                y_prev <span style="color:#f92672">=</span> y_curr
</span></span><span style="display:flex;"><span>                n <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Accept if n is odd, reject if even</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> n <span style="color:#f92672">%</span> <span style="color:#ae81ff">2</span> <span style="color:#f92672">==</span> <span style="color:#ae81ff">1</span>:  <span style="color:#75715e"># Accept</span>
</span></span><span style="display:flex;"><span>                samples<span style="color:#f92672">.</span>append(y1 <span style="color:#f92672">+</span> k)
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">break</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">else</span>:  <span style="color:#75715e"># Reject</span>
</span></span><span style="display:flex;"><span>                k <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    avg_draws <span style="color:#f92672">=</span> total_uniform_draws <span style="color:#f92672">/</span> n_samples
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> np<span style="color:#f92672">.</span>array(samples), avg_draws
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Generate samples using von Neumann&#39;s method</span>
</span></span><span style="display:flex;"><span>von_neumann_samples, avg_draws <span style="color:#f92672">=</span> exponential_von_neumann(n_samples)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Von Neumann method used </span><span style="color:#e6db74">{</span>avg_draws<span style="color:#e6db74">:</span><span style="color:#e6db74">.2f</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> uniform draws per sample on average.&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Von Neumann method used 4.30 uniform draws per sample on average.
</span></span></code></pre></div><p>The algorithm requires approximately <strong>4.3</strong> uniform draws per exponential sample, matching the theoretical value $e^2/(e-1) = 4.30$.</p>
<hr>
<h2 id="verification-and-comparison">Verification and Comparison</h2>
<p>The critical test: do both methods actually produce the same distribution? And how do their performance characteristics compare?</p>
<h3 id="visual-verification">Visual Verification</h3>
<p>Let&rsquo;s plot histograms of samples from both methods alongside the theoretical probability density function $f(x) = e^{-x}$:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> matplotlib.pyplot <span style="color:#66d9ef">as</span> plt
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> seaborn <span style="color:#66d9ef">as</span> sns
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Configure plot aesthetics</span>
</span></span><span style="display:flex;"><span>sns<span style="color:#f92672">.</span>set_style(<span style="color:#e6db74">&#34;whitegrid&#34;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>figure(figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">12</span>, <span style="color:#ae81ff">7</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Plot histograms for both methods</span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>hist(inverse_samples, bins<span style="color:#f92672">=</span><span style="color:#ae81ff">50</span>, density<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.7</span>,
</span></span><span style="display:flex;"><span>         label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Inverse Transform&#39;</span>, color<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;skyblue&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>hist(von_neumann_samples, bins<span style="color:#f92672">=</span><span style="color:#ae81ff">50</span>, density<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.7</span>,
</span></span><span style="display:flex;"><span>         label<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Von Neumann&#39;s Method&#34;</span>, color<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;lightcoral&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Overlay theoretical PDF</span>
</span></span><span style="display:flex;"><span>x <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linspace(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">8</span>, <span style="color:#ae81ff">400</span>)
</span></span><span style="display:flex;"><span>pdf <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>exp(<span style="color:#f92672">-</span>x)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>plot(x, pdf, <span style="color:#e6db74">&#39;r-&#39;</span>, linewidth<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>, label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Theoretical PDF ($e^{-x}$)&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>title(<span style="color:#e6db74">&#39;Exponential Sampling Methods vs. Theoretical Distribution&#39;</span>, fontsize<span style="color:#f92672">=</span><span style="color:#ae81ff">16</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>xlabel(<span style="color:#e6db74">&#39;x&#39;</span>, fontsize<span style="color:#f92672">=</span><span style="color:#ae81ff">12</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>ylabel(<span style="color:#e6db74">&#39;Density&#39;</span>, fontsize<span style="color:#f92672">=</span><span style="color:#ae81ff">12</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>legend()
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>xlim(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">8</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>tight_layout()
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>show()
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/exponential_random_gens.webp"
         alt="Comparison of exponential sampling methods showing histograms from both inverse transform and von Neumann methods overlaid with the theoretical exponential distribution"
         title="Comparison of exponential sampling methods showing histograms from both inverse transform and von Neumann methods overlaid with the theoretical exponential distribution"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Both sampling methods reproduce the exponential distribution $f(x) = e^{-x}$</figcaption>
    
</figure>

<p>The visualization confirms that both methods accurately reproduce the target exponential distribution. The empirical histograms match the theoretical curve, confirming both algorithms sample the target distribution.</p>
<h3 id="performance-analysis">Performance Analysis</h3>
<p>Mathematical elegance often diverges from computational efficiency. Von Neumann himself observed that on the ENIAC, it was actually &ldquo;slightly quicker to use a truncated power series for log(1-T)&rdquo; than to perform all the comparisons his method required.</p>
<p>Let&rsquo;s benchmark both approaches in a modern Python environment:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> time
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Benchmark inverse transform method</span>
</span></span><span style="display:flex;"><span>start_time <span style="color:#f92672">=</span> time<span style="color:#f92672">.</span>time()
</span></span><span style="display:flex;"><span>_ <span style="color:#f92672">=</span> exponential_inverse_transform(n_samples)
</span></span><span style="display:flex;"><span>inverse_time <span style="color:#f92672">=</span> time<span style="color:#f92672">.</span>time() <span style="color:#f92672">-</span> start_time
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Benchmark von Neumann method</span>
</span></span><span style="display:flex;"><span>start_time <span style="color:#f92672">=</span> time<span style="color:#f92672">.</span>time()
</span></span><span style="display:flex;"><span>_ <span style="color:#f92672">=</span> exponential_von_neumann(n_samples)
</span></span><span style="display:flex;"><span>vn_time <span style="color:#f92672">=</span> time<span style="color:#f92672">.</span>time() <span style="color:#f92672">-</span> start_time
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Inverse Transform:  </span><span style="color:#e6db74">{</span>inverse_time<span style="color:#e6db74">:</span><span style="color:#e6db74">.4f</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> seconds&#34;</span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Von Neumann Method: </span><span style="color:#e6db74">{</span>vn_time<span style="color:#e6db74">:</span><span style="color:#e6db74">.4f</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> seconds&#34;</span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Speedup factor: </span><span style="color:#e6db74">{</span>vn_time <span style="color:#f92672">/</span> inverse_time<span style="color:#e6db74">:</span><span style="color:#e6db74">.1f</span><span style="color:#e6db74">}</span><span style="color:#e6db74">x&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Inverse Transform:  0.0018 seconds
</span></span><span style="display:flex;"><span>Von Neumann Method: 0.1860 seconds
</span></span><span style="display:flex;"><span>Speedup factor: 103.3x
</span></span></code></pre></div><p>The gap is large. The vectorized NumPy implementation of inverse transform sampling, leveraging a highly optimized C-backed logarithm function, outperforms the Python-looped von Neumann implementation by more than two orders of magnitude. While a vectorized or JIT-compiled version of von Neumann&rsquo;s method would close this gap by removing Python interpreter overhead, the inverse transform remains the practical winner on modern hardware with fast floating-point units. This confirms von Neumann&rsquo;s prescient observation: the &ldquo;theoretically elegant&rdquo; method avoiding transcendental functions often yields to direct computation.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This exploration offers a window into the ingenuity of early computational mathematics. Von Neumann&rsquo;s comparison-based algorithm demonstrates remarkable mathematical creativity (showing how to &ldquo;compute a logarithm&rdquo; using only basic machine operations). Our implementation reproduces the algorithm, producing samples whose histogram and moments match the exponential distribution.</p>
<p>The performance comparison validates von Neumann&rsquo;s own pragmatic assessment. His rejection sampling method is intellectually elegant and historically significant. The direct logarithmic approach proves far more efficient on both early and modern hardware. It serves as a timeless reminder in scientific computing: theoretical beauty often diverges from computational practicality.</p>
<p>The enduring value of von Neumann&rsquo;s work lies in the fundamental insight that creative mathematical thinking can circumvent apparent computational limitations. Understanding alternative methods deepens our appreciation for the rich landscape of algorithmic possibilities, even when the direct approach proves superior.</p>
]]></content:encoded></item><item><title>Modernizing Rahman's 1964 Argon Simulation</title><link>https://hunterheidenreich.com/posts/rahman-1964-lammps-liquid-argon/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/rahman-1964-lammps-liquid-argon/</guid><description>How I used modern software engineering (caching, vectorization, and dependency locking) to reproduce a 60-year-old physics milestone.</description><content:encoded><![CDATA[<p>Some papers invent entire fields. Aneesur Rahman&rsquo;s 1964 paper, <strong>&ldquo;Correlations in the Motion of Atoms in Liquid Argon&rdquo;</strong>, is the &ldquo;Hello World&rdquo; of molecular dynamics (MD). Using a computer with less memory than a modern microwave, Rahman solved Newton&rsquo;s equations for 864 atoms and proved that liquids have distinct, quantifiable structure.</p>
<p>The physics of liquid argon is a solved problem. We know the answer.</p>
<p>So, why replicate it in 2025? <strong>To apply modern engineering standards to legacy science.</strong></p>
<p>This project served as an exercise in <strong>software archaeology</strong>: taking a vintage scientific workflow and rebuilding it with a modular Python analysis pipeline. I wanted to see if I could replace Rahman&rsquo;s &ldquo;write-once&rdquo; Fortran mentality with modern reproducibility, type safety, and intelligent caching.</p>
<p>The full source code is available on <a href="https://github.com/hunter-heidenreich/argon-simulation">GitHub</a>. The complete project overview, including analysis results and pipeline architecture, is on the <a href="/projects/rahman-1964-replication/">Rahman 1964 Replication project page</a>.</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/KjFixUt6bnQ?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<hr>
<h2 id="engineering-the-pipeline">Engineering the Pipeline</h2>
<p>The most interesting part of this project isn&rsquo;t the simulation engine (LAMMPS handles that); it&rsquo;s the architecture of the analysis suite. MD analysis is computationally expensive ($O(N^2)$), and iterating on plots can be painfully slow if you re-compute trajectory data every time.</p>
<p>Why bother? Don&rsquo;t modern MD packages come with analysis tools?
Well, some say that writing is thinking.
Sometimes getting into the weeds of how an algorithm works or an analysis is performed, you gain insights and a deeper understanding that might be obscured by a plug-and-play tool.</p>
<h3 id="intelligent-caching">Intelligent Caching</h3>
<p>I built the <code>argon_sim</code> package with a decorator-based caching layer. The system hashes the source file&rsquo;s modification time and the function&rsquo;s arguments to avoid re-calculating the Radial Distribution Function (RDF) or Van Hove correlations on every script run.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#a6e22e">@cached_computation</span>(<span style="color:#e6db74">&#34;gr&#34;</span>)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">compute_radial_distribution</span>(filename: str, dr: float <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.05</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># ... expensive O(N^2) distance calculations ...</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> r_values, g_r, density
</span></span></code></pre></div><p>If I tweak a plot axis, the script runs instantly, loading pre-computed arrays from disk instead of re-running the $O(N^2)$ computation. If I change the simulation trajectory, the cache invalidates automatically.</p>
<h3 id="vectorization--memory-management">Vectorization &amp; Memory Management</h3>
<p>Rahman likely relied on nested loops. Python is too slow for that. I utilized <strong>NumPy broadcasting</strong> to vectorize the calculation of atomic displacements.</p>
<p>However, calculating an $864 \times 864$ distance matrix for 5,000 frames consumes significant RAM. I implemented a <strong>chunked MSD (Mean Square Displacement) algorithm</strong> that processes the trajectory in blocks, balancing vectorization speed with memory constraints. The chunking trades some vectorization speed for a bounded memory footprint, so the analysis is not capped by holding the full distance matrix in RAM.</p>
<h3 id="reproducibility-as-a-feature">Reproducibility as a Feature</h3>
<p>Academic code is notorious for &ldquo;it works on my machine.&rdquo; To combat this, I used <strong><code>uv</code></strong> for dependency management, locking the exact environment state. The entire workflow (from simulation to final figure generation) is abstracted into a <code>Makefile</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># One command to run the physics, analyze data, and generate plots</span>
</span></span><span style="display:flex;"><span>make workflow
</span></span></code></pre></div><hr>
<h2 id="the-simulation-1964-vs-2025">The Simulation: 1964 vs. 2025</h2>
<p>I preserved Rahman&rsquo;s physical parameters exactly to ensure a fair comparison:</p>
<ul>
<li><strong>System</strong>: 864 Argon atoms</li>
<li><strong>Potential</strong>: Lennard-Jones ($\sigma = 3.4$ Å, $\epsilon/k_B = 120$ K)</li>
<li><strong>Target</strong>: 94.4 K, 1.374 g/cm³</li>
</ul>
<p>However, I modernized the <em>numerical</em> methods to ensure stability:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Feature</th>
          <th style="text-align: left">Rahman (1964)</th>
          <th style="text-align: left">This Work (2025)</th>
          <th style="text-align: left">Why it Matters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Integration</strong></td>
          <td style="text-align: left">Predictor-Corrector</td>
          <td style="text-align: left">Velocity Verlet</td>
          <td style="text-align: left">Better energy conservation over long runs</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Timestep</strong></td>
          <td style="text-align: left">10 fs</td>
          <td style="text-align: left">2 fs</td>
          <td style="text-align: left">Rahman&rsquo;s step was aggressive; 2 fs ensures numerical stability</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Equilibration</strong></td>
          <td style="text-align: left">Velocity Scaling</td>
          <td style="text-align: left">1 ns NVT</td>
          <td style="text-align: left">Rahman couldn&rsquo;t afford long equilibrations; I melted the crystal properly to remove bias</td>
      </tr>
  </tbody>
</table>
<p>The production run lasted 10 ps in the NVE ensemble, generating 5,001 frames. Temperature remained within 1% of target with an RMS fluctuation of 0.0165.</p>
<hr>
<h2 id="validation-results">Validation Results</h2>
<p>The replication was quantitatively successful. The analysis pipeline faithfully reproduced every key signature of liquid argon.</p>
<h3 id="the-cage-effect">The Cage Effect</h3>
<p>This is the paper&rsquo;s crown jewel. In a gas, velocity correlations decay exponentially. In a liquid, Rahman discovered that atoms get trapped by their neighbors and bounce back, causing the correlation to go <em>negative</em>.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-velocity-autocorrelation.webp"
         alt="Velocity Autocorrelation Function"
         title="Velocity Autocorrelation Function"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The VACF dips below zero at 0.3 ps. This &rsquo;negative correlation&rsquo; is the signature of the cage effect: atoms rattling against their neighbors.</figcaption>
    
</figure>

<p>My simulation captures this minimum at -0.083, matching Rahman&rsquo;s observation. The Fourier transform of this data (the frequency spectrum) reveals a peak at $\beta \approx 0.25$, physically representing the frequency of atomic collisions within the cage.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-vacf-frequency-spectrum.webp"
         alt="Frequency spectrum of the VACF showing characteristic peak from atomic caging effects"
         title="Frequency spectrum of the VACF showing characteristic peak from atomic caging effects"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Frequency spectrum of the VACF showing characteristic peak from atomic caging effects</figcaption>
    
</figure>

<h3 id="structural-fingerprints">Structural Fingerprints</h3>
<p>The Radial Distribution Function $g(r)$ and its Fourier transform, the Structure Factor $S(k)$, are the &ldquo;fingerprints&rdquo; of a liquid&rsquo;s structure.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-radial-distribution-function.webp"
         alt="Radial Distribution Function and Structure Factor"
         title="Radial Distribution Function and Structure Factor"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The sharp first peak (3.82 Å) shows defined nearest neighbors, while the decay shows the lack of long-range order. My calculated peaks match Rahman&rsquo;s within 3%.</figcaption>
    
</figure>

<p>The agreement here is striking. My first peak appeared at <strong>3.82 Å</strong> (Rahman: 3.7 Å). The slight discrepancy is likely due to my improved equilibration method, which allowed the system to relax into a more natural liquid state than Rahman&rsquo;s 1960s hardware allowed.</p>
<h3 id="diffusion-and-non-gaussian-behavior">Diffusion and Non-Gaussian Behavior</h3>
<p>By calculating the Mean Square Displacement (MSD), I derived a diffusion coefficient of <strong>$D = 2.47 \times 10^{-5}$ cm²/s</strong>, which deviates only <strong>2%</strong> from Rahman&rsquo;s reported $2.43 \times 10^{-5}$.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-mean-square-displacement.webp"
         alt="Mean Square Displacement vs time showing ballistic to diffusive transition"
         title="Mean Square Displacement vs time showing ballistic to diffusive transition"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Mean Square Displacement vs. time showing ballistic to diffusive transition</figcaption>
    
</figure>

<p>More interestingly, I reproduced the &ldquo;Non-Gaussian&rdquo; parameters. Standard diffusion assumes a Gaussian distribution of displacements. Rahman found (and I confirmed) that liquid atoms deviate from this. They exhibit &ldquo;jump&rdquo; and &ldquo;wait&rdquo; dynamics, a behavior that standard Brownian motion models fail to capture.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-non-gaussian-parameters.webp"
         alt="Non-Gaussian parameters showing deviation from simple diffusive behavior"
         title="Non-Gaussian parameters showing deviation from simple diffusive behavior"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Evidence that atoms do not follow a simple random walk. The non-zero alpha parameters indicate heterogeneous dynamics.</figcaption>
    
</figure>

<h3 id="advanced-analysis-van-hove-functions">Advanced Analysis: Van Hove Functions</h3>
<p>Rahman also explored advanced properties like the Van Hove correlation function $G(r,t)$, which describes how liquid structure evolves over time.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-van-hove-correlation.webp"
         alt="Van Hove distinct correlation function G_d(r,t) at two time points"
         title="Van Hove distinct correlation function G_d(r,t) at two time points"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Van Hove distinct correlation function showing how neighbor coordination shells &lsquo;melt&rsquo; as time progresses</figcaption>
    
</figure>

<p>At 1.0 ps, the structure remains well-defined with clear shells. By 2.5 ps, it becomes increasingly diffuse. Rahman compared this evolution to theoretical predictions (the Vineyard approximation) and found that theory predicted overly rapid structural decay. My results confirm this finding.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-delayed-convolution.webp"
         alt="Delayed convolution approximation testing Rahman&#39;s theoretical improvement"
         title="Delayed convolution approximation testing Rahman&#39;s theoretical improvement"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Testing Rahman&rsquo;s &lsquo;delayed convolution approximation&rsquo; (his proposed improvement over existing theory)</figcaption>
    
</figure>

<hr>
<h2 id="system-validation">System Validation</h2>
<p>Before analyzing physics, basic sanity checks confirmed proper thermal equilibrium.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-temperature-stability.webp"
         alt="Temperature vs time plot showing excellent temperature control around 94.4 K target"
         title="Temperature vs time plot showing excellent temperature control around 94.4 K target"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Temperature vs. Time - 5001 frames showing excellent temperature control with mean 94.73 K</figcaption>
    
</figure>

<p>Mean temperature was 94.73 K (0.33 K off target) with a standard deviation of 1.56 K.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-maxwell-boltzmann-velocity.webp"
         alt="Maxwell-Boltzmann velocity distribution"
         title="Maxwell-Boltzmann velocity distribution"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Maxwell-Boltzmann velocity distribution from 12.9 million velocity components</figcaption>
    
</figure>

<p>The velocity distribution from 12.9 million velocity components produces a clean Maxwell-Boltzmann distribution, as expected for thermal equilibrium. The distribution widths at various heights closely match Rahman&rsquo;s results: 1.77, 2.48, and 3.56 compared to his 1.77, 2.52, and 3.52.</p>
<hr>
<h2 id="conclusion">Conclusion</h2>
<p>Replicating a 60-year-old paper might seem like a solved puzzle, but it teaches a valuable lesson in computational science. Rahman relied on brilliance and raw mathematical intuition because he lacked compute power. Today, pairing modern compute with disciplined software practices makes the same result reproducible and auditable.</p>
<p>Applying modern software engineering (<strong>modular architecture, caching, and automated workflows</strong>) to classical physics reproduces the past and builds a foundation that makes the <em>next</em> discovery easier, faster, and more reliable.</p>
<p>The quantitative agreement is striking: diffusion coefficients within 2%, structural peaks within 0.1 Å, velocity distributions matching to three significant figures. This level of reproducibility, achieved with completely different hardware and software, validates something fundamental: Rahman&rsquo;s physical model was remarkably sound, and his computational methodology was scientifically rigorous despite 1960s constraints.</p>
<p>The cage effect, velocity correlations, and structural evolution are fundamental characteristics of how matter behaves at the atomic scale, as relevant today as they were six decades ago.</p>
]]></content:encoded></item><item><title>Modernizing Rahman''s 1964 Argon Simulation</title><link>https://hunterheidenreich.com/projects/rahman-1964-replication/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/rahman-1964-replication/</guid><description>A high-fidelity replication of foundational molecular dynamics using modern software engineering practices: caching, vectorization, and strict reproducibility.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>This project is a &ldquo;digital restoration&rdquo; of Aneesur Rahman&rsquo;s seminal 1964 paper, <em>Correlations in the Motion of Atoms in Liquid Argon</em>. While the physics of liquid argon is a solved problem, the challenge lies in bridging the gap between 1960s mainframe constraints and 2025 software architecture.</p>
<p>I replicated the simulation using <strong>LAMMPS</strong> and built a <strong>Python analysis pipeline</strong> to process the trajectory data. The project demonstrates how modern tooling (<code>uv</code>, type hinting, vectorized NumPy) can transform academic &ldquo;write-once&rdquo; scripts into a reproducible research toolkit.</p>
<h2 id="features">Features</h2>
<h3 id="the-analysis-pipeline">The Analysis Pipeline</h3>
<p>I architected a modular Python package (<code>argon_sim</code>) designed for performance and maintainability.</p>
<ul>
<li><strong>Intelligent Caching System</strong>: MD analysis is compute-intensive ($O(N^2)$). I implemented a decorator-based caching layer (<code>@cached_computation</code>) that hashes source file modification times and function arguments. This ensures expensive calculations (like RDF or Van Hove correlations) are only re-run when the underlying trajectory or parameters actually change.</li>
<li><strong>Vectorization &amp; Optimization</strong>: To handle the $N^2$ complexity of pair-wise interactions without C++ extensions, I utilized NumPy broadcasting. For example, the Mean Square Displacement (MSD) calculation is fully vectorized, with a fallback &ldquo;chunked&rdquo; implementation to handle memory overflows on smaller machines.</li>
<li><strong>Modern Python Tooling</strong>:
<ul>
<li><strong>Dependency Management</strong>: Used <code>uv</code> for deterministic environment locking (sub-second resolution).</li>
<li><strong>Type Safety</strong>: Fully type-hinted codebase for static analysis compliance.</li>
<li><strong>Automation</strong>: A <code>Makefile</code> abstracts the workflow (simulation → analysis → figure generation) into single commands (e.g., <code>make figure-5</code>).</li>
</ul>
</li>
</ul>
<h3 id="the-simulation-strategy">The Simulation Strategy</h3>
<p>I used LAMMPS for the MD engine but strictly adhered to Rahman&rsquo;s physical parameters while modernizing the stability mechanisms.</p>
<ul>
<li><strong>Integration</strong>: Replaced Rahman&rsquo;s predictor-corrector method with the modern standard <strong>Velocity Verlet</strong> algorithm (2 fs timestep).</li>
<li><strong>Equilibration</strong>: I implemented a 1 ns <strong>NVT equilibration</strong> phase (500,000 steps at the 2 fs timestep) to properly melt the FCC crystal structure before the NVE production run.</li>
<li><strong>Intellectual Honesty</strong>: The <code>in.argon</code> script explicitly documents every deviation from the original methodology (e.g., energy minimization) and the justification for ensuring numerical stability.</li>
</ul>
<h2 id="usage">Usage</h2>
<p>The project uses a <code>Makefile</code> to automate the workflow. Run <code>make all</code> to execute the LAMMPS simulation and generate all analysis figures.</p>
<h2 id="results">Results</h2>
<p>The replication achieved high quantitative agreement with the historical data, validating both the simulation parameters and the custom analysis code.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Property</th>
          <th style="text-align: left">Rahman (1964)</th>
          <th style="text-align: left">This Work</th>
          <th style="text-align: left">Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Diffusion Coefficient ($D$)</td>
          <td style="text-align: left">$2.43 \times 10^{-5}$ cm²/s</td>
          <td style="text-align: left">$2.47 \times 10^{-5}$ cm²/s</td>
          <td style="text-align: left">Agreement within 2%</td>
      </tr>
      <tr>
          <td style="text-align: left">RDF First Peak</td>
          <td style="text-align: left">$3.7$ Å</td>
          <td style="text-align: left">$3.82$ Å</td>
          <td style="text-align: left">Slight shift</td>
      </tr>
      <tr>
          <td style="text-align: left">Velocity Dist. Width ($e^{-1/2}$)</td>
          <td style="text-align: left">$1.77$</td>
          <td style="text-align: left">$1.77$</td>
          <td style="text-align: left">Exact match to theoretical Maxwell-Boltzmann</td>
      </tr>
  </tbody>
</table>
<h3 id="visual-replication">Visual Replication</h3>
<p>I used Matplotlib to digitally recreate Rahman&rsquo;s hand-drawn plots, confirming signatures like the <strong>negative region in the Velocity Autocorrelation Function (VACF)</strong>, which provided the first evidence of the &ldquo;cage effect&rdquo; in simple liquids.</p>















<figure class="post-figure center ">
    <img src="/img/rahman-1964-argon-molecular-dynamics/rahman-argon-velocity-autocorrelation.webp"
         alt="Velocity Autocorrelation Function comparison showing the characteristic negative region"
         title="Velocity Autocorrelation Function comparison showing the characteristic negative region"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The VACF&rsquo;s negative region (first evidence of the &lsquo;cage effect&rsquo; in liquids) reproduced 60 years later.</figcaption>
    
</figure>

<h2 id="challenges--learnings">Challenges &amp; Learnings</h2>
<ul>
<li><strong>Unit Hell</strong>: Rahman&rsquo;s paper uses a mix of reduced units and CGS. Mapping these to LAMMPS&rsquo;s <code>real</code> units required a dedicated <code>constants.py</code> module and rigorous unit testing to prevent dimensional errors.</li>
<li><strong>Fourier Transforms</strong>: Calculating the Structure Factor $S(k)$ from $g(r)$ required implementing a manual 3D Fourier transform for spherical symmetry, as standard FFT packages do not account for the radial shell integration implicit in liquid structure analysis.</li>
<li><strong>Code as a Liability</strong>: Early in the project, I realized that re-running analysis scripts was becoming a bottleneck. This drove the decision to build the caching infrastructure, reinforcing the lesson that investing in developer tooling pays off even in small-scale scientific projects.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<p>The full methodology and physics are documented in the companion blog post:</p>
<ul>
<li><a href="/posts/rahman-1964-lammps-liquid-argon/">Replicating Rahman&rsquo;s 1964 Liquid Argon Simulation</a></li>
</ul>
]]></content:encoded></item><item><title>Synthetic Isomer Data Generation Pipeline</title><link>https://hunterheidenreich.com/projects/isomer-dataset-generation/</link><pubDate>Sat, 09 Mar 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/isomer-dataset-generation/</guid><description>An end-to-end cheminformatics pipeline transforming 1D chemical formulas into 3D conformer datasets using graph enumeration and physics-based featurization.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>In computational drug discovery, data scarcity is often the bottleneck. This project builds a synthetic data generator that creates labeled 3D molecular datasets starting from nothing but a raw chemical formula (e.g., $C_6H_{14}$).</p>
<p>The pipeline bridges the gap between <strong>1D Chemical Information</strong> (stoichiometry) and <strong>3D Geometric Data</strong> (conformers), effectively serving as a &ldquo;data factory&rdquo; for training molecular machine learning models.</p>
<h2 id="features">Features</h2>
<h3 id="1-graph-enumeration--3d-embedding">1. Graph Enumeration &amp; 3D Embedding</h3>
<p>The core of the project is <code>pysomer/data/gen.py</code>, which orchestrates a multi-step generation process:</p>
<ul>
<li><strong>Structural Isomerism:</strong> Uses <strong>MAYGEN</strong> (via a Java bridge) to mathematically enumerate all valid graph connectivities for a given formula</li>
<li><strong>Conformer Sampling:</strong> Uses <strong>RDKit</strong> to embed these graphs into 3D space, generating multiple conformers (rotamers) per isomer to capture flexibility</li>
<li><strong>IUPAC Labeling:</strong> Automatically queries PubChem APIs to assign human-readable labels (e.g., &ldquo;2-methylpentane&rdquo;) to the generated structures</li>
</ul>
<h3 id="2-physics-aware-featurization">2. Physics-Aware Featurization</h3>
<p>The pipeline computes <strong>Coulomb Matrices</strong>, ensuring the input respects physical invariants:</p>
<p>$$C_{ij} = \begin{cases} 0.5 Z_i^{2.4} &amp; i = j \ \frac{Z_i Z_j}{|R_i - R_j|} &amp; i \neq j \end{cases}$$</p>
<p>This representation encodes the electrostatic potential of the molecule, providing a more informative signal for the neural network than raw Cartesian coordinates.</p>
<h3 id="3-hdf5-data-storage">3. HDF5 Data Storage</h3>
<p>To handle the large volume of generated conformers, the system writes to hierarchical <strong>HDF5</strong> files. This allows for efficient, chunked I/O during training, a critical pattern for scaling to larger chemical spaces.</p>
<h2 id="usage">Usage</h2>
<p>The pipeline is executed via a CLI, taking a chemical formula as input and outputting an HDF5 dataset of 3D conformers.</p>
<h2 id="results">Results</h2>
<p>This project serves as a &ldquo;vertical slice&rdquo; of a cheminformatics workflow.</p>
<ul>
<li><strong>The Good:</strong> The separation of concerns is clean: <code>dataclasses</code> for configuration and HDF5 for storage keep the data-engineering layer tidy and extensible.</li>
<li><strong>The &ldquo;Old School&rdquo;:</strong> The model used is a simple Multi-Layer Perceptron (MLP) on flattened Coulomb Matrices. In a modern production setting (post-2020), I would replace this with an <strong>E(3)-Equivariant GNN</strong> (like SchNet or E3NN) to handle rotational symmetry natively, eliminating manual feature engineering.</li>
<li><strong>Dependency Management:</strong> The reliance on an external Java JAR (<code>MAYGEN</code>) for graph enumeration makes the environment brittle. Today, I would likely swap this for a pure Python enumerator or a containerized microservice to improve portability.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<p>This data pipeline powers the analysis in my comprehensive guide on molecular representation:</p>
<ul>
<li><a href="/posts/alkane-constitutional-isomer-classification/">Coulomb Matrix Eigenvalues: Can You Hear the Shape of a Molecule?</a>: A deep dive into data generation, unsupervised clustering, and supervised classification of alkane isomers.</li>
</ul>
<p>See also:</p>
<ul>
<li><a href="/posts/molecular-descriptor-coulomb-matrix/">The Coulomb Matrix</a>: Deep dive into the physics-based featurization used here</li>
<li><a href="/notes/chemistry/molecular-representations/notations/number-of-isomeric-hydrocarbons/">The Number of Isomeric Hydrocarbons</a>: The foundational 1931 paper on alkane enumeration</li>
</ul>
]]></content:encoded></item><item><title>Coulomb Matrices for Molecular Machine Learning</title><link>https://hunterheidenreich.com/posts/molecular-descriptor-coulomb-matrix/</link><pubDate>Sat, 10 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/molecular-descriptor-coulomb-matrix/</guid><description>Learn how Coulomb matrices encode 3D molecular structure for machine learning from basic theory to Python implementation and practical limitations.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>When working with machine learning in chemistry, one of the first challenges you encounter is how to represent molecules in a way that algorithms can understand. You can&rsquo;t just feed raw atomic coordinates into a model. The representation needs to be invariant to rotation, translation, and atom ordering, since these operations don&rsquo;t change the molecule&rsquo;s fundamental properties.</p>
<p>The Coulomb matrix, introduced by Rupp et al. in 2012 <a href="#ref-1">[1]</a>, provides a straightforward solution to this problem. While newer methods have largely superseded it for practical applications, the Coulomb matrix remains an excellent starting point for understanding how molecular descriptors work.</p>
<p>The key insight is simple: we encode pairwise relationships between atoms in a way that captures the essential physics while maintaining the required invariances.</p>
<h2 id="the-coulomb-matrix-theory-and-intuition">The Coulomb Matrix: Theory and Intuition</h2>
<p>The Coulomb matrix encodes molecular structure in a symmetric $N \times N$ matrix, where $N$ is the number of atoms. Each element $C_{ij}$ is defined as:</p>
<p>$$
C_{ij} = \begin{cases} 0.5 Z_i^{2.4} &amp; \text{if } i = j, \\ \frac{Z_i Z_j}{|\mathbf{R}_i - \mathbf{R}_j|} &amp; \text{if } i \neq j, \end{cases}
$$</p>
<p>Here, $Z_i$ is the atomic number of atom $i$, and $\mathbf{R}_i$ is its position in 3D space. The diagonal elements ($0.5 Z_i^{2.4}$) represent atomic self-energies, derived from fitting atomic numbers to experimental data. The off-diagonal elements mimic Coulombic interactions between atoms. They&rsquo;re inversely proportional to distance, just like electrostatic potential energy <a href="#ref-3">[3]</a>.</p>
<p>This construction gives us several useful properties:</p>
<ul>
<li><strong>Rotation and translation invariant</strong>: Only relative distances matter</li>
<li><strong>Symmetric</strong>: $C_{ij} = C_{ji}$, which is physically sensible</li>
<li><strong>Size-extensive</strong>: Larger molecules have larger matrix elements</li>
<li><strong>Captures 3D structure</strong>: Nearby atoms have larger interaction terms</li>
</ul>
<p>While more sophisticated methods exist today <a href="#ref-2">[2]</a>, the Coulomb matrix&rsquo;s simplicity makes it ideal for understanding the fundamentals of molecular representation.</p>
<h3 id="hands-on-example-bicyclobutane">Hands-on Example: Bicyclobutane</h3>
<p>Let&rsquo;s calculate the Coulomb matrix for <a href="https://en.wikipedia.org/wiki/Bicyclobutane">bicyclobutane</a>, a strained but stable bicyclic system (bicyclo[1.1.0]butane, C4H6, two cis-fused cyclopropane rings). This example will show you exactly how the theory translates to practice.</p>















<figure class="post-figure center ">
    <img src="https://upload.wikimedia.org/wikipedia/commons/b/b4/Bicyclobutane-2.svg"
         alt="Bicyclobutane"
         title="Bicyclobutane"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Bicyclobutane structure (Smokefoot, Public domain, via Wikimedia Commons)</figcaption>
    
</figure>

<p>I&rsquo;ll use Python with the Atomic Simulation Environment (<code>ase</code>) for molecular structure <a href="#ref-4">[4]</a> and <code>dscribe</code> for the Coulomb matrix calculation <a href="#ref-2">[2]</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> ase.build <span style="color:#f92672">import</span> molecule
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> ase.visualize <span style="color:#f92672">import</span> view
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Load the bicyclobutane structure</span>
</span></span><span style="display:flex;"><span>bicyclobutane <span style="color:#f92672">=</span> molecule(<span style="color:#e6db74">&#39;bicyclobutane&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Optional: visualize the structure</span>
</span></span><span style="display:flex;"><span>view(bicyclobutane, viewer<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;x3d&#39;</span>)
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/bicyclobutane_ase_1.webp"
         alt="Bicyclobutane 3D structure"
         title="Bicyclobutane 3D structure"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">3D structure of bicyclobutane</figcaption>
    
</figure>

<p>Now we calculate the Coulomb matrix using DScribe:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> dscribe.descriptors <span style="color:#f92672">import</span> CoulombMatrix
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Set up the descriptor</span>
</span></span><span style="display:flex;"><span>cm <span style="color:#f92672">=</span> CoulombMatrix(n_atoms_max<span style="color:#f92672">=</span>len(bicyclobutane))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Calculate and reshape into matrix form</span>
</span></span><span style="display:flex;"><span>cm_bicyclobutane <span style="color:#f92672">=</span> cm<span style="color:#f92672">.</span>create(bicyclobutane)
</span></span><span style="display:flex;"><span>cm_bicyclobutane <span style="color:#f92672">=</span> cm_bicyclobutane<span style="color:#f92672">.</span>reshape(len(bicyclobutane), len(bicyclobutane))
</span></span></code></pre></div><h3 id="visualizing-the-results">Visualizing the Results</h3>
<p>The Coulomb matrix can be visualized as a heatmap. Let&rsquo;s look at both the raw matrix and its logarithm:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> matplotlib.pyplot <span style="color:#66d9ef">as</span> plt
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Raw Coulomb matrix</span>
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>figure(figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">8</span>, <span style="color:#ae81ff">8</span>), dpi<span style="color:#f92672">=</span><span style="color:#ae81ff">150</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>imshow(cm_bicyclobutane, cmap<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;coolwarm&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>colorbar(label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Magnitude&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>title(<span style="color:#e6db74">&#39;Coulomb Matrix for Bicyclobutane&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>show()
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/cm_bicyclobutane.webp"
         alt="Coulomb matrix of bicyclobutane"
         title="Coulomb matrix of bicyclobutane"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Coulomb matrix for bicyclobutane</figcaption>
    
</figure>

<p>The raw matrix shows clear patterns:</p>
<ul>
<li><strong>Large diagonal elements</strong>: Carbon atoms (Z=6) dominate due to their higher atomic numbers</li>
<li><strong>Smaller off-diagonal elements</strong>: Represent pairwise interactions</li>
<li><strong>Minimal hydrogen contribution</strong>: Hydrogen atoms (Z=1) have much smaller values</li>
</ul>
<p>For better visualization of the structure, the logarithm reveals more detail:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>figure(figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">8</span>, <span style="color:#ae81ff">8</span>), dpi<span style="color:#f92672">=</span><span style="color:#ae81ff">150</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>imshow(np<span style="color:#f92672">.</span>log(cm_bicyclobutane), cmap<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;coolwarm&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>colorbar(label<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;log(Magnitude)&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>title(<span style="color:#e6db74">&#39;Log Coulomb Matrix for Bicyclobutane&#39;</span>)
</span></span><span style="display:flex;"><span>plt<span style="color:#f92672">.</span>show()
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/cm_bicyclobutane_log.webp"
         alt="Log Coulomb matrix of bicyclobutane"
         title="Log Coulomb matrix of bicyclobutane"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Log-scale reveals more structural detail</figcaption>
    
</figure>

<h3 id="eigenvalue-analysis">Eigenvalue Analysis</h3>
<p>The eigenvalues of the Coulomb matrix provide another perspective on molecular structure:</p>















<figure class="post-figure center ">
    <img src="/img/cm_bicyclobutane_eigenvalues.webp"
         alt="Eigenvalues of Coulomb matrix"
         title="Eigenvalues of Coulomb matrix"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Eigenvalues of the Coulomb matrix</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/cm_bicyclobutane_log_eigenvalues.webp"
         alt="Eigenvalues of log Coulomb matrix"
         title="Eigenvalues of log Coulomb matrix"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Eigenvalues on logarithmic scale</figcaption>
    
</figure>

<p>These eigenvalues are often used as features themselves, providing a more compact representation than the full matrix.</p>
<h2 id="practical-limitations">Practical Limitations</h2>
<p>The Coulomb matrix has significant limitations that explain why it&rsquo;s been largely superseded by modern methods. Understanding these constraints is crucial for knowing when and how to use this descriptor.</p>
<h3 id="the-size-problem">The Size Problem</h3>
<p>Every molecule must be represented by the same size matrix, which creates several issues:</p>
<ul>
<li><strong>Padding overhead</strong>: Small molecules get padded with zeros up to the maximum size</li>
<li><strong>Quadratic scaling</strong>: An $N$-atom molecule requires $N^2$ features</li>
<li><strong>Fixed maximum size</strong>: You can&rsquo;t represent molecules larger than your preset limit</li>
<li><strong>Inefficient storage</strong>: Most elements are zero for small molecules in large matrices</li>
</ul>
<p>For a dataset ranging from 5-atom to 50-atom molecules, every molecule needs a 50x50 matrix. That&rsquo;s 2,500 features, most of which are zero for smaller molecules.</p>
<h3 id="permutation-sensitivity">Permutation Sensitivity</h3>
<p>Despite being called &ldquo;invariant,&rdquo; the Coulomb matrix can actually change if you reorder the atoms in your input file. The standard solution is to sort atoms by the L2 norm of their matrix rows, but this introduces its own problems:</p>
<ul>
<li><strong>Symmetry breaking</strong>: Equivalent atoms might be ordered differently</li>
<li><strong>Numerical instability</strong>: Small coordinate changes can flip the ordering</li>
<li><strong>Loss of chemical intuition</strong>: The sorted order doesn&rsquo;t reflect meaningful chemistry</li>
</ul>
<p>Interestingly, some studies suggest that adding controlled noise to create multiple permutations can actually improve machine learning performance <a href="#ref-5">[5]</a>.</p>
<h3 id="limited-scope">Limited Scope</h3>
<p>The Coulomb matrix works well only for specific types of systems:</p>
<ul>
<li><strong>Small molecules</strong>: Performance degrades for large systems due to size scaling</li>
<li><strong>Gas-phase</strong>: Not suitable for periodic systems like crystals or surfaces</li>
<li><strong>Single conformations</strong>: Each 3D structure gets its own matrix</li>
<li><strong>Non-reactive</strong>: Doesn&rsquo;t capture bond-breaking or formation</li>
</ul>
<p>For periodic systems, you&rsquo;d need specialized variants like the Ewald sum matrix <a href="#ref-6">[6]</a>.</p>
<h2 id="why-learn-it-anyway">Why Learn It Anyway?</h2>
<p>Given these limitations, why spend time understanding the Coulomb matrix? Several reasons:</p>
<p><strong>Educational value</strong>: It&rsquo;s conceptually straightforward and provides excellent intuition for how molecular descriptors work. The mathematical formulation is simple enough to implement from scratch.</p>
<p><strong>Historical importance</strong>: Many subsequent methods build on ideas first explored with Coulomb matrices. Understanding this foundation helps you appreciate why newer methods were developed.</p>
<p><strong>Benchmarking</strong>: It remains useful as a baseline method for comparing new descriptors on small molecular datasets.</p>
<p><strong>Proof of concept</strong>: For exploratory work on small, well-defined datasets, the Coulomb matrix can still provide quick insights.</p>
<p>If you&rsquo;re working on practical problems with larger datasets or diverse molecular sizes, consider modern alternatives like graph neural networks, descriptors from DScribe&rsquo;s extended library, or learned representations from transformer models.</p>
<h2 id="putting-it-in-context">Putting It in Context</h2>
<p>To see the Coulomb matrix applied to real problems, I&rsquo;ve written a detailed guide using it for molecular classification:</p>
<ul>
<li><a href="/posts/alkane-constitutional-isomer-classification/">Coulomb Matrix Eigenvalues: Can You Hear the Shape of a Molecule?</a>: A comprehensive analysis of alkane isomers, from unsupervised clustering limits to supervised classification successes.</li>
</ul>
<p>For comparison with modern approaches, check out my post on <a href="/posts/geom-conformer-generation-dataset/">3D conformer generation with the GEOM dataset</a>, which showcases more sophisticated molecular representations. For technical specifications and benchmarks, see the <a href="/notes/chemistry/datasets/geom/">GEOM dataset card</a>.</p>
<p>The Coulomb matrix may be dated, but it remains an excellent entry point into the world of molecular machine learning. Once you understand its strengths and limitations, you&rsquo;ll be better equipped to appreciate why the field has moved toward more sophisticated approaches.</p>
<hr>
<p><em>Have questions about molecular descriptors or want to discuss other approaches to molecular machine learning? I&rsquo;d be happy to explore these topics further.</em></p>
<h2 id="references">References</h2>
<ul>
<li><a id="ref-1"></a>[1]: M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld, &ldquo;Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning,&rdquo; Physical Review Letters, 108(5), 058301 (2012). <a href="https://doi.org/10.1103/PhysRevLett.108.058301">https://doi.org/10.1103/PhysRevLett.108.058301</a> <a href="https://arxiv.org/abs/1109.2618">arXiv:1109.2618</a></li>
<li><a id="ref-2"></a>[2] L. Himanen, M. O. J. Jäger, E. V. Morooka, F. F. Canova, Y. S. Ranawat, D. Z. Gao, P. Rinke, and A. S. Foster, &ldquo;DScribe: Library of descriptors for machine learning in materials science,&rdquo; Computer Physics Communications, 247, 106949 (2020). <a href="https://doi.org/10.1016/j.cpc.2019.106949">https://doi.org/10.1016/j.cpc.2019.106949</a> <a href="https://arxiv.org/abs/1904.08875">arXiv:1904.08875</a></li>
<li><a id="ref-3"></a>[3] J. Schrier, &ldquo;Can one hear the shape of a molecule (from its Coulomb matrix eigenvalues)?,&rdquo; Journal of Chemical Information and Modeling, 60(8), 3804-3811 (2020). <a href="https://doi.org/10.1021/acs.jcim.0c00631">https://doi.org/10.1021/acs.jcim.0c00631</a></li>
<li><a id="ref-4"></a>[4] A. H. Larsen, J. J. Mortensen, J. Blomqvist, I. E. Castelli, R. Christensen, M. Dułak, J. Friis, M. N. Groves, B. Hammer, C. Hargus, E. D. Hermes, P. C. Jennings, P. B. Jensen, J. Kermode, J. R. Kitchin, E. L. Kolsbjerg, J. Kubal, K. Kaasbjerg, S. Lysgaard, J. B. Maronsson, T. Maxson, T. Olsen, L. Pastewka, A. Peterson, C. Rostgaard, J. Schiøtz, O. Schütt, M. Strange, K. S. Thygesen, T. Vegge, L. Vilhelmsen, M. Walter, Z. Zeng, and K. W. Jacobsen, &ldquo;The Atomic Simulation Environment - A Python library for working with atoms,&rdquo; J. Phys.: Condens. Matter, 29, 273002 (2017). <a href="https://doi.org/10.1088/1361-648X/aa680e">https://doi.org/10.1088/1361-648X/aa680e</a> <a href="https://ase-lib.org/index.html">documentation</a></li>
<li><a id="ref-5"></a>[5] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, A. Lilienfeld, and K.-R. Müller, &ldquo;Learning invariant representations of molecules for atomization energy prediction,&rdquo; Advances in Neural Information Processing Systems, 25 (2012). Available online: <a href="https://proceedings.neurips.cc/paper_files/paper/2012/file/115f89503138416a242f40fb7d7f338e-Paper.pdf">https://proceedings.neurips.cc/paper_files/paper/2012/file/115f89503138416a242f40fb7d7f338e-Paper.pdf</a></li>
<li><a id="ref-6"></a>[6] F. Faber, A. Lindmaa, O. A. von Lilienfeld, and R. Armiento, &ldquo;Crystal structure representations for machine learning models of formation energies,&rdquo; International Journal of Quantum Chemistry, 115(16), 1094-1101 (2015). <a href="https://doi.org/10.1002/qua.24917">https://doi.org/10.1002/qua.24917</a></li>
</ul>
]]></content:encoded></item><item><title>How Does Congress Actually Work? Data from 15K Bills</title><link>https://hunterheidenreich.com/posts/us-117th-congress-data-exploration/</link><pubDate>Thu, 05 Oct 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/us-117th-congress-data-exploration/</guid><description>What happens to bills in Congress? Analyzing 15K+ bills from the 117th Congress to understand legislative patterns, party dynamics, and success rates.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Analyzing congressional data reveals the underlying mechanics of the legislative process. Legislative text is a large, structured corpus well suited to text classification and other NLP tasks. I scraped data from Congress.gov to analyze what actually happens to the thousands of bills introduced each session and to build a foundational dataset for downstream machine learning tasks.</p>
<p>This analysis focuses on the 117th Congress (2021-2023), examining 15,000+ bills to understand basic patterns: Which bills get introduced? How many receive votes? What factors influence success?</p>
<p>This post covers the foundational exploratory analysis and data collection process, setting the stage for <a href="/posts/congressional-bill-policy-area-classification/">predictive modeling and policy area classification</a>.</p>
<h2 id="data-collection">Data Collection</h2>
<p>My primary source is <a href="https://www.congress.gov/">Congress.gov</a>, maintained by the Library of Congress. I focused on the 117th Congress (2021-2023), collecting data on bills and joint resolutions, omitting simple resolutions, concurrent resolutions, and amendments.</p>
<p><strong>Data collected:</strong></p>
<table>
  <thead>
      <tr>
          <th>Bill Type</th>
          <th>Introduced</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>House Bill</td>
          <td>9,698</td>
      </tr>
      <tr>
          <td>House Joint Resolution</td>
          <td>106</td>
      </tr>
      <tr>
          <td>Senate Bill</td>
          <td>5,357</td>
      </tr>
      <tr>
          <td>Senate Joint Resolution</td>
          <td>70</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>15,231</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="technical-implementation">Technical Implementation</h3>
<p>Building a usable NLP dataset requires careful handling of the source. Congress.gov loads content dynamically and presents nested DOM structures, so the scraper combines static HTML parsing with a headless browser to render JavaScript before parsing.</p>
<p><strong>Implementation details:</strong></p>
<ul>
<li><a href="https://www.python.org/">Python</a> for core orchestration and data schema management</li>
<li><a href="https://www.selenium.dev/">Selenium</a> for executing JavaScript and loading dynamic page elements</li>
<li><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">BeautifulSoup</a> for structured HTML parsing</li>
<li>Regex for text normalization and extracting clean legislative text for language models</li>
</ul>
<p>The crawler used 5-second delays between requests to respect server limits, a roughly 3-day collection run. It handles edge cases in congressional text formatting and writes one JSON record per bill on a fixed schema. The crawler and processed data are available on <a href="https://github.com/hunter-heidenreich/congress-scraper">GitHub</a>.</p>
<p>For each bill, I queried two pages:</p>
<ul>
<li>All info page: <code>https://www.congress.gov/bill/117th-congress/{bill_type}/{bill_id}/all-info</code></li>
<li>Text page: <code>https://www.congress.gov/bill/117th-congress/{bill_type}/{bill_id}/text?format=txt</code></li>
</ul>
<p>The parsing process involved targeting specific HTML elements and implementing basic caching to avoid redundant requests.</p>
<h2 id="key-findings">Key Findings</h2>
<p>The analysis reveals clear patterns in congressional activity. Most bills never receive votes, and success rates vary significantly by party and policy area.</p>
<h3 id="legislative-outcomes">Legislative Outcomes</h3>
<p>The fundamental question: what happens to bills after introduction?</p>
<p>Each bill has a tracker status indicating its position in the legislative process. The eight possible statuses can be grouped into three meaningful categories:</p>
<ul>
<li><strong>Introduced</strong>: Bills introduced but never voted on</li>
<li><strong>Stalled</strong>: Bills that saw votes but didn&rsquo;t become law (since the 117th Congress ended, these effectively died)</li>
<li><strong>Law</strong>: Bills signed by the President</li>
</ul>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Introduced</th>
          <th>Stalled</th>
          <th>Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>House Bill</td>
          <td>8,977</td>
          <td>523</td>
          <td>198</td>
      </tr>
      <tr>
          <td>House Joint Resolution</td>
          <td>102</td>
          <td>1</td>
          <td>3</td>
      </tr>
      <tr>
          <td>Senate Bill</td>
          <td>5,083</td>
          <td>114</td>
          <td>160</td>
      </tr>
      <tr>
          <td>Senate Joint Resolution</td>
          <td>57</td>
          <td>9</td>
          <td>4</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>14,219</strong></td>
          <td><strong>647</strong></td>
          <td><strong>365</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Key insights:</strong></p>
<ul>
<li>Only 7% of introduced bills ever receive a vote</li>
<li>Of bills that receive votes, 36% become law</li>
<li>Overall, just 2% of introduced bills become law</li>
</ul>
<h3 id="sponsor-analysis">Sponsor Analysis</h3>
<p>The bill sponsor (the primary member who introduces legislation) provides insights into party and geographic patterns.</p>
<h4 id="party-breakdown">Party Breakdown</h4>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Introduced</th>
          <th>Stalled</th>
          <th>Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Democrat</td>
          <td>8,271</td>
          <td>437</td>
          <td>235</td>
      </tr>
      <tr>
          <td>Republican</td>
          <td>5,883</td>
          <td>210</td>
          <td>130</td>
      </tr>
      <tr>
          <td>Independent</td>
          <td>65</td>
          <td>0</td>
          <td>0</td>
      </tr>
  </tbody>
</table>
<p><strong>Party comparison:</strong></p>
<ul>
<li><strong>Democrats</strong>: 7.5% of bills moved beyond introduction; 2.6% became law</li>
<li><strong>Republicans</strong>: 5.5% of bills moved beyond introduction; 2.1% became law</li>
<li>When bills do advance, Republicans have a slightly higher success rate (38% vs 35%)</li>
</ul>
<h4 id="geographic-distribution">Geographic Distribution</h4>
<p><strong>Top 10 states by bills introduced:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>State: Introduced</th>
          <th>State: Stalled</th>
          <th>State: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>CA: 1,350</td>
          <td>CA: 93</td>
          <td>CA: 34</td>
      </tr>
      <tr>
          <td>2</td>
          <td>TX: 879</td>
          <td>NY: 44</td>
          <td>MI: 30</td>
      </tr>
      <tr>
          <td>3</td>
          <td>NY: 784</td>
          <td>TX: 43</td>
          <td>TX: 25</td>
      </tr>
      <tr>
          <td>4</td>
          <td>FL: 766</td>
          <td>MI: 28</td>
          <td>NY: 24</td>
      </tr>
      <tr>
          <td>5</td>
          <td>IL: 660</td>
          <td>NJ: 28</td>
          <td>MN: 17</td>
      </tr>
      <tr>
          <td>6</td>
          <td>PA: 521</td>
          <td>IL: 27</td>
          <td>IL: 16</td>
      </tr>
      <tr>
          <td>7</td>
          <td>NJ: 478</td>
          <td>VA: 26</td>
          <td>OH: 11</td>
      </tr>
      <tr>
          <td>8</td>
          <td>MI: 380</td>
          <td>FL: 24</td>
          <td>VA: 11</td>
      </tr>
      <tr>
          <td>9</td>
          <td>OH: 377</td>
          <td>PA: 22</td>
          <td>FL: 11</td>
      </tr>
      <tr>
          <td>10</td>
          <td>MA: 361</td>
          <td>OH: 19</td>
          <td>GA: 9</td>
      </tr>
  </tbody>
</table>
<p><strong>Per-representative normalization reveals different patterns:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>State: Introduced</th>
          <th>State: Stalled</th>
          <th>State: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>DC: 101.0</td>
          <td>DC: 7.0</td>
          <td>AK: 2.2</td>
      </tr>
      <tr>
          <td>2</td>
          <td>NH: 47.5</td>
          <td>AK: 2.8</td>
          <td>NH: 2.0</td>
      </tr>
      <tr>
          <td>3</td>
          <td>MT: 44.0</td>
          <td>IA: 2.3</td>
          <td>MT: 2.0</td>
      </tr>
      <tr>
          <td>4</td>
          <td>OR: 41.0</td>
          <td>SD: 2.3</td>
          <td>MI: 1.9</td>
      </tr>
      <tr>
          <td>5</td>
          <td>NV: 40.0</td>
          <td>NH: 2.2</td>
          <td>MN: 1.5</td>
      </tr>
      <tr>
          <td>6</td>
          <td>DE: 38.7</td>
          <td>VA: 2.0</td>
          <td>HI: 1.5</td>
      </tr>
      <tr>
          <td>7</td>
          <td>SD: 38.3</td>
          <td>NJ: 2.0</td>
          <td>CT: 1.3</td>
      </tr>
      <tr>
          <td>8</td>
          <td>IA: 37.7</td>
          <td>PR: 2.0</td>
          <td>IA: 1.2</td>
      </tr>
      <tr>
          <td>9</td>
          <td>RI: 36.5</td>
          <td>NV: 1.8</td>
          <td>OR: 1.1</td>
      </tr>
      <tr>
          <td>10</td>
          <td>UT: 36.0</td>
          <td>MO: 1.8</td>
          <td>SD: 1.0</td>
      </tr>
  </tbody>
</table>
<h4 id="top-individual-sponsors">Top Individual Sponsors</h4>
<p><strong>Most prolific legislators by bills introduced:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Individual: Introduced</th>
          <th>Individual: Stalled</th>
          <th>Individual: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Sen. Rubio (R-FL): 186</td>
          <td>Sen. Peters (D-MI): 11</td>
          <td>Sen. Peters (D-MI): 19</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Sen. Klobuchar (D-MN): 143</td>
          <td>Sen. Cornyn (R-TX): 8</td>
          <td>Sen. Cornyn (R-TX): 15</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Sen. Lee (R-UT): 125</td>
          <td>Rep. Connolly (D-VA-11): 8</td>
          <td>Sen. Klobuchar (D-MN): 7</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Sen. Markey (D-MA): 118</td>
          <td>Rep. Takano (D-CA-41): 8</td>
          <td>Sen. Tester (D-MT): 6</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Sen. Casey (D-PA): 116</td>
          <td>Sen. Grassley (R-IA): 7</td>
          <td>Sen. Rubio (R-FL): 6</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Sen. Cortez Masto (D-NV): 109</td>
          <td>Del. Norton (D-DC): 7</td>
          <td>Rep. DeLauro (D-CT-3): 6</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Sen. Booker (D-NJ): 106</td>
          <td>Rep. Johnson (D-TX-30): 7</td>
          <td>Sen. Grassley (R-IA): 5</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Sen. Durbin (D-IL): 102</td>
          <td>Rep. Katko (R-NY-24): 7</td>
          <td>Sen. Ossoff (D-GA): 4</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Del. Norton (D-DC): 101</td>
          <td>Rep. Dean (D-PA-4): 6</td>
          <td>Sen. Murkowski (R-AK): 4</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Sen. Menendez (D-NJ): 99</td>
          <td>Rep. Wagner (R-MO-2): 6</td>
          <td>Sen. Padilla (D-CA): 4</td>
      </tr>
  </tbody>
</table>
<p><strong>Effectiveness score (laws enacted / total bills):</strong></p>
<p>$$
\text{effectiveness} = \frac{\text{bills that became law}}{\text{total bills introduced}}
$$</p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Individual: Effectiveness Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Rep. Pelosi (D-CA-12): 0.500</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Rep. Mrvan (D-IN-1): 0.444</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Rep. Yarmuth (D-KY-3): 0.333</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Rep. Stivers (R-OH-15): 0.250</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Rep. Graves (R-MO-6): 0.222</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Rep. Jeffries (D-NY-8): 0.200</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Rep. Neal (D-MA-1): 0.200</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Rep. Palazzo (R-MS-4): 0.200</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Sen. Peters (D-MI): 0.186</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Rep. Fischbach (R-MN-7): 0.176</td>
      </tr>
  </tbody>
</table>
<h3 id="policy-focus-areas">Policy Focus Areas</h3>
<p>Each bill is assigned a primary policy area. Here are the most active areas by legislative outcome:</p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Policy Area: Introduced</th>
          <th>Policy Area: Stalled</th>
          <th>Policy Area: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Health: 1,885</td>
          <td>Government Operations: 79</td>
          <td>Government Operations: 94</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Armed Forces: 1,114</td>
          <td>Armed Forces: 60</td>
          <td>Armed Forces: 69</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Taxation: 1,066</td>
          <td>International Affairs: 60</td>
          <td>Crime &amp; Law Enforcement: 31</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Government Operations: 982</td>
          <td>Health: 56</td>
          <td>Health: 19</td>
      </tr>
      <tr>
          <td>5</td>
          <td>International Affairs: 866</td>
          <td>Crime &amp; Law Enforcement: 44</td>
          <td>Native Americans: 17</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Crime &amp; Law Enforcement: 842</td>
          <td>Public Lands: 44</td>
          <td>International Affairs: 14</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Education: 663</td>
          <td>Science &amp; Technology: 44</td>
          <td>Economics &amp; Finance: 13</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Transportation: 663</td>
          <td>Commerce: 43</td>
          <td>Public Lands: 13</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Public Lands: 548</td>
          <td>Finance: 34</td>
          <td>Commerce: 13</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Finance: 547</td>
          <td>Emergency Management: 27</td>
          <td>Emergency Management: 11</td>
      </tr>
  </tbody>
</table>
<p>Notable patterns: Health dominates introductions but has lower success rates, while government operations and armed forces bills are more likely to become law.</p>
<h2 id="next-steps">Next Steps</h2>
<p>This analysis establishes baseline patterns: most bills fail, party affiliation affects success rates, and certain policy areas perform better than others.</p>
<p>Future work could explore:</p>
<ul>
<li>Committee dynamics and voting patterns</li>
<li>Geographic analysis of state-level interests</li>
<li>Bill text analysis using NLP techniques</li>
<li>Predictive modeling for bill outcomes</li>
</ul>
<blockquote>
<p><strong>Update</strong>: I&rsquo;ve since applied machine learning to this type of data in <a href="/posts/congressional-bill-policy-area-classification/">Congressional Bill Policy Area Classification</a>, using 48K+ bills from three Congresses to automatically categorize bills by policy area.</p></blockquote>
<p>The complete dataset and code are publicly available to support further research into legislative transparency.</p>
]]></content:encoded></item><item><title>Kabsch Algorithm: NumPy, PyTorch, TensorFlow, and JAX</title><link>https://hunterheidenreich.com/posts/kabsch-algorithm/</link><pubDate>Tue, 03 Oct 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/kabsch-algorithm/</guid><description>Learn about the Kabsch algorithm for optimal point alignment with implementations in NumPy, PyTorch, TensorFlow, and JAX for ML applications.</description><content:encoded><![CDATA[<h2 id="what-is-the-kabsch-algorithm">What is the Kabsch Algorithm?</h2>
<p>In computer vision or scientific computing, a common problem frequently arises: given two sets of points, what is the optimal rigid body transformation for their alignment? The Kabsch algorithm provides a nice solution.</p>















<figure class="post-figure center ">
    <img src="/img/scientific-computing/kabsch-alignment-before-and-after.webp"
         alt="Visualization of two point sets before and after Kabsch alignment"
         title="Visualization of two point sets before and after Kabsch alignment"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The Kabsch algorithm optimally rotates and translates the blue points to align with the red points.</figcaption>
    
</figure>

<p>What are some concrete situations where this crops up?</p>
<ul>
<li><strong>Molecular Dynamics</strong>: Your points are a set of atoms (with physically relevant types), and you want to compare two molecular conformations. Are they the same structure with minor noise or rotation? Or are they different conformations, like a different folding of a protein? This is especially helpful when applying generative models to chemical structures. For example, if you are building a <a href="/notes/chemistry/molecular-simulation/ml-potentials/denoise-vae/">3D Molecular VAE</a> in PyTorch or working with <a href="/notes/machine-learning/generative-models/flow-matching-for-generative-modeling/">Flow Matching models</a>, Kabsch alignment ensures your generative loss function remains rotationally invariant.</li>
<li><strong>Computer Vision</strong>: You have two point clouds from 3D scans of an object taken from different angles. You want to align them to reconstruct the full shape. Or perhaps you&rsquo;re generating 3D shapes from 2D images and need to compare the generated shape to a ground truth scan. Anytime a 3D system is represented as a point cloud, the Kabsch algorithm can help with alignment.</li>
</ul>
<p>Of course, existing libraries implement this algorithm. However, often I find it beneficial to implement algorithms from scratch to build intuition. Furthermore, modern machine learning applications require automatic differentiation, so we will implement the algorithm in PyTorch, TensorFlow, and JAX.</p>
<p>Below, we&rsquo;ll cover the math behind the Kabsch algorithm (and its scaling variant, the <strong>Kabsch-Umeyama</strong> algorithm) and provide complete, differentiable implementations in <strong>NumPy</strong>, <strong>PyTorch</strong>, <strong>TensorFlow</strong>, and <strong>JAX</strong>, demonstrating both single-pair and batched computations for ML applications.</p>
<h2 id="the-math">The Math</h2>















<figure class="post-figure center ">
    <img src="/img/scientific-computing/kabsch-algorithm-basic-animation.webp"
         alt="Animation showing the iterative steps of centroid alignment and rotation"
         title="Animation showing the iterative steps of centroid alignment and rotation"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Visualizing the alignment process: first centering the datasets, then finding the optimal rotation.</figcaption>
    
</figure>

<p>Let&rsquo;s say we have two sets of paired points,
$P={\mathbf{p}_i} \in \mathbb{R}^{N \times D}$ and $Q={\mathbf{q}_i} \in \mathbb{R}^{N \times D}$, for $i = 1, \dots, N$
(where $D$ is the dimensionality and $N$ is the number of points).
We want to find a translation vector $\mathbf{t}$ and rotation matrix $R$ to transform $P$ to align with $Q$.</p>
<p>The optimization problem is:</p>
<p>$$
\min_{\mathbf{t}, \ R} \mathcal{L}(\mathbf{t}, R) = \frac{1}{2} \sum_{i=1}^N | \mathbf{q}_i - (R\mathbf{p}_i + \mathbf{t}) |^2
$$</p>
<p>where $\mathbf{t}^\ast \in \mathbb{R}^D$ and $R^\ast \in \mathbb{R}^{D \times D}$ are the optimal translation and rotation.</p>
<p>Often we use a weighted version with weights $w_i$ (e.g., atomic masses in molecular dynamics):</p>
<p>$$
\min_{\mathbf{t}, \ R} \mathcal{L}(\mathbf{t}, R) = \frac{1}{2} \sum_{i=1}^N w_i | \mathbf{q}_i - (R\mathbf{p}_i + \mathbf{t}) |^2
$$</p>
<h3 id="the-translation">The Translation</h3>
<p>The translation and rotation are coupled, but they separate cleanly once we work in centroid-centered coordinates. Compute the centroids (averages) of both point sets:</p>
<p>$$
\bar{\mathbf{p}} = \frac{1}{N} \sum_{i=1}^N \mathbf{p}_i \quad \text{and} \quad \bar{\mathbf{q}} = \frac{1}{N} \sum_{i=1}^N \mathbf{q}_i
$$</p>
<p>For any fixed rotation $R$, the translation that minimizes $\mathcal{L}$ is found by setting $\partial \mathcal{L} / \partial \mathbf{t} = 0$. It maps the rotated source centroid onto the target centroid:</p>
<p>$$
\mathbf{t} = \bar{\mathbf{q}} - R\bar{\mathbf{p}}
$$</p>
<p>A tempting shortcut is to write $\mathbf{t} = \bar{\mathbf{q}} - \bar{\mathbf{p}}$, but that is only correct when $R = I$. In general the translation depends on the rotation, so we compute it <em>after</em> solving for $R$. Substituting this optimal $\mathbf{t}$ back into the objective cancels the centroids and leaves a rotation-only problem in the centered coordinates $\mathbf{p}_i^\prime = \mathbf{p}_i - \bar{\mathbf{p}}$ and $\mathbf{q}_i^\prime = \mathbf{q}_i - \bar{\mathbf{q}}$:</p>
<p>$$
\mathcal{L}(R) = \frac{1}{2} \sum_{i=1}^N | \mathbf{q}_i^\prime - R\mathbf{p}_i^\prime |^2
$$</p>
<p>which is what the next section solves.</p>
<h3 id="the-rotation-matrix">The Rotation Matrix</h3>
<p>We now minimize $\mathcal{L}(R)$ over rotations, using the centered points $\mathbf{p}_i^\prime$ and $\mathbf{q}_i^\prime$ from above. Compute the cross-covariance matrix between the centered sets:</p>
<p>$$
C = P^{\prime T} Q^\prime = \sum_{i=1}^N \mathbf{p}_i^{\prime T} \mathbf{q}_i^{\prime} \in \mathbb{R}^{D \times D}
$$</p>
<p>This is a fairly lightweight operation since $D$ is typically small (e.g., 3 for 3D points), even if $N$ is large.</p>
<p>With $C$ in hand, we want to compute its Singular Value Decomposition (SVD):</p>
<p>$$
C = U \Sigma V^T
$$</p>
<p>This operation is computationally expensive. It scales cubically with $D$ (i.e., $O(D^3)$).
However, since we&rsquo;re often interested in cases where $D$ is small (e.g., 2D or 3D points), this is manageable.</p>
<p>Next, we check for improper rotations (i.e., reflections) and correct for them where necessary:</p>
<p>$$
d = \text{sign}(\det(V U^T))
$$</p>
<p>If $d = -1$, we need to flip the last column of $V$ in the final rotation matrix.</p>
<p>Let $B = \text{diag}(1, 1, d)$.
The optimal rotation matrix comes out:</p>
<p>$$
R^\ast = V B U^T
$$</p>
<h3 id="summary">Summary</h3>
<p>In a nutshell, the Kabsch algorithm boils down to:</p>
<ol>
<li>Compute centroids of $P$ and $Q$ ($\bar{\mathbf{p}}$ and $\bar{\mathbf{q}}$)</li>
<li>Center both point sets by subtracting centroids: $P^\prime$ and $Q^\prime$</li>
<li>Compute cross-covariance matrix $C = P^{\prime T} Q^\prime$</li>
<li>Compute SVD: $C = U \Sigma V^T$ (<em>expensive step</em>)</li>
<li>Compute $d = \text{sign}(\det(V U^T))$ and $B = \text{diag}(1, 1, d)$</li>
<li>Optimal rotation: $R^\ast = V B U^T$</li>
<li>Optimal translation (using the rotation from step 6): $\mathbf{t}^\ast = \bar{\mathbf{q}} - R^\ast\bar{\mathbf{p}}$</li>
</ol>
<p>The resulting root-mean-square deviation (RMSD) between aligned point sets is</p>
<p>$$
\text{RMSD} = \sqrt{\frac{1}{N} \sum_{i=1}^N | \mathbf{q}_i - (R^\ast\mathbf{p}_i + \mathbf{t}^\ast) |^2}
$$</p>















<figure class="post-figure center ">
    <img src="/img/scientific-computing/kabsch-algorithm-visualized-rmsd.webp"
         alt="Diagram illustrating Root Mean Square Deviation (RMSD) distances"
         title="Diagram illustrating Root Mean Square Deviation (RMSD) distances"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">RMSD measures the average distance between the aligned points.</figcaption>
    
</figure>

<p>which is frequently used as a measure of similarity between molecular structures or as a metric in loss functions for ML applications.</p>
<h3 id="the-kabsch-umeyama-algorithm-scaling">The Kabsch-Umeyama Algorithm (Scaling)</h3>
<p>While the standard Kabsch algorithm solves for optimal rotation and translation, the <strong>Kabsch-Umeyama algorithm</strong> extends this by also finding an optimal <strong>scaling factor</strong> $c$. This is essential when aligning structures of different scales, such as a 3D scan versus a ground truth model.</p>
<p><em>(Note: This is sometimes searched for as the &ldquo;Absch-Umeyama algorithm&rdquo; due to typos, but the correct attribution is to Shinji Umeyama based on Wolfgang Kabsch&rsquo;s work.)</em></p>
<p>The method estimates the transformation $\mathbf{q}_i \approx c R \mathbf{p}_i + \mathbf{t}$. The optimal scale is the trace of the (reflection-corrected) singular values of the cross-covariance divided by the variance of the source points about their centroid. See the <a href="/notes/biology/computational-biology/umeyama-similarity-transformation/">Umeyama paper notes</a> for the full derivation.</p>
<p><strong>A Note on SVD and Automatic Differentiation</strong></p>
<p>While modern frameworks allow us to backpropagate through the Singular Value Decomposition (SVD), it comes with a known stability issue: if the cross-covariance matrix has identical (degenerate) singular values (which can occur if the point clouds are perfectly aligned or have certain symmetries), the gradient of the SVD approaches infinity, causing <code>NaN</code> values during backpropagation. If you plan to use this algorithm as a loss function for a neural network, it is often necessary to add a tiny epsilon to the matrix before computing the SVD, or to utilize an SVD gradient patch. The <a href="/projects/kabsch-horn-cookbook/">Kabsch-Horn Cookbook</a> library provides a SafeSVD primitive that floors the singular-value-gap denominator at machine epsilon in the backward pass, producing finite gradients at degenerate inputs across PyTorch, JAX, TensorFlow, and MLX.</p>
<h2 id="implementation">Implementation</h2>
<p>Let&rsquo;s implement the algorithm in different frameworks. Note that for simplicity, the following implementations cover the <strong>unweighted</strong> Kabsch algorithm. If your application (like molecular dynamics) requires weights (e.g., atomic masses), the <a href="/projects/kabsch-horn-cookbook/">Kabsch-Horn Cookbook</a> library provides per-point weighted alignment out of the box.</p>
<h3 id="numpy">NumPy</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kabsch_numpy</span>(P, Q):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the optimal rotation and translation to align two sets of points (P -&gt; Q),
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    and their RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param P: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param Q: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :return: A tuple containing the optimal rotation matrix, the optimal
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             translation vector, and the RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> P<span style="color:#f92672">.</span>shape <span style="color:#f92672">==</span> Q<span style="color:#f92672">.</span>shape, <span style="color:#e6db74">&#34;Matrix dimensions must match&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute centroids</span>
</span></span><span style="display:flex;"><span>    centroid_P <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>mean(P, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>    centroid_Q <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>mean(Q, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Center the points</span>
</span></span><span style="display:flex;"><span>    p <span style="color:#f92672">=</span> P <span style="color:#f92672">-</span> centroid_P
</span></span><span style="display:flex;"><span>    q <span style="color:#f92672">=</span> Q <span style="color:#f92672">-</span> centroid_Q
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute the covariance matrix</span>
</span></span><span style="display:flex;"><span>    H <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>dot(p<span style="color:#f92672">.</span>T, q)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># SVD</span>
</span></span><span style="display:flex;"><span>    U, S, Vt <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>svd(H)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Validate right-handed coordinate system</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> np<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>det(np<span style="color:#f92672">.</span>dot(Vt<span style="color:#f92672">.</span>T, U<span style="color:#f92672">.</span>T)) <span style="color:#f92672">&lt;</span> <span style="color:#ae81ff">0.0</span>:
</span></span><span style="display:flex;"><span>        Vt[<span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>, :] <span style="color:#f92672">*=</span> <span style="color:#f92672">-</span><span style="color:#ae81ff">1.0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal rotation</span>
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>dot(Vt<span style="color:#f92672">.</span>T, U<span style="color:#f92672">.</span>T)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal translation (depends on R, so computed after it)</span>
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> centroid_Q <span style="color:#f92672">-</span> np<span style="color:#f92672">.</span>dot(R, centroid_P)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># RMSD</span>
</span></span><span style="display:flex;"><span>    rmsd <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>sqrt(np<span style="color:#f92672">.</span>sum(np<span style="color:#f92672">.</span>square(np<span style="color:#f92672">.</span>dot(p, R<span style="color:#f92672">.</span>T) <span style="color:#f92672">-</span> q)) <span style="color:#f92672">/</span> P<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> R, t, rmsd
</span></span></code></pre></div><p>Here&rsquo;s a quick test to verify correctness:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">test_numpy</span>():
</span></span><span style="display:flex;"><span>    np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>seed(<span style="color:#ae81ff">12345</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    P <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>randn(<span style="color:#ae81ff">100</span>, <span style="color:#ae81ff">3</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    alpha <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>rand() <span style="color:#f92672">*</span> <span style="color:#ae81ff">2</span> <span style="color:#f92672">*</span> np<span style="color:#f92672">.</span>pi
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>array([[np<span style="color:#f92672">.</span>cos(alpha), <span style="color:#f92672">-</span>np<span style="color:#f92672">.</span>sin(alpha), <span style="color:#ae81ff">0</span>],
</span></span><span style="display:flex;"><span>                    [np<span style="color:#f92672">.</span>sin(alpha), np<span style="color:#f92672">.</span>cos(alpha), <span style="color:#ae81ff">0</span>],
</span></span><span style="display:flex;"><span>                    [<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>]])
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>randn(<span style="color:#ae81ff">3</span>) <span style="color:#f92672">*</span> <span style="color:#ae81ff">10</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    Q <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>dot(P, R<span style="color:#f92672">.</span>T) <span style="color:#f92672">+</span> t
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    R_opt, t_opt, rmsd <span style="color:#f92672">=</span> kabsch_numpy(P, Q)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#39;RMSD: </span><span style="color:#e6db74">{}</span><span style="color:#e6db74">&#39;</span><span style="color:#f92672">.</span>format(rmsd))
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#39;R:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">&#39;</span><span style="color:#f92672">.</span>format(R))
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#39;R_opt:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">&#39;</span><span style="color:#f92672">.</span>format(R_opt))
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#39;t:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">&#39;</span><span style="color:#f92672">.</span>format(t))
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#39;t_opt:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">&#39;</span><span style="color:#f92672">.</span>format(t_opt))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    l2_t <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>norm(t <span style="color:#f92672">-</span> t_opt)
</span></span><span style="display:flex;"><span>    l2_R <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>norm(R <span style="color:#f92672">-</span> R_opt)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#39;l2_t: </span><span style="color:#e6db74">{}</span><span style="color:#e6db74">&#39;</span><span style="color:#f92672">.</span>format(l2_t))
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#39;l2_R: </span><span style="color:#e6db74">{}</span><span style="color:#e6db74">&#39;</span><span style="color:#f92672">.</span>format(l2_R))
</span></span></code></pre></div><p>Running this test shows the algorithm correctly recovers the rotation and translation:</p>
<pre><code>RMSD: 3.2111501877699246e-15
R:
[[-0.8475392 -0.5307328  0.       ]
 [ 0.5307328 -0.8475392  0.       ]
 [ 0.         0.         1.       ]]
R_opt:
[[-8.47539198e-01 -5.30732803e-01 -2.95434260e-16]
 [ 5.30732803e-01 -8.47539198e-01  2.92859649e-16]
 [ 0.00000000e+00 -2.77555756e-16  1.00000000e+00]]
t:
[ 5.99726796  1.50078468 -3.34633977]
t_opt:
[ 5.99726796  1.50078468 -3.34633977]
l2_t: 2.7012892057857038e-15
l2_R: 8.028174304721057e-16
</code></pre>
<p>Both the rotation and the translation are recovered to within floating-point precision (the residuals <code>l2_t</code> and <code>l2_R</code> are on the order of <code>1e-15</code>).</p>
<p>For batch processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kabsch_numpy_batched</span>(P, Q):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the optimal rotation and translation to align two sets of points (P -&gt; Q),
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    and their RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param P: A BxNx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param Q: A BxNx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :return: A tuple containing the optimal rotation matrix, the optimal
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             translation vector, and the RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> P<span style="color:#f92672">.</span>shape <span style="color:#f92672">==</span> Q<span style="color:#f92672">.</span>shape, <span style="color:#e6db74">&#34;Matrix dimensions must match&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute centroids</span>
</span></span><span style="display:flex;"><span>    centroid_P <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>mean(P, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, keepdims<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)  <span style="color:#75715e"># Bx1x3</span>
</span></span><span style="display:flex;"><span>    centroid_Q <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>mean(Q, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, keepdims<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)  <span style="color:#75715e"># Bx1x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Center the points</span>
</span></span><span style="display:flex;"><span>    p <span style="color:#f92672">=</span> P <span style="color:#f92672">-</span> centroid_P  <span style="color:#75715e"># BxNx3</span>
</span></span><span style="display:flex;"><span>    q <span style="color:#f92672">=</span> Q <span style="color:#f92672">-</span> centroid_Q  <span style="color:#75715e"># BxNx3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute the covariance matrix</span>
</span></span><span style="display:flex;"><span>    H <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>matmul(p<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>), q)  <span style="color:#75715e"># Bx3x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># SVD</span>
</span></span><span style="display:flex;"><span>    U, S, Vt <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>svd(H)  <span style="color:#75715e"># Bx3x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Validate right-handed coordinate system</span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>det(np<span style="color:#f92672">.</span>matmul(Vt<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>), U<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>)))
</span></span><span style="display:flex;"><span>    flip <span style="color:#f92672">=</span> d <span style="color:#f92672">&lt;</span> <span style="color:#ae81ff">0.0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> flip<span style="color:#f92672">.</span>any():
</span></span><span style="display:flex;"><span>        Vt[flip, <span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>, :] <span style="color:#f92672">*=</span> <span style="color:#f92672">-</span><span style="color:#ae81ff">1.0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal rotation</span>
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>matmul(Vt<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>), U<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>))  <span style="color:#75715e"># Bx3x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal translation (depends on R, so computed after it)</span>
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> centroid_Q<span style="color:#f92672">.</span>squeeze(<span style="color:#ae81ff">1</span>) <span style="color:#f92672">-</span> np<span style="color:#f92672">.</span>matmul(centroid_P, R<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>))<span style="color:#f92672">.</span>squeeze(<span style="color:#ae81ff">1</span>)  <span style="color:#75715e"># Bx3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># RMSD</span>
</span></span><span style="display:flex;"><span>    rmsd <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>sqrt(np<span style="color:#f92672">.</span>sum(np<span style="color:#f92672">.</span>square(np<span style="color:#f92672">.</span>matmul(p, R<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>)) <span style="color:#f92672">-</span> q), axis<span style="color:#f92672">=</span>(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)) <span style="color:#f92672">/</span> P<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">1</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> R, t, rmsd
</span></span></code></pre></div><h3 id="pytorch">PyTorch</h3>


<p><details >
  <summary markdown="span">📝 Important Update (February 15, 2026)</summary>
  <strong>Bug Fix Notice:</strong> The PyTorch implementation has been updated to use the &ldquo;B-matrix&rdquo; broadcasting approach. This eliminates in-place tensor modification (which breaks <code>autograd</code>) and data-dependent control flow (which breaks <code>torch.compile</code> and <code>torch.vmap</code>).
</details></p>

<p>The PyTorch implementation now uses broadcasting to ensure differentiability:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> torch
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kabsch_torch</span>(P, Q):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the optimal rotation and translation to align two sets of points (P -&gt; Q),
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    and their RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param P: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param Q: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :return: A tuple containing the optimal rotation matrix, the optimal
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             translation vector, and the RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> P<span style="color:#f92672">.</span>shape <span style="color:#f92672">==</span> Q<span style="color:#f92672">.</span>shape, <span style="color:#e6db74">&#34;Matrix dimensions must match&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute centroids</span>
</span></span><span style="display:flex;"><span>    centroid_P <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>mean(P, dim<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>    centroid_Q <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>mean(Q, dim<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Center the points</span>
</span></span><span style="display:flex;"><span>    p <span style="color:#f92672">=</span> P <span style="color:#f92672">-</span> centroid_P
</span></span><span style="display:flex;"><span>    q <span style="color:#f92672">=</span> Q <span style="color:#f92672">-</span> centroid_Q
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute the covariance matrix</span>
</span></span><span style="display:flex;"><span>    H <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>matmul(p<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>), q)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># SVD</span>
</span></span><span style="display:flex;"><span>    U, S, Vt <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>svd(H)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 1. Calculate determinant</span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>det(torch<span style="color:#f92672">.</span>matmul(Vt<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>), U<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>)))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 2. Build diagonal B tensor without in-place mutation</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># We use stack to preserve gradients and graph connections</span>
</span></span><span style="display:flex;"><span>    B_diag <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>stack([torch<span style="color:#f92672">.</span>tensor(<span style="color:#ae81ff">1.0</span>, device<span style="color:#f92672">=</span>d<span style="color:#f92672">.</span>device, dtype<span style="color:#f92672">=</span>d<span style="color:#f92672">.</span>dtype),
</span></span><span style="display:flex;"><span>                          torch<span style="color:#f92672">.</span>tensor(<span style="color:#ae81ff">1.0</span>, device<span style="color:#f92672">=</span>d<span style="color:#f92672">.</span>device, dtype<span style="color:#f92672">=</span>d<span style="color:#f92672">.</span>dtype),
</span></span><span style="display:flex;"><span>                          torch<span style="color:#f92672">.</span>sign(d)])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 3. Scale columns of Vt.T via broadcasting, then multiply by U^T</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Vt.T: (3, 3). B_diag: (3) -&gt; B_diag[None, :]: (1, 3)</span>
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>matmul(Vt<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>) <span style="color:#f92672">*</span> B_diag[<span style="color:#66d9ef">None</span>, :], U<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal translation (depends on R, so computed after it)</span>
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> centroid_Q <span style="color:#f92672">-</span> centroid_P <span style="color:#f92672">@</span> R<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># RMSD</span>
</span></span><span style="display:flex;"><span>    rmsd <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>sqrt(torch<span style="color:#f92672">.</span>sum(torch<span style="color:#f92672">.</span>square(torch<span style="color:#f92672">.</span>matmul(p, R<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>)) <span style="color:#f92672">-</span> q)) <span style="color:#f92672">/</span> P<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> R, t, rmsd
</span></span></code></pre></div><p>And our batched version:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kabsch_torch_batched</span>(P, Q):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the optimal rotation and translation to align two sets of points (P -&gt; Q),
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    and their RMSD, in a batched manner.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param P: A BxNx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param Q: A BxNx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :return: A tuple containing the optimal rotation matrix, the optimal
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             translation vector, and the RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> P<span style="color:#f92672">.</span>shape <span style="color:#f92672">==</span> Q<span style="color:#f92672">.</span>shape, <span style="color:#e6db74">&#34;Matrix dimensions must match&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute centroids</span>
</span></span><span style="display:flex;"><span>    centroid_P <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>mean(P, dim<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, keepdims<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)  <span style="color:#75715e"># Bx1x3</span>
</span></span><span style="display:flex;"><span>    centroid_Q <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>mean(Q, dim<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, keepdims<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)  <span style="color:#75715e"># Bx1x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Center the points</span>
</span></span><span style="display:flex;"><span>    p <span style="color:#f92672">=</span> P <span style="color:#f92672">-</span> centroid_P  <span style="color:#75715e"># BxNx3</span>
</span></span><span style="display:flex;"><span>    q <span style="color:#f92672">=</span> Q <span style="color:#f92672">-</span> centroid_Q  <span style="color:#75715e"># BxNx3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute the covariance matrix</span>
</span></span><span style="display:flex;"><span>    H <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>matmul(p<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>), q)  <span style="color:#75715e"># Bx3x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># SVD</span>
</span></span><span style="display:flex;"><span>    U, S, Vt <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>svd(H)  <span style="color:#75715e"># Bx3x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 1. Calculate batched determinant</span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>det(torch<span style="color:#f92672">.</span>matmul(Vt<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>), U<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)))  <span style="color:#75715e"># B</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 2. Build batched B_diag without in-place mutation or control flow</span>
</span></span><span style="display:flex;"><span>    ones <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>ones_like(d)
</span></span><span style="display:flex;"><span>    B_diag <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>stack([ones, ones, torch<span style="color:#f92672">.</span>sign(d)], dim<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>) <span style="color:#75715e"># Bx3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 3. Scale columns of Vt.T and multiply</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Vt.T: (B, 3, 3). B_diag: (B, 3). B_diag[:, None, :]: (B, 1, 3).</span>
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>matmul(Vt<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>) <span style="color:#f92672">*</span> B_diag[:, <span style="color:#66d9ef">None</span>, :], U<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal translation (depends on R, so computed after it)</span>
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> centroid_Q<span style="color:#f92672">.</span>squeeze(<span style="color:#ae81ff">1</span>) <span style="color:#f92672">-</span> torch<span style="color:#f92672">.</span>matmul(centroid_P, R<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>))<span style="color:#f92672">.</span>squeeze(<span style="color:#ae81ff">1</span>)  <span style="color:#75715e"># Bx3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># RMSD</span>
</span></span><span style="display:flex;"><span>    rmsd <span style="color:#f92672">=</span> torch<span style="color:#f92672">.</span>sqrt(torch<span style="color:#f92672">.</span>sum(torch<span style="color:#f92672">.</span>square(torch<span style="color:#f92672">.</span>matmul(p, R<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)) <span style="color:#f92672">-</span> q), dim<span style="color:#f92672">=</span>(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)) <span style="color:#f92672">/</span> P<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">1</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> R, t, rmsd
</span></span></code></pre></div><h3 id="tensorflow">TensorFlow</h3>
<p>The TensorFlow implementation returns <code>S</code>, <code>U</code>, and <code>V</code> directly. To handle immutability and potential compilation (e.g., via <code>@tf.function</code>), we avoid explicit conditional branching by constructing a correction matrix $B$ and broadcasting it.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> tensorflow <span style="color:#66d9ef">as</span> tf
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kabsch_tensorflow</span>(P, Q):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the optimal rotation and translation to align two sets of points (P -&gt; Q),
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    and their RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param P: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param Q: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :return: A tuple containing the optimal rotation matrix, the optimal
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             translation vector, and the RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    P <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>convert_to_tensor(P, dtype<span style="color:#f92672">=</span>tf<span style="color:#f92672">.</span>float32)
</span></span><span style="display:flex;"><span>    Q <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>convert_to_tensor(Q, dtype<span style="color:#f92672">=</span>tf<span style="color:#f92672">.</span>float32)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> P<span style="color:#f92672">.</span>shape <span style="color:#f92672">==</span> Q<span style="color:#f92672">.</span>shape, <span style="color:#e6db74">&#34;Matrix dimensions must match&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute centroids</span>
</span></span><span style="display:flex;"><span>    centroid_P <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>reduce_mean(P, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>    centroid_Q <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>reduce_mean(Q, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Center the points</span>
</span></span><span style="display:flex;"><span>    p <span style="color:#f92672">=</span> P <span style="color:#f92672">-</span> centroid_P
</span></span><span style="display:flex;"><span>    q <span style="color:#f92672">=</span> Q <span style="color:#f92672">-</span> centroid_Q
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute the covariance matrix</span>
</span></span><span style="display:flex;"><span>    H <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>matmul(tf<span style="color:#f92672">.</span>transpose(p), q)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># SVD</span>
</span></span><span style="display:flex;"><span>    S, U, V <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>svd(H)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 1. Calculate determinant</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Note: V in TF SVD is V, not V^T.</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># R = V * U^T. Det(R) = Det(V * U^T)</span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>det(tf<span style="color:#f92672">.</span>matmul(V, tf<span style="color:#f92672">.</span>transpose(U)))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 2. Build diagonal B tensor: [1.0, 1.0, sign(d)]</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Use static shape 3 if possible, or infer from D. Assuming D=3 here.</span>
</span></span><span style="display:flex;"><span>    B_diag <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>stack([<span style="color:#ae81ff">1.0</span>, <span style="color:#ae81ff">1.0</span>, tf<span style="color:#f92672">.</span>sign(d)])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 3. Scale columns of V via broadcasting (V * B_diag), then multiply by U^T</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># V is DxD, B_diag is D. V * B_diag[None, :] multiplies each column j by B_diag[j]</span>
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>matmul(V <span style="color:#f92672">*</span> B_diag[<span style="color:#66d9ef">None</span>, :], tf<span style="color:#f92672">.</span>transpose(U))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal translation (depends on R, so computed after it)</span>
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> centroid_Q <span style="color:#f92672">-</span> tf<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>matvec(R, centroid_P)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># RMSD</span>
</span></span><span style="display:flex;"><span>    rmsd <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>sqrt(tf<span style="color:#f92672">.</span>reduce_sum(tf<span style="color:#f92672">.</span>square(tf<span style="color:#f92672">.</span>matmul(p, tf<span style="color:#f92672">.</span>transpose(R)) <span style="color:#f92672">-</span> q)) <span style="color:#f92672">/</span> P<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> R, t, rmsd
</span></span></code></pre></div><p>and a batched version:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kabsch_tensorflow_batched</span>(P, Q):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the optimal rotation and translation to align two sets of points (P -&gt; Q),
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    and their RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param P: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param Q: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :return: A tuple containing the optimal rotation matrix, the optimal
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             translation vector, and the RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    P <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>convert_to_tensor(P, dtype<span style="color:#f92672">=</span>tf<span style="color:#f92672">.</span>float32)
</span></span><span style="display:flex;"><span>    Q <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>convert_to_tensor(Q, dtype<span style="color:#f92672">=</span>tf<span style="color:#f92672">.</span>float32)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> P<span style="color:#f92672">.</span>shape <span style="color:#f92672">==</span> Q<span style="color:#f92672">.</span>shape, <span style="color:#e6db74">&#34;Matrix dimensions must match&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute centroids</span>
</span></span><span style="display:flex;"><span>    centroid_P <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>reduce_mean(P, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, keepdims<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>    centroid_Q <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>reduce_mean(Q, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, keepdims<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Center the points</span>
</span></span><span style="display:flex;"><span>    p <span style="color:#f92672">=</span> P <span style="color:#f92672">-</span> centroid_P
</span></span><span style="display:flex;"><span>    q <span style="color:#f92672">=</span> Q <span style="color:#f92672">-</span> centroid_Q
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute the covariance matrix</span>
</span></span><span style="display:flex;"><span>    H <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>matmul(tf<span style="color:#f92672">.</span>transpose(p, perm<span style="color:#f92672">=</span>[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>]), q)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># SVD</span>
</span></span><span style="display:flex;"><span>    S, U, V <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>svd(H)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 1. Calculate batched determinant</span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>det(tf<span style="color:#f92672">.</span>matmul(V, tf<span style="color:#f92672">.</span>transpose(U, perm<span style="color:#f92672">=</span>[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>])))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 2. Build batched B_diag: shape (B, 3)</span>
</span></span><span style="display:flex;"><span>    ones <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>ones_like(d)
</span></span><span style="display:flex;"><span>    B_diag <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>stack([ones, ones, tf<span style="color:#f92672">.</span>sign(d)], axis<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 3. Scale columns of V (Broadcasting adds the middle dimension)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># V: (B, 3, 3), B_diag: (B, 3) -&gt; B_diag[:, None, :]: (B, 1, 3)</span>
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>matmul(V <span style="color:#f92672">*</span> B_diag[:, <span style="color:#66d9ef">None</span>, :], tf<span style="color:#f92672">.</span>transpose(U, perm<span style="color:#f92672">=</span>[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>]))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal translation (depends on R, so computed after it)</span>
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>squeeze(centroid_Q, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>) <span style="color:#f92672">-</span> tf<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>matvec(R, tf<span style="color:#f92672">.</span>squeeze(centroid_P, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>))  <span style="color:#75715e"># Bx3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># RMSD</span>
</span></span><span style="display:flex;"><span>    rmsd <span style="color:#f92672">=</span> tf<span style="color:#f92672">.</span>sqrt(tf<span style="color:#f92672">.</span>reduce_sum(tf<span style="color:#f92672">.</span>square(tf<span style="color:#f92672">.</span>matmul(p, tf<span style="color:#f92672">.</span>transpose(R, perm<span style="color:#f92672">=</span>[<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>])) <span style="color:#f92672">-</span> q), axis<span style="color:#f92672">=</span>(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)) <span style="color:#f92672">/</span> P<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">1</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> R, t, rmsd
</span></span></code></pre></div><h3 id="jax">JAX</h3>
<p>The JAX implementation closely mirrors NumPy, replacing <code>np</code> with <code>jnp</code>. However, we again avoid <code>if</code> statements and in-place assignment (which JAX disallows) by using the broadcasting B-matrix approach.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> jax.numpy <span style="color:#66d9ef">as</span> jnp
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kabsch_jax</span>(P, Q):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the optimal rotation and translation to align two sets of points (P -&gt; Q),
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    and their RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param P: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param Q: A Nx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :return: A tuple containing the optimal rotation matrix, the optimal
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             translation vector, and the RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    P <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>array(P)
</span></span><span style="display:flex;"><span>    Q <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>array(Q)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> P<span style="color:#f92672">.</span>shape <span style="color:#f92672">==</span> Q<span style="color:#f92672">.</span>shape, <span style="color:#e6db74">&#34;Matrix dimensions must match&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute centroids</span>
</span></span><span style="display:flex;"><span>    centroid_P <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>mean(P, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>    centroid_Q <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>mean(Q, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Center the points</span>
</span></span><span style="display:flex;"><span>    p <span style="color:#f92672">=</span> P <span style="color:#f92672">-</span> centroid_P
</span></span><span style="display:flex;"><span>    q <span style="color:#f92672">=</span> Q <span style="color:#f92672">-</span> centroid_Q
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute the covariance matrix</span>
</span></span><span style="display:flex;"><span>    H <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>dot(p<span style="color:#f92672">.</span>T, q)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># SVD</span>
</span></span><span style="display:flex;"><span>    U, S, Vt <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>svd(H)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 1. Calculate determinant</span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>det(jnp<span style="color:#f92672">.</span>dot(Vt<span style="color:#f92672">.</span>T, U<span style="color:#f92672">.</span>T))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 2. Build diagonal B array</span>
</span></span><span style="display:flex;"><span>    B_diag <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>array([<span style="color:#ae81ff">1.0</span>, <span style="color:#ae81ff">1.0</span>, jnp<span style="color:#f92672">.</span>sign(d)])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 3. Scale columns of Vt.T and multiply by U.T</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Vt.T is V.</span>
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>dot(Vt<span style="color:#f92672">.</span>T <span style="color:#f92672">*</span> B_diag[<span style="color:#66d9ef">None</span>, :], U<span style="color:#f92672">.</span>T)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal translation (depends on R, so computed after it)</span>
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> centroid_Q <span style="color:#f92672">-</span> jnp<span style="color:#f92672">.</span>dot(R, centroid_P)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># RMSD</span>
</span></span><span style="display:flex;"><span>    rmsd <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>sqrt(jnp<span style="color:#f92672">.</span>sum(jnp<span style="color:#f92672">.</span>square(jnp<span style="color:#f92672">.</span>dot(p, R<span style="color:#f92672">.</span>T) <span style="color:#f92672">-</span> q)) <span style="color:#f92672">/</span> P<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> R, t, rmsd
</span></span></code></pre></div><p>and batched:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kabsch_jax_batched</span>(P, Q):
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;&#34;&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Computes the optimal rotation and translation to align two sets of points (P -&gt; Q),
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    and their RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param P: A BxNx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :param Q: A BxNx3 matrix of points
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    :return: A tuple containing the optimal rotation matrix, the optimal
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             translation vector, and the RMSD.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    P <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>array(P)
</span></span><span style="display:flex;"><span>    Q <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>array(Q)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">assert</span> P<span style="color:#f92672">.</span>shape <span style="color:#f92672">==</span> Q<span style="color:#f92672">.</span>shape, <span style="color:#e6db74">&#34;Matrix dimensions must match&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute centroids</span>
</span></span><span style="display:flex;"><span>    centroid_P <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>mean(P, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, keepdims<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)  <span style="color:#75715e"># Bx1x3</span>
</span></span><span style="display:flex;"><span>    centroid_Q <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>mean(Q, axis<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, keepdims<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)  <span style="color:#75715e"># Bx1x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Center the points</span>
</span></span><span style="display:flex;"><span>    p <span style="color:#f92672">=</span> P <span style="color:#f92672">-</span> centroid_P  <span style="color:#75715e"># BxNx3</span>
</span></span><span style="display:flex;"><span>    q <span style="color:#f92672">=</span> Q <span style="color:#f92672">-</span> centroid_Q  <span style="color:#75715e"># BxNx3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Compute the covariance matrix</span>
</span></span><span style="display:flex;"><span>    H <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>matmul(p<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>), q)  <span style="color:#75715e"># Bx3x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># SVD</span>
</span></span><span style="display:flex;"><span>    U, S, Vt <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>svd(H)  <span style="color:#75715e"># Bx3x3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 1. Calculate batched determinant</span>
</span></span><span style="display:flex;"><span>    d <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>det(jnp<span style="color:#f92672">.</span>matmul(Vt<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>), U<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>)))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 2. Build batched B_diag</span>
</span></span><span style="display:flex;"><span>    ones <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>ones_like(d)
</span></span><span style="display:flex;"><span>    B_diag <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>stack([ones, ones, jnp<span style="color:#f92672">.</span>sign(d)], axis<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># 3. Scale columns of Vt.T and multiply by U.T</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Vt.T: (B, 3, 3). B_diag: (B, 3).</span>
</span></span><span style="display:flex;"><span>    R <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>matmul(Vt<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>) <span style="color:#f92672">*</span> B_diag[:, <span style="color:#66d9ef">None</span>, :], U<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Optimal translation (depends on R, so computed after it)</span>
</span></span><span style="display:flex;"><span>    t <span style="color:#f92672">=</span> centroid_Q<span style="color:#f92672">.</span>squeeze(<span style="color:#ae81ff">1</span>) <span style="color:#f92672">-</span> jnp<span style="color:#f92672">.</span>matmul(centroid_P, R<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>))<span style="color:#f92672">.</span>squeeze(<span style="color:#ae81ff">1</span>)  <span style="color:#75715e"># Bx3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># RMSD</span>
</span></span><span style="display:flex;"><span>    rmsd <span style="color:#f92672">=</span> jnp<span style="color:#f92672">.</span>sqrt(jnp<span style="color:#f92672">.</span>sum(jnp<span style="color:#f92672">.</span>square(jnp<span style="color:#f92672">.</span>matmul(p, R<span style="color:#f92672">.</span>transpose(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">1</span>)) <span style="color:#f92672">-</span> q), axis<span style="color:#f92672">=</span>(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)) <span style="color:#f92672">/</span> P<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">1</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> R, t, rmsd
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/scientific-computing/kabsch-animated-protein-conformational-alignment-analysis.webp"
         alt="Animation of a protein structure being aligned using the Kabsch algorithm"
         title="Animation of a protein structure being aligned using the Kabsch algorithm"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Real-world application: Aligning protein conformations to analyze structural changes.</figcaption>
    
</figure>

<h2 id="extensions">Extensions</h2>
<p>The Kabsch algorithm has several important extensions that go beyond the formulation dealt with here:</p>
<ul>
<li><strong>Quaternion Form</strong>: The algorithm can be reformulated using quaternions for better numerical stability, particularly useful in applications requiring high precision.</li>
<li><strong>Iterative Versions</strong>: More robust variants that handle noise better and have improved scaling properties for large point sets. This also can be advantageous for setups with limited computational resources.</li>
<li><strong>Weighted Kabsch</strong>: Extensions that incorporate point weights (e.g., atomic masses in molecular dynamics). While SciPy provides a <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.transform.Rotation.align_vectors.html#scipy.spatial.transform.Rotation.align_vectors">weighted version</a>, it lacks batch processing capabilities.</li>
<li><strong>The Umeyama Algorithm</strong>: If your point sets are rotated, translated, and scaled differently, the Umeyama algorithm is the direct extension of Kabsch. It solves the same optimization problem but introduces a scaling factor $c$, finding the optimal alignment for $Q \approx c R P + t$.</li>
</ul>
<p>Several of these extensions are implemented in the <a href="/projects/kabsch-horn-cookbook/">Kabsch-Horn Cookbook</a> library, which provides differentiable Kabsch, Horn, and Umeyama alignment across NumPy, PyTorch, JAX, TensorFlow, and MLX.</p>
<h2 id="further-reading">Further Reading</h2>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Kabsch_algorithm">Wikipedia, Kabsch Algorithm</a></li>
<li><a href="https://zalo.github.io/blog/kabsch/">Zalo on Kabsch</a>: An interactive shape matching demo.</li>
</ul>
<h3 id="original-papers">Original Papers</h3>
<ul>
<li><strong>[Kabsch 1976]</strong> Kabsch, W. (1976). &ldquo;A solution for the best rotation to relate two sets of vectors.&rdquo; <em>Acta Crystallographica Section A</em>, 32(5), 922-923. <a href="https://doi.org/10.1107/S0567739476001873">DOI: 10.1107/S0567739476001873</a>
<em>The original paper: a closed-form, non-iterative optimal-rotation solution derived via Lagrange multipliers and eigendecomposition of $\tilde{R}R$ (the SVD reformulation came later; see Arun et al. 1987).</em> See also: <a href="/notes/biology/computational-biology/kabsch-algorithm/">paper notes</a>.</li>
<li><strong>[Kabsch 1978]</strong> Kabsch, W. (1978). &ldquo;A discussion of the solution for the best rotation to relate two sets of vectors.&rdquo; <em>Acta Crystallographica Section A</em>, 34(5), 827-828. <a href="https://doi.org/10.1107/S0567739478001680">DOI: 10.1107/S0567739478001680</a>
<em>The follow-up paper correcting for improper rotations (reflections).</em></li>
<li><strong>[Arun et al. 1987]</strong> Arun, K. S., Huang, T. S., &amp; Blostein, S. D. (1987). &ldquo;Least-Squares Fitting of Two 3-D Point Sets.&rdquo; <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>, PAMI-9(5), 698-700. <a href="https://doi.org/10.1109/TPAMI.1987.4767965">DOI: 10.1109/TPAMI.1987.4767965</a>
<em>The first SVD-based formulation for 3D point set alignment.</em> See also: <a href="/notes/biology/computational-biology/arun-svd-point-fitting/">paper notes</a>.</li>
<li><strong>[Horn et al. 1988]</strong> Horn, B. K. P., Hilden, H. M., &amp; Negahdaripour, S. (1988). &ldquo;Closed-form solution of absolute orientation using orthonormal matrices.&rdquo; <em>Journal of the Optical Society of America A</em>, 5(7), 1127-1135. <a href="https://doi.org/10.1364/JOSAA.5.001127">DOI: 10.1364/JOSAA.5.001127</a>
<em>The matrix square root (polar decomposition) approach to the same problem.</em> See also: <a href="/notes/biology/computational-biology/horn-orthonormal-matrices/">paper notes</a>.</li>
<li><strong>[Horn 1987]</strong> Horn, B. K. P. (1987). &ldquo;Closed-form solution of absolute orientation using unit quaternions.&rdquo; <em>Journal of the Optical Society of America A</em>, 4(4), 629-642. <a href="https://doi.org/10.1364/JOSAA.4.000629">DOI: 10.1364/JOSAA.4.000629</a>
<em>An alternative quaternion-based closed-form solution that also handles scale.</em> See also: <a href="/notes/biology/computational-biology/horn-absolute-orientation/">paper notes</a>.</li>
<li><strong>[Umeyama 1991]</strong> Umeyama, S. (1991). &ldquo;Least-squares estimation of transformation parameters between two point patterns.&rdquo; <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>, 13(4), 376-380. <a href="https://doi.org/10.1109/34.88573">DOI: 10.1109/34.88573</a>
<em>The extension of the algorithm to include optimal scaling in addition to rotation and translation.</em> See also: <a href="/notes/biology/computational-biology/umeyama-similarity-transformation/">paper notes</a>.</li>
</ul>
]]></content:encoded></item><item><title>Automated Adatom Diffusion Workflow</title><link>https://hunterheidenreich.com/projects/lammps-adatom-diffusion/</link><pubDate>Thu, 21 Sep 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/lammps-adatom-diffusion/</guid><description>Python-wrapped reference implementation for surface diffusion simulations using LAMMPS and EAM potentials, with automated analysis pipelines.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>This project provides an &ldquo;input-to-analysis&rdquo; workflow for simulating adatom diffusion on FCC metal surfaces. It demonstrates how to set up surface diffusion simulations in LAMMPS, manage EAM potentials, and parse trajectory data into energy and trajectory plots using Python. The LAMMPS input scripts are adapted from Eric N. Hahn&rsquo;s adatom tutorial; the Python analysis layer (<code>plot_energy.py</code>, <code>plot_xy.py</code>) is my own, written while in CSElab (Harvard, 2023).</p>
<p>The workflow covers two material systems (Copper (Cu) and Platinum (Pt)) providing comparative datasets that highlight how atomic mass and bonding strength affect surface dynamics.</p>
<h2 id="features">Features</h2>
<h3 id="simulation-architecture">Simulation Architecture</h3>
<p>The project separates simulation logic from analysis code:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Directory</th>
          <th style="text-align: left">Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong><code>/adatom_cu</code></strong></td>
          <td style="text-align: left">Copper adatom diffusion on Cu(100)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong><code>/adatom_pt</code></strong></td>
          <td style="text-align: left">Platinum adatom diffusion on Pt(100)</td>
      </tr>
  </tbody>
</table>
<p>Each directory contains:</p>
<ul>
<li><strong>LAMMPS input scripts</strong> (<code>.in</code> files) defining the physics</li>
<li><strong>EAM potential files</strong> for metallic bonding (the Cu potential is committed; the Pt potential must be downloaded separately from the NIST Interatomic Potentials Repository, so the Pt system does not run as-checked-out)</li>
<li><strong>Python analysis scripts</strong> for trajectory and energy parsing</li>
</ul>
<h3 id="key-features">Key Features</h3>
<ul>
<li><strong>EAM Potentials</strong>: Uses Embedded Atom Method alloy potentials to accurately model metallic bonding and surface energies, providing accuracy beyond simple Lennard-Jones potentials</li>
<li><strong>Automated Analysis</strong>: Python pipeline (<code>plot_energy.py</code>, <code>plot_xy.py</code>) that parses raw thermodynamic logs and trajectory dumps to generate &ldquo;health check&rdquo; dashboards</li>
<li><strong>Workflow Orchestration</strong>: Demonstrates the &ldquo;Input → Simulation → Analysis&rdquo; loop, automating the transition from raw <code>.lammpstrj</code> files to publication-ready plots</li>
<li><strong>Kokkos Support</strong>: Includes Kokkos execution commands for GPU/multi-threaded runs</li>
</ul>
<h3 id="simulation-parameters">Simulation Parameters</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Parameter</th>
          <th style="text-align: left">Value</th>
          <th style="text-align: left">Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Ensemble</strong></td>
          <td style="text-align: left">NVT → NVE</td>
          <td style="text-align: left">Equilibration followed by energy conservation checks</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Potential</strong></td>
          <td style="text-align: left">EAM/alloy</td>
          <td style="text-align: left">Accurate metallic bonding for surface dynamics</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Minimization</strong></td>
          <td style="text-align: left">CG (1.0e-4)</td>
          <td style="text-align: left">Remove steric overlaps before dynamics</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Timestep</strong></td>
          <td style="text-align: left">5 fs (metal units)</td>
          <td style="text-align: left">EAM-appropriate integration step</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Trajectory dump</strong></td>
          <td style="text-align: left">every 5 steps (25 fs)</td>
          <td style="text-align: left">Tracks adatom site-to-site hops</td>
      </tr>
  </tbody>
</table>
<h2 id="usage">Usage</h2>
<p>The repository includes LAMMPS input scripts and Python analysis scripts. Run the LAMMPS scripts to generate trajectory data, then use the Python scripts to visualize the results.</p>
<h2 id="results">Results</h2>
<p>This workflow is documented in detail in companion blog posts:</p>
<ul>
<li><a href="/posts/adatom-cu-diffusion/">LAMMPS Tutorial: Copper and Platinum Adatom Diffusion</a> - Complete setup walkthrough with line-by-line script explanation and comparison of how heavier atoms behave differently on surfaces</li>
</ul>
]]></content:encoded></item><item><title>Mini-Protein Trajectory Generation</title><link>https://hunterheidenreich.com/projects/mini-protein-trajectories/</link><pubDate>Tue, 01 Aug 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/mini-protein-trajectories/</guid><description>Automated GROMACS pipeline generating MD trajectories with atomic force extraction for Neural Network Potential training.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>I developed an automated GROMACS pipeline to generate molecular dynamics (MD) datasets for machine learning applications. The workflow automates the simulation of capped dipeptides across nine distinct residue types, creating a diverse training set suitable for Neural Network Potentials (NNPs). The pipeline is built off Luca Tubiana&rsquo;s GROMACS tutorial (University of Trento); the Python analysis layer and the curated dipeptide dataset are my own.</p>
<h2 id="features">Features</h2>
<h3 id="automated-simulation-pipeline">Automated Simulation Pipeline</h3>
<ul>
<li><strong>End-to-End Scripting</strong>: Bash-automated workflow handling topology generation (<code>pdb2gmx</code>), solvation, ionization, and equilibration</li>
<li><strong>Langevin Dynamics</strong>: Implemented Stochastic Dynamics (SD) integration to ensure proper canonical (NVT) ensemble sampling</li>
<li><strong>High-Resolution Output</strong>: Configured to capture <strong>0.1 ps (100 fs) resolution</strong> trajectories, critical for capturing fast bond vibrations</li>
<li><strong>Force Extraction</strong>: Optimized output to <code>.trr</code> format preserving uncompressed atomic forces, a key requirement for force-matching in ML potentials</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-ini" data-lang="ini"><span style="display:flex;"><span><span style="color:#75715e">; md_langevin.mdp</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">integrator</span>  <span style="color:#f92672">=</span> <span style="color:#e6db74">sd        ; Stochastic dynamics for proper sampling</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">dt</span>          <span style="color:#f92672">=</span> <span style="color:#e6db74">0.001     ; 1 fs timestep</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">nstxout</span>     <span style="color:#f92672">=</span> <span style="color:#e6db74">100       ; Output every 100 steps = 0.1 ps resolution</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">tc-grps</span>     <span style="color:#f92672">=</span> <span style="color:#e6db74">Protein Non-Protein</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">tau_t</span>       <span style="color:#f92672">=</span> <span style="color:#e6db74">0.1  0.1  ; Friction constant (ps)</span>
</span></span></code></pre></div><h3 id="chemical-diversity-suite">Chemical Diversity Suite</h3>
<p>Designed to stress-test ML models against varied kinematic constraints:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Residues</th>
          <th>Dynamics Challenge</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Aromatic</strong></td>
          <td>Phe, Trp</td>
          <td>π-stacking, bulky side chains</td>
      </tr>
      <tr>
          <td><strong>Constrained</strong></td>
          <td>Pro</td>
          <td>Cyclic backbone restrictions</td>
      </tr>
      <tr>
          <td><strong>Flexible</strong></td>
          <td>Gly, Ala</td>
          <td>High conformational entropy</td>
      </tr>
      <tr>
          <td><strong>Branched</strong></td>
          <td>Val, Ile, Leu</td>
          <td>Steric clashes, rotamer preferences</td>
      </tr>
      <tr>
          <td><strong>Sulfur-Containing</strong></td>
          <td>Met</td>
          <td>Flexible thioether linkage</td>
      </tr>
  </tbody>
</table>
<h2 id="usage">Usage</h2>
<p>The pipeline is executed via bash scripts, requiring GROMACS to be installed.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>Data Volume vs. Fidelity</strong>: Balanced high-frequency force outputs (every 100 steps) against storage constraints by automating post-processing extraction of forces into lightweight <code>.xvg</code> formats</li>
<li><strong>Force Field Consistency</strong>: Standardized the Amber03 force field and TIP3P water model across all residues to ensure consistent potential energy surfaces for downstream model training</li>
</ul>
<blockquote>
<p><strong>Note</strong>: This pipeline uses Amber03 for consistency across residue types. For production ML potentials, consider swapping to Charmm36m or similar modern force fields.</p></blockquote>
<h2 id="retrospective">Retrospective</h2>
<ul>
<li><strong>Demonstrative, not production-scale</strong>: the 1 ns trajectories exercise the pipeline and capture fast bond vibrations, but proper conformational sampling needs 100 ns to 1 µs runs. This is a working reference, not a finished dataset.</li>
<li><strong>Dated force field</strong>: Amber03 / TIP3P keeps the potential energy surface consistent across residues, but it is not state-of-the-art for ML-potential training; CHARMM36m or Amber ff19SB would be the upgrade path.</li>
<li><strong>Paused, not abandoned</strong>: a candidate to revive and extend (more residues, longer trajectories, Ramachandran analysis) for future force-matching work.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/posts/mini-proteins/">Mini-Protein Dynamics</a> - Detailed blog post on the simulation methodology</li>
</ul>
]]></content:encoded></item><item><title>Look, Don't Tweet: Unified Data Models for Social NLP</title><link>https://hunterheidenreich.com/research/look-dont-tweet/</link><pubDate>Wed, 30 Jun 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/look-dont-tweet/</guid><description>PyConversations library and unified data schema for normalizing 300M+ posts across Twitter, Reddit, Facebook, and 4chan.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>This is my undergraduate senior thesis, completed at Drexel University in 2021. The scope (308 million posts across four platforms, structural topology analysis, and domain adaptation experiments with Transformer models) was unusually broad for a senior thesis, spanning large-scale data engineering, graph-structural analysis, and representation-learning experiments.</p>
<p>Social media research is often siloed by platform, with tools built specifically for Twitter&rsquo;s flat structure or Reddit&rsquo;s tree structure. This fragmentation makes cross-platform analysis difficult. In this work, I introduce <strong><a href="https://github.com/hunter-heidenreich/pyconversations">PyConversations</a></strong>, an open-source Python package that normalizes data from Twitter, Facebook, Reddit, and 4chan into a single, platform-agnostic data model. <em>(Note: the repository is archived and no longer actively maintained.)</em></p>
<p>Leveraging this tool, I processed over <strong>308 million posts</strong> to analyze the structural &ldquo;shape&rdquo; of online conversations. I then evaluated the efficacy of domain-adaptive pre-training (DAPT) for Transformer-based language models, finding that training on a toxic domain (4chan) boosts hate-speech detection by over 5 F1.</p>
<h2 id="the-engineering-problem-data-normalization">The Engineering Problem: Data Normalization</h2>
<p>Social media platforms impose different structural constraints on discourse, making it difficult to feed heterogeneous data into a single ML pipeline:</p>
<ul>
<li><strong>Twitter:</strong> Technically allows infinite depth, but functionally operates as a flat stream or shallow tree.</li>
<li><strong>Facebook:</strong> Enforces a hard limit of two depth levels (comments and replies), resulting in &ldquo;short and fat&rdquo; conversation trees.</li>
<li><strong>Reddit &amp; 4chan:</strong> Allow for deep, branching tree structures.</li>
</ul>
<p>To solve this, I designed a <strong>Universal Message Schema</strong> and the <strong>PyConversations</strong> library. This system ingests raw dumps from these disparate sources and maps them to a unified Directed Acyclic Graph (DAG) format, preserving the parent-child relationships regardless of the source platform&rsquo;s constraints.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>PyConversations Library</strong>: An open-source package for robust conversational analysis, featuring graph-based traversing and filtering.</li>
<li><strong>Massive Dataset Analysis</strong>: Processed a collection of <strong>308 million posts</strong> and <strong>15.8 million conversations</strong>, creating one of the largest comparative cross-platform analyses at the time of thesis submission.</li>
<li><strong>Structural Insights</strong>: Quantified how UI constraints shape human behavior. For instance, Facebook&rsquo;s depth limit forces users to &ldquo;bunch&rdquo; comments, creating uniquely wide conversation trees compared to Reddit&rsquo;s deep, narrow threads.</li>
<li><strong>Domain Adaptation Experiments</strong>: Continued-pretrained RoBERTa on platform-specific slices (e.g., the 4chan-adapted <code>RoBERTa-4chan</code>), demonstrating that exposing models to toxic domains improved hate-speech detection F1 by over 5 points.</li>
</ul>
<h2 id="structural-analysis-findings">Structural Analysis Findings</h2>
<p>By treating conversations as graphs, we uncovered distinct topological signatures for each platform:</p>
<h3 id="the-shape-of-discourse">The &ldquo;Shape&rdquo; of Discourse</h3>
<p>We measured the <strong>width</strong> (max posts at any depth) and <strong>depth</strong> (max distance from root) of conversation trees.</p>
<ul>
<li><strong>Facebook</strong> exhibited a &ldquo;short and fat&rdquo; topology due to its 2-level nesting limit.</li>
<li><strong>4chan</strong> threads were surprisingly shallow despite having no depth limits. This suggests that the platform&rsquo;s <strong>ephemerality</strong> (threads are deleted quickly) and the &ldquo;bump limit&rdquo; mechanic discourage long-term dialogue, though data scraping limitations on this transient platform also contribute to this topology.</li>
<li><strong>Reddit</strong> maintained the most robust tree structures, with &ldquo;good faith&rdquo; communities like <em>r/ChangeMyView</em> showing distinct patterns of sustained engagement.</li>
</ul>
<h3 id="information-density">Information Density</h3>
<p>We analyzed <strong>Innovation Rate</strong>, a measure of how quickly a text introduces new vocabulary. We found that Twitter threads have negative innovation rates (indicating high novelty per token) likely forced by the strict character limits. In contrast, Reddit posts showed higher redundancy, typical of longer-form essay writing.</p>
<h2 id="representation-learning--domain-adaptation">Representation Learning &amp; Domain Adaptation</h2>
<p>We experimented with &ldquo;Warm-Start&rdquo; tuning: taking a standard RoBERTa model and pre-training it further on platform-specific data before fine-tuning on downstream tasks (TweetEval).</p>
<ul>
<li><strong>Limited gains on most general tasks:</strong> Domain-adaptive pre-training added little on sentiment and emotion (from well under 1 up to a few F1 points), with irony detection the exception (+5.6 to +5.9 F1). Base RoBERTa already covers most of the signal for general NLP tasks.</li>
<li><strong>The Toxic Exception:</strong> The notable exception was <strong>Hate Speech Detection</strong>. The 4chan-adapted model (<code>RoBERTa-4chan</code>) was the strongest here, outperforming the baseline by over 5 F1. This highlights that for specialized, out-of-distribution language (like toxic slang), domain adaptation remains valuable.</li>
</ul>
<h2 id="significance">Significance</h2>
<p>This work bridges the gap between <strong>Computational Social Science</strong> and <strong>ML Engineering</strong>. It provides the community with a reusable tool (<code>PyConversations</code>) to handle the messy reality of social data and offers empirical evidence on the limits and benefits of domain-adaptive pre-training for LLMs.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@thesis</span>{heidenreich2021look,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Look, Don&#39;t Tweet: Representation Learning and Social Media}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">school</span>=<span style="color:#e6db74">{Drexel University}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">type</span>=<span style="color:#e6db74">{Undergraduate Senior Thesis}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For related work on how social media content surfaces in digital journalism, including a dataset of embedded tweets across 273,899 news articles, see <a href="/research/newstweet-social-media-journalism/">NewsTweet Dataset: Social Media in Digital Journalism</a>.</p>
]]></content:encoded></item><item><title>PyConversations: Social Media Conversational Analysis</title><link>https://hunterheidenreich.com/projects/pyconversations-social-media-analysis/</link><pubDate>Tue, 01 Jun 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/pyconversations-social-media-analysis/</guid><description>Undergraduate thesis exploring representation learning for social media text and developing tools for cross-platform conversational analysis.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>Undergraduate thesis exploring representation learning for social media text and developing tools for cross-platform conversational analysis. Built PyConversations, a Python module for analyzing social media conversations, and found that domain-specific approaches often outperform large pre-trained models.</p>
<h2 id="features">Features</h2>
<h3 id="pyconversations-module">PyConversations Module</h3>
<ul>
<li><strong>Graph-based modeling</strong>: Models conversations as Directed Acyclic Graphs (DAGs) to quantify topological structure (depth, width, density)</li>
<li><strong>Unified interface</strong>: Polymorphic design normalizing heterogeneous data from Twitter, Reddit, 4chan, and Facebook into a single analysis schema</li>
<li><strong>Linguistic dynamics</strong>: Implements information-theoretic feature extraction, including harmonic mixing laws and entropy measures</li>
<li><strong>Stream processing</strong>: Memory-efficient generators to ingest and traverse multi-gigabyte JSON dumps (e.g., 135M+ Reddit posts) without loading the full corpus into RAM</li>
</ul>
<h3 id="research-contributions">Research Contributions</h3>
<ul>
<li><strong>Representation learning</strong>: Investigated domain-specific vs. general-purpose Transformers (BERT vs. specialized variants) on social media text</li>
<li><strong>Topological analysis</strong>: Demonstrated that conversational structure (context) is as critical as content for classification tasks</li>
<li><strong>Cross-platform study</strong>: Comparative analysis of communication dynamics across moderated (Reddit/Twitter) and unmoderated (4chan) spaces</li>
</ul>
<h2 id="usage">Usage</h2>
<p>The PyConversations module can be imported into Python scripts to parse and analyze social media datasets.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>Model performance</strong>: Smaller, domain-specific approaches frequently outperformed standard pre-trained models</li>
<li><strong>Context importance</strong>: Conversational context and dialogue structure proved crucial for understanding social media interactions</li>
<li><strong>Domain adaptation</strong>: Social media text benefits from specialized handling over generic approaches</li>
<li><strong>Cross-platform challenges</strong>: Different platforms require adapted approaches despite seeming similarities</li>
</ul>
<h2 id="team--recognition">Team &amp; Recognition</h2>
<ul>
<li><strong>Hunter Heidenreich</strong> - Lead Researcher and Developer</li>
<li><strong>Jake Williams</strong> - Faculty Advisor</li>
<li><strong>First Place - Research Undergraduate Senior Thesis</strong> at Drexel University</li>
</ul>
<h2 id="impact">Impact</h2>
<p>This library served as the engineering backbone for my thesis, <a href="/research/look-dont-tweet/">Look, Don&rsquo;t Tweet</a>, enabling the processing of 308 million posts to evaluate Transformer performance on toxic data.</p>
<p>The findings about model performance suggested that specialized domains require tailored model architectures, a perspective that has become more relevant as the field continues to evolve.</p>
]]></content:encoded></item><item><title>Word Embeddings in NLP: An Introduction</title><link>https://hunterheidenreich.com/posts/intro-to-word-embeddings/</link><pubDate>Sun, 05 Aug 2018 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/intro-to-word-embeddings/</guid><description>Learn about word embeddings in NLP: from basic one-hot encoding to contextual models like ELMo. Guide with examples.</description><content:encoded><![CDATA[<h2 id="understanding-word-embeddings">Understanding Word Embeddings</h2>
<p>A word embedding maps words to real-valued vectors:</p>
<p>$$
\text{word} \rightarrow \mathbb{R}^n
$$</p>
<p>where $n$ represents the dimensionality of the embedding space.</p>
<p>The goal is simple: position semantically similar words close together in vector space. This dense representation typically uses hundreds of dimensions, a massive reduction from the millions required by one-hot encoding.</p>
<p>Word embeddings are grounded in <a href="https://en.wikipedia.org/wiki/Distributional_semantics">Zellig Harris&rsquo; distributional hypothesis</a>: words appearing in similar contexts tend to have similar meanings. This forms the foundation of distributional semantics.</p>















<figure class="post-figure center ">
    <img src="/img/distributional_semantics-50.webp"
         alt="Distributional semantics visualization"
         title="Distributional semantics visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Words embedded in three-dimensional space, organized by semantic similarity</figcaption>
    
</figure>

<p>Different embedding algorithms capture various aspects of this distributional principle. This post explores the main methods for creating word embeddings and their applications in natural language processing.</p>
<p>While modern foundation models and large Vision-Language Models rely on subword tokenizers (like BPE) and Transformer embedding layers, the goal is the same: mapping discrete text to a continuous vector space where math can capture meaning. These foundational techniques build the intuition for the embedding layers in today&rsquo;s models.</p>
<h2 id="why-word-embeddings-matter-in-nlp">Why Word Embeddings Matter in NLP</h2>
<p>Computers require numerical representations to apply machine learning algorithms to text. Word embeddings bridge this gap by converting text into dense vectors that preserve semantic and syntactic relationships.</p>
<p><strong>Key advantages:</strong></p>
<ol>
<li><strong>Dense representation</strong>: Hundreds of dimensions provide a compact alternative to vocabulary-sized sparse vectors.</li>
<li><strong>Semantic preservation</strong>: Similar words cluster together in vector space.</li>
<li><strong>Mathematical operations</strong>: Enable analogical reasoning ($\text{king} - \text{man} + \text{woman} \approx \text{queen}$).</li>
<li><strong>Transfer learning</strong>: Pre-trained embeddings work across multiple tasks and domains.</li>
</ol>
<p>Modern deep learning architectures leverage these properties extensively. The development of universal, pre-trained embeddings was a significant step forward. We can use versatile embeddings that generalize across applications, eliminating the need to train task-specific representations from scratch.</p>
<h2 id="word-embedding-approaches">Word Embedding Approaches</h2>
<h3 id="one-hot-encoding-and-count-vectorization">One-Hot Encoding and Count Vectorization</h3>
<p>One-hot encoding represents the simplest approach to word vectorization. Each word gets a unique dimension in a vocabulary-sized vector, marked with 1 for presence and 0 elsewhere. Count vectorization extends this by counting the occurrences of each word in a document.</p>















<figure class="post-figure center ">
    <img src="/img/word_vector_onehot-50.webp"
         alt="One-hot encoding visualization"
         title="One-hot encoding visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">One-hot encoding creates sparse vectors with single active dimensions</figcaption>
    
</figure>

<p><strong>Characteristics:</strong></p>
<ul>
<li><strong>High dimensionality</strong>: Vector length equals vocabulary size.</li>
<li><strong>Extreme sparsity</strong>: Most dimensions contain zeros.</li>
<li><strong>No relationships</strong>: Treats all words as equally distant.</li>
<li><strong>Computational efficiency</strong>: Simple to implement and understand.</li>
</ul>
<p>While lacking semantic information, count vectorization serves as a foundation for more complex methods. Let&rsquo;s look at a practical implementation using scikit-learn&rsquo;s <code>CountVectorizer</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> CountVectorizer
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Initialize the vectorizer</span>
</span></span><span style="display:flex;"><span>vectorizer <span style="color:#f92672">=</span> CountVectorizer()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Sample text for demonstration</span>
</span></span><span style="display:flex;"><span>sample_text <span style="color:#f92672">=</span> [<span style="color:#e6db74">&#34;One of the most basic ways we can numerically represent words &#34;</span>
</span></span><span style="display:flex;"><span>               <span style="color:#e6db74">&#34;is through the one-hot encoding method (also sometimes called &#34;</span>
</span></span><span style="display:flex;"><span>               <span style="color:#e6db74">&#34;count vectorizing).&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Fit the vectorizer to our text data</span>
</span></span><span style="display:flex;"><span>vectorizer<span style="color:#f92672">.</span>fit(sample_text)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Examine the vocabulary and word indices</span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#39;Vocabulary:&#39;</span>)
</span></span><span style="display:flex;"><span>print(vectorizer<span style="color:#f92672">.</span>vocabulary_)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Transform text to vectors</span>
</span></span><span style="display:flex;"><span>vector <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>transform(sample_text)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#39;Full vector:&#39;</span>)
</span></span><span style="display:flex;"><span>print(vector<span style="color:#f92672">.</span>toarray())
</span></span></code></pre></div><p>At scale, count vectorization introduces engineering challenges. With millions of documents, the vocabulary grows large, and the sparse matrices become expensive to store and compute on. In these scaling scenarios, practitioners often turn to the <strong>Hashing Trick</strong> (via <code>HashingVectorizer</code>) to bound the dimensionality, or they move entirely to the dense embeddings discussed later in this post.</p>
<p>We can see count vectorization in action with a real dataset, building a simple text classifier for the <a href="https://www.kaggle.com/datasets/crawford/20-newsgroups">20 Newsgroups dataset</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.datasets <span style="color:#f92672">import</span> fetch_20newsgroups
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> CountVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.naive_bayes <span style="color:#f92672">import</span> MultinomialNB
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn <span style="color:#f92672">import</span> metrics
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Load train and test splits, removing metadata for a cleaner signal</span>
</span></span><span style="display:flex;"><span>newsgroups_train <span style="color:#f92672">=</span> fetch_20newsgroups(subset<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;train&#39;</span>,
</span></span><span style="display:flex;"><span>                                      remove<span style="color:#f92672">=</span>(<span style="color:#e6db74">&#39;headers&#39;</span>, <span style="color:#e6db74">&#39;footers&#39;</span>, <span style="color:#e6db74">&#39;quotes&#39;</span>))
</span></span><span style="display:flex;"><span>newsgroups_test <span style="color:#f92672">=</span> fetch_20newsgroups(subset<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;test&#39;</span>,
</span></span><span style="display:flex;"><span>                                     remove<span style="color:#f92672">=</span>(<span style="color:#e6db74">&#39;headers&#39;</span>, <span style="color:#e6db74">&#39;footers&#39;</span>, <span style="color:#e6db74">&#39;quotes&#39;</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Initialize and fit vectorizer on training data</span>
</span></span><span style="display:flex;"><span>vectorizer <span style="color:#f92672">=</span> CountVectorizer()
</span></span><span style="display:flex;"><span>X_train <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>fit_transform(newsgroups_train<span style="color:#f92672">.</span>data)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Build and train classifier</span>
</span></span><span style="display:flex;"><span>classifier <span style="color:#f92672">=</span> MultinomialNB(alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.01</span>)
</span></span><span style="display:flex;"><span>classifier<span style="color:#f92672">.</span>fit(X_train, newsgroups_train<span style="color:#f92672">.</span>target)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Transform test data and make predictions</span>
</span></span><span style="display:flex;"><span>X_test <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>transform(newsgroups_test<span style="color:#f92672">.</span>data)
</span></span><span style="display:flex;"><span>y_pred <span style="color:#f92672">=</span> classifier<span style="color:#f92672">.</span>predict(X_test)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Evaluate performance</span>
</span></span><span style="display:flex;"><span>accuracy <span style="color:#f92672">=</span> metrics<span style="color:#f92672">.</span>accuracy_score(newsgroups_test<span style="color:#f92672">.</span>target, y_pred)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;Accuracy: </span><span style="color:#e6db74">{</span>accuracy<span style="color:#e6db74">:</span><span style="color:#e6db74">.3f</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span></code></pre></div><p>This provides a solid baseline. To capture actual semantic meaning and reduce dimensionality, we must move beyond simple counting.</p>
<h3 id="tf-idf-term-frequency-inverse-document-frequency">TF-IDF (Term Frequency-Inverse Document Frequency)</h3>
<p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TF-IDF</a> extends one-hot encoding by weighting terms based on their importance across a document collection. TF-IDF combines:</p>
<ul>
<li><strong>Term Frequency (TF)</strong>: How often a word appears in a document</li>
<li><strong>Inverse Document Frequency (IDF)</strong>: How rare a word is across all documents</li>
</ul>
<p>This weighting scheme reduces the impact of common words (like &ldquo;the&rdquo; or &ldquo;and&rdquo;) while emphasizing distinctive terms that appear frequently in specific documents but rarely elsewhere.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li>Captures document-level importance</li>
<li>Reduces impact of stop words</li>
<li>Effective for information retrieval tasks</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li>Still high-dimensional and sparse</li>
<li>No semantic relationships between terms</li>
<li>Context-independent representation</li>
</ul>
<h3 id="co-occurrence-matrices">Co-Occurrence Matrices</h3>
<p>Co-occurrence matrices capture word relationships by recording which terms appear together within defined contexts (sentences, paragraphs, or fixed windows). The resulting matrix has dimensions equal to vocabulary size squared, with entries showing co-occurrence frequency.</p>















<figure class="post-figure center ">
    <img src="/img/Word_co-occurrence_network_%28range_3_words%29_-_ENG-50.webp"
         alt="Co-occurrence network visualization"
         title="Co-occurrence network visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Co-occurrence relationships within a three-word window</figcaption>
    
</figure>

<p><strong>Key properties:</strong></p>
<ul>
<li><strong>Global statistics</strong>: Captures corpus-wide word relationships</li>
<li><strong>Symmetric relationships</strong>: Mutual co-occurrence patterns</li>
<li><strong>Extreme dimensionality</strong>: Vocabulary size squared creates storage challenges</li>
<li><strong>Sparse representation</strong>: Most word pairs never co-occur</li>
</ul>
<p>While computationally expensive to store and process, co-occurrence matrices form the foundation for advanced methods like GloVe that compress this information into dense representations.</p>
<h2 id="neural-network-based-embeddings">Neural Network-Based Embeddings</h2>
<h3 id="neural-probabilistic-language-models">Neural Probabilistic Language Models</h3>
<p><a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf">Neural probabilistic models</a> pioneered the use of neural networks for learning word embeddings. These models learn dense representations as a byproduct of language modeling, predicting the next word in a sequence.</p>















<figure class="post-figure center ">
    <img src="/img/bengio-npm-50.webp"
         alt="Neural probabilistic model diagram"
         title="Neural probabilistic model diagram"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Architecture of neural probabilistic language models</figcaption>
    
</figure>

<p><strong>Training process:</strong></p>
<ol>
<li>Initialize random dense embeddings for each vocabulary word</li>
<li>Use embeddings as inputs to predict language modeling objectives</li>
<li>Update embeddings through backpropagation based on prediction errors</li>
<li>Resulting embeddings capture patterns useful for the training task</li>
</ol>
<p>This approach demonstrated that task-specific embeddings could be learned jointly with model objectives, establishing the foundation for modern embedding methods.</p>
<h3 id="word2vec">Word2Vec</h3>
<p><a href="https://code.google.com/archive/p/word2vec/">Word2Vec</a> made word embeddings practical at scale by introducing efficient training algorithms for massive corpora. It popularized compelling vector arithmetic properties, enabling analogical reasoning like the famous &ldquo;$\text{king} - \text{man} + \text{woman} \approx \text{queen}$&rdquo; example (a vector-offset regularity first reported by Mikolov, Yih &amp; Zweig (2013) on recurrent-network language-model embeddings).</p>















<figure class="post-figure center ">
    <img src="/img/Word_vector_illustration.webp"
         alt="Word2Vec vector arithmetic visualization"
         title="Word2Vec vector arithmetic visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Word2Vec demonstrates analogical relationships through vector arithmetic</figcaption>
    
</figure>

<p><strong>Two training architectures:</strong></p>
<h4 id="continuous-bag-of-words-cbow">Continuous Bag-of-Words (CBOW)</h4>
<p>Predicts target words from surrounding context words. Given a window of context words, the model learns to predict the central word.</p>
<h4 id="skip-gram">Skip-Gram</h4>
<p>Predicts context words from target words. Given a central word, the model learns to predict surrounding words within a defined window.</p>
<p><strong>Key advantages:</strong></p>
<ul>
<li><strong>Computational efficiency</strong>: Much faster than neural probabilistic models</li>
<li><strong>Scalable training</strong>: Can process billion-word corpora effectively</li>
<li><strong>Quality embeddings</strong>: Captures semantic and syntactic relationships</li>
<li><strong>Flexible context</strong>: Window size controls topical vs. functional similarity</li>
</ul>
<p>The choice of window size significantly impacts learned relationships. Larger windows capture topical associations, while smaller windows focus on syntactic and functional similarities.</p>
<h3 id="glove-global-vectors">GloVe (Global Vectors)</h3>
<p><a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> combines the best aspects of matrix factorization methods (which capture global corpus statistics) and local context window approaches like Word2Vec. Matrix factorization methods excel at global patterns but struggle with analogical reasoning, while Word2Vec captures local relationships but may miss global structure.</p>
<p><strong>Key innovation:</strong>
GloVe trains on a global word-context co-occurrence matrix, incorporating corpus-wide statistical information while maintaining the analogical reasoning capabilities that made Word2Vec successful.</p>
<p><strong>Advantages over Word2Vec:</strong></p>
<ul>
<li><strong>Global optimization</strong>: Leverages entire corpus statistics</li>
<li><strong>Better performance</strong>: Often outperforms Word2Vec on word similarity and analogy tasks</li>
<li><strong>Stable training</strong>: More consistent convergence due to global objective function</li>
</ul>
<p>The result is embeddings that capture both local syntactic patterns and global semantic relationships more effectively.</p>
<h2 id="contextual-embedding-methods">Contextual Embedding Methods</h2>
<h3 id="fasttext">FastText</h3>
<p><a href="https://github.com/facebookresearch/fastText">FastText</a> addresses a critical limitation of previous methods: handling out-of-vocabulary (OOV) words. By incorporating subword information, FastText can generate meaningful representations for previously unseen words.</p>
<p><strong>Subword approach:</strong></p>
<ul>
<li>Decomposes words into character n-grams (typically 3-6 characters)</li>
<li>Represents words as sums of their component n-grams</li>
<li>Trains using skip-gram objective with negative sampling</li>
</ul>
<p><strong>Key advantages:</strong></p>
<ul>
<li><strong>OOV handling</strong>: Can embed unseen words using known subword components</li>
<li><strong>Morphological awareness</strong>: Captures relationships between related word forms</li>
<li><strong>Multilingual support</strong>: Facebook released pre-trained embeddings for 294 languages</li>
<li><strong>Robust performance</strong>: Particularly effective for morphologically rich languages</li>
</ul>
<p>For example, if the model knows &ldquo;navigate,&rdquo; it can provide meaningful representation for &ldquo;circumnavigate&rdquo; by leveraging shared subword components, even if &ldquo;circumnavigate&rdquo; wasn&rsquo;t in the training data.</p>
<h3 id="poincaré-embeddings">Poincaré Embeddings</h3>
<p><a href="https://radimrehurek.com/gensim/models/poincare.html">Poincaré embeddings</a> introduce a novel approach by learning representations in hyperbolic space. This geometric innovation specifically targets hierarchical relationships in data.</p>
<p><strong>Hyperbolic geometry advantages:</strong></p>
<ul>
<li><strong>Natural hierarchy encoding</strong>: Distance represents similarity, while norm encodes hierarchical level</li>
<li><strong>Efficient representation</strong>: Requires fewer dimensions for hierarchical data</li>
<li><strong>Mathematical elegance</strong>: Leverages properties of hyperbolic space for embedding optimization</li>
</ul>
<p><strong>Applications:</strong>
Particularly effective for data with inherent hierarchical structure, such as:</p>
<ul>
<li>WordNet taxonomies</li>
<li>Organizational charts</li>
<li>Computer network topologies</li>
<li>Knowledge graphs</li>
</ul>
<p>The <a href="https://arxiv.org/abs/1705.08039">original paper</a> demonstrates good efficiency in reproducing WordNet relationships with significantly lower dimensionality compared to traditional embedding methods.</p>
<h2 id="contextual-embeddings">Contextual Embeddings</h2>
<h3 id="elmo-embeddings-from-language-models">ELMo (Embeddings from Language Models)</h3>
<p><a href="https://github.com/allenai/allennlp-models">ELMo</a> represents a paradigm shift toward contextual word representations. ELMo generates dynamic representations based on sentence context, adapting to word usage patterns.</p>
<p><strong>Architecture:</strong></p>
<ul>
<li><strong>Bidirectional LSTM</strong>: Processes text in both forward and backward directions</li>
<li><strong>Character-level input</strong>: Handles OOV words and captures morphological patterns</li>
<li><strong>Multi-layer representations</strong>: Combines different abstraction levels</li>
</ul>
<p><strong>Layer specialization:</strong></p>
<ul>
<li><strong>Lower layers</strong>: Excel at syntactic tasks (POS tagging, parsing)</li>
<li><strong>Higher layers</strong>: Capture semantic relationships (word sense disambiguation)</li>
<li><strong>Combined layers</strong>: Weighted combination achieves good performance</li>
</ul>
<p><strong>Key innovation:</strong>
ELMo embeddings vary by context. The word &ldquo;bank&rdquo; receives different representations in &ldquo;river bank&rdquo; versus &ldquo;financial bank,&rdquo; addressing polysemy directly through contextual awareness.</p>
<p>This approach achieved strong performance across numerous NLP tasks by providing context-sensitive representations that adapt to word usage patterns.</p>
<h3 id="probabilistic-fasttext">Probabilistic FastText</h3>
<p><a href="https://github.com/benathi/multisense-prob-fasttext">Probabilistic FastText</a> addresses polysemy (words with multiple meanings) through probabilistic modeling. Traditional embeddings conflate different word senses into single representations, limiting their precision.</p>
<p><strong>The polysemy problem:</strong>
Consider &ldquo;rock&rdquo; which can mean:</p>
<ul>
<li>Rock music (genre)</li>
<li>A stone (geological object)</li>
<li>Rocking motion (verb)</li>
</ul>
<p>Standard embeddings average these meanings, producing representations that may not capture any sense precisely.</p>
<p><strong>Probabilistic approach:</strong>
Probabilistic FastText represents words as Gaussian mixture models: probability distributions that can capture multiple distinct meanings as separate components.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li><strong>Multi-sense representation</strong>: Each word sense gets its own distribution</li>
<li><strong>Context sensitivity</strong>: Can select appropriate sense based on usage context</li>
<li><strong>Uncertainty quantification</strong>: Probabilistic framework captures embedding confidence</li>
</ul>
<p>This approach provides a more nuanced treatment of lexical ambiguity, particularly valuable for words with distinct, context-dependent meanings.</p>
<h2 id="summary-and-future-directions">Summary and Future Directions</h2>
<p>Word embeddings have evolved from simple one-hot encodings to contextual representations that capture nuanced linguistic relationships. Each approach offers distinct advantages:</p>
<p><strong>Static embeddings</strong> (Word2Vec, GloVe, FastText) provide:</p>
<ul>
<li>Computational efficiency for large-scale applications</li>
<li>Pre-trained models available for numerous languages</li>
<li>Clear analogical reasoning capabilities</li>
<li>Good performance on many downstream tasks</li>
</ul>
<p><strong>Contextual embeddings</strong> (ELMo, BERT, GPT) offer:</p>
<ul>
<li>Dynamic representations based on sentence context</li>
<li>Better handling of polysemy and word sense disambiguation</li>
<li>Strong performance on complex NLP tasks</li>
<li>Ability to capture subtle contextual nuances</li>
</ul>
<p><strong>Choosing the right approach</strong> depends on:</p>
<ul>
<li><strong>Task requirements</strong>: Static embeddings for efficiency, contextual for accuracy</li>
<li><strong>Data availability</strong>: Pre-trained models vs. domain-specific training</li>
<li><strong>Computational constraints</strong>: Static embeddings require less processing power</li>
<li><strong>Language coverage</strong>: Consider availability of pre-trained models for target languages</li>
</ul>
<p>The field continues advancing toward more efficient contextual models, better multilingual representations, and embeddings that capture increasingly complex linguistic phenomena.</p>
<p>For a from-scratch Word2Vec implementation in PyTorch (Skip-gram and CBOW, with hierarchical softmax and negative sampling) that takes these concepts further, see the <a href="/projects/modern-word2vec/">PyTorch Word2Vec project</a>.</p>
]]></content:encoded></item><item><title>Term Schedule Optimizer</title><link>https://hunterheidenreich.com/projects/term-schedule-optimizer/</link><pubDate>Wed, 15 Feb 2017 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/term-schedule-optimizer/</guid><description>A constraint satisfaction solver built to generate conflict-free university schedules from web-scraped course data.</description><content:encoded><![CDATA[<p>A Python-based automation tool I wrote as a freshman to solve the &ldquo;Term Master Schedule&rdquo; problem (and used throughout my undergrad from 2016 to 2020).</p>
<h2 id="overview">Overview</h2>
<p>Manually creating a university schedule involves solving a <strong>Constraint Satisfaction Problem (CSP)</strong> with multiple variables:</p>
<ul>
<li><strong>Hard Constraints:</strong> No time overlaps between classes.</li>
<li><strong>Soft Constraints:</strong> Preferences for &ldquo;no 8 AMs,&rdquo; specific lunch breaks, or maximizing free days.</li>
</ul>
<p>The naive approach (manually checking every possible combination) becomes intractable as the number of courses and sections grows.</p>
<h2 id="features">Features</h2>
<p>I built a script that:</p>
<ol>
<li><strong>Scraped Data:</strong> Parsed the Drexel WebTMS (Term Master Schedule) using <code>lxml</code> to build a localized dataset of course availability.</li>
<li><strong>Solved for X:</strong> Implemented a <strong>recursive backtracking algorithm</strong> to generate every valid schedule permutation that satisfied user-defined constraints.</li>
</ol>
<h3 id="the-algorithm">The Algorithm</h3>
<p>The core of this project is a <code>recursive_generator</code> function that implements a valid CSP solver using backtracking. It performs a recursive depth-first search that:</p>
<ol>
<li>Takes a set of variables (courses).</li>
<li>Checks constraints (time overlaps, lunch hours, max classes per day).</li>
<li>Backtracks when a branch fails.</li>
</ol>
<p>It is the same backtracking pattern used in everything from Sudoku solvers to compiler register allocation.</p>
<h2 id="usagegameplay">Usage/Gameplay</h2>
<p>The tool is run via the command line, taking a list of desired courses and outputting valid schedule combinations.</p>
<h2 id="results">Results</h2>
<p>This tool saved me (and several friends) hours of planning time each quarter. While the scraping logic was fragile (dependent on 2017 HTML structures), the core logic (a depth-first search through the state space of possible schedules) remains a fundamental algorithmic pattern.</p>
]]></content:encoded></item><item><title>Rubik's Cube Sonification</title><link>https://hunterheidenreich.com/projects/rubiks-cube-player/</link><pubDate>Sun, 29 Jan 2017 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/rubiks-cube-player/</guid><description>A hackathon experiment in algorithmic musicology: mapping the visual entropy of a Rubik's Cube to harmonic audio synthesis.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>Built in under 24 hours at the Drexel 2017 Music Hackathon, this project attempts to answer a question: <em>What does order sound like?</em></p>
<p>The system uses a webcam to scan a Rubik&rsquo;s cube face and algorithmically generates audio based on the color configuration. A scrambled cube generates dissonant, complex waveforms; a solved cube resolves into a pure, harmonious chord.</p>
<h2 id="features">Features</h2>
<p>This freshman-year project was built on <strong>first principles</strong>:</p>
<ul>
<li><strong>Manual Waveform Synthesis:</strong> The audio engine generates raw 8-bit PCM audio byte-by-byte using sine functions (<code>math.sin</code>), played at a 16 kHz sample rate.</li>
<li><strong>Algorithmic Harmony:</strong> Colors are mapped to musical intervals. The &ldquo;center&rdquo; color establishes the root note (Tonic), while the surrounding &ldquo;cubies&rdquo; determine the chord structure and melody using equal temperament frequency calculations ($f = f_0 \cdot 2^{n/12}$).</li>
</ul>
<h2 id="usagegameplay">Usage/Gameplay</h2>
<p>The application runs via a Python script, requiring a webcam to scan the Rubik&rsquo;s cube.</p>
<h2 id="results">Results</h2>
<p>Looking back at this code 8 years later, it serves as a &ldquo;time capsule&rdquo; of my early engineering mindset.</p>
<ul>
<li><strong>The &ldquo;Hack&rdquo;:</strong> The computer vision relied on hardcoded pixel coordinates and raw OS shell calls, classic &ldquo;glue code&rdquo; behavior typical of hackathons.</li>
<li><strong>The Lesson:</strong> While brittle, the project successfully demonstrated how to bridge the gap between physical entropy and digital signal processing using fundamental programming concepts.</li>
</ul>
<h2 id="related-content">Related Content</h2>
<ul>
<li><a href="/videos/rubiks-cube-player-hackathon/">Video Demonstration</a></li>
</ul>
]]></content:encoded></item><item><title>Rubik's Cube Player - Drexel Music Hackathon 2017</title><link>https://hunterheidenreich.com/videos/rubiks-cube-player-hackathon/</link><pubDate>Sun, 29 Jan 2017 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/videos/rubiks-cube-player-hackathon/</guid><description>A hackathon project that converts Rubik's cube faces into music, creating more harmonious sounds as the cube becomes more solved.</description><content:encoded><![CDATA[<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
      <iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube-nocookie.com/embed/s0xpRTJYw4I?autoplay=0&amp;controls=1&amp;end=0&amp;loop=0&amp;mute=0&amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video"></iframe>
    </div>

<p>This is a project I worked on with Emmanuel Espino and Jason Zogheb for the Drexel 2017 Music Hackathon at the ExCITe Center. The Rubik&rsquo;s Cube Player uses a webcam to capture images of a Rubik&rsquo;s cube face and converts the color patterns into musical pitches.</p>
<p>The interesting part is that the more solved a cube face is, the more harmonious the generated melody sounds. We built this in less than 24 hours using Python, computer vision libraries, and audio synthesis techniques.</p>
<h2 id="technical-details">Technical Details</h2>
<ul>
<li><strong>Computer Vision</strong>: Uses webcam capture and image processing to identify cube colors</li>
<li><strong>Audio Generation</strong>: Converts color patterns to musical pitches and waveforms</li>
<li><strong>Real-time Processing</strong>: Live analysis and audio playback</li>
<li><strong>Harmony Algorithm</strong>: More solved faces produce more consonant musical intervals</li>
</ul>
<h2 id="links">Links</h2>
<ul>
<li><strong>GitHub Repository</strong>: <a href="https://github.com/hunter-heidenreich/Rubik-s-Player">Rubik&rsquo;s Player</a></li>
<li><strong>Project Page</strong>: <a href="/projects/rubiks-cube-player/">Rubik&rsquo;s Cube Player</a></li>
<li><strong>Contributors</strong>: Emmanuel Espino, Jason Zogheb, and myself</li>
</ul>
]]></content:encoded></item></channel></rss>