Mini-Protein Trajectory Generation

Project Overview

I developed an automated GROMACS pipeline to generate high-fidelity molecular dynamics (MD) datasets for machine learning applications. The workflow automates the simulation of capped dipeptides across nine distinct residue types, creating a diverse training set suitable for Neural Network Potentials (NNPs).

Technical Architecture

Automated Simulation Pipeline

End-to-End Scripting: Bash-automated workflow handling topology generation (pdb2gmx), solvation, ionization, and equilibration
Langevin Dynamics: Implemented Stochastic Dynamics (SD) integration to ensure proper canonical (NVT) ensemble sampling
High-Resolution Output: Configured to capture 0.1 ps (100 fs) resolution trajectories, critical for capturing fast bond vibrations
Force Extraction: Optimized output to .trr format preserving uncompressed atomic forces, a key requirement for force-matching in ML potentials

; md_langevin.mdp
integrator  = sd        ; Stochastic dynamics for proper sampling
dt          = 0.001     ; 1 fs timestep
nstxout     = 100       ; Output every 100 steps = 0.1 ps resolution
tc-grps     = Protein Non-Protein
tau_t       = 0.1  0.1  ; Friction constant (ps)

Chemical Diversity Suite

Designed to stress-test ML models against varied kinematic constraints:

Category	Residues	Dynamics Challenge
Aromatic	Phe, Trp	π-stacking, bulky side chains
Constrained	Pro	Cyclic backbone restrictions
Flexible	Gly, Ala	High conformational entropy
Branched	Val, Ile, Leu	Steric clashes, rotamer preferences
Sulfur-Containing	Met	Flexible thioether linkage

Engineering Challenges

Data Volume vs. Fidelity: Balanced high-frequency force outputs (every 100 steps) against storage constraints by automating post-processing extraction of forces into lightweight .xvg formats
Force Field Consistency: Standardized the Amber03 force field and TIP3P water model across all residues to ensure consistent potential energy surfaces for downstream model training

Note: This pipeline uses Amber03 for consistency across residue types. For production ML potentials, consider swapping to Charmm36m or similar modern force fields.

Technical Highlights

Infrastructure as Code: Converted manual GUI-based GROMACS tutorials into reproducible, headless shell scripts
Atomic Force Preservation: Configured .trr output to retain uncompressed velocities and forces, essential for training NNPs via force matching
Ensemble Correctness: Langevin thermostat implementation ensures proper Boltzmann sampling, unlike simple velocity rescaling

Impact

ML Training Data: Diverse trajectory datasets with atomic forces for neural networks learning interatomic potentials
Method Development: Foundation for generating training data for larger protein systems
Reproducible Science: Automated workflows others can extend to additional amino acids or simulation conditions

Mini-Protein Dynamics - Detailed blog post on the simulation methodology

Project Details
Author	Hunter Heidenreich
Category	Computational Biology
Type	Simulation
Status	Complete
Date	August 2023
Links	📝 Blog Post • 💻 Code

Project Overview#

Technical Architecture#

Automated Simulation Pipeline#

Chemical Diversity Suite#

Engineering Challenges#

Technical Highlights#

Impact#

Related Work#