Project Overview
I developed an automated GROMACS pipeline to generate high-fidelity molecular dynamics (MD) datasets for machine learning applications. The workflow automates the simulation of capped dipeptides across nine distinct residue types, creating a diverse training set suitable for Neural Network Potentials (NNPs).
Technical Architecture
Automated Simulation Pipeline
- End-to-End Scripting: Bash-automated workflow handling topology generation (
pdb2gmx), solvation, ionization, and equilibration - Langevin Dynamics: Implemented Stochastic Dynamics (SD) integration to ensure proper canonical (NVT) ensemble sampling
- High-Resolution Output: Configured to capture 0.1 ps (100 fs) resolution trajectories, critical for capturing fast bond vibrations
- Force Extraction: Optimized output to
.trrformat preserving uncompressed atomic forces, a key requirement for force-matching in ML potentials
; md_langevin.mdp
integrator = sd ; Stochastic dynamics for proper sampling
dt = 0.001 ; 1 fs timestep
nstxout = 100 ; Output every 100 steps = 0.1 ps resolution
tc-grps = Protein Non-Protein
tau_t = 0.1 0.1 ; Friction constant (ps)
Chemical Diversity Suite
Designed to stress-test ML models against varied kinematic constraints:
| Category | Residues | Dynamics Challenge |
|---|---|---|
| Aromatic | Phe, Trp | π-stacking, bulky side chains |
| Constrained | Pro | Cyclic backbone restrictions |
| Flexible | Gly, Ala | High conformational entropy |
| Branched | Val, Ile, Leu | Steric clashes, rotamer preferences |
| Sulfur-Containing | Met | Flexible thioether linkage |
Engineering Challenges
- Data Volume vs. Fidelity: Balanced high-frequency force outputs (every 100 steps) against storage constraints by automating post-processing extraction of forces into lightweight
.xvgformats - Force Field Consistency: Standardized the Amber03 force field and TIP3P water model across all residues to ensure consistent potential energy surfaces for downstream model training
Note: This pipeline uses Amber03 for consistency across residue types. For production ML potentials, consider swapping to Charmm36m or similar modern force fields.
Technical Highlights
- Infrastructure as Code: Converted manual GUI-based GROMACS tutorials into reproducible, headless shell scripts
- Atomic Force Preservation: Configured
.trroutput to retain uncompressed velocities and forces, essential for training NNPs via force matching - Ensemble Correctness: Langevin thermostat implementation ensures proper Boltzmann sampling, unlike simple velocity rescaling
Impact
- ML Training Data: Diverse trajectory datasets with atomic forces for neural networks learning interatomic potentials
- Method Development: Foundation for generating training data for larger protein systems
- Reproducible Science: Automated workflows others can extend to additional amino acids or simulation conditions
Related Work
- Mini-Protein Dynamics - Detailed blog post on the simulation methodology
