Project Overview

I developed an automated GROMACS pipeline to generate high-fidelity molecular dynamics (MD) datasets for machine learning applications. The workflow automates the simulation of capped dipeptides across nine distinct residue types, creating a diverse training set suitable for Neural Network Potentials (NNPs).

Technical Architecture

Automated Simulation Pipeline

  • End-to-End Scripting: Bash-automated workflow handling topology generation (pdb2gmx), solvation, ionization, and equilibration
  • Langevin Dynamics: Implemented Stochastic Dynamics (SD) integration to ensure proper canonical (NVT) ensemble sampling
  • High-Resolution Output: Configured to capture 0.1 ps (100 fs) resolution trajectories, critical for capturing fast bond vibrations
  • Force Extraction: Optimized output to .trr format preserving uncompressed atomic forces, a key requirement for force-matching in ML potentials
; md_langevin.mdp
integrator  = sd        ; Stochastic dynamics for proper sampling
dt          = 0.001     ; 1 fs timestep
nstxout     = 100       ; Output every 100 steps = 0.1 ps resolution
tc-grps     = Protein Non-Protein
tau_t       = 0.1  0.1  ; Friction constant (ps)

Chemical Diversity Suite

Designed to stress-test ML models against varied kinematic constraints:

CategoryResiduesDynamics Challenge
AromaticPhe, Trpπ-stacking, bulky side chains
ConstrainedProCyclic backbone restrictions
FlexibleGly, AlaHigh conformational entropy
BranchedVal, Ile, LeuSteric clashes, rotamer preferences
Sulfur-ContainingMetFlexible thioether linkage

Engineering Challenges

  • Data Volume vs. Fidelity: Balanced high-frequency force outputs (every 100 steps) against storage constraints by automating post-processing extraction of forces into lightweight .xvg formats
  • Force Field Consistency: Standardized the Amber03 force field and TIP3P water model across all residues to ensure consistent potential energy surfaces for downstream model training

Note: This pipeline uses Amber03 for consistency across residue types. For production ML potentials, consider swapping to Charmm36m or similar modern force fields.

Technical Highlights

  • Infrastructure as Code: Converted manual GUI-based GROMACS tutorials into reproducible, headless shell scripts
  • Atomic Force Preservation: Configured .trr output to retain uncompressed velocities and forces, essential for training NNPs via force matching
  • Ensemble Correctness: Langevin thermostat implementation ensures proper Boltzmann sampling, unlike simple velocity rescaling

Impact

  • ML Training Data: Diverse trajectory datasets with atomic forces for neural networks learning interatomic potentials
  • Method Development: Foundation for generating training data for larger protein systems
  • Reproducible Science: Automated workflows others can extend to additional amino acids or simulation conditions