Overview

I developed an automated GROMACS pipeline to generate molecular dynamics (MD) datasets for machine learning applications. The workflow automates the simulation of capped dipeptides across nine distinct residue types, creating a diverse training set suitable for Neural Network Potentials (NNPs). The pipeline is built off Luca Tubiana’s GROMACS tutorial (University of Trento); the Python analysis layer and the curated dipeptide dataset are my own.

Features

Automated Simulation Pipeline

  • End-to-End Scripting: Bash-automated workflow handling topology generation (pdb2gmx), solvation, ionization, and equilibration
  • Langevin Dynamics: Implemented Stochastic Dynamics (SD) integration to ensure proper canonical (NVT) ensemble sampling
  • High-Resolution Output: Configured to capture 0.1 ps (100 fs) resolution trajectories, critical for capturing fast bond vibrations
  • Force Extraction: Optimized output to .trr format preserving uncompressed atomic forces, a key requirement for force-matching in ML potentials
; md_langevin.mdp
integrator  = sd        ; Stochastic dynamics for proper sampling
dt          = 0.001     ; 1 fs timestep
nstxout     = 100       ; Output every 100 steps = 0.1 ps resolution
tc-grps     = Protein Non-Protein
tau_t       = 0.1  0.1  ; Friction constant (ps)

Chemical Diversity Suite

Designed to stress-test ML models against varied kinematic constraints:

CategoryResiduesDynamics Challenge
AromaticPhe, Trpπ-stacking, bulky side chains
ConstrainedProCyclic backbone restrictions
FlexibleGly, AlaHigh conformational entropy
BranchedVal, Ile, LeuSteric clashes, rotamer preferences
Sulfur-ContainingMetFlexible thioether linkage

Usage

The pipeline is executed via bash scripts, requiring GROMACS to be installed.

Results

  • Data Volume vs. Fidelity: Balanced high-frequency force outputs (every 100 steps) against storage constraints by automating post-processing extraction of forces into lightweight .xvg formats
  • Force Field Consistency: Standardized the Amber03 force field and TIP3P water model across all residues to ensure consistent potential energy surfaces for downstream model training

Note: This pipeline uses Amber03 for consistency across residue types. For production ML potentials, consider swapping to Charmm36m or similar modern force fields.

Retrospective

  • Demonstrative, not production-scale: the 1 ns trajectories exercise the pipeline and capture fast bond vibrations, but proper conformational sampling needs 100 ns to 1 µs runs. This is a working reference, not a finished dataset.
  • Dated force field: Amber03 / TIP3P keeps the potential energy surface consistent across residues, but it is not state-of-the-art for ML-potential training; CHARMM36m or Amber ff19SB would be the upgrade path.
  • Paused, not abandoned: a candidate to revive and extend (more residues, longer trajectories, Ramachandran analysis) for future force-matching work.