Generating Mini-Protein Trajectories with GROMACS

Introduction

When developing machine learning models for protein dynamics, I needed training data, lots of it. Most researchers start with alanine dipeptide, a tiny two-amino-acid system that’s become the “hello world” of protein simulation. It’s small enough to simulate quickly but complex enough to show interesting folding behavior.

I wanted more diversity in my training data. Different amino acid side chains behave differently, and I was curious how this would affect model performance. So I extended the typical alanine dipeptide approach to include eight other amino acids, creating a small collection of “mini-proteins” for ML studies.

This post covers the GROMACS scripts I developed to generate these trajectories systematically. While these aren’t groundbreaking proteins, they’ve been useful for testing how different chemical properties (aromatic rings, flexibility, branching) affect molecular dynamics, which helps train more robust ML models.

What Are Mini-Proteins?

In this context, “mini-proteins” are single amino acid residues capped with acetyl and N-methyl groups (Ace-X-Nme, where X is the amino acid). They’re not actually proteins in the biological sense, more like the simplest possible systems that still capture some protein-like behavior.

These systems are popular in computational studies because they:

Simulate quickly (seconds to minutes instead of hours)
Have well-characterized behavior for validation
Show enough complexity to be interesting
Can be systematically varied to study different chemical effects

Getting Started

The complete workflow and scripts are available on GitHub: mini-proteins

Requirements

Linux system with GROMACS installed
Python 3 with numpy and matplotlib
Basic familiarity with molecular dynamics concepts

Quick Start

git clone https://github.com/hunter-heidenreich/mini-proteins.git
cd mini-proteins
ID=ala ./scripts/run.sh

This runs the complete pipeline: energy minimization, solvation, equilibration, and production simulation. The default settings generate 1 ns of trajectory data saved every 100 fs. I chose high temporal resolution for my ML models, but you can adjust this in config/md_langevin.mdp.

For longer production runs (recommended for most applications), increase the simulation time to ~100 ns and reduce the save frequency to manage file sizes.

The Collection

I’ve included nine different amino acid dipeptides, each with distinct chemical properties:

Flexible systems: Glycine (smallest side chain), Alanine (methyl group)

Branched systems: Valine, Isoleucine, Leucine (different branching patterns)

Aromatic systems: Phenylalanine, Tryptophan (different ring structures)

Special cases: Proline (ring constraint), Methionine (sulfur chemistry)

This systematic set allows studying how different chemical features affect dynamics:

Does the flexibility of glycine lead to more diverse conformational sampling?
How do aromatic rings in tryptophan affect folding pathways?
Does the ring constraint in proline create different energy landscapes?

While these questions might seem basic, having systematic data to test ML models against known chemical intuition builds confidence in the approach.

Ideally, a neural network trained on this dataset should learn physical invariances. By training on both aliphatic (Val, Leu, Ile) and aromatic (Phe, Trp) systems, the model learns to distinguish how electron density (π-systems vs. σ-bonds) influences local potential energy surfaces, rather than just memorizing atom types.

Generating ML-Ready Trajectory Data

Generating raw coordinates is easy; generating ML-ready data requires specific configurations. Standard MD simulations compress trajectory files to save space, discarding high-frequency velocity and force data. To train Neural Network Potentials (NNPs), I configured the GROMACS pipeline differently.

The fastest way to generate trajectory data is using the run.sh script:

ID=ala ./scripts/run.sh

where ID is the three-letter amino acid code (here, ala for alanine).

This script performs energy minimization, solvation, neutralization, NVT equilibration, NPT equilibration, and production simulation. The resulting trajectory saves to the out/ID/data directory.

Why This Pipeline Differs from Standard Tutorials

A key deviation from standard tutorials is the use of Stochastic Dynamics (Langevin) as the integrator. This adds friction and noise terms to the equations of motion, ensuring correct thermodynamic sampling:

; config/md_langevin.mdp
integrator  = sd        ; Stochastic dynamics (Langevin)
dt          = 0.001     ; 1 fs timestep
nstxout     = 100       ; Save coordinates every 100 steps
nstvout     = 100       ; Save velocities every 100 steps
nstfout     = 100       ; Save forces every 100 steps
tc-grps     = Protein Non-Protein
tau_t       = 0.1  0.1  ; Friction constant (ps)
ref_t       = 298  298  ; Reference temperature (K)

The critical settings for ML applications:

Langevin Dynamics (sd): Ensures proper canonical (NVT) sampling rather than the deterministic velocity-rescaling often used in tutorials
Uncompressed Force Output (nstfout = 100): Writing to .trr format captures the precise atomic forces acting on every atom, essential for force-matching in NNP training
High-Frequency Sampling (0.1 ps): Saving frames every 100 fs captures fast bond vibrations often missed in standard 10 ps snapshots

Note: A production simulation currently runs for 1 nanosecond, saved every 0.1 picoseconds (100 fs). For most applications, increase this to 100 nanoseconds and adjust the save frequency to avoid large data files. I targeted 100 fs because I needed correlated time data for ML models, but you may not need such high frequency.

You can also run each step individually (see scripts/run.sh for examples).

The Systems

Here are the nine amino acid dipeptides I’ve included, each chosen for different chemical properties:

Alanine Dipeptide: The Standard

Alanine dipeptide molecular dynamics simulation animation — Alanine Dipeptide

The classic starting point for protein folding studies. The small methyl side chain makes it simple but not trivial.

Glycine Dipeptide: Maximum Flexibility

Glycine dipeptide molecular dynamics simulation animation — Glycine Dipeptide

No side chain means maximum backbone flexibility. Great for studying how constraints affect conformational sampling.

Proline Dipeptide: Built-in Rigidity

Proline dipeptide molecular dynamics simulation animation — Proline Dipeptide

The ring structure creates backbone constraints. Interesting comparison to glycine’s flexibility.

Aromatic Systems

Phenylalanine dipeptide molecular dynamics simulation animation — Phenylalanine Dipeptide

Phenylalanine: Simple benzene ring for studying aromatic interactions.

Tryptophan dipeptide molecular dynamics simulation animation — Tryptophan Dipeptide

Tryptophan: Larger indole ring system with more complex aromatic chemistry.

Branched Aliphatic Systems

Valine dipeptide molecular dynamics simulation animation — Valine Dipeptide

Valine: β-branched, creates steric constraints near the backbone.

Isoleucine dipeptide molecular dynamics simulation animation — Isoleucine Dipeptide

Isoleucine: γ-branched, different steric profile than valine.

Leucine dipeptide molecular dynamics simulation animation — Leucine Dipeptide

Leucine: Longer branched chain with more conformational freedom.

Special Chemistry

Methionine dipeptide molecular dynamics simulation animation — Methionine Dipeptide

Methionine: Sulfur chemistry, different from the others and interesting for studying heteroatom effects.

What’s Next?

These mini-protein simulations have been useful for my ML work, providing systematic training data with controlled chemical variation. While they’re simple systems, they’ve helped me understand how different amino acid properties affect molecular behavior, knowledge that’s valuable when working with larger, more complex proteins.

The real value of this pipeline isn’t the simulations themselves, it’s the force extraction workflow. Having atomic forces alongside coordinates enables training NNPs via force matching, which typically converges faster and generalizes better than energy-only training. Tools like TorchMD-Net, NequIP, and MACE can directly consume this data format.

The scripts are designed to be easily modified for different amino acids or simulation conditions. I’ve tried to make the workflow straightforward while keeping it flexible.

This work complements my other molecular dynamics projects:

Cu Adatom Diffusion: Learning LAMMPS for surface simulations
Pt Adatom Diffusion: Extending to different elements

Together, these projects have given me a solid foundation in MD simulations for generating ML training data across different molecular systems.

Find the complete code and documentation on GitHub. Questions or suggestions? I’d love to hear from you, especially if you’ve found interesting ways to extend or improve the approach.

Acknowledgements

The scripts build on the GROMACS tutorial by Luca Tubiana at the University of Trento.

Introduction#

What Are Mini-Proteins?#

Getting Started#

Requirements#

Quick Start#

The Collection#

Generating ML-Ready Trajectory Data#

Why This Pipeline Differs from Standard Tutorials#

The Systems#

Alanine Dipeptide: The Standard#

Glycine Dipeptide: Maximum Flexibility#

Proline Dipeptide: Built-in Rigidity#

Aromatic Systems#

Branched Aliphatic Systems#

Special Chemistry#

What’s Next?#

Acknowledgements#