Introduction
While learning to develop machine learning models for protein dynamics, I needed to generate training data—lots of it. Most researchers start with alanine dipeptide (AD), a tiny two-amino-acid system that’s become the “hello world” of protein simulation. It’s small enough to simulate quickly but complex enough to show interesting folding behavior.
But I wanted more diversity in my training data. What if different amino acid side chains behave differently? How would that affect my models? So I extended the typical alanine dipeptide approach to include eight other amino acids, creating a small collection of “mini-proteins” for ML studies.
This post covers the GROMACS scripts I developed to generate these trajectories systematically. While these aren’t groundbreaking proteins by any means, they’ve been useful for testing how different chemical properties (aromatic rings, flexibility, disulfide bonds) affect molecular dynamics—which could be valuable for training more robust ML models.
What Are Mini-Proteins?
In this context, “mini-proteins” are single amino acid residues capped with acetyl and N-methyl groups (Ace-X-Nme, where X is the amino acid). They’re not actually proteins in the biological sense—more like the simplest possible systems that still capture some protein-like behavior.
These systems are popular in computational studies because they:
- Simulate quickly (seconds to minutes instead of hours)
- Have well-characterized behavior for validation
- Show enough complexity to be interesting
- Can be systematically varied to study different chemical effects
Getting Started
The complete workflow and scripts are available on GitHub: mini-proteins
What You’ll Need
- Linux system with GROMACS installed
- Python 3 with numpy and matplotlib
- Basic familiarity with molecular dynamics concepts
Quick Start
git clone https://github.com/hunter-heidenreich/mini-proteins.git
cd mini-proteins
ID=ala ./scripts/run.sh
This runs the complete pipeline: energy minimization, solvation, equilibration, and production simulation. The default settings generate 1 ns of trajectory data saved every 100 fs—which I chose because I needed high temporal resolution for my ML models, but you can adjust this in config/md_langevin.mdp
.
For longer production runs (recommended for most applications), you’ll want to increase the simulation time to ~100 ns and reduce the save frequency to manage file sizes.
The Collection
I’ve included nine different amino acid dipeptides, each with distinct chemical properties:
Flexible systems: Glycine (smallest side chain), Alanine (methyl group)
Branched systems: Valine, Isoleucine, Leucine (different branching patterns)
Aromatic systems: Phenylalanine, Tryptophan (different ring structures)
Special cases: Proline (ring constraint), Methionine (sulfur chemistry)
The idea was to create a systematic set where I could study how different chemical features affect dynamics. For example:
- Does the flexibility of glycine lead to more diverse conformational sampling?
- How do aromatic rings in tryptophan affect folding pathways?
- Does the ring constraint in proline create different energy landscapes?
While these questions might seem basic, having systematic data to test ML models against known chemical intuition is valuable for building confidence in the approach.
Generating Trajectory Data
The fastest way to generate trajectory data is to use the run.sh
script.
ID=ala ./scripts/run.sh
where ID
is the three-letter amino acid code for the amino acid you want to simulate.
Here, we’re using ala
for alanine.
This script will perform energy minimization, solvation, neutralization, NVT equilibration, NPT equilibration, and production simulation.
The resulting trajectory will be saved in the out/ID/data
directory.
Note: A production simulation is currently coded for 1 nanosecond trajectory saved every 100 femtoseconds.
Ideally, you’ll want to increase this to 100 nanoseconds.
Also, you may want to revise the save frequency to avoid the large amount of data that will be generated.
I’ve targeted 100 femtoseconds because I need correlated data in time for my machine learning models, but you may not need such a high frequency.
These settings can be changed in the config/md_langevin.mdp
file.
Alternatively, you may run each step individually (see scripts/run.sh
for an example of how to do this).
The Systems
Here are the nine amino acid dipeptides I’ve included, each chosen for different chemical properties:
Alanine Dipeptide — The Standard

Alanine Dipeptide
The classic starting point for protein folding studies. Small methyl side chain makes it simple but not trivial.
Glycine Dipeptide — Maximum Flexibility

Glycine Dipeptide
No side chain means maximum backbone flexibility. Great for studying how constraints affect conformational sampling.
Proline Dipeptide — Built-in Rigidity

Proline Dipeptide
The ring structure creates backbone constraints. Interesting comparison to glycine’s flexibility.
Aromatic Systems

Phenylalanine Dipeptide
Phenylalanine: Simple benzene ring for studying aromatic interactions.

Tryptophan Dipeptide
Tryptophan: Larger indole ring system—more complex aromatic chemistry.
Branched Aliphatic Systems

Valine Dipeptide
Valine: β-branched, creates steric constraints near the backbone.

Isoleucine Dipeptide
Isoleucine: γ-branched, different steric profile than valine.

Leucine Dipeptide
Leucine: Longer branched chain, more conformational freedom.
Special Chemistry

Methionine Dipeptide
Methionine: Sulfur chemistry—different from the others and interesting for studying heteroatom effects.
What’s Next?
These mini-protein simulations have been useful for my ML work, providing systematic training data with controlled chemical variation. While they’re simple systems, they’ve helped me understand how different amino acid properties affect molecular behavior—knowledge that’s valuable when working with larger, more complex proteins.
The scripts are designed to be easily modified, so if you need different amino acids or simulation conditions, the framework should adapt well. I’ve tried to make the workflow as straightforward as possible while still being flexible.
This work complements my other molecular dynamics projects:
- Cu Adatom Diffusion — Learning LAMMPS for surface simulations
- Pt Adatom Diffusion — Extending to different elements
Together, these projects have given me a solid foundation in MD simulations for generating ML training data across different types of molecular systems.
Find the complete code and documentation on GitHub. Questions or suggestions? I’d love to hear from you—especially if you’ve found interesting ways to extend or improve the approach.
Acknowledgements
The scripts are written building off of the GROMACS tutorial by Luca Tubiana at the University of Trento.