Introduction

When developing machine learning models for protein dynamics, I needed training data—lots of it. Most researchers start with alanine dipeptide, a tiny two-amino-acid system that’s become the “hello world” of protein simulation. It’s small enough to simulate quickly but complex enough to show interesting folding behavior.

I wanted more diversity in my training data. Different amino acid side chains behave differently, and I was curious how this would affect model performance. So I extended the typical alanine dipeptide approach to include eight other amino acids, creating a small collection of “mini-proteins” for ML studies.

This post covers the GROMACS scripts I developed to generate these trajectories systematically. While these aren’t groundbreaking proteins, they’ve been useful for testing how different chemical properties—aromatic rings, flexibility, branching—affect molecular dynamics, which helps train more robust ML models.

What Are Mini-Proteins?

In this context, “mini-proteins” are single amino acid residues capped with acetyl and N-methyl groups (Ace-X-Nme, where X is the amino acid). They’re not actually proteins in the biological sense—more like the simplest possible systems that still capture some protein-like behavior.

These systems are popular in computational studies because they:

  • Simulate quickly (seconds to minutes instead of hours)
  • Have well-characterized behavior for validation
  • Show enough complexity to be interesting
  • Can be systematically varied to study different chemical effects

Getting Started

The complete workflow and scripts are available on GitHub: mini-proteins

Requirements

  • Linux system with GROMACS installed
  • Python 3 with numpy and matplotlib
  • Basic familiarity with molecular dynamics concepts

Quick Start

git clone https://github.com/hunter-heidenreich/mini-proteins.git
cd mini-proteins
ID=ala ./scripts/run.sh

This runs the complete pipeline: energy minimization, solvation, equilibration, and production simulation. The default settings generate 1 ns of trajectory data saved every 100 fs. I chose high temporal resolution for my ML models, but you can adjust this in config/md_langevin.mdp.

For longer production runs (recommended for most applications), increase the simulation time to ~100 ns and reduce the save frequency to manage file sizes.

The Collection

I’ve included nine different amino acid dipeptides, each with distinct chemical properties:

Flexible systems: Glycine (smallest side chain), Alanine (methyl group)

Branched systems: Valine, Isoleucine, Leucine (different branching patterns)

Aromatic systems: Phenylalanine, Tryptophan (different ring structures)

Special cases: Proline (ring constraint), Methionine (sulfur chemistry)

This systematic set allows studying how different chemical features affect dynamics:

  • Does the flexibility of glycine lead to more diverse conformational sampling?
  • How do aromatic rings in tryptophan affect folding pathways?
  • Does the ring constraint in proline create different energy landscapes?

While these questions might seem basic, having systematic data to test ML models against known chemical intuition builds confidence in the approach.

Generating Trajectory Data

The fastest way to generate trajectory data is using the run.sh script:

ID=ala ./scripts/run.sh

where ID is the three-letter amino acid code (here, ala for alanine).

This script performs energy minimization, solvation, neutralization, NVT equilibration, NPT equilibration, and production simulation. The resulting trajectory saves to the out/ID/data directory.

Note: A production simulation currently runs for 1 nanosecond, saved every 100 femtoseconds. For most applications, increase this to 100 nanoseconds and adjust the save frequency to avoid large data files. I targeted 100 femtoseconds because I needed correlated time data for ML models, but you may not need such high frequency. Change these settings in config/md_langevin.mdp.

You can also run each step individually (see scripts/run.sh for examples).

The Systems

Here are the nine amino acid dipeptides I’ve included, each chosen for different chemical properties:

Alanine Dipeptide — The Standard

Animation of Alanine Dipeptide
Alanine Dipeptide

The classic starting point for protein folding studies. The small methyl side chain makes it simple but not trivial.

Glycine Dipeptide — Maximum Flexibility

Animation of Glycine Dipeptide
Glycine Dipeptide

No side chain means maximum backbone flexibility. Great for studying how constraints affect conformational sampling.

Proline Dipeptide — Built-in Rigidity

Animation of Proline Dipeptide
Proline Dipeptide

The ring structure creates backbone constraints. Interesting comparison to glycine’s flexibility.

Aromatic Systems

Animation of Phenylalanine Dipeptide
Phenylalanine Dipeptide

Phenylalanine: Simple benzene ring for studying aromatic interactions.

Animation of Tryptophan Dipeptide
Tryptophan Dipeptide

Tryptophan: Larger indole ring system with more complex aromatic chemistry.

Branched Aliphatic Systems

Animation of Valine Dipeptide
Valine Dipeptide

Valine: β-branched, creates steric constraints near the backbone.

Animation of Isoleucine Dipeptide
Isoleucine Dipeptide

Isoleucine: γ-branched, different steric profile than valine.

Animation of Leucine Dipeptide
Leucine Dipeptide

Leucine: Longer branched chain with more conformational freedom.

Special Chemistry

Animation of Methionine Dipeptide
Methionine Dipeptide

Methionine: Sulfur chemistry—different from the others and interesting for studying heteroatom effects.

What’s Next?

These mini-protein simulations have been useful for my ML work, providing systematic training data with controlled chemical variation. While they’re simple systems, they’ve helped me understand how different amino acid properties affect molecular behavior—knowledge that’s valuable when working with larger, more complex proteins.

The scripts are designed to be easily modified for different amino acids or simulation conditions. I’ve tried to make the workflow straightforward while keeping it flexible.

This work complements my other molecular dynamics projects:

Together, these projects have given me a solid foundation in MD simulations for generating ML training data across different molecular systems.


Find the complete code and documentation on GitHub. Questions or suggestions? I’d love to hear from you—especially if you’ve found interesting ways to extend or improve the approach.

Acknowledgements

The scripts build on the GROMACS tutorial by Luca Tubiana at the University of Trento.