Introduction

While learning to develop machine learning models for protein dynamics, I needed to generate training data—lots of it. Most researchers start with alanine dipeptide (AD), a tiny two-amino-acid system that’s become the “hello world” of protein simulation. It’s small enough to simulate quickly but complex enough to show interesting folding behavior.

But I wanted more diversity in my training data. What if different amino acid side chains behave differently? How would that affect my models? So I extended the typical alanine dipeptide approach to include eight other amino acids, creating a small collection of “mini-proteins” for ML studies.

This post covers the GROMACS scripts I developed to generate these trajectories systematically. While these aren’t groundbreaking proteins by any means, they’ve been useful for testing how different chemical properties (aromatic rings, flexibility, disulfide bonds) affect molecular dynamics—which could be valuable for training more robust ML models.

What Are Mini-Proteins?

In this context, “mini-proteins” are single amino acid residues capped with acetyl and N-methyl groups (Ace-X-Nme, where X is the amino acid). They’re not actually proteins in the biological sense—more like the simplest possible systems that still capture some protein-like behavior.

These systems are popular in computational studies because they:

  • Simulate quickly (seconds to minutes instead of hours)
  • Have well-characterized behavior for validation
  • Show enough complexity to be interesting
  • Can be systematically varied to study different chemical effects

Getting Started

The complete workflow and scripts are available on GitHub: mini-proteins

What You’ll Need

  • Linux system with GROMACS installed
  • Python 3 with numpy and matplotlib
  • Basic familiarity with molecular dynamics concepts

Quick Start

git clone https://github.com/hunter-heidenreich/mini-proteins.git
cd mini-proteins
ID=ala ./scripts/run.sh

This runs the complete pipeline: energy minimization, solvation, equilibration, and production simulation. The default settings generate 1 ns of trajectory data saved every 100 fs—which I chose because I needed high temporal resolution for my ML models, but you can adjust this in config/md_langevin.mdp.

For longer production runs (recommended for most applications), you’ll want to increase the simulation time to ~100 ns and reduce the save frequency to manage file sizes.

The Collection

I’ve included nine different amino acid dipeptides, each with distinct chemical properties:

Flexible systems: Glycine (smallest side chain), Alanine (methyl group)

Branched systems: Valine, Isoleucine, Leucine (different branching patterns)

Aromatic systems: Phenylalanine, Tryptophan (different ring structures)

Special cases: Proline (ring constraint), Methionine (sulfur chemistry)

The idea was to create a systematic set where I could study how different chemical features affect dynamics. For example:

  • Does the flexibility of glycine lead to more diverse conformational sampling?
  • How do aromatic rings in tryptophan affect folding pathways?
  • Does the ring constraint in proline create different energy landscapes?

While these questions might seem basic, having systematic data to test ML models against known chemical intuition is valuable for building confidence in the approach.

Generating Trajectory Data

The fastest way to generate trajectory data is to use the run.sh script.

ID=ala ./scripts/run.sh

where ID is the three-letter amino acid code for the amino acid you want to simulate. Here, we’re using ala for alanine.

This script will perform energy minimization, solvation, neutralization, NVT equilibration, NPT equilibration, and production simulation. The resulting trajectory will be saved in the out/ID/data directory.

Note: A production simulation is currently coded for 1 nanosecond trajectory saved every 100 femtoseconds. Ideally, you’ll want to increase this to 100 nanoseconds. Also, you may want to revise the save frequency to avoid the large amount of data that will be generated. I’ve targeted 100 femtoseconds because I need correlated data in time for my machine learning models, but you may not need such a high frequency. These settings can be changed in the config/md_langevin.mdp file.

Alternatively, you may run each step individually (see scripts/run.sh for an example of how to do this).

The Systems

Here are the nine amino acid dipeptides I’ve included, each chosen for different chemical properties:

Alanine Dipeptide — The Standard

Animation of Alanine Dipeptide

Alanine Dipeptide

The classic starting point for protein folding studies. Small methyl side chain makes it simple but not trivial.

Glycine Dipeptide — Maximum Flexibility

Animation of Glycine Dipeptide

Glycine Dipeptide

No side chain means maximum backbone flexibility. Great for studying how constraints affect conformational sampling.

Proline Dipeptide — Built-in Rigidity

Animation of Proline Dipeptide

Proline Dipeptide

The ring structure creates backbone constraints. Interesting comparison to glycine’s flexibility.

Aromatic Systems

Animation of Phenylalanine Dipeptide

Phenylalanine Dipeptide

Phenylalanine: Simple benzene ring for studying aromatic interactions.

Animation of Tryptophan Dipeptide

Tryptophan Dipeptide

Tryptophan: Larger indole ring system—more complex aromatic chemistry.

Branched Aliphatic Systems

Animation of Valine Dipeptide

Valine Dipeptide

Valine: β-branched, creates steric constraints near the backbone.

Animation of Isoleucine Dipeptide

Isoleucine Dipeptide

Isoleucine: γ-branched, different steric profile than valine.

Animation of Leucine Dipeptide

Leucine Dipeptide

Leucine: Longer branched chain, more conformational freedom.

Special Chemistry

Animation of Methionine Dipeptide

Methionine Dipeptide

Methionine: Sulfur chemistry—different from the others and interesting for studying heteroatom effects.

What’s Next?

These mini-protein simulations have been useful for my ML work, providing systematic training data with controlled chemical variation. While they’re simple systems, they’ve helped me understand how different amino acid properties affect molecular behavior—knowledge that’s valuable when working with larger, more complex proteins.

The scripts are designed to be easily modified, so if you need different amino acids or simulation conditions, the framework should adapt well. I’ve tried to make the workflow as straightforward as possible while still being flexible.

This work complements my other molecular dynamics projects:

Together, these projects have given me a solid foundation in MD simulations for generating ML training data across different types of molecular systems.


Find the complete code and documentation on GitHub. Questions or suggestions? I’d love to hear from you—especially if you’ve found interesting ways to extend or improve the approach.

Acknowledgements

The scripts are written building off of the GROMACS tutorial by Luca Tubiana at the University of Trento.