Mini Proteins

Recently, I’ve been working on machine learning models for forecasting the stochastic dynamics of proteins. In order to test these models, I needed to generate a large number of trajectories of proteins. Frequently, machine learning researchers target a small protein called Alanine Dipeptide (AD) for this purpose. AD is a small protein with only two amino acids, and it is frequently used as a test case for new methods in molecular dynamics, from new force fields to new sampling algorithms.

I wanted to generate a large number of trajectories of AD, but I also wanted to generate trajectories of other proteins. I wanted to generate trajectories of proteins that were similar to AD, but not identical. So, in learning to use GROMACS to generate trajectories of AD, I also learned how to generate trajectories of other toy proteins. I’ve collected the scripts I used to generate these trajectories in a GitHub repository called mini-proteins and I’ve written this blog post to explain how to use them.

Getting Started

Prerequisites

The scripts in this repository are designed to be run on a Linux machine with GROMACS installed.

Additionally, Python3 is required to run the post-processing scripts in this repository, and the following Python packages are required:

  • numpy
  • matplotlib

Installation

To install this repository, simply clone it from GitHub:

git clone https://github.com/hunter-heidenreich/mini-proteins.git

Usage

This repo contains scripts to:

  • Perform energy minimization
  • Solvate the protein
  • Add ions to neutralize the system
  • Equilibrate the system (NVT)
  • Equilibrate the system (NPT)
  • Run a production simulation
  • Post-process the simulation of a mini-protein using GROMACS.

In this post, we consider a “mini-protein” to be a non-technical designation for a single amino acid residue (or a dipeptide), capped with an acetyl group and an N-methyl group.

Frequently, alanine dipeptide (Ace-Ala-Nme) is used as a model system for protein folding studies. It’s especially enjoyed by machine learning researchers, because it’s small enough to be simulated quickly, but large enough to exhibit interesting folding behavior.

This repo extends a typical data generation of alanine dipeptide to include other amino acids. While not all amino acids are included, these scripts could allow for easy generation of multiple so-called dipeptide “mini-proteins” for machine learning studies to add slight diversity to the models considered.

For example, the addition of a disulfide bond in methionine dipeptide could be used to study the effects of disulfide bonds on protein folding. Or the addition of a tryptophan residue could be used to study the effects of aromatic residues on protein folding. Furthermore, glycine dipeptide could be used to study the effects of a residue with a small side chain on protein folding, inducing more flexibility.

Generating Trajectory Data

The fastest way to generate trajectory data is to use the run.sh script.

ID=ala ./scripts/run.sh

where ID is the three-letter amino acid code for the amino acid you want to simulate. Here, we’re using ala for alanine.

This script will perform energy minimization, solvation, neutralization, NVT equilibration, NPT equilibration, and production simulation. The resulting trajectory will be saved in the out/ID/data directory.

Note: A production simulation is currently coded for 1 nanosecond trajectory saved every 100 femtoseconds. Ideally, you’ll want to increase this to 100 nanoseconds. Also, you may want to revise the save frequency to avoid the large amount of data that will be generated. I’ve targeted 100 femtoseconds because I need correlated data in time for my machine learning models, but you may not need such a high frequency. These settings can be changed in the config/md_langevin.mdp file.

Alternatively, you may run each step individually (see scripts/run.sh for an example of how to do this).

The Included Proteins

The following amino acids are included in this repository:

Alanine Dipeptide

Animation of Alanine Dipeptide

Alanine Dipeptide

  • data/ala.pdb:
  • Alanine Dipeptide (Ace-Ala-Nme)
  • PubChem CID: 5484387 (URL)
  • ATB: URL

Glycine Dipeptide

Animation of Glycine Dipeptide

Glycine Dipeptide

  • data/gly.pdb:
  • Glycine Dipeptide (Ace-Gly-Nme)
  • PubChem CID: 439506 (URL)
  • ATB: URL

Isoleucine Dipeptide

Animation of Isoleucine Dipeptide

Isoleucine Dipeptide

  • data/ile.pdb
  • Isoleucine Dipeptide (Ace-Ile-Nme)
  • PubChem CID: 7019852 (URL)
  • ATB: URL

Leucine Dipeptide

Animation of Leucine Dipeptide

Leucine Dipeptide

  • data/leu.pdb
  • Leucine Dipeptide (Ace-Leu-Nme)
  • PubChem CID: 6950977 (URL)
  • ATB: URL

Methionine Dipeptide

Animation of Methionine Dipeptide

Methionine Dipeptide

  • data/met.pdb
  • Methionine Dipeptide (Ace-Met-Nme)
  • PubChem CID: 13875186 (URL)
  • ATB: URL
  • Contains a disulfide bond

Phenylalanine Dipeptide

Animation of Phenylalanine Dipeptide

Phenylalanine Dipeptide

  • data/phe.pdb
  • Phenylalanine Dipeptide (Ace-Phe-Nme)
  • PubChem CID: 7019860 (URL)
  • ATB: URL

Proline Dipeptide

Animation of Proline Dipeptide

Proline Dipeptide

  • data/pro.pdb
  • Proline Dipeptide (Ace-Pro-Nme)
  • PubChem CID: 5245806 (URL)
  • ATB: URL

Tryptophan Dipeptide

Animation of Tryptophan Dipeptide

Tryptophan Dipeptide

  • data/trp.pdb
  • Tryptophan Dipeptide (Ace-Trp-Nme)
  • PubChem CID: 151412 (URL)
  • ATB: URL

Valine Dipeptide

Animation of Valine Dipeptide

Valine Dipeptide

  • data/val.pdb
  • Valine Dipeptide (Ace-Val-Nme)
  • PubChem CID: 13875188 (URL)
  • ATB: URL

Wrapping Up

If you find this repository useful, please consider citing the following dataset/scripts:

@misc{Heidenreich_Mini-proteins_2023,
author = {Heidenreich, Hunter},
month = sep,
title = {{Mini-proteins}},
url = {https://github.com/hunter-heidenreich/mini-proteins},
year = {2023}
}

And if you see any errors or have any suggestions, please feel free to open an issue or pull request.

Hopefully this repository will be useful for you!

Acknowledgements

The scripts are written building off of the GROMACS tutorial by Luca Tubiana at the University of Trento.