Overview

A collection of GROMACS simulation workflows designed to generate diverse molecular dynamics trajectories of amino acid dipeptides. This project extends beyond the commonly studied alanine dipeptide to create comprehensive datasets for machine learning applications in protein dynamics.

Scientific Motivation

Most protein dynamics studies focus on alanine dipeptides due to their simplicity, but this limits the chemical diversity available for machine learning models. This project systematically covers different amino acid types to provide richer training data for ML applications.

Key Features

Diverse Amino Acid Coverage

  • 9 different amino acid types with systematic chemical diversity
  • Aromatic residues (Phe, Tyr, Trp) for π-stacking interactions
  • Branched residues (Ile, Val) for steric effects
  • Flexible residues (Gly, Ala) for conformational sampling
  • Constrained residues (Pro) for ring effects

Automated Simulation Workflows

  • Complete GROMACS setup with optimized parameters
  • Systematic parameter sweeps across temperature and force fields
  • High-resolution trajectories optimized for ML applications
  • Reproducible protocols with full documentation

Analysis Pipeline

  • Ramachandran plot analysis for conformational validation
  • Free energy landscape calculation
  • Trajectory clustering and representative structure extraction
  • ML-ready feature extraction from atomic coordinates

Technical Implementation

Simulation Protocol

  • Force Field: AMBER99SB-ILDN with TIP3P water
  • Temperature Range: 300K - 400K systematic sampling
  • Simulation Length: 100ns per system for statistical convergence
  • Output Frequency: High-resolution for ML applications

Quality Control

  • Energy minimization and equilibration protocols
  • System stability monitoring throughout production runs
  • Validation against experimental and theoretical benchmarks
  • Cross-validation between different force fields

Applications

Machine Learning

  • Conformational sampling for protein folding prediction
  • Molecular dynamics ML models for accelerated simulations
  • Force field development through data-driven approaches
  • Feature learning from molecular trajectories

Computational Biology

  • Understanding amino acid-specific dynamics
  • Validation of force field parameters
  • Comparison with experimental NMR data
  • Benchmark development for MD software

Dataset Characteristics

Trajectory Details

  • High temporal resolution (1 ps timesteps)
  • Comprehensive coordinate data including velocities and forces
  • Multiple replicas for statistical significance
  • Metadata inclusion for systematic analysis

ML-Ready Features

  • Preprocessed coordinate matrices
  • Calculated descriptors (distances, angles, dihedrals)
  • Time-series formatted data for sequential models
  • Normalized and standardized feature sets

Results & Validation

  • Conformational landscapes consistent with experimental observations
  • Force field validation through comparison with quantum calculations
  • Reproducible dynamics across independent simulation runs
  • Chemical diversity quantified through principal component analysis

This project supports research documented in:

Future Directions

  • Extension to larger peptide systems
  • Integration with enhanced sampling methods
  • Machine learning potential development
  • Experimental validation through NMR collaboration