Overview
A collection of GROMACS simulation workflows designed to generate diverse molecular dynamics trajectories of amino acid dipeptides. This project extends beyond the commonly studied alanine dipeptide to create comprehensive datasets for machine learning applications in protein dynamics.
Scientific Motivation
Most protein dynamics studies focus on alanine dipeptides due to their simplicity, but this limits the chemical diversity available for machine learning models. This project systematically covers different amino acid types to provide richer training data for ML applications.
Key Features
Diverse Amino Acid Coverage
- 9 different amino acid types with systematic chemical diversity
- Aromatic residues (Phe, Tyr, Trp) for π-stacking interactions
- Branched residues (Ile, Val) for steric effects
- Flexible residues (Gly, Ala) for conformational sampling
- Constrained residues (Pro) for ring effects
Automated Simulation Workflows
- Complete GROMACS setup with optimized parameters
- Systematic parameter sweeps across temperature and force fields
- High-resolution trajectories optimized for ML applications
- Reproducible protocols with full documentation
Analysis Pipeline
- Ramachandran plot analysis for conformational validation
- Free energy landscape calculation
- Trajectory clustering and representative structure extraction
- ML-ready feature extraction from atomic coordinates
Technical Implementation
Simulation Protocol
- Force Field: AMBER99SB-ILDN with TIP3P water
- Temperature Range: 300K - 400K systematic sampling
- Simulation Length: 100ns per system for statistical convergence
- Output Frequency: High-resolution for ML applications
Quality Control
- Energy minimization and equilibration protocols
- System stability monitoring throughout production runs
- Validation against experimental and theoretical benchmarks
- Cross-validation between different force fields
Applications
Machine Learning
- Conformational sampling for protein folding prediction
- Molecular dynamics ML models for accelerated simulations
- Force field development through data-driven approaches
- Feature learning from molecular trajectories
Computational Biology
- Understanding amino acid-specific dynamics
- Validation of force field parameters
- Comparison with experimental NMR data
- Benchmark development for MD software
Dataset Characteristics
Trajectory Details
- High temporal resolution (1 ps timesteps)
- Comprehensive coordinate data including velocities and forces
- Multiple replicas for statistical significance
- Metadata inclusion for systematic analysis
ML-Ready Features
- Preprocessed coordinate matrices
- Calculated descriptors (distances, angles, dihedrals)
- Time-series formatted data for sequential models
- Normalized and standardized feature sets
Results & Validation
- Conformational landscapes consistent with experimental observations
- Force field validation through comparison with quantum calculations
- Reproducible dynamics across independent simulation runs
- Chemical diversity quantified through principal component analysis
Related Work
This project supports research documented in:
- Mini-Protein Dynamics Analysis
- Ongoing work on ML-driven protein folding prediction
Future Directions
- Extension to larger peptide systems
- Integration with enhanced sampling methods
- Machine learning potential development
- Experimental validation through NMR collaboration