An Interactive LLM for Molecule Optimization

DrugAssist is a Method paper that proposes an interactive molecule optimization model built by fine-tuning Llama2-7B-Chat with LoRA on a newly constructed instruction dataset. The primary contribution is twofold: (1) the MolOpt-Instructions dataset containing over one million molecule pairs with six molecular properties and three optimization task categories, and (2) a dialogue-based molecule optimization system that allows domain experts to iteratively refine molecular modifications through multi-turn natural language conversations.

Why Interactive Molecule Optimization Matters

Molecule optimization is a core step in the drug discovery pipeline, where lead compounds must be modified to improve specific pharmacological properties while maintaining structural similarity. Existing approaches fall into sequence-based methods (treating SMILES optimization as machine translation) and graph-based methods (graph-to-graph translation), but they share a critical limitation: they are non-interactive. These models learn patterns from chemical structure data without incorporating expert feedback.

The drug discovery process is inherently iterative and requires integrating domain expertise. Medicinal chemists typically refine candidates through repeated cycles of suggestion, evaluation, and adjustment. Prior LLM-based approaches like ChatDrug relied on prompt engineering with general-purpose models (GPT-3.5-turbo) rather than fine-tuning, limiting their optimization accuracy. Additionally, most existing molecule optimization benchmarks focus on single-property optimization with vague objectives (e.g., “maximize QED”), while real-world drug design requires optimizing property values within specific ranges across multiple properties simultaneously.

Instruction-Based Fine-Tuning with MolOpt-Instructions

The core innovation has two components: the MolOpt-Instructions dataset construction pipeline and the multi-task instruction tuning strategy.

Dataset Construction

MolOpt-Instructions is built from one million molecules randomly sampled from the ZINC database. The construction workflow uses mmpdb (an open-source Matched Molecular Pair platform) to generate structurally similar molecule pairs through Matched Molecular Pair Analysis (MMPA). Pairs are filtered to satisfy two criteria: Tanimoto similarity greater than 0.65 and logP difference greater than 2.5. Property values for six properties (Solubility, BBBP, hERG inhibition, QED, hydrogen bond donor count, and hydrogen bond acceptor count) are computed using Tencent’s iDrug platform. The final dataset contains 1,029,949 unique pairs covering 1,595,839 unique molecules, with mean similarity of 0.69 and mean logP difference of 2.82.

Three categories of optimization tasks are defined:

  • Loose: Increase or decrease a given property value (no threshold)
  • Strict: Increase or decrease by at least a specified threshold
  • Range: Optimize the property value to fall within a given interval

Instruction templates are generated with ChatGPT assistance and manually refined. To ensure balance, source and target molecules are swapped for some pairs to maintain a roughly 1:1 ratio of property increases to decreases.

Murcko scaffold analysis confirms chemical diversity: the average molecules per scaffold is 2.95, and over 93.7% of scaffolds contain no more than five molecules.

Multi-Task Instruction Tuning

The model is fine-tuned on Llama2-7B-Chat using LoRA (rank 64, alpha 128). To prevent catastrophic forgetting of general language capabilities, the training data combines MolOpt-Instructions with the Stanford Alpaca dataset (52k instruction-following examples, replicated 5x to balance the mixture). The training objective minimizes the negative log-likelihood over the response tokens:

$$L(R; \boldsymbol{\theta}) = -\sum_{u_i \in R} \log \Phi(u_i \mid u_{<i}, I)$$

where $I$ is the instruction, $R$ is the response, and $\Phi$ is the model’s conditional probability.

Training runs for 10 epochs with batch size 512, using AdamW ($\beta = (0.9, 0.999)$), learning rate 1e-4, 3% warm-up steps with cosine decay, and no weight decay. The data is split 90/5/5 for train/validation/test.

Experimental Setup and Multi-Property Optimization Results

Comparison with Traditional Approaches

DrugAssist is compared against Mol-Seq2Seq and Mol-Transformer (He et al., 2021) on simultaneous Solubility and BBBP optimization with range constraints. The evaluation prompt asks the model to generate an optimized molecule with solubility within a given range and BBBP category changed from one level to another.

ModelSolubilityBBBPBothValid RateSimilarity
Mol-Seq2Seq0.460.550.350.760.61
Mol-Transformer0.700.780.590.960.70
DrugAssist0.740.800.620.980.69

DrugAssist achieves the highest success rates in both single-property and multi-property optimization while maintaining high validity (0.98) and comparable structural similarity (0.69).

Comparison with LLMs

DrugAssist is compared against Llama2-7B-Chat, GPT-3.5-turbo (via ChatDrug), and BioMedGPT-LM-7B on 16 tasks covering all three optimization categories. These comparisons use multi-turn dialogues following the ChatDrug protocol: if the model’s output fails to meet requirements, a database-retrieved molecule meeting the criteria and similar to the model’s output is provided as a hint for iterative refinement.

Selected results on single-property tasks (valid ratio / correct ratio, loose/strict):

TaskLlama2-7B-ChatGPT-3.5-turboBioMedGPT-LMDrugAssist
QED+0.17 / 0.160.15 / 0.150.15 / 0.090.76 / 0.63
Acceptor+0.08 / 0.080.04 / 0.060.18 / 0.130.71 / 0.67
Donor+0.15 / 0.080.10 / 0.040.17 / 0.090.72 / 0.76
Solubility+0.36 / 0.200.16 / 0.050.18 / 0.090.80 / 0.41
BBBP+0.19 / 0.140.10 / 0.100.16 / 0.070.82 / 0.61
hERG-0.39 / 0.310.13 / 0.150.13 / 0.120.71 / 0.67

Multi-property tasks:

TaskLlama2-7B-ChatGPT-3.5-turboBioMedGPT-LMDrugAssist
Sol+ & Acc+0.15 / 0.040.09 / 0.020.10 / 0.070.50 / 0.27
QED+ & BBBP+0.14 / 0.090.09 / 0.060.16 / 0.110.65 / 0.41

DrugAssist outperforms all baselines across every task. BioMedGPT-LM frequently misunderstands the task, generating guidance text rather than molecules. GPT-3.5-turbo achieves high validity but often outputs the input molecule unchanged.

Transferability, Iterative Refinement, and Limitations

Key Findings

Zero-shot transferability: Although DrugAssist trains on single-property optimization data, it successfully handles multi-property optimization requests at inference time. In a case study, the model simultaneously increased both BBBP and QED by at least 0.1 while maintaining structural similarity, without any multi-property training examples.

Few-shot generalization: DrugAssist optimizes properties not seen during training (e.g., logP) when provided with a few in-context examples of successful optimizations, a capability that traditional sequence-based or graph-based models cannot achieve without retraining.

Iterative optimization: When an initial optimization fails to meet requirements, DrugAssist can incorporate feedback (a database-retrieved hint molecule) and modify different functional groups in a second attempt to produce a compliant molecule.

Limitations

The authors acknowledge that DrugAssist has a relatively lower success rate on the most challenging task category, strict range-constrained solubility optimization (0.41 success rate under strict criteria vs. 0.80 under loose criteria). The model also relies on iDrug for property prediction of Solubility, BBBP, and hERG inhibition, meaning its optimization quality is bounded by the accuracy of these property predictors. The evaluation uses only 500 test molecules for LLM comparisons, which is a relatively small evaluation set. The paper does not report statistical significance tests or confidence intervals for any results.

Future Directions

The authors plan to improve multimodal data handling to reduce hallucination problems and to further enhance DrugAssist’s interactive capabilities for better understanding of user needs and feedback.


Reproducibility Details

Data

PurposeDatasetSizeNotes
TrainingMolOpt-Instructions1,029,949 molecule pairsSourced from ZINC via mmpdb; 6 properties
Training (auxiliary)Stanford Alpaca52k instructions (5x replicated)Mitigates catastrophic forgetting
Evaluation (traditional)From He et al. (2021)Not specifiedMulti-property optimization test
Evaluation (LLM)ZINC subset500 moleculesRandomly selected

Algorithms

  • Base model: Llama2-7B-Chat
  • Fine-tuning: LoRA with rank 64, alpha 128
  • Optimizer: AdamW, $\beta = (0.9, 0.999)$, lr = 1e-4, no weight decay
  • Schedule: 3% warm-up, cosine decay
  • Epochs: 10
  • Batch size: 512
  • Property calculation: iDrug (Solubility, BBBP, hERG); RDKit (H-bond donors/acceptors, QED)
  • Molecular pairs: mmpdb for Matched Molecular Pair Analysis

Models

  • Fine-tuned Llama2-7B-Chat with LoRA adapters
  • No pre-trained weights released (code and data available)

Evaluation

MetricDescription
Success rateFraction of molecules meeting optimization criteria
Valid rateFraction of generated SMILES that parse as valid molecules
SimilarityTanimoto similarity between input and optimized molecules

Hardware

  • 8 NVIDIA Tesla A100-SXM4-40GB GPUs

Artifacts

ArtifactTypeLicenseNotes
DrugAssist CodeCodeNot specifiedTraining and inference code
MolOpt-InstructionsDatasetNot specified1M+ molecule pairs, 6 properties

Paper Information

Citation: Ye, G., Cai, X., Lai, H., Wang, X., Huang, J., Wang, L., Liu, W., & Zeng, X. (2024). DrugAssist: A Large Language Model for Molecule Optimization. Briefings in Bioinformatics, 26(1), bbae693.

@article{ye2024drugassist,
  title={DrugAssist: A Large Language Model for Molecule Optimization},
  author={Ye, Geyan and Cai, Xibao and Lai, Houtim and Wang, Xing and Huang, Junhong and Wang, Longyue and Liu, Wei and Zeng, Xiangxiang},
  journal={Briefings in Bioinformatics},
  volume={26},
  number={1},
  pages={bbae693},
  year={2024},
  doi={10.1093/bib/bbae693}
}