ChatDrug: Conversational Drug Editing with ChatGPT

A Framework for Conversational Drug Editing with LLMs

ChatDrug is a Method paper that introduces a parameter-free framework for drug editing using conversational large language models (specifically ChatGPT/GPT-3.5). The primary contribution is a three-module pipeline that combines prompt engineering, retrieval-augmented domain feedback, and iterative conversation to perform text-guided editing of small molecules, peptides, and proteins. The paper also establishes a benchmark of 39 drug editing tasks spanning these three drug types.

Bridging Conversational AI and Drug Discovery

Drug editing (also called lead optimization or protein design) is a critical step in the drug discovery pipeline where molecular substructures are modified to achieve desired properties. Traditional approaches rely on domain experts for manual editing, which can be subjective and biased. Recent multi-modal approaches like MoleculeSTM and ProteinDT have started exploring text-guided drug editing, but they are domain-specific (limited to one drug type) and lack conversational capabilities for iterative refinement.

The authors identify three properties of conversational LLMs that make them suitable for drug discovery: (1) pretraining on comprehensive knowledge bases covering drug-related concepts, (2) strong few-shot adaptation and generalization abilities, and (3) interactive communication enabling iterative feedback incorporation. However, directly applying LLMs to drug editing yields suboptimal results because the models do not fully utilize prior domain knowledge. ChatDrug addresses this gap through structured retrieval and feedback mechanisms.

Three-Module Pipeline: PDDS, ReDF, and Conversation

ChatDrug consists of three modules that operate sequentially without any parameter learning.

PDDS Module (Prompt Design for Domain-Specific)

The PDDS module constructs domain-specific prompts for ChatGPT. Given an input drug $\pmb{x}_{\text{in}}$ and a text prompt $\pmb{x}_t$ describing the desired property change, the goal is:

$$ \pmb{x}_{\text{out}} = \text{ChatDrug}(\pmb{x}_{\text{in}}, \pmb{x}_t) $$

The prompts are designed around high-level property descriptions (e.g., “more soluble in water”) rather than exact substructure replacements. The authors argue that ChatDrug is better suited for “fuzzy searching” (property-based editing with non-deterministic answers) rather than “exact searching” (precise substructure replacement that experts can do directly).

ReDF Module (Retrieval and Domain Feedback)

The ReDF module retrieves structurally similar examples from a domain-specific database and injects them into the conversation as demonstrations. For an input drug $\pmb{x}_{\text{in}}$, a candidate drug $\tilde{\pmb{x}}$ that failed the desired property change, and a retrieval database, ReDF returns:

$$ \pmb{x}_R = \text{ReDF}(\pmb{x}_{\text{in}}, \tilde{\pmb{x}}; \pmb{x}_t) = \underset{\pmb{x}’_R \in \text{RetrievalDB}}{\arg\max} \langle \tilde{\pmb{x}}, \pmb{x}’_R \rangle \wedge D(\pmb{x}_{\text{in}}, \pmb{x}’_R; \pmb{x}_t) $$

where $D(\cdot, \cdot; \cdot) \in {\text{True}, \text{False}}$ is a domain feedback function checking whether the retrieved drug satisfies the desired property change, and $\langle \tilde{\pmb{x}}, \pmb{x}’_R \rangle$ is a similarity function (Tanimoto similarity for small molecules, Levenshtein distance for peptides and proteins).

The retrieved example $\pmb{x}_R$ is injected into the prompt as: “Your provided sequence [$\tilde{\pmb{x}}$] is not correct. We find a sequence [$\pmb{x}_R$] which is correct and similar to the molecule you provided. Can you give me a new molecule?”

Conversation Module

The conversation module enables iterative refinement over $C$ rounds. At each round $c$, if the edited drug $\pmb{x}_c$ does not satisfy the evaluation condition, ChatDrug retrieves a new example via ReDF using $\tilde{\pmb{x}} = \pmb{x}_c$ and continues the conversation. This aligns with the iterative nature of real drug discovery workflows.

Experiments Across 39 Drug Editing Tasks

Task Design

The benchmark includes 39 tasks across three drug types:

Small molecules (28 tasks): 16 single-objective (tasks 101-108, each with loose and strict thresholds) and 12 multi-objective tasks (tasks 201-206, each with two thresholds). Properties include solubility (LogP), drug-likeness (QED), permeability (tPSA), hydrogen bond acceptors/donors.
Peptides (9 tasks): 6 single-objective and 3 multi-objective tasks for editing peptide-MHC binding affinity across different HLA allele types.
Proteins (2 tasks): Editing protein sequences to increase alpha-helix or beta-strand secondary structures.

Baselines

For small molecules, baselines include Random, PCA, High-Variance, and GS-Mutate (all based on MegaMolBART), plus MoleculeSTM with SMILES and Graph representations. For peptides and proteins, random mutation baselines with 1-3 mutated positions are used.

Main Results

ChatDrug achieves the best performance on 33 out of 39 tasks. Key results for small molecule editing (hit ratio):

Task	Property	ChatDrug (loose)	Best Baseline (loose)
101	More soluble	94.13	67.86 (MoleculeSTM-Graph)
102	Less soluble	96.86	64.79 (MoleculeSTM-Graph)
106	Lower permeability	77.35	34.13 (MoleculeSTM-SMILES)
107	More HBA	95.35	54.01 (MoleculeSTM-SMILES)
108	More HBD	96.54	60.97 (MoleculeSTM-Graph)

ChatDrug underperforms on tasks 104 (less like a drug) and 105 (higher permeability) and most multi-objective tasks involving permeability (205), where MoleculeSTM variants perform better.

For peptide editing, ChatDrug achieves 41-69% hit ratios compared to 0.4-14.4% for random mutation baselines. For protein editing, ChatDrug reaches 34.79% and 51.38% hit ratios on helix and strand tasks respectively, compared to 26.90% and 21.44% for the best random mutation baseline.

Ablation Studies

Conversation rounds: Performance increases with more rounds, converging around $C = 2$. For example, on task 101 (loose threshold), zero-shot achieves 78.26%, $C = 1$ reaches 89.56%, and $C = 2$ reaches 93.37%.

ReDF threshold: Using a stricter threshold in the domain feedback function $D$ (matching the evaluation threshold) yields substantially higher performance than using a loose threshold. For example, on task 107 with strict evaluation, the strict-threshold ReDF achieves 72.60% vs. 14.96% for the loose-threshold ReDF.

Similarity analysis: Retrieved molecules $\pmb{x}_R$ tend to have lower similarity to input molecules than the intermediate outputs $\pmb{x}_1$, yet they have higher hit ratios. This suggests the ReDF module explores the chemical space effectively, and the conversation module balances similarity preservation with property optimization.

Knowledge extraction: ChatDrug can articulate domain-specific reasoning for its edits (e.g., summarizing rules for increasing water solubility by introducing polar functional groups), though the extracted knowledge shows some redundancy.

Limitations and Future Directions

ChatDrug demonstrates that conversational LLMs can serve as useful tools for drug editing, achieving strong results across diverse drug types with a parameter-free approach. The framework exhibits open vocabulary and compositional properties, allowing it to handle novel drug concepts and multi-objective tasks through natural language.

The authors acknowledge two main limitations. First, ChatDrug struggles with understanding complex 3D drug geometries, which would require deeper geometric modeling. Second, the framework requires multiple conversation rounds to achieve strong performance, adding computational cost through repeated API calls. The authors suggest that knowledge summarization capabilities of LLMs could help reduce this cost.

The evaluation relies entirely on computational oracles (RDKit for small molecules, MHCflurry2.0 for peptides, ProteinCLAP for proteins) rather than wet-lab validation. The hit ratio metric also excludes invalid outputs from the denominator, so the effective success rate on all attempted edits may be lower than reported.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Small molecule inputs	ZINC	200 molecules	Sampled SMILES strings
Small molecule retrieval DB	ZINC	10K molecules	For ReDF similarity search
Peptide inputs	Peptide-MHC binding dataset	500 peptides per task	From 30 common MHC alleles
Peptide retrieval DB	Experimental binding data	Varies by allele	Target allele experimental data
Protein inputs	TAPE test set	Varies	Secondary structure prediction test data
Protein retrieval DB	TAPE training set	Varies	Secondary structure prediction training data

Algorithms

GPT-3.5-turbo via OpenAI ChatCompletion API, temperature=0, frequency_penalty=0.2
System prompt: “You are an expert in the field of molecular chemistry.”
$C = 2$ conversation rounds for main results
5 random seeds (0-4) for small molecule main results, seed 0 for ablations

Models

ChatGPT (GPT-3.5-turbo): used as-is, no fine-tuning
MHCflurry 2.0: pseudo-oracle for peptide binding affinity evaluation
ProteinCLAP-EBM-NCE from ProteinDT: protein secondary structure prediction
ESMFold: protein folding for visualization
RDKit: molecular property calculations for small molecules

Evaluation

Metric	Description	Notes
Hit Ratio	Fraction of valid edits satisfying property requirements	Invalid sequences excluded from denominator

Hardware

All experiments conducted on a single NVIDIA RTX A6000 GPU (used only for peptide and protein evaluation). Total OpenAI API cost was less than $100.

Artifact	Type	License	Notes
ChatDrug GitHub	Code	Not specified	Official implementation

Paper Information

Citation: Liu, S., Wang, J., Yang, Y., Wang, C., Liu, L., Guo, H., & Xiao, C. (2024). Conversational Drug Editing Using Retrieval and Domain Feedback. ICLR 2024.

@inproceedings{liu2024chatdrug,
  title={Conversational Drug Editing Using Retrieval and Domain Feedback},
  author={Liu, Shengchao and Wang, Jiongxiao and Yang, Yijin and Wang, Chengpeng and Liu, Ling and Guo, Hongyu and Xiao, Chaowei},
  booktitle={International Conference on Learning Representations},
  year={2024}
}

A Framework for Conversational Drug Editing with LLMs#

Bridging Conversational AI and Drug Discovery#

Three-Module Pipeline: PDDS, ReDF, and Conversation#

PDDS Module (Prompt Design for Domain-Specific)#

ReDF Module (Retrieval and Domain Feedback)#

Conversation Module#

Experiments Across 39 Drug Editing Tasks#

Task Design#

Baselines#

Main Results#

Ablation Studies#

Limitations and Future Directions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Paper Information#