Coscientist: Autonomous Chemistry with LLM Agents

An LLM-Powered Agent for Autonomous Chemical Experimentation

This is a Method paper that introduces Coscientist, an AI system driven by GPT-4 that autonomously designs, plans, and performs complex chemical experiments. The primary contribution is a modular multi-LLM agent architecture that integrates internet search, documentation retrieval, code execution, and robotic experimentation APIs into a unified system capable of end-to-end experimental chemistry with minimal human intervention.

Bridging LLM Capabilities and Laboratory Automation

Transformer-based large language models had demonstrated strong capabilities in natural language processing, biology, chemistry, and code generation by early 2023. Simultaneously, laboratory automation had progressed with autonomous reaction discovery, automated flow systems, and mobile robotic platforms. However, these two threads remained largely separate: LLMs could reason about chemistry in text, but could not act on that reasoning by controlling physical experiments.

The gap this work addresses is the integration of LLM reasoning with laboratory automation in a closed-loop system. Prior automated chemistry systems relied on traditional optimization algorithms or narrow AI components. The question was whether GPT-4’s general reasoning capabilities could be combined with tool access to produce a system that autonomously designs experiments, writes instrument code, executes reactions, and interprets results, all from natural language prompts.

This work was developed independently and in parallel with other autonomous agent efforts (AutoGPT, BabyAGI, LangChain), with ChemCrow serving as another chemistry-specific example.

A Modular Multi-LLM Architecture with Tool Access

The core innovation is Coscientist’s modular architecture, centered on a “Planner” module (a GPT-4 chat completion instance) that orchestrates four command types:

GOOGLE: A Web Searcher module (itself an LLM) that transforms prompts into search queries, browses results, and funnels answers back to the Planner.
PYTHON: A Code Execution module running in an isolated Docker container for calculations and data analysis, with no LLM dependency.
DOCUMENTATION: A Docs Searcher module that retrieves and summarizes technical documentation (e.g., Opentrons Python API, Emerald Cloud Lab Symbolic Lab Language) using ada embeddings and distance-based vector search.
EXPERIMENT: An Automation module that executes generated code on laboratory hardware or provides synthetic procedures.

The system prompts are engineered in a modular fashion, with the Planner receiving initial user input and command outputs as messages. The Planner can iteratively call commands, fix software errors, and refine its approach. This design allows natural language instructions (e.g., “perform multiple Suzuki reactions”) to be translated into complete experimental protocols.

For documentation retrieval, all sections of the OT-2 API documentation were embedded using OpenAI’s ada model, and relevant sections are retrieved via cosine similarity search. For the Emerald Cloud Lab, the system learned to program in a symbolic lab language (SLL) that was completely unknown to GPT-4 at training time, demonstrating effective in-context learning from supplied documentation.

Six Tasks Demonstrating Autonomous Chemistry Capabilities

The paper evaluates Coscientist across six tasks of increasing complexity.

Task 1: Chemical Synthesis Planning

A benchmark of seven compounds was used to compare synthesis planning across models (GPT-4, GPT-3.5, Claude 1.3, Falcon-40B-Instruct) with and without web search. Outputs were scored on a 1-5 scale:

Score	Meaning
5	Very detailed and chemically accurate procedure
4	Detailed and accurate but without reagent quantities
3	Correct chemistry but no step-by-step procedure
2	Extremely vague or unfeasible
1	Incorrect or failure to follow instructions

The GPT-4-powered Web Searcher achieved maximum scores for acetaminophen, aspirin, nitroaniline, and phenolphthalein. It was the only approach to achieve acceptable scores (3+) for ibuprofen, which all non-browsing models synthesized incorrectly. These results highlight the importance of grounding LLMs to avoid hallucinations.

Task 2: Documentation Search

The system correctly identified relevant ECL functions from documentation and generated valid SLL code that was successfully executed at ECL, including an HPLC experiment on a caffeine standard sample.

Task 3: Cloud Laboratory Execution

Using prompt-to-function and prompt-to-SLL pipelines, Coscientist generated executable code for the Emerald Cloud Lab. It also searched a catalogue of 1,110 model samples to identify relevant stock solutions from simple search terms.

Task 4: Liquid Handler Control

Using the Opentrons OT-2, Coscientist translated natural language prompts (e.g., “colour every other line with one colour of your choice,” “draw a red cross”) into accurate liquid handling protocols.

Task 5: Integrated Multi-Module Experiment

The most complex demonstration combined web search, code execution, documentation retrieval, and hardware control to design and execute Suzuki-Miyaura and Sonogashira cross-coupling reactions. Coscientist:

Searched the internet for reaction conditions and stoichiometries
Selected correct coupling partners (never misassigning phenylboronic acid to Sonogashira)
Calculated reagent volumes and wrote OT-2 protocols
Self-corrected when using an incorrect heater-shaker method by consulting documentation
Successfully produced target products confirmed by GC-MS analysis (biphenyl at 9.53 min for Suzuki, diphenylacetylene at 12.92 min for Sonogashira)

Task 6: Reaction Optimization

Coscientist was tested on two fully mapped reaction datasets:

Suzuki reaction flow dataset (Perera et al.): varying ligands, reagents/bases, and solvents
Buchwald-Hartwig C-N coupling dataset (Doyle et al.): varying ligands, additives, and bases

Performance was evaluated using a normalized advantage metric:

$$\text{Normalized Advantage} = \frac{\text{yield}_i - \overline{\text{yield}}}{\text{yield}_{\max} - \overline{\text{yield}}}$$

A value of 1 indicates maximum yield reached, 0 indicates random performance, and negative values indicate worse than random. The normalized maximum advantage (NMA) tracks the best result achieved up to each iteration.

Key findings from the optimization experiments:

GPT-4 with prior information (10 random data points) produced better initial guesses than GPT-4 without prior information
Both GPT-4 approaches converged to similar NMA values at the limit
Both GPT-4 approaches outperformed standard Bayesian optimization in NMA and normalized advantage
GPT-3.5 largely failed due to inability to output correct JSON schemas
On the Buchwald-Hartwig dataset, GPT-4 performed comparably whether given compound names or SMILES strings, and could reason about electronic properties from SMILES representations

All experiments used a maximum of 20 iterations (5.2% and 6.9% of the total reaction space for the two datasets).

Demonstrated Versatility with Safety Considerations

Coscientist demonstrated that GPT-4, when equipped with appropriate tool access, can autonomously handle the full experimental chemistry workflow from literature search to reaction execution and data interpretation. The system showed chemical reasoning capabilities, including selecting appropriate reagents, providing justifications for choices based on reactivity and selectivity, and using experimental data to guide subsequent iterations.

Several limitations are acknowledged:

The experimental setup was not yet fully automated (plates were moved manually between instruments), though no human decision-making was involved
GPT-3.5 consistently underperformed due to inability to follow formatting instructions
The synthesis planning evaluation scale is inherently subjective
It is unclear whether GPT-4’s training data contained information from the optimization datasets
The comparison with Bayesian optimization may reflect different exploration/exploitation balances rather than pure capability differences

The authors raise safety concerns about dual-use potential and note that full code and prompts were withheld pending development of US AI regulations. A simplified implementation was released for reproducibility purposes.

Future directions include extending the system with reaction databases (Reaxys, SciFinder), implementing advanced prompting strategies (ReAct, Chain of Thought, Tree of Thoughts), and developing automated quality control for cloud laboratory experiments.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Synthesis benchmark	7 compound set	7 compounds	Acetaminophen, aspirin, ibuprofen, nitroaniline, etc.
Optimization	Perera et al. Suzuki flow dataset	Fully mapped condition space	Varying ligands, bases, solvents
Optimization	Doyle Buchwald-Hartwig dataset	Fully mapped condition space	Varying ligands, additives, bases
Reagent selection	SMILES compound database	Not specified	Used for computational experiments

Algorithms

Planner: GPT-4 chat completion with modular system prompts
Web Searcher: GPT-4 or GPT-3.5-turbo for query generation and result parsing
Documentation embedding: OpenAI ada model with distance-based vector search
Code execution: Isolated Docker container (no LLM dependency)
Baseline: Bayesian optimization with varying initial sample sizes (1-10)

Models

GPT-4 (primary)
GPT-3.5-turbo (baseline)
Claude 1.3 (baseline for synthesis planning)
Falcon-40B-Instruct (baseline for synthesis planning)
OpenAI ada (for documentation embedding)

Evaluation

Metric	Context	Notes
Synthesis score (1-5)	7-compound benchmark	Subjective expert grading
Normalized advantage	Optimization tasks	Measures improvement over random
NMA	Optimization tasks	Maximum advantage achieved through iteration N
GC-MS confirmation	Cross-coupling reactions	Product formation verified experimentally

Hardware

Opentrons OT-2 liquid handler with heater-shaker module
UV-Vis plate reader
Emerald Cloud Lab (cloud-based automation)
Computational requirements not specified (relies on OpenAI API calls)

Artifacts

Artifact	Type	License	Notes
gomesgroup/coscientist	Code	Apache-2.0 with Commons Clause	Simplified implementation; full code withheld for safety

Paper Information

Citation: Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624(7992), 570-578. https://doi.org/10.1038/s41586-023-06792-0

@article{boiko2023autonomous,
  title={Autonomous chemical research with large language models},
  author={Boiko, Daniil A. and MacKnight, Robert and Kline, Ben and Gomes, Gabriel dos Passos},
  journal={Nature},
  volume={624},
  number={7992},
  pages={570--578},
  year={2023},
  publisher={Springer Nature},
  doi={10.1038/s41586-023-06792-0}
}

An LLM-Powered Agent for Autonomous Chemical Experimentation#

Bridging LLM Capabilities and Laboratory Automation#

A Modular Multi-LLM Architecture with Tool Access#

Six Tasks Demonstrating Autonomous Chemistry Capabilities#

Task 1: Chemical Synthesis Planning#

Task 2: Documentation Search#

Task 3: Cloud Laboratory Execution#

Task 4: Liquid Handler Control#

Task 5: Integrated Multi-Module Experiment#

Task 6: Reaction Optimization#

Demonstrated Versatility with Safety Considerations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#