Computational Chemistry

FDB-17 filtering pipeline from GDB-17 (166.4B) through fragment filters (4.6B) to even sampling (10M), with bar charts comparing size distribution and Fsp3 shape complexity against commercial fragments

FDB-17: Fragment Database (10M Molecules)

FDB-17 contains 10 million fragment-like molecules selected from GDB-17’s 166.4 billion entries. Fragment-likeness filters reduce GDB-17 by 36x to 4.6 billion molecules, then even sampling across (HAC, heteroatoms, stereocenters) triplets produces a 460x further reduction to a manageable, diverse library enriched in 3D-shaped molecules.

Computational Chemistry

GDBMedChem pipeline from GDB-17 through medicinal chemistry filters to 10M molecules, with Venn diagram showing 97% unique substructures and property comparison against known drugs

GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)

GDBMedChem applies medicinal chemistry-inspired functional group and structural complexity filters to GDB-17, reducing 166.4 billion molecules to 17.8 billion, then evenly samples across molecular size, stereochemistry, and polarity to produce 10 million drug-like molecules. 97% of its substructures are absent from known molecule databases.

Computational Chemistry

Simulated QM9 property landscape scatter plot of HOMO-LUMO gap vs dipole moment, colored by heavy atom count, with example molecules rendered alongside

QM9: Quantum Chemistry Properties of 134k Molecules

QM9 provides B3LYP/6-31G(2df,p)-level geometric, energetic, electronic, and thermodynamic properties for 133,885 small organic molecules (up to 9 heavy atoms of C, N, O, F) drawn from the GDB-17 chemical universe. It is one of the most widely used benchmarks in molecular machine learning.

Computational Chemistry

VQM24 overview showing 9 included elements with valencies, combinatorial scaling of molecular geometries with heavy atom count, and ML learning curves comparing VQM24 vs QM9 difficulty

VQM24: 836k Molecules at DFT and Diffusion QMC

VQM24 exhaustively enumerates all neutral closed-shell molecules with up to 5 heavy atoms from C, N, O, F, Si, P, S, Cl, Br, yielding 258k constitutional isomers and 578k conformers (836k total). Properties are computed at the wB97X-D3/cc-pVDZ level, with diffusion QMC energies for 10,793 molecules up to 4 heavy atoms. ML models show up to 8x higher errors than on QM9, making VQM24 a more challenging benchmark.

Computational Chemistry

ChatDrug pipeline from prompt design through ChatGPT to domain feedback and edited molecule output

ChatDrug: Conversational Drug Editing with ChatGPT

ChatDrug is a parameter-free framework that combines ChatGPT with retrieval-augmented domain feedback and iterative conversation to edit drugs across small molecules, peptides, and proteins.

Computational Chemistry

ChemCrow architecture with GPT-4 central planner connected to 18 chemistry tools via ReAct reasoning

ChemCrow: Augmenting LLMs with 18 Chemistry Tools

ChemCrow augments GPT-4 with 18 chemistry tools to autonomously plan and execute syntheses, discover novel chromophores, and solve diverse chemical reasoning tasks.

Computational Chemistry

ChemLLM pipeline from ChemData structured templates through fine-tuned InternLM2 to ChemBench evaluation

ChemLLM: A Chemical Large Language Model Framework

ChemLLM presents a comprehensive framework for chemistry-specific language modeling, including a 7M-sample instruction tuning dataset (ChemData), a 4,100-question benchmark (ChemBench), and a two-stage fine-tuned model that matches GPT-4 on core chemical tasks.

Computational Chemistry

Coscientist architecture with GPT-4 planner orchestrating web search, code execution, document search, and robot lab API modules

Coscientist: Autonomous Chemistry with LLM Agents

Introduces Coscientist, a GPT-4-driven AI system that autonomously designs and executes chemical experiments using web search, code execution, and robotic lab automation.

Computational Chemistry

DrugAssist workflow from user instruction through LoRA fine-tuned Llama2 to optimized molecule output

DrugAssist: Interactive LLM Molecule Optimization

DrugAssist fine-tunes Llama2-7B-Chat on over one million molecule pairs for interactive, dialogue-based molecule optimization across six molecular properties.

Computational Chemistry

DrugChat architecture showing GNN encoder, linear adaptor, and Vicuna LLM for conversational drug analysis

DrugChat: Conversational QA on Drug Molecule Graphs

DrugChat is a prototype system that bridges molecular graph neural networks with large language models for interactive, multi-turn question answering about drug compounds. It trains only a lightweight linear adaptor between a frozen GNN encoder and Vicuna-13B using 143K curated QA pairs from ChEMBL and PubChem.

Computational Chemistry

Pipeline diagram showing natural language chemistry questions flowing through fine-tuned GPT-3 to chemical predictions across molecules, materials, and reactions

Fine-Tuning GPT-3 for Predictive Chemistry Tasks

Jablonka et al. show that fine-tuning GPT-3 on natural language chemistry questions achieves competitive or superior performance to dedicated ML models across 15 benchmarks, with particular strength in low-data settings and inverse molecular design.

Computational Chemistry

Visualization of Galactica corpus composition and benchmark performance comparing Galactica 120B against baselines

Galactica: A Curated Scientific LLM from Meta AI

Galactica trains a decoder-only Transformer on a curated 106B-token scientific corpus spanning papers, proteins, and molecules, achieving strong results on scientific QA, mathematical reasoning, and citation prediction.