Machine-Learning

Diagram comparing RWKV inference complexity against Transformers and efficient variants

RWKV: Linear-Cost RNN with Transformer Training

RWKV is a novel sequence model that achieves transformer-level performance while maintaining linear time and constant memory complexity during inference, scaled up to 14 billion parameters.

Molecular Representations

Overview of six categories of materials representations for machine learning

Materials Representations for ML Review

A comprehensive review of how solid-state materials can be numerically represented for machine learning, spanning structural features, graph neural networks, compositional descriptors, transfer learning, and generative models for inverse design.

Molecular Representations

BioT5 architecture showing SELFIES molecules, amino acid proteins, and scientific text feeding into a T5 encoder-decoder

BioT5: Cross-Modal Integration of Biology and Chemistry

BioT5 uses SELFIES representations and separate tokenization to pre-train a unified T5 model across molecules, proteins, and text, achieving state-of-the-art results on 10 of 15 downstream tasks.

Computational Chemistry

ChatDrug pipeline from prompt design through ChatGPT to domain feedback and edited molecule output

ChatDrug: Conversational Drug Editing with ChatGPT

ChatDrug is a parameter-free framework that combines ChatGPT with retrieval-augmented domain feedback and iterative conversation to edit drugs across small molecules, peptides, and proteins.

Computational Chemistry

ChemCrow architecture with GPT-4 central planner connected to 18 chemistry tools via ReAct reasoning

ChemCrow: Augmenting LLMs with 18 Chemistry Tools

ChemCrow augments GPT-4 with 18 chemistry tools to autonomously plan and execute syntheses, discover novel chromophores, and solve diverse chemical reasoning tasks.

Molecular Generation

ChemGE pipeline from integer chromosome through CFG grammar rules to valid SMILES output

ChemGE: Molecule Generation via Grammatical Evolution

ChemGE uses grammatical evolution over SMILES context-free grammars to generate diverse drug-like molecules in parallel, outperforming deep learning baselines in throughput and molecular diversity.

Computational Chemistry

Coscientist architecture with GPT-4 planner orchestrating web search, code execution, document search, and robot lab API modules

Coscientist: Autonomous Chemistry with LLM Agents

Introduces Coscientist, a GPT-4-driven AI system that autonomously designs and executes chemical experiments using web search, code execution, and robotic lab automation.

Computational Chemistry

DrugAssist workflow from user instruction through LoRA fine-tuned Llama2 to optimized molecule output

DrugAssist: Interactive LLM Molecule Optimization

DrugAssist fine-tunes Llama2-7B-Chat on over one million molecule pairs for interactive, dialogue-based molecule optimization across six molecular properties.

Computational Chemistry

Pipeline diagram showing natural language chemistry questions flowing through fine-tuned GPT-3 to chemical predictions across molecules, materials, and reactions

Fine-Tuning GPT-3 for Predictive Chemistry Tasks

Jablonka et al. show that fine-tuning GPT-3 on natural language chemistry questions achieves competitive or superior performance to dedicated ML models across 15 benchmarks, with particular strength in low-data settings and inverse molecular design.

Computational Chemistry

Visualization of Galactica corpus composition and benchmark performance comparing Galactica 120B against baselines

Galactica: A Curated Scientific LLM from Meta AI

Galactica trains a decoder-only Transformer on a curated 106B-token scientific corpus spanning papers, proteins, and molecules, achieving strong results on scientific QA, mathematical reasoning, and citation prediction.

Computational Chemistry

SMolInstruct dataset feeding into four base models for chemistry instruction tuning

LlaSMol: Instruction-Tuned LLMs for Chemistry Tasks

LlaSMol fine-tunes Mistral, Llama 2, and other open-source LLMs on SMolInstruct, a 3.3M-sample instruction tuning dataset covering 14 chemistry tasks. The Mistral-based model outperforms GPT-4 and Claude 3 Opus across all tasks.

Molecular Representations

MolFM trimodal architecture fusing 2D graph, knowledge graph, and biomedical text via cross-modal attention

MolFM: Trimodal Molecular Foundation Pre-training

MolFM pre-trains a multimodal encoder that fuses 2D molecular graphs, biomedical text, and knowledge graph entities through fine-grained cross-modal attention, achieving strong gains on cross-modal retrieval, molecule captioning, text-based generation, and property prediction.