Cheminformatics

PharmaGPT two-stage training from domain continued pretraining to weighted supervised fine-tuning with RLHF

PharmaGPT: Domain-Specific LLMs for Pharma and Chem

PharmaGPT is a suite of domain-specific LLMs (13B and 70B parameters) built on LLaMA with continued pretraining on biopharmaceutical and chemical data, achieving strong results on NAPLEX and Chinese pharmacist exams.

Computational Chemistry

ReactionT5 two-stage pretraining from CompoundT5 to ReactionT5 with product prediction and yield results

ReactionT5: Pre-trained T5 for Reaction Prediction

ReactionT5 introduces a two-stage pretraining pipeline (compound then reaction) on the Open Reaction Database, enabling competitive product and yield prediction with as few as 30 fine-tuning reactions.

Computational Chemistry

REINVENT pipeline showing Prior, Agent, and Scoring Function with augmented likelihood equation

REINVENT: Reinforcement Learning for Mol. Design

Introduces a policy-based reinforcement learning method that fine-tunes an RNN pre-trained on ChEMBL SMILES to generate molecules with specified desirable properties, using an augmented episodic likelihood that anchors the agent to its prior while optimizing a user-defined scoring function.

Computational Chemistry

Three-stage progression from task-specific transformers through multimodal models to LLM chemistry agents

Transformers and LLMs for Chemistry Drug Discovery

A review chapter tracing three stages of transformer adoption in chemistry: task-specific single-modality models (reaction prediction, retrosynthesis), multimodal approaches bridging spectra and text, and LLM-powered agents like ChemCrow for general chemical reasoning.

Computational Chemistry

Diagram showing dual-view molecule pre-training with a SMILES Transformer branch and a GNN branch connected by a consistency loss

DMP: Dual-View Molecule Pre-training (SMILES+GNN)

DMP combines a SMILES Transformer and a GNN branch during pre-training, using masked language modeling plus a BYOL-inspired dual-view consistency loss to learn complementary molecular representations.

Computational Chemistry

Bar chart comparing MAT average ROC-AUC against D-MPNN, GCN, and Weave baselines

MAT: Graph-Augmented Transformer for Molecules (2020)

Molecule Attention Transformer (MAT) augments Transformer self-attention with inter-atomic distances and graph adjacency, achieving strong property prediction across diverse molecular tasks with minimal hyperparameter tuning after self-supervised pretraining.

Computational Chemistry

Bar chart showing RMSE improvement from SMILES augmentation across ESOL, FreeSolv, and lipophilicity datasets

Maxsmi: SMILES Augmentation for Property Prediction

A systematic study of SMILES augmentation strategies for molecular property prediction, showing that augmentation consistently improves CNN and RNN performance and that prediction variance across SMILES correlates with model uncertainty.

Computational Chemistry

Bar chart comparing MG-BERT vs GNN baselines on six MoleculeNet classification tasks

MG-BERT: Graph BERT for Molecular Property Prediction

MG-BERT combines GNN-style local attention with BERT’s masked pretraining on molecular graphs, learning context-sensitive atomic representations that improve ADMET property prediction across 11 benchmark datasets.

Computational Chemistry

Bar chart comparing Mol2vec ESOL RMSE against ECFP4, MACCS, and Neural Fingerprint baselines

Mol2vec: Unsupervised ML with Chemical Intuition

Mol2vec treats molecular substructures as words and compounds as sentences, training Word2vec on 19.9M molecules to produce dense embeddings that capture chemical intuition and enable competitive property prediction.

Computational Chemistry

Bar chart showing MTL-BERT combining pretraining, multitask learning, and SMILES enumeration for best improvement

MTL-BERT: Multitask BERT for Property Prediction

MTL-BERT pretrains a BERT model on 1.7M unlabeled SMILES, then fine-tunes jointly on 60 ADMET and molecular property tasks using SMILES enumeration as data augmentation in all phases.

Computational Chemistry

Bar chart comparing AlphaDrug docking scores against known ligands across five protein targets

AlphaDrug: MCTS-Guided Target-Specific Drug Design

AlphaDrug generates drug candidates for specific protein targets by combining an Lmser Transformer (with hierarchical encoder-decoder skip connections) and Monte Carlo tree search guided by docking scores, achieving higher binding affinities than known ligands on 86% of test proteins.

Computational Chemistry

Bar chart comparing SMILES tokens vs Atom-in-SMILES across molecular generation, retrosynthesis, and reaction prediction

Atom-in-SMILES: Better Tokens for Chemical Models

Introduces Atom-in-SMILES (AIS), a tokenization scheme that encodes local chemical environments into SMILES tokens, improving prediction quality across canonicalization, retrosynthesis, and property prediction tasks.