Cheminformatics

Bar chart comparing small and big foundation models surveyed across property prediction, MLIPs, inverse design, and multi-domain chemistry applications

Foundation Models in Chemistry: A 2025 Perspective

This perspective from Choi et al. reviews foundation models in chemistry, categorizing them as ‘small’ (domain-specific, e.g., property prediction, MLIPs, inverse design) and ‘big’ (multi-domain, e.g., multimodal and LLM-based). It surveys pretraining strategies, key architectures (GNNs and language models), and outlines future directions for scaling, efficiency, and interpretability.

Computational Chemistry

Taxonomy diagram showing four generative model families (VAE, GAN, Diffusion, Flow) connecting to small molecule generation and protein generation subtasks

Generative AI Survey for De Novo Molecule and Protein Design

This survey organizes generative AI for de novo drug design into two themes: small molecule generation (target-agnostic, target-aware, conformation) and protein generation (structure prediction, sequence generation, backbone design, antibody, peptide). It covers four generative model families (VAEs, GANs, diffusion, flow-based), catalogs key datasets and benchmarks, and provides 12 comparative benchmark tables across all subtasks.

Computational Chemistry

Bar chart comparing Group SELFIES vs SELFIES on MOSES benchmark metrics

Group SELFIES: Fragment-Based Molecular Strings

Group SELFIES extends SELFIES with group tokens representing functional groups and substructures, maintaining chemical robustness while improving distribution learning and molecular generation quality.

Computational Chemistry

Schematic of inverse molecular design paradigm mapping desired properties to molecular structures through generative models

Inverse Molecular Design with ML Generative Models

A foundational review surveying how deep generative models (VAEs, GANs, reinforcement learning) enable inverse molecular design, covering molecular representations, chemical space navigation, and applications from drug discovery to materials engineering.

Computational Chemistry

Bar chart showing Lingo3DMol achieves best Vina docking scores on DUD-E compared to five baselines

Lingo3DMol: Language Model for 3D Molecule Design

Lingo3DMol introduces FSMILES, a fragment-based SMILES representation with local and global coordinates, to generate drug-like 3D molecules in protein pockets via a transformer language model.

Computational Chemistry

Schematic of Link-INVENT architecture showing encoder-decoder RNN with reinforcement learning scoring loop

Link-INVENT: RL-Driven Molecular Linker Generation

Link-INVENT is an RNN-based generative model for molecular linker design that uses reinforcement learning with a flexible scoring function, demonstrated on fragment linking, scaffold hopping, and PROTAC design.

Computational Chemistry

Diagram showing the CaR pipeline from SMILES to ChatGPT-generated captions to fine-tuned RoBERTa predictions

LLM4Mol: ChatGPT Captions as Molecular Representations

Proposes Captions as Representations (CaR), where ChatGPT generates textual explanations for SMILES strings that are then used to fine-tune small language models for molecular property prediction.

Computational Chemistry

Bar chart showing language model validity rates across XYZ, CIF, and PDB 3D chemical file formats

LMs Generate 3D Molecules from XYZ, CIF, PDB Files

Demonstrates that standard transformer language models, trained with next-token prediction on sequences from XYZ, CIF, and PDB files, can generate valid 3D molecules, crystals, and protein binding sites competitive with domain-specific 3D generative models.

Computational Chemistry

Bar chart showing vision language model performance across chemistry tasks including equipment identification, molecule matching, spectroscopy, and laboratory safety

MaCBench: Multimodal Chemistry and Materials Benchmark

MaCBench evaluates frontier vision language models across 1,153 chemistry and materials science tasks spanning data extraction, experimental execution, and data interpretation, uncovering fundamental limitations in spatial reasoning and cross-modal integration.

Computational Chemistry

Bar chart showing MolBERT ablation: combining MLM, PhysChem, and SMILES equivalence tasks gives best improvement

MolBERT: Auxiliary Tasks for Molecular BERT Models

MolBERT pre-trains a BERT model on SMILES strings using masked language modeling, SMILES equivalence, and physicochemical property prediction as auxiliary tasks, achieving state-of-the-art results on virtual screening and QSAR benchmarks.

Computational Chemistry

Diagram showing ULMFiT-style three-stage pipeline adapted for molecular property prediction

MolPMoFiT: Inductive Transfer Learning for QSAR

MolPMoFiT applies ULMFiT-style transfer learning to QSAR modeling, pre-training an AWD-LSTM on one million ChEMBL molecules and fine-tuning for property prediction on small datasets.

Computational Chemistry

Bar chart comparing nach0 vs T5-base across molecular captioning, Q/A, reaction prediction, retrosynthesis, and generation

nach0: A Multimodal Chemical and NLP Foundation Model

nach0 unifies natural language and SMILES-based chemical tasks in a single encoder-decoder model, achieving competitive results across molecular property prediction, reaction prediction, molecular generation, and biomedical NLP benchmarks.