Molecular-Representation

Sample efficiency curves showing different molecular optimization algorithm families converging at different rates under a fixed oracle budget

PMO: Benchmarking Sample-Efficient Molecular Design

A large-scale benchmark of 25 molecular optimization methods on 23 oracles under constrained oracle budgets, showing that sample efficiency is a critical and often neglected dimension of evaluation.

Computational Chemistry

Taxonomy of molecular representation learning foundation models organized by input modality

Review of Molecular Representation Learning Models

A comprehensive survey classifying molecular representation learning foundation models by input modality (sequence, graph, 3D, image, multimodal) and analyzing four pretraining paradigms for drug discovery tasks.

Computational Chemistry

Visualization of STONED algorithm generating a local chemical subspace around a seed molecule through SELFIES string mutations, with a chemical path shown between two endpoints

STONED: Training-Free Molecular Design with SELFIES

STONED introduces simple string manipulation algorithms on SELFIES for molecular design, achieving competitive results with deep generative models while requiring no training data or GPU resources.

Computational Chemistry

QSPR surface roughness comparison across molecular representations, showing smooth fingerprint surfaces versus rougher pretrained model surfaces

ROGI-XD: Roughness of Pretrained Molecular Representations

This paper introduces ROGI-XD, a reformulation of the ROuGhness Index that enables fair comparison of QSPR surface roughness across molecular representations of different dimensionalities. Evaluating VAE, GIN, ChemBERTa, and ChemGPT representations, the authors show that pretrained chemical models do not produce smoother structure-property landscapes than simple molecular fingerprints or descriptors.

Computational Chemistry

Taxonomy diagram showing the three axes of MolGenSurvey: molecular representations (1D string, 2D graph, 3D geometry), generative methods (deep generative models and combinatorial optimization), and eight generation tasks (1D/2D and 3D)

MolGenSurvey: Systematic Survey of ML for Molecule Design

MolGenSurvey systematically reviews ML models for molecule design, organizing the field by molecular representation (1D/2D/3D), generative method (deep generative models vs. combinatorial optimization), and task type (8 distinct generation/optimization tasks). It catalogs over 100 methods, unifies task definitions via input/output/goal taxonomy, and identifies key challenges including out-of-distribution generation, oracle costs, and lack of unified benchmarks.

Computational Chemistry

BARTSmiles ablation study summary showing impact of pre-training strategies on downstream task performance

BARTSmiles: BART Pre-Training for Molecular SMILES

BARTSmiles pre-trains a BART-large model on 1.7 billion SMILES strings from ZINC20 and achieves the best reported results on 11 classification, regression, and generation benchmarks.

Computational Chemistry

Three distribution plots showing RNN language models closely matching training distributions across peaked, multi-modal, and large-scale molecular generation tasks while graph models fail

Language Models Learn Complex Molecular Distributions

This study benchmarks RNN-based chemical language models against graph generative models on three challenging tasks: high penalized LogP distributions, multi-modal molecular distributions, and large-molecule generation from PubChem. The LSTM language models consistently outperform JTVAE and CGVAE.

Computational Chemistry

Diagram of the LIMO pipeline showing gradient-based reverse optimization flowing backward through a frozen property predictor and VAE decoder to optimize the latent space z

LIMO: Latent Inceptionism for Targeted Molecule Generation

LIMO combines a SELFIES-based VAE with a novel stacked property predictor architecture (decoder output as predictor input) and gradient-based reverse optimization on the latent space. It is 6-8x faster than RL baselines and 12x faster than sampling methods while generating molecules with nanomolar binding affinities, including a predicted KD of 6e-14 M against the human estrogen receptor.

Computational Chemistry

Regression Transformer dual-masking concept showing property prediction (mask numbers) and conditional generation (mask molecules) in a single model

Regression Transformer: Prediction Meets Generation

The Regression Transformer (RT) reformulates regression as conditional sequence modelling, enabling a single XLNet-based model to both predict continuous molecular properties and generate novel molecules conditioned on desired property values.

Computational Chemistry

Diagram of the RetMol pipeline showing input molecule and retrieval database feeding into a frozen encoder, cross-attention fusion module, and frozen decoder to produce optimized molecules with iterative refinement

RetMol: Retrieval-Based Controllable Molecule Generation

RetMol plugs a lightweight cross-attention retrieval module into a pre-trained Chemformer backbone to guide molecule generation toward multi-property design criteria. It requires no task-specific fine-tuning and works with as few as 23 exemplar molecules. It achieves 94.5% success on QED optimization, 96.9% on GSK3b/JNK3 dual inhibitor design, and 2.84 kcal/mol average binding affinity improvement on SARS-CoV-2 main protease inhibitor optimization.

Computational Chemistry

Diagram showing the UnCorrupt SMILES pipeline: invalid SMILES are corrected by a transformer seq2seq model into valid SMILES, with correction rates of 62-95% across generator types

UnCorrupt SMILES: Post Hoc Correction for De Novo Design

This paper trains a transformer model to correct invalid SMILES produced by de novo molecular generators (RNN, VAE, GAN). The corrector fixes 60-95% of invalid outputs, and the fixed molecules are comparable in novelty and similarity to valid generator outputs. The approach also enables local chemical space exploration by introducing and correcting errors in existing molecules.

Computational Chemistry

MolGen overview showing two-stage pre-training (molecular language syntax learning and domain-agnostic prefix tuning) and chemical feedback paradigm

MolGen: Molecular Generation with Chemical Feedback

MolGen pre-trains on 100M+ SELFIES molecules, introduces domain-agnostic prefix tuning for cross-domain transfer, and applies a chemical feedback paradigm to reduce molecular hallucinations.