Systematization Papers: Surveys, Reviews, and Taxonomies

Overview of 16 transformer models for molecular property prediction organized by architecture type

Transformers for Molecular Property Prediction Review

Sultan et al. review 16 sequence-based transformer models for molecular property prediction, systematically analyzing seven design decisions (database selection, chemical language, tokenization, positional encoding, model size, pre-training objectives, and fine-tuning strategy) and identifying a critical need for standardized evaluation practices.

Molecular Representations

Taxonomy of molecular representation learning foundation models organized by input modality

Review of Molecular Representation Learning Models

A comprehensive survey classifying molecular representation learning foundation models by input modality (sequence, graph, 3D, image, multimodal) and analyzing four pretraining paradigms for drug discovery tasks.

Predictive Chemistry

QSPR surface roughness comparison across molecular representations, showing smooth fingerprint surfaces versus rougher pretrained model surfaces

ROGI-XD: Roughness of Pretrained Molecular Representations

This paper introduces ROGI-XD, a reformulation of the ROuGhness Index that enables fair comparison of QSPR surface roughness across molecular representations of different dimensionalities. Evaluating VAE, GIN, ChemBERTa, and ChemGPT representations, the authors show that pretrained chemical models do not produce smoother structure-property landscapes than simple molecular fingerprints or descriptors.

Molecular Generation

Taxonomy diagram showing the three axes of MolGenSurvey: molecular representations (1D string, 2D graph, 3D geometry), generative methods (deep generative models and combinatorial optimization), and eight generation tasks (1D/2D and 3D)

MolGenSurvey: Systematic Survey of ML for Molecule Design

MolGenSurvey systematically reviews ML models for molecule design, organizing the field by molecular representation (1D/2D/3D), generative method (deep generative models vs. combinatorial optimization), and task type (8 distinct generation/optimization tasks). It catalogs over 100 methods, unifies task definitions via input/output/goal taxonomy, and identifies key challenges including out-of-distribution generation, oracle costs, and lack of unified benchmarks.

Predictive Chemistry

Activity cliffs benchmark showing method rankings by RMSE on cliff compounds, with SVM plus ECFP outperforming deep learning approaches

Exposing Limitations of Molecular ML with Activity Cliffs

This paper benchmarks 24 machine and deep learning methods on activity cliff compounds (structurally similar molecules with large potency differences) across 30 macromolecular targets. Traditional ML with molecular fingerprints consistently outperforms graph neural networks and SMILES-based transformers on these challenging cases, especially in low-data regimes.

Machine Learning

Graph network block diagram showing input graph transformed through edge, node, and global update steps to produce an updated graph

Relational Inductive Biases in Deep Learning (2018)

Battaglia et al. argue that combinatorial generalization requires structured representations, systematically analyze the relational inductive biases in standard deep learning architectures (MLPs, CNNs, RNNs), and present the graph network as a unifying framework that generalizes and extends prior graph neural network approaches.

Machine Learning

Log-log plot comparing scaling laws across six architectures showing the vanilla Transformer has the steepest slope

Scaling Laws vs Model Architectures: Inductive Bias

Tay et al. systematically compare scaling laws across ten diverse architectures (Transformers, Switch Transformers, Performers, MLP-Mixers, and others), finding that the vanilla Transformer has the best scaling coefficient and that the best-performing architecture changes across compute regions.

Optical Chemical Structure Recognition

Review of OCSR Techniques and Models (Musazade 2022)

This systematization paper traces the history of OCSR, comparing early rule-based systems like OSRA with modern deep learning approaches like DECIMER. It highlights the shift from image classification to image captioning and identifies critical gaps in dataset standardization and evaluation metrics.

Optical Chemical Structure Recognition

A Review of Optical Chemical Structure Recognition Tools

This paper reviews three decades of OCSR development, transitioning from rule-based heuristics to early deep learning approaches. It includes a benchmark study comparing the performance of three open-source tools (OSRA, Imago, MolVec) on four diverse datasets.

Molecular Simulation

Embedding energy and effective charge functions for Ni and Pd from the original EAM paper

Embedded-Atom Method User Guide: Voter's 1994 Chapter

This 1994 handbook chapter serves as a practical user guide for the Embedded-Atom Method (EAM). It details the theoretical derivation from density-functional theory, synthesizes related methods like the Glue Model, and provides a complete tutorial on fitting potentials, illustrated with a specific implementation for the Ni-Al-B system.

Molecular Simulation

Embedded-Atom Method: Theory and Applications Review

This 1993 review systematizes the Embedded-Atom Method (EAM) as a practical semi-empirical approach for metallic systems. It synthesizes theory, applications, and connections to related methods while addressing the limitations of pair potentials.

Evolutionary Biology

Electron microscope image of Pyrolobus fumarii showing irregular coccoid cell structure

Three Domains of Life: Woese's Phylogenetic Revolution

This paper established the three-domain classification system (Bacteria, Archaea, Eucarya) based on molecular evidence from ribosomal RNA sequences, arguing that the prokaryote-eukaryote dichotomy obscures the deep evolutionary divergence of Archaea from Bacteria.