Computational Chemistry
Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Computational Chemistry
ZINC-22 Tranche Browser showing molecular count distribution

ZINC-22: Multi-Billion Scale Database

ZINC-22 is the world’s largest freely available database of commercially available compounds, containing over 37 billion make-on-demand molecules with sophisticated search capabilities and cloud-scale infrastructure designed for modern virtual screening campaigns.

Computational Chemistry
Aspirin molecular structure generated from SMILES string

Converting SMILES and SELFIES to 2D Molecular Images

Build a robust Python CLI tool that converts both SMILES and SELFIES notation into publication-quality 2D molecular images, complete with formulas and legends.

Computational Chemistry
SELFIES representation of 2-Fluoroethenimine molecule

SELFIES (Self-Referencing Embedded Strings)

An in-depth overview of SELFIES, the 100% robust molecular string representation designed to overcome SMILES limitations in machine learning, where every possible string (even random ones) decodes to a valid molecule through local operations, customizable valence rules, and graph-based internal representations.

Computational Chemistry
MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Representation & Conformers

MARCEL provides a comprehensive benchmark for molecular representation learning with 722K+ conformers across four diverse subsets (Drugs-75K, Kraken, EE, BDE), enabling evaluation of conformer ensemble methods for property prediction in drug discovery and catalysis.

Computational Chemistry
Benzene molecule with SMILES notation

SMILES: Compact Notation for Chemical Structures

Comprehensive overview of SMILES notation for chemical structures, covering syntax for atoms, bonds, branches, rings, and stereochemistry, plus its key limitations for machine learning.

Computational Chemistry
Log-scale plot showing exponential growth of alkane isomer counts from C1 to C40

The Number of Isomeric Hydrocarbons of the Methane Series

A foundational 1931 paper that derives exact mathematical laws for counting alkane structural isomers through recursive formulas, correcting historical errors and establishing validated benchmark counts up to C₄₀.

Computational Chemistry
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations

GEOM contains 450k+ molecules with 37M+ conformations, featuring energy annotations from semi-empirical (GFN2-xTB) and DFT methods for property prediction and molecular generation research.

Computational Chemistry
GDB-11 molecule structure showing FC1C2OC1c3c(F)coc23

GDB-11: Chemical Universe Database (26.4M Molecules)

GDB-11 contains 26.4 million systematically generated small organic molecules with up to 11 atoms, establishing the methodology for exploring drug-like chemical space computationally.

Computational Chemistry
Spherical harmonics visualization

Efficient DFT Hamiltonian Prediction via Adaptive Sparsity

ICML 2025 methodological paper introducing SPHNet, which uses adaptive network sparsification to overcome the computational bottleneck of tensor products in SE(3)-equivariant networks, achieving 2-5x speedup while maintaining accuracy on DFT Hamiltonian prediction tasks.

Computational Chemistry
Schematic showing atom-surface interaction using the method of images

Adsorption and Diffusion on Surfaces

Lennard-Jones’s landmark 1932 theoretical paper that applied quantum mechanical potential energy surfaces to gas-solid interactions, providing the first unified framework explaining both physisorption and chemisorption as different regions of the same energy landscape.

Computational Chemistry
GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN

GDB-13: Chemical Universe Database (970M Molecules)

GDB-13 contains nearly 1 billion systematically generated small organic molecules with up to 13 atoms, achieving billion-scale chemical space exploration while maintaining drug-like properties.