Computational Chemistry
Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Computational Chemistry
ZINC-22 Tranche Browser showing molecular count distribution

ZINC-22: Multi-Billion Scale Database

ZINC-22 is the world’s largest freely available database of commercially available compounds, containing over 37 billion make-on-demand molecules with sophisticated search capabilities and cloud-scale infrastructure designed for modern virtual screening campaigns.

Computational Chemistry
Aspirin molecular structure generated from SMILES string

Converting SMILES and SELFIES to 2D Molecular Images

Build a robust Python CLI tool that converts both SMILES and SELFIES notation into publication-quality 2D molecular images, complete with formulas and legends.

Computational Chemistry
SELFIES representation of 2-Fluoroethenimine molecule

SELFIES (Self-Referencing Embedded Strings)

An in-depth overview of SELFIES, the 100% robust molecular string representation designed to overcome SMILES limitations in machine learning, where every possible string (even random ones) decodes to a valid molecule through local operations, customizable valence rules, and graph-based internal representations.

Machine Learning Fundamentals
Sphere packing illustration showing Shannon's geometric interpretation of channel capacity

Communication in the Presence of Noise: Shannon's 1949 Paper

Shannon’s foundational 1949 paper establishing the mathematical framework for modern information theory, defining channel capacity as the fundamental limit for reliable communication over noisy channels and introducing the sampling theorem (Nyquist-Shannon) that underpins all digital signal processing.

Computational Biology
Protein folding energy landscape funnel showing high-energy unfolded states converging to the native state

How to Fold Graciously: The Levinthal Paradox

Levinthal’s 1969 perspective paper defined the protein folding paradox by demonstrating the impossibility of random search, establishing the need for kinetic pathways that guide folding faster than thermodynamic equilibration allows.

Computational Chemistry
MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Representation & Conformers

MARCEL provides a comprehensive benchmark for molecular representation learning with 722K+ conformers across four diverse subsets (Drugs-75K, Kraken, EE, BDE), enabling evaluation of conformer ensemble methods for property prediction in drug discovery and catalysis.

Computational Chemistry
Müller-Brown Potential Energy Surface showing the three minima and two saddle points

Müller-Brown Potential

A two-dimensional analytical potential energy surface introduced in 1979 that has become the gold standard for testing optimization algorithms, featuring three minima and challenging transition pathways that mirror real chemical reaction landscapes.

Computational Chemistry
Benzene molecule with SMILES notation

SMILES: Compact Notation for Chemical Structures

Comprehensive overview of SMILES notation for chemical structures, covering syntax for atoms, bonds, branches, rings, and stereochemistry, plus its key limitations for machine learning.

Computational Chemistry
Log-scale plot showing exponential growth of alkane isomer counts from C1 to C40

The Number of Isomeric Hydrocarbons of the Methane Series

A foundational 1931 paper that derives exact mathematical laws for counting alkane structural isomers through recursive formulas, correcting historical errors and establishing validated benchmark counts up to C₄₀.

Planetary Science
Magellan radar mosaic of Venus showing the northern hemisphere with volcanic plains, tesserae, and lava flows in orange-brown tones

The Surface of Venus: Stratigraphy and Resurfacing

Basilevsky and Head’s definitive synthesis reveals a planet that undergoes catastrophic global resurfacing events. We explore the “stagnant lid” model, the synchronous stratigraphy, and the divergence of Venus’s geological history from Earth’s.

Computational Chemistry
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations

GEOM contains 450k+ molecules with 37M+ conformations, featuring energy annotations from semi-empirical (GFN2-xTB) and DFT methods for property prediction and molecular generation research.