Generative Modeling
Visualization of the VAE prior hole problem showing a ring-shaped aggregate posterior with an empty center where the Gaussian prior has highest density

Contrastive Learning for Variational Autoencoder Priors

A NeurIPS 2021 method paper introducing Noise Contrastive Priors to address the VAE ‘prior hole’ problem, where standard Gaussian priors assign high density to regions of latent space that don’t correspond to realistic data, using energy-based models trained with contrastive learning to match the aggregate posterior.

Computational Chemistry
GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN

GDB-13: Chemical Universe Database (970M Molecules)

GDB-13 contains nearly 1 billion systematically generated small organic molecules with up to 13 atoms, achieving billion-scale chemical space exploration while maintaining drug-like properties.

Computational Chemistry
GDB-17 molecule structure showing complex polycyclic architecture

GDB-17: Chemical Universe Database (166B Molecules)

GDB-17 contains 166 billion systematically generated small organic molecules with up to 17 atoms, representing the most comprehensive exploration of drug-relevant chemical space achieved through computational enumeration.

Natural Language Processing
Huffman Tree visualization for the input 'beep boop beer!' showing internal nodes with frequency counts and leaf nodes with characters

High-Performance Word2Vec in Pure PyTorch

A ground-up PyTorch implementation of Word2Vec treating it as a systems engineering challenge, with “tensorized tree” architecture converting pointer-chasing Hierarchical Softmax into dense GPU operations, infinite streaming datasets with Zipfian subsampling, and torch.compile compatibility for production-grade efficiency.

Computational Chemistry
Comparison of 2D molecular graph versus 3D conformer ensemble showing latanoprost molecule in multiple conformations

GEOM Dataset: 3D Molecular Conformer Generation

Get a practical overview of the GEOM dataset and learn how it’s advancing 3D molecular machine learning by bridging static graphs and dynamic reality.

Machine Learning Fundamentals
Comparison of standard 3D CNN versus 3D Steerable CNN for handling rotational symmetry

3D Steerable CNNs: Rotationally Equivariant Features

Weiler et al.’s NeurIPS 2018 paper introducing 3D Steerable CNNs that achieve SE(3) equivariance through group representation theory and spherical harmonic convolution kernels, eliminating the need for data augmentation and improving data efficiency for scientific applications with rotational symmetry like molecular and protein structures.

Computational Biology
A reconstruction of LUCA within its evolutionary and ecological context

The Nature of LUCA and Early Earth System

A comprehensive phylogenomic study dating LUCA to ~4.2 Ga and reconstructing it as a complex, anaerobic acetogen. The authors employ a novel cross-bracing molecular clock method and gene-tree-species-tree reconciliation to infer that LUCA possessed an early immune system and lived within a hydrogen-recycling ecosystem.

Computational Chemistry
SELFIES robustness demonstration

Invalid SMILES Benefit Chemical Language Models: A Study

A provocative 2024 Nature Machine Intelligence paper challenging the assumption that invalid SMILES are failures, showing empirically that the ability to generate invalid outputs actually improves chemical language model performance by enabling quality filtering and providing richer training signals.

Computational Chemistry
3D ball-and-stick model of butane molecule representing the structural isomer generation process

Synthetic Isomer Data Generation Pipeline

An end-to-end data factory for molecular machine learning that transforms raw chemical formulas (e.g., C6H14) into labeled 3D conformer datasets, using MAYGEN for structural isomer enumeration, RDKit for 3D embedding, and physics-based featurization to address data scarcity in computational drug discovery.

Generative Modeling
Variational Autoencoder architecture diagram showing encoder, latent space, and decoder

Modern PyTorch VAEs: A Detailed Implementation Guide

A complete guide to implementing modern Variational Autoencoders in PyTorch. Includes a copy-pasteable implementation, explanation of KL annealing to fix posterior collapse, and a deep dive into stable standard deviation parameterizations.

Natural Language Processing
Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

What happens when you achieve 99.8% accuracy on sarcasm detection? You might have accidentally built a domain classifier. A cautionary ML tale about dataset bias.

Computational Chemistry
3D ball-and-stick model of butane molecule showing linear carbon chain structure

Hearing Molecular Shape via Coulomb Matrix Eigenvalues

Can mathematical signatures capture molecular shape? We test whether Coulomb matrix eigenvalues can distinguish alkane constitutional isomers, from unsupervised clustering failures to supervised learning successes.