Document Processing
Statistics of the PubMed-OCR dataset including number of articles, pages, words, and bounding boxes.

PubMed-OCR: PMC Open Access OCR Annotations

A large-scale dataset of 209K+ articles with OCR and layout bounding boxes, enabling layout-aware modeling and document …

Computational Chemistry

Molecular Sets (MOSES): A Generative Modeling Benchmark

MOSES provides a standardized benchmarking platform for molecular generative models, featuring datasets, metrics, and …

Document Processing
Chart showing the trade-off between accuracy and throughput in document automation

The Reliability Trap: When 99% Accuracy Isn't Enough

Why high-accuracy LLMs fail in production: exploring the calibration crisis and the challenge of reliable …

Document Processing
Conceptual diagram of page stream segmentation sorting pages into documents

The Evolution of Page Stream Segmentation: Rules to LLMs

From brittle rules to reasoning engines: why transformers are the only way to solve the 'Hello World' of document …

Computational Chemistry
ChemBERTa-2 visualization showing flowing SMILES strings in blue tones representing molecular data streams

ChemBERTa-2: Scaling Molecular Transformers to 77M

Optimizing transformer pretraining for molecules using MLM vs MTR objectives, scaling to 77M compounds from PubChem for …

Generative Modeling
GP-MoLFormer architecture showing large-scale SMILES input, linear-attention transformer decoder, and property optimization via pair-tuning soft prompts

GP-MoLFormer: Molecular Generation via Transformers

A 46.8M parameter transformer for molecular generation trained on 1.1B SMILES, introducing pair-tuning for efficient …

Computational Chemistry
ChemBERTa masked language modeling visualization showing SMILES string CC(=O)O with masked tokens

ChemBERTa: Molecular Property Prediction via Transformers

A systematic evaluation of RoBERTa transformers pretrained on 77M PubChem SMILES for molecular property prediction …

Computational Chemistry
Chemformer pre-training on 100M SMILES strings flowing into BART model, which then enables reaction prediction and property prediction tasks

Chemformer: Pre-trained Transformer for Comp Chem

BART-based Transformer pre-trained on 100M molecules using self-supervision to accelerate convergence on chemical …

Generative Modeling
Visualization of probability density flow from initial distribution ρ₀ to target distribution ρ₁ over time through space

Building Normalizing Flows with Stochastic Interpolants

A continuous-time normalizing flow using stochastic interpolants and quadratic loss to bypass costly ODE …

Generative Modeling
Visualization comparing Optimal Transport (straight paths) vs Diffusion (curved paths) for Flow Matching

Flow Matching for Generative Modeling

A simulation-free framework for training Continuous Normalizing Flows using Conditional Flow Matching and Optimal …

Machine Learning Fundamentals
Comparison of Residual Network vs ODE Network architectures showing discrete layers versus continuous transformations

Neural ODEs: Continuous-Depth Deep Learning

Introduces ODE-Nets, a continuous-depth neural network model parameterized by ODEs, enabling constant memory …

Generative Modeling
Denoising Score Matching Intuition - Vectors point from corrupted samples back to clean data, approximating the score

Score Matching and Denoising Autoencoders

Theoretical paper proving the equivalence between training Denoising Autoencoders and performing Score Matching on a …