Computational Chemistry
Bar chart comparing SMINA docking scores of CVAE, GVAE, and REINVENT against a random ZINC 10% baseline across eight protein targets

SMINA Docking Benchmark for De Novo Drug Design Models

Proposes a benchmark for de novo drug design using SMINA docking scores across eight drug targets, revealing that popular generative models fail to outperform random ZINC subsets.

Computational Chemistry
BARTSmiles ablation study summary showing impact of pre-training strategies on downstream task performance

BARTSmiles: BART Pre-Training for Molecular SMILES

BARTSmiles pre-trains a BART-large model on 1.7 billion SMILES strings from ZINC20 and achieves the best reported results on 11 classification, regression, and generation benchmarks.

Computational Chemistry
Regression Transformer dual-masking concept showing property prediction (mask numbers) and conditional generation (mask molecules) in a single model

Regression Transformer: Prediction Meets Generation

The Regression Transformer (RT) reformulates regression as conditional sequence modelling, enabling a single XLNet-based model to both predict continuous molecular properties and generate novel molecules conditioned on desired property values.

Computational Chemistry
Molecular Transformer architecture showing atom-wise tokenized SMILES input through encoder-decoder with multi-head attention to predict reaction products

Molecular Transformer: Calibrated Reaction Prediction

The Molecular Transformer applies the Transformer architecture to forward reaction prediction, treating it as SMILES-to-SMILES machine translation. It achieves 90.4% top-1 accuracy on USPTO_MIT, outperforms quantum-chemistry baselines on regioselectivity, and provides calibrated uncertainty scores (0.89 AUC-ROC) for ranking synthesis pathways.

Computational Chemistry
Activity cliffs benchmark showing method rankings by RMSE on cliff compounds, with SVM plus ECFP outperforming deep learning approaches

Exposing Limitations of Molecular ML with Activity Cliffs

This paper benchmarks 24 machine and deep learning methods on activity cliff compounds (structurally similar molecules with large potency differences) across 30 macromolecular targets. Traditional ML with molecular fingerprints consistently outperforms graph neural networks and SMILES-based transformers on these challenging cases, especially in low-data regimes.

Computational Chemistry
MoLFormer-XL architecture diagram showing SMILES tokens flowing through a linear attention transformer to MoleculeNet benchmark results and attention-structure correlation

MoLFormer: Large-Scale Chemical Language Representations

MoLFormer is a transformer encoder with linear attention and rotary positional embeddings, pretrained via masked language modeling on 1.1 billion molecules from PubChem and ZINC. MoLFormer-XL outperforms GNN baselines on most MoleculeNet classification and regression tasks, and attention analysis reveals that the model learns interatomic spatial relationships directly from SMILES strings.

Computational Chemistry
SELFormer architecture diagram showing SELFIES token input flowing through a RoBERTa transformer encoder to molecular property predictions

SELFormer: A SELFIES-Based Molecular Language Model

SELFormer is a transformer-based chemical language model that uses SELFIES instead of SMILES as input. Pretrained on 2M ChEMBL compounds via masked language modeling, it achieves strong classification performance on MoleculeNet tasks, outperforming ChemBERTa-2 by ~12% on average across BACE, BBBP, and HIV.

Generative Modeling
D3PM forward and reverse processes on a quantized swiss roll with uniform, Gaussian, and absorbing transition matrices

D3PM: Discrete Denoising Diffusion Probabilistic Models

This paper introduces Discrete Denoising Diffusion Probabilistic Models (D3PMs), which generalize diffusion to discrete state-spaces using structured Markov transition matrices. D3PMs include uniform, absorbing-state, and discretized Gaussian corruption processes, drawing a connection between diffusion and masked language models.

Computational Chemistry
Density plot showing training vs generated physicochemical property distribution

Molecular Sets (MOSES): A Generative Modeling Benchmark

MOSES introduces a comprehensive benchmarking platform for molecular generative models, offering standardized datasets, evaluation metrics, and baselines. By providing a unified measuring stick, it aims to resolve reproducibility challenges in chemical distribution learning.

Document Processing
Chart showing the trade-off between accuracy and throughput in document automation

The Reliability Trap: The Limits of 99% Accuracy

We explore the ‘Silent Failure’ mode of LLMs in production: the limits of 99% accuracy for reliability, how confidence decays in long documents, and why standard calibration techniques struggle to fix it.

Document Processing
Conceptual diagram of page stream segmentation sorting pages into documents

The Evolution of Page Stream Segmentation: Rules to LLMs

We trace the history of Page Stream Segmentation (PSS) through three eras (Heuristic, Encoder, and Decoder) and explain how privacy-preserving, localized LLMs enable true semantic processing.

Computational Chemistry
ChemBERTa-2 visualization showing flowing SMILES strings in blue tones representing molecular data streams

ChemBERTa-2: Scaling Molecular Transformers to 77M

This work investigates the scaling hypothesis for molecular transformers, training RoBERTa models on 77M SMILES from PubChem. It compares Masked Language Modeling (MLM) against Multi-Task Regression (MTR) pretraining, finding that MTR yields better downstream performance but is computationally heavier.