
GutenOCR: A Grounded Vision-Language Front-End for Documents
GutenOCR is a family of vision-language models designed to serve as a ‘grounded OCR front-end’, providing high-quality text transcription and explicit geometric grounding.

Optimizing Sequence Models for Dynamical Systems
We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

Molecular Sets (MOSES): A Generative Modeling Benchmark
MOSES introduces a comprehensive benchmarking platform for molecular generative models, offering standardized datasets, evaluation metrics, and baselines.

The Reliability Trap: The Limits of 99% Accuracy
We explore the ‘Silent Failure’ mode of LLMs in production: the limits of 99% accuracy for reliability, how confidence decays in long documents, and why standard calibration techniques struggle to fix it.

The Evolution of Page Stream Segmentation: Rules to LLMs
We trace the history of Page Stream Segmentation (PSS) through three eras (Heuristic, Encoder, and Decoder) and explain how privacy-preserving, localized LLMs enable true semantic processing.

PubMed-OCR: PMC Open Access OCR Annotations
PubMed-OCR provides 1.5M pages of scientific articles with comprehensive OCR annotations and bounding boxes to support layout-aware modeling and document analysis.

ChemBERTa-3: Open Source Training Framework
ChemBERTa-3 provides a unified, scalable infrastructure for pretraining and benchmarking chemical foundation models, addressing reproducibility gaps in previous studies like MoLFormer through standardized scaffold splitting and open-source tooling.

ChemDFM-R: Chemical Reasoner LLM
ChemDFM-R is a 14B-parameter chemical reasoning model that integrates a 101B-token dataset of atomized chemical knowledge. Using a novel mix-sourced distillation strategy and domain-specific reinforcement learning, it achieves state-of-the-art performance on chemical benchmarks.

ChemBERTa-2: Scaling Molecular Transformers to 77M
This work investigates the scaling hypothesis for molecular transformers, training RoBERTa models on 77M SMILES from PubChem. It compares Masked Language Modeling (MLM) against Multi-Task Regression (MTR) pretraining, finding that MTR yields better downstream performance but is computationally heavier.

GP-MoLFormer: Molecular Generation via Transformers
This methodological paper proposes a linear-attention transformer decoder trained on 1.1 billion molecules. It introduces pair-tuning for efficient property optimization and establishes empirical scaling laws relating inference compute to generation novelty.

ChemBERTa: Molecular Property Prediction via Transformers
This paper introduces ChemBERTa, a RoBERTa-based model pretrained on 77M SMILES strings. It systematically evaluates the impact of pretraining dataset size, tokenization strategies, and input representations (SMILES vs. SELFIES) on downstream MoleculeNet tasks, finding that performance scales positively with data size.

Chemformer: Pre-trained Transformer for Comp Chem
This paper introduces Chemformer, a BART-based sequence-to-sequence model pre-trained on 100M molecules using a novel ‘combined’ masking and augmentation task. It achieves state-of-the-art top-1 accuracy on reaction prediction benchmarks while significantly reducing training time through transfer learning.