
GutenOCR: A Grounded Vision-Language Front-End for Documents
GutenOCR is a family of vision-language models designed to serve as a ‘grounded OCR front-end’, providing high-quality text transcription and explicit geometric grounding.

Optimizing Sequence Models for Dynamical Systems
We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

Graph Grammar and ILP for Carbon Fixation Pathways
A graph-grammar cheminformatics workflow expands the carbon fixation reaction network, then uses integer linear programming flow queries to surface short autocatalytic pathways producing Acetyl-CoA and Malate with efficiencies approaching the CETCH cycle.

MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs
Building on the gas-phase MB-nrg PEF for polyalanine, Ruihan Zhou and Francesco Paesani add machine-learned 2-body terms for each backbone functional group interacting with water, fit to BSSE-corrected DLPNO-CCSD(T)/aug-cc-pVTZ data, then validate the resulting potential against alanine dipeptide-water dimer scans, free-energy surfaces in explicit MB-pol water, and hydration radial distribution functions.

MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine
Ruihan Zhou and co-authors extend the MB-nrg many-body formalism to covalently bonded biomolecules by fragmenting polyalanine into functional-group n-mers, fitting permutationally invariant polynomials to DLPNO-CCSD(T)/aug-cc-pVTZ reference energies, and reproducing alanine dipeptide Ramachandran surfaces, harmonic frequencies, and AceAla9Nme secondary-structure dynamics more faithfully than Amber ff14SB and ff19SB.

ACSESS: Diverse Optimal Molecules in the SMU
Property-optimizing ACSESS combines diversity-biased sampling with iterative fitness thresholding to discover diverse sets of molecules with favorable properties. Tested on GDB-9 (dipole moment optimization) and NKp fitness landscapes, it outperforms standard genetic algorithms in diversity while matching or exceeding their fitness, using only ~30,000 evaluations to navigate a 300,000-molecule space.

AllChem: Generating and Searching 10^20 Structures
AllChem generates ~5 million synthons by recursively applying ~100 reactions to ~7,000 building blocks, combinatorially representing up to 10^20 complete structures with an A-B-C topology. Topomer shape similarity enables efficient searching of this space, and every hit comes with a proposed synthetic route.

Atom-Density Representations for Machine Learning
Introduces a Dirac notation formalism for atomic environments that unifies SOAP power spectra, Behler-Parrinello symmetry functions, and other density-based structural representations under a single theoretical framework.

CHX8: Complete Eight-Carbon Hydrocarbon Space
CHX8 exhaustively enumerates all mathematically feasible hydrocarbons with up to eight carbon atoms (77,524 structures), then DFT-optimizes them to identify 31,497 stable molecules. A universal relative strain energy (RSE) metric referenced to cyclohexane serves as a synthesizability proxy. CHX8 covers 16x more C8 hydrocarbons than GDB-13 and reveals that over 90% of novel structures should be synthetically accessible.

Conformation Autoencoder for 3D Molecules
A conformation autoencoder converts molecular 3D arrangements into fixed-size latent representations using internal coordinates and graph neural networks, enabling conformer generation and spatial property optimization.

Defining Disentangled Representations via Group Theory
Proposes the first principled mathematical definition of disentangled representations by connecting symmetry group decompositions to independent subspaces in a representation’s vector space.

DGCNN: Dynamic Graph CNN for Point Cloud Learning
DGCNN introduces the EdgeConv operator, which constructs k-nearest neighbor graphs dynamically in feature space at each network layer. This enables the model to capture both local geometry and long-range semantic relationships for point cloud classification and segmentation.