Computational Chemistry
QSPR surface roughness comparison across molecular representations, showing smooth fingerprint surfaces versus rougher pretrained model surfaces

ROGI-XD: Roughness of Pretrained Molecular Representations

This paper introduces ROGI-XD, a reformulation of the ROuGhness Index that enables fair comparison of QSPR surface roughness across molecular representations of different dimensionalities. Evaluating VAE, GIN, ChemBERTa, and ChemGPT representations, the authors show that pretrained chemical models do not produce smoother structure-property landscapes than simple molecular fingerprints or descriptors.

Computational Chemistry
Diagram showing a genetic algorithm for molecules where a parent albuterol molecule undergoes mutation to produce two child molecules, with a selection and repeat loop

Genetic Algorithms as Baselines for Molecule Generation

This position paper demonstrates that genetic algorithms (GAs) perform surprisingly well on molecular generation benchmarks, often outperforming complex deep learning methods. The authors propose the GA criterion: new molecule generation algorithms should demonstrate a clear advantage over GAs.

Computational Chemistry
Bar chart comparing SMINA docking scores of CVAE, GVAE, and REINVENT against a random ZINC 10% baseline across eight protein targets

SMINA Docking Benchmark for De Novo Drug Design Models

Proposes a benchmark for de novo drug design using SMINA docking scores across eight drug targets, revealing that popular generative models fail to outperform random ZINC subsets.

Computational Chemistry
2D structure of a phenyl-quaterthiophene, a conjugated organic molecule representative of the photovoltaic donor materials benchmarked in the Tartarus platform

Tartarus: Realistic Inverse Molecular Design Benchmarks

Tartarus introduces a modular suite of realistic molecular design benchmarks grounded in computational chemistry simulations. Benchmarking eight generative models reveals that no single algorithm dominates all tasks, and simple genetic algorithms often outperform deep generative models.

Computational Chemistry
Diagram of the tied two-way transformer architecture with shared encoder, retro and forward decoders, latent variables, and cycle consistency, alongside USPTO-50K accuracy and validity results

Tied Two-Way Transformers for Diverse Retrosynthesis

This paper couples a retrosynthesis transformer with a forward reaction transformer through parameter sharing, cycle consistency checks, and multinomial latent variables. The combined approach reduces top-1 SMILES invalidity to 0.1% on USPTO-50K, improves top-10 accuracy to 78.5%, and achieves 87.3% pathway coverage on a multi-pathway in-house dataset.

Computational Chemistry
BARTSmiles ablation study summary showing impact of pre-training strategies on downstream task performance

BARTSmiles: BART Pre-Training for Molecular SMILES

BARTSmiles pre-trains a BART-large model on 1.7 billion SMILES strings from ZINC20 and achieves the best reported results on 11 classification, regression, and generation benchmarks.

Computational Chemistry
Three distribution plots showing RNN language models closely matching training distributions across peaked, multi-modal, and large-scale molecular generation tasks while graph models fail

Language Models Learn Complex Molecular Distributions

This study benchmarks RNN-based chemical language models against graph generative models on three challenging tasks: high penalized LogP distributions, multi-modal molecular distributions, and large-molecule generation from PubChem. The LSTM language models consistently outperform JTVAE and CGVAE.

Computational Chemistry
Activity cliffs benchmark showing method rankings by RMSE on cliff compounds, with SVM plus ECFP outperforming deep learning approaches

Exposing Limitations of Molecular ML with Activity Cliffs

This paper benchmarks 24 machine and deep learning methods on activity cliff compounds (structurally similar molecules with large potency differences) across 30 macromolecular targets. Traditional ML with molecular fingerprints consistently outperforms graph neural networks and SMILES-based transformers on these challenging cases, especially in low-data regimes.

Computational Chemistry
MoLFormer-XL architecture diagram showing SMILES tokens flowing through a linear attention transformer to MoleculeNet benchmark results and attention-structure correlation

MoLFormer: Large-Scale Chemical Language Representations

MoLFormer is a transformer encoder with linear attention and rotary positional embeddings, pretrained via masked language modeling on 1.1 billion molecules from PubChem and ZINC. MoLFormer-XL outperforms GNN baselines on most MoleculeNet classification and regression tasks, and attention analysis reveals that the model learns interatomic spatial relationships directly from SMILES strings.

Computational Chemistry
SELFormer architecture diagram showing SELFIES token input flowing through a RoBERTa transformer encoder to molecular property predictions

SELFormer: A SELFIES-Based Molecular Language Model

SELFormer is a transformer-based chemical language model that uses SELFIES instead of SMILES as input. Pretrained on 2M ChEMBL compounds via masked language modeling, it achieves strong classification performance on MoleculeNet tasks, outperforming ChemBERTa-2 by ~12% on average across BACE, BBBP, and HIV.

Computational Chemistry
Uni-Parser pipeline diagram showing document pre-processing, layout detection, semantic parsing, content gathering, and format conversion stages

Uni-Parser: Industrial-Grade Multi-Modal PDF Parsing (2025)

Technical report on Uni-Parser, an industrial-grade document parsing engine that uses a modular multi-expert architecture to parse scientific PDFs into structured representations. Integrates MolParser 1.5 for OCSR, achieving 88.6% accuracy on chemical structures while processing up to 20 pages per second.

Computational Chemistry
OCSU: Optical Chemical Structure Understanding

OCSU: Optical Chemical Structure Understanding (2025)

Proposes the ‘Optical Chemical Structure Understanding’ (OCSU) task to translate molecular images into multi-level descriptions (motifs, IUPAC, SMILES). Introduces the Vis-CheBI20 dataset and two paradigms: DoubleCheck (OCSR-based) and Mol-VL (OCSR-free).