Discovery Papers: Validated Empirical Findings Using AI

Diagram showing the three-step nested pipeline from small-scale training to large-model loss prediction across data mixtures

Data Mixing Laws for LM Pretraining Optimization

Ye et al. find that language model loss on each domain follows an exponential function of training mixture proportions. By nesting data mixing laws with scaling laws for steps and model size, small-scale experiments can predict and optimize mixtures for large models, achieving 48% training efficiency gains.

Natural Language Processing

Chart showing effective data as a function of epochs with exponential decay, with the 4-epoch safe zone and 16-epoch half-life marked

Scaling Data-Constrained Language Models

Muennighoff et al. train 400+ models to study how data repetition affects scaling. They propose a data-constrained scaling law with exponential decay for repeated tokens, finding that up to 4 epochs have negligible impact on loss, returns diminish around 16 epochs, and code augmentation provides a 2x effective data boost.

Natural Language Processing

Bar chart comparing average benchmark accuracy across seven domain combination configurations showing diversity improves performance

SlimPajama-DC: Data Combinations for LLM Training

Shen et al. empirically analyze how different domain combinations and deduplication strategies in the SlimPajama dataset affect 1.3B model performance. Global deduplication across sources outperforms local deduplication, and increasing domain diversity consistently improves average accuracy, with findings transferring to 7B scale.

Molecular Generation

Two-panel plot showing score divergence with disagreeing classifiers vs convergence with agreeing classifiers

Avoiding Failure Modes in Goal-Directed Generation

Shows that divergence between optimization and control scores during goal-directed molecular generation is explained by pre-existing disagreement among QSAR models on the training distribution, not by algorithmic exploitation of model-specific biases.

Molecular Representations

Bar chart showing randomized SMILES generate more of GDB-13 chemical space than canonical SMILES across training set sizes

Randomized SMILES Improve Molecular Generative Models

An extensive benchmark showing that training RNN generative models with randomized (non-canonical) SMILES strings yields more uniform, complete, and closed molecular output domains than canonical SMILES.

Predictive Chemistry

Bar chart comparing fixed molecular representations (RF, SVM, XGBoost) against learned representations (MolBERT, GROVER) across six property prediction benchmarks under scaffold split

Benchmarking Molecular Property Prediction at Scale

This study trains over 62,000 models to systematically evaluate molecular representations and models for property prediction, finding that traditional ML on fixed descriptors often outperforms deep learning approaches.

Molecular Generation

Diagram showing divergence between optimization score and control scores during molecular optimization

Failure Modes in Molecule Generation & Optimization

Identifies failure modes in molecular generative models, showing that trivial edits fool distribution-learning benchmarks and that ML-based scoring functions introduce exploitable model-specific and data-specific biases during goal-directed optimization.

Molecular Representations

Log-log plots showing power-law scaling of ChemGPT validation loss versus model size and GNN force field loss versus dataset size

Neural Scaling of Deep Chemical Models

Frey et al. discover empirical power-law scaling relations for both chemical language models (ChemGPT, up to 1B parameters) and equivariant GNN interatomic potentials, finding that neither domain has saturated with respect to model size, data, or compute.

Predictive Chemistry

Three distribution plots showing RNN language models closely matching training distributions across peaked, multi-modal, and large-scale molecular generation tasks while graph models fail

Language Models Learn Complex Molecular Distributions

This study benchmarks RNN-based chemical language models against graph generative models on three challenging tasks: high penalized LogP distributions, multi-modal molecular distributions, and large-molecule generation from PubChem. The LSTM language models consistently outperform JTVAE and CGVAE.

Evolutionary Biology

A reconstruction of LUCA within its evolutionary and ecological context

The Nature of LUCA and Its Impact on the Early Earth System

A comprehensive phylogenomic study dating LUCA to ~4.2 Ga and reconstructing it as a complex, anaerobic acetogen. The authors apply the cross-bracing molecular clock method alongside gene-tree-species-tree reconciliation to infer that LUCA possessed an early immune system and lived within a hydrogen-recycling ecosystem.

Planetary Science

Orbital diagram showing chaotic planetary trajectories

Chaotic Evolution of the Solar System (Sussman 1992)

Sussman and Wisdom’s 1992 study used the Supercomputer Toolkit and symplectic mapping to integrate the entire Solar System for 100 million years, confirming chaotic behavior with an exponential divergence timescale of ~4 million years and demonstrating that long-term planetary motion is fundamentally unpredictable.

Molecular Simulation

Carbon monoxide molecule adsorbed on Pt(100) FCC surface in hollow site configuration

In Situ XRD of Oxidation-Reduction Oscillations on Pt/SiO2

This study provides the first direct experimental proof that rate oscillations in catalytic CO oxidation on supported Pt are driven by a periodic oxidation and reduction of the catalyst surface. By monitoring Bragg peak intensities in situ, the authors confirm the ‘oxide model’ over competing reconstruction or carbon models.