Computational Chemistry
ZINC-22 Tranche Browser showing molecular count distribution

ZINC-22: Multi-Billion Molecule Database

A dataset card for ZINC-22, the largest freely available database of commercially available compounds for virtual …

Computational Chemistry
MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Representation and Conformer Ensemble Learning

MARCEL dataset provides 722K+ conformers across 76K+ molecules for drug discovery, catalysis, and molecular …

Computational Chemistry
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations

A dataset card for the GEOM dataset, a collection of energy-annotated molecular conformations for property prediction …

Computational Chemistry
GDB-11 molecule structure showing FC1C2OC1c3c(F)coc23

GDB-11: Chemical Universe Database (26.4M Molecules)

A dataset card for the Generated Database 11 (GDB-11), a database of 26.4 million small organic molecules for virtual …

Computational Chemistry
GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN

GDB-13: Chemical Universe Database (970M Molecules)

A dataset card for the Generated Database 13 (GDB-13), a database of nearly 1 billion small organic molecules for …

Computational Chemistry
GDB-17 molecule structure showing complex polycyclic architecture

GDB-17: Chemical Universe Database (166B Molecules)

Dataset card for GDB-17, containing 166 billion small organic molecules representing the largest enumerated chemical …

Computational Chemistry
Comparison of 2D molecular graph versus 3D conformer ensemble showing latanoprost molecule in multiple conformations

GEOM Dataset: 3D Molecular Conformer Generation

Learn how GEOM transforms 2D molecular graphs into dynamic 3D conformer ensembles for molecular machine learning …

Natural Language Processing
Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that actually learned to …

Computational Chemistry

Mini-Protein Trajectory Generation

GROMACS workflows for amino acid dipeptide trajectories across nine residue types, creating diverse MD datasets for ML …

Computational Social Science
Data visualization showing congressional bill analysis and legislative patterns

Congressional Data Analysis & Classification

Data science project scraping 47,000+ congressional bills, analyzing legislative patterns, and building ML models …...

Computational Social Science

NewsTweet Dataset: Social Media in Digital Journalism

NewsTweet dataset and pipeline for studying embedded tweets in online news. Analysis shows 13% of stories contain …...

Natural Language Processing
Types and distribution of coreferences in QuAC dataset showing dialogue complexity

QuAC: Question Answering in Context Dataset

Analysis of QuAC's conversational QA through student-teacher interactions, featuring 100K+ context-dependent questions …