Computational Chemistry

GDB-11: Chemical Universe Database (26.4M Molecules)

A dataset card for the Generated Database 11 (GDB-11), a database of 26.4 million small organic molecules for virtual …

Computational Chemistry

GDB-13: Chemical Universe Database (970M Molecules)

A dataset card for the Generated Database 13 (GDB-13), a database of nearly 1 billion small organic molecules for …

Computational Chemistry

GDB-17: Chemical Universe Database (166B Molecules)

A dataset card for the Generated Database 17 (GDB-17), containing 166 billion small organic molecules representing the …

Computational Chemistry

Beyond 2D: Exploring the GEOM Dataset for 3D Molecular Conformer Generation

Learn how GEOM transforms 2D molecular graphs into dynamic 3D conformer ensembles for molecular machine learning …

Natural Language Processing

Sarcasm Detection with Transformers: A Cautionary Tale

Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that actually learned to …

Computational Chemistry

Mini-Protein Trajectory Generation

GROMACS simulation workflows for generating amino acid dipeptide trajectories across nine different residue types, …

Computational Social Science

Congressional Data Analysis & Classification

Comprehensive data science project that scraped 47,000+ congressional bills, analyzed legislative patterns, and built ML …...

Computational Social Science

NewsTweet Dataset: Social Media in Digital Journalism

Describes the creation of NewsTweet, a large-scale dataset and pipeline for studying embedded tweets in online news. …...

Natural Language Processing

QuAC: Question Answering in Context Dataset

An analysis of QuAC's approach to conversational question answering through student-teacher interactions, featuring …

Natural Language Processing

CoQA Dataset: Advancing Conversational Question Answering

An analysis of CoQA, a conversational QA dataset that introduces multi-turn dialogue, coreference resolution, and …