Datasets

3D conformer ensemble of a drug-like molecule from the GEOM dataset

GEOM Dataset: 3D Molecular Conformer Generation

Get a practical overview of the GEOM dataset and learn how it’s advancing 3D molecular machine learning by bridging static graphs and dynamic reality.

Molecular Generation

3D ball-and-stick model of butane molecule representing the structural isomer generation process

Synthetic Isomer Data Generation Pipeline

An end-to-end data factory for molecular machine learning that transforms raw chemical formulas (e.g., C6H14) into labeled 3D conformer datasets, using MAYGEN for structural isomer enumeration, RDKit for 3D embedding, and physics-based featurization to address data scarcity in computational drug discovery.

Natural Language Processing

Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

What happens when you achieve 99.8% accuracy on sarcasm detection? You might have accidentally built a domain classifier. A cautionary ML tale about dataset bias.

Computational Social Science

Top features for Economics and Public Finance policy classification across Congresses

How Does Congress Actually Work? Data from 15K Bills

Only 2% of congressional bills become law. We analyze 15K bills from 2021-2023 to understand what drives legislative success and failure.

Molecular Simulation

Ball model representation of a crystal surface with steps, kinks, adatoms, and vacancies showing various surface features

LAMMPS Tutorial: Copper and Platinum Adatom Diffusion

Step-by-step LAMMPS tutorial for simulating copper and platinum adatom diffusion. Learn surface dynamics simulation, trajectory analysis, and how atomic mass affects diffusion for machine learning datasets.

Computational Biology

Molecular visualization of a methionine dipeptide structure from MD simulation

Generating Mini-Protein Trajectories with GROMACS

A practical guide to simulating mini-proteins using GROMACS; from alanine dipeptide to tryptophan systems for ML training data generation.

Computational Biology

Mini-Protein Trajectory Generation

An automated GROMACS pipeline for generating molecular dynamics datasets suitable for machine learning, simulating capped dipeptides across nine residue types with 0.1 ps force-output resolution and atomic force extraction for training Neural Network Potentials.

Computational Social Science

Top features for Social Welfare policy classification showing social, poverty, benefits keywords

Congressional Knowledge Graph & Policy Classification

A computational social science project that built a 47,000+ bill dataset from Congress.gov (115th-117th Congresses), with a co-sponsorship legislative graph and TF-IDF baseline models for 33-class policy-area classification (up to ~0.89 weighted F1 on full text), now available on Hugging Face.

Computational Social Science

Diagram of the Universal Message schema showing fields like ID, Text, Author, and Reply Sets that normalize data across platforms

Look, Don't Tweet: Unified Data Models for Social NLP

Bachelor’s thesis introducing PyConversations, an open-source library that normalizes over 308 million posts from Twitter, Reddit, Facebook, and 4chan into a unified data model for cross-platform social media research.

Computational Social Science

NewsTweet data collection pipeline: news outlets are crawled via Google News RSS feeds, articles are accessed to extract embedded tweets, and user timelines are downloaded from Twitter

NewsTweet Dataset: Social Media in Digital Journalism

We introduce NewsTweet, a dataset and pipeline for studying embedded tweets in digital journalism, revealing that 13% of Google News articles incorporate tweets and providing insights into how social media becomes newsworthy.

Natural Language Processing

Venn diagram showing semantic overlap between word senses for go, move, and proceed, illustrating our hierarchy induction algorithm

Data-Driven WordNet Construction from Wiktionary

We present an unsupervised algorithm for inducing semantic networks from Wiktionary’s crowd-sourced data, creating a WordNet-like resource more than 5x larger than Princeton WordNet with over 344,000 linked example sentences.

Natural Language Processing

Types and distribution of coreferences in QuAC dataset showing dialogue complexity

QuAC: Question Answering in Context Dataset

QuAC introduces a conversational QA dataset that models student-teacher interactions, creating context-dependent questions that test systems’ ability to understand dialogue and resolve references.