Computational Chemistry
GDB-17 molecule structure showing complex polycyclic architecture

GDB-17: Chemical Universe Database (166.4B Molecules)

GDB-17 contains 166.4 billion systematically generated small organic molecules with up to 17 atoms. It represents the most comprehensive exploration of drug-relevant chemical space achieved through computational enumeration.

Computational Chemistry
3D conformer ensemble of a drug-like molecule from the GEOM dataset

GEOM Dataset: 3D Molecular Conformer Generation

Get a practical overview of the GEOM dataset and learn how it’s advancing 3D molecular machine learning by bridging static graphs and dynamic reality.

Computational Chemistry
3D ball-and-stick model of butane molecule representing the structural isomer generation process

Synthetic Isomer Data Generation Pipeline

An end-to-end data factory for molecular machine learning that transforms raw chemical formulas (e.g., C6H14) into labeled 3D conformer datasets, using MAYGEN for structural isomer enumeration, RDKit for 3D embedding, and physics-based featurization to address data scarcity in computational drug discovery.

Natural Language Processing
Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

What happens when you achieve 99.8% accuracy on sarcasm detection? You might have accidentally built a domain classifier. A cautionary ML tale about dataset bias.

Computational Social Science
Top features for Economics and Public Finance policy classification across Congresses

How Does Congress Actually Work? Data from 15K Bills

Only 2% of congressional bills become law. We analyze 15K bills from 2021-2023 to understand what drives legislative success and failure.

Computational Chemistry
Ball model representation of a crystal surface with steps, kinks, adatoms, and vacancies showing various surface features

LAMMPS Tutorial: Copper and Platinum Adatom Diffusion

Step-by-step LAMMPS tutorial for simulating copper and platinum adatom diffusion. Learn surface dynamics simulation, trajectory analysis, and how atomic mass affects diffusion for machine learning datasets.

Computational Biology
Molecular visualization of a methionine dipeptide structure from MD simulation

Generating Mini-Protein Trajectories with GROMACS

A practical guide to simulating mini-proteins using GROMACS; from alanine dipeptide to tryptophan systems for ML training data generation.

Computational Biology
Molecular visualization of a methionine dipeptide structure from MD simulation

Mini-Protein Trajectory Generation

An automated GROMACS pipeline for generating high-fidelity molecular dynamics datasets suitable for machine learning, simulating capped dipeptides across nine residue types with 0.1 ps resolution and atomic force extraction optimized for training Neural Network Potentials.

Computational Social Science
Top features for Social Welfare policy classification showing social, poverty, benefits keywords

Congressional Knowledge Graph & Policy Classification

A computational social science project that engineered a custom extraction engine to build a 47,000+ bill knowledge graph from Congress.gov (115th-117th Congresses), creating a novel legislative graph with co-sponsorship networks and establishing an 87% accuracy benchmark for policy area classification now available on Hugging Face.

Computational Social Science
Diagram of the Universal Message schema showing fields like ID, Text, Author, and Reply Sets that normalize data across platforms

Look, Don't Tweet: Unified Data Models for Social NLP

Bachelor’s thesis introducing PyConversations, an open-source library that normalizes over 308 million posts from Twitter, Reddit, Facebook, and 4chan into a unified data model for cross-platform social media research.

Computational Social Science
NewsTweet data collection pipeline: news outlets are crawled via Google News RSS feeds, articles are accessed to extract embedded tweets, and user timelines are downloaded from Twitter

NewsTweet Dataset: Social Media in Digital Journalism

We introduce NewsTweet, a dataset and pipeline for studying embedded tweets in digital journalism, revealing that 13% of Google News articles incorporate tweets and providing insights into how social media becomes newsworthy.

Natural Language Processing
Venn diagram showing semantic overlap between word senses for go, move, and proceed, illustrating our hierarchy induction algorithm

Data-Driven WordNet Construction from Wiktionary

We present an unsupervised algorithm for inducing semantic networks from Wiktionary’s crowd-sourced data, creating a WordNet-like resource an order of magnitude larger than Princeton WordNet with over 344,000 linked example sentences.