Congressional Knowledge Graph & Policy Classification

Abstract

A computational social science project that constructed a dataset of 47,000+ US congressional bills by extracting legislative text and metadata from the 115th-117th Congresses. The project creates a novel “legislative graph” (linking sponsors, committees, and bill text) and establishes a machine learning benchmark for policy area classification (87% accuracy), now hosted on Hugging Face to support reproducible political science research.

The Engineering Challenge

Intelligent Data Acquisition

Instead of using standard APIs with limited rate limits, I engineered a custom Selenium-based extraction engine to handle Congress.gov’s complex DOM structures.

Optimization: Targeted aggregate endpoints (e.g., /all-info) to reduce HTTP request volume by ~90% per bill.
Resilience: Implemented a local caching layer to store raw HTML, separating the fetch step from the parse step. This ensured 100% reproducibility and minimized server load during iterative development.
Graph construction: Beyond simple text, the script extracts relational data including co-sponsorship networks, committee assignments, and related bill lineage.

Natural Language Processing

Corpus construction: Cleaned and normalized legislative text, removing procedural artifacts (e.g., “A BILL TO…”) to isolate semantic policy content.
Feature engineering: Utilized TF-IDF vectorization with N-gram analysis to capture legislative jargon.
Modeling: Benchmarked Naive Bayes, Logistic Regression, and SVMs, achieving 87.3% accuracy on policy area prediction (cross-validated).

Key Insights

The “partisan vocabulary”: Feature importance analysis revealed distinct linguistic markers separating Democratic and Republican legislation, identifiable even without metadata.
Temporal drift: Policy priorities and terminology showed measurable shifts across congressional sessions (115th vs 117th).
Classification success: Simple linear models (SVM/LogReg) proved remarkably effective at distinguishing policy domains, suggesting legislative language is highly structured.

Impact & Deliverables

Hugging Face dataset: Released the first machine-readable, ML-ready dataset of modern bills, democratizing access for researchers.
Open source tooling: Published the scraper and parsing logic to allow others to extend the dataset to future congresses.
Academic benchmark: Establishing a clear baseline for “Government NLP” tasks, aiding in the automated transparency and monitoring of new legislation.

Project Details
Author	Hunter Heidenreich
Category	Computational Social Science
Type	Data Analysis
Status	Complete
Date	March 2023
Links	📈 Analysis • 🏛️ Data Exploration • 📊 Dataset • 💻 Code

Abstract#

The Engineering Challenge#

Intelligent Data Acquisition#

Natural Language Processing#

Key Insights#

Impact & Deliverables#

Related Work#