Abstract

Data science project that scraped 47,000+ congressional bills from Congress.gov, analyzed legislative patterns, and built ML models achieving 87% accuracy for policy area classification. The resulting dataset on Hugging Face enables research in computational political science.

What I Built

Data Collection

  • Web scraper: Selenium-based tool for Congress.gov with error handling and rate limiting
  • Dataset: 47,000+ bills from 115th-117th Congresses with comprehensive metadata
  • Data pipeline: Validation and cleaning workflows for government text data

Analysis & Classification

  • Text classification: Compared Naive Bayes, Logistic Regression, and SVM approaches
  • Feature engineering: TF-IDF vectorization with N-gram analysis
  • Performance: 87.3% accuracy on policy area prediction with cross-validation

Key Findings

  • Party differences: Clear language pattern distinctions between Democratic and Republican bills
  • Temporal trends: Policy priorities shift across congressional sessions and election cycles
  • Classification performance: 87.3% accuracy with robust generalization across sessions
  • Feature analysis: Identified key textual indicators for automated bill categorization

Impact

The work provides:

  • Open dataset: First comprehensive, ML-ready congressional bill dataset on Hugging Face
  • Research tools: Replicable methodology for government text analysis
  • Practical applications: Automated bill categorization for transparency and monitoring systems
  • Academic value: Benchmark dataset for computational political science research