Abstract
Data science project that scraped 47,000+ congressional bills from Congress.gov, analyzed legislative patterns, and built ML models achieving 87% accuracy for policy area classification. The resulting dataset on Hugging Face enables research in computational political science.
What I Built
Data Collection
- Web scraper: Selenium-based tool for Congress.gov with error handling and rate limiting
- Dataset: 47,000+ bills from 115th-117th Congresses with comprehensive metadata
- Data pipeline: Validation and cleaning workflows for government text data
Analysis & Classification
- Text classification: Compared Naive Bayes, Logistic Regression, and SVM approaches
- Feature engineering: TF-IDF vectorization with N-gram analysis
- Performance: 87.3% accuracy on policy area prediction with cross-validation
Key Findings
- Party differences: Clear language pattern distinctions between Democratic and Republican bills
- Temporal trends: Policy priorities shift across congressional sessions and election cycles
- Classification performance: 87.3% accuracy with robust generalization across sessions
- Feature analysis: Identified key textual indicators for automated bill categorization
Impact
The work provides:
- Open dataset: First comprehensive, ML-ready congressional bill dataset on Hugging Face
- Research tools: Replicable methodology for government text analysis
- Practical applications: Automated bill categorization for transparency and monitoring systems
- Academic value: Benchmark dataset for computational political science research