Overview

A comprehensive data science project exploring U.S. Congressional legislative patterns through web scraping, data analysis, and machine learning. Built tools to collect, analyze, and classify congressional bills while making the resulting dataset publicly available.

Key Components

Data Collection

  • Custom web scraper for Congress.gov
  • 47,000+ bills collected from three recent Congresses (115th-117th)
  • Comprehensive metadata including sponsors, committees, vote records
  • Robust data pipeline with error handling and rate limiting

Analysis & Insights

  • Legislative pattern analysis across different Congresses
  • Party dynamics exploration and voting behavior
  • Policy area distribution and trending topics
  • Temporal analysis of legislative activity

Machine Learning

  • Policy area classification using traditional ML approaches
  • 87%+ accuracy achieved with optimized models
  • Feature engineering from bill text and metadata
  • Model comparison across different algorithms

Technical Implementation

Web Scraping Architecture

  • Selenium-based scraper for dynamic content
  • Robust error handling and retry mechanisms
  • Rate limiting to respect server resources
  • Data validation and cleaning pipelines

Analysis Tools

  • Exploratory data analysis with statistical insights
  • Visualization of legislative patterns and trends
  • Text preprocessing for NLP applications
  • Feature extraction for machine learning

Machine Learning Pipeline

  • Traditional approaches: Naive Bayes, Logistic Regression, SVM
  • Text vectorization using TF-IDF and N-grams
  • Cross-validation and hyperparameter tuning
  • Model evaluation and performance analysis

Dataset Contributions

Public Dataset Release

  • Open-source dataset available on Hugging Face
  • Comprehensive documentation with usage examples
  • Reproducible preprocessing scripts included
  • Multiple format support (CSV, JSON, Parquet)

Research Applications

The dataset enables research in:

  • Political science and legislative analysis
  • NLP applications on government text
  • Policy trend analysis and prediction
  • Democratic process understanding

Results & Impact

  • High-accuracy classification models for policy area prediction
  • Insights into legislative patterns across different political periods
  • Public dataset enabling further research
  • Reproducible methodology for similar data collection projects

This work supports the analyses documented in:

Future Directions

  • Extension to more recent Congresses
  • Advanced NLP approaches (transformers, embeddings)
  • Real-time monitoring of legislative activity
  • Integration with voting record analysis