Overview
A comprehensive data science project exploring U.S. Congressional legislative patterns through web scraping, data analysis, and machine learning. Built tools to collect, analyze, and classify congressional bills while making the resulting dataset publicly available.
Key Components
Data Collection
- Custom web scraper for Congress.gov
- 47,000+ bills collected from three recent Congresses (115th-117th)
- Comprehensive metadata including sponsors, committees, vote records
- Robust data pipeline with error handling and rate limiting
Analysis & Insights
- Legislative pattern analysis across different Congresses
- Party dynamics exploration and voting behavior
- Policy area distribution and trending topics
- Temporal analysis of legislative activity
Machine Learning
- Policy area classification using traditional ML approaches
- 87%+ accuracy achieved with optimized models
- Feature engineering from bill text and metadata
- Model comparison across different algorithms
Technical Implementation
Web Scraping Architecture
- Selenium-based scraper for dynamic content
- Robust error handling and retry mechanisms
- Rate limiting to respect server resources
- Data validation and cleaning pipelines
Analysis Tools
- Exploratory data analysis with statistical insights
- Visualization of legislative patterns and trends
- Text preprocessing for NLP applications
- Feature extraction for machine learning
Machine Learning Pipeline
- Traditional approaches: Naive Bayes, Logistic Regression, SVM
- Text vectorization using TF-IDF and N-grams
- Cross-validation and hyperparameter tuning
- Model evaluation and performance analysis
Dataset Contributions
Public Dataset Release
- Open-source dataset available on Hugging Face
- Comprehensive documentation with usage examples
- Reproducible preprocessing scripts included
- Multiple format support (CSV, JSON, Parquet)
Research Applications
The dataset enables research in:
- Political science and legislative analysis
- NLP applications on government text
- Policy trend analysis and prediction
- Democratic process understanding
Results & Impact
- High-accuracy classification models for policy area prediction
- Insights into legislative patterns across different political periods
- Public dataset enabling further research
- Reproducible methodology for similar data collection projects
Related Publications
This work supports the analyses documented in:
Future Directions
- Extension to more recent Congresses
- Advanced NLP approaches (transformers, embeddings)
- Real-time monitoring of legislative activity
- Integration with voting record analysis