Introduction
This blog post introduces using machine learning to identify the policy areas of congressional bills, focusing on data from the 115th to 117th Congresses (2017-2023). We’ll cover:
- the fundamentals of bill classification
- the efficacy of traditional machine learning models (as baselines)
- the analysis of performance over different time frames and policy domains Our discussion sets the groundwork for applying advanced deep learning techniques to this task.
Why Classifying Congressional Bills’ Policy Areas Matters
Identifying the policy areas of congressional bills is crucial for several advanced reasons. It offers insights into legislative priorities, enables impact analyses of laws, and promotes transparency in governance. By applying machine learning to classify bills, we simplify engaging with legislative processes, enhance predictive analytics for future policies, and streamline comparative analyses across Congress sessions. This approach not only sheds light on the dynamics of policymaking but also democratizes access to legislative information, making it a pivotal step towards more informed public discourse and efficient governance.
At the end of the day, I hope this to be a first step in providing more tools that enable people to engage with the legislative process and understand the impact of laws on their lives.
Data
This data was obtained by scraping the Congress.gov website. The data includes all bills from the 115th through 117th Congresses. The data includes:
- Bill ID
- Bill Title
- Bill Summary (if available), specifically earliest summary available
- Bill Text (if available), specifically earliest text available
- Policy Area
We will seek to classify the policy area of a bill based on a text field (title, summary, or text) of the bill.
$$ f(X) = \hat{y}, \quad \text{where} \quad X = { \text{title}, \text{summary}, \text{text} }, \quad \hat{y} \in { \text{policy areas} } $$
You can get this data at Hugging Face Datasets: hheiden/us-congress-bill-policy-115_117.
Bills
We collect the following bills across the three Congresses:
Congress | Bills |
---|---|
115th | 13,555 |
116th | 16,601 |
117th | 17,817 |
Total | 47,973 |
Policy Areas
A policy area is a categorical label given to each bill, assigned by Congress.gov (see this glossary).
Including Private Legislation
, there are 33 policy areas in this dataset (while the glossary lists 32, Private Legislation
is included in the dataset since it is a policy area label, even if it is not a policy area in the traditional sense).
However, these classes are not balanced, with some policy areas having significantly more bills than others.
The following table shows the number of bills in each policy area across the three Congresses:
Policy Area | 115th | 116th | 117th | Total |
---|---|---|---|---|
Agriculture and Food | 312 | 328 | 398 | 1,038 |
Animals | 96 | 83 | 71 | 250 |
Armed Forces and National Security | 1,108 | 1,337 | 1,399 | 3,844 |
Arts, Culture, Religion | 81 | 79 | 103 | 263 |
Civil Rights and Liberties, Minority Issues | 175 | 205 | 220 | 600 |
Commerce | 312 | 593 | 633 | 1,538 |
Congress | 594 | 541 | 640 | 1,775 |
Crime and Law Enforcement | 827 | 904 | 1,022 | 2,753 |
Economics and Public Finance | 176 | 210 | 197 | 583 |
Education | 607 | 798 | 801 | 2,206 |
Emergency Management | 207 | 198 | 202 | 607 |
Energy | 316 | 370 | 530 | 1,216 |
Environmental Protection | 352 | 423 | 464 | 1,239 |
Families | 79 | 127 | 139 | 345 |
Finance and Financial Sector | 556 | 611 | 601 | 1,768 |
Foreign Trade and International Finance | 120 | 148 | 212 | 480 |
Government Operations and Politics | 1,008 | 1,258 | 1,272 | 3,538 |
Health | 1,526 | 2,109 | 2,276 | 5,911 |
Housing and Community Development | 142 | 250 | 231 | 623 |
Immigration | 398 | 466 | 591 | 1,455 |
International Affairs | 918 | 1,178 | 1,390 | 3,486 |
Labor and Employment | 348 | 452 | 552 | 1,352 |
Law | 109 | 162 | 175 | 446 |
Native Americans | 175 | 234 | 245 | 654 |
Public Lands and Natural Resources | 718 | 648 | 642 | 2,008 |
Science, Technology, Communications | 389 | 551 | 505 | 1,445 |
Social Sciences and History | 5 | 6 | 4 | 15 |
Social Welfare | 177 | 229 | 199 | 605 |
Sports and Recreation | 92 | 93 | 125 | 310 |
Taxation | 983 | 1,156 | 1,078 | 3,217 |
Transportation and Public Works | 492 | 672 | 742 | 1,906 |
Water Resources Development | 89 | 111 | 110 | 310 |
Private Legislation | 69 | 71 | 48 | 188 |
Note that the smallest class (Social Sciences and History
) has only 15 bills across the three Congresses, while the largest class (Health
) has 5,911 bills.
Any model trained on this data will need to account for this class imbalance, and the likelihood that the samples present in the minority classes might not be strongly predictive of other samples in the same class.
Token Statistics
To understand how much text data is present in this dataset, we’ll use spaCy to tokenize the text data. To begin, we’ll look at the token counts for the title, summary, and text of the bills.
Title Token Statistics:
Congress | Average Tokens | Min Tokens | Max Tokens | Total Tokens |
---|---|---|---|---|
115th | 12.3 | 1 | 167 | 166,763 |
116th | 11.3 | 1 | 226 | 188,158 |
117th | 11.5 | 1 | 272 | 204,978 |
All | 11.7 | 1 | 272 | 559,419 |
Summary Token Statistics:
Congress | Average Tokens | Min Tokens | Max Tokens | Total Tokens |
---|---|---|---|---|
115th | 109.1 | 2 | 6,839 | 1,479,212 |
116th | 94.9 | 2 | 5,886 | 1,574,732 |
117th | 95.1 | 2 | 502 | 1,695,276 |
All | 99.0 | 2 | 6,839 | 4,749,220 |
Full Text Token Statistics:
Congress | Average Tokens | Min Tokens | Max Tokens | Total Tokens |
---|---|---|---|---|
115th | 2,588.7 | 91 | 304,478 | 35,092,075 |
116th | 2,760.3 | 70 | 973,173 | 45,824,498 |
117th | 2,706.7 | 71 | 1,013,608 | 48,224,757 |
All | - | 70 | 1,013,608 | 129,141,330 |
These token statistics can serve as a rough guide for how expensive it would be to use each field in a model.
- For example, there’s more than 100M tokens in the text of the bills, which would require a significant amount of memory and processing power to handle.
Both title and summary tokens are much more manageable, with the average number of tokens in the title and summary being around 12 and 100, respectively. We can rapidly prototype models using the title and summary tokens, and then consider using the full text of the bills if we find that the title and summary tokens are not sufficient for our task. We won’t consider hyperparameter tuning when using the full text of the bills, as the computational cost of using the full text of the bills is likely to be prohibitively expensive.
Evaluation Overview
Experiment Design
Our approach involves training models on legislative data from one Congress and testing them on another, allowing us to evaluate both in-sample (same Congress) and out-of-sample (different Congress) performance. This setup is structured into a 3x3 grid, with each cell representing a unique train-test Congress combination. Our goal is a model that performs consistently across all combinations, though we anticipate challenges, especially with temporal differences between Congress sessions.
Metrics and Hyperparameter Tuning
We focus on weighted average F1 score to account for class imbalance. This choice over micro and macro averages ensures fairness across all classes by adjusting for their size, crucial for evaluating model performance comprehensively.
When reporting in-Congress performance, we report metrics as the average cross-validated score across each of the K-fold splits. When reporting out-of-Congress performance, we report metrics as the score on the entire cross-Congress dataset.
For tuning, we use Cross-Validation Grid Search to identify optimal hyperparameters, setting the cross-validation folds to min(3, n_samples)
to ensure representation from all classes, including the smallest. The best parameters from the training phase are then applied to refit the model, which is subsequently tested across different Congresses to gauge generalizability and performance consistency.
Traditional Machine Learning Models
We’ll load our data in a basic dictionary, keyed by the congress number:
import pandas as pd
data = {}
for congress in [115, 116, 117]:
data[congress] = pd.read_csv(f'../../../data/congress_{congress}_bills.csv')
Preprocessing - Tf-Idf Vectorization
We convert text data into numerical form using the tf-idf vectorization technique, utilizing scikit-learn’s TfidfVectorizer
, detailed in the TfidfVectorizer documentation. For each model, we’ll tune over a range of hyperparameters based on how expensive it is to train the model. For the cheapest (Naive Bayes), we’ll perform a very wide grid search, while for the most expensive (XGBoost), we’ll perform a narrow grid search.
Tf-idf stands for term frequency-inverse document frequency, a method that quantifies a word’s relevance in a document relative to its frequency across all documents. This technique not only accounts for the word’s presence but adjusts its importance based on ubiquity, producing a normalized vector ideal for text classification in machine learning.
Baseline Model - Multinomial Naive Bayes
For our first baseline, we employ the Multinomial Naive Bayes model, detailed in the scikit-learn documentation. This model is a staple for text classification due to its simplicity, interpretability, and surprisingly good performance for such a straightforward approach. It operates under the assumption that all features (words) are independent of each other, a naive yet effective premise for many datasets.
A particularly interesting aspect of using Naive Bayes is its benchmarking capability. If a more complex model, like those based on deep learning, fails to outperform Naive Bayes, it prompts a reevaluation of the model’s approach or the data itself. Thus, Multinomial Naive Bayes serves as an essential baseline in text classification tasks.
Post-training, the feature_log_prob_
attribute of the model provides insights into the most influential features (words) for each class. This attribute is invaluable for identifying key predictors within different policy areas, offering a glimpse into potential learning patterns for more complex models and enhancing our understanding of the dataset.
You can see the code for training the Naive Bayes model below:
Baseline Model - Logistic Regression
We employ the logistic regression model for this analysis, as detailed in the scikit-learn documentation. This model, functioning as a precursor to more complex deep learning models, offers a straightforward yet powerful approach for text classification. It utilizes the logistic function to convert the linear model’s output into a probability, making it an excellent baseline for comparison.
You can see the code for training the Logistic Regression model below:
Baseline Model - XGBoost
For this task, we’re implementing the XGBoost model, a standout in the gradient boosting framework as detailed in the XGBoost documentation. Renowned for its performance in machine learning competitions, XGBoost is a decision tree-based model that frequently surpasses deep learning models on structured data. Its efficacy makes it a top pick for tabular data tasks, including text classification when the data is appropriately preprocessed.
You can see the code for training the XGBoost model below:
Results
Our experiments are broken into the following sections:
- Title-Only Inputs: Using only the title of the bill as input, with hyperparameter tuning
- Summary-Only Inputs: Using only the summary of the bill as input, with hyperparameter tuning
- Text-Only Inputs: Using only the text of the bill as input, using the best hyperparameters from the summary-only inputs
Title-Only Inputs
Naive Bayes
Title-only Naive Bayes experiments are run with the following settings:
sweep_nb(
data,
X_key='title',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
'ngram_range': [(1, 1), (1, 2)],
'max_df': (0.05, 0.1, 0.25, 0.5),
'min_df': (1, 2, 5),
},
nb_params={},
nb_grid={
'alpha': (1, 0.1, 0.01, 0.001),
},
)
which yields the following output:
Training on Congress 115
Best score: 0.661
Refit Time: 0.570
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.05
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6369760774921475
Testing on Congress 117 F1: 0.5488274400521962
Training on Congress 116
Best score: 0.677
Refit Time: 0.499
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.05
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.691175262953872
Testing on Congress 117 F1: 0.6798043069585031
Training on Congress 117
Best score: 0.670
Refit Time: 0.565
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.25
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.6168474701996426
Testing on Congress 116 F1: 0.6981574942116808
Mean fit time: 0.54 ± 0.03s
Which demonstrates:
- Rapid training times
- Strong baseline performance
- Fairly consistent hyperparameters across Congresses
- Only the
max_df
parameter varies across Congresses. I’d conjecture this is a side-effect of small dataset size, and that the optimalmax_df
is likely to be close to 0.1 (as a median value across Congresses).
- Only the
Plotted, we see decent out-of-congress performance:
As predicted, we see that training on congresses further apart in time results in worse performance usually. Training on the 116th Congress yields the best performance of the three as it is directly adjacent to the testing congresses.
Logistic Regression
Title-only Logistic Regression experiments are run with the following settings:
sweep_logreg(
data,
X_key='title',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
'ngram_range': [(1, 1), (1, 2)],
'max_df': (0.05, 0.1, 0.25),
},
logreg_params={
'max_iter': 1000,
'random_state': 42,
'class_weight': 'balanced',
},
logreg_grid={
'C': [0.1, 1, 10],
},
)
which yields the following output:
Training on Congress 115
Best score: 0.704
Refit Time: 32.063
Best parameters set:
clf__C: 10
tfidf__max_df: 0.05
tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6809188275881766
Testing on Congress 117 F1: 0.601917336933838
Training on Congress 116
Best score: 0.714
Refit Time: 31.227
Best parameters set:
clf__C: 10
tfidf__max_df: 0.05
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.7408989977276476
Testing on Congress 117 F1: 0.7200639105208106
Training on Congress 117
Best score: 0.711
Refit Time: 34.083
Best parameters set:
clf__C: 10
tfidf__max_df: 0.05
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.674418393892329
Testing on Congress 116 F1: 0.7405934743144291
Mean fit time: 32.46 ± 1.20s
Which demonstrates:
- Even stronger performance than Naive Bayes. This is expected, as Logistic Regression is a more complex model.
- Consistent hyperparameters across Congresses
- Longer training times, but still manageable
Plotted, we see strong out-of-congress performance:
XGBoost
Title-only XGBoost experiments are run with the following settings:
sweep_xgb(
data,
X_key='title',
y_key='policy_area',
tfidf_grid={
'max_df': (0.05,),
},
xgb_grid={
'max_depth': (6,),
'eta': (0.3,),
},
)
which yields the following output:
Training on Congress 115
Best score: 0.591
Refit Time: 198.063
Best parameters set:
clf__eta: 0.3
clf__max_depth: 6
clf__num_class: 33
tfidf__max_df: 0.05
Testing on Congress 116 F1: 0.5649530686141018
Testing on Congress 117 F1: 0.5215939580735101
Training on Congress 116
Best score: 0.600
Refit Time: 264.824
Best parameters set:
clf__eta: 0.3
clf__max_depth: 6
clf__num_class: 33
tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.6037922738570368
Testing on Congress 117 F1: 0.5965027418245722
Training on Congress 117
Best score: 0.595
Refit Time: 249.799
Best parameters set:
clf__eta: 0.3
clf__max_depth: 6
clf__num_class: 33
tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.5600491477899472
Testing on Congress 116 F1: 0.60815381664894
Mean fit time: 237.56 ± 28.60s
Which demonstrates:
- Poor performance compared to the other models
- Long training times
Plotted, we see poor out-of-congress performance as well (relative to the other models):
Cost of Training
Summarizing the training times for each model:
Model | Fit Time |
---|---|
Naive Bayes | 0.54 ± 0.03s |
LogReg | 32.46 ± 1.20s |
XGBoost | 237.56 ± 28.60s |
Honestly, I am a bit surprised by the low performance of XGBoost. I would have expected it to outperform Logistic Regression, but it seems that the simplicity of the Logistic Regression model is a better fit for this task or that the hyperparameters I chose for XGBoost were not optimal. With how (relatively) expensive it is to train XGBoost, I’m not going to spend time tuning it further. I’ll focus on tuning Logistic Regression and Naive Bayes, and then use the best hyperparameters from those models to train a model on the full text of the bills.
I’d much rather push ahead towards the pre-trained deep learning models than spend my time hand-tuning XGBoost. If you have alternative suggestions for hyperparameters to try, I’d be happy to hear them (or see them using this data & code).
Summary-Only Inputs
Naive Bayes
Summary-only Naive Bayes experiments are run with the following settings:
sweep_nb(
data,
X_key='summary',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
'ngram_range': [(1, 1), (1, 2)],
'max_df': (0.05, 0.1, 0.25, 0.5),
'min_df': (1, 2, 5),
},
nb_params={},
nb_grid={
'alpha': (1, 0.1, 0.01, 0.001),
},
)
which yields the following output:
Training on Congress 115
Best score: 0.852
Refit Time: 3.420
Best parameters set:
clf__alpha: 0.001
tfidf__max_df: 0.5
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.8200213925851242
Testing on Congress 117 F1: 0.7741146840296312
Training on Congress 116
Best score: 0.859
Refit Time: 3.456
Best parameters set:
clf__alpha: 0.001
tfidf__max_df: 0.25
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.859492569524175
Testing on Congress 117 F1: 0.8564757791353249
Training on Congress 117
Best score: 0.855
Refit Time: 3.422
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.05
tfidf__min_df: 2
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.8211004582388687
Testing on Congress 116 F1: 0.866928380103485
Mean fit time: 3.43 ± 0.02s
or as a plot:
Which shows us significant lift in performance over the title-only inputs (now we’re 80%+ F1 score across the board, except for training on the 115th Congress and testing on the 117th Congress–farthest apart forwards in time).
Logistic Regression
Summary-only Logistic Regression experiments are run with the following settings:
sweep_logreg(
data,
X_key='summary',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
# 'ngram_range': [(1, 1), (1, 2)],
'max_df': (0.05, 0.1, 0.25),
},
logreg_params={
'max_iter': 1000,
'random_state': 42,
'class_weight': 'balanced',
},
logreg_grid={
'C': [0.1, 1, 10],
},
)
which yields the following output:
Training on Congress 115
Best score: 0.862
Refit Time: 9.007
Best parameters set:
clf__C: 10
tfidf__max_df: 0.25
Testing on Congress 116 F1: 0.8284864693401133
Testing on Congress 117 F1: 0.7934161507811646
Training on Congress 116
Best score: 0.865
Refit Time: 13.897
Best parameters set:
clf__C: 10
tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8637852557418315
Testing on Congress 117 F1: 0.8594775615031977
Training on Congress 117
Best score: 0.862
Refit Time: 12.167
Best parameters set:
clf__C: 10
tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8355736563084967
Testing on Congress 116 F1: 0.8696403838390832
Mean fit time: 11.69 ± 2.02s
And plotted:
Overall, it’s nice to see slightly better performance from the Logistic Regression model over the Naive Bayes model. We see similar trends in the performance lifts, and my hunch is that the two models are basically predicting off the same features, but the Logistic Regression model is able to better capture the relationships between the features and the target. That said, we’d need to pin that down with more rigorous testing to be sure that’s the case before acting on such an assumption.
Full Text Inputs
Naive Bayes
For the full-text setting, we’ll use our same functions as before but restrict to just one set of hyperparameters:
sweep_nb(
data,
X_key='text',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
'ngram_range': [(1, 2),],
'max_df': (0.05,),
'min_df': (1,),
},
nb_params={},
nb_grid={
'alpha': (0.01,),
},
)
Output:
Training on Congress 115
Best score: 0.838
Refit Time: 43.686
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.05
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.8239477589096698
Testing on Congress 117 F1: 0.7754084272244569
Training on Congress 116
Best score: 0.845
Refit Time: 52.626
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.05
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.8589588789759509
Testing on Congress 117 F1: 0.8497483570393742
Training on Congress 117
Best score: 0.847
Refit Time: 53.883
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.05
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.8168591723584602
Testing on Congress 116 F1: 0.8586192558317293
Mean fit time: 50.07 ± 4.54s
Plotted:
Interestingly, we see a reduction in performance compared with the summary-only inputs. This is likely due to the increased noise in the full text. Perhaps a better set of hyperparameters would help, but the fit time is quite high for this model relative to the summary-only inputs, so I’m not going to spend time tuning it further.
Logistic Regression
As for the Naive Bayes model, we’ll use our same functions as before but restrict to just one set of hyperparameters:
sweep_logreg(
data,
X_key='summary',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
# 'ngram_range': [(1, 1), (1, 2)],
'max_df': (0.25,),
},
logreg_params={
'max_iter': 1000,
'random_state': 42,
'class_weight': 'balanced',
},
logreg_grid={
'C': [10,],
},
)
Output:
Training on Congress 115
Best score: 0.875
Refit Time: 56.319
Best parameters set:
clf__C: 10
Testing on Congress 116 F1: 0.8673497383529855
Testing on Congress 117 F1: 0.8354891492120329
Training on Congress 116
Best score: 0.878
Refit Time: 69.198
Best parameters set:
clf__C: 10
Testing on Congress 115 F1: 0.8888145116306644
Testing on Congress 117 F1: 0.8823083848346037
Training on Congress 117
Best score: 0.879
Refit Time: 87.444
Best parameters set:
clf__C: 10
Testing on Congress 115 F1: 0.8611970979741099
Testing on Congress 116 F1: 0.8902321184590636
Mean fit time: 70.99 ± 12.77s
Interestingly, we actually find the full text to produce the best model. Notably:
- We don’t restrict the vocab at all, relying solely on the regularlization to prevent overfitting
- Because the data is sparse and the model is sparse, we find that the cost of training is not too high for a baseline model
- We can almost approach a 90% F1 score on the 116th Congress, which is quite impressive; further breakdown and analysis of the model’s performance could help us to understand where to go next with development
Conclusion
In this post, we’ve seen that the best-performing model for classifying policy areas based on bill text is a Logistic Regression model trained on the summary of the bill. Not that we’re simply trying to establish a decent baseline to compare more complex (read: deep learning-based) models against. We saw how limited the title of the bill was for predicting the policy area, and how the full text of the bill can be too noisy to be useful for this task.
If you want to work with the data, I encourage you to check out the Hugging Face Datasets: hheiden/us-congress-bill-policy-115_117. I’m setting up a basic classification leaderboard, and as a first entry I’m placing this tuned Logistic Regression model: You can check out the Policy Area Classification Leaderboard here.
In coming posts, I’ll explore how we can use deep learning to similarly tackle this problem. I’ll also work towards the expansion of this dataset to include more congresses and more detailed information about the bills.
If you enjoyed this post, consider:
- Reading a similar post, such as: US 117th Congress Data Exploration
- Sharing this post with a friend
- Buying me a coffee (see the button below)