Introduction

This blog post introduces using machine learning to identify the policy areas of congressional bills, focusing on data from the 115th to 117th Congresses (2017-2023). We’ll cover:

  • the fundamentals of bill classification
  • the efficacy of traditional machine learning models (as baselines)
  • the analysis of performance over different time frames and policy domains Our discussion sets the groundwork for applying advanced deep learning techniques to this task.

Why Classifying Congressional Bills’ Policy Areas Matters

Identifying the policy areas of congressional bills is crucial for several advanced reasons. It offers insights into legislative priorities, enables impact analyses of laws, and promotes transparency in governance. By applying machine learning to classify bills, we simplify engaging with legislative processes, enhance predictive analytics for future policies, and streamline comparative analyses across Congress sessions. This approach not only sheds light on the dynamics of policymaking but also democratizes access to legislative information, making it a pivotal step towards more informed public discourse and efficient governance.

At the end of the day, I hope this to be a first step in providing more tools that enable people to engage with the legislative process and understand the impact of laws on their lives.

Data

This data was obtained by scraping the Congress.gov website. The data includes all bills from the 115th through 117th Congresses. The data includes:

  • Bill ID
  • Bill Title
  • Bill Summary (if available), specifically earliest summary available
  • Bill Text (if available), specifically earliest text available
  • Policy Area

We will seek to classify the policy area of a bill based on a text field (title, summary, or text) of the bill.

$$ f(X) = \hat{y}, \quad \text{where} \quad X = { \text{title}, \text{summary}, \text{text} }, \quad \hat{y} \in { \text{policy areas} } $$

You can get this data at Hugging Face Datasets: hheiden/us-congress-bill-policy-115_117.

Bills

We collect the following bills across the three Congresses:

CongressBills
115th13,555
116th16,601
117th17,817
Total47,973

Policy Areas

A policy area is a categorical label given to each bill, assigned by Congress.gov (see this glossary).

Including Private Legislation, there are 33 policy areas in this dataset (while the glossary lists 32, Private Legislation is included in the dataset since it is a policy area label, even if it is not a policy area in the traditional sense). However, these classes are not balanced, with some policy areas having significantly more bills than others.

The following table shows the number of bills in each policy area across the three Congresses:

Policy Area115th116th117thTotal
Agriculture and Food3123283981,038
Animals968371250
Armed Forces and National Security1,1081,3371,3993,844
Arts, Culture, Religion8179103263
Civil Rights and Liberties, Minority Issues175205220600
Commerce3125936331,538
Congress5945416401,775
Crime and Law Enforcement8279041,0222,753
Economics and Public Finance176210197583
Education6077988012,206
Emergency Management207198202607
Energy3163705301,216
Environmental Protection3524234641,239
Families79127139345
Finance and Financial Sector5566116011,768
Foreign Trade and International Finance120148212480
Government Operations and Politics1,0081,2581,2723,538
Health1,5262,1092,2765,911
Housing and Community Development142250231623
Immigration3984665911,455
International Affairs9181,1781,3903,486
Labor and Employment3484525521,352
Law109162175446
Native Americans175234245654
Public Lands and Natural Resources7186486422,008
Science, Technology, Communications3895515051,445
Social Sciences and History56415
Social Welfare177229199605
Sports and Recreation9293125310
Taxation9831,1561,0783,217
Transportation and Public Works4926727421,906
Water Resources Development89111110310
Private Legislation697148188

Note that the smallest class (Social Sciences and History) has only 15 bills across the three Congresses, while the largest class (Health) has 5,911 bills. Any model trained on this data will need to account for this class imbalance, and the likelihood that the samples present in the minority classes might not be strongly predictive of other samples in the same class.

Token Statistics

To understand how much text data is present in this dataset, we’ll use spaCy to tokenize the text data. To begin, we’ll look at the token counts for the title, summary, and text of the bills.

Title Token Statistics:

CongressAverage TokensMin TokensMax TokensTotal Tokens
115th12.31167166,763
116th11.31226188,158
117th11.51272204,978
All11.71272559,419

Summary Token Statistics:

CongressAverage TokensMin TokensMax TokensTotal Tokens
115th109.126,8391,479,212
116th94.925,8861,574,732
117th95.125021,695,276
All99.026,8394,749,220

Full Text Token Statistics:

CongressAverage TokensMin TokensMax TokensTotal Tokens
115th2,588.791304,47835,092,075
116th2,760.370973,17345,824,498
117th2,706.7711,013,60848,224,757
All-701,013,608129,141,330

These token statistics can serve as a rough guide for how expensive it would be to use each field in a model.

  • For example, there’s more than 100M tokens in the text of the bills, which would require a significant amount of memory and processing power to handle.

Both title and summary tokens are much more manageable, with the average number of tokens in the title and summary being around 12 and 100, respectively. We can rapidly prototype models using the title and summary tokens, and then consider using the full text of the bills if we find that the title and summary tokens are not sufficient for our task. We won’t consider hyperparameter tuning when using the full text of the bills, as the computational cost of using the full text of the bills is likely to be prohibitively expensive.

Evaluation Overview

Experiment Design

Our approach involves training models on legislative data from one Congress and testing them on another, allowing us to evaluate both in-sample (same Congress) and out-of-sample (different Congress) performance. This setup is structured into a 3x3 grid, with each cell representing a unique train-test Congress combination. Our goal is a model that performs consistently across all combinations, though we anticipate challenges, especially with temporal differences between Congress sessions.

Metrics and Hyperparameter Tuning

We focus on weighted average F1 score to account for class imbalance. This choice over micro and macro averages ensures fairness across all classes by adjusting for their size, crucial for evaluating model performance comprehensively.

When reporting in-Congress performance, we report metrics as the average cross-validated score across each of the K-fold splits. When reporting out-of-Congress performance, we report metrics as the score on the entire cross-Congress dataset.

For tuning, we use Cross-Validation Grid Search to identify optimal hyperparameters, setting the cross-validation folds to min(3, n_samples) to ensure representation from all classes, including the smallest. The best parameters from the training phase are then applied to refit the model, which is subsequently tested across different Congresses to gauge generalizability and performance consistency.

Traditional Machine Learning Models

We’ll load our data in a basic dictionary, keyed by the congress number:

import pandas as pd

data = {}
for congress in [115, 116, 117]:
    data[congress] = pd.read_csv(f'../../../data/congress_{congress}_bills.csv')

Preprocessing - Tf-Idf Vectorization

We convert text data into numerical form using the tf-idf vectorization technique, utilizing scikit-learn’s TfidfVectorizer, detailed in the TfidfVectorizer documentation. For each model, we’ll tune over a range of hyperparameters based on how expensive it is to train the model. For the cheapest (Naive Bayes), we’ll perform a very wide grid search, while for the most expensive (XGBoost), we’ll perform a narrow grid search.

Tf-idf stands for term frequency-inverse document frequency, a method that quantifies a word’s relevance in a document relative to its frequency across all documents. This technique not only accounts for the word’s presence but adjusts its importance based on ubiquity, producing a normalized vector ideal for text classification in machine learning.

Baseline Model - Multinomial Naive Bayes

For our first baseline, we employ the Multinomial Naive Bayes model, detailed in the scikit-learn documentation. This model is a staple for text classification due to its simplicity, interpretability, and surprisingly good performance for such a straightforward approach. It operates under the assumption that all features (words) are independent of each other, a naive yet effective premise for many datasets.

A particularly interesting aspect of using Naive Bayes is its benchmarking capability. If a more complex model, like those based on deep learning, fails to outperform Naive Bayes, it prompts a reevaluation of the model’s approach or the data itself. Thus, Multinomial Naive Bayes serves as an essential baseline in text classification tasks.

Post-training, the feature_log_prob_ attribute of the model provides insights into the most influential features (words) for each class. This attribute is invaluable for identifying key predictors within different policy areas, offering a glimpse into potential learning patterns for more complex models and enhancing our understanding of the dataset.

You can see the code for training the Naive Bayes model below:

Baseline Model - Logistic Regression

We employ the logistic regression model for this analysis, as detailed in the scikit-learn documentation. This model, functioning as a precursor to more complex deep learning models, offers a straightforward yet powerful approach for text classification. It utilizes the logistic function to convert the linear model’s output into a probability, making it an excellent baseline for comparison.

You can see the code for training the Logistic Regression model below:

Baseline Model - XGBoost

For this task, we’re implementing the XGBoost model, a standout in the gradient boosting framework as detailed in the XGBoost documentation. Renowned for its performance in machine learning competitions, XGBoost is a decision tree-based model that frequently surpasses deep learning models on structured data. Its efficacy makes it a top pick for tabular data tasks, including text classification when the data is appropriately preprocessed.

You can see the code for training the XGBoost model below:

Results

Our experiments are broken into the following sections:

  • Title-Only Inputs: Using only the title of the bill as input, with hyperparameter tuning
  • Summary-Only Inputs: Using only the summary of the bill as input, with hyperparameter tuning
  • Text-Only Inputs: Using only the text of the bill as input, using the best hyperparameters from the summary-only inputs

Title-Only Inputs

Naive Bayes

Title-only Naive Bayes experiments are run with the following settings:

sweep_nb(
    data, 
    X_key='title', 
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25, 0.5),
        'min_df': (1, 2, 5),
    },
    nb_params={},
    nb_grid={
        'alpha': (1, 0.1, 0.01, 0.001),
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.661
Refit Time: 0.570
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6369760774921475
Testing on Congress 117 F1: 0.5488274400521962

Training on Congress 116
Best score: 0.677
Refit Time: 0.499
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.691175262953872
Testing on Congress 117 F1: 0.6798043069585031

Training on Congress 117
Best score: 0.670
Refit Time: 0.565
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.25
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.6168474701996426
Testing on Congress 116 F1: 0.6981574942116808

Mean fit time: 0.54 ± 0.03s

Which demonstrates:

  • Rapid training times
  • Strong baseline performance
  • Fairly consistent hyperparameters across Congresses
    • Only the max_df parameter varies across Congresses. I’d conjecture this is a side-effect of small dataset size, and that the optimal max_df is likely to be close to 0.1 (as a median value across Congresses).

Plotted, we see decent out-of-congress performance:

Naive Bayes Policy Area Classification F1 Score

Naive Bayes Policy Area Classification F1 Score

As predicted, we see that training on congresses further apart in time results in worse performance usually. Training on the 116th Congress yields the best performance of the three as it is directly adjacent to the testing congresses.

Naive Bayes Top Features for Agriculture and Food

Naive Bayes Top Features for Agriculture and Food

Naive Bayes Top Features for Animals

Naive Bayes Top Features for Animals

Naive Bayes Top Features for Armed Forces and National Security

Naive Bayes Top Features for Armed Forces and National Security

Naive Bayes Top Features for Arts, Culture, Religion

Naive Bayes Top Features for Arts, Culture, Religion

Naive Bayes Top Features for Civil Rights and Liberties, Minority Issues

Naive Bayes Top Features for Civil Rights and Liberties, Minority Issues

Naive Bayes Top Features for Commerce

Naive Bayes Top Features for Commerce

Naive Bayes Top Features for Congress

Naive Bayes Top Features for Congress

Naive Bayes Top Features for Crime and Law Enforcement

Naive Bayes Top Features for Crime and Law Enforcement

Naive Bayes Top Features for Economics and Public Finance

Naive Bayes Top Features for Economics and Public Finance

Naive Bayes Top Features for Education

Naive Bayes Top Features for Education

Naive Bayes Top Features for Emergency Management

Naive Bayes Top Features for Emergency Management

Naive Bayes Top Features for Energy

Naive Bayes Top Features for Energy

Naive Bayes Top Features for Environmental Protection

Naive Bayes Top Features for Environmental Protection

Naive Bayes Top Features for Families

Naive Bayes Top Features for Families

Naive Bayes Top Features for Finance and Financial Sector

Naive Bayes Top Features for Finance and Financial Sector

Naive Bayes Top Features for Foreign Trade and International Finance

Naive Bayes Top Features for Foreign Trade and International Finance

Naive Bayes Top Features for Government Operations and Politics

Naive Bayes Top Features for Government Operations and Politics

Naive Bayes Top Features for Health

Naive Bayes Top Features for Health

Naive Bayes Top Features for Housing and Community Development

Naive Bayes Top Features for Housing and Community Development

Naive Bayes Top Features Immigration

Naive Bayes Top Features Immigration

Naive Bayes Top Features for International Affairs

Naive Bayes Top Features for International Affairs

Naive Bayes Top Features for Labor and Employment

Naive Bayes Top Features for Labor and Employment

Naive Bayes Top Features for Law

Naive Bayes Top Features for Law

Naive Bayes Top Features for Native Americans

Naive Bayes Top Features for Native Americans

Naive Bayes Top Features for Public Lands and Natural Resources

Naive Bayes Top Features for Public Lands and Natural Resources

Naive Bayes Top Features for Science, Technology, Communications

Naive Bayes Top Features for Science, Technology, Communications

Naive Bayes Top Features for Social Sciences and History

Naive Bayes Top Features for Social Sciences and History

Naive Bayes Top Features for Social Welfare

Naive Bayes Top Features for Social Welfare

Naive Bayes Top Features for Sports and Recreation

Naive Bayes Top Features for Sports and Recreation

Naive Bayes Top Features for Taxation

Naive Bayes Top Features for Taxation

Naive Bayes Top Features for Transportation and Public Works

Naive Bayes Top Features for Transportation and Public Works

Naive Bayes Top Features for Water Resources Development

Naive Bayes Top Features for Water Resources Development

Naive Bayes Top Features for Private Legislation

Naive Bayes Top Features for Private Legislation

Logistic Regression

Title-only Logistic Regression experiments are run with the following settings:

sweep_logreg(
    data,
    X_key='title',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25),
    },
    logreg_params={
        'max_iter': 1000,
        'random_state': 42,
        'class_weight': 'balanced',
    },
    logreg_grid={
        'C': [0.1, 1, 10],
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.704
Refit Time: 32.063
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6809188275881766
Testing on Congress 117 F1: 0.601917336933838

Training on Congress 116
Best score: 0.714
Refit Time: 31.227
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.7408989977276476
Testing on Congress 117 F1: 0.7200639105208106

Training on Congress 117
Best score: 0.711
Refit Time: 34.083
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.674418393892329
Testing on Congress 116 F1: 0.7405934743144291

Mean fit time: 32.46 ± 1.20s

Which demonstrates:

  • Even stronger performance than Naive Bayes. This is expected, as Logistic Regression is a more complex model.
  • Consistent hyperparameters across Congresses
  • Longer training times, but still manageable

Plotted, we see strong out-of-congress performance:

Logistic Regression Policy Area Classification F1 Score

Logistic Regression Policy Area Classification F1 Score

XGBoost

Title-only XGBoost experiments are run with the following settings:

sweep_xgb(
    data, 
    X_key='title',
    y_key='policy_area',
    tfidf_grid={
        'max_df': (0.05,),
    },
    xgb_grid={
        'max_depth': (6,),
        'eta': (0.3,),
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.591
Refit Time: 198.063
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 116 F1: 0.5649530686141018
Testing on Congress 117 F1: 0.5215939580735101

Training on Congress 116
Best score: 0.600
Refit Time: 264.824
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.6037922738570368
Testing on Congress 117 F1: 0.5965027418245722

Training on Congress 117
Best score: 0.595
Refit Time: 249.799
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.5600491477899472
Testing on Congress 116 F1: 0.60815381664894

Mean fit time: 237.56 ± 28.60s

Which demonstrates:

  • Poor performance compared to the other models
  • Long training times

Plotted, we see poor out-of-congress performance as well (relative to the other models):

XGBoost Policy Area Classification F1 Score

XGBoost Policy Area Classification F1 Score

Cost of Training

Summarizing the training times for each model:

ModelFit Time
Naive Bayes0.54 ± 0.03s
LogReg32.46 ± 1.20s
XGBoost237.56 ± 28.60s

Honestly, I am a bit surprised by the low performance of XGBoost. I would have expected it to outperform Logistic Regression, but it seems that the simplicity of the Logistic Regression model is a better fit for this task or that the hyperparameters I chose for XGBoost were not optimal. With how (relatively) expensive it is to train XGBoost, I’m not going to spend time tuning it further. I’ll focus on tuning Logistic Regression and Naive Bayes, and then use the best hyperparameters from those models to train a model on the full text of the bills.

I’d much rather push ahead towards the pre-trained deep learning models than spend my time hand-tuning XGBoost. If you have alternative suggestions for hyperparameters to try, I’d be happy to hear them (or see them using this data & code).

Summary-Only Inputs

Naive Bayes

Summary-only Naive Bayes experiments are run with the following settings:

sweep_nb(
    data, 
    X_key='summary', 
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25, 0.5),
        'min_df': (1, 2, 5),
    },
    nb_params={},
    nb_grid={
        'alpha': (1, 0.1, 0.01, 0.001),
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.852
Refit Time: 3.420
Best parameters set:
	clf__alpha: 0.001
	tfidf__max_df: 0.5
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.8200213925851242
Testing on Congress 117 F1: 0.7741146840296312

Training on Congress 116
Best score: 0.859
Refit Time: 3.456
Best parameters set:
	clf__alpha: 0.001
	tfidf__max_df: 0.25
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.859492569524175
Testing on Congress 117 F1: 0.8564757791353249

Training on Congress 117
Best score: 0.855
Refit Time: 3.422
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 2
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.8211004582388687
Testing on Congress 116 F1: 0.866928380103485

Mean fit time: 3.43 ± 0.02s

or as a plot:

Naive Bayes Policy Area Classification F1 Score

Naive Bayes Policy Area Classification F1 Score

Which shows us significant lift in performance over the title-only inputs (now we’re 80%+ F1 score across the board, except for training on the 115th Congress and testing on the 117th Congress–farthest apart forwards in time).

Logistic Regression

Summary-only Logistic Regression experiments are run with the following settings:

sweep_logreg(
    data,
    X_key='summary',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        # 'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25),
    },
    logreg_params={
        'max_iter': 1000,
        'random_state': 42,
        'class_weight': 'balanced',
    },
    logreg_grid={
        'C': [0.1, 1, 10],
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.862
Refit Time: 9.007
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 116 F1: 0.8284864693401133
Testing on Congress 117 F1: 0.7934161507811646

Training on Congress 116
Best score: 0.865
Refit Time: 13.897
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8637852557418315
Testing on Congress 117 F1: 0.8594775615031977

Training on Congress 117
Best score: 0.862
Refit Time: 12.167
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8355736563084967
Testing on Congress 116 F1: 0.8696403838390832

Mean fit time: 11.69 ± 2.02s

And plotted:

Logistic Regression Policy Area Classification F1 Score

Logistic Regression Policy Area Classification F1 Score

Overall, it’s nice to see slightly better performance from the Logistic Regression model over the Naive Bayes model. We see similar trends in the performance lifts, and my hunch is that the two models are basically predicting off the same features, but the Logistic Regression model is able to better capture the relationships between the features and the target. That said, we’d need to pin that down with more rigorous testing to be sure that’s the case before acting on such an assumption.

Full Text Inputs

Naive Bayes

For the full-text setting, we’ll use our same functions as before but restrict to just one set of hyperparameters:

sweep_nb(
    data, 
    X_key='text', 
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        'ngram_range': [(1, 2),],
        'max_df': (0.05,),
        'min_df': (1,),
    },
    nb_params={},
    nb_grid={
        'alpha': (0.01,),
    },
)

Output:

Training on Congress 115
Best score: 0.838
Refit Time: 43.686
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.8239477589096698
Testing on Congress 117 F1: 0.7754084272244569

Training on Congress 116
Best score: 0.845
Refit Time: 52.626
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.8589588789759509
Testing on Congress 117 F1: 0.8497483570393742

Training on Congress 117
Best score: 0.847
Refit Time: 53.883
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.8168591723584602
Testing on Congress 116 F1: 0.8586192558317293

Mean fit time: 50.07 ± 4.54s

Plotted:

Naive Bayes Policy Area Classification F1 Score

Naive Bayes Policy Area Classification F1 Score

Interestingly, we see a reduction in performance compared with the summary-only inputs. This is likely due to the increased noise in the full text. Perhaps a better set of hyperparameters would help, but the fit time is quite high for this model relative to the summary-only inputs, so I’m not going to spend time tuning it further.

Logistic Regression

As for the Naive Bayes model, we’ll use our same functions as before but restrict to just one set of hyperparameters:

sweep_logreg(
    data,
    X_key='summary',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        # 'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.25,),
    },
    logreg_params={
        'max_iter': 1000,
        'random_state': 42,
        'class_weight': 'balanced',
    },
    logreg_grid={
        'C': [10,],
    },
)

Output:

Training on Congress 115
Best score: 0.875
Refit Time: 56.319
Best parameters set:
	clf__C: 10
Testing on Congress 116 F1: 0.8673497383529855
Testing on Congress 117 F1: 0.8354891492120329

Training on Congress 116
Best score: 0.878
Refit Time: 69.198
Best parameters set:
	clf__C: 10
Testing on Congress 115 F1: 0.8888145116306644
Testing on Congress 117 F1: 0.8823083848346037

Training on Congress 117
Best score: 0.879
Refit Time: 87.444
Best parameters set:
	clf__C: 10
Testing on Congress 115 F1: 0.8611970979741099
Testing on Congress 116 F1: 0.8902321184590636

Mean fit time: 70.99 ± 12.77s
Logistic Regression Policy Area Classification F1 Score

Logistic Regression Policy Area Classification F1 Score

Interestingly, we actually find the full text to produce the best model. Notably:

  • We don’t restrict the vocab at all, relying solely on the regularlization to prevent overfitting
  • Because the data is sparse and the model is sparse, we find that the cost of training is not too high for a baseline model
  • We can almost approach a 90% F1 score on the 116th Congress, which is quite impressive; further breakdown and analysis of the model’s performance could help us to understand where to go next with development

Conclusion

In this post, we’ve seen that the best-performing model for classifying policy areas based on bill text is a Logistic Regression model trained on the summary of the bill. Not that we’re simply trying to establish a decent baseline to compare more complex (read: deep learning-based) models against. We saw how limited the title of the bill was for predicting the policy area, and how the full text of the bill can be too noisy to be useful for this task.

If you want to work with the data, I encourage you to check out the Hugging Face Datasets: hheiden/us-congress-bill-policy-115_117. I’m setting up a basic classification leaderboard, and as a first entry I’m placing this tuned Logistic Regression model: You can check out the Policy Area Classification Leaderboard here.

In coming posts, I’ll explore how we can use deep learning to similarly tackle this problem. I’ll also work towards the expansion of this dataset to include more congresses and more detailed information about the bills.

If you enjoyed this post, consider: