Classifying Congressional Bills with Machine Learning

Introduction

This post explores machine learning approaches for classifying congressional bills by policy area, using data from the 115th to 117th Congresses (2017-2023). We’ll examine:

The fundamentals of bill classification
Traditional machine learning models as baselines
Performance analysis across different time periods and policy domains

This work establishes baselines for future deep learning approaches to legislative text classification.

This post builds on the data foundation established in Exploring the 117th U.S. Congress.

Why This Matters

Automatically classifying congressional bills by policy area has practical value for researchers, journalists, and citizens who need to navigate thousands of bills each Congress. Machine learning can help identify patterns in legislative priorities and track policy trends over time.

This work establishes baseline performance for text classification on legislative data, providing a foundation for more sophisticated approaches.

Data

The data comes from scraping Congress.gov for all bills from the 115th through 117th Congresses. Each bill includes:

Bill ID and title
Summary (when available): the earliest summary provided
Full text (when available): the earliest text version
Policy area classification

Our task is to predict policy area from text features:

$$ f(X) = \hat{y}, \quad \text{where} \quad X = { \text{title}, \text{summary}, \text{text} }, \quad \hat{y} \in { \text{policy areas} } $$

The complete dataset is available at Hugging Face: hheiden/us-congress-bill-policy-115_117.

Bills by Congress

Our dataset contains the following distribution:

Congress	Bills
115th	13,555
116th	16,601
117th	17,817
Total	47,973

Policy Areas

Each bill receives a policy area label from Congress.gov (see glossary). The dataset includes 33 policy areas, though these classes are highly imbalanced.

The following table shows the number of bills in each policy area across the three Congresses:

Policy Area	115th	116th	117th	Total
Agriculture and Food	312	328	398	1,038
Animals	96	83	71	250
Armed Forces and National Security	1,108	1,337	1,399	3,844
Arts, Culture, Religion	81	79	103	263
Civil Rights and Liberties, Minority Issues	175	205	220	600
Commerce	312	593	633	1,538
Congress	594	541	640	1,775
Crime and Law Enforcement	827	904	1,022	2,753
Economics and Public Finance	176	210	197	583
Education	607	798	801	2,206
Emergency Management	207	198	202	607
Energy	316	370	530	1,216
Environmental Protection	352	423	464	1,239
Families	79	127	139	345
Finance and Financial Sector	556	611	601	1,768
Foreign Trade and International Finance	120	148	212	480
Government Operations and Politics	1,008	1,258	1,272	3,538
Health	1,526	2,109	2,276	5,911
Housing and Community Development	142	250	231	623
Immigration	398	466	591	1,455
International Affairs	918	1,178	1,390	3,486
Labor and Employment	348	452	552	1,352
Law	109	162	175	446
Native Americans	175	234	245	654
Public Lands and Natural Resources	718	648	642	2,008
Science, Technology, Communications	389	551	505	1,445
Social Sciences and History	5	6	4	15
Social Welfare	177	229	199	605
Sports and Recreation	92	93	125	310
Taxation	983	1,156	1,078	3,217
Transportation and Public Works	492	672	742	1,906
Water Resources Development	89	111	110	310
Private Legislation	69	71	48	188

The class imbalance is severe: Social Sciences and History has only 15 bills across all three Congresses, while Health has 5,911 bills. This imbalance presents modeling challenges, as minority classes may lack sufficient representative samples.

Text Statistics

We analyzed token counts using spaCy to understand the computational requirements for each text field.

Title Token Statistics:

Congress	Average Tokens	Min Tokens	Max Tokens	Total Tokens
115th	12.3	1	167	166,763
116th	11.3	1	226	188,158
117th	11.5	1	272	204,978
All	11.7	1	272	559,419

Summary Token Statistics:

Congress	Average Tokens	Min Tokens	Max Tokens	Total Tokens
115th	109.1	2	6,839	1,479,212
116th	94.9	2	5,886	1,574,732
117th	95.1	2	502	1,695,276
All	99.0	2	6,839	4,749,220

Full Text Token Statistics:

Congress	Average Tokens	Min Tokens	Max Tokens	Total Tokens
115th	2,588.7	91	304,478	35,092,075
116th	2,760.3	70	973,173	45,824,498
117th	2,706.7	71	1,013,608	48,224,757
All	-	70	1,013,608	129,141,330

These statistics reveal computational trade-offs:

Titles average ~12 tokens: computationally efficient but limited information.
Summaries average ~100 tokens: good balance of information and efficiency.
Full text averages ~2,700 tokens with 129M total tokens: detailed but computationally expensive. Processing this volume of text introduces real-world engineering challenges, such as memory constraints and a higher noise-to-signal ratio typical of long legal documents.

We’ll prototype with titles and summaries before considering full text, given the computational costs involved.

Evaluation Framework

Experimental Design

We train models on one Congress and test on others, creating a 3x3 evaluation grid. This setup evaluates both within-Congress performance (same session) and cross-Congress generalization (different sessions). We expect temporal drift between Congress sessions to impact performance.

Metrics and Hyperparameter Tuning

We use weighted average F1 score to handle class imbalance, ensuring fair evaluation across all policy areas regardless of frequency.

For within-Congress evaluation, we report cross-validated scores. For cross-Congress evaluation, we test on the entire target Congress dataset.

Hyperparameter tuning uses Cross-Validation Grid Search with folds set to min(3, n_samples) to ensure all classes are represented. We apply the best parameters from training to test generalization across different Congresses.

Baseline Models

We evaluate three traditional machine learning approaches using TF-IDF vectorization:

Text Preprocessing

We convert text to numerical features using TF-IDF (term frequency-inverse document frequency), which weighs word importance by frequency within documents relative to the entire corpus. This creates normalized feature vectors suitable for machine learning classification.

Multinomial Naive Bayes

We start with Multinomial Naive Bayes as our simplest baseline. Despite its “naive” independence assumption between features, this model often performs surprisingly well for text classification tasks and serves as an important benchmark. If more complex models can’t beat Naive Bayes, it signals potential issues with the approach or data.

The model’s feature_log_prob_ attribute reveals the most influential words for each policy area, providing interpretable insights into classification patterns.

You can see the code for training the Naive Bayes model below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Create a pipeline with TF-IDF vectorizer and Multinomial Naive Bayes classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, dtype=np.float32)),
    ('clf', MultinomialNB()),
])

# Define the parameters for grid search
parameters = {  
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'tfidf__max_df': (0.05, 0.1, 0.25, 0.5),
    'tfidf__min_df': (1, 2, 5, 10),
    'clf__alpha': (1, 0.1, 0.01, 0.001),
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(
    pipeline,
    parameters,
    scoring='f1_weighted',
    n_jobs=-1,
    refit=True,
    cv=3,
)
grid_search.fit(X_train, y_train)

Logistic Regression

Logistic regression provides a natural step up in complexity from Naive Bayes. It uses the logistic function to convert linear combinations of features into probabilities, making it an excellent baseline for comparison with more sophisticated models while remaining interpretable.

You can see the code for training the Logistic Regression model below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Create a pipeline with TF-IDF vectorizer and Logistic Regression classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, dtype=np.float32)),
    ('clf', LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')),
])

# Define the parameters for grid search
parameters = {  
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__max_df': (0.05, 0.1, 0.25),
    'clf__C': [0.1, 1, 10],
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(
    pipeline,
    parameters,
    scoring='f1_weighted',
    n_jobs=-1,
    refit=True,
    cv=3,
)
grid_search.fit(X_train, y_train)

XGBoost

We include XGBoost as our tree-based ensemble method. While XGBoost typically excels on structured tabular data, we test whether its gradient boosting approach can effectively handle TF-IDF features for text classification.

You can see the code for training the XGBoost model below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

# Create a pipeline with TF-IDF vectorizer and XGBoost classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, dtype=np.float32)),
    ('clf', XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', objective='multi:softmax', seed=42, n_jobs=-1)),
])

# Define the parameters for grid search
parameters = {  
    'tfidf__max_df': (0.05, 0.1, 0.25),
    'clf__max_depth': (3, 6, 9),
    'clf__n_estimators': (100, 200, 300),
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(
    pipeline,
    parameters,
    scoring='f1_weighted',
    refit=True,
    cv=3,
    verbose=3,
)
grid_search.fit(X_train, y_train, clf__sample_weight=sample_weight)

Results

We evaluate models on three input types:

Title-only: Quick prototyping with limited context
Summary-only: Balanced information content and computational efficiency
Full text: Maximum context with computational constraints (limited hyperparameter tuning)

Title-Only Inputs

Naive Bayes

Title-only Naive Bayes experiments are run with the following settings:

sweep_nb(
    data,
    X_key='title',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25, 0.5),
        'min_df': (1, 2, 5),
    },
    nb_params={},
    nb_grid={
        'alpha': (1, 0.1, 0.01, 0.001),
    },
)

and the results:

Training on Congress 115
Best score: 0.661
Refit Time: 0.570
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6369760774921475
Testing on Congress 117 F1: 0.5488274400521962

Training on Congress 116
Best score: 0.677
Refit Time: 0.499
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.691175262953872
Testing on Congress 117 F1: 0.6798043069585031

Training on Congress 117
Best score: 0.670
Refit Time: 0.565
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.25
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.6168474701996426
Testing on Congress 116 F1: 0.6981574942116808

Mean fit time: 0.54 ± 0.03s

Results Summary

The results demonstrate several key findings:

Fast training: Sub-second training times make this highly practical
Solid baseline performance: F1 scores around 0.65-0.70 provide a reasonable starting point
Consistent hyperparameters: Similar optimal settings across Congresses suggest stable patterns
Temporal effects: Performance generally decreases when training and testing on Congresses further apart in time

Training on the 116th Congress yields the best cross-Congress performance, likely due to its temporal proximity to both adjacent sessions.

Naive Bayes Policy Area Classification F1 Score — Naive Bayes F1 scores show temporal effects, with better performance between adjacent Congresses

The model learns interpretable features for each policy area. For example, Agriculture bills are strongly associated with terms like “farm,” “crop,” and “livestock,” while Armed Forces bills correlate with “military,” “defense,” and “veterans.”

Naive Bayes Top Features for Agriculture and Food

Naive Bayes Top Features for Armed Forces and National Security

Logistic Regression

Title-only Logistic Regression experiments are run with the following settings:

sweep_logreg(
    data,
    X_key='title',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25),
    },
    logreg_params={
        'max_iter': 1000,
        'random_state': 42,
        'class_weight': 'balanced',
    },
    logreg_grid={
        'C': [0.1, 1, 10],
    },
)

and the results:

Training on Congress 115
Best score: 0.704
Refit Time: 32.063
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6809188275881766
Testing on Congress 117 F1: 0.601917336933838

Training on Congress 116
Best score: 0.714
Refit Time: 31.227
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.7408989977276476
Testing on Congress 117 F1: 0.7200639105208106

Training on Congress 117
Best score: 0.711
Refit Time: 34.083
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.674418393892329
Testing on Congress 116 F1: 0.7405934743144291

Mean fit time: 32.46 ± 1.20s

Results Summary

Logistic regression improves upon Naive Bayes performance:

Higher F1 scores: Generally 5-7 percentage points better than Naive Bayes
Consistent hyperparameters: Optimal settings remain stable across Congresses
Reasonable training time: 30-35 seconds per model remains manageable
Strong cross-Congress generalization: F1 scores consistently above 0.70

Logistic Regression Policy Area Classification F1 Score

XGBoost

Title-only XGBoost experiments are run with the following settings:

sweep_xgb(
    data,
    X_key='title',
    y_key='policy_area',
    tfidf_grid={
        'max_df': (0.05,),
    },
    xgb_grid={
        'max_depth': (6,),
        'eta': (0.3,),
    },
)

and the results:

Training on Congress 115
Best score: 0.591
Refit Time: 198.063
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 116 F1: 0.5649530686141018
Testing on Congress 117 F1: 0.5215939580735101

Training on Congress 116
Best score: 0.600
Refit Time: 264.824
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.6037922738570368
Testing on Congress 117 F1: 0.5965027418245722

Training on Congress 117
Best score: 0.595
Refit Time: 249.799
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.5600491477899472
Testing on Congress 116 F1: 0.60815381664894

Mean fit time: 237.56 ± 28.60s

Results Summary

XGBoost underperforms relative to expectations:

Poor performance: F1 scores significantly below linear models (0.55-0.60 range)
Long training times: 4+ minutes per model with limited hyperparameter exploration
Questionable value: The computational cost doesn’t justify the poor performance

Given these results, we focus on the more promising linear models for subsequent experiments with longer text inputs.

XGBoost Policy Area Classification F1 Score

Training Efficiency

The computational costs vary dramatically:

Model	Training Time
Naive Bayes	0.54 $\pm$ 0.03s
Logistic Regression	32.46 $\pm$ 1.20s
XGBoost	237.56 $\pm$ 28.60s

XGBoost’s poor performance despite high computational cost suggests that tree-based methods may not be well-suited for sparse TF-IDF features. This is a classic example of the “curse of dimensionality”: tree-based models struggle to make effective splits in highly sparse, high-dimensional bag-of-words spaces compared to linear models that simply assign weights to all features simultaneously. We’ll focus on linear models for the remaining experiments.

Summary-Only Results

Using bill summaries provides substantially more context than titles alone, leading to significant performance improvements.

Naive Bayes Performance

The summary-based models show dramatic improvement over title-only versions:

F1 scores: 0.85+ within-Congress, 0.77-0.86 cross-Congress
Training time: Still fast at ~3.4 seconds
Strong generalization: Consistent performance across time periods

Naive Bayes Summary Performance — Summary-based models achieve 80%+ F1 scores across most Congress combinations

Logistic Regression Performance

Logistic regression slightly outperforms Naive Bayes on summaries:

F1 scores: 0.86+ within-Congress, 0.79-0.87 cross-Congress
Training time: Reasonable at ~12 seconds
Stable hyperparameters: Consistent optimal settings across Congresses

The performance difference between models suggests they rely on similar feature patterns, with logistic regression better capturing feature interactions.

Logistic Regression

Summary-only Logistic Regression experiments are run with the following settings:

sweep_logreg(
    data,
    X_key='summary',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        # 'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25),
    },
    logreg_params={
        'max_iter': 1000,
        'random_state': 42,
        'class_weight': 'balanced',
    },
    logreg_grid={
        'C': [0.1, 1, 10],
    },
)

And the results:

Training on Congress 115
Best score: 0.862
Refit Time: 9.007
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 116 F1: 0.8284864693401133
Testing on Congress 117 F1: 0.7934161507811646

Training on Congress 116
Best score: 0.865
Refit Time: 13.897
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8637852557418315
Testing on Congress 117 F1: 0.8594775615031977

Training on Congress 117
Best score: 0.862
Refit Time: 12.167
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8355736563084967
Testing on Congress 116 F1: 0.8696403838390832

Mean fit time: 11.69 ± 2.02s

Full Text Results

We test whether complete bill text improves performance over summaries, using optimal hyperparameters from summary experiments.

Naive Bayes on Full Text

Surprisingly, full text yields slightly lower performance than summaries:

F1 scores: 0.84-0.85 within-Congress, 0.77-0.86 cross-Congress
Training time: ~50 seconds (10x slower than summaries)
Performance drop: Likely due to increased noise in lengthy documents

Naive Bayes Full Text Performance — Full text performance is slightly worse than summaries, suggesting diminishing returns

Logistic Regression on Full Text

Logistic regression shows the strongest performance on full text:

F1 scores: 0.87-0.88 within-Congress, 0.83-0.89 cross-Congress
Training time: ~70 seconds
Best overall performance: Approaches 90% F1 on some Congress pairs

Logistic Regression Full Text Performance — Logistic regression achieves the best performance using full bill text

The logistic regression model benefits from having access to complete legislative language while effectively regularizing against noise.

Key Findings

This baseline study establishes several important results:

Best performing model: Logistic regression trained on full bill text achieves up to 89% F1 score, providing a strong benchmark for future deep learning approaches.

Text input comparison:

Titles: Limited but fast (F1 ~0.65-0.70)
Summaries: Good balance of performance and efficiency (F1 ~0.85)
Full text: Best performance but computationally expensive (F1 ~0.87-0.89)

Cross-Congress generalization: Models trained on one Congress generalize reasonably well to others, though performance decreases with temporal distance between sessions.

Model performance ranking: Logistic Regression > Naive Bayes » XGBoost for this text classification task.

Next Steps

The strong baseline performance sets the stage for several research directions:

Deep learning models: Transformer-based approaches using pre-trained language models
Dataset expansion: Including additional Congresses and more detailed bill metadata
Error analysis: Understanding failure cases and class-specific performance patterns
Feature engineering: Exploring domain-specific text preprocessing and feature extraction

The complete dataset and experimental code are available for researchers interested in building upon these baselines.

Resources:

Dataset: Hugging Face: hheiden/us-congress-bill-policy-115_117
Leaderboard: Policy Area Classification Leaderboard
Project: Congressional Knowledge Graph & Policy Classification

Introduction#

Why This Matters#

Data#

Bills by Congress#

Policy Areas#

Text Statistics#

Evaluation Framework#

Experimental Design#

Metrics and Hyperparameter Tuning#

Baseline Models#

Text Preprocessing#

Multinomial Naive Bayes#

Logistic Regression#

XGBoost#

Results#

Title-Only Inputs#

Naive Bayes#

Results Summary#

Logistic Regression#

Results Summary#

XGBoost#

Results Summary#

Training Efficiency#

Summary-Only Results#

Naive Bayes Performance#

Logistic Regression Performance#

Logistic Regression#

Full Text Results#

Naive Bayes on Full Text#

Logistic Regression on Full Text#

Key Findings#

Next Steps#

Introduction

Why This Matters

Data

Bills by Congress

Policy Areas

Text Statistics

Evaluation Framework

Experimental Design

Metrics and Hyperparameter Tuning

Baseline Models

Text Preprocessing

Multinomial Naive Bayes

Logistic Regression

XGBoost

Results

Title-Only Inputs

Naive Bayes

Results Summary

Logistic Regression

Results Summary

XGBoost

Results Summary

Training Efficiency

Summary-Only Results

Naive Bayes Performance

Logistic Regression Performance

Logistic Regression

Full Text Results

Naive Bayes on Full Text

Logistic Regression on Full Text

Key Findings

Next Steps