The 117th US Congress: Policy Area Classification
Discussing a new NLP dataset for US Congressional bills

After scraping the 117th US Congress, like we did in this post, I wanted to do something interesting with the data. I decided to create a new dataset for NLP researchers to use: US Congress 117th Bills.
Here, I want to discuss the dataset through its data card and present a baseline model for the task of classifying a bill’s policy area.
Dataset Card for Dataset US 117th Congress Bills
Dataset Summary
The US 117th Congress Bills dataset is a collection of all of the House Resolutions, House Joint Resolutions, Senate Resolutions, and Senate Joint Resolutions introduced during the 117th Congress (2021-2022). The task is to classify each bill into one of thirty-three major policy areas. There are 11,389 bills in the training split and 3,797 bills in the testing split.
Supported Tasks and Leaderboards
text-classification
: The goal is to classify each bill into one of thirty-three major policy areas. The dataset contains both a text label (policy_areas
) and a class integer (y
).
These classes correspond to:
- 0: Agriculture and Food
- 1: Animals
- 2: Armed Forces and National Security
- 3: Arts, Culture, Religion
- 4: Civil Rights and Liberties, Minority Issues
- 5: Commerce
- 6: Congress
- 7: Crime and Law Enforcement
- 8: Economics and Public Finance
- 9: Education
- 10: Emergency Management
- 11: Energy
- 12: Environmental Protection
- 13: Families
- 14: Finance and Financial Sector
- 15: Foreign Trade and International Finance
- 16: Government Operations and Politics
- 17: Health
- 18: Housing and Community Development
- 19: Immigration
- 20: International Affairs
- 21: Labor and Employment
- 22: Law
- 23: Native Americans
- 24: Private Legislation
- 25: Public Lands and Natural Resources
- 26: Science, Technology, Communications
- 27: Social Sciences and History
- 28: Social Welfare
- 29: Sports and Recreation
- 30: Taxation
- 31: Transportation and Public Works
- 32: Water Resources Development
There is no leaderboard currently.
Languages
English
Dataset Structure
Data Instances
index 11047
id H.R.4536
policy_areas Social Welfare
cur_summary Welfare for Needs not Weed Act\nThis bill proh...
cur_text To prohibit assistance provided under the prog...
title Welfare for Needs not Weed Act
titles_official To prohibit assistance provided under the prog...
titles_short Welfare for Needs not Weed Act
sponsor_name Rep. Rice, Tom
sponsor_party R
sponsor_state SC
Name: 0, dtype: object
Data Fields
index
: A numeric indexid
: The unique bill ID as a stringpolicy_areas
: The key policy area as a string. This is the classification label.cur_summary
: The latest summary of the bill as a string.cur_text
: The latest text of the bill as a string.title
: The core title of the bill, as labeled on Congress.gov, as a string.titles_official
: All official titles of the bill (or nested legislation) as a string.titles_short
: All short titles of the bill (or nested legislation) as a string.sponsor_name
: The name of the primary representative sponsoring the legislation as a string.sponsor_party
: The party of the primary sponsor as a string.sponsor_state
: The home state of the primary sponsor as a string.
Data Splits
The dataset was split into a training and testing split using a stratefied sampling, due to the class imbalance in the dataset.
Using scikit-learn, a quarter of the data (by class) is reserved for testing:
train_ix, test_ix = train_test_split(ixs, test_size=0.25, stratify=df['y'], random_state=1234567)
Dataset Creation
Curation Rationale
This dataset was created to provide a new dataset at the intersection of NLP and legislation. Using this data for a simple major topic classification seemed like a practical first step.
Source Data
Initial Data Collection and Normalization
Data was collected from congress.gov with minimal pre-processing. Additional information about this datasets collection is discussed here.
Who are the source language producers?
Either Congressional Research Service or other congressional staffers.
Annotations
Who are the annotators?
Congressional Staff
Personal and Sensitive Information
None, this is publicly available text through congress.gov.
Additional Information
Licensing Information
MIT License
Citation Information
@misc {hunter_heidenreich_2023,
author = { {Hunter Heidenreich} },
title = { us-congress-117-bills (Revision 9ed940e) },
year = 2023,
url = { https://huggingface.co/datasets/hheiden/us-congress-117-bills },
doi = { 10.57967/hf/1193 },
publisher = { Hugging Face }
}
Baseline Model
I want to create a very simplistic baseline for this dataset, and in the future, I will create a more complex model. Here, we’ll use a combination of one-hot encoding and TF-IDF to create a simple logistic regression model.
First, we load the dataset from the HuggingFace Hub:
from datasets import load_dataset
dataset = load_dataset("hheiden/us-congress-117-bills")
Then, we encode the policy labels as integers:
from sklearn.preprocessing import LabelEncoder
y_le = LabelEncoder()
y_train = y_le.fit_transform(dataset['train']['policy_areas'])
y_test = y_le.transform(dataset['test']['policy_areas'])
We’ll parse the categorical features from the dataset:
import re
from sklearn.preprocessing import OneHotEncoder
le_types = OneHotEncoder(handle_unknown='ignore')
def _p(d):
return [
re.sub(r'\d+', '', d['id']),
# d['sponsor_name'],
d['sponsor_party'],
d['sponsor_state'],
]
x_train_one = le_types.fit_transform(list(map(_p, dataset['train'])))
x_test_one = le_types.transform(list(map(_p, dataset['test'])))
Then, we’ll use TF-IDF to vectorize the text:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
from tqdm.auto import tqdm
def _p(s):
return re.sub(r'\d+', 'NUM', s if s else '')
vecs = defaultdict(lambda: TfidfVectorizer(stop_words='english'))
keys = ['cur_summary', 'cur_text', 'title', 'titles_short', 'titles_official']
x_train_tfidfs = {
k: vecs[k].fit_transform(list(map(_p, dataset['train'][k])))
for k in tqdm(keys, desc='TFIDF: FIT_TRANSFORM')
}
x_test_tfidfs = {
k: vecs[k].transform(list(map(_p, dataset['test'][k])))
for k in tqdm(keys, 'TFIDF: TRANSFORM')
}
Finally, we’ll combine the one-hot encoded features with the TF-IDF features:
from scipy.sparse import hstack
x_train_vs = hstack([x_train_tfidfs[k] for k in keys])
x_test_vs = hstack([x_test_tfidfs[k] for k in keys])
x_train = hstack((x_train_one, x_train_vs))
x_test = hstack((x_test_one, x_test_vs))
Now, we’ll train a simple logistic regression model:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
clf = LogisticRegression(
class_weight='balanced',
multi_class='ovr',
max_iter=200,
verbose=1,
)
_ = clf.fit(x_train, y_train)
Note: I’m using class_weight='balanced'
to account for the class imbalance in the dataset.
I also found that, for this dataset, multi_class='ovr'
performed better than multi_class='multinomial'
.
Finally, we’ll evaluate the model:
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
Here are the results:
print(classification_report(y_train, y_pred_train, target_names=y_le.inverse_transform(list(range(33)))))
precision recall f1-score support
Agriculture and Food 0.99 1.00 1.00 262
Animals 0.98 1.00 0.99 49
Armed Forces and National Security 1.00 1.00 1.00 932
Arts, Culture, Religion 0.96 1.00 0.98 26
Civil Rights and Liberties, Minority Issues 1.00 1.00 1.00 83
Commerce 1.00 1.00 1.00 442
Congress 0.98 0.99 0.99 124
Crime and Law Enforcement 1.00 0.99 0.99 688
Economics and Public Finance 0.96 1.00 0.98 128
Education 0.99 1.00 1.00 505
Emergency Management 0.99 1.00 1.00 136
Energy 0.99 0.99 0.99 372
Environmental Protection 0.99 1.00 0.99 322
Families 1.00 1.00 1.00 81
Finance and Financial Sector 1.00 1.00 1.00 439
Foreign Trade and International Finance 0.99 0.98 0.99 146
Government Operations and Politics 0.99 0.98 0.99 866
Health 1.00 0.99 0.99 1470
Housing and Community Development 0.99 1.00 1.00 168
Immigration 1.00 1.00 1.00 397
International Affairs 0.99 1.00 1.00 705
Labor and Employment 0.99 0.99 0.99 376
Law 0.99 0.98 0.99 127
Native Americans 1.00 1.00 1.00 165
Private Legislation 1.00 1.00 1.00 13
Public Lands and Natural Resources 1.00 0.99 1.00 454
Science, Technology, Communications 0.99 1.00 0.99 337
Social Sciences and History 1.00 1.00 1.00 3
Social Welfare 0.99 1.00 1.00 139
Sports and Recreation 1.00 1.00 1.00 25
Taxation 1.00 0.99 1.00 802
Transportation and Public Works 0.99 0.99 0.99 524
Water Resources Development 0.99 0.99 0.99 83
accuracy 0.99 11389
macro avg 0.99 1.00 0.99 11389
weighted avg 0.99 0.99 0.99 11389
print(classification_report(y_test, y_pred_test, target_names=y_le.inverse_transform(list(range(33)))))
precision recall f1-score support
Agriculture and Food 0.98 0.92 0.95 87
Animals 0.94 0.94 0.94 16
Armed Forces and National Security 0.93 0.91 0.92 311
Arts, Culture, Religion 0.50 0.44 0.47 9
Civil Rights and Liberties, Minority Issues 0.75 0.78 0.76 27
Commerce 0.90 0.90 0.90 148
Congress 0.91 0.69 0.78 42
Crime and Law Enforcement 0.86 0.90 0.88 229
Economics and Public Finance 0.85 0.91 0.88 43
Education 0.92 0.96 0.94 169
Emergency Management 0.89 0.67 0.77 46
Energy 0.90 0.91 0.90 124
Environmental Protection 0.87 0.84 0.85 107
Families 0.88 0.81 0.85 27
Finance and Financial Sector 0.90 0.89 0.90 147
Foreign Trade and International Finance 0.88 0.86 0.87 49
Government Operations and Politics 0.86 0.89 0.87 289
Health 0.95 0.95 0.95 490
Housing and Community Development 0.90 0.93 0.91 56
Immigration 0.89 0.90 0.90 133
International Affairs 0.85 0.90 0.87 235
Labor and Employment 0.96 0.88 0.92 125
Law 0.90 0.62 0.73 42
Native Americans 0.92 0.89 0.91 55
Private Legislation 1.00 1.00 1.00 4
Public Lands and Natural Resources 0.84 0.89 0.87 151
Science, Technology, Communications 0.84 0.90 0.87 112
Social Sciences and History 0.00 0.00 0.00 1
Social Welfare 0.73 0.72 0.73 46
Sports and Recreation 0.67 0.50 0.57 8
Taxation 0.97 0.97 0.97 267
Transportation and Public Works 0.93 0.92 0.92 175
Water Resources Development 0.83 0.74 0.78 27
accuracy 0.90 3797
macro avg 0.84 0.82 0.83 3797
weighted avg 0.90 0.90 0.90 3797
Comparing the results between the training split and the testing split, we can see that we’ve clearly overfit to the training set. For classes with very little data, we can see that the model has a hard time predicting them. It will be interesting to try deep learning models to see if we can improve the results, or if this is a limitation of the data. If it’s the latter, this presents a good opportunity to collect more data.
Conclusion
In this post, we’ve taken an in-depth look at a dataset of congressional bills and their associated text. We’ve put together a very simplistic baseline model and seen that it can do a decent job at predicting the topic of a bill. We’ve also seen that the model is overfitting to the training data, and that there is a lot of room for improvement. In the next post, we’ll try to improve the model by using deep learning techniques.
As always, if you have any questions or comments, please feel free to reach out to me.