Contents

The 117th US Congress: Policy Area Classification

Discussing a new NLP dataset for US Congressional bills

After scraping the 117th US Congress, like we did in this post, I wanted to do something interesting with the data. I decided to create a new dataset for NLP researchers to use: US Congress 117th Bills.

Here, I want to discuss the dataset through its data card and present a baseline model for the task of classifying a bill’s policy area.

Dataset Card for Dataset US 117th Congress Bills

Dataset Summary

The US 117th Congress Bills dataset is a collection of all of the House Resolutions, House Joint Resolutions, Senate Resolutions, and Senate Joint Resolutions introduced during the 117th Congress (2021-2022). The task is to classify each bill into one of thirty-three major policy areas. There are 11,389 bills in the training split and 3,797 bills in the testing split.

Supported Tasks and Leaderboards

  • text-classification: The goal is to classify each bill into one of thirty-three major policy areas. The dataset contains both a text label (policy_areas) and a class integer (y).

These classes correspond to:

  • 0: Agriculture and Food
  • 1: Animals
  • 2: Armed Forces and National Security
  • 3: Arts, Culture, Religion
  • 4: Civil Rights and Liberties, Minority Issues
  • 5: Commerce
  • 6: Congress
  • 7: Crime and Law Enforcement
  • 8: Economics and Public Finance
  • 9: Education
  • 10: Emergency Management
  • 11: Energy
  • 12: Environmental Protection
  • 13: Families
  • 14: Finance and Financial Sector
  • 15: Foreign Trade and International Finance
  • 16: Government Operations and Politics
  • 17: Health
  • 18: Housing and Community Development
  • 19: Immigration
  • 20: International Affairs
  • 21: Labor and Employment
  • 22: Law
  • 23: Native Americans
  • 24: Private Legislation
  • 25: Public Lands and Natural Resources
  • 26: Science, Technology, Communications
  • 27: Social Sciences and History
  • 28: Social Welfare
  • 29: Sports and Recreation
  • 30: Taxation
  • 31: Transportation and Public Works
  • 32: Water Resources Development

There is no leaderboard currently.

Languages

English

Dataset Structure

Data Instances

index                                                          11047
id                                                          H.R.4536
policy_areas                                          Social Welfare
cur_summary        Welfare for Needs not Weed Act\nThis bill proh...
cur_text           To prohibit assistance provided under the prog...
title                                 Welfare for Needs not Weed Act
titles_official    To prohibit assistance provided under the prog...
titles_short                          Welfare for Needs not Weed Act
sponsor_name                                          Rep. Rice, Tom
sponsor_party                                                      R
sponsor_state                                                     SC
Name: 0, dtype: object

Data Fields

  • index: A numeric index
  • id: The unique bill ID as a string
  • policy_areas: The key policy area as a string. This is the classification label.
  • cur_summary: The latest summary of the bill as a string.
  • cur_text: The latest text of the bill as a string.
  • title: The core title of the bill, as labeled on Congress.gov, as a string.
  • titles_official: All official titles of the bill (or nested legislation) as a string.
  • titles_short: All short titles of the bill (or nested legislation) as a string.
  • sponsor_name: The name of the primary representative sponsoring the legislation as a string.
  • sponsor_party: The party of the primary sponsor as a string.
  • sponsor_state: The home state of the primary sponsor as a string.

Data Splits

The dataset was split into a training and testing split using a stratefied sampling, due to the class imbalance in the dataset.

Using scikit-learn, a quarter of the data (by class) is reserved for testing:

train_ix, test_ix = train_test_split(ixs, test_size=0.25, stratify=df['y'], random_state=1234567)

Dataset Creation

Curation Rationale

This dataset was created to provide a new dataset at the intersection of NLP and legislation. Using this data for a simple major topic classification seemed like a practical first step.

Source Data

Initial Data Collection and Normalization

Data was collected from congress.gov with minimal pre-processing. Additional information about this datasets collection is discussed here.

Who are the source language producers?

Either Congressional Research Service or other congressional staffers.

Annotations

Who are the annotators?

Congressional Staff

Personal and Sensitive Information

None, this is publicly available text through congress.gov.

Additional Information

Licensing Information

MIT License

Citation Information

@misc {hunter_heidenreich_2023,
	author       = { {Hunter Heidenreich} },
	title        = { us-congress-117-bills (Revision 9ed940e) },
	year         = 2023,
	url          = { https://huggingface.co/datasets/hheiden/us-congress-117-bills },
	doi          = { 10.57967/hf/1193 },
	publisher    = { Hugging Face }
}

Baseline Model

I want to create a very simplistic baseline for this dataset, and in the future, I will create a more complex model. Here, we’ll use a combination of one-hot encoding and TF-IDF to create a simple logistic regression model.

First, we load the dataset from the HuggingFace Hub:

from datasets import load_dataset

dataset = load_dataset("hheiden/us-congress-117-bills")

Then, we encode the policy labels as integers:

from sklearn.preprocessing import LabelEncoder

y_le = LabelEncoder()

y_train = y_le.fit_transform(dataset['train']['policy_areas'])
y_test = y_le.transform(dataset['test']['policy_areas'])

We’ll parse the categorical features from the dataset:

import re
from sklearn.preprocessing import OneHotEncoder

le_types = OneHotEncoder(handle_unknown='ignore')

def _p(d):
    return [
        re.sub(r'\d+', '', d['id']),
        # d['sponsor_name'],
        d['sponsor_party'],
        d['sponsor_state'],
    ]

x_train_one = le_types.fit_transform(list(map(_p, dataset['train'])))
x_test_one = le_types.transform(list(map(_p, dataset['test'])))

Then, we’ll use TF-IDF to vectorize the text:

from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
from tqdm.auto import tqdm

def _p(s):
    return re.sub(r'\d+', 'NUM', s if s else '')

vecs = defaultdict(lambda: TfidfVectorizer(stop_words='english'))

keys = ['cur_summary', 'cur_text', 'title', 'titles_short', 'titles_official']

x_train_tfidfs = {
    k: vecs[k].fit_transform(list(map(_p, dataset['train'][k])))
    for k in tqdm(keys, desc='TFIDF: FIT_TRANSFORM')
}

x_test_tfidfs = {
    k: vecs[k].transform(list(map(_p, dataset['test'][k])))
    for k in tqdm(keys, 'TFIDF: TRANSFORM')
}

Finally, we’ll combine the one-hot encoded features with the TF-IDF features:

from scipy.sparse import hstack

x_train_vs = hstack([x_train_tfidfs[k] for k in keys])
x_test_vs = hstack([x_test_tfidfs[k] for k in keys])
x_train = hstack((x_train_one, x_train_vs))
x_test = hstack((x_test_one, x_test_vs))

Now, we’ll train a simple logistic regression model:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf = LogisticRegression(
    class_weight='balanced',
    multi_class='ovr',
    max_iter=200,
    verbose=1,
)
_ = clf.fit(x_train, y_train)

Note: I’m using class_weight='balanced' to account for the class imbalance in the dataset. I also found that, for this dataset, multi_class='ovr' performed better than multi_class='multinomial'.

Finally, we’ll evaluate the model:

y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)

Here are the results:

print(classification_report(y_train, y_pred_train, target_names=y_le.inverse_transform(list(range(33)))))
                                             precision    recall  f1-score   support

                       Agriculture and Food       0.99      1.00      1.00       262
                                    Animals       0.98      1.00      0.99        49
         Armed Forces and National Security       1.00      1.00      1.00       932
                    Arts, Culture, Religion       0.96      1.00      0.98        26
Civil Rights and Liberties, Minority Issues       1.00      1.00      1.00        83
                                   Commerce       1.00      1.00      1.00       442
                                   Congress       0.98      0.99      0.99       124
                  Crime and Law Enforcement       1.00      0.99      0.99       688
               Economics and Public Finance       0.96      1.00      0.98       128
                                  Education       0.99      1.00      1.00       505
                       Emergency Management       0.99      1.00      1.00       136
                                     Energy       0.99      0.99      0.99       372
                   Environmental Protection       0.99      1.00      0.99       322
                                   Families       1.00      1.00      1.00        81
               Finance and Financial Sector       1.00      1.00      1.00       439
    Foreign Trade and International Finance       0.99      0.98      0.99       146
         Government Operations and Politics       0.99      0.98      0.99       866
                                     Health       1.00      0.99      0.99      1470
          Housing and Community Development       0.99      1.00      1.00       168
                                Immigration       1.00      1.00      1.00       397
                      International Affairs       0.99      1.00      1.00       705
                       Labor and Employment       0.99      0.99      0.99       376
                                        Law       0.99      0.98      0.99       127
                           Native Americans       1.00      1.00      1.00       165
                        Private Legislation       1.00      1.00      1.00        13
         Public Lands and Natural Resources       1.00      0.99      1.00       454
        Science, Technology, Communications       0.99      1.00      0.99       337
                Social Sciences and History       1.00      1.00      1.00         3
                             Social Welfare       0.99      1.00      1.00       139
                      Sports and Recreation       1.00      1.00      1.00        25
                                   Taxation       1.00      0.99      1.00       802
            Transportation and Public Works       0.99      0.99      0.99       524
                Water Resources Development       0.99      0.99      0.99        83

                                   accuracy                           0.99     11389
                                  macro avg       0.99      1.00      0.99     11389
                               weighted avg       0.99      0.99      0.99     11389
print(classification_report(y_test, y_pred_test, target_names=y_le.inverse_transform(list(range(33)))))
                                             precision    recall  f1-score   support

                       Agriculture and Food       0.98      0.92      0.95        87
                                    Animals       0.94      0.94      0.94        16
         Armed Forces and National Security       0.93      0.91      0.92       311
                    Arts, Culture, Religion       0.50      0.44      0.47         9
Civil Rights and Liberties, Minority Issues       0.75      0.78      0.76        27
                                   Commerce       0.90      0.90      0.90       148
                                   Congress       0.91      0.69      0.78        42
                  Crime and Law Enforcement       0.86      0.90      0.88       229
               Economics and Public Finance       0.85      0.91      0.88        43
                                  Education       0.92      0.96      0.94       169
                       Emergency Management       0.89      0.67      0.77        46
                                     Energy       0.90      0.91      0.90       124
                   Environmental Protection       0.87      0.84      0.85       107
                                   Families       0.88      0.81      0.85        27
               Finance and Financial Sector       0.90      0.89      0.90       147
    Foreign Trade and International Finance       0.88      0.86      0.87        49
         Government Operations and Politics       0.86      0.89      0.87       289
                                     Health       0.95      0.95      0.95       490
          Housing and Community Development       0.90      0.93      0.91        56
                                Immigration       0.89      0.90      0.90       133
                      International Affairs       0.85      0.90      0.87       235
                       Labor and Employment       0.96      0.88      0.92       125
                                        Law       0.90      0.62      0.73        42
                           Native Americans       0.92      0.89      0.91        55
                        Private Legislation       1.00      1.00      1.00         4
         Public Lands and Natural Resources       0.84      0.89      0.87       151
        Science, Technology, Communications       0.84      0.90      0.87       112
                Social Sciences and History       0.00      0.00      0.00         1
                             Social Welfare       0.73      0.72      0.73        46
                      Sports and Recreation       0.67      0.50      0.57         8
                                   Taxation       0.97      0.97      0.97       267
            Transportation and Public Works       0.93      0.92      0.92       175
                Water Resources Development       0.83      0.74      0.78        27

                                   accuracy                           0.90      3797
                                  macro avg       0.84      0.82      0.83      3797
                               weighted avg       0.90      0.90      0.90      3797

Comparing the results between the training split and the testing split, we can see that we’ve clearly overfit to the training set. For classes with very little data, we can see that the model has a hard time predicting them. It will be interesting to try deep learning models to see if we can improve the results, or if this is a limitation of the data. If it’s the latter, this presents a good opportunity to collect more data.

Conclusion

In this post, we’ve taken an in-depth look at a dataset of congressional bills and their associated text. We’ve put together a very simplistic baseline model and seen that it can do a decent job at predicting the topic of a bill. We’ve also seen that the model is overfitting to the training data, and that there is a lot of room for improvement. In the next post, we’ll try to improve the model by using deep learning techniques.

As always, if you have any questions or comments, please feel free to reach out to me.