Status
This leaderboard is a historical snapshot from February 2024 and is no longer actively maintained. The data covers the 115th through 117th Congresses (2017-2022). No new entries will be added.
Leaderboard
In-Congress Performance
| Rank | Model | Score | Date | Link |
|---|---|---|---|---|
| 1 | Logistic Regression | 0.877 | 2024-02-21 | Blog Post |
Out-of-Congress Performance
| Rank | Model | Score | Date | Link |
|---|---|---|---|---|
| 1 | Logistic Regression | 0.871 | 2024-02-21 | Blog Post |
What is a Policy Area?
A policy area is a broad category of public policy that includes a wide range of related policies. For example, the policy area of “Health” includes policies related to healthcare, public health, and health insurance. The policy area of “Education” includes policies related to K-12 education, higher education, and vocational training. The policy area of “Economic Development” includes policies related to job creation, workforce development, and business incentives.
On congress.gov, there is an official list of policy areas that are used to classify legislation. The leaderboard additionally includes Private Legislation as a policy area, which is not included in the official list but does exist as a designation in the data. This brings the total to 33 policy areas.
Going further back in time may require also considering the Commemorations policy area, which used to be a policy area but is not anymore.
The Challenge
The dataset includes legislation introduced in the United States Congress, collected from congress.gov. This data includes:
- A unique identifier for each piece of legislation
- The congress in which the legislation was introduced
- The title of the legislation (display title)
- The summary of the legislation (earliest version available)
- The full text of the legislation (earliest version available)
- The policy area of the legislation
The data covers the 115th Congress (2017-2018) through the 117th Congress (2021-2022). This is a fixed historical snapshot; data from the 118th Congress (2023-2024) onward is not included and will not be added.
In general, the challenge is to build a model that can accurately predict the policy area of a piece of legislation based on its title, summary, and/or full text: $$ \text{Policy Area} = f(\text{Title}, \text{Summary}, \text{Full Text}) $$
The goal of this challenge is to train on a single Congress and either:
- Predict on a held-out set within Congress
- Predict on a separate Congress (either extrapolating to a future Congress or “interpolating” to/recalling a past Congress)
Data
You can download the data directly from Hugging Face Datasets: hhieden/us-congress-bill-policy-115_117.
Evaluation
In-Congress performance will be evaluated using a K-fold cross-validation scheme, where K=3. For each congress, perform a 3-fold cross-validation, where the folds are stratified by policy area. Performance will be evaluated using a weighted F1 score. Final in-Congress performance will be the average weighted F1 score across the 3 folds, and further averaged across the congresses included. In math: $$ \text{In-Congress Performance} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{3} \sum_{k=1}^{3} \text{Weighted F1 Score}_{i,k} $$ where $N$ is the number of congresses included in the evaluation.
Out-of-Congress performance will be evaluated using the full held-out congresses. Final performance will be the weighted F1 score, averaged across the congresses included (minus the congresses used for training).
This gives us two-scalar scores: in-Congress performance and out-of-Congress performance.