<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Legislative-Data on Hunter Heidenreich | Senior AI Research Scientist</title><link>https://hunterheidenreich.com/tags/legislative-data/</link><description>Recent content in Legislative-Data on Hunter Heidenreich | Senior AI Research Scientist</description><image><title>Hunter Heidenreich | Senior AI Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sat, 14 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/legislative-data/index.xml" rel="self" type="application/rss+xml"/><item><title>Tea Party in the House: Legislative Ideology via HIPTM</title><link>https://hunterheidenreich.com/notes/interdisciplinary/social-science/tea-party-hiptm/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/social-science/tea-party-hiptm/</guid><description>A hierarchical probabilistic model combining roll call votes, bill text, and legislative speeches to analyze political polarization and framing.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p><strong>Method</strong>.</p>
<p>This paper is primarily a <strong>Methodological</strong> contribution. It proposes a novel probabilistic architecture, the Hierarchical Ideal Point Topic Model (HIPTM), designed to solve the specific limitations of existing political science models that typically rely on either voting data or text data in isolation. The paper validates this method by demonstrating its superior performance in predicting &ldquo;Tea Party&rdquo; membership compared to text-only baselines and its ability to provide interpretable &ldquo;framing&rdquo; analysis.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The primary motivation is to better understand political polarization, specifically the &ldquo;Tea Party&rdquo; phenomenon within the Republican party during the 112th Congress.</p>
<p>An ideal point is a scalar score representing a legislator&rsquo;s ideological position, estimated from voting patterns. Standard &ldquo;Ideal Point&rdquo; models (like DW-NOMINATE) typically project legislators onto a single liberal-conservative dimension using only binary voting data. This is insufficient for capturing complex, multi-dimensional intra-party conflicts where legislators might agree on votes but differ on policy &ldquo;framing&rdquo; or specific sub-issues. Furthermore, existing multi-dimensional models often produce dimensions that are difficult for humans to interpret.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>Hierarchical Ideal Point Topic Model (HIPTM)</strong>. It distinguishes itself from prior work through three main technical innovations:</p>
<ol>
<li><strong>Joint Modeling of Three Data Sources</strong>: It integrates roll call votes, the text of bills, and the floor speeches of legislators into a single probabilistic framework.</li>
<li><strong>Hierarchical Topic Structure</strong>: It models &ldquo;frames&rdquo; as a second level of the topic hierarchy. &ldquo;Issues&rdquo; (level 1) are fixed and non-polarized, while &ldquo;Frames&rdquo; (level 2) are discovered dynamically and carry polarity (ideal point weights). For example, Health Care is an issue; &ldquo;government overreach&rdquo; vs. &ldquo;patient protection&rdquo; are frames legislators use when debating it.</li>
<li><strong>Text-Based Ideal Point Prediction</strong>: HIPTM regresses ideal points on speech text, allowing it to predict the political alignment of legislators based solely on their writing or speeches without requiring voting records for inference.</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the model using data from the 112th U.S. Congress (Republican legislators only).</p>
<ul>
<li><strong>Prediction Task</strong>: Classifying legislators as members of the &ldquo;Tea Party Caucus&rdquo;.</li>
<li><strong>Baselines</strong>: The model was compared against Support Vector Machines (SVM) trained on:
<ul>
<li>TF-IDF vectors (Text only)</li>
<li>Normalized TF-IDF vectors (Text only)</li>
<li>Binary Vote vectors (Vote only)</li>
</ul>
</li>
<li><strong>Metric</strong>: Area Under the Receiver Operating Characteristic Curve (AUC-ROC) via 5-fold cross-validation.</li>
<li><strong>Qualitative Analysis</strong>: The authors examined the &ldquo;span&rdquo; of ideal points within specific topics (e.g., Macroeconomics, Health) to identify which issues were most polarized between Tea Party and Establishment Republicans.</li>
</ul>
<h2 id="what-were-the-outcomes-and-conclusions-drawn">What were the outcomes and conclusions drawn?</h2>
<ul>
<li><strong>Quantitative Performance</strong>: HIPTM features combined with voting data (HIPTM-VOTE) achieved the highest classification performance (AUC-ROC in the ~0.70-0.75 range, approximate, read from Figure 2). Vote-only features slightly trail HIPTM-VOTE, while text-only baselines (TF-IDF, normalized TF-IDF) fall considerably lower. The one-dimensional Tea Party ideal points correlate with DW-NOMINATE ($\rho = 0.91$). When voting data was withheld (simulating a candidate without a record), HIPTM&rsquo;s text-based features outperformed standard text baselines TF-IDF and normalized TF-IDF (approximate, read from Figure 3).</li>
<li><strong>Political Insight</strong>: The model identified &ldquo;Government Operations,&rdquo; &ldquo;Macroeconomics,&rdquo; and &ldquo;Transportation&rdquo; as the three most polarized topics between Tea Party and establishment Republicans.</li>
<li><strong>Framing Analysis</strong>: The hierarchical topic structure reveals how legislators frame issues differently. For Macroeconomics, frame M3 (most Tea Party-oriented) focuses on criticizing government overspending, while frame M1 (least Tea Party-oriented) focuses on the downsides of a government shutdown. For Health, frame H3 captures Tea Party framing of the Affordable Care Act as an unconstitutional government takeover, while frame H1 frames opposition in terms of implementation costs and health care exchanges.</li>
<li><strong>Framing vs. Voting Taxonomy</strong>: The authors construct a 2x2 taxonomy of disagreement across issues, crossing whether ideal points are polarized with whether issue frames are polarized. Issues like Civil Rights fall in the &ldquo;neither polarized&rdquo; quadrant, where cooperation is expected. Banking/Finance and Transportation fall in the &ldquo;ideal points polarized, frames not&rdquo; quadrant, where Republicans frame the issue similarly but have underlying policy disagreements. Issues like Health and Public Lands fall in the &ldquo;frames polarized, ideal points not&rdquo; quadrant: Republicans voted similarly but framed the issue very differently. Issues like Macroeconomics and Government Operations fall in the &ldquo;both polarized&rdquo; quadrant, posing the greatest challenge for Republican leadership.</li>
<li><strong>Sub-group Identification</strong>: The model identifies legislators whose language marks them as ideologically aligned with the Tea Party even without formal caucus membership. For example, Jeff Flake (R-AZ) received the second-highest ideal point, disagreeing with Freedom Works on only one of 60 key votes, despite not being a Tea Party Caucus member. Justin Amash (R-MI), founder and chairman of the Liberty Caucus, agreed with Freedom Works on every key vote since 2011. Conversely, some self-identified Tea Partiers like Rodney Alexander (R-LA) only agreed with Freedom Works 48% of the time. Alexander and Ander Crenshaw (R-FL, 50% agreement) are categorized as &ldquo;Green Tea&rdquo; by Gervais and Morris (2014): Republican legislators who associate with the Tea Party on their own initiative but lack support from Tea Party organizations.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<ul>
<li>HIPTM does not formally distinguish frames from other kinds of subtopics. For example, the model discovered a strongly Tea Party-oriented frame under &ldquo;Labor, Employment and Immigration&rdquo; that reflected a Boeing labor dispute specific to South Carolina legislators, capturing geographic rather than ideological framing.</li>
<li>The model is validated only on Republican legislators in the 112th Congress. Generalization to other parties, chambers, or time periods is untested.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study focuses on the <strong>112th U.S. Congress</strong> (Jan 2011 - Jan 2013).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Subjects</strong></td>
          <td>Republican Legislators</td>
          <td>240 Reps</td>
          <td>60 are Tea Party Caucus members.</td>
      </tr>
      <tr>
          <td><strong>Votes</strong></td>
          <td>Roll Call Votes</td>
          <td>13,856 votes</td>
          <td>Agreement/disagreement with Freedom Works on 60 key votes (40 in 2011, 20 in 2012).</td>
      </tr>
      <tr>
          <td><strong>Text</strong></td>
          <td>Floor Speeches</td>
          <td>5,349 word types</td>
          <td>Sourced from GovTrack. Vocabulary size after preprocessing.</td>
      </tr>
      <tr>
          <td><strong>Priors</strong></td>
          <td>Congressional Bills Project</td>
          <td>19 Topics</td>
          <td>Used to set informed priors $\phi^*_k$ for top-level issues.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The model uses a <strong>Stochastic EM</strong> approach for inference.</p>
<ul>
<li><strong>Generative Process</strong>:
<ul>
<li><strong>Speeches</strong>: Modeled as a mixture of $K$ Hierarchical Dirichlet Processes (HDPs). A legislator chooses an issue $z$, then a frame $t$ from a Dirichlet Process, then a word $w$.</li>
<li><strong>Bills</strong>: Modeled using Latent Dirichlet Allocation (LDA). Each bill is a mixture over $K$ issues.</li>
<li><strong>Votes</strong>: Modeled via a probabilistic ideal point function (logistic/inverse-logit). The probability of a &ldquo;Yes&rdquo; vote depends on the bill&rsquo;s polarity $x_b$, popularity $y_b$, and the legislator&rsquo;s issue-specific ideal point $u_{a,k}$.</li>
</ul>
</li>
<li><strong>Optimization Steps</strong>:
<ol>
<li><strong>Sampling</strong>: Issue assignments $z$ and frame assignments $t$ are sampled for tokens in speeches and bills.</li>
<li><strong>Regression</strong>: Frame-specific regression weights $\eta_{k,j}$ are optimized using <strong>L-BFGS</strong>.</li>
<li><strong>Ideal Points</strong>: Legislator ideal points $u_{a,k}$ and bill parameters ($x_b, y_b$) are updated using <strong>Gradient Ascent</strong>.</li>
</ol>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Ideal Point Definition</strong>: A legislator&rsquo;s ideal point on issue $k$ ($u_{a,k}$) is defined as a linear combination of the ideal points of the <em>frames</em> they use ($\eta_{k,j}$), weighted by their usage frequency ($\hat{\psi}_{a,k,j}$).</li>
<li><strong>Topic Hierarchy</strong>:
<ul>
<li><strong>Level 1 (Issues)</strong>: Fixed at $K=19$ (based on Policy Agendas Project major headings). These nodes use informed Dirichlet priors.</li>
<li><strong>Level 2 (Frames)</strong>: Unbounded number of frames per issue, discovered non-parametrically via Dirichlet Process.</li>
</ul>
</li>
<li><strong>Prediction Features</strong>: The model runs for 1,000 iterations total with a 500-iteration burn-in. After burn-in, the sampled state is kept every 50 iterations, and feature values are averaged over the 10 stored models.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: AUC-ROC (Area Under the Receiver Operating Characteristic Curve).</li>
<li><strong>Classifier</strong>: $\text{SVM}^{\text{light}}$ (Joachims, 1999).</li>
<li><strong>Cross-Validation</strong>: 5-fold stratified sampling.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.govtrack.us/">GovTrack Congressional Speeches</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Source of floor speech text</td>
      </tr>
      <tr>
          <td><a href="http://www.congressionalbills.org/">Congressional Bills Project</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Bill text with Policy Agendas Project topic labels</td>
      </tr>
      <tr>
          <td>Freedom Works Key Votes</td>
          <td>Dataset</td>
          <td>Public</td>
          <td>60 key votes used to define Tea Party alignment (freedomworks.org is no longer available)</td>
      </tr>
  </tbody>
</table>
<p>No official code release accompanies this paper. The inference algorithm (Stochastic EM with Gibbs sampling, L-BFGS, and gradient ascent) is described in detail in Section 4 of the paper, but a full reimplementation would be required.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Nguyen, V., Boyd-Graber, J., Resnik, P., &amp; Miler, K. (2015). Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress. <em>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics</em>, 1438-1448. <a href="https://doi.org/10.3115/v1/P15-1139">https://doi.org/10.3115/v1/P15-1139</a></p>
<p><strong>Publication</strong>: ACL 2015</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{nguyenTeaPartyHouse2015,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Tea {{Party}} in the {{House}}: {{A Hierarchical Ideal Point Topic Model}} and {{Its Application}} to {{Republican Legislators}} in the 112th {{Congress}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Tea {{Party}} in the {{House}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 53rd {{Annual Meeting}} of the {{Association}} for {{Computational Linguistics}} and the 7th {{International Joint Conference}} on {{Natural Language Processing}} ({{Volume}} 1: {{Long Papers}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Nguyen, Viet-An and {Boyd-Graber}, Jordan and Resnik, Philip and Miler, Kristina}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1438--1448}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Beijing, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.3115/v1/P15-1139}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2023-11-02}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We introduce the Hierarchical Ideal Point Topic Model, which provides a rich picture of policy issues, framing, and voting behavior using a joint model of votes, bill text, and the language that legislators use when debating bills. We use this model to look at the relationship between Tea Party Republicans and ``establishment&#39;&#39; Republicans in the U.S. House of Representatives during the 112th Congress.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://aclanthology.org/P15-1139/">ACL Anthology: Tea Party in the House</a></li>
<li>Gervais, B. T., &amp; Morris, I. L. (2012). Reading the tea leaves: Understanding Tea Party Caucus membership in the US House of Representatives. <em>PS: Political Science &amp; Politics</em>, 45(2), 245-250.</li>
<li>Gervais, B. T., &amp; Morris, I. L. (2014). Black Tea, Green Tea, White Tea, and Coffee: Understanding the variation in attachment to the Tea Party among members of Congress. In <em>Annual Meeting of the American Political Science Association</em>. (Source of the &ldquo;Green Tea&rdquo; Republican taxonomy cited in the paper)</li>
</ul>
]]></content:encoded></item><item><title>Party Matters: Enhancing Legislative Vote Embeddings</title><link>https://hunterheidenreich.com/notes/interdisciplinary/social-science/party-matters-hiptm/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/social-science/party-matters-hiptm/</guid><description>A method for improving legislative vote prediction across sessions by augmenting bill text embeddings with sponsor metadata.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel neural architecture that modifies how bill embeddings are constructed by explicitly incorporating sponsor metadata alongside text. The authors validate this method by comparing it against text-only baselines (MWE and CNN) and demonstrating superior performance in a newly defined &ldquo;out-of-session&rdquo; evaluation setting.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Existing models for predicting legislative roll-call votes rely heavily on text or voting history within a single session. However, these models fail to generalize across sessions because the underlying data generation process changes. Specifically, the ideological position of bills on similar topics shifts depending on which party is in power. A model trained on a single session learns an implicit ideological prior that becomes inaccurate when the political context changes in subsequent sessions.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is a neural architecture that augments bill text representations with sponsor ideology, specifically the percentage of Republican vs. Democrat sponsors.</p>
<ul>
<li><strong>Sponsor-Weighted Embeddings</strong>: They compute a composite embedding where the text representation is weighted by party sponsorship percentages ($p_{r}, p_{d}$) and party-specific influence vectors ($a_{r}, a_{d}$).</li>
<li><strong>Out-of-Session Evaluation</strong>: They introduce a rigorous evaluation setting where models trained on past sessions (e.g., 2005-2012) are tested on future sessions (e.g., 2013-2014) to test generalization, which previous work had ignored.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors evaluated their models using a dataset of U.S. Congressional bills from 2005 to 2016.</p>
<ul>
<li><strong>Models Tested</strong>: They compared text-only models (MWE (Mean Word Embedding), CNN) against metadata-augmented versions (MWE+Meta, CNN+Meta) and a &ldquo;Meta-Only&rdquo; baseline (using dummy text).</li>
<li><strong>Settings</strong>:
<ul>
<li><strong>In-Session</strong>: 5-fold cross-validation on 2005-2012 data.</li>
<li><strong>Out-of-Session</strong>: Training on 2005-2012 and testing on 2013-2014 and 2015-2016.</li>
</ul>
</li>
<li><strong>Baselines</strong>: Comparisons included a &ldquo;Guess Yes&rdquo; baseline and an SVM trained on bag-of-words summaries with sponsor indicators.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Metadata is Critical</strong>: Augmenting text with sponsor metadata consistently outperformed text-only models. The <code>CNN+Meta</code> model achieved the highest accuracy in-session (86.21% vs. 83.24% for CNN) and on 2013-2014 out-of-session (83.59%), while <code>MWE+Meta</code> achieved the best 2015-2016 accuracy (71.90%).</li>
<li><strong>Generalization</strong>: Text-only models degraded significantly in out-of-session testing. For example, CNN dropped from 83.24% in-session to 77.49% on 2013-2014 and 69.63% on 2015-2016, confirming that text alone fails to capture shifting ideological contexts.</li>
<li><strong>Sponsor Signal</strong>: The <code>Meta-Only</code> model (using no text) outperformed text-only models in the 2013-2014 out-of-session test (82.28% vs. 77.57% for MWE), suggesting that in some contexts, the author&rsquo;s identity provides a stronger predictive signal than the bill&rsquo;s content.</li>
<li><strong>2015-2016 Difficulty</strong>: All models performed worse on the 2015-2016 session, where intra-party divisions within the House Republican caucus disrupted typical voting dynamics.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: Collected from GovTrack. The paper text references the &ldquo;106th to 111th&rdquo; Congressional sessions, but the data tables show coverage from 2005 to 2016, which corresponds to the 109th through 114th sessions.</li>
<li><strong>Content</strong>: Non-unanimous roll call votes, full text of bills/resolutions, and Congressional Research Service (CRS) summaries.</li>
<li><strong>Filtering</strong>: Bills with unanimous votes were excluded.</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>Text lowercased and stop-words removed.</li>
<li>Summaries truncated to $N=400$ words; full text truncated to $N=2000$ words (80th percentile lengths).</li>
</ul>
</li>
<li><strong>Splits</strong>:
<ul>
<li><em>Training</em>: Sessions 2005-2012 (1718 bills).</li>
<li><em>Testing</em>: Sessions 2013-2014 (360 bills) and 2015-2016 (382 bills).</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Bill Representation ($v_{B}$)</strong>:
$$v_{B}=((a_{r}p_{r})\cdot T_{r})+((a_{d}p_{d})\cdot T_{d})$$
Where $T$ is the text embedding (CNN or MWE), $p$ is the percentage of sponsors from a party, and $a$ is a learnable party influence vector. $T_{r}$ and $T_{d}$ are Republican and Democratic copies of the same bill&rsquo;s text representation, each weighted by the corresponding party&rsquo;s sponsorship proportion.</li>
<li><strong>Vote Prediction</strong>:
<ul>
<li>Project bill embedding to legislator space: $v_{BL}=W_{B}v_{B}+b_{B}$.</li>
<li>Alignment score: $W_{v}(v_{BL}\odot v_{L})+b_{v}$ (using element-wise multiplication).</li>
<li>Output: Sigmoid activation.</li>
</ul>
</li>
<li><strong>Optimization</strong>: AdaMax algorithm with binary cross-entropy loss.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Text Encoders</strong>:
<ul>
<li><strong>CNN</strong>: 4-grams with 400 filter maps.</li>
<li><strong>MWE</strong>: <a href="/posts/intro-to-word-embeddings/">Mean Word Embedding</a>.</li>
</ul>
</li>
<li><strong>Embeddings</strong>:
<ul>
<li>Initialized with 50-dimensional GloVe vectors.</li>
<li>Embeddings are non-static (updated during training).</li>
<li>Legislator embedding size ($v_{L}$): 25 dimensions.</li>
</ul>
</li>
<li><strong>Initialization</strong>: Weights initialized with Glorot uniform distribution.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Accuracy.</li>
<li><strong>Comparison</strong>:
<ul>
<li><strong>In-session</strong>: 5-fold cross-validation.</li>
<li><strong>Out-of-session</strong>: Train on past sessions, predict future sessions.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Config</strong>: Models trained for 50 epochs with mini-batches of size 50. No specific GPU or compute requirements are reported.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.govtrack.us/">GovTrack</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Source for bill texts and roll-call votes</td>
      </tr>
  </tbody>
</table>
<p>No official code repository or pretrained models were released with this paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kornilova, A., Argyle, D., &amp; Eidelman, V. (2018). Party Matters: Enhancing Legislative Embeddings with Author Attributes for Vote Prediction. <em>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</em>, 510-515. <a href="https://doi.org/10.18653/v1/p18-2081">https://doi.org/10.18653/v1/p18-2081</a></p>
<p><strong>Publication</strong>: ACL 2018</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{kornilovaPartyMattersEnhancing2018,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Party {{Matters}}: {{Enhancing Legislative Embeddings}} with {{Author Attributes}} for {{Vote Prediction}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Party {{Matters}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Kornilova, Anastassia and Argyle, Daniel and Eidelman, Vlad}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 56th {{Annual Meeting}} of the {{Association}} for {{Computational Linguistics}} ({{Volume}} 2: {{Short Papers}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{510--515}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Melbourne, Australia}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.18653/v1/p18-2081}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{1805.08182}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>A Spatial Model for Legislative Roll Call Analysis</title><link>https://hunterheidenreich.com/notes/interdisciplinary/social-science/nominate-1985/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/social-science/nominate-1985/</guid><description>Introduces NOMINATE, a probabilistic spatial model estimating legislator ideal points from roll call data via maximum likelihood.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Methodological ($\Psi_{\text{Method}}$)</strong> paper. It introduces a &ldquo;general nonlinear logit model&rdquo; and a specific estimation algorithm (<strong>NOMINATE</strong>) to analyze political choice data. The paper focuses on deriving a metric spatial map from nominal data (yea/nay votes). It validates this method by comparing it against existing techniques like Guttman scaling and factor analysis, demonstrating that the new method recovers geometric structures that previous methods obscured.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Prior research relied on &ldquo;black box&rdquo; statistical methods (like factor analysis or nonmetric scaling) or Guttman scaling to analyze legislative behavior. These methods had significant limitations:</p>
<ul>
<li><strong>Metric Recovery</strong>: They struggled to accurately recover the underlying Euclidean coordinates of legislators and choices from nominal data.</li>
<li><strong>Dimensionality</strong>: They tended to exaggerate the number of dimensions (issues) because they did not account for probabilistic error in voting.</li>
<li><strong>Identification</strong>: Pure Guttman scaling (assuming perfect voting) identifies only the order of legislators, leaving the location of policy alternatives unknown.</li>
</ul>
<p>The authors sought to bridge the &ldquo;crucial gap&rdquo; between spatial theory and data by developing a model-driven procedure that simultaneously estimates the locations of choosers and choices while accounting for error.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core contribution is the <strong>NOMINATE</strong> (Nominal Three-step Estimation) procedure. Key innovations include:</p>
<ul>
<li><strong>Simultaneous Estimation</strong>: This method estimates coordinates for <em>both</em> the legislators ($x_i$) and the roll call outcomes ($z_{jl}$) in a common space simultaneously.</li>
<li><strong>Probabilistic Utility</strong>: It employs a specific bell-shaped utility function with a stochastic error term (log of the inverse exponential), allowing for a tractable probabilistic voting model.</li>
<li><strong>Metric Unfolding</strong>: It successfully performs &ldquo;unfolding methodology for nominal level data,&rdquo; recovering metric distances solely from binary choices.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the model through both historical data analysis and synthetic testing:</p>
<ul>
<li><strong>US House Analysis (1957-58)</strong>: Analyzed 172 roll calls from the 85th Congress to compare NOMINATE results against Miller and Stokes&rsquo; influential Guttman scales.</li>
<li><strong>US Senate Analysis (1979-1982)</strong>: Performed separate estimations for four years of Senate voting to assess stability and validity.</li>
<li><strong>Monte Carlo Simulations</strong>: Generated synthetic data (98 legislators and 291 roll calls in most runs, 50 legislators in one run) for different values of $\beta$ to test the robustness of parameter recovery under known &ldquo;truth&rdquo; conditions.</li>
<li><strong>Robustness Checks</strong>: Tested sensitivity to &ldquo;perfect&rdquo; legislators (who never vote against their side) and outliers (like Senator Proxmire).</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Unidimensionality</strong>: A single liberal-conservative dimension correctly classified ~80% of individual choices in the US House and Senate.</li>
<li><strong>Dimensionality Reduction</strong>: The model demonstrated that distinct &ldquo;issue scales&rdquo; found in previous research (e.g., social welfare vs. foreign policy) could largely be mapped onto a single dimension when error is accounted for.</li>
<li><strong>Strategic Behavior</strong>: The analysis revealed that majority leadership tends to place roll call midpoints slightly away from the median legislator to increase the probability of passage.</li>
<li><strong>Geometric Mean Probability</strong>: The authors introduced the geometric mean probability as a more robust metric than simple classification error for evaluating probabilistic models.</li>
<li><strong>Limitations</strong>: The authors acknowledge that the model is restricted to one dimension with a common utility function, and that civil rights voting represents a genuinely separate dimension not captured by the liberal-conservative axis. Standard errors computed from the alternating procedure are theoretically approximate (computed from separate information matrices rather than the full joint matrix), though Monte Carlo tests showed them to be reasonably reliable in practice. Extensions to multidimensional models and variable utility functions are deferred to later work.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The paper analyzes roll call voting matrices (a roll call is a procedure in which each legislator&rsquo;s name is called and their individual vote is recorded, producing a complete public record of who voted which way) where rows are legislators and columns are roll calls.</p>
<table>
  <thead>
      <tr>
          <th>Context</th>
          <th>Size</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>US House (85th)</strong></td>
          <td>440 Legislators x 172 Roll Calls</td>
          <td>68,284 choices; 1957-58</td>
      </tr>
      <tr>
          <td><strong>US Senate</strong></td>
          <td>~100 Senators/year</td>
          <td>Years 1979, 1980, 1981, 1982</td>
      </tr>
      <tr>
          <td><strong>Filtering</strong></td>
          <td>Cutoff &gt; 2.5%</td>
          <td>Roll calls with &lt; 2.5% minority vote are excluded to prevent &ldquo;noise&rdquo; from distorting estimates.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The <strong>NOMINATE</strong> algorithm maximizes the log-likelihood of observed choices using a constrained nonlinear maximum likelihood procedure.</p>
<p><strong>Utility Function</strong>:
The utility of legislator $i$ for outcome $j$ on roll call $l$ is:
$$U_{ijl}=\beta~\exp\left[\frac{-\omega^{2}d_{ijl}^{2}}{2}\right]+\epsilon_{ijl}$$
Where $d_{ijl}$ is the Euclidean distance between legislator $i$ and outcome $j$.</p>
<p><strong>Optimization Strategy (Global Iteration)</strong>:
Because estimating ~800 parameters simultaneously is impractical, the algorithm uses an alternating three-step method:</p>
<ol>
<li><strong>Utility Parameters</strong>: Estimate $\beta$ and $\omega$ while holding legislator ($x$) and roll call ($z$) coordinates fixed.</li>
<li><strong>Legislator Coordinates</strong>: Estimate $x_i$ for each legislator (independent of others) holding $\beta, \omega, z$ fixed.</li>
<li><strong>Roll Call Coordinates</strong>: Estimate $z_{yl}, z_{nl}$ for each roll call holding $\beta, \omega, x$ fixed.</li>
</ol>
<p>This cycle repeats until parameters correlate at the 0.99 level between iterations.</p>
<h3 id="models">Models</h3>
<p>The model estimates the following parameters for a one-dimensional space:</p>
<ul>
<li><strong>Legislator Coordinates ($x_i$)</strong>: The ideal point of each legislator, normalized to the range $[-1, +1]$.</li>
<li><strong>Outcome Coordinates ($z_{yl}, z_{nl}$)</strong>: The spatial location of the &ldquo;Yea&rdquo; and &ldquo;Nay&rdquo; policy outcomes for each vote.</li>
<li><strong>Signal-to-Noise ($\beta$)</strong>: Represents the weight of the spatial component versus the error term.</li>
<li><strong>Weighting ($\omega$)</strong>: A shape parameter for the utility function (often fixed to $0.5$ in practice due to collinearity with $\beta$).</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is evaluated primarily via classification accuracy and probabilistic fit.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Context</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Classification</strong></td>
          <td>78.9%</td>
          <td>House (1957-58)</td>
          <td>Correctly predicts Yea/Nay choice</td>
      </tr>
      <tr>
          <td><strong>Classification</strong></td>
          <td>80.3 / 80.6 / 83.2 / 81.7%</td>
          <td>Senate (1979 / 1980 / 1981 / 1982)</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Geo. Mean Prob.</strong></td>
          <td>0.642 (House); 0.654 / 0.638 / 0.657 / 0.637 (Senate 1979 / 1980 / 1981 / 1982)</td>
          <td>Unconstrained roll calls</td>
          <td>Exponential of the average log likelihood</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Development</strong>: DEC-2060</li>
<li><strong>Production</strong>: VAX-11/780</li>
</ul>
<h3 id="reproducibility-status">Reproducibility Status</h3>
<p>This paper predates modern open-source conventions. No original source code was released, and the NOMINATE algorithm was described at an overview level rather than with full pseudocode. However, the underlying roll call voting data for the U.S. Congress is now freely available through the <a href="https://voteview.com/">Voteview</a> project, which Poole and Rosenthal later maintained. Modern open-source reimplementations exist, including the R packages <code>wnominate</code> and <code>pscl</code>. Reproducibility status: <strong>Partially Reproducible</strong> (data available, modern reimplementations exist, but original code not released).</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Poole, K. T., &amp; Rosenthal, H. (1985). A Spatial Model for Legislative Roll Call Analysis. <em>American Journal of Political Science</em>, 29(2), 357-384. <a href="https://doi.org/10.2307/2111172">https://doi.org/10.2307/2111172</a></p>
<p><strong>Publication</strong>: American Journal of Political Science 1985</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{pooleSpatialModelLegislative1985,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{A {{Spatial Model}} for {{Legislative Roll Call Analysis}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Poole, Keith T. and Rosenthal, Howard}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">1985</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{American Journal of Political Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{29}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{357--384}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.2307/2111172}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/NOMINATE_(scaling_method)">Wikipedia: NOMINATE</a></li>
<li><a href="https://voteview.com/">Voteview (Modern Repository)</a></li>
</ul>
]]></content:encoded></item><item><title>Classifying Congressional Bills with Machine Learning</title><link>https://hunterheidenreich.com/posts/congressional-bill-policy-area-classification/</link><pubDate>Wed, 21 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/congressional-bill-policy-area-classification/</guid><description>Testing ML classification of congressional bills by policy area. Comparing Naive Bayes, Logistic Regression, and XGBoost on legislative text.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>This post explores machine learning approaches for classifying congressional bills by policy area, using data from the 115th to 117th Congresses (2017-2023). We&rsquo;ll examine:</p>
<ul>
<li>The fundamentals of bill classification</li>
<li>Traditional machine learning models as baselines</li>
<li>Performance analysis across different time periods and policy domains</li>
</ul>
<p>This work establishes baselines for future deep learning approaches to legislative text classification.</p>
<p><em>This post builds on the data foundation established in <a href="/posts/us-117th-congress-data-exploration/">Exploring the 117th U.S. Congress</a>.</em></p>
<h3 id="motivation">Motivation</h3>
<p>Automatically classifying congressional bills by policy area has practical value for researchers, journalists, and citizens who need to navigate thousands of bills each Congress. Machine learning can help identify patterns in legislative priorities and track policy trends over time.</p>
<h2 id="data">Data</h2>
<p>The data comes from scraping <a href="https://www.congress.gov/">Congress.gov</a> for all bills from the 115th through 117th Congresses. Each bill includes:</p>
<ul>
<li>Bill ID and title</li>
<li>Summary (when available): the earliest summary provided</li>
<li>Full text (when available): the earliest text version</li>
<li>Policy area classification</li>
</ul>
<p>Our task is to predict policy area from text features:</p>
<p>$$
f(X) = \hat{y}, \quad \text{where} \quad X = { \text{title}, \text{summary}, \text{text} }, \quad \hat{y} \in { \text{policy areas} }
$$</p>
<p>The complete dataset is available at <a href="https://huggingface.co/datasets/hheiden/us-congress-bill-policy-115_117">Hugging Face: hheiden/us-congress-bill-policy-115_117</a>.</p>
<h3 id="bills-by-congress">Bills by Congress</h3>
<p>Our dataset contains the following distribution:</p>
<table>
  <thead>
      <tr>
          <th>Congress</th>
          <th>Bills</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>115th</td>
          <td>13,556</td>
      </tr>
      <tr>
          <td>116th</td>
          <td>16,601</td>
      </tr>
      <tr>
          <td>117th</td>
          <td>17,817</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>47,974</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="policy-areas">Policy Areas</h3>
<p>Each bill receives a policy area label from <a href="https://www.congress.gov/">Congress.gov</a> (see <a href="https://www.congress.gov/help/field-values/policy-area">glossary</a>). The dataset includes 33 policy areas, though these classes are highly imbalanced.</p>
<p>The following table shows the number of bills in each policy area across the three Congresses:</p>
<table>
  <thead>
      <tr>
          <th>Policy Area</th>
          <th>115th</th>
          <th>116th</th>
          <th>117th</th>
          <th>Total</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Agriculture and Food</td>
          <td>312</td>
          <td>328</td>
          <td>398</td>
          <td>1,038</td>
      </tr>
      <tr>
          <td>Animals</td>
          <td>96</td>
          <td>83</td>
          <td>71</td>
          <td>250</td>
      </tr>
      <tr>
          <td>Armed Forces and National Security</td>
          <td>1,108</td>
          <td>1,337</td>
          <td>1,399</td>
          <td>3,844</td>
      </tr>
      <tr>
          <td>Arts, Culture, Religion</td>
          <td>81</td>
          <td>79</td>
          <td>103</td>
          <td>263</td>
      </tr>
      <tr>
          <td>Civil Rights and Liberties, Minority Issues</td>
          <td>175</td>
          <td>205</td>
          <td>220</td>
          <td>600</td>
      </tr>
      <tr>
          <td>Commerce</td>
          <td>312</td>
          <td>593</td>
          <td>633</td>
          <td>1,538</td>
      </tr>
      <tr>
          <td>Congress</td>
          <td>594</td>
          <td>541</td>
          <td>640</td>
          <td>1,775</td>
      </tr>
      <tr>
          <td>Crime and Law Enforcement</td>
          <td>827</td>
          <td>904</td>
          <td>1,022</td>
          <td>2,753</td>
      </tr>
      <tr>
          <td>Economics and Public Finance</td>
          <td>176</td>
          <td>210</td>
          <td>197</td>
          <td>583</td>
      </tr>
      <tr>
          <td>Education</td>
          <td>607</td>
          <td>798</td>
          <td>801</td>
          <td>2,206</td>
      </tr>
      <tr>
          <td>Emergency Management</td>
          <td>207</td>
          <td>198</td>
          <td>202</td>
          <td>607</td>
      </tr>
      <tr>
          <td>Energy</td>
          <td>316</td>
          <td>370</td>
          <td>530</td>
          <td>1,216</td>
      </tr>
      <tr>
          <td>Environmental Protection</td>
          <td>352</td>
          <td>423</td>
          <td>464</td>
          <td>1,239</td>
      </tr>
      <tr>
          <td>Families</td>
          <td>79</td>
          <td>127</td>
          <td>139</td>
          <td>345</td>
      </tr>
      <tr>
          <td>Finance and Financial Sector</td>
          <td>556</td>
          <td>611</td>
          <td>601</td>
          <td>1,768</td>
      </tr>
      <tr>
          <td>Foreign Trade and International Finance</td>
          <td>120</td>
          <td>148</td>
          <td>212</td>
          <td>480</td>
      </tr>
      <tr>
          <td>Government Operations and Politics</td>
          <td>1,008</td>
          <td>1,258</td>
          <td>1,272</td>
          <td>3,538</td>
      </tr>
      <tr>
          <td>Health</td>
          <td>1,526</td>
          <td>2,109</td>
          <td>2,276</td>
          <td>5,911</td>
      </tr>
      <tr>
          <td>Housing and Community Development</td>
          <td>142</td>
          <td>250</td>
          <td>231</td>
          <td>623</td>
      </tr>
      <tr>
          <td>Immigration</td>
          <td>398</td>
          <td>466</td>
          <td>591</td>
          <td>1,455</td>
      </tr>
      <tr>
          <td>International Affairs</td>
          <td>918</td>
          <td>1,178</td>
          <td>1,390</td>
          <td>3,486</td>
      </tr>
      <tr>
          <td>Labor and Employment</td>
          <td>348</td>
          <td>452</td>
          <td>552</td>
          <td>1,352</td>
      </tr>
      <tr>
          <td>Law</td>
          <td>109</td>
          <td>162</td>
          <td>175</td>
          <td>446</td>
      </tr>
      <tr>
          <td>Native Americans</td>
          <td>175</td>
          <td>234</td>
          <td>245</td>
          <td>654</td>
      </tr>
      <tr>
          <td>Public Lands and Natural Resources</td>
          <td>718</td>
          <td>648</td>
          <td>642</td>
          <td>2,008</td>
      </tr>
      <tr>
          <td>Science, Technology, Communications</td>
          <td>389</td>
          <td>551</td>
          <td>505</td>
          <td>1,445</td>
      </tr>
      <tr>
          <td>Social Sciences and History</td>
          <td>5</td>
          <td>6</td>
          <td>4</td>
          <td>15</td>
      </tr>
      <tr>
          <td>Social Welfare</td>
          <td>177</td>
          <td>229</td>
          <td>199</td>
          <td>605</td>
      </tr>
      <tr>
          <td>Sports and Recreation</td>
          <td>92</td>
          <td>93</td>
          <td>125</td>
          <td>310</td>
      </tr>
      <tr>
          <td>Taxation</td>
          <td>983</td>
          <td>1,156</td>
          <td>1,078</td>
          <td>3,217</td>
      </tr>
      <tr>
          <td>Transportation and Public Works</td>
          <td>492</td>
          <td>672</td>
          <td>742</td>
          <td>1,906</td>
      </tr>
      <tr>
          <td>Water Resources Development</td>
          <td>89</td>
          <td>111</td>
          <td>110</td>
          <td>310</td>
      </tr>
      <tr>
          <td>Private Legislation</td>
          <td>69</td>
          <td>71</td>
          <td>48</td>
          <td>188</td>
      </tr>
  </tbody>
</table>
<p>The class imbalance is severe: <code>Social Sciences and History</code> has only 15 bills across all three Congresses, while <code>Health</code> has 5,911 bills. This imbalance presents modeling challenges, as minority classes may lack sufficient representative samples.</p>
<h3 id="text-statistics">Text Statistics</h3>
<p>We analyzed token counts using spaCy to understand the computational requirements for each text field.</p>
<p>Title Token Statistics:</p>
<table>
  <thead>
      <tr>
          <th>Congress</th>
          <th>Average Tokens</th>
          <th>Min Tokens</th>
          <th>Max Tokens</th>
          <th>Total Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>115th</td>
          <td>12.3</td>
          <td>1</td>
          <td>167</td>
          <td>166,763</td>
      </tr>
      <tr>
          <td>116th</td>
          <td>11.3</td>
          <td>1</td>
          <td>226</td>
          <td>188,158</td>
      </tr>
      <tr>
          <td>117th</td>
          <td>11.5</td>
          <td>1</td>
          <td>272</td>
          <td>204,978</td>
      </tr>
      <tr>
          <td>All</td>
          <td>11.7</td>
          <td>1</td>
          <td>272</td>
          <td>559,419</td>
      </tr>
  </tbody>
</table>
<p>Summary Token Statistics:</p>
<table>
  <thead>
      <tr>
          <th>Congress</th>
          <th>Average Tokens</th>
          <th>Min Tokens</th>
          <th>Max Tokens</th>
          <th>Total Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>115th</td>
          <td>109.1</td>
          <td>2</td>
          <td>6,839</td>
          <td>1,479,212</td>
      </tr>
      <tr>
          <td>116th</td>
          <td>94.9</td>
          <td>2</td>
          <td>5,886</td>
          <td>1,574,732</td>
      </tr>
      <tr>
          <td>117th</td>
          <td>95.1</td>
          <td>2</td>
          <td>502</td>
          <td>1,695,276</td>
      </tr>
      <tr>
          <td>All</td>
          <td>99.0</td>
          <td>2</td>
          <td>6,839</td>
          <td>4,749,220</td>
      </tr>
  </tbody>
</table>
<p>Full Text Token Statistics:</p>
<table>
  <thead>
      <tr>
          <th>Congress</th>
          <th>Average Tokens</th>
          <th>Min Tokens</th>
          <th>Max Tokens</th>
          <th>Total Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>115th</td>
          <td>2,588.7</td>
          <td>91</td>
          <td>304,478</td>
          <td>35,092,075</td>
      </tr>
      <tr>
          <td>116th</td>
          <td>2,760.3</td>
          <td>70</td>
          <td>973,173</td>
          <td>45,824,498</td>
      </tr>
      <tr>
          <td>117th</td>
          <td>2,706.7</td>
          <td>71</td>
          <td>1,013,608</td>
          <td>48,224,757</td>
      </tr>
      <tr>
          <td>All</td>
          <td>-</td>
          <td>70</td>
          <td>1,013,608</td>
          <td>129,141,330</td>
      </tr>
  </tbody>
</table>
<p>These statistics reveal computational trade-offs:</p>
<ul>
<li><strong>Titles</strong> average ~12 tokens: computationally efficient but limited information.</li>
<li><strong>Summaries</strong> average ~100 tokens: good balance of information and efficiency.</li>
<li><strong>Full text</strong> averages ~2,700 tokens with 129M total tokens: detailed but computationally expensive. Processing this volume of text introduces real-world engineering challenges, such as memory constraints and a higher noise-to-signal ratio typical of long legal documents.</li>
</ul>
<p>We&rsquo;ll prototype with titles and summaries before considering full text, given the computational costs involved.</p>
<h2 id="evaluation-framework">Evaluation Framework</h2>
<h3 id="experimental-design">Experimental Design</h3>
<p>We train models on one Congress and test on others, creating a 3x3 evaluation grid. This setup evaluates both within-Congress performance (same session) and cross-Congress generalization (different sessions). We expect temporal drift between Congress sessions to impact performance.</p>
<h3 id="metrics-and-hyperparameter-tuning">Metrics and Hyperparameter Tuning</h3>
<p>We use weighted average F1 score to handle class imbalance, ensuring fair evaluation across all policy areas regardless of frequency.</p>
<p>For within-Congress evaluation, we report cross-validated scores. For cross-Congress evaluation, we test on the entire target Congress dataset.</p>
<p>Hyperparameter tuning uses Cross-Validation Grid Search with folds set to <code>min(3, n_samples)</code> to ensure all classes are represented. We apply the best parameters from training to test generalization across different Congresses.</p>
<h2 id="baseline-models">Baseline Models</h2>
<p>We evaluate three traditional machine learning approaches using TF-IDF vectorization:</p>
<h3 id="text-preprocessing">Text Preprocessing</h3>
<p>We convert text to numerical features using TF-IDF (term frequency-inverse document frequency), which weighs word importance by frequency within documents relative to the entire corpus. This creates normalized feature vectors suitable for machine learning classification.</p>
<h3 id="multinomial-naive-bayes">Multinomial Naive Bayes</h3>
<p>We start with Multinomial Naive Bayes as our simplest baseline. Despite its &ldquo;naive&rdquo; independence assumption between features, this model often performs surprisingly well for text classification tasks and serves as an important benchmark. If more complex models can&rsquo;t beat Naive Bayes, it signals potential issues with the approach or data.</p>
<p>The model&rsquo;s <code>feature_log_prob_</code> attribute reveals the most influential words for each policy area, providing interpretable insights into classification patterns.</p>
<p>You can see the code for training the Naive Bayes model below:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> TfidfVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.model_selection <span style="color:#f92672">import</span> GridSearchCV
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.pipeline <span style="color:#f92672">import</span> Pipeline
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.naive_bayes <span style="color:#f92672">import</span> MultinomialNB
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Create a pipeline with TF-IDF vectorizer and Multinomial Naive Bayes classifier</span>
</span></span><span style="display:flex;"><span>pipeline <span style="color:#f92672">=</span> Pipeline([
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;tfidf&#39;</span>, TfidfVectorizer(lowercase<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, dtype<span style="color:#f92672">=</span>np<span style="color:#f92672">.</span>float32)),
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;clf&#39;</span>, MultinomialNB()),
</span></span><span style="display:flex;"><span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Define the parameters for grid search</span>
</span></span><span style="display:flex;"><span>parameters <span style="color:#f92672">=</span> {  
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__ngram_range&#39;</span>: [(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">3</span>)],
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>, <span style="color:#ae81ff">0.5</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__min_df&#39;</span>: (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">5</span>, <span style="color:#ae81ff">10</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;clf__alpha&#39;</span>: (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.01</span>, <span style="color:#ae81ff">0.001</span>),
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Perform grid search with cross-validation</span>
</span></span><span style="display:flex;"><span>grid_search <span style="color:#f92672">=</span> GridSearchCV(
</span></span><span style="display:flex;"><span>    pipeline,
</span></span><span style="display:flex;"><span>    parameters,
</span></span><span style="display:flex;"><span>    scoring<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;f1_weighted&#39;</span>,
</span></span><span style="display:flex;"><span>    n_jobs<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span>    refit<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>    cv<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>grid_search<span style="color:#f92672">.</span>fit(X_train, y_train)
</span></span></code></pre></div><h3 id="logistic-regression">Logistic Regression</h3>
<p>Logistic regression provides a natural step up in complexity from Naive Bayes. It uses the logistic function to convert linear combinations of features into probabilities, making it an excellent baseline for comparison with more sophisticated models while remaining interpretable.</p>
<p>You can see the code for training the Logistic Regression model below:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> TfidfVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.model_selection <span style="color:#f92672">import</span> GridSearchCV
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.pipeline <span style="color:#f92672">import</span> Pipeline
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.linear_model <span style="color:#f92672">import</span> LogisticRegression
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Create a pipeline with TF-IDF vectorizer and Logistic Regression classifier</span>
</span></span><span style="display:flex;"><span>pipeline <span style="color:#f92672">=</span> Pipeline([
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;tfidf&#39;</span>, TfidfVectorizer(lowercase<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, dtype<span style="color:#f92672">=</span>np<span style="color:#f92672">.</span>float32)),
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;clf&#39;</span>, LogisticRegression(max_iter<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>, random_state<span style="color:#f92672">=</span><span style="color:#ae81ff">42</span>, class_weight<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;balanced&#39;</span>)),
</span></span><span style="display:flex;"><span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Define the parameters for grid search</span>
</span></span><span style="display:flex;"><span>parameters <span style="color:#f92672">=</span> {  
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__ngram_range&#39;</span>: [(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)],
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;clf__C&#39;</span>: [<span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">10</span>],
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Perform grid search with cross-validation</span>
</span></span><span style="display:flex;"><span>grid_search <span style="color:#f92672">=</span> GridSearchCV(
</span></span><span style="display:flex;"><span>    pipeline,
</span></span><span style="display:flex;"><span>    parameters,
</span></span><span style="display:flex;"><span>    scoring<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;f1_weighted&#39;</span>,
</span></span><span style="display:flex;"><span>    n_jobs<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span>    refit<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>    cv<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>grid_search<span style="color:#f92672">.</span>fit(X_train, y_train)
</span></span></code></pre></div><h3 id="xgboost">XGBoost</h3>
<p>We include XGBoost as our tree-based ensemble method. While XGBoost typically excels on structured tabular data, we test whether its gradient boosting approach can effectively handle TF-IDF features for text classification.</p>
<p>You can see the code for training the XGBoost model below:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> TfidfVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.model_selection <span style="color:#f92672">import</span> GridSearchCV
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.pipeline <span style="color:#f92672">import</span> Pipeline
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> xgboost <span style="color:#f92672">import</span> XGBClassifier
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Create a pipeline with TF-IDF vectorizer and XGBoost classifier</span>
</span></span><span style="display:flex;"><span>pipeline <span style="color:#f92672">=</span> Pipeline([
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;tfidf&#39;</span>, TfidfVectorizer(lowercase<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, dtype<span style="color:#f92672">=</span>np<span style="color:#f92672">.</span>float32)),
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;clf&#39;</span>, XGBClassifier(use_label_encoder<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, eval_metric<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;mlogloss&#39;</span>, objective<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;multi:softmax&#39;</span>, seed<span style="color:#f92672">=</span><span style="color:#ae81ff">42</span>, n_jobs<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>)),
</span></span><span style="display:flex;"><span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Define the parameters for grid search</span>
</span></span><span style="display:flex;"><span>parameters <span style="color:#f92672">=</span> {  
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;clf__max_depth&#39;</span>: (<span style="color:#ae81ff">3</span>, <span style="color:#ae81ff">6</span>, <span style="color:#ae81ff">9</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;clf__n_estimators&#39;</span>: (<span style="color:#ae81ff">100</span>, <span style="color:#ae81ff">200</span>, <span style="color:#ae81ff">300</span>),
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Perform grid search with cross-validation</span>
</span></span><span style="display:flex;"><span>grid_search <span style="color:#f92672">=</span> GridSearchCV(
</span></span><span style="display:flex;"><span>    pipeline,
</span></span><span style="display:flex;"><span>    parameters,
</span></span><span style="display:flex;"><span>    scoring<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;f1_weighted&#39;</span>,
</span></span><span style="display:flex;"><span>    refit<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>    cv<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>    verbose<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>grid_search<span style="color:#f92672">.</span>fit(X_train, y_train, clf__sample_weight<span style="color:#f92672">=</span>sample_weight)
</span></span></code></pre></div><h2 id="results">Results</h2>
<p>We evaluate models on three input types:</p>
<ul>
<li><strong>Title-only</strong>: Quick prototyping with limited context</li>
<li><strong>Summary-only</strong>: Balanced information content and computational efficiency</li>
<li><strong>Full text</strong>: Maximum context with computational constraints (limited hyperparameter tuning)</li>
</ul>
<h3 id="title-only-inputs">Title-Only Inputs</h3>
<h4 id="naive-bayes">Naive Bayes</h4>
<p>Title-only Naive Bayes experiments are run with the following settings:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>sweep_nb(
</span></span><span style="display:flex;"><span>    data,
</span></span><span style="display:flex;"><span>    X_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;title&#39;</span>,
</span></span><span style="display:flex;"><span>    y_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;policy_area&#39;</span>,
</span></span><span style="display:flex;"><span>    tfidf_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;lowercase&#39;</span>: <span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;dtype&#39;</span>: np<span style="color:#f92672">.</span>float32,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    tfidf_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;ngram_range&#39;</span>: [(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)],
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>, <span style="color:#ae81ff">0.5</span>),
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;min_df&#39;</span>: (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">5</span>),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    nb_params<span style="color:#f92672">=</span>{},
</span></span><span style="display:flex;"><span>    nb_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;alpha&#39;</span>: (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.01</span>, <span style="color:#ae81ff">0.001</span>),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>and the results:</p>
<pre><code>Training on Congress 115
Best score: 0.661
Refit Time: 0.570
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6369760774921475
Testing on Congress 117 F1: 0.5488274400521962

Training on Congress 116
Best score: 0.677
Refit Time: 0.499
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.691175262953872
Testing on Congress 117 F1: 0.6798043069585031

Training on Congress 117
Best score: 0.670
Refit Time: 0.565
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.25
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.6168474701996426
Testing on Congress 116 F1: 0.6981574942116808

Mean fit time: 0.54 ± 0.03s
</code></pre>
<h4 id="results-summary">Results Summary</h4>
<p>The results demonstrate several key findings:</p>
<ul>
<li><strong>Fast training</strong>: Sub-second training times make this highly practical</li>
<li><strong>Solid baseline performance</strong>: F1 scores around 0.65-0.70 provide a reasonable starting point</li>
<li><strong>Consistent hyperparameters</strong>: Similar optimal settings across Congresses suggest stable patterns</li>
<li><strong>Temporal effects</strong>: Performance generally decreases when training and testing on Congresses further apart in time</li>
</ul>
<p>Training on the 116th Congress yields the best cross-Congress performance, likely due to its temporal proximity to both adjacent sessions.</p>















<figure class="post-figure center ">
    <img src="/img/nb_title_policy_area/f1s.webp"
         alt="Naive Bayes Policy Area Classification F1 Score"
         title="Naive Bayes Policy Area Classification F1 Score"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Naive Bayes F1 scores show temporal effects, with better performance between adjacent Congresses</figcaption>
    
</figure>

<p>The model learns interpretable features for each policy area. For example, Agriculture bills are strongly associated with terms like &ldquo;farm,&rdquo; &ldquo;crop,&rdquo; and &ldquo;livestock,&rdquo; while Armed Forces bills correlate with &ldquo;military,&rdquo; &ldquo;defense,&rdquo; and &ldquo;veterans.&rdquo;</p>















<figure class="post-figure center ">
    <img src="/img/nb_title_policy_area/top-Agriculture_and_Food.webp"
         alt="Naive Bayes Top Features for Agriculture and Food"
         title="Naive Bayes Top Features for Agriculture and Food"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Naive Bayes Top Features for Agriculture and Food</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/nb_title_policy_area/top-Armed_Forces_and_National_Security.webp"
         alt="Naive Bayes Top Features for Armed Forces and National Security"
         title="Naive Bayes Top Features for Armed Forces and National Security"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Naive Bayes Top Features for Armed Forces and National Security</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/nb_title_policy_area/top-Health.webp"
         alt="Naive Bayes Top Features for Health"
         title="Naive Bayes Top Features for Health"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Naive Bayes Top Features for Health</figcaption>
    
</figure>

<h4 id="logistic-regression-1">Logistic Regression</h4>
<p>Title-only Logistic Regression experiments are run with the following settings:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>sweep_logreg(
</span></span><span style="display:flex;"><span>    data,
</span></span><span style="display:flex;"><span>    X_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;title&#39;</span>,
</span></span><span style="display:flex;"><span>    y_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;policy_area&#39;</span>,
</span></span><span style="display:flex;"><span>    tfidf_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;lowercase&#39;</span>: <span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;dtype&#39;</span>: np<span style="color:#f92672">.</span>float32,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    tfidf_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;ngram_range&#39;</span>: [(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)],
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    logreg_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_iter&#39;</span>: <span style="color:#ae81ff">1000</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;random_state&#39;</span>: <span style="color:#ae81ff">42</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;class_weight&#39;</span>: <span style="color:#e6db74">&#39;balanced&#39;</span>,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    logreg_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;C&#39;</span>: [<span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">10</span>],
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>and the results:</p>
<pre><code>Training on Congress 115
Best score: 0.704
Refit Time: 32.063
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6809188275881766
Testing on Congress 117 F1: 0.601917336933838

Training on Congress 116
Best score: 0.714
Refit Time: 31.227
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.7408989977276476
Testing on Congress 117 F1: 0.7200639105208106

Training on Congress 117
Best score: 0.711
Refit Time: 34.083
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.674418393892329
Testing on Congress 116 F1: 0.7405934743144291

Mean fit time: 32.46 ± 1.20s
</code></pre>
<h4 id="results-summary-1">Results Summary</h4>
<p>Logistic regression improves upon Naive Bayes performance:</p>
<ul>
<li><strong>Higher F1 scores</strong>: Generally 5-7 percentage points better than Naive Bayes</li>
<li><strong>Consistent hyperparameters</strong>: Optimal settings remain stable across Congresses</li>
<li><strong>Reasonable training time</strong>: 30-35 seconds per model remains manageable</li>
<li><strong>Strong cross-Congress generalization</strong>: F1 scores consistently above 0.70</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/logreg_title_policy_area/f1s.webp"
         alt="Logistic Regression Policy Area Classification F1 Score"
         title="Logistic Regression Policy Area Classification F1 Score"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Logistic Regression Policy Area Classification F1 Score</figcaption>
    
</figure>

<h4 id="xgboost-1">XGBoost</h4>
<p>Title-only XGBoost experiments are run with the following settings:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>sweep_xgb(
</span></span><span style="display:flex;"><span>    data,
</span></span><span style="display:flex;"><span>    X_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;title&#39;</span>,
</span></span><span style="display:flex;"><span>    y_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;policy_area&#39;</span>,
</span></span><span style="display:flex;"><span>    tfidf_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>,),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    xgb_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_depth&#39;</span>: (<span style="color:#ae81ff">6</span>,),
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;eta&#39;</span>: (<span style="color:#ae81ff">0.3</span>,),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>and the results:</p>
<pre><code>Training on Congress 115
Best score: 0.591
Refit Time: 198.063
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 116 F1: 0.5649530686141018
Testing on Congress 117 F1: 0.5215939580735101

Training on Congress 116
Best score: 0.600
Refit Time: 264.824
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.6037922738570368
Testing on Congress 117 F1: 0.5965027418245722

Training on Congress 117
Best score: 0.595
Refit Time: 249.799
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.5600491477899472
Testing on Congress 116 F1: 0.60815381664894

Mean fit time: 237.56 ± 28.60s
</code></pre>
<h4 id="results-summary-2">Results Summary</h4>
<p>XGBoost underperforms relative to expectations:</p>
<ul>
<li><strong>Poor performance</strong>: F1 scores significantly below linear models (0.55-0.60 range)</li>
<li><strong>Long training times</strong>: 4+ minutes per model with limited hyperparameter exploration</li>
<li><strong>Questionable value</strong>: The computational cost doesn&rsquo;t justify the poor performance</li>
</ul>
<p>Given these results, we focus on the more promising linear models for subsequent experiments with longer text inputs.</p>















<figure class="post-figure center ">
    <img src="/img/xgb_title_policy_area/f1s.webp"
         alt="XGBoost Policy Area Classification F1 Score"
         title="XGBoost Policy Area Classification F1 Score"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">XGBoost Policy Area Classification F1 Score</figcaption>
    
</figure>

<h4 id="training-efficiency">Training Efficiency</h4>
<p>The computational costs vary dramatically:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Training Time</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Naive Bayes</td>
          <td>0.54 $\pm$ 0.03s</td>
      </tr>
      <tr>
          <td>Logistic Regression</td>
          <td>32.46 $\pm$ 1.20s</td>
      </tr>
      <tr>
          <td>XGBoost</td>
          <td>237.56 $\pm$ 28.60s</td>
      </tr>
  </tbody>
</table>
<p>XGBoost&rsquo;s poor performance despite high computational cost suggests that tree-based methods may not be well-suited for sparse TF-IDF features. This is a classic example of the &ldquo;curse of dimensionality&rdquo;: tree-based models struggle to make effective splits in highly sparse, high-dimensional bag-of-words spaces compared to linear models that simply assign weights to all features simultaneously. We&rsquo;ll focus on linear models for the remaining experiments.</p>
<h3 id="summary-only-results">Summary-Only Results</h3>
<p>Using bill summaries provides substantially more context than titles alone, leading to significant performance improvements.</p>
<h4 id="naive-bayes-performance">Naive Bayes Performance</h4>
<p>The summary-based models show dramatic improvement over title-only versions:</p>
<ul>
<li><strong>F1 scores</strong>: 0.85+ within-Congress, 0.77-0.86 cross-Congress</li>
<li><strong>Training time</strong>: Still fast at ~3.4 seconds</li>
<li><strong>Strong generalization</strong>: Consistent performance across time periods</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/nb_summary_policy_area/f1s.webp"
         alt="Naive Bayes Summary Performance"
         title="Naive Bayes Summary Performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Summary-based models achieve 80%+ F1 scores across most Congress combinations</figcaption>
    
</figure>

<h4 id="logistic-regression-performance">Logistic Regression Performance</h4>
<p>Logistic regression slightly outperforms Naive Bayes on summaries:</p>
<ul>
<li><strong>F1 scores</strong>: 0.86+ within-Congress, 0.79-0.87 cross-Congress</li>
<li><strong>Training time</strong>: Reasonable at ~12 seconds</li>
<li><strong>Stable hyperparameters</strong>: Consistent optimal settings across Congresses</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/logreg_summary_policy_area/f1s.webp"
         alt="Logistic Regression Summary Performance"
         title="Logistic Regression Summary Performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Logistic regression maintains slight performance advantage over Naive Bayes</figcaption>
    
</figure>

<p>The performance difference between models suggests they rely on similar feature patterns, with logistic regression better capturing feature interactions.</p>
<h4 id="logistic-regression-2">Logistic Regression</h4>
<p>Summary-only Logistic Regression experiments are run with the following settings:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>sweep_logreg(
</span></span><span style="display:flex;"><span>    data,
</span></span><span style="display:flex;"><span>    X_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;summary&#39;</span>,
</span></span><span style="display:flex;"><span>    y_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;policy_area&#39;</span>,
</span></span><span style="display:flex;"><span>    tfidf_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;lowercase&#39;</span>: <span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;dtype&#39;</span>: np<span style="color:#f92672">.</span>float32,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    tfidf_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># &#39;ngram_range&#39;: [(1, 1), (1, 2)],</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    logreg_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_iter&#39;</span>: <span style="color:#ae81ff">1000</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;random_state&#39;</span>: <span style="color:#ae81ff">42</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;class_weight&#39;</span>: <span style="color:#e6db74">&#39;balanced&#39;</span>,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    logreg_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;C&#39;</span>: [<span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">10</span>],
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>And the results:</p>
<pre><code>Training on Congress 115
Best score: 0.862
Refit Time: 9.007
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 116 F1: 0.8284864693401133
Testing on Congress 117 F1: 0.7934161507811646

Training on Congress 116
Best score: 0.865
Refit Time: 13.897
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8637852557418315
Testing on Congress 117 F1: 0.8594775615031977

Training on Congress 117
Best score: 0.862
Refit Time: 12.167
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8355736563084967
Testing on Congress 116 F1: 0.8696403838390832

Mean fit time: 11.69 ± 2.02s
</code></pre>















<figure class="post-figure center ">
    <img src="/img/logreg_summary_policy_area/f1s.webp"
         alt="Logistic Regression Policy Area Classification F1 Score"
         title="Logistic Regression Policy Area Classification F1 Score"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Logistic Regression Policy Area Classification F1 Score</figcaption>
    
</figure>

<h3 id="full-text-results">Full Text Results</h3>
<p>We test whether complete bill text improves performance over summaries, using optimal hyperparameters from summary experiments.</p>
<h4 id="naive-bayes-on-full-text">Naive Bayes on Full Text</h4>
<p>Surprisingly, full text yields slightly lower performance than summaries:</p>
<ul>
<li><strong>F1 scores</strong>: 0.84-0.85 within-Congress, 0.77-0.86 cross-Congress</li>
<li><strong>Training time</strong>: ~50 seconds (10x slower than summaries)</li>
<li><strong>Performance drop</strong>: Likely due to increased noise in lengthy documents</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/nb_text_policy_area/f1s.webp"
         alt="Naive Bayes Full Text Performance"
         title="Naive Bayes Full Text Performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Full text performance is slightly worse than summaries, suggesting diminishing returns</figcaption>
    
</figure>

<h4 id="logistic-regression-on-full-text">Logistic Regression on Full Text</h4>
<p>Logistic regression shows the strongest performance on full text:</p>
<ul>
<li><strong>F1 scores</strong>: 0.87-0.88 within-Congress, 0.83-0.89 cross-Congress</li>
<li><strong>Training time</strong>: ~70 seconds</li>
<li><strong>Best overall performance</strong>: up to 0.89 F1 on the strongest single cross-Congress pair (best within-Congress score 0.877)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/logreg_text_policy_area/f1s.webp"
         alt="Logistic Regression Full Text Performance"
         title="Logistic Regression Full Text Performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Logistic regression achieves the best performance using full bill text</figcaption>
    
</figure>

<p>The logistic regression model benefits from having access to complete legislative language while effectively regularizing against noise.</p>
<h2 id="key-findings">Key Findings</h2>
<p>This baseline study establishes several important results:</p>
<p><strong>Best performing model</strong>: Logistic regression trained on full bill text reaches up to 0.89 F1 on the strongest single cross-Congress pair (best within-Congress score 0.877), providing a strong benchmark for future deep learning approaches.</p>
<p><strong>Text input comparison</strong>:</p>
<ul>
<li>Titles: Limited but fast (F1 ~0.65-0.70)</li>
<li>Summaries: Good balance of performance and efficiency (F1 ~0.85)</li>
<li>Full text: Best performance but computationally expensive (certified weighted-F1 0.871-0.877; up to ~0.89 on the strongest single cross-Congress pair)</li>
</ul>
<p><strong>Cross-Congress generalization</strong>: Models trained on one Congress generalize reasonably well to others, though performance decreases with temporal distance between sessions.</p>
<p><strong>Model performance ranking</strong>: Logistic Regression &gt; Naive Bayes &raquo; XGBoost for this text classification task.</p>
<h2 id="next-steps">Next Steps</h2>
<p>The strong baseline performance sets the stage for several research directions:</p>
<ol>
<li><strong>Deep learning models</strong>: Transformer-based approaches using pre-trained language models</li>
<li><strong>Dataset expansion</strong>: Including additional Congresses and more detailed bill metadata</li>
<li><strong>Error analysis</strong>: Understanding failure cases and class-specific performance patterns</li>
<li><strong>Feature engineering</strong>: Exploring domain-specific text preprocessing and feature extraction</li>
</ol>
<p>The complete dataset and experimental code are available for researchers interested in building upon these baselines.</p>
<p><strong>Resources</strong>:</p>
<ul>
<li>Dataset: <a href="https://huggingface.co/datasets/hheiden/us-congress-bill-policy-115_117">Hugging Face: hheiden/us-congress-bill-policy-115_117</a></li>
<li>Leaderboard: <a href="/leaderboards/policy_area_classification_leaderboard/">Policy Area Classification Leaderboard</a></li>
<li>Project: <a href="/projects/congressional-data-analysis/">Congressional Knowledge Graph &amp; Policy Classification</a></li>
</ul>
]]></content:encoded></item><item><title>How Does Congress Actually Work? Data from 15K Bills</title><link>https://hunterheidenreich.com/posts/us-117th-congress-data-exploration/</link><pubDate>Thu, 05 Oct 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/us-117th-congress-data-exploration/</guid><description>What happens to bills in Congress? Analyzing 15K+ bills from the 117th Congress to understand legislative patterns, party dynamics, and success rates.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Analyzing congressional data reveals the underlying mechanics of the legislative process. Legislative text is a large, structured corpus well suited to text classification and other NLP tasks. I scraped data from Congress.gov to analyze what actually happens to the thousands of bills introduced each session and to build a foundational dataset for downstream machine learning tasks.</p>
<p>This analysis focuses on the 117th Congress (2021-2023), examining 15,000+ bills to understand basic patterns: Which bills get introduced? How many receive votes? What factors influence success?</p>
<p>This post covers the foundational exploratory analysis and data collection process, setting the stage for <a href="/posts/congressional-bill-policy-area-classification/">predictive modeling and policy area classification</a>.</p>
<h2 id="data-collection">Data Collection</h2>
<p>My primary source is <a href="https://www.congress.gov/">Congress.gov</a>, maintained by the Library of Congress. I focused on the 117th Congress (2021-2023), collecting data on bills and joint resolutions, omitting simple resolutions, concurrent resolutions, and amendments.</p>
<p><strong>Data collected:</strong></p>
<table>
  <thead>
      <tr>
          <th>Bill Type</th>
          <th>Introduced</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>House Bill</td>
          <td>9,698</td>
      </tr>
      <tr>
          <td>House Joint Resolution</td>
          <td>106</td>
      </tr>
      <tr>
          <td>Senate Bill</td>
          <td>5,357</td>
      </tr>
      <tr>
          <td>Senate Joint Resolution</td>
          <td>70</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>15,231</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="technical-implementation">Technical Implementation</h3>
<p>Building a usable NLP dataset requires careful handling of the source. Congress.gov loads content dynamically and presents nested DOM structures, so the scraper combines static HTML parsing with a headless browser to render JavaScript before parsing.</p>
<p><strong>Implementation details:</strong></p>
<ul>
<li><a href="https://www.python.org/">Python</a> for core orchestration and data schema management</li>
<li><a href="https://www.selenium.dev/">Selenium</a> for executing JavaScript and loading dynamic page elements</li>
<li><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">BeautifulSoup</a> for structured HTML parsing</li>
<li>Regex for text normalization and extracting clean legislative text for language models</li>
</ul>
<p>The crawler used 5-second delays between requests to respect server limits, a roughly 3-day collection run. It handles edge cases in congressional text formatting and writes one JSON record per bill on a fixed schema. The crawler and processed data are available on <a href="https://github.com/hunter-heidenreich/congress-scraper">GitHub</a>.</p>
<p>For each bill, I queried two pages:</p>
<ul>
<li>All info page: <code>https://www.congress.gov/bill/117th-congress/{bill_type}/{bill_id}/all-info</code></li>
<li>Text page: <code>https://www.congress.gov/bill/117th-congress/{bill_type}/{bill_id}/text?format=txt</code></li>
</ul>
<p>The parsing process involved targeting specific HTML elements and implementing basic caching to avoid redundant requests.</p>
<h2 id="key-findings">Key Findings</h2>
<p>The analysis reveals clear patterns in congressional activity. Most bills never receive votes, and success rates vary significantly by party and policy area.</p>
<h3 id="legislative-outcomes">Legislative Outcomes</h3>
<p>The fundamental question: what happens to bills after introduction?</p>
<p>Each bill has a tracker status indicating its position in the legislative process. The eight possible statuses can be grouped into three meaningful categories:</p>
<ul>
<li><strong>Introduced</strong>: Bills introduced but never voted on</li>
<li><strong>Stalled</strong>: Bills that saw votes but didn&rsquo;t become law (since the 117th Congress ended, these effectively died)</li>
<li><strong>Law</strong>: Bills signed by the President</li>
</ul>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Introduced</th>
          <th>Stalled</th>
          <th>Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>House Bill</td>
          <td>8,977</td>
          <td>523</td>
          <td>198</td>
      </tr>
      <tr>
          <td>House Joint Resolution</td>
          <td>102</td>
          <td>1</td>
          <td>3</td>
      </tr>
      <tr>
          <td>Senate Bill</td>
          <td>5,083</td>
          <td>114</td>
          <td>160</td>
      </tr>
      <tr>
          <td>Senate Joint Resolution</td>
          <td>57</td>
          <td>9</td>
          <td>4</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>14,219</strong></td>
          <td><strong>647</strong></td>
          <td><strong>365</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>Key insights:</strong></p>
<ul>
<li>Only 7% of introduced bills ever receive a vote</li>
<li>Of bills that receive votes, 36% become law</li>
<li>Overall, just 2% of introduced bills become law</li>
</ul>
<h3 id="sponsor-analysis">Sponsor Analysis</h3>
<p>The bill sponsor (the primary member who introduces legislation) provides insights into party and geographic patterns.</p>
<h4 id="party-breakdown">Party Breakdown</h4>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Introduced</th>
          <th>Stalled</th>
          <th>Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Democrat</td>
          <td>8,271</td>
          <td>437</td>
          <td>235</td>
      </tr>
      <tr>
          <td>Republican</td>
          <td>5,883</td>
          <td>210</td>
          <td>130</td>
      </tr>
      <tr>
          <td>Independent</td>
          <td>65</td>
          <td>0</td>
          <td>0</td>
      </tr>
  </tbody>
</table>
<p><strong>Party comparison:</strong></p>
<ul>
<li><strong>Democrats</strong>: 7.5% of bills moved beyond introduction; 2.6% became law</li>
<li><strong>Republicans</strong>: 5.5% of bills moved beyond introduction; 2.1% became law</li>
<li>When bills do advance, Republicans have a slightly higher success rate (38% vs 35%)</li>
</ul>
<h4 id="geographic-distribution">Geographic Distribution</h4>
<p><strong>Top 10 states by bills introduced:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>State: Introduced</th>
          <th>State: Stalled</th>
          <th>State: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>CA: 1,350</td>
          <td>CA: 93</td>
          <td>CA: 34</td>
      </tr>
      <tr>
          <td>2</td>
          <td>TX: 879</td>
          <td>NY: 44</td>
          <td>MI: 30</td>
      </tr>
      <tr>
          <td>3</td>
          <td>NY: 784</td>
          <td>TX: 43</td>
          <td>TX: 25</td>
      </tr>
      <tr>
          <td>4</td>
          <td>FL: 766</td>
          <td>MI: 28</td>
          <td>NY: 24</td>
      </tr>
      <tr>
          <td>5</td>
          <td>IL: 660</td>
          <td>NJ: 28</td>
          <td>MN: 17</td>
      </tr>
      <tr>
          <td>6</td>
          <td>PA: 521</td>
          <td>IL: 27</td>
          <td>IL: 16</td>
      </tr>
      <tr>
          <td>7</td>
          <td>NJ: 478</td>
          <td>VA: 26</td>
          <td>OH: 11</td>
      </tr>
      <tr>
          <td>8</td>
          <td>MI: 380</td>
          <td>FL: 24</td>
          <td>VA: 11</td>
      </tr>
      <tr>
          <td>9</td>
          <td>OH: 377</td>
          <td>PA: 22</td>
          <td>FL: 11</td>
      </tr>
      <tr>
          <td>10</td>
          <td>MA: 361</td>
          <td>OH: 19</td>
          <td>GA: 9</td>
      </tr>
  </tbody>
</table>
<p><strong>Per-representative normalization reveals different patterns:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>State: Introduced</th>
          <th>State: Stalled</th>
          <th>State: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>DC: 101.0</td>
          <td>DC: 7.0</td>
          <td>AK: 2.2</td>
      </tr>
      <tr>
          <td>2</td>
          <td>NH: 47.5</td>
          <td>AK: 2.8</td>
          <td>NH: 2.0</td>
      </tr>
      <tr>
          <td>3</td>
          <td>MT: 44.0</td>
          <td>IA: 2.3</td>
          <td>MT: 2.0</td>
      </tr>
      <tr>
          <td>4</td>
          <td>OR: 41.0</td>
          <td>SD: 2.3</td>
          <td>MI: 1.9</td>
      </tr>
      <tr>
          <td>5</td>
          <td>NV: 40.0</td>
          <td>NH: 2.2</td>
          <td>MN: 1.5</td>
      </tr>
      <tr>
          <td>6</td>
          <td>DE: 38.7</td>
          <td>VA: 2.0</td>
          <td>HI: 1.5</td>
      </tr>
      <tr>
          <td>7</td>
          <td>SD: 38.3</td>
          <td>NJ: 2.0</td>
          <td>CT: 1.3</td>
      </tr>
      <tr>
          <td>8</td>
          <td>IA: 37.7</td>
          <td>PR: 2.0</td>
          <td>IA: 1.2</td>
      </tr>
      <tr>
          <td>9</td>
          <td>RI: 36.5</td>
          <td>NV: 1.8</td>
          <td>OR: 1.1</td>
      </tr>
      <tr>
          <td>10</td>
          <td>UT: 36.0</td>
          <td>MO: 1.8</td>
          <td>SD: 1.0</td>
      </tr>
  </tbody>
</table>
<h4 id="top-individual-sponsors">Top Individual Sponsors</h4>
<p><strong>Most prolific legislators by bills introduced:</strong></p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Individual: Introduced</th>
          <th>Individual: Stalled</th>
          <th>Individual: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Sen. Rubio (R-FL): 186</td>
          <td>Sen. Peters (D-MI): 11</td>
          <td>Sen. Peters (D-MI): 19</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Sen. Klobuchar (D-MN): 143</td>
          <td>Sen. Cornyn (R-TX): 8</td>
          <td>Sen. Cornyn (R-TX): 15</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Sen. Lee (R-UT): 125</td>
          <td>Rep. Connolly (D-VA-11): 8</td>
          <td>Sen. Klobuchar (D-MN): 7</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Sen. Markey (D-MA): 118</td>
          <td>Rep. Takano (D-CA-41): 8</td>
          <td>Sen. Tester (D-MT): 6</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Sen. Casey (D-PA): 116</td>
          <td>Sen. Grassley (R-IA): 7</td>
          <td>Sen. Rubio (R-FL): 6</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Sen. Cortez Masto (D-NV): 109</td>
          <td>Del. Norton (D-DC): 7</td>
          <td>Rep. DeLauro (D-CT-3): 6</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Sen. Booker (D-NJ): 106</td>
          <td>Rep. Johnson (D-TX-30): 7</td>
          <td>Sen. Grassley (R-IA): 5</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Sen. Durbin (D-IL): 102</td>
          <td>Rep. Katko (R-NY-24): 7</td>
          <td>Sen. Ossoff (D-GA): 4</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Del. Norton (D-DC): 101</td>
          <td>Rep. Dean (D-PA-4): 6</td>
          <td>Sen. Murkowski (R-AK): 4</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Sen. Menendez (D-NJ): 99</td>
          <td>Rep. Wagner (R-MO-2): 6</td>
          <td>Sen. Padilla (D-CA): 4</td>
      </tr>
  </tbody>
</table>
<p><strong>Effectiveness score (laws enacted / total bills):</strong></p>
<p>$$
\text{effectiveness} = \frac{\text{bills that became law}}{\text{total bills introduced}}
$$</p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Individual: Effectiveness Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Rep. Pelosi (D-CA-12): 0.500</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Rep. Mrvan (D-IN-1): 0.444</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Rep. Yarmuth (D-KY-3): 0.333</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Rep. Stivers (R-OH-15): 0.250</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Rep. Graves (R-MO-6): 0.222</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Rep. Jeffries (D-NY-8): 0.200</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Rep. Neal (D-MA-1): 0.200</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Rep. Palazzo (R-MS-4): 0.200</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Sen. Peters (D-MI): 0.186</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Rep. Fischbach (R-MN-7): 0.176</td>
      </tr>
  </tbody>
</table>
<h3 id="policy-focus-areas">Policy Focus Areas</h3>
<p>Each bill is assigned a primary policy area. Here are the most active areas by legislative outcome:</p>
<table>
  <thead>
      <tr>
          <th>Ranking</th>
          <th>Policy Area: Introduced</th>
          <th>Policy Area: Stalled</th>
          <th>Policy Area: Law</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Health: 1,885</td>
          <td>Government Operations: 79</td>
          <td>Government Operations: 94</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Armed Forces: 1,114</td>
          <td>Armed Forces: 60</td>
          <td>Armed Forces: 69</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Taxation: 1,066</td>
          <td>International Affairs: 60</td>
          <td>Crime &amp; Law Enforcement: 31</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Government Operations: 982</td>
          <td>Health: 56</td>
          <td>Health: 19</td>
      </tr>
      <tr>
          <td>5</td>
          <td>International Affairs: 866</td>
          <td>Crime &amp; Law Enforcement: 44</td>
          <td>Native Americans: 17</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Crime &amp; Law Enforcement: 842</td>
          <td>Public Lands: 44</td>
          <td>International Affairs: 14</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Education: 663</td>
          <td>Science &amp; Technology: 44</td>
          <td>Economics &amp; Finance: 13</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Transportation: 663</td>
          <td>Commerce: 43</td>
          <td>Public Lands: 13</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Public Lands: 548</td>
          <td>Finance: 34</td>
          <td>Commerce: 13</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Finance: 547</td>
          <td>Emergency Management: 27</td>
          <td>Emergency Management: 11</td>
      </tr>
  </tbody>
</table>
<p>Notable patterns: Health dominates introductions but has lower success rates, while government operations and armed forces bills are more likely to become law.</p>
<h2 id="next-steps">Next Steps</h2>
<p>This analysis establishes baseline patterns: most bills fail, party affiliation affects success rates, and certain policy areas perform better than others.</p>
<p>Future work could explore:</p>
<ul>
<li>Committee dynamics and voting patterns</li>
<li>Geographic analysis of state-level interests</li>
<li>Bill text analysis using NLP techniques</li>
<li>Predictive modeling for bill outcomes</li>
</ul>
<blockquote>
<p><strong>Update</strong>: I&rsquo;ve since applied machine learning to this type of data in <a href="/posts/congressional-bill-policy-area-classification/">Congressional Bill Policy Area Classification</a>, using 48K+ bills from three Congresses to automatically categorize bills by policy area.</p></blockquote>
<p>The complete dataset and code are publicly available to support further research into legislative transparency.</p>
]]></content:encoded></item><item><title>Congressional Knowledge Graph &amp; Policy Classification</title><link>https://hunterheidenreich.com/projects/congressional-data-analysis/</link><pubDate>Wed, 01 Mar 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/congressional-data-analysis/</guid><description>A 47,000+ bill knowledge graph from Congress.gov with co-sponsorship networks and TF-IDF baselines for 33-class policy-area classification.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>A computational social science project that constructed a dataset of 47,000+ US congressional bills by extracting legislative text and metadata from the 115th-117th Congresses. The project creates a &ldquo;legislative graph&rdquo;
(linking sponsors, committees, and bill text) and establishes TF-IDF baseline models for policy area classification across 33 (highly imbalanced) policy classes, now hosted on Hugging Face to support reproducible political science research.</p>
<h2 id="features">Features</h2>
<h3 id="intelligent-data-acquisition">Intelligent Data Acquisition</h3>
<p>Standard APIs impose strict rate limits. I built a Selenium-based extraction engine to handle Congress.gov&rsquo;s complex DOM structures.</p>
<ul>
<li><strong>Optimization</strong>: Targeted aggregate endpoints (e.g., <code>/all-info</code>) to pull each bill&rsquo;s text and metadata in fewer requests.</li>
<li><strong>Resilience</strong>: Implemented a local caching layer to store raw HTML, separating the fetch step from the parse step. This made the parse step re-runnable without re-fetching, and minimized server load during iterative development.</li>
<li><strong>Graph construction</strong>: Beyond simple text, the script extracts relational data including co-sponsorship networks, committee assignments, and related bill lineage.</li>
</ul>
<h3 id="natural-language-processing">Natural Language Processing</h3>
<ul>
<li><strong>Corpus construction</strong>: Cleaned and normalized legislative text, removing procedural artifacts (e.g., &ldquo;A BILL TO&hellip;&rdquo;) to isolate semantic policy content.</li>
<li><strong>Feature engineering</strong>: Utilized TF-IDF vectorization with N-gram analysis to capture legislative jargon.</li>
<li><strong>Modeling</strong>: Benchmarked Naive Bayes, Logistic Regression, and gradient-boosted trees (XGBoost), reaching ~0.86 weighted F1 on bill summaries and up to ~0.89 on full text (cross-validated). Weighted F1, not raw accuracy, is the honest metric here: the 33 policy classes are severely imbalanced (Health has 5,911 bills; Social Sciences and History has 15).</li>
</ul>
<h2 id="usage">Usage</h2>
<p>The dataset is available on Hugging Face and can be loaded directly via the <code>datasets</code> library. The scraper can be run locally to fetch new bills.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>The &ldquo;partisan vocabulary&rdquo;</strong>: Feature importance analysis revealed distinct linguistic markers separating Democratic and Republican legislation, identifiable even without metadata.</li>
<li><strong>Temporal drift</strong>: Policy priorities and terminology showed measurable shifts across congressional sessions (115th vs 117th).</li>
<li><strong>Classification success</strong>: Simple linear models (Logistic Regression and Naive Bayes) proved effective at distinguishing policy domains, outperforming gradient-boosted trees on these sparse TF-IDF features and suggesting legislative language is highly structured.</li>
</ul>
<h2 id="impact--deliverables">Impact &amp; Deliverables</h2>
<ul>
<li><strong>Hugging Face dataset</strong>: Released a machine-readable, ML-ready dataset of modern bills (115th-117th Congresses) on Hugging Face for reproducible research.</li>
<li><strong>Open source tooling</strong>: Published the scraper and parsing logic to allow others to extend the dataset to future congresses.</li>
<li><strong>Academic benchmark</strong>: Establishing a clear baseline for &ldquo;Government NLP&rdquo; tasks, aiding in the automated transparency and monitoring of new legislation.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/posts/us-117th-congress-data-exploration/">117th Congress Data Exploration</a></li>
<li><a href="/posts/congressional-bill-policy-area-classification/">Congressional Bill Policy Area Classification</a></li>
</ul>
]]></content:encoded></item></channel></rss>