<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Text-Classification on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/tags/text-classification/</link><description>Recent content in Text-Classification on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sat, 14 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/text-classification/index.xml" rel="self" type="application/rss+xml"/><item><title>Tea Party in the House: Legislative Ideology via HIPTM</title><link>https://hunterheidenreich.com/notes/interdisciplinary/social-science/tea-party-hiptm/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/social-science/tea-party-hiptm/</guid><description>A hierarchical probabilistic model combining roll call votes, bill text, and legislative speeches to analyze political polarization and framing.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p><strong>Method</strong>.</p>
<p>This paper is primarily a <strong>Methodological</strong> contribution. It proposes a novel probabilistic architecture, the Hierarchical Ideal Point Topic Model (HIPTM), designed to solve the specific limitations of existing political science models that typically rely on either voting data or text data in isolation. The paper validates this method by demonstrating its superior performance in predicting &ldquo;Tea Party&rdquo; membership compared to text-only baselines and its ability to provide interpretable &ldquo;framing&rdquo; analysis.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>The primary motivation is to better understand political polarization, specifically the &ldquo;Tea Party&rdquo; phenomenon within the Republican party during the 112th Congress.</p>
<p>An ideal point is a scalar score representing a legislator&rsquo;s ideological position, estimated from voting patterns. Standard &ldquo;Ideal Point&rdquo; models (like DW-NOMINATE) typically project legislators onto a single liberal-conservative dimension using only binary voting data. This is insufficient for capturing complex, multi-dimensional intra-party conflicts where legislators might agree on votes but differ on policy &ldquo;framing&rdquo; or specific sub-issues. Furthermore, existing multi-dimensional models often produce dimensions that are difficult for humans to interpret.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is the <strong>Hierarchical Ideal Point Topic Model (HIPTM)</strong>. It distinguishes itself from prior work through three main technical innovations:</p>
<ol>
<li><strong>Joint Modeling of Three Data Sources</strong>: It integrates roll call votes, the text of bills, and the floor speeches of legislators into a single probabilistic framework.</li>
<li><strong>Hierarchical Topic Structure</strong>: It models &ldquo;frames&rdquo; as a second level of the topic hierarchy. &ldquo;Issues&rdquo; (level 1) are fixed and non-polarized, while &ldquo;Frames&rdquo; (level 2) are discovered dynamically and carry polarity (ideal point weights). For example, Health Care is an issue; &ldquo;government overreach&rdquo; vs. &ldquo;patient protection&rdquo; are frames legislators use when debating it.</li>
<li><strong>Text-Based Ideal Point Prediction</strong>: HIPTM regresses ideal points on speech text, allowing it to predict the political alignment of legislators based solely on their writing or speeches without requiring voting records for inference.</li>
</ol>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors validated the model using data from the 112th U.S. Congress (Republican legislators only).</p>
<ul>
<li><strong>Prediction Task</strong>: Classifying legislators as members of the &ldquo;Tea Party Caucus&rdquo;.</li>
<li><strong>Baselines</strong>: The model was compared against Support Vector Machines (SVM) trained on:
<ul>
<li>TF-IDF vectors (Text only)</li>
<li>Normalized TF-IDF vectors (Text only)</li>
<li>Binary Vote vectors (Vote only)</li>
</ul>
</li>
<li><strong>Metric</strong>: Area Under the Receiver Operating Characteristic Curve (AUC-ROC) via 5-fold cross-validation.</li>
<li><strong>Qualitative Analysis</strong>: The authors examined the &ldquo;span&rdquo; of ideal points within specific topics (e.g., Macroeconomics, Health) to identify which issues were most polarized between Tea Party and Establishment Republicans.</li>
</ul>
<h2 id="what-were-the-outcomes-and-conclusions-drawn">What were the outcomes and conclusions drawn?</h2>
<ul>
<li><strong>Quantitative Performance</strong>: HIPTM features combined with voting data (HIPTM-VOTE) achieved the highest classification performance (AUC-ROC in the ~0.70-0.75 range, approximate, read from Figure 2). Vote-only features slightly trail HIPTM-VOTE, while text-only baselines (TF-IDF, normalized TF-IDF) fall considerably lower. The one-dimensional Tea Party ideal points correlate with DW-NOMINATE ($\rho = 0.91$). When voting data was withheld (simulating a candidate without a record), HIPTM&rsquo;s text-based features outperformed standard text baselines TF-IDF and normalized TF-IDF (approximate, read from Figure 3).</li>
<li><strong>Political Insight</strong>: The model identified &ldquo;Government Operations,&rdquo; &ldquo;Macroeconomics,&rdquo; and &ldquo;Transportation&rdquo; as the three most polarized topics between Tea Party and establishment Republicans.</li>
<li><strong>Framing Analysis</strong>: The hierarchical topic structure reveals how legislators frame issues differently. For Macroeconomics, frame M3 (most Tea Party-oriented) focuses on criticizing government overspending, while frame M1 (least Tea Party-oriented) focuses on the downsides of a government shutdown. For Health, frame H3 captures Tea Party framing of the Affordable Care Act as an unconstitutional government takeover, while frame H1 frames opposition in terms of implementation costs and health care exchanges.</li>
<li><strong>Framing vs. Voting Taxonomy</strong>: The authors construct a 2x2 taxonomy of disagreement across issues, crossing whether ideal points are polarized with whether issue frames are polarized. Issues like Civil Rights fall in the &ldquo;neither polarized&rdquo; quadrant, where cooperation is expected. Banking/Finance and Transportation fall in the &ldquo;ideal points polarized, frames not&rdquo; quadrant, where Republicans frame the issue similarly but have underlying policy disagreements. Issues like Health and Public Lands fall in the &ldquo;frames polarized, ideal points not&rdquo; quadrant: Republicans voted similarly but framed the issue very differently. Issues like Macroeconomics and Government Operations fall in the &ldquo;both polarized&rdquo; quadrant, posing the greatest challenge for Republican leadership.</li>
<li><strong>Sub-group Identification</strong>: The model identifies legislators whose language marks them as ideologically aligned with the Tea Party even without formal caucus membership. For example, Jeff Flake (R-AZ) received the second-highest ideal point, disagreeing with Freedom Works on only one of 60 key votes, despite not being a Tea Party Caucus member. Justin Amash (R-MI), founder and chairman of the Liberty Caucus, agreed with Freedom Works on every key vote since 2011. Conversely, some self-identified Tea Partiers like Rodney Alexander (R-LA) only agreed with Freedom Works 48% of the time. Alexander and Ander Crenshaw (R-FL, 50% agreement) are categorized as &ldquo;Green Tea&rdquo; by Gervais and Morris (2014): Republican legislators who associate with the Tea Party on their own initiative but lack support from Tea Party organizations.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<ul>
<li>HIPTM does not formally distinguish frames from other kinds of subtopics. For example, the model discovered a strongly Tea Party-oriented frame under &ldquo;Labor, Employment and Immigration&rdquo; that reflected a Boeing labor dispute specific to South Carolina legislators, capturing geographic rather than ideological framing.</li>
<li>The model is validated only on Republican legislators in the 112th Congress. Generalization to other parties, chambers, or time periods is untested.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The study focuses on the <strong>112th U.S. Congress</strong> (Jan 2011 - Jan 2013).</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Subjects</strong></td>
          <td>Republican Legislators</td>
          <td>240 Reps</td>
          <td>60 are Tea Party Caucus members.</td>
      </tr>
      <tr>
          <td><strong>Votes</strong></td>
          <td>Roll Call Votes</td>
          <td>13,856 votes</td>
          <td>Agreement/disagreement with Freedom Works on 60 key votes (40 in 2011, 20 in 2012).</td>
      </tr>
      <tr>
          <td><strong>Text</strong></td>
          <td>Floor Speeches</td>
          <td>5,349 word types</td>
          <td>Sourced from GovTrack. Vocabulary size after preprocessing.</td>
      </tr>
      <tr>
          <td><strong>Priors</strong></td>
          <td>Congressional Bills Project</td>
          <td>19 Topics</td>
          <td>Used to set informed priors $\phi^*_k$ for top-level issues.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>The model uses a <strong>Stochastic EM</strong> approach for inference.</p>
<ul>
<li><strong>Generative Process</strong>:
<ul>
<li><strong>Speeches</strong>: Modeled as a mixture of $K$ Hierarchical Dirichlet Processes (HDPs). A legislator chooses an issue $z$, then a frame $t$ from a Dirichlet Process, then a word $w$.</li>
<li><strong>Bills</strong>: Modeled using Latent Dirichlet Allocation (LDA). Each bill is a mixture over $K$ issues.</li>
<li><strong>Votes</strong>: Modeled via a probabilistic ideal point function (logistic/inverse-logit). The probability of a &ldquo;Yes&rdquo; vote depends on the bill&rsquo;s polarity $x_b$, popularity $y_b$, and the legislator&rsquo;s issue-specific ideal point $u_{a,k}$.</li>
</ul>
</li>
<li><strong>Optimization Steps</strong>:
<ol>
<li><strong>Sampling</strong>: Issue assignments $z$ and frame assignments $t$ are sampled for tokens in speeches and bills.</li>
<li><strong>Regression</strong>: Frame-specific regression weights $\eta_{k,j}$ are optimized using <strong>L-BFGS</strong>.</li>
<li><strong>Ideal Points</strong>: Legislator ideal points $u_{a,k}$ and bill parameters ($x_b, y_b$) are updated using <strong>Gradient Ascent</strong>.</li>
</ol>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Ideal Point Definition</strong>: A legislator&rsquo;s ideal point on issue $k$ ($u_{a,k}$) is defined as a linear combination of the ideal points of the <em>frames</em> they use ($\eta_{k,j}$), weighted by their usage frequency ($\hat{\psi}_{a,k,j}$).</li>
<li><strong>Topic Hierarchy</strong>:
<ul>
<li><strong>Level 1 (Issues)</strong>: Fixed at $K=19$ (based on Policy Agendas Project major headings). These nodes use informed Dirichlet priors.</li>
<li><strong>Level 2 (Frames)</strong>: Unbounded number of frames per issue, discovered non-parametrically via Dirichlet Process.</li>
</ul>
</li>
<li><strong>Prediction Features</strong>: The model runs for 1,000 iterations total with a 500-iteration burn-in. After burn-in, the sampled state is kept every 50 iterations, and feature values are averaged over the 10 stored models.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Primary Metric</strong>: AUC-ROC (Area Under the Receiver Operating Characteristic Curve).</li>
<li><strong>Classifier</strong>: $\text{SVM}^{\text{light}}$ (Joachims, 1999).</li>
<li><strong>Cross-Validation</strong>: 5-fold stratified sampling.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.govtrack.us/">GovTrack Congressional Speeches</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Source of floor speech text</td>
      </tr>
      <tr>
          <td><a href="http://www.congressionalbills.org/">Congressional Bills Project</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Bill text with Policy Agendas Project topic labels</td>
      </tr>
      <tr>
          <td>Freedom Works Key Votes</td>
          <td>Dataset</td>
          <td>Public</td>
          <td>60 key votes used to define Tea Party alignment (freedomworks.org is no longer available)</td>
      </tr>
  </tbody>
</table>
<p>No official code release accompanies this paper. The inference algorithm (Stochastic EM with Gibbs sampling, L-BFGS, and gradient ascent) is described in detail in Section 4 of the paper, but a full reimplementation would be required.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Nguyen, V., Boyd-Graber, J., Resnik, P., &amp; Miler, K. (2015). Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress. <em>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics</em>, 1438-1448. <a href="https://doi.org/10.3115/v1/P15-1139">https://doi.org/10.3115/v1/P15-1139</a></p>
<p><strong>Publication</strong>: ACL 2015</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{nguyenTeaPartyHouse2015,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Tea {{Party}} in the {{House}}: {{A Hierarchical Ideal Point Topic Model}} and {{Its Application}} to {{Republican Legislators}} in the 112th {{Congress}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Tea {{Party}} in the {{House}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 53rd {{Annual Meeting}} of the {{Association}} for {{Computational Linguistics}} and the 7th {{International Joint Conference}} on {{Natural Language Processing}} ({{Volume}} 1: {{Long Papers}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Nguyen, Viet-An and {Boyd-Graber}, Jordan and Resnik, Philip and Miler, Kristina}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2015}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{1438--1448}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Beijing, China}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.3115/v1/P15-1139}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2023-11-02}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">abstract</span> = <span style="color:#e6db74">{We introduce the Hierarchical Ideal Point Topic Model, which provides a rich picture of policy issues, framing, and voting behavior using a joint model of votes, bill text, and the language that legislators use when debating bills. We use this model to look at the relationship between Tea Party Republicans and ``establishment&#39;&#39; Republicans in the U.S. House of Representatives during the 112th Congress.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">langid</span> = <span style="color:#e6db74">{english}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://aclanthology.org/P15-1139/">ACL Anthology: Tea Party in the House</a></li>
<li>Gervais, B. T., &amp; Morris, I. L. (2012). Reading the tea leaves: Understanding Tea Party Caucus membership in the US House of Representatives. <em>PS: Political Science &amp; Politics</em>, 45(2), 245-250.</li>
<li>Gervais, B. T., &amp; Morris, I. L. (2014). Black Tea, Green Tea, White Tea, and Coffee: Understanding the variation in attachment to the Tea Party among members of Congress. In <em>Annual Meeting of the American Political Science Association</em>. (Source of the &ldquo;Green Tea&rdquo; Republican taxonomy cited in the paper)</li>
</ul>
]]></content:encoded></item><item><title>Party Matters: Enhancing Legislative Vote Embeddings</title><link>https://hunterheidenreich.com/notes/interdisciplinary/social-science/party-matters-hiptm/</link><pubDate>Sun, 14 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/interdisciplinary/social-science/party-matters-hiptm/</guid><description>A method for improving legislative vote prediction across sessions by augmenting bill text embeddings with sponsor metadata.</description><content:encoded><![CDATA[<h2 id="what-kind-of-paper-is-this">What kind of paper is this?</h2>
<p>This is a <strong>Method</strong> paper. It proposes a novel neural architecture that modifies how bill embeddings are constructed by explicitly incorporating sponsor metadata alongside text. The authors validate this method by comparing it against text-only baselines (MWE and CNN) and demonstrating superior performance in a newly defined &ldquo;out-of-session&rdquo; evaluation setting.</p>
<h2 id="what-is-the-motivation">What is the motivation?</h2>
<p>Existing models for predicting legislative roll-call votes rely heavily on text or voting history within a single session. However, these models fail to generalize across sessions because the underlying data generation process changes. Specifically, the ideological position of bills on similar topics shifts depending on which party is in power. A model trained on a single session learns an implicit ideological prior that becomes inaccurate when the political context changes in subsequent sessions.</p>
<h2 id="what-is-the-novelty-here">What is the novelty here?</h2>
<p>The core novelty is a neural architecture that augments bill text representations with sponsor ideology, specifically the percentage of Republican vs. Democrat sponsors.</p>
<ul>
<li><strong>Sponsor-Weighted Embeddings</strong>: They compute a composite embedding where the text representation is weighted by party sponsorship percentages ($p_{r}, p_{d}$) and party-specific influence vectors ($a_{r}, a_{d}$).</li>
<li><strong>Out-of-Session Evaluation</strong>: They introduce a rigorous evaluation setting where models trained on past sessions (e.g., 2005-2012) are tested on future sessions (e.g., 2013-2014) to test generalization, which previous work had ignored.</li>
</ul>
<h2 id="what-experiments-were-performed">What experiments were performed?</h2>
<p>The authors evaluated their models using a dataset of U.S. Congressional bills from 2005 to 2016.</p>
<ul>
<li><strong>Models Tested</strong>: They compared text-only models (MWE (Mean Word Embedding), CNN) against metadata-augmented versions (MWE+Meta, CNN+Meta) and a &ldquo;Meta-Only&rdquo; baseline (using dummy text).</li>
<li><strong>Settings</strong>:
<ul>
<li><strong>In-Session</strong>: 5-fold cross-validation on 2005-2012 data.</li>
<li><strong>Out-of-Session</strong>: Training on 2005-2012 and testing on 2013-2014 and 2015-2016.</li>
</ul>
</li>
<li><strong>Baselines</strong>: Comparisons included a &ldquo;Guess Yes&rdquo; baseline and an SVM trained on bag-of-words summaries with sponsor indicators.</li>
</ul>
<h2 id="what-outcomesconclusions">What outcomes/conclusions?</h2>
<ul>
<li><strong>Metadata is Critical</strong>: Augmenting text with sponsor metadata consistently outperformed text-only models. The <code>CNN+Meta</code> model achieved the highest accuracy in-session (86.21% vs. 83.24% for CNN) and on 2013-2014 out-of-session (83.59%), while <code>MWE+Meta</code> achieved the best 2015-2016 accuracy (71.90%).</li>
<li><strong>Generalization</strong>: Text-only models degraded significantly in out-of-session testing. For example, CNN dropped from 83.24% in-session to 77.49% on 2013-2014 and 69.63% on 2015-2016, confirming that text alone fails to capture shifting ideological contexts.</li>
<li><strong>Sponsor Signal</strong>: The <code>Meta-Only</code> model (using no text) outperformed text-only models in the 2013-2014 out-of-session test (82.28% vs. 77.57% for MWE), suggesting that in some contexts, the author&rsquo;s identity provides a stronger predictive signal than the bill&rsquo;s content.</li>
<li><strong>2015-2016 Difficulty</strong>: All models performed worse on the 2015-2016 session, where intra-party divisions within the House Republican caucus disrupted typical voting dynamics.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Source</strong>: Collected from GovTrack. The paper text references the &ldquo;106th to 111th&rdquo; Congressional sessions, but the data tables show coverage from 2005 to 2016, which corresponds to the 109th through 114th sessions.</li>
<li><strong>Content</strong>: Non-unanimous roll call votes, full text of bills/resolutions, and Congressional Research Service (CRS) summaries.</li>
<li><strong>Filtering</strong>: Bills with unanimous votes were excluded.</li>
<li><strong>Preprocessing</strong>:
<ul>
<li>Text lowercased and stop-words removed.</li>
<li>Summaries truncated to $N=400$ words; full text truncated to $N=2000$ words (80th percentile lengths).</li>
</ul>
</li>
<li><strong>Splits</strong>:
<ul>
<li><em>Training</em>: Sessions 2005-2012 (1718 bills).</li>
<li><em>Testing</em>: Sessions 2013-2014 (360 bills) and 2015-2016 (382 bills).</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Bill Representation ($v_{B}$)</strong>:
$$v_{B}=((a_{r}p_{r})\cdot T_{r})+((a_{d}p_{d})\cdot T_{d})$$
Where $T$ is the text embedding (CNN or MWE), $p$ is the percentage of sponsors from a party, and $a$ is a learnable party influence vector. $T_{r}$ and $T_{d}$ are Republican and Democratic copies of the same bill&rsquo;s text representation, each weighted by the corresponding party&rsquo;s sponsorship proportion.</li>
<li><strong>Vote Prediction</strong>:
<ul>
<li>Project bill embedding to legislator space: $v_{BL}=W_{B}v_{B}+b_{B}$.</li>
<li>Alignment score: $W_{v}(v_{BL}\odot v_{L})+b_{v}$ (using element-wise multiplication).</li>
<li>Output: Sigmoid activation.</li>
</ul>
</li>
<li><strong>Optimization</strong>: AdaMax algorithm with binary cross-entropy loss.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Text Encoders</strong>:
<ul>
<li><strong>CNN</strong>: 4-grams with 400 filter maps.</li>
<li><strong>MWE</strong>: <a href="/posts/intro-to-word-embeddings/">Mean Word Embedding</a>.</li>
</ul>
</li>
<li><strong>Embeddings</strong>:
<ul>
<li>Initialized with 50-dimensional GloVe vectors.</li>
<li>Embeddings are non-static (updated during training).</li>
<li>Legislator embedding size ($v_{L}$): 25 dimensions.</li>
</ul>
</li>
<li><strong>Initialization</strong>: Weights initialized with Glorot uniform distribution.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: Accuracy.</li>
<li><strong>Comparison</strong>:
<ul>
<li><strong>In-session</strong>: 5-fold cross-validation.</li>
<li><strong>Out-of-session</strong>: Train on past sessions, predict future sessions.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Training Config</strong>: Models trained for 50 epochs with mini-batches of size 50. No specific GPU or compute requirements are reported.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.govtrack.us/">GovTrack</a></td>
          <td>Dataset</td>
          <td>Public</td>
          <td>Source for bill texts and roll-call votes</td>
      </tr>
  </tbody>
</table>
<p>No official code repository or pretrained models were released with this paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kornilova, A., Argyle, D., &amp; Eidelman, V. (2018). Party Matters: Enhancing Legislative Embeddings with Author Attributes for Vote Prediction. <em>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</em>, 510-515. <a href="https://doi.org/10.18653/v1/p18-2081">https://doi.org/10.18653/v1/p18-2081</a></p>
<p><strong>Publication</strong>: ACL 2018</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{kornilovaPartyMattersEnhancing2018,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{Party {{Matters}}: {{Enhancing Legislative Embeddings}} with {{Author Attributes}} for {{Vote Prediction}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{Party {{Matters}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Kornilova, Anastassia and Argyle, Daniel and Eidelman, Vlad}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 56th {{Annual Meeting}} of the {{Association}} for {{Computational Linguistics}} ({{Volume}} 2: {{Short Papers}})}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{510--515}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computational Linguistics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{Melbourne, Australia}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.18653/v1/p18-2081}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{1805.08182}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Sarcasm Detection with Transformers: A Cautionary Tale</title><link>https://hunterheidenreich.com/posts/sarcasm-detection-with-transformers/</link><pubDate>Sun, 25 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/sarcasm-detection-with-transformers/</guid><description>Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that learned to classify news sources.</description><content:encoded><![CDATA[<h2 id="why-sarcasm-detection-is-hard">Why Sarcasm Detection Is Hard</h2>
<p>Sarcasm detection represents one of the most challenging problems in NLP. The difficulties include:</p>
<p><strong>Context dependence</strong>: Sarcasm relies on situational knowledge and shared understanding that extends beyond the text itself.</p>
<p><strong>Subtlety</strong>: Even humans struggle with sarcastic interpretation, especially in written text without vocal cues.</p>
<p><strong>Cultural variability</strong>: Sarcastic expressions vary significantly across cultures and regions.</p>
<p><strong>Annotation disagreement</strong>: Human annotators often disagree on what constitutes sarcasm.</p>
<p>These challenges raise a fundamental question: can sarcasm detection be well-defined as a computational problem? This case study explores what happens when we try (and reveals a common pitfall in dataset construction).</p>
<h2 id="the-dataset-a-hidden-flaw">The Dataset: A Hidden Flaw</h2>
<p>I used the <a href="https://huggingface.co/datasets/raquiba/Sarcasm_News_Headline">Sarcasm News Headlines dataset</a>, which combines headlines from <a href="https://theonion.com/">The Onion</a> (satirical) and <a href="https://www.huffpost.com/">The Huffington Post</a> (traditional news). The dataset contains ~50,000 examples.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> datasets <span style="color:#f92672">import</span> load_dataset
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> load_dataset(<span style="color:#e6db74">&#34;raquiba/Sarcasm_News_Headline&#34;</span>)
</span></span><span style="display:flex;"><span>print(dataset[<span style="color:#e6db74">&#34;train&#34;</span>][<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>print(dataset[<span style="color:#e6db74">&#34;train&#34;</span>][<span style="color:#ae81ff">1</span>])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>{&#39;headline&#39;: &#39;thirtysomething scientists unveil doomsday clock of hair loss&#39;,
</span></span><span style="display:flex;"><span> &#39;is_sarcastic&#39;: 1}
</span></span><span style="display:flex;"><span>{&#39;headline&#39;: &#39;dem rep. totally nails why congress is falling short on gender, racial equality&#39;,
</span></span><span style="display:flex;"><span> &#39;is_sarcastic&#39;: 0}
</span></span></code></pre></div><p><strong>The critical flaw</strong>: This dataset uses binary classification based on source domain. The Onion headlines are labeled sarcastic, HuffPost headlines are not. This creates a dangerous shortcut where models learn to detect the publication source.</p>
<p>After preprocessing to standardize column names:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> dataset<span style="color:#f92672">.</span>map(
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">lambda</span> example: {<span style="color:#e6db74">&#34;text&#34;</span>: example[<span style="color:#e6db74">&#34;headline&#34;</span>], <span style="color:#e6db74">&#34;label&#34;</span>: example[<span style="color:#e6db74">&#34;is_sarcastic&#34;</span>]},
</span></span><span style="display:flex;"><span>    remove_columns<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;headline&#34;</span>, <span style="color:#e6db74">&#34;article_link&#34;</span>, <span style="color:#e6db74">&#34;is_sarcastic&#34;</span>]
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h2 id="fine-tuning-roberta">Fine-Tuning RoBERTa</h2>
<p>I fine-tuned a pre-trained RoBERTa model using standard practices:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> transformers <span style="color:#f92672">import</span> AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>model_name <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;FacebookAI/roberta-base&#34;</span>
</span></span><span style="display:flex;"><span>tokenizer <span style="color:#f92672">=</span> AutoTokenizer<span style="color:#f92672">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> AutoModelForSequenceClassification<span style="color:#f92672">.</span>from_pretrained(model_name, num_labels<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Tokenize the data</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">tokenize_function</span>(examples):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> tokenizer(examples[<span style="color:#e6db74">&#34;text&#34;</span>], truncation<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, max_length<span style="color:#f92672">=</span><span style="color:#ae81ff">512</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tokenized_datasets <span style="color:#f92672">=</span> dataset<span style="color:#f92672">.</span>map(tokenize_function, batched<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Training configuration</span>
</span></span><span style="display:flex;"><span>training_args <span style="color:#f92672">=</span> TrainingArguments(
</span></span><span style="display:flex;"><span>    output_dir<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;./results&#34;</span>,
</span></span><span style="display:flex;"><span>    num_train_epochs<span style="color:#f92672">=</span><span style="color:#ae81ff">5</span>,
</span></span><span style="display:flex;"><span>    per_device_train_batch_size<span style="color:#f92672">=</span><span style="color:#ae81ff">32</span>,
</span></span><span style="display:flex;"><span>    evaluation_strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;epoch&#34;</span>,
</span></span><span style="display:flex;"><span>    save_strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;epoch&#34;</span>,
</span></span><span style="display:flex;"><span>    load_best_model_at_end<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>trainer <span style="color:#f92672">=</span> Trainer(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span>model,
</span></span><span style="display:flex;"><span>    args<span style="color:#f92672">=</span>training_args,
</span></span><span style="display:flex;"><span>    train_dataset<span style="color:#f92672">=</span>tokenized_datasets[<span style="color:#e6db74">&#34;train&#34;</span>],
</span></span><span style="display:flex;"><span>    eval_dataset<span style="color:#f92672">=</span>tokenized_datasets[<span style="color:#e6db74">&#34;test&#34;</span>],
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#f92672">=</span>tokenizer,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>trainer<span style="color:#f92672">.</span>train()
</span></span></code></pre></div><h2 id="results-too-good-to-be-true">Results: Too Good to Be True</h2>
<p>The model achieved high accuracy:</p>
<table>
  <thead>
      <tr>
          <th>Epoch</th>
          <th>Test Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>96.3%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>97.8%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>99.4%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>99.8%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>99.8%</td>
      </tr>
  </tbody>
</table>
<p>This should immediately raise red flags. Sarcasm detection is notoriously difficult, even for humans. Such high accuracy suggests the model learned a proxy task.</p>
<p>My hypothesis: <strong>The model bypassed sarcasm detection entirely, learning only to distinguish between The Onion and HuffPost writing styles.</strong></p>
<h2 id="interacting-with-the-model">Interacting with the Model</h2>
<p>Let&rsquo;s test our hypothesis by interacting with the model.</p>
<p>First, let&rsquo;s load the model and tokenizer:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> transformers <span style="color:#f92672">import</span> pipeline
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> AutoModelForSequenceClassification<span style="color:#f92672">.</span>from_pretrained(<span style="color:#e6db74">&#39;results/2024-02-25_20-24-51/checkpoint-4475&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>clf <span style="color:#f92672">=</span> pipeline(<span style="color:#e6db74">&#39;text-classification&#39;</span>, model<span style="color:#f92672">=</span>model, tokenizer<span style="color:#f92672">=</span>tokenizer)
</span></span></code></pre></div><p>Now, let&rsquo;s test the model with some examples.</p>
<p>First, let&rsquo;s try an Onion article from this week, something I know to be sarcastic and not in the training data.
Let&rsquo;s use <a href="https://theonion.com/alabama-supreme-court-justice-invokes-veggietales-in-1851282252/">&ldquo;Alabama Supreme Court Justice Invokes &lsquo;VeggieTales&rsquo; In Ruling&rdquo;</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Alabama Supreme Court Justice Invokes ‘VeggieTales&#39; In Ruling&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.99916672706604}]
</span></span></code></pre></div><p>The model is extremely confident that this is not sarcastic.</p>
<p>Let&rsquo;s try a different Onion article, possibly even more difficult: <a href="https://theonion.com/trump-booed-frozen-burritos-and-more-this-week-in-br-1851282066/">Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993497729301453}]
</span></span></code></pre></div><p>Again, very confident that this is not sarcastic. Hmm. It could be the temporal accuracy of our model just cannot capture the sarcasm of the Onion in 2024.</p>
<p>Let&rsquo;s try one more Onion article, this one that is still recent but a bit more of a low-hanging fruit: <a href="https://theonion.com/mom-only-likes-the-other-outback-steakhouse-1851265335/">Mom Only Likes The Other Outback Steakhouse</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Mom Only Likes The Other Outback Steakhouse&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_1&#39;, &#39;score&#39;: 0.9997231364250183}]
</span></span></code></pre></div><p>Finally, a correct prediction! The model is confident that this is sarcastic.
Our model detects only very specific types of sarcasm. It fails to generalize to new, unseen data within the same domain.</p>
<p>Let&rsquo;s also try some headlines from the Huffington Post, which the model should predict as not sarcastic.
Let&rsquo;s try the five most recent headlines from the Huffington Post:</p>
<ul>
<li><a href="https://www.huffpost.com/entry/donald-trump-south-carolina-nikki-haley_n_65db61f5e4b0e4346d52bed8">Donald Trump Won South Carolina - But There&rsquo;s 1 Big Caveat</a></li>
<li><a href="https://www.huffpost.com/entry/israeli-embassy-washington-man-set-fire_n_65db9364e4b0e4346d52ce3d">Man Sets Himself On Fire In Front Of Israeli Embassy In Washington</a></li>
<li><a href="https://www.huffpost.com/entry/bc-ml-israel-palestinians-temporary-truce-cease-fire_n_65db2e9ae4b0189a6a7e32ea">Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange</a></li>
<li><a href="https://www.huffpost.com/entry/george-latimer-race-comments-democratic-primary_n_65d8fac3e4b0cc1f2f7bafd8">A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.</a></li>
<li><a href="https://www.huffpost.com/entry/mongolia-climate-change-extreme-weather_n_65d90294e4b0cc1f2f7bb527">Climate Change-Fueled Winter Extremes Put 90% Of This Country At &lsquo;High Risk&rsquo;</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf([
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Donald Trump Won South Carolina - But There&#39;s 1 Big Caveat&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Man Sets Himself On Fire In Front Of Israeli Embassy In Washington&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Climate Change-Fueled Winter Extremes Put 90% Of This Country At &#39;High Risk&#39;&#34;</span>
</span></span><span style="display:flex;"><span>])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993808269500732},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993786811828613},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9985186457633972},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993883371353149},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993487000465393}]
</span></span></code></pre></div><p>The model is extremely confident that these are not sarcastic.</p>
<p>The model detects sarcasm in limited cases. It fails to generalize to new, unseen data within the same domain. This is a common problem in machine learning. Training a model that performs well on a specific dataset is straightforward. Training a model that generalizes to new, unseen data remains a significant challenge.
Furthermore, our sarcasm detection project resulted in a domain classifier. For fuzzier concepts like sarcasm, it&rsquo;s important to be clear about what we&rsquo;re actually detecting, and to collect the necessary scale of data to capture the full range of the concept.</p>
<h2 id="key-takeaways">Key Takeaways</h2>
<p>This case study reveals a fundamental problem in ML: <strong>high accuracy guarantees only performance on the training distribution</strong>. Here&rsquo;s what actually happened:</p>
<ol>
<li><strong>Dataset bias</strong>: Using publication source as a proxy for sarcasm created a shortcut for the model</li>
<li><strong>Domain classification</strong>: The model exclusively learned to distinguish writing styles</li>
<li><strong>Poor generalization</strong>: New examples from the same sources often failed</li>
</ol>
<p>This is a common pitfall when building datasets for subjective concepts. The lesson: high accuracy must be accompanied by validation of the model&rsquo;s actual learned behavior.</p>
<p>For better sarcasm detection, we&rsquo;d need:</p>
<ul>
<li>Diverse sources beyond two publications</li>
<li>Human annotation across multiple contexts</li>
<li>Careful evaluation on out-of-domain examples</li>
</ul>
<p>Instructive failures in ML projects provide valuable lessons about our assumptions and the limitations of our approaches.</p>
]]></content:encoded></item><item><title>Classifying Congressional Bills with Machine Learning</title><link>https://hunterheidenreich.com/posts/congressional-bill-policy-area-classification/</link><pubDate>Wed, 21 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/congressional-bill-policy-area-classification/</guid><description>Testing ML classification of congressional bills by policy area. Comparing Naive Bayes, Logistic Regression, and XGBoost on legislative text.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>This post explores machine learning approaches for classifying congressional bills by policy area, using data from the 115th to 117th Congresses (2017-2023). We&rsquo;ll examine:</p>
<ul>
<li>The fundamentals of bill classification</li>
<li>Traditional machine learning models as baselines</li>
<li>Performance analysis across different time periods and policy domains</li>
</ul>
<p>This work establishes baselines for future deep learning approaches to legislative text classification.</p>
<p><em>This post builds on the data foundation established in <a href="/posts/us-117th-congress-data-exploration/">Exploring the 117th U.S. Congress</a>.</em></p>
<h3 id="why-this-matters">Why This Matters</h3>
<p>Automatically classifying congressional bills by policy area has practical value for researchers, journalists, and citizens who need to navigate thousands of bills each Congress. Machine learning can help identify patterns in legislative priorities and track policy trends over time.</p>
<p>This work establishes baseline performance for text classification on legislative data, providing a foundation for more sophisticated approaches.</p>
<h2 id="data">Data</h2>
<p>The data comes from scraping <a href="https://www.congress.gov/">Congress.gov</a> for all bills from the 115th through 117th Congresses. Each bill includes:</p>
<ul>
<li>Bill ID and title</li>
<li>Summary (when available): the earliest summary provided</li>
<li>Full text (when available): the earliest text version</li>
<li>Policy area classification</li>
</ul>
<p>Our task is to predict policy area from text features:</p>
<p>$$
f(X) = \hat{y}, \quad \text{where} \quad X = { \text{title}, \text{summary}, \text{text} }, \quad \hat{y} \in { \text{policy areas} }
$$</p>
<p>The complete dataset is available at <a href="https://huggingface.co/datasets/hheiden/us-congress-bill-policy-115_117">Hugging Face: hheiden/us-congress-bill-policy-115_117</a>.</p>
<h3 id="bills-by-congress">Bills by Congress</h3>
<p>Our dataset contains the following distribution:</p>
<table>
  <thead>
      <tr>
          <th>Congress</th>
          <th>Bills</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>115th</td>
          <td>13,555</td>
      </tr>
      <tr>
          <td>116th</td>
          <td>16,601</td>
      </tr>
      <tr>
          <td>117th</td>
          <td>17,817</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>47,973</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="policy-areas">Policy Areas</h3>
<p>Each bill receives a policy area label from <a href="https://www.congress.gov/">Congress.gov</a> (see <a href="https://www.congress.gov/help/field-values/policy-area">glossary</a>). The dataset includes 33 policy areas, though these classes are highly imbalanced.</p>
<p>The following table shows the number of bills in each policy area across the three Congresses:</p>
<table>
  <thead>
      <tr>
          <th>Policy Area</th>
          <th>115th</th>
          <th>116th</th>
          <th>117th</th>
          <th>Total</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Agriculture and Food</td>
          <td>312</td>
          <td>328</td>
          <td>398</td>
          <td>1,038</td>
      </tr>
      <tr>
          <td>Animals</td>
          <td>96</td>
          <td>83</td>
          <td>71</td>
          <td>250</td>
      </tr>
      <tr>
          <td>Armed Forces and National Security</td>
          <td>1,108</td>
          <td>1,337</td>
          <td>1,399</td>
          <td>3,844</td>
      </tr>
      <tr>
          <td>Arts, Culture, Religion</td>
          <td>81</td>
          <td>79</td>
          <td>103</td>
          <td>263</td>
      </tr>
      <tr>
          <td>Civil Rights and Liberties, Minority Issues</td>
          <td>175</td>
          <td>205</td>
          <td>220</td>
          <td>600</td>
      </tr>
      <tr>
          <td>Commerce</td>
          <td>312</td>
          <td>593</td>
          <td>633</td>
          <td>1,538</td>
      </tr>
      <tr>
          <td>Congress</td>
          <td>594</td>
          <td>541</td>
          <td>640</td>
          <td>1,775</td>
      </tr>
      <tr>
          <td>Crime and Law Enforcement</td>
          <td>827</td>
          <td>904</td>
          <td>1,022</td>
          <td>2,753</td>
      </tr>
      <tr>
          <td>Economics and Public Finance</td>
          <td>176</td>
          <td>210</td>
          <td>197</td>
          <td>583</td>
      </tr>
      <tr>
          <td>Education</td>
          <td>607</td>
          <td>798</td>
          <td>801</td>
          <td>2,206</td>
      </tr>
      <tr>
          <td>Emergency Management</td>
          <td>207</td>
          <td>198</td>
          <td>202</td>
          <td>607</td>
      </tr>
      <tr>
          <td>Energy</td>
          <td>316</td>
          <td>370</td>
          <td>530</td>
          <td>1,216</td>
      </tr>
      <tr>
          <td>Environmental Protection</td>
          <td>352</td>
          <td>423</td>
          <td>464</td>
          <td>1,239</td>
      </tr>
      <tr>
          <td>Families</td>
          <td>79</td>
          <td>127</td>
          <td>139</td>
          <td>345</td>
      </tr>
      <tr>
          <td>Finance and Financial Sector</td>
          <td>556</td>
          <td>611</td>
          <td>601</td>
          <td>1,768</td>
      </tr>
      <tr>
          <td>Foreign Trade and International Finance</td>
          <td>120</td>
          <td>148</td>
          <td>212</td>
          <td>480</td>
      </tr>
      <tr>
          <td>Government Operations and Politics</td>
          <td>1,008</td>
          <td>1,258</td>
          <td>1,272</td>
          <td>3,538</td>
      </tr>
      <tr>
          <td>Health</td>
          <td>1,526</td>
          <td>2,109</td>
          <td>2,276</td>
          <td>5,911</td>
      </tr>
      <tr>
          <td>Housing and Community Development</td>
          <td>142</td>
          <td>250</td>
          <td>231</td>
          <td>623</td>
      </tr>
      <tr>
          <td>Immigration</td>
          <td>398</td>
          <td>466</td>
          <td>591</td>
          <td>1,455</td>
      </tr>
      <tr>
          <td>International Affairs</td>
          <td>918</td>
          <td>1,178</td>
          <td>1,390</td>
          <td>3,486</td>
      </tr>
      <tr>
          <td>Labor and Employment</td>
          <td>348</td>
          <td>452</td>
          <td>552</td>
          <td>1,352</td>
      </tr>
      <tr>
          <td>Law</td>
          <td>109</td>
          <td>162</td>
          <td>175</td>
          <td>446</td>
      </tr>
      <tr>
          <td>Native Americans</td>
          <td>175</td>
          <td>234</td>
          <td>245</td>
          <td>654</td>
      </tr>
      <tr>
          <td>Public Lands and Natural Resources</td>
          <td>718</td>
          <td>648</td>
          <td>642</td>
          <td>2,008</td>
      </tr>
      <tr>
          <td>Science, Technology, Communications</td>
          <td>389</td>
          <td>551</td>
          <td>505</td>
          <td>1,445</td>
      </tr>
      <tr>
          <td>Social Sciences and History</td>
          <td>5</td>
          <td>6</td>
          <td>4</td>
          <td>15</td>
      </tr>
      <tr>
          <td>Social Welfare</td>
          <td>177</td>
          <td>229</td>
          <td>199</td>
          <td>605</td>
      </tr>
      <tr>
          <td>Sports and Recreation</td>
          <td>92</td>
          <td>93</td>
          <td>125</td>
          <td>310</td>
      </tr>
      <tr>
          <td>Taxation</td>
          <td>983</td>
          <td>1,156</td>
          <td>1,078</td>
          <td>3,217</td>
      </tr>
      <tr>
          <td>Transportation and Public Works</td>
          <td>492</td>
          <td>672</td>
          <td>742</td>
          <td>1,906</td>
      </tr>
      <tr>
          <td>Water Resources Development</td>
          <td>89</td>
          <td>111</td>
          <td>110</td>
          <td>310</td>
      </tr>
      <tr>
          <td>Private Legislation</td>
          <td>69</td>
          <td>71</td>
          <td>48</td>
          <td>188</td>
      </tr>
  </tbody>
</table>
<p>The class imbalance is severe: <code>Social Sciences and History</code> has only 15 bills across all three Congresses, while <code>Health</code> has 5,911 bills. This imbalance presents modeling challenges, as minority classes may lack sufficient representative samples.</p>
<h3 id="text-statistics">Text Statistics</h3>
<p>We analyzed token counts using spaCy to understand the computational requirements for each text field.</p>
<p>Title Token Statistics:</p>
<table>
  <thead>
      <tr>
          <th>Congress</th>
          <th>Average Tokens</th>
          <th>Min Tokens</th>
          <th>Max Tokens</th>
          <th>Total Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>115th</td>
          <td>12.3</td>
          <td>1</td>
          <td>167</td>
          <td>166,763</td>
      </tr>
      <tr>
          <td>116th</td>
          <td>11.3</td>
          <td>1</td>
          <td>226</td>
          <td>188,158</td>
      </tr>
      <tr>
          <td>117th</td>
          <td>11.5</td>
          <td>1</td>
          <td>272</td>
          <td>204,978</td>
      </tr>
      <tr>
          <td>All</td>
          <td>11.7</td>
          <td>1</td>
          <td>272</td>
          <td>559,419</td>
      </tr>
  </tbody>
</table>
<p>Summary Token Statistics:</p>
<table>
  <thead>
      <tr>
          <th>Congress</th>
          <th>Average Tokens</th>
          <th>Min Tokens</th>
          <th>Max Tokens</th>
          <th>Total Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>115th</td>
          <td>109.1</td>
          <td>2</td>
          <td>6,839</td>
          <td>1,479,212</td>
      </tr>
      <tr>
          <td>116th</td>
          <td>94.9</td>
          <td>2</td>
          <td>5,886</td>
          <td>1,574,732</td>
      </tr>
      <tr>
          <td>117th</td>
          <td>95.1</td>
          <td>2</td>
          <td>502</td>
          <td>1,695,276</td>
      </tr>
      <tr>
          <td>All</td>
          <td>99.0</td>
          <td>2</td>
          <td>6,839</td>
          <td>4,749,220</td>
      </tr>
  </tbody>
</table>
<p>Full Text Token Statistics:</p>
<table>
  <thead>
      <tr>
          <th>Congress</th>
          <th>Average Tokens</th>
          <th>Min Tokens</th>
          <th>Max Tokens</th>
          <th>Total Tokens</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>115th</td>
          <td>2,588.7</td>
          <td>91</td>
          <td>304,478</td>
          <td>35,092,075</td>
      </tr>
      <tr>
          <td>116th</td>
          <td>2,760.3</td>
          <td>70</td>
          <td>973,173</td>
          <td>45,824,498</td>
      </tr>
      <tr>
          <td>117th</td>
          <td>2,706.7</td>
          <td>71</td>
          <td>1,013,608</td>
          <td>48,224,757</td>
      </tr>
      <tr>
          <td>All</td>
          <td>-</td>
          <td>70</td>
          <td>1,013,608</td>
          <td>129,141,330</td>
      </tr>
  </tbody>
</table>
<p>These statistics reveal computational trade-offs:</p>
<ul>
<li><strong>Titles</strong> average ~12 tokens: computationally efficient but limited information.</li>
<li><strong>Summaries</strong> average ~100 tokens: good balance of information and efficiency.</li>
<li><strong>Full text</strong> averages ~2,700 tokens with 129M total tokens: detailed but computationally expensive. Processing this volume of text introduces real-world engineering challenges, such as memory constraints and a higher noise-to-signal ratio typical of long legal documents.</li>
</ul>
<p>We&rsquo;ll prototype with titles and summaries before considering full text, given the computational costs involved.</p>
<h2 id="evaluation-framework">Evaluation Framework</h2>
<h3 id="experimental-design">Experimental Design</h3>
<p>We train models on one Congress and test on others, creating a 3x3 evaluation grid. This setup evaluates both within-Congress performance (same session) and cross-Congress generalization (different sessions). We expect temporal drift between Congress sessions to impact performance.</p>
<h3 id="metrics-and-hyperparameter-tuning">Metrics and Hyperparameter Tuning</h3>
<p>We use weighted average F1 score to handle class imbalance, ensuring fair evaluation across all policy areas regardless of frequency.</p>
<p>For within-Congress evaluation, we report cross-validated scores. For cross-Congress evaluation, we test on the entire target Congress dataset.</p>
<p>Hyperparameter tuning uses Cross-Validation Grid Search with folds set to <code>min(3, n_samples)</code> to ensure all classes are represented. We apply the best parameters from training to test generalization across different Congresses.</p>
<h2 id="baseline-models">Baseline Models</h2>
<p>We evaluate three traditional machine learning approaches using TF-IDF vectorization:</p>
<h3 id="text-preprocessing">Text Preprocessing</h3>
<p>We convert text to numerical features using TF-IDF (term frequency-inverse document frequency), which weighs word importance by frequency within documents relative to the entire corpus. This creates normalized feature vectors suitable for machine learning classification.</p>
<h3 id="multinomial-naive-bayes">Multinomial Naive Bayes</h3>
<p>We start with Multinomial Naive Bayes as our simplest baseline. Despite its &ldquo;naive&rdquo; independence assumption between features, this model often performs surprisingly well for text classification tasks and serves as an important benchmark. If more complex models can&rsquo;t beat Naive Bayes, it signals potential issues with the approach or data.</p>
<p>The model&rsquo;s <code>feature_log_prob_</code> attribute reveals the most influential words for each policy area, providing interpretable insights into classification patterns.</p>
<p>You can see the code for training the Naive Bayes model below:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> TfidfVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.model_selection <span style="color:#f92672">import</span> GridSearchCV
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.pipeline <span style="color:#f92672">import</span> Pipeline
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.naive_bayes <span style="color:#f92672">import</span> MultinomialNB
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Create a pipeline with TF-IDF vectorizer and Multinomial Naive Bayes classifier</span>
</span></span><span style="display:flex;"><span>pipeline <span style="color:#f92672">=</span> Pipeline([
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;tfidf&#39;</span>, TfidfVectorizer(lowercase<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, dtype<span style="color:#f92672">=</span>np<span style="color:#f92672">.</span>float32)),
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;clf&#39;</span>, MultinomialNB()),
</span></span><span style="display:flex;"><span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Define the parameters for grid search</span>
</span></span><span style="display:flex;"><span>parameters <span style="color:#f92672">=</span> {  
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__ngram_range&#39;</span>: [(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">3</span>)],
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>, <span style="color:#ae81ff">0.5</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__min_df&#39;</span>: (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">5</span>, <span style="color:#ae81ff">10</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;clf__alpha&#39;</span>: (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.01</span>, <span style="color:#ae81ff">0.001</span>),
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Perform grid search with cross-validation</span>
</span></span><span style="display:flex;"><span>grid_search <span style="color:#f92672">=</span> GridSearchCV(
</span></span><span style="display:flex;"><span>    pipeline,
</span></span><span style="display:flex;"><span>    parameters,
</span></span><span style="display:flex;"><span>    scoring<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;f1_weighted&#39;</span>,
</span></span><span style="display:flex;"><span>    n_jobs<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span>    refit<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>    cv<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>grid_search<span style="color:#f92672">.</span>fit(X_train, y_train)
</span></span></code></pre></div><h3 id="logistic-regression">Logistic Regression</h3>
<p>Logistic regression provides a natural step up in complexity from Naive Bayes. It uses the logistic function to convert linear combinations of features into probabilities, making it an excellent baseline for comparison with more sophisticated models while remaining interpretable.</p>
<p>You can see the code for training the Logistic Regression model below:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> TfidfVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.model_selection <span style="color:#f92672">import</span> GridSearchCV
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.pipeline <span style="color:#f92672">import</span> Pipeline
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.linear_model <span style="color:#f92672">import</span> LogisticRegression
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Create a pipeline with TF-IDF vectorizer and Logistic Regression classifier</span>
</span></span><span style="display:flex;"><span>pipeline <span style="color:#f92672">=</span> Pipeline([
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;tfidf&#39;</span>, TfidfVectorizer(lowercase<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, dtype<span style="color:#f92672">=</span>np<span style="color:#f92672">.</span>float32)),
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;clf&#39;</span>, LogisticRegression(max_iter<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>, random_state<span style="color:#f92672">=</span><span style="color:#ae81ff">42</span>, class_weight<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;balanced&#39;</span>)),
</span></span><span style="display:flex;"><span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Define the parameters for grid search</span>
</span></span><span style="display:flex;"><span>parameters <span style="color:#f92672">=</span> {  
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__ngram_range&#39;</span>: [(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)],
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;clf__C&#39;</span>: [<span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">10</span>],
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Perform grid search with cross-validation</span>
</span></span><span style="display:flex;"><span>grid_search <span style="color:#f92672">=</span> GridSearchCV(
</span></span><span style="display:flex;"><span>    pipeline,
</span></span><span style="display:flex;"><span>    parameters,
</span></span><span style="display:flex;"><span>    scoring<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;f1_weighted&#39;</span>,
</span></span><span style="display:flex;"><span>    n_jobs<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>,
</span></span><span style="display:flex;"><span>    refit<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>    cv<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>grid_search<span style="color:#f92672">.</span>fit(X_train, y_train)
</span></span></code></pre></div><h3 id="xgboost">XGBoost</h3>
<p>We include XGBoost as our tree-based ensemble method. While XGBoost typically excels on structured tabular data, we test whether its gradient boosting approach can effectively handle TF-IDF features for text classification.</p>
<p>You can see the code for training the XGBoost model below:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> TfidfVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.model_selection <span style="color:#f92672">import</span> GridSearchCV
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.pipeline <span style="color:#f92672">import</span> Pipeline
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> xgboost <span style="color:#f92672">import</span> XGBClassifier
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Create a pipeline with TF-IDF vectorizer and XGBoost classifier</span>
</span></span><span style="display:flex;"><span>pipeline <span style="color:#f92672">=</span> Pipeline([
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;tfidf&#39;</span>, TfidfVectorizer(lowercase<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, dtype<span style="color:#f92672">=</span>np<span style="color:#f92672">.</span>float32)),
</span></span><span style="display:flex;"><span>    (<span style="color:#e6db74">&#39;clf&#39;</span>, XGBClassifier(use_label_encoder<span style="color:#f92672">=</span><span style="color:#66d9ef">False</span>, eval_metric<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;mlogloss&#39;</span>, objective<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;multi:softmax&#39;</span>, seed<span style="color:#f92672">=</span><span style="color:#ae81ff">42</span>, n_jobs<span style="color:#f92672">=-</span><span style="color:#ae81ff">1</span>)),
</span></span><span style="display:flex;"><span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Define the parameters for grid search</span>
</span></span><span style="display:flex;"><span>parameters <span style="color:#f92672">=</span> {  
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;tfidf__max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;clf__max_depth&#39;</span>: (<span style="color:#ae81ff">3</span>, <span style="color:#ae81ff">6</span>, <span style="color:#ae81ff">9</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#39;clf__n_estimators&#39;</span>: (<span style="color:#ae81ff">100</span>, <span style="color:#ae81ff">200</span>, <span style="color:#ae81ff">300</span>),
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Perform grid search with cross-validation</span>
</span></span><span style="display:flex;"><span>grid_search <span style="color:#f92672">=</span> GridSearchCV(
</span></span><span style="display:flex;"><span>    pipeline,
</span></span><span style="display:flex;"><span>    parameters,
</span></span><span style="display:flex;"><span>    scoring<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;f1_weighted&#39;</span>,
</span></span><span style="display:flex;"><span>    refit<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>    cv<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>    verbose<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>grid_search<span style="color:#f92672">.</span>fit(X_train, y_train, clf__sample_weight<span style="color:#f92672">=</span>sample_weight)
</span></span></code></pre></div><h2 id="results">Results</h2>
<p>We evaluate models on three input types:</p>
<ul>
<li><strong>Title-only</strong>: Quick prototyping with limited context</li>
<li><strong>Summary-only</strong>: Balanced information content and computational efficiency</li>
<li><strong>Full text</strong>: Maximum context with computational constraints (limited hyperparameter tuning)</li>
</ul>
<h3 id="title-only-inputs">Title-Only Inputs</h3>
<h4 id="naive-bayes">Naive Bayes</h4>
<p>Title-only Naive Bayes experiments are run with the following settings:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>sweep_nb(
</span></span><span style="display:flex;"><span>    data,
</span></span><span style="display:flex;"><span>    X_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;title&#39;</span>,
</span></span><span style="display:flex;"><span>    y_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;policy_area&#39;</span>,
</span></span><span style="display:flex;"><span>    tfidf_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;lowercase&#39;</span>: <span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;dtype&#39;</span>: np<span style="color:#f92672">.</span>float32,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    tfidf_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;ngram_range&#39;</span>: [(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)],
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>, <span style="color:#ae81ff">0.5</span>),
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;min_df&#39;</span>: (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>, <span style="color:#ae81ff">5</span>),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    nb_params<span style="color:#f92672">=</span>{},
</span></span><span style="display:flex;"><span>    nb_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;alpha&#39;</span>: (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.01</span>, <span style="color:#ae81ff">0.001</span>),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>and the results:</p>
<pre><code>Training on Congress 115
Best score: 0.661
Refit Time: 0.570
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6369760774921475
Testing on Congress 117 F1: 0.5488274400521962

Training on Congress 116
Best score: 0.677
Refit Time: 0.499
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.691175262953872
Testing on Congress 117 F1: 0.6798043069585031

Training on Congress 117
Best score: 0.670
Refit Time: 0.565
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.25
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.6168474701996426
Testing on Congress 116 F1: 0.6981574942116808

Mean fit time: 0.54 ± 0.03s
</code></pre>
<h4 id="results-summary">Results Summary</h4>
<p>The results demonstrate several key findings:</p>
<ul>
<li><strong>Fast training</strong>: Sub-second training times make this highly practical</li>
<li><strong>Solid baseline performance</strong>: F1 scores around 0.65-0.70 provide a reasonable starting point</li>
<li><strong>Consistent hyperparameters</strong>: Similar optimal settings across Congresses suggest stable patterns</li>
<li><strong>Temporal effects</strong>: Performance generally decreases when training and testing on Congresses further apart in time</li>
</ul>
<p>Training on the 116th Congress yields the best cross-Congress performance, likely due to its temporal proximity to both adjacent sessions.</p>















<figure class="post-figure center ">
    <img src="/img/nb_title_policy_area/f1s.webp"
         alt="Naive Bayes Policy Area Classification F1 Score"
         title="Naive Bayes Policy Area Classification F1 Score"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Naive Bayes F1 scores show temporal effects, with better performance between adjacent Congresses</figcaption>
    
</figure>

<p>The model learns interpretable features for each policy area. For example, Agriculture bills are strongly associated with terms like &ldquo;farm,&rdquo; &ldquo;crop,&rdquo; and &ldquo;livestock,&rdquo; while Armed Forces bills correlate with &ldquo;military,&rdquo; &ldquo;defense,&rdquo; and &ldquo;veterans.&rdquo;</p>















<figure class="post-figure center ">
    <img src="/img/nb_title_policy_area/top-Agriculture_and_Food.webp"
         alt="Naive Bayes Top Features for Agriculture and Food"
         title="Naive Bayes Top Features for Agriculture and Food"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Naive Bayes Top Features for Agriculture and Food</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/nb_title_policy_area/top-Armed_Forces_and_National_Security.webp"
         alt="Naive Bayes Top Features for Armed Forces and National Security"
         title="Naive Bayes Top Features for Armed Forces and National Security"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Naive Bayes Top Features for Armed Forces and National Security</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/nb_title_policy_area/top-Health.webp"
         alt="Naive Bayes Top Features for Health"
         title="Naive Bayes Top Features for Health"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Naive Bayes Top Features for Health</figcaption>
    
</figure>

<h4 id="logistic-regression-1">Logistic Regression</h4>
<p>Title-only Logistic Regression experiments are run with the following settings:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>sweep_logreg(
</span></span><span style="display:flex;"><span>    data,
</span></span><span style="display:flex;"><span>    X_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;title&#39;</span>,
</span></span><span style="display:flex;"><span>    y_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;policy_area&#39;</span>,
</span></span><span style="display:flex;"><span>    tfidf_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;lowercase&#39;</span>: <span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;dtype&#39;</span>: np<span style="color:#f92672">.</span>float32,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    tfidf_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;ngram_range&#39;</span>: [(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>), (<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">2</span>)],
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    logreg_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_iter&#39;</span>: <span style="color:#ae81ff">1000</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;random_state&#39;</span>: <span style="color:#ae81ff">42</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;class_weight&#39;</span>: <span style="color:#e6db74">&#39;balanced&#39;</span>,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    logreg_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;C&#39;</span>: [<span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">10</span>],
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>and the results:</p>
<pre><code>Training on Congress 115
Best score: 0.704
Refit Time: 32.063
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6809188275881766
Testing on Congress 117 F1: 0.601917336933838

Training on Congress 116
Best score: 0.714
Refit Time: 31.227
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.7408989977276476
Testing on Congress 117 F1: 0.7200639105208106

Training on Congress 117
Best score: 0.711
Refit Time: 34.083
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.674418393892329
Testing on Congress 116 F1: 0.7405934743144291

Mean fit time: 32.46 ± 1.20s
</code></pre>
<h4 id="results-summary-1">Results Summary</h4>
<p>Logistic regression improves upon Naive Bayes performance:</p>
<ul>
<li><strong>Higher F1 scores</strong>: Generally 5-7 percentage points better than Naive Bayes</li>
<li><strong>Consistent hyperparameters</strong>: Optimal settings remain stable across Congresses</li>
<li><strong>Reasonable training time</strong>: 30-35 seconds per model remains manageable</li>
<li><strong>Strong cross-Congress generalization</strong>: F1 scores consistently above 0.70</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/logreg_title_policy_area/f1s.webp"
         alt="Logistic Regression Policy Area Classification F1 Score"
         title="Logistic Regression Policy Area Classification F1 Score"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Logistic Regression Policy Area Classification F1 Score</figcaption>
    
</figure>

<h4 id="xgboost-1">XGBoost</h4>
<p>Title-only XGBoost experiments are run with the following settings:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>sweep_xgb(
</span></span><span style="display:flex;"><span>    data,
</span></span><span style="display:flex;"><span>    X_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;title&#39;</span>,
</span></span><span style="display:flex;"><span>    y_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;policy_area&#39;</span>,
</span></span><span style="display:flex;"><span>    tfidf_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>,),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    xgb_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_depth&#39;</span>: (<span style="color:#ae81ff">6</span>,),
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;eta&#39;</span>: (<span style="color:#ae81ff">0.3</span>,),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>and the results:</p>
<pre><code>Training on Congress 115
Best score: 0.591
Refit Time: 198.063
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 116 F1: 0.5649530686141018
Testing on Congress 117 F1: 0.5215939580735101

Training on Congress 116
Best score: 0.600
Refit Time: 264.824
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.6037922738570368
Testing on Congress 117 F1: 0.5965027418245722

Training on Congress 117
Best score: 0.595
Refit Time: 249.799
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.5600491477899472
Testing on Congress 116 F1: 0.60815381664894

Mean fit time: 237.56 ± 28.60s
</code></pre>
<h4 id="results-summary-2">Results Summary</h4>
<p>XGBoost underperforms relative to expectations:</p>
<ul>
<li><strong>Poor performance</strong>: F1 scores significantly below linear models (0.55-0.60 range)</li>
<li><strong>Long training times</strong>: 4+ minutes per model with limited hyperparameter exploration</li>
<li><strong>Questionable value</strong>: The computational cost doesn&rsquo;t justify the poor performance</li>
</ul>
<p>Given these results, we focus on the more promising linear models for subsequent experiments with longer text inputs.</p>















<figure class="post-figure center ">
    <img src="/img/xgb_title_policy_area/f1s.webp"
         alt="XGBoost Policy Area Classification F1 Score"
         title="XGBoost Policy Area Classification F1 Score"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">XGBoost Policy Area Classification F1 Score</figcaption>
    
</figure>

<h4 id="training-efficiency">Training Efficiency</h4>
<p>The computational costs vary dramatically:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Training Time</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Naive Bayes</td>
          <td>0.54 $\pm$ 0.03s</td>
      </tr>
      <tr>
          <td>Logistic Regression</td>
          <td>32.46 $\pm$ 1.20s</td>
      </tr>
      <tr>
          <td>XGBoost</td>
          <td>237.56 $\pm$ 28.60s</td>
      </tr>
  </tbody>
</table>
<p>XGBoost&rsquo;s poor performance despite high computational cost suggests that tree-based methods may not be well-suited for sparse TF-IDF features. This is a classic example of the &ldquo;curse of dimensionality&rdquo;: tree-based models struggle to make effective splits in highly sparse, high-dimensional bag-of-words spaces compared to linear models that simply assign weights to all features simultaneously. We&rsquo;ll focus on linear models for the remaining experiments.</p>
<h3 id="summary-only-results">Summary-Only Results</h3>
<p>Using bill summaries provides substantially more context than titles alone, leading to significant performance improvements.</p>
<h4 id="naive-bayes-performance">Naive Bayes Performance</h4>
<p>The summary-based models show dramatic improvement over title-only versions:</p>
<ul>
<li><strong>F1 scores</strong>: 0.85+ within-Congress, 0.77-0.86 cross-Congress</li>
<li><strong>Training time</strong>: Still fast at ~3.4 seconds</li>
<li><strong>Strong generalization</strong>: Consistent performance across time periods</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/nb_summary_policy_area/f1s.webp"
         alt="Naive Bayes Summary Performance"
         title="Naive Bayes Summary Performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Summary-based models achieve 80%+ F1 scores across most Congress combinations</figcaption>
    
</figure>

<h4 id="logistic-regression-performance">Logistic Regression Performance</h4>
<p>Logistic regression slightly outperforms Naive Bayes on summaries:</p>
<ul>
<li><strong>F1 scores</strong>: 0.86+ within-Congress, 0.79-0.87 cross-Congress</li>
<li><strong>Training time</strong>: Reasonable at ~12 seconds</li>
<li><strong>Stable hyperparameters</strong>: Consistent optimal settings across Congresses</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/logreg_summary_policy_area/f1s.webp"
         alt="Logistic Regression Summary Performance"
         title="Logistic Regression Summary Performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Logistic regression maintains slight performance advantage over Naive Bayes</figcaption>
    
</figure>

<p>The performance difference between models suggests they rely on similar feature patterns, with logistic regression better capturing feature interactions.</p>
<h4 id="logistic-regression-2">Logistic Regression</h4>
<p>Summary-only Logistic Regression experiments are run with the following settings:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>sweep_logreg(
</span></span><span style="display:flex;"><span>    data,
</span></span><span style="display:flex;"><span>    X_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;summary&#39;</span>,
</span></span><span style="display:flex;"><span>    y_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;policy_area&#39;</span>,
</span></span><span style="display:flex;"><span>    tfidf_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;lowercase&#39;</span>: <span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;dtype&#39;</span>: np<span style="color:#f92672">.</span>float32,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    tfidf_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># &#39;ngram_range&#39;: [(1, 1), (1, 2)],</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_df&#39;</span>: (<span style="color:#ae81ff">0.05</span>, <span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">0.25</span>),
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    logreg_params<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;max_iter&#39;</span>: <span style="color:#ae81ff">1000</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;random_state&#39;</span>: <span style="color:#ae81ff">42</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;class_weight&#39;</span>: <span style="color:#e6db74">&#39;balanced&#39;</span>,
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    logreg_grid<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#39;C&#39;</span>: [<span style="color:#ae81ff">0.1</span>, <span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">10</span>],
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>And the results:</p>
<pre><code>Training on Congress 115
Best score: 0.862
Refit Time: 9.007
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 116 F1: 0.8284864693401133
Testing on Congress 117 F1: 0.7934161507811646

Training on Congress 116
Best score: 0.865
Refit Time: 13.897
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8637852557418315
Testing on Congress 117 F1: 0.8594775615031977

Training on Congress 117
Best score: 0.862
Refit Time: 12.167
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8355736563084967
Testing on Congress 116 F1: 0.8696403838390832

Mean fit time: 11.69 ± 2.02s
</code></pre>















<figure class="post-figure center ">
    <img src="/img/logreg_summary_policy_area/f1s.webp"
         alt="Logistic Regression Policy Area Classification F1 Score"
         title="Logistic Regression Policy Area Classification F1 Score"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Logistic Regression Policy Area Classification F1 Score</figcaption>
    
</figure>

<h3 id="full-text-results">Full Text Results</h3>
<p>We test whether complete bill text improves performance over summaries, using optimal hyperparameters from summary experiments.</p>
<h4 id="naive-bayes-on-full-text">Naive Bayes on Full Text</h4>
<p>Surprisingly, full text yields slightly lower performance than summaries:</p>
<ul>
<li><strong>F1 scores</strong>: 0.84-0.85 within-Congress, 0.77-0.86 cross-Congress</li>
<li><strong>Training time</strong>: ~50 seconds (10x slower than summaries)</li>
<li><strong>Performance drop</strong>: Likely due to increased noise in lengthy documents</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/nb_text_policy_area/f1s.webp"
         alt="Naive Bayes Full Text Performance"
         title="Naive Bayes Full Text Performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Full text performance is slightly worse than summaries, suggesting diminishing returns</figcaption>
    
</figure>

<h4 id="logistic-regression-on-full-text">Logistic Regression on Full Text</h4>
<p>Logistic regression shows the strongest performance on full text:</p>
<ul>
<li><strong>F1 scores</strong>: 0.87-0.88 within-Congress, 0.83-0.89 cross-Congress</li>
<li><strong>Training time</strong>: ~70 seconds</li>
<li><strong>Best overall performance</strong>: Approaches 90% F1 on some Congress pairs</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/logreg_text_policy_area/f1s.webp"
         alt="Logistic Regression Full Text Performance"
         title="Logistic Regression Full Text Performance"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Logistic regression achieves the best performance using full bill text</figcaption>
    
</figure>

<p>The logistic regression model benefits from having access to complete legislative language while effectively regularizing against noise.</p>
<h2 id="key-findings">Key Findings</h2>
<p>This baseline study establishes several important results:</p>
<p><strong>Best performing model</strong>: Logistic regression trained on full bill text achieves up to 89% F1 score, providing a strong benchmark for future deep learning approaches.</p>
<p><strong>Text input comparison</strong>:</p>
<ul>
<li>Titles: Limited but fast (F1 ~0.65-0.70)</li>
<li>Summaries: Good balance of performance and efficiency (F1 ~0.85)</li>
<li>Full text: Best performance but computationally expensive (F1 ~0.87-0.89)</li>
</ul>
<p><strong>Cross-Congress generalization</strong>: Models trained on one Congress generalize reasonably well to others, though performance decreases with temporal distance between sessions.</p>
<p><strong>Model performance ranking</strong>: Logistic Regression &gt; Naive Bayes &raquo; XGBoost for this text classification task.</p>
<h2 id="next-steps">Next Steps</h2>
<p>The strong baseline performance sets the stage for several research directions:</p>
<ol>
<li><strong>Deep learning models</strong>: Transformer-based approaches using pre-trained language models</li>
<li><strong>Dataset expansion</strong>: Including additional Congresses and more detailed bill metadata</li>
<li><strong>Error analysis</strong>: Understanding failure cases and class-specific performance patterns</li>
<li><strong>Feature engineering</strong>: Exploring domain-specific text preprocessing and feature extraction</li>
</ol>
<p>The complete dataset and experimental code are available for researchers interested in building upon these baselines.</p>
<p><strong>Resources</strong>:</p>
<ul>
<li>Dataset: <a href="https://huggingface.co/datasets/hheiden/us-congress-bill-policy-115_117">Hugging Face: hheiden/us-congress-bill-policy-115_117</a></li>
<li>Leaderboard: <a href="/leaderboards/policy_area_classification_leaderboard/">Policy Area Classification Leaderboard</a></li>
<li>Project: <a href="/projects/congressional-data-analysis/">Congressional Knowledge Graph &amp; Policy Classification</a></li>
</ul>
]]></content:encoded></item><item><title>Congressional Knowledge Graph &amp; Policy Classification</title><link>https://hunterheidenreich.com/projects/congressional-data-analysis/</link><pubDate>Wed, 01 Mar 2023 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/congressional-data-analysis/</guid><description>A 47,000+ bill knowledge graph from Congress.gov with sponsor networks and 87% policy classification accuracy.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>A computational social science project that constructed a dataset of 47,000+ US congressional bills by extracting legislative text and metadata from the 115th-117th Congresses. The project creates a novel &ldquo;legislative graph&rdquo;
(linking sponsors, committees, and bill text) and establishes a machine learning benchmark for policy area classification (87% accuracy), now hosted on Hugging Face to support reproducible political science research.</p>
<h2 id="features">Features</h2>
<h3 id="intelligent-data-acquisition">Intelligent Data Acquisition</h3>
<p>Standard APIs impose strict rate limits. I engineered a custom Selenium-based extraction engine to handle Congress.gov&rsquo;s complex DOM structures.</p>
<ul>
<li><strong>Optimization</strong>: Targeted aggregate endpoints (e.g., <code>/all-info</code>) to reduce HTTP request volume by ~90% per bill.</li>
<li><strong>Resilience</strong>: Implemented a local caching layer to store raw HTML, separating the fetch step from the parse step. This ensured 100% reproducibility and minimized server load during iterative development.</li>
<li><strong>Graph construction</strong>: Beyond simple text, the script extracts relational data including co-sponsorship networks, committee assignments, and related bill lineage.</li>
</ul>
<h3 id="natural-language-processing">Natural Language Processing</h3>
<ul>
<li><strong>Corpus construction</strong>: Cleaned and normalized legislative text, removing procedural artifacts (e.g., &ldquo;A BILL TO&hellip;&rdquo;) to isolate semantic policy content.</li>
<li><strong>Feature engineering</strong>: Utilized TF-IDF vectorization with N-gram analysis to capture legislative jargon.</li>
<li><strong>Modeling</strong>: Benchmarked Naive Bayes, Logistic Regression, and SVMs, achieving 87.3% accuracy on policy area prediction (cross-validated).</li>
</ul>
<h2 id="usage">Usage</h2>
<p>The dataset is available on Hugging Face and can be loaded directly via the <code>datasets</code> library. The scraper can be run locally to fetch new bills.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>The &ldquo;partisan vocabulary&rdquo;</strong>: Feature importance analysis revealed distinct linguistic markers separating Democratic and Republican legislation, identifiable even without metadata.</li>
<li><strong>Temporal drift</strong>: Policy priorities and terminology showed measurable shifts across congressional sessions (115th vs 117th).</li>
<li><strong>Classification success</strong>: Simple linear models (SVM/LogReg) proved remarkably effective at distinguishing policy domains, suggesting legislative language is highly structured.</li>
</ul>
<h2 id="impact--deliverables">Impact &amp; Deliverables</h2>
<ul>
<li><strong>Hugging Face dataset</strong>: Released the first machine-readable, ML-ready dataset of modern bills, democratizing access for researchers.</li>
<li><strong>Open source tooling</strong>: Published the scraper and parsing logic to allow others to extend the dataset to future congresses.</li>
<li><strong>Academic benchmark</strong>: Establishing a clear baseline for &ldquo;Government NLP&rdquo; tasks, aiding in the automated transparency and monitoring of new legislation.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/posts/us-117th-congress-data-exploration/">117th Congress Data Exploration</a></li>
<li><a href="/posts/congressional-bill-policy-area-classification/">Congressional Bill Policy Area Classification</a></li>
</ul>
]]></content:encoded></item><item><title>Word Embeddings in NLP: An Introduction</title><link>https://hunterheidenreich.com/posts/intro-to-word-embeddings/</link><pubDate>Sun, 05 Aug 2018 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/intro-to-word-embeddings/</guid><description>Learn about word embeddings in NLP: from basic one-hot encoding to contextual models like ELMo. Guide with examples.</description><content:encoded><![CDATA[<h2 id="understanding-word-embeddings">Understanding Word Embeddings</h2>
<p>A word embedding maps words to real-valued vectors:</p>
<p>$$
\text{word} \rightarrow \mathbb{R}^n
$$</p>
<p>where $n$ represents the dimensionality of the embedding space.</p>
<p>The goal is simple: position semantically similar words close together in vector space. This dense representation typically uses hundreds of dimensions, a massive reduction from the millions required by one-hot encoding.</p>
<p>Word embeddings are grounded in <a href="https://en.wikipedia.org/wiki/Distributional_semantics">Zellig Harris&rsquo; distributional hypothesis</a>: words appearing in similar contexts tend to have similar meanings. This forms the foundation of distributional semantics.</p>















<figure class="post-figure center ">
    <img src="/img/distributional_semantics-50.webp"
         alt="Distributional semantics visualization"
         title="Distributional semantics visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Words embedded in three-dimensional space, organized by semantic similarity</figcaption>
    
</figure>

<p>Different embedding algorithms capture various aspects of this distributional principle. This post explores the main methods for creating word embeddings and their applications in natural language processing.</p>
<p>While modern foundation models and terabyte-scale Vision-Language Models (VLMs) rely on advanced subword tokenizers (like BPE) and massive Transformer embedding layers, the fundamental goal remains exactly the same: mapping discrete text to a continuous vector space where math can capture meaning. Understanding these foundational techniques provides the necessary intuition for debugging and scaling today&rsquo;s production ML systems.</p>
<h2 id="why-word-embeddings-matter-in-nlp">Why Word Embeddings Matter in NLP</h2>
<p>Computers require numerical representations to apply machine learning algorithms to text. Word embeddings bridge this gap by converting text into dense vectors that preserve semantic and syntactic relationships.</p>
<p><strong>Key advantages:</strong></p>
<ol>
<li><strong>Dense representation</strong>: Hundreds of dimensions provide a compact alternative to vocabulary-sized sparse vectors.</li>
<li><strong>Semantic preservation</strong>: Similar words cluster together in vector space.</li>
<li><strong>Mathematical operations</strong>: Enable analogical reasoning ($\text{king} - \text{man} + \text{woman} \approx \text{queen}$).</li>
<li><strong>Transfer learning</strong>: Pre-trained embeddings work across multiple tasks and domains.</li>
</ol>
<p>Modern deep learning architectures leverage these properties extensively. The development of universal, pre-trained embeddings was a significant step forward. We can use versatile embeddings that generalize across applications, eliminating the need to train task-specific representations from scratch.</p>
<h2 id="word-embedding-approaches">Word Embedding Approaches</h2>
<h3 id="one-hot-encoding-and-count-vectorization">One-Hot Encoding and Count Vectorization</h3>
<p>One-hot encoding represents the simplest approach to word vectorization. Each word gets a unique dimension in a vocabulary-sized vector, marked with 1 for presence and 0 elsewhere. Count vectorization extends this by counting the occurrences of each word in a document.</p>















<figure class="post-figure center ">
    <img src="/img/word_vector_onehot-50.webp"
         alt="One-hot encoding visualization"
         title="One-hot encoding visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">One-hot encoding creates sparse vectors with single active dimensions</figcaption>
    
</figure>

<p><strong>Characteristics:</strong></p>
<ul>
<li><strong>High dimensionality</strong>: Vector length equals vocabulary size.</li>
<li><strong>Extreme sparsity</strong>: Most dimensions contain zeros.</li>
<li><strong>No relationships</strong>: Treats all words as equally distant.</li>
<li><strong>Computational efficiency</strong>: Simple to implement and understand.</li>
</ul>
<p>While lacking semantic information, count vectorization serves as a foundation for more complex methods. Let&rsquo;s look at a practical implementation using scikit-learn&rsquo;s <code>CountVectorizer</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> CountVectorizer
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Initialize the vectorizer</span>
</span></span><span style="display:flex;"><span>vectorizer <span style="color:#f92672">=</span> CountVectorizer()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Sample text for demonstration</span>
</span></span><span style="display:flex;"><span>sample_text <span style="color:#f92672">=</span> [<span style="color:#e6db74">&#34;One of the most basic ways we can numerically represent words &#34;</span>
</span></span><span style="display:flex;"><span>               <span style="color:#e6db74">&#34;is through the one-hot encoding method (also sometimes called &#34;</span>
</span></span><span style="display:flex;"><span>               <span style="color:#e6db74">&#34;count vectorizing).&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Fit the vectorizer to our text data</span>
</span></span><span style="display:flex;"><span>vectorizer<span style="color:#f92672">.</span>fit(sample_text)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Examine the vocabulary and word indices</span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#39;Vocabulary:&#39;</span>)
</span></span><span style="display:flex;"><span>print(vectorizer<span style="color:#f92672">.</span>vocabulary_)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Transform text to vectors</span>
</span></span><span style="display:flex;"><span>vector <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>transform(sample_text)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#39;Full vector:&#39;</span>)
</span></span><span style="display:flex;"><span>print(vector<span style="color:#f92672">.</span>toarray())
</span></span></code></pre></div><p>In a production environment, count vectorization introduces significant engineering challenges. When processing millions of documents, the vocabulary size explodes. Storing and computing on these massive sparse matrices quickly leads to memory exhaustion. In these scaling scenarios, practitioners often turn to the <strong>Hashing Trick</strong> (via <code>HashingVectorizer</code>) to bound the dimensionality, or they move entirely to the dense embeddings discussed later in this post.</p>
<p>We can see count vectorization in action with a real dataset, building a simple text classifier for the <a href="https://www.kaggle.com/datasets/crawford/20-newsgroups">20 Newsgroups dataset</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.datasets <span style="color:#f92672">import</span> fetch_20newsgroups
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> CountVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.naive_bayes <span style="color:#f92672">import</span> MultinomialNB
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn <span style="color:#f92672">import</span> metrics
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Load train and test splits, removing metadata for a cleaner signal</span>
</span></span><span style="display:flex;"><span>newsgroups_train <span style="color:#f92672">=</span> fetch_20newsgroups(subset<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;train&#39;</span>,
</span></span><span style="display:flex;"><span>                                      remove<span style="color:#f92672">=</span>(<span style="color:#e6db74">&#39;headers&#39;</span>, <span style="color:#e6db74">&#39;footers&#39;</span>, <span style="color:#e6db74">&#39;quotes&#39;</span>))
</span></span><span style="display:flex;"><span>newsgroups_test <span style="color:#f92672">=</span> fetch_20newsgroups(subset<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;test&#39;</span>,
</span></span><span style="display:flex;"><span>                                     remove<span style="color:#f92672">=</span>(<span style="color:#e6db74">&#39;headers&#39;</span>, <span style="color:#e6db74">&#39;footers&#39;</span>, <span style="color:#e6db74">&#39;quotes&#39;</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Initialize and fit vectorizer on training data</span>
</span></span><span style="display:flex;"><span>vectorizer <span style="color:#f92672">=</span> CountVectorizer()
</span></span><span style="display:flex;"><span>X_train <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>fit_transform(newsgroups_train<span style="color:#f92672">.</span>data)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Build and train classifier</span>
</span></span><span style="display:flex;"><span>classifier <span style="color:#f92672">=</span> MultinomialNB(alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.01</span>)
</span></span><span style="display:flex;"><span>classifier<span style="color:#f92672">.</span>fit(X_train, newsgroups_train<span style="color:#f92672">.</span>target)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Transform test data and make predictions</span>
</span></span><span style="display:flex;"><span>X_test <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>transform(newsgroups_test<span style="color:#f92672">.</span>data)
</span></span><span style="display:flex;"><span>y_pred <span style="color:#f92672">=</span> classifier<span style="color:#f92672">.</span>predict(X_test)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Evaluate performance</span>
</span></span><span style="display:flex;"><span>accuracy <span style="color:#f92672">=</span> metrics<span style="color:#f92672">.</span>accuracy_score(newsgroups_test<span style="color:#f92672">.</span>target, y_pred)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;Accuracy: </span><span style="color:#e6db74">{</span>accuracy<span style="color:#e6db74">:</span><span style="color:#e6db74">.3f</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span></code></pre></div><p>This provides a solid baseline. To capture actual semantic meaning and reduce dimensionality, we must move beyond simple counting.</p>
<h3 id="tf-idf-term-frequency-inverse-document-frequency">TF-IDF (Term Frequency-Inverse Document Frequency)</h3>
<p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TF-IDF</a> extends one-hot encoding by weighting terms based on their importance across a document collection. TF-IDF combines:</p>
<ul>
<li><strong>Term Frequency (TF)</strong>: How often a word appears in a document</li>
<li><strong>Inverse Document Frequency (IDF)</strong>: How rare a word is across all documents</li>
</ul>
<p>This weighting scheme reduces the impact of common words (like &ldquo;the&rdquo; or &ldquo;and&rdquo;) while emphasizing distinctive terms that appear frequently in specific documents but rarely elsewhere.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li>Captures document-level importance</li>
<li>Reduces impact of stop words</li>
<li>Effective for information retrieval tasks</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li>Still high-dimensional and sparse</li>
<li>No semantic relationships between terms</li>
<li>Context-independent representation</li>
</ul>
<h3 id="co-occurrence-matrices">Co-Occurrence Matrices</h3>
<p>Co-occurrence matrices capture word relationships by recording which terms appear together within defined contexts (sentences, paragraphs, or fixed windows). The resulting matrix has dimensions equal to vocabulary size squared, with entries showing co-occurrence frequency.</p>















<figure class="post-figure center ">
    <img src="/img/Word_co-occurrence_network_%28range_3_words%29_-_ENG-50.webp"
         alt="Co-occurrence network visualization"
         title="Co-occurrence network visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Co-occurrence relationships within a three-word window</figcaption>
    
</figure>

<p><strong>Key properties:</strong></p>
<ul>
<li><strong>Global statistics</strong>: Captures corpus-wide word relationships</li>
<li><strong>Symmetric relationships</strong>: Mutual co-occurrence patterns</li>
<li><strong>Extreme dimensionality</strong>: Vocabulary size squared creates storage challenges</li>
<li><strong>Sparse representation</strong>: Most word pairs never co-occur</li>
</ul>
<p>While computationally expensive to store and process, co-occurrence matrices form the foundation for advanced methods like GloVe that compress this information into dense representations.</p>
<h2 id="neural-network-based-embeddings">Neural Network-Based Embeddings</h2>
<h3 id="neural-probabilistic-language-models">Neural Probabilistic Language Models</h3>
<p><a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf">Neural probabilistic models</a> pioneered the use of neural networks for learning word embeddings. These models learn dense representations as a byproduct of language modeling, predicting the next word in a sequence.</p>















<figure class="post-figure center ">
    <img src="/img/bengio-npm-50.webp"
         alt="Neural probabilistic model diagram"
         title="Neural probabilistic model diagram"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Architecture of neural probabilistic language models</figcaption>
    
</figure>

<p><strong>Training process:</strong></p>
<ol>
<li>Initialize random dense embeddings for each vocabulary word</li>
<li>Use embeddings as inputs to predict language modeling objectives</li>
<li>Update embeddings through backpropagation based on prediction errors</li>
<li>Resulting embeddings capture patterns useful for the training task</li>
</ol>
<p>This approach demonstrated that task-specific embeddings could be learned jointly with model objectives, establishing the foundation for modern embedding methods.</p>
<h3 id="word2vec">Word2Vec</h3>
<p><a href="https://code.google.com/archive/p/word2vec/">Word2Vec</a> revolutionized word embeddings by introducing efficient training algorithms for massive corpora. It became the first method to demonstrate compelling vector arithmetic properties, enabling analogical reasoning like the famous &ldquo;$\text{king} - \text{man} + \text{woman} \approx \text{queen}$&rdquo; example.</p>















<figure class="post-figure center ">
    <img src="/img/Word_vector_illustration.webp"
         alt="Word2Vec vector arithmetic visualization"
         title="Word2Vec vector arithmetic visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Word2Vec demonstrates analogical relationships through vector arithmetic</figcaption>
    
</figure>

<p><strong>Two training architectures:</strong></p>
<h4 id="continuous-bag-of-words-cbow">Continuous Bag-of-Words (CBOW)</h4>
<p>Predicts target words from surrounding context words. Given a window of context words, the model learns to predict the central word.</p>
<h4 id="skip-gram">Skip-Gram</h4>
<p>Predicts context words from target words. Given a central word, the model learns to predict surrounding words within a defined window.</p>
<p><strong>Key advantages:</strong></p>
<ul>
<li><strong>Computational efficiency</strong>: Much faster than neural probabilistic models</li>
<li><strong>Scalable training</strong>: Can process billion-word corpora effectively</li>
<li><strong>Quality embeddings</strong>: Captures semantic and syntactic relationships</li>
<li><strong>Flexible context</strong>: Window size controls topical vs. functional similarity</li>
</ul>
<p>The choice of window size significantly impacts learned relationships. Larger windows capture topical associations, while smaller windows focus on syntactic and functional similarities.</p>
<h3 id="glove-global-vectors">GloVe (Global Vectors)</h3>
<p><a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> combines the best aspects of matrix factorization methods (which capture global corpus statistics) and local context window approaches like Word2Vec. Matrix factorization methods excel at global patterns but struggle with analogical reasoning, while Word2Vec captures local relationships but may miss global structure.</p>
<p><strong>Key innovation:</strong>
GloVe trains on a global word-context co-occurrence matrix, incorporating corpus-wide statistical information while maintaining the analogical reasoning capabilities that made Word2Vec successful.</p>
<p><strong>Advantages over Word2Vec:</strong></p>
<ul>
<li><strong>Global optimization</strong>: Leverages entire corpus statistics</li>
<li><strong>Better performance</strong>: Often outperforms Word2Vec on word similarity and analogy tasks</li>
<li><strong>Stable training</strong>: More consistent convergence due to global objective function</li>
</ul>
<p>The result is embeddings that capture both local syntactic patterns and global semantic relationships more effectively.</p>
<h2 id="contextual-embedding-methods">Contextual Embedding Methods</h2>
<h3 id="fasttext">FastText</h3>
<p><a href="https://github.com/facebookresearch/fastText">FastText</a> addresses a critical limitation of previous methods: handling out-of-vocabulary (OOV) words. By incorporating subword information, FastText can generate meaningful representations for previously unseen words.</p>
<p><strong>Subword approach:</strong></p>
<ul>
<li>Decomposes words into character n-grams (typically 3-6 characters)</li>
<li>Represents words as sums of their component n-grams</li>
<li>Trains using skip-gram objective with negative sampling</li>
</ul>
<p><strong>Key advantages:</strong></p>
<ul>
<li><strong>OOV handling</strong>: Can embed unseen words using known subword components</li>
<li><strong>Morphological awareness</strong>: Captures relationships between related word forms</li>
<li><strong>Multilingual support</strong>: Facebook released pre-trained embeddings for 294 languages</li>
<li><strong>Robust performance</strong>: Particularly effective for morphologically rich languages</li>
</ul>
<p>For example, if the model knows &ldquo;navigate,&rdquo; it can provide meaningful representation for &ldquo;circumnavigate&rdquo; by leveraging shared subword components, even if &ldquo;circumnavigate&rdquo; wasn&rsquo;t in the training data.</p>
<h3 id="poincaré-embeddings">Poincaré Embeddings</h3>
<p><a href="https://radimrehurek.com/gensim/models/poincare.html">Poincaré embeddings</a> introduce a novel approach by learning representations in hyperbolic space. This geometric innovation specifically targets hierarchical relationships in data.</p>
<p><strong>Hyperbolic geometry advantages:</strong></p>
<ul>
<li><strong>Natural hierarchy encoding</strong>: Distance represents similarity, while norm encodes hierarchical level</li>
<li><strong>Efficient representation</strong>: Requires fewer dimensions for hierarchical data</li>
<li><strong>Mathematical elegance</strong>: Leverages properties of hyperbolic space for embedding optimization</li>
</ul>
<p><strong>Applications:</strong>
Particularly effective for data with inherent hierarchical structure, such as:</p>
<ul>
<li>WordNet taxonomies</li>
<li>Organizational charts</li>
<li>Computer network topologies</li>
<li>Knowledge graphs</li>
</ul>
<p>The <a href="https://arxiv.org/abs/1705.08039">original paper</a> demonstrates good efficiency in reproducing WordNet relationships with significantly lower dimensionality compared to traditional embedding methods.</p>
<h2 id="contextual-embeddings">Contextual Embeddings</h2>
<h3 id="elmo-embeddings-from-language-models">ELMo (Embeddings from Language Models)</h3>
<p><a href="https://github.com/allenai/allennlp-models">ELMo</a> represents a paradigm shift toward contextual word representations. ELMo generates dynamic representations based on sentence context, adapting to word usage patterns.</p>
<p><strong>Architecture:</strong></p>
<ul>
<li><strong>Bidirectional LSTM</strong>: Processes text in both forward and backward directions</li>
<li><strong>Character-level input</strong>: Handles OOV words and captures morphological patterns</li>
<li><strong>Multi-layer representations</strong>: Combines different abstraction levels</li>
</ul>
<p><strong>Layer specialization:</strong></p>
<ul>
<li><strong>Lower layers</strong>: Excel at syntactic tasks (POS tagging, parsing)</li>
<li><strong>Higher layers</strong>: Capture semantic relationships (word sense disambiguation)</li>
<li><strong>Combined layers</strong>: Weighted combination achieves good performance</li>
</ul>
<p><strong>Key innovation:</strong>
ELMo embeddings vary by context. The word &ldquo;bank&rdquo; receives different representations in &ldquo;river bank&rdquo; versus &ldquo;financial bank,&rdquo; addressing polysemy directly through contextual awareness.</p>
<p>This approach achieved strong performance across numerous NLP tasks by providing context-sensitive representations that adapt to word usage patterns.</p>
<h3 id="probabilistic-fasttext">Probabilistic FastText</h3>
<p><a href="https://github.com/benathi/multisense-prob-fasttext">Probabilistic FastText</a> addresses polysemy (words with multiple meanings) through probabilistic modeling. Traditional embeddings conflate different word senses into single representations, limiting their precision.</p>
<p><strong>The polysemy problem:</strong>
Consider &ldquo;rock&rdquo; which can mean:</p>
<ul>
<li>Rock music (genre)</li>
<li>A stone (geological object)</li>
<li>Rocking motion (verb)</li>
</ul>
<p>Standard embeddings average these meanings, producing representations that may not capture any sense precisely.</p>
<p><strong>Probabilistic approach:</strong>
Probabilistic FastText represents words as Gaussian mixture models: probability distributions that can capture multiple distinct meanings as separate components.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li><strong>Multi-sense representation</strong>: Each word sense gets its own distribution</li>
<li><strong>Context sensitivity</strong>: Can select appropriate sense based on usage context</li>
<li><strong>Uncertainty quantification</strong>: Probabilistic framework captures embedding confidence</li>
</ul>
<p>This approach provides a more nuanced treatment of lexical ambiguity, particularly valuable for words with distinct, context-dependent meanings.</p>
<h2 id="summary-and-future-directions">Summary and Future Directions</h2>
<p>Word embeddings have evolved from simple one-hot encodings to contextual representations that capture nuanced linguistic relationships. Each approach offers distinct advantages:</p>
<p><strong>Static embeddings</strong> (Word2Vec, GloVe, FastText) provide:</p>
<ul>
<li>Computational efficiency for large-scale applications</li>
<li>Pre-trained models available for numerous languages</li>
<li>Clear analogical reasoning capabilities</li>
<li>Good performance on many downstream tasks</li>
</ul>
<p><strong>Contextual embeddings</strong> (ELMo, BERT, GPT) offer:</p>
<ul>
<li>Dynamic representations based on sentence context</li>
<li>Better handling of polysemy and word sense disambiguation</li>
<li>Strong performance on complex NLP tasks</li>
<li>Ability to capture subtle contextual nuances</li>
</ul>
<p><strong>Choosing the right approach</strong> depends on:</p>
<ul>
<li><strong>Task requirements</strong>: Static embeddings for efficiency, contextual for accuracy</li>
<li><strong>Data availability</strong>: Pre-trained models vs. domain-specific training</li>
<li><strong>Computational constraints</strong>: Static embeddings require less processing power</li>
<li><strong>Language coverage</strong>: Consider availability of pre-trained models for target languages</li>
</ul>
<p>The field continues advancing toward more efficient contextual models, better multilingual representations, and embeddings that capture increasingly complex linguistic phenomena.</p>
<p>For a production-grade Word2Vec implementation in PyTorch that takes these concepts further, see the <a href="/projects/modern-word2vec/">High-Performance Word2Vec project</a>.</p>
]]></content:encoded></item></channel></rss>