<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Research on Hunter Heidenreich | Senior AI Research Scientist</title><link>https://hunterheidenreich.com/research/</link><description>Recent content in Research on Hunter Heidenreich | Senior AI Research Scientist</description><image><title>Hunter Heidenreich | Senior AI Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Mon, 01 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/research/index.xml" rel="self" type="application/rss+xml"/><item><title>GutenOCR: A Grounded Vision-Language Front-End for Documents</title><link>https://hunterheidenreich.com/research/gutenocr-grounded-vision-language-frontend/</link><pubDate>Tue, 20 Jan 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/gutenocr-grounded-vision-language-frontend/</guid><description>GutenOCR introduces vision-language models for grounded OCR, offering precise text transcription and geometric grounding via a unified prompt-based interface.</description><content:encoded><![CDATA[<p><a href="https://github.com/Roots-Automation/GutenOCR"><strong>GutenOCR</strong></a> is a family of vision-language models (VLMs) designed to serve as a &ldquo;grounded OCR front-end&rdquo;. Unlike traditional OCR pipelines (which are often brittle) or modern &ldquo;OCR-free&rdquo; VLMs (which often lack precise token-to-pixel alignment), GutenOCR is fine-tuned to provide both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface.</p>
<h3 id="abstract">Abstract</h3>
<p>Traditional OCR pipelines are often brittle, while modern &ldquo;OCR-free&rdquo; Vision-Language Models (VLMs) frequently lack precise token-to-pixel alignment. To address this, we introduce <strong>GutenOCR</strong>, a family of VLMs designed specifically as a &ldquo;grounded OCR front-end.&rdquo; By fine-tuning Qwen2.5-VL on a curriculum of synthetic and real-world documents, GutenOCR provides both high-quality text transcription and explicit geometric grounding (bounding boxes) through a unified, prompt-based interface. This approach allows downstream systems to request exactly the data format they need, from plain text to complex JSON structures.</p>
<h3 id="key-contributions--results">Key Contributions &amp; Results</h3>
<ul>
<li><strong>Unified Interface:</strong> Transforms Qwen2.5-VL models into specialized OCR systems supporting full-page reading, detection, localized reading, and conditional detection via prompting.</li>
<li><strong>In-Domain Improvements:</strong> GutenOCR-7B more than doubles the composite grounded OCR score of its base model (0.40 to 0.82) on 10.5K held-out pages, with the gains concentrated in localized reading and detection.</li>
<li><strong>Fox Benchmark:</strong> GutenOCR-3B achieves a region-level Character Error Rate (CER) of 0.053, surpassing even the dedicated Fox model (0.059), and sharply improves line-level OCR over its Qwen2.5-VL backbones (0.817 to 0.240), though the specialized Fox model still leads at line level.</li>
<li><strong>Curriculum Learning:</strong> Training uses a three-stage curriculum across synthetic data, real-world business documents, and long-context scientific articles to progressively build layout and grounding competency.</li>
<li><strong>Trade-offs:</strong> While GutenOCR reads content accurately (high Page F1), it orders text based on 2D layout columns. It also experiences catastrophic forgetting of color-based prompts and clear negative transfer on math formula recognition (notably the 3B model), to the point the paper recommends the un-fine-tuned backbone for equation-heavy use.</li>
</ul>
<h3 id="methodology">Methodology</h3>
<ul>
<li><strong>Data:</strong> The training mixture combines large-scale real-world documents (business forms, scientific articles) with synthetic data designed to teach precise grounding (e.g., &ldquo;Grounded LaTeX&rdquo; and &ldquo;SynthDoG Grounding&rdquo;).</li>
<li><strong>Curriculum Learning:</strong> Training progresses through three stages, starting with short contexts and synthetic data, moving to real-world business documents, and finishing with long-context scientific articles (up to 16k tokens).</li>
<li><strong>Unified Interface:</strong> The model treats &ldquo;pipeline&rdquo; stages (detection, reading, grounding) as different input-output schemas of a single model, allowing downstream systems to request exactly the data format they need (e.g., plain text vs. JSON boxes).</li>
</ul>
<h3 id="models">Models</h3>
<p>We release 3B and 7B parameter models on HuggingFace:</p>
<ul>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-3B"><strong>GutenOCR-3B (HuggingFace)</strong></a></li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-7B"><strong>GutenOCR-7B (HuggingFace)</strong></a></li>
</ul>
<p>You can try GutenOCR directly at <a href="https://ocr.roots.ai/">ocr.roots.ai</a>, where you can upload a document image and see the model&rsquo;s parsed text output alongside bounding-box highlights on the original image.</p>















<figure class="post-figure center ">
    <img src="/img/gutenocr-demo.webp"
         alt="GutenOCR demo interface showing parsed text with hover-to-highlight on the uploaded image"
         title="GutenOCR demo interface showing parsed text with hover-to-highlight on the uploaded image"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The live demo at ocr.roots.ai: hovering over any parsed token highlights its bounding box on the original document.</figcaption>
    
</figure>

<h3 id="why-this-matters">Why This Matters</h3>
<p>GutenOCR is proposed as a foundational layer for systems where every extracted answer must be explicitly linked to supporting pixels. By providing stable, grounded outputs, it enables human-in-the-loop workflows where reviewers can easily verify hallucinations or missing text by checking the predicted bounding boxes. This work pairs closely with our release of <a href="/research/pubmed-ocr-pmc-open-access-ocr-annotations/">PubMed-OCR</a>, which provides the large-scale, high-density annotations necessary to train such layout-aware models.</p>
<h3 id="resources">Resources</h3>
<ul>
<li><a href="https://ocr.roots.ai/"><strong>Live Demo</strong></a>: Try GutenOCR on your own documents.</li>
<li><a href="https://arxiv.org/abs/2601.14490"><strong>Paper (arXiv)</strong></a>: Full technical report.</li>
<li><a href="https://github.com/Roots-Automation/GutenOCR"><strong>Code (GitHub)</strong></a>: Training code and model release.</li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-3B"><strong>GutenOCR-3B (HuggingFace)</strong></a>: 3B parameter model weights.</li>
<li><a href="https://huggingface.co/rootsautomation/GutenOCR-7B"><strong>GutenOCR-7B (HuggingFace)</strong></a>: 7B parameter model weights.</li>
</ul>
<h3 id="citation">Citation</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2026gutenocrgroundedvisionlanguagefrontend,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GutenOCR: A Grounded Vision-Language Front-End for Documents}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Ben Elliott and Olivia Dinica and Yosheb Getachew}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2601.14490}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2601.14490}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h3 id="related-work">Related Work</h3>
<ul>
<li><a href="/research/pubmed-ocr-pmc-open-access-ocr-annotations/"><strong>PubMed-OCR</strong></a>: The large-scale annotation dataset used to train GutenOCR&rsquo;s layout-aware grounding capabilities.</li>
<li><a href="/research/llm-page-stream-segmentation/"><strong>LLMs for Page Stream Segmentation</strong></a>: Complementary work on document understanding at the page-stream level.</li>
<li><a href="/posts/history-of-page-stream-segmentation/"><strong>The Evolution of Page Stream Segmentation: Rules to LLMs</strong></a>: Background on the history and evolution of document processing pipelines.</li>
<li><a href="/posts/reliability-trap-document-automation/"><strong>The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</strong></a>: Explores calibration challenges in deployed PSS systems, directly relevant to GutenOCR&rsquo;s deployment context as an OCR front-end.</li>
</ul>
]]></content:encoded></item><item><title>PubMed-OCR: PMC Open Access OCR Annotations</title><link>https://hunterheidenreich.com/research/pubmed-ocr-pmc-open-access-ocr-annotations/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/pubmed-ocr-pmc-open-access-ocr-annotations/</guid><description>A large-scale dataset of 209K+ articles with OCR and layout bounding boxes, enabling layout-aware modeling and document understanding research.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>OCR-First Supervision</strong>: Unlike prior datasets for PubMed that align XML to PDFs, PubMed-OCR provides native OCR annotations (Google Cloud Vision), bypassing alignment errors and covering non-digital scanned pages.</li>
<li><strong>High-Density Annotation</strong>: At <strong>~1.3B words across 1.5M pages</strong>, PubMed-OCR is far denser per page than comparable corpora like OCR-IDL: <strong>~13x the word density</strong> (844 vs. 62.5 words/page) and <strong>~6x the line density</strong> (106 vs. 17.5 lines/page), achieved despite drawing from fewer total pages.</li>
<li><strong>Multi-Level Bounding Boxes</strong>: Includes explicit word-, line-, and paragraph-level bounding boxes to support hierarchical document understanding and layout-aware modeling. We also hope that this leads to VQA datasets with grounded answers in document layout.</li>
<li><strong>Open Access &amp; Reproducibility</strong>: Derived strictly from the redistributable PMCOA subset, releasing both the JSON annotations and original PDFs to ensure verifiable and reproducible research.</li>
</ul>
<h2 id="technical-implementation">Technical Implementation</h2>
<h3 id="corpus-construction">Corpus Construction</h3>
<p>PubMed-OCR is built from PubMed Central Open Access (PMCOA) PDFs, chosen specifically because the PMCOA license permits redistribution of both the original documents and derived annotations. Each PDF is rendered to page images, then passed to the Google Cloud Vision (GCV) API. Each page produces a structured JSON annotation file capturing the detected text along with bounding box geometry at word, line, and paragraph levels.</p>
<h3 id="json-annotation-schema">JSON Annotation Schema</h3>
<p>Each page annotation follows this compact schema. Bounding boxes are axis-aligned rectangles in <code>[x1, y1, x2, y2]</code> pixel coordinates. Words, lines, and paragraphs are stored as parallel flat lists under the <code>text</code> key:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;text&#34;</span>: {
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;words&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">210</span>, <span style="color:#ae81ff">786</span>]}
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;lines&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example sentence&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">540</span>, <span style="color:#ae81ff">786</span>]}
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;paragraphs&#34;</span>: [
</span></span><span style="display:flex;"><span>      {<span style="color:#f92672">&#34;text&#34;</span>: <span style="color:#e6db74">&#34;Example sentence\nSecond line&#34;</span>, <span style="color:#f92672">&#34;box&#34;</span>: [<span style="color:#ae81ff">180</span>, <span style="color:#ae81ff">746</span>, <span style="color:#ae81ff">540</span>, <span style="color:#ae81ff">820</span>]}
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>  },
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;image&#34;</span>: <span style="color:#e6db74">&#34;...&#34;</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>














<figure class="post-figure center ">
    <img src="/img/pubmed-ocr-annotation-levels.webp"
         alt="Tri-panel figure showing the same scientific article page annotated at word level (red), line level (blue), and paragraph level (green)."
         title="Tri-panel figure showing the same scientific article page annotated at word level (red), line level (blue), and paragraph level (green)."
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The same page annotated at three granularities: word (left), line (center), and paragraph (right). Page from Zhou et al., &ldquo;Regulation of alternative splicing by local histone modifications: potential roles for RNA-guided mechanisms,&rdquo; <em>Nucleic Acids Research</em> 42(2):701-713, 2014 (<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3902899/">PMC3902899</a>, DOI:<a href="https://doi.org/10.1093/nar/gkt875">10.1093/nar/gkt875</a>). Licensed under CC BY-NC.</figcaption>
    
</figure>

<h3 id="line-reconstruction">Line Reconstruction</h3>
<p>GCV returns word-level detections natively. Line and paragraph groupings are reconstructed using spatial heuristics: words are clustered into lines by vertical overlap and horizontal proximity, and paragraph grouping follows a similar process at a coarser scale. These heuristics work well for standard single-column scientific layouts but can fail on multi-column or irregularly structured pages (see Limitations).</p>
<h2 id="using-the-dataset">Using the Dataset</h2>
<p>The corpus spans 1.5M pages, so streaming is recommended for most use cases:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> datasets <span style="color:#f92672">import</span> load_dataset
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Streaming is recommended for the full 1.5M-page corpus</span>
</span></span><span style="display:flex;"><span>ds <span style="color:#f92672">=</span> load_dataset(<span style="color:#e6db74">&#34;rootsautomation/pubmed-ocr&#34;</span>, streaming<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, split<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;train&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Inspect a page</span>
</span></span><span style="display:flex;"><span>page <span style="color:#f92672">=</span> next(iter(ds))
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Article: </span><span style="color:#e6db74">{</span>page[<span style="color:#e6db74">&#39;accession_id&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">,  Page: </span><span style="color:#e6db74">{</span>page[<span style="color:#e6db74">&#39;page&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Parse OCR annotations</span>
</span></span><span style="display:flex;"><span>ocr <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>loads(page[<span style="color:#e6db74">&#34;ocr_json&#34;</span>])
</span></span><span style="display:flex;"><span>text <span style="color:#f92672">=</span> ocr[<span style="color:#e6db74">&#34;text&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Iterate over lines and words</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> line <span style="color:#f92672">in</span> text[<span style="color:#e6db74">&#34;lines&#34;</span>][:<span style="color:#ae81ff">5</span>]:
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  Line: </span><span style="color:#e6db74">{</span>line[<span style="color:#e6db74">&#39;text&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  BBox: </span><span style="color:#e6db74">{</span>line[<span style="color:#e6db74">&#39;box&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Access individual word detections</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> word <span style="color:#f92672">in</span> text[<span style="color:#e6db74">&#34;words&#34;</span>][:<span style="color:#ae81ff">5</span>]:
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;  Word: </span><span style="color:#e6db74">{</span>word[<span style="color:#e6db74">&#39;text&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">, BBox: </span><span style="color:#e6db74">{</span>word[<span style="color:#e6db74">&#39;box&#39;</span>]<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>Full schema documentation is available on the <a href="https://huggingface.co/datasets/rootsautomation/pubmed-ocr">HuggingFace dataset card</a>.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>The lack of large-scale, high-quality OCR datasets with explicit geometric grounding has been a major bottleneck for training layout-aware models. By releasing PubMed-OCR, we provide the community with the dense, multi-level bounding box annotations necessary to build the next generation of document understanding systems. This dataset directly supports the development of models like <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>, enabling them to learn precise token-to-pixel alignment and robust layout reasoning.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Single OCR engine</strong>: All annotations come from Google Cloud Vision. GCV&rsquo;s error modes (handwriting, degraded scans, complex math, non-Latin scripts) propagate uncorrected into the dataset. Different OCR engines could yield different coverage patterns and error distributions.</li>
<li><strong>Heuristic line reconstruction</strong>: Spatial word-to-line clustering is approximate. Multi-column layouts, rotated text, or unusual page orientations may produce incorrect line groupings.</li>
<li><strong>PMCOA scope</strong>: Coverage is limited to the Open Access subset of PubMed Central. Commercial or subscription articles are excluded.</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2026pubmedocrpmcopenaccess,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PubMed-OCR: PMC Open Access OCR Annotations}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Yosheb Getachew and Olivia Dinica and Ben Elliott}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2601.11425}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CV}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2601.11425}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>This dataset directly enables <a href="/research/gutenocr-grounded-vision-language-frontend/">GutenOCR</a>, a family of vision-language models trained on PubMed-OCR annotations to produce grounded OCR outputs with explicit bounding boxes.</p>
<p>For related work on document processing pipelines that consume OCR output, see <a href="/research/llm-page-stream-segmentation/">LLMs for Page Stream Segmentation</a> and <a href="/research/page-stream-segmentation-llms/">Page Stream Segmentation with LLMs: Challenges and Applications</a>.</p>
]]></content:encoded></item><item><title>LLMs for Insurance Document Automation</title><link>https://hunterheidenreich.com/research/page-stream-segmentation-llms/</link><pubDate>Wed, 01 Jan 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/page-stream-segmentation-llms/</guid><description>LLM applications for insurance document automation using parameter-efficient fine-tuning and analysis of calibration challenges.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Page Stream Segmentation (PSS) is critical for automating document processing in industries like insurance, where unstructured document collections are common. This paper explores the use of large language models (LLMs) for PSS, applying parameter-efficient fine-tuning to real-world insurance data. Our experiments show that LLMs outperform baseline models in segmentation accuracy. We find that stream-level calibration remains a significant challenge. We evaluate post-hoc calibration and Monte Carlo dropout, finding they offer limited improvement, highlighting the need for future work in this area for high-stakes applications.</p>
<p>Published in the COLING 2025 Industry Track, this work builds on our earlier research establishing the <a href="/research/llm-page-stream-segmentation/">TabMe++ benchmark and decoder-based LLM approach</a>, extending those methods to real-world industrial deployment.</p>
<blockquote>
<p><strong>Blog Post:</strong> For a narrative overview of the reliability and calibration findings discussed in this paper, see <a href="/posts/reliability-trap-document-automation/">The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</a>.</p></blockquote>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Real-World Evaluation</strong>: Applied small-to-mid-sized LLMs (Phi-3.5-mini, Mistral-7B) to a proprietary insurance dataset, outperforming strong baselines like XGBoost in segmentation accuracy.</li>
<li><strong>Parameter-Efficient Fine-Tuning</strong>: Successfully used parameter-efficient fine-tuning (PEFT) to adapt LLMs for the specialized task of page stream segmentation.</li>
<li><strong>Calibration Complexity</strong>: Found that post-hoc calibration and Monte Carlo dropout offer limited improvement at the stream level, keeping human-in-the-loop workflows necessary for high-stakes automation (see stream-level confidence analysis below).</li>
<li><strong>Throughput Analysis</strong>: Introduced an accuracy-vs-throughput framework to quantify how much volume can be safely automated at strict confidence thresholds.</li>
</ul>
<h2 id="stream-level-confidence">Stream-Level Confidence</h2>
<p>A key insight from this work is why calibration becomes increasingly difficult as documents grow longer. We define stream-level confidence as the product of individual page-level confidences:</p>
<p>$$C = \prod_{i=1}^{N} C_i$$</p>
<p>where $C_i$ is the confidence for page $i$ and $N$ is the number of pages in the stream. This multiplicative relationship means that even small page-level errors compound aggressively. As streams grow longer, confidence drops rapidly, making it difficult to set reliable thresholds for automation.</p>















<figure class="post-figure center ">
    <img src="/img/page-stream-segmentation-throughput.webp"
         alt="Stream accuracy versus relative throughput for Mistral-7B and XGBoost models, showing LLMs achieve higher automation rates at equivalent accuracy levels"
         title="Stream accuracy versus relative throughput for Mistral-7B and XGBoost models, showing LLMs achieve higher automation rates at equivalent accuracy levels"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Accuracy vs. throughput trade-off: Mistral-7B enables higher automation rates than XGBoost at strict accuracy thresholds, demonstrating the practical value of LLMs for document processing.</figcaption>
    
</figure>

<h2 id="technical-implementation">Technical Implementation</h2>
<h3 id="models--fine-tuning">Models &amp; Fine-Tuning</h3>
<p>We fine-tuned <strong>Mistral-7B-v0.3</strong> and <strong>Phi-3.5-mini</strong> (4-bit quantized) using QLoRA. Training was performed efficiently on a single NVIDIA H100 GPU using the <strong>Unsloth</strong> library and Hugging Face&rsquo;s TRL.</p>
<ul>
<li><strong>Stack</strong>: Unsloth + TRL</li>
<li><strong>Config</strong>: Rank $r=16$, Alpha $\alpha=16$</li>
</ul>
<h3 id="dataset">Dataset</h3>
<p>The study utilized a proprietary <strong>insurance dataset</strong> consisting of 7.5k document streams (44.7k pages). This real-world data includes health records, legal contracts, and police reports, offering a more challenging and realistic evaluation than synthetic benchmarks.</p>
<h3 id="task-framing">Task Framing</h3>
<p>The task framing follows the <a href="/research/llm-page-stream-segmentation/">TabMe++ work</a>: binary classification over a local window (previous page + current page). Here the models are prompted to output valid JSON marking the start of a new document.</p>
<h2 id="impact">Impact</h2>
<p>This work demonstrates both the promise and the current limitations of using LLMs in high-stakes industrial applications. LLMs can significantly improve segmentation accuracy over traditional methods, but performance metrics alone are not sufficient for deployment. For sectors like insurance, stream-level calibration is an open problem that must be solved before full automation becomes responsible.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{heidenreich2025page,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Page Stream Segmentation with LLMs: Challenges and Applications in Insurance Document Automation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heidenreich, Hunter and Dalvi, Ratish and Verma, Nikhil and Getachew, Yosheb}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 31st International Conference on Computational Linguistics: Industry Track}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{305--317}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Optimizing Sequence Models for Dynamical Systems</title><link>https://hunterheidenreich.com/research/deconstructing-recurrence-attention-gating/</link><pubDate>Tue, 01 Oct 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/deconstructing-recurrence-attention-gating/</guid><description>Ablation study deconstructing sequence models. Attention-augmented Recurrent Highway Networks outperform Transformers on chaotic systems.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Advanced neural network architectures developed for tasks like natural language processing are often transferred to spatiotemporal forecasting without a deep understanding of which components drive their performance. This can lead to suboptimal results and reinforces the view of these models as &ldquo;black boxes&rdquo;. In this work, we deconstruct the core mechanisms of Transformers and Recurrent Neural Networks (RNNs) (namely attention, gating, and recurrence). We then build and test novel hybrid architectures to identify which components are most effective. A key finding is that while adding recurrence is detrimental to Transformers, augmenting RNNs with attention and neural gating consistently improves their forecasting accuracy. Our study reveals that a seldom-used architecture, the Recurrent Highway Network (RHN) enhanced with these mechanisms, emerges as the top-performing model for forecasting high-dimensional chaotic systems.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Systematic Ablation</strong>: Deconstructed Transformers and RNNs into core mechanisms (attention, gating, recurrence) to isolate performance drivers</li>
<li><strong>Novel Hybrid Architectures</strong>: Synthesized and tested new combinations of neural primitives for spatiotemporal forecasting</li>
<li><strong>RHN advantage on chaotic systems</strong>: Demonstrated that attention-augmented Recurrent Highway Networks outperform standard Transformers on high-dimensional chaotic systems</li>
<li><strong>Robustness Analysis</strong>: Validated models across both clean physics simulations and noisy real-world industrial datasets</li>
</ul>
<h2 id="motivation">Motivation</h2>
<p>In modern ML, architectures are often transferred from one domain (like NLP) to another (like physical forecasting) without understanding the underlying mechanics. This &ldquo;black box&rdquo; approach leads to suboptimal compute usage and performance ceilings.</p>
<p>Our goal was to break these architectures down. We treated the core mechanisms of <strong>Transformers</strong> and <strong>RNNs</strong> (<strong>Gating, Attention, and Recurrence</strong>) as orthogonal basis vectors. By decoupling these components, we could synthesize and test hybrid architectures to find the best configuration for spatiotemporal forecasting.</p>
<h2 id="methodological-approach">Methodological Approach</h2>
<p>We built a modular framework to mix and match neural primitives. We systematically evaluated:</p>
<ol>
<li><strong>Gating Mechanisms:</strong> Testing Additive, Learned Rate, Input-Dependent, and Coupled Input-Dependent variants</li>
<li><strong>Attention:</strong> Implementing multi-headed attention with relative positional biases</li>
<li><strong>Recurrence:</strong> Testing standard cells (LSTM, GRU) against deeper transition cells like Recurrent Highway Networks (RHN)</li>
</ol>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/neural-gates.webp"
         alt="Neural gating mechanisms: Additive, Learned Rate, Dependent-Coupled, and Dependent variants"
         title="Neural gating mechanisms: Additive, Learned Rate, Dependent-Coupled, and Dependent variants"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The hierarchy of neural gating mechanisms we tested, from simple additive to fully input-dependent.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/rnn-cell-types.webp"
         alt="RNN cell architectures: Elman, LSTM, GRU, and RHN cells"
         title="RNN cell architectures: Elman, LSTM, GRU, and RHN cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Recurrent cell types compared in our study. The RHN (d) extends processing depth within each timestep.</figcaption>
    
</figure>

<p>This ablation isolated exactly <em>which</em> mathematical operation was driving the performance gain.</p>
<h2 id="key-findings">Key Findings</h2>
<h3 id="recurrent-highway-networks-on-chaotic-systems">Recurrent Highway Networks on Chaotic Systems</h3>
<p>For high-dimensional chaotic systems like the Multiscale Lorenz-96 shown below, we found that a <strong>Recurrent Highway Network (RHN)</strong> augmented with <strong>Attention and Neural Gating</strong> was the top-performing architecture. This hybrid exceeded the forecasting accuracy of standard Transformers, suggesting that deeper recurrence (processing depth per timestep) matters for complex dynamics.</p>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/multiscale-lorenz.webp"
         alt="Forecasting comparison on Multiscale Lorenz-96 system"
         title="Forecasting comparison on Multiscale Lorenz-96 system"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Forecasting the Multiscale Lorenz-96 system. The top row shows the &rsquo;texture&rsquo; of the chaotic evolution. Notice how the RHN (far right) maintains the coherent wave-like structures for nearly a full Lyapunov time, holding structure longer than the Transformer variants (the plotted window spans two Lyapunov times).</figcaption>
    
</figure>

<h3 id="transformers-recurrence-hurts-gating-helps">Transformers: Recurrence Hurts, Gating Helps</h3>
<p>We attempted to force recurrence into Transformers to give them &ldquo;memory,&rdquo; but it consistently hurt performance. However, <strong>Neural Gating</strong> significantly improved Transformer robustness. For real-world, noisy data (traffic, weather), the <strong>Pre-Layer Normalization (PreLN) Transformer</strong> with added gating proved to be the most robust model.</p>
<h3 id="adding-attention-to-lstms-and-grus">Adding Attention to LSTMs and GRUs</h3>
<p>We tested on the Kuramoto-Sivashinsky equation, a model of turbulence and flame fronts. We found that standard LSTMs and GRUs are under-optimized for this setting: adding <strong>attention</strong> to these cells improved their valid-prediction time several-fold, with the best attention-augmented LSTM and GRU reaching roughly 4x and 6.6x their baseline valid-prediction time, respectively (the paper reports the top RNNs at 2-7x baseline on K-S). (On the partially-observed Multiscale Lorenz-96 system the same attention-plus-gating gain is smaller, more than 40%.)</p>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/kuramoto-sivashinksy.webp"
         alt="Forecasting comparison on Kuramoto-Sivashinsky system"
         title="Forecasting comparison on Kuramoto-Sivashinsky system"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Forecasting the Kuramoto-Sivashinsky system. The error heatmaps (bottom row) show how prediction quality degrades over time (lighter means larger error). The RHN maintains structural fidelity longer than competing architectures.</figcaption>
    
</figure>

<h3 id="robustness-on-real-world-datasets">Robustness on Real-World Datasets</h3>
<p>While chaotic systems test the limits of theory, we also validated our models on seven standard real-world datasets: the four <strong>Electricity Transformer Temperature (ETT)</strong> subsets plus <strong>Traffic</strong>, <strong>Electricity</strong>, and <strong>Weather</strong>.</p>
<p>Unlike the clean physics simulations, these datasets contain real-world noise and irregularities. In this environment, the <strong>Pre-Layer Normalization (PreLN) Transformer</strong> proved to be the most robust architecture. While it didn&rsquo;t always beat the RHN on pure chaos, its stability makes it a strong default choice for general time-series forecasting tasks where training speed and reliability are paramount.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>This work treats architectural components as independently tunable choices rather than fixed defaults, and that framing surfaces a concrete trade-off. Transformers train in only 25-50% of the time the RNNs require (roughly 2-4x faster), while the attention-augmented RNNs give better inference accuracy on the chaotic physical systems. Which mechanism to select depends on whether the training budget or the forecast precision is the binding constraint, and the ablation makes that an informed choice rather than a default one.</p>
<p>The ablation framework here, treating architectural components as independently tunable factors and measuring their marginal contribution, shaped how later evaluation work is structured. The same principle of isolating variables rather than comparing end-to-end black boxes appears in the document processing research, from benchmark construction in page stream segmentation to grounded evaluation in GutenOCR.</p>
<h2 id="related-work">Related Work</h2>
<p>The methodology here shares a design philosophy with <a href="/research/eigennoise-contrastive-prior/">EigenNoise</a>,
which similarly decomposes a neural mechanism (word vector initialization) into theoretically
grounded components to isolate what drives performance. Both papers treat model components as
testable hypotheses rather than fixed architectural choices.</p>
<p>For broader context on where this fits in the portfolio&rsquo;s Scientific Machine Learning arc,
see the <a href="/research/">Research</a> overview.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2024deconstructingrecurrenceattentiongating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter S. Heidenreich and Pantelis R. Vlachas and Petros Koumoutsakos}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2410.02654}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.LG}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2410.02654}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LLMs for Page Stream Segmentation</title><link>https://hunterheidenreich.com/research/llm-page-stream-segmentation/</link><pubDate>Wed, 21 Aug 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/llm-page-stream-segmentation/</guid><description>Enhanced TabMe benchmark for page stream segmentation, creating TabMe++, showing fine-tuned decoder-based LLMs outperform prior models.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Page Stream Segmentation (PSS), the task of correctly dividing a sequence of pages into distinct documents, is a critical first step in automated document processing pipelines. Research in this area has been held back by the lack of high-quality, public datasets.</p>
<p>In this work, we address this issue by enhancing an existing benchmark, <a href="https://github.com/aldolipani/TABME">TabMe</a>, with commercial-grade Optical Character Recognition (OCR) to create <strong><a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a></strong>. This new version significantly reduces noise and improves text detection, highlighting the critical importance of OCR quality for document understanding tasks.</p>
<p>We then conduct the first evaluation of large, decoder-based language models (LLMs) on the PSS task. Our findings show that models like <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a>, when fine-tuned using parameter-efficient methods, <strong>outperform smaller encoder-based models</strong> and traditional baselines. For instance, our best model correctly segments 80% of document streams in the test set without any errors.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Enhanced Public Benchmark (<a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a>)</strong>: Re-processed the entire <a href="https://github.com/aldolipani/TABME">TabMe</a> dataset with commercial OCR, correcting significant text recognition errors and reducing blank pages by over 80% (from 2.27% to 0.38%)</li>
<li><strong>First Application of Large Decoder-Based LLMs to PSS</strong>: Systematically evaluated and fine-tuned billion-parameter, decoder-only LLMs for page stream segmentation</li>
<li><strong>Decoder LLMs Outperform Encoders on TabMe++</strong>: fine-tuned decoder models outperform previous encoder-based and multimodal approaches on the benchmark</li>
<li><strong>OCR Quality Analysis</strong>: Quantified the impact that high-quality OCR has on PSS model performance through comparative experiments</li>
</ul>
<h2 id="the-evolution-of-page-stream-segmentation">The Evolution of Page Stream Segmentation</h2>
<p>The history of PSS can be read as three distinct algorithmic eras, revealing a clear trajectory toward semantic understanding:</p>
<ul>
<li><strong>The Heuristic Era:</strong> Early systems relied on handcrafted rules and region-specific pattern matching (e.g., looking for headers/footers), which failed to generalize across heterogeneous documents.</li>
<li><strong>The Encoder Era:</strong> The field moved to &ldquo;learning-based&rdquo; methods using Convolutional Neural Networks (CNNs) and later Transformer encoders like LayoutLM and LEGAL-BERT. While better, these often required complex multimodal architectures.</li>
<li><strong>The Decoder Era (this work):</strong> This work establishes the viability of the third era: using billion-parameter generative models (decoder-only LLMs) which simplify the architecture while improving semantic reasoning.</li>
</ul>
<blockquote>
<p><strong>Blog Post:</strong> Read the full story of these eras in <a href="/posts/history-of-page-stream-segmentation/">The Evolution of Page Stream Segmentation</a>.</p></blockquote>
<h2 id="key-evaluation-metrics">Key Evaluation Metrics</h2>
<p>Beyond standard F1 scores, the study evaluates models on metrics that directly translate to operational costs:</p>
<ul>
<li><strong>Straight-Through Processing (STP):</strong> The percentage of document streams segmented <em>perfectly</em>, requiring zero human intervention. The fine-tuned <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a> achieved an STP of <strong>0.800</strong>, meaning 80% of streams were fully automated. In contrast, the traditional XGBoost baseline achieved only <strong>0.074</strong>.</li>
<li><strong>Minimum Number of Drag-and-Drops (MNDD):</strong> A proxy for human effort, measuring how many pages a human would need to move to correct the segmentation. The best LLM reduced this &ldquo;effort metric&rdquo; by over <strong>13x</strong> compared to the XGBoost baseline (0.81 vs 10.85).</li>
</ul>
<h3 id="document-level-precision-and-recall">Document-Level Precision and Recall</h3>
<p>We define a ground truth segmentation $\mathcal{G}$ and a predicted segmentation $\mathcal{P}$. A &ldquo;True Positive&rdquo; is defined strictly as a document present in both sets ($\mathcal{P} \cap \mathcal{G}$). The metrics are calculated as:</p>
<p>$$P = \frac{|\mathcal{P} \cap \mathcal{G}|}{|\mathcal{P} \cap \mathcal{G}| + |\mathcal{P} \setminus \mathcal{G}|}$$</p>
<p>$$R = \frac{|\mathcal{P} \cap \mathcal{G}|}{|\mathcal{P} \cap \mathcal{G}| + |\mathcal{G} \setminus \mathcal{P}|}$$</p>
<p>This rigorous definition ensures that a model is only rewarded if it gets <em>both</em> the start and end boundaries of a document correct.</p>
<h2 id="technical-innovation">Technical Innovation</h2>
<p>Our approach combines commercial-grade OCR processing with parameter-efficient fine-tuning of large language models. We addressed two main bottlenecks: data quality and model efficiency.</p>
<h3 id="data-remediation">Data Remediation</h3>
<p>The original <a href="https://github.com/aldolipani/TABME">TabMe</a> dataset relied on Tesseract OCR, which introduced significant noise. By reprocessing the images with Microsoft OCR, we reduced the number of &ldquo;blank&rdquo; pages from <strong>2.27% to just 0.38%</strong>, recovering critical features like titles and ID numbers that were previously lost.</p>
<h3 id="model-architecture">Model Architecture</h3>
<p>We formulated the task as a <strong>binary classification of page pairs</strong>: predicting if a &ldquo;break&rdquo; exists between Page $N$ and Page $N+1$.</p>
<h4 id="problem-formulation">Problem Formulation</h4>
<p>The task is treated as a binary classification problem over a window of pages. For a specific page $p_i$, the model predicts a binary label $y_i$ based on a window of adjacent pages $(p_{i-l}, \ldots, p_i, \ldots, p_{i+r})$. In this work, we strictly defined the window as:</p>
<p>$$l=1, \quad r=0$$</p>
<p>This means the decision for page $p_i$ is made solely based on the pair $(p_{i-1}, p_i)$.</p>
<h3 id="efficient-tuning">Efficient Tuning</h3>
<p>We utilized <strong>Low-Rank Adaptation (LoRA)</strong> and 4-bit quantization to fine-tune <a href="https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit">Mistral-7B</a> and <a href="https://huggingface.co/unsloth/Phi-3-mini-4k-instruct">Phi-3-mini</a> on a single NVIDIA H100 GPU, proving that PSS does not require massive compute clusters.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>Page Stream Segmentation is the critical first step in any automated document processing pipeline. If a system fails to correctly separate documents, all downstream tasks (like classification or data extraction) will operate on corrupted inputs. By demonstrating that parameter-efficiently fine-tuned LLMs can achieve an 80% straight-through processing rate, this work provides a viable path toward fully automating high-volume document workflows.</p>
<p>Beyond the path to automation, this work gives the research community improved evaluation tools: the enhanced <a href="https://huggingface.co/datasets/rootsautomation/TABMEpp">TabMe++</a> dataset and the quantified impact of OCR quality on PSS performance have direct applications in commercial document processing pipelines.</p>
<p>We later extended these findings to real-world industrial deployment and analyzed model calibration challenges in our follow-up <a href="/research/page-stream-segmentation-llms/">COLING Industry paper on LLMs for Insurance Document Automation</a>. The calibration challenges that emerged from that deployment are explored in depth in <a href="/posts/reliability-trap-document-automation/">The Reliability Trap: When 99% Accuracy Isn&rsquo;t Enough</a>.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2024largelanguagemodelspage,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Large Language Models for Page Stream Segmentation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich and Ratish Dalvi and Rohith Mukku and Nikhil Verma and Neven Pičuljan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2408.11981}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CL}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2408.11981}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>EigenNoise: Data-Free Word Vector Initialization</title><link>https://hunterheidenreich.com/research/eigennoise-contrastive-prior/</link><pubDate>Sun, 01 May 2022 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/eigennoise-contrastive-prior/</guid><description>Investigation into EigenNoise, a data-free initialization scheme for word vectors that approaches pre-trained model performance after fine-tuning.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>We developed EigenNoise, a method to initialize word vectors using <strong>zero pre-training data</strong>. By deriving a co-occurrence matrix solely from the theoretical harmonic structure of language (Zipf&rsquo;s Law), this project demonstrates that we can mathematically synthesize a &ldquo;warm-start&rdquo; for NLP models. This approach challenges the reliance on massive corpora for initialization and offers a competitive alternative for low-resource environments.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li>A <strong>data-free initialization scheme</strong>: word vectors derived from a co-occurrence matrix synthesized from independent (Zipfian) frequency statistics, with no pre-training corpus.</li>
<li>Grounds the construction in the <strong>harmonic statistical structure</strong> of language, so the representation follows from first principles rather than from data.</li>
<li>Evaluates with <strong>Minimum Description Length (MDL)</strong> probing, which measures how much task-relevant information a representation encodes and how compactly, rather than raw accuracy.</li>
<li>After fine-tuning, EigenNoise <strong>approaches</strong> the performance of GloVe (trained on Gigaword) despite seeing <strong>no pre-training text</strong>.</li>
</ul>
<h2 id="technical-implementation">Technical Implementation</h2>
<p>The core insight is that &ldquo;noise&rdquo; in language follows a predictable distribution.</p>
<ol>
<li><strong>Modeling</strong>: We model the &ldquo;null hypothesis&rdquo; of text, how words would co-occur if they were statistically independent but followed Zipfian rank-frequency. This yields a theoretical co-occurrence matrix $\hat{X}$:</li>
</ol>
<p>$$\hat{X}_{ij} = \frac{2mN}{r_i r_j H_N}$$</p>
<p>Where $r_i$ is the rank of word $i$, $N$ is vocabulary size, $m$ is the context window size, and $H_N$ is the $N$-th harmonic number.</p>
<ol start="2">
<li>
<p><strong>Factorization</strong>: We then solve for the word vectors by performing an <strong>eigen-decomposition</strong> on this matrix, extracting the top $d$ components to form the representation space.</p>
</li>
<li>
<p><strong>Probing</strong>: Validated performance using MDL probing on CoNLL-2003 and TweetEval benchmarks.</p>
</li>
</ol>
<h2 id="why-this-matters">Why This Matters</h2>
<p>This research explores how much structure can emerge from frequency statistics alone, with no text exposure at all. The central finding is that EigenNoise vectors, derived purely from Zipf&rsquo;s Law, reach competitive performance with GloVe after fine-tuning. This is evidence that a significant portion of what we call &ldquo;learned linguistic knowledge&rdquo; is a consequence of word frequency distributions, not semantic exposure to real text.</p>
<p>In 2026, small pretrained models are freely available and handle most low-resource initialization needs, so the practical case for data-free initialization is narrower than it was in 2022. The theoretical contribution remains relevant: EigenNoise establishes a clean null hypothesis for what word vectors look like when only frequency information is present. For interpretability researchers trying to disentangle frequency artifacts from genuine semantic content, this baseline has value independent of the initialization use case.</p>
<p>The <strong>MDL probing</strong> methodology applied here also contributes beyond the main result. Unlike task accuracy, MDL measures how much information a representation encodes and how compactly, providing a more principled lens for evaluating representational quality. EigenNoise&rsquo;s co-occurrence prior is grounded directly in the <strong>Independent Frequencies Model (IFM)</strong> introduced in the companion <a href="/research/word-company-vicinity/">Word2Vec factorization paper</a>. Together, the two works form a coherent theoretical line: the IFM characterizes the frequency-driven baseline of embedding space, and EigenNoise operationalizes it as a practical, data-free initialization scheme.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2022eigennoisecontrastivepriorwarmstart,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{EigenNoise: A Contrastive Prior to Warm-Start Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Scott Heidenreich and Jake Ryland Williams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2205.04376}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CL}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2205.04376}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For the theoretical foundation underlying EigenNoise&rsquo;s null hypothesis, including the first analytical solution to Word2Vec&rsquo;s softmax objective, see <a href="/research/word-company-vicinity/">Analytical Solution to Word2Vec Softmax &amp; Bias Probing</a>.</p>
]]></content:encoded></item><item><title>Analytical Solution to Word2Vec Softmax &amp; Bias Probing</title><link>https://hunterheidenreich.com/research/word-company-vicinity/</link><pubDate>Sun, 01 May 2022 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/word-company-vicinity/</guid><description>Analytical derivation of Word2Vec's softmax objective factorization and a new framework for detecting semantic bias in raw corpora.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>While the Skip-Gram with Negative Sampling (SGNS) objective for Word2Vec has famously been shown to factorize a shifted PMI matrix, the implicit matrix factorization of the original <strong>Softmax</strong> objective has remained an open question. In this work, we provide the first known analytical solution to Word2Vec&rsquo;s softmax-optimized skip-gram algorithm.</p>
<p>We use this derivation to introduce the <strong>Independent Frequencies Model (IFM)</strong>, identifying a &ldquo;frequency-ratios property&rdquo; that unifies classical word vector models. This theoretical insight allows us to derive a low-cost, training-free method for measuring semantic bias directly from corpus statistics.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Analytical Solution</strong>: Proved that the softmax skip-gram objective converges to a factorization of the log-conditional probability matrix, a derivation for Word2Vec&rsquo;s original (unapproximated) objective that prior work had left open.</li>
<li><strong>Independent Frequencies Model (IFM)</strong>: Introduced a dense co-occurrence model computable purely from unigram frequencies to act as a null hypothesis for embedding structures.</li>
<li><strong>Bias Dissonance Metric</strong>: Derived a low-cost, training-free method for measuring semantic bias directly from corpus statistics using the frequency-ratios property.</li>
<li><strong>Data Transparency</strong>: Demonstrated how specific corpora exhibit distinct bias profiles, offering a tool for auditing datasets before training large models.</li>
</ul>
<h2 id="key-theoretical-results">Key Theoretical Results</h2>
<h3 id="1-the-softmax-factorization-theorem">1. The Softmax Factorization Theorem</h3>
<p>We prove that under the log-softmax objective, Word2Vec implicitly converges towards a factorization of the <strong>log-conditional probability matrix</strong> of the co-occurrence model.</p>
<p><strong>Theorem:</strong> For the objective
$\mathcal{L}_{\text{soft}} = - \sum _{t,s} F _{t,s}^m \log \varphi (\vec{u}_t \vec{v}_s)$,
the algorithm converges to:</p>
<p>$$
\vec{u}_{t}\vec{v}_{s}^{T} = \log\frac{F_{t,s}^{m}}{f_{t}^{m}}
$$</p>
<p>where $F_{t,s}^m$ is the co-occurrence count and $f_t^m$ is the marginal frequency. This effectively makes the dot product of the embedding vectors equal to the log-conditional probability of the context word given the target word.</p>
<h3 id="2-the-independent-frequencies-model-ifm">2. The Independent Frequencies Model (IFM)</h3>
<p>To understand the baseline behavior of these models, we introduce the IFM, which models a dense co-occurrence matrix computable purely from unigram frequencies:</p>
<p>$$
\hat{F}_{t,s}^{m} = \frac{2m f_t f_s}{M}
$$</p>
<p>This model acts as a &ldquo;null hypothesis&rdquo; for embedding structures, allowing us to isolate true semantic signals from statistical noise.</p>
<h2 id="methodological-innovation-bias-dissonance">Methodological Innovation: Bias Dissonance</h2>
<p>Leveraging the frequency-ratios property derived from our factorization, we propose a metric called <strong>Dissonance ($\Delta$)</strong> to probe semantic bias in data without training a model.</p>
<p>For an analogy $A:B :: C:D$ (e.g., <em>man:king :: woman:queen</em>), we measure the alignment of their corpus frequency ratios. High dissonance indicates that the corpus statistics do not support the analogy, potentially revealing bias or under-representation.</p>
<p><strong>Intuitive Example:</strong> If a corpus contains the phrase <em>&ldquo;man is king&rdquo;</em> 100 times more often than <em>&ldquo;woman is queen,&rdquo;</em> the frequency ratios are misaligned. A perfect, unbiased analogy would have matching ratios (i.e., <em>man</em> relates to <em>king</em> at the same rate <em>woman</em> relates to <em>queen</em>). Any deviation from this symmetry is captured by our dissonance metric, revealing where the data itself encodes asymmetric associations.</p>
<p>$$
\Delta(x,y|\mathcal{D}) = \left| \log\frac{f_{t}f_{\bar{s}}}{f_{s}f_{\bar{t}}} \right| / \max_{l \in \mathcal{V}} { \log f_l }
$$</p>
<p>By applying this to the <strong>Bigger Analogy Test Set (BATS)</strong>, we demonstrated how specific corpora (like Wikipedia vs. Google Books) exhibit distinct bias profiles regarding geographic and encyclopedic knowledge.</p>
<h2 id="visualizing-statistical-independence">Visualizing Statistical Independence</h2>















<figure class="post-figure center ">
    <img src="/img/word-bias-iqr.webp"
         alt="Plot showing the portion of statistically dependent information decreasing as window size increases, with curves for different corpus sizes and an inset showing power-law decay"
         title="Plot showing the portion of statistically dependent information decreasing as window size increases, with curves for different corpus sizes and an inset showing power-law decay"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The Information Quality Ratio measuring the portion of co-occurrence information that is statistically dependent, plotted against window size. Colors indicate corpus size from the GUM corpus. The dashed lines show the IFM prediction. The inset reveals the power-law decay rate, demonstrating how linguistic dependencies diminish predictably with context distance.</figcaption>
    
</figure>

<h2 id="impact">Impact</h2>
<p>This work bridges the gap between empirical success and theoretical foundations in NLP by:</p>
<ol>
<li><strong>Solving a fundamental mechanism:</strong> Providing the missing factorization proof for Softmax Word2Vec.</li>
<li><strong>Efficient pre-training (a future direction):</strong> The factorization suggests embedding layers could be &ldquo;warm-started&rdquo; from unigram statistics via the IFM, a direction taken up in the companion <a href="/research/eigennoise-contrastive-prior/">EigenNoise</a> work.</li>
<li><strong>Data Transparency:</strong> Offering a computationally inexpensive tool for auditing datasets for bias before investing resources in training large models.</li>
</ol>
<h2 id="my-contribution">My Contribution</h2>
<p>Jake Williams is the first author and primary driver of this work. He developed the core theory, derived the factorization proofs, designed the dissonance metric, and ran the experiments. My role was supporting: I contributed through critique and refinement during the writing process, but the intellectual heavy lifting belongs to Jake.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{williams2022knowcompanywordslies,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{To Know by the Company Words Keep and What Else Lies in the Vicinity}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jake Ryland Williams and Hunter Scott Heidenreich}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2205.00148}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CL}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2205.00148}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For a complementary analytical approach to word representations, deriving data-free word vector initializations from the same frequency-ratio insights, see <a href="/research/eigennoise-contrastive-prior/">EigenNoise: Data-Free Word Vector Initialization</a>.</p>
]]></content:encoded></item><item><title>Look, Don't Tweet: Unified Data Models for Social NLP</title><link>https://hunterheidenreich.com/research/look-dont-tweet/</link><pubDate>Wed, 30 Jun 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/look-dont-tweet/</guid><description>PyConversations library and unified data schema for normalizing 300M+ posts across Twitter, Reddit, Facebook, and 4chan.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>This is my undergraduate senior thesis, completed at Drexel University in 2021. The scope (308 million posts across four platforms, structural topology analysis, and domain adaptation experiments with Transformer models) was unusually broad for a senior thesis, spanning large-scale data engineering, graph-structural analysis, and representation-learning experiments.</p>
<p>Social media research is often siloed by platform, with tools built specifically for Twitter&rsquo;s flat structure or Reddit&rsquo;s tree structure. This fragmentation makes cross-platform analysis difficult. In this work, I introduce <strong><a href="https://github.com/hunter-heidenreich/pyconversations">PyConversations</a></strong>, an open-source Python package that normalizes data from Twitter, Facebook, Reddit, and 4chan into a single, platform-agnostic data model. <em>(Note: the repository is archived and no longer actively maintained.)</em></p>
<p>Leveraging this tool, I processed over <strong>308 million posts</strong> to analyze the structural &ldquo;shape&rdquo; of online conversations. I then evaluated the efficacy of domain-adaptive pre-training (DAPT) for Transformer-based language models, finding that training on a toxic domain (4chan) boosts hate-speech detection by over 5 F1.</p>
<h2 id="the-engineering-problem-data-normalization">The Engineering Problem: Data Normalization</h2>
<p>Social media platforms impose different structural constraints on discourse, making it difficult to feed heterogeneous data into a single ML pipeline:</p>
<ul>
<li><strong>Twitter:</strong> Technically allows infinite depth, but functionally operates as a flat stream or shallow tree.</li>
<li><strong>Facebook:</strong> Enforces a hard limit of two depth levels (comments and replies), resulting in &ldquo;short and fat&rdquo; conversation trees.</li>
<li><strong>Reddit &amp; 4chan:</strong> Allow for deep, branching tree structures.</li>
</ul>
<p>To solve this, I designed a <strong>Universal Message Schema</strong> and the <strong>PyConversations</strong> library. This system ingests raw dumps from these disparate sources and maps them to a unified Directed Acyclic Graph (DAG) format, preserving the parent-child relationships regardless of the source platform&rsquo;s constraints.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>PyConversations Library</strong>: An open-source package for robust conversational analysis, featuring graph-based traversing and filtering.</li>
<li><strong>Massive Dataset Analysis</strong>: Processed a collection of <strong>308 million posts</strong> and <strong>15.8 million conversations</strong>, creating one of the largest comparative cross-platform analyses at the time of thesis submission.</li>
<li><strong>Structural Insights</strong>: Quantified how UI constraints shape human behavior. For instance, Facebook&rsquo;s depth limit forces users to &ldquo;bunch&rdquo; comments, creating uniquely wide conversation trees compared to Reddit&rsquo;s deep, narrow threads.</li>
<li><strong>Domain Adaptation Experiments</strong>: Continued-pretrained RoBERTa on platform-specific slices (e.g., the 4chan-adapted <code>RoBERTa-4chan</code>), demonstrating that exposing models to toxic domains improved hate-speech detection F1 by over 5 points.</li>
</ul>
<h2 id="structural-analysis-findings">Structural Analysis Findings</h2>
<p>By treating conversations as graphs, we uncovered distinct topological signatures for each platform:</p>
<h3 id="the-shape-of-discourse">The &ldquo;Shape&rdquo; of Discourse</h3>
<p>We measured the <strong>width</strong> (max posts at any depth) and <strong>depth</strong> (max distance from root) of conversation trees.</p>
<ul>
<li><strong>Facebook</strong> exhibited a &ldquo;short and fat&rdquo; topology due to its 2-level nesting limit.</li>
<li><strong>4chan</strong> threads were surprisingly shallow despite having no depth limits. This suggests that the platform&rsquo;s <strong>ephemerality</strong> (threads are deleted quickly) and the &ldquo;bump limit&rdquo; mechanic discourage long-term dialogue, though data scraping limitations on this transient platform also contribute to this topology.</li>
<li><strong>Reddit</strong> maintained the most robust tree structures, with &ldquo;good faith&rdquo; communities like <em>r/ChangeMyView</em> showing distinct patterns of sustained engagement.</li>
</ul>
<h3 id="information-density">Information Density</h3>
<p>We analyzed <strong>Innovation Rate</strong>, a measure of how quickly a text introduces new vocabulary. We found that Twitter threads have negative innovation rates (indicating high novelty per token) likely forced by the strict character limits. In contrast, Reddit posts showed higher redundancy, typical of longer-form essay writing.</p>
<h2 id="representation-learning--domain-adaptation">Representation Learning &amp; Domain Adaptation</h2>
<p>We experimented with &ldquo;Warm-Start&rdquo; tuning: taking a standard RoBERTa model and pre-training it further on platform-specific data before fine-tuning on downstream tasks (TweetEval).</p>
<ul>
<li><strong>Limited gains on most general tasks:</strong> Domain-adaptive pre-training added little on sentiment and emotion (from well under 1 up to a few F1 points), with irony detection the exception (+5.6 to +5.9 F1). Base RoBERTa already covers most of the signal for general NLP tasks.</li>
<li><strong>The Toxic Exception:</strong> The notable exception was <strong>Hate Speech Detection</strong>. The 4chan-adapted model (<code>RoBERTa-4chan</code>) was the strongest here, outperforming the baseline by over 5 F1. This highlights that for specialized, out-of-distribution language (like toxic slang), domain adaptation remains valuable.</li>
</ul>
<h2 id="significance">Significance</h2>
<p>This work bridges the gap between <strong>Computational Social Science</strong> and <strong>ML Engineering</strong>. It provides the community with a reusable tool (<code>PyConversations</code>) to handle the messy reality of social data and offers empirical evidence on the limits and benefits of domain-adaptive pre-training for LLMs.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@thesis</span>{heidenreich2021look,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Look, Don&#39;t Tweet: Representation Learning and Social Media}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Heidenreich}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">school</span>=<span style="color:#e6db74">{Drexel University}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">type</span>=<span style="color:#e6db74">{Undergraduate Senior Thesis}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For related work on how social media content surfaces in digital journalism, including a dataset of embedded tweets across 273,899 news articles, see <a href="/research/newstweet-social-media-journalism/">NewsTweet Dataset: Social Media in Digital Journalism</a>.</p>
]]></content:encoded></item><item><title>GPT-2 Susceptibility to Universal Adversarial Triggers</title><link>https://hunterheidenreich.com/research/gpt2-adversarial-triggers/</link><pubDate>Sat, 01 May 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/gpt2-adversarial-triggers/</guid><description>Investigation into whether universal adversarial triggers can control both topic and stance of GPT-2's generated text and security implications.</description><content:encoded><![CDATA[<blockquote>
<p><strong>Historical context:</strong> This paper was published at AIES 2021, the AAAI/ACM Conference on AI, Ethics, and Society, predating the modern red-teaming practices and adversarial robustness benchmarks that emerged with instruction-tuned and RLHF-trained models. GPT-2 is now a historical baseline, but the core methodology and findings remain a relevant foundation for current adversarial robustness work.</p></blockquote>
<h2 id="abstract">Abstract</h2>
<p>This work investigates universal adversarial triggers (UATs), a method for disrupting language models using input-agnostic token sequences. We investigated whether it is possible to use these triggers to control the <strong>topic</strong> and the <strong>stance</strong> of text generated by GPT-2. Across four controversial topics, we demonstrated success in identifying triggers that guide the model to produce text on a targeted subject and influence the position it takes. Our goal is to raise awareness that even deployed models are susceptible to this influence and to advocate for immediate safeguards.</p>
<h2 id="key-findings--contributions">Key Findings &amp; Contributions</h2>
<ul>
<li><strong>Topic and Stance Control</strong>: We were the first to systematically explore using UATs to control both the topic and the stance of a language model&rsquo;s output. We found that controlling the topic is highly feasible, and controlling the stance is also possible.</li>
<li><strong>The &ldquo;Filter Bubble&rdquo; Hypothesis</strong>: We observed that triggers for fringe topics (e.g., Flat Earth) were harder to find but offered a higher degree of stance control than broader topics. We posit this may reflect &ldquo;filter bubbles&rdquo; in the training data, where fringe viewpoints use distinct linguistic patterns.</li>
<li><strong>Ethical &amp; Security Analysis</strong>: We highlighted the security risks of deployed models being manipulated by external adversaries without internal model access. To be responsible, we withheld the most sensitive triggers we discovered.</li>
<li><strong>Constructive Applications</strong>: Beyond a security flaw, we proposed that UATs could be used constructively as a <strong>diagnostic tool</strong> to audit models for bias or as a method for <strong>bot detection</strong> on social media.</li>
</ul>
<h2 id="significance--why-this-matters">Significance &amp; Why This Matters</h2>
<p>This work extended early research on UATs by moving beyond single-issue attacks (like generating toxic content) to a nuanced analysis of topic and stance control. It demonstrated that a <strong>gradient-based search process (adapting HotFlip)</strong> is effective at manipulating model outputs, emphasizing a critical vulnerability for any organization deploying large language models.</p>
<p>For ML practitioners and security researchers, this highlights the importance of robust safeguards against input-agnostic attacks. It also opens the door to using these same adversarial techniques constructively: as diagnostic tools to audit models for hidden biases or to detect automated bot activity on social media platforms.</p>
<h2 id="related-work">Related Work</h2>
<p>The constructive bot-detection application proposed here connects directly to empirical work on coordinated inauthentic behavior. <a href="/research/coordinated-social-targeting/">Coordinated Social Targeting on Twitter</a> documents real-world follower-manipulation patterns on high-profile accounts, illustrating the kind of automated adversarial activity that UAT-based detection methods could help identify.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{10.1145/3461702.3462578,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Heidenreich, Hunter Scott and Williams, Jake Ryland}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{9781450384735}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computing Machinery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{New York, NY, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1145/3461702.3462578}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1145/3461702.3462578}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{566--573}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">numpages</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{adversarial attacks, bias, language modeling, natural language processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">location</span> = <span style="color:#e6db74">{Virtual Event, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span> = <span style="color:#e6db74">{AIES &#39;21}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>NewsTweet Dataset: Social Media in Digital Journalism</title><link>https://hunterheidenreich.com/research/newstweet-social-media-journalism/</link><pubDate>Sat, 01 Aug 2020 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/newstweet-social-media-journalism/</guid><description>NewsTweet dataset for studying embedded tweets in online journalism. Analysis shows 13% of Google News stories contain tweets.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>We introduce NewsTweet, a dataset and data collection pipeline designed to study the embedding of social media in digital journalism. Our descriptive analysis of articles collected from Google News (chosen for its significant role in shaping attention) reveals that 13% of stories include embedded tweets. The dataset provides a foundation for exploring how social media content is sourced and which users become newsworthy. <em>(Note: this is an arXiv preprint from 2020 and was not published at a peer-reviewed venue.)</em></p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Large-Scale Dataset</strong>: A dataset of 273,899 news articles, with 35,218 containing embedded tweets, collected from Google News RSS feeds over a four-month period.</li>
<li><strong>Data Collection Pipeline</strong>: Details an automated pipeline for acquiring news articles, extracting embedded tweets, and collecting the corresponding user timelines from Twitter&rsquo;s API.</li>
<li><strong>Descriptive Statistics</strong>: Presents statistics on the prevalence of tweet embedding across different news categories, outlets, and users, highlighting key patterns.</li>
</ul>
<h2 id="data-availability">Data Availability</h2>
<p>The NewsTweet dataset is not publicly available for direct download. Due to Twitter/X&rsquo;s Terms of Service restrictions on redistributing tweet content, the full dataset cannot be shared openly. Researchers interested in accessing the data or the collection pipeline are encouraged to contact the authors via the <a href="https://arxiv.org/abs/2008.02870">arXiv paper (arXiv:2008.02870)</a>.</p>
<h2 id="dataset-characteristics">Dataset Characteristics</h2>
<h3 id="scale-and-coverage">Scale and Coverage</h3>
<ul>
<li><strong>News Sources</strong>: 5,961 unique news domains aggregated through Google News RSS feeds.</li>
<li><strong>Time Period</strong>: Data collection initiated on May 15th, 2019, with the paper describing the first four months of data.</li>
<li><strong>Collection Velocity</strong>: The pipeline averaged <strong>2,302 articles per day</strong>, with approximately 296 containing embedded tweets.</li>
<li><strong>Content Types</strong>: Focuses specifically on embedded tweets from Twitter, the most frequently embedded platform.</li>
<li><strong>Metadata</strong>: Includes article source, Google News category (e.g., Sports, Health), and full tweet and user objects from the Twitter API.</li>
</ul>
<h3 id="technical-implementation">Technical Implementation</h3>
<ul>
<li><strong>RSS-to-API Pipeline</strong>: Automatically crawls Google News RSS feeds to extract article HTML, identifying embedded tweet IDs to fetch full objects via the Twitter API.</li>
<li><strong>Artifact Filtering</strong>: Implements cleaning protocols to handle artifacts, such as detecting and excluding YouTube pages that appear as articles in Google News feeds.</li>
<li><strong>Longitudinal Tracking</strong>: Features a &ldquo;top-off&rdquo; mechanism that continuously tracks discovered users, updating their timelines to capture historical context.</li>
<li><strong>Rate Limit Management</strong>: Utilizes a random sampling queue to maintain continuous data collection across thousands of users without exceeding Twitter API limits.</li>
</ul>
<h2 id="key-findings">Key Findings</h2>
<h3 id="embedding-prevalence">Embedding Prevalence</h3>
<ul>
<li><strong>13% of news articles</strong> in our Google News-sourced collection contained embedded tweets.</li>
<li><strong>Significant variation across categories</strong>: Sports (24% of articles) and Entertainment (14%) had the highest rates of embedding, while Health (2%) had the lowest.</li>
<li>News outlets that publish the most articles are well-known mass media organizations, while outlets with the highest average number of embeds per article are often focused on Sports and Entertainment.</li>
</ul>
<h3 id="user-and-content-patterns">User and Content Patterns</h3>
<ul>
<li><strong>Public figures dominate</strong>: Well-known figures like politicians and celebrities, alongside organizations, are embedded far more often than ordinary users.</li>
<li>Some users have a small number of their tweets embedded many times, while others gain newsworthiness from a wider range of their content.</li>
<li>The Health category, despite having few embedded tweets, had the highest proportion of unique tweets (93%), suggesting that when tweets are embedded, they are less likely to be reused across multiple stories.</li>
<li><strong>&ldquo;Catch-up&rdquo; Phenomenon</strong>: Data reveals a class of users with high &ldquo;embedding effectiveness&rdquo;: those embedded more frequently than they tweet. This suggests journalists often use embeddings to &ldquo;catch readers up&rdquo; on backstories for previously unknown individuals.</li>
</ul>
<h2 id="significance">Significance</h2>
<p>The dataset is a foundation for studying how social media surfaces in journalism: how sourcing routines are evolving in the digital age, how traditional outlets and social platforms interact, and how previously-unknown users become newsworthy, grounded in the per-category and per-outlet embedding rates and the user-newsworthiness patterns the dataset captures.</p>
<h2 id="my-contribution">My Contribution</h2>
<p>I am the second of six authors on this paper. My contributions focused on the descriptive analysis: writing code to process the collected data, generating summary tables and statistics, and helping write and review the manuscript.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{mujib2020newstweetdatasetsocialmedia,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{NewsTweet: A Dataset of Social Media Embedding in Online Journalism}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Munif Ishad Mujib and Hunter Scott Heidenreich and Colin J. Murphy and Giovanni C. Santia and Asta Zelenkauskaite and Jake Ryland Williams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2008.02870}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.SI}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2008.02870}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<ul>
<li><a href="/research/look-dont-tweet/">Look, Don&rsquo;t Tweet: Unified Data Models for Social NLP</a>: provides the unified cross-platform social media data model underlying broader Twitter analysis.</li>
<li><a href="/research/coordinated-social-targeting/">Coordinated Social Targeting on Twitter</a>: a companion study from the same research group and time period, documenting coordinated follower-manipulation patterns on high-profile Twitter accounts.</li>
</ul>
]]></content:encoded></item><item><title>Coordinated Social Targeting on Twitter</title><link>https://hunterheidenreich.com/research/coordinated-social-targeting/</link><pubDate>Wed, 01 Jul 2020 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/coordinated-social-targeting/</guid><description>Investigation into follower dynamics on high-profile Twitter accounts, documenting sub-second spikes, saw-tooth patterns, and resurrected accounts.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>In the lead-up to the 2020 U.S. presidential election, we developed tools to monitor the follower dynamics of prominent Twitter users. This work details our investigation into several strange phenomena we observed, which suggest the presence of coordinated manipulation campaigns. We present preliminary evidence of networks of automated (bot) and potentially compromised accounts aiming to artificially inflate or diminish the perceived social status of high-profile individuals. <em>(Note: this work was presented as an extended abstract at IC2S2 2020. It is available as an arXiv preprint but was not submitted as a full peer-reviewed paper.)</em></p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>High-frequency sampling infrastructure</strong>: custom tooling sustaining 1-second follower-count polling over weeks, recursive follower-list &ldquo;tunneling&rdquo; through Twitter&rsquo;s pagination limits, and API-rate-window management, capturing sub-minute signatures invisible to standard API sampling and requiring no internal platform access.</li>
<li><strong>Four documented phenomena</strong>: spikes &amp; sawteeth, circulating accounts, resurrected pre-2010 accounts, and follower-list anomalies, each cataloged as a separate experiment.</li>
<li><strong>Preliminary evidence of coordination</strong>: anomalous patterns consistent with systematic bot activity around prominent political accounts in the lead-up to the 2020 election, presented as observation rather than proven causation.</li>
</ul>
<h2 id="methodological-strategy">Methodological Strategy</h2>
<p>To capture these transient phenomena, we built a high-frequency observational pipeline that bypassed standard sampling limitations. Our framework treated the social graph as a dynamic signal:</p>
<ol>
<li><strong>High-Resolution Sampling</strong>: Implemented &ldquo;cycling scripts&rdquo; to monitor follower counts at 1-second resolution, revealing sub-minute &ldquo;spike&rdquo; anomalies often missed by aggregate metrics</li>
<li><strong>Recursive &ldquo;Tunneling&rdquo;</strong>: Developed a recursive sampling method to extract deep history from follower lists, uncovering strata of &ldquo;ancient&rdquo; accounts buried by recent activity</li>
<li><strong>Cross-Referencing</strong>: Correlated circulation events with follower count distortions to validate the &ldquo;sawtooth&rdquo; waveform hypothesis</li>
</ol>
<p><strong>Engineering Note:</strong> This methodology required custom infrastructure to work around Twitter API rate limits and pagination constraints. We built a &ldquo;tunneling&rdquo; system that recursively paginated through follower lists (which Twitter caps at 5,000 per request) while respecting rate windows. The &ldquo;cycling scripts&rdquo; maintained persistent connections and implemented intelligent backoff strategies to sustain continuous 1-second polling over weeks of observation. This data engineering work was essential to capturing the high-resolution temporal signatures that standard API usage would miss.</p>
<h2 id="key-findings">Key Findings</h2>
<ul>
<li><strong>Spikes &amp; Sawteeth</strong>: Dramatic, sub-second spikes and saw-tooth wave patterns in follower counts, suggesting automated activity</li>
<li><strong>Circulating Accounts</strong>: Cohorts of users repeatedly following and unfollowing the same high-profile accounts</li>
<li><strong>&lsquo;Resurrected&rsquo; Ancient Accounts</strong>: Disproportionate numbers of pre-2010 Twitter accounts with long inactivity periods suddenly appearing as recent followers</li>
<li><strong>Follower List Anomalies</strong>: Unusual temporal groupings in follower histories suggesting historical manipulation campaigns</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/social-targeting/elonmusk-realDonaldTrump-twitter-spikes-and-saws.webp"
         alt="Sawtooth follower growth patterns for @elonmusk and @realDonaldTrump"
         title="Sawtooth follower growth patterns for @elonmusk and @realDonaldTrump"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">High-resolution monitoring of @elonmusk and @realDonaldTrump shows &ldquo;sawtooth&rdquo; follower growth patterns consistent with (though not proof of) coordinated activity attempting to artificially inflate or suppress account metrics.</figcaption>
    
</figure>

<h2 id="significance--sociological-impact">Significance &amp; Sociological Impact</h2>
<p>Beyond the technical detection of botnets, this work raises questions about a potential vulnerability in the infrastructure of modern public discourse. The observed patterns suggest these coordinated campaigns may be targeting the metrics that algorithms use to verify relevance (such as follower counts and engagement rates) as a vector for artificially shifting perceived social proof.</p>
<ul>
<li><strong>Manipulating social-proof signals</strong>: Our observations are consistent with attempts to manipulate the &ldquo;rich-get-richer&rdquo; dynamics of social platforms. If inorganic follower inflation can influence recommender systems, that could amplify specific narratives to legitimate users. This remains a documented observation rather than a proven causal chain, and warrants further investigation.</li>
<li><strong>Account-age filters are not enough</strong>: The data highlights patterns where dormant pre-2010 accounts appear as new followers in bulk, suggesting that account-age alone is insufficient as a defense against coordinated manipulation.</li>
<li><strong>A Case for Metric Scrutiny</strong>: These findings suggest that raw follower counts are noisy signals that can be manipulated at scale. Supplementing them with engagement-quality metrics may make platform recommender systems more robust to coordinated inorganic activity.</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/social-targeting/obama-hydration.webp"
         alt="UID digit distribution analysis of @BarackObama&#39;s follower list grouped by account-ID age"
         title="UID digit distribution analysis of @BarackObama&#39;s follower list grouped by account-ID age"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">A &ldquo;geological&rdquo; analysis of @BarackObama&rsquo;s follower list, grouped by the number of digits in each account&rsquo;s user ID (a coarse proxy for account age). A mid-list region otherwise dominated by recent accounts abruptly inverts to much older ones, one of the anomalous patterns the work flags as consistent with (though not proof of) coordinated activity.</figcaption>
    
</figure>

<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{heidenreich2020investigating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Investigating Coordinated &#39;Social&#39; Targeting of High-Profile Twitter Accounts}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heidenreich, Hunter Scott and Mujib, Munif Ishad and Williams, Jake Ryland}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{abs/2008.02874}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2008.02874}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2008.02874}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprinttype</span>=<span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For related work in the computational social science cluster:</p>
<ul>
<li><a href="/research/newstweet-social-media-journalism/">NewsTweet Dataset: Social Media in Digital Journalism</a>: a companion dataset studying how tweets surface in journalism, from the same research group and time period.</li>
<li><a href="/research/look-dont-tweet/">Look, Don&rsquo;t Tweet: Unified Data Models for Social NLP</a>: provides the unified cross-platform social media data model used in broader analysis of Twitter and related platforms.</li>
<li><a href="/research/gpt2-adversarial-triggers/">GPT-2 Susceptibility to Universal Adversarial Triggers</a>: proposes UAT-based bot detection as a constructive application, addressing the same automated manipulation behavior documented here from a complementary detection angle.</li>
</ul>
]]></content:encoded></item><item><title>Data-Driven WordNet Construction from Wiktionary</title><link>https://hunterheidenreich.com/research/semantic-network-induction/</link><pubDate>Fri, 01 Nov 2019 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/semantic-network-induction/</guid><description>We introduce an unsupervised algorithm for inducing semantic networks from noisy, crowd-sourced data, producing a resource with over 344,000 linked examples.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>We introduce a novel <strong>unsupervised algorithm</strong> for inducing semantic networks from noisy, crowd-sourced data. By framing network construction as a &ldquo;relationship disambiguation&rdquo; task, we process Wiktionary&rsquo;s English entries to build a massive, WordNet-like semantic resource. The resulting network is more than 5x larger than Princeton WordNet and features over <strong>344,000 linked example sentences</strong> (vs. WordNet&rsquo;s 68k). Evaluation on standard word similarity benchmarks demonstrates that our fully data-driven approach yields semantic structures competitive with expert-annotated resources.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Unsupervised Hierarchy Induction</strong>: We propose a deterministic algorithm to construct a Directed Acyclic Graph (DAG) of senses from pairwise relationships, effectively inducing a semantic hierarchy without human supervision.</li>
<li><strong>A Massive Semantic Resource</strong>: We release a dataset enriched with hundreds of thousands of semantically linked usage examples, serving as a critical resource for tasks like Word Sense Disambiguation (WSD).</li>
<li><strong>Disambiguation Framework</strong>: We model &ldquo;relationship disambiguation&rdquo; using a Laplacian kernel and FastText embeddings to filter noisy user annotations.</li>
<li><strong>Open-Source Infrastructure</strong>: We provide a full pipeline for downloading, parsing, and constructing networks from Wiktionary data.</li>
</ul>
<h2 id="technical-approach">Technical Approach</h2>
<p>The core of our method addresses the noise inherent in crowd-sourced dictionaries. We frame the problem as <strong>Latent Semantic Network Induction</strong>:</p>
<ol>
<li><strong>Relationship Disambiguation</strong>: For every linked pair of words (e.g., <em>go</em> ~ <em>proceed</em>), we define a semantic subspace using their definitions. We utilize <strong>FastText embeddings</strong> and a <strong>Laplacian kernel</strong> to identify which specific definitions participate in the relationship.</li>
<li><strong>Hierarchy Construction</strong>: We apply a custom intersection algorithm that treats more general senses as the &ldquo;overlap&rdquo; between specific definition sets. We formalize this as a set-theoretic &ldquo;hole punching&rdquo; operation, where a general sense $t$ is defined by the intersection of definition sets $\mathbb{D}&rsquo;$, excluding any broader intersections:</li>
</ol>
<p>$$f^{-1}(t) = \left(\bigcap_{\mathbb{D}&rsquo;} D_{u\sim v}\right) \setminus \left(\bigcup_{\mathbb{D} \supset \mathbb{D}&rsquo;} \bigcap_{\mathbb{D}} D_{u\sim v}\right)$$</p>
<h2 id="evaluation--validation">Evaluation &amp; Validation</h2>
<p>The primary achievement is scale: our induced network contains over <strong>344,000 linked example sentences</strong>, compared to Princeton WordNet&rsquo;s 68,000 (more than 5x the coverage), built entirely from crowd-sourced data without expert annotation.</p>
<p>Beyond scale, the network holds up semantically. On standard noun-similarity benchmarks (RG-65), the unsupervised network achieves a Spearman rank correlation of $\rho = 0.83$, matching the performance of Explicit Semantic Analysis (ESA) models built on expert-annotated WordNet ($\rho = 0.82$). The point is not that we beat WordNet by 0.01. It is that a fully automated approach over noisy Wiktionary data produces a resource of comparable quality at 5x the scale.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>Building high-quality linguistic resources typically requires expensive expert annotation. Princeton WordNet took decades of lexicographer effort. This work demonstrates that an unsupervised algorithm over crowd-sourced data can produce a resource of comparable semantic quality at more than 5x the scale. For ML practitioners, that matters: larger coverage means more training signal for downstream tasks like Word Sense Disambiguation. For this portfolio, it shows early experience building structured NLP datasets from scratch, a theme that continues in later work on large-scale document corpora.</p>
<h2 id="related-work">Related Work</h2>
<p>For a theoretical treatment of word semantics from the same collaboration, including the first analytical solution to Word2Vec&rsquo;s softmax objective, see <a href="/research/word-company-vicinity/">Analytical Solution to Word2Vec Softmax &amp; Bias Probing</a>.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{heidenreich2019latent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Latent semantic network induction in the context of linked example senses}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heidenreich, Hunter and Williams, Jake}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{170--180}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>