Research

GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR is a family of vision-language models designed to serve as a ‘grounded OCR front-end’, providing high-quality text transcription and explicit geometric grounding.

Document Processing

Statistics of the PubMed-OCR dataset including number of articles, pages, words, and bounding boxes.

PubMed-OCR: PMC Open Access OCR Annotations

PubMed-OCR provides 1.5M pages of scientific articles with comprehensive OCR annotations and bounding boxes to support layout-aware modeling and document analysis.

Document Processing

Stream accuracy versus relative throughput for Mistral-7B and XGBoost models

LLMs for Insurance Document Automation

We explore LLM applications for page stream segmentation in insurance document processing, demonstrating that parameter-efficient fine-tuning achieves strong accuracy but revealing significant calibration challenges that limit deployment confidence.

Document Processing

Diagram showing page stream segmentation workflow: an input stream of pages is processed through binary classification of page pairs to predict document breaks, producing segmented output documents

LLMs for Page Stream Segmentation

We create TabMe++, an enhanced page stream segmentation benchmark with commercial-grade OCR, and show that parameter-efficiently fine-tuned decoder-based LLMs like Mistral-7B achieve 80% straight-through processing rates, dramatically outperforming encoder-based models.

Time Series Forecasting

Forecasting comparison of different neural architectures on the Multiscale Lorenz-96 system

Optimizing Sequence Models for Dynamical Systems

We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

Natural Language Processing

Heatmap visualization of the EigenNoise analytical co-occurrence prior matrix showing word rank relationships

EigenNoise: Data-Free Word Vector Initialization

We develop EigenNoise, a zero-data initialization method for word vectors that synthesizes representations from Zipf’s Law alone, demonstrating competitive performance to GloVe after fine-tuning without requiring any pre-training corpus.

Natural Language Processing

Information Quality Ratio plot showing statistical dependencies decay as window size increases

Analytical Solution to Word2Vec Softmax & Bias Probing

We provide the first analytical solution to Word2Vec’s softmax skip-gram objective, introducing the Independent Frequencies Model and deriving a low-cost, training-free method for measuring semantic bias directly from corpus statistics.

Natural Language Processing

Venn diagram showing semantic overlap between word senses for go, move, and proceed, illustrating our hierarchy induction algorithm

Data-Driven WordNet Construction from Wiktionary

We present an unsupervised algorithm for inducing semantic networks from Wiktionary’s crowd-sourced data, creating a WordNet-like resource an order of magnitude larger than Princeton WordNet with over 344,000 linked example sentences.

Computational Social Science

Diagram of the Universal Message schema showing fields like ID, Text, Author, and Reply Sets that normalize data across platforms

Look, Don't Tweet: Unified Data Models for Social NLP

Bachelor’s thesis introducing PyConversations, an open-source library that normalizes over 308 million posts from Twitter, Reddit, Facebook, and 4chan into a unified data model for cross-platform social media research.

Computational Social Science

NewsTweet data collection pipeline: news outlets are crawled via Google News RSS feeds, articles are accessed to extract embedded tweets, and user timelines are downloaded from Twitter

NewsTweet Dataset: Social Media in Digital Journalism

We introduce NewsTweet, a dataset and pipeline for studying embedded tweets in digital journalism, revealing that 13% of Google News articles incorporate tweets and providing insights into how social media becomes newsworthy.

Computational Social Science

Sawtooth follower growth patterns for @elonmusk and @realDonaldTrump showing coordinated bot activity

Coordinated Social Targeting on Twitter

We developed high-frequency monitoring tools to detect coordinated manipulation on Twitter, documenting anomalous follower patterns including sub-second spikes, sawtooth waves, circulating accounts, and weaponized ancient dormant accounts targeting political figures.

AI Safety

A nonsensical trigger sequence 'WTC theoriesclimate Flat Hubbard Principle' is fed into GPT-2, which then generates Flat Earth conspiracy text

GPT-2 Susceptibility to Universal Adversarial Triggers

We demonstrate that universal adversarial triggers can control both the topic and stance of GPT-2’s generated text, revealing security vulnerabilities in deployed language models and proposing constructive applications for bias auditing.

Research

Document Processing

GutenOCR: A Grounded Vision-Language Front-End for Documents

PubMed-OCR: PMC Open Access OCR Annotations

LLMs for Insurance Document Automation

LLMs for Page Stream Segmentation

Scientific Machine Learning

Optimizing Sequence Models for Dynamical Systems

Natural Language Processing

EigenNoise: Data-Free Word Vector Initialization

Analytical Solution to Word2Vec Softmax & Bias Probing

Data-Driven WordNet Construction from Wiktionary

Computational Social Science

Look, Don't Tweet: Unified Data Models for Social NLP

NewsTweet Dataset: Social Media in Digital Journalism

Coordinated Social Targeting on Twitter

AI Safety & Adversarial ML

GPT-2 Susceptibility to Universal Adversarial Triggers