Document Processing
Diagram showing page stream segmentation workflow: an input stream of pages is processed through binary classification of page pairs to predict document breaks, producing segmented output documents

LLMs for Page Stream Segmentation

Enhanced TabMe benchmark for page stream segmentation, creating TabMe++, showing fine-tuned decoder-based LLMs …...

Natural Language Processing
Huffman Tree visualization for the input 'beep boop beer!' showing internal nodes with frequency counts and leaf nodes with characters

Vectorizing the Tree: High-Performance Word2Vec in Pure PyTorch

A production-grade implementation of Hierarchical Softmax and Negative Sampling, featuring vectorized tree traversals, …...

Document Processing
Stream accuracy versus relative throughput for Mistral-7B and XGBoost models

LLMs for Insurance Document Automation

LLM applications for insurance document automation using parameter-efficient fine-tuning and analysis of calibration …...

Computational Social Science
Top features for Social Welfare policy classification showing social, poverty, benefits keywords

Congressional Knowledge Graph & Policy Classification

Constructed a 47,000+ bill legislative knowledge graph from Congress.gov, capturing sponsor networks and committee …...

Natural Language Processing
Information Quality Ratio plot showing statistical dependencies decay as window size increases

Analytical Solution to Word2Vec Softmax & Corpus Bias Probing

Analytical derivation of Word2Vec's softmax objective factorization and a new framework for detecting semantic bias in …...

Natural Language Processing
Heatmap visualization of the EigenNoise analytical co-occurrence prior matrix showing word rank relationships

EigenNoise: Data-Free Word Vector Initialization

Investigation into EigenNoise, a data-free initialization scheme for word vectors that approaches pre-trained model …...

Computational Social Science
Diagram of the Universal Message schema showing fields like ID, Text, Author, and Reply Sets that normalize data across platforms

Look, Don't Tweet: Unified Data Models for Social NLP

A comprehensive study on cross-platform social media analysis, introducing the PyConversations library and a unified …...

Computational Social Science
Diagram of the Universal Message schema showing fields like ID, Text, Author, and Reply Sets that normalize data across platforms

PyConversations: Social Media Conversational Analysis

Undergraduate thesis exploring representation learning for social media text and developing tools for cross-platform …

AI Safety
A nonsensical trigger sequence 'WTC theoriesclimate Flat Hubbard Principle' is fed into GPT-2, which then generates Flat Earth conspiracy text

GPT-2 Susceptibility to Universal Adversarial Triggers

Investigation into whether universal adversarial triggers can control both topic and stance of GPT-2's generated text …...

Natural Language Processing
Venn diagram showing semantic overlap between word senses for go, move, and proceed, illustrating our hierarchy induction algorithm

Data-Driven WordNet Construction from Wiktionary

We introduce an unsupervised algorithm for inducing semantic networks from noisy, crowd-sourced data, producing a …...

AI Fundamentals
Diagram illustrating knowledge-based agent architecture with knowledge base, reasoning, and action components

An Introduction to Knowledge-Based Agents

Learn about knowledge-based agents: how AI systems use knowledge bases, reasoning, and inference to build intelligent …

Natural Language Processing
Types and distribution of coreferences in QuAC dataset showing dialogue complexity

QuAC: Question Answering in Context Dataset

Analysis of QuAC's conversational QA through student-teacher interactions, featuring 100K+ context-dependent questions …