Natural Language Processing
Huffman Tree visualization for the input 'beep boop beer!' showing internal nodes with frequency counts and leaf nodes with characters

High-Performance Word2Vec in Pure PyTorch

A ground-up PyTorch implementation of Word2Vec treating it as a systems engineering challenge, with “tensorized tree” architecture converting pointer-chasing Hierarchical Softmax into dense GPU operations, infinite streaming datasets with Zipfian subsampling, and torch.compile compatibility for production-grade efficiency.

Natural Language Processing
Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

What happens when you achieve 99.8% accuracy on sarcasm detection? You might have accidentally built a domain classifier. A cautionary ML tale about dataset bias.

Natural Language Processing
Information Quality Ratio plot showing statistical dependencies decay as window size increases

Analytical Solution to Word2Vec Softmax & Bias Probing

We provide the first analytical solution to Word2Vec’s softmax skip-gram objective, introducing the Independent Frequencies Model and deriving a low-cost, training-free method for measuring semantic bias directly from corpus statistics.

Natural Language Processing
Heatmap visualization of the EigenNoise analytical co-occurrence prior matrix showing word rank relationships

EigenNoise: Data-Free Word Vector Initialization

We develop EigenNoise, a zero-data initialization method for word vectors that synthesizes representations from Zipf’s Law alone, demonstrating competitive performance to GloVe after fine-tuning without requiring any pre-training corpus.

Natural Language Processing
Venn diagram showing semantic overlap between word senses for go, move, and proceed, illustrating our hierarchy induction algorithm

Data-Driven WordNet Construction from Wiktionary

We present an unsupervised algorithm for inducing semantic networks from Wiktionary’s crowd-sourced data, creating a WordNet-like resource an order of magnitude larger than Princeton WordNet with over 344,000 linked example sentences.

Natural Language Processing
Types and distribution of coreferences in QuAC dataset showing dialogue complexity

QuAC: Question Answering in Context Dataset

QuAC introduces a conversational QA dataset that models student-teacher interactions, creating context-dependent questions that test systems’ ability to understand dialogue and resolve references.

Natural Language Processing
Visualization of coreference resolution in the CoQA conversational question answering dataset

CoQA Dataset: Advancing Conversational Question Answering

CoQA extends question answering beyond isolated questions to conversations that require context and reference understanding.

Natural Language Processing
3D visualization of word embeddings showing semantic relationships in vector space

Word Embeddings in NLP: An Introduction

Learn how computers understand words through mathematical vectors, from simple counting methods to contextual embeddings that power modern NLP.