Natural Language Processing
Huffman Tree visualization for the input 'beep boop beer!' showing internal nodes with frequency counts and leaf nodes with characters

Vectorizing the Tree: High-Performance Word2Vec in Pure PyTorch

A production-grade implementation of Hierarchical Softmax and Negative Sampling, featuring vectorized tree traversals, …...

Natural Language Processing
Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that actually learned to …

Natural Language Processing
Information Quality Ratio plot showing statistical dependencies decay as window size increases

Analytical Solution to Word2Vec Softmax & Corpus Bias Probing

Analytical derivation of Word2Vec's softmax objective factorization and a new framework for detecting semantic bias in …...

Natural Language Processing
Heatmap visualization of the EigenNoise analytical co-occurrence prior matrix showing word rank relationships

EigenNoise: Data-Free Word Vector Initialization

Investigation into EigenNoise, a data-free initialization scheme for word vectors that approaches pre-trained model …...

Natural Language Processing
Venn diagram showing semantic overlap between word senses for go, move, and proceed, illustrating our hierarchy induction algorithm

Data-Driven WordNet Construction from Wiktionary

We introduce an unsupervised algorithm for inducing semantic networks from noisy, crowd-sourced data, producing a …...

Natural Language Processing
Types and distribution of coreferences in QuAC dataset showing dialogue complexity

QuAC: Question Answering in Context Dataset

Analysis of QuAC's conversational QA through student-teacher interactions, featuring 100K+ context-dependent questions …

Natural Language Processing
Visualization of coreference resolution in the CoQA conversational question answering dataset

CoQA Dataset: Advancing Conversational Question Answering

Analysis of CoQA, a conversational QA dataset with multi-turn dialogue, coreference resolution, and natural answers for …

Natural Language Processing
One-hot encoding and count vectorization visualization showing sparse vector representation

Count Vectorization with scikit-learn in Python

Learn count vectorization in Python: convert text to numerical vectors using scikit-learn's CountVectorizer with …

Natural Language Processing
3D visualization of word embeddings showing semantic relationships in vector space

Word Embeddings in NLP: An Introduction

Learn about word embeddings in NLP: from basic one-hot encoding to contextual models like ELMo. Guide with examples.