Vectorized Word2Vec in Pure PyTorch

Overview

Word2Vec is often treated as a “solved problem” or a black box inside libraries like Gensim. This project deconstructs the algorithm to treat it as a systems engineering challenge.

I built a ground-up, typed, and compiled PyTorch implementation that bridges the gap between the original C code’s efficiency and modern GPU acceleration. The core innovation lies in “tensorizing the tree”, converting the pointer-chasing logic of Hierarchical Softmax into dense, vectorized operations compatible with torch.compile.

Features

1. Vectorized Hierarchical Softmax

Classically, Hierarchical Softmax involves traversing a binary Huffman tree. While efficient on a CPU, this approach creates divergent execution paths on GPUs.

The Solution: I implemented a “pre-computed path” strategy. The tree traversal for every vocabulary word is flattened into fixed-size tensors (word_path_indices, word_codes_tensor) padded to the maximum depth.
The Result: The forward pass becomes a massive, masked batch dot-product against internal node embeddings, allowing the GPU to crunch the probability tree without branching logic.

2. Infinite Streaming & Sliding Windows

To handle datasets larger than RAM (e.g., Wikipedia/CommonCrawl), I built a custom IterableDataset that performs a true single-pass read.

Efficient Windowing: It uses a collections.deque buffer to slide over the token stream, generating training pairs only when a new token enters the center context.
Zipfian Subsampling: Implemented a probabilistic rejection sampling layer that downsamples frequent words (like “the” or “of”) on-the-fly, strictly adhering to the original Mikolov et al. paper’s distribution.

3. Modern Tooling

This project uses a strict “software 2.0” stack:

Dependency Management: Built with uv for deterministic, fast environment resolution.
Compilation: Fully compatible with torch.compile (PyTorch 2.0+), allowing for graph fusion of the custom loss functions.

Usage

The library installs from source (clone the repo, then pip install -e .) and exposes a typed Python API (SkipGramModel, CBOWModel, Trainer, Word2VecDataset) alongside word2vec-train and word2vec-query CLIs, with GPU acceleration. Trained embeddings export to .npy for use with Gensim or other tooling.

Results

Embedding geometry checked qualitatively: the vectors behave sensibly on semantic-similarity and analogy probes, which is evidence the tensorized tree matches sequential traversal. No analogy-benchmark score is reported.
Branch-free GPU execution: the batched Huffman-tree path turns hierarchical-softmax tree traversal into dense, masked tensor operations, removing the divergent branching that slows naive implementations on GPUs.
Runs on larger-than-RAM corpora: the streaming IterableDataset with Zipfian subsampling processes Wikipedia/CommonCrawl-scale text in a single pass without loading the corpus into memory.
torch.compile-compatible: the custom loss functions are written to fuse under torch.compile (PyTorch 2.0+).

This project connects to related NLP work on this site:

An Introduction to Word Embeddings: conceptual background on the representations this library produces
Word Company Vicinity: research applying word vector semantics to company names
Semantic Network Induction: research on inducing semantic graphs from embedding spaces

Overview#

Features#

1. Vectorized Hierarchical Softmax#

2. Infinite Streaming & Sliding Windows#

3. Modern Tooling#

Usage#

Results#

Related Work#