Abstract

We developed EigenNoise, a method to initialize word vectors using zero pre-training data. By deriving a co-occurrence matrix solely from the theoretical harmonic structure of language (Zipf’s Law), this project demonstrates that we can mathematically synthesize a “warm-start” for NLP models. This approach challenges the reliance on massive corpora for initialization and offers a competitive alternative for low-resource environments.

Key Contributions

  • Algorithmic Innovation: Created a data-free initialization scheme by modeling independent co-occurrence statistics and applying eigen-decomposition
  • Theoretical Grounding: Leveraged the harmonic statistical structure of language to derive representations from first principles rather than empirical observation
  • Information-Theoretic Evaluation: Utilized Minimum Description Length (MDL) probing to rigorously measure the information content and regularity of the learned representations
  • Efficiency: Demonstrated that EigenNoise vectors, once fine-tuned, match the performance of GloVe vectors (trained on Gigaword) despite seeing no pre-training text

Technical Implementation

The core insight is that “noise” in language isn’t random; it follows a predictable distribution.

  1. Modeling: We model the “null hypothesis” of text, how words would co-occur if they were statistically independent but followed Zipfian rank-frequency. This yields a theoretical co-occurrence matrix $\hat{X}$:

$$\hat{X}_{ij} = \frac{2mN}{r_i r_j H_N}$$

Where $r_i$ is the rank of word $i$, $N$ is vocabulary size, $m$ is the context window size, and $H_N$ is the $N$-th harmonic number.

  1. Factorization: We then solve for the word vectors by performing an eigen-decomposition on this matrix, extracting the top $d$ components to form the representation space.

  2. Probing: Validated performance using MDL probing on CoNLL-2003 and TweetEval benchmarks.

Why This Matters

This research explores the limits of contrastive learning: how effective is a representation learned from contrastive information alone? It demonstrates that a significant portion of what we consider “learned linguistic knowledge” is actually the statistical consequence of word frequency distributions. For ML practitioners, this offers a lightweight, mathematically grounded way to initialize models in data-scarce domains.

Citation

@misc{heidenreich2022eigennoisecontrastivepriorwarmstart,
  title={EigenNoise: A Contrastive Prior to Warm-Start Representations},
  author={Hunter Scott Heidenreich and Jake Ryland Williams},
  year={2022},
  eprint={2205.04376},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2205.04376},
}