Abstract

We introduce an analytic model of the statistics learned by seminal word embedding algorithms like Word2Vec and GloVe. A central result of this work is the derivation of the first known analytical solution to Word2Vec’s softmax-optimized skip-gram algorithm, revealing the implicit matrix factorization it performs. Using this theoretical groundwork, we demonstrate a “frequency-ratios property” that word vectors exhibit and leverage it to propose a cost-effective method for discerning biases directly within data corpora, prior to model training.

Key Contributions

  • Analytical Solution for Word2Vec Softmax: We derive the first known solution for Word2Vec’s softmax objective, showing it converges to the log-conditional probability matrix of the co-occurrence model.
  • Unifying Theoretical Framework: We introduce the Independent Frequencies Model (IFM) and identify a “frequency-ratios property” that appears universally across classical word vector models trained on statistically independent data.
  • Method for Probing Bias in Data: We propose a novel, low-cost methodology to measure potential semantic biases directly in a text corpus by analyzing the log-frequency ratios of analogical word pairs. This allows for data evaluation before training a model.
  • Analysis of Co-occurrence Statistics: We conduct a deep analysis of statistical independence in co-occurrence models, quantifying how dependencies decay and showing that a non-zero amount of statistical dependence remains even in infinitely large corpora.

Mathematical Innovation

This work provides a rigorous mathematical analysis of the Word2Vec softmax objective, demonstrating that it implicitly factorizes a matrix of log-conditional probabilities. This result moves beyond empirical observation to a formal, theoretical understanding and complements prior work that had successfully factorized the negative-sampling (SGNS) objective.

Bias Detection Framework

Our analysis leads to a framework for probing semantic bias directly in text corpora. By measuring the “dissonance” between the frequency ratios of word pairs in an analogy (e.g., $f_{king}/f_{man} \approx f_{queen}/f_{woman}$), we can quantify the degree to which a dataset statistically supports certain associations. This provides a proactive tool for assessing data fairness before it is used to train large models.

Theoretical Significance

This research helps bridge the gap between the practical success of word embeddings and their theoretical foundations. By identifying the unifying “frequency-ratios property” and deriving a solution to a fundamental learning objective, our work provides a clearer understanding of the mechanics of distributional semantics and offers a foundation for future algorithmic development.

Contribution and Future Work

This work contributes to the mathematical understanding of word embeddings and the growing field of algorithmic fairness in NLP. The analytical solution for the softmax objective opens new pathways for developing more efficient and powerful language model representations. Furthermore, the bias-probing methodology offers a practical, low-cost tool for researchers and practitioners to better understand their data.

Citation

@article{williams2022know,
  title={To Know by the Company Words Keep and What Else Lies in the Vicinity},
  author={Williams, Jake Ryland and Heidenreich, Hunter Scott},
  journal={arXiv preprint arXiv:2205.00148},
  year={2022}
}