Abstract

This paper explores a completely data-driven approach to construct a wordnet using the entirety of the open-source, user-annotated dictionary, Wiktionary. We compare simple baselines to the expert-annotated Princeton WordNet and find compelling evidence that our induction process constructs a network with useful semantic structure.

Key Contributions

  • A Data-Driven Wordnet: We introduce a novel, fully data-driven process for constructing a wordnet from a noisy, crowd-sourced resource without relying on expert annotation.
  • Relationship Disambiguation Task: We frame the network induction as a “relationship disambiguation” task, identifying which specific definitions of two words are relevant for a given semantic link (e.g., synonymy).
  • Hierarchy Induction Algorithm: We present a deterministic algorithm to construct a directed acyclic graph (DAG) of senses from sets of disambiguated pairwise relationships.
  • An Open-Source Framework: We released code for downloading the data, performing disambiguation, and constructing the networks to encourage further research.

Technical Approach

Our method starts with the synonym and antonym relationships annotated by Wiktionary editors. The core of our model is relationship disambiguation: for a linked pair of words (e.g., go ~ proceed), we identify the subset of their definitions that convey the shared sense. To do this, we employ a vector similarity model using FastText embeddings and a Laplacian kernel to compare the semantics of definitions. From the resulting collection of disambiguated links, we apply an algorithm to induce a semantic hierarchy, where more general senses are formed from the intersection of more specific ones.

Key Findings

  • The induced network shows useful semantic structure. When compared to WordNet, its graph-based similarity metrics show a weak-to-moderate correlation for nouns but a very weak correlation for verbs, suggesting room for improvement in verb disambiguation.
  • The resulting resource is vast, an order of magnitude larger than WordNet, and contains over 344,000 linked example sentences compared to WordNet’s 68,000.
  • The network is competitive on word similarity tasks. Using Explicit Semantic Analysis (ESA), our network’s performance is comparable to or better than ESA models built from WordNet on several standard noun-based benchmarks.

Citation

@inproceedings{heidenreich2019latent,
  title={Latent semantic network induction in the context of linked example senses},
  author={Heidenreich, Hunter and Williams, Jake},
  booktitle={Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)},
  pages={170--180},
  year={2019}
}