Notes/Word Embeddings: How King Minus Man Plus Woman Equals Queen

Word Embeddings: How King Minus Man Plus Woman Equals Queen

From one-hot vectors to Word2Vec and GloVe, exploring how the distributional hypothesis lets us learn dense vector representations where arithmetic on words captures meaning.

2025-10-11AI-Synthesized from Personal Notes

Computational LinguisticsWord EmbeddingsWord2vecNlp

Terminology

Term	Definition
Word Embedding	A dense, low-dimensional vector representation of a word, typically 100 to 300 dimensions, learned from co-occurrence patterns in text
Distributional Hypothesis	"You shall know a word by the company it keeps" (Firth, 1957): words appearing in similar contexts have similar meanings
One-Hot Encoding	A sparse vector of length $V$ with a single 1 at the word's index and 0 everywhere else; no similarity information between words
Word2Vec	A family of two architectures (CBOW and Skip-gram) that learn word embeddings by predicting words from their context or vice versa
CBOW (Continuous Bag of Words)	A Word2Vec variant that predicts the center word from the average of its surrounding context word vectors
Skip-gram	A Word2Vec variant that predicts surrounding context words given the center word, performing better on rare words
GloVe (Global Vectors)	An embedding method that factorizes the global word co-occurrence matrix, combining the benefits of count-based and prediction-based methods
Negative Sampling	A training trick that replaces the expensive softmax over the full vocabulary with a binary classification between true context words and randomly sampled "negative" words
Cosine Similarity	The cosine of the angle between two vectors: $\frac{\vec{a} \cdot \vec{b}}{\|\|\vec{a}\|\| \cdot \|\|\vec{b}\|\|}$, used to measure semantic similarity between embeddings

What & Why

One-hot vectors treat every word as equally different from every other word. "Cat" and "dog" are just as distant as "cat" and "democracy." This is a terrible representation for any task that requires understanding meaning, because it encodes zero semantic information.

Word embeddings solve this by mapping each word to a dense vector in a continuous space where similar words are close together. The key insight is the distributional hypothesis: words that appear in similar contexts tend to have similar meanings. "Cat" and "dog" both appear near "pet," "food," "vet," so their learned vectors end up nearby.

The most striking property of these embeddings is that vector arithmetic captures semantic relationships. The vector for "king" minus "man" plus "woman" lands near "queen." This is not programmed in; it emerges from the statistics of word co-occurrence. This discovery (Mikolov et al., 2013) demonstrated that simple neural networks trained on raw text could learn structured representations of meaning.

How It Works

From One-Hot to Dense Vectors

A one-hot vector for vocabulary size $V = 50{,}000$ is a 50,000-dimensional vector with a single 1. An embedding maps this to a dense vector of dimension $d = 300$. The embedding matrix $W \in \mathbb{R}^{V \times d}$ stores one row per word. Looking up a word's embedding is just selecting its row from $W$.

Skip-gram Architecture

Given a center word, predict the surrounding context words within a window of size $c$. For the sentence "the cat sat on the mat" with window $c = 2$ and center word "sat," the training pairs are: (sat, cat), (sat, the), (sat, on), (sat, the).

The objective maximizes:

$J = \frac{1}{T}\sum_{t=1}^{T}\sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)$

With negative sampling, the probability is approximated as:

$\log \sigma(\vec{v}_{w_O} \cdot \vec{v}_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)}[\log \sigma(-\vec{v}_{w_i} \cdot \vec{v}_{w_I})]$

where $k$ negative samples are drawn from a noise distribution $P_n(w) \propto f(w)^{3/4}$.

CBOW Architecture

CBOW reverses the direction: given the context words, predict the center word. It averages the context word vectors and passes the result through a single linear layer. CBOW is faster to train and works well for frequent words, while Skip-gram handles rare words better.

GloVe

GloVe (Pennington et al., 2014) takes a different approach. It first builds a global co-occurrence matrix $X$ where $X_{ij}$ counts how often word $i$ appears near word $j$. Then it learns vectors such that:

$\vec{w}_i \cdot \vec{w}_j + b_i + b_j = \log X_{ij}$

This combines the statistical power of count-based methods with the quality of prediction-based embeddings.

Complexity Analysis

Operation	Time	Space	Notes
Skip-gram training (1 epoch)	$O(T \cdot c \cdot k)$	$O(V \cdot d)$	$T$ = tokens, $c$ = window, $k$ = negative samples, $d$ = dimensions
GloVe training (1 epoch)	$O(\|X_{\neq 0}\|)$	$O(V^2)$ worst case	Iterates over non-zero co-occurrence entries
Embedding lookup	$O(1)$	$O(d)$	Index into the embedding matrix
Nearest neighbor search	$O(V \cdot d)$	$O(1)$	Brute-force cosine similarity against all words

The embedding matrix for $V = 400{,}000$ words at $d = 300$ dimensions requires $V \times d \times 4$ bytes $\approx 480$ MB in float32. Pre-trained GloVe and Word2Vec models are typically distributed as these large matrix files.

Implementation

ALGORITHM TrainSkipGram(corpus, d, c, k, learningRate, epochs)
INPUT: corpus: list of tokens, d: embedding dimension, c: window size,
       k: negative samples, learningRate: float, epochs: integer
OUTPUT: W: embedding matrix [V x d]

BEGIN
  V <- number of unique tokens in corpus
  W <- RANDOM MATRIX [V x d] with small values    // target embeddings
  W' <- RANDOM MATRIX [V x d] with small values   // context embeddings
  freqs <- frequency of each word in corpus
  noiseDist <- freqs[i]^0.75 / SUM(freqs[j]^0.75 for all j)

  FOR epoch FROM 1 TO epochs DO
    FOR t FROM 0 TO LENGTH(corpus) - 1 DO
      centerIdx <- INDEX OF corpus[t]

      FOR j FROM -c TO c, j != 0 DO
        IF t + j < 0 OR t + j >= LENGTH(corpus) THEN CONTINUE
        contextIdx <- INDEX OF corpus[t + j]

        // Positive sample: push center and context together
        score <- DOT(W[centerIdx], W'[contextIdx])
        grad <- (SIGMOID(score) - 1) * learningRate
        W[centerIdx]  <- W[centerIdx]  - grad * W'[contextIdx]
        W'[contextIdx] <- W'[contextIdx] - grad * W[centerIdx]

        // Negative samples: push center away from random words
        FOR i FROM 1 TO k DO
          negIdx <- SAMPLE from noiseDist
          score <- DOT(W[centerIdx], W'[negIdx])
          grad <- SIGMOID(score) * learningRate
          W[centerIdx] <- W[centerIdx] - grad * W'[negIdx]
          W'[negIdx]   <- W'[negIdx]   - grad * W[centerIdx]
        END FOR
      END FOR
    END FOR
  END FOR

  RETURN W
END

ALGORITHM WordAnalogy(W, wordA, wordB, wordC, vocab)
INPUT: W: embedding matrix, wordA/B/C: strings (e.g., "king","man","woman")
OUTPUT: closest word to vector(wordA) - vector(wordB) + vector(wordC)

BEGIN
  target <- W[INDEX(wordA)] - W[INDEX(wordB)] + W[INDEX(wordC)]

  bestWord <- ""
  bestSim <- -INFINITY

  FOR EACH word IN vocab DO
    IF word = wordA OR word = wordB OR word = wordC THEN CONTINUE
    sim <- DOT(target, W[INDEX(word)]) / (NORM(target) * NORM(W[INDEX(word)]))
    IF sim > bestSim THEN
      bestSim <- sim
      bestWord <- word
    END IF
  END FOR

  RETURN bestWord   // e.g., "queen"
END

Real-World Applications

Sentiment analysis: Pre-trained word embeddings provide a semantic foundation for classifiers, letting models generalize from "great" to "excellent" without seeing both in training data
Named entity recognition: Embedding features help sequence labelers identify entities by capturing distributional similarity between known and unknown entity words
Machine translation: Cross-lingual embeddings map words from different languages into a shared vector space, enabling translation by nearest-neighbor lookup
Recommendation systems: Item2Vec applies the Skip-gram idea to user interaction sequences, learning item embeddings where co-purchased products are nearby
Biomedical NLP: BioWordVec trains embeddings on PubMed abstracts, capturing domain-specific relationships like drug-disease associations
Analogy and bias detection: Vector arithmetic reveals societal biases in training data (e.g., "doctor" closer to "man" than "woman"), enabling debiasing research

Key Takeaways

Word embeddings map words to dense vectors where semantic similarity corresponds to geometric proximity, solving the fundamental limitation of one-hot encoding
The distributional hypothesis ("a word is known by the company it keeps") is the theoretical foundation: co-occurrence statistics encode meaning
Skip-gram predicts context from a center word and excels at rare words; CBOW predicts the center word from context and trains faster on frequent words
GloVe factorizes the global co-occurrence matrix, combining count-based and prediction-based approaches into a single objective
Vector arithmetic captures semantic relationships (king - man + woman = queen), an emergent property that was not explicitly trained for

2025-10-12

Sequence Labeling: POS Tagging, NER, and the Viterbi Algorithm

How Hidden Markov Models and BIO tagging turn the problem of labeling every word in a sentence into a tractable dynamic programming problem, powering part-of-speech tagging and named entity recognition.

2025-10-13

Parsing and Syntax Trees: Where Compilers and NLP Share Roots

From constituency parsing to dependency parsing and the CYK algorithm, exploring how machines discover the grammatical structure of sentences using the same ideas that power programming language compilers.

2025-10-14

Semantic Similarity and Sentence Embeddings: From Cosine Distance to RAG Pipelines

How BERT embeddings and sentence transformers measure meaning similarity between texts, enabling semantic search, duplicate detection, and retrieval-augmented generation.