Back to Blog

Attention and Self-Attention Through the Linguistics Lens

How the attention mechanism lets transformers learn which words relate to which, why positional encoding replaces word order, and what self-attention looks like from a linguist's perspective.

2025-10-15
Share
Computational Linguisticsattentionself-attentiontransformersnlp

Terminology

Term Definition
Attention A mechanism that computes a weighted sum over a set of values, where the weights indicate how relevant each value is to a given query
Self-Attention Attention where queries, keys, and values all come from the same sequence, allowing each token to attend to every other token in the input
Query, Key, Value (Q, K, V) Three learned projections of each token: the query asks "what am I looking for?", keys answer "what do I contain?", and values provide "what information do I carry?"
Scaled Dot-Product Attention $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$, the standard attention formula used in transformers
Multi-Head Attention Running multiple attention operations in parallel with different learned projections, then concatenating results, allowing the model to attend to different relationship types simultaneously
Positional Encoding A vector added to each token embedding to inject information about its position in the sequence, since self-attention is inherently order-agnostic
Attention Head One of the parallel attention computations in multi-head attention, each with its own Q, K, V projection matrices
Softmax A function that converts a vector of real numbers into a probability distribution: $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$
Contextualized Embedding A word representation that changes depending on surrounding words, unlike static embeddings (Word2Vec) where each word has one fixed vector

What & Why

Language is full of long-range dependencies. In "The cat that the dog chased ran away," the verb "ran" agrees with "cat," not the nearby "dog." RNNs process tokens sequentially and struggle with these distant connections because information must survive many time steps. Attention solves this by letting every token directly look at every other token, regardless of distance.

From a linguistics perspective, self-attention learns something resembling grammatical relationships. One attention head might learn to connect pronouns to their antecedents ("she" attending to "Marie"). Another might learn syntactic dependencies (a verb attending to its subject). A third might capture semantic roles. The model discovers these patterns from data, without being told what grammar is.

The trade-off is that self-attention is inherently unordered. It treats the input as a set, not a sequence. "Dog bites man" and "man bites dog" produce identical attention patterns unless position information is injected. Positional encoding solves this by adding a unique position signal to each token, effectively telling the model "this is the third word."

How It Works

Scaled Dot-Product Attention

Each token is projected into three vectors: a query q, a key k, and a value v. The attention score between token i and token j is the dot product of q_i and k_j, scaled by \sqrt{d_k} to prevent the softmax from saturating:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The softmax converts scores into a probability distribution. The output for each token is a weighted sum of all value vectors, where the weights reflect relevance.

Self-Attention: "The cat sat" (one head) The cat sat Input tokens Attention weight matrix (after softmax) The: 0.15 0.70 0.15 cat: 0.10 0.20 0.70 sat: 0.05 0.80 0.15 "sat" attends strongly to "cat" The' cat' sat' Contextualized Each output is a weighted sum of all value vectors

Multi-Head Attention

A single attention head can only capture one type of relationship. Multi-head attention runs h parallel attention operations, each with its own learned W^Q, W^K, W^V projection matrices of dimension d_k = d_{\text{model}} / h. The outputs are concatenated and projected:

$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W^O$

In practice, different heads learn different linguistic functions. Research has shown that specific heads in BERT correspond to syntactic relations (subject-verb), coreference (pronoun-antecedent), and positional patterns (attend to the next or previous token).

Positional Encoding

Self-attention treats its input as an unordered set. To restore word order, the original transformer adds sinusoidal positional encodings:

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$

This gives each position a unique signature. The sinusoidal form was chosen because it allows the model to learn relative positions: PE_{pos+k} can be expressed as a linear function of PE_{pos}.

Complexity Analysis

Operation Time Space Notes
Self-attention (one head) $O(n^2 \cdot d_k)$ $O(n^2 + n \cdot d_k)$ $n$ = sequence length, $d_k$ = head dimension
Multi-head attention $O(n^2 \cdot d)$ $O(h \cdot n^2 + n \cdot d)$ $d = h \cdot d_k$, total cost same as single head with full $d$
RNN (for comparison) $O(n \cdot d^2)$ $O(d)$ Sequential, cannot parallelize across positions
Positional encoding $O(n \cdot d)$ $O(n \cdot d)$ Precomputed or computed once per forward pass

The $O(n^2)$ attention cost is the main bottleneck for long sequences. For $n = 4096$ tokens with $d = 1024$, the attention matrix alone requires $4096^2 \times 4$ bytes $\approx 67$ MB per head. This is why context length extensions (sparse attention, FlashAttention, ring attention) are active research areas.

Implementation

ALGORITHM ScaledDotProductAttention(Q, K, V)
INPUT: Q: matrix [n x d_k] (queries)
       K: matrix [n x d_k] (keys)
       V: matrix [n x d_v] (values)
OUTPUT: output: matrix [n x d_v], attentionWeights: matrix [n x n]

BEGIN
  d_k <- NUMBER OF COLUMNS IN K

  // Compute attention scores
  scores <- MATMUL(Q, TRANSPOSE(K))    // [n x n]
  scores <- scores / SQRT(d_k)          // scale to prevent softmax saturation

  // Apply softmax row-wise to get attention weights
  attentionWeights <- SOFTMAX(scores, axis=1)   // each row sums to 1

  // Weighted sum of values
  output <- MATMUL(attentionWeights, V)   // [n x d_v]

  RETURN output, attentionWeights
END
ALGORITHM MultiHeadAttention(X, h, d_model)
INPUT: X: matrix [n x d_model] (input embeddings)
       h: number of attention heads
       d_model: model dimension
OUTPUT: output: matrix [n x d_model]

BEGIN
  d_k <- d_model / h
  heads <- empty list

  FOR i FROM 1 TO h DO
    Q_i <- MATMUL(X, W_Q[i])    // W_Q[i] is [d_model x d_k]
    K_i <- MATMUL(X, W_K[i])    // W_K[i] is [d_model x d_k]
    V_i <- MATMUL(X, W_V[i])    // W_V[i] is [d_model x d_k]

    head_i, _ <- ScaledDotProductAttention(Q_i, K_i, V_i)
    APPEND head_i TO heads
  END FOR

  concatenated <- CONCATENATE(heads, axis=1)   // [n x d_model]
  output <- MATMUL(concatenated, W_O)           // W_O is [d_model x d_model]

  RETURN output
END
ALGORITHM SinusoidalPositionalEncoding(n, d_model)
INPUT: n: maximum sequence length, d_model: embedding dimension
OUTPUT: PE: matrix [n x d_model]

BEGIN
  PE <- CREATE matrix [n x d_model], all zeros

  FOR pos FROM 0 TO n - 1 DO
    FOR i FROM 0 TO d_model / 2 - 1 DO
      angle <- pos / POWER(10000, 2 * i / d_model)
      PE[pos][2 * i]     <- SIN(angle)
      PE[pos][2 * i + 1] <- COS(angle)
    END FOR
  END FOR

  RETURN PE
END

Real-World Applications

  • BERT and GPT: Both architectures are built entirely on self-attention layers, with BERT using bidirectional attention for understanding and GPT using causal (left-to-right) attention for generation
  • Coreference resolution: Attention heads naturally learn to connect pronouns to their referents ("she" attending to "Marie"), a core linguistic task
  • Syntactic probing: Researchers extract attention patterns from trained transformers and find that specific heads correspond to dependency relations (subject-verb, determiner-noun)
  • Document summarization: Cross-attention between a source document and a generated summary lets the model focus on the most relevant source sentences at each generation step
  • Code understanding: Self-attention in code models (Codex, CodeBERT) learns to connect variable uses to their definitions across long spans of source code
  • Protein structure prediction: AlphaFold uses attention over amino acid sequences to predict 3D protein structures, treating residues like "words" in a biological language

Key Takeaways

  • Self-attention lets every token directly attend to every other token, capturing long-range dependencies that RNNs struggle with
  • The Q/K/V framework computes relevance scores (query-key dot products) and uses them to weight value vectors, producing contextualized representations
  • Multi-head attention runs parallel attention operations that learn different relationship types: syntactic, semantic, positional, and coreference patterns
  • Positional encoding is essential because self-attention is inherently order-agnostic; without it, "dog bites man" and "man bites dog" are indistinguishable
  • The $O(n^2)$ cost of self-attention is the fundamental bottleneck for long sequences, driving research into sparse and linear attention variants