Attention and Self-Attention Through the Linguistics Lens
How the attention mechanism lets transformers learn which words relate to which, why positional encoding replaces word order, and what self-attention looks like from a linguist's perspective.
Terminology
| Term | Definition |
|---|---|
| Attention | A mechanism that computes a weighted sum over a set of values, where the weights indicate how relevant each value is to a given query |
| Self-Attention | Attention where queries, keys, and values all come from the same sequence, allowing each token to attend to every other token in the input |
| Query, Key, Value (Q, K, V) | Three learned projections of each token: the query asks "what am I looking for?", keys answer "what do I contain?", and values provide "what information do I carry?" |
| Scaled Dot-Product Attention | $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$, the standard attention formula used in transformers |
| Multi-Head Attention | Running multiple attention operations in parallel with different learned projections, then concatenating results, allowing the model to attend to different relationship types simultaneously |
| Positional Encoding | A vector added to each token embedding to inject information about its position in the sequence, since self-attention is inherently order-agnostic |
| Attention Head | One of the parallel attention computations in multi-head attention, each with its own Q, K, V projection matrices |
| Softmax | A function that converts a vector of real numbers into a probability distribution: $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ |
| Contextualized Embedding | A word representation that changes depending on surrounding words, unlike static embeddings (Word2Vec) where each word has one fixed vector |
What & Why
Language is full of long-range dependencies. In "The cat that the dog chased ran away," the verb "ran" agrees with "cat," not the nearby "dog." RNNs process tokens sequentially and struggle with these distant connections because information must survive many time steps. Attention solves this by letting every token directly look at every other token, regardless of distance.
From a linguistics perspective, self-attention learns something resembling grammatical relationships. One attention head might learn to connect pronouns to their antecedents ("she" attending to "Marie"). Another might learn syntactic dependencies (a verb attending to its subject). A third might capture semantic roles. The model discovers these patterns from data, without being told what grammar is.
The trade-off is that self-attention is inherently unordered. It treats the input as a set, not a sequence. "Dog bites man" and "man bites dog" produce identical attention patterns unless position information is injected. Positional encoding solves this by adding a unique position signal to each token, effectively telling the model "this is the third word."
How It Works
Scaled Dot-Product Attention
Each token is projected into three vectors: a query q, a key k, and a value v. The attention score between token i and token j is the dot product of q_i and k_j, scaled by \sqrt{d_k} to prevent the softmax from saturating:
The softmax converts scores into a probability distribution. The output for each token is a weighted sum of all value vectors, where the weights reflect relevance.
Multi-Head Attention
A single attention head can only capture one type of relationship. Multi-head attention runs h parallel attention operations, each with its own learned W^Q, W^K, W^V projection matrices of dimension d_k = d_{\text{model}} / h. The outputs are concatenated and projected:
In practice, different heads learn different linguistic functions. Research has shown that specific heads in BERT correspond to syntactic relations (subject-verb), coreference (pronoun-antecedent), and positional patterns (attend to the next or previous token).
Positional Encoding
Self-attention treats its input as an unordered set. To restore word order, the original transformer adds sinusoidal positional encodings:
This gives each position a unique signature. The sinusoidal form was chosen because it allows the model to learn relative positions: PE_{pos+k} can be expressed as a linear function of PE_{pos}.
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| Self-attention (one head) | $O(n^2 \cdot d_k)$ | $O(n^2 + n \cdot d_k)$ | $n$ = sequence length, $d_k$ = head dimension |
| Multi-head attention | $O(n^2 \cdot d)$ | $O(h \cdot n^2 + n \cdot d)$ | $d = h \cdot d_k$, total cost same as single head with full $d$ |
| RNN (for comparison) | $O(n \cdot d^2)$ | $O(d)$ | Sequential, cannot parallelize across positions |
| Positional encoding | $O(n \cdot d)$ | $O(n \cdot d)$ | Precomputed or computed once per forward pass |
The $O(n^2)$ attention cost is the main bottleneck for long sequences. For $n = 4096$ tokens with $d = 1024$, the attention matrix alone requires $4096^2 \times 4$ bytes $\approx 67$ MB per head. This is why context length extensions (sparse attention, FlashAttention, ring attention) are active research areas.
Implementation
ALGORITHM ScaledDotProductAttention(Q, K, V)
INPUT: Q: matrix [n x d_k] (queries)
K: matrix [n x d_k] (keys)
V: matrix [n x d_v] (values)
OUTPUT: output: matrix [n x d_v], attentionWeights: matrix [n x n]
BEGIN
d_k <- NUMBER OF COLUMNS IN K
// Compute attention scores
scores <- MATMUL(Q, TRANSPOSE(K)) // [n x n]
scores <- scores / SQRT(d_k) // scale to prevent softmax saturation
// Apply softmax row-wise to get attention weights
attentionWeights <- SOFTMAX(scores, axis=1) // each row sums to 1
// Weighted sum of values
output <- MATMUL(attentionWeights, V) // [n x d_v]
RETURN output, attentionWeights
END
ALGORITHM MultiHeadAttention(X, h, d_model)
INPUT: X: matrix [n x d_model] (input embeddings)
h: number of attention heads
d_model: model dimension
OUTPUT: output: matrix [n x d_model]
BEGIN
d_k <- d_model / h
heads <- empty list
FOR i FROM 1 TO h DO
Q_i <- MATMUL(X, W_Q[i]) // W_Q[i] is [d_model x d_k]
K_i <- MATMUL(X, W_K[i]) // W_K[i] is [d_model x d_k]
V_i <- MATMUL(X, W_V[i]) // W_V[i] is [d_model x d_k]
head_i, _ <- ScaledDotProductAttention(Q_i, K_i, V_i)
APPEND head_i TO heads
END FOR
concatenated <- CONCATENATE(heads, axis=1) // [n x d_model]
output <- MATMUL(concatenated, W_O) // W_O is [d_model x d_model]
RETURN output
END
ALGORITHM SinusoidalPositionalEncoding(n, d_model)
INPUT: n: maximum sequence length, d_model: embedding dimension
OUTPUT: PE: matrix [n x d_model]
BEGIN
PE <- CREATE matrix [n x d_model], all zeros
FOR pos FROM 0 TO n - 1 DO
FOR i FROM 0 TO d_model / 2 - 1 DO
angle <- pos / POWER(10000, 2 * i / d_model)
PE[pos][2 * i] <- SIN(angle)
PE[pos][2 * i + 1] <- COS(angle)
END FOR
END FOR
RETURN PE
END
Real-World Applications
- BERT and GPT: Both architectures are built entirely on self-attention layers, with BERT using bidirectional attention for understanding and GPT using causal (left-to-right) attention for generation
- Coreference resolution: Attention heads naturally learn to connect pronouns to their referents ("she" attending to "Marie"), a core linguistic task
- Syntactic probing: Researchers extract attention patterns from trained transformers and find that specific heads correspond to dependency relations (subject-verb, determiner-noun)
- Document summarization: Cross-attention between a source document and a generated summary lets the model focus on the most relevant source sentences at each generation step
- Code understanding: Self-attention in code models (Codex, CodeBERT) learns to connect variable uses to their definitions across long spans of source code
- Protein structure prediction: AlphaFold uses attention over amino acid sequences to predict 3D protein structures, treating residues like "words" in a biological language
Key Takeaways
- Self-attention lets every token directly attend to every other token, capturing long-range dependencies that RNNs struggle with
- The Q/K/V framework computes relevance scores (query-key dot products) and uses them to weight value vectors, producing contextualized representations
- Multi-head attention runs parallel attention operations that learn different relationship types: syntactic, semantic, positional, and coreference patterns
- Positional encoding is essential because self-attention is inherently order-agnostic; without it, "dog bites man" and "man bites dog" are indistinguishable
- The $O(n^2)$ cost of self-attention is the fundamental bottleneck for long sequences, driving research into sparse and linear attention variants