Machine Translation: From Encoder-Decoder to Why Google Translate Got Good Overnight
How sequence-to-sequence models, attention, beam search, and the BLEU score transformed machine translation from a rule-based curiosity into a system that handles 100+ languages.
Terminology
| Term | Definition |
|---|---|
| Machine Translation (MT) | Automatic translation of text from one natural language to another |
| Encoder-Decoder | An architecture where the encoder reads the source sentence into a fixed representation and the decoder generates the target sentence from that representation |
| Seq2Seq (Sequence-to-Sequence) | A model that maps a variable-length input sequence to a variable-length output sequence, originally using RNNs with attention |
| Cross-Attention | Attention where the decoder queries attend to encoder key-value pairs, allowing the decoder to focus on relevant source words at each generation step |
| BLEU Score | Bilingual Evaluation Understudy: an automatic metric that measures n-gram overlap between machine output and human reference translations, ranging from 0 to 1 |
| Beam Search | A decoding strategy that maintains the top $b$ most probable partial translations at each step, exploring multiple hypotheses simultaneously |
| Greedy Decoding | Selecting the highest-probability token at each step without considering future consequences, fast but often suboptimal |
| Teacher Forcing | A training technique where the decoder receives the ground-truth previous token as input rather than its own prediction, stabilizing training |
| Brevity Penalty | A BLEU score component that penalizes translations shorter than the reference, preventing the model from gaming the metric with short outputs |
What & Why
Machine translation is one of the oldest AI problems, dating back to the 1950s. Early systems used hand-crafted rules and bilingual dictionaries. Statistical MT (2000s) used phrase tables and n-gram language models. Then in November 2016, Google switched Google Translate from statistical to neural MT overnight, and translation quality jumped more in one update than it had in the previous decade combined.
The breakthrough was the encoder-decoder architecture with attention. The encoder reads the entire source sentence and produces a sequence of hidden states. The decoder generates the target sentence one token at a time, using cross-attention to focus on the relevant source words at each step. This lets the model handle word reordering (English SVO to Japanese SOV), agreement across long distances, and idiomatic expressions that have no word-for-word translation.
Understanding MT matters because it is the canonical seq2seq problem. The same architecture powers text summarization, code generation, speech recognition, and any task that maps one sequence to another. The evaluation challenges (BLEU's limitations, the difficulty of measuring translation quality) also apply broadly to all text generation tasks.
How It Works
Encoder-Decoder Architecture
The encoder processes the source sentence x_1, x_2, \ldots, x_n and produces hidden states h_1, h_2, \ldots, h_n. The decoder generates the target sentence y_1, y_2, \ldots, y_m one token at a time, conditioned on the encoder outputs and previously generated tokens.
In the original seq2seq model (Sutskever et al., 2014), the encoder compressed the entire source into a single fixed-length vector (the final hidden state). This bottleneck limited performance on long sentences.
Attention Solves the Bottleneck
Bahdanau attention (2015) lets the decoder look at all encoder hidden states at each generation step. At decoder step t, the model computes attention weights over all source positions:
The context vector c_t is a weighted sum of encoder states, focusing on the source words most relevant to the current target word.
Beam Search
Greedy decoding picks the most probable token at each step, but this can lead to suboptimal translations. Beam search maintains b candidate translations (beams) at each step, expanding each by all possible next tokens and keeping only the top b by cumulative probability. With beam width b = 4, the decoder explores 4 hypotheses in parallel, often finding better translations than greedy search.
BLEU Score
BLEU measures translation quality by counting n-gram matches between the machine output and one or more human reference translations:
where p_n is the modified precision for n-grams of length n, and BP is the brevity penalty that penalizes short translations.
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| Encoder forward pass | $O(n^2 \cdot d)$ | $O(n \cdot d)$ | $n$ = source length, $d$ = model dimension (transformer encoder) |
| Decoder step (one token) | $O((m + n) \cdot d)$ | $O(d)$ | $m$ = target tokens generated so far, includes cross-attention over $n$ source states |
| Greedy decoding (full) | $O(m \cdot (m + n) \cdot d)$ | $O(m \cdot d)$ | $m$ autoregressive steps |
| Beam search decoding | $O(b \cdot m \cdot (m + n) \cdot d)$ | $O(b \cdot m \cdot d)$ | $b$ = beam width, typically 4 to 10 |
| BLEU computation | $O(L)$ | $O(L)$ | $L$ = total tokens in output + reference, counting n-gram matches |
For a typical translation with source length $n = 30$, target length $m = 35$, model dimension $d = 512$, and beam width $b = 4$, beam search performs roughly $4 \times 35 \times 65 \times 512 \approx 4.7 \times 10^6$ operations. This is fast enough for real-time translation.
Implementation
ALGORITHM BeamSearchDecode(encoder_states, decoder, beamWidth, maxLen, vocab)
INPUT: encoder_states: matrix [n x d] from encoder
decoder: trained decoder model
beamWidth: integer b
maxLen: maximum output length
vocab: target vocabulary
OUTPUT: bestTranslation: list of tokens
BEGIN
// Initialize beams with start-of-sequence token
beams <- [( tokens: [SOS], logProb: 0.0 )]
FOR step FROM 1 TO maxLen DO
allCandidates <- empty list
FOR EACH beam IN beams DO
IF beam.tokens[-1] = EOS THEN
APPEND beam TO allCandidates
CONTINUE
END IF
// Run one decoder step with cross-attention to encoder_states
logits <- decoder.STEP(beam.tokens, encoder_states)
logProbs <- LOG_SOFTMAX(logits)
// Expand beam with top-b next tokens
topK <- TOP_K(logProbs, beamWidth)
FOR EACH (token, lp) IN topK DO
newBeam <- (
tokens: beam.tokens + [token],
logProb: beam.logProb + lp
)
APPEND newBeam TO allCandidates
END FOR
END FOR
// Keep only top-b beams by cumulative log probability
SORT allCandidates BY logProb DESCENDING
beams <- FIRST beamWidth ELEMENTS OF allCandidates
// Early stop if all beams have ended
IF ALL beams end with EOS THEN BREAK
END FOR
// Length normalization: divide log prob by length to avoid short-sentence bias
FOR EACH beam IN beams DO
beam.score <- beam.logProb / LENGTH(beam.tokens)
END FOR
bestTranslation <- beam with highest score
RETURN bestTranslation.tokens
END
ALGORITHM ComputeBLEU(candidate, references, maxN)
INPUT: candidate: list of tokens (machine translation)
references: list of reference translations (each a list of tokens)
maxN: maximum n-gram order (typically 4)
OUTPUT: bleuScore: float in [0, 1]
BEGIN
// Compute modified precision for each n-gram order
precisions <- empty list
FOR n FROM 1 TO maxN DO
candidateNgrams <- EXTRACT all n-grams from candidate
candidateCounts <- COUNT occurrences of each n-gram in candidateNgrams
// Clip counts by maximum count in any reference
clippedTotal <- 0
totalCandidate <- 0
FOR EACH (ngram, count) IN candidateCounts DO
maxRefCount <- MAX over references of COUNT(ngram in reference)
clippedTotal <- clippedTotal + MIN(count, maxRefCount)
totalCandidate <- totalCandidate + count
END FOR
IF totalCandidate = 0 THEN
APPEND 0 TO precisions
ELSE
APPEND clippedTotal / totalCandidate TO precisions
END IF
END FOR
// Geometric mean of precisions (in log space)
logAvg <- SUM(LOG(p) for p in precisions if p > 0) / maxN
// Brevity penalty
c <- LENGTH(candidate)
r <- LENGTH of closest reference by length
IF c < r THEN
BP <- EXP(1 - r / c)
ELSE
BP <- 1.0
END IF
bleuScore <- BP * EXP(logAvg)
RETURN bleuScore
END
Real-World Applications
- Google Translate: Switched from statistical to neural MT in 2016, using a transformer-based encoder-decoder that handles 100+ language pairs
- Real-time subtitling: Live translation of speech (via ASR + MT) powers real-time subtitles in video conferencing tools like Microsoft Teams and Zoom
- Localization pipelines: Software companies use MT as a first pass for translating UI strings, documentation, and marketing content, with human post-editing for quality
- Cross-lingual information retrieval: MT enables searching documents in one language using queries in another, critical for intelligence and legal applications
- Low-resource languages: Multilingual models (mBART, NLLB) transfer translation ability from high-resource to low-resource language pairs via shared representations
- Code translation: The same seq2seq architecture translates between programming languages (Python to Java), treating source code as a "language"
Key Takeaways
- The encoder-decoder architecture with attention is the foundation of modern MT: the encoder reads the source, the decoder generates the target, and cross-attention connects them
- Attention solved the information bottleneck of fixed-length encoder representations, allowing the decoder to focus on relevant source words at each generation step
- Beam search explores multiple translation hypotheses in parallel, finding better translations than greedy decoding at a modest computational cost
- BLEU measures n-gram overlap with reference translations and remains the standard automatic metric, despite known limitations (insensitivity to meaning-preserving paraphrases)
- The transformer architecture replaced RNN-based seq2seq models, enabling parallelized training and better handling of long-range dependencies