Notes/Machine Translation: From Encoder-Decoder to Why Google Translate Got Good Overnight

Machine Translation: From Encoder-Decoder to Why Google Translate Got Good Overnight

How sequence-to-sequence models, attention, beam search, and the BLEU score transformed machine translation from a rule-based curiosity into a system that handles 100+ languages.

2025-10-16AI-Synthesized from Personal Notes

Computational LinguisticsMachine TranslationSeq2seqNlp

Terminology

Term	Definition
Machine Translation (MT)	Automatic translation of text from one natural language to another
Encoder-Decoder	An architecture where the encoder reads the source sentence into a fixed representation and the decoder generates the target sentence from that representation
Seq2Seq (Sequence-to-Sequence)	A model that maps a variable-length input sequence to a variable-length output sequence, originally using RNNs with attention
Cross-Attention	Attention where the decoder queries attend to encoder key-value pairs, allowing the decoder to focus on relevant source words at each generation step
BLEU Score	Bilingual Evaluation Understudy: an automatic metric that measures n-gram overlap between machine output and human reference translations, ranging from 0 to 1
Beam Search	A decoding strategy that maintains the top $b$ most probable partial translations at each step, exploring multiple hypotheses simultaneously
Greedy Decoding	Selecting the highest-probability token at each step without considering future consequences, fast but often suboptimal
Teacher Forcing	A training technique where the decoder receives the ground-truth previous token as input rather than its own prediction, stabilizing training
Brevity Penalty	A BLEU score component that penalizes translations shorter than the reference, preventing the model from gaming the metric with short outputs

What & Why

Machine translation is one of the oldest AI problems, dating back to the 1950s. Early systems used hand-crafted rules and bilingual dictionaries. Statistical MT (2000s) used phrase tables and n-gram language models. Then in November 2016, Google switched Google Translate from statistical to neural MT overnight, and translation quality jumped more in one update than it had in the previous decade combined.

The breakthrough was the encoder-decoder architecture with attention. The encoder reads the entire source sentence and produces a sequence of hidden states. The decoder generates the target sentence one token at a time, using cross-attention to focus on the relevant source words at each step. This lets the model handle word reordering (English SVO to Japanese SOV), agreement across long distances, and idiomatic expressions that have no word-for-word translation.

Understanding MT matters because it is the canonical seq2seq problem. The same architecture powers text summarization, code generation, speech recognition, and any task that maps one sequence to another. The evaluation challenges (BLEU's limitations, the difficulty of measuring translation quality) also apply broadly to all text generation tasks.

How It Works

Encoder-Decoder Architecture

The encoder processes the source sentence $x_1, x_2, \ldots, x_n$ and produces hidden states $h_1, h_2, \ldots, h_n$. The decoder generates the target sentence $y_1, y_2, \ldots, y_m$ one token at a time, conditioned on the encoder outputs and previously generated tokens.

In the original seq2seq model (Sutskever et al., 2014), the encoder compressed the entire source into a single fixed-length vector (the final hidden state). This bottleneck limited performance on long sentences.

Attention Solves the Bottleneck

Bahdanau attention (2015) lets the decoder look at all encoder hidden states at each generation step. At decoder step $t$, the model computes attention weights over all source positions:

$\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{n} \exp(e_{t,j})}, \quad c_t = \sum_{i=1}^{n} \alpha_{t,i} \cdot h_i$

The context vector $c_t$ is a weighted sum of encoder states, focusing on the source words most relevant to the current target word.

Beam Search

Greedy decoding picks the most probable token at each step, but this can lead to suboptimal translations. Beam search maintains $b$ candidate translations (beams) at each step, expanding each by all possible next tokens and keeping only the top $b$ by cumulative probability. With beam width $b = 4$, the decoder explores 4 hypotheses in parallel, often finding better translations than greedy search.

BLEU Score

BLEU measures translation quality by counting n-gram matches between the machine output and one or more human reference translations:

$\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{4} \frac{1}{4} \log p_n\right)$

where $p_n$ is the modified precision for n-grams of length $n$, and $BP$ is the brevity penalty that penalizes short translations.

Complexity Analysis

Operation	Time	Space	Notes
Encoder forward pass	$O(n^2 \cdot d)$	$O(n \cdot d)$	$n$ = source length, $d$ = model dimension (transformer encoder)
Decoder step (one token)	$O((m + n) \cdot d)$	$O(d)$	$m$ = target tokens generated so far, includes cross-attention over $n$ source states
Greedy decoding (full)	$O(m \cdot (m + n) \cdot d)$	$O(m \cdot d)$	$m$ autoregressive steps
Beam search decoding	$O(b \cdot m \cdot (m + n) \cdot d)$	$O(b \cdot m \cdot d)$	$b$ = beam width, typically 4 to 10
BLEU computation	$O(L)$	$O(L)$	$L$ = total tokens in output + reference, counting n-gram matches

For a typical translation with source length $n = 30$, target length $m = 35$, model dimension $d = 512$, and beam width $b = 4$, beam search performs roughly $4 \times 35 \times 65 \times 512 \approx 4.7 \times 10^6$ operations. This is fast enough for real-time translation.

Implementation

ALGORITHM BeamSearchDecode(encoder_states, decoder, beamWidth, maxLen, vocab)
INPUT: encoder_states: matrix [n x d] from encoder
       decoder: trained decoder model
       beamWidth: integer b
       maxLen: maximum output length
       vocab: target vocabulary
OUTPUT: bestTranslation: list of tokens

BEGIN
  // Initialize beams with start-of-sequence token
  beams <- [( tokens: [SOS], logProb: 0.0 )]

  FOR step FROM 1 TO maxLen DO
    allCandidates <- empty list

    FOR EACH beam IN beams DO
      IF beam.tokens[-1] = EOS THEN
        APPEND beam TO allCandidates
        CONTINUE
      END IF

      // Run one decoder step with cross-attention to encoder_states
      logits <- decoder.STEP(beam.tokens, encoder_states)
      logProbs <- LOG_SOFTMAX(logits)

      // Expand beam with top-b next tokens
      topK <- TOP_K(logProbs, beamWidth)
      FOR EACH (token, lp) IN topK DO
        newBeam <- (
          tokens: beam.tokens + [token],
          logProb: beam.logProb + lp
        )
        APPEND newBeam TO allCandidates
      END FOR
    END FOR

    // Keep only top-b beams by cumulative log probability
    SORT allCandidates BY logProb DESCENDING
    beams <- FIRST beamWidth ELEMENTS OF allCandidates

    // Early stop if all beams have ended
    IF ALL beams end with EOS THEN BREAK
  END FOR

  // Length normalization: divide log prob by length to avoid short-sentence bias
  FOR EACH beam IN beams DO
    beam.score <- beam.logProb / LENGTH(beam.tokens)
  END FOR

  bestTranslation <- beam with highest score
  RETURN bestTranslation.tokens
END

ALGORITHM ComputeBLEU(candidate, references, maxN)
INPUT: candidate: list of tokens (machine translation)
       references: list of reference translations (each a list of tokens)
       maxN: maximum n-gram order (typically 4)
OUTPUT: bleuScore: float in [0, 1]

BEGIN
  // Compute modified precision for each n-gram order
  precisions <- empty list
  FOR n FROM 1 TO maxN DO
    candidateNgrams <- EXTRACT all n-grams from candidate
    candidateCounts <- COUNT occurrences of each n-gram in candidateNgrams

    // Clip counts by maximum count in any reference
    clippedTotal <- 0
    totalCandidate <- 0
    FOR EACH (ngram, count) IN candidateCounts DO
      maxRefCount <- MAX over references of COUNT(ngram in reference)
      clippedTotal <- clippedTotal + MIN(count, maxRefCount)
      totalCandidate <- totalCandidate + count
    END FOR

    IF totalCandidate = 0 THEN
      APPEND 0 TO precisions
    ELSE
      APPEND clippedTotal / totalCandidate TO precisions
    END IF
  END FOR

  // Geometric mean of precisions (in log space)
  logAvg <- SUM(LOG(p) for p in precisions if p > 0) / maxN

  // Brevity penalty
  c <- LENGTH(candidate)
  r <- LENGTH of closest reference by length
  IF c < r THEN
    BP <- EXP(1 - r / c)
  ELSE
    BP <- 1.0
  END IF

  bleuScore <- BP * EXP(logAvg)
  RETURN bleuScore
END

Real-World Applications

Google Translate: Switched from statistical to neural MT in 2016, using a transformer-based encoder-decoder that handles 100+ language pairs
Real-time subtitling: Live translation of speech (via ASR + MT) powers real-time subtitles in video conferencing tools like Microsoft Teams and Zoom
Localization pipelines: Software companies use MT as a first pass for translating UI strings, documentation, and marketing content, with human post-editing for quality
Cross-lingual information retrieval: MT enables searching documents in one language using queries in another, critical for intelligence and legal applications
Low-resource languages: Multilingual models (mBART, NLLB) transfer translation ability from high-resource to low-resource language pairs via shared representations
Code translation: The same seq2seq architecture translates between programming languages (Python to Java), treating source code as a "language"

Key Takeaways

The encoder-decoder architecture with attention is the foundation of modern MT: the encoder reads the source, the decoder generates the target, and cross-attention connects them
Attention solved the information bottleneck of fixed-length encoder representations, allowing the decoder to focus on relevant source words at each generation step
Beam search explores multiple translation hypotheses in parallel, finding better translations than greedy decoding at a modest computational cost
BLEU measures n-gram overlap with reference translations and remains the standard automatic metric, despite known limitations (insensitivity to meaning-preserving paraphrases)
The transformer architecture replaced RNN-based seq2seq models, enabling parallelized training and better handling of long-range dependencies

2025-10-17

Text Generation and Decoding Strategies: How ChatGPT Picks the Next Token

From greedy decoding to top-k, top-p nucleus sampling, and temperature scaling, exploring the algorithms that turn a probability distribution over tokens into coherent, creative text.

2025-10-18

Evaluation Metrics for NLP: Why Measuring Language Quality Is Hard

From BLEU and ROUGE to METEOR and perplexity, exploring the automatic metrics that score NLP systems, their blind spots, and why human evaluation remains the gold standard.

2025-10-19

Ethics and Bias in Language Models: What Training Data Teaches Machines About Us

How training data bias, toxicity, and hallucination emerge in language models, what debiasing techniques exist, and why responsible deployment requires more than better algorithms.