Back to Blog

Text Generation and Decoding Strategies: How ChatGPT Picks the Next Token

From greedy decoding to top-k, top-p nucleus sampling, and temperature scaling, exploring the algorithms that turn a probability distribution over tokens into coherent, creative text.

2025-10-17
Share
Computational Linguisticstext-generationdecodingnlp

Terminology

Term Definition
Autoregressive Generation Generating text one token at a time, where each new token is conditioned on all previously generated tokens
Greedy Decoding Always selecting the token with the highest probability at each step; fast but often produces repetitive, dull text
Temperature A scalar $T$ that divides logits before softmax: $P(w) = \frac{e^{z_w / T}}{\sum e^{z_j / T}}$. Low $T$ sharpens the distribution; high $T$ flattens it
Top-k Sampling Restricting sampling to the $k$ most probable tokens, redistributing their probabilities to sum to 1
Top-p (Nucleus) Sampling Restricting sampling to the smallest set of tokens whose cumulative probability exceeds $p$, adapting the candidate set size dynamically
Logits The raw, unnormalized scores output by the model's final linear layer before softmax converts them to probabilities
Repetition Penalty A technique that reduces the probability of tokens that have already appeared in the generated text, discouraging loops
KV Cache Caching the key and value tensors from previous decoder steps to avoid recomputing them, reducing each generation step from $O(n^2)$ to $O(n)$
Speculative Decoding Using a small, fast "draft" model to propose multiple tokens, then verifying them in parallel with the large model to speed up generation

What & Why

A language model produces a probability distribution over the entire vocabulary at each step. The decoding strategy decides how to turn that distribution into an actual token. This choice has an enormous impact on output quality.

Greedy decoding (always pick the most probable token) produces grammatically correct but boring, repetitive text. Pure random sampling produces creative but incoherent text. The practical strategies used by ChatGPT, Claude, and other LLMs sit between these extremes: temperature scaling controls randomness, top-k limits the candidate pool, and top-p (nucleus sampling) adapts the pool size based on the model's confidence.

Understanding decoding strategies matters because they are the user-facing knobs of text generation. When you adjust the "temperature" slider in an API, you are directly modifying the softmax distribution. When a model starts repeating itself, the fix is often a repetition penalty, not retraining. These are engineering decisions, not model architecture decisions, and they have outsized impact on perceived quality.

How It Works

Temperature Scaling

Before applying softmax, divide all logits by temperature T:

$P(w_i) = \frac{e^{z_i / T}}{\sum_{j} e^{z_j / T}}$
  • T = 1.0: standard softmax, model's learned distribution
  • T < 1.0: sharper distribution, more deterministic (approaches greedy at T \to 0)
  • T > 1.0: flatter distribution, more random (approaches uniform at T \to \infty)
Temperature Effect on Token Probabilities T=0.3 (sharp) T=1.0 (normal) T=2.0 (flat) Deterministic Balanced Creative Bar height = token probability. Lower T concentrates mass on top token.

Top-k Sampling

After computing probabilities, keep only the k most probable tokens and set all others to zero. Renormalize the remaining probabilities and sample. With k = 50, the model can only choose from its top 50 candidates, preventing low-probability nonsense tokens from being selected.

The problem: k is fixed. When the model is confident (one token has 95% probability), k = 50 still allows 49 unlikely tokens. When the model is uncertain (flat distribution), k = 50 might cut off reasonable options.

Top-p (Nucleus) Sampling

Top-p adapts the candidate set dynamically. Sort tokens by probability, then include tokens until their cumulative probability exceeds p. With p = 0.9:

  • If the top token has 95% probability, only that token is included (effectively greedy)
  • If the distribution is flat, many tokens are included (more diversity)

This adaptive behavior makes top-p generally preferred over top-k in practice.

Repetition Penalty

Divide the logit of any token that has already appeared in the output by a penalty factor \theta > 1:

$z_i' = \begin{cases} z_i / \theta & \text{if token } i \text{ appeared in previous output} \\ z_i & \text{otherwise} \end{cases}$

This discourages the model from repeating phrases, a common failure mode of autoregressive generation.

Complexity Analysis

Operation Time Space Notes
One generation step (with KV cache) $O(n \cdot d + V \cdot d)$ $O(n \cdot d)$ $n$ = sequence so far, $d$ = model dim, $V$ = vocab size
Temperature scaling $O(V)$ $O(1)$ Divide all logits by $T$
Top-k filtering $O(V \log k)$ $O(k)$ Partial sort to find top $k$ elements
Top-p filtering $O(V \log V)$ $O(V)$ Full sort by probability, then cumulative sum scan
Full generation ($m$ tokens) $O(m \cdot (n+m) \cdot d)$ $O((n+m) \cdot d \cdot L)$ $L$ = number of layers, KV cache grows linearly

The KV cache is the dominant memory cost. For a model with $L = 32$ layers, $d = 4096$, and sequence length $n = 4096$, the cache requires $2 \times L \times n \times d \times 2$ bytes (float16) $\approx 2$ GB. This is why long-context generation is memory-bound, not compute-bound.

Implementation

ALGORITHM SampleWithTopPAndTemperature(logits, temperature, topP, repPenalty, previousTokens)
INPUT: logits: array of V raw scores from model
       temperature: float > 0
       topP: float in (0, 1]
       repPenalty: float >= 1.0
       previousTokens: set of token indices already generated
OUTPUT: sampledToken: integer (token index)

BEGIN
  // Step 1: Apply repetition penalty
  FOR EACH tokenIdx IN previousTokens DO
    IF logits[tokenIdx] > 0 THEN
      logits[tokenIdx] <- logits[tokenIdx] / repPenalty
    ELSE
      logits[tokenIdx] <- logits[tokenIdx] * repPenalty
    END IF
  END FOR

  // Step 2: Apply temperature
  FOR i FROM 0 TO V - 1 DO
    logits[i] <- logits[i] / temperature
  END FOR

  // Step 3: Softmax
  maxLogit <- MAX(logits)
  expSum <- 0
  probs <- CREATE array of length V
  FOR i FROM 0 TO V - 1 DO
    probs[i] <- EXP(logits[i] - maxLogit)
    expSum <- expSum + probs[i]
  END FOR
  FOR i FROM 0 TO V - 1 DO
    probs[i] <- probs[i] / expSum
  END FOR

  // Step 4: Top-p (nucleus) filtering
  sortedIndices <- ARGSORT(probs, DESCENDING)
  cumulativeProb <- 0
  cutoffIdx <- 0
  FOR i FROM 0 TO V - 1 DO
    cumulativeProb <- cumulativeProb + probs[sortedIndices[i]]
    IF cumulativeProb >= topP THEN
      cutoffIdx <- i
      BREAK
    END IF
  END FOR

  // Zero out tokens below the cutoff
  allowedSet <- sortedIndices[0 .. cutoffIdx]
  FOR i FROM 0 TO V - 1 DO
    IF i NOT IN allowedSet THEN
      probs[i] <- 0
    END IF
  END FOR

  // Renormalize
  total <- SUM(probs)
  FOR i FROM 0 TO V - 1 DO
    probs[i] <- probs[i] / total
  END FOR

  // Step 5: Sample from the filtered distribution
  sampledToken <- RANDOM_CHOICE(indices: 0..V-1, weights: probs)
  RETURN sampledToken
END
ALGORITHM GenerateText(model, prompt, maxTokens, temperature, topP, repPenalty)
INPUT: model: language model with KV cache support
       prompt: list of token indices
       maxTokens: integer, temperature: float, topP: float, repPenalty: float
OUTPUT: generatedTokens: list of token indices

BEGIN
  // Prefill: process entire prompt at once
  kvCache <- model.PREFILL(prompt)
  generatedTokens <- empty list
  previousTokens <- SET(prompt)

  FOR step FROM 1 TO maxTokens DO
    // Get logits for next token position
    IF step = 1 THEN
      logits <- model.FORWARD(prompt, kvCache)
    ELSE
      logits <- model.FORWARD([lastToken], kvCache)
    END IF

    token <- SampleWithTopPAndTemperature(logits, temperature, topP, repPenalty, previousTokens)

    IF token = EOS THEN BREAK

    APPEND token TO generatedTokens
    ADD token TO previousTokens
    lastToken <- token
  END FOR

  RETURN generatedTokens
END

Real-World Applications

  • ChatGPT and Claude: Use top-p sampling with temperature control, allowing users to adjust creativity vs determinism via API parameters
  • Code completion (Copilot): Uses low temperature (near-greedy) for code suggestions where correctness matters more than creativity
  • Creative writing assistants: Use higher temperature and top-p to generate diverse story continuations, poetry, and brainstorming ideas
  • Structured output generation: Uses temperature 0 (greedy) with constrained decoding to ensure outputs conform to JSON schemas or SQL syntax
  • Dialogue systems: Combine nucleus sampling with repetition penalties to produce natural, non-repetitive conversational responses
  • Speculative decoding in production: Systems like Medusa and SpecInfer use draft models to propose multiple tokens, verified in parallel by the main model, achieving 2-3x speedups

Key Takeaways

  • Decoding strategy has as much impact on output quality as model architecture: the same model produces very different text with greedy vs nucleus sampling
  • Temperature controls the sharpness of the probability distribution: low values make output deterministic, high values make it creative but potentially incoherent
  • Top-p (nucleus) sampling adapts the candidate set size based on model confidence, generally outperforming fixed top-k in practice
  • Repetition penalties are essential for long-form generation, preventing the degenerate loops that autoregressive models are prone to
  • The KV cache is the key optimization for autoregressive generation, trading memory for compute by avoiding redundant attention calculations