Notes/Text Generation and Decoding Strategies: How ChatGPT Picks the Next Token

Text Generation and Decoding Strategies: How ChatGPT Picks the Next Token

From greedy decoding to top-k, top-p nucleus sampling, and temperature scaling, exploring the algorithms that turn a probability distribution over tokens into coherent, creative text.

2025-10-17AI-Synthesized from Personal Notes

Computational LinguisticsText GenerationDecodingNlp

Terminology

Term	Definition
Autoregressive Generation	Generating text one token at a time, where each new token is conditioned on all previously generated tokens
Greedy Decoding	Always selecting the token with the highest probability at each step; fast but often produces repetitive, dull text
Temperature	A scalar $T$ that divides logits before softmax: $P(w) = \frac{e^{z_w / T}}{\sum e^{z_j / T}}$. Low $T$ sharpens the distribution; high $T$ flattens it
Top-k Sampling	Restricting sampling to the $k$ most probable tokens, redistributing their probabilities to sum to 1
Top-p (Nucleus) Sampling	Restricting sampling to the smallest set of tokens whose cumulative probability exceeds $p$, adapting the candidate set size dynamically
Logits	The raw, unnormalized scores output by the model's final linear layer before softmax converts them to probabilities
Repetition Penalty	A technique that reduces the probability of tokens that have already appeared in the generated text, discouraging loops
KV Cache	Caching the key and value tensors from previous decoder steps to avoid recomputing them, reducing each generation step from $O(n^2)$ to $O(n)$
Speculative Decoding	Using a small, fast "draft" model to propose multiple tokens, then verifying them in parallel with the large model to speed up generation

What & Why

A language model produces a probability distribution over the entire vocabulary at each step. The decoding strategy decides how to turn that distribution into an actual token. This choice has an enormous impact on output quality.

Greedy decoding (always pick the most probable token) produces grammatically correct but boring, repetitive text. Pure random sampling produces creative but incoherent text. The practical strategies used by ChatGPT, Claude, and other LLMs sit between these extremes: temperature scaling controls randomness, top-k limits the candidate pool, and top-p (nucleus sampling) adapts the pool size based on the model's confidence.

Understanding decoding strategies matters because they are the user-facing knobs of text generation. When you adjust the "temperature" slider in an API, you are directly modifying the softmax distribution. When a model starts repeating itself, the fix is often a repetition penalty, not retraining. These are engineering decisions, not model architecture decisions, and they have outsized impact on perceived quality.

How It Works

Temperature Scaling

Before applying softmax, divide all logits by temperature $T$:

$P(w_i) = \frac{e^{z_i / T}}{\sum_{j} e^{z_j / T}}$

$T = 1.0$: standard softmax, model's learned distribution
$T < 1.0$: sharper distribution, more deterministic (approaches greedy at $T \to 0$)
$T > 1.0$: flatter distribution, more random (approaches uniform at $T \to \infty$)

Top-k Sampling

After computing probabilities, keep only the $k$ most probable tokens and set all others to zero. Renormalize the remaining probabilities and sample. With $k = 50$, the model can only choose from its top 50 candidates, preventing low-probability nonsense tokens from being selected.

The problem: $k$ is fixed. When the model is confident (one token has 95% probability), $k = 50$ still allows 49 unlikely tokens. When the model is uncertain (flat distribution), $k = 50$ might cut off reasonable options.

Top-p (Nucleus) Sampling

Top-p adapts the candidate set dynamically. Sort tokens by probability, then include tokens until their cumulative probability exceeds $p$. With $p = 0.9$:

If the top token has 95% probability, only that token is included (effectively greedy)
If the distribution is flat, many tokens are included (more diversity)

This adaptive behavior makes top-p generally preferred over top-k in practice.

Repetition Penalty

Divide the logit of any token that has already appeared in the output by a penalty factor $\theta > 1$:

$z_i' = \begin{cases} z_i / \theta & \text{if token } i \text{ appeared in previous output} \\ z_i & \text{otherwise} \end{cases}$

This discourages the model from repeating phrases, a common failure mode of autoregressive generation.

Complexity Analysis

Operation	Time	Space	Notes
One generation step (with KV cache)	$O(n \cdot d + V \cdot d)$	$O(n \cdot d)$	$n$ = sequence so far, $d$ = model dim, $V$ = vocab size
Temperature scaling	$O(V)$	$O(1)$	Divide all logits by $T$
Top-k filtering	$O(V \log k)$	$O(k)$	Partial sort to find top $k$ elements
Top-p filtering	$O(V \log V)$	$O(V)$	Full sort by probability, then cumulative sum scan
Full generation ($m$ tokens)	$O(m \cdot (n+m) \cdot d)$	$O((n+m) \cdot d \cdot L)$	$L$ = number of layers, KV cache grows linearly

The KV cache is the dominant memory cost. For a model with $L = 32$ layers, $d = 4096$, and sequence length $n = 4096$, the cache requires $2 \times L \times n \times d \times 2$ bytes (float16) $\approx 2$ GB. This is why long-context generation is memory-bound, not compute-bound.

Implementation

ALGORITHM SampleWithTopPAndTemperature(logits, temperature, topP, repPenalty, previousTokens)
INPUT: logits: array of V raw scores from model
       temperature: float > 0
       topP: float in (0, 1]
       repPenalty: float >= 1.0
       previousTokens: set of token indices already generated
OUTPUT: sampledToken: integer (token index)

BEGIN
  // Step 1: Apply repetition penalty
  FOR EACH tokenIdx IN previousTokens DO
    IF logits[tokenIdx] > 0 THEN
      logits[tokenIdx] <- logits[tokenIdx] / repPenalty
    ELSE
      logits[tokenIdx] <- logits[tokenIdx] * repPenalty
    END IF
  END FOR

  // Step 2: Apply temperature
  FOR i FROM 0 TO V - 1 DO
    logits[i] <- logits[i] / temperature
  END FOR

  // Step 3: Softmax
  maxLogit <- MAX(logits)
  expSum <- 0
  probs <- CREATE array of length V
  FOR i FROM 0 TO V - 1 DO
    probs[i] <- EXP(logits[i] - maxLogit)
    expSum <- expSum + probs[i]
  END FOR
  FOR i FROM 0 TO V - 1 DO
    probs[i] <- probs[i] / expSum
  END FOR

  // Step 4: Top-p (nucleus) filtering
  sortedIndices <- ARGSORT(probs, DESCENDING)
  cumulativeProb <- 0
  cutoffIdx <- 0
  FOR i FROM 0 TO V - 1 DO
    cumulativeProb <- cumulativeProb + probs[sortedIndices[i]]
    IF cumulativeProb >= topP THEN
      cutoffIdx <- i
      BREAK
    END IF
  END FOR

  // Zero out tokens below the cutoff
  allowedSet <- sortedIndices[0 .. cutoffIdx]
  FOR i FROM 0 TO V - 1 DO
    IF i NOT IN allowedSet THEN
      probs[i] <- 0
    END IF
  END FOR

  // Renormalize
  total <- SUM(probs)
  FOR i FROM 0 TO V - 1 DO
    probs[i] <- probs[i] / total
  END FOR

  // Step 5: Sample from the filtered distribution
  sampledToken <- RANDOM_CHOICE(indices: 0..V-1, weights: probs)
  RETURN sampledToken
END

ALGORITHM GenerateText(model, prompt, maxTokens, temperature, topP, repPenalty)
INPUT: model: language model with KV cache support
       prompt: list of token indices
       maxTokens: integer, temperature: float, topP: float, repPenalty: float
OUTPUT: generatedTokens: list of token indices

BEGIN
  // Prefill: process entire prompt at once
  kvCache <- model.PREFILL(prompt)
  generatedTokens <- empty list
  previousTokens <- SET(prompt)

  FOR step FROM 1 TO maxTokens DO
    // Get logits for next token position
    IF step = 1 THEN
      logits <- model.FORWARD(prompt, kvCache)
    ELSE
      logits <- model.FORWARD([lastToken], kvCache)
    END IF

    token <- SampleWithTopPAndTemperature(logits, temperature, topP, repPenalty, previousTokens)

    IF token = EOS THEN BREAK

    APPEND token TO generatedTokens
    ADD token TO previousTokens
    lastToken <- token
  END FOR

  RETURN generatedTokens
END

Real-World Applications

ChatGPT and Claude: Use top-p sampling with temperature control, allowing users to adjust creativity vs determinism via API parameters
Code completion (Copilot): Uses low temperature (near-greedy) for code suggestions where correctness matters more than creativity
Creative writing assistants: Use higher temperature and top-p to generate diverse story continuations, poetry, and brainstorming ideas
Structured output generation: Uses temperature 0 (greedy) with constrained decoding to ensure outputs conform to JSON schemas or SQL syntax
Dialogue systems: Combine nucleus sampling with repetition penalties to produce natural, non-repetitive conversational responses
Speculative decoding in production: Systems like Medusa and SpecInfer use draft models to propose multiple tokens, verified in parallel by the main model, achieving 2-3x speedups

Key Takeaways

Decoding strategy has as much impact on output quality as model architecture: the same model produces very different text with greedy vs nucleus sampling
Temperature controls the sharpness of the probability distribution: low values make output deterministic, high values make it creative but potentially incoherent
Top-p (nucleus) sampling adapts the candidate set size based on model confidence, generally outperforming fixed top-k in practice
Repetition penalties are essential for long-form generation, preventing the degenerate loops that autoregressive models are prone to
The KV cache is the key optimization for autoregressive generation, trading memory for compute by avoiding redundant attention calculations

2025-10-18

Evaluation Metrics for NLP: Why Measuring Language Quality Is Hard

From BLEU and ROUGE to METEOR and perplexity, exploring the automatic metrics that score NLP systems, their blind spots, and why human evaluation remains the gold standard.

2025-10-19

Ethics and Bias in Language Models: What Training Data Teaches Machines About Us

How training data bias, toxicity, and hallucination emerge in language models, what debiasing techniques exist, and why responsible deployment requires more than better algorithms.

2025-10-08

Tokenization and Text Normalization: Why GPT Does Not See Words

From whitespace splitting to Byte Pair Encoding and SentencePiece, exploring how machines break text into processable units and why subword tokenization changed NLP forever.