Text Generation and Decoding Strategies: How ChatGPT Picks the Next Token
From greedy decoding to top-k, top-p nucleus sampling, and temperature scaling, exploring the algorithms that turn a probability distribution over tokens into coherent, creative text.
Terminology
| Term | Definition |
|---|---|
| Autoregressive Generation | Generating text one token at a time, where each new token is conditioned on all previously generated tokens |
| Greedy Decoding | Always selecting the token with the highest probability at each step; fast but often produces repetitive, dull text |
| Temperature | A scalar $T$ that divides logits before softmax: $P(w) = \frac{e^{z_w / T}}{\sum e^{z_j / T}}$. Low $T$ sharpens the distribution; high $T$ flattens it |
| Top-k Sampling | Restricting sampling to the $k$ most probable tokens, redistributing their probabilities to sum to 1 |
| Top-p (Nucleus) Sampling | Restricting sampling to the smallest set of tokens whose cumulative probability exceeds $p$, adapting the candidate set size dynamically |
| Logits | The raw, unnormalized scores output by the model's final linear layer before softmax converts them to probabilities |
| Repetition Penalty | A technique that reduces the probability of tokens that have already appeared in the generated text, discouraging loops |
| KV Cache | Caching the key and value tensors from previous decoder steps to avoid recomputing them, reducing each generation step from $O(n^2)$ to $O(n)$ |
| Speculative Decoding | Using a small, fast "draft" model to propose multiple tokens, then verifying them in parallel with the large model to speed up generation |
What & Why
A language model produces a probability distribution over the entire vocabulary at each step. The decoding strategy decides how to turn that distribution into an actual token. This choice has an enormous impact on output quality.
Greedy decoding (always pick the most probable token) produces grammatically correct but boring, repetitive text. Pure random sampling produces creative but incoherent text. The practical strategies used by ChatGPT, Claude, and other LLMs sit between these extremes: temperature scaling controls randomness, top-k limits the candidate pool, and top-p (nucleus sampling) adapts the pool size based on the model's confidence.
Understanding decoding strategies matters because they are the user-facing knobs of text generation. When you adjust the "temperature" slider in an API, you are directly modifying the softmax distribution. When a model starts repeating itself, the fix is often a repetition penalty, not retraining. These are engineering decisions, not model architecture decisions, and they have outsized impact on perceived quality.
How It Works
Temperature Scaling
Before applying softmax, divide all logits by temperature T:
T = 1.0: standard softmax, model's learned distributionT < 1.0: sharper distribution, more deterministic (approaches greedy atT \to 0)T > 1.0: flatter distribution, more random (approaches uniform atT \to \infty)
Top-k Sampling
After computing probabilities, keep only the k most probable tokens and set all others to zero. Renormalize the remaining probabilities and sample. With k = 50, the model can only choose from its top 50 candidates, preventing low-probability nonsense tokens from being selected.
The problem: k is fixed. When the model is confident (one token has 95% probability), k = 50 still allows 49 unlikely tokens. When the model is uncertain (flat distribution), k = 50 might cut off reasonable options.
Top-p (Nucleus) Sampling
Top-p adapts the candidate set dynamically. Sort tokens by probability, then include tokens until their cumulative probability exceeds p. With p = 0.9:
- If the top token has 95% probability, only that token is included (effectively greedy)
- If the distribution is flat, many tokens are included (more diversity)
This adaptive behavior makes top-p generally preferred over top-k in practice.
Repetition Penalty
Divide the logit of any token that has already appeared in the output by a penalty factor \theta > 1:
This discourages the model from repeating phrases, a common failure mode of autoregressive generation.
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| One generation step (with KV cache) | $O(n \cdot d + V \cdot d)$ | $O(n \cdot d)$ | $n$ = sequence so far, $d$ = model dim, $V$ = vocab size |
| Temperature scaling | $O(V)$ | $O(1)$ | Divide all logits by $T$ |
| Top-k filtering | $O(V \log k)$ | $O(k)$ | Partial sort to find top $k$ elements |
| Top-p filtering | $O(V \log V)$ | $O(V)$ | Full sort by probability, then cumulative sum scan |
| Full generation ($m$ tokens) | $O(m \cdot (n+m) \cdot d)$ | $O((n+m) \cdot d \cdot L)$ | $L$ = number of layers, KV cache grows linearly |
The KV cache is the dominant memory cost. For a model with $L = 32$ layers, $d = 4096$, and sequence length $n = 4096$, the cache requires $2 \times L \times n \times d \times 2$ bytes (float16) $\approx 2$ GB. This is why long-context generation is memory-bound, not compute-bound.
Implementation
ALGORITHM SampleWithTopPAndTemperature(logits, temperature, topP, repPenalty, previousTokens)
INPUT: logits: array of V raw scores from model
temperature: float > 0
topP: float in (0, 1]
repPenalty: float >= 1.0
previousTokens: set of token indices already generated
OUTPUT: sampledToken: integer (token index)
BEGIN
// Step 1: Apply repetition penalty
FOR EACH tokenIdx IN previousTokens DO
IF logits[tokenIdx] > 0 THEN
logits[tokenIdx] <- logits[tokenIdx] / repPenalty
ELSE
logits[tokenIdx] <- logits[tokenIdx] * repPenalty
END IF
END FOR
// Step 2: Apply temperature
FOR i FROM 0 TO V - 1 DO
logits[i] <- logits[i] / temperature
END FOR
// Step 3: Softmax
maxLogit <- MAX(logits)
expSum <- 0
probs <- CREATE array of length V
FOR i FROM 0 TO V - 1 DO
probs[i] <- EXP(logits[i] - maxLogit)
expSum <- expSum + probs[i]
END FOR
FOR i FROM 0 TO V - 1 DO
probs[i] <- probs[i] / expSum
END FOR
// Step 4: Top-p (nucleus) filtering
sortedIndices <- ARGSORT(probs, DESCENDING)
cumulativeProb <- 0
cutoffIdx <- 0
FOR i FROM 0 TO V - 1 DO
cumulativeProb <- cumulativeProb + probs[sortedIndices[i]]
IF cumulativeProb >= topP THEN
cutoffIdx <- i
BREAK
END IF
END FOR
// Zero out tokens below the cutoff
allowedSet <- sortedIndices[0 .. cutoffIdx]
FOR i FROM 0 TO V - 1 DO
IF i NOT IN allowedSet THEN
probs[i] <- 0
END IF
END FOR
// Renormalize
total <- SUM(probs)
FOR i FROM 0 TO V - 1 DO
probs[i] <- probs[i] / total
END FOR
// Step 5: Sample from the filtered distribution
sampledToken <- RANDOM_CHOICE(indices: 0..V-1, weights: probs)
RETURN sampledToken
END
ALGORITHM GenerateText(model, prompt, maxTokens, temperature, topP, repPenalty)
INPUT: model: language model with KV cache support
prompt: list of token indices
maxTokens: integer, temperature: float, topP: float, repPenalty: float
OUTPUT: generatedTokens: list of token indices
BEGIN
// Prefill: process entire prompt at once
kvCache <- model.PREFILL(prompt)
generatedTokens <- empty list
previousTokens <- SET(prompt)
FOR step FROM 1 TO maxTokens DO
// Get logits for next token position
IF step = 1 THEN
logits <- model.FORWARD(prompt, kvCache)
ELSE
logits <- model.FORWARD([lastToken], kvCache)
END IF
token <- SampleWithTopPAndTemperature(logits, temperature, topP, repPenalty, previousTokens)
IF token = EOS THEN BREAK
APPEND token TO generatedTokens
ADD token TO previousTokens
lastToken <- token
END FOR
RETURN generatedTokens
END
Real-World Applications
- ChatGPT and Claude: Use top-p sampling with temperature control, allowing users to adjust creativity vs determinism via API parameters
- Code completion (Copilot): Uses low temperature (near-greedy) for code suggestions where correctness matters more than creativity
- Creative writing assistants: Use higher temperature and top-p to generate diverse story continuations, poetry, and brainstorming ideas
- Structured output generation: Uses temperature 0 (greedy) with constrained decoding to ensure outputs conform to JSON schemas or SQL syntax
- Dialogue systems: Combine nucleus sampling with repetition penalties to produce natural, non-repetitive conversational responses
- Speculative decoding in production: Systems like Medusa and SpecInfer use draft models to propose multiple tokens, verified in parallel by the main model, achieving 2-3x speedups
Key Takeaways
- Decoding strategy has as much impact on output quality as model architecture: the same model produces very different text with greedy vs nucleus sampling
- Temperature controls the sharpness of the probability distribution: low values make output deterministic, high values make it creative but potentially incoherent
- Top-p (nucleus) sampling adapts the candidate set size based on model confidence, generally outperforming fixed top-k in practice
- Repetition penalties are essential for long-form generation, preventing the degenerate loops that autoregressive models are prone to
- The KV cache is the key optimization for autoregressive generation, trading memory for compute by avoiding redundant attention calculations