In this article, you will learn how logits, temperature, and top-p sampling work together to control the prediction of the next token in large language models.
Topics we will cover include:
- What are logits and how they are produced by the final linear layer of a transformer.
- How temperature and top-p (kernel sampling) shape the probability distribution used for token selection.
- How these three components fit into a sequential pipeline that governs the generation of LLM results.
Token Selection Statistics: Logits, Temperature, and Top-P Walkthrough
Introduction
When large language models, or LLMs for short, produce results, several criteria are at play, including the overall relevance of the response, coherence, and creativity. Since at their core, models work by building their response word by word – or more precisely, token by token – capturing these desirable properties involves mathematically adjusting the output probability distributions that govern the process of predicting the next token.
This article presents the mechanisms behind LLM decoding strategies from a statistical point of view. We will explore how the model’s raw scores, called logits, interact with two other parameters: temperature and top-p. These are key factors used to control the token selection process.
While we will focus on exploring what happens in the final stages of the LLM architecture, aka the transformer, you can check out this article if you need a concise overview of the entire process and the journey tokens take from start to finish.
Token Selection Process in LLMs
What are logits?
In neural networks, the unnormalized raw scores produced (usually at the final linear layers) before converting them to probabilities of possible outcomes (e.g., classes) are called logits. Although logits have been used since the era of classic machine learning classification models like softmax regression, the same principle still applies to the final linear layer of transformer models. This final layer processes the hidden states – which contain progressively accumulated linguistic knowledge about the input text gathered throughout the transformer – and generates a vector of logits. How many? As many as the size of the model’s vocabulary, i.e., the number of possible tokens that the model can generate.
See the diagram at the top, for example. If an LLM trained in English-to-Spanish translation predicts the next word after the generated sequence “me gusta mucho” (the translation of “I really like”), it can produce a raw logit score of 12.5 for “viajar” (travel), 8.2 for “jugar” (play), and -3.1 for “dorm” (sleep). These raw values are unlimited, which makes them difficult to interpret directly; therefore, a softmax function is applied on top of the final linear layer to transform these logits into a standard, interpretable probability distribution over the vocabulary tokens, such that all values sum to 1.
What are temperature and top-p?
Once we have a probability distribution over the target vocabulary, do LLMs simply choose the token with the highest probability as the next one to generate? Not exactly, but the real process looks a lot like this scenario. The next token is sampled from the distribution, and how this sampling works depends on several decoding parameters, two of the most important being temperature and top-p.
Temperature
Temperature is a scaling factor applied to the logits before the softmax step. A high temperature (e.g., greater than 1) flattens the resulting probabilities, making them more uniform. As a result, uncertainty and unpredictability increase and the model behaves more creatively. A low temperature (e.g., well below 1) accentuates the differences between high and low probability tokens, thereby increasing certainty and strongly favoring the more probable tokens in the original distribution.
Top-p
Top-p, also called kernel sampling, is another approach to control the randomness of selecting the next token. Rather than scaling the probabilities, this limits the pool of candidates from which to sample. While similar strategies like top-k only consider the k tokens with the highest probability, top-p identifies the smallest set of tokens whose cumulative probability meets or exceeds a threshold p, making it more adaptive and flexible. In other words, if we set p = 0.9, top-p sorts tokens by probability and continues adding them to a candidate pool until their cumulative probability reaches 0.9.
Complete Walkthrough: How are these concepts related to each other?
Logit-probability calculation, temperature, and top-p can be combined in a multi-step sequential pipeline to produce LLM results, i.e., predictions of the next token.
First, the model generates raw logits for all possible tokens, as described above. Temperature then comes into play by scaling these raw logits – note that this happens before the softmax function converts them to probabilities. Depending on the temperature value, the resulting distribution will appear more uniform (higher temperature, more uncertainty) or sharper (lower temperature, higher certainty).
Token selection walkthrough based on logits, temperature, and top-p
Once the scaled logits are converted to probabilities, top-p is applied to filter the resulting distribution, calculating the cumulative probabilities to retain only a basic “central pool” of the most likely tokens (see step 3 in the image above). Finally, the model randomly samples from this pool to select the next token.
Closing remarks
Now that we have demystified the statistical process behind token selection in LLMs, it is useful to think about how to choose temperature and top-p values in practice. As a developer, you’ll want to strike the right balance between predictability and creativity for your use case. For high-stakes fact-based scenarios like coding or legal analysis, a lower temperature and stricter top-p are advised – for example t = 0.1 and p = 0.5 – which produces highly deterministic model responses. For creative domains such as poetry generation or brainstorming, a higher temperature and top-p, such as t=0.8 and p=0.95, allow for a greater variety of candidate tokens in the selection pool.
For more information, please visit the source article Here.
“`

