Token Selection Statistics: Logits, Temperature, and Top-P Walkthrough

In this article, you will learn how logits, temperature, and top-p sampling work together to control the prediction of the next token in large language models.

Topics we will cover include:

What are logits and how they are produced by the final linear layer of a transformer.

How temperature and top-p (kernel sampling) shape the probability distribution used for token selection.

How these three components fit into a sequential pipeline that governs the generation of LLM results.

Token Selection Statistics: Logits, Temperature, and Top-P Walkthrough

Introduction

When large language models, or LLMs for short, produce results, several criteria are at play, including the overall relevance of the response, coherence, and creativity. Since at their core, models work by building their response word by word – or more precisely, token by token – capturing these desirable properties involves mathematically adjusting the output probability distributions that govern the process of predicting the next token.

This article presents the mechanisms behind LLM decoding strategies from a statistical point of view. We will explore how the model’s raw scores, called logits, interact with two other parameters: temperature and top-p. These are key factors used to control the token selection process.

While we will focus on exploring what happens in the final stages of the LLM architecture, aka the transformer, you can check out this article if you need a concise overview of the entire process and the journey tokens take from start to finish.

Token Selection Process in LLMs

What are logits?

In neural networks, the unnormalized raw scores produced (usually at the final linear layers) before converting them to probabilities of possible outcomes (e.g., classes) are called logits. Although logits have been used since the era of classic machine learning classification models like softmax regression, the same principle still applies to the final linear layer of transformer models. This final layer processes the hidden states – which contain progressively accumulated linguistic knowledge about the input text gathered throughout the transformer – and generates a vector of logits. How many? As many as the size of the model’s vocabulary, i.e., the number of possible tokens that the model can generate.

See the diagram at the top, for example. If an LLM trained in English-to-Spanish translation predicts the next word after the generated sequence “me gusta mucho” (the translation of “I really like”), it can produce a raw logit score of 12.5 for “viajar” (travel), 8.2 for “jugar” (play), and -3.1 for “dorm” (sleep). These raw values are unlimited, which makes them difficult to interpret directly; therefore, a softmax function is applied on top of the final linear layer to transform these logits into a standard, interpretable probability distribution over the vocabulary tokens, such that all values sum to 1.

What are temperature and top-p?

Once we have a probability distribution over the target vocabulary, do LLMs simply choose the token with the highest probability as the next one to generate? Not exactly, but the real process looks a lot like this scenario. The next token is sampled from the distribution, and how this sampling works depends on several decoding parameters, two of the most important being temperature and top-p.

Temperature

Temperature is a scaling factor applied to the logits before the softmax step. A high temperature (e.g., greater than 1) flattens the resulting probabilities, making them more uniform. As a result, uncertainty and unpredictability increase and the model behaves more creatively. A low temperature (e.g., well below 1) accentuates the differences between high and low probability tokens, thereby increasing certainty and strongly favoring the more probable tokens in the original distribution.

Top-p

Top-p, also called kernel sampling, is another approach to control the randomness of selecting the next token. Rather than scaling the probabilities, this limits the pool of candidates from which to sample. While similar strategies like top-k only consider the k tokens with the highest probability, top-p identifies the smallest set of tokens whose cumulative probability meets or exceeds a threshold p, making it more adaptive and flexible. In other words, if we set p = 0.9, top-p sorts tokens by probability and continues adding them to a candidate pool until their cumulative probability reaches 0.9.

Complete Walkthrough: How are these concepts related to each other?

Logit-probability calculation, temperature, and top-p can be combined in a multi-step sequential pipeline to produce LLM results, i.e., predictions of the next token.

First, the model generates raw logits for all possible tokens, as described above. Temperature then comes into play by scaling these raw logits – note that this happens before the softmax function converts them to probabilities. Depending on the temperature value, the resulting distribution will appear more uniform (higher temperature, more uncertainty) or sharper (lower temperature, higher certainty).

Token selection walkthrough based on logits, temperature and top-p

Token selection walkthrough based on logits, temperature, and top-p

Once the scaled logits are converted to probabilities, top-p is applied to filter the resulting distribution, calculating the cumulative probabilities to retain only a basic “central pool” of the most likely tokens (see step 3 in the image above). Finally, the model randomly samples from this pool to select the next token.

Closing remarks

Now that we have demystified the statistical process behind token selection in LLMs, it is useful to think about how to choose temperature and top-p values in practice. As a developer, you’ll want to strike the right balance between predictability and creativity for your use case. For high-stakes fact-based scenarios like coding or legal analysis, a lower temperature and stricter top-p are advised – for example t = 0.1 and p = 0.5 – which produces highly deterministic model responses. For creative domains such as poetry generation or brainstorming, a higher temperature and top-p, such as t=0.8 and p=0.95, allow for a greater variety of candidate tokens in the selection pool.

For more information, please visit the source article Here.

“`

Media note: MIT establishes regional quantum center

The Xiaomi 17T only launched yesterday and it’s already £150 off

The Digital Impact of the Jeanne Clery Act – Campus Technology

Last 24 hours to save up to $410 on your TechCrunch Disrupt 2026 ticket

Token Selection Statistics: Logits, Temperature, and Top-P Walkthrough

Introduction

What are logits?

What are temperature and top-p?

Temperature

Top-p

Complete Walkthrough: How are these concepts related to each other?

Closing remarks

Media note: MIT establishes regional quantum center

The Xiaomi 17T only launched yesterday and it’s already £150 off

The Digital Impact of the Jeanne Clery Act – Campus Technology

Last 24 hours to save up to $410 on your TechCrunch Disrupt 2026 ticket

The Massachusetts AG is suing UnitedHealthcare over alleged Medicaid fraud

Full-stack data scientists for the world of agent coding

A new era of innovation: Google Search at I/O 2026

Claude Opus 4.8 is now available on AWS

7 Real-World AI Projects to Build in 2026 (With Guides)

Google Co-Scientist: Search and Discovery at Scale

LEAVE A REPLY Cancel reply

Useful Links

Latest News

The Xiaomi 17T only launched yesterday and it’s already £150 off

The Digital Impact of the Jeanne Clery Act – Campus Technology

Last 24 hours to save up to $410 on your TechCrunch Disrupt 2026 ticket

Our Newsletter