In this article, you will learn what guardrails are for non-deterministic AI agents and how simple statistical methods can be used to implement them effectively.
Topics we will cover include:
- What guardrails are and why they are important when working with nondeterministic agents and large language models.
- How semantic drift detection, based on cosine distance z-scores, can flag off-topic or dangerous agent responses.
- How the confidence threshold, based on Shannon entropy, can detect when a model is uncertain or probably mind-blowing.
Implementation of statistical guardrails for non-deterministic agents (click to enlarge)
Introduction
Nondeterministic agents are those where the same input can lead to distinct outputs over multiple executions. In other words, their behavior is probabilistic, making it impossible to perform standard evaluation methods such as unit testing. Statistical approaches based on thresholds going beyond exact matching are therefore necessary not only to evaluate the performance of these agents, but above all, to guarantee secure AI guardrails stand between non-deterministic agents and end users.
This article examines guardrails for nondeterministic evaluation of agents, helping to understand their significance and illustrating how simple statistical mechanisms can lay the foundation for robust evaluation guardrails.
Understanding Guardrails in Agent Evaluation
Guardrails are programmatic constraints that act as an automated security layer between a nondeterministic agent and the end user. Today, the symbiotic use of AI agents alongside large language models makes them particularly important, as the latter can produce hallucinations or unpredictable outcomes.
In a broad sense, a guardrail evaluates the agent’s response in real time. The evaluation involves checking aspects such as topic relevance, factual alignment, and potential security violations, all before the result is displayed to the end user.
Developers can implement them and make agents more reliable, even with probabilistic behavior: the key is to rely on quantitative statistical thresholds. Let’s see how through some examples.
Statistical Safeguards for Non-Deterministic Agents
Statistical safeguards go well beyond abstract security concerns. They convert these concerns into automated controls focused on rigor. Measures widely used in statistics can be used, for example, to identify situations in which the agent becomes erratic or “confused”.
Let us describe two simple but effective approaches: semantic drift based on cosine distance and confidence threshold based on log-probability entropy.
Semantic Drift
This guardrail is designed to measure what the officer says, against a “safe” baseline.
This involves integrating the output text into a vector space and calculating the cosine distance with respect to the known base data. A cosine distance z-score is calculated: if its value is high, it means that the response is a statistical outlier, therefore flagging the response.
This strategy is best applied when it is necessary to avoid off-topic drifts, as well as hallucinations or toxic changes in the agent’s personality and behavior.
Confidence Threshold
This guardrail measures certainty – more precisely, the agent’s degree of certainty about the words chosen to construct his response.
To measure it, the log-probabilities of the generated tokens are extracted to calculate the Shannon entropy of the underlying distribution:
$$H = -sum p(x) log p(x)$$
When the entropy H is high, the agent’s model has guessed between many low-probability tokens to choose the next one to generate: a clear sign of factual failure and low confidence in answer generation.
This strategy is best used to detect when the model is likely to invent facts or struggle with complex logic workflows.
Implementation of Statistical Safeguards
Below we provide a concise example of implementing these two guardrails in Python, assuming readily available agent output text.
Start by importing the necessary modules and classes:
import numpy as np from sentence_transformers import SentenceTransformer from scipy.spatial.distance import cosine
import numpy as n.p. Since phrase_transformers import Sentence Transformer Since scipy.spatial.distance import cosine |
The pre-trained sentence transformer that we will load is used to construct embeddings for the safe base response examples and the actual agent response to be evaluated.
# Initialize model model = SentenceTransformer(‘all-MiniLM-L6-v2’) safe_examples = [“The system is operational.”, “Access is granted to authorized users.”] baseline_embs = model.encode(safe_examples)
# Initialize the model model = Sentence Transformer(‘all-MiniLM-L6-v2’) safe_examples = [“The system is operational.”, “Access is granted to authorized users.”] baseline_embs = model.encode(safe_examples) |
We define a check_guardrails() function that evaluates the agent’s output using the two methods described above: a semantic guardrail based on cosine distance z-scores and a confidence guardrail based on entropy.
def check_guardrails(output, token_probs): # 1. Semantic guardrails (cosine distance) output_emb = model.encode([output])[0] distances = np.array([cosine(output_emb, b) for b in baseline_embs]) Mean_dist = np.mean(distances) std_dist = np.std(distances) + 1e-9 # avoid division by zero z_score = (np.min(distances) – Mean_dist) / std_dist # 2. Trust guardrail (entropy) # token_probs is a list of probabilities for each generated token entropy = -np.sum(token_probs * np.log(token_probs + 1e-9)) # Decision logic is_off_topic = z_score > 2.0 # Statistical outlier is_confused = entropy > 3.5 # High uncertainty if is_off_topic or is_confused: return “REJECT”, {“z_score”: z_score, “entropy”: entropy} return “PASS”, {“z_score”: z_score, “entropy”: entropy} # Example of use with dummy token probabilities print(check_guardrails(“The moon is made of blue cheese.”, np.array([0.1, 0.2, 0.1, 0.5])))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def check_guardrails(to go out, token_probs): #1. Semantic guardrail (cosine distance) exit_emb = model.encode([output])[0] distances = n.p..painting([cosine(output_emb, b) for b in baseline_embs]) average_dist = n.p..mean(distances) std_dist = n.p..standard(distances) +1st–9 # avoid division by zero z_score = (n.p..min(distances) – average_dist) / standard_dist
#2. Trust Guardrail (Entropy) # token_probs is a list of probabilities for each generated token entropy = –n.p..sum(token_probs * n.p..save(token_probs + 1st–9))
# Decision logic is_off_topic = z_score > 2.0 # Statistical outlier is_confused = entropy > 3.5 # High uncertainty
if is_off_topic Or is_confused: back “DISMISS”, {“z_score”: z_score, “entropy”: entropy} back “PASS”, {“z_score”: z_score, “entropy”: entropy} # Example of use with fictitious token probabilities print(check_guardrails(“The moon is made of blue cheese.”, n.p..painting([0.1, 0.2, 0.1, 0.5]))) |
To see how guardrails behave in different scenarios, try replacing the response string in the last line with something of your choice. You can also modify the token probability table to increase or decrease uncertainty. In the example above, the semantic guardrail triggers &emdash; the z-score far exceeds the threshold of 2.0 &emdash; so the answer is rejected:
(‘REJECT’, {‘z_score’: np.float64(3.847), ‘entropy’: np.float64(1.1289781873656017)})
(‘DISMISS’, {‘z_score’: n.p..float64(3,847), ‘entropy’: n.p..float64(1.1289781873656017)}) |
Summary
Simple and traditional statistical methods and measurements can become effective pillars for implementing security safeguards in AI applications involving agents and large language models. They can analyze different desirable properties of responses and support decision making, making these systems more reliable.
For further insights on implementing statistical guardrails for non-deterministic agents, you can visit the source link Here.
“`

