Guardrails for LLMs: measuring AI “hallucinations” and verbosity

Introduction

Large language models (LLMs) are powerful tools designed to generate human-like text, but they often produce responses that are overly verbose and complex. This tendency stems from their training to provide helpful and comprehensive answers. However, verbosity can lead to a significant issue known as hallucinations, where the model’s output diverges from factual information. The more verbose an answer, the higher the risk of generating inaccurate content. To address this, implementing effective guardrails is crucial. This article explores how the Textstat Python library can help measure verbosity and ensure clarity in LLM responses.

Understanding Verbosity and Hallucinations

Verbosity in LLMs is characterized by overly detailed and complex responses that may overwhelm users. While detailed answers can be beneficial, they often increase the likelihood of hallucinations—instances where the model fabricates information. The challenge lies in balancing comprehensiveness with accuracy. By measuring and controlling verbosity, we can reduce the risk of hallucinations and improve the reliability of LLM outputs.

Set a Complexity Budget with Textstat

The Textstat Python library offers a way to calculate readability scores, such as the Automated Readability Index (ARI), to determine the complexity of text generated by LLMs. By setting a threshold, such as a 10th-grade reading level (ARI score of 10.0), we can trigger a re-prompt loop if the complexity exceeds this limit. This approach encourages concise and simpler responses, reducing verbosity and the risk of hallucinations.

Implementing the LangChain Pipeline

This section demonstrates how to integrate the complexity budget strategy into a LangChain pipeline, executable in a Google Colab notebook. First, obtain a Cuddly face API token from Hugging Face. Next, install the necessary libraries:



!pip install textstat langchain_huggingface langchain_community

In Google Colab, retrieve the API token:



from google.colab import userdata

HF_TOKEN = userdata.get('HF_TOKEN')

if not HF_TOKEN:

    print("WARNING: Token 'HF_TOKEN' was not found. This may cause errors.")

else:

    print("Hugging Face Token loaded successfully.")

Next, configure components for local text generation using a pre-trained Hugging Face model:



import textstat

from langchain_core.prompts import PromptTemplate

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from langchain_community.llms import HuggingFacePipeline



model_id = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)



pipe = pipeline(

    "text-generation",

    model=model,

    tokenizer=tokenizer,

    max_new_tokens=100,

    device=0  # Use GPU if available

)



llm = HuggingFacePipeline(pipeline=pipe)

The function below generates a summary of text input, ensuring it does not exceed the complexity budget:



def safe_summarize(text_input, complex_budget=10.0):

    print("n--- Starting the summary process ---")

    print(f"Length of input text: {len(text_input)} characters")

    print(f"Target complexity budget (ARI score): {complex_budget}")



    base_prompt = PromptTemplate.from_template("Provide a complete summary of the following: {text}")

    chain = base_prompt | llm

    summary = chain.invoke({"text": text_input})

    print("Initial summary generated:")

    print("-------------------------")

    print(summary)

    print("-------------------------")



    ari_score = textstat.automated_readability_index(summary)

    print(f"Initial ARI score: {ari_score:.2f}")



    if ari_score > complex_budget:

        print("Over budget! The initial summary is too complex.")

        print("Triggering the simplification guardrail...")



        simplification_prompt = PromptTemplate.from_template(

            "The following text is too wordy. Rewrite it concisely using simple vocabulary, removing " +

            "flowery language:nn{text}"

        )

        simplify_chain = simplification_prompt | llm

        simplified_summary = simplify_chain.invoke({"text": summary})

        

        new_ari = textstat.automated_readability_index(simplified_summary)

        print("Simplified summary generated:")

        print("------------------------")

        print(simplified_summary)

        print("-------------------------")

        print(f"Revised ARI score: {new_ari:.2f}")

        summary = simplified_summary

    else:

        print("The initial summary meets the complexity budget. No simplification is necessary.")

        print("---Summary process completed---")

    return summary

Finally, test the function with sample text:



sample_text = """The inextricably intertwined permutations of cognitive computational arrays in the realm of large linguistic models often precipitate a cascade of unnecessarily labyrinthine lexical structures. This propensity toward circumlocution, while seemingly indicative of deep erudition, often obscures the fundamental semantic load, thereby making the generated speech much less accessible to the quintessentially profane."""

print("Running summary pipeline...n")

final_output = safe_summarize(sample_text, complex_budget=10.0)

print("n--- Final Guardrailed Summary ---")

print(final_output)

Conclusion

This article outlines a framework to measure and control verbosity in LLM responses, aiming to reduce hallucinations. While focusing on verbosity, additional checks like semantic consistency and LLM-as-a-judge solutions can further enhance reliability. By refining LLM responses, we can enhance their usefulness and trustworthiness in real-world applications.

Ivan Palomares Carrascosa is a leader in AI, machine learning, and LLM, guiding others in applying AI effectively.

For more details, visit the source Here.

“`

Improving verifiability in AI development

DeepMind spinout Isomorphic Labs raises $2.1 billion self.__wrap_b(“:Rl6glm:”,0.7)

Comau and OMRON Robotics partner to bring robotics to more industries

Presentation of Claude Platform on AWS: Anthropic’s native platform, via your AWS account

Guardrails for LLMs: measuring AI “hallucinations” and verbosity

Introduction

Understanding Verbosity and Hallucinations

Set a Complexity Budget with Textstat

Implementing the LangChain Pipeline

Conclusion

Improving verifiability in AI development

DeepMind spinout Isomorphic Labs raises $2.1 billion self.__wrap_b(“:Rl6glm:”,0.7)

Comau and OMRON Robotics partner to bring robotics to more industries

Presentation of Claude Platform on AWS: Anthropic’s native platform, via your AWS account

As public criticism of vaccines fades, RFK Jr. continues to conduct safety investigations behind the scenes: NYT

Presentation of Claude Platform on AWS: Anthropic’s native platform, via your AWS account

RNNs cannot think what transformers think cheaply. ICLR 2026 has proven that the gap is exponential.

Vibe Coding XR: Accelerate AI + XR prototyping with XR Blocks and Gemini

Building modern EDA pipelines with Penguin

How ChatGPT Gets You Addicted

LEAVE A REPLY Cancel reply

Useful Links

Latest News

DeepMind spinout Isomorphic Labs raises $2.1 billion self.__wrap_b(“:Rl6glm:”,0.7)

Comau and OMRON Robotics partner to bring robotics to more industries

Presentation of Claude Platform on AWS: Anthropic’s native platform, via your AWS account

Our Newsletter