Google cut Gemma 4 by 72% and Unsloth fixed the 4-bit bug that no one else caught on a 4090, and 4-bit shouldn't be as good

Revolutionizing AI Efficiency: Google’s Gemma 4 and Unsloth’s Breakthrough

Author(s): Chew Loong Nian – AI ENGINEER

Originally published on Towards AI.

An Unprecedented Leap in Model Efficiency

Google’s latest innovation, the Gemma 4 QAT checkpoints, represents a significant leap in AI model efficiency. By reducing the model size by 72%, Google has managed to maintain performance with a 26 billion parameter model that operates with just 15 GB of memory. This allows it to produce 193 tokens per second on a single consumer-grade GPU, such as those found in laptops and gaming rigs. Surprisingly, the 4-bit version of this model performs nearly as well as the full-precision original, defying conventional quantification expectations.

Understanding Gemma 4’s Unique Approach

The key to Gemma 4’s efficiency lies in its Quantization Aware Training (QAT) combined with Unsloth’s GGUF conversion. This approach ensures that the model learns to handle 4-bit rounding during training, reducing the typical quality loss associated with Post-Training Quantization (PTQ). Unsloth further enhances this process by addressing a subtle scale shift bug that could negate the benefits during conversion to lama.cpp formats.

Performance and Practical Implications

The performance metrics for different Gemma 4 variants, particularly the 26B-A4B mixed expert model, are impressive. The article outlines the accuracy comparisons between naive and dynamic conversions and provides practical steps for deploying the model using lama.cpp, alongside other options like API Server, Ollama/LM Studio, Unsloth Studio, vLLM/SGLang, MLX, and ONNX Browser.

Choosing the Right Model

When selecting a model, it’s crucial to consider the available hardware. While the 4-bit model offers unprecedented efficiency, it’s essential to remember that it still has limitations inherent to its bit depth. However, the usual compromise between quality and speed is surprisingly minimal, making the 26B-A4B model feel almost like a full-scale model experience on consumer GPUs.

For those interested in exploring this groundbreaking technology further, the full blog post is available for free on Medium.

Barcelona-based THEKER raises €73 million Series A to accelerate the deployment of AI robotics

Audi’s RS 6 Avant may be a fast car, but is it a good car?

Jinhua Zhao appointed head of the Department of Urban Studies and Planning

‘Hands off our NHS’: Anti-Palantir protests erupt in UK over National Health Service deal

Google cut Gemma 4 by 72% and Unsloth fixed the 4-bit bug that no one else caught on a 4090, and 4-bit shouldn’t be as good

Revolutionizing AI Efficiency: Google’s Gemma 4 and Unsloth’s Breakthrough

Author(s): Chew Loong Nian – AI ENGINEER

An Unprecedented Leap in Model Efficiency

Understanding Gemma 4’s Unique Approach

Performance and Practical Implications

Choosing the Right Model

Barcelona-based THEKER raises €73 million Series A to accelerate the deployment of AI robotics

Audi’s RS 6 Avant may be a fast car, but is it a good car?

Jinhua Zhao appointed head of the Department of Urban Studies and Planning

‘Hands off our NHS’: Anti-Palantir protests erupt in UK over National Health Service deal

Bluesky launches group chats, as company shifts focus to community features

New framework for auditing machine unlearning

5 Useful Python Scripts to Automate Boring PDF Tasks

Moonshot Cracked Claude Code’s Playbook with an MIT Terminal Agent and a $0.60 Model

Collaborate on a nationwide randomized study of AI in real-world virtual care

10 GitHub Repositories for Web Development in Python

LEAVE A REPLY Cancel reply

Useful Links

Latest News

Audi’s RS 6 Avant may be a fast car, but is it a good car?

Jinhua Zhao appointed head of the Department of Urban Studies and Planning

‘Hands off our NHS’: Anti-Palantir protests erupt in UK over National Health Service deal

Our Newsletter