How we protect internal systems from increasingly powerful and incompletely targeted AI
AI agents are transforming how we interact with technology. By autonomously executing complex tasks—from cyber defense to scientific discovery to product development—these systems herald a new era of productivity. In the US alone, AI agents could generate $2.9 trillion in economic value by 2030.
As these agents grow more powerful, they demand more sophisticated protection measures. That’s why we developed our AI Control Roadmap: a framework for building and managing the advanced AI we use at Google. This defense-in-depth approach, which could serve as a model for the entire industry, goes beyond traditional model alignment and adds a critical layer of system-level security that provides security even when alignment is incomplete.
Understanding AI Control
Our security approach begins with a robust foundation that includes traditional protections such as sandboxing, endpoint security, and instant injection resistance. Additionally, the AI control roadmap uses model alignment, training AI to be inherently safe and helpful, as a primary defense. It adds another layer of security by treating internal agents as potentially misaligned, providing security even when alignment is incomplete.
Think of it like a driving instructor with dual controls. The instructor trusts the student but remains ready to take the wheel or hit the brakes if a mistake occurs. Similarly, our AI control system grants permissions to AI agents based on their verified behavior, allowing us to build trust through controlled, incremental access.
Our roadmap is based on a simple but cautious assumption: What if a highly skilled AI agent disagrees with our goals and acts in unexpected ways? To address this issue, our plan focuses on three key areas.
Novel Threat Modeling Framework
First, we developed a novel threat modeling framework for AI, taking cybersecurity best practices into account. The roadmap treats untrusted AI agents as potential “insider threats” – similar to how a company would treat a potential fraudulent employee who already has access to the office. By building on the industry-standard MITER ATT&CK framework, a globally accessible knowledge base of adversary tactics and techniques based on real-world observations, we break down potential attacks into smaller tactics and techniques. This allows us to systematically track risks, identify bad behavior before it causes harm, and conduct safe exercises to test whether our defenses actually hold up.
Deploying AI Controls
Second, we are developing a plan for deploying AI controls to address these risks.
- Identifying the problem (detection): Monitoring is the backbone of AI control. We use other, trusted AI systems as “monitors” to constantly check a working agent’s thoughts, actions, and plans to see if they are deviating from the right path.
- Dealing with the problem (prevention and response): If a supervisor sees the agent trying to do something harmful, he or she can intervene and block the action before harm is done.
Finally, to ensure our systems are functional, we continually measure performance against three critical metrics: coverage (the proportion of traffic monitored), recall (the proportion of misbehavior detected), and time to response.
Scaling Security as AI Gets Smarter
As AI models continue to develop, our defenses must also be strengthened. In our AI Control Roadmap, we map security protocols to measurable AI capability milestones on two critical fronts:
- Ability to evade detection: Today we monitor the visible chain of thought of a model. However, if models learn to hide their reasoning—whether through supervisory awareness (the knowledge that they are being monitored) or through opaque reasoning (reasoning without using visible text)—simply reading their verbalized reasoning will not be enough. We need to analyze their behavior and perhaps examine the inner workings of the model.
- Possibility of possible harm: With low-risk and reversible actions, we can be confident that we are reviewing transcripts with a delay and resolving issues quickly after they occur (asynchronous response). For particularly risky actions, such as during major cyberattacks, we need to move to real-time prevention and block malicious actions before they can take place (synchronous response). In other words, our response strategy must be based on the severity of the threat.
For further details on our AI Control Roadmap, please visit Here.
“`

