Your AI Agent Fleet Has a Security Problem
We ran three different AI runtimes against our own codebase as security auditors. Each one found things the others missed. Here is what we learned about securing code that agents write.
The Problem Nobody Talks About
When your AI agents write code, who reviews it for security? If the answer is "the same agent that wrote it," you have a problem. AI agents are excellent at writing tests that validate their own assumptions. That is confirmation bias with a green checkmark.
We build financial software. Invoices, payments, cash reserves. The kind of code where a validation bypass does not just break a UI. It lets someone overdraw a cash reserve or create duplicate payments. So we decided to audit ourselves properly.
Three Runtimes, Three Audits
We ran three independent security audits on the same codebase using three different AI runtimes: Claude (Codex), Gemini, and GPT. None of them saw each other's results. Each got the same prompt: find everything wrong with this code before we ship it to customers.
The results were humbling.
Claude found the broken plumbing. Two modules had been deleted as part of a security strip, but the imports and URL patterns still pointed to them. The application would not start. This was not subtle. It was a build failure that two previous manual passes missed because the person doing the review was the same person who deleted the modules.
Gemini found the identity leaks. The developer's username was hardcoded in 20+ locations across views, settings, webhooks, and models. Login pages said "ask Kevin" in the help text. The operator's personal card descriptions were in the payment method choices. These are the things you stop seeing because you look at them every day.
GPT found the infrastructure trust issues. Production settings defaulted to a specific Railway hostname. If a customer misconfigured their deployment, the app would trust the original developer's infrastructure instead of failing closed. CORS headers hardcoded the developer's domains globally. Fleet automation scripts fell back to the developer's control plane URL if the environment variable was missing.
What Each Runtime Missed
This is the part that matters. None of them found everything.
Claude missed the identity leaks entirely. It was focused on code correctness, not distribution readiness. Gemini missed the broken imports. It audited documentation and configuration but did not run the test suite. GPT missed the broken plumbing too, but found infrastructure trust issues the others did not look for.
If we had run only one audit, we would have shipped with real vulnerabilities. The three-runtime approach caught 15 findings that no single runtime found alone.
The Findings That Surprised Us
Some things were genuinely good. No raw SQL anywhere. No XSS vectors. No secrets committed to git history (we scrubbed that separately). Brute-force protection on every login. File upload validation with magic-byte checking, not just extension filtering. The Django security fundamentals were solid.
But the gaps were real:
- Django was 13 months past end-of-life with 10 unpatched CVEs, including a critical SQL injection
- Every agent in the fleet shared a single API key with no per-agent attribution in audit trails
- The Docker container ran as root
- Financial model validation was bypassed by bulk_create operations
- Production settings defaulted to the developer's infrastructure instead of failing closed
None of these were exotic attacks. They were the boring stuff. Default credentials, missing least-privilege, running as root, skipping validation on batch operations. OWASP Top 10 material that exists because when you are building fast, security is the thing you will do "next sprint."
The Pattern: Fresh Eyes Beat Better Prompts
The lesson is not "use three AI runtimes." The lesson is that the agent who wrote the code cannot audit the code. Different runtimes have different blind spots. Claude is thorough on code logic but misses configuration issues. Gemini catches documentation and identity leaks but does not run tests. GPT finds infrastructure trust boundaries but misses application-level bugs.
The three-audit pattern works because each runtime approaches the problem differently. They do not share context, they do not share assumptions, and they do not share blind spots.
What We Changed
After the audit, we made five changes:
- Multi-runtime security gate. Every PR that touches security-sensitive code gets audited by at least two different runtimes before merge.
- Pre-commit secret scanning. Gitleaks hooks on every repo. Agents commit code constantly. One leaked API key in a commit is permanent even after deletion.
- Fail-closed defaults. Production settings now require explicit environment configuration. No hostname defaults, no CORS defaults, no trust defaults. Missing a variable is an error, not a fallback.
- Per-agent API keys. Moving from a single shared key to scoped tokens per agent. Your content agent should not have database credentials.
- Git history scrub. We ran git-filter-repo across four repositories to remove personal data from commit history. The current code was clean. The history was not.
If You Run an Agent Fleet
Start here:
- Run
pip-auditon your requirements. You are probably on a deprecated framework version. - Check your Docker containers. If there is no
USERdirective, you are running as root. - Search your codebase for your own name, email, and company name. Then search git history too.
- Have a different agent (or a different AI runtime entirely) audit the code your building agent wrote.
- Check your production settings for defaults that point to your development infrastructure.
The agents are great at building. They are not great at finding their own mistakes. That is not a flaw in the model. It is a property of all systems that self-evaluate. The fix is architectural: separate the builder from the reviewer, and make the reviewer genuinely independent.