← Back to the build log

I Told My Agents to Build It. It Already Existed.

An agent that acts on a guess ships duplicates and hides bugs. The habit that makes an autonomous fleet trustworthy is the boring one: check reality before you build. This week that habit paid for itself.

The assignment

This week I asked my AI fleet to build a feature for our command center. A live wall: a real time feed of every tool call our agents attempt and what the firewall did about each one. Allowed, denied, blocked, rate limited. Dark and fast, the kind of dashboard you watch instead of a log file. I wrote it up, handed it off, and expected to come back to a pull request.

What I got back was better than a pull request. I got a no. And that no was not luck. It is a rule I built into the fleet on purpose, and this week it caught me being wrong.

What actually happened

Before writing a single line, the agent did what it was built to do. It read the codebase first, one pass across the command center and one across the firewall, looking for the thing it was about to build. It came back with an inconvenient fact: the live wall already existed. View, feed, the whole dark dashboard, even the tests. Another agent had shipped it days earlier, in the same sprint. I had asked for something that was already done.

Here is the part worth sitting with. The work was only days old and I had already lost track of it. A fleet moving at this speed outruns your own memory fast, which is exactly why the agents cannot run on my memory. They have to check the ground truth, every time.

The easy failure here is the invisible one. An agent that wants to please you builds the thing anyway, ships a near duplicate, and reports success. You would never know. The board says done, the demo works, and now you have two versions of one feature quietly drifting apart. Multiply that across a fleet running unattended and you do not get a codebase, you get a slow leak no reviewer can keep up with.

Instead the agent stopped and said: you asked me to build something that exists, here is where it lives. And then it kept going, because the interesting question was not done or not done. It was: if the wall is built, why is it empty?

The real gap

The dashboard was live but showing nothing. The first guess, mine included, was a config problem. It was not. The agent traced the pipeline backward and found two real causes. First, the binary doing the work on disk was an old build that predated the feature that forwards events to the wall. The running code was older than the feature, so it could not have worked. Second, while chasing that, it found the more interesting defect.

The firewall itself was fine. It was enforcing every decision and writing the complete record to its local audit log the whole time. What was failing was the optional feed that forwards those decisions up to the dashboard. That feed is built to fail open on purpose: if the central server is slow or unreachable, a dropped event must never stall a real tool call. The trade is deliberate. The bug was that this path had been wired up with no logger, so when the forward calls started erroring, the errors were swallowed with nowhere to surface. The firewall stayed loud everywhere else and went silent on exactly this one path. The empty wall read as all clear. It was the opposite: the events were dying before they ever reached the screen.

The fix was visibility, not behavior. We kept the fail-open design and gave the forwarder a logger so a failure can no longer hide. With that and a clean rebuild, a controlled proof run pushed events end to end. The live agent cutover is still a staged, deliberate step, not something I will claim is done before it is. But the thing I asked for did not need building, and the thing that was actually broken got found and fixed. None of that happens if the agent builds on the assumption that my request was correct.

The agents we trust are the ones that check

The agents we trust most are not the fastest. They are the ones that check. Raw speed without judgment just hands you a confident wrong answer faster than you can catch it. So we build one rule into the fleet and hold it hard: verify reality before you act. Do not claim a thing is broken until you have reproduced it. Do not say a feature is missing until you have searched for it. Do not build until you have confirmed it does not already exist. When the evidence contradicts the instructions, surface the contradiction instead of quietly complying.

This sounds obvious. It is not the default. By default an agent treats your request as ground truth and works to satisfy it, because that reads as helpful. The discipline of pausing to ask is this true has to be designed in and protected, the same way you would coach a junior engineer out of the habit of guessing.

People sometimes hear treating agents as partners as sentiment. It is not. You would not ask a senior engineer to build something without expecting them to check whether it already exists, to push back when the request does not match the codebase, to fail loud rather than paper over a problem to keep you happy. Holding an agent to those same standards is not anthropomorphizing it. It is refusing to lower the bar just because the worker is software.

What it costs

I will be honest about the trade, because there is one. Checking before acting is slower and it burns more compute than building on the first instruction. An agent that pushes back on a wrong request will sometimes push back on a right one, and you have to design for that too. And the fail-open forwarding I described is a deliberate choice: I would rather a telemetry hiccup never break a real tool call, but that same choice is exactly what let this failure go quiet until we added the logging. None of these costs are free. I pay them on purpose, because the alternative is a fleet I cannot trust to tell me the truth, and a fleet you have to babysit is not a fleet. It is overhead.

The takeaway

The most useful thing my fleet did this week was refuse to build what I asked. If you are building with agents, that refusal is the property worth optimizing for. Not how fast they move. Whether they check before they claim, and whether they will tell you when you are pointed at the wrong wall.

That rule, check reality before you act, is one of the standing orders the whole fleet runs on. I publish the full set, and the harness that enforces them, inside the Nock community. If you want the checklist instead of the parable, that is where it lives.