← Back to the build log

The Real Bottleneck in AI Agent Workflows

Everyone's optimizing prompts. Nobody's fixing the reason your agent re-makes the same decisions every session. Here's what we found.

The Prompt Optimization Trap

Every AI agent tutorial starts the same way: write a better prompt. Add more context. Use chain-of-thought. Be specific about your output format.

I spent three weeks optimizing prompts before I realized the bottleneck wasn't the prompt. It was the fact that my agent started fresh every single session. All the decisions it made yesterday. which architecture pattern to use, which files to avoid, what Kevin prefers. gone. Every morning it re-derived everything from scratch.

The bottleneck in AI agent workflows isn't intelligence. It's memory.

The Decision Re-derivation Problem

Watch an agent work across multiple sessions and you'll see a pattern. Session 1: it explores the codebase, figures out the conventions, makes good decisions, ships clean code. Session 2: it explores the same codebase, arrives at slightly different conventions, makes different decisions, ships code that doesn't match session 1.

Both sessions produced good work. But they produced inconsistent work. The agent didn't get dumber. it just forgot what it decided last time.

In a human team this would be like hiring a contractor who does excellent work but has amnesia. Every morning you re-explain the project from scratch. By Friday they've built five different versions of the same feature, each internally consistent, none compatible with each other.

What We Tried (and What Failed)

Giant system prompts. We packed every convention, decision, and preference into the system prompt. It worked until the prompt hit 8,000 tokens and the agent spent more time re-reading instructions than writing code. And it still didn't remember runtime decisions. the ones made during the session that weren't in any document.

Conversation history. We saved full conversation logs and loaded them at session start. This collapsed under its own weight within a week. A 200-message conversation log is 50,000+ tokens of context that's 90% irrelevant to the current task. The agent would get confused by old discussions about problems that were already solved.

RAG over past conversations. We tried semantic search over past sessions. The retrieval was noisy. An agent asking "how should I structure this test file" would pull up five different past answers, each from a different context, and average them into something nobody wanted.

What Actually Worked

The fix was separating what the agent needs to know from what it once knew.

We built three layers:

Identity documents. versioned files that define who the agent is, what it's responsible for, and how it should behave. These get read at the start of every session. They're short (under 2,000 tokens each), specific, and maintained like code. When a decision changes, the identity doc gets updated. The agent never has to re-derive it.

Handoff files. operational state documents written at the end of every session. What was in progress, what shipped, what's next. The next session reads the handoff and picks up mid-stride. Not the full conversation. just the actionable state.

Structured memory. a queryable store of facts, decisions, and preferences. Not conversation transcripts. extracted facts with timestamps and confidence levels. "Kevin prefers single-file PRs" is a memory. The 45-minute conversation where he explained why is not.

The Infrastructure Shift

Once we built this, the agent's behavior changed dramatically. Session-to-session consistency went from maybe 60% to over 95%. The agent stopped asking questions it had already answered. Code style stabilized. Architectural decisions stuck.

And the prompt got simpler, not more complex. We deleted most of the system prompt because the identity documents covered it. The prompt became: read your identity, read your handoff, check your memory, then start working.

The bottleneck was never the model's capability. It was the absence of infrastructure between sessions. Fix the memory problem and the prompt problem solves itself.

The Takeaway

If your agent workflow feels like it's plateauing. if you're getting diminishing returns from prompt engineering. stop optimizing the prompt. Look at what happens between sessions. Is your agent starting fresh? Is it re-deriving decisions? Is it losing context across restarts?

That's your bottleneck. Not the model. The infrastructure around it.