Memory Layers and Agent Orchestration: Keep It Simple

The AI agent space has a complexity problem. Every new deployment seems to need a vector database, a graph store, three retrieval pipelines, and a multi-agent council that debates each response.

For almost every business I work with, that's the wrong shape. Here's the setup I use by default.

How many memory layers does an agent need?

Three. Layer one is the system prompt. Role, rules, tone, hard limits. This never changes. It's the agent's identity. No retrieval, no lookup. Just baked into every call.

// Need an agent that does this? → bradbond.org/contact

Layer two is the session. Whatever's happened in the current conversation. Loaded each turn, trimmed when it grows too long. Context window management happens here. If your session is bloating to tens of thousands of tokens, something's wrong with how you're writing to it.

Layer three is persistent notes. A small file or structured record the agent can read and update between sessions. Client preferences. Running task lists. Named profiles. Anything the agent needs to remember past this conversation.

That's it. Three layers. No embeddings, no semantic search, no vector store.

When does this setup break?

It breaks when your knowledge base genuinely exceeds what fits in context. A law firm with thousands of case files needs retrieval. A medical practice with a decade of patient notes needs retrieval. These are real cases where you need embeddings and a proper retrieval layer.

The mistake is assuming every agent needs that setup. A small business tracking maybe a hundred customers doesn't. Put those customers in the persistent notes file. The agent will read the whole thing faster than it would embed and retrieve the relevant chunk, and it'll be more accurate too because nothing got lost in the retrieval.

When should you split into multiple agents?

One agent handling the full job is usually better than five agents passing messages. Every handoff between agents adds latency, cost, and another place for things to go wrong. I've seen multi-agent setups where the orchestrator spends more tokens coordinating than the workers spend doing the work.

Split only when you hit a real limit. Context window overflowing? Split by domain so each agent holds less. Latency unacceptable because one agent is doing sequential work that could be parallel? Split into workers. Different agents need different tool sets or different permission levels? Split.

Don't split because the Twitter discourse says multi-agent is the future. Split when something actually breaks.

What goes wrong in practice?

A deployment where someone stood up a vector database to store twelve FAQ entries. The retrieval layer regularly pulled the wrong entry because twelve items is below the noise floor for embeddings. The fix was deleting the vector database and putting the FAQ in the system prompt. Twelve entries, maybe two hundred words, took zero additional tokens to make perfect.

Another one. A multi-agent setup for a task that a single agent handled fine in testing. In production, the handoff between the "planner" and the "executor" agents failed about one in twenty times because the executor didn't understand what the planner had decided. Collapsed it to one agent. Error rate dropped to near zero.

Start simple. Add memory when the agent forgets what it needs to remember. Add orchestration when one agent can't fit the job. Don't build the law firm when the client is a coffee shop.

Want an agent built lean like this? Tell me what you need.

Want an AI agent that handles this for you?

I build custom AI agents for small businesses and teams. Tell me what you need automated and I can put something together.

Get in touch →