People ask us almost weekly: "what does your AI actually do?" The honest answer is that we don't have one AI doing one thing. We have a stack — a layered set of agents, each doing a specific job, coordinated by senior humans who make the calls the agents can't.
This post is a walk-through of that stack as we run it today. Tools change, models change, prices change. The shape of the work doesn't.
Layer 1: the deterministic glue
At the bottom are the boring things that have to be reliable. Pulling data from ad platforms, posting to a CMS, triggering an alert when a metric crosses a threshold. None of this needs reasoning. All of it needs to be 100% correct, 100% of the time.
For this layer we use n8n (self-hosted) and Make. Simple if-this-then-that flows. We don't put an LLM in this layer because LLMs are non-deterministic, and "the daily report didn't generate because the model hallucinated a column name" is not a failure mode we tolerate.
What lives here
- Daily data pulls from Google Ads, Meta Ads, GA4, Search Console, Plausible
- Webhook-triggered actions (e.g. new lead → enrich → route to Slack)
- Scheduled CMS publishing once content is human-approved
- Health checks: did the agent run? did it produce output? did the output land where it should?
Layer 2: the retrieval & structured-output agents
Above the deterministic layer sit agents whose job is to take inputs, fetch context, and produce structured output. They use an LLM under the hood but the output shape is constrained — JSON schemas, named fields, validated values. We use Claude as our default model for this layer because tool use and structured output are the strongest.
Example workflows
- Keyword expansion. Input: a seed query. Agent fetches SERP data, competitor pages, related searches. Output: a structured keyword cluster with intent classification, difficulty estimate, and a suggested content angle.
- Ad copy variants. Input: a base concept + the campaign target metric. Agent generates 12 variants, scores each on hook strength + claim specificity, returns the top 5 with rationale.
- Brief-from-meeting. Input: a transcript of a discovery call. Agent extracts goals, current numbers, blockers, decision-maker, timeline. Returns a structured project brief ready for senior review.
These agents are stateless. They get input, produce output, hand off to the next stage. The structure of the output is what makes them safe to chain.
Layer 3: the reasoning agents
This is the layer that does what Zapier can't: multi-step reasoning with judgment. An agent here might pull data, notice an anomaly, decide whether to flag it or fix it, and only escalate to a human when its confidence drops below a threshold.
For this layer we write custom agent code on top of Anthropic's tool-use API with structured outputs and explicit confidence scoring. Most of our reasoning agents run on Claude with a few specialized ones on GPT or Gemini depending on cost-per-task.
Example: the paid-media monitor
Runs every 4 hours. Pulls performance data per campaign per ad set. For each anomaly (CPA spike, CTR drop, frequency creep), the agent decides:
- Confidence high, action low-risk → apply the fix automatically (pause an ad with CTR < 0.5%, increase bid on a top performer).
- Confidence medium, action medium-risk → draft the change, post a recommendation to Slack with rationale + one-click approve.
- Confidence low OR action high-risk → escalate to the senior strategist with the data + competing hypotheses. Never apply.
The most expensive bug in agentic systems is a confident agent making an irreversible change. The solution is not a smarter agent — it's clearer rules about which actions an agent can take without sign-off.
Layer 4: the humans on top
Every account has a senior strategist who owns the relationship and the outcomes. The agent stack reports up; the strategist makes the calls that span more than one campaign, one channel, or one quarter.
What humans do that agents don't:
- Strategy decisions. "We're going to deprioritize Meta and double down on YouTube for Q3." Not an agent call.
- Creative judgment. Which of the 47 ad variants the agent generated actually feels like the brand. Agents help generate options; humans pick.
- Client trust. Hard conversations, scope renegotiations, the moments where a project needs a person on the call. Always a human.
- Cross-channel storytelling. Connecting the SEO insight to the paid hypothesis to the email sequence. Agents see channels; humans see narrative.
What makes it work
Three principles, in order of importance:
1. Constrained output beats clever prompts
The agents in our stack don't return free-form text. They return JSON with a schema we validate. If the schema fails, the agent retries. If retries fail, the workflow escalates. The clever prompt-engineering tricks matter less than the structural discipline.
2. Observability is non-negotiable
Every agent run is logged: inputs, outputs, tools called, decisions made, confidence scores. When something goes wrong (and it will), we can replay the run and see exactly where the reasoning broke. Without this you're flying blind.
3. The handoff to humans is where the alpha lives
The best workflows we run aren't the most automated ones. They're the ones with the clearest handoff: agents do the 80% of mechanical work, surface the 20% of judgment-required work in a clean, ranked format, and let a human spend 15 focused minutes instead of 4 distracted hours. That's the asymmetry.
What we don't automate
- Final approval of anything client-facing. Even when the agent's draft is better than what we'd write, the strategist reads it before it ships.
- The discovery call. A senior human listens to a senior human. Agents can transcribe + brief, but they can't read the room.
- Crisis response. When an account is on fire — bad CAC, drop in rankings, ad disapprovals — the agent stack provides data, not decisions.
- Pricing conversations. Always a founder on that call.
The honest part
This stack took two years to build, breaks roughly once a week, and requires a senior automation engineer (that's me) full-time to maintain. Anyone who tells you they shipped an "agentic agency" in 90 days is selling you Zapier in a hat.
But the leverage is real. A two-person team running this stack ships more useful output per week than a ten-person traditional agency. Not because agents replace humans — but because agents take the mechanical work off the humans, and the humans get to spend their time on judgment.
That's the anatomy. The work is in the assembly.