June 2026

Same question, three answers: building a governed MCP server

Ask Warden's agent "what's the open pipeline for Acme Corp?" as an admin and you get $125,000 across two deals. Ask as a support agent and you get a polite, honest refusal. The model never decides which one you deserve. That's the point.

Live demo · Source on GitHub

The two questions nobody can dodge

Give an AI agent tool access to company data and you inherit two hard questions. First: who is the agent acting as? A support rep asking about pipeline numbers must not get an answer the human behind the keyboard is not allowed to see. Second: how do you know the agent behaved? "It seemed fine when I tested it" is not something you can take into a security review.

Every forward-deployed engineering job description I read this spring was circling these exact two questions: configure a governed MCP server, scope agent access, validate agent workflows, monitor accuracy. So I built a complete, readable answer to both and put it on the public internet where you can poke it.

What Warden is

A fictional company spread across three sources (CRM accounts and deals, billing invoices, support tickets), an MCP server that exposes them through four tools, a Claude agent that answers questions through that server, and three layers of receipts:

RBAC enforced outside the model. Three roles (admin, West-region sales, support), three governance primitives: resource-level access, region row-scoping, field redaction.
OpenTelemetry traces on every run. Real spans with GenAI semantic attributes, persisted and replayed on a Gantt timeline in the dashboard.
An LLM-as-judge eval suite. Twelve golden cases covering every governance primitive, all passing, with a design twist I'll get to.

Governance the model cannot talk its way out of

The most important design decision is where the policy lives: not in the prompt, not in the model's judgment, but in a single choke point the requests physically pass through. The agent's role comes from the session identity (the MCP server reads it from its environment at spawn, the way OAuth token scopes work). Every read goes through one GovernedStore that applies access checks, row scoping, and field redaction before the model ever sees a byte. Prompting harder does not widen access, because there is nothing on the model's side of the wall to widen.

The tool surface is registry/dispatch, not one-tool-per-table: list_resources, describe_resource, query_resource, get_record. Adding a data source changes the registry, not the tools, and the policy engine stays in exactly one place. When policy says no, the tools return a structured access_denied object instead of an error string, which matters more than it sounds, because it turns "the agent hit a wall" into data the eval layer can reason about.

The eval insight: the oracle has to obey the rules too

This is the part of the build I'd defend in an interview. If you compute your reference answers against the raw database, your eval is broken in a subtle way: when the support agent is correctly denied pipeline data and honestly says so, your eval compares that refusal against $125,000 and scores it a failure. The fix is that the oracle computes ground truth through the same governance layer as the agent. Governance-aware ground truth is what makes "the agent honestly declined" a passing grade instead of a miss.

Each case is then judged on four axes: accuracy against the reference, faithfulness to the data actually retrieved, RBAC compliance, and honesty about limits. A stronger model judges than answers (Opus judging Sonnet), anchored to the oracle's reference, which cuts down both judge laziness and self-preference bias. Twelve cases, every primitive covered, 12/12 passing, and the whole scorecard is public.

Traces you can replay

Every run emits genuine OpenTelemetry spans, a root agent span over each LLM completion and MCP tool call, with GenAI semantic attributes. I wrote a small in-process span processor that captures them as they close and persists them with the run, so the dashboard can replay any answer as a timeline: how long the model thought, when it reached for a tool, what it sent, what came back, and which role was enforcing the result. The role is stamped on every tool output. That stamp is the governance proof, visible in every single trace.

Putting a live agent on the public internet

The demo would be weaker if you could only browse recorded runs, so the console lets anyone fire a real agent run. That means a public endpoint that burns real model tokens, which forced the boring, important work: per-IP rate limits keyed off the CDN's forwarded client header, a global daily budget, a single-flight lock so concurrent strangers queue instead of stampede, and a hard timeout. None of this is glamorous. All of it is the difference between a demo and a toy.

Things that bit me

Markdown tables in agent answers rendered as pipe-soup until I noticed Tailwind's prose classes silently do nothing without the typography plugin, and react-markdown needs remark-gfm for tables at all. Two packages, ten minutes, only found by actually clicking through the deployed site.
The official MCP Python SDK ships FastMCP at mcp.server.fastmcp. The separately packaged fastmcp library is a different install. Knowing which one you're importing saves a confusing half hour.
LLM judges grade generously when you let them freestyle. Anchoring the judge to a deterministic reference answer changed it from a vibe check into a measurement.

Where it goes from here

Warden is deliberately small enough to read in an afternoon, and that's a feature. The pattern scales by swapping the seed company for real connectors and the env-var role for real session auth; the choke point, the structured denials, the governance-aware oracle, and the trace replay all carry over unchanged. If you're building agents that touch data someone actually cares about, steal those four ideas.

warden.alexlaguardia.dev · github.com/AlexlaGuardia/warden

← back to alexlaguardia.dev