Claude SDK in Production — The Architecture Decisions We Publish

For the technical founder evaluating whether to embed Claude into their product — the specific decisions we make, the specific failure modes we have hit, and the specific reasons we publish this instead of keeping it proprietary.

The Situation

Every month or two, a Dallas founder calls us with a variant of the same request: they want to add AI to their product. The AI is usually Claude. The use case is usually one of four — internal agent (summarize a customer record, draft an email), user-facing assistant (chat in the product), back-office automation (classify an inbound request, route a ticket), or structured data extraction (pull fields out of a document, map to a schema). The founder has already experimented with the API on a side branch. They can get a working prototype in a weekend. They are now trying to get from that weekend prototype to a production feature their customers pay for.

This is the gap where most AI integration projects die. Not at prototype. At production. The gap is not about prompting — the weekend prototype has usually solved the prompting. The gap is about everything else: cost control, latency under load, graceful degradation when the API is slow, safe tool-calling when the agent can modify data, observability when a user reports a bad answer, and the dozen other concerns that separate a demo from a revenue-generating feature.

At Routiine we have shipped more than a dozen Claude-embedded features for Dallas clients in the last twelve months, across industries from auto-glass to medical-devices to professional services. Every deployment uses essentially the same architecture, adapted for the specific product. I want to document that architecture here, as openly as we document the FORGE methodology at /forge and the Living Software doctrine at /living-software. The goal is not to keep these decisions proprietary. The goal is to let technical founders and in-house engineering teams copy them — or decide that copying them is harder than hiring a team that has already made the mistakes.

One note before the detail. The specific SDK in question is the Anthropic Claude SDK, currently at the model version Sonnet and Opus 4.7 as of this writing in April 2026. The architecture below is portable to other LLM providers with minor changes, but I will use the Claude vocabulary throughout because it is the one we use daily.

The piece is written at a technical founder's level, not an engineer's level. Where I refer to a specific library or configuration, I am describing the decision, not the implementation. The implementation is in client repositories under the Ownership Transfer that governs every Routiine engagement.

The Problem

The weekend prototype of a Claude integration almost always makes four decisions that fail in production, and the founder does not know they are failures until the failures compound.

The first failure is no caching. The prototype sends the full system prompt on every request. The system prompt contains the agent's persona, the tool definitions, the few-shot examples, and the operational constraints. In aggregate it is often ten to thirty thousand tokens. Without prompt caching, every request pays the full tokenization cost. Under load, the cost becomes prohibitive and the latency becomes unacceptable. Prompt caching, properly configured, reduces token cost on repeat requests by seventy-five to ninety percent and cuts time-to-first-token from seconds to a few hundred milliseconds. It is the first decision we make on every production deployment and the first decision prototypes never make.

The second failure is no tool-call guardrails. The prototype wires up a set of tools — read customer record, update email template, send notification — and lets the model call them freely. In production, this fails in two ways. First, the model occasionally calls the wrong tool in the wrong order and produces an action the user did not intend. Second, if the tool set includes any destructive operation (delete, overwrite, send-to-production), there is no approval layer preventing a hallucinated call from executing. Production deployments require a typed tool schema, an approval queue for any operation with external side effects, and a rollback procedure for each destructive tool. The prototype has none of these.

The third failure is no graceful degradation. The prototype assumes the Claude API is always available, always fast, and always correct. In production, the API occasionally returns errors, occasionally runs slow, and occasionally produces output that does not match the expected schema. The prototype crashes, or returns a blank response, or worse, retries aggressively and compounds the cost. Production deployments need retry policies with exponential backoff, circuit breakers, timeout thresholds, and fallback behaviors — what does the product show when Claude takes thirty seconds to respond? What does it show when Claude returns malformed JSON? Those answers have to be designed, not improvised.

The fourth failure is no observability. The prototype logs nothing. Or it logs to stdout. In production, when a customer reports a bad answer, the team has no trace of what the model received, what it returned, what tools it called, or why it made the decision it did. Reproducing the issue is impossible because model outputs are non-deterministic and the original inputs are gone. Production deployments need structured logging of every request, every response, every tool call, and every user-facing output — with redaction of any PII before it is stored. Without this, debugging is a coin flip.

Those four failures compound. A deployment without caching is expensive. An expensive deployment tempts the team to cut corners on tool guardrails to save on tokens. A deployment without guardrails has more user-reported issues. An observability-blind deployment cannot debug those issues. The feature ships, fails invisibly, and the founder's AI bet turns into a line-item the finance team wants to kill.

The Implication

The cost of a production AI integration that skips these decisions is higher than most founders expect, and it shows up in three separate places.

First, the cost is fiscal. A Claude-powered chat feature running at even modest volume — say, five hundred daily active users averaging ten messages per session — without prompt caching will typically cost three to eight thousand dollars per month in API spend alone. The same feature with caching and intelligent model routing (Haiku for simple responses, Sonnet for standard responses, Opus for complex ones) will run at four hundred to nine hundred dollars per month. The gap is five to fifteen thousand dollars monthly that a founder is paying for prototype-grade architecture.

Second, the cost is reputational. Every user-facing AI feature has a reputational floor set by its worst output. If one percent of users see a hallucination, or a wrong tool call, or a delayed response that feels broken, that is the experience they will describe to other potential users. AI features without guardrails and without graceful degradation generate a higher rate of bad experiences than AI features with them — not because the model is worse but because the surrounding architecture is worse. The damage is diffuse and impossible to reverse.

Third, the cost is strategic. Founders who ship a bad AI integration lose conviction about AI as a product category. They become cautious. They scope down future features. They delay the next integration by two quarters while they decide if AI is "really ready." The irony is that AI is ready; their architecture was not. The Decay Thesis from the Living Software doctrine at /living-software applies in reverse here — under-engineered AI features decay faster than traditional features because the underlying model changes monthly, the APIs evolve quarterly, and any integration that barely worked at launch will not work in a year without active maintenance. The prototype-grade deployment does not decay gracefully. It fails abruptly.

The founders who succeed with AI do the unglamorous architecture work up front. The founders who fail with AI ship the weekend prototype and hope.

The Need-Payoff

Here is the architecture we ship, decision by decision. Every choice is explained. Every choice is portable. Every choice has been exercised in at least two Dallas production deployments.

Decision 1: Prompt caching for the static portion. The system prompt is split into a static half (persona, tool definitions, policy) and a dynamic half (current user context, recent messages, live data). The static half is cached using the Claude SDK's prompt-caching feature with a sensible TTL. The dynamic half is inlined on each request. Cost reduction on repeat calls: seventy-five to ninety percent. Latency reduction on warm calls: typically seven to thirteen times faster than cold.

Decision 2: Model tiering by task complexity. Not every request needs the largest model. A classification request (is this email spam?) runs on Haiku. A standard conversational response runs on Sonnet. A complex reasoning task (draft a contract revision) runs on Opus. We implement this as a routing layer in front of every Claude call, with the model choice made based on an explicit complexity classification of the input. Typical spend reduction compared to always-Opus: sixty to eighty percent.

Decision 3: Typed tool schemas with an approval queue. Every tool available to the model is defined with a strict JSON schema. Every destructive tool (any call that modifies data or triggers an external side effect) is wired to an approval queue rather than direct execution. The user sees the proposed action, approves or denies, and only approved actions execute. For high-trust internal automation, the approval queue can be auto-approved with a time delay and a one-click undo. The queue also provides an audit log that satisfies compliance requirements without separate instrumentation.

Decision 4: Circuit breaker and fallback. Every Claude call goes through a circuit breaker with timeout, retry count, and exponential backoff. If the circuit trips (too many failures in a rolling window), the product renders a documented fallback — a non-AI path, a cached previous response, or an honest "AI is unavailable, please try again" message. No user-facing feature is ever allowed to hang on a live API call. The fallback is itself tested on every deploy under Quality Gate 3 of the FORGE methodology documented at /forge.

Decision 5: Structured logging with PII redaction. Every request, response, tool call, and user-facing output is logged to a structured sink (typically a database table or a log aggregation service) with automatic redaction of any fields marked as PII in the schema. The log is retained for ninety days, queryable by request ID, and linked from every user-visible response so that when a user reports a bad output, the team can reconstruct exactly what happened.

Decision 6: Output schema validation. If the model is expected to produce JSON or structured output, the output is validated against a strict schema before it is used. Invalid outputs trigger a single retry with a modified prompt that includes the schema error. Still-invalid outputs trigger the fallback path rather than passing malformed data downstream. This prevents the common production failure where a hallucinated field crashes a downstream function.

Decision 7: Cost ceiling per tenant. Every production deployment has a per-tenant monthly cost ceiling, enforced in the routing layer. A tenant approaching the ceiling gets a warning. A tenant hitting the ceiling either gets degraded service (Haiku-only responses) or a billing prompt, depending on the product model. This prevents the nightmare scenario where a single abusive tenant generates ten thousand dollars of API spend in a weekend.

Decision 8: Deterministic evaluation harness. For every AI feature, we build an evaluation harness — a curated set of twenty to two hundred input cases with expected outputs or expected properties. The harness runs on every deploy as part of the test suite (Quality Gate 3). If the AI feature's output quality regresses beyond a threshold, the deploy fails. This catches the class of failures where a prompt change or a model update silently degrades output quality. Without this gate, quality drift is invisible until users complain.

Decision 9: Versioned prompts with git provenance. Every prompt is a file in the repository, versioned, code-reviewed, and tagged with the deploy in which it shipped. No prompt is edited in a web UI. No prompt exists only in someone's head. This makes every user-facing response traceable to a specific prompt version, which makes debugging and iteration possible.

Decision 10: Separation of prompt and code. The application code does not contain prompt text. The prompts live in a dedicated module with a clean interface. This allows product managers or domain experts to iterate on prompts without touching application code, and it allows the application to swap prompts without a deploy (under strict review). The separation is documented in the Inheritance Ledger at the Ownership Transfer, so the client can maintain prompts after engagement close.

Those ten decisions compose into the default architecture for any Claude-embedded feature we ship at Routiine. Every one of them is applied on every deployment regardless of scale. Every one is exercised in the ten Quality Gates we run on every release. Every one is handed over in the Ownership Transfer so the client owns the full stack after day ninety. Every feature ships under the Ship-or-Pay Guarantee — if we miss the ship date, the client does not pay the final milestone.

The Wise Magician stance on AI architecture is simple: the interesting work is not in the prompt. The interesting work is in making a non-deterministic capability behave reliably in a deterministic product. Any engineer can write a good prompt. The engineering challenge is the ten decisions above. We publish them because we want more AI features in the Dallas market to be correctly architected, not because we want to hoard the knowledge.

Next Steps

Three actions, in order of commitment.

First, read the FORGE methodology at /forge for the full ten Quality Gates that apply to any software release, including AI-embedded features. The gates are identical across product types. An AI feature does not get a free pass on accessibility, performance, or security just because it is AI.

Second, book a free FORGE Audit at /contact. If you are already running a Claude or other LLM integration in production, I will review your architecture against these ten decisions and produce a written gap list. If you are considering a first AI integration, I will tell you honestly whether your use case is production-ready or still prototype-stage, and what it will cost to close the gap.

Third, if you want the architecture described above applied to your product by a team that has shipped it a dozen times, apply to the Founding Client Program at /work. The Program has five slots at twenty percent below our standard rates. AI-embedded features sit within the Launch, Platform, or System tiers depending on scope. All ten Quality Gates apply. The Ship-or-Pay Guarantee applies.

AI in production is not harder than traditional software. It is differently hard. The difficulty lives in the architecture around the model, not in the model itself. Founders who understand this stop hiring AI consultants and start hiring software teams who happen to be fluent in AI. That is the team Routiine is built to be.

Routiine

Claude SDK in Production — The Architecture Decisions We Publish