Why You Probably Shouldn't Let Your Model Provider Manage Your Agents

There's a particular kind of decision that feels like an infrastructure choice but is actually a strategic one. The new wave of managed agent runtimes — Anthropic's Claude Managed Agents, AWS Bedrock Agents, OpenAI's Assistants API, and the offerings now arriving from every major model provider — is exactly that decision. Each arrives dressed as a weekend integration project. Each tends to leave as a multi-year architectural commitment you didn't fully negotiate.

This isn't a critique of any one model. The frontier models are excellent. This is about something more consequential: what happens when you hand a model provider control over the reasoning layer of your business operations, and whether the convenience is worth what you're actually trading away.

For the rest of this article, Claude Managed Agents — Anthropic's hosted agent runtime, generally available since April 2026 — is the worked example. The argument generalises to every comparable offering. Pick the named product that matches the procurement decision currently sitting on your desk.

The Orchestration Layer Is Not Commodity Infrastructure

Most vendor lock-in conversations focus on data portability — can you export your data if you leave? That framing misses the deeper problem with managed orchestration.

The orchestration layer is where your business logic lives. It decides how a complex task gets decomposed into agent sub-steps, which agent gets called in what order, what happens when one fails, how retry logic works, and how intermediate results feed into subsequent decisions. In a deterministic system, this logic is in your code. In a multi-agent system, it's in your orchestrator.

When you use a managed agent runtime, the provider runs that layer. The runtime is encoding how your business thinks about its problems — not just processing your requests.

This is categorically different from using AWS to host a database or Stripe to handle payments. Those services handle commoditised mechanics. Your database storage format isn't a competitive advantage. How you decompose a complex underwriting task into agent sub-steps, or how you've structured a customer onboarding workflow to route between specialist agents — that is proprietary. A company that has genuinely figured out a superior agentic decomposition of a high-value workflow has built something. Externalising its execution to a managed runtime is externalising the moat.

The correct mental model: a managed agent runtime isn't like hiring a cloud provider. It's like hiring an external firm to run your product team. They bring their own tooling, their own processes, and their own incentives — and the proprietary work happens inside someone else's process.

The Observability Problem You Can't Retrofit

Here is the failure mode nobody talks about until they're living through it at 2am.

Managed agent platforms have improved significantly on the observability question. Claude Managed Agents now ships with a Console that exposes session-level traces, tool execution details, token usage, and step-by-step replay of what happened inside any given run. This is meaningfully better than the early generation of managed agent products. It is not, however, the same thing as owning your observability layer.

The distinction the industry consistently underestimates is between operational logging and causal observability. Operational logs tell you what happened: agent called tool X at 14:23, returned result Y. Causal observability tells you why: the orchestrator routed to Agent B because Agent A's confidence score fell below threshold 0.7 on the second retry, triggering the fallback pattern and the orchestrator's routing weights had been auto-tuned overnight by the runtime's optimisation feedback loop. In deterministic software, logs are sufficient for debugging. In multi-agent systems — where the execution path is dynamic and context-dependent, and where the runtime itself may be adjusting routing behaviour — you need access to the causal layer to diagnose, optimise, and audit.

Managed platforms give you the trace data they choose to expose, in the format they choose, accessible through the tools they provide. That is a fundamentally different posture than owning the layer. You cannot insert custom metrics at the orchestration step. You cannot replay a workflow against modified routing logic. You cannot wire orchestration events into your existing OpenTelemetry stack without rebuilding around the provider's emission format. You cannot diff orchestration behaviour across model versions because the orchestrator is part of the runtime they updated.

A B2B fintech ran customer-facing onboarding agents on a managed runtime for four months without incident. In month five, customer support began routing an unusual cluster of complaints — not failures, but applications being declined that previously would have been queued for human review. The team checked the dashboard. Every session showed green. Every tool call had executed successfully. Every routing decision was visible. What wasn't visible was that the runtime's auto-tuned routing thresholds had drifted by 0.04 points over a four-week period in response to upstream signals the team didn't have access to. The cumulative effect was that around 11% of borderline applications were now being declined by the agent layer instead of escalated. The trace data showed the new threshold being applied. It did not show the threshold having changed, when, or in response to what. The team escalated to vendor support. The investigation took eleven days. The fix was a runtime configuration override the team didn't know existed until it was offered to them. By the time it was applied, an estimated 340 applications had been mis-routed. None of those applicants will ever know.

This is the failure mode that makes step-level observability of a layer you control a prerequisite for operating critical agent workflows in production. With operational visibility into someone else's runtime, you can see that a tier label was assigned. You cannot see the orchestration-level state change that made the assignment different from yesterday's. Without that attribution, your incident response is a symptom report waiting on someone else's diagnosis.

For internal productivity tooling where agent failures are inconvenient, this is manageable. For revenue-generating automations or customer-facing workflows, this observability gap is an operational liability you've accepted without fully pricing in.

Lock-In Compounds — and the Arithmetic Is Worse Than You Think

The standard warning about vendor lock-in is true but abstract. The practical reality is directional and cumulative: every week you run in production on a managed runtime, migration becomes harder in ways that aren't obvious from the outside.

Consider what actually accumulates during six months of production operation. Your engineers build institutional knowledge of the provider's tool definition syntax. Your prompt engineering optimises for that provider's specific response patterns and quirks. Your session state schema — the data structures that track what's happened across multi-step workflows — is tied to the provider's persistence format. Your monitoring dashboards are built against the provider's telemetry API. Your on-call runbooks reference the provider's incident status page. Your auth flows depend on the provider's credential handling.

None of these are individually catastrophic. Collectively, they mean that after six months, you don't have a migration project — you have an architectural rewrite. And architectural rewrites of production systems generating revenue tend not to get approved, regardless of how compelling the technical argument is.

The harder problem is that providers have demonstrated they will make unilateral changes to things you depend on: model versions, context window pricing, rate limits, API behaviour, and now — with managed runtimes — orchestration semantics. When these changes happen inside your own orchestration layer, you have a configuration change. When they happen inside a managed runtime you don't control, you have a crisis on a timeline set by someone else's roadmap.

There's also a structural incentive problem worth naming directly. Your model provider's commercial interest is maximising your model consumption. Your interest is cost-efficient, model-agnostic workflows. Those interests aren't aligned, and they diverge most sharply in the orchestration layer — where decisions about how many agent steps your workflows take, which tools get called, and how retries are handled are being made. Handing that layer to an entity with a built-in incentive to maximise consumption is a governance problem as much as a technical one. The session-hour pricing layered on top of token pricing in current managed offerings is a quiet preview of how the commercial relationship evolves once the architectural commitment is made.

What You Lose Access To: Multi-Model Architecture

This angle is almost entirely absent from the current conversation around managed agent offerings — and it may be the most consequential.

The emerging architecture of sophisticated agent systems isn't "which model do I use?" It's "which model is optimal for this specific sub-task, at this latency requirement, at this cost point?" Concretely: a research sub-task goes to a high-capability long-context model. A quick classification step goes to a fast, cheap model. A code generation step goes to a code-specialist model. A vision-heavy step goes to a multimodal model. The orchestrator maintains a capability registry and routes each sub-task to the model best suited for it.

This pattern is already in production at organisations running complex mixed workloads. Published benchmarks for intelligent routing report cost reductions in the range of 30 to 85% versus single-model approaches on equivalent tasks, depending on workload composition (RouteLLM, Berkeley/Canva, 2024 — sustained 95% of GPT-4 quality at a fraction of the cost). The 40 to 60% range is conservative for typical enterprise mixed workloads, and arrives with quality improvements on specialist sub-tasks where a generalist model was previously being used out of convenience. Industry surveys show 37% of enterprises now run five or more models in production.

A managed agent runtime forecloses this architecture entirely — or, at best, gates it behind whatever model selection the provider chooses to expose. Everything runs at the provider's pricing, under the provider's latency profile, against the provider's available model catalogue. You're not just locked into one vendor — you're locked into a mono-provider architecture at precisely the moment the industry is discovering that heterogeneous model routing is a genuine performance and cost lever.

The alternative is a clean separation of concerns. Your orchestration layer, owned and version-controlled, handles task decomposition, agent routing, retry logic, and audit logging. Model providers sit behind thin, swappable adapters behind a unified interface. Agent workers are stateless, independently deployable capability units. Your persistence layer — session state, workflow state, conversation history — lives in your own data model. In this architecture, Anthropic is a model provider. A good one, possibly your primary one, but a provider you can route around during an outage, supplement with specialist models, or replace as the competitive landscape shifts. That's a fundamentally different relationship than infrastructure dependency.

When a Managed Runtime Is Actually the Right Call

The argument above isn't that managed agent runtimes are never appropriate. It's that the decision is being made on the wrong criteria.

There is a legitimate use case: agent workflows that are internal productivity tools, not core product or revenue-critical. If you're a professional services firm using agents to automate internal research summaries, or a small team running internal document processing, a managed runtime may be entirely appropriate. The operational dependency is manageable, the observability requirement is lower, and the speed-to-deployment benefit is real. The same applies to early-stage prototypes where you want to validate that an agentic approach works at all before investing in proprietary orchestration.

The error isn't using managed orchestration. It's using it for the wrong workflows — specifically, those that are customer-facing, revenue-generating, compliance-sensitive, or architecturally central. The evaluation question isn't "is this faster to ship?" It's "what is the operational and strategic cost if this layer is unavailable, opaque, or changed without notice?"

Most teams making this decision are optimising for the first deployment. They should be optimising for the incident at 2am when an agent workflow is blocking a batch of customer orders, and asking whether they want to be active responders or passive observers at that moment.

What To Do This Week

If you have agent workflows in production or late-stage development on any managed orchestration platform, do one specific thing before you go further: map every place where your business logic has leaked into the vendor's runtime.

That means answering four questions concretely:

Which retry decisions are made by the platform versus your code?
Where is your session state actually stored, and what's the schema?
What would it take to replay a failed workflow from an intermediate checkpoint today — not in theory, but with your current tooling?
Which parts of your monitoring would break if the vendor's telemetry format changed?

That audit will tell you — precisely, not abstractly — how much architectural control you've already ceded and what it would cost to recover it. In most cases, teams that run this exercise find the answer is uncomfortable but still reversible. The teams that skip it find out eighteen months later, when the answer is neither.

Own your orchestration layer. Use the best models for each job. Those are not the same decision.