Showing Posts From
Ai vendor concentration risk
- 09 Jan, 2026
When Your AI Vendor Fails: What Enterprise Continuity Planning Misses
Enterprise business continuity planning has matured significantly over the past decade. Most large organizations have detailed plans covering data center failure, key supplier disruption, network outages, and critical SaaS dependency. The plans are tested, updated annually, and presented to the board as evidence of operational resilience. Almost none of them cover AI vendor failure. This is a gap that is growing faster than organizations are closing it, because AI infrastructure has accumulated as a strategic dependency faster than any previous technology category — and the concentration of that infrastructure in a small number of providers creates a risk profile that most business continuity frameworks weren't designed to address. The concentration reality Most enterprise AI programs now run on infrastructure from three to five large providers. The foundational models — the large language models, the vision models, the embedding models — come from a handful of companies. The infrastructure to fine-tune, serve, and monitor those models runs predominantly on two or three cloud platforms. The tooling layer for MLOps, vector databases, and AI observability has its own concentrations. This isn't a failure of procurement strategy. It reflects the current state of the market: the compute requirements for training large models are only viable at a few companies, and the organizational dependencies compound over time as internal teams build workflows, integrations, and institutional knowledge around specific providers. The result is that a significant portion of enterprise AI capability now has a concentrated dependency structure that looks nothing like the diversified supply chains those same organizations maintain for physical goods or traditional software. The failure scenarios that aren't hypothetical None of the scenarios I'm going to describe are theoretical. All of them have happened to some organization or category of organization in the past few years. Pricing changes. API pricing for foundational models has changed multiple times since commercial availability began. Organizations that built business cases on a specific cost-per-call structure have seen those economics shift materially. At low volumes this is manageable. At production scale, a significant pricing increase is a P&L event that has to be absorbed or responded to — and the response options are limited when the alternative is rebuilding on a different model. API deprecation. Model versions get deprecated. The model version that a production system was built on, fine-tuned on, and evaluated against may be removed from availability on a timeline driven by the provider's product roadmap, not the client's operational needs. This forces either a migration under time pressure or an extended period of running on a deprecated version with no support and potential security exposure. Performance degradation. Foundation model behavior changes with version updates, even when providers describe updates as improvements. A model that performed reliably on a specific task may behave differently after an update that was not communicated as a breaking change. For AI systems that have been through a conformity assessment or regulatory approval process based on specific model behavior, this creates a compliance problem as well as an operational one. Regulatory shutdown. Regulators in multiple jurisdictions have shown willingness to restrict or suspend AI capabilities. This risk is higher for providers with significant regulatory exposure and for capabilities that sit in legally ambiguous territory. Acquisition and pivot. The AI infrastructure market is consolidating. A provider that is an independent company today may be acquired by a larger platform company whose strategic priorities don't include the specific capabilities your organization depends on. Post-acquisition product decisions are made by the acquirer. What business continuity plans currently miss Standard business continuity frameworks are built around the concept of recovery time objective (RTO) and recovery point objective (RPO): how quickly can we restore service, and how much data can we afford to lose? These are the right questions for infrastructure failures — a data center going down, a primary database becoming unavailable. They're the wrong questions for AI vendor risk, because the failure modes are different in character. A data center failure is abrupt and typically temporary. An AI vendor pricing change is gradual and potentially permanent. A model version deprecation has a defined timeline but requires significant engineering work to respond to. A performance shift may not be immediately detectable and may require revalidation of systems that were previously approved. Business continuity planning for AI needs to address a different set of scenarios: what does the organization do if this model version is deprecated in 90 days? What does the organization do if this provider's pricing increases by 40%? What does the organization do if a capability we depend on is restricted by a regulator? These scenarios require capability-based responses, not infrastructure failover. The portability question The honest answer for most organizations is that they are significantly more locked in than they realize. Fine-tuned models, when the fine-tuning has been done on a provider's infrastructure, may not be portable in a form that can be redeployed elsewhere without significant rework. Embeddings and vector indexes built against one model's embedding space are not compatible with a different model's embedding space without recalculation. Prompts engineered for one model's behavior may produce degraded outputs on a different model without re-optimization. The integration layer — the application code, the data pipelines, the orchestration logic — is generally portable. The model-specific components often aren't, and those are frequently where the most time and expertise have been invested. Understanding your actual portability position requires a dependency mapping exercise that most organizations haven't done: for each production AI system, what would it take to migrate to an alternative model, and how long would that take with current team capability? Contract provisions that actually matter Standard enterprise software contracts are not adequate for AI vendor relationships. The provisions that should be in place, and often aren't: SLA definitions for model behavior. Standard SLAs cover uptime and response time. They don't cover model performance consistency. If the provider can change model behavior without notification and without SLA consequence, the client has no contractual protection against the performance degradation scenario. Data portability rights. Any fine-tuning data, output data, or evaluation data held by the provider should be contractually accessible and exportable. This matters both for migration and for regulatory compliance — organizations may need to produce this data in response to a regulator. Deprecation notice periods. Minimum notice periods before model version deprecation should be defined contractually. Ninety days is common in SaaS contracts for feature deprecation — AI model deprecation warrants at least the same, and often longer given migration complexity. Audit rights. The right to audit model behavior, training data provenance (where relevant), and security practices. This matters for regulatory conformity — particularly for organizations subject to the EU AI Act's high-risk system requirements. Successor model provisions. What obligations does the provider have when a model version is deprecated? Is there a migration support commitment? Are there protections against the replacement model failing to meet the performance specifications of the deprecated version? The resilient procurement position No procurement position eliminates AI vendor risk. But the gap between organizations that have thought about this and those that haven't is significant. The elements of a resilient position: Multi-vendor architecture for critical systems. For AI capabilities that support critical business processes, architectural design should account for vendor substitution. This doesn't mean parallel deployment of everything — it means understanding which systems could be migrated within an acceptable timeframe and what would be required to do it. Open-weight model capability as insurance. Maintaining the internal capability to run open-weight models — Llama, Mistral, and similar — creates an option that most organizations are currently not exercising. These models may not match proprietary model performance for all use cases, but for some use cases they're adequate, and the ability to fall back to self-hosted capability removes one category of vendor dependency. Internal capability as a floor. The organization should maintain sufficient internal AI capability that it can evaluate vendor claims, understand what it's using, and make migration decisions based on technical judgment rather than pure vendor dependency. This doesn't require building everything internally. It requires not outsourcing understanding. Documentation for migration. For each production AI system, maintaining documentation of what the system does, what inputs it requires, what performance benchmarks it meets, and what alternatives were considered creates a foundation for migration planning that doesn't require starting from scratch under pressure. The business continuity plan that doesn't include AI vendor failure scenarios is a plan with a known gap. The question is whether the gap is closed before or after the scenario that reveals it.
Read full article