Showing Posts From

Enterprise data integration ai

03 Apr, 2026
- Data Engineering

What Fragmented Data Architecture Really Costs an AI Program

When I first joined a large financial services organization to run a data program, one of the first things I did was ask for a map of the integrations. I wanted to see how the systems connected — what fed what, where the authoritative sources of record were, what the data lineage looked like. The answer I got was approximately: nobody has that. That experience has repeated itself in some form in almost every large enterprise I've worked with since. Not because data professionals are negligent, but because complex organizations accumulate integrations over decades, and integrations don't have owners the way systems do. A system has someone responsible for it. An integration is what happens between two systems — and usually falls into the gap between the two teams. Fragmented data architecture isn't an unusual condition in enterprise. It's the default condition. What changes is the cost of living with it. Most operational contexts tolerate it reasonably well. AI programs do not. What fragmented architecture actually looks like Fragmentation in enterprise data isn't usually catastrophic. It doesn't announce itself. It looks like: Customer records in three systems that don't share a primary key, so "customer" means different things depending on which system you're querying. Transaction data that exists in two forms — a processed form used for reporting and a raw form used for operations — that diverge in ways nobody fully understands or has documented. Reference data maintained in spreadsheets owned by individuals who may or may not still be at the company. Timestamps that mean different things across different systems because timezone conventions weren't standardized when the systems were built. None of these problems are visible until you try to do something that requires combining data from multiple sources. Which is precisely what machine learning requires. The discovery problem Data problems in large enterprises surface during AI programs in a way they don't surface during other kinds of programs, because AI programs force a level of data integration and quality scrutiny that most operational systems never require. An operational system can run on data that's partially inconsistent, because a human is in the loop who can spot and compensate for the inconsistency. An ML model can't. A model trained on data where "customer" means different things in different fields will learn patterns that reflect the artifact rather than the underlying reality. The resulting predictions will be wrong in ways that are difficult to diagnose because the data looks fine at the surface level. The discovery problem is that the fragmentation isn't visible until you're already into the program. Data assessments before program start typically reveal the obvious issues: missing fields, obvious duplicates, known quality problems. They don't reveal the subtle semantic inconsistencies, the undocumented conventions, the fields that have been repurposed over time and whose historical values mean something different from their current values. Those problems surface during feature engineering, during model training, during validation — at exactly the point in the program where the team has committed to a use case and a timeline and is under pressure to deliver. Four costs that compound Model quality degradation. The most direct cost. A model trained on inconsistent data learns inconsistent patterns. Performance may look reasonable on test data drawn from the same inconsistent distribution — and then degrade in production when the inconsistency expresses itself differently at inference time, or when the system is asked to make decisions on edge cases the training data didn't represent well. This is the cost that's hardest to attribute to data quality specifically, because model underperformance in production has multiple possible causes. The data quality problem usually contributes for months or years before anyone can confidently identify it as the source. Latency at inference. Fragmented data architecture creates latency problems at inference time because assembling the inputs the model needs requires real-time queries across multiple systems that weren't built to work together at speed. A model that performs well on batch scoring can fail latency requirements in real-time applications because the feature assembly process is joining across five systems, two of which have undocumented rate limits. This problem is invisible in POC development, where features are assembled offline from a flat file. It becomes visible the first time the model is deployed to a real-time endpoint. Auditability failure. Regulatory requirements in financial services, healthcare, and increasingly in other sectors require that AI-driven decisions be explainable — not just at the model level, but at the data level. A decision made by a model trained on data from five systems, where the data lineage hasn't been documented and the transformations applied in feature engineering weren't logged, cannot be explained to a regulator who asks where the data came from. This isn't a hypothetical risk. Regulatory scrutiny of AI decision-making is increasing in most jurisdictions, and the question "show me the data that drove this decision" is one that every enterprise AI program will eventually face. Trust collapse. The most damaging long-term cost isn't technical — it's organizational. When a model's outputs contradict what a business user believes to be true about the underlying data, or when two models trained on data from different parts of a fragmented architecture produce contradictory outputs, the business loses confidence in AI outputs generally. Trust is hard to rebuild. An organization that has watched AI models disagree with each other or with the known facts becomes deeply skeptical of AI investment. The business case for the next program starts from a deficit before it's been written. The workaround trap The engineering response to fragmented data is usually feature engineering workarounds: bridge tables, deduplication logic, reference data lookups, semantic normalization applied in the feature construction layer. These work up to a point. The problem with workarounds is that they're expensive to build, expensive to maintain, and opaque to everyone who didn't build them. A feature pipeline that applies a complex normalization to reconcile two inconsistent reference datasets is technical debt that the model inherits. When the underlying data changes — which it will — the workaround may silently break in ways that degrade model performance without triggering an error. Solving data quality at the data layer is categorically different from solving it in the feature engineering layer, because the fix at the data layer benefits every downstream system that uses the data. The fix in the feature engineering layer benefits one model and creates another dependency that has to be maintained indefinitely. Teams that choose workarounds because they're faster to implement are right about the speed and wrong about the cost. What a minimum viable data architecture for AI looks like I'm not going to argue for a complete data architecture overhaul as a precondition for AI investment — that's a multi-year program with a cost that rarely gets approved and a timeline that creates its own political problems. What I will argue for is a minimum viable data architecture assessment before committing to a production AI program. That assessment covers four questions: What is the authoritative source of record for each data entity the model will use? What are the known quality issues in those sources, and what are the implications for model performance? What is the data lineage — from source through transformation to training — and is it documented? What are the regulatory and compliance requirements for the data, and are they compatible with how the model will use it? If the answers to those questions reveal problems that can't be addressed within the program budget and timeline, the AI program isn't ready for production. That's not a failure of the AI program — it's useful information about what needs to happen first. The cost of discovering that in a data assessment is small. The cost of discovering it eighteen months into a production deployment is not.

Read full article