Showing Posts From

Enterprise data quality ai

06 Mar, 2026
- Data Engineering

The Data Quality Problem Nobody Puts in the Deck

Early in a data program at a large insurer, I asked the head of the business data office what percentage of their claims data they considered clean. She said, without hesitation, about 80 percent. When we actually looked — ran completeness checks, consistency validations, temporal analysis, cross-referenced against the systems of record — the number was closer to 40 percent. And that was by a generous definition of "clean." The 40-point gap was not the result of negligence. It was the result of the way enterprise data accumulates: through system migrations that didn't fully reconcile historical records, through form field changes that made old values semantically incompatible with new ones, through operational shortcuts that were rational at the time and invisible until someone tried to use the data systematically. This pattern — a significant gap between what the business believes about its data and what the data actually contains — is present in virtually every large enterprise I've worked with. What changes is how much it matters. In most operational contexts, it doesn't matter much. In AI programs, it matters enormously. Why the gap doesn't surface until you try to train Data discovery processes reveal the problems that can be found by looking at data directly: missing values, obvious format inconsistencies, clearly duplicated records. What they don't reveal are the problems that only become visible when you try to do something with the data. Semantic inconsistency is one example. A "claim status" field that takes values of "open," "pending," "in review," and "active" might look fine in discovery. The problem emerges when you try to build a model that predicts claim duration and discover that "pending" meant two different things before and after a system migration five years ago. The model learns from the historical pattern and produces predictions that are systematically off for a segment of claims because the label meant something different during the training window. Temporal invalidity is another. Features constructed from historical data often embed assumptions about time that are violated in ways that aren't visible until you start building features. A "days since last contact" feature that looks like a useful signal turns out to encode data entry behavior rather than customer behavior — the field was populated differently in different branches, and the differences correlate with branch-level outcomes rather than customer-level ones. These problems don't show up in a data profile. They show up in model validation, in production performance anomalies, and in the kinds of questions that domain experts ask when they look at model outputs that seem technically sound but operationally wrong. The four failure patterns Completeness gaps. Missing data is the most visible quality problem and usually the best understood. But completeness is less binary than it appears. A field that's 95% complete might have its 5% missingness concentrated in the segment of the data the model most needs to reason about — a specific customer segment, a specific time period, a specific geography. Aggregate completeness metrics hide distributional missingness that can create systematic model errors invisible until production. Consistency failures. Data that means different things in different records, or that's been encoded differently across systems, is the failure pattern that's hardest to detect and most dangerous for model training. Consistency failures are common at integration points — where data from one system is loaded into another — and at migration boundaries, where historical records under an old schema are mapped to fields in a new schema. The mapping logic that seemed sensible at migration time often introduces subtle distortions that aren't documented and don't announce themselves. Temporal drift. The relationship between data and the world changes over time. Customer behavior changes, market conditions change, business rules change. A model trained on data from three years ago has learned from a world that no longer exists in important ways. This isn't a data quality problem in the traditional sense — the data accurately reflects what was true at the time — but it creates a model that doesn't reflect current reality. Temporal drift is the most common reason AI models underperform in production relative to testing, and it's consistently underweighted in most data quality assessments. Labeling errors. For supervised learning problems, the quality of the labels — the ground truth the model is learning from — determines a ceiling on model quality that no algorithm can overcome. Label quality is frequently taken for granted in enterprise AI programs because the labels come from an existing operational system that the business trusts. But operational labels are a product of the processes that generated them, and those processes have their own inconsistencies. Claims classified as fraudulent by one review team using one set of criteria, then reclassified by another team using updated criteria, produce a label set that encodes inconsistency as signal. The model learns from that inconsistency and reproduces it at scale. How it kills AI ROI The mechanism by which data quality degrades AI ROI isn't usually a catastrophic failure. It's a gradual tax on every part of the program. Model performance caps are lower than they should be, which means the business case is harder to achieve. The team spends more time on data remediation than on model development, which means the delivery timeline extends. Retraining cycles are more expensive because the data pipeline that feeds them is brittle, which means the operational cost is higher than projected. And when the business starts to see outputs that seem wrong — where the model disagrees with what an experienced practitioner would say — trust erodes in ways that are very difficult to reverse. The cumulative effect is hard to quantify precisely, but a program that expected an eighteen-month path to production value and took thirty months instead, with model performance fifteen points below the initial projection, is not unusual when data quality problems were underassessed at the start. The feature engineering temptation The engineering response to data quality problems is usually feature engineering workarounds: bridge tables, deduplication logic, reference data lookups, semantic normalization applied in the feature construction layer. These work. I've used them. But they're a form of debt that compounds. A feature pipeline that applies complex normalization to reconcile inconsistent reference datasets has to be maintained for as long as the model runs. When the underlying data changes — and it will — the workaround may silently break, degrading model performance without triggering an error. And every new model that uses the same data inherits the same problem independently, solving it in its own way, creating a portfolio of different workarounds for the same underlying issue. Feature engineering can bridge a data quality gap temporarily while the underlying issue is being addressed. It's not a substitute for addressing the underlying issue. The difference matters when you're building the fourth model on the same data foundation that the first three models already patched around. What a real data quality assessment looks like The most useful question to ask before starting an AI program is not "do we have the data?" Almost every enterprise has data. The question is "is the data we have sufficient to support the model we need to build?" A data quality assessment designed around that question covers: the completeness and consistency of each field the model will use, the temporal validity of the training window, the label quality for the target variable, and the regulatory and compliance status of the data being used for training. For a focused use case, this takes two to four weeks. It produces a realistic view of what remediation work is needed before model training can begin, what workarounds are viable in the short term, and where the data gaps are severe enough to require a different use case or a longer pre-program phase. That view has a cost — it may reveal that the program timeline needs to move right, or that a use case needs to change. But it's a cost paid once, upfront, with full information. The alternative is paying a larger cost spread across months of rework, performance shortfalls, and stakeholder trust that's harder to rebuild than it was to lose. The data quality problem almost always goes into the deck eventually. The question is whether it goes in at the start, when there's still time to do something about it, or at month fourteen, when everyone is looking for someone to blame.

Read full article