Showing Posts From

Data engineering

05 Jun, 2026
- Data Engineering

How to Classify Your Data Before Your AI Program Does It for You

Data classification is one of those governance practices that most organizations have in some form and almost none have in a form that is adequate for AI. The gap matters because AI deployment without a working classification framework creates a specific category of problem: the system treats all accessible data as equivalent input, and the outputs reflect that indiscriminateness in ways that are difficult to predict and costly to remediate after the fact. The CIO who gets this right before the AI program starts is in a very different position from the one who inherits a classification gap when the first incident surfaces. Here is what a practical classification approach looks like when AI deployment is the specific forcing function. Why existing classification frameworks usually fall short Most organizations have some form of data classification. The typical structure is a four-level hierarchy: public, internal, confidential, and restricted. Documents get tagged — or are supposed to get tagged — at one of these levels. Access controls are set accordingly. This framework was designed for a world where humans navigate information deliberately. You look for a document, you find it, you read it. The sensitivity of what you see is a function of where you went to look. AI tools do not navigate information that way. They can process everything they have access to simultaneously, surface connections between data sources that were never designed to be combined, and produce outputs that reflect the aggregate of what they have seen rather than any single document. The sensitivity classification of individual documents does not translate cleanly into the sensitivity of an AI system's outputs. There are three specific failure modes I see in organizations that try to apply existing classification frameworks to AI deployment. Permission-level accuracy. Existing classification may reflect the intention of who should access what, but actual permissions often diverge from the classification framework over time. Documents move between folders. Projects end and access is not revoked. Distribution lists grow and are not pruned. When an AI system is given access to everything a user can access, it inherits this divergence between intended and actual permissions. Output sensitivity. A document classified as "internal" might, in combination with five other documents also classified as "internal," produce an AI output that reveals information that would have been classified "confidential" if anyone had written it down directly. The classification framework addresses individual document sensitivity but not the sensitivity of AI-generated synthesis. Dynamic content. AI systems that connect to live data sources — CRMs, financial systems, email archives — encounter content that has never been classified at all, because classification was designed for documents rather than data records. Building a classification framework for AI specifically A classification framework that works for AI deployment needs to answer three questions that the standard framework typically does not. What can this data type be used for in AI context? Rather than a single sensitivity level, each data category needs a set of permitted AI use cases. Client financial data might be appropriate for internal analytics AI but not for a tool that produces externally shared outputs. Personal data might be appropriate for a tool with data processing agreement coverage but not for one without it. The permitted use case dimension is specific to AI and does not exist in traditional classification frameworks. What combinations create elevated sensitivity? Certain combinations of data categories produce outputs that are more sensitive than any individual category. A practical classification framework for AI should identify the high-risk combinations and set explicit controls around AI systems that can access both. What is the real-time classification status? For live data sources, the classification question is not just "what is this data type" but "what is the current state of this specific record, and does that affect what AI can do with it." A client record that includes active litigation flags, for example, may need to be treated differently than a standard client record even if the data type is classified the same way. The practical approach Doing this well does not require a multi-year data governance program. It requires a focused exercise tied directly to the AI deployment timeline. Here is what that looks like. Start with the AI system's data access scope. Before classifying anything, define what data sources the AI system will be connected to. The classification exercise is scoped to those sources. Everything else can wait. Map the sensitive data categories within scope. For each data source the AI will access, identify what sensitive categories exist: personal data, commercially sensitive data, legally privileged material, client confidential data, regulated financial data. This is an inventory exercise, and it usually reveals data in places people did not expect it. Define permitted use cases for each category. For each sensitive category, specify what the AI system is and is not permitted to do with it. This becomes the basis for technical controls — what data the system can retrieve, what it can include in outputs, and what it should exclude or flag. Build the combination rules. Identify the high-risk combinations and set rules for how the AI system handles them. This is the hardest part and the one most often skipped. Spending a day on this with the CIO, the data protection officer, and the AI system owner is worth it. Implement classification tags as technical controls. The classification decisions need to be expressed as technical constraints that the AI system respects, not just as policy documentation. A policy that says "the AI should not include client financial data in externally visible outputs" is unenforceable unless the system is technically configured to prevent it. The CIO's role in making this work Data classification for AI is not a project the technical team can own independently. The decisions about which data categories can be used for which AI purposes require input from legal, compliance, and the business functions that own the data. The CIO's role is to convene those conversations and drive them to decisions before the AI system goes live, not after. The alternative — deploying the AI system and addressing classification issues as they surface — is more expensive and more disruptive. When an AI system produces an output that reveals information it should not have had access to, the response involves technical remediation, incident investigation, potential regulatory notification, and organizational credibility damage. All of which are harder than running the classification exercise before deployment. The time required for a focused data classification exercise scoped to a specific AI deployment is typically two to four weeks for a system with well-defined data access scope. That is a reasonable investment given the alternative. What to take from thisExisting data classification frameworks were designed for human navigation of information. They do not translate directly to AI access, which aggregates and synthesizes rather than navigates. Classification for AI needs to address permitted use cases, high-risk combinations, and live data — three dimensions that standard frameworks typically do not cover. Scope the classification exercise to the AI system's data access, not the organization's entire data estate. A focused exercise is achievable in weeks; an organization-wide program is not. Classification decisions need to be expressed as technical constraints, not just policy documentation. A policy without technical enforcement is not a control. The CIO needs to convene legal, compliance, and business data owners in the classification exercise. The decisions require input from all of them, and making them without that input produces gaps.The organizations that get AI deployment right are not the ones with the most comprehensive data governance programs. They are the ones that did the focused, practical work of understanding their data before they connected it to a model, and made deliberate decisions about what that meant for acceptable use.

Read full article

05 May, 2026
- Data Engineering

What AI Actually Requires From Your Data Infrastructure to Scale

The AI program has been approved, the vendor is selected, the team is assembled, and then somebody runs the data assessment. What they find — inconsistent data models, missing labels, fragmented systems, unclear ownership — is the same thing found in most enterprises when they look carefully for the first time. The program does not fail at this point. It slows down, the timeline gets revised, the initial scope gets reduced, and the business expectations that were set in the approval process do not get met on schedule. This is the pattern I see most often when AI programs run into trouble, and it is almost always traced back to the same root cause: the data infrastructure requirements for AI at scale were not understood when the program was designed. The CIO who understands these requirements upfront can either design the program around them or secure the investment to address them. Either path is better than discovering the gap during delivery. The data requirements that tend to be underestimated Data availability and accessibility. AI systems need data at query time or training time, and they need it in a form they can process. In most enterprises, relevant data lives across multiple systems — a CRM, an ERP, a data warehouse, a collection of flat files, some APIs — with different schemas, different access mechanisms, and different freshness characteristics. The work of making that data accessible to an AI system is infrastructure work, not AI work, and it is consistently underestimated. The practical implication: before committing to an AI delivery timeline, map which data sources the system will need access to, what the access mechanism is for each, and whether a data integration layer needs to be built or updated. This often takes months and is not typically included in AI vendor timelines. Data quality at the point of use. AI systems amplify data quality problems. A system trained on or retrieval-indexed against inaccurate, incomplete, or inconsistent data will produce confident-sounding outputs that reflect those problems. The model has no way to know that the customer record is outdated or that the product data is inconsistently formatted across systems. The organizations I see struggle with AI quality most reliably are the ones that treated data quality as a pre-existing solved problem when it was not. Data quality issues that were manageable in human-reviewed processes become highly visible when AI processes the same data and produces outputs that expose the inconsistencies. Labeling and structure for training use cases. For AI applications that require model training — not just retrieval-augmented generation — the training data needs to be labeled in a way that reflects what the model is supposed to learn. In most enterprises, the historical data that would be most useful for training is not labeled for the relevant task, was not structured with model training in mind, and requires significant preparation work before it is ready. An AI use case that requires supervised training on historical data — a classification system, a predictive model, an automated decision support tool — implicitly requires a data labeling exercise. This is often not scoped, not budgeted, and not understood by the business stakeholders who approved the use case. Data freshness and pipeline reliability. AI systems that operate on live or recent data need data pipelines that deliver data at the required latency and with acceptable reliability. In many enterprise data environments, the pipelines that move data from operational systems to analytical environments run on batch schedules that are inconsistent with the freshness requirements of an AI application that is supposed to support real-time decisions. Building or upgrading data pipelines to support AI freshness requirements is infrastructure investment that is separate from the AI system itself. It tends not to appear in AI project budgets. The governance requirements that get missed Data infrastructure for AI is not just technical. It has a governance layer that the CIO needs to own before the AI program runs into it. Data ownership and authority. AI systems require someone to decide what data they can access, what they can do with it, and who can change those parameters. In most enterprises, data ownership is unclear — data exists in systems owned by IT but created and maintained by business functions, with no single party who has clear authority to approve AI system access. The AI program surfaces this ambiguity in a way that other programs did not. Data lineage for AI outputs. When an AI system produces an output, the ability to trace that output back to the source data matters both for debugging and for regulatory purposes. This requires data lineage tooling and practices that most organizations have not prioritized, because the use cases that required them previously were narrower. Access controls at the data level. The access control requirements for AI systems are different from those for human users. An AI system that processes data on behalf of many users needs access controls that reflect what each user should be able to see, applied dynamically at the time the system generates outputs. Most data infrastructure was not designed for this pattern. What the CIO needs to establish before the program starts The work that makes AI programs succeed from a data perspective is not done by the AI team. It is done by the data engineering and infrastructure function, working from a clear set of requirements before the AI program timeline is set. Specifically: Run a data infrastructure assessment scoped to the AI program's requirements. This assessment should identify what data the AI system needs, what state that data is in, what gaps exist, and what work is required to close them. The assessment output should feed directly into the program plan. Define data ownership for AI access before the program enters delivery. The conversations about which data the AI system can access are harder to have mid-delivery than pre-delivery. Get the governance decisions made before the program is scheduled around them. Include data pipeline and infrastructure work in the program budget and timeline. This work is frequently treated as a prerequisite that will be addressed separately, which means it is not resourced and becomes a blocker. It needs to be inside the program. Set data quality thresholds explicitly. What level of completeness, consistency, and accuracy is required for the AI system to produce reliable outputs? These thresholds should be defined and measured before the system goes live, not after the first quality issue surfaces in production. What to take from thisData infrastructure gaps are the most common reason AI programs miss their original timelines. Run a data infrastructure assessment as part of program planning, not as a separate track. AI systems amplify data quality problems. Assess the quality of the data the system will use before committing to performance targets. Training use cases with labeled data requirements are underestimated consistently. If the use case requires model training, scope the labeling work explicitly. Data freshness requirements for live AI applications often exceed what existing batch pipelines can deliver. Build or upgrade the data pipelines as part of the AI program. Data ownership and governance for AI access need to be resolved before delivery starts. These decisions are harder to make under delivery pressure than during planning.The AI programs I have seen deliver on their original timelines shared a common characteristic: the CIO ran the data assessment early, understood the infrastructure gaps, and either adjusted the program plan or secured the investment to address them. The ones that struggled did not.

Read full article

03 Apr, 2026
- Data Engineering

What Fragmented Data Architecture Really Costs an AI Program

When I first joined a large financial services organization to run a data program, one of the first things I did was ask for a map of the integrations. I wanted to see how the systems connected — what fed what, where the authoritative sources of record were, what the data lineage looked like. The answer I got was approximately: nobody has that. That experience has repeated itself in some form in almost every large enterprise I've worked with since. Not because data professionals are negligent, but because complex organizations accumulate integrations over decades, and integrations don't have owners the way systems do. A system has someone responsible for it. An integration is what happens between two systems — and usually falls into the gap between the two teams. Fragmented data architecture isn't an unusual condition in enterprise. It's the default condition. What changes is the cost of living with it. Most operational contexts tolerate it reasonably well. AI programs do not. What fragmented architecture actually looks like Fragmentation in enterprise data isn't usually catastrophic. It doesn't announce itself. It looks like: Customer records in three systems that don't share a primary key, so "customer" means different things depending on which system you're querying. Transaction data that exists in two forms — a processed form used for reporting and a raw form used for operations — that diverge in ways nobody fully understands or has documented. Reference data maintained in spreadsheets owned by individuals who may or may not still be at the company. Timestamps that mean different things across different systems because timezone conventions weren't standardized when the systems were built. None of these problems are visible until you try to do something that requires combining data from multiple sources. Which is precisely what machine learning requires. The discovery problem Data problems in large enterprises surface during AI programs in a way they don't surface during other kinds of programs, because AI programs force a level of data integration and quality scrutiny that most operational systems never require. An operational system can run on data that's partially inconsistent, because a human is in the loop who can spot and compensate for the inconsistency. An ML model can't. A model trained on data where "customer" means different things in different fields will learn patterns that reflect the artifact rather than the underlying reality. The resulting predictions will be wrong in ways that are difficult to diagnose because the data looks fine at the surface level. The discovery problem is that the fragmentation isn't visible until you're already into the program. Data assessments before program start typically reveal the obvious issues: missing fields, obvious duplicates, known quality problems. They don't reveal the subtle semantic inconsistencies, the undocumented conventions, the fields that have been repurposed over time and whose historical values mean something different from their current values. Those problems surface during feature engineering, during model training, during validation — at exactly the point in the program where the team has committed to a use case and a timeline and is under pressure to deliver. Four costs that compound Model quality degradation. The most direct cost. A model trained on inconsistent data learns inconsistent patterns. Performance may look reasonable on test data drawn from the same inconsistent distribution — and then degrade in production when the inconsistency expresses itself differently at inference time, or when the system is asked to make decisions on edge cases the training data didn't represent well. This is the cost that's hardest to attribute to data quality specifically, because model underperformance in production has multiple possible causes. The data quality problem usually contributes for months or years before anyone can confidently identify it as the source. Latency at inference. Fragmented data architecture creates latency problems at inference time because assembling the inputs the model needs requires real-time queries across multiple systems that weren't built to work together at speed. A model that performs well on batch scoring can fail latency requirements in real-time applications because the feature assembly process is joining across five systems, two of which have undocumented rate limits. This problem is invisible in POC development, where features are assembled offline from a flat file. It becomes visible the first time the model is deployed to a real-time endpoint. Auditability failure. Regulatory requirements in financial services, healthcare, and increasingly in other sectors require that AI-driven decisions be explainable — not just at the model level, but at the data level. A decision made by a model trained on data from five systems, where the data lineage hasn't been documented and the transformations applied in feature engineering weren't logged, cannot be explained to a regulator who asks where the data came from. This isn't a hypothetical risk. Regulatory scrutiny of AI decision-making is increasing in most jurisdictions, and the question "show me the data that drove this decision" is one that every enterprise AI program will eventually face. Trust collapse. The most damaging long-term cost isn't technical — it's organizational. When a model's outputs contradict what a business user believes to be true about the underlying data, or when two models trained on data from different parts of a fragmented architecture produce contradictory outputs, the business loses confidence in AI outputs generally. Trust is hard to rebuild. An organization that has watched AI models disagree with each other or with the known facts becomes deeply skeptical of AI investment. The business case for the next program starts from a deficit before it's been written. The workaround trap The engineering response to fragmented data is usually feature engineering workarounds: bridge tables, deduplication logic, reference data lookups, semantic normalization applied in the feature construction layer. These work up to a point. The problem with workarounds is that they're expensive to build, expensive to maintain, and opaque to everyone who didn't build them. A feature pipeline that applies a complex normalization to reconcile two inconsistent reference datasets is technical debt that the model inherits. When the underlying data changes — which it will — the workaround may silently break in ways that degrade model performance without triggering an error. Solving data quality at the data layer is categorically different from solving it in the feature engineering layer, because the fix at the data layer benefits every downstream system that uses the data. The fix in the feature engineering layer benefits one model and creates another dependency that has to be maintained indefinitely. Teams that choose workarounds because they're faster to implement are right about the speed and wrong about the cost. What a minimum viable data architecture for AI looks like I'm not going to argue for a complete data architecture overhaul as a precondition for AI investment — that's a multi-year program with a cost that rarely gets approved and a timeline that creates its own political problems. What I will argue for is a minimum viable data architecture assessment before committing to a production AI program. That assessment covers four questions: What is the authoritative source of record for each data entity the model will use? What are the known quality issues in those sources, and what are the implications for model performance? What is the data lineage — from source through transformation to training — and is it documented? What are the regulatory and compliance requirements for the data, and are they compatible with how the model will use it? If the answers to those questions reveal problems that can't be addressed within the program budget and timeline, the AI program isn't ready for production. That's not a failure of the AI program — it's useful information about what needs to happen first. The cost of discovering that in a data assessment is small. The cost of discovering it eighteen months into a production deployment is not.

Read full article

06 Mar, 2026
- Data Engineering

The Data Quality Problem Nobody Puts in the Deck

Early in a data program at a large insurer, I asked the head of the business data office what percentage of their claims data they considered clean. She said, without hesitation, about 80 percent. When we actually looked — ran completeness checks, consistency validations, temporal analysis, cross-referenced against the systems of record — the number was closer to 40 percent. And that was by a generous definition of "clean." The 40-point gap was not the result of negligence. It was the result of the way enterprise data accumulates: through system migrations that didn't fully reconcile historical records, through form field changes that made old values semantically incompatible with new ones, through operational shortcuts that were rational at the time and invisible until someone tried to use the data systematically. This pattern — a significant gap between what the business believes about its data and what the data actually contains — is present in virtually every large enterprise I've worked with. What changes is how much it matters. In most operational contexts, it doesn't matter much. In AI programs, it matters enormously. Why the gap doesn't surface until you try to train Data discovery processes reveal the problems that can be found by looking at data directly: missing values, obvious format inconsistencies, clearly duplicated records. What they don't reveal are the problems that only become visible when you try to do something with the data. Semantic inconsistency is one example. A "claim status" field that takes values of "open," "pending," "in review," and "active" might look fine in discovery. The problem emerges when you try to build a model that predicts claim duration and discover that "pending" meant two different things before and after a system migration five years ago. The model learns from the historical pattern and produces predictions that are systematically off for a segment of claims because the label meant something different during the training window. Temporal invalidity is another. Features constructed from historical data often embed assumptions about time that are violated in ways that aren't visible until you start building features. A "days since last contact" feature that looks like a useful signal turns out to encode data entry behavior rather than customer behavior — the field was populated differently in different branches, and the differences correlate with branch-level outcomes rather than customer-level ones. These problems don't show up in a data profile. They show up in model validation, in production performance anomalies, and in the kinds of questions that domain experts ask when they look at model outputs that seem technically sound but operationally wrong. The four failure patterns Completeness gaps. Missing data is the most visible quality problem and usually the best understood. But completeness is less binary than it appears. A field that's 95% complete might have its 5% missingness concentrated in the segment of the data the model most needs to reason about — a specific customer segment, a specific time period, a specific geography. Aggregate completeness metrics hide distributional missingness that can create systematic model errors invisible until production. Consistency failures. Data that means different things in different records, or that's been encoded differently across systems, is the failure pattern that's hardest to detect and most dangerous for model training. Consistency failures are common at integration points — where data from one system is loaded into another — and at migration boundaries, where historical records under an old schema are mapped to fields in a new schema. The mapping logic that seemed sensible at migration time often introduces subtle distortions that aren't documented and don't announce themselves. Temporal drift. The relationship between data and the world changes over time. Customer behavior changes, market conditions change, business rules change. A model trained on data from three years ago has learned from a world that no longer exists in important ways. This isn't a data quality problem in the traditional sense — the data accurately reflects what was true at the time — but it creates a model that doesn't reflect current reality. Temporal drift is the most common reason AI models underperform in production relative to testing, and it's consistently underweighted in most data quality assessments. Labeling errors. For supervised learning problems, the quality of the labels — the ground truth the model is learning from — determines a ceiling on model quality that no algorithm can overcome. Label quality is frequently taken for granted in enterprise AI programs because the labels come from an existing operational system that the business trusts. But operational labels are a product of the processes that generated them, and those processes have their own inconsistencies. Claims classified as fraudulent by one review team using one set of criteria, then reclassified by another team using updated criteria, produce a label set that encodes inconsistency as signal. The model learns from that inconsistency and reproduces it at scale. How it kills AI ROI The mechanism by which data quality degrades AI ROI isn't usually a catastrophic failure. It's a gradual tax on every part of the program. Model performance caps are lower than they should be, which means the business case is harder to achieve. The team spends more time on data remediation than on model development, which means the delivery timeline extends. Retraining cycles are more expensive because the data pipeline that feeds them is brittle, which means the operational cost is higher than projected. And when the business starts to see outputs that seem wrong — where the model disagrees with what an experienced practitioner would say — trust erodes in ways that are very difficult to reverse. The cumulative effect is hard to quantify precisely, but a program that expected an eighteen-month path to production value and took thirty months instead, with model performance fifteen points below the initial projection, is not unusual when data quality problems were underassessed at the start. The feature engineering temptation The engineering response to data quality problems is usually feature engineering workarounds: bridge tables, deduplication logic, reference data lookups, semantic normalization applied in the feature construction layer. These work. I've used them. But they're a form of debt that compounds. A feature pipeline that applies complex normalization to reconcile inconsistent reference datasets has to be maintained for as long as the model runs. When the underlying data changes — and it will — the workaround may silently break, degrading model performance without triggering an error. And every new model that uses the same data inherits the same problem independently, solving it in its own way, creating a portfolio of different workarounds for the same underlying issue. Feature engineering can bridge a data quality gap temporarily while the underlying issue is being addressed. It's not a substitute for addressing the underlying issue. The difference matters when you're building the fourth model on the same data foundation that the first three models already patched around. What a real data quality assessment looks like The most useful question to ask before starting an AI program is not "do we have the data?" Almost every enterprise has data. The question is "is the data we have sufficient to support the model we need to build?" A data quality assessment designed around that question covers: the completeness and consistency of each field the model will use, the temporal validity of the training window, the label quality for the target variable, and the regulatory and compliance status of the data being used for training. For a focused use case, this takes two to four weeks. It produces a realistic view of what remediation work is needed before model training can begin, what workarounds are viable in the short term, and where the data gaps are severe enough to require a different use case or a longer pre-program phase. That view has a cost — it may reveal that the program timeline needs to move right, or that a use case needs to change. But it's a cost paid once, upfront, with full information. The alternative is paying a larger cost spread across months of rework, performance shortfalls, and stakeholder trust that's harder to rebuild than it was to lose. The data quality problem almost always goes into the deck eventually. The question is whether it goes in at the start, when there's still time to do something about it, or at month fourteen, when everyone is looking for someone to blame.

Read full article