The Knowledge That Walks Out the Door When Employees Use AI on Client Work

The Knowledge That Walks Out the Door When Employees Use AI on Client Work

Professional services firms — consulting, legal, accounting, advisory — have a specific relationship with client data that is different from most enterprise AI contexts. The data they handle belongs to their clients. The confidentiality obligations around it are contractual and, in many cases, professional and regulatory. The consequences of a breach are not limited to regulatory exposure; they extend to client trust, which is the fundamental asset in any advisory relationship. AI tools are now deeply embedded in how professional services work gets done. Analysts use them to accelerate research. Consultants use them to draft documents. Lawyers use them to review contracts. The productivity benefits are real and the competitive pressure to use them is significant. The risk accumulation that comes with that use is largely unaddressed. What client data actually flows through AI tools in professional services The volume and sensitivity of client data flowing through AI tools in professional services contexts tends to be higher than in most enterprise settings, because the work itself involves processing and analyzing client-owned information. Project research and analysis. Analysts feed client financial data, market analysis, and competitive benchmarks into AI tools to accelerate synthesis. The client's internal data, which the firm has received under a confidentiality agreement, enters a third-party AI system. Document drafting. Consultants use AI writing assistants to draft recommendations, presentations, and reports. The source material that informs the drafting — interview outputs, internal data, strategic context — is included as context for the tool. Contract review and legal analysis. Legal and advisory professionals use AI tools to review and summarize contracts, due diligence materials, and transaction documents. These materials contain some of the most sensitive information clients possess. Meeting summaries and communication assistance. Client meeting recordings processed through meeting AI tools. Client correspondence drafted with AI assistance. Internal discussions about client situations entered as context for AI queries. Each of these flows involves client-confidential data entering a third-party AI system. Most firms have not mapped this systematically. Many assume it falls under the general confidentiality terms in their client agreements without having verified that the AI tool's data processing terms are compatible with those obligations. The contractual gap that most firms have not closed Professional services firms operate under engagement letters and master services agreements that include confidentiality provisions. These provisions were written before AI tools existed in their current form. They typically cover how the firm handles client confidential information: where it is stored, who has access, what the firm's obligations are around disclosure. What they almost never address: whether the firm can process client confidential information using third-party AI tools, and if so under what conditions. This creates a gap. The firm has agreed to keep client information confidential. The firm's employees are feeding that information to third-party AI systems. Whether that constitutes a breach of the confidentiality provisions depends on the specific language and how it would be interpreted, which is not a comfortable analysis to be doing reactively. Some clients are now asking about this proactively in RFPs and at the start of engagements. Firms that have a clear, honest answer to the question "do your employees use AI tools when working on our engagement, and if so how is our data handled" are in a better position than those who have not worked out an answer. The knowledge residue problem There is a second dimension to this risk that is less obvious than the direct data exposure question. When an employee works with client information through an AI tool over the course of an engagement, the contextual knowledge they develop about the client's situation is richer and more detailed than it would be if they had processed the same information manually. The AI tool allows them to work across more data, make more connections, and develop a more comprehensive understanding than time would have permitted through manual analysis. This enriched understanding lives in the employee's head when they walk out the door. When that employee moves to a competitor or, in certain conflict situations, works on a client in a similar competitive situation, the depth of knowledge they carry creates an exposure that goes beyond the normal knowledge transfer risk. The firm cannot fully control what employees internalize through their work. That has always been true. AI tools increase the depth and breadth of what an employee can internalize over a fixed period of time. The risk management implications are subtle but real. What governance looks like in practice The minimum governance framework for a professional services firm using AI tools on client work: An explicit AI use policy that covers client work. This should specify which AI tools are approved for use on client matters, what categories of client data can be processed through AI tools under what conditions, and what the data handling terms are for approved tools. This is different from the general employee AI policy — it needs to address the confidentiality obligations that are specific to client engagements. Client engagement agreement updates. The confidentiality provisions in engagement letter and master services agreement templates need to be updated to address AI tool use. At minimum, the provisions should not preclude AI use in ways that are inconsistent with how work is actually being delivered. Better than that: the provisions should address AI tool use explicitly, with appropriate confidentiality protections around how client data is handled within those tools. Client disclosure for high-sensitivity matters. For engagements involving particularly sensitive information — M&A transactions, regulatory matters, litigation, restructuring — the engagement team should have a protocol for discussing AI tool use with the client and obtaining explicit confirmation about what is acceptable. Employee education that is specific to client work. General AI use training does not address the confidentiality implications specific to professional services. Employees handling client confidential information need to understand what AI tools they can use, with what data, under what terms, and what their obligations are when in doubt. The question clients are starting to ask The most direct signal that this needs to be addressed now: clients are beginning to ask about it. Not often, but the frequency is increasing, and the questions are getting more specific. "Does your team use AI tools when working on our matters?" "If so, does our confidential information enter those AI systems?" "What are the data handling terms for the AI tools you use, and how do they interact with your confidentiality obligations to us?" A firm that has thought about these questions and has clear answers is in a different position from one that has to formulate an answer under client scrutiny. The latter tends to produce either an evasive answer that damages trust or a defensive answer that raises more questions than it resolves. What to take from thisMap what client data is flowing through AI tools on active engagements. The volume is almost certainly higher than any single partner or manager would estimate. Review whether existing client confidentiality provisions are consistent with how AI tools are actually being used in client delivery. The gap is likely to be meaningful. Update engagement agreement templates to address AI tool use explicitly, before clients start asking for it in contract negotiations. Develop a protocol for client disclosure on high-sensitivity matters. The default should be proactive transparency, not reactive disclosure. Train client-facing staff specifically on the confidentiality implications of AI tool use in their context. Generic AI training is not sufficient for professional services.The firms that handle this well are not necessarily the most cautious ones. They are the ones that have been honest about how AI is being used in client delivery, have updated their agreements to reflect that, and can answer client questions about it clearly and without hesitation.

Read full article
How to Classify Your Data Before Your AI Program Does It for You

How to Classify Your Data Before Your AI Program Does It for You

Data classification is one of those governance practices that most organizations have in some form and almost none have in a form that is adequate for AI. The gap matters because AI deployment without a working classification framework creates a specific category of problem: the system treats all accessible data as equivalent input, and the outputs reflect that indiscriminateness in ways that are difficult to predict and costly to remediate after the fact. The CIO who gets this right before the AI program starts is in a very different position from the one who inherits a classification gap when the first incident surfaces. Here is what a practical classification approach looks like when AI deployment is the specific forcing function. Why existing classification frameworks usually fall short Most organizations have some form of data classification. The typical structure is a four-level hierarchy: public, internal, confidential, and restricted. Documents get tagged — or are supposed to get tagged — at one of these levels. Access controls are set accordingly. This framework was designed for a world where humans navigate information deliberately. You look for a document, you find it, you read it. The sensitivity of what you see is a function of where you went to look. AI tools do not navigate information that way. They can process everything they have access to simultaneously, surface connections between data sources that were never designed to be combined, and produce outputs that reflect the aggregate of what they have seen rather than any single document. The sensitivity classification of individual documents does not translate cleanly into the sensitivity of an AI system's outputs. There are three specific failure modes I see in organizations that try to apply existing classification frameworks to AI deployment. Permission-level accuracy. Existing classification may reflect the intention of who should access what, but actual permissions often diverge from the classification framework over time. Documents move between folders. Projects end and access is not revoked. Distribution lists grow and are not pruned. When an AI system is given access to everything a user can access, it inherits this divergence between intended and actual permissions. Output sensitivity. A document classified as "internal" might, in combination with five other documents also classified as "internal," produce an AI output that reveals information that would have been classified "confidential" if anyone had written it down directly. The classification framework addresses individual document sensitivity but not the sensitivity of AI-generated synthesis. Dynamic content. AI systems that connect to live data sources — CRMs, financial systems, email archives — encounter content that has never been classified at all, because classification was designed for documents rather than data records. Building a classification framework for AI specifically A classification framework that works for AI deployment needs to answer three questions that the standard framework typically does not. What can this data type be used for in AI context? Rather than a single sensitivity level, each data category needs a set of permitted AI use cases. Client financial data might be appropriate for internal analytics AI but not for a tool that produces externally shared outputs. Personal data might be appropriate for a tool with data processing agreement coverage but not for one without it. The permitted use case dimension is specific to AI and does not exist in traditional classification frameworks. What combinations create elevated sensitivity? Certain combinations of data categories produce outputs that are more sensitive than any individual category. A practical classification framework for AI should identify the high-risk combinations and set explicit controls around AI systems that can access both. What is the real-time classification status? For live data sources, the classification question is not just "what is this data type" but "what is the current state of this specific record, and does that affect what AI can do with it." A client record that includes active litigation flags, for example, may need to be treated differently than a standard client record even if the data type is classified the same way. The practical approach Doing this well does not require a multi-year data governance program. It requires a focused exercise tied directly to the AI deployment timeline. Here is what that looks like. Start with the AI system's data access scope. Before classifying anything, define what data sources the AI system will be connected to. The classification exercise is scoped to those sources. Everything else can wait. Map the sensitive data categories within scope. For each data source the AI will access, identify what sensitive categories exist: personal data, commercially sensitive data, legally privileged material, client confidential data, regulated financial data. This is an inventory exercise, and it usually reveals data in places people did not expect it. Define permitted use cases for each category. For each sensitive category, specify what the AI system is and is not permitted to do with it. This becomes the basis for technical controls — what data the system can retrieve, what it can include in outputs, and what it should exclude or flag. Build the combination rules. Identify the high-risk combinations and set rules for how the AI system handles them. This is the hardest part and the one most often skipped. Spending a day on this with the CIO, the data protection officer, and the AI system owner is worth it. Implement classification tags as technical controls. The classification decisions need to be expressed as technical constraints that the AI system respects, not just as policy documentation. A policy that says "the AI should not include client financial data in externally visible outputs" is unenforceable unless the system is technically configured to prevent it. The CIO's role in making this work Data classification for AI is not a project the technical team can own independently. The decisions about which data categories can be used for which AI purposes require input from legal, compliance, and the business functions that own the data. The CIO's role is to convene those conversations and drive them to decisions before the AI system goes live, not after. The alternative — deploying the AI system and addressing classification issues as they surface — is more expensive and more disruptive. When an AI system produces an output that reveals information it should not have had access to, the response involves technical remediation, incident investigation, potential regulatory notification, and organizational credibility damage. All of which are harder than running the classification exercise before deployment. The time required for a focused data classification exercise scoped to a specific AI deployment is typically two to four weeks for a system with well-defined data access scope. That is a reasonable investment given the alternative. What to take from thisExisting data classification frameworks were designed for human navigation of information. They do not translate directly to AI access, which aggregates and synthesizes rather than navigates. Classification for AI needs to address permitted use cases, high-risk combinations, and live data — three dimensions that standard frameworks typically do not cover. Scope the classification exercise to the AI system's data access, not the organization's entire data estate. A focused exercise is achievable in weeks; an organization-wide program is not. Classification decisions need to be expressed as technical constraints, not just policy documentation. A policy without technical enforcement is not a control. The CIO needs to convene legal, compliance, and business data owners in the classification exercise. The decisions require input from all of them, and making them without that input produces gaps.The organizations that get AI deployment right are not the ones with the most comprehensive data governance programs. They are the ones that did the focused, practical work of understanding their data before they connected it to a model, and made deliberate decisions about what that meant for acceptable use.

Read full article
Training AI on Proprietary Data: What You Gain and What You Give Up

Training AI on Proprietary Data: What You Gain and What You Give Up

At some point in most enterprise AI programs, someone proposes using the organization's own data to improve the model. The pitch is compelling: a model trained on your internal documentation, your historical decisions, your domain-specific knowledge will perform better on your actual work than a general-purpose model that knows nothing about your context. The pitch is not wrong. Done well, training or fine-tuning a model on proprietary data can produce a genuinely better system. The problem is that "done well" requires a set of decisions that most organizations have not thought through, and the downside of doing it carelessly — from data exposure, from quality degradation, from regulatory exposure — is substantial. This is not an argument against using proprietary data to improve AI systems. It is an argument for approaching that decision with the same rigor you would apply to any decision about where your most valuable data goes. What training on proprietary data actually involves There are a few different mechanisms by which proprietary data can improve model performance. The distinctions matter for understanding the exposure. Retrieval-augmented generation (RAG). The model is not trained on proprietary data at all. Instead, a retrieval system fetches relevant internal documents at query time and provides them as context for the model to work from. The proprietary data sits in a controlled index that the organization manages. The model itself stays unchanged. This approach avoids most of the training data risks while providing significant performance improvement on domain-specific tasks. Fine-tuning. The model's internal weights are adjusted using proprietary data. The model itself changes — it becomes better at tasks that reflect the patterns in the training data. The proprietary data has, in a meaningful sense, been absorbed into the model. This is more powerful than RAG for certain tasks, and considerably more complex from a data governance standpoint. Full training from scratch. Building a model entirely on proprietary data, without starting from a pre-trained foundation. This is rare at the enterprise level — the compute and data requirements are significant — but it does happen for organizations with specialized domain requirements and the resources to invest. The data exposure implications are very different across these approaches. RAG keeps proprietary data in an index the organization controls. Fine-tuning moves it into the model weights. The latter is where the tricky questions live. What gets encoded in a fine-tuned model When data is used to fine-tune a model, the specific documents and content do not get stored inside the model as retrievable files. The model cannot be queried to reproduce its training data verbatim — in most cases. But the training data has shaped the model's behavior in ways that are difficult to audit or reverse. This matters in two respects. First, sensitive information can influence the model's outputs in ways that are hard to anticipate. A model fine-tuned on internal strategy documents may, when asked questions that touch on those topics, produce outputs that reflect internal strategic positions without anyone intending to reveal them. The model is not leaking documents — it is producing outputs shaped by the patterns in documents it has seen. The effect is subtle and hard to trace. Second, fine-tuned models can, under certain adversarial conditions, be prompted to surface information that reflects their training data more directly than normal. This is not a theoretical risk — it is an active research area, and the techniques for eliciting training data from fine-tuned models are improving. Organizations that fine-tune on genuinely sensitive data on models hosted by third parties should factor this risk into their assessment. The vendor relationship in fine-tuning When an organization fine-tunes a model that is hosted by a cloud AI vendor, several questions about the proprietary data need to be answered clearly before proceeding. Does the vendor use the fine-tuning data to improve their base models? Most enterprise agreements exclude this, but it should be a named contractual commitment. Who can access the fine-tuning data? During the fine-tuning process, the data is processed on the vendor's infrastructure. Access controls, the circumstances under which vendor employees can access training content, and the security measures around the fine-tuning pipeline should all be part of the assessment. What happens to the fine-tuning data after training is complete? Is it retained, for how long, and can it be deleted? What does deletion mean in the context of a model that has already been trained on it — a question vendors are not always able to answer cleanly, because once data has influenced model weights, the effect cannot be fully reversed. Where is the fine-tuned model stored, and who controls it? The fine-tuned model is organizational IP. The vendor relationship needs to be clear about ownership, portability, and what happens to the fine-tuned model if the organization ends the vendor relationship. Data quality is the amplification lever One risk that gets less attention than the security questions: the quality and composition of the training data determine whether fine-tuning improves or degrades model performance. Organizations that rush fine-tuning treat it as an input optimization problem — more data is better. That is incorrect. Noisy, inconsistent, or poorly representative training data produces a fine-tuned model that performs worse than the base model on the tasks that matter, and sometimes produces dangerous outputs in domains where the training data was biased or incomplete. I have seen organizations fine-tune models on documentation that contained outdated processes, contradictory guidance, and examples of known-bad decisions that were documented as cautionary tales rather than as positive examples. The resulting model incorporated those patterns along with the useful ones. The outputs were confidently wrong in ways that were hard to trace back to the training data quality issue. Before committing to fine-tuning, the data curation question deserves as much attention as the technical process. What data will be included, what will be excluded, who reviews the training set for quality and appropriateness, and how will the fine-tuned model's behavior be evaluated against the base model are all part of a credible fine-tuning program. When it is worth it Fine-tuning on proprietary data is worth the complexity when two conditions are both true: the performance improvement on the target task is significant and demonstrable, and the data exposure risks can be managed through a combination of vendor agreement terms, data classification, and governance. RAG should be the default approach for most enterprise use cases. It provides substantial performance improvement without the fine-tuning data exposure risk, and it is operationally simpler to maintain and update. Fine-tuning becomes worth the investment when the task requires deep pattern recognition across the organization's specific domain that a retrieval approach cannot fully replicate. The decision to fine-tune should not be made by the AI team alone. It needs the CTO's visibility into the data exposure implications, legal review of the vendor terms around training data, and a clear data classification decision about which content is and is not appropriate as training material. What to take from thisRAG and fine-tuning are different. RAG keeps proprietary data in an index the organization controls. Fine-tuning moves patterns from that data into model weights. They have different data exposure profiles. A model fine-tuned on sensitive data can surface that information in ways that are hard to predict or audit. This is not a reason to avoid fine-tuning, but it is a reason to be deliberate about what goes into the training set. When fine-tuning on a third-party hosted model, get contractual clarity on training data use, retention, and what happens to the data and model at contract end. Data quality in the training set is as important as security. Noisy, inconsistent, or outdated data produces a fine-tuned model that may perform worse than the base model. Make fine-tuning a CTO-level decision with legal review, not a technical team default. The data exposure implications require organizational sign-off, not just technical execution.

Read full article
Defining AI Success Before Anyone Commits a Budget

Defining AI Success Before Anyone Commits a Budget

I've sat in enough AI program kick-off meetings to have stopped being surprised by this: twenty people in a room, a board mandate to "move fast on AI," a vendor selected, a team forming — and nobody has yet agreed on what success looks like. Not in specific terms. Not in a way that would allow someone to come back in twelve months and determine whether the program worked. This isn't unusual. It's the norm. And it's the single failure mode I see most consistently across enterprise AI programs, regardless of industry, company size, or technical maturity. The reasons are understandable. There's pressure to start. Defining success precisely feels constraining when the technology is new and the possibilities feel open. Executives are comfortable with directional goals — "improve customer experience," "reduce operational cost," "increase fraud detection" — and less comfortable with the kind of specificity that creates a clear accountability line. Specificity means someone owns the number. Owning the number means someone can be wrong. The cost of this discomfort is high. Why technical metrics aren't the answer The first place program teams turn when asked to define success is model performance metrics. Accuracy, F1 score, AUC, precision-recall tradeoffs. These are measurable, they're familiar, and they can be calculated before the model is in production. They're also not what the business cares about. A model with a 94% accuracy rate that nobody uses isn't a success. A model with 87% accuracy that improves a critical business process by a measurable amount is. The gap between technical performance and business outcome is where most AI programs lose the narrative — and where the CFO eventually starts asking whether the investment was worth it. Technical metrics are necessary for model development and monitoring. They're not sufficient for program success definition. What the business cares about is downstream: decisions improved, costs reduced, revenue generated, risk reduced. Those are different measurements, and the relationship between model performance and business outcome needs to be made explicit at the start, not assumed. What a complete success definition contains A success definition that can be used to evaluate the program twelve months from now has four components. A specific metric. Not "improve fraud detection" — a number the business can measure. False negative rate, dollar value of fraud losses prevented, percentage of fraud alerts requiring manual review. The metric needs to be something that exists in a system the business actually maintains, not something that has to be constructed after the fact. A documented baseline. What is the current value of that metric, measured against the same methodology that will be used to measure the AI-driven result? Without a baseline, you can't measure improvement. Without a consistent measurement methodology, comparisons are arbitrary. Getting the baseline documented before the program starts is more important than it sounds — it eliminates a whole category of disagreement about whether the program worked. A numeric target. A direction is not a target. "Improve" is not a target. "Reduce false negative rate from 4.2% to below 2.5%" is a target. The target should be challenging enough to justify the investment and specific enough to be unambiguous. A timeframe. By when? The timeframe determines the evaluation rhythm and gives the program team a real deadline against which to calibrate pace. Without it, the target floats indefinitely. Without all four components, the success definition isn't complete. It's a directional goal dressed up as a commitment. The traps Several common patterns make success definitions look complete when they aren't. Vanity metrics are things the program team can control that don't connect to outcomes the business cares about — models built, data sources integrated, team size, features shipped. These are useful operational metrics. They're not success metrics. A program that reports these as evidence of success has redefined success to be about activity rather than outcome. Unmeasurable outcomes are aspirations that can't be tracked. "Become an AI-native organization." "Embed AI into our culture." These may represent genuine long-term goals. They cannot be evaluated in a twelve-month program review, and including them as success criteria gives the program team permanent cover against accountability. Metrics the business can't track are a trap that sounds technical but isn't. If measuring the success metric requires access to data the business doesn't actually maintain, or calculations the business doesn't currently run, the metric will be reported inconsistently or not at all. Success metrics need to be things the business can measure on a monthly or quarterly basis with existing data infrastructure. Proxy metrics that drift from the target outcome are the hardest to catch. An AI program for customer service might measure success by handle time reduction. But if the model reduces handle time by routing calls to hold queues rather than resolving queries, the metric looks good while the outcome is bad. The connection between the proxy and the real outcome needs to be validated, not assumed. Running the conversation with executives who prefer ambiguity The executives most resistant to specific success definitions are usually the ones with the most at stake. Specificity creates accountability, and accountability creates risk. Understanding that the resistance is rational rather than evasive changes how you approach the conversation. The approach that tends to work: starting not from "what does success look like" but from "what would change your mind." Ask the sponsor what result, at what point in time, would cause them to question whether the program was working. Ask the CFO what the program would need to show at month twelve to be considered a good investment. Ask the business unit head what their team would need to see to start relying on the model's outputs. These questions come at the success definition from the outside — from what a skeptic would need to see to be convinced — rather than from the inside, where optimism tends to inflate targets and round off the hard edges. They also surface the implicit assumptions about what success looks like that are already in the room, undiscussed. The success definition document that comes out of this conversation doesn't need to be complex. It needs to be signed — literally. An agreed definition, documented and acknowledged by the program sponsor, the business lead, and the CFO or their representative. The act of signing matters because it makes the definition a commitment rather than a suggestion. What happens when you skip this step Programs without defined success criteria don't fail suddenly. They drift. Twelve months in, there's a review. The program has made progress — models are built, pilots are running, the team has learned a lot. Nobody can agree on whether the program is working because nobody agreed at the start on what "working" would mean. The sponsor points to the positive signals. The CFO points to the cost. The business unit says the outputs aren't quite what they needed. The program continues — not because it's succeeding, but because it's not clearly failing. Two years in, the program has consumed significant investment with ambiguous returns. The next budget cycle is where it gets cut — not in a formal review, but quietly, when the sponsor doesn't go to bat for it. The program closes with a retrospective report that describes what was learned rather than what was delivered. That's not a technical failure. It's a governance failure that started on day one, and it was entirely preventable.

Read full article
How to Structure an AI Team That Can Actually Deliver

How to Structure an AI Team That Can Actually Deliver

The team structure that gets proposed for most enterprise AI programs looks roughly like this: two or three data scientists, a data engineer, an ML engineer, a product manager, and maybe a business analyst. Sometimes there's a director of AI or a head of data science above them. It looks reasonable. It's often wrong. Not because those roles are wrong, but because the configuration doesn't match the actual work of delivering AI in enterprise settings — and because the roles most critical to production success are frequently the ones hired last, or not at all. The build vs run mismatch The most consistent structural mistake I see is treating AI delivery as a single phase when it's two distinct phases with different skill requirements. Building a model — running experiments, selecting features, iterating on architecture, evaluating performance — requires people who are comfortable with uncertainty, who can work through failures without losing momentum, and who have the technical depth to make good modeling decisions. These are data scientists and ML engineers with research instincts. Running a model in production — monitoring performance, managing retraining pipelines, diagnosing inference latency issues, responding to alerts — requires reliability engineering instincts. Discipline around operational procedures, comfort with incident response, a preference for stable and observable systems over novel approaches. These are different jobs. They attract different people. Training a great data scientist to be a great MLOps engineer is possible but slow and frequently frustrating for both the person and the organization. The right approach is to hire for both functions, with deliberate thought about when each is needed. In most programs, MLOps is understaffed at the start because the instinct is to hire modeling capability first and figure out production later. This works during the build phase and creates a bottleneck at the production transition that extends timelines by months and degrades output quality. The roles that actually determine delivery The model quality matters. Whether the program succeeds depends on other things. The ML engineer or MLOps engineer. This role is consistently underhired relative to its impact. The person who owns the pipeline from raw data to trained model to deployed endpoint to monitored output is carrying the weight of the whole production system. Without someone genuinely owning this, models get trained but not deployed, or deployed but not monitored, or monitored but not retrained when performance drifts. The data science team gets blamed for production problems that are actually infrastructure and operations problems. The data engineer. Not a data scientist who also knows SQL — a specialist whose job is to build, maintain, and improve the data pipelines that feed the models. Data engineers are frequently absent from AI team design, with the assumption that data infrastructure will come from somewhere else in the organization. When it doesn't — and in most enterprises it doesn't, because the data engineering capacity is already allocated to other programs — model training becomes ad hoc, feature engineering gets rebuilt from scratch for every use case, and the first retraining cycle reveals that nobody documented what the original training data looked like. The domain expert. Not always a formal hire, but always a critical relationship. Every enterprise AI use case operates in a domain — fraud, logistics, finance, clinical care — and the model's outputs have to make sense in that domain. A domain expert who can validate model behavior against operational knowledge, who can identify when outputs are technically correct but operationally nonsensical, is the difference between a model the business trusts and one it doesn't. The product owner. Not a traditional product manager — someone who can hold the operational specification of the AI system: what it needs to do, what the performance thresholds are, what business rules govern its use, how success is measured. In many AI programs this role gets rolled into the data science lead or left undefined. When it's undefined, requirements drift, scope expands, and the model ends up solving a subtly different problem from the one the business originally described. The accountability model The organizational question that matters most isn't team structure — it's accountability. Who owns what, and what happens when it's unclear? In a well-structured AI team, accountability maps cleanly to the phases of the system's life: someone owns the model's technical performance, someone owns its operational behavior in production, someone owns the business outcome it's driving, and someone owns the cost of running it. These may be the same person in a small program or different people in a large one. The important thing is that each accountability is named and unambiguous. The failure mode is when accountability is assumed to be shared. A model that's jointly "owned" by the data science team, the business unit, and IT is owned by nobody. When performance drifts, each team has an explanation for why it's the other team's problem. By the time accountability is resolved, the drift has already cost the business something. The team size trap When AI programs are under pressure, the instinct is often to add headcount. More data scientists, more engineers, the problem gets solved faster. This is sometimes true and usually not. Headcount helps when the constraint is capacity — there's more work than the team can do. It doesn't help when the constraint is clarity — nobody is sure what the right thing to build is, who owns the decision, or what the requirements actually are. Adding people to a clarity problem makes the problem worse: more people, more interpretations, more work produced in directions that don't align. Before adding headcount, the question should be: what is the actual constraint? If it's capacity, add people with the specific skills the bottleneck requires. If it's clarity, add process — requirements definition, decision rights, explicit ownership — before adding people. I've seen programs double their team size and slow down because the new people inherited the same unclear mandate the original team was operating under. The hiring sequence Starting from zero, with a defined use case and a twelve-month production timeline, the hiring sequence I'd follow: Hire the data engineer and the ML engineer before the data scientists. Get the infrastructure in place first. A data science team without working pipelines produces notebooks that never get to production — and hiring in the wrong order means the data scientists spend the first three months doing their own infrastructure work while the models wait. Hire the data scientists once there's something to hand off to. At that point, the infrastructure exists to take model artifacts and push them toward deployment. Identify and contract the domain expert in parallel with the early hiring. This can be an internal secondment or an external advisor — the important thing is access, not full-time headcount. The product ownership function needs to be filled at day one. This is frequently someone who already exists in the organization: a business analyst with the right domain knowledge, a senior person from the business function sponsoring the use case. Whatever form it takes, the role needs to exist before the team starts building anything — because without it, the team will eventually build the wrong thing efficiently, and efficiency at the wrong thing is its own kind of expensive.

Read full article
Shadow AI: What's Already Running in Your Organization

Shadow AI: What's Already Running in Your Organization

Before any formal AI program exists, before the steering committee has its first meeting, before the CIO has signed off on a single vendor — your employees are already using AI. Not because they are reckless. Because they have work to finish. I have walked into organizations that spent six months building an AI governance framework and discovered, halfway through the engagement, that the finance team had been using a consumer large language model to draft board reports for the past year. The legal team was using a different one to summarize contracts. Neither team thought they were doing anything wrong. They were getting work done faster. That is the shape of shadow AI. It does not announce itself. It does not appear in your IT asset register. It shows up in the gap between the work people need to do and the tools they have been given to do it. What shadow AI actually looks like The term sounds covert. It rarely is. Most shadow AI use is visible if you look for it — people pasting client briefs into ChatGPT, uploading PDFs to summarization tools, running AI writing assistants over strategy documents. Nobody is hiding anything. They just do not think of it as an IT procurement decision. The categories I see most often: Consumer AI assistants used for drafting, summarizing, researching, and explaining technical material. These are usually the first tools employees reach for because they already use them outside work. The friction is zero. AI features embedded in software people already have — writing assistants in productivity suites, AI tools inside project management and communication platforms. These arrive via a product update, not a procurement decision, and most organizations do not notice until they are already in use. Specialist tools for specific functions: AI contract review, AI coding assistants, AI research tools, AI presentation builders. These typically start as a free trial one person tries. Three months later the whole team is using them, usually without telling anyone. The common thread: easy to access, fast to start, and they solve a real problem. Nobody waits for procurement when they have a real problem. The data that leaves when they do Here is the question I put to leadership teams: what data has left your organization in the last 90 days through AI tools? Almost nobody has an answer. Every prompt sent to a third-party AI tool is data that has left the building. Every document uploaded for summarization. Every contract pasted in for analysis. Every email thread fed to an assistant for context. This is not theoretical exposure. It is live, ongoing, and unmeasured. The specific risk depends on the tool. Some consumer AI products train future models on user inputs by default unless the user has explicitly opted out — a setting most enterprise users have never opened. Some retain query data for extended periods for internal product improvement. Some have data residency terms that have no relationship to the organization's regulatory obligations. Most employees have read none of this. What makes shadow AI data exposure different from other shadow IT is the combination of volume and sensitivity. When someone uses an unsanctioned SaaS product, they tend to generate new data inside that system. When someone uses an unsanctioned AI tool, they are typically feeding existing sensitive material — client information, financial projections, internal strategy — into a system with unknown retention, processing, and training terms. A CTO told me once: "We spent a year tightening our cloud storage permissions, and the whole time people were copying strategy documents into a consumer AI chat." He was not exaggerating. He had run the discovery work. The decision quality problem The data risk is one issue. The decision quality risk is a separate one. When employees use AI tools that have not been evaluated or approved, the outputs they receive carry no governance. There is no audit trail of what the model was asked, what it returned, or how that output influenced a decision. Nobody has tested whether the tool performs reliably on the organization's specific data domain. Nobody has checked whether outputs are factually accurate, whether knowledge cutoffs create blind spots, or whether model behavior introduces biases relevant to the use case. I have seen board presentations drafted with consumer AI assistance that contained subtly incorrect market figures — the kind of error that is hard to catch if you are not already deeply familiar with the material. I have seen contract summaries that missed jurisdiction-specific clauses because the model lacked coverage of that legal context. None of these produced immediate disasters. But they were invisible errors that reached decision-makers before anyone thought to check the source. The problem is not that the tools are necessarily unreliable. The problem is that nobody defined what "reliable enough" looks like for this use case, nobody validated that the tool clears that bar, and nobody has visibility into which decisions have been shaped by which tools. Why a policy does not solve this on its own Every organization that discovers shadow AI responds the same way: they draft a policy. Employees must not use AI tools without prior approval. All AI tools must go through procurement. No sensitive data should be uploaded to external AI systems. These policies are not wrong. They are just insufficient on their own. A policy without detection capability is a statement of intent, not a control. If you cannot observe what tools are in use, you cannot enforce anything. And the detection problem is real — consumer AI tools operate over standard encrypted web traffic that looks indistinguishable from any other browser activity on most network monitoring setups. A policy without a usable alternative is a speed bump, not a barrier. If employees are turning to shadow AI because the approved tooling is slow, limited, expensive, or simply does not exist yet, a policy telling them to stop will reduce usage briefly and then have no effect. People optimize for their work. A policy without a clear explanation of why tends to generate resentment rather than behavior change. If employees do not understand what the actual risk is — not just that "data security is important" but specifically what could happen to the specific data they are handling — they will weigh an abstract policy against a concrete productivity benefit and find the policy unconvincing. What actually works There are four interventions that change the situation. Not instead of policy — alongside it. Run a discovery exercise before making any decisions. You need to know what is actually in use before designing controls. This means endpoint monitoring, network traffic analysis, and honest conversations with department heads. Expect surprises. The goal is not to catch anyone — it is to understand your real exposure before you design a response to it. Move quickly on a sanctioned alternative. The fastest way to reduce shadow AI use is to provide a better approved option. This does not require the best enterprise AI platform with a six-month procurement timeline. It means the minimum viable approved tool that addresses the main use cases driving shadow adoption — often that is simply a properly configured, privacy-compliant version of the same tool people are already using. Create a fast path for new tool requests. The reason shadow AI persists is that the formal route takes too long. Teams wait months for IT to evaluate a tool they need now. Make the process faster and more transparent. Most requests should get a decision within two weeks. The ones that cannot should at least get a clear explanation of why not. Treat the people using shadow AI as a signal. They are telling you where your official tooling is falling short. Employees using unauthorized AI tools are often the highest performers trying to work better. Treating them as a compliance problem to be managed misreads what is happening. Their behavior is product feedback. What to take from thisAudit shadow AI use before designing governance for it. You need to know what is already running before you build controls around it. Consumer AI data terms are not written for enterprise compliance. Read them — specifically the sections on input retention, training use, and data residency — before employees continue uploading sensitive material. A policy without detection is not a control. Invest in observability first, then communicate the policy. The fastest fix is a sanctioned alternative that works. Prohibition without substitution creates resentment, not compliance. The employees using shadow AI are showing you where your approved tooling has gaps. Use that information when planning what to procure next.The organizations I see handle this well are not the ones that moved fastest to write a policy. They are the ones that ran the discovery work, understood their actual exposure, and moved quickly to close the gap between what employees needed and what they were officially allowed to use. That gap is where shadow AI lives.

Read full article