Showing Posts From
Data governance
- 07 Jun, 2026
What Happens to Your Data Inside a Large Language Model
One of the questions I get most often from executive teams when they start getting serious about AI governance is some version of: "If we send data to an AI model, does that data end up in the model? Can the model then use our data to answer questions for our competitors?" It is a reasonable question. The answer is more nuanced than the headlines around AI and data privacy usually suggest, and getting the nuance right matters for making sound decisions about vendor selection, data handling, and acceptable use. This is not a technical explanation. It is an executive one. I want to give you the conceptual framework that lets you ask the right questions and evaluate the answers vendors give you. The key distinction: training versus inference There are two fundamentally different things that can happen to data when it touches an AI model. Inference is what happens during normal use. You send a prompt. The model processes it using the knowledge and patterns it already has. It generates a response. Your data was processed, but it did not change the model. The model is no more or less capable after your interaction than it was before. Think of it like asking an expert a question: they used their knowledge to answer you, but they did not become a different expert because you asked. Training is different. Training is when data is used to update the model's internal parameters — to change what the model knows or how it responds. This is what actually shapes the model's behavior and capabilities. Training happens periodically, using large datasets, through a deliberate process. It is not what happens every time a user sends a prompt. The confusion between training and inference is responsible for most of the anxiety executives have about sending data to AI vendors. When an employee pastes a strategy document into an AI assistant, that document is used for inference — to generate the response. It is not, in that moment, training the model or making the model more likely to surface that information to other users. The question of whether your data is used for training is a separate one, governed by the vendor's policies and your agreement with them. When data does influence the model The concern about data "ending up in the model" is legitimate in one specific scenario: when the vendor uses interaction data to train future versions of the model. This practice is more common in consumer products than enterprise ones. Many consumer AI tools, under default settings, retain interaction data and may use it as part of the training pipeline for future model versions. This does not mean a competitor can directly query the model and retrieve your document. Training does not work like storing files in a searchable database. But your data, if used for training, has influenced the model's patterns in ways that are effectively irreversible and non-auditable. Enterprise agreements typically exclude this. When an organization purchases an enterprise license with a proper data processing agreement, the vendor generally commits to not using that organization's data for training purposes. This is one of the most important terms to verify in any AI vendor agreement, and one of the strongest reasons to ensure employees are using enterprise tiers rather than consumer accounts. The practical implication: the risk of your data influencing the model is primarily a function of which tier you are on and what your agreement says — not of using AI tools in general. What retention actually means Even when a vendor does not train on your data, they may retain it for a period. Understanding what retention means in practice matters for two reasons: regulatory compliance and the question of who can access the retained data. Vendors retain interaction data for different reasons: abuse prevention, conversation history for the user, debugging and quality assurance, and in some cases legal holds. The retention period varies from days to years depending on the product and the settings. What the retained data can be used for is defined in the vendor's privacy policy and data processing agreement. The key questions are: Can vendor employees access the content of retained interactions? Under what circumstances? Are there audit logs of such access? What are the deletion terms — can you request deletion, and is it complete? These are not abstract questions. An employee sending sensitive content to an AI tool is creating a record that exists in the vendor's infrastructure for some period. If that infrastructure is breached, or if the vendor is subject to legal process, that record is potentially accessible. The same employee would not dream of emailing that content to a stranger. But the AI tool does not feel like an external party — it feels like a private tool. The retention question is also where GDPR and similar regulations create specific obligations. Any interaction containing personal data is a transfer of personal data to a third-party processor. That transfer requires a legal basis, a data processing agreement, and compliance with data subject rights including deletion. Most organizations have not mapped their AI tool usage against these obligations. The questions a CTO should ask every AI vendor The framework above translates into a specific set of questions that should be part of any AI vendor evaluation: Is interaction data used for training future models? Under what conditions? What controls does the customer have over this? This is the most important question. Get the answer in writing, as a contractual commitment, not as a verbal assurance. What is the data retention period for interaction data? Can this be configured? What are the deletion rights and processes? What confirmation is provided when deletion is complete? Who within the vendor organization can access the content of customer interactions? Under what circumstances? Are there access logs? What are the procedures if vendor employees need to access content for support or debugging? Where is the data processed? This matters for regulatory compliance. Data about EU residents processed in jurisdictions without an adequacy decision creates specific compliance obligations that need to be managed. What happens to retained data in the event of the vendor being acquired, going out of business, or being subject to legal process? Where does customer data fall in those scenarios? What is the vendor's certification posture? SOC 2 Type II, ISO 27001, and similar certifications do not answer all of these questions, but they provide a baseline for security practices that matters for any serious enterprise evaluation. The honest assessment No AI tool is risk-free from a data perspective. Sending data to any third-party system involves some degree of information leaving your infrastructure, under terms you did not write, in systems you do not control. That is true of cloud storage, email services, and every other third-party tool the organization uses. The question is whether the risk is understood, whether the terms are acceptable given the regulatory and contractual context, and whether the data classification of what is being sent is appropriate for the tier and agreement in place. The worst outcome is not using AI tools with enterprise data under a proper enterprise agreement with a reputable vendor. The worst outcome is using consumer-tier products with default settings, with sensitive data, without any of the contractual protections that make enterprise use manageable. Most organizations are currently somewhere in between. The CTO's job is to understand exactly where on that spectrum the organization sits, and to move deliberately toward the part of the spectrum that is defensible. What to take from thisTraining and inference are different. Using an AI tool to process data does not automatically mean that data trains the model. Whether it does depends on the vendor's policies and your agreement. The training exclusion is one of the most important terms in an enterprise AI agreement. Verify it explicitly — a verbal assurance is not sufficient. Retention means your data exists on vendor infrastructure for some period. Understand the retention period, access controls, and deletion rights for every tool in active use. Consumer and enterprise tiers of the same product often have materially different data handling terms. The tier distinction matters more than the vendor selection in many cases. Map AI tool usage against data protection obligations before the next regulatory review, not during it.The executives who handle this well are the ones who moved past the surface-level anxiety about "AI knowing your data" and got specific about the mechanisms: what are the actual terms, what does retention mean, and what commitments can the vendor make in writing?
Read full article
- 06 Jun, 2026
The Knowledge That Walks Out the Door When Employees Use AI on Client Work
Professional services firms — consulting, legal, accounting, advisory — have a specific relationship with client data that is different from most enterprise AI contexts. The data they handle belongs to their clients. The confidentiality obligations around it are contractual and, in many cases, professional and regulatory. The consequences of a breach are not limited to regulatory exposure; they extend to client trust, which is the fundamental asset in any advisory relationship. AI tools are now deeply embedded in how professional services work gets done. Analysts use them to accelerate research. Consultants use them to draft documents. Lawyers use them to review contracts. The productivity benefits are real and the competitive pressure to use them is significant. The risk accumulation that comes with that use is largely unaddressed. What client data actually flows through AI tools in professional services The volume and sensitivity of client data flowing through AI tools in professional services contexts tends to be higher than in most enterprise settings, because the work itself involves processing and analyzing client-owned information. Project research and analysis. Analysts feed client financial data, market analysis, and competitive benchmarks into AI tools to accelerate synthesis. The client's internal data, which the firm has received under a confidentiality agreement, enters a third-party AI system. Document drafting. Consultants use AI writing assistants to draft recommendations, presentations, and reports. The source material that informs the drafting — interview outputs, internal data, strategic context — is included as context for the tool. Contract review and legal analysis. Legal and advisory professionals use AI tools to review and summarize contracts, due diligence materials, and transaction documents. These materials contain some of the most sensitive information clients possess. Meeting summaries and communication assistance. Client meeting recordings processed through meeting AI tools. Client correspondence drafted with AI assistance. Internal discussions about client situations entered as context for AI queries. Each of these flows involves client-confidential data entering a third-party AI system. Most firms have not mapped this systematically. Many assume it falls under the general confidentiality terms in their client agreements without having verified that the AI tool's data processing terms are compatible with those obligations. The contractual gap that most firms have not closed Professional services firms operate under engagement letters and master services agreements that include confidentiality provisions. These provisions were written before AI tools existed in their current form. They typically cover how the firm handles client confidential information: where it is stored, who has access, what the firm's obligations are around disclosure. What they almost never address: whether the firm can process client confidential information using third-party AI tools, and if so under what conditions. This creates a gap. The firm has agreed to keep client information confidential. The firm's employees are feeding that information to third-party AI systems. Whether that constitutes a breach of the confidentiality provisions depends on the specific language and how it would be interpreted, which is not a comfortable analysis to be doing reactively. Some clients are now asking about this proactively in RFPs and at the start of engagements. Firms that have a clear, honest answer to the question "do your employees use AI tools when working on our engagement, and if so how is our data handled" are in a better position than those who have not worked out an answer. The knowledge residue problem There is a second dimension to this risk that is less obvious than the direct data exposure question. When an employee works with client information through an AI tool over the course of an engagement, the contextual knowledge they develop about the client's situation is richer and more detailed than it would be if they had processed the same information manually. The AI tool allows them to work across more data, make more connections, and develop a more comprehensive understanding than time would have permitted through manual analysis. This enriched understanding lives in the employee's head when they walk out the door. When that employee moves to a competitor or, in certain conflict situations, works on a client in a similar competitive situation, the depth of knowledge they carry creates an exposure that goes beyond the normal knowledge transfer risk. The firm cannot fully control what employees internalize through their work. That has always been true. AI tools increase the depth and breadth of what an employee can internalize over a fixed period of time. The risk management implications are subtle but real. What governance looks like in practice The minimum governance framework for a professional services firm using AI tools on client work: An explicit AI use policy that covers client work. This should specify which AI tools are approved for use on client matters, what categories of client data can be processed through AI tools under what conditions, and what the data handling terms are for approved tools. This is different from the general employee AI policy — it needs to address the confidentiality obligations that are specific to client engagements. Client engagement agreement updates. The confidentiality provisions in engagement letter and master services agreement templates need to be updated to address AI tool use. At minimum, the provisions should not preclude AI use in ways that are inconsistent with how work is actually being delivered. Better than that: the provisions should address AI tool use explicitly, with appropriate confidentiality protections around how client data is handled within those tools. Client disclosure for high-sensitivity matters. For engagements involving particularly sensitive information — M&A transactions, regulatory matters, litigation, restructuring — the engagement team should have a protocol for discussing AI tool use with the client and obtaining explicit confirmation about what is acceptable. Employee education that is specific to client work. General AI use training does not address the confidentiality implications specific to professional services. Employees handling client confidential information need to understand what AI tools they can use, with what data, under what terms, and what their obligations are when in doubt. The question clients are starting to ask The most direct signal that this needs to be addressed now: clients are beginning to ask about it. Not often, but the frequency is increasing, and the questions are getting more specific. "Does your team use AI tools when working on our matters?" "If so, does our confidential information enter those AI systems?" "What are the data handling terms for the AI tools you use, and how do they interact with your confidentiality obligations to us?" A firm that has thought about these questions and has clear answers is in a different position from one that has to formulate an answer under client scrutiny. The latter tends to produce either an evasive answer that damages trust or a defensive answer that raises more questions than it resolves. What to take from thisMap what client data is flowing through AI tools on active engagements. The volume is almost certainly higher than any single partner or manager would estimate. Review whether existing client confidentiality provisions are consistent with how AI tools are actually being used in client delivery. The gap is likely to be meaningful. Update engagement agreement templates to address AI tool use explicitly, before clients start asking for it in contract negotiations. Develop a protocol for client disclosure on high-sensitivity matters. The default should be proactive transparency, not reactive disclosure. Train client-facing staff specifically on the confidentiality implications of AI tool use in their context. Generic AI training is not sufficient for professional services.The firms that handle this well are not necessarily the most cautious ones. They are the ones that have been honest about how AI is being used in client delivery, have updated their agreements to reflect that, and can answer client questions about it clearly and without hesitation.
Read full article
- 05 Jun, 2026
How to Classify Your Data Before Your AI Program Does It for You
Data classification is one of those governance practices that most organizations have in some form and almost none have in a form that is adequate for AI. The gap matters because AI deployment without a working classification framework creates a specific category of problem: the system treats all accessible data as equivalent input, and the outputs reflect that indiscriminateness in ways that are difficult to predict and costly to remediate after the fact. The CIO who gets this right before the AI program starts is in a very different position from the one who inherits a classification gap when the first incident surfaces. Here is what a practical classification approach looks like when AI deployment is the specific forcing function. Why existing classification frameworks usually fall short Most organizations have some form of data classification. The typical structure is a four-level hierarchy: public, internal, confidential, and restricted. Documents get tagged — or are supposed to get tagged — at one of these levels. Access controls are set accordingly. This framework was designed for a world where humans navigate information deliberately. You look for a document, you find it, you read it. The sensitivity of what you see is a function of where you went to look. AI tools do not navigate information that way. They can process everything they have access to simultaneously, surface connections between data sources that were never designed to be combined, and produce outputs that reflect the aggregate of what they have seen rather than any single document. The sensitivity classification of individual documents does not translate cleanly into the sensitivity of an AI system's outputs. There are three specific failure modes I see in organizations that try to apply existing classification frameworks to AI deployment. Permission-level accuracy. Existing classification may reflect the intention of who should access what, but actual permissions often diverge from the classification framework over time. Documents move between folders. Projects end and access is not revoked. Distribution lists grow and are not pruned. When an AI system is given access to everything a user can access, it inherits this divergence between intended and actual permissions. Output sensitivity. A document classified as "internal" might, in combination with five other documents also classified as "internal," produce an AI output that reveals information that would have been classified "confidential" if anyone had written it down directly. The classification framework addresses individual document sensitivity but not the sensitivity of AI-generated synthesis. Dynamic content. AI systems that connect to live data sources — CRMs, financial systems, email archives — encounter content that has never been classified at all, because classification was designed for documents rather than data records. Building a classification framework for AI specifically A classification framework that works for AI deployment needs to answer three questions that the standard framework typically does not. What can this data type be used for in AI context? Rather than a single sensitivity level, each data category needs a set of permitted AI use cases. Client financial data might be appropriate for internal analytics AI but not for a tool that produces externally shared outputs. Personal data might be appropriate for a tool with data processing agreement coverage but not for one without it. The permitted use case dimension is specific to AI and does not exist in traditional classification frameworks. What combinations create elevated sensitivity? Certain combinations of data categories produce outputs that are more sensitive than any individual category. A practical classification framework for AI should identify the high-risk combinations and set explicit controls around AI systems that can access both. What is the real-time classification status? For live data sources, the classification question is not just "what is this data type" but "what is the current state of this specific record, and does that affect what AI can do with it." A client record that includes active litigation flags, for example, may need to be treated differently than a standard client record even if the data type is classified the same way. The practical approach Doing this well does not require a multi-year data governance program. It requires a focused exercise tied directly to the AI deployment timeline. Here is what that looks like. Start with the AI system's data access scope. Before classifying anything, define what data sources the AI system will be connected to. The classification exercise is scoped to those sources. Everything else can wait. Map the sensitive data categories within scope. For each data source the AI will access, identify what sensitive categories exist: personal data, commercially sensitive data, legally privileged material, client confidential data, regulated financial data. This is an inventory exercise, and it usually reveals data in places people did not expect it. Define permitted use cases for each category. For each sensitive category, specify what the AI system is and is not permitted to do with it. This becomes the basis for technical controls — what data the system can retrieve, what it can include in outputs, and what it should exclude or flag. Build the combination rules. Identify the high-risk combinations and set rules for how the AI system handles them. This is the hardest part and the one most often skipped. Spending a day on this with the CIO, the data protection officer, and the AI system owner is worth it. Implement classification tags as technical controls. The classification decisions need to be expressed as technical constraints that the AI system respects, not just as policy documentation. A policy that says "the AI should not include client financial data in externally visible outputs" is unenforceable unless the system is technically configured to prevent it. The CIO's role in making this work Data classification for AI is not a project the technical team can own independently. The decisions about which data categories can be used for which AI purposes require input from legal, compliance, and the business functions that own the data. The CIO's role is to convene those conversations and drive them to decisions before the AI system goes live, not after. The alternative — deploying the AI system and addressing classification issues as they surface — is more expensive and more disruptive. When an AI system produces an output that reveals information it should not have had access to, the response involves technical remediation, incident investigation, potential regulatory notification, and organizational credibility damage. All of which are harder than running the classification exercise before deployment. The time required for a focused data classification exercise scoped to a specific AI deployment is typically two to four weeks for a system with well-defined data access scope. That is a reasonable investment given the alternative. What to take from thisExisting data classification frameworks were designed for human navigation of information. They do not translate directly to AI access, which aggregates and synthesizes rather than navigates. Classification for AI needs to address permitted use cases, high-risk combinations, and live data — three dimensions that standard frameworks typically do not cover. Scope the classification exercise to the AI system's data access, not the organization's entire data estate. A focused exercise is achievable in weeks; an organization-wide program is not. Classification decisions need to be expressed as technical constraints, not just policy documentation. A policy without technical enforcement is not a control. The CIO needs to convene legal, compliance, and business data owners in the classification exercise. The decisions require input from all of them, and making them without that input produces gaps.The organizations that get AI deployment right are not the ones with the most comprehensive data governance programs. They are the ones that did the focused, practical work of understanding their data before they connected it to a model, and made deliberate decisions about what that meant for acceptable use.
Read full article
- 04 Jun, 2026
Training AI on Proprietary Data: What You Gain and What You Give Up
At some point in most enterprise AI programs, someone proposes using the organization's own data to improve the model. The pitch is compelling: a model trained on your internal documentation, your historical decisions, your domain-specific knowledge will perform better on your actual work than a general-purpose model that knows nothing about your context. The pitch is not wrong. Done well, training or fine-tuning a model on proprietary data can produce a genuinely better system. The problem is that "done well" requires a set of decisions that most organizations have not thought through, and the downside of doing it carelessly — from data exposure, from quality degradation, from regulatory exposure — is substantial. This is not an argument against using proprietary data to improve AI systems. It is an argument for approaching that decision with the same rigor you would apply to any decision about where your most valuable data goes. What training on proprietary data actually involves There are a few different mechanisms by which proprietary data can improve model performance. The distinctions matter for understanding the exposure. Retrieval-augmented generation (RAG). The model is not trained on proprietary data at all. Instead, a retrieval system fetches relevant internal documents at query time and provides them as context for the model to work from. The proprietary data sits in a controlled index that the organization manages. The model itself stays unchanged. This approach avoids most of the training data risks while providing significant performance improvement on domain-specific tasks. Fine-tuning. The model's internal weights are adjusted using proprietary data. The model itself changes — it becomes better at tasks that reflect the patterns in the training data. The proprietary data has, in a meaningful sense, been absorbed into the model. This is more powerful than RAG for certain tasks, and considerably more complex from a data governance standpoint. Full training from scratch. Building a model entirely on proprietary data, without starting from a pre-trained foundation. This is rare at the enterprise level — the compute and data requirements are significant — but it does happen for organizations with specialized domain requirements and the resources to invest. The data exposure implications are very different across these approaches. RAG keeps proprietary data in an index the organization controls. Fine-tuning moves it into the model weights. The latter is where the tricky questions live. What gets encoded in a fine-tuned model When data is used to fine-tune a model, the specific documents and content do not get stored inside the model as retrievable files. The model cannot be queried to reproduce its training data verbatim — in most cases. But the training data has shaped the model's behavior in ways that are difficult to audit or reverse. This matters in two respects. First, sensitive information can influence the model's outputs in ways that are hard to anticipate. A model fine-tuned on internal strategy documents may, when asked questions that touch on those topics, produce outputs that reflect internal strategic positions without anyone intending to reveal them. The model is not leaking documents — it is producing outputs shaped by the patterns in documents it has seen. The effect is subtle and hard to trace. Second, fine-tuned models can, under certain adversarial conditions, be prompted to surface information that reflects their training data more directly than normal. This is not a theoretical risk — it is an active research area, and the techniques for eliciting training data from fine-tuned models are improving. Organizations that fine-tune on genuinely sensitive data on models hosted by third parties should factor this risk into their assessment. The vendor relationship in fine-tuning When an organization fine-tunes a model that is hosted by a cloud AI vendor, several questions about the proprietary data need to be answered clearly before proceeding. Does the vendor use the fine-tuning data to improve their base models? Most enterprise agreements exclude this, but it should be a named contractual commitment. Who can access the fine-tuning data? During the fine-tuning process, the data is processed on the vendor's infrastructure. Access controls, the circumstances under which vendor employees can access training content, and the security measures around the fine-tuning pipeline should all be part of the assessment. What happens to the fine-tuning data after training is complete? Is it retained, for how long, and can it be deleted? What does deletion mean in the context of a model that has already been trained on it — a question vendors are not always able to answer cleanly, because once data has influenced model weights, the effect cannot be fully reversed. Where is the fine-tuned model stored, and who controls it? The fine-tuned model is organizational IP. The vendor relationship needs to be clear about ownership, portability, and what happens to the fine-tuned model if the organization ends the vendor relationship. Data quality is the amplification lever One risk that gets less attention than the security questions: the quality and composition of the training data determine whether fine-tuning improves or degrades model performance. Organizations that rush fine-tuning treat it as an input optimization problem — more data is better. That is incorrect. Noisy, inconsistent, or poorly representative training data produces a fine-tuned model that performs worse than the base model on the tasks that matter, and sometimes produces dangerous outputs in domains where the training data was biased or incomplete. I have seen organizations fine-tune models on documentation that contained outdated processes, contradictory guidance, and examples of known-bad decisions that were documented as cautionary tales rather than as positive examples. The resulting model incorporated those patterns along with the useful ones. The outputs were confidently wrong in ways that were hard to trace back to the training data quality issue. Before committing to fine-tuning, the data curation question deserves as much attention as the technical process. What data will be included, what will be excluded, who reviews the training set for quality and appropriateness, and how will the fine-tuned model's behavior be evaluated against the base model are all part of a credible fine-tuning program. When it is worth it Fine-tuning on proprietary data is worth the complexity when two conditions are both true: the performance improvement on the target task is significant and demonstrable, and the data exposure risks can be managed through a combination of vendor agreement terms, data classification, and governance. RAG should be the default approach for most enterprise use cases. It provides substantial performance improvement without the fine-tuning data exposure risk, and it is operationally simpler to maintain and update. Fine-tuning becomes worth the investment when the task requires deep pattern recognition across the organization's specific domain that a retrieval approach cannot fully replicate. The decision to fine-tune should not be made by the AI team alone. It needs the CTO's visibility into the data exposure implications, legal review of the vendor terms around training data, and a clear data classification decision about which content is and is not appropriate as training material. What to take from thisRAG and fine-tuning are different. RAG keeps proprietary data in an index the organization controls. Fine-tuning moves patterns from that data into model weights. They have different data exposure profiles. A model fine-tuned on sensitive data can surface that information in ways that are hard to predict or audit. This is not a reason to avoid fine-tuning, but it is a reason to be deliberate about what goes into the training set. When fine-tuning on a third-party hosted model, get contractual clarity on training data use, retention, and what happens to the data and model at contract end. Data quality in the training set is as important as security. Noisy, inconsistent, or outdated data produces a fine-tuned model that may perform worse than the base model. Make fine-tuning a CTO-level decision with legal review, not a technical team default. The data exposure implications require organizational sign-off, not just technical execution.
Read full article
- 01 Jun, 2026
Shadow AI: What's Already Running in Your Organization
Before any formal AI program exists, before the steering committee has its first meeting, before the CIO has signed off on a single vendor — your employees are already using AI. Not because they are reckless. Because they have work to finish. I have walked into organizations that spent six months building an AI governance framework and discovered, halfway through the engagement, that the finance team had been using a consumer large language model to draft board reports for the past year. The legal team was using a different one to summarize contracts. Neither team thought they were doing anything wrong. They were getting work done faster. That is the shape of shadow AI. It does not announce itself. It does not appear in your IT asset register. It shows up in the gap between the work people need to do and the tools they have been given to do it. What shadow AI actually looks like The term sounds covert. It rarely is. Most shadow AI use is visible if you look for it — people pasting client briefs into ChatGPT, uploading PDFs to summarization tools, running AI writing assistants over strategy documents. Nobody is hiding anything. They just do not think of it as an IT procurement decision. The categories I see most often: Consumer AI assistants used for drafting, summarizing, researching, and explaining technical material. These are usually the first tools employees reach for because they already use them outside work. The friction is zero. AI features embedded in software people already have — writing assistants in productivity suites, AI tools inside project management and communication platforms. These arrive via a product update, not a procurement decision, and most organizations do not notice until they are already in use. Specialist tools for specific functions: AI contract review, AI coding assistants, AI research tools, AI presentation builders. These typically start as a free trial one person tries. Three months later the whole team is using them, usually without telling anyone. The common thread: easy to access, fast to start, and they solve a real problem. Nobody waits for procurement when they have a real problem. The data that leaves when they do Here is the question I put to leadership teams: what data has left your organization in the last 90 days through AI tools? Almost nobody has an answer. Every prompt sent to a third-party AI tool is data that has left the building. Every document uploaded for summarization. Every contract pasted in for analysis. Every email thread fed to an assistant for context. This is not theoretical exposure. It is live, ongoing, and unmeasured. The specific risk depends on the tool. Some consumer AI products train future models on user inputs by default unless the user has explicitly opted out — a setting most enterprise users have never opened. Some retain query data for extended periods for internal product improvement. Some have data residency terms that have no relationship to the organization's regulatory obligations. Most employees have read none of this. What makes shadow AI data exposure different from other shadow IT is the combination of volume and sensitivity. When someone uses an unsanctioned SaaS product, they tend to generate new data inside that system. When someone uses an unsanctioned AI tool, they are typically feeding existing sensitive material — client information, financial projections, internal strategy — into a system with unknown retention, processing, and training terms. A CTO told me once: "We spent a year tightening our cloud storage permissions, and the whole time people were copying strategy documents into a consumer AI chat." He was not exaggerating. He had run the discovery work. The decision quality problem The data risk is one issue. The decision quality risk is a separate one. When employees use AI tools that have not been evaluated or approved, the outputs they receive carry no governance. There is no audit trail of what the model was asked, what it returned, or how that output influenced a decision. Nobody has tested whether the tool performs reliably on the organization's specific data domain. Nobody has checked whether outputs are factually accurate, whether knowledge cutoffs create blind spots, or whether model behavior introduces biases relevant to the use case. I have seen board presentations drafted with consumer AI assistance that contained subtly incorrect market figures — the kind of error that is hard to catch if you are not already deeply familiar with the material. I have seen contract summaries that missed jurisdiction-specific clauses because the model lacked coverage of that legal context. None of these produced immediate disasters. But they were invisible errors that reached decision-makers before anyone thought to check the source. The problem is not that the tools are necessarily unreliable. The problem is that nobody defined what "reliable enough" looks like for this use case, nobody validated that the tool clears that bar, and nobody has visibility into which decisions have been shaped by which tools. Why a policy does not solve this on its own Every organization that discovers shadow AI responds the same way: they draft a policy. Employees must not use AI tools without prior approval. All AI tools must go through procurement. No sensitive data should be uploaded to external AI systems. These policies are not wrong. They are just insufficient on their own. A policy without detection capability is a statement of intent, not a control. If you cannot observe what tools are in use, you cannot enforce anything. And the detection problem is real — consumer AI tools operate over standard encrypted web traffic that looks indistinguishable from any other browser activity on most network monitoring setups. A policy without a usable alternative is a speed bump, not a barrier. If employees are turning to shadow AI because the approved tooling is slow, limited, expensive, or simply does not exist yet, a policy telling them to stop will reduce usage briefly and then have no effect. People optimize for their work. A policy without a clear explanation of why tends to generate resentment rather than behavior change. If employees do not understand what the actual risk is — not just that "data security is important" but specifically what could happen to the specific data they are handling — they will weigh an abstract policy against a concrete productivity benefit and find the policy unconvincing. What actually works There are four interventions that change the situation. Not instead of policy — alongside it. Run a discovery exercise before making any decisions. You need to know what is actually in use before designing controls. This means endpoint monitoring, network traffic analysis, and honest conversations with department heads. Expect surprises. The goal is not to catch anyone — it is to understand your real exposure before you design a response to it. Move quickly on a sanctioned alternative. The fastest way to reduce shadow AI use is to provide a better approved option. This does not require the best enterprise AI platform with a six-month procurement timeline. It means the minimum viable approved tool that addresses the main use cases driving shadow adoption — often that is simply a properly configured, privacy-compliant version of the same tool people are already using. Create a fast path for new tool requests. The reason shadow AI persists is that the formal route takes too long. Teams wait months for IT to evaluate a tool they need now. Make the process faster and more transparent. Most requests should get a decision within two weeks. The ones that cannot should at least get a clear explanation of why not. Treat the people using shadow AI as a signal. They are telling you where your official tooling is falling short. Employees using unauthorized AI tools are often the highest performers trying to work better. Treating them as a compliance problem to be managed misreads what is happening. Their behavior is product feedback. What to take from thisAudit shadow AI use before designing governance for it. You need to know what is already running before you build controls around it. Consumer AI data terms are not written for enterprise compliance. Read them — specifically the sections on input retention, training use, and data residency — before employees continue uploading sensitive material. A policy without detection is not a control. Invest in observability first, then communicate the policy. The fastest fix is a sanctioned alternative that works. Prohibition without substitution creates resentment, not compliance. The employees using shadow AI are showing you where your approved tooling has gaps. Use that information when planning what to procure next.The organizations I see handle this well are not the ones that moved fastest to write a policy. They are the ones that ran the discovery work, understood their actual exposure, and moved quickly to close the gap between what employees needed and what they were officially allowed to use. That gap is where shadow AI lives.
Read full article
- 28 Apr, 2026
How to Build an AI Data Governance Framework Executives Will Actually Use
Data governance frameworks are one of the most reliably underused artifacts in enterprise AI programs. They get built, often with genuine care and significant effort, and then they get reviewed annually by the compliance team and consulted by nobody else. The problem is not usually the content. The problem is who the framework is written for and how it connects — or fails to connect — to the decisions that actually need to get made. Most data governance frameworks are written for compliance teams. They are thorough, they are precise, and they are not the thing an executive reaches for when they need to decide whether a specific AI use case is appropriate. They are also not the thing a business line manager references when they are trying to figure out whether they can use a new AI tool with client data. An AI data governance framework that actually works does two things differently. It is designed around the decisions that need to happen, not the principles that are supposed to guide them. And it has ownership that is connected to actual authority. Why most frameworks fail to produce decisions The typical AI data governance framework includes a set of principles: data minimization, purpose limitation, appropriate security, transparency in AI use. These principles are correct. They do not produce decisions. When a business line manager wants to deploy an AI tool for a new use case, they need to know: is this approved, under what conditions, and who decides if I am not sure? A principles document does not answer any of those questions. The manager does one of two things: they either escalate to a committee that meets monthly and respond six weeks later, or they proceed without asking because the approval path is too unclear to bother. The outcome of the first path is governance that moves at the wrong pace. The outcome of the second is governance that does not exist in practice. An effective AI data governance framework is built backwards from the decisions that need to get made: what use cases are pre-approved, what use cases require individual review, who conducts that review, and what criteria they apply. The principles inform the criteria, but the framework is organized around the decision structure. The ownership model that actually works Data governance for AI requires ownership at three levels, and the levels need to be connected. Executive sponsor. One member of the executive team owns AI data governance as a responsibility, not as a title. This person ensures the framework is consistent with the organization's risk appetite, resolves escalations that the operational governance structure cannot, and is accountable to the board for the organization's AI data governance posture. Without this person, governance decisions pile up in committee and do not get resolved. Operational owners. The CIO and CTO share operational ownership of the framework — the CIO for data classification, access controls, and compliance with data protection obligations; the CTO for AI system architecture, vendor data terms, and technical controls. These two need to work together consistently, which means shared visibility into AI deployments and a clear division of the decisions that sit with each. Data owners by domain. For each major data category — client data, HR data, financial data, legal material — a specific owner is accountable for decisions about AI use in that domain. This person is not the CIO or CTO; they are typically the head of the business function that owns the data. They approve use cases, review exceptions, and escalate issues that require executive judgment. The framework only works if these three levels are connected through a clear escalation structure and meet at a cadence that matches the pace of AI deployment decisions in the organization. The decision structure: the practical center of the framework The most useful component of any AI data governance framework is a decision matrix: which use cases and data types fall into which approval category. Pre-approved. Use cases that are within defined parameters and require no additional review before deployment. These should be clearly specified: which AI tools, with which data categories, under which conditions, are automatically approved. The goal is to move the routine decisions out of the governance process entirely, so the governance process can focus on the non-routine ones. Expedited review. Use cases that require review but can be processed within a defined short timeframe — five to ten business days. The review criteria should be pre-specified so that the review is a check against criteria rather than a fresh analysis from first principles. Most new use cases should fall here. Full governance review. Use cases involving novel data categories, significant regulatory complexity, or high-sensitivity data that require a more thorough assessment. These should be rare if the pre-approved and expedited categories are well-designed. Prohibited. Use cases that are not permitted under any conditions, or not permitted until specific controls are in place. Making these explicit removes them from the case-by-case decision space. The matrix should be a reference document that people actually consult — short, decision-oriented, updated regularly as the landscape changes. What makes governance visible to executives Executives do not engage with governance frameworks through documentation. They engage through metrics, through escalations, and through the questions they ask in governance meetings. The metrics that matter: how many AI use case reviews were completed in the period, at what pace, with what outcomes? How many active AI deployments have been reviewed under the framework and how many have not? What is the current status of high-risk AI deployments relative to the framework's requirements? These are the questions the executive sponsor should be asking at governance review meetings. If the CIO cannot answer them, the governance program does not have adequate visibility into what is happening. The escalation structure is equally important. When a business line manager hits a governance decision they cannot make at their level, the path to getting an answer needs to be fast and clear. A governance framework that requires a monthly committee meeting to resolve a time-sensitive deployment decision is not fit for the pace at which AI deployment happens. Keeping it current without making it a burden AI data governance frameworks go stale quickly. Vendor terms change. New AI capabilities create new use cases. Regulatory guidance evolves. The framework needs a maintenance mechanism that keeps it current without requiring a major review process every time something changes. The practical approach: designate the operational owners — CIO and CTO — as responsible for maintaining the framework, with a quarterly review cycle and a clear process for minor updates between cycles. The executive sponsor reviews major changes. The board sees an annual summary. The review cycle for specific elements of the framework should be driven by trigger events — a new major AI deployment, a significant regulatory development, a governance incident — rather than purely by calendar. What to take from thisBuild the framework around the decisions that need to happen, not the principles that inform them. A decision matrix that tells people what is pre-approved, what needs review, and what is prohibited is more useful than a comprehensive principles document. Name an executive sponsor with genuine accountability, not an oversight committee with diffuse responsibility. Committees defer decisions; sponsors make them. Data owners by domain need to be part of the governance structure. The head of the business function that owns the data is better positioned to make AI use case decisions for that domain than a central technology function. Build governance metrics into the executive review agenda. If the CIO cannot answer questions about active AI deployment coverage at a governance meeting, the oversight is insufficient. The escalation path from a business line manager to a governance decision needs to be fast enough to match the pace of AI deployment. If the answer takes six weeks, managers will stop asking.The organizations with effective AI data governance are not the ones with the most comprehensive frameworks. They are the ones that have built governance around how decisions actually get made in their organization, rather than how they are supposed to get made according to the framework.
Read full article
- 14 Apr, 2026
The Internal Data Access Problem That AI Makes Suddenly Visible
Access controls in most organizations work on a document-by-document basis. You have permission to read a file or you do not. The logic has been sufficient for most purposes because humans navigate information deliberately — they go looking for specific things and find what they have access to. AI tools have broken that model without anyone changing any permissions. When an AI system with broad read access is asked a question, it does not navigate to a specific document. It queries across everything it can reach, synthesizes what is relevant, and produces an answer. The access controls determine what the system can read. They do not determine what combinations it can surface, what inferences it can draw, or what aggregated view of the organization's data it can present to the user. The result is a category of access control failure that most organizations have not addressed, because the access controls themselves are technically correct — and still inadequate. The gap between technical access and intended visibility The cleanest way to describe the problem: in most organizations, there is a meaningful difference between what an employee technically has access to and what they were intended to be able to see. This gap exists because access management is messy in practice. Permissions accumulate over time as people join projects, take on new roles, and inherit access from reorganizations. Revocation processes lag behind changes. Distribution lists include people who should have rotated off. Shared drives created for one purpose get used for another. The intended access model and the actual permissions diverge, and in normal day-to-day work the gap is largely invisible because people go looking for things they need rather than systematically browsing everything they can reach. AI tools systematically browse everything they can reach. That is their function. An employee asking an AI assistant "what do we know about the performance review process for the engineering team" may receive an answer drawn from documents they technically have access to but were never intended to be the audience for — HR process documentation, individual feedback templates, comparative data that lives in a folder from an organizational design project two years ago that nobody cleaned up. The employee has not circumvented any security control. But they have seen something the access model was not designed to permit. The categories where this matters most HR and compensation data. Salary information, performance ratings, disciplinary records, and individual feedback exist throughout organizations in documents with permissions that were set for a specific purpose and have often drifted since. AI systems connected to broad document repositories will find this material and surface it in response to queries that touch on it. Legal and privileged material. Legal advice, litigation strategy, settlement terms, and attorney-client communications often exist in places that technically-authorized users can access for one purpose but should not be able to aggregate for another. The privilege protection may be legally intact — the employee can read the document — but the ability to synthesize across years of legal communications is a different kind of access. Financial data beyond role scope. Budget holders can typically access their own budget data. AI systems may surface aggregate financial data by drawing on individual documents each of which was appropriately accessible, producing a consolidated view that nobody intended to give the employee. Client and partner confidential information. Client files shared within engagement teams are accessible to all team members for legitimate work purposes. An AI system that can search across all engagement files simultaneously may surface patterns about client relationships, deal economics, or strategic situations that no single team member was supposed to see in aggregate. Why the standard response does not work The first response most organizations reach for is tightening access controls. If AI is exposing the problem, fix the permissions. This is not wrong, but it is not sufficient. The problem has two parts that require different responses. The first part is genuine permission drift that should be corrected regardless of AI. Employees who have retained access to systems and documents they no longer need it for should have that access revoked. This is an overdue access hygiene exercise, and AI deployment is a reasonable forcing function for doing it. The second part is structurally different. Even with clean, intentional permissions, an employee with access to many documents across an organization will technically have access to combinations of data that, when synthesized by an AI, reveal more than the permission model was designed to permit. You cannot solve this purely by tightening access, because the individual access grants may all be correct. The solution to the second part requires building constraints into the AI system itself: what categories of data it can include in synthesis across user queries, what aggregation rules apply, and what escalation or approval processes apply to queries that touch the highest-sensitivity categories. Building the right architecture Three things need to happen in parallel, not sequentially. Access control remediation. Run an access review scoped to the data sources the AI system will connect to. Specifically look for: permissions that predate current roles, broad read access granted for historical projects that is no longer needed, distribution list membership that has not been reviewed in over a year. This will not solve the problem completely, but it reduces the surface area. AI-specific access boundaries. Define, at the AI system configuration level, what categories of data the system can use for synthesis in response to user queries. HR data, compensation data, legal documents, and individual performance information may be categories where even technically authorized access should not be available to the AI synthesis function. These boundaries need to be implemented as technical constraints in the AI system, not just as policy guidance. Query monitoring and anomaly detection. The AI system's query logs are, for the first time, making the access control problem visible. An employee who systematically queries for compensation data across a broad population, or who extracts patterns from legal files, shows up in the query logs in ways they would not show up in document access logs. This monitoring capability is new and should be used. What the CIO needs to drive The access control gap in AI deployments is fundamentally a CIO problem, not an AI team problem. The AI team can build a capable system. The CIO needs to ensure that the system's access to organizational data is deliberately configured rather than broadly permissive by default. Broadly permissive by default is the path of least resistance. It makes the AI system more capable and easier to demonstrate. It also creates the access control failures described above, and the first incident involving inadvertent disclosure of HR or financial data through an AI tool is going to be a painful conversation. The access architecture needs to be designed before the AI system goes live. The conversation about what categories of data the system should not be able to synthesize — even if individual documents in those categories are technically accessible — needs to happen with legal, HR leadership, and the CFO, not just the AI team. What to take from thisTechnical access controls determine what an AI system can read. They do not determine what it will synthesize or surface. The gap between these is where the access control problem lives. Run an access control remediation exercise scoped to the AI system's data access before deployment. Clean up permission drift even if the AI deployment were not happening — AI just makes the urgency visible. Build AI-specific access boundaries into the system configuration. Some data categories should not be available for AI synthesis even if individual documents within them are technically accessible. Use AI query logs as an access monitoring tool. The visibility into what the system is being asked to surface is new and valuable. The CIO needs to own the access architecture decision, not delegate it to the AI team. The decisions about what data categories the AI should not aggregate require organizational input that the AI team is not positioned to provide alone.
Read full article
- 24 Mar, 2026
The Confidentiality Risk in Your AI Productivity Rollout
The business case for an organization-wide AI productivity rollout usually focuses on time saved — hours of drafting, summarizing, and searching that employees no longer have to do manually. The productivity math is often compelling. The confidentiality implications rarely make it into the same document. I am not arguing against AI productivity tools. Most of the organizations I work with benefit from deploying them thoughtfully. I am arguing that the deployment decision and the confidentiality assessment need to happen together, not sequentially, because by the time the confidentiality issues surface in a live deployment, they are substantially harder to address. There are four areas where confidentiality exposure tends to materialize in AI productivity rollouts, and none of them are obvious from the outside. The permissions inheritance problem Most enterprise AI productivity tools integrate with the organization's existing data. A writing assistant that can access email and calendar content. A search tool that can query across the organization's document repositories. A meeting assistant that processes conversation recordings. The integration is the point — the tool needs access to data to provide value. The confidentiality problem is that the access often inherits existing permissions without anyone reviewing what those permissions actually cover. Organizational data permissions are almost never clean. Documents shared broadly during a project and never restricted afterward. Distribution lists with members who should have rotated off. Legacy permissions on systems that predate the current structure. This is normal; access controls accumulate over time and rarely get regularly pruned. When an AI productivity tool indexes the content that a user can access, it indexes everything they can access — including the content they technically have access to but were never meant to see in its entirety. When the tool then uses that content to answer queries, generate summaries, or surface relevant information, it may surface content in ways that exceed what the original permission model was designed to permit. I have seen this manifest in practice: an AI assistant that could search across an organization's document repositories began surfacing salary data in response to queries about a particular team, because the underlying HR documents were stored in a folder the user had access to for an unrelated historical reason. The user was not trying to find that data. The AI found it for them. The aggregation problem Individual pieces of information that are harmless in isolation can be sensitive in combination. AI productivity tools are particularly good at making the combination visible. An employee with legitimate access to sales pipeline data, client meeting notes, and internal budget discussions does not normally see all of that information together in a synthesized form. They encounter it in different contexts, at different times, through different systems. The totality is there, but the cognitive effort required to combine it provides a natural friction. An AI tool that can aggregate, summarize, and cross-reference across all of those sources removes that friction. The same employee can now, with a single query, see a synthesized view of their organization's client relationships, deal economics, and strategic priorities that no single document or system would have surfaced. This is not a bug in the tool — it is often the primary selling point. The confidentiality question is whether there are categories of information where that aggregation creates an exposure that the access control model did not anticipate. The answer is usually yes, but nobody looked. The meeting and conversation record Meeting AI tools — platforms that transcribe, summarize, and make conversation content searchable — have become common in enterprise deployments. The confidentiality implications deserve explicit attention before rollout. Conversations that participants understood to be informal or confidential in the moment become searchable records. This matters in three contexts that are not always considered during rollout planning. Board and leadership discussions processed by meeting AI tools create records of deliberations that may need to be protected under legal privilege or governance confidentiality obligations. Whether the tool's data handling terms are compatible with those obligations is often not reviewed. Client and partner conversations. Many organizations use meeting AI tools for external calls without explicitly disclosing this to the other party. Depending on jurisdiction, recording requirements vary, but the confidentiality implication extends beyond recording law: the content of client conversations is typically covered by confidentiality obligations in the client relationship. Where that content is stored, who can access it, and what the tool does with it are questions the client may reasonably want answered. HR and sensitive personnel conversations. Performance discussions, disciplinary matters, and sensitive employee conversations processed by meeting AI tools create records that carry additional obligations around storage, access, and deletion. The external output risk AI productivity tools help employees produce external outputs faster. That productivity benefit creates a confidentiality exposure that tends to get overlooked: the risk that AI-assisted drafting incorporates confidential context that the author did not intend to share. When an employee drafts a client proposal using an AI writing tool that has access to their full communication and document history, the tool may draw on that context in ways the author does not fully control or review. A proposal drafted with AI assistance might reflect information about the organization's pricing strategy, competitive positioning, or internal deliberations that no single author would have consciously included. This is harder to observe than the other risks because it manifests in outputs that look normal and are not obviously different from what the employee would have written manually. The signal is subtle: slight reveals of internal context, references to information the recipient was not meant to have, framing that reflects internal discussions the author forgot they had consulted. Running the confidentiality assessment before rollout The practical steps that matter: Review the permission state before enabling AI access to existing content. Specifically: which users have access to what, and are the existing permissions consistent with what was intended? The AI rollout is a good forcing function for an access control review that should have happened anyway. Identify the sensitive data categories in scope. For each category — client data, HR data, financial data, legal and privileged content — assess whether AI tool access is appropriate and under what controls. Check whether meeting recording disclosure is required. For external calls, understand the legal and relationship requirements in the relevant jurisdictions and configure the tool accordingly. Establish a content review process for AI-assisted external documents. This does not have to be comprehensive — it should focus on the document types where inadvertent disclosure risk is highest. Set explicit expectations with employees about what the tool is and is not appropriate for. Not a policy document nobody reads — a short, specific briefing that describes the actual confidentiality risks and what to do about them. What to take from thisAI productivity tools inherit existing permissions. Review the permission state of the content they will access before enabling the rollout — you will find problems. Aggregation risk is real and is not obvious from reviewing individual access controls. Think about what combinations of accessible content look like when synthesized. Meeting AI tools create records of conversations that may carry confidentiality obligations the tool's data terms do not satisfy. Assess this before deployment, not after. AI-assisted external drafting can inadvertently incorporate confidential context. Build a light-touch review step into the document production workflow for the highest-risk document types. The business case and the confidentiality assessment need to happen simultaneously. Running the confidentiality review after the deployment decision has been made tends to surface problems at the wrong point in the process.
Read full article
- 19 Mar, 2026
How AI Vendors Use Your Data: Contract Versus Reality
I have read a lot of AI vendor contracts in the past few years. Not because contract review is interesting in itself, but because the gap between what vendors say in sales conversations and what their agreements actually commit to has consequences. Organizations that do not close that gap before signing end up discovering what the contract actually says at the worst possible time. The general shape of an AI vendor data agreement is worth understanding at the executive level — not because the CFO or CIO needs to redline individual clauses, but because the strategic choices about which vendors to use and under what conditions flow directly from what those agreements permit and exclude. Here is what I see consistently. The default terms favor the vendor This should not be surprising. The default data processing terms in any commercial agreement are written to minimize the vendor's liability and maximize their operational flexibility. AI vendor agreements are no different, and in some respects they are more aggressively drafted than traditional software agreements because the stakes around data use are higher and the regulatory landscape is still evolving. The standard structure of a consumer or early-stage enterprise AI agreement typically includes: A broad grant to the vendor to use interaction data for service improvement, model training, and product development purposes, subject to anonymization or aggregation. In practice, what "anonymization" means and how consistently it is applied is rarely specified. Retention periods that are defined by the vendor's operational needs rather than the customer's preferences, often without a customer-initiated deletion right. Liability limitations that cap the vendor's exposure in the event of a data incident at amounts that bear no relationship to the potential harm to the customer — typically limited to fees paid rather than the value of the data or the cost of a breach. Unilateral modification rights that allow the vendor to change the data processing terms with notice, sometimes as short as 30 days, without requiring the customer's affirmative consent. None of these are unusual in commercial software agreements. But when the agreement governs how your organization's strategic data, client information, and proprietary content is handled, they warrant closer attention than a standard SaaS contract. What changes in a properly negotiated enterprise agreement The distinction between a default agreement and a properly negotiated enterprise agreement is significant. When procurement and legal have done their job, the enterprise agreement should include at minimum: Exclusion from training data. A clear, contractually binding commitment that the customer's interaction data will not be used to train or fine-tune the vendor's models. This is the single most important data term and the one that organizations should refuse to proceed without. Data processing agreement compliant with applicable regulations. For any processing of personal data of EU residents, a GDPR-compliant data processing agreement is a legal requirement. Increasingly, other jurisdictions impose similar requirements. This agreement specifies the purposes for which data is processed, the retention periods, the data subject rights the vendor will support, and the security measures in place. Defined retention and deletion terms. The agreement should specify how long the vendor retains interaction data, under what circumstances, and what deletion looks like — with confirmation that deletion is complete and irreversible. Sub-processor disclosure and control. AI platforms often rely on cloud infrastructure, third-party safety tooling, and other sub-processors. The enterprise agreement should disclose who these are and give the customer the ability to object to new sub-processors. Breach notification terms. The timeframe within which the vendor will notify the customer of a security incident affecting customer data. Thirty days is common in default agreements; 72 hours is what most regulatory regimes require you to provide to your own regulators. Make sure the vendor's notification obligation to you is faster than your notification obligation to regulators. The clauses that cause problems later In practice, the clauses that create the most problems are not the ones organizations focus on during negotiation. Aggregated and anonymized data carve-outs. Most agreements carve out "aggregated and anonymized data" from the restrictions on training use, with the rationale that anonymized data cannot be traced back to the customer. The problem is that what counts as "anonymized" is not usually defined with precision, and for certain types of content — queries about niche industries, specialized technical topics, or specific organizational patterns — re-identification is more feasible than the carve-out implies. Operational necessity language. Agreements often include broad permissions for the vendor to process customer data "as necessary to provide and improve the service." The scope of "improve the service" is frequently contested. Make sure this language is defined, not left open. Right to audit provisions. The ability to verify that the vendor is actually complying with the data processing commitments they have made. Many agreements include an audit right that is functionally unusable — limited to once per year, requiring 90 days notice, subject to the vendor's approval of the auditor. An audit right with those conditions provides limited practical assurance. Termination data handling. What happens to your data when the contract ends. How long does the vendor retain it after termination? What format is it returned in? Is deletion from backup systems addressed? Organizations that have ended vendor relationships often discover that "deletion" in practice means deletion from active systems, with indefinite retention in backup infrastructure. The sales conversation versus the signed contract The gap I see most often is between what the sales team communicates during the evaluation — "we never use your data for training," "your data is completely private," "you retain full ownership of everything" — and what the signed agreement actually commits to. This is not always deliberate misrepresentation. Sales teams are not contract lawyers, and they often communicate what they believe to be true without knowing the precise legal scope of the commitments they are describing. The problem is that verbal assurances do not create contractual obligations. What matters is what the signed agreement says. The practical implication: have the conversation about data handling before the procurement decision, but validate every assurance by finding the corresponding contractual language. If the vendor says they do not train on customer data, ask them to point to the specific clause that says so. If the clause does not exist, or if it is qualified in ways that limit its practical scope, that is important information. What the CFO should be looking at The CFO's lens on AI vendor data terms is different from the CIO's. Beyond the data handling questions, the financial and liability exposure matters. Liability caps that are set at fees paid rather than harm caused mean that in the event of a serious data incident, the vendor's contractual exposure is often a fraction of the cost the organization incurs — in regulatory fines, breach notification costs, customer notification, and reputational damage. This does not mean the vendor relationship is unworkable, but it does mean the organization is bearing most of the downside risk and should price that accordingly. Insurance coverage. Some AI vendor incidents may fall into gaps between the organization's existing cyber insurance policy and the vendor's coverage. This is worth reviewing explicitly before the program goes live. Renewal and price terms. AI vendor agreements increasingly include significant pricing flexibility — unilateral price changes, usage-based components that scale in ways that are hard to predict, and renewal terms that are less favorable than the initial agreement. Understanding the financial exposure over a three-to-five year horizon matters for the investment case. What to take from thisDefault AI vendor data terms are written for the vendor's benefit. Do not assume they protect customer interests without reviewing them. The training exclusion is non-negotiable for any enterprise deployment handling sensitive data. Get it as a contractual commitment, not a sales assurance. Aggregated and anonymized data carve-outs are often broader than they appear. Define what anonymization means in the specific context of your data. Audit rights that are functionally unusable provide no real assurance. Push for meaningful audit provisions. Verify every data handling assurance the sales team makes by finding the contractual language that supports it. If it is not in the contract, it is not a commitment.The organizations that manage AI vendor relationships well are not the ones with the longest or most restrictive agreements. They are the ones that understood what they were agreeing to before they signed, addressed the material gaps, and built a vendor relationship on actual commitments rather than assumed ones.
Read full article
- 10 Mar, 2026
What Data Leaves Your Organization Every Time Someone Uses an AI Tool
Most organizations operate under a working assumption that their data is contained. Files live on approved systems. Emails go through monitored infrastructure. Cloud storage is access-controlled. The perimeter is imperfect, but it is at least visible. AI tools have quietly dismantled that assumption. Not through a breach. Through normal, sanctioned-feeling use. Every time an employee types a prompt into a large language model, attaches a document for summarization, or pastes a block of text for analysis, that content leaves the organization's infrastructure and enters a third-party system. The employee does not experience this as data transfer. They experience it as using a tool. But the data has moved, and where it goes, how long it stays, and what is done with it depends entirely on terms most organizations have never reviewed. What "data leaving the building" actually means The framing matters here, so I want to be precise. When I say data leaves the organization, I mean three distinct things. First, the input reaches the vendor's infrastructure. The prompt, the document, the pasted text — all of it travels to servers the organization does not control, under security and access policies the organization did not set, in jurisdictions the organization may not have mapped. Second, the vendor processes and stores that input for some period. The length and purpose of storage varies dramatically by product and by the specific agreement in place. Some vendors retain inputs for a defined period for abuse prevention. Some retain them longer for product improvement. Some will, under certain terms, use them to improve future model versions. The defaults on this vary and are not always what organizations assume. Third, the output the model generates may itself be derived from patterns the model learns over time. This is the mechanism that tends to unsettle executives most when they understand it, though the practical risk here is more nuanced than the headline version usually suggests. The part that matters most in practice is the first two: the content reaches third-party infrastructure, and its fate is governed by the vendor's policies, not yours. The content that tends to flow through AI tools This is worth spending time on, because organizations that have audited actual AI tool usage consistently find that the content flowing through consumer and productivity AI tools is more sensitive than they assumed. Strategy and planning documents. Employees use AI tools to refine presentations, summarize options, and draft documents for leadership review. The source material they feed in frequently includes internal plans, financial projections, and competitive analysis. Client and customer information. Sales teams use AI assistants to draft proposals and account summaries. Support teams use them to summarize case histories. Analysts use them to structure reports. Client data is routinely included, often without a deliberate decision to include it. Legal and contractual material. Lawyers and procurement teams use AI tools to summarize contracts, identify key clauses, and compare terms. Contract text often contains commercially sensitive information that neither party intended to share beyond the two signatories. HR and personnel data. Managers use AI tools to draft performance reviews, restructuring communications, and offer letters. The inputs frequently include specific salary information, performance ratings, and personal circumstances. None of these employees are being careless. They are using AI to do their jobs. The exposure is a product of normal behavior, not negligence. Where the data goes: the three mechanisms Processing for the immediate request. This happens in every interaction, by definition. The data reaches the model, the model generates a response, and the exchange is complete from the user's perspective. What happens after that depends on the vendor. Retention for operational purposes. Most AI services retain some record of interactions for a period — to detect abuse, to provide conversation history to the user, or to meet regulatory requirements in certain jurisdictions. The retention period and what the organization can do about it (deletion requests, data portability) varies significantly and is usually defined in the data processing agreement or privacy policy. Use for model training and improvement. This is the term that gets the most attention, and for good reason. Some AI products, particularly consumer-grade versions of enterprise tools, include default settings that allow the vendor to use interaction data to improve the model. The important nuance: enterprise agreements frequently exclude this, while consumer free tiers often include it. The problem in most organizations is that employees are using a mix of both, and nobody has mapped which is which. The distinction between enterprise and consumer tiers on this specific point is where most of the real exposure sits. An employee using an enterprise-licensed product with a properly negotiated data processing agreement is in a materially different position than an employee using the same vendor's free consumer product with default settings. The output is functionally identical. The data treatment is not. What the CTO and CIO actually need to understand The question is not whether AI tools create data exposure — they do, by design, in the same way any cloud service does. The question is whether the organization's data exposure through AI tools is understood, consented to, and consistent with its regulatory and contractual obligations. That requires knowing three things you probably do not know right now. What tools are actually in use. Not just the ones IT has approved — all of them. This means running discovery before designing governance. Most organizations that do this discovery find a longer list than they expected. What tier of each tool is in use. The enterprise agreement and the free consumer version of the same product often have dramatically different data processing terms. This distinction matters for training data use, retention, and deletion rights. What the data processing terms actually say. Not the marketing language about being "privacy-first" or "enterprise-grade" — the actual data processing agreement. Specifically: what the vendor can do with inputs, how long they retain them, what the organization's rights are around deletion, and where the data is processed. Most organizations have answered none of these questions systematically. The CIO knows what is in the procurement system. The CTO knows what is in production. Neither has a complete picture of what is happening between individual employees and third-party AI services. The regulatory and contractual layer Data flowing to AI tools does not exist in a vacuum. It intersects with existing obligations. If the organization operates under data protection regulation, any transfer of personal data to a third-party processor requires a legal basis and, in many jurisdictions, a data processing agreement that specifies how the processor may use the data. AI tools that process personal data — and most enterprise use cases involve at least some personal data — need to be assessed against these requirements. If the organization has contractual confidentiality obligations to clients, those obligations typically extend to how client data is handled regardless of the tool involved. A consultant uploading client strategy documents to an AI summarization tool without a data processing agreement in place may be in breach of their client agreement, regardless of whether the AI tool's terms are otherwise acceptable. These are not hypothetical risks. They are existing obligations that most organizations have not mapped against their AI tool usage. What to take from thisAudit what AI tools are in active use across the organization before designing any data governance response. The list will be longer than IT's approved toolset. Distinguish between enterprise and consumer tiers. The same tool can have dramatically different data processing implications depending on which version employees are using. Read the data processing agreements — specifically the sections on input retention, training use, and deletion rights. Do not rely on the vendor's marketing language. Map AI tool usage against existing data protection and client confidentiality obligations. The intersection is almost certainly not clean. Build a disclosure and classification step into any AI tool approval process: what categories of data can employees use with this tool, under what conditions?The data exposure from AI tools is not a future problem to prepare for. It is a current condition to understand. The organizations that handle this well are not the ones with the most restrictive policies — they are the ones that ran the discovery work, understood what was actually flowing through which tools, and made deliberate decisions about what that meant for their obligations.
Read full article