Showing Posts From
Large language models
- 07 Jun, 2026
What Happens to Your Data Inside a Large Language Model
One of the questions I get most often from executive teams when they start getting serious about AI governance is some version of: "If we send data to an AI model, does that data end up in the model? Can the model then use our data to answer questions for our competitors?" It is a reasonable question. The answer is more nuanced than the headlines around AI and data privacy usually suggest, and getting the nuance right matters for making sound decisions about vendor selection, data handling, and acceptable use. This is not a technical explanation. It is an executive one. I want to give you the conceptual framework that lets you ask the right questions and evaluate the answers vendors give you. The key distinction: training versus inference There are two fundamentally different things that can happen to data when it touches an AI model. Inference is what happens during normal use. You send a prompt. The model processes it using the knowledge and patterns it already has. It generates a response. Your data was processed, but it did not change the model. The model is no more or less capable after your interaction than it was before. Think of it like asking an expert a question: they used their knowledge to answer you, but they did not become a different expert because you asked. Training is different. Training is when data is used to update the model's internal parameters — to change what the model knows or how it responds. This is what actually shapes the model's behavior and capabilities. Training happens periodically, using large datasets, through a deliberate process. It is not what happens every time a user sends a prompt. The confusion between training and inference is responsible for most of the anxiety executives have about sending data to AI vendors. When an employee pastes a strategy document into an AI assistant, that document is used for inference — to generate the response. It is not, in that moment, training the model or making the model more likely to surface that information to other users. The question of whether your data is used for training is a separate one, governed by the vendor's policies and your agreement with them. When data does influence the model The concern about data "ending up in the model" is legitimate in one specific scenario: when the vendor uses interaction data to train future versions of the model. This practice is more common in consumer products than enterprise ones. Many consumer AI tools, under default settings, retain interaction data and may use it as part of the training pipeline for future model versions. This does not mean a competitor can directly query the model and retrieve your document. Training does not work like storing files in a searchable database. But your data, if used for training, has influenced the model's patterns in ways that are effectively irreversible and non-auditable. Enterprise agreements typically exclude this. When an organization purchases an enterprise license with a proper data processing agreement, the vendor generally commits to not using that organization's data for training purposes. This is one of the most important terms to verify in any AI vendor agreement, and one of the strongest reasons to ensure employees are using enterprise tiers rather than consumer accounts. The practical implication: the risk of your data influencing the model is primarily a function of which tier you are on and what your agreement says — not of using AI tools in general. What retention actually means Even when a vendor does not train on your data, they may retain it for a period. Understanding what retention means in practice matters for two reasons: regulatory compliance and the question of who can access the retained data. Vendors retain interaction data for different reasons: abuse prevention, conversation history for the user, debugging and quality assurance, and in some cases legal holds. The retention period varies from days to years depending on the product and the settings. What the retained data can be used for is defined in the vendor's privacy policy and data processing agreement. The key questions are: Can vendor employees access the content of retained interactions? Under what circumstances? Are there audit logs of such access? What are the deletion terms — can you request deletion, and is it complete? These are not abstract questions. An employee sending sensitive content to an AI tool is creating a record that exists in the vendor's infrastructure for some period. If that infrastructure is breached, or if the vendor is subject to legal process, that record is potentially accessible. The same employee would not dream of emailing that content to a stranger. But the AI tool does not feel like an external party — it feels like a private tool. The retention question is also where GDPR and similar regulations create specific obligations. Any interaction containing personal data is a transfer of personal data to a third-party processor. That transfer requires a legal basis, a data processing agreement, and compliance with data subject rights including deletion. Most organizations have not mapped their AI tool usage against these obligations. The questions a CTO should ask every AI vendor The framework above translates into a specific set of questions that should be part of any AI vendor evaluation: Is interaction data used for training future models? Under what conditions? What controls does the customer have over this? This is the most important question. Get the answer in writing, as a contractual commitment, not as a verbal assurance. What is the data retention period for interaction data? Can this be configured? What are the deletion rights and processes? What confirmation is provided when deletion is complete? Who within the vendor organization can access the content of customer interactions? Under what circumstances? Are there access logs? What are the procedures if vendor employees need to access content for support or debugging? Where is the data processed? This matters for regulatory compliance. Data about EU residents processed in jurisdictions without an adequacy decision creates specific compliance obligations that need to be managed. What happens to retained data in the event of the vendor being acquired, going out of business, or being subject to legal process? Where does customer data fall in those scenarios? What is the vendor's certification posture? SOC 2 Type II, ISO 27001, and similar certifications do not answer all of these questions, but they provide a baseline for security practices that matters for any serious enterprise evaluation. The honest assessment No AI tool is risk-free from a data perspective. Sending data to any third-party system involves some degree of information leaving your infrastructure, under terms you did not write, in systems you do not control. That is true of cloud storage, email services, and every other third-party tool the organization uses. The question is whether the risk is understood, whether the terms are acceptable given the regulatory and contractual context, and whether the data classification of what is being sent is appropriate for the tier and agreement in place. The worst outcome is not using AI tools with enterprise data under a proper enterprise agreement with a reputable vendor. The worst outcome is using consumer-tier products with default settings, with sensitive data, without any of the contractual protections that make enterprise use manageable. Most organizations are currently somewhere in between. The CTO's job is to understand exactly where on that spectrum the organization sits, and to move deliberately toward the part of the spectrum that is defensible. What to take from thisTraining and inference are different. Using an AI tool to process data does not automatically mean that data trains the model. Whether it does depends on the vendor's policies and your agreement. The training exclusion is one of the most important terms in an enterprise AI agreement. Verify it explicitly — a verbal assurance is not sufficient. Retention means your data exists on vendor infrastructure for some period. Understand the retention period, access controls, and deletion rights for every tool in active use. Consumer and enterprise tiers of the same product often have materially different data handling terms. The tier distinction matters more than the vendor selection in many cases. Map AI tool usage against data protection obligations before the next regulatory review, not during it.The executives who handle this well are the ones who moved past the surface-level anxiety about "AI knowing your data" and got specific about the mechanisms: what are the actual terms, what does retention mean, and what commitments can the vendor make in writing?
Read full article