"Never trained on your data" · what it actually means
When an AI vendor tells you they will never train on your data, it sounds like exactly the reassurance you were looking for. The phrase has become something close to table stakes in the enterprise AI market: everyone says it, it appears on pricing pages and in sales decks, and it is offered as the answer to any privacy concern you might raise.
The problem is that the phrase hides more than it reveals. Not training on your data is necessary. It is nowhere near sufficient. And the gap between necessary and sufficient is where most of the real privacy risk in AI deployment actually lives.
What training actually means, and why it is only the beginning
When an AI provider says they do not train on your data, they mean that the content of your conversations, documents, and queries will not be fed back into the process of improving the underlying model. Your inputs will not make the model smarter in ways that could benefit other users or expose your information through the model's outputs to someone else.
That commitment, if genuine and architecturally enforced, matters. It means your proprietary information will not influence what the AI tells other users. It means a competitor using the same platform cannot, through the model itself, benefit from knowledge you shared in your prompts. For businesses that use AI to think through strategy, draft sensitive communications, or analyse confidential data, this is a meaningful protection.
But it answers only one of the four questions that matter. Where does your data go? Who can see it? What happens when you want it gone? Can you control where it is stored? Training is the question that gets asked most often because it is the easiest to explain. The other three are harder to understand but equally important to get right.
Where does your data actually live?
When you send a prompt to an AI tool, that prompt travels to a server somewhere, gets processed, and a response comes back. What happens to the prompt between those two events, and where it sits while it is being processed, is a question most users never ask and most providers do not volunteer an answer to.
The default for most AI infrastructure is shared: your prompt is processed in the same environment, by the same infrastructure, as prompts from other customers. This is efficient and cost-effective at scale. It is also not the same as private. Even if no one is actively reading your prompts, they exist in an environment shared with others, subject to the security posture and access controls of a system you did not design and do not control.
An isolated deployment is different. In an isolated deployment, your data lives in an environment that is dedicated to your business: your own database, not a shared pool. Your prompts are processed in infrastructure that is not shared with other customers. The access controls are set around your organisation, not around a multi-tenant platform. This is the level of isolation that a serious privacy requirement demands, and it is meaningfully more expensive and complex to deliver than a shared deployment with a training opt-out.
When evaluating a provider, ask directly: is my data in a shared database or an isolated instance? If the answer requires qualification or is not a clear "isolated instance," you have your answer.
Who can see what you share?
Most privacy policies include language about limiting internal access to customer data. Support teams, engineers, and other staff are typically described as having access only on a need-to-know basis, subject to confidentiality agreements. This is standard practice, and it is not dishonest.
But it is also not the same as saying no one can see your data. Human access to customer conversations, even with safeguards and intentions that are entirely legitimate, is a different level of privacy than a system where access is architecturally prevented. The question to ask is not whether the provider has policies limiting access. It is whether the system is designed so that access is not possible without an explicit, audited action.
For businesses handling financial records, legal documents, or personal health information, the distinction is material. A policy that limits access is better than no policy. An architecture that prevents access is what the most sensitive use cases actually require. Understand which one you are being offered before you rely on the privacy commitment.
What happens when you want your data gone?
Every business should expect, at some point, to want its data deleted. You may switch providers. A client relationship may end and contractual obligations require you to delete their information. A regulatory authority may request that you demonstrate you can delete personal information on request. Whatever the reason, the right to delete data is a basic expectation, and in most jurisdictions with meaningful data protection law, it is a legal requirement.
The question is not whether a provider offers a delete function. Most do. The question is what deletion actually means. Does it remove the data from the primary database only, while retaining it in backups? Does it mark it as deleted in the system while the underlying record persists? Are there logs, audit trails, or cached copies that retain the content beyond the primary deletion? Does deletion take effect immediately or after a retention period that the provider sets unilaterally?
Shadow copies are a specific concern worth raising directly. Some systems automatically create backups of data at regular intervals. If those backups are not subject to the same deletion process as primary data, then deletion is not actually deletion: it is a change in the most accessible copy while a historical record persists somewhere else. A provider that cannot answer the shadow copy question specifically has not designed their system with genuine deletion in mind.
The standard you should be looking for is permanent deletion: when you delete your data, it is gone from every system the provider operates, including backups, logs, and any derivative copies, with no retention period. If that is what you need, say so explicitly and get a specific answer. Vague language about deletion being available is not sufficient.
The self-hosting option and what it means for sovereignty
For organisations with the most demanding privacy requirements, there is a fourth question that goes beyond what any cloud-based deployment can fully answer: can I run this on my own infrastructure? Self-hosting means the AI operates on hardware you control, on your network, under your security policies. Data never leaves your premises. The provider supplies the software; you supply the environment.
This is relevant for financial institutions, legal firms, healthcare organisations, and enterprises operating in jurisdictions with strict data residency requirements. It is also relevant for any business where the competitive sensitivity of the information is high enough that even a trusted third party in the chain represents an unacceptable risk.
Self-hosting is not the right choice for every business. It requires infrastructure investment, technical capability, and ongoing operational responsibility. But for businesses where data sovereignty is a genuine requirement, it is the only approach that provides an absolute guarantee: the information is on your hardware, and the question of what a provider does with it is structurally irrelevant because the provider never has it.
What the regulators say about all of this
Both POPIA in South Africa and GDPR in Europe impose obligations on businesses that process personal information using third-party systems. Under both frameworks, the business remains the responsible party for how that information is handled, even when a vendor is doing the handling. This is the concept of the responsible party and operator relationship in POPIA, or the controller and processor relationship in GDPR.
This matters because it means a privacy failure by your AI vendor can be a privacy failure by your business. If personal information is retained beyond what your vendor said it would be, or processed in a way that was not disclosed, or accessed without appropriate authorisation, the regulatory exposure does not sit solely with the vendor. The business that chose to use the vendor and did not conduct adequate due diligence on the privacy architecture shares the responsibility.
Regulators in both jurisdictions have clarified that appropriate technical measures includes understanding how third-party processors handle data, ensuring contractual protections are in place, and being able to demonstrate that the processing is consistent with the rights of data subjects. "We were told it was private" is not a sufficient answer. "We reviewed the architecture, confirmed the guarantees, and have the documentation" is a much stronger position.
The questions worth asking any AI vendor
Before you rely on a privacy commitment from any AI provider, ask for specific answers to a specific set of questions. Is my data in an isolated instance or a shared database? Who within your organisation can access the content of my data, under what circumstances, and is that access logged? When I delete my data, is it removed from all systems including backups and logs? How long does deletion take to propagate? Are there any derivative copies, cached versions, or audit records that retain the content after deletion? Do you offer self-hosting for businesses that require it?
Providers who have built their systems around genuine privacy can answer every one of these questions precisely and with specifics. The answers are built into the architecture, so they do not require interpretation or hedging. Providers who respond with generalities, redirect to their privacy policy, or offer reassurances that do not map to specific technical realities have not built their systems that way.
The difference between the two is not just a matter of trust. It is a matter of what you can represent to your clients, your regulators, and your board when they ask how you handle sensitive information. Precision matters. Specificity is the proof.
Why we built Emma this way
Emma was designed from the beginning around the privacy requirements of businesses that handle information they cannot afford to lose control of. Every customer's data lives in its own isolated instance. No data is shared between customers. No AI model is ever trained on what Emma receives. Shadow copies are not retained. When you delete your data, it is gone permanently.
For enterprise customers with the strictest requirements, Emma can be deployed entirely on your own infrastructure. The AI runs on your hardware, on your network, under your control. We provide the capability. You own the environment.
We built it this way because we believe that private AI is not a premium feature. It is a prerequisite for any business that takes its responsibilities seriously. The phrase "never trained on your data" should be the beginning of the conversation, not the end of it. We are happy to answer the rest of the questions in detail, because the architecture gives us specific answers to give.