
Understanding the nature of how modern AI distills, stores, and reconstructs the world’s knowledge and the dangers it presents.
Many of us now use AI practically every day in our jobs and personal lives to help us draft a clear and concise email, provide compelling statistics and data points for articles and blog posts, and even provide recommendations of the best restaurants in a foreign city. When we use tools like ChatGPT, Google Gemini, and others, we experience a kind of technological magic. Yet we use these tools without much thought about what’s happening behind the scenes. Understanding the mechanics of AI is important because it also helps us understand the inherent risks to our data privacy and protection, and how to mitigate those risks.
The unknown truth is that AI models are the biggest, most powerful lossy compression systems ever created. What does that mean, and why is it important?
Lossy Compression is Everywhere
As a quick overview, compression is the technological method we use to compress, or shrink, data to save space and/or transmit it faster. Compression falls into two main types:
- Lossless compression: This means that during compression, no data is lost, and the original file can be reconstructed perfectly. This method is commonly used for files that are a few megabytes in size and don’t need to be made significantly smaller for transmission. For example, ZIP files and PNG images utilize lossless compression.
- Lossy compression: With this method, the data is compressed by throwing away details most people won’t notice, keeping what matters, and discarding the rest. This is typically used for files that are hundreds (or thousands) of megabytes in size, such as MP3 music, JPEG images, or MP4 for video, and shrinking them enough for streaming or sharing means some data must be cut away.
Given its application, lossy compression is everywhere, allowing us to store and share massive amounts of information in a practical way by sacrificing some original detail.
AI Is Lossy Compression: Why This is Important
Most of us understand that AI models like GPT or Stable Diffusion are trained on vast amounts of data, including books and articles, reports, websites, images, and sounds. To use such large data sets to train AI models, the details of the original sources must be boiled down and stored as mathematical “weights.” This process discards specific parts that are not noticeable, while preserving patterns, relationships, and meanings.
The resulting AI model isn’t a carbon copy of the training data; rather, it’s a compressed, lossy version that “remembers” enough to generate plausible language, images, or answers, even though it can’t (and shouldn’t) reproduce the originals word-for-word.
The data to model the output process includes the following steps:
- Training the model: The AI “reads” massive datasets and encodes what it learns as billions of numbers, establishing its parameters.
- Data compression: Such parameters are like a zip file for knowledge. GPT-3, for example, compresses knowledge from trillions of words into hundreds of gigabytes.
- Data decompression: When the model receives a prompt – for example, “Tell me a joke about penguins”- it generates a new response, drawing on its compressed knowledge rather than copying from the original sources. In other words, the AI model decompresses enough data to provide an answer in real-time, blending what it knows into a new format.
AI compression is unique in that traditional lossy compression (for example, for JPEG or MP3 files) works on pixels or sounds. AI models work on context and meaning. They don’t simply memorize data; they generalize and invent, producing novel outputs based on the patterns they have compressed.
The key differences of AI compression are the following:
- Scale: AI can “compress” petabytes of data into gigabytes of model weights.
- Semantics: The focus is on capturing concepts, ideas, and relationships, not just raw bits.
- Generativity: Outputs are new creations, not direct copies, making it impressive, but unpredictable.
As with every technology, lossy compression isn’t perfect. Similar to how JPEGs can get blocky artifacts, where images get pixelated or misshapen, AI models sometimes “hallucinate” facts, get details wrong, or show bias in their outputs. These are considered the “compression artifacts” of AI: by squeezing the world’s knowledge so tightly, some accuracy and nuance can be lost.
See also: Beware of Vengeful AI Agents
When Lossy Isn’t Safe: The Risk to Sensitive Company Data
For any business or organization using AI, it’s critical to understand the risks posed to their data based on the compression/decompression process. Truthfully, AI models are fairly good at generalizing information from large datasets — but when it comes to how they process unique, rare, or sensitive data, such as a company’s intellectual property, unpublished code, or confidential documents, the risk is substantially greater.
When unique or rare data is included in an AI’s training set, it’s far more likely to be memorized by the model. With a carefully crafted prompt, someone who is unauthorized to access such data can potentially extract an output that is nearly identical to the original confidential material. This risk is highest with content that is unique to one company, such as proprietary algorithms, secret strategies, sensitive contracts, or one-of-a-kind research. Academic studies and real-world incidents have already shown that this kind of data “leakage” is possible.
The infamous “Fast Inverse Square Root” function from the 1999 video game Quake III Arena serves as a stark example of the hidden risks in AI code generation: when a developer asked an AI assistant like Copilot to generate this specific, highly optimized code, the AI often reproduced the verbatim, distinctively licensed (GNU GPL, a viral copyleft license) function without any attribution or warning about its source or licensing terms. This demonstrates how AI tools, trained on vast code datasets, can inadvertently expose developers to significant legal problems, including copyright infringement and forced open-sourcing of their entire project, by incorporating restrictively licensed code without proper adherence to its terms.
AI’s potential to memorize and reproduce sensitive data makes the security of an organization’s training data absolutely critical. Training unsecured AI models with unique IP or sensitive data can open the door to major privacy, security, and competitive risks.
By understanding AI as the ultimate lossy compressor, we can have more clarity on fundamental issues:
- Privacy: Can sensitive information be reconstructed from the model’s “compressed memory?”
- Copyright: Are outputs creative, or simply compressed copies of protected works?
- Data retention: Do we still need raw data once a model is trained?
- Security: Who can access this vast, compressed knowledge — and what risks does that create?
- IP Protection: How can organizations protect their crown jewels when lossy AI models could “decompress” them on demand?
The Solution: Confidential Computing and Continuous Encryption
The combination of confidential computing and continuous encryption can prevent this type of IP leakage. By protecting your data during training and ensuring that models never see or store your unique information in the clear, you can unlock the power of AI without putting your crown jewels – your sensitive data – at risk. Here’s why.
Unlike traditional encryption, continuous encryption allows data to remain encrypted throughout the entire computation process. As a result, AI models can train, learn, and infer without ever exposing sensitive data.
When combining continuous encryption with hardware-backed Trusted Execution Environments (TEE), a form of confidential computing, it’s possible to create an unparalleled secure computing framework:
- TEE-Enforced Encryption: Encryption keys exist exclusively within the TEE, never exposed externally.
- Always Encrypted Models: AI models remain encrypted at every stage, from initial training to ongoing inference.
- Encrypted Processing: Intermediate results, model gradients, and outputs never appear in plaintext, even internally.
Leveraging continuous encryption with confidential computing provides a zero-trust, zero-knowledge environment where sensitive information remains protected and secure, even across distributed or cloud-based infrastructures.
The Benefits of End-to-End AI Encryption
This integrated approach offers numerous strategic benefits:
- Isolation of Proprietary Data: Critical IP and sensitive information remain continuously protected throughout the AI lifecycle.
- Controlled Key Management: Keys are securely stored within TEEs, preventing potential leaks through memory exploits or insider threats.
- Model Protection: Encrypted AI models are unusable if exfiltrated, safeguarding against theft and unauthorized usage.
- Regulatory Compliance: Compliance with stringent standards like GDPR and the EU AI Act is achieved seamlessly, without sacrificing operational efficiency.
- Secure Collaboration: Enterprises can collaborate across external platforms without risking exposure of internal datasets or proprietary model logic.
AI models are more than just smart tools. At their core, they are the most advanced, scalable, and creative lossy compression systems ever built. They distill and remix the world’s knowledge. With the right input, they can decompress information that’s unique and sensitive, especially if it is unique to one entity.
The challenge is that, unlike traditional archives, this compressed knowledge isn’t locked away or offline. It sits in memory, 24/7, in clear, instantly available to anyone with access to the model – whether an internal team or the entire world – via an API. The world’s knowledge, distilled and generalized, is always online, ready to be “decompressed” by whoever asks the right questions.
That makes securing and governing AI models not just a technical challenge, but one of the most urgent questions facing technology today. How do we protect information that is no longer in files, but woven into the patterns of a model’s memory? How do we make sure this always-on, always-available “compressed brain” is used safely, ethically, and in line with the values of individuals and organizations?
I strongly believe the answer is a new approach to confidential computing and continuous encryption, enabling secure AI that keeps the world’s most powerful lossy compressors from becoming its biggest security risks.