Where Is Our Data? Why Data Location Discovery Is Hard


Data location discovery is the major initial hurdle of achieving true data security; without discovery, detection of sensitive data and remediation of their vulnerabilities is futile.

This innocent question, “Where is our data?” pins the starting point for cybersecurity. You can’t begin to secure data until you know where it is – especially critical business, customer, or regulated data. In this new era of agile, your data can be almost anywhere in the cloud. Getting better visibility is the first step to a new process of securing cloud data called Data Security Posture Management (DSPM). Discovery of where your data is a prelude to detecting which data are at risk and remediating vulnerabilities to secure these data. In this article, we focus on the first challenge of DSPM, which is the location discovery of all your data. Until this is solved with precision, truly securing sensitive data is a pipe dream.

Why Data Location Discovery Is Hard

Discovery of data location is a huge issue because of the nature of agile. In DevOps and model-driven organizations, there is a vastly larger and expanding amount of structured and unstructured data that could be located almost anywhere.

Any cloud. Any workload. Any scale.  Try SkySQL now

In legacy scenarios, all the data was stored on-premises, which spawned the “Castle & Moat” network security model of restricting external access while allowing internally trusted users. Those were the easy days of security! Cloud has fragmented the legacy architecture by storing data at external locations operated by service providers and other entities. For security architects and practitioners, and those responsible for compliance, this titanic shift in data volumes and locations calls for a different approach to securing the data: hence Data Security Posture Management.

The DSPM approach acknowledges that agile architectures are far more complex because cloud is not a monolithic place. For most enterprises, cloud encompasses many physical and virtual places: two or more cloud service providers such as Amazon, Microsoft, or Google; software-as-a-service providers; platform and infrastructure-as-a-service providers; function-as-a-service providers; data lake providers; business partners; and, of course, a myriad of hybrid clouds, servers, and endpoints within your own organization.

To merely say, “Our data is in the cloud,” however, is unhelpful for data security or compliance. Practitioners must know exactly where sensitive data exists in the cloud. DSPM’s data discovery process prescribes finding cloud-native structured and unstructured data stores. It discovers cloud-native block storage, such as EBS volumes. It discovers PaaS data stores such as Snowflake and Databricks. DSPM should continuously monitor and discover new data stores. And it should notify security teams of the discovery of new data stores or objects that could be at risk.

To enable location discovery, it’s important for teams to have a clear picture of reasons why sensitive data can be almost anywhere – and where to look for it. Let’s take a deeper dive into four issues propelling a need for accurate discovery of sensitive data lost from view.

1. Microservices Bring a Blizzard of New Services … and Data Sources

The blame game starts with microservices. In the Paleolithic Age, when data lived only on-premises serving monolithic applications with multi-year development timelines, lost data mostly came from mechanical or software failure. No one worried then about lost data falling prey to attackers because when data was lost, it was Gone with a capital G! (Hence the invention of Disaster Recovery.)

With cloud and agile DevOps processes, the application architecture has switched to dozens or even hundreds of microservices that an organization can reuse over and over in a stream of new internal and external apps. Data has moved into clouds and multiplied like rabbits in physical and virtual holes throughout the environment. Reproduction is normal and innocuous on the surface. For example, when a new feature is required, or demand for new scale appears, the old database might not work. So, the developer migrates production data into a new datastore and fixes the issue. Perhaps the old service lingers awhile, and over time, developers forget about it – and its fallow database.

Fallow is not impotent as the repository is likely to contain sensitive data. Attackers who attain lateral access inside the cloud will have an eye out for abandoned databases. These are the least likely to be under strict access controls, so the data are ripe for picking.

Lesson #1: Find and remediate abandoned databases in the development environment.

See also: How Good Data Management Enables Effective Business Strategies

2. AI/ML Modeling Fuels Risky Use of More Data Stores

AL/ML is all about data – the more, the better. Data bulk is important because it allows the models to learn better and faster. Learning accuracy is enabled by good production data, which tends to include sensitive information that needs protection. Our interest here is on a scenario especially affecting small-to-mid-sized companies that are less mature with AI/ML model management. Security teams usually are good at protecting production data in the cloud. But when a new AI/ML business case arises, data scientists need to move data from production into the model development environment to test hypotheses and enable model learning.

Any cloud. Any workload. Any scale.  Try SkySQL now

Without security controls managed by an MLOps platform or other means, placing production data into a model development environment can lead to data insecurity. This is especially true when team members make a copy of the data to run their own tests. Typically, the entire database is duplicated because it’s easier to start from scratch rather than to take the original model and append new data. The result can be significant data duplication. If these data reside in non-protected databases, they become honeypots for attackers.

Lesson #2: Find and remediate old, unused data in AI/ML model development environments.

3. CI/CD Accidentally Creates Shadow Data Stores

A shadow datastore is created by developers for use with the DevOps process, but it is not sanctioned by or even on the radar of security operations. Reasons for creating a shadow datastore might be for features that are experimental or for features that are in production without having a proper review. Typically, shadow datastores are not operated under standard access controls, nor is the data encrypted. Such behavior accompanies lax CI/CD processes that do not build in security best practices. When shadow datastores contain sensitive information (and they usually do!), they create a major vulnerability that is attractive to attackers who seek the least path of resistance to a breach.

Lesson #3: Find shadow datastores and implement standard security processes for data access and protection.

4. Other Innocent Ways of Exposing Data

Oversight by engineers can trigger exposure of multiple datastores. Consider the power of access credentials used by engineers. These users often log onto two or more machines in production, which in turn have access to other resources. The process of entering credentials to access an additional system is an interruption to workflow – one that some engineers (being human) may be inclined to avoid. Their approach is to store the credentials to avoid having to reenter them multiple times. And since there are machines in the production environment that have keys and secrets – voila, data is potentially exposed to attackers.

Another innocent way to expose data is via downloads. When data scientists or analysts are given access to production data, they download it, move it into a new datastore and do whatever they intended to do with the data. However, the original dump often stays on the machine where the download occurred and could be exposed in the non-production environment.

Lesson #4: Organizations must scan for potentially exposed data or access to it in unusual places. You never know where it might reside!

The general process flow for DSPM is simple: (1) Discover where the data is, (2) identify sensitive data in those stores, and (3) remediate vulnerabilities to securing the data. As we’ve seen, however, sensitive data can be almost anywhere in an enterprise environment, so discovery isn’t so simple – especially using legacy tools that struggle in cloud-native scenarios. In fact, location discovery is the major initial hurdle of achieving true data security; without discovery, detection of sensitive data and remediation of their vulnerabilities is futile. A comprehensive risk management strategy, therefore, must include processes for universally scanning an organization’s entire hybrid environment for lost or forgotten data. Until you can find all your sensitive data, the dream of data protection and meeting related requirements for compliance will always carry the risk of becoming a nightmare.

Amer Deeba

About Amer Deeba

Amer Deeba is co-founder and CEO of Normalyze, a pioneering provider of cloud data security solutions.

Leave a Reply

Your email address will not be published.