Most data stacks begin governance at the warehouse, but they don’t know where that ELT data came from and what the context and source is. We need to fix that.
Enterprise data teams are facing new demands as businesses require faster access to timely information. Data analysis teams are growing from one team to getting larger and more focused as they support more parts of the enterprise. This puts pressure on centralized data engineering teams to support an increasing number of requests from distributed analytics teams, such as marketing, finance, or product business analytics teams. At the same time, privacy and security requirements are forcing data engineers to closely examine data access and use within their organizations. Increasingly, there is a need for more robust data management.
One way to reduce this friction is with a modern ELT approach and a combined data stack. This opens up the opportunity to democratize data access across a company. Large organizations should strive to let data analysts ‘self-serve’ their data needs while also staying compliant with data governance requirements. By using a delegated control approach, data teams can access the data they need, ensure that the data is valuable for their work, and establish control without constraint.
As more enterprises shift to ELT, this modern approach brings in raw data on the front end that is usually more timely and fresh, but this shift also means that analysts have less reliability and confidence in the data as it’s ingested. Ensuring reliability and trust in the data requires an improved level of governance and data management that can monitor who has access to different data streams and adds context about where the data came from to ensure that a team isn’t pulling data from a QA server instead of getting it from the correct production CRM database.
This problem is most relevant for enterprises with core data teams trying to support a range of data teams across business units and specific departments. These core data teams end up spending too much time evaluating and granting access to data when they could be looking at better ways to simplify the flow of data to the proper teams or focusing on higher-impact data projects. Central data teams are getting pulled in multiple directions and need a better way to manage, prioritize and track access to data.
At the same time, business function-based teams can be tempted to pull their own S3 channels and build their own data lake if they can’t get the access they need – which makes governance much more challenging. Then when there’s an audit, access is closed off, and suddenly, those rouge teams can’t do their job.
This problem really hits industries that have high complexity of data but traditionally lower levels of governance. Any enterprise needs insight into what type of data is going where. Otherwise, data engineers may discover PII information being stored insecurely or various sources of data getting combined without the proper controls. Either the data engineering team or automated tools are needed to check permissions and access rights to PII or other sensitive data for each request from an analyst, which slows down progress.
Today, almost any ELT tool is effectively a black box. But when looking at a new data tool or the creation of a BI report, there are many stakeholders who need to sign off on that data access to ensure governance. A Legal team will want to know if PII is present, and if so, limit access to a sales team, for example. Then Security will want to ensure they can do data audits before they make a tool the enterprise standard. And the core data team just needs to know what type of data is going into the warehouse so they can determine which teams have access on the other side.
Data governance is very centered on the warehouse and BI tools today, but this doesn’t look at where the data came from and doesn’t verify the completeness or accuracy of that data. Say, for example, a schema changes upstream – how does that impact the data downstream? And what is the source of the data? Which geography? What column? Was this from a contact table in Salesforce or a specific page? Without a modern data stack, this context is not always available. But companies need to know their data lineage so they can uncover mistakes or retrace steps if there’s an issue requiring a fix.
If enterprises want to serve all of their internal customers and specific departments without putting too much of a burden on the core data teams, they should take the following steps:
- Organize teams to provide control without constraint. As data teams become more embedded in business groups, the central data team needs to provide a standardized tech stack for the whole company to ensure governance. If distributed teams adopt common tools, then central data teams can ensure governance is automatically enforced in a standardized way, while individual teams have more access to what they need.
- Set up organization-wide governance policies. As data teams become embedded throughout a company, different teams may traditionally use a variety of sources, pipelines, and destinations. Governance policies should apply to individual data assets. For example, the sales team needs access to customer information. This policy then has to be enforced across all sources, pipelines, and destinations. Setting up policies on individual tools makes it very hard to ensure the policy is correctly applied and consistently enforced. Simplify things by starting governance early. That way, you can ensure data sources are logged and available, so you know what’s the context and what type of source and can ensure the right policy is applied.
- Ensure visibility into data movement. Focus less on cleaning/transforming data going into a warehouse, and focus more on capturing all the context. Make sure your organization has full knowledge of the “who/what/where” for data so relevant distributed data teams have access to the appropriate data sources. Hold transformation and schema organization till you access the data, not when ingesting it. This will save time and add flexibility. Require teams to gather enough metadata upstream to support access permissions downstream. If a schema changes, teams need to have the data lineage to determine other impacts.
By centralizing on a data stack, providing a structure for access, and analyzing how data is flowing, companies are able to add control without constraint for their central and dispersed data teams. This helps these companies audit systems and identify who has access to what data while giving them the ability to set the right access policies and eventually integrate smoothly with the organization’s governance toolset.
By taking steps to clearly spell out the differing roles between central data teams, and line of business analyst teams, large companies can get a better understanding and handle on how their data is being used across the company. By clearly delineating different types of data requests and mapping those to different team needs, organizations can make sure that data is handled correctly while still supporting a ‘self-serve’ approach that helps analysts complete their work efficiently.