Reece Group, a seller of plumbing and bathroom supplies, has stood up a “next-generation data platform” to enable secure and controlled self-service access to data by different departments and teams.
Image credit: Reece Group
Applied data science principal Erik Pak told the Databricks Data Intelligence Day in Melbourne that Reece had previously leant heavily on a centralised BI team to execute analytics and reporting tasks.
“We’re an enterprise-level retail company, and just like a lot of those companies we have a centralised BI team, and of course we also share a lot of the common challenges,” Pak said.
“For example, our BI team seems to be never big enough to support all the BI initiatives across the company, and our [data] warehouse never seems to be big enough to host all the data that everyone needs or even to handle all the workload efficiently at the same time.
“Not to mention it always seems there’s some user that has a long-running query that affects everyone else, and we just pray that it doesn’t break our overnight jobs.”
In addition to enabling a higher degree of self-service, Reece wanted departments and teams that owned data to “take better ownership … on the data platform, not just simply within the source system”, Pak said.
With data security top-of-mind, emphasis was also placed on standing up a data platform that offered physical dataset and compute isolation, and that also made it easy for a department to grant partial or full access to their data to another department or team.
“If we want a data owner to take more ownership of their data in the data platform and to manage access control, we have to have a way for them to manage the control easily,” Pak said.
To achieve this, Reece set up Databricks in an AWS account owned by the data platform team. It then set up a Databricks workspace – and assigned cloud storage and compute capacity – for each department in their own AWS accounts.
“In terms of compute, each workspace would have a shared cluster and shared SQL warehouse at a department level, and then depending on how much a given team actually uses the platform, we might actually create team-based clusters for them,” Pak said.
“For some specific use cases and scenarios, some individual users might require their own compute, and we will just do that on an ad hoc request basis.”
Pak said that the organisation is also making use of serverless SQL warehouses for some query processing, saying it had “made a very positive impact from a user experience and also cost perspective.”
With the workspaces stood up, Reece then used Unity Catalog to enable users of one workspace to request and – if granted – access data from other workspaces.
Unity Catalog “provides centralised access control, auditing, lineage, and data discovery capabilities across Databricks workspaces”, according to Databricks’ documentation.
“With Unity Catalog’s permission model, now we can actually allow the data owner within the workspace to grant another team from a different workspace access to their data, and potentially maybe only a portion of their data as well,” Pak said.
Pak said the result has been achieving the data security and user self-service targets, as well as driving more decentralised responsibility and accountability for data access.
“The most important thing is to start having a conversation within your organisations about a concept of treating data as a product and try to expand the responsibility of data ownership simply from maintaining the source system towards actually taking responsibility on how their data is being used across the organisation,” Pak told the summit audience.
“This will also help you to offload your BI team to work on more strategic things.”