Skills required for data engineering success


Data has always been regarded as an organisation’s crown jewels, but due to the explosion of data sources, making sense of the structured and unstructured information contained within an enterprise’s different data stores is an increasingly complex task. Pulling everything together to provide a homogenous view of business activity may seem like a project that will never end, which is why there is now interest in data engineering.

According to analyst Gartner, data engineers play a key role in enabling organisations to unlock the value of data. This involves designing and building systems to collect, store, transform, operationalise and deliver data at scale. The analyst firm says data engineering involves collaboration between the business and IT to make the appropriate data accessible and available to various data users – such as data scientists or data analysts – at the right time.

Gartner’s Essential skills for data engineers to succeed report identifies a range of skills required in data engineering. Report authors Mayank Talwar, Zain Khan and Shubhankar Nandi describe structured query language (SQL) as being pervasive across a wide range of tools and platforms, making it a critical and extensible skill. As an example of SQL’s pervasiveness, they note that dbt, a data transformation tool, enables data engineers to transform data in their warehouses by simply writing select SQL statements.

The second core skill identified in the report is data processing, which is described as a “foundational skill that every data engineer must possess”. This is because data in its raw format is not usually useful for analytics. Data processing covers batch and real-time processing; storage covers technologies like data lakes, data warehouses, graph and document databases, and object stores. Common programming languages used by data engineering teams include Python, Java and Scala.

Other core skills listed by Gartner include data storage, data orchestration, programming and collaboration. With regards to data orchestration, the analysts note that data engineering pipelines are slowly moving from tools that support task-driven architectures, such as Apache Airflow and Luigi, towards tools that offer a data-driven approach, such as Dagster, Flyte from Lyft and reflow.

Gartner recommends IT leaders prioritise development of the core data engineering skills since they are widely adopted, heavily used and have proven to provide significant benefits.

A simpler approach?

There is a case for assessing a simpler approach to achieving the goal of providing timely enterprise data to the business in a format users can make use of for planning and analysis. This is where providers of traditional enterprise resource planning (ERP) systems see an opportunity to build a business around the need for organisations to have a single version of the truth. From an ERP perspective, this single version of the truth resides in the systems of record that make up an ERP system.

SAP, for instance, delivers an entire systems and application stack as a cloud-centric offering on a subscription basis, together with process mining and other tools, plus bundled support, maintenance and other services.

Dale Vile, co-founder of analyst firm Freeform Dynamics, notes that SAP’s Business Technology Platform (BTP) can be considered an integral part of the supplier’s cloud offering. BTP is essentially a platform as a service (PaaS) that allows customers to extend SAP applications and/or build custom applications.

“For some customers, this kind of all-encompassing service is truly attractive as it means they no longer have to worry as much about systems-level operations, monitoring, security and so on,” says Vile. “A lot of the stuff that makes SAP landscapes so challenging to run and change over time is taken care of once you sign the contract.”

The contract effectively ties an organisation into SAP. While there is a case to build in flexibility, for some organisations it is far more important to have a single version of the truth and have all data in one place. This is the case at Irish manufacturing firm WaterWipes, as data manager Liz Cotter explains.

You can have your advanced analytics automation, but if your master data isn’t accurate, then your transactional data is worthless
Liz Cotter, WaterWipes

Previously, she says, software as a service (SaaS) systems sat alongside SAP and “may have been integrated with SAP, but were not fully harmonised”. In other words, the organisation selected best-of-breed SaaS products to support certain business processes, such as human resources or customer service. Cotter says this meant SAP was not the system of record for some of the newer datasets being used by the business.

She says SAP Datasphere enables the business to run a standard platform as a system of record for transactional data, which provides a master copy of the organisation’s data. “I feel that SAP has transitioned and is offering more tools to keep up with the demand for enriched data,” she says. 

Cotter joined WaterWipes in January 2024 with a remit to put in place data management and data governance. She says the company was not making the best use of the data it had available, which could be used to gain insights and help to align with strategic key performance indicators (KPIs).

“When we assessed our data maturity, there was no data governance and data security. We needed a tool to help mitigate that risk quickly,” she says.

As Cotter points out, successful IT-driven business initiatives require a solid data foundation. “You can have your advanced analytics automation, but if your master data isn’t accurate, then your transactional data is worthless,” she says. For Cotter, there is little point in investing in new technology unless the data is as accurate as possible.

The company began working with Bluestonex on implementing its Maextro master data management tool. This is developed on SAP BTP and provides data governance and data management for WaterWipes.

“It’s basically an application to manage data, workflows and data reporting,” says Cotter.

This avoids SAP developers having to run queries directly on the company’s S/4Hana system. In terms of data maturity, Cotter says: “We’re not going to get to expert level, but we want to align with our 2027 strategy, which is very ambitious in terms of sales and customer growth.”

The phased approach has involved building out data governance and data management best practices first, before investing in technology.

Supporting AI

Given the trend to do more with artificial intelligence (AI), the Gartner analysts urge IT leaders to ensure data engineers recognise the need to upskill themselves. This upskilling, according to Gartner, is required if data engineers want to participate in building the data foundation layer for companies that have decided to train language models on their enterprise data.

“With GenAI’s [generative artificial intelligence] appetite for training data exponentially rising, data engineers can play a pivotal role in creating data platforms and pipelines that can supply high-quality data required for training these models,” the analysts note in the Essential skills for data engineers to succeed report.

Gartner predicts that companies will start building smaller, more refined and business-curated language models – as opposed to large language models – for greater controls on cost, privacy, risk and accuracy. Gartner believes data engineers will need to learn how to work with unstructured data and create data repositories to enable the building of these models.

Ideally, IT leaders would be given the time and resources to develop a data engineering practice, but this is unlikely. Cotter’s experience at WaterWipes shows it is entirely possible for even those organisations that are still quite early in their data management journey to achieve business value relatively quickly. The one caveat is that this may involve being tied into a particular product set, such as an ERP system.



Source link