A path to better data engineering


Today’s data landscape presents unprecedented challenges for organisations, due to the need for businesses to process thousands of documents in numerous data formats. These, as Bogdan Raduta, head of research for FlowX.ai, points out, can range from PDFs and spreadsheets, to images, to multimedia, which all need to be brought together and processed into meaningful information.

Each data source has its own data model and requirements, and unless they can be brought together in a meaningful way, organisations end up dealing with data silos. This can mean users are forced to move between one application and another, and cutting and pasting information from different systems to get useful insights to drive informed decision-making.

However, traditional data engineering approaches struggle with the complexity of pulling in data in different formats. “While conventional ETL [extract, translate and load] data pipelines excel at processing structured data, they falter when confronting the ambiguity and variability of real-world information,” says Raduta. What this means is that rule-based systems become brittle and expensive to maintain as the variety of data sources grows. 

In his experience, even modern integration platforms, designed for application programming interface (API)-driven workflows, struggle with the semantic understanding required to process natural language content effectively.

With all of the hype surrounding artificial intelligence (AI) and data, the tech industry really should be able to handle this level of data heterogeneity. But, Jesse Anderson, managing director of Big Data Institute, argues that there is a lack of understanding of the job roles and skills needed for data sciences.

One misconception, according to Anderson, is that data scientists have traditionally been mistaken for people who create models and do all of the engineering work required. But he says: “If you ever want to hear how something data-related can’t be done, just go to the ‘no team’ for data warehousing, and you’ll be told, ‘no, it can’t be done’.”

This perception of reality doesn’t bode well for the industry, he says, because the data projects don’t go anywhere.

Developing a data engineering mindset

Anderson believes that part of the confusion comes from the two quite different definitions of the data engineering role.

One definition describes a structured query language (SQL)-focused person. This, he says, is someone who can pull information from different data sources by writing queries using SQL.

The other definition is a software engineer with specialised knowledge in creating data systems. Such individuals, says Anderson, can write code and write SQL queries. More importantly, they can create complex systems for data where a SQL-focused person is totally reliant on less complex systems, often relying on low-code or no-code tools.

“The ability to write code is a key part of a data engineer who is a software engineer,” he says. As complicated requirements come from the business and system design, Anderson says these data engineers have the skills needed to create these complex systems.

However, if it were easy to create the right data engineering team in the first place, everyone would have done it. “Some profound organisational and technical changes are necessary,” says Anderson. “You’ll have to convince your C-level to fund the team, convice HR that you’ll have to pay them well, and convince business that working with a competent data engineering team can solve their data problems.”

In his experience, getting on the right path for data engineering takes a concerted effort, which means it does not evolve organically as teams take on different projects.

Lessons from science

Recalling a recent problem with data access, Justin Pront, senior director of product at TetraScience, says: “When a major pharmaceutical company recently tried to use AI to analyse a year of bioprocessing data, they hit a wall familiar to every data engineer: their data was technically ‘accessible’ but practically unusable.”

Pront says the company’s instrument readings sat in proprietary formats, so critical metadata resided in disconnected systems. What this meant, he says, is that simple questions, such as enquiring about the conditions for a particular experiment, required manual detective work across multiple databases.

“This scenario highlights a truth I’ve observed repeatedly – scientific data represents the ultimate stress test for enterprise data architectures. While most organisations grapple with data silos, scientific data pushes these challenges to their absolute limits,” he says.

For instance, scientific data analysis relies on multi-dimensional numerical sets, which Pront says comes from “a dizzying array of sensitive instruments, unstructured notes written by bench scientists, inconsistent key-value pairs and workflows so complex that the shortest ones total 40 steps.”

For Pront, there are three key principles from scientific data engineering that any organisation looking to improve data engineering needs to have a grip on. These are the shift from file-centric to data-centric architectures, the importance of preserving context from source through transformation via data engineering, and the need for unified data access patterns that serve immediate and future analysis needs.

According to Pront, the challenges faced by data engineers in life sciences offer valuable lessons that could benefit any data-intensive enterprise. “Preserving context, ensuring data integrity and enabling diverse analytical workflows apply far beyond scientific domains and use cases,” he says.

Discussing the shift to a data-centric architecture, he adds: “Like many business users, scientists traditionally view files as their primary data container. However, files segment information into limited-access silos and strip away crucial context. While this works for the individual scientist analysing their assay results to get data into their electronic lab notebook (ELN) or lab informatics management system (LIMS), it makes any aggregate or exploratory analysis or AI and ML [machine learning] engineering time and labour-intensive.”

Pront believes modern data engineering should focus on the information, preserving relationships and metadata that make data valuable. For Pront, this means using platforms that capture and maintain data lineage, quality metrics and usage context.

In terms of data integrity, he says: “Even minor data alterations in scientific work, such as omitting a trailing zero in a decimal reading, can lead to misinterpretation or invalid conclusions. This drives the need for immutable data acquisition and repeatable processing pipelines that preserve original values while enabling different data views.”

In regulated industries like healthcare, pharmaceutical sector and financial services, data integrity from acquisition at a file or source system through data transformation and analysis is non-negotiable.

Looking at data access for scientists, Pront says there is a tension between immediate accessibility and future utility. This is clearly a situation that many organisations face. “Scientists want, and need, seamless access to data in their preferred analysis tools, so they end up with generalised desktop-based tooling such as spreadsheets or localised visualisation software. That’s how we end up with more silos,” he says.

However, as Pront notes, they also use cloud-based datasets colocated with their analysis tools to ensure the same quick analysis while the entire enterprise benefits from having the data prepped and ready for advanced applications, AI training and, where needed, regulatory submissions. He says data lakehouses built on open storage formats such as Delta and Iceberg have emerged in response to these needs, offering unified governance and flexible access patterns.

Engineering data flows

Returning to the challenge of making sense of all the different types of data an organisation needs to process, as Raduta from FlowX.ai has previously noted, ETL falls far short of what businesses now need.

One promising area of AI that the tech sector has developed is large language models (LLMs). Raduta says LLMs offer a fundamentally different approach to data engineering. Rather than relying on the deterministic transformation rules inherent in ETL tools, he says: “LLMs can understand context and extract meaning from unstructured content, effectively turning any document into a queryable data source.”

For Raduta, this means LLMs offer an entirely new architecture for data processing. At its foundation lies an intelligent ingestion layer that can handle diverse input sources. But unlike traditional ETL systems, Raduta says the intelligent ingestion layer not only extracts information from data sources, it has the ability to understand what all the different data sources it ingests are actually saying.

There is unlikely to be a single approach to data engineering. TetraScience’s Pront urges IT leaders to consider data engineering as a practice that evolves over time. As Big Data Institute’s Anderson points out, the skills required to evolve data engineering, combine programming skills and traditional data science skills in a way that means IT leaders will need to convince the board and their HR people that to attract the right data engineering skills they will need to pay a premium for staff.



Source link