A path to better data engineering

Today’s data landscape presents unprecedented challenges for organisations, due to the need for businesses to process thousands of documents in numerous data formats. These, as Bogdan Raduta, head of research for FlowX.ai, points out, can range from PDFs and spreadsheets, to images, to multimedia, which all need to be brought together and processed into meaningful information.

Each data source has its own data model and requirements, and unless they can be brought together in a meaningful way, organisations end up dealing with data silos. This can mean users are forced to move between one application and another, and cutting and pasting information from different systems to get useful insights to drive informed decision-making.

However, traditional data engineering approaches struggle with the complexity of pulling in data in different formats. “While conventional ETL [extract, translate and load] data pipelines excel at processing structured data, they falter when confronting the ambiguity and variability of real-world information,” says Raduta. What this means is that rule-based systems become brittle and expensive to maintain as the variety of data sources grows.

In his experience, even modern integration platforms, designed for application programming interface (API)-driven workflows, struggle with the semantic understanding required to process natural language content effectively.

With all of the hype surrounding artificial intelligence (AI) and data, the tech industry really should be able to handle this level of data heterogeneity. But, Jesse Anderson, managing director of Big Data Institute, argues that there is a lack of understanding of the job roles and skills needed for data sciences.

One misconception, according to Anderson, is that data scientists have traditionally been mistaken for people who create models and do all of the engineering work required. But he says: “If you ever want to hear how something data-related can’t be done, just go to the ‘no team’ for data warehousing, and you’ll be told, ‘no, it can’t be done’.”

This perception of reality doesn’t bode well for the industry, he says, because the data projects don’t go anywhere.