Podcast: How to get value from unstructured data

Podcast: How to get value from unstructured data

We talk to Nasuni founder and chief technology officer (CTO) Andres Rodriguez about the characteristics needed from storage to make optimal use of unstructured data in the enterprise, as well as the challenge of its scale.

He says the cloud has changed everything, with the cloud model of working providing a blueprint for a single pool of storage accessible from anywhere.

He also says enterprises need to classify, tag and curate data to build rich metadata that can boost corporate knowledge of and access to data, as well as to access it for artificial intelligence (AI), such as via Model Context Protocol (MCP) connectors. 

What is the nature of the obstacles to optimal use of unstructured data in the enterprise?

It really is all about scale. I mean, if you go back to what unstructured data is, it’s all of the files in the file servers, the NAS [network-attached storage], etc.

It’s all of that work product. So, if you are an architecture firm, it’s design drawings. If you’re a manufacturing firm, it’s design drawings and simulations. All of that ends up in the files, in the file systems of the enterprise.



And in every organisation, in addition to that, there’s the classic office documents – Excel and PowerPoints and Word documents and PDFs. Those are generic across all industries. And so, you end up with this sort of huge potential repository that could be mined to add value to the organisation.

But the challenge is, how do you access it? How do you control access to it at the same time that you can access it? And then, how do you plug it into the tools that are going to give you insights into that data? And doing that at scale is a really formidable challenge.

So, what do customers need from the way unstructured data is stored so that they can gain as much insight from it as possible?

The first thing is there’s so much of it in organisations that what ends up happening with traditional approaches is you end up with lots of silos of data. You know, the data gets stored in devices, the devices are all over the place, etc.

If it’s a large organisation, there could be different geographic locations where employees are located, and they need high-performance access to files in those locations. So you end up building silos for those.

It could just be capacity. You run out of capacity in one file server, so you deploy another one and another one, and you end up with this incredible number of file servers. So, when you look to do things that are valuable with the data, you realise that it’s become impossible because the data is in so many different silos, and it’s hard to get to the silos and aggregate them in any sort of logical way.

The cloud changed all that. Many organisations, especially large organisations that have consolidated their unstructured data, their file data, into the cloud, have realised this enormous gain, which is that the data is now consolidated in one logical space that is infinitely scalable, and it’s available at very high levels of performance from anywhere in the world.

The cloud is infinite and the cloud is everywhere. And so, that is an incredible foundational piece for them to be able to tap into that data repository, that unstructured data repository, and gather insights from the data.

What technologies underpin the optimal use of unstructured data for customers, especially in this era of AI?

I think there are several pieces.

At the foundational level, you want technology that allows for NAS consolidation. One of our specialties is to provide that sort of NAS, enabled with the cloud, that gives you scale and high performance anywhere you want it. That’s the first building block.

Then, on top of that block, you need to have unstructured data management tools that allow you to take that enormous repository and do it right at scale.

For everything I’m talking about, you’re fighting a scale headwind, so you need to have the technology that allows you to get to hundreds of millions or billions of files and petabytes of storage, otherwise, you’re going to end up being crippled in your efforts by the sheer scale of the problem.

So, in this next layer of unstructured data management, you want to have very scalable tools that allow you to classify data, tag data, set access controls at a global level for the data – in other words, curate the data.

I mean, if you look at what people are trying to do now with AI and gaining insights from AI, the failure of most of those projects can be attributed to a lack of sufficient quality data going into the LLMs [large language models]. In engineering school, they used to teach us, you put garbage into a model, you get garbage out of a model.

The first priority is to clean up the data that’s going into your models. This means tools that allow you to do that at scale with the regular unstructured data that your organisation is producing, so that as the organisation continues to evolve, that dataset is updated automatically.

Not because you’re doing some special kind of lift and effort, but because you’ve already set up the pipelines and all the systems are automatically cleaning up the data and making the data available to the machine learning models.

That’s how you get a system that doesn’t just work once when you’re running the project, but adds insights to the organisation on an ongoing basis.

And so, the last layer is this sort of general-purpose plug-in into all of the available LLM models. There isn’t going to be a single one that’s going to meet all your needs.

You need to have a sort of hub that allows you to connect. The term people are using now is the MCP interfaces that give you standard access to different models. That sort of standardisation at the level of the models is crucial because the dataset isn’t going to change.

I mean, it’s going to change when workers change, but it isn’t going to change based on what model you’re using. You have to be able to plug in whatever model is best suited to the goal you’re trying to achieve.

And if it doesn’t work, or if you want an upgrade, or if you want to switch vendors, you need to be able to change that. It’s what we call late binding, and later in the project, you need to be able to make that decision.

And then, of course, you need to close the loop and see through some sort of interface reporting – things like Tableau – the insights you’re getting from the data.

What our clients typically want to do is look at project data and estimate, is this project going to be on time? Is it going to be on budget based on signals coming from the unstructured data?

Or you want to be able to do compliance at a higher level of knowledge. Perhaps you want to understand not just what’s in the files, but how end users interact with those files, how those files have changed over time. That can give you enormous insights into the behaviour of your unstructured data, and how your organisation is using or not using that data.

So, it’s really about the integration of those three layers; the foundational NAS consolidation or unstructured data consolidation layer, which is all about storage and making sure the data is protected, making sure you have capacity and high performance. Then above that is an unstructured data management layer that allows you to curate the data and prepare it so that you make it available to the third layer, which is the interface to all the machine learning models.

I guess the curation and classification layer part of things is all about the metadata. Would that be the case?

That is correct.

Sometimes you can harness the data to come up with metadata, but the rules are always based on metadata.

So, the idea is you have to have a rich structure. This is why that first layer, the NAS consolidation, is so important.

It’s because you need a rich structure in your file system that allows you to annotate your data with new metadata to allow for rules to be set based on that metadata that controls the curation, the behaviour of the unstructured data.


Source link

About Cybernoz

Security researcher and threat analyst with expertise in malware analysis and incident response.