Meta open-sources AI tool to automatically classify sensitive documents
Meta has released an open source AI tool called Automated Sensitive Document Classification. It was originally built for internal use and is designed to find sensitive information in documents and apply security labels automatically.
The tool uses customizable classification rules and works with files that contain readable text. Once labeled, the documents can be protected from unauthorized access or excluded from AI systems that use retrieval-augmented generation (RAG).
The solution uses Apache Tika to pull text from Google Docs, Sheets, and Slides. It then uses Llama to spot sensitive content and works with the Google Drive API to apply sensitivity labels to those files.
Why the researchers created Automated Sensitive Document Classification
At Meta, preventing the loss of sensitive data is a constant challenge, made even harder by the volume and variety of information the company manages. “Data loss prevention of sensitive data is a common problem in security and privacy,” Robin Franklin, Security Engineer at Meta, told Help Net Security.
Meta handles a vast range of file types and sensitive data. That scale made standard methods, like using regular expressions, fall short. “Normal approaches, like RegEx, weren’t sufficient for us to identify sensitive data,” Franklin said.
To address the problem, Meta turned to a LLM-based solution. “To meet our scalability and accuracy goals, we decided to build an LLM-based solution, which also ensured seamless auditability in our deployment.” This new system doesn’t just classify data. It also helps map out where it lives across the organization.
“It can output a CSV of the files enumerated and the results of a classification run, or even store everything to the included SQLi database,” Franklin explained. That includes the classification result, MD5 hash, and parsing status of each file.
With this level of detail, Meta’s security and privacy teams can better detect when sensitive data is being mishandled or stolen, without relying so much on manual labeling. “Ultimately, all of this information allows security or privacy teams to develop detections with high precision and recall for exfiltration or tampering of sensitive data while reducing the manual burden on an organization to label their content.”
What makes this tool unique
Meta is releasing its custom data classification system as open source, aiming to help other organizations struggling with data loss prevention. “We decided to open source this work to help other teams facing similar problems for data loss prevention,” said Franklin.
When the project began nearly three years ago, there weren’t many guides or tools available for building a custom classification system outside of what major document platforms already offered. “There were no reference points for building a custom classification architecture outside of the existing document platforms,” Franklin said. “The information we’re sharing now would have accelerated our progress even more quickly, and we hope others find it useful as well.”
To make the tool more usable, the Meta team focused on giving developers flexibility. “We wanted to make classification as flexible as possible for developers to label their data with their own standards,” Franklin said. The tool uses a multilevel classification agent that can be configured to match a company’s own policies or standards. “Our reference implementation provides a starting point,” Franklin added.
That flexibility also applies to how teams deploy the tool. “We include the infrastructure to deploy this as a Docker container, meaning any organization can scale this service however they like,” Franklin said. “And we include an option to interface with the classification engine as a Python package anywhere they’d like.”
Future plans and download
“Our architecture currently supports a llama-stack deployment and Google Drive integration. Long term, we would like to expand the number of deployment platforms (like Ollama) and the number of SaaS document sharing platforms we support with the classification engine. Office 365 has the same concept of document sensitivity labels that would also benefit from automatic classification with our approach. As we get additional feedback from the open source community, we plan to prioritize other approaches and platforms,” Franklin concluded.
Automated Sensitive Document Classification is available for free on GitHub.
Must read:
Subscribe to the Help Net Security ad-free monthly newsletter to stay informed on the essential open-source cybersecurity tools. Subscribe here!
Source link