When a critical vulnerability in the printing system CUPS started raising alarms among security teams, Detectify had already entered war-room mode to address the situation. Within the day, customers could test whether they were vulnerable thanks to the rollout of a new scanning engine framework that reinvents how Detectify operates under the hood, allowing for a faster and more efficient response to security threats. In this first post of a Detectify under the hood blog series, we will introduce our new engine framework.
On Thursday, September 26, security researcher evilsocket published a write-up alongside a PoC for a critical severity GNU/Linux unauthenticated remote code execution (RCE) vulnerability affecting the CUPS open-source printing system.
Attackers can exploit this vulnerability by sending specially crafted IPP packets to cups-browsed (CVE-2024-47176), causing CUPS to connect to an attacker-controlled server from which malicious IPP properties are returned. The latter can then be used to further exploit components in the CUPS ecosystem (CVE-2024-47076, CVE-2024-47175, and CVE-2024-47177). The exploit is triggered when the victim runs a print job from the affected device, leading to remote code execution (RCE). RCE vulnerabilities are generally considered the most critical, giving attackers multiple options to perform malicious actions, including data theft, unauthorized access, and complete system compromise through malware and backdoors.
As soon as the CUPS flaw was detected, Detectify entered war-room mode to build a test for the vulnerability and ensure that customers were kept safe against such a critical threat. Our goal was to develop a test engine to production within the day.
Testing for the vulnerability
To assess whether customers’ systems are vulnerable, we need to send a specific UDP request containing a callback URL in its payload to an endpoint on the server that runs the cups-browsed service. This action will add a new printer to the system. While we would typically need to wait for a user action to print something in order to perform the remote code execution, we can determine that the server is vulnerable simply by receiving the callback from the printing system. This confirms that the printer has been successfully added and that we, later on, would have the ability to perform malicious actions.
Say Hello to our engine framework
Part of the success in this journey was the use of our newly built engine framework, which helps us get engines out quickly. An engine for us is how we package and run a set of vulnerability tests. It can be seen as a virtual hacker that uses a set of targets and performs vulnerability tests on them to assess whether they are secure.
We have many different engines and often need to create new ones. Our engine framework does the heavy lifting for us, enabling a one-click-deploy fashion that generates most of the commoditized engine code and supporting functionality so our developers can spend their time innovating on security research.
Fundamentally, all engines require development tools such as Git repositories, build pipelines, infrastructure as code, and deployment automation. They also need product capabilities, like a scheduler, rate limiting, force/avoid lists, metrics observability, back-office functionality, transparency in the tests performed, lifecycle and status of found vulnerabilities, along with an automated way of interacting with them. The only difference between these engines lies in the specific security research and tests they automate. This is the only element we need to focus on to begin testing a new type of vulnerability.
Having an engine framework brings several benefits to both us and our customers, from increasing the speed of iteration and reducing cognitive load to optimizing running costs and streamlining response time to new types of vulnerabilities. In upcoming blog posts, we will further explore the different key aspects of the engine.
Running one-offs for validation
The first step in any given engine is to validate that the security research works in a production environment. The engine framework provides an out-of-the-box GraphQL API used for back-office functionalities, like letting our sales engineers gain deeper insight into what specific tests we have run towards specific targets. This API also allows for one-off tests, where a target and the scope are provided to the engine, which then yields the results, helping us verify whether it operates correctly.
To use the framework, we need to implement a test runner. The runner has a simple interface that accepts a target (an IP address and a port target in this case) and a scope (the particular test it wants to run), returning a result (whether we found the vulnerability, for instance). Then the engine takes care of either persisting things or in the case of the one-off, returning the result in the API.
We reuse the runner code for the one-off API and regular automated tests. This approach facilitates rapid experimentation when automating research and validates results in already running engines.
Running things at scale
When the research is validated, it is time to start running things at scale. Here, our engine framework does so by separating the engine concerns into three main parts: subscription handling, monitoring, and assessment.
The subscription is how the engine connects to our ecosystem and acquires knowledge about our customers’ external attack surface in order to do its job. It captures customers’ needs and configurations and creates the appropriate monitors (what tests to run and at what cadence, given that not all tests are of the same importance).
The monitoring is all about distributing the so-called jobs to be done (security tests) in a way that respects the expected cadence while avoiding negatively impacting our customers with large spikes in traffic. As a result, our engines will run in a predictable manner, enabling us to do proper capacity planning when it comes to provisioning our infrastructure, striking the balance between coping with the necessary load and maintaining cost efficiency.
The assessment part is responsible for running the jobs (security tests). Here, we re-use the runner code implemented during the one-off validation, and as previously mentioned, this is where the innovation lives in each specific engine. The framework ensures that we capture the jobs’ results and publish them in a standard way to simplify their consumption, enabling other teams to build new features on top of them.
We will deep dive into these parts in future blog posts, showcasing what they do and the interesting technical patterns that we have used to build them. Below is a visual overview of the flow:
So far, we have primarily discussed the back-end work necessary to get tests for the CUPS vulnerability within the day. However, a good deal of cross-team collaboration and front-end work has also been involved in making these tests visible to our customers in the Detectify UI. Since the communication pattern for the engines using the framework is consistent across the board, automating the front-end tasks to create a complete solution is both feasible and highly desirable. This will be a key focus as we continue implementing our engine framework.
Observing the behavior of the running system
Using the engine framework makes all engines look and behave similarly, allowing us to automate how we monitor their behavior and performance. Additionally, we can easily create dashboards showing basic product metrics as well as detailed operational and reliability metrics with one click.
The number of subscriptions the particular engine is handling, the number of jobs it is performing, and the positive results it has found are just some examples of product metrics we are interested in. As for reliability and operational metrics, we took inspiration from Google’s SRE book, more specifically the Four Golden Signals. A few examples of how we applied those concepts are by monitoring the error rate of the jobs, the time they take to finish, the lag in waiting to run, as well as more generic saturation metrics like CPU, Memory and DB Disk Space.
In the case of the CUPS vulnerability, we discovered and celebrated our first finding in production just a few hours after its release.
Framework + collaboration = fast response
As a critical 0-day threat, the CUPS vulnerability sounded the alarm among many security teams. We promptly took action to ensure that our customers remained safe. Through effective collaboration among security researchers, back-end engineers, designers, and front-end engineers, we successfully got a new engine out in production testing for the CUPS vulnerability within the day. Having a framework for abstracting away everything that “just needs to be there” proves to pay off over and over.
Stay tuned as we share more about how our engine framework works under the hood, deep-diving into fundamental principles and technical challenges. Go hack yourself.