As part of our Detectify under the hood blog series, we recently introduced our new engine framework and how it helped us address a critical 0-day vulnerability within a day. In this article, we deep-dive into the problem space of monitoring our customers’ attack surface and distributing security tests to them.
Monitoring
At Detectify, we define the problem space of monitoring as the process of keeping the right data up to date by conducting the appropriate security tests on our customers with a reasonable cadence. This can be broken down into several subcategories. First, monitoring involves designing what we call monitors, which are all about capturing which tests to run and at what frequency. Second, we have the monitor distribution, which is used to schedule the jobs (sets of tests to be run) and build up a job queue. Finally, there is the observability part to guarantee that performance meets expectations.
Monitor Design
When designing the monitors, we define the sets of tests that we want to run and their cadence. Not all tests are equally important, and based on their relevance, we can schedule different tests at different cadences. This enables us to lower the amount of traffic towards our customers and the load within our systems.
An example of this is port reconnaissance, where it’s crucial to probe common HTTP and HTTPS ports several times a day, as we have a wide range of HTTP vulnerability tests that we need to perform. Other ports may require less frequent checks, and considering the full spectrum of 65,000 ports, it becomes clear that probing all of them with the same monitor would be ineffective both from a cost perspective and in terms of traffic impact for our customers.
Our engine framework allows us to divide the ports we are testing into several monitors, each running at its own cadence. This enables us to test for everything in a logical manner. The design of these monitors is based on knowledge from our security experts as well as data that we gather from our customers.
From a framework perspective, the monitor configuration is one of the few things that is unique to each engine. The framework provides a hook where each engine can provide its monitoring configuration. This can be as simple as a list of monitors, with each one having a name, a list of tests and their cadence.
Monitor Distribution
The sole purpose of the monitor distribution is to distribute the jobs to be performed (what tests to run) in a way that follows their expected cadence. The distribution occurs on a scheduled basis, with each monitor holding the timing for the next distribution.
When it’s time for distribution, the dirty monitors (those that need to be distributed based on their schedule) are fetched, and jobs are created and spread over time between the current and the next distribution time. This allows us to distribute the jobs over that interval but always at different times, flattening the curve to avoid overwhelming our customers with traffic. This technique is known as slow dripping in security, where tests are performed slowly over time to prevent triggering intruder protection systems.
To fully prevent overwhelming our customers’ systems, we need one more piece: our rate limiter, about which we will write a dedicated blog post in the coming weeks.
There are exceptions when it comes to the distribution of jobs. The first one is when we distribute a monitor for the first time: e.g., the customer has just started using Detectify and expects to get results within a few minutes instead of having to wait for an entire cadence interval. The second exception is when a new target is discovered for an already created monitor. In that case, we want to assess the newly found target immediately to ensure that we keep our customer safe as their attack surface grows.
Show me the code
A typical engine may need to manage a few million monitors each day, which presents several interesting challenges. First, to handle such a high volume within the expected timeframe, we need a good level of parallelization. Second, we want to ensure that monitor distribution occurs at the appropriate time while also being able to track the state of things, like when we have executed a certain monitor distribution or when we will perform it next. Third, is that the monitor distribution has a predictable recurring load, and we want to take advantage of that for our cost optimization.
Given the challenges presented above, we have explored various technologies and approaches to address the problem. Combined with additional requirements from the engine, we decided to run monitoring in ECS and using PostgreSQL. In this section, we will explain how we use these methods by breaking down the monitor distribution problem into two sub-problems: fetching dirty monitors and adding jobs to the job queue.
Fetching dirty monitors
A dirty monitor is a monitor that is ready to be distributed again, based on the expected cadence interval (i.e. the duration from the last distribution time to now is longer than the cadence interval).
To handle millions of monitors daily for an engine, we need to parallelize the distribution. Our goal is to prioritize the monitors that have been dirty the longest; however, handling this load sequentially with a single monitor distributor is insufficient. To address this, we need to redefine the rules. While we aim to distribute the monitors that have been dirty the longest, the ordering of their distribution is not critically important.
We are parallelizing by running the monitor distribution in several ECS tasks, each running several threads, all competing to access the dirty monitors.
To prevent everyone from being blocked or multiple threads from working on the same data, we are using a PostgreSQL feature in the form of `select for update skip locked`.
WITH _dirty_monitors AS ( SELECT m.id from monitor m
WHERE m.latest_distribution_state = ‘WAITING’ AND m.next_distribution_at <= NOW()
ORDER BY m.next_distribution_at
LIMIT 100
FOR UPDATE SKIP LOCKED
) UPDATE monitor SET latest_distribution_state = ‘RUNNING’, latest_distribution_start_at = NOW()
WHERE id = ANY(select dm.id from _dirty_monitors dm) RETURNING …;
This feature allows us to have everyone competing to acquire row locks while skipping currently locked ones. For instance, the first thread that arrives may lock the first 100 rows, while the second thread will lock the next 100 rows, allowing work on different rows to occur in parallel. This approach deals with the ordering of the monitors’ distributions in a best-effort manner, which, as mentioned, is not a strict requirement. If it were, our solution would certainly look different. Therefore, we choose to effectively manage the load while maintaining cost efficiency and adding good observability. As long as the distribution does not significantly lag for a monitor, the system will live up to expectations on the freshness of the data.
As one can imagine, issues can arise between the fetching and the distribution of the monitors. However, we have implemented self-healing measures to confirm that the distribution will always carry on and monitors are never lost.
Purpose-built job queue
The first step was to ensure that jobs are distributed for each dirty monitor in a timely manner. Next, we focused on creating the actual job queue that allows us to insert jobs at an arbitrary place and make sure that they are not picked up before certain times. That is, not a typical first-in, first-out queue.
Let’s look at an example. Consider having two monitors, one for `example.com` doing reconnaissance for ports 1, 2 and 3 and one for `detectify.com` that does the same reconnaissance for ports 1, 2, 3. First, the monitor for `example.com` gets distributed, where it schedules jobs within its cadence interval, say port 1 at 04:00, port 2 at 12:00 and port 3 at 20:00, and puts them on the job queue. Then, an hour later we distribute jobs for `detectify.com` and port 1 is scheduled at 16:00, port 2 at 02:00 and port 3 11:00, adding them to the job queue. Next, we want our test workers to process this job queue. Instead of following a first-come, first-served approach, the workers will reorder the queue based on the scheduled execution times of the jobs.
The test workers must not only adhere to the specified order; they also need to follow the scheduled times. For instance, if there are multiple scheduled jobs in the job queue, the workers should refrain from taking any action until it is time to perform the work.
This lets us prioritize newly found targets by placing them at the top of the job queue for immediate assessment.
Observability
As mentioned in our previous article about the engine framework, having observability on how our engines are performing is critical.
When it comes to the monitoring problem space, we want to keep an eye on how many monitors are being handled, how much they may be lagging, and how much time it’s taking to distribute them. These aspects are important in terms of reliability and scalability.
Naturally, there is a balance between staying exactly up to date and allowing yourself to lag behind in brief moments. We want to prevent the queue from growing too much or from falling too far behind, but we also do not want to over-provision our infrastructure.
The way we have chosen to visualize the monitoring lag is by using buckets. This approach allows us to track the amount of monitor distributions that are lagging behind within certain thresholds. This is calculated as `current time – estimated monitor distribution time = current lag duration`. Lag durations are divided into 15 minute buckets.
To generate this data, we are using a PostgreSQL feature called `width_bucket` that helps place our monitors into buckets, depending on the amount of time they are lagging behind.
WITH _monitors_lagging AS (
SELECT id, next_distribution_at from monitor where next_distribution_at <= NOW() AND latest_distribution_state IN (‘WAITING’, ‘RUNNING’))
SELECT count(*),
WIDTH_BUCKET(
next_distribution_at, ARRAY(
SELECT GENERATE_SERIES(
NOW() - INTERVAL ‘1 hour’, NOW(),
(15 || ' minutes')::INTERVAL))
) AS bucket
FROM _monitors_lagging
GROUP BY bucket ORDER BY bucket;
Then, of course, the fun inception of monitoring: we observe the performance of the worker that calculates this lag, as these queries may become heavy with the growth of our customer base and the size of their attack surface.
More efficiency and frequency
Our new monitoring system has enabled us to tailor the monitor design to effectively test the right things at the right cadence. In terms of efficiency, we have successfully increased the frequency of our most important tests without compromising overall performance or affecting either our team or our customers. Given that this system is integrated into the engine framework, we can share these benefits across all of our engines with just the click of a button.
Interested in learning more about Detectify? Start a 2-week free trial or talk to our experts.
If you are a Detectify customer already, don’t miss the What’s New page for the latest product updates, improvements, and new security tests.