A practitioner’s guide to classifying every asset in your attack surface

A practitioner’s guide to classifying every asset in your attack surface

TLDR: This article details methods and tools (from DNS records and IP addresses to HTTP analysis and HTML content) that practitioners can use to classify every web app and asset in their attack surface. You’ll learn to view your assets from an attacker’s perspective, enabling you to understand not only that an asset exists but also its exact nature. 

“You can’t secure what you don’t know exists.” It’s a common refrain in cybersecurity (and for good reason!). But the reality is a bit more complex: it’s not enough to just know that something exists. To effectively secure your assets, you need to understand what each of them is. Without proper classification, applying the right security processes or tools becomes a guessing game.

There’s a discrepancy between what you think you’re exposing and what you actually are exposing. Critically, an attacker only cares about what is actually accessible to them, not what you think it is. Research from Detectify indicates that the average organization is missing testing 9 out of 10 of its complex web apps that are potential attack targets. 

Imagine you’ve identified a few thousand assets exposed to the internet. The crucial next step is to determine what you are actually exposing. Different tools can help depending on what’s on your attack surface, but instead of focusing on specific tools right away, let’s concentrate on the methods and data points used to understand what each asset is. 

Data points for an outside-in perspective

Numerous data points can be used for classification. Let’s examine them in the order of a typical connection flow, assuming an outside-in, black-box analysis perspective. Internal network data or based on source code would require a different approach. 

Asset classification methods covered in this guide

Handshake

  1. DNS: Where is the DNS hosted? What types of pointers (A, CNAME, MX, etc.) are used? Where are they pointing? Are there informative TXT records (e.g., SPF, DKIM, DMARC)?
  2. IPs: Where is the IP address geographically located? What Autonomous System Number (ASN) does it belong to? Is it an individual IP or part of a larger range?
  3. Ports: Which ports are open or closed? How does the firewall behave (e.g., treatment of TCP vs. UDP, dropped vs. rejected packets)?
  4. Protocol/Schema: What protocol responds on an open port (e.g., HTTP, FTP, SSH)? Are there nested protocols (e.g., HTTP over TLS, WebSocket over HTTP)?
  5. SSL/TLS: Which Certificate Authority (CA) issued the certificate? What does JARM fingerprinting and handshake data reveal? What Subject Alternative Names (SANs) are listed?

Deep dive into HTTP

The data available for deeper classification heavily depends on the protocol encountered. For this blog post, we’ll focus primarily on HTTP, the backbone of web applications.

Key HTTP data points include:

  1. Response Codes: Is it a 200 OK, a 30X redirect (and where to?), or a 50X server error?
  2. Headers: Response headers are particularly rich, including custom X-headers, Cookies and security headers.
  3. File Signatures: These are unique identifiers forming part of a file’s binary data, often found in the first few bytes of a response body.
  4. Content-Type and Length: Is the response JSON, XML, HTML? What’s the size of the response?

Further down into HTML 

If the response is HTML, we can delve even deeper:

  1. Favicon: Many applications use default favicons. Hashing these icons can quickly identify known software.
  2. URL Patterns: Are there detectable patterns in URLs (e.g., /wp-admin/, /api/v1/, specific query parameter structures)?
  3. Meta-tags: name attributes (e.g., for description, keywords, generator) or http-equiv attributes (simulating response headers) can reveal underlying technologies or CMS.
  4. Form-tags: The structure, input field names, and action URLs within forms (especially login forms) can indicate specific systems.
  5. Links in Code: Are there hardcoded links to known sources, documentation, or license agreements?
  6. Code Patterns: Detectable patterns in JavaScript, HTML structure, or CSS can point to specific frameworks or libraries.
  7. Third-Party Resources: What external resources (scripts, images, APIs, tracking pixels) are being loaded, and from where?

Other Protocols

If we haven’t gone down the HTTP and HTML path (e.g., we’ve encountered an SSH or SMTP server), we would then look further into the binary response or protocol-specific handshake data to understand what software components are running. However, that’s a topic for another article.

Data Points Unpacked

When we examine each data point individually, significant opportunities for fingerprinting and understanding exposed assets emerge. Combining them provides even richer insights:

DNS

  • NS records and CNAMEs: Can be used to understand hosting providers (e.g., AWS, Azure, GCP), third-party SaaS applications, and CDNs/WAFs. Analyzing the domain name itself often yields this information.
  • DNS security records (e.g., SPF, DMARC): Can reveal third-party services used for functions like marketing automation or invoicing, which can be relevant for supply chain risk assessment or social engineering attack vectors.

Tools and Techniques: Manual inspection can be done with the dig command and basic human pattern recognition for small-scale analysis. For larger-scale testing, open-source tools like MassDNS can be highly effective.

IPs

  • ASN (Autonomous System Number): Helps determine organizational ownership, network size and scope, and geographical footprint. ASN data can also indicate underlying technology providers, as vendors often allocate IP blocks to different products or services.

Tools and Techniques: Nmap is a widely used tool for IP and port scanning. Alternatives for large-scale scanning include Zmap and MASSCAN. Whois lookups (command-line or web-based) are essential for ASN information.

Ports

Understanding which ports are open can help determine the firewall in place and the underlying systems running.

  • Single Ports: While specific ports are commonly associated with certain services (e.g., port 80 for HTTP), this isn’t guaranteed. Misconfigurations can lead to odd combinations of ports and services. Port status is an indication, not proof; probing the service is necessary for confirmation.
  • Combination of Ports: Certain combinations of open ports can strongly indicate an underlying system. For example, Cloudflare often presents a standard set of 13 open ports, while Imperva Incapsula might show all TCP ports as open.
  • Port “Spoofing”/Firewall Behavior: If a firewall detects a port scan, it might respond by showing no open ports, dropping packets, or indicating all ports are open. Analyzing this behavior in detail can provide clues about the edge device (firewall/WAF) in use.
  • Malformed Requests: Sending malformed requests that don’t adhere to RFCs can sometimes elicit responses that reveal more information than standard requests.

Tools and Techniques: For scanning at scale, masscan is fast, though it may produce a higher number of false positives. You’ll need to decide between speed and accuracy, as they often involve trade-offs. Nmap offers more accuracy and service detection features.

Protocol/Schema

The identified protocol/schema is connected to the combination of hostname (e.g., the Host header for domain fronting, or TLS-based routing using SNI), IP address, and port in the request.

  • Nested Communications: Communications can be nested. Many basic tools might not capture these nested communications, whether they result from intentional design or misconfiguration. This can lead to an incomplete understanding of what’s truly exposed.

Tools and Techniques: Nmap is the most known service. Other tools like JA4T (for TLS client/server fingerprinting) and fingerprintx can also help identify protocols and services.

SSL/TLS

  • Certificate Authority: Are certificates updated manually or automatically (e.g., Let’s Encrypt certificates are short-lived and usually automated)? Are different certificate authorities used in different parts of the infrastructure? This can hint at internal processes or even supply chain elements.
  • Subject Alternative Names (SANs): Is the certificate used for other domains? What can be learned from them? For example, google.com’s certificate lists over 50 domains under SAN.
  • JARM: Passively analyzing JARM hashes (an active TLS fingerprinting technique) can group disparate servers by configuration, identify default applications or infrastructure, and even fingerprint malware command and control servers.
  • Handshake Details: Different TLS server implementations respond differently when actively probed. Analyzing supported ciphers and TLS versions provides insights into the server’s configuration and potential vulnerabilities.

Tools and Techniques: JARM fingerprinting tools actively probe servers. Certificate Transparency (CT) logs are valuable public data sources for discovering issued certificates for domains, like crt.sh.

Deep Dive into HTTP Responses

Response Codes

A simple 200 OK status code might offer limited information in isolation. However, observing an application’s status codes in response to crafted payloads can be far more revealing. Different payloads will trigger different behaviors, and a WAF may interfere. Additionally, response codes can vary based on the user-agent and accept-header.

  • 10X (Informational): Commonly seen when upgrading to WebSockets or when Expect headers are used.
  • 20X (Successful): Limited use in isolation for system identification without further context.
  • 30X (Redirection): Redirect headers can give hints about underlying systems, authentication flows, or application structure. An example:
$ curl -v http://whitehouse.gov
* Trying 192.0.66.51:80...
* Connected to whitehouse.gov (192.0.66.51) port 80 (#0)
> GET / HTTP/1.1
> Host: whitehouse.gov
> User-Agent: curl/7.81.0
> Accept: /
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 301 Moved Permanently
< Server: nginx
< Date: Wed, 16 Apr 2025 12:15:21 GMT
< Content-Type: text/html
< Content-Length: 162
< Connection: keep-alive
< Location: https://whitehouse.gov/
<

301 Moved Permanently

Moved Permanently
nginx

The response body of the redirect clearly states that nginx is used.

  • 40X (Client Error): These are often very interesting, as they can be triggered with specially crafted payloads tailored to specific type of systems. Different systems have unique 404 pages or error messages.
  • 50X (Server Error): It’s not uncommon for 50X errors to present custom error pages or verbose error messages that can be connected to a specific system type, framework, or even programming language. If a 50X error can be triggered, you might be able to detect more. 

A practitioner’s guide to classifying every asset in your attack surface

 

Tools and Techniques: Common web scanning tools like Burp Suite, combined with human ingenuity, can help us understand more. 

For example, sometimes triggering a non 200 status code might expose more information about a system or an underlying technology. As an example, if you’re looking to identify assets running IBM Notes/IBM Domino it can be helpful to request an nsf-file that does not exist.

Sending a GET request to example.com/foo.nsf can trigger a 404 response containing strings such as HTTP Web Server: IBM Notes Exception - File does not exist


Source link