Cloudflare Accuses Perplexity AI of Bypassing Firewalls with User-Agent Spoofing

Cloudflare Accuses Perplexity AI of Bypassing Firewalls with User-Agent Spoofing

Cloudflare has publicly accused Perplexity AI of employing deceptive crawling practices that violate established web crawling protocols and deliberately circumvent website protection mechanisms.

The cybersecurity company has documented evidence of Perplexity using undisclosed user agents and rotating IP addresses to access content from websites that have explicitly blocked the AI company’s declared crawlers.

Stealth Crawling Operations Detected

Cloudflare’s investigation revealed that Perplexity operates two distinct crawling systems: a declared crawler using the legitimate “Perplexity-User” user agent generating 20-25 million daily requests, and a stealth crawler masquerading as a standard Chrome browser on macOS, responsible for 3-6 million additional daily requests.

fingerprinting the crawler

The stealth crawler employs the user agent string “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36” to impersonate legitimate browser traffic.

The company’s testing methodology involved creating multiple newly registered domains with strict robots.txt files containing “User-agent: * Disallow: /” directives, effectively prohibiting all automated crawling.

Despite these explicit restrictions and Web Application Firewall (WAF) rules blocking Perplexity’s official crawlers, the AI service continued accessing and indexing content from these protected domains.

When queried about these domains, Perplexity provided detailed information about their content, demonstrating successful circumvention of the implemented security measures.

Cloudflare observed that Perplexity’s stealth operations extend beyond simple user-agent spoofing.

The undeclared crawler utilizes IP addresses outside Perplexity’s official documented ranges and rotates through multiple Autonomous System Numbers (ASNs) to evade detection and blocking mechanisms.

This behavior violates RFC 9309 standards for web crawling and represents a systematic attempt to bypass website owner preferences expressed through robots.txt files.

Bypassing Firewalls
robots.txt files

Comparative Analysis

The cybersecurity firm contrasted Perplexity’s practices with those of other AI companies, particularly highlighting OpenAI’s compliance with established protocols.

OpenAI’s ChatGPT crawler respects robots.txt directives and ceases crawling activities when encountering blocks, without attempting alternative access methods through different user agents or IP addresses.

OpenAI also implements the emerging Web Bot Auth standard for HTTP request authentication, demonstrating transparent crawling practices.

Cloudflare emphasizes that legitimate web crawlers should maintain transparency by using unique user agents, providing declared IP ranges, serving clear purposes, and respecting website directives.

The company has responded to Perplexity’s behavior by de-listing it as a verified bot and implementing heuristic detection methods within their managed rule systems to block the stealth crawling activity.

The investigation spans tens of thousands of domains and millions of daily requests, with Cloudflare employing machine learning and network signal analysis to fingerprint the deceptive crawler.

Customers utilizing Cloudflare’s bot management services are automatically protected through existing challenge and blocking mechanisms, while a new managed rule specifically targeting AI crawling activity has been made available to all users, including free-tier customers.

Find this News Interesting! Follow us on Google News, LinkedIn, and X to Get Instant Updates!


Source link