Perplexity AI ignores no-crawling rules on websites, crawls them anyway

Perplexity AI ignores no-crawling rules on websites, crawls them anyway

Imagine putting up a no-trespassing sign for people walking their dogs, and then finding out that one person dresses up their Great Dane as a calf and walks it on your grounds.

Well that’s sort of what AI answer engine Perplexity has been doing, by evading the no-crawl directives of websites, according to Cloudflare.

The no-trespassing sign in this case would be a robots.txt file—a small text file placed on a website that tells search engines and other automated tools (often called “bots” or “crawlers”) which pages or sections of the site they are allowed to access and which parts they should not visit.

Cloudflare started an investigation after customers complained that Perplexity accessed their content despite the fact that they had disallowed Perplexity crawling activity in their robots.txt files. They had also created Web Application Firewall (WAF) rules to specifically block both of Perplexity’s declared crawlers: PerplexityBot and Perplexity-User.

So Cloudflare set up some test domains and queried Perplexity about them to see if it would return the information that it shouldn’t be able to access. What the researchers found was surprising, because it looks like Perplexity uses a very deliberate system to bypass the no-crawling directives.

The researchers discovered the use of a user-agent designed to impersonate Google Chrome on macOS (to look like a regular user) when their declared crawler was blocked. The User Agent String shows the website you’re visiting some basic information about your browser and operating system, so the site can provide an experience optimized for that setup. The Perplexity crawler camouflaged by this User Agent String used regularly changing IP-addresses outside of Perplexity’s official IP range, allowing it to successfully access the content it wasn’t supposed to.

I wondered Perplexity’s view on the whole thing, so I asked it whether it knew what a robots.txt file was for and why it was bad to ignore one. It returned the following:

Perplexity AI ignores no-crawling rules on websites, crawls them anyway 2

Why is it wrong to ignore robots.txt?

  • Respect for Privacy and Security: Some parts of a website may contain private, sensitive, or unfinished content not meant for public search engines. Ignoring robots.txt can expose data that should remain private.
  • Fair Resource Usage: Crawling websites takes up bandwidth and server resources. By telling robots where not to go, website owners can prevent wasteful use of these resources, ensuring the site runs smoothly for real users.
  • Legal and Ethical Standards: Ignoring robots.txt can be considered unethical, and in certain circumstances, it might even break terms of service or data protection regulations depending on what information is accessed and how it’s used.
  • Trust and Reputation: If a robot repeatedly ignores robots.txt, it can be banned or blacklisted by websites, and its reputation could suffer in the public and professional community.

The official reaction of Perplexity is somewhat different. It hides behind the difference between a regular bot and an AI agent that is on a quest for a user to answer a specific question.

Perplexity reasons that:

“Modern AI assistants work fundamentally differently from traditional web crawling. When you ask Perplexity a question that requires current information—say, “What are the latest reviews for that new restaurant?”—the AI doesn’t already have that information sitting in a database somewhere. Instead, it goes to the relevant websites, reads the content, and brings back a summary tailored to your specific question.

This is fundamentally different from traditional web crawling, in which crawlers systematically visit millions of pages to build massive databases, whether anyone asked for that specific information or not.”

Although I see Perplexity’s point, there is a big difference between crawling websites to gather as much information as you can and seeking to answer a specific question for one user, the decision whether a website owner wants to allow either is up to them. And there should be no need for sneaking around.

So why not create a User Agent String that tells website owners “this is just a short visit to find some specific information” to discern it from actual crawlers that siphon up every bit they can find, and then let the website owners decide whether they will allow them or not?

Either way, this discussion seems far from over, and with the rise of AI agents we will probably see problems arise that were not on the radar before we all started using AI.


We don’t just report on data privacy—we help you remove your personal information

Cybersecurity risks should never spread beyond a headline. With Malwarebytes Personal Data Remover, you can scan to find out which sites are exposing your personal information, and then delete that sensitive data from the internet.




Source link