Perplexity is accused of willingly crawling and scraping content from websites that have explicitly blocked its bots, by obscuring its identity • DigiBanker

AI startup Perplexity is crawling and scraping content from websites that have explicitly indicated they don’t want to be scraped, according to internet infrastructure provider Cloudflare. Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities. The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages “in an attempt to circumvent the website’s preferences,” Cloudflare’s researchers wrote. Perplexity appears to be willingly circumventing these blocks by changing its bots’ “user agent,” meaning a signal that identifies a website visitor by their device and version type, as well as changing their autonomous system networks, or ASN, essentially a number that identifies large networks on the internet, according to Cloudflare. “This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals,” read Cloudflare’s post. Cloudflare said it first noticed the behavior after its customers complained that Perplexity was crawling and scraping their sites, even after they added rules on their Robots file and for specifically blocking Perplexity’s known bots. Cloudflare said it then performed tests to check and confirmed that Perplexity was circumventing these blocks. “We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked,” according to Cloudflare. The company also said that it has de-listed Perplexity’s bots from its verified list and added new techniques to block them.

Read Article