Perplexity AI Caught Faking User Agent to Scrape the Web

Yo, remember when the internet had rules? Like, actual basic-decency protocols that kept the whole fragile ecosystem from collapsing into dogshit? Yeah. Perplexity AI doesn't.

Here's what went down: developer Robb Knight published a blog post this week titled "Perplexity AI is lying about its user agent" that immediately shot to the front page of Hacker News because — shocker — the $3 billion "answer engine" that raised $73M from Jeff Bezos, Institutional Venture Partners, and NEA back in January has been caught running a covert scraping operation that actively disguises itself as regular browser traffic.

Not a bot. Not some honest crawler with proper identification. Perplexity is out here wearing a fake mustache and a trench coat, rifling through your website's content like a digital shoplifter at a FYE.

Let's get technical for a second.

When any legitimate web crawler hits your site — Googlebot, Bingbot, even freakin' ChatGPT's GPTBot — it identifies itself via a User-Agent string in the HTTP headers. That string tells your server, "Hey, I'm a bot, here's who made me." This lets site owners make informed decisions via robots.txt about what gets indexed and what doesn't. It's Web 101. It's been standard since the mid-90s. It's older than *NSYNC.

Perplexity's actual crawler? It's been hitting sites with a generic user agent string that looks like a regular human browsing Chrome. No identifier. No "PerplexityBot." No nothing. When Knight dug deeper, he found requests coming from IPs that resolve back to Perplexity's infrastructure — but the user agent strings were straight-up masquerading as ordinary visitors.

This isn't an oopsie. This is a strategy.

Because here's the grift: Perplexity's entire value proposition is that it's an "AI-powered answer engine" that synthesizes information from across the web into clean, summarized responses. Sounds great, right? But those answers don't come from nowhere. They come from your content. My content. Every publisher, blogger, and independent creator who put time and money into creating something worth reading.

And Perplexity doesn't want to pay for any of it.

This is the same playbook we've seen from every AI company in the past 18 months. OpenAI? Trained on basically the entire internet, then played dumb when caught. Stability AI? Same energy. Now Perplexity — the company that launched in August 2022, grew to 15 million monthly users by late 2024, and introduced its Perplexity Pro subscription at $20/month — is out here pretending they're just a scrappy startup when in reality they're running a content heist at industrial scale.

The timing is chef's kiss too. Perplexity has been aggressively positioning itself as the "honest" search alternative to Google. CEO Aravind Srinivas has been doing the podcast circuit, talking about transparency and accuracy and how AI search should work for users. Meanwhile, his company's bots are out here in disguise, bypassing the very mechanisms designed to give publishers agency over their own content.

When Knight reached out to Perplexity about the discrepancy, their response was essentially: "We use third-party crawlers and we can't control what user agents they use." Which is — and I want to be very precise here — absolute horseshit. You absolutely can. Companies specify crawler behavior in vendor contracts all the time. The fact that Perplexity apparently has no idea (or claims to have no idea) how their data is being sourced is either gross negligence or a bald-faced lie. Pick your poison.

Here's why this matters beyond tech drama.

The web is already in crisis mode. Publishers are laying off staff left and right. Ad revenue has collapsed. SEO spam has made Google search nearly unusable. And now AI companies want to vacuum up whatever's left, repurpose it without attribution or compensation, and serve it back to users who never have to visit the original source.

When the crawlers also hide their identity, it removes the last shred of control publishers have. You can't block what you can't identify. You can't negotiate with a ghost. Perplexity isn't just taking content — they're taking away your ability to say no.

And let's be real about the economics here. Perplexity is burning cash on API calls to models like GPT-4 and Claude 3 Opus (both of which cost real money to run), while simultaneously refusing to invest in the fundamental infrastructure of ethical data collection. The compute costs money. The engineering talent costs money. But the raw material — the actual human knowledge that makes their product valuable? That should apparently be free.

This is the AI industry's dirty open secret. The "intelligence" in artificial intelligence isn't artificial at all. It's your intelligence, scraped, compressed, and monetized by companies that add a chat interface and call it innovation.

As of this writing, Perplexity hasn't issued a public response to Knight's findings. Their official crawler documentation still references a PerplexityBot user agent that apparently nobody's actually seeing in the wild. The blog post continues to circulate. Developers are sharing their own logs confirming the behavior.

The whole thing feels like watching someone get caught with their hand in the cookie jar, and instead of pulling their hand out, they just keep eating cookies while making eye contact. Bold strategy. Let's see how it plays out.

Look — I'm not anti-AI. This very blog uses AI tools. The technology has genuine potential. But the companies building these tools need to be held to basic standards of honesty and transparency. If you're going to build a business on other people's work, have the decency to ask.

Or at least have the guts to admit what you're doing. Because right now, Perplexity isn't an answer engine. It's a theft engine with a fancy UI.

And the entire AI industry is watching to see if anyone actually gives a damn.