Home Back

Why news publishers are struggling to fend off AI bots scraping online content

indianexpress.com 3 days ago

Tensions between AI companies and news publishers have continued to escalate with Perplexity AI now at the centre.

perplexity ai, perplexity ai controversy, ai scraping, data scraping, web crawlers, ai bots, news publishers
Perplexity AI is not the only one with questionable data scraping methods. (Image Source: Freepik)

Once touted to become Google Search’s replacement, Perplexity AI has found itself in hot waters for allegedly plagiarising news articles without providing proper attribution to sources. In early June, the generative AI-powered search engine was threatened with legal action by Forbes for allegedly plagiarising its work. Then, an investigation by Wired alleged that Perplexity AI could be freely copying online content from other prominent news sites as well.

Since then, several AI companies have come under scrutiny for reportedly circumventing paywalls and technical standards that have been put in place by publishers to prevent their online content from being used to train AI models and generate summaries.

While Perplexity AI CEO Aravind Srinivas has said that a third party service was to blame, the controversy surrounding the AI startup is the latest flashpoint between news publishers alleging that their content is being copied without permission and AI companies arguing that they should be allowed to do so.

How did it all start?

An IIT Madras graduate, Aravind Srinivas worked at prominent tech ventures such as Google, Deepmind, and OpenAI before launching Perplexity which looked to disrupt how search results are shown to users; i.e. by responding to users’ queries with personalised answers generated using AI.

Perplexity AI achieves this by “crawling the web, pulling the relevant sources, only using the content from those sources to answer the question, and always telling the user where the answer came from through citations or references,” Srinivas had told The Indian Express in an interview.

Hence, Perplexity was seen as a small player taking on tech giants such as Google and Microsoft in the search engine market. However, things took a different turn when it rolled out a feature called ‘Pages’ that allowed users to input a prompt and receive a researched, AI-generated report that cited its sources and could be published as a web page to be shared with anyone.

Days after its rollout, the Perplexity team published an AI-generated ‘Page’ of an exclusive Forbes article about ex-Google CEO Eric Schmidt’s involvement in a secret military drone project. The US-based publication claimed that the language in its paywalled article and Perplexity’s AI-generated summary was similar. It pointed out that the artwork in the article had also been copied and further alleged that Forbes had not been cited prominently enough.

Why is Perplexity receiving flak from publishers?

In addition to allegedly plagiarising articles and bypassing paywalls, Perplexity has also been accused of not complying with accepted web standards such as robots.txt files.

According to cybersecurity firm Cloudflare, “A robots.txt file contains instructions for bots that tell them which web pages they can and cannot access.”

Robots.txt primarily applies to web crawlers that are used by Google to scan the internet and index content in order to display search results. The page admin can leave behind specific commands so that web crawlers don’t process data on restricted web pages or directories.

However, robots.txt is not legally binding which means that it is not much of a defence against AI bots as they can simply choose to ignore the instructions within the file. That’s exactly what Perplexity did, according to Wired. Confirming the findings of a developer named Robb Knight, the tech news portal found that Perplexity AI was able to access its content and provide a summary of it despite prohibiting the AI bot from scraping its website.

But Perplexity is not the only one with questionable data scraping methods. Quora’s AI chatbot Poe goes one step further than a summary and provides users with a HTML file of paywalled articles for download, according to a report by Wired. Furthermore, content licencing startup Tollbit said that more and more AI agents “are opting to bypass the robots.txt protocol to retrieve content from sites.”

How else can publishers block AI bots?

The emerging trend of AI bots reportedly defying web standards and bypassing paywalled sites raises an important question: what other measures can publishers take to prevent the unauthorised scraping and use of their online content by AI bots?

Reddit has said that in addition to updating its robot.txt file, it is also using a technique known as rate limiting which essentially limits the number of times users can perform certain actions (such as logging into a web portal) within a specified time frame. While this solution can be used to filter out legitimate traffic from AI traffic to websites, it is not foolproof.

There has also been a rise in the development of data poisoning tools like Nightshade and Kudurru, which claim to help artists block attempts by AI bots from ingesting their artwork without permission by actively blocking AI bots or damaging their datasets in retaliation.

People are also reading