← Back to all posts How-to

How to control AI crawlers on your website

May 13, 2026·9 min read

Several dozen AI crawlers may visit your website each day. Most identify themselves and respect robots.txt. This guide shows you which bots are out there, what each one does with the pages it fetches, and how to block (or selectively allow) them in a way that takes about three minutes.

Why this matters in 2026

AI companies need data to train and run their models. A lot of that data comes from crawling the open web: your blog posts, your product descriptions, your recipes, your photos, your code samples. Once a model is trained on your content, it can quote it, summarize it, and sometimes reproduce parts of it verbatim, without sending a single visitor back to your site.

There are two distinct things you might want to control:

A common setup blocks the first and allows the second. The next sections show how.

The full list of AI crawlers (mid-2026)

Each row is a real User-Agent string you can put in robots.txt. The "Type" column tells you whether the bot trains models or fetches on demand.

AI training crawlers

User-AgentWhat it does
GPTBotAI training crawler
ClaudeBotAI training crawler
anthropic-aiLegacy AI training user-agent
Google-ExtendedControls AI training use of crawled content
Applebot-ExtendedControls AI training use of crawled content
AmazonbotCrawler used for AI products
BytespiderWeb crawler used for AI training
CCBotOpen web dataset crawler used by many AI projects
FacebookBotUsed for AI training
DiffbotKnowledge graph crawler

AI search crawlers (consider keeping these on)

User-AgentWhat it does
OAI-SearchBotAI search crawler that cites sources
ChatGPT-UserFetches when a user asks an AI assistant a question
PerplexityBotAI answer engine crawler
Claude-UserFetches when a user asks an AI assistant a question

Blocking all four search crawlers removes your site from the answers those AI tools surface. Allowing them keeps your pages eligible to appear with citations.

Two independent decisions

Training bots and search bots are separate user-agents. You can allow or block each independently. Training bots collect content to build AI models. Search bots fetch pages on demand and send users back to you with a citation, like a traditional search engine.

Example: block training, allow search

Here's a complete robots.txt that blocks the major AI training bots while allowing AI search bots and traditional search engines. This is one common configuration. Adjust it to whatever combination you want.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

# Allow everything else (including AI search bots)
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Replace the Sitemap: URL with yours, or remove that line if you don't have one yet.

If you want to block ALL AI, including search

Some site owners (paywalled news, fiction publishers, photographers protecting unique work) want a hard block on every AI bot, both training and on-demand search. To do that, add the four search-crawler user-agents to the block list:

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Claude-User
Disallow: /

Trade-off: users asking AI assistants for "the best paywalled news article about X" or "find me a photographer for Y" will not see you cited in the answer. Whether that trade-off is worth it depends on how much of your traffic you expect from AI-assisted search.

Where the file goes

The robots.txt file lives at the root of the site (the same folder as the homepage), served at https://yoursite.com/robots.txt. How to get it there depends on the hosting platform: most CMSes and static-site hosts have either a dashboard upload option or a robots.txt editor in their SEO settings.

Verify by visiting https://yoursite.com/robots.txt in a browser. The file's contents should appear in plain text.

Generate your robots.txt in one click

Toggle which AI bots to block. Copy or download the file. Free, in your browser.

Open the generator →

What robots.txt does and does not do

robots.txt is a request that compliant crawlers follow. The major AI crawler operators publicly commit to following it. Because compliance is voluntary, robots.txt alone does not technically prevent fetching.

For guaranteed control over which clients can reach your pages, pair robots.txt with one of:

Does blocking AI crawlers today mean AI assistants won't know my site exists?

Not really. AI models already have a lot of historical data from earlier crawls and from third-party sources (open web datasets, scraped indexes, leaked datasets). Blocking a crawler only prevents new data being collected going forward. It does not retroactively delete what was already learned.

To request removal of specific content already in a model, contact the AI provider directly. Most have a takedown form, but processing times vary.

Combining robots.txt with llms.txt

Whatever you choose to block or allow in robots.txt, you can also add an llms.txt file describing your site in your own words. AI search bots that do read your pages will be guided by that description and quote it back with citations.

We have a full plain-English guide to llms.txt covering exactly that.

The 3-minute checklist

  1. Open the Robots.txt Generator.
  2. Toggle each bot to Allow or Block based on what you want.
  3. Download the file and upload it to your site root as robots.txt.
  4. Verify it loads at yoursite.com/robots.txt.
  5. (Optional) Add an llms.txt so AI search bots have a clean description of your site.

The whole process takes about five minutes.

Related reading