How to write a good robots.txt in 2026 (and why most are broken)
Modern robots.txt covers AI crawlers, sitemaps, and crawl-delay quirks. The 2026 guide with rules for GPTBot, ClaudeBot, PerplexityBot, and Google-Extended.
How to write a good robots.txt in 2026 (and why most are broken)
Most production robots.txt files I audit are wrong in at least one of four ways: they block their own sitemap, they're contradicting themselves with overlapping wildcards, they haven't been updated since the AI crawler boom, or they're trying to enforce policy that robots.txt isn't designed to enforce.
This is the file you should actually be shipping in 2026.
The spec, briefly
robots.txt is governed by RFC 9309 (published 2022). Key rules:
- Lives at
/robots.txton the origin. Subdomains have their own. - Plain text, UTF-8.
- Groups are
User-agent:followed byAllow:/Disallow:lines. - Last matching rule wins on path conflicts (longest-match first, then last-listed).
Sitemap:is a top-level directive, not inside a group.- Compliant bots respect it. Non-compliant ones don't.
That last point is the most important: robots.txt is a polite request. It is not access control. If a path leaks secrets, putting it in Disallow: is the security equivalent of a sign saying "please don't read this." Use auth.
The 2026 starter template
# Default: allow all general crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Allow: /
# --- AI training crawlers ---
# Block or allow per your content policy.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
# --- AI search/RAG crawlers (cite you at query time) ---
# These are usually worth allowing — they drive traffic.
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# --- Sitemaps ---
Sitemap: https://example.com/sitemap.xml
That covers the major decisions. Now the reasoning.
Training vs search/RAG: the distinction that matters
Most AI vendors run two crawlers:
| Vendor | Training | Search/RAG |
|---|---|---|
| OpenAI | GPTBot |
OAI-SearchBot, ChatGPT-User |
| Anthropic | ClaudeBot |
Claude-SearchBot |
Google-Extended |
(uses regular Googlebot) |
|
| Apple | Applebot-Extended |
Applebot |
| Perplexity | (none declared) | PerplexityBot, Perplexity-User |
The training bots collect data to fine-tune models offline. The search/RAG bots fetch your pages because a user just asked an LLM a question about them and the model is going to cite you in its answer.
The default decision for a content site in 2026: block training, allow search/RAG. You don't want your content silently absorbed into Model v8; you do want to be cited (and clicked) when someone asks an LLM a question your page answers.
For a SaaS marketing site or docs site, allowing both makes more sense — exposure is the point.
For paywalled content, block everything aggressive at the bot level and enforce server-side.
Common mistakes
1. Blocking your own assets
User-agent: *
Disallow: /static/
Disallow: /assets/
I see this constantly. Then Googlebot can't fetch your CSS and JS, and your "mobile-friendly" check fails. Modern crawlers render pages with a real browser; they need your assets to score them. Only disallow paths that have no SEO value (/admin/, /internal/).
2. Trying to use Allow to override Disallow with overlapping wildcards
User-agent: *
Disallow: /
Allow: /blog/
This works for Google (longest-match wins), but other implementations interpret "last matching rule" differently. Be explicit instead:
User-agent: *
Disallow: /admin/
Disallow: /api/
Then everything else is implicitly allowed. No fragile ordering.
3. Crawl-delay won't work where you need it
Crawl-delay: 10 was a Bing/Yahoo extension. Googlebot ignores it; you control Googlebot's rate in Search Console. If you need rate-limiting on a crawler that ignores Crawl-delay, do it at the server level (nginx limit_req, Cloudflare rules) — and accept that a non-compliant bot will spoof user-agents anyway.
4. Sitemap path is wrong or missing
The Sitemap: directive is a top-level line, not nested under a User-agent:. Use the absolute URL, not a relative path. If you have multiple sitemaps, list all of them.
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml
If you're building your sitemap manually, the Robots.txt Generator emits both together with the right syntax, and the Meta Tag Generator handles the per-page robots meta and canonical tags that go with it.
5. Assuming compliance
In August 2025, Cloudflare published telemetry showing Perplexity using undeclared crawlers that rotated user-agents and ASNs to bypass Disallow rules. The major commercial bots (GPTBot, ClaudeBot, Google-Extended, OAI-SearchBot) have a public commitment and a track record of compliance. Smaller and adversarial crawlers don't. If you genuinely need to keep bots out, robots.txt is the first layer — not the only one.
Per-page robots meta tags
robots.txt covers fetching. The per-page <meta name="robots"> tag controls indexing of pages that are fetched. They serve different purposes:
<meta name="robots" content="noindex, follow">
<meta name="googlebot" content="noindex">
<meta name="GPTBot" content="noindex">
Use noindex on search result pages, faceted nav, and any page that's a duplicate of a canonical. Use nofollow on user-generated link blocks (comments, forum signatures) to avoid link-juice farms.
A debugging recipe
Before you ship:
- Fetch
https://yoursite.com/robots.txtdirectly. Make sure it's 200,text/plain, UTF-8 without a BOM. - Validate with Google's robots.txt tester and Bing's equivalent.
- Check that
curl -A "GPTBot" https://yoursite.com/some-blocked-pagestill returns 200 (you don't enforce in robots.txt; you announce). - Re-check after every CMS upgrade. WordPress, Ghost, and Next.js all have generated defaults that may overwrite yours.
Try it
- Robots.txt Generator — modern crawler list, sitemap support, output validated against RFC 9309
- Meta Tag Generator — per-page
<meta robots>, canonical, OpenGraph, in one panel