seorobots-txtai

How to write a good robots.txt in 2026 (and why most are broken)

Modern robots.txt covers AI crawlers, sitemaps, and crawl-delay quirks. The 2026 guide with rules for GPTBot, ClaudeBot, PerplexityBot, and Google-Extended.

By devformat.tools · 2026-06-04 · 5 min read

How to write a good robots.txt in 2026 (and why most are broken)

Most production robots.txt files I audit are wrong in at least one of four ways: they block their own sitemap, they're contradicting themselves with overlapping wildcards, they haven't been updated since the AI crawler boom, or they're trying to enforce policy that robots.txt isn't designed to enforce.

This is the file you should actually be shipping in 2026.

The spec, briefly

robots.txt is governed by RFC 9309 (published 2022). Key rules:

Lives at /robots.txt on the origin. Subdomains have their own.
Plain text, UTF-8.
Groups are User-agent: followed by Allow: / Disallow: lines.
Last matching rule wins on path conflicts (longest-match first, then last-listed).
Sitemap: is a top-level directive, not inside a group.
Compliant bots respect it. Non-compliant ones don't.

That last point is the most important: robots.txt is a polite request. It is not access control. If a path leaks secrets, putting it in Disallow: is the security equivalent of a sign saying "please don't read this." Use auth.

The 2026 starter template

# Default: allow all general crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Allow: /

# --- AI training crawlers ---
# Block or allow per your content policy.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# --- AI search/RAG crawlers (cite you at query time) ---
# These are usually worth allowing — they drive traffic.

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- Sitemaps ---
Sitemap: https://example.com/sitemap.xml

That covers the major decisions. Now the reasoning.

Training vs search/RAG: the distinction that matters

Most AI vendors run two crawlers:

Vendor	Training	Search/RAG
OpenAI	`GPTBot`	`OAI-SearchBot`, `ChatGPT-User`
Anthropic	`ClaudeBot`	`Claude-SearchBot`
Google	`Google-Extended`	(uses regular `Googlebot`)
Apple	`Applebot-Extended`	`Applebot`
Perplexity	(none declared)	`PerplexityBot`, `Perplexity-User`

The training bots collect data to fine-tune models offline. The search/RAG bots fetch your pages because a user just asked an LLM a question about them and the model is going to cite you in its answer.

The default decision for a content site in 2026: block training, allow search/RAG. You don't want your content silently absorbed into Model v8; you do want to be cited (and clicked) when someone asks an LLM a question your page answers.

For a SaaS marketing site or docs site, allowing both makes more sense — exposure is the point.

For paywalled content, block everything aggressive at the bot level and enforce server-side.

Common mistakes

1. Blocking your own assets

User-agent: *
Disallow: /static/
Disallow: /assets/

I see this constantly. Then Googlebot can't fetch your CSS and JS, and your "mobile-friendly" check fails. Modern crawlers render pages with a real browser; they need your assets to score them. Only disallow paths that have no SEO value (/admin/, /internal/).

2. Trying to use `Allow` to override `Disallow` with overlapping wildcards

User-agent: *
Disallow: /
Allow: /blog/

This works for Google (longest-match wins), but other implementations interpret "last matching rule" differently. Be explicit instead:

User-agent: *
Disallow: /admin/
Disallow: /api/

Then everything else is implicitly allowed. No fragile ordering.

3. `Crawl-delay` won't work where you need it

Crawl-delay: 10 was a Bing/Yahoo extension. Googlebot ignores it; you control Googlebot's rate in Search Console. If you need rate-limiting on a crawler that ignores Crawl-delay, do it at the server level (nginx limit_req, Cloudflare rules) — and accept that a non-compliant bot will spoof user-agents anyway.

4. Sitemap path is wrong or missing

The Sitemap: directive is a top-level line, not nested under a User-agent:. Use the absolute URL, not a relative path. If you have multiple sitemaps, list all of them.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml

If you're building your sitemap manually, the Robots.txt Generator emits both together with the right syntax, and the Meta Tag Generator handles the per-page robots meta and canonical tags that go with it.

5. Assuming compliance

In August 2025, Cloudflare published telemetry showing Perplexity using undeclared crawlers that rotated user-agents and ASNs to bypass Disallow rules. The major commercial bots (GPTBot, ClaudeBot, Google-Extended, OAI-SearchBot) have a public commitment and a track record of compliance. Smaller and adversarial crawlers don't. If you genuinely need to keep bots out, robots.txt is the first layer — not the only one.

Per-page `robots` meta tags

robots.txt covers fetching. The per-page <meta name="robots"> tag controls indexing of pages that are fetched. They serve different purposes:

<meta name="robots" content="noindex, follow">
<meta name="googlebot" content="noindex">
<meta name="GPTBot" content="noindex">

Use noindex on search result pages, faceted nav, and any page that's a duplicate of a canonical. Use nofollow on user-generated link blocks (comments, forum signatures) to avoid link-juice farms.

A debugging recipe

Before you ship:

Fetch https://yoursite.com/robots.txt directly. Make sure it's 200, text/plain, UTF-8 without a BOM.
Validate with Google's robots.txt tester and Bing's equivalent.
Check that curl -A "GPTBot" https://yoursite.com/some-blocked-page still returns 200 (you don't enforce in robots.txt; you announce).
Re-check after every CMS upgrade. WordPress, Ghost, and Next.js all have generated defaults that may overwrite yours.

Try it

Robots.txt Generator — modern crawler list, sitemap support, output validated against RFC 9309
Meta Tag Generator — per-page <meta robots>, canonical, OpenGraph, in one panel

How to write a good robots.txt in 2026 (and why most are broken)

The spec, briefly

The 2026 starter template

Training vs search/RAG: the distinction that matters

Common mistakes

1. Blocking your own assets

2. Trying to use Allow to override Disallow with overlapping wildcards

3. Crawl-delay won't work where you need it

4. Sitemap path is wrong or missing

5. Assuming compliance

Per-page robots meta tags

A debugging recipe

Try it

Try our free developer tools

2. Trying to use `Allow` to override `Disallow` with overlapping wildcards

3. `Crawl-delay` won't work where you need it

Per-page `robots` meta tags