(TO THE GOD OF ABRAHAM, ISAAC AND JACOB. I DEDICATE THIS WORK TO YOU MAY YOU BLESS IT AND MAY IT BLESS THOSE YOU USE IT, MORESO MAY THEY KNOW YOU BY NAME, REPENT AND BE LED TO YOUR WILL AND KINGDOM.) Our Father who is in the heavens, let Your Name be set-apart,let Your reign come, let Your desire be done on earth as it is in heaven. Give us today our daily bread. And forgive us our debts, as we for- give our debtors. And do not lead us into trial, but deliver us from the wicked one because Yours is the reign and the power and the esteem, forever. Amen.
How to write a good robots.txt in 2026 (and why most are broken) | devformat.tools Blog
seorobots-txtai

How to write a good robots.txt in 2026 (and why most are broken)

Modern robots.txt covers AI crawlers, sitemaps, and crawl-delay quirks. The 2026 guide with rules for GPTBot, ClaudeBot, PerplexityBot, and Google-Extended.

By devformat.tools · · 5 min read

How to write a good robots.txt in 2026 (and why most are broken)

Most production robots.txt files I audit are wrong in at least one of four ways: they block their own sitemap, they're contradicting themselves with overlapping wildcards, they haven't been updated since the AI crawler boom, or they're trying to enforce policy that robots.txt isn't designed to enforce.

This is the file you should actually be shipping in 2026.

The spec, briefly

robots.txt is governed by RFC 9309 (published 2022). Key rules:

  • Lives at /robots.txt on the origin. Subdomains have their own.
  • Plain text, UTF-8.
  • Groups are User-agent: followed by Allow: / Disallow: lines.
  • Last matching rule wins on path conflicts (longest-match first, then last-listed).
  • Sitemap: is a top-level directive, not inside a group.
  • Compliant bots respect it. Non-compliant ones don't.

That last point is the most important: robots.txt is a polite request. It is not access control. If a path leaks secrets, putting it in Disallow: is the security equivalent of a sign saying "please don't read this." Use auth.

The 2026 starter template

# Default: allow all general crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Allow: /

# --- AI training crawlers ---
# Block or allow per your content policy.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# --- AI search/RAG crawlers (cite you at query time) ---
# These are usually worth allowing — they drive traffic.

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- Sitemaps ---
Sitemap: https://example.com/sitemap.xml

That covers the major decisions. Now the reasoning.

Training vs search/RAG: the distinction that matters

Most AI vendors run two crawlers:

Vendor Training Search/RAG
OpenAI GPTBot OAI-SearchBot, ChatGPT-User
Anthropic ClaudeBot Claude-SearchBot
Google Google-Extended (uses regular Googlebot)
Apple Applebot-Extended Applebot
Perplexity (none declared) PerplexityBot, Perplexity-User

The training bots collect data to fine-tune models offline. The search/RAG bots fetch your pages because a user just asked an LLM a question about them and the model is going to cite you in its answer.

The default decision for a content site in 2026: block training, allow search/RAG. You don't want your content silently absorbed into Model v8; you do want to be cited (and clicked) when someone asks an LLM a question your page answers.

For a SaaS marketing site or docs site, allowing both makes more sense — exposure is the point.

For paywalled content, block everything aggressive at the bot level and enforce server-side.

Common mistakes

1. Blocking your own assets

User-agent: *
Disallow: /static/
Disallow: /assets/

I see this constantly. Then Googlebot can't fetch your CSS and JS, and your "mobile-friendly" check fails. Modern crawlers render pages with a real browser; they need your assets to score them. Only disallow paths that have no SEO value (/admin/, /internal/).

2. Trying to use Allow to override Disallow with overlapping wildcards

User-agent: *
Disallow: /
Allow: /blog/

This works for Google (longest-match wins), but other implementations interpret "last matching rule" differently. Be explicit instead:

User-agent: *
Disallow: /admin/
Disallow: /api/

Then everything else is implicitly allowed. No fragile ordering.

3. Crawl-delay won't work where you need it

Crawl-delay: 10 was a Bing/Yahoo extension. Googlebot ignores it; you control Googlebot's rate in Search Console. If you need rate-limiting on a crawler that ignores Crawl-delay, do it at the server level (nginx limit_req, Cloudflare rules) — and accept that a non-compliant bot will spoof user-agents anyway.

4. Sitemap path is wrong or missing

The Sitemap: directive is a top-level line, not nested under a User-agent:. Use the absolute URL, not a relative path. If you have multiple sitemaps, list all of them.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml

If you're building your sitemap manually, the Robots.txt Generator emits both together with the right syntax, and the Meta Tag Generator handles the per-page robots meta and canonical tags that go with it.

5. Assuming compliance

In August 2025, Cloudflare published telemetry showing Perplexity using undeclared crawlers that rotated user-agents and ASNs to bypass Disallow rules. The major commercial bots (GPTBot, ClaudeBot, Google-Extended, OAI-SearchBot) have a public commitment and a track record of compliance. Smaller and adversarial crawlers don't. If you genuinely need to keep bots out, robots.txt is the first layer — not the only one.

Per-page robots meta tags

robots.txt covers fetching. The per-page <meta name="robots"> tag controls indexing of pages that are fetched. They serve different purposes:

<meta name="robots" content="noindex, follow">
<meta name="googlebot" content="noindex">
<meta name="GPTBot" content="noindex">

Use noindex on search result pages, faceted nav, and any page that's a duplicate of a canonical. Use nofollow on user-generated link blocks (comments, forum signatures) to avoid link-juice farms.

A debugging recipe

Before you ship:

  1. Fetch https://yoursite.com/robots.txt directly. Make sure it's 200, text/plain, UTF-8 without a BOM.
  2. Validate with Google's robots.txt tester and Bing's equivalent.
  3. Check that curl -A "GPTBot" https://yoursite.com/some-blocked-page still returns 200 (you don't enforce in robots.txt; you announce).
  4. Re-check after every CMS upgrade. WordPress, Ghost, and Next.js all have generated defaults that may overwrite yours.

Try it

Try our free developer tools

51+ tools that run in your browser. No data sent anywhere.

Browse Tools