In one sentence
robots.txt is a text file placed at the entrance of a website (/robots.txt) that instructs crawlers (search-engine crawl bots and AI crawlers) on "where they may look."
What does this look like in practice?
For example, if your site's /robots.txt contains:
User-agent: GPTBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://example.com/sitemap.xml
then crawlers from ChatGPT / Claude / Gemini understand "ah, we may look at the entire site" and proceed to crawl.
Conversely, if you write Disallow: /, you can completely shut these AI crawlers out.
GEO best practices
| Bot | Treatment in GEO measures |
|---|---|
| GPTBot (OpenAI / ChatGPT) | Allow <- required to be cited |
| Claude-Web (Anthropic / Claude) | Allow |
| Google-Extended (Google / Gemini) | Allow |
| PerplexityBot (Perplexity) | Allow |
| CCBot (Common Crawl) | Optional (used for training data) |
Allow everything by default is the iron rule of GEO. Setting Disallow makes you completely invisible in AI search.
Common mistakes
- Unintentional Disallow: many companies still disallow GPTBot via old templates (templates from before 2024 need review)
- No configuration at all: with no configuration, the default is Allow, so this is at least tolerable
- Disallowing AI bots only: the desire not to be used as training data is understandable, but you also lose citation opportunities
Related files
For GEO, robots.txt is typically maintained alongside:
- sitemap.xml: explicitly tells crawlers which URLs to visit
- llms.txt: conveys the main content to AI in Markdown