llms.txt, robots.txt and AI crawlers: the technical basics for owners
Two tiny text files decide whether artificial intelligence can read your site at all. One is robots.txt, which tells crawlers where they're allowed in; the other is llms.txt, which offers the AI the essentials. And here's the surprising part: many websites accidentally lock out the very crawlers that ChatGPT and Gemini rely on — all because of a single bad line. Not out of bad intent — one overzealous setting, one line copied from a template, and the gate swings shut.
You don't need to be a developer to understand this, or to check it yourself. I'll explain what these two files are, why they matter, and how to inspect your own in five minutes, straight from your browser. By the end, you'll know exactly where to look for the problem.
What is robots.txt, and who does it let in?
robots.txt is a simple text file in your website's root that tells automated visitors — crawlers — where they may go and where they may not. It's been around for a long time, originally because of search engines. What's new is that today the crawlers belonging to AI companies read it first too, before downloading a single page. If they hit a block here, the crawler politely turns back — and your content gets left out.
It's worth knowing the most important AI crawlers by name, because the name is exactly what you use to let them in or shut them out:
- GPTBot — OpenAI's crawler, which gathers content to train the models.
- OAI-SearchBot — also from OpenAI, but this one powers ChatGPT's search feature: it's what lets your site appear as a cited source in the answers.
- ClaudeBot — Anthropic's crawler for the Claude models (a separate crawler, Claude-SearchBot, handles the search traffic).
- PerplexityBot — Perplexity's crawler, which by their own account exists specifically to surface you as a source, not to train models.
- Google-Extended — Google's signal token, which lets you separately control whether your content feeds into the Gemini models. One important detail: blocking it does not, per the official documentation, affect traditional Google Search — only generative use.
There's a distinction worth stating plainly here, because a lot of confusion comes from it. Training and answer-citation are not the same thing. The training crawlers (like GPTBot) collect content so the models can learn from your site; the search and answer crawlers (like OAI-SearchBot or PerplexityBot) come to cite your content live, with attribution. Many owners rightly don't want their content used as model training material — but they do want to appear in AI answers. These two can be configured separately: you can block training and allow search. International professional consensus is currently leaning exactly that way.
User-agent: *, or a bot-protection layer at the hosting provider that shuts out AI crawlers too — often without the owner ever knowing. A frequently cited international survey found that roughly 27% of the B2B software and online-store sites examined were blocking the major AI crawlers this way, without realizing it. That figure comes from a foreign, business-software sample, not Hungarian SMEs — but the phenomenon is surprisingly common here too, and it's insidious precisely because it's invisible.If something stands in the way of the AI crawlers, every further effort becomes pointless. It doesn't matter how beautiful the site is, or how careful the content — if the gate is closed, the model never reaches the page at all. That's why it's worth starting here, before anyone tackles anything more complicated.
What is llms.txt, and do you need it?
llms.txt is a more recent idea: also a text file in your website's root, but with a different purpose. Where robots.txt tells the crawler where it may not go, llms.txt politely offers the AI the essentials — a clean, organized table of contents showing the site's most important pages and where to find the trustworthy descriptions. It's as if the site handed the guest the table of contents instead of making them hunt along the shelves.
What should it contain? In practice, a short, human-readable list: the name of the business and a one-sentence description, then links to the key pages — services, about, contact, the more important articles — each with a concise explanation. No magic, no code. A well-organized table of contents, seen from the AI's point of view.
So do you need it? If your robots.txt is in order and you have half an hour to spare, it's worth it. But the order matters: the gate first, the table of contents only after. An llms.txt is worth nothing if the crawlers can't even get onto the site in the first place.
How to check your own in 5 minutes
You don't need a developer or a paid tool for this. Three steps, from your browser:
- Open your robots.txt file. Type your own address into the browser's address bar, then add this to the end:
/robots.txt— so, for example,yoursite.com/robots.txt. You'll see either plain text or a "no such page" message. - Look for the blocking lines. Check whether it contains
Disallow: /under aUser-agent: *— that locks out everyone. Search for the AI crawlers' names as well (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot): if any of them is followed byDisallow: /, that crawler is blocked. If the file is empty, missing, or contains onlyAllowlines, that's usually a good sign. - Run a quick render check. This measures whether your content is visible without running any code — because most AI crawlers don't run JavaScript. Open your homepage, right-click on an empty area, and choose "view page source." If you can find your site's real sentences and text in the page that opens, then the crawler can see them too. But if you mostly see an empty code skeleton with the content nowhere to be found — that's a warning sign.
That's the five minutes. Those three steps reveal whether the gate is open and whether the AI can see anything of your site at all. With this, you'll know more about your own situation than many an expensive report would tell you.
It's important, though, to keep this test in perspective — to know what it does and doesn't say. The fact that AI crawlers can get in is only the entry ticket — it's not the same as the AI recommending your business. Recommendation is determined first and foremost by your off-site presence: reviews, independent mentions, appearances in credible sources. Your competitors aren't visible to artificial intelligence because their robots.txt is cleverer — they're visible because their external footprint is bigger. The goal is to build that presence on purpose, where so many today leave it to chance. I write about this in detail in why a GEO score doesn't equal an AI recommendation, and I lay out the full measurement logic on the methodology page.
So these two files aren't the finish line of the race — they're the starting line. The place where AI visibility can begin — or where it quietly stalls. If you get stuck on the test above, or you're not sure what you're seeing, write to me on the contact page and I'll do a free mini-check of your robots.txt. I'll see whether the gate stands open to the crawlers your customers ask about day in, day out. If you're curious about the difference between traditional search and AI search, the comparison of SEO and GEO gives you the framework, and you can follow the full process on the how it works page.
Frequently asked questions
What is the difference between robots.txt and llms.txt?
robots.txt tells automated crawlers where they may and may not go on your site. llms.txt, by contrast, offers the AI the essence of your website: an organized table of contents of the most important pages. One opens or closes the gate, the other points the way.
How do I know whether my website blocks AI crawlers?
Open your own address in the browser with /robots.txt at the end, for example yoursite.com/robots.txt. Look for Disallow: / lines under a User-agent: *, or behind the AI crawlers' names (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot). If there's a block there, that crawler is shut out.
Is llms.txt mandatory or a standard?
No. llms.txt is an emerging convention, not an official standard, and no major AI provider has confirmed that it relies on it in production. But doing it is cheap and the risk is practically zero, so it's worth creating as a piece of foresight.
If I let the AI crawlers in, will ChatGPT recommend me?
Not necessarily. Letting the crawlers in is only the entry ticket: it's what allows the AI to reach your site at all. Recommendation is determined first and foremost by your off-site presence — reviews, independent mentions, and appearances in credible sources.
Sources
- OpenAI — Bots (GPTBot for training, OAI-SearchBot for search, ChatGPT-User)
- Anthropic / Claude help — ClaudeBot, Claude-User and Claude-SearchBot, and blocking them in robots.txt
- Perplexity — Bots (PerplexityBot serves source attribution, not model training)
- Technologychecker — robots.txt AI-crawler blocking report (~27% of the B2B/online-store sites examined unknowingly block the major AI crawlers)