Glossary · Glossary
Robots.txt
Robots.txt is a site file that gives crawler access rules for parts of a website.
Robots.txt is a plain-text file, usually available at /robots.txt, that gives crawler access rules for parts of a website. It tells compliant crawlers which URL paths they may or may not request.
The file is a crawl control mechanism, not a content quality or indexing system. It can help keep crawlers away from irrelevant paths, but it can also accidentally block important pages if rules are too broad.
Why it matters
Robots rules sit before the page fetch. If an important guide, glossary term, or report is disallowed, a search crawler may not be able to request the page and inspect its content. That weakens SEO discovery and can also reduce the page’s ability to support AI search features that depend on accessible web sources.
Robots.txt is also part of publication hygiene. A static site should allow public content routes while keeping internal build artifacts, admin paths, raw material, and accidental preview paths out of crawl paths when those paths exist.
How it differs
Meta robots directives live on a specific page and are seen only after a crawler can fetch that page. Robots.txt works before fetching by controlling crawler access to URL patterns.
Noindex is an indexing directive. Robots.txt is a crawling rule. Blocking a page in robots.txt can prevent crawlers from seeing a page-level noindex directive, which is why robots.txt is not the right tool for normal page removal from search results.
Example
User-agent: *
Disallow: /raw/
Disallow: /admin/
Allow: /
Sitemap: https://www.example.com/sitemap.xml
This example tells compliant crawlers not to request internal raw and admin paths, while keeping public routes available. The Sitemap line points crawlers toward the canonical sitemap, but it does not override disallow rules.
How teams use it
Teams review robots.txt when launching a site, moving content, opening a staging environment, changing CMS routes, or diagnosing crawl drops. A practical review checks:
| Check | Question |
|---|---|
| Scope | Are rules targeted, or do they block broad public sections? |
| Sensitive paths | Are non-public paths excluded without relying on robots.txt for security? |
| Public content | Are published guides, glossary pages, and tools crawlable? |
| Sitemap | Does the sitemap URL point to the intended public sitemap? |
Common misunderstanding
Robots.txt is not access control. It is a crawler instruction for compliant bots, not a security boundary. Private material should not be publicly reachable just because it is disallowed in robots.txt.
Read next
Use these glossary paths to move from the definition into adjacent concepts, topic clusters, and operator guides.