Guides/SEO & Content

Robots.txt: a beginner's guide

Parse robots.txt rules and sitemaps

EdgeDNS Team·January 1, 2026·9 min read

robots.txt: the file that quietly decides whether Google crawls your site

The `robots.txt` file is a small text file at the root of every website (`yourdomain.com/robots.txt`) that tells search-engine crawlers which pages they are allowed to read and which they are not. It is the first file every well-behaved search engine fetches when it visits a domain, and the directives inside it are honored by Google, Bing, Yahoo, DuckDuckGo, and the entire ecosystem of compliant bots. The format is dead simple — a few lines of `User-agent` and `Disallow` rules — but the consequences of getting it wrong are anything but.

You should care because a single typo in `robots.txt` can deindex your entire website. The most famous version of this is the line `Disallow: /`, which is shorthand for "do not crawl any URL on this domain." It is the standard configuration for staging and development environments, where you don't want Google indexing in-progress work. The disaster is when that line survives a launch and accidentally goes live in production. Within hours, every page on your live site disappears from Google. Within days, your organic traffic is zero. Within weeks, your business notices.

The five things every `robots.txt` audit looks at:

Does the file exist? A missing `robots.txt` is fine — search engines treat that as "crawl everything." But many CMSes generate one automatically and you should know what is in it.
Is there a global `Disallow: /`? This is the disaster line. Any time it appears in production, every search engine in the world stops crawling.
Are important sections accidentally blocked? Watch for blocks on `/blog`, `/products`, `/api`, `/static`, or `/wp-content` that prevent Google from rendering pages correctly.
Is the sitemap referenced? A `Sitemap:` line in `robots.txt` tells crawlers where to find the XML sitemap.
Are there `User-agent` rules for specific crawlers? Many sites block specific bots (like SEO crawlers from competitors or AI training bots) — these rules can quietly evolve and cause unintended consequences.

The audit has gotten more interesting since AI crawlers arrived. It is no longer enough to know what "the" rule is — robots.txt now needs to be evaluated per crawler. Googlebot, Bingbot, GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl, used to train many models), Applebot-Extended, and PerplexityBot each match different `User-agent` blocks, and the effective rule for each one depends on a specificity algorithm that can be surprising. A deep audit resolves the effective rule for each major crawler so you can answer questions like "Is GPTBot allowed to train on my content?" or "Is Googlebot accidentally blocked from a section that ClaudeBot can still read?" The same audit also HEAD-verifies every sitemap URL declared in robots.txt — a `Sitemap:` line pointing at a 404 quietly defeats the whole purpose.

Three questions a `robots.txt` audit answers:

Did anything from staging accidentally make it into production and turn off search-engine crawling?
Are any of my important content sections accidentally blocked?
Is my `robots.txt` doing what I think it is doing for every crawler — or has it drifted such that Googlebot, GPTBot, and ClaudeBot now see different rules?

The cost of an unchecked `robots.txt` is the most catastrophic SEO failure mode in the entire industry: total invisibility, with no warning signs until traffic has already collapsed. The fix is a thirty-second file edit — but only if you notice. Running this check on a regular schedule, or as part of every launch, is the easiest catastrophe-prevention measure in SEO. The official `robots.txt` documentation is hosted by Google and the protocol itself is described in RFC 9309.

The Robots.txt endpoint, in plain language

In one sentence: Parse robots.txt (robots.txt file) rules and sitemaps

Fetches and parses the robots.txt (robots.txt file) file to extract crawler rules, disallowed paths, and sitemap references. Reveals what content is hidden from search engines.

Don't worry if some of the words above are still unfamiliar — there's a plain-language glossary at the bottom of this page, and most of the terms link to their own beginner guides if you want to learn more.

What is actually happening when you call it

Here's what's actually happening behind the scenes when you call this endpoint:

Retrieves robots.txt (robots.txt file) from the domain root and parses it per the official internet standard (the Robots Exclusion Protocol standard). Extracts User-agent blocks, Allow/Disallow rules, sitemap references, and crawl-delay directives. Highlights commonly interesting disallowed paths (admin panels, APIs, etc.).

If you're using an AI assistant through MCP, you don't need to understand any of the technical details — the assistant calls the tool and translates the result for you.

Why this specific tool matters

Let's skip the marketing fluff and answer the only question that actually matters: why should you, a real human with a real to-do list, care about the Robots.txt tool? Here's the plain-English version, written the way you'd hear it from a friend who happens to do this for a living.

Robots.txt often reveals hidden directories, admin panels, and API (Application Programming Interface) endpoints that aren't linked publicly. For SEO (Search Engine Optimization), it helps verify that important pages aren't accidentally blocked from search engines.

Picture this in real life. Imagine a penetration tester. Here's the situation they're walking into: Discover hidden paths and admin interfaces listed in Disallow rules. Without the right tool, that person would be stuck copy-pasting between five browser tabs, reading documentation written for engineers, and crossing their fingers that the answer they cobble together is correct. With the Robots.txt tool, the same person gets a clear answer in seconds — no spreadsheets, no guessing, no waiting for someone on the infrastructure team to free up.

Three questions this tool answers in plain English. If any of these have ever crossed your mind, the Robots.txt tool is built for you:

Are search engines actually able to crawl, understand, and recommend my pages?
What is the single biggest fix I could make today to climb in Google?
How does my site compare against the technical SEO checklist that the top results all pass?

You can either click the tool and get the answer yourself, or ask your AI assistant — connected through MCP (Model Context Protocol) — to ask the question for you and translate the answer into something you can paste into Slack.

Who gets the most out of this. Marketers, content writers, freelancers running client sites, founders trying to grow without paying for ads, and SEO specialists running monthly health checks. If you see yourself in that list, this is one of the EdgeDNS tools you should bookmark today.

What happens if you skip this entirely. Skip it and search engines quietly stop sending you traffic and you don't find out until the next quarterly review. That's why running this check — even once a month — is one of the cheapest forms of insurance you can give your domain.

Info:

Available on the free plan. The technical details: `GET /v1/domain/robots`.

When would I actually use this?

If you're still on the fence about whether the Robots.txt tool belongs in your toolbox, this section is for you. Below you'll meet three real people — a penetration tester, a SEO specialist, and a devops engineer — facing three real situations where this tool turns a stressful afternoon into a five-minute task. Read whichever story sounds closest to your week.

Story 1: Security Reconnaissance

Imagine you're a penetration tester. Discover hidden paths and admin interfaces listed in Disallow rules.

Why it matters: Find additional attack surface not discoverable through crawling.

Story 2: SEO Troubleshooting

Imagine you're an SEO specialist. Diagnose why certain pages aren't appearing in search results by checking robots.txt (robots.txt file) blocks.

Why it matters: Fix accidental search engine blocks hurting organic traffic.

Story 3: Crawler Configuration

Imagine you're a devops engineer. Verify robots.txt (robots.txt file) is properly configured before deploying to production.

Why it matters: Prevent accidental blocking of important pages from search engines.

Common situations across teams. Beyond the three stories above, here are the everyday workplace moments when people across the company reach for the Robots.txt tool — or one of the tools right next to it in this category. If any of these are on your calendar this month, that's your sign:

Before launching a new page, site, or campaign — to catch the dumb mistakes.
During a quarterly SEO health check.
When organic traffic suddenly drops and you need to find out why.
When pitching a new client and you need an audit deck in under an hour.

If you can see yourself in even one of those bullets, the Robots.txt tool will pay for itself the first time you use it.

Still not sure? Here's the easiest test in the world. Open Claude, ChatGPT, Gemini, or any other AI assistant connected to the EdgeDNS MCP server and ask, in your own words: "Is the Robots.txt tool useful for my job?" The assistant will look at the tool, ask you a couple of follow-up questions about what you're trying to accomplish, and give you a straight answer in plain English. No commitment, no signup forms, no jargon.

The easiest way: just ask your AI assistant

If you've connected the EdgeDNS MCP server to Claude, ChatGPT, Gemini, Cursor, or any other AI assistant, you don't need to write any code. Just ask in plain English:

"Use the Robots.txt tool to check https://example.com and explain anything that looks wrong in plain language."

The AI will figure out which tool to call, fill in the right parameters, run it, and then explain the result back to you. No copy-pasting between tabs. No reading raw JSON. No memorizing endpoint names.

Tip:

MCP (Model Context Protocol) access is free on every plan, including the free tier. One API key works for both REST and AI — you do not have to choose.

The technical way: call it from code

If you're a developer and want to call the endpoint from a script or your own application, here's the simplest possible example. Replace the placeholder API key with the real one from your dashboard.

bash

# Replace edns_live_YOUR_KEY with your real API key from the dashboard
curl -H "Authorization: Bearer edns_live_YOUR_KEY" \
  "https://api.edgedns.dev/v1/domain/robots?domain=https%3A%2F%2Fexample.com"

What you need to provide

There's just one piece of information you need to provide. The table below explains exactly what it is and what a real value looks like.

Field	Type	Required?	What it means	Example
domain	string	Yes	Domain or full URL (web address) — accepts `example.com` or `HTTPS://example.com/path`. Both `HTTP://` and `HTTPS://` robots.txt (robots.txt file) locations are probed regardless of the protocol picked.	https://example.com

What you get back

When you call this tool, you'll get back a JSON object with the fields below. If you're talking to it through an AI assistant, the assistant reads these for you and explains them in plain language — you don't need to memorize them.

Field	Type	What you'll see in it
domain	string	The queried domain (bare hostname).
url	string	Full URL (web address) that was fetched, echoing the protocol used in the request.
exists	boolean	Whether robots.txt (robots.txt file) exists
fileSize	number	File size in bytes (null if not found)
rules	array	Parsed rules by user-agent (userAgent, allow, disallow, crawlDelay)
effectiveRulesByCrawler	object	Per-crawler effective rules (Googlebot, Bingbot, GPTBot, ClaudeBot, CCBot, etc.) with source ("specific" or "wildcard") — reveals exactly what each crawler is allowed to do, removing the common misconception that the wildcard always applies.
sitemaps	array	Sitemap URLs referenced
sitemapReachability	array	HEAD-verified reachability of each declared Sitemap URL (web address) (status, contentType, lastModified). A 404 sitemap declared in robots.txt (robots.txt file) is a top Search Console error class.
interestingPaths	array	Security-interesting disallowed paths (admin, login, API (Application Programming Interface), etc.)
totalRules	number	Total number of user-agent rule groups
totalDisallowedPaths	number	Total number of disallowed paths across all rules
score	number	Robots.txt health score (0-100)
grade	string	Letter grade (A-F) based on score
scoreDetails	array	Breakdown of scoring factors
recommendations	array	Actionable suggestions to improve robots.txt (robots.txt file)

Words you might be wondering about

If any words on this page felt like jargon, here's a plain-language version. Click any linked term to read a full beginner-friendly guide.

API (Application Programming Interface) — A way for one program to ask another program for something — like a waiter taking your order to the kitchen.

SEO (Search Engine Optimization) — Everything you do to help search engines like Google find, understand, and rank your website.

robots.txt (robots.txt file) — A simple text file at the root of a website that tells search engine crawlers which pages they're allowed to look at and which to skip.

RFC (Request for Comments) — The official internet standards documents. When someone says 'RFC 8484' they mean a specific numbered standards document — in that case, the one defining DNS over HTTPS.

Need Programmatic Access?

Automate domain intelligence with 100+ API endpoints and a free MCP server for AI integration.

Get Free API Key Try in Playground