Robots.txt Syntax: The Basics
A robots.txt file consists of one or more blocks, each targeting a specific crawler (User-agent) and listing the paths it should or should not access. User-agent: * Disallow: /admin/ Disallow: /private/ Allow: / Sitemap: https://example.com/sitemap.xml User-agent: * means the rule applies to all crawlers. Use Googlebot for Google-specific rules. Disallow: tells a crawler not to access paths starting with the specified string. Allow: overrides a Disallow for a specific sub-path. Sitemap: tells all crawlers where to find your sitemap ā always include this. Lines beginning with # are comments. A blank line separates different User-agent blocks.
What to Block in robots.txt
Block paths that should never appear in search results: ⢠Admin and login pages: /admin/, /wp-admin/, /login/, /dashboard/ ⢠Internal search result pages: /search/, /?s=, /?q= ⢠URL parameter variants that create duplicate content: /?sort=, /?filter=, /?ref= ⢠Staging or development subdirectories: /staging/, /dev/ ⢠Private API endpoints: /api/internal/, /api/private/ ⢠Shopping cart and checkout pages (e-commerce): /cart/, /checkout/ ⢠Thank-you and confirmation pages: /thank-you/, /order-confirmation/ For WordPress specifically: block /wp-admin/ (but explicitly Allow /wp-admin/admin-ajax.php for AJAX to function), /wp-includes/, and /?s= (site search).
What NOT to Block (Common Mistakes)
The most dangerous mistake is blocking the entire site: Disallow: / ā this one line prevents every crawler from accessing anything on your site. It is the single most common catastrophic robots.txt error. Do not block CSS and JavaScript files. Old SEO advice said to block /wp-content/plugins/ and /wp-content/themes/ but this is wrong. Google needs to render your pages to understand them ā blocking CSS and JS prevents proper rendering and can hurt rankings. Do not block your sitemap URL. Do not use robots.txt for security. The file is public and any crawler (including malicious ones) can read and ignore it. For truly private content, use server-side authentication or .htaccess password protection. Do not block pages you want indexed. This sounds obvious but it happens when someone copies a robots.txt template that blocks a path their own important content sits under.
Testing Your robots.txt Before Going Live
Google Search Console has a robots.txt Tester (under Settings ā robots.txt) that lets you test any URL against your current robots.txt to see if it would be blocked. Use it after making any changes. Also verify that your live robots.txt is accessible at https://yourdomain.com/robots.txt ā return a 200 status code, not a redirect or error. A 404 on robots.txt means all crawlers will proceed with no restrictions, which is usually fine but means any accidental sensitive paths are crawlable.