The robots.txt exclusion standard is a bit of an ambiguous convention, acting as sort of the opposite of a sitemap. But while one might see the value in telling a web crawler what pages are on your site, and how they’re structured, it’s more difficult to see the value of telling them where they can’t go. Or rather “shouldn’t go”, as the robots protocol is merely advisory.
While these won’t be the pages where you hide your missile launch codes or run your ‘The-most-dangerous-game” side business from, it can be useful to keep search engine robots from crawling pages that could harm your SEO. These can be pages that appear spammy to a crawler’s algorithm but useful to the human visitor, such as duplicate content pages, or just a page with photos that you don’t want indexed and appearing on search results. In truth there are almost as many reasons to exclude pages from the site’s crawl as there is to include it.
Test and Maintain your Robots.txt Easily
The only problem is the robots.txt protocol can be tricky. If misused, it can do a lot more harm than good. This is probably why Google has now updated its testing tool for your robots directives in the Google Webmaster Tools.
This updated tool allows you to test if new pages are disallowed, has made the often complex directives easier to understand and to identify issues with, and lets you test and maintain your robots.txt more easily and with increased accuracy. Read Google’s article on the update here, for more information.