When we launch a site at Stanford Web Services, we open the doors and roll out the red carpet for the search engines to index the site. However, before launch we like to keep the content under wraps and ask the search engines not to index the site. To do this, we use a module called Stanford MetaTag NoBots.
Using the Stanford Metatag NoBots module
To use this module, after installation, you only need to enable the module. When enabled, this module stops search engines from indexing your site. To allow the search engines to index your site, you disable the module.
To learn more about using this module and tips on checking your site to see what the search engines see, check out the readme.md at:
How it works (the technical stuff)
When a search engine crawler (or "robot") crawls a site, it sends information about whichrobot it is, in the form of a user agent string. So Google's crawler will send the "Googlebot" user agent string, Bing will send "Bing", etc. The Context User Agent module allows Drupal administrators to set up conditionsbased on a given user agent string.
When a web server responds to a request for a page, it sends various HTTP headers in its response. One of the (optional) headers is the
X-Robots-Tag: noindex,nofollow,noarchive header, which (in short) tells search engine robots, "Hey, don't index or archive this page, and don't follow any links on it either". (Learn more about the Robots Exclusion Standard on Wikipedia.)
We configured Context module along with Context HTTP Header and Context Useragent modules to indentify when the the user agent string is a robot and react with the X-Robots-Tag HTTP headers. The Stanford Metatag NoBots module captures this configuration as a Feature.
Why a new module?
We started out using the MetaTag module on Drupal.org. With this module you can prevent the search engines from indexing the pages on a website. This module works great at stopping the search engines. In fact, it works too well.
When we went to launch a site, even though we changed the settings to allow indexing, some pages still remained blocked. This was because, even though we unchecked the box, meta tag information for some pages had been saved to the database and did not revert as expected.