Skip to content Skip to navigation

Module of the Day: Stanford MetaTag NoBots - Hide your site from search engines!

When we launch a site at Stanford Web Services, we open the doors and roll out the red carpet for the search engines to index the site. However, before launch we like to keep the content under wraps and ask the search engines not to index the site. To do this, we use a module called Stanford MetaTag NoBots.

Using the Stanford Metatag NoBots module

To use this module, after installation, you only need to enable the module. When enabled, this module stops search engines from indexing your site. To allow the search engines to index your site, you disable the module.

 

If your site is hosted on Stanford Sites, this module is already available on your site; you just need to enable it. Otherwise, you can find this module on Github at:

https://github.com/SU-SWS/stanford_metatag_nobots

To learn more about using this module and tips on checking your site to see what the search engines see, check out the readme.md at:

https://github.com/SU-SWS/stanford_metatag_nobots/blob/7.x-3.x-dev/README.md

How it works (the technical stuff)

When a search engine crawler (or "robot") crawls a site, it sends information about whichrobot it is, in the form of a user agent string. So Google's crawler will send the "Googlebot" user agent string, Bing will send "Bing", etc. The Context User Agent module allows Drupal administrators to set up conditionsbased on a given user agent string.

When a web server responds to a request for a page, it sends various HTTP headers in its response. One of the (optional) headers is the X-Robots-Tag: noindex,nofollow,noarchive header, which (in short) tells search engine robots, "Hey, don't index or archive this page, and don't follow any links on it either". (Learn more about the Robots Exclusion Standard on Wikipedia.)

We configured Context module along with Context HTTP Header and Context Useragent modules to indentify when the the user agent string is a robot and react with the X-Robots-Tag HTTP headers. The Stanford Metatag NoBots module captures this configuration as a Feature

 

screenshot illustrating the context configuration

To learn more about using the Stanford MetaTag NoBots module and for tips on checking your site to see what the search engines see, check out the readme.md at:

https://github.com/SU-SWS/stanford_metatag_nobots/blob/7.x-3.x-dev/README.md

Why a new module?

We started out using the MetaTag module on Drupal.org. With this module you can prevent the search engines from indexing the pages on a website. This  module works great at stopping the search engines. In fact, it works too well.

When we went to launch a site, even though we changed the settings to allow indexing, some pages still remained blocked. This was because, even though we unchecked the box, meta tag information for some pages had been saved to the database and did not revert as expected.

Acknowledgements

I’d like to extend a tip of my hat to Shea McKinney and John Bickar for developing and documenting this module, and to John Bickar for contributing to this post.

Categories: 

Comments

I'm not sure that a Drupal module should be 'branded' in such a way. What is wrong with metatag_nobots?

Thanks for publicizing the use of X-Robots-Tag headers. They are underused.
We set these headers all the time for dev and stage sites, but we leave our Drupal instances lean and instead configure the extra headers in nginx. It would be relatively easy to code up a dashboard in a website where this would be exposed as clickable settings per site, it's just that we prefer the command line. Anyway, my point is: Less fat in the PHP layer is a good thing, Apache or nginx can do the job easily and quickly.

Why don't you publish this feature module on D.O?

We probably should. We'll look into doing that.

I personally use https://www.drupal.org/project/robotstxt and change the config to:

User-agent: *
Disallow: /

When the site goes live either disable the module and replace the robots.txt file or put the standard config back into the module. Seems simpler than adding additional meta tags on each page

Google indexes site even with "Disallow: /" in robots.txt.

We use HTTP Basic Authentication to prevent search engines (or indeed any prying eyes) from seeing our pre-launch and staging sites.

Exactly how you manage the settings depends on your deployment process, we put them into the vhosts on the server (since live and staging use different vhosts).

How does this handle page caching for anonymous users?

Looks like it doesn't send the X-Robots-Tag header when Drupal core page caching is enabled. In practice, for us, the two are mutually-exclusive: page caching is enabled on production sites, and the site is opened up to search engine robots.