Skip to content Skip to navigation

XML Sitemap and Cache Priming

As a follow-up to my post on configuring your Drupal site for improved performance, in this post I will detail how to go a little further using the XML Sitemap module to create a sitemap, and then using the ITS Scheduling Service to set up a cron job to "prime" the page cache for all pages in the sitemap. This will give anonymous users the benefit of always loading the cached version of a page.

Why Would I Want to Do This?

The first time that Drupal serves up a given page, it has to run many database queries to pull together all of the various components of the page (e.g., the menu(s), block[s], View[s], HTML content, etc.). This process can take a good deal of time. If Drupal's page caching is enabled for anonymous users, Drupal will store (or "cache") the rendered HTML content of the page in the database, and serve that cached HTML to subsequent anonymous users. This means one database query, instead of dozens or hundreds.

However, Drupal still must do the heavy lifting of all those database queries to generate the final HTML output. The technique outlined below will ensure that a machine visitor is the one who gets the slow initial first page load, not a human visitor.

On high-traffic sites, this technique is usually not necessary, as there is enough anonymous traffic to generate cached pages. However, on low-traffic/high-value sites, this technique can offer sizable performance gains.

XML Sitemap is available to all personal, group, and department websites hosted on Stanford Sites.

Configuring XML Sitemap

In this example, I am just going to add the main menu links to the XML sitemap (for brevity). Typically, the main menu contains your most important pages, so therefore they are the ones that you want to be fast all the time. You can use additional XML Sitemap submodules to add users, nodes, taxonomy terms, and custom links to a sitemap as well.

  1. Enable the XML sitemap and XML sitemap menu modules
  2. Go to admin/structure/menu/manage/main-menu/edit to configure the XML sitemap settings for your main menu
  3. Change the Inclusion setting to "Included"
  4. Set the Default priority to 1.0

    XML sitemap settings on main menu config page
  5. Clear all caches
  6. Run cron
  7. Go to mysite.stanford.edu/sitemap.xml to view your XML sitemap

    The XML sitemap

The sitemap alone will provide benefits such as increased visibility in search engines. However, we also are going to leverage the sitemap to make your site always speedy for anonymous visitors.

Create Your Cache Priming Script

The next step is to create a script to anonymously hit all the pages in your sitemap. This will "prime" or "warm" the cache, by forcing Drupal to generate a cached version of each page in the sitemap. We use wget, in part because it does not run Javascript, so running this script on a regular basis will not artificially inflate stats in Google Analytics, for instance.

  1. In an AFS directory of your choosing, create a shell script with the following content:

    #!/bin/bash
    #
    # Use the sitemap and reload the Page Cache by accessing each page once
    #

    wget --quiet <a href="https://mysite.stanford.edu/sitemap.xml">https://mysite.stanford.edu/sitemap.xml</a> --output-document - | egrep -o "<a href="https://mysite.stanford.edu/[">https://mysite.stanford.edu/[</a>^<]+" | wget -q --delete-after -i -

    (Replace mysite.stanford.edu with the URL of your website.)

  2. For this example, I am going to create the script in the "cgi-bin" directory in the "mygroup" AFS space, and name it "cache-prime.sh", so the script will live at /afs/ir/group/mygroup/cgi-bin/cache-prime.sh. This information is important for subsequent steps.
  3. Make the script executable by running:
    chmod 755 /afs/ir/group/mygroup/cgi-bin/cache-prime.sh

    (Replace "group/mygroup" with the path to your AFS group space)

You can run the script manually by entering /afs/ir/group/mygroup/cgi-bin/cache-prime.sh at a command prompt, then hitting Enter.

Setting Up the Cache Priming Cron Job

We have a sitemap, and a script to hit all the pages in that sitemap. The next step is to set up a recurring job to call our script at regular intervals.

  1. Go to https://tools.stanford.edu/cgi-bin/scheduler and click the "Create New Job" button

    Create a new job on the scheduling service
  2. Command: "/afs/ir/group/mygroup/cgi-bin/cache-prime.sh"
  3. Run this command as this principal: "group-mygroup/cgi"
  4. Make the Job Active? "Yes, execute as directed"
  5. Mail command output? "No, email only errors"
  6. Send email to: Your email address
  7. Description: Drupal cache prime script
  8. Schedule
    • Custom Schedule: Run this command...on a custom schedule
      • Months: Every Month
      • Days each month: Every Day of Month
      • Days each week: Every Day of Week
      • Hours: Choose times that correspond to one-half of your minimum cache lifetime. For example if your cache lifetime is set to 1 day, choose 0600 and 1800
      • Minutes: OK to leave at :00, or you can tweak this to your liking to not run at the top of the hour

        Schedule the cache priming script to run

If you are familiar with cron syntax this form is just a GUI to that same functionality.

Is the Page Cached?

If you are using Drupal's core caching functionality, you can determine whether Drupal is serving up the cached version of a page by checking for the X-Drupal-Cache: HIT HTTP header.

OK, how do I do that?

  • You can use the Live HTTP Headers Firefox plugin
  • You can use a one-line bash script like this:
    #!/bin/bash

    curl -fsIL $1 2>&1 | grep -q -m 1 "X-Drupal-Cache: HIT" && echo "Yes, this page appears to be cached." || echo "No, this page does not appear to be cached."

    I have this saved as a file named is-cached at the root of my hard drive, so I just run:

    $ /is-cached <a href="https://swsblog.stanford.edu
    Yes">https://swsblog.stanford.edu
    Yes</a>, this page appears to be cached.

Further Reading

Categories: