To avoid overloading our servers, we need to manage the way web crawlers/spiders find our pages. Googlebot is of primary concern, because it uses the most resources. However, we don't want to design solely for Google at the expense of other search engines.

Problems we need to fix

Decisions to make

Research to be done

Questions answered:

Proposed solutions

Option 1 – Allow controlled spidering and supplement with sitemaps

  1. Build applications so it they are possible to navigate without sessions, and robots never see session ID's in URLs. May not be possible with Struts applications.
  2. Ignore the fact that many pages will be indexed with the non-PURL form of the address. Most search engines update their links often enough that they will have a working address, even if it is not the permanent address.
  3. Open robots.txt to allow everything to be spidered.
  4. When possible/convenient, create sitemaps to ensure every object in a collection is spidered.
  5. Don't allow browse pages to be placed in the index. Add a robots meta tag with "noindex" to these pages. (links to these pages will still be followed)
  6. Don't allow links out of detail pages to be followed. Add a robots meta tag with "nofollow" to these pages. This keeps the spider from accidentally falling into a loop (although a loop shouldn't normally happen if we're not displaying session IDs).

Advantages:

Disadvantages:

Option 2 – Generate static "landing pages"

  1. Pre-generate a static copy of the detail page for each item. Links from the detail page point to the "live" webapp.
  2. Place this detail page directly on the PURL server, served by Apache.
  3. Allow web spiders to index content on the PURL server only.

Advantages:

Disadvantages:

Googlebot

Official Googlebot FAQ
Google Webmaster blog

Do a Google Image Search for "Pacific Shore line at Laguna Beach. Sunday" (with quotes) and you can see that the 600 pixel image is indexed by a webapp1 address, while the 1000 pixel image is indexed by a PURL. It appears that the image search takes its URLs from the URL that resulted in page that contains the image rather than the image src attribute (in this case, both images are referenced by PURLs, but the detail page is only referenced by a webapp1 URL).

According to Google, Googlebot is relatively conservative in its handling of robots.txt and meta-tag robot instructions.

Google sitemaps

URLs provided in sitemaps must point to the same server, and redirects aren't allowed. This means that sitemaps won't solve the problem with getting Google to recognize PURLs. The main advantage to sitemaps is ensuring that Google doesn't miss any content in the collection.

Articles