In an attempt to back off the Google bots a bit from my Pier website I was looking for a sitemap generator. Browsing the Pier addonsI found one written by Philippe Marshall.

What is the problem?

The Google bots index every url they can find and hence they try to index all the Pier urls of your website. But these are dynamic urls containing session information (actually, they point to a continuation). Every time a Google bot comes by, it finds new session urls leading to really massive bot visits over time.

What is the solution?

Blocking dynamic urls in the robots.txt file means that Google can't index the Pier contents at all. Unless you provide a sitemap with clean urls. This is just what a sitemap is in this context: an xml file with a collection of urls. See http://www.sitemaps.org. The Pier Sitemap plugin creates such a sitemap for all the Pier contents of your site, respecting your security settings. The plugin had already become a bit rusty and dusty, but Phillipe revived it just over a few IM sessions. Once you have loaded the plugin you can request a sitemap xml-file like this: http://www.a3aan.st/Sitemap?view=SitemapDownloadPlainView.

Ideally all you have to do now is adding the following lines to your robots.txt:

Sitemap: http://......?view=SitemapDownloadPlainView

User-agent: *
Disallow: /*?

The first part tells the bots where your sitemap is and the second part tells them not to touch dynamic urls.

Unfortunataly this won't work because the second part forbids the reading of the sitemap because the url contains a question mark. To circumvent that I added a rewrite rule to the webserver configuration:

RewriteRule ^/sitemap.xml$ Sitemap?view=SitemapDownloadPlainView [P,L]

Now the lines for robots.txt read

Sitemap: http://....../sitemap.xml

User-agent: *
Disallow: /*?

Not only Google, but more and more search engine bots can handle sitemaps.

Back to the access log now to check if these bots start to behave or not...

Posted by Adriaan van Os at 4 October 2007, 1:21 am with tags Pier, sitemap, Google, SEO link
© 2004-2020 Adriaan van Os  -  [ | ] Powered by Smalltalk (Squeak/Seaside/Magritte/Pier)  -  Served by Apache  -  Hosted on eComStation