View Single Post
  #2 (permalink)  
Old 02-19-2007, 03:08 PM
haxen's Avatar
haxen   is offline
I shredded my PS3
PS3Hax Leader
 
Join Date: Dec 2006
Posts: 239
Default Re: Spider IP Addresses

Pattern matching for robots.txt:

Googlebot interprets some pattern matching.

Matching a sequence of characters using *
You can use an asterisk (*) to match a sequence of characters. For instance, to block access to all subdirectories that begin with private, you could use the following entry:

User-Agent: Googlebot
Disallow: /private*/
To block access to all URLs that include a question mark (?), you could use the following entry:

User-agent: *
Disallow: /*?*
Matching the end characters of the URL using $
You can use the $ character to specify matching the end of the URL. For instance, to block an URLs that end with .asp, you could use the following entry:

User-Agent: Googlebot
Disallow: /*.asp$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

User-agent: *
Allow: /*?$
Disallow: /*?
The Disallow:/ *? line will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ line will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
__________________
Reply With Quote