Neurocyber SEO home

Search engine robots


Robots, spiders or crawlers are names for the search engine software that visits your site. These programs can automatically and rapidly fetch material that a site owner may not want anyone to see and have the power to bring a web site to a standstill with multiple page requests. For this reason, most robots (spiders/crawlers) abide by the 'Robots Exclusion Standard', a set of rules that constrains their behaviour. This standard allows you to define what parts of the site are off limits, you can disallow access to temporary, cgi, private directories or any other directories, as well as defining specific robot user agent names that can be disallowed

When search engine robots come to your site they look for a special file in the root of your server called robots.txt (this is a plain text file). The robots.txt file is the active part of the 'Robots Exclusion Protocol'

The configuration of the robots.txt file is quite straight forward, it tells robots which pages they aren't allowed to look at. Each section includes the name of the user agent (robot/spider/crawler) and the paths it can't follow. There is no way to 'allow' a specific directory, or to specify a type of file. What this means is that robots may access any directory that is not explicitly disallowed, i.e. everything not disallowed is OK

You can usually read this file in a browser by just requesting it from the server e.g. www.neurocyber.co.uk/robots.txt. You'll see it as a simple text page, but it's easy to read. This is all documented in the Standard for Robot Exclusion, and all robots should recognise and honour the rules in the robots.txt file

Making a robots.txt file

This is a very simple task, the only tool you'll need is a text editor, Notepad will do if you're using windows. Below is a typical example of a robots.txt file

In this example, all robots can visit every directory except the two mentioned

Making a robots.txt file
  1. The first line you type determines what url the robots.txt file refers to
  2. The asterisk (*) in the User-agent field is shorthand for "all robots". Because nothing is disallowed specifically, everything is allowed (everything being all robots)
  3. Disallow: /cgi-bin/ - this means all robots are not allowed into the cgi-bin directory
  4. Disallow: /errors/ - if you want to add further directories not to be searched then add them on the next line, in this case the directory called 'errors' is also disallowed to all robots

In the following example a specific robot is dealt with as well as all robots

# robots.txt for www.yoursite.com

User-agent: WickedRobot
Disallow: /

User-agent: *
Disallow: /cgi-bin/

In this example, the WickedRobot is not allowed to see anything. The slash is shorthand for 'all directories'. The blank line indicates a new 'record' - a new user agent command. All other robots can see everything except the 'cgi-bin' folder

The next example shows how to disallow a specific robot from searching a specific file

# robots.txt for www.yoursite.com

User-agent: DangerRobot
Disallow: /cgi-bin/
Disallow: /errors/
Disallow: /private/details.html

User-agent: *
Disallow: /cgi-bin/
Disallow: /errors/

This configuration stops the DangerRobot from visiting the details page in the private directory, the cgi-bin directory and the errors directory. All other robots can see everything except the cgi-bin and errors directories

Using robot meta tags instead

There is another way to tell robots not to index a web page or follow links on it, which may be more helpful in some cases, as it can be used more conveniently on a page-by-page basis. It involves placing the following meta tag in the head of your pages html

<META Name="robots" content="index,follow">

This tells the robot to index that page and follow any links. If you didn't want the robot to index your page or follow any links then replace "index,follow" with "noindex,nofollow". If you wanted the search engine robot to index that page but not follow any links you would have "index,nofollow"

Search engine robots

Here are two common robots with links to pages with specific information on them

For further information on web robots and the robots.txt Robots Exclusion Standard and other articles about writing well-behaved Web robots have a look at Robotstxt.org

Remember that as the Internet grows in size, so do the number of robots, spiders and crawlers that inhabit it. Now, more than ever it's important that a web site has a properly made and implemented 'robots.txt' file

SEO articles more promotional and seo articles.....

 

home | getting started | keywords | page titles | meta tags
link popularity | search engines | search directories | pay per click
seo software | seo articles | resource links | site map | contact us

All Rights Reserved ® neurocyber privacy policy | terms and conditions

SEO Getting Started
choosing the best keywords
best use of page titles
how to use meta tags
link popularity for increased traffic
search engines add url
search directories add url
pay per click search engines
SEO Software
SEO Articles
resource links
site map
contact us

 

SEO Newsletter
Get information and tips to help improve your ranking in the search engines




privacy policy