![]() |
Search engine robotsRobots, spiders or crawlers are names for the search engine software that visits your site. These programs can automatically and rapidly fetch material that a site owner may not want anyone to see and have the power to bring a web site to a standstill with multiple page requests. For this reason, most robots (spiders/crawlers) abide by the 'Robots Exclusion Standard', a set of rules that constrains their behaviour. This standard allows you to define what parts of the site are off limits, you can disallow access to temporary, cgi, private directories or any other directories, as well as defining specific robot user agent names that can be disallowed When search engine robots come to your site they look for a special file in the root of your server called robots.txt (this is a plain text file). The robots.txt file is the active part of the 'Robots Exclusion Protocol' The configuration of the robots.txt file is quite straight forward, it tells robots which pages they aren't allowed to look at. Each section includes the name of the user agent (robot/spider/crawler) and the paths it can't follow. There is no way to 'allow' a specific directory, or to specify a type of file. What this means is that robots may access any directory that is not explicitly disallowed, i.e. everything not disallowed is OK You can usually read this file in a browser by just requesting it from the server e.g. www.neurocyber.co.uk/robots.txt. You'll see it as a simple text page, but it's easy to read. This is all documented in the Standard for Robot Exclusion, and all robots should recognise and honour the rules in the robots.txt file Making a robots.txt fileThis is a very simple task, the only tool you'll need is a text editor, Notepad will do if you're using windows. Below is a typical example of a robots.txt fileIn this example, all robots can visit every directory except the two mentioned
In the following example a specific robot is dealt with as well as all robots # robots.txt for www.yoursite.com In this example, the WickedRobot is not allowed to see anything. The slash is shorthand for 'all directories'. The blank line indicates a new 'record' - a new user agent command. All other robots can see everything except the 'cgi-bin' folder The next example shows how to disallow a specific robot from searching a specific file User-agent: DangerRobot Using robot meta tags insteadThere is another way to tell robots not to index a web page or follow links on it, which may be more helpful in some cases, as it can be used more conveniently on a page-by-page basis. It involves placing the following meta tag in the head of your pages html<META Name="robots" content="index,follow"> Search engine robotsHere are two common robots with links to pages with specific information on themFor further information on web robots and the robots.txt Robots Exclusion Standard and other articles about writing well-behaved Web robots have a look at Robotstxt.org Remember that as the Internet grows in size, so do the number of robots, spiders and crawlers that inhabit it. Now, more than ever it's important that a web site has a properly made and implemented 'robots.txt' file
home | getting started | keywords | page titles | meta tags All Rights Reserved ® neurocyber privacy policy | terms and conditions |
|
||||||||||||||||
|