Harnessing the Power of
Robots.txt
Once we have a website up and running, we need to make
sure that all visiting search engines can access all the pages
we want them to look at.
Sometimes, we may want search engines to not index
certain parts of the site, or even ban other SE from the site
all together. This is where a simple, little 2 line text file
called robots.txt comes in.
Robots.txt resides in your websites main directory (on LINUX
systems this is your /public_html/ directory), and looks
something like the following: User-agent: * Disallow: The first
line controls the “bot” that will be visiting your site, the
second line controls if they are allowed in, or which parts of
the site they are not allowed to visit…
If you want to handle multiple “bots”, then simple repeat
the above lines. So an example: User-agent: googlebot Disallow:
User-agent: askjeeves Disallow: / This will allow Goggle
(user-agent name GoogleBot) to visit every page and directory,
while at the same time banning Ask Jeeves from the site
completely.
To find a “reasonably” up to date list of robot user names
this visit http://www.robotstxt.org/wc/active/html/index.html Even
if you want to allow every robot to index every page of your
site, it’s still very advisable to put a robots.txt file on
your site. It will stop your error logs filling up with entries
from search engines trying to access your robots.txt file that
doesn’t exist.
|