Jump to content
Latest News: (loading..)

Archived

This topic is now archived and is closed to further replies.

mhormann

Help on 'robots.txt'

Recommended Posts

Hi everybody.

 

I've seen a lot of 'bad' (i.e. non-working) 'robots.txt' files lately, so I want to give a few tips.

 

'robots.txt' is a file that 'well-behaved' (not all!) spiders and search engines respect and check for. It is used to specify what you DON'T want the spiders to see.

 

Here are some guidelines:

 

1. The filename MUST be 'robots.txt'.

EXACTLY that. All lowercase, plural (i.e., NOT 'robot.txt' or 'Robots.txt'.

 

2. The file MUST have 'Linux-type' (i.e. Linefeed, "\n") type line endings.

NOT Mac (CR), NOT DOS (CRLF). If you work on any of these, use an editor that allows to save in 'UNIX Mode'.

 

3. The file MUST be in your web root.

If you put it in, say '/catalog/', no bot will ever see or check it. ALWAYS put in your web root, i.e. "http://www.mydomain.com/robots.txt".

 

4. Comment lines.

You CAN have comment lines. They start with a '#' in column ONE. Be careful NOT to separate too much using empty lines (see rule 9)!

 ? # BAD: Not starting in column 1.

# GOOD: Start in column 1.

 

In theory, it IS allowed to put comments on a 'Disallow' or 'User-agent' line like 'User-agent: Googlebot #this is Google'. DON'T USE IT! It is bad practise, and some spiders will misinterpret it and instead spider what they weren't supposed to.

# BAD: Comments on the same line.
User-agent: Googlebot # Google

# GOOD: Comments on separate lines.
# Google
User-agent: Googlebot

 

5. White space.

In theory, it IS allowed to use white space (i.e., empty lines or indentation by blanks or tabs). DON'T DO IT! Some spiders will misinterpret it. ALWAYS start comments, 'User-agent:' and 'Disallow:' in column 1. And DON'T use tabs but blanks instead.

# BAD: Indentation (not starting in column 1).
? User-agent: Googlebot
? Disallow: ? ? /admin/

# GOOD: Start in column 1, use ONE blank after the ':'.
User-agent: Googlebot
Disallow: /admin/

 

6. User-agent: [spider's name]

Type it EXACTLY like this. It means the spider. You CAN put '*' to make it target ALL spiders.

# BAD: all lowercase
user-agent: *
# BAD: all uppercase
USER-AGENT: *
# BAD: no '-', 'Agent' has uppercase 'A'
User Agent: *

# GOOD: (Google)
User-agent: Googlebot
# GOOD: (ALL spiders)
User-agent: *

 

7. Disallow: [path/filename to exclude]

Use ONLY ONE path and/or filename per line, i.e. NOT "Disallow: /cgi-bin /stats"!

# BAD: Multiple paths/files on one line
Disallow: /cgi-bin /stats

# GOOD: One definition per line
Disallow: /cgi-bin
Disallow: /stats

 

Disallow works like an 'automatic wildcard' (without '?' or '*') by matching from the left, i.e. "Disallow: /help" would match the DIRECTORY "/help", the directory "/helpfiles", the FILE "/help.htm", the FILE "/helpfile.php" and so on.

 

So if you want to exclude a complete directory but NOT files with same name (i.e. you want to exclude the '/catalog/elmar/' directory but NOT 'elmar_start.php', it is good practise to write it like "Disallow: /catalog/emar/" (with ending '/').

# GOOD: Disallow directory 'elmar' but not 'elmar_start.php'
Disallow: /elmar/

 

8. There is NO 'Allow:'!

If you want to allow anything, you must disallow the rest and put an empty 'Disallow:' at the end!

# BAD: (intended: disallow 'Jane' but not 'John')
Disallow: /Jane
Allow: /John

# GOOD: (disallow 'Jane', allow all the rest)
Disallow: /Jane
Disallow:

 

9. NEVER have a BLANK LINE BETWEEN 'User-agent:' and it's corresponding 'Disallow:' lines!

Some spiders will mis-interpret this as to be allowed spidering your whole site. You CAN have comment lines in between.

# BAD: Blank line between 'User-agent:' and 'Disallow:'
# This should exclude Google
User-agent: Googlebot

# And here we say which to exclude
Disallow: /
# Result: Some spiders will instead assume they're ALLOWED you whole site!

# GOOD: NO blank lines between 'User-agent:' and corresponding 'Disallow:'
# This should exclude Google
User-agent: Googlebot
# And here we say which to exclude
Disallow: /
# Result: Google will be kept from spidering your whole site.

 

10. Always go from 'more specific' to 'less specific'!

Start with the most specific rules, then go to the least specific. This means, the part for 'User-agent: *' should come LAST in your 'robots.txt'! The reason: If a spider sees 'User-agent: *' FIRST it might stop scanning since it's one of 'All spiders', so it'll not bother to look through the rest of your file if it's specifically addressed elsewhere!

# BAD: Spider might not honor this
# Allow everything to all other spiders
User-agent: *
Disallow: 

# Disallow Google
User-agent: Googlebot
Disallow: /

# GOOD: First do the specifics, then the 'rest of them'
# Disallow Google
User-agent: Googlebot
Disallow: /

# Allow everything to all other spiders
User-agent: *
Disallow:

 

11. Use a 'robots.txt' validator.

One might make mistakes. It's good practise to check using a validator.

 

Here's a good one (has some examples even):

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

 

And here's one that checks on even more potential problems:

http://tool.motoricerca.info/robots-checker.phtml

 

12. One more tip: Search engines get clever.

If you really have to run a lot of sites... Hey, comparing their 'robots.txt' files is FAST and makes it VERY easy for SEs to find if they're all the same... and so they start assuming they get tricked and rank you down... ;-)

 

Here's an example 'robots.txt':

# osCommerce robots.txt

# Currently disallow all shop stuff to the Google Image bot
# Mainly image hunters anyway, they eat up bandwidth...
User-agent: Googlebot-Image
Disallow: /cgi-bin/
Disallow: /usage/
Disallow: /catalog/

# ALL search engine spiders/crawlers (put at end of file)
User-agent: *
Disallow: /cgi-bin/
Disallow: /usage/
Disallow: /catalog/admin/
Disallow: /catalog/download/
Disallow: /catalog/elmar/
Disallow: /catalog/pub/
Disallow: /catalog/account.php
Disallow: /catalog/advanced_search.php
Disallow: /catalog/checkout_shipping.php
Disallow: /catalog/create_account.php
Disallow: /catalog/login.php
Disallow: /catalog/password_forgotten.php
Disallow: /catalog/popup_image.php
Disallow: /catalog/shopping_cart.php

 

Have fun! And happy 'spidering'...

Matthias


I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Share this post


Link to post
Share on other sites

Two more tips here, since this darned forum only lets me edit my post twice...

 

13. Always start from the 'base' path—reduce ambiguity.

Some spiders would probably do what you want when specifying things like

# BAD: Ambiguous
User-agent: *
Disallow: secret.php

Some would assume it means 'secret.php' in every directory, some would ignore it, some would only compare it to '/secret.php' ...

ALWAYS be specific and start at the web root, i.e. with '/'!

# GOOD: Always start at your web root
User-agent: *
Disallow: /secret.php
Disallow: /catalog/secret.php
Disallow: /catalog/admin/secret.php

 

14. Google's new 'wildcard' exclusion system

Google now allows 'wildcards' to be specified like '*.cgi'. DO NOT assume this will work with any other spider! Try to keep it as simple as possible, using rules that are easy to understand for every spider.

If you really want to target special functions for special spiders, ALWAYS target them specifically, i.e. use a separate 'User-agent:' part.

# BAD: Assuming they all understand it
User-agent: *
Disallow: *.cgi

# GOOD: If you have to... address it specifically
User-agent: Googlebot
Disallow: *.cgi

User-agent: *
Disallow: /cgi-bin/
Disallow: /secret/secret.cgi

 

15. DO NOT put each and every file in 'robots.txt'!

I have seen 'robots.txt' files with more than 4000 entries, specifying each and every .html or .php program file in them. DON'T DO THAT.

It is much better to exclude complete directories or single 'critical' files.

Bots tend to turn away on too long 'robot.txt' files and probably never come back...


I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Share this post


Link to post
Share on other sites

You're very welcome.

 

Seeing people happy is actually a wonderful Christmas present for me. And working together with people all over this globe in peace...


I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Share this post


Link to post
Share on other sites
You're very welcome.

 

Seeing people happy is actually a wonderful Christmas present for me. And working together with people all over this globe in peace...

 

I just found this information about robots.txt and it is really helpful in setting this up for my site. There is one thing however which I do not completely understand and that is where to place this robots.txt and how to refer to the different 'disallow' directories.

On my host I have a httpdocs directory underneath which I have my old shop and the soon to be osC shop. When people acces my domain (www.mydomain.nl) they access the index.* file from the httpdocs directory so I assumed this was my root. When FTP'ing however I can go one level back seeing ie. the httpdocs directory, the cgi-bin directory etc.

With this in mind I changed the robots.txt to:

# robots.txt for Wheel of Time
# Currently disallow all shop stuff to the Google Image bot
# Mainly image hunters anyway, they eat up bandwidth...
User-agent: Googlebot-Image
Disallow: /cgi-bin/
Disallow: /httpdocs/catalog/

# ALL search engine spiders/crawlers (put at end of file)
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /httpdocs/tmp/
Disallow: /httpdocs/stats/
Disallow: /httpdocs/plesk-stat/
Disallow: /httpdocs/media/
Disallow: /httpdocs/siteadmin/
Disallow: /httpdocs/catalog/admin/
Disallow: /httpdocs/catalog/download/
Disallow: /httpdocs/catalog/pub/
Disallow: /httpdocs/catalog/account.php
Disallow: /httpdocs/catalog/advanced_search.php
Disallow: /httpdocs/catalog/checkout_shipping.php
Disallow: /httpdocs/catalog/create_account.php
Disallow: /httpdocs/catalog/login.php
Disallow: /httpdocs/catalog/password_forgotten.php
Disallow: /httpdocs/catalog/popup_image.php
Disallow: /httpdocs/catalog/shopping_cart.php

 

The robots.txt I placed in /httpdocs however as that is the 'root' everyone gets when accessing the site but I am not sure if this will also be the case for the visiting robots ? Did I do wrong to add /httpdocs in front of everything making this not work at all or is it as it should be ?

 

Thanks in advance !

Share this post


Link to post
Share on other sites

Wheeloftime,

 

This is incorrect. It's from the WEB root, not the server path.

 

So even though you FTP your files to /httpdocs/ your *web* root is still /

 

Example: the administrator of these forums probably FTPs the forum files to something like:

 

/public_html/forums/

 

However, the web root is still simple:

 

/

 

Which, in robots.txt, is the same thing as: http://forums.oscommerce.com/

 

So you will want to upload your robots.txt file in the same directory as your index.* page, and you should treat that as your root.

Share this post


Link to post
Share on other sites
Wheeloftime,

 

This is incorrect.  It's from the WEB root, not the server path.

 

So even though you FTP your files to /httpdocs/ your *web* root is still /

 

Example: the administrator of these forums probably FTPs the forum files to something like:

 

/public_html/forums/

 

However, the web root is still simple:

 

/

 

Which, in robots.txt, is the same thing as: http://forums.oscommerce.com/

 

So you will want to upload your robots.txt file in the same directory as  your index.* page, and you should treat that as your root.

 

Gabriel,

 

Thanks for coming back on this rather old topic ! I figured out it had to be as you describe and that's where I have placed the file.

 

regards,

Howard

Share this post


Link to post
Share on other sites
I would not add the location of your admin (or any other non public directory) to the robots.txt file! Showing this kind of information in your robots.txt file (which anybody can read), makes your site less safe.

 

Robots only would get there if there's a link to it (and obviously there shouldn't), and if a robot finds/tries it anyway, for whatever reason, the .htaccess protection won't allow it in, so the robots.txt file does not add anything usefull to that.

 

Helder.

Thanks also for this extra information !

Share this post


Link to post
Share on other sites

I dont get the disallow stuff, I mean doesn't that just allow people to read a file and know all your sensative spots to try and exploit? I completely removed, renamed, but I don't disallow it on search engines... IF it isn't linked anywhere how can google find it?

 

Robert

Share this post


Link to post
Share on other sites
I dont get the disallow stuff, I mean doesn't that just allow people to read a file and know all your sensative spots to try and exploit? I completely removed, renamed, but I don't disallow it on search engines... IF it isn't linked anywhere how can google find it?

Exactly,

 

The dissallow is only ment for linked pages that you don't want indexed for whatever reason (saving bandwidth for example). Only nice bots listen to it, other ignore, or might even search for dissallowed files and dirs.


Please do not PM me for support, I will not respond anyway.

Share this post


Link to post
Share on other sites

I ran the validator at searchengineworld.com and got this:

 

Were sorry this robots.txt does not validate

Warnings detected 238

Errors detected 322

 

This is a stock page from OSC 2.2. I have not modified anything.

I guess I am going to have to learn. This seems like alot of errors?

Any advise? thanks, Moon


"Woohoo, Just Havin Funnn!"

Share this post


Link to post
Share on other sites
A stock osC install does not have a robots.txt file included (which is logical because the robots.txt must be in the root, while the catalog may be located elsewhere), so I guess it's another robots.txt file you have that produces all those errors.

 

url?

 

Oh my...

I have no idea? Could I have picked up the robots.txt in a contribution? Now I am more confused...


"Woohoo, Just Havin Funnn!"

Share this post


Link to post
Share on other sites

What do I need to do now? Should I remove the file and find a replacement or should I try to edit it? I surely don't know where it came from.


"Woohoo, Just Havin Funnn!"

Share this post


Link to post
Share on other sites

Ok, It does not exist... Thank you for setting me straight, I will do just what you say. Off I go to build my own robots.txt. Wish me luck!


"Woohoo, Just Havin Funnn!"

Share this post


Link to post
Share on other sites
I would not add the location of your admin (or any other non public directory) to the robots.txt file! Showing this kind of information in your robots.txt file (which anybody can read), makes your site less safe.

 

Robots only would get there if there's a link to it (and obviously there shouldn't), and if a robot finds/tries it anyway, for whatever reason, the .htaccess protection won't allow it in, so the robots.txt file does not add anything usefull to that.

 

Great tips here.

 

When a bot gets in your site does it come in through the index page and tries all the links recursively up and down? And if so, a robots.txt file like the one below is overkill?

 

User-agent: Googlebot-Image

Disallow: /

 

User-agent: *

Disallow: /admin/

Disallow: /downloads/

Disallow: /images/

Disallow: /includes/

Disallow: /pub/

Disallow: /session/

Disallow: /temp/

Disallow: /templates/

Disallow: /webstats/

#

Disallow: /account.php

Disallow: /account_edit.php

Disallow: /account_history.php

Disallow: /account_history_info.php

Disallow: /account_newsletters.php

Disallow: /account_notifications.php

Disallow: /account_password.php

Disallow: /add_checkout_success.php

Disallow: /address_book.php

Disallow: /address_book_process.php

Disallow: /advanced_search.php

Disallow: /advanced_search_result.php

Disallow: /affiliate_affiliate.php

Disallow: /affiliate_banners.php

Disallow: /affiliate_clicks.php

Disallow: /affiliate_contact.php

Disallow: /affiliate_details.php

Disallow: /affiliate_faq.php

Disallow: /affiliate_intro.php

Disallow: /affiliate_logout.php

Disallow: /affiliate_password_forgotten.php

Disallow: /affiliate_payment.php

Disallow: /affiliate_sales.php

Disallow: /affiliate_show_banner.php

Disallow: /affiliate_signup.php

Disallow: /affiliate_summary.php

Disallow: /affiliate_terms.php

Disallow: /checkout_confirmation.php

Disallow: /checkout_payment.php

Disallow: /checkout_payment_address.php

Disallow: /checkout_paypalipn.php

Disallow: /checkout_process.php

Disallow: /checkout_shipping.php

Disallow: /checkout_shipping_address.php

Disallow: /checkout_success.php

Disallow: /configure.php

Disallow: /contact_us.php

Disallow: /create_account.php

Disallow: /create_account_success.php

Disallow: /down_for_maintenance.php

Disallow: /download.php

Disallow: /gv_redeem.php

Disallow: /gv_send.php

Disallow: /info_shopping_cart.php

Disallow: /links_setup.php

Disallow: /login.php

Disallow: /logoff.php

Disallow: /password_forgotten.php

Disallow: /paypal_notify.php

Disallow: /popup_coupon_help.php

Disallow: /popup_image.php

Disallow: /popup_search_help.php

Disallow: /product_notifications.php

Disallow: /product_reviews_write.php

Disallow: /redirect.php

Disallow: /shopping_cart.php

Disallow: /shopping_cart_help.php

Disallow: /shipping_estimator_popup.php

Disallow: /tell_a_friend.php

 

You mention not to put /admin in the file. Fine, it's password protected (.htaccess). What about the other directories? (includes, temp, download...)

 

So if I undestand this right, the best way to create an efficient robots.txt file is to manually browse your site, follow all possible links and write down the ones you don't want bots to look at and ignore all the files and directories that are not linked by any pages? Is that right?

 

Thanks

Share this post


Link to post
Share on other sites

Excellent post Matthias.

 

I would like to add something for clarification because there are some misconceptions about what the robot actually does when it gets to your site. Yes, when it gets to your site the first thing it does (at least the good bots like google) is look in your root directory for robots.txt.

 

If it finds it it then scans the file for exclusions and what you dictate you would like to be seen.

 

Whether or not you have a robots.txt file the bot then FOLLOWS THE LINKS WITHIN YOUR SITE. It does not view directories and files in which you do not have links or that aren't linked from other sites. In other words it cannot see directories like you can in your FTP program. It only sees files that you have built links to.

 

Why would you want to omit a file from the bot's? One excellent example is the larger picture popup page you get when you click on the "view larger image" on your product page. And why wouldn't you want the bot to index a popup? If it appears in a page of search results on Google (or other index site) and your potential customer clicks on the link then they are taken to your little picture with no means of navigation or understanding of where they are. Those who didn't include robot.txt files excluding these pages are now having to create hacks to redirect customers to the home page of their stores.

 

My 2cents.

Share this post


Link to post
Share on other sites

good work guys..keep it up...

I'm a newbie here so hope you guys can help..

 

The problem!

I have submitted my website " www.nicedeals.co.uk " to a few search engines and have added the meta tag contributions which allows you to add meta tag title, description and keywords in the admin panel under "edit". Now I have a few issues with the rankings which I believe could be related to this topic. I'll deal with msn and google here as yahoo is an altogether different ball game I'm told.

 

MSN;

msn has indexed the first categories from my website e.g. if i search for " vauxhall astra body panels " it will rank me No.1 and take me to my website page " http://nicedeals.co.uk/caraccessories/nfos...7661882d85c1a0f " wich is the main category under " body panels and lamps " covering all the vauxhall models. but if I type " astra body panels" it will rank me lower down although I have specified a more specific meta tag just for vauxhall astra and it doesn't index the astra page either which should be;

" http://nicedeals.co.uk/caraccessories/nfos...972f78d25951867 "

 

Google;

google on the otehr hand is not ranking me anywhere near the first pages and will only show my site if I was to type something like " vauxhall nicedeals " and even then the more specific pages for astra will be in the ommited search results if atall present.

 

Now correct me if I'm wrong but this robots.txt file if used in a way where only the specific vauxhall models i.e. astra , calibra pages etc are only allowed to be read by the bots should increase the website ranking for specific searchees like" astra bonnet " etc.

 

Thank you so much for reading and any help geatly appreciated.

 

 

 

:thumbsup:

Share this post


Link to post
Share on other sites

Just a thought on this... why isn't cookie_usage.php disallowed too? Doesn't it seem that this is what causes bots to hit a wall?


I find the fun in everything.

Share this post


Link to post
Share on other sites
# ALL search engine spiders/crawlers (put at end of file)

User-agent: *

Disallow: /cgi-bin/

Disallow: /usage/

Disallow: /catalog/admin/

Disallow: /catalog/download/

Disallow: /catalog/elmar/

Disallow: /catalog/pub/

Disallow: /catalog/account.php

Disallow: /catalog/advanced_search.php

Disallow: /catalog/checkout_shipping.php

Disallow: /catalog/create_account.php

Disallow: /catalog/login.php

Disallow: /catalog/password_forgotten.php

Disallow: /catalog/popup_image.php

Disallow: /catalog/shopping_cart.php[/code]

 

Have fun! And happy 'spidering'...

Matthias

 

Hi, just a simple newbie question. What should the rest of the file look like? I mean how do you put the spiders in the end of the file. I had an error when I was using just a list of spiders. TIA.

 

-Jani

Share this post


Link to post
Share on other sites

×