mhormann Posted December 18, 2004 Share Posted December 18, 2004 Hi everybody. I've seen a lot of 'bad' (i.e. non-working) 'robots.txt' files lately, so I want to give a few tips. 'robots.txt' is a file that 'well-behaved' (not all!) spiders and search engines respect and check for. It is used to specify what you DON'T want the spiders to see. Here are some guidelines: 1. The filename MUST be 'robots.txt'. EXACTLY that. All lowercase, plural (i.e., NOT 'robot.txt' or 'Robots.txt'. 2. The file MUST have 'Linux-type' (i.e. Linefeed, "\n") type line endings. NOT Mac (CR), NOT DOS (CRLF). If you work on any of these, use an editor that allows to save in 'UNIX Mode'. 3. The file MUST be in your web root. If you put it in, say '/catalog/', no bot will ever see or check it. ALWAYS put in your web root, i.e. "http://www.mydomain.com/robots.txt". 4. Comment lines. You CAN have comment lines. They start with a '#' in column ONE. Be careful NOT to separate too much using empty lines (see rule 9)! ? # BAD: Not starting in column 1. # GOOD: Start in column 1. In theory, it IS allowed to put comments on a 'Disallow' or 'User-agent' line like 'User-agent: Googlebot #this is Google'. DON'T USE IT! It is bad practise, and some spiders will misinterpret it and instead spider what they weren't supposed to. # BAD: Comments on the same line. User-agent: Googlebot # Google # GOOD: Comments on separate lines. # Google User-agent: Googlebot 5. White space. In theory, it IS allowed to use white space (i.e., empty lines or indentation by blanks or tabs). DON'T DO IT! Some spiders will misinterpret it. ALWAYS start comments, 'User-agent:' and 'Disallow:' in column 1. And DON'T use tabs but blanks instead. # BAD: Indentation (not starting in column 1). ? User-agent: Googlebot ? Disallow: ? ? /admin/ # GOOD: Start in column 1, use ONE blank after the ':'. User-agent: Googlebot Disallow: /admin/ 6. User-agent: [spider's name] Type it EXACTLY like this. It means the spider. You CAN put '*' to make it target ALL spiders. # BAD: all lowercase user-agent: * # BAD: all uppercase USER-AGENT: * # BAD: no '-', 'Agent' has uppercase 'A' User Agent: * # GOOD: (Google) User-agent: Googlebot # GOOD: (ALL spiders) User-agent: * 7. Disallow: [path/filename to exclude] Use ONLY ONE path and/or filename per line, i.e. NOT "Disallow: /cgi-bin /stats"! # BAD: Multiple paths/files on one line Disallow: /cgi-bin /stats # GOOD: One definition per line Disallow: /cgi-bin Disallow: /stats Disallow works like an 'automatic wildcard' (without '?' or '*') by matching from the left, i.e. "Disallow: /help" would match the DIRECTORY "/help", the directory "/helpfiles", the FILE "/help.htm", the FILE "/helpfile.php" and so on. So if you want to exclude a complete directory but NOT files with same name (i.e. you want to exclude the '/catalog/elmar/' directory but NOT 'elmar_start.php', it is good practise to write it like "Disallow: /catalog/emar/" (with ending '/'). # GOOD: Disallow directory 'elmar' but not 'elmar_start.php' Disallow: /elmar/ 8. There is NO 'Allow:'! If you want to allow anything, you must disallow the rest and put an empty 'Disallow:' at the end! # BAD: (intended: disallow 'Jane' but not 'John') Disallow: /Jane Allow: /John # GOOD: (disallow 'Jane', allow all the rest) Disallow: /Jane Disallow: 9. NEVER have a BLANK LINE BETWEEN 'User-agent:' and it's corresponding 'Disallow:' lines! Some spiders will mis-interpret this as to be allowed spidering your whole site. You CAN have comment lines in between. # BAD: Blank line between 'User-agent:' and 'Disallow:' # This should exclude Google User-agent: Googlebot # And here we say which to exclude Disallow: / # Result: Some spiders will instead assume they're ALLOWED you whole site! # GOOD: NO blank lines between 'User-agent:' and corresponding 'Disallow:' # This should exclude Google User-agent: Googlebot # And here we say which to exclude Disallow: / # Result: Google will be kept from spidering your whole site. 10. Always go from 'more specific' to 'less specific'! Start with the most specific rules, then go to the least specific. This means, the part for 'User-agent: *' should come LAST in your 'robots.txt'! The reason: If a spider sees 'User-agent: *' FIRST it might stop scanning since it's one of 'All spiders', so it'll not bother to look through the rest of your file if it's specifically addressed elsewhere! # BAD: Spider might not honor this # Allow everything to all other spiders User-agent: * Disallow: # Disallow Google User-agent: Googlebot Disallow: / # GOOD: First do the specifics, then the 'rest of them' # Disallow Google User-agent: Googlebot Disallow: / # Allow everything to all other spiders User-agent: * Disallow: 11. Use a 'robots.txt' validator. One might make mistakes. It's good practise to check using a validator. Here's a good one (has some examples even): http://www.searchengineworld.com/cgi-bin/robotcheck.cgi And here's one that checks on even more potential problems: http://tool.motoricerca.info/robots-checker.phtml 12. One more tip: Search engines get clever. If you really have to run a lot of sites... Hey, comparing their 'robots.txt' files is FAST and makes it VERY easy for SEs to find if they're all the same... and so they start assuming they get tricked and rank you down... ;-) Here's an example 'robots.txt': # osCommerce robots.txt # Currently disallow all shop stuff to the Google Image bot # Mainly image hunters anyway, they eat up bandwidth... User-agent: Googlebot-Image Disallow: /cgi-bin/ Disallow: /usage/ Disallow: /catalog/ # ALL search engine spiders/crawlers (put at end of file) User-agent: * Disallow: /cgi-bin/ Disallow: /usage/ Disallow: /catalog/admin/ Disallow: /catalog/download/ Disallow: /catalog/elmar/ Disallow: /catalog/pub/ Disallow: /catalog/account.php Disallow: /catalog/advanced_search.php Disallow: /catalog/checkout_shipping.php Disallow: /catalog/create_account.php Disallow: /catalog/login.php Disallow: /catalog/password_forgotten.php Disallow: /catalog/popup_image.php Disallow: /catalog/shopping_cart.php Have fun! And happy 'spidering'... Matthias I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
mhormann Posted December 18, 2004 Author Share Posted December 18, 2004 Two more tips here, since this darned forum only lets me edit my post twice... 13. Always start from the 'base' path—reduce ambiguity. Some spiders would probably do what you want when specifying things like # BAD: Ambiguous User-agent: * Disallow: secret.php Some would assume it means 'secret.php' in every directory, some would ignore it, some would only compare it to '/secret.php' ... ALWAYS be specific and start at the web root, i.e. with '/'! # GOOD: Always start at your web root User-agent: * Disallow: /secret.php Disallow: /catalog/secret.php Disallow: /catalog/admin/secret.php 14. Google's new 'wildcard' exclusion system Google now allows 'wildcards' to be specified like '*.cgi'. DO NOT assume this will work with any other spider! Try to keep it as simple as possible, using rules that are easy to understand for every spider. If you really want to target special functions for special spiders, ALWAYS target them specifically, i.e. use a separate 'User-agent:' part. # BAD: Assuming they all understand it User-agent: * Disallow: *.cgi # GOOD: If you have to... address it specifically User-agent: Googlebot Disallow: *.cgi User-agent: * Disallow: /cgi-bin/ Disallow: /secret/secret.cgi 15. DO NOT put each and every file in 'robots.txt'! I have seen 'robots.txt' files with more than 4000 entries, specifying each and every .html or .php program file in them. DON'T DO THAT. It is much better to exclude complete directories or single 'critical' files. Bots tend to turn away on too long 'robot.txt' files and probably never come back... I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
TCwho Posted December 20, 2004 Share Posted December 20, 2004 Very Good Information mhormann! Very Much Appreciated :D Drop_Shadow How Did You Hear About Us Email HTML Order Link ---- GMT -5:00 Link to comment Share on other sites More sharing options...
miguel_os Posted December 21, 2004 Share Posted December 21, 2004 Thank you very much!!!!. Link to comment Share on other sites More sharing options...
mhormann Posted December 21, 2004 Author Share Posted December 21, 2004 You're very welcome. Seeing people happy is actually a wonderful Christmas present for me. And working together with people all over this globe in peace... I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
wheeloftime Posted January 14, 2005 Share Posted January 14, 2005 You're very welcome. Seeing people happy is actually a wonderful Christmas present for me. And working together with people all over this globe in peace... <{POST_SNAPBACK}> I just found this information about robots.txt and it is really helpful in setting this up for my site. There is one thing however which I do not completely understand and that is where to place this robots.txt and how to refer to the different 'disallow' directories. On my host I have a httpdocs directory underneath which I have my old shop and the soon to be osC shop. When people acces my domain (www.mydomain.nl) they access the index.* file from the httpdocs directory so I assumed this was my root. When FTP'ing however I can go one level back seeing ie. the httpdocs directory, the cgi-bin directory etc. With this in mind I changed the robots.txt to: # robots.txt for Wheel of Time # Currently disallow all shop stuff to the Google Image bot # Mainly image hunters anyway, they eat up bandwidth... User-agent: Googlebot-Image Disallow: /cgi-bin/ Disallow: /httpdocs/catalog/ # ALL search engine spiders/crawlers (put at end of file) User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /httpdocs/tmp/ Disallow: /httpdocs/stats/ Disallow: /httpdocs/plesk-stat/ Disallow: /httpdocs/media/ Disallow: /httpdocs/siteadmin/ Disallow: /httpdocs/catalog/admin/ Disallow: /httpdocs/catalog/download/ Disallow: /httpdocs/catalog/pub/ Disallow: /httpdocs/catalog/account.php Disallow: /httpdocs/catalog/advanced_search.php Disallow: /httpdocs/catalog/checkout_shipping.php Disallow: /httpdocs/catalog/create_account.php Disallow: /httpdocs/catalog/login.php Disallow: /httpdocs/catalog/password_forgotten.php Disallow: /httpdocs/catalog/popup_image.php Disallow: /httpdocs/catalog/shopping_cart.php The robots.txt I placed in /httpdocs however as that is the 'root' everyone gets when accessing the site but I am not sure if this will also be the case for the visiting robots ? Did I do wrong to add /httpdocs in front of everything making this not work at all or is it as it should be ? Thanks in advance ! Link to comment Share on other sites More sharing options...
gabrielk Posted April 5, 2005 Share Posted April 5, 2005 Wheeloftime, This is incorrect. It's from the WEB root, not the server path. So even though you FTP your files to /httpdocs/ your *web* root is still / Example: the administrator of these forums probably FTPs the forum files to something like: /public_html/forums/ However, the web root is still simple: / Which, in robots.txt, is the same thing as: http://www.oscommerce.com/forums/ So you will want to upload your robots.txt file in the same directory as your index.* page, and you should treat that as your root. Link to comment Share on other sites More sharing options...
wheeloftime Posted April 5, 2005 Share Posted April 5, 2005 Wheeloftime, This is incorrect. It's from the WEB root, not the server path. So even though you FTP your files to /httpdocs/ your *web* root is still / Example: the administrator of these forums probably FTPs the forum files to something like: /public_html/forums/ However, the web root is still simple: / Which, in robots.txt, is the same thing as: http://www.oscommerce.com/forums/ So you will want to upload your robots.txt file in the same directory as your index.* page, and you should treat that as your root. <{POST_SNAPBACK}> Gabriel, Thanks for coming back on this rather old topic ! I figured out it had to be as you describe and that's where I have placed the file. regards, Howard Link to comment Share on other sites More sharing options...
wheeloftime Posted April 5, 2005 Share Posted April 5, 2005 I would not add the location of your admin (or any other non public directory) to the robots.txt file! Showing this kind of information in your robots.txt file (which anybody can read), makes your site less safe. Robots only would get there if there's a link to it (and obviously there shouldn't), and if a robot finds/tries it anyway, for whatever reason, the .htaccess protection won't allow it in, so the robots.txt file does not add anything usefull to that. <{POST_SNAPBACK}> Helder. Thanks also for this extra information ! Link to comment Share on other sites More sharing options...
wheeloftime Posted April 5, 2005 Share Posted April 5, 2005 Geen dank, geen hulde, geef me liever een gulde :D <{POST_SNAPBACK}> Eurootje dan toch nog :D Maar dat rijmt nie zo helaas. Link to comment Share on other sites More sharing options...
Panic36 Posted April 6, 2005 Share Posted April 6, 2005 I dont get the disallow stuff, I mean doesn't that just allow people to read a file and know all your sensative spots to try and exploit? I completely removed, renamed, but I don't disallow it on search engines... IF it isn't linked anywhere how can google find it? Robert Link to comment Share on other sites More sharing options...
Guest Posted April 6, 2005 Share Posted April 6, 2005 I dont get the disallow stuff, I mean doesn't that just allow people to read a file and know all your sensative spots to try and exploit? I completely removed, renamed, but I don't disallow it on search engines... IF it isn't linked anywhere how can google find it? <{POST_SNAPBACK}> Exactly, The dissallow is only ment for linked pages that you don't want indexed for whatever reason (saving bandwidth for example). Only nice bots listen to it, other ignore, or might even search for dissallowed files and dirs. Link to comment Share on other sites More sharing options...
moonbeam Posted April 7, 2005 Share Posted April 7, 2005 I ran the validator at searchengineworld.com and got this: Were sorry this robots.txt does not validate Warnings detected 238 Errors detected 322 This is a stock page from OSC 2.2. I have not modified anything. I guess I am going to have to learn. This seems like alot of errors? Any advise? thanks, Moon "Woohoo, Just Havin Funnn!" Link to comment Share on other sites More sharing options...
moonbeam Posted April 7, 2005 Share Posted April 7, 2005 A stock osC install does not have a robots.txt file included (which is logical because the robots.txt must be in the root, while the catalog may be located elsewhere), so I guess it's another robots.txt file you have that produces all those errors. url? <{POST_SNAPBACK}> Oh my... I have no idea? Could I have picked up the robots.txt in a contribution? Now I am more confused... "Woohoo, Just Havin Funnn!" Link to comment Share on other sites More sharing options...
moonbeam Posted April 8, 2005 Share Posted April 8, 2005 What do I need to do now? Should I remove the file and find a replacement or should I try to edit it? I surely don't know where it came from. "Woohoo, Just Havin Funnn!" Link to comment Share on other sites More sharing options...
moonbeam Posted April 8, 2005 Share Posted April 8, 2005 Ok, It does not exist... Thank you for setting me straight, I will do just what you say. Off I go to build my own robots.txt. Wish me luck! "Woohoo, Just Havin Funnn!" Link to comment Share on other sites More sharing options...
Guest Posted April 8, 2005 Share Posted April 8, 2005 Wish me luck! <{POST_SNAPBACK}> Good luck! :D Link to comment Share on other sites More sharing options...
ptrinephi Posted June 25, 2005 Share Posted June 25, 2005 I would not add the location of your admin (or any other non public directory) to the robots.txt file! Showing this kind of information in your robots.txt file (which anybody can read), makes your site less safe. Robots only would get there if there's a link to it (and obviously there shouldn't), and if a robot finds/tries it anyway, for whatever reason, the .htaccess protection won't allow it in, so the robots.txt file does not add anything usefull to that. <{POST_SNAPBACK}> Great tips here. When a bot gets in your site does it come in through the index page and tries all the links recursively up and down? And if so, a robots.txt file like the one below is overkill? User-agent: Googlebot-ImageDisallow: / User-agent: * Disallow: /admin/ Disallow: /downloads/ Disallow: /images/ Disallow: /includes/ Disallow: /pub/ Disallow: /session/ Disallow: /temp/ Disallow: /templates/ Disallow: /webstats/ # Disallow: /account.php Disallow: /account_edit.php Disallow: /account_history.php Disallow: /account_history_info.php Disallow: /account_newsletters.php Disallow: /account_notifications.php Disallow: /account_password.php Disallow: /add_checkout_success.php Disallow: /address_book.php Disallow: /address_book_process.php Disallow: /advanced_search.php Disallow: /advanced_search_result.php Disallow: /affiliate_affiliate.php Disallow: /affiliate_banners.php Disallow: /affiliate_clicks.php Disallow: /affiliate_contact.php Disallow: /affiliate_details.php Disallow: /affiliate_faq.php Disallow: /affiliate_intro.php Disallow: /affiliate_logout.php Disallow: /affiliate_password_forgotten.php Disallow: /affiliate_payment.php Disallow: /affiliate_sales.php Disallow: /affiliate_show_banner.php Disallow: /affiliate_signup.php Disallow: /affiliate_summary.php Disallow: /affiliate_terms.php Disallow: /checkout_confirmation.php Disallow: /checkout_payment.php Disallow: /checkout_payment_address.php Disallow: /checkout_paypalipn.php Disallow: /checkout_process.php Disallow: /checkout_shipping.php Disallow: /checkout_shipping_address.php Disallow: /checkout_success.php Disallow: /configure.php Disallow: /contact_us.php Disallow: /create_account.php Disallow: /create_account_success.php Disallow: /down_for_maintenance.php Disallow: /download.php Disallow: /gv_redeem.php Disallow: /gv_send.php Disallow: /info_shopping_cart.php Disallow: /links_setup.php Disallow: /login.php Disallow: /logoff.php Disallow: /password_forgotten.php Disallow: /paypal_notify.php Disallow: /popup_coupon_help.php Disallow: /popup_image.php Disallow: /popup_search_help.php Disallow: /product_notifications.php Disallow: /product_reviews_write.php Disallow: /redirect.php Disallow: /shopping_cart.php Disallow: /shopping_cart_help.php Disallow: /shipping_estimator_popup.php Disallow: /tell_a_friend.php You mention not to put /admin in the file. Fine, it's password protected (.htaccess). What about the other directories? (includes, temp, download...) So if I undestand this right, the best way to create an efficient robots.txt file is to manually browse your site, follow all possible links and write down the ones you don't want bots to look at and ignore all the files and directories that are not linked by any pages? Is that right? Thanks Link to comment Share on other sites More sharing options...
Guest Posted July 30, 2005 Share Posted July 30, 2005 Excellent post Matthias. I would like to add something for clarification because there are some misconceptions about what the robot actually does when it gets to your site. Yes, when it gets to your site the first thing it does (at least the good bots like google) is look in your root directory for robots.txt. If it finds it it then scans the file for exclusions and what you dictate you would like to be seen. Whether or not you have a robots.txt file the bot then FOLLOWS THE LINKS WITHIN YOUR SITE. It does not view directories and files in which you do not have links or that aren't linked from other sites. In other words it cannot see directories like you can in your FTP program. It only sees files that you have built links to. Why would you want to omit a file from the bot's? One excellent example is the larger picture popup page you get when you click on the "view larger image" on your product page. And why wouldn't you want the bot to index a popup? If it appears in a page of search results on Google (or other index site) and your potential customer clicks on the link then they are taken to your little picture with no means of navigation or understanding of where they are. Those who didn't include robot.txt files excluding these pages are now having to create hacks to redirect customers to the home page of their stores. My 2cents. Link to comment Share on other sites More sharing options...
nicedeals Posted July 30, 2005 Share Posted July 30, 2005 good work guys..keep it up... I'm a newbie here so hope you guys can help.. The problem! I have submitted my website " www.nicedeals.co.uk " to a few search engines and have added the meta tag contributions which allows you to add meta tag title, description and keywords in the admin panel under "edit". Now I have a few issues with the rankings which I believe could be related to this topic. I'll deal with msn and google here as yahoo is an altogether different ball game I'm told. MSN; msn has indexed the first categories from my website e.g. if i search for " vauxhall astra body panels " it will rank me No.1 and take me to my website page " http://nicedeals.co.uk/caraccessories/nfos...7661882d85c1a0f " wich is the main category under " body panels and lamps " covering all the vauxhall models. but if I type " astra body panels" it will rank me lower down although I have specified a more specific meta tag just for vauxhall astra and it doesn't index the astra page either which should be; " http://nicedeals.co.uk/caraccessories/nfos...972f78d25951867 " Google; google on the otehr hand is not ranking me anywhere near the first pages and will only show my site if I was to type something like " vauxhall nicedeals " and even then the more specific pages for astra will be in the ommited search results if atall present. Now correct me if I'm wrong but this robots.txt file if used in a way where only the specific vauxhall models i.e. astra , calibra pages etc are only allowed to be read by the bots should increase the website ranking for specific searchees like" astra bonnet " etc. Thank you so much for reading and any help geatly appreciated. :thumbsup: Link to comment Share on other sites More sharing options...
FixItPete Posted April 15, 2006 Share Posted April 15, 2006 Just a thought on this... why isn't cookie_usage.php disallowed too? Doesn't it seem that this is what causes bots to hit a wall? I find the fun in everything. Link to comment Share on other sites More sharing options...
jashnu Posted April 29, 2006 Share Posted April 29, 2006 # ALL search engine spiders/crawlers (put at end of file)User-agent: * Disallow: /cgi-bin/ Disallow: /usage/ Disallow: /catalog/admin/ Disallow: /catalog/download/ Disallow: /catalog/elmar/ Disallow: /catalog/pub/ Disallow: /catalog/account.php Disallow: /catalog/advanced_search.php Disallow: /catalog/checkout_shipping.php Disallow: /catalog/create_account.php Disallow: /catalog/login.php Disallow: /catalog/password_forgotten.php Disallow: /catalog/popup_image.php Disallow: /catalog/shopping_cart.php[/code] Have fun! And happy 'spidering'... Matthias Hi, just a simple newbie question. What should the rest of the file look like? I mean how do you put the spiders in the end of the file. I had an error when I was using just a list of spiders. TIA. -Jani Link to comment Share on other sites More sharing options...
Guest Posted April 29, 2006 Share Posted April 29, 2006 the spiders use a different file. Look into your catalog\includes\spiders.txt http://www.oscommerce.com/community/contributions,2455 Link to comment Share on other sites More sharing options...
jashnu Posted May 2, 2006 Share Posted May 2, 2006 the spiders use a different file. Look into your catalog\includes\spiders.txthttp://www.oscommerce.com/community/contributions,2455 I meant robots.txt file, not spiders.txt Link to comment Share on other sites More sharing options...
Guest Posted May 4, 2006 Share Posted May 4, 2006 via the user agent for example User-agent: Googlebot-Image Disallow: / Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.