stevel Posted September 20, 2004 Share Posted September 20, 2004 (edited) A replacement for catalog/includes/spiders.txt - updated with newly seen spiders and optimized for quicker processing. For 2.2-MS2 or later. Comments, questions and suggestions welcomed here. http://www.oscommerce.com/community/contributions,2455 Edited September 20, 2004 by stevel Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
nrlatsha Posted September 24, 2004 Share Posted September 24, 2004 Great Idea Steve Quote 9 times out of 10 its a PEBCAK Error (Problem exists between chair and keyboard) Replace that and you're fine... Link to comment Share on other sites More sharing options...
ari Posted October 18, 2004 Share Posted October 18, 2004 Hi there, I noticed that googlebot is not in your long or short spiders.txt file. I also noticed on my site that googlebot is trying to "buy" and it keeps hitting the "cookie_usage.php" file. It was not doing it in the past. Any idea why it would do so and how to stop it? Ari Quote Link to comment Share on other sites More sharing options...
stevel Posted October 18, 2004 Author Share Posted October 18, 2004 It is there as "ooglebot", which picks up googlebot and frooglebot. This seems to work for my site. Are others having a problem with this? Look in your server log for entries from googlebot. Do they have session IDs? You do have "Prevent Spider Sessions" set to true, right? Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
ari Posted October 18, 2004 Share Posted October 18, 2004 thanks Steve, ...ooglebot sounds good... My settings are fine - googlebot does not create a session, it is just trying to "buy" and then it is being sent to the cookie_usage page. here's a sample 66.249.79.63 - - [17/Oct/2004:02:01:39 -0700] "GET /index.php?cPath=1_27&page=1&sort=5a&action=buy_now&products_id=258 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.249.79.63 - - [17/Oct/2004:02:01:39 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.249.78.59 - - [17/Oct/2004:02:01:54 -0700] "GET /index.php?cPath=1_27&page=1&sort=5d HTTP/1.0" 200 28434 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.207.118.206 - - [17/Oct/2004:02:02:05 -0700] "GET /product_info.php?products_id=505&osCsid=472a70b19e8107dab70be0b6c3ee42e2 HTTP/1.1" 200 27914 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)" 66.249.78.59 - - [17/Oct/2004:02:02:27 -0700] "GET /index.php?cPath=1_27&page=1&sort=5a&action=buy_now&products_id=138 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.249.78.59 - - [17/Oct/2004:02:02:27 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.249.78.79 - - [17/Oct/2004:02:03:07 -0700] "GET /product_reviews.php?products_id=499&action=notify HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.249.78.79 - - [17/Oct/2004:02:03:07 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.207.118.206 - - [17/Oct/2004:02:03:37 -0700] "GET /index.php?cPath=32_30&osCsid=472a70b19e8107dab70be0b6c3ee42e2 HTTP/1.1" 200 27133 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)" 66.207.118.206 - - [17/Oct/2004:02:04:22 -0700] "GET /product_reviews.php?products_id=443&osCsid=b8841dddbb1c72e030619851e0b84632 HTTP/1.1" 200 26970 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)" 66.249.78.54 - - [17/Oct/2004:02:04:37 -0700] "GET /index.php?cPath=1_27&page=1&sort=2a&action=buy_now&products_id=257 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.249.78.54 - - [17/Oct/2004:02:04:38 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.249.78.9 - - [17/Oct/2004:02:04:52 -0700] "GET /index.php?cPath=1_27&page=1&sort=2a&action=buy_now&products_id=138 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" 66.249.78.9 - - [17/Oct/2004:02:04:53 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" I have about 200 products, but with this "new" behavior (it started doing it recently) googlebot is hitting about 1000 pages a day. What a waste! --- Someone suggested adding a disallow for the cookie usage page in robots.txt. That might cut the number of pages in 1/2, but we really need to prevent it from following the "action=buy_now&products_id=xyz" link. Any idea? Ari Quote Link to comment Share on other sites More sharing options...
stevel Posted October 18, 2004 Author Share Posted October 18, 2004 Interesting that I have not seen this behavior on my site, but I don't use "Buy Now". One approach would be to display the "add to cart" or "buy now" button only if a session was started. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
ari Posted October 18, 2004 Share Posted October 18, 2004 What do you use if not "buy now" (or "add to cart")? - I did not play with that function at all. Your idea is a good one - some people suggested changing the "add to cart" to a FORM function which is ignored but all robots. How would you go about ot showing the button if the session is not started? Ari Quote Link to comment Share on other sites More sharing options...
stevel Posted October 18, 2004 Author Share Posted October 18, 2004 The PHP would simply not display the Buy Now link if there was no session. The spider would not see it and thus not follow it. Add to cart is already a form button, which is why I don't see the problem on my store. Normal users would have sessions so they would get the link. When I get home I'll come up with the code change to make and let you know, if someone doesn't beat me to it. It would be quite simple. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
ari Posted October 18, 2004 Share Posted October 18, 2004 The PHP would simply not display the Buy Now link if there was no session. The spider would not see it and thus not follow it. Add to cart is already a form button, which is why I don't see the problem on my store. Normal users would have sessions so they would get the link. When I get home I'll come up with the code change to make and let you know, if someone doesn't beat me to it. It would be quite simple. <{POST_SNAPBACK}> Thanks Steve, I think I see what's happening -- the product detail page i.e. "product_info.php" uses a FORM action which is not causing any problem with robots, but the category listing or whatever template is used by this "/index.php?cPath=11_22" template is using a URL action to add to the cart (the action name is "buy_now"). How did this happen? I don't know. I am sure there are more templates that are doing the same (search_results.php, etc.) There is a contribution out there that explains how to change everything back to a FORM action. I didn't want to use it because I thought there was a reason for making it a URL action. Now that I find some templates are using FORM and others URL, I think I will just change all of them to FORM. What do you think? Lots of folks, on other threads, have described the same problem. If you confirm this solution, we should post this message on a few other threads. Ari Quote Link to comment Share on other sites More sharing options...
stevel Posted October 18, 2004 Author Share Posted October 18, 2004 Ari, That's much too much effort for this. Here's how to fix it. In catalog/includes/modules/product_listing.php at around line 133 is this line: $lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> '; Insert before that line this: if ($session_started) { and add after that line: } else {$lc_text = ' ';} That should take care of it. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
ari Posted October 19, 2004 Share Posted October 19, 2004 Thanks Steve, I tried it, it worked - the button goes away (I used Firefox with User Agent Switcher to browse the site). But I ended up turning all the URL actions into FORM, there are a few templates that have this problem and I wanted to make sure I covered all of them. product_reviews.php product_reviews_info.php products_new.php all_products.php I will now go and post this solution in another thread that a similar discussion. thanks again Ari Quote Link to comment Share on other sites More sharing options...
koksal Posted November 3, 2004 Share Posted November 3, 2004 Also in the sid Killer contribution there is a buy now button hack, which converts buttons to forms, it workd fine with me. KC gifts and home and garden Quote Link to comment Share on other sites More sharing options...
Guest Posted December 15, 2004 Share Posted December 15, 2004 Steve, I noticed you have exabot in spiders.txt. I get hit by a bot from Exava, supposedly a new search engine. Are these the same? I asked because the eXavaBot that hits my site still generates a sessionID. Thanks, Ed Quote Link to comment Share on other sites More sharing options...
stevel Posted December 15, 2004 Author Share Posted December 15, 2004 Hmm - looks as if they changed the spider name. Go ahead and add "abot" to the list and I'll update soon. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted December 15, 2004 Share Posted December 15, 2004 Thanks Steve! ed Quote Link to comment Share on other sites More sharing options...
Guest Posted December 16, 2004 Share Posted December 16, 2004 Steve, Follow up question. I'm making use of $user_agent to determine the spiders name in a whos online contrib. For GoogleBot and MSNBot, $user_agent displayed something like: msnbot/1.0(+http://www.msn.com). Will $user_agent always return a format like botName/version (URL)? I'm trying to shorten the name by truncating the stuff in parenthesis. So is it always in this format? Thanks, Ed Quote Link to comment Share on other sites More sharing options...
stevel Posted December 16, 2004 Author Share Posted December 16, 2004 No, it isn't. And note that the way my spiders.txt works is to define substrings that are found in multiple user agents. For example, many have "crawl" or "robot". "msnbot" is in there, but some others have substrings (such as "lebot" for googlebot/frooglebot). As I find new robots, I sometimes create new substrings. I will probably add "abot" for exavabot/exabot. For this context, you want a string that matches as many robots without also matching browser agents. A problem I have found is that there are robots out there which provide either NO user agent or pretend they are a browser. Not much can be done about those, to be honest, but the major bots are more reasonable in this regard. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
mhormann Posted December 18, 2004 Share Posted December 18, 2004 (edited) Thanks for the great idea to keep 'spiders.txt' up-to-date, Steve! If you haven't already, you might want to take a look at Search Engine World's 'robots4.txt', this is a nice list of 'known-well-behaving' spiders. I have used this info at times, since it also includes some of those that 'disguise' as browsers. Regards, and keep it up! Matthias Edited December 18, 2004 by mhormann Quote I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
stevel Posted December 18, 2004 Author Share Posted December 18, 2004 (edited) Thanks - that is a useful resource, but I try to balance, at least in the smaller file, number of entries with the prevalance of the spider. There are lots of spiders listed that are inactive. But I also know that I have seen spiders visit my site that aren't on that list. I will shortly be posting another update with some further optimization (and catching more spiders.) Edited December 18, 2004 by stevel Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
mhormann Posted December 20, 2004 Share Posted December 20, 2004 Since I'm just spidering myself into a zillion sessions ;-) ... Say, if I wanted to set variants of a spider that comes up as either Xenu Xenu Link Sleuth 1.1e Xenu Link Sleuth 1.2f what would I put in? Just Xenu or better xenu Is it case-sensitive? Would it also catch if using "Link Sleuth" for example? Quote I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
stevel Posted December 20, 2004 Author Share Posted December 20, 2004 You would want just: xenu The strings are forced to lowercase, so only lowercase will match. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
mhormann Posted December 20, 2004 Share Posted December 20, 2004 Perfect. Thanks for your ongoing support, really! You might want to include 'xenu' in your spiders.txt, since some might now be more aware of this nice tool and start spidering around... and it can hit your site hard, if your robots.txt isn't carefully layed out. Btw, many SEOs use it also. My first try on a simple osC test installation gave around 65,000 links it spidered along, and about 510 MB log file. Imagine that guy hitting your site with 100 threads simultaneously plus CREATING SESSIONS. Phew! If you need to contact the Xenu's author, Tilman Hausherr, let me know. He's quite helpful and actually wrote my instructions down on how to "set sessions off for Xenu". He told me most users of his spider got sessions and was quite amazed that we could simply "switch them off"... ;-) Quote I don't want to set the world on fire—I just want to start a flame in your heart. osCommerce Contributions: Class cc_show() v1.0 – Show Credit Cards, Gateways More Product Weight v1.0 Link to comment Share on other sites More sharing options...
stevel Posted December 21, 2004 Author Share Posted December 21, 2004 I know the Xenu tool well - I use it myself. Good idea to add xenu to the list - thanks. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
ckyshop.co.uk Posted December 30, 2004 Share Posted December 30, 2004 Is there such a bot as msnbot? Quote Thanks for any help/comments. Regards, Lewis Hill Link to comment Share on other sites More sharing options...
Guest Posted December 31, 2004 Share Posted December 31, 2004 Lewis, Yep. Comes by my sites reguarly. It's Microsoft's newest search engine that is supposed to compete with Google. ed Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.