Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Updated spiders.txt Official Support Topic


stevel

Recommended Posts

A replacement for catalog/includes/spiders.txt - updated with newly seen spiders and optimized for quicker processing. For 2.2-MS2 or later.

 

Comments, questions and suggestions welcomed here.

 

http://www.oscommerce.com/community/contributions,2455

Edited by stevel
Link to comment
Share on other sites

  • 4 weeks later...

Hi there,

 

I noticed that googlebot is not in your long or short spiders.txt file.

 

I also noticed on my site that googlebot is trying to "buy" and it keeps hitting the "cookie_usage.php" file. It was not doing it in the past. Any idea why it would do so and how to stop it?

 

Ari

Link to comment
Share on other sites

It is there as "ooglebot", which picks up googlebot and frooglebot. This seems to work for my site. Are others having a problem with this? Look in your server log for entries from googlebot. Do they have session IDs? You do have "Prevent Spider Sessions" set to true, right?

Link to comment
Share on other sites

thanks Steve,

 

...ooglebot sounds good...

 

My settings are fine - googlebot does not create a session, it is just trying to "buy" and then it is being sent to the cookie_usage page.

 

here's a sample

66.249.79.63 - - [17/Oct/2004:02:01:39 -0700] "GET /index.php?cPath=1_27&page=1&sort=5a&action=buy_now&products_id=258 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.79.63 - - [17/Oct/2004:02:01:39 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.59 - - [17/Oct/2004:02:01:54 -0700] "GET /index.php?cPath=1_27&page=1&sort=5d HTTP/1.0" 200 28434 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.207.118.206 - - [17/Oct/2004:02:02:05 -0700] "GET /product_info.php?products_id=505&osCsid=472a70b19e8107dab70be0b6c3ee42e2 HTTP/1.1" 200 27914 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

66.249.78.59 - - [17/Oct/2004:02:02:27 -0700] "GET /index.php?cPath=1_27&page=1&sort=5a&action=buy_now&products_id=138 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.59 - - [17/Oct/2004:02:02:27 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.79 - - [17/Oct/2004:02:03:07 -0700] "GET /product_reviews.php?products_id=499&action=notify HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.79 - - [17/Oct/2004:02:03:07 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.207.118.206 - - [17/Oct/2004:02:03:37 -0700] "GET /index.php?cPath=32_30&osCsid=472a70b19e8107dab70be0b6c3ee42e2 HTTP/1.1" 200 27133 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

66.207.118.206 - - [17/Oct/2004:02:04:22 -0700] "GET /product_reviews.php?products_id=443&osCsid=b8841dddbb1c72e030619851e0b84632 HTTP/1.1" 200 26970 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

66.249.78.54 - - [17/Oct/2004:02:04:37 -0700] "GET /index.php?cPath=1_27&page=1&sort=2a&action=buy_now&products_id=257 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.54 - - [17/Oct/2004:02:04:38 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.9 - - [17/Oct/2004:02:04:52 -0700] "GET /index.php?cPath=1_27&page=1&sort=2a&action=buy_now&products_id=138 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.9 - - [17/Oct/2004:02:04:53 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

 

I have about 200 products, but with this "new" behavior (it started doing it recently) googlebot is hitting about 1000 pages a day. What a waste! ---

 

Someone suggested adding a disallow for the cookie usage page in robots.txt. That might cut the number of pages in 1/2, but we really need to prevent it from following the "action=buy_now&products_id=xyz" link. Any idea?

 

Ari

Link to comment
Share on other sites

Interesting that I have not seen this behavior on my site, but I don't use "Buy Now". One approach would be to display the "add to cart" or "buy now" button only if a session was started.

Link to comment
Share on other sites

What do you use if not "buy now" (or "add to cart")? - I did not play with that function at all. Your idea is a good one - some people suggested changing the "add to cart" to a FORM function which is ignored but all robots. How would you go about ot showing the button if the session is not started?

 

Ari

Link to comment
Share on other sites

The PHP would simply not display the Buy Now link if there was no session. The spider would not see it and thus not follow it. Add to cart is already a form button, which is why I don't see the problem on my store. Normal users would have sessions so they would get the link.

 

When I get home I'll come up with the code change to make and let you know, if someone doesn't beat me to it. It would be quite simple.

Link to comment
Share on other sites

The PHP would simply not display the Buy Now link if there was no session.  The spider would not see it and thus not follow it.  Add to cart is already a form button, which is why I don't see the problem on my store. Normal users would have sessions so they would get the link.

 

When I get home I'll come up with the code change to make and let you know, if someone doesn't beat me to it. It would be quite simple.

 

Thanks Steve,

 

I think I see what's happening -- the product detail page i.e. "product_info.php" uses a FORM action which is not causing any problem with robots, but the category listing or whatever template is used by this "/index.php?cPath=11_22" template is using a URL action to add to the cart (the action name is "buy_now"). How did this happen? I don't know. I am sure there are more templates that are doing the same (search_results.php, etc.) There is a contribution out there that explains how to change everything back to a FORM action. I didn't want to use it because I thought there was a reason for making it a URL action. Now that I find some templates are using FORM and others URL, I think I will just change all of them to FORM. What do you think?

 

Lots of folks, on other threads, have described the same problem. If you confirm this solution, we should post this message on a few other threads.

 

Ari

Link to comment
Share on other sites

Ari,

 

That's much too much effort for this. Here's how to fix it.

 

In catalog/includes/modules/product_listing.php at around line 133 is this line:

$lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> ';

Insert before that line this:

if ($session_started) {

and add after that line:

} else {$lc_text = ' ';}

That should take care of it.

Link to comment
Share on other sites

Thanks Steve,

 

I tried it, it worked - the button goes away (I used Firefox with User Agent Switcher to browse the site).

 

But I ended up turning all the URL actions into FORM, there are a few templates that have this problem and I wanted to make sure I covered all of them.

product_reviews.php

product_reviews_info.php

products_new.php

all_products.php

 

I will now go and post this solution in another thread that a similar discussion.

 

thanks again

 

Ari

Link to comment
Share on other sites

  • 2 weeks later...
  • 1 month later...

Steve,

 

I noticed you have exabot in spiders.txt. I get hit by a bot from Exava, supposedly a new search engine. Are these the same? I asked because the eXavaBot that hits my site still generates a sessionID.

 

Thanks,

Ed

Link to comment
Share on other sites

Steve,

 

Follow up question. I'm making use of $user_agent to determine the spiders name in a whos online contrib. For GoogleBot and MSNBot, $user_agent displayed something like: msnbot/1.0(+http://www.msn.com).

 

Will $user_agent always return a format like botName/version (URL)? I'm trying to shorten the name by truncating the stuff in parenthesis. So is it always in this format?

 

Thanks,

Ed

Link to comment
Share on other sites

No, it isn't. And note that the way my spiders.txt works is to define substrings that are found in multiple user agents. For example, many have "crawl" or "robot". "msnbot" is in there, but some others have substrings (such as "lebot" for googlebot/frooglebot). As I find new robots, I sometimes create new substrings. I will probably add "abot" for exavabot/exabot.

 

For this context, you want a string that matches as many robots without also matching browser agents.

 

A problem I have found is that there are robots out there which provide either NO user agent or pretend they are a browser. Not much can be done about those, to be honest, but the major bots are more reasonable in this regard.

Link to comment
Share on other sites

Thanks for the great idea to keep 'spiders.txt' up-to-date, Steve!

 

If you haven't already, you might want to take a look at

Search Engine World's 'robots4.txt', this is a nice list of 'known-well-behaving' spiders.

 

I have used this info at times, since it also includes some of those that 'disguise' as browsers.

 

Regards, and keep it up!

Matthias

Edited by mhormann

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

Thanks - that is a useful resource, but I try to balance, at least in the smaller file, number of entries with the prevalance of the spider. There are lots of spiders listed that are inactive. But I also know that I have seen spiders visit my site that aren't on that list.

 

I will shortly be posting another update with some further optimization (and catching more spiders.)

Edited by stevel
Link to comment
Share on other sites

Since I'm just spidering myself into a zillion sessions ;-) ...

 

Say, if I wanted to set variants of a spider that comes up as either

 

Xenu

Xenu Link Sleuth 1.1e

Xenu Link Sleuth 1.2f

 

what would I put in? Just

 

Xenu

 

or better

 

xenu

 

Is it case-sensitive? Would it also catch if using "Link Sleuth" for example?

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

Perfect. Thanks for your ongoing support, really!

 

You might want to include 'xenu' in your spiders.txt, since some might now be more aware of this nice tool and start spidering around... and it can hit your site hard, if your robots.txt isn't carefully layed out. Btw, many SEOs use it also.

 

My first try on a simple osC test installation gave around 65,000 links it spidered along, and about 510 MB log file. Imagine that guy hitting your site with 100 threads simultaneously plus CREATING SESSIONS. Phew!

 

If you need to contact the Xenu's author, Tilman Hausherr, let me know. He's quite helpful and actually wrote my instructions down on how to "set sessions off for Xenu". He told me most users of his spider got sessions and was quite amazed that we could simply "switch them off"... ;-)

I don't want to set the world on fire—I just want to start a flame in your heart.

 

osCommerce Contributions:

Class cc_show() v1.0 – Show Credit Cards, Gateways

More Product Weight v1.0

Link to comment
Share on other sites

  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...