Latest News: (loading..)
stevel

Updated spiders.txt Official Support Topic

597 posts in this topic

Hi there,

 

I noticed that googlebot is not in your long or short spiders.txt file.

 

I also noticed on my site that googlebot is trying to "buy" and it keeps hitting the "cookie_usage.php" file. It was not doing it in the past. Any idea why it would do so and how to stop it?

 

Ari

Share this post


Link to post
Share on other sites

It is there as "ooglebot", which picks up googlebot and frooglebot. This seems to work for my site. Are others having a problem with this? Look in your server log for entries from googlebot. Do they have session IDs? You do have "Prevent Spider Sessions" set to true, right?

Share this post


Link to post
Share on other sites

thanks Steve,

 

...ooglebot sounds good...

 

My settings are fine - googlebot does not create a session, it is just trying to "buy" and then it is being sent to the cookie_usage page.

 

here's a sample

66.249.79.63 - - [17/Oct/2004:02:01:39 -0700] "GET /index.php?cPath=1_27&page=1&sort=5a&action=buy_now&products_id=258 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.79.63 - - [17/Oct/2004:02:01:39 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.59 - - [17/Oct/2004:02:01:54 -0700] "GET /index.php?cPath=1_27&page=1&sort=5d HTTP/1.0" 200 28434 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.207.118.206 - - [17/Oct/2004:02:02:05 -0700] "GET /product_info.php?products_id=505&osCsid=472a70b19e8107dab70be0b6c3ee42e2 HTTP/1.1" 200 27914 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

66.249.78.59 - - [17/Oct/2004:02:02:27 -0700] "GET /index.php?cPath=1_27&page=1&sort=5a&action=buy_now&products_id=138 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.59 - - [17/Oct/2004:02:02:27 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.79 - - [17/Oct/2004:02:03:07 -0700] "GET /product_reviews.php?products_id=499&action=notify HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.79 - - [17/Oct/2004:02:03:07 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.207.118.206 - - [17/Oct/2004:02:03:37 -0700] "GET /index.php?cPath=32_30&osCsid=472a70b19e8107dab70be0b6c3ee42e2 HTTP/1.1" 200 27133 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

66.207.118.206 - - [17/Oct/2004:02:04:22 -0700] "GET /product_reviews.php?products_id=443&osCsid=b8841dddbb1c72e030619851e0b84632 HTTP/1.1" 200 26970 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

66.249.78.54 - - [17/Oct/2004:02:04:37 -0700] "GET /index.php?cPath=1_27&page=1&sort=2a&action=buy_now&products_id=257 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.54 - - [17/Oct/2004:02:04:38 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.9 - - [17/Oct/2004:02:04:52 -0700] "GET /index.php?cPath=1_27&page=1&sort=2a&action=buy_now&products_id=138 HTTP/1.0" 302 0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

66.249.78.9 - - [17/Oct/2004:02:04:53 -0700] "GET /cookie_usage.php HTTP/1.0" 200 21764 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

 

I have about 200 products, but with this "new" behavior (it started doing it recently) googlebot is hitting about 1000 pages a day. What a waste! ---

 

Someone suggested adding a disallow for the cookie usage page in robots.txt. That might cut the number of pages in 1/2, but we really need to prevent it from following the "action=buy_now&products_id=xyz" link. Any idea?

 

Ari

Share this post


Link to post
Share on other sites

Interesting that I have not seen this behavior on my site, but I don't use "Buy Now". One approach would be to display the "add to cart" or "buy now" button only if a session was started.

Share this post


Link to post
Share on other sites

What do you use if not "buy now" (or "add to cart")? - I did not play with that function at all. Your idea is a good one - some people suggested changing the "add to cart" to a FORM function which is ignored but all robots. How would you go about ot showing the button if the session is not started?

 

Ari

Share this post


Link to post
Share on other sites

The PHP would simply not display the Buy Now link if there was no session. The spider would not see it and thus not follow it. Add to cart is already a form button, which is why I don't see the problem on my store. Normal users would have sessions so they would get the link.

 

When I get home I'll come up with the code change to make and let you know, if someone doesn't beat me to it. It would be quite simple.

Share this post


Link to post
Share on other sites
The PHP would simply not display the Buy Now link if there was no session.  The spider would not see it and thus not follow it.  Add to cart is already a form button, which is why I don't see the problem on my store. Normal users would have sessions so they would get the link.

 

When I get home I'll come up with the code change to make and let you know, if someone doesn't beat me to it. It would be quite simple.

 

Thanks Steve,

 

I think I see what's happening -- the product detail page i.e. "product_info.php" uses a FORM action which is not causing any problem with robots, but the category listing or whatever template is used by this "/index.php?cPath=11_22" template is using a URL action to add to the cart (the action name is "buy_now"). How did this happen? I don't know. I am sure there are more templates that are doing the same (search_results.php, etc.) There is a contribution out there that explains how to change everything back to a FORM action. I didn't want to use it because I thought there was a reason for making it a URL action. Now that I find some templates are using FORM and others URL, I think I will just change all of them to FORM. What do you think?

 

Lots of folks, on other threads, have described the same problem. If you confirm this solution, we should post this message on a few other threads.

 

Ari

Share this post


Link to post
Share on other sites

Ari,

 

That's much too much effort for this. Here's how to fix it.

 

In catalog/includes/modules/product_listing.php at around line 133 is this line:

$lc_text = '<a href="' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('action')) . 'action=buy_now&products_id=' . $listing['products_id']) . '">' . tep_image_button('button_buy_now.gif', IMAGE_BUTTON_BUY_NOW) . '</a> ';

Insert before that line this:

if ($session_started) {

and add after that line:

} else {$lc_text = ' ';}

That should take care of it.

Share this post


Link to post
Share on other sites

Thanks Steve,

 

I tried it, it worked - the button goes away (I used Firefox with User Agent Switcher to browse the site).

 

But I ended up turning all the URL actions into FORM, there are a few templates that have this problem and I wanted to make sure I covered all of them.

product_reviews.php

product_reviews_info.php

products_new.php

all_products.php

 

I will now go and post this solution in another thread that a similar discussion.

 

thanks again

 

Ari

Share this post


Link to post
Share on other sites

Steve,

 

I noticed you have exabot in spiders.txt. I get hit by a bot from Exava, supposedly a new search engine. Are these the same? I asked because the eXavaBot that hits my site still generates a sessionID.

 

Thanks,

Ed

Share this post


Link to post
Share on other sites

Hmm - looks as if they changed the spider name. Go ahead and add "abot" to the list and I'll update soon.

Share this post


Link to post
Share on other sites

Steve,

 

Follow up question. I'm making use of $user_agent to determine the spiders name in a whos online contrib. For GoogleBot and MSNBot, $user_agent displayed something like: msnbot/1.0(+http://www.msn.com).

 

Will $user_agent always return a format like botName/version (URL)? I'm trying to shorten the name by truncating the stuff in parenthesis. So is it always in this format?

 

Thanks,

Ed

Share this post


Link to post
Share on other sites

No, it isn't. And note that the way my spiders.txt works is to define substrings that are found in multiple user agents. For example, many have "crawl" or "robot". "msnbot" is in there, but some others have substrings (such as "lebot" for googlebot/frooglebot). As I find new robots, I sometimes create new substrings. I will probably add "abot" for exavabot/exabot.

 

For this context, you want a string that matches as many robots without also matching browser agents.

 

A problem I have found is that there are robots out there which provide either NO user agent or pretend they are a browser. Not much can be done about those, to be honest, but the major bots are more reasonable in this regard.

Share this post


Link to post
Share on other sites

Thanks for the great idea to keep 'spiders.txt' up-to-date, Steve!

 

If you haven't already, you might want to take a look at

Search Engine World's 'robots4.txt', this is a nice list of 'known-well-behaving' spiders.

 

I have used this info at times, since it also includes some of those that 'disguise' as browsers.

 

Regards, and keep it up!

Matthias

Edited by mhormann

Share this post


Link to post
Share on other sites

Thanks - that is a useful resource, but I try to balance, at least in the smaller file, number of entries with the prevalance of the spider. There are lots of spiders listed that are inactive. But I also know that I have seen spiders visit my site that aren't on that list.

 

I will shortly be posting another update with some further optimization (and catching more spiders.)

Edited by stevel

Share this post


Link to post
Share on other sites

Since I'm just spidering myself into a zillion sessions ;-) ...

 

Say, if I wanted to set variants of a spider that comes up as either

 

Xenu

Xenu Link Sleuth 1.1e

Xenu Link Sleuth 1.2f

 

what would I put in? Just

 

Xenu

 

or better

 

xenu

 

Is it case-sensitive? Would it also catch if using "Link Sleuth" for example?

Share this post


Link to post
Share on other sites

You would want just:

 

xenu

 

The strings are forced to lowercase, so only lowercase will match.

Share this post


Link to post
Share on other sites

Perfect. Thanks for your ongoing support, really!

 

You might want to include 'xenu' in your spiders.txt, since some might now be more aware of this nice tool and start spidering around... and it can hit your site hard, if your robots.txt isn't carefully layed out. Btw, many SEOs use it also.

 

My first try on a simple osC test installation gave around 65,000 links it spidered along, and about 510 MB log file. Imagine that guy hitting your site with 100 threads simultaneously plus CREATING SESSIONS. Phew!

 

If you need to contact the Xenu's author, Tilman Hausherr, let me know. He's quite helpful and actually wrote my instructions down on how to "set sessions off for Xenu". He told me most users of his spider got sessions and was quite amazed that we could simply "switch them off"... ;-)

Share this post


Link to post
Share on other sites

I know the Xenu tool well - I use it myself. Good idea to add xenu to the list - thanks.

Share this post


Link to post
Share on other sites

Lewis,

 

Yep. Comes by my sites reguarly. It's Microsoft's newest search engine that is supposed to compete with Google.

 

ed

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now