Jump to content
Latest News: (loading..)
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

FYI ... I am being hit by Become.com a lot. It is constantly going through my entire site. Can someone add this to the spiders.txt list?


Remember what the Bible says: He who is without sin, cast the first rock. And I shall smoketh it.

Share this post


Link to post
Share on other sites
The purpose of spiders.txt is not to keep spiders out of your site - it's tp keep spiders from creating sessions.  Do you see become.com's spider getting sessions?

 

deja vu?


Answers to osCommerce's most persistent questions! Tips & Tricks | Configuration | Common Problems.

Seek and ye shall find Contributions.

My Contributions

My Blog

Share this post


Link to post
Share on other sites

Steve,

Can you talk to me about "Inktomi"? It's not in your spider.txt file...for a reason?

 

It crawls my site ALL OF THE TIME!!!

 

It has a different IP address with every single URL it crawls.

 

It "appears" to create sessions also...however, I am not quite sure how to REALLY see what it is doing...should I look in the log files??? I am simply using my Visitor Stats contrib to see it.

 

Curious about what information you might have on this. I have searched the forum but found nothing pertinent.

 

Thanks in advance!

 

Regards,

Siddall

Share this post


Link to post
Share on other sites

what about msnbot? isnt that one of the biggest spiders ya have issues with?

 

...or did I just miss it in the most recent spiders.txt contrib of yours?

Share this post


Link to post
Share on other sites

msnbot is detected by the string "nbot".

 

As for Inktomi - in the past, this has been associated with Yahoo slurp, and would be detected by "slurp". In my own logs, I don't see any spiders with "Inktomi" in the user agent. Give me some sample lines from your access logs that show Inktomi. I believe you, as I've seen complaints from other webmasters about an Inktomi spider that has run wild, but I have not seen it myself and can't find references to a specific user agent.

Share this post


Link to post
Share on other sites

Is anyone going to offer access log lines for this supposed Inktomi spider? I'm puzzled that my own sites seem unaffected by this (yet are indexed by dozens of other spiders.)

Share this post


Link to post
Share on other sites
Is anyone going to offer access log lines for this supposed Inktomi spider?  I'm puzzled that my own sites seem unaffected by this (yet are indexed by dozens of other spiders.)

 

STEVE, you appear to be correct according to the log file (have an excerpt of it below for your review)...the "Inktomi" spider is related to (or the same as) the Yahoo! Slurp stuff (that's a technical term) that you mentioned.

 

[i may be easier to cut n paste the excerpt below into something that doesn't word-wrap]

 

HOWEVER, the log excerpt below (which is Inktomi/Slurp specific) shows that when it crawls, it uses a different IP with almost EVERY link it crawls. I'll get 10-20 different IP hits (Inktomi-specific) within' an hour...a few times a day.

 

Sidenote: it doesn't seem to be creating near as many sessions as it used to...but there are a few (search on osCsid).

 

I don't know if anything can be done about this, but I thought it was worth noting, just in case there is something that I am missing or not understanding.

 

68.142.251.148 - - [09/Jul/2005:00:05:11 -0500] "GET /robots.txt HTTP/1.0" 200 806 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.151 - - [09/Jul/2005:00:05:13 -0500] "GET /the-bad-sports-mlb-c-21_27_22_47.html HTTP/1.0" 200 3228 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.93 - - [09/Jul/2005:00:15:54 -0500] "GET /-c-23_35_45.html HTTP/1.0" 200 2878 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.189 - - [09/Jul/2005:00:16:01 -0500] "GET /stop-evil-internet-p-55.html?action=notify HTTP/1.0" 302 26 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.91 - - [09/Jul/2005:00:28:39 -0500] "GET /stop-lying-yourself-pr-84.html HTTP/1.0" 200 3661 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.32 - - [09/Jul/2005:00:49:58 -0500] "GET /the-good-c-22_38_23_36_21.html HTTP/1.0" 200 3934 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.48 - - [09/Jul/2005:00:51:49 -0500] "GET /good-animal-rights-c-22_38_23_36_21_48.html HTTP/1.0" 200 3886 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.249.72 - - [09/Jul/2005:00:53:13 -0500] "GET /the-sign-c-22_38_23_36_21_48_41.html HTTP/1.0" 200 3204 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.30 - - [09/Jul/2005:00:53:14 -0500] "GET /-c-22_38_23_36_21_48_45.html HTTP/1.0" 200 2949 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.153 - - [09/Jul/2005:00:55:46 -0500] "GET /-c-22_38_23_36_21_48_42.html HTTP/1.0" 200 2931 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.148 - - [09/Jul/2005:01:20:50 -0500] "GET /robots.txt HTTP/1.0" 200 806 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.68 - - [09/Jul/2005:01:20:51 -0500] "GET /the-good-c-23_39_21_26_21.html?sort=3a&page=1 HTTP/1.0" 200 4555 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.171 - - [09/Jul/2005:01:27:26 -0500] "GET /stop-supporting-murder-troops-p-124.html?action=notify HTTP/1.0" 302 26 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.249.148 - - [09/Jul/2005:01:33:37 -0500] "GET /-c-23_39_21_27_22_31_44.html HTTP/1.0" 200 2927 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.249.140 - - [09/Jul/2005:01:41:47 -0500] "GET /the-good-c-21_25_21.html?page=3&sort=2a HTTP/1.0" 200 3497 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.172 - - [09/Jul/2005:01:46:36 -0500] "GET /product_info.php?products_id=218&osCsid=d56479bf7ae04c02366c3fc1b38a5d33 HTTP/1.0" 301 3771 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.172 - - [09/Jul/2005:01:46:38 -0500] "GET /stop-take-time-recycle-graphic-p-218.html?osCsid=d56479bf7ae04c02366c3fc1b38a5d33 HTTP/1.0" 200 3765 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.110 - - [09/Jul/2005:01:52:45 -0500] "GET /-c-22_31_43.html HTTP/1.0" 200 2887 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.249.115 - - [09/Jul/2005:02:10:29 -0500] "GET /stop-puppy-mills-p-182.html HTTP/1.0" 200 3736 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.64 - - [09/Jul/2005:02:17:18 -0500] "GET /-c-22_38_23_33_42.html HTTP/1.0" 200 2922 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.171 - - [09/Jul/2005:02:17:47 -0500] "GET /the-bad-c-22_38_23_35_22.html?sort=2d&page=1 HTTP/1.0" 200 4694 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.148 - - [09/Jul/2005:02:23:07 -0500] "GET /robots.txt HTTP/1.0" 200 806 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.11 - - [09/Jul/2005:02:23:07 -0500] "GET /stop-spelled-backwards-pots-p-217.html?action=notify HTTP/1.0" 302 26 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.67 - - [09/Jul/2005:02:34:29 -0500] "GET /stop-supporting-republicans-pr-115.html HTTP/1.0" 200 3653 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.72 - - [09/Jul/2005:02:46:47 -0500] "GET /the-good-c-21_24_21.html?sort=3a&page=1 HTTP/1.0" 200 4331 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.64 - - [09/Jul/2005:02:46:54 -0500] "GET /catalog/product_info.php?products_id=220&osCsid=d56479bf7ae04c02366c3fc1b38a5d33 HTTP/1.0" 302 361 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.64 - - [09/Jul/2005:02:46:55 -0500] "GET /product_info.php?products_id=220&osCsid=d56479bf7ae04c02366c3fc1b38a5d33 HTTP/1.0" 301 3813 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.64 - - [09/Jul/2005:02:46:56 -0500] "GET /stop-voices-head-p-220.html?osCsid=d56479bf7ae04c02366c3fc1b38a5d33 HTTP/1.0" 200 3806 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.132 - - [09/Jul/2005:03:00:56 -0500] "GET /the-bad-anarchy-c-22_38.html?page=2&sort=2a HTTP/1.0" 200 3427 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.34 - - [09/Jul/2005:03:09:05 -0500] "GET /stop-worrying-enjoy-your-kitten-p-175.html?action=notify HTTP/1.0" 302 26 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.148 - - [09/Jul/2005:03:29:12 -0500] "GET /robots.txt HTTP/1.0" 200 806 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.73 - - [09/Jul/2005:03:29:13 -0500] "GET /the-bad-c-22_38_23_36_21_27_22.html HTTP/1.0" 200 4499 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.80 - - [09/Jul/2005:03:48:40 -0500] "GET /index.php?cPath=45&osCsid=d56479bf7ae04c02366c3fc1b38a5d33 HTTP/1.0" 301 2844 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.80 - - [09/Jul/2005:03:48:41 -0500] "GET /-c-45.html?osCsid=d56479bf7ae04c02366c3fc1b38a5d33 HTTP/1.0" 200 2876 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.206 - - [09/Jul/2005:03:57:26 -0500] "GET /product_info.php?products_id=228 HTTP/1.0" 301 3692 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.251.206 - - [09/Jul/2005:03:57:27 -0500] "GET /stop-those-packersgo-vikings-p-228.html HTTP/1.0" 200 3656 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.249.126 - - [09/Jul/2005:04:04:00 -0500] "GET /-c-23_39_21_27_44.html HTTP/1.0" 200 2917 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.122 - - [09/Jul/2005:04:07:52 -0500] "GET /stop-those-packersgo-vikings-p-228.html HTTP/1.0" 200 3656 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 
68.142.250.110 - - [09/Jul/2005:04:15:14 -0500] "GET /stop-violence-pr-108.html HTTP/1.0" 200 3645 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Share this post


Link to post
Share on other sites

Ok - that is Slurp, and will be detected as such by spiders.txt. I suspect you'll find that the sessions it gets are actually old sessions, but they'll be recreated when it tries again. The contrib Spider Session Remover will help with that.

 

A comment - you evidently have the sort headers enabled in the product listings (that's the default.) The spiders will try every combination of sort, which means a lot of duplicate accesses. You can either disable the sort links entirely (which I did - I don't find them useful), or can make them active only when a session is active. A programming exercise left to the reader.

Share this post


Link to post
Share on other sites

Hi All, :'(

 

Im so glad this thread exists. I have a problem, that I am hoping can be solved and also be a point for others with a similar issue to referance on the forum.

 

Since an install of few stores it seems that our monthly server bandwidth has risen to amasing numbers. Almost our limit and in some cases well over.

 

Looking at stats it appears the spiders are causing this, MSNBOT totalling in some months over 600MB worth. I think I am having the same issue others are. I am not very technical and would like a step by step (1. 2. 3.) guide on how I can stop this from happening.

 

While I understand they are important I do not want to turn them off or restrict access. Any ideas for a complete newb at this? What spiders.txt I should use, code I should add to files or settings in Admin I should use? I need to apply this to 3 stores I have set up and added over 1000 items.

 

Thanks!!!! And great work guys!

Share this post


Link to post
Share on other sites

Well, if you have a lot of pages in your store, the spiders are going to want to search them. The purpose of spiders.txt is to prevent spiders from getting sessions, so that they can't go on a wild shoping spree in your store, but they can still index the pages. Just replace the stock spiders.txt with the one in this contribution, and watch the announcement thread for updates to keep it current. spiders.txt does not keep spiders out of your store.

 

If the spiders are simply crawling your store to index its content, that's goodness. If you have a store with lots of products, you will probably need to move to a host that offers more bandwidth unless you want to keep spiders out.

Share this post


Link to post
Share on other sites

I'm not the "sharpest tool in the shed", so forgive me if this is already included in your list".

 

I am including the whole string because I don't know what info you use.

 

sv-crawlfw3.looksmart.com

 

By the way, thanks for keeping the list up-to-date. You do a terrific job, and I appreciate that you take the time to do this for everyone.


Joe

 

Code? PHP? What's that? Maybe I'll have it figured out in about 5 years, goodness knows I'm trying!

Share this post


Link to post
Share on other sites

"crawl" is in the entry that will catch this one.

 

Thanks for the kind words. And if people are wondering why there haven't been any recent updates, it's because I haven't detected any new spiders on my sites. (I don't bother with those that just touch the home page and go no further.)

Share this post


Link to post
Share on other sites

Hi Steve - great contrib, just a quick question to see if i can save a little more bandwidth; I get around 25,000 spider page requests per day, which is perfect, as it keeps my rankings nice and high in the engines' result pages, and also shows my new content very quickly. However, I am using coolmenu on all my pages, and as I understand it, spiders dont read javascript, meaning my pages all have roughly 25kb of java code which is completely irrelevant to the spiders. Is there a way to call the user_agent into my index.php file using an if statement to not refer to my coolmenu script if a pider is detected as the browser, but instead display a default categories box.

 

This little bit of coding could save me nearly a quarter of my bandwidth every month!

 

From what I understand (very limited in terms of this, but here goes), something like this:

<?php if $user_agent == 'ooglebot' { ?>
<?php require (DIR_WS_BOXES . 'categories.php'); ?>
<php }else{ ?>
<?php require (DIR_WS_BOXES . 'coolmenu.php'); ?>
<?php } ?>

would that work? Also, is there an easy way to make it work for every spider listed in your spiders.txt file instead of listing them individually, or if not, how do I add the OR type command in?

 

Sorry for the imbecilic questions, this is an area of php coding that is very grey to me.....any and all help very much appreciated.


Please note - if I have suggested a contrib above, it doesnt mean it will work! Most of the contribs are not ones I've used, but may be useful for your particular problem....

Have you tried a refined search? Chances are your problem has already been dealt with elsewhere on the forums.....

if (stumped == true) {

return(square_one($start_over)

} else {

$random_query = tep_fetch_answer($forum_query)

}

Share this post


Link to post
Share on other sites
Hi Steve - great contrib, just a quick question to see if i can save a little more bandwidth; I get around 25,000 spider page requests per day, which is perfect, as it keeps my rankings nice and high in the engines' result pages, and also shows my new content very quickly. However, I am using coolmenu on all my pages, and as I understand it, spiders dont read javascript, meaning my pages all have roughly 25kb of java code which is completely irrelevant to the spiders. Is there a way to call the user_agent into my index.php file using an if statement to not refer to my coolmenu script if a pider is detected as the browser, but instead display a default categories box.

 

This little bit of coding could save me nearly a quarter of my bandwidth every month!

 

From what I understand (very limited in terms of this, but here goes), something like this:

<?php if $user_agent == 'ooglebot' { ?>
<?php require (DIR_WS_BOXES . 'categories.php'); ?>
<php }else{ ?>
<?php require (DIR_WS_BOXES . 'coolmenu.php'); ?>
<?php } ?>

would that work? Also, is there an easy way to make it work for every spider listed in your spiders.txt file instead of listing them individually, or if not, how do I add the OR type command in?

 

Sorry for the imbecilic questions, this is an area of php coding that is very grey to me.....any and all help very much appreciated.

 

simply use :

 

if ($spider_flag) {

do not show js

} else {

show js

}


Treasurer MFC

Share this post


Link to post
Share on other sites
if ($spider_flag) {

do not show js

} else {

show js

}

 

Hi Amanda - where would i put that? in my application_top?


Please note - if I have suggested a contrib above, it doesnt mean it will work! Most of the contribs are not ones I've used, but may be useful for your particular problem....

Have you tried a refined search? Chances are your problem has already been dealt with elsewhere on the forums.....

if (stumped == true) {

return(square_one($start_over)

} else {

$random_query = tep_fetch_answer($forum_query)

}

Share this post


Link to post
Share on other sites

For MAC OS X hosted sites your server may not be able to properly identify the line breaks in spiders.txt.

 

This can be fixed be either editing auto_detect_line_breaks in php.ini or adding an ini_set to includes/application_top.php

 

For full details on this thread: http://forums.oscommerce.com/index.php?showtopic=170026

Share this post


Link to post
Share on other sites

Hi all!

 

Steve you've done nice work :D

 

Simple question from a newbie:

 

Does setting Force Cookie Use = TRUE Prevent spiders From indexing my OSc Site totally ?

Share this post


Link to post
Share on other sites

No - it prevents spiders from successfully completing an action that requires a session, such as Buy Now or Notify, but it doesn't prevent spiders from indexing such links. This option will keep some customers from purchasing at your store and will break your store if your domain name for HTTPS is not the same as for HTTP.

 

Neither use of spiders.txt nor Force Cookie Use prevents spiders from indexing your store. You should do what you can to prevent spiders from following links you don't want to appear in a search engine index. A robots.txt is one tool, another is to not display links such as Buy Now if $session_started is false.

Share this post


Link to post
Share on other sites

Typically you'd need to modify the arguments to tep_href_link to add the rel= value. I prefer to not display the links at all - where the code says:

 

echo '<a href="' . tep_href_link(...

 

I change it to:

 

if ($session_started) echo '<a href="' . tep_href_link(...

 

Some of the places I do that are: Buy Now, Notify (in fact, I have the whole notify box removed if no session), login, checkout, My Account, etc. One can also do this for the product listing sort links (very important, I feel.)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×