Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

If you add "gecko", you'll be filtering out everyone who uses Mozilla or Firefox. What is the actual user agent line in your log? I have not seen an actual spider with "gecko" in the user agent.

Share this post


Link to post
Share on other sites

Just found this spider crawling my site: ingrid

It is from the Dutch search engine ilse.nl, maybe you want to add it in your next release.

 

Thanks

Share this post


Link to post
Share on other sites

Moved from the announcement thread:

possible spider with multiple ip addresses and strange user agent:

 

[11-May-2005 13:03:36] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 )

ip: 12.17.130.27

 

 

[11-May-2005 13:04:06] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 )

ip: 65.164.129.91

 

 

[11-May-2005 13:04:26] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 )

ip: 207.155.199.163

 

I'm not sure why you think these entries are spiders and there isn't even a complete user agent string. It COULD be a spider, but because the user agent is that of MSIE, there's nothing spiders.txt could do about it. Nor do I think it's worth worrying about unless the actual accesses are causing you trouble.

 

I have seen the occasional spider-like activity with user agents that look like ordinary browsers, but they are infrequent and clearly not associated with a public search engine.

 

Howard, thanks for the mention of ingrid.

Share this post


Link to post
Share on other sites
Moved from the announcement thread:
possible spider with multiple ip addresses and strange user agent:

 

[11-May-2005 13:03:36] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 )

ip: 12.17.130.27

[11-May-2005 13:04:06] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 )

ip: 65.164.129.91

[11-May-2005 13:04:26] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 )

ip: 207.155.199.163

 

I'm not sure why you think these entries are spiders and there isn't even a complete user agent string. It COULD be a spider, but because the user agent is that of MSIE, there's nothing spiders.txt could do about it. Nor do I think it's worth worrying about unless the actual accesses are causing you trouble.

 

I have seen the occasional spider-like activity with user agents that look like ordinary browsers, but they are infrequent and clearly not associated with a public search engine.

 

Howard, thanks for the mention of ingrid.

 

 

http://www.webmasterworld.com/forum97/186.htm


Treasurer MFC

Share this post


Link to post
Share on other sites

I'm not sure why you think these entries are spiders and there isn't even a complete user agent string. It COULD be a spider, but because the user agent is that of MSIE, there's nothing spiders.txt could do about it. Nor do I think it's worth worrying about unless the actual accesses are causing you trouble.

 

I have seen the occasional spider-like activity with user agents that look like ordinary browsers, but they are infrequent and clearly not associated with a public search engine.

 

Howard, thanks for the mention of ingrid.

http://www.webmasterworld.com/forum97/186.htm

 

http://www.webmasterworld.com/forum11/2611.htm


Treasurer MFC

Share this post


Link to post
Share on other sites

Interesting - so the "...." is actually in the user agent string? I don't see that anyone has positively identified this as a bot. The msnbot guy disclaims ownership.

 

The first two IPs you give don't have a backtranslation at all. The third belongs to xo.com. My guess is that this is more of a worm than a spider. What files was it accessing on your site?

Share this post


Link to post
Share on other sites
Interesting - so the "...." is actually in the user agent string?  I don't see that anyone has positively identified this as a bot.  The msnbot guy disclaims ownership.

 

The first two IPs you give don't have a backtranslation at all.  The third belongs to xo.com.  My guess is that this is more of a worm than a spider.  What files was it accessing on your site?

 

(yes, the .... is part of it.)

 

well, they access all of my pages (without robots.txt consideration) but in a very strange order. An order that would be impossible to do for a normal user like 2 different pages in succession but both with the language=en parameter, only robots can do that. That is not possible when normally viewing the site.

 

All three ip's got about 750 visitor stat entries together in about 1 hour before I started looking for a possible explanation on the internet.


Treasurer MFC

Share this post


Link to post
Share on other sites

Bizarre.

 

Well, you can add the string .... to the spiders.txt - that will at least prevent it from getting a session. If you want to block those IPs, you can do it in a .htaccess.

Share this post


Link to post
Share on other sites

I'm getting hit by this site and they're even adding stuff to my cart for items that don't have a buy button.

 

I see

 

 /catalog/product_reviews.php?products_id=145&action=buy_now.....

 

at times which adds items to my cart. Even for items that DON'T have a buy button!!

 

Here's an entry in my server log:

 

modemcable164.221-203-24.mc.videotron.ca - - [16/May/2005:21:02:38 -0700] "GET /catalog/product_reviews_write.php?cPath=31&products_id=145&osCsid=1aefb837b7347eb5a033dfdf6acd7058 HTTP/1.0" 302 0 "http://www.thepartyfowl.com/catalog/product_reviews.php?cPath=31&products_id=145&osCsid=1aefb837b7347eb5a033dfdf6acd7058" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

 

 

What user agent string should I add to my spiders.txt file to keep him out? I don't see anything unique to this bot.

 

(I have now added DISALLOW: /catalog/product_reviews.php to my robots.txt file since also)

 

Alternately, I have just added his IP (got from osc user tracking contribution) to the .htaccess. Is this right? Is this the /html/.htaccess? or is this in the /html/catalog/.htaccess??

 

I keep up to date on the posted spiders.txt from this thread.

 

Thanks


Thanks

 

Mike

Share this post


Link to post
Share on other sites

There's nothing you can add to spiders.txt for this - the user agent string makes it look exactly like MSIE. It's not clear to me that that is any sort of spider, as it is running from a cable modem connection.

 

You can add a "DenyFrom 64.221.203.24" in .htaccess, but this access is not behaving like a search engine spider.

Share this post


Link to post
Share on other sites
There's nothing you can add to spiders.txt for this - the user agent string makes it look exactly like MSIE.  It's not clear to me that that is any sort of spider, as it is running from a cable modem connection.

 

You can add a "DenyFrom 64.221.203.24" in .htaccess, but this access is not behaving like a search engine spider.

thanks...

 

I am getting hundreds of hits (& a session ID & some cart additions) from this source in a short span....every couple weeks.

 

Hopefully .htaccess will prevent this.


Thanks

 

Mike

Share this post


Link to post
Share on other sites

I'm wondering if a slight change in the code could help out - it appears that 'spiders.txt' is used to filter out user-agents. I suggest adding a 'approved_agents.txt' file that is used to filter-in user-agents.

 

A bit of code would need to be added in a /includes/application_top.php

 

 

  } elseif (SESSION_BLOCK_SPIDERS == 'True') {
   $user_agent = strtolower(getenv('HTTP_USER_AGENT'));
   $spider_flag = false;

   if (tep_not_null($user_agent)) {
// bof approved_agent
        $approved_agents = file(DIR_WS_INCLUDES . 'approved_agents.txt');
        $is_approved_agent = false;

        for ($i = 0, $n = sizeof($approved_agents); $i<$n; $i++) {
         
             if (tep_not_null($approved_agents[$i])) {
              
                  if (is_integer(strpos($user_agent, trim($approved_agents[$i])))) {
                       // found an approved agent
                       is_approved_agent = true;
                       break;
                  }
             }
         }

         if (false == is_approved_agent) {
// eof approved_agent
              $spiders = file(DIR_WS_INCLUDES . 'spiders.txt');

              for ($i=0, $n=sizeof($spiders); $i<$n; $i++) {
                   if (tep_not_null($spiders[$i])) {
                        if (is_integer(strpos($user_agent, trim($spiders[$i])))) {
                             $spider_flag = true;
                             break;
                        }
                   }
              }
// bof approved_agent
         }
// eof approved_agent
   }

 

The approved_agent.txt file need not be complete, only a few agents in the file would hit about 80% of the traffic. It would take so few, that it might even be easier to simply create an array of approved agents right in the application_top.php file.

 

Am I way off base with this?

 

Also, if I'm reading the code correctly, am I right in that if 'Force Cookies' is true that none of this matters?

Share this post


Link to post
Share on other sites
I'm wondering if a slight change in the code could help out - it appears that 'spiders.txt' is used to filter out user-agents.  I suggest adding a 'approved_agents.txt' file that is used to filter-in user-agents.

 

A bit of code would need to be added in a /includes/application_top.php

 ?} elseif (SESSION_BLOCK_SPIDERS == 'True') {
? ?$user_agent = strtolower(getenv('HTTP_USER_AGENT'));
? ?$spider_flag = false;

? ?if (tep_not_null($user_agent)) {
// bof approved_agent
? ? ? ? $approved_agents = file(DIR_WS_INCLUDES . 'approved_agents.txt');
? ? ? ? $is_approved_agent = false;

? ? ? ? for ($i = 0, $n = sizeof($approved_agents); $i<$n; $i++) {
? ? ? ? ?
? ? ? ? ? ? ?if (tep_not_null($approved_agents[$i])) {
? ? ? ? ? ? ? 
? ? ? ? ? ? ? ? ? if (is_integer(strpos($user_agent, trim($approved_agents[$i])))) {
? ? ? ? ? ? ? ? ? ? ? ?// found an approved agent
? ? ? ? ? ? ? ? ? ? ? ?is_approved_agent = true;
? ? ? ? ? ? ? ? ? ? ? ?break;
? ? ? ? ? ? ? ? ? }
? ? ? ? ? ? ?}
? ? ? ? ?}

? ? ? ? ?if (false == is_approved_agent) {
// eof approved_agent
? ? ? ? ? ? ? $spiders = file(DIR_WS_INCLUDES . 'spiders.txt');

? ? ? ? ? ? ? for ($i=0, $n=sizeof($spiders); $i<$n; $i++) {
? ? ? ? ? ? ? ? ? ?if (tep_not_null($spiders[$i])) {
? ? ? ? ? ? ? ? ? ? ? ? if (is_integer(strpos($user_agent, trim($spiders[$i])))) {
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?$spider_flag = true;
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break;
? ? ? ? ? ? ? ? ? ? ? ? }
? ? ? ? ? ? ? ? ? ?}
? ? ? ? ? ? ? }
// bof approved_agent
? ? ? ? ?}
// eof approved_agent
? ?}

 

The approved_agent.txt file need not be complete, only a few agents in the file would hit about 80% of the traffic.  It would take so few, that it might even be easier to simply create an array of approved agents right in the application_top.php file.

 

Am I way off base with this?

 

Also, if I'm reading the code correctly, am I right in that if 'Force Cookies' is true that none of this matters?

 

well, with forced cookies, no session is issued if no cookie is accepted and since spiders don't accept cookies, end of story.


Treasurer MFC

Share this post


Link to post
Share on other sites

Sure, you could do this, but an even simpler optimization is to see if the user agent starts with the string "mozilla", and if it does, let it through. No need for a list. This will be nearly 100% effective at identifying real browsers.

 

Yes, Force Cookie Use will also obviate the need for checking for spiders, but it also will turn away some customers and can't be used on stores with shared SSL.

Share this post


Link to post
Share on other sites
Sure, you could do this, but an even simpler optimization is to see if the user agent starts with the string "mozilla", and if it does, let it through.  No need for a list. This will be nearly 100% effective at identifying real browsers.

 

Yes, Force Cookie Use will also obviate the need for checking for spiders, but it also will turn away some customers and can't be used on stores with shared SSL.

 

not yahoo slurp, that one uses the term mozilla


Treasurer MFC

Share this post


Link to post
Share on other sites

Sigh - I should have checked my own logs closer. You're right, this check would not be helpful at all. Several other spiders also start their UA with 'Mozilla'.

 

I've looked at this code on and off, wondering if there was an easy way to skip the spider check if a session was already started. I haven't delved into it enough to see how this might be done.

Share this post


Link to post
Share on other sites
Sigh - I should have checked my own logs closer.  You're right, this check would not be helpful at all.  Several other spiders also start their UA with 'Mozilla'.

 

I've looked at this code on and off, wondering if there was an easy way to skip the spider check if a session was already started.  I haven't delved into it enough to see how this might be done.

 

well, you could check to see if the client has the cookie with the session id or the session id is in the url. if not, you still need to go thru the list as it may be a first entry.

 

So I doubt if that would be faster.

 

I tried to only go thru the list if no browser language was supplied but guess what, some spiders actually provide a browser language.

 

so for now I think the list remains very valid, I just keep it on a ram disk for fast access.


Treasurer MFC

Share this post


Link to post
Share on other sites

If the browser name field were also considered would that help? It appears that most robots at least change the browser name to something different. For example, internet exporer has a browser name of 'MSIE' and an agent string starting with 'Mozilla/4.0 (compatible; MSIE' - if those two conditions are met, I would assume it is safe to assume the visitor is a person using internet explorer, and not some robot/spider.

Share this post


Link to post
Share on other sites
To be honest, I don't think that the processing of spiders.txt takes enough time to warrant a lot of effort in this area.

 

 

The quickest comparison is the one you don't have to do at all, right?

 

Perhaps testing for a valid agent isn't the way, but an idea was floated about checking to see if a session has already been started. That seems like an excellent idea - yes you have to trudge through the list once for a valid client only to discover they are not a robot/spider - but on subsequent page requests there would be a session set already, so no need to check through the spiders.txt file again.

Share this post


Link to post
Share on other sites

My guess is that opening the file takes more time than processing the spider list once it has been read in. Opening a second file doesn't sound like an improvement to me.

 

Yeah, I do think that trying to determine if a session is started is useful. But beware of spiders which may already have SIDs in their saved URLs if you had your store up before enabling Prevent Spider Sessions. There is a "Spider Session Eliminator" (or some such) contrib that uses redirect rules in .htaccess to deal with that.

Share this post


Link to post
Share on other sites
My guess is that opening the file takes more time than processing the spider list once it has been read in.  Opening a second file doesn't sound like an improvement to me.

 

Yeah, I do think that trying to determine if a session is started is useful.  But beware of spiders which may already have SIDs in their saved URLs if you had your store up before enabling Prevent Spider Sessions.  There is a "Spider Session Eliminator" (or some such) contrib that uses redirect rules in .htaccess to deal with that.

 

just experimenting so I am trying to use this for a while :

 

} elseif (SESSION_BLOCK_SPIDERS == 'True') {

// IP override for spiders with a language or no agent or other strange stuff

$spider_ips = array(

'61.111.254.59', // Korean no agent no language

'66.249.65.162','66.249.66.99','66.249.66.172','66.249.66.239', // Google

'195.92.95.94', // netcraft

'64.62.168.25', // gigabot using en language

'198.65.147.172',// goforit

'61.135.145.212','202.108.250.223','61.135.146.208', // baidu spider using tw and en language

'129.241.104.174','129.241.104.179','129.241.104.168', // boitho norway with en language

'66.36.241.140', // NutchCVS/0.06-dev

//'220.135.121.91', // me for testing only

'203.160.252.178' // ?

);

if (in_array($browser_ip, $spider_ips)) {

$spider_flag = true;

} else {

if ((tep_not_null($user_agent))

and ($browser_language == '')

) {

// AGENT override

if (stristr($user_agent, 'bot')) $spider_flag = true;

elseif (stristr($user_agent, 'spider')) $spider_flag = true;

elseif (stristr($user_agent, 'mediapartners')) $spider_flag = true;

 

etc....the whole spiders.txt entries.

 

 

elseif (stristr($user_agent, 'falcon')) $spider_flag = true;

elseif (stristr($user_agent, 'objectsearch')) $spider_flag = true;

}

}

if (!$spider_flag) {

tep_session_start();

$session_started = true;

 

// not in spider lists but no browser language - send to error log for possible inclusion

if ($browser_language == '') {

error_log('SPIDER? NO language: '.$user_agent."\n".'ip: '.$browser_ip."\n");

}

}

} else {

tep_session_start();

$session_started = true;

}

 

 

note variables already set :

$user_agent = strtolower(getenv('HTTP_USER_AGENT'));

$browser_language = getenv('HTTP_ACCEPT_LANGUAGE');

$browser_ip = tep_get_ip_address();


Treasurer MFC

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×