stevel Posted April 11, 2005 Author Share Posted April 11, 2005 If you add "gecko", you'll be filtering out everyone who uses Mozilla or Firefox. What is the actual user agent line in your log? I have not seen an actual spider with "gecko" in the user agent. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
selectronics4u Posted April 11, 2005 Share Posted April 11, 2005 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2 this is the line from agent. DON Quote Link to comment Share on other sites More sharing options...
stevel Posted April 11, 2005 Author Share Posted April 11, 2005 That's not a spider, that's the Firefox web browser. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
wheeloftime Posted April 22, 2005 Share Posted April 22, 2005 Just found this spider crawling my site: ingrid It is from the Dutch search engine ilse.nl, maybe you want to add it in your next release. Thanks Quote Link to comment Share on other sites More sharing options...
stevel Posted May 11, 2005 Author Share Posted May 11, 2005 Moved from the announcement thread: possible spider with multiple ip addresses and strange user agent: [11-May-2005 13:03:36] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 ) ip: 12.17.130.27 [11-May-2005 13:04:06] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 ) ip: 65.164.129.91 [11-May-2005 13:04:26] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 ) ip: 207.155.199.163 I'm not sure why you think these entries are spiders and there isn't even a complete user agent string. It COULD be a spider, but because the user agent is that of MSIE, there's nothing spiders.txt could do about it. Nor do I think it's worth worrying about unless the actual accesses are causing you trouble. I have seen the occasional spider-like activity with user agents that look like ordinary browsers, but they are infrequent and clearly not associated with a public search engine. Howard, thanks for the mention of ingrid. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted May 11, 2005 Share Posted May 11, 2005 Moved from the announcement thread:possible spider with multiple ip addresses and strange user agent: [11-May-2005 13:03:36] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 ) ip: 12.17.130.27 [11-May-2005 13:04:06] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 ) ip: 65.164.129.91 [11-May-2005 13:04:26] SPIDER? NO language: mozilla/4.0 (compatible; msie 4.0; windows nt; ....../1.0 ) ip: 207.155.199.163 I'm not sure why you think these entries are spiders and there isn't even a complete user agent string. It COULD be a spider, but because the user agent is that of MSIE, there's nothing spiders.txt could do about it. Nor do I think it's worth worrying about unless the actual accesses are causing you trouble. I have seen the occasional spider-like activity with user agents that look like ordinary browsers, but they are infrequent and clearly not associated with a public search engine. Howard, thanks for the mention of ingrid. <{POST_SNAPBACK}> http://www.webmasterworld.com/forum97/186.htm Quote Treasurer MFC Link to comment Share on other sites More sharing options...
boxtel Posted May 11, 2005 Share Posted May 11, 2005 I'm not sure why you think these entries are spiders and there isn't even a complete user agent string. It COULD be a spider, but because the user agent is that of MSIE, there's nothing spiders.txt could do about it. Nor do I think it's worth worrying about unless the actual accesses are causing you trouble. I have seen the occasional spider-like activity with user agents that look like ordinary browsers, but they are infrequent and clearly not associated with a public search engine. Howard, thanks for the mention of ingrid. <{POST_SNAPBACK}> http://www.webmasterworld.com/forum97/186.htm <{POST_SNAPBACK}> http://www.webmasterworld.com/forum11/2611.htm Quote Treasurer MFC Link to comment Share on other sites More sharing options...
stevel Posted May 11, 2005 Author Share Posted May 11, 2005 Interesting - so the "...." is actually in the user agent string? I don't see that anyone has positively identified this as a bot. The msnbot guy disclaims ownership. The first two IPs you give don't have a backtranslation at all. The third belongs to xo.com. My guess is that this is more of a worm than a spider. What files was it accessing on your site? Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted May 12, 2005 Share Posted May 12, 2005 Interesting - so the "...." is actually in the user agent string? I don't see that anyone has positively identified this as a bot. The msnbot guy disclaims ownership. The first two IPs you give don't have a backtranslation at all. The third belongs to xo.com. My guess is that this is more of a worm than a spider. What files was it accessing on your site? <{POST_SNAPBACK}> (yes, the .... is part of it.) well, they access all of my pages (without robots.txt consideration) but in a very strange order. An order that would be impossible to do for a normal user like 2 different pages in succession but both with the language=en parameter, only robots can do that. That is not possible when normally viewing the site. All three ip's got about 750 visitor stat entries together in about 1 hour before I started looking for a possible explanation on the internet. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
stevel Posted May 12, 2005 Author Share Posted May 12, 2005 Bizarre. Well, you can add the string .... to the spiders.txt - that will at least prevent it from getting a session. If you want to block those IPs, you can do it in a .htaccess. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
MnMeBiz Posted May 17, 2005 Share Posted May 17, 2005 I'm getting hit by this site and they're even adding stuff to my cart for items that don't have a buy button. I see /catalog/product_reviews.php?products_id=145&action=buy_now..... at times which adds items to my cart. Even for items that DON'T have a buy button!! Here's an entry in my server log: modemcable164.221-203-24.mc.videotron.ca - - [16/May/2005:21:02:38 -0700] "GET /catalog/product_reviews_write.php?cPath=31&products_id=145&osCsid=1aefb837b7347eb5a033dfdf6acd7058 HTTP/1.0" 302 0 "http://www.thepartyfowl.com/catalog/product_reviews.php?cPath=31&products_id=145&osCsid=1aefb837b7347eb5a033dfdf6acd7058" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) What user agent string should I add to my spiders.txt file to keep him out? I don't see anything unique to this bot. (I have now added DISALLOW: /catalog/product_reviews.php to my robots.txt file since also) Alternately, I have just added his IP (got from osc user tracking contribution) to the .htaccess. Is this right? Is this the /html/.htaccess? or is this in the /html/catalog/.htaccess?? I keep up to date on the posted spiders.txt from this thread. Thanks Quote Thanks Mike Link to comment Share on other sites More sharing options...
stevel Posted May 17, 2005 Author Share Posted May 17, 2005 There's nothing you can add to spiders.txt for this - the user agent string makes it look exactly like MSIE. It's not clear to me that that is any sort of spider, as it is running from a cable modem connection. You can add a "DenyFrom 64.221.203.24" in .htaccess, but this access is not behaving like a search engine spider. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
MnMeBiz Posted May 17, 2005 Share Posted May 17, 2005 There's nothing you can add to spiders.txt for this - the user agent string makes it look exactly like MSIE. It's not clear to me that that is any sort of spider, as it is running from a cable modem connection. You can add a "DenyFrom 64.221.203.24" in .htaccess, but this access is not behaving like a search engine spider. <{POST_SNAPBACK}> thanks... I am getting hundreds of hits (& a session ID & some cart additions) from this source in a short span....every couple weeks. Hopefully .htaccess will prevent this. Quote Thanks Mike Link to comment Share on other sites More sharing options...
stevel Posted May 17, 2005 Author Share Posted May 17, 2005 I'd guess that some user has told his browser to "subscribe" to your site. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted May 22, 2005 Share Posted May 22, 2005 I'm wondering if a slight change in the code could help out - it appears that 'spiders.txt' is used to filter out user-agents. I suggest adding a 'approved_agents.txt' file that is used to filter-in user-agents. A bit of code would need to be added in a /includes/application_top.php } elseif (SESSION_BLOCK_SPIDERS == 'True') { $user_agent = strtolower(getenv('HTTP_USER_AGENT')); $spider_flag = false; if (tep_not_null($user_agent)) { // bof approved_agent $approved_agents = file(DIR_WS_INCLUDES . 'approved_agents.txt'); $is_approved_agent = false; for ($i = 0, $n = sizeof($approved_agents); $i<$n; $i++) { if (tep_not_null($approved_agents[$i])) { if (is_integer(strpos($user_agent, trim($approved_agents[$i])))) { // found an approved agent is_approved_agent = true; break; } } } if (false == is_approved_agent) { // eof approved_agent $spiders = file(DIR_WS_INCLUDES . 'spiders.txt'); for ($i=0, $n=sizeof($spiders); $i<$n; $i++) { if (tep_not_null($spiders[$i])) { if (is_integer(strpos($user_agent, trim($spiders[$i])))) { $spider_flag = true; break; } } } // bof approved_agent } // eof approved_agent } The approved_agent.txt file need not be complete, only a few agents in the file would hit about 80% of the traffic. It would take so few, that it might even be easier to simply create an array of approved agents right in the application_top.php file. Am I way off base with this? Also, if I'm reading the code correctly, am I right in that if 'Force Cookies' is true that none of this matters? Quote Link to comment Share on other sites More sharing options...
boxtel Posted May 22, 2005 Share Posted May 22, 2005 I'm wondering if a slight change in the code could help out - it appears that 'spiders.txt' is used to filter out user-agents. I suggest adding a 'approved_agents.txt' file that is used to filter-in user-agents. A bit of code would need to be added in a /includes/application_top.php ?} elseif (SESSION_BLOCK_SPIDERS == 'True') { ? ?$user_agent = strtolower(getenv('HTTP_USER_AGENT')); ? ?$spider_flag = false; ? ?if (tep_not_null($user_agent)) { // bof approved_agent ? ? ? ? $approved_agents = file(DIR_WS_INCLUDES . 'approved_agents.txt'); ? ? ? ? $is_approved_agent = false; ? ? ? ? for ($i = 0, $n = sizeof($approved_agents); $i<$n; $i++) { ? ? ? ? ? ? ? ? ? ? ? ?if (tep_not_null($approved_agents[$i])) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (is_integer(strpos($user_agent, trim($approved_agents[$i])))) { ? ? ? ? ? ? ? ? ? ? ? ?// found an approved agent ? ? ? ? ? ? ? ? ? ? ? ?is_approved_agent = true; ? ? ? ? ? ? ? ? ? ? ? ?break; ? ? ? ? ? ? ? ? ? } ? ? ? ? ? ? ?} ? ? ? ? ?} ? ? ? ? ?if (false == is_approved_agent) { // eof approved_agent ? ? ? ? ? ? ? $spiders = file(DIR_WS_INCLUDES . 'spiders.txt'); ? ? ? ? ? ? ? for ($i=0, $n=sizeof($spiders); $i<$n; $i++) { ? ? ? ? ? ? ? ? ? ?if (tep_not_null($spiders[$i])) { ? ? ? ? ? ? ? ? ? ? ? ? if (is_integer(strpos($user_agent, trim($spiders[$i])))) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?$spider_flag = true; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break; ? ? ? ? ? ? ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ?} ? ? ? ? ? ? ? } // bof approved_agent ? ? ? ? ?} // eof approved_agent ? ?} The approved_agent.txt file need not be complete, only a few agents in the file would hit about 80% of the traffic. It would take so few, that it might even be easier to simply create an array of approved agents right in the application_top.php file. Am I way off base with this? Also, if I'm reading the code correctly, am I right in that if 'Force Cookies' is true that none of this matters? <{POST_SNAPBACK}> well, with forced cookies, no session is issued if no cookie is accepted and since spiders don't accept cookies, end of story. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
stevel Posted May 22, 2005 Author Share Posted May 22, 2005 Sure, you could do this, but an even simpler optimization is to see if the user agent starts with the string "mozilla", and if it does, let it through. No need for a list. This will be nearly 100% effective at identifying real browsers. Yes, Force Cookie Use will also obviate the need for checking for spiders, but it also will turn away some customers and can't be used on stores with shared SSL. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted May 22, 2005 Share Posted May 22, 2005 Sure, you could do this, but an even simpler optimization is to see if the user agent starts with the string "mozilla", and if it does, let it through. No need for a list. This will be nearly 100% effective at identifying real browsers. Yes, Force Cookie Use will also obviate the need for checking for spiders, but it also will turn away some customers and can't be used on stores with shared SSL. <{POST_SNAPBACK}> not yahoo slurp, that one uses the term mozilla Quote Treasurer MFC Link to comment Share on other sites More sharing options...
stevel Posted May 22, 2005 Author Share Posted May 22, 2005 Sigh - I should have checked my own logs closer. You're right, this check would not be helpful at all. Several other spiders also start their UA with 'Mozilla'. I've looked at this code on and off, wondering if there was an easy way to skip the spider check if a session was already started. I haven't delved into it enough to see how this might be done. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted May 22, 2005 Share Posted May 22, 2005 Sigh - I should have checked my own logs closer. You're right, this check would not be helpful at all. Several other spiders also start their UA with 'Mozilla'. I've looked at this code on and off, wondering if there was an easy way to skip the spider check if a session was already started. I haven't delved into it enough to see how this might be done. <{POST_SNAPBACK}> well, you could check to see if the client has the cookie with the session id or the session id is in the url. if not, you still need to go thru the list as it may be a first entry. So I doubt if that would be faster. I tried to only go thru the list if no browser language was supplied but guess what, some spiders actually provide a browser language. so for now I think the list remains very valid, I just keep it on a ram disk for fast access. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
Guest Posted May 22, 2005 Share Posted May 22, 2005 If the browser name field were also considered would that help? It appears that most robots at least change the browser name to something different. For example, internet exporer has a browser name of 'MSIE' and an agent string starting with 'Mozilla/4.0 (compatible; MSIE' - if those two conditions are met, I would assume it is safe to assume the visitor is a person using internet explorer, and not some robot/spider. Quote Link to comment Share on other sites More sharing options...
stevel Posted May 22, 2005 Author Share Posted May 22, 2005 To be honest, I don't think that the processing of spiders.txt takes enough time to warrant a lot of effort in this area. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Guest Posted May 23, 2005 Share Posted May 23, 2005 To be honest, I don't think that the processing of spiders.txt takes enough time to warrant a lot of effort in this area. <{POST_SNAPBACK}> The quickest comparison is the one you don't have to do at all, right? Perhaps testing for a valid agent isn't the way, but an idea was floated about checking to see if a session has already been started. That seems like an excellent idea - yes you have to trudge through the list once for a valid client only to discover they are not a robot/spider - but on subsequent page requests there would be a session set already, so no need to check through the spiders.txt file again. Quote Link to comment Share on other sites More sharing options...
stevel Posted May 23, 2005 Author Share Posted May 23, 2005 My guess is that opening the file takes more time than processing the spider list once it has been read in. Opening a second file doesn't sound like an improvement to me. Yeah, I do think that trying to determine if a session is started is useful. But beware of spiders which may already have SIDs in their saved URLs if you had your store up before enabling Prevent Spider Sessions. There is a "Spider Session Eliminator" (or some such) contrib that uses redirect rules in .htaccess to deal with that. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted May 23, 2005 Share Posted May 23, 2005 My guess is that opening the file takes more time than processing the spider list once it has been read in. Opening a second file doesn't sound like an improvement to me. Yeah, I do think that trying to determine if a session is started is useful. But beware of spiders which may already have SIDs in their saved URLs if you had your store up before enabling Prevent Spider Sessions. There is a "Spider Session Eliminator" (or some such) contrib that uses redirect rules in .htaccess to deal with that. <{POST_SNAPBACK}> just experimenting so I am trying to use this for a while : } elseif (SESSION_BLOCK_SPIDERS == 'True') { // IP override for spiders with a language or no agent or other strange stuff $spider_ips = array( '61.111.254.59', // Korean no agent no language '66.249.65.162','66.249.66.99','66.249.66.172','66.249.66.239', // Google '195.92.95.94', // netcraft '64.62.168.25', // gigabot using en language '198.65.147.172',// goforit '61.135.145.212','202.108.250.223','61.135.146.208', // baidu spider using tw and en language '129.241.104.174','129.241.104.179','129.241.104.168', // boitho norway with en language '66.36.241.140', // NutchCVS/0.06-dev //'220.135.121.91', // me for testing only '203.160.252.178' // ? ); if (in_array($browser_ip, $spider_ips)) { $spider_flag = true; } else { if ((tep_not_null($user_agent)) and ($browser_language == '') ) { // AGENT override if (stristr($user_agent, 'bot')) $spider_flag = true; elseif (stristr($user_agent, 'spider')) $spider_flag = true; elseif (stristr($user_agent, 'mediapartners')) $spider_flag = true; etc....the whole spiders.txt entries. elseif (stristr($user_agent, 'falcon')) $spider_flag = true; elseif (stristr($user_agent, 'objectsearch')) $spider_flag = true; } } if (!$spider_flag) { tep_session_start(); $session_started = true; // not in spider lists but no browser language - send to error log for possible inclusion if ($browser_language == '') { error_log('SPIDER? NO language: '.$user_agent."\n".'ip: '.$browser_ip."\n"); } } } else { tep_session_start(); $session_started = true; } note variables already set : $user_agent = strtolower(getenv('HTTP_USER_AGENT')); $browser_language = getenv('HTTP_ACCEPT_LANGUAGE'); $browser_ip = tep_get_ip_address(); Quote Treasurer MFC Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.