Guest Posted July 24, 2006 Share Posted July 24, 2006 Hi, I've need a new crawler frequently on my site. The line from the apache logs is: 206.169.110.66 - - [24/Jul/2006:13:59:28 +0200] "GET /shop/naomi-doll-p-1099.html HTTP/1.0" 200 51348 www.perfectpassion.co.uk "-" "page_verifier http://www.securecomputing.com/goto/pv" "-" Following the link in the user agent brings a page stating that it is a new crawler that search across the net for pages with malware. Does this need adding to spiders.txt? Thanks, Tom Quote Link to comment Share on other sites More sharing options...
stevel Posted July 24, 2006 Author Share Posted July 24, 2006 Probably a good idea so that it doesn't start a session and use up resources. You could add the string "page_verifier". I'll include this in the next update. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
daz_75 Posted July 28, 2006 Share Posted July 28, 2006 Thanks for this contrib :thumbsup: Quote Link to comment Share on other sites More sharing options...
stevel Posted July 30, 2006 Author Share Posted July 30, 2006 Updated 2006-07-30 Added strings: python, page_verifier Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
bobsi18 Posted August 19, 2006 Share Posted August 19, 2006 hi there all... first off, great contrib :) quick question - i had a bot crawling my site for about 4 and a half days, each time for about 16 hours or so. it seemed to obey my robots file, the name that showed up in my whos online sections named it 'FAST' and after some searching I found that it was 'FAST MetaWeb Crawler (helpdesk at fastsearch dot com)' - just wondering if anyone knows if this is a 'bad' or 'ban-able' bot? TIA ~bobsi18~ Quote Link to comment Share on other sites More sharing options...
stevel Posted August 19, 2006 Author Share Posted August 19, 2006 Looks to me like a bot you would like indexing your store. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
bobsi18 Posted August 19, 2006 Share Posted August 19, 2006 Thats what I sort of decided in the end - first time I had seen that behaviour tho, so was a little aprehensive of what was going on. Thanks for quelling my doubts! Quote Link to comment Share on other sites More sharing options...
Andreas2003 Posted September 1, 2006 Share Posted September 1, 2006 Hi Steve, I have the IP-addresses from a colocation hoster in the US in my logs. Some details: United States - Delaware - Newark - Mccolo Corporation McColo Corporation MCCOLO (NET-208-66-192-0-1) 208.66.192.0 - 208.66.195.255 Digital Infinity Ltd DIGITALINFINITY (NET-208-66-195-0-1) IP range : 208.66.195.0 - 208.66.195.15 www.mccolo.com Its seems so that a spider is running on some of the addresses there, cause I found the details in another SEO forum. Any hints how to include them in the spiders.txt ? Thanks in advance kind regards Andreas Quote Link to comment Share on other sites More sharing options...
stevel Posted September 1, 2006 Author Share Posted September 1, 2006 Andreas, Spider detection looks at the user agent string in the web request, not the IP. What is the UA string for these spider requests? Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted September 1, 2006 Share Posted September 1, 2006 Andreas, Spider detection looks at the user agent string in the web request, not the IP. What is the UA string for these spider requests? Hi Steve, cannot block on the user agent string. Its only Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322) I found out that from this IP range a scraper, comment bot and/or referer spammer bot is coming. See here: http://www.donsausa.com/2006/06/psycheclon...t-advisory.html I think, I have to block that IP range, because some other folks wrote that that bot is not friendly. Thanks in advance, kind regards Andreas Quote Link to comment Share on other sites More sharing options...
stevel Posted September 1, 2006 Author Share Posted September 1, 2006 Ok - you'll probably want to use .htaccess to block that IP or IP range as described in that link you posted. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted September 2, 2006 Share Posted September 2, 2006 Hi Steve, cannot block on the user agent string. Its only Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322) I found out that from this IP range a scraper, comment bot and/or referer spammer bot is coming. See here: http://www.donsausa.com/2006/06/psycheclon...t-advisory.html I think, I have to block that IP range, because some other folks wrote that that bot is not friendly. Thanks in advance, kind regards Andreas if this is an email address harvester as suggested, simply make sure there is nothing to harvest or you will find yourself blocking ip ranges for the rest of your life. Quote Treasurer MFC Link to comment Share on other sites More sharing options...
stevel Posted September 2, 2006 Author Share Posted September 2, 2006 What she said.. On my site, I replace @ in email addresses with & #064; without the space after the & - I have to have that here or else it will appear like @ The harvesters don't bother looking for this - it has been quite successful for me, but if the email addresses have been exposed in the past, there's no putting the genie back in the bottle. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted September 27, 2006 Share Posted September 27, 2006 Hi folks, I have here hits from "bl1sch4091909.phx.gbl". The referer is msnbot-media/1.0 (+http://search.msn.com/msnbot.htm). My question is, due to that the crawler is named "*-media", if this is a special MSN bot for crawling images and media ? If so, should I exclude him via robots.txt like the "Googlebot-Image" ? Or is it the normal MSN spider ? Thanks in advance, kind regards Andreas Quote Link to comment Share on other sites More sharing options...
stevel Posted September 27, 2006 Author Share Posted September 27, 2006 I have seen complaints elsewhere that msnbot (and particularly msnbot-media) doesn't honor robots.txt. Yes, this is like googlebot-image, but I haven't found anything from MSN to say that one needs a special entry in robots.txt for this spider. You can try adding msnbot-media and see what it gets you. Note that as far as spiders.txt is concerned, this robot is covered by "nbot". Also, that phx.gbl hostname sure is weird, but I see other reports of that. I have no idea what it means. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted September 27, 2006 Share Posted September 27, 2006 I have seen complaints elsewhere that msnbot (and particularly msnbot-media) doesn't honor robots.txt. Yes, this is like googlebot-image, but I haven't found anything from MSN to say that one needs a special entry in robots.txt for this spider. You can try adding msnbot-media and see what it gets you. Any ideas, how the robots.txt-Entry should look like ? Now I have : User-agent: msnbot-media Disallow: / ...but I have no idea, if the Entry is correct (regarding spelling, upper-/lowercase etc.). Thanks in advance, Regards Andreas Quote Link to comment Share on other sites More sharing options...
stevel Posted September 27, 2006 Author Share Posted September 27, 2006 I have no idea myself. I can't find anything on MSN's web site to suggest what it should be. Note that there's often no direct correlation to what is in the referrer string. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted September 27, 2006 Share Posted September 27, 2006 Ok, I think, I will post in the forum or somewhere else and get back to here. Thanks, Steve. Quote Link to comment Share on other sites More sharing options...
stevel Posted September 27, 2006 Author Share Posted September 27, 2006 Well, here is not really appropriate. It has nothing to do with spiders.txt. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
boxtel Posted September 27, 2006 Share Posted September 27, 2006 Any ideas, how the robots.txt-Entry should look like ?Now I have : User-agent: msnbot-media Disallow: / ...but I have no idea, if the Entry is correct (regarding spelling, upper-/lowercase etc.). Thanks in advance, Regards Andreas User Agent transmitted to the visited web server : msnbot/1.0 (+http://search.msn.com/msnbot.htm) msnbot/0.9 (+http://search.msn.com/msnbot.htm) msnbot-media/1.0 (+http://search.msn.com/msnbot.htm) IP address range : from 65.52.0.0 to 65.55.255.255 (msn.com) from 207.68.128.0 to 207.68.207.255 (phx.gbl) URL for more information : http://search.msn.com/msnbot.htm Access control options understood by the robot : robots.txt META NAME="robots" User Agent to be used in the robots.txt file: msnbot Quote Treasurer MFC Link to comment Share on other sites More sharing options...
Andreas2003 Posted September 27, 2006 Share Posted September 27, 2006 User Agent to be used in the robots.txt file: msnbot I found exactly the same page you posted above, but I'm not sure, if "msnbot" is the correct agent to place in the robots.txt, because msnbot is the Searchbot, msnbot-media is the Image bot. I only want to exclude the msnbot-media for Images. Regards Andreas Quote Link to comment Share on other sites More sharing options...
chongordon Posted October 2, 2006 Share Posted October 2, 2006 (edited) Warning: file(includes/spiders.txt) [function.file]: failed to open stream: No such file or directory in /home/wof/public_html/includes/application_top.php on line 178 Warning: session_start() [function.session-start]: Cannot send session cookie - headers already sent by (output started at /home/wof/public_html/includes/application_top.php:178) in /home/wof/public_html/includes/functions/sessions.php on line 67 Warning: session_start() [function.session-start]: Cannot send session cache limiter - headers already sent (output started at /home/wof/public_html/includes/application_top.php:178) in /home/wof/public_html/includes/functions/sessions.php on line 67 hi guys, this code suddently come out from my google cache page ...i had the latest spider.txt.. whats wrong ? any idea ? Thanks in advance.. Edited October 2, 2006 by chongordon Quote Puppy Clothes by WoofWoofLand Link to comment Share on other sites More sharing options...
stevel Posted October 2, 2006 Author Share Posted October 2, 2006 Andreas, the problem is that we don't know what robots.txt string msnbot-media looks for. Try using "msnbot-media", but you should probably ask the MSN search support what to use. chongordon, it seems that your spiders.txt file is gone. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
chongordon Posted October 2, 2006 Share Posted October 2, 2006 weird... the SPIDERS.TXT always there .... theres no problem when i view my website ....but theres a error in cache... Quote Puppy Clothes by WoofWoofLand Link to comment Share on other sites More sharing options...
DeadDingo Posted October 2, 2006 Share Posted October 2, 2006 This may be a silly question, but do I add bot spiders.txt and spiders-large.txt? Or should I just use one and if so what are the advantages/disadvantages obver the large one? Cheers Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.