Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

Hi,

 

I've need a new crawler frequently on my site. The line from the apache logs is:

 

206.169.110.66 - - [24/Jul/2006:13:59:28 +0200] "GET /shop/naomi-doll-p-1099.html HTTP/1.0" 200 51348 www.perfectpassion.co.uk "-" "page_verifier http://www.securecomputing.com/goto/pv" "-"

 

 

Following the link in the user agent brings a page stating that it is a new crawler that search across the net for pages with malware.

 

Does this need adding to spiders.txt?

 

Thanks,

Tom

Share this post


Link to post
Share on other sites

hi there all... first off, great contrib :) quick question - i had a bot crawling my site for about 4 and a half days, each time for about 16 hours or so. it seemed to obey my robots file, the name that showed up in my whos online sections named it 'FAST' and after some searching I found that it was 'FAST MetaWeb Crawler (helpdesk at fastsearch dot com)' - just wondering if anyone knows if this is a 'bad' or 'ban-able' bot?

TIA

~bobsi18~

Share this post


Link to post
Share on other sites

Thats what I sort of decided in the end - first time I had seen that behaviour tho, so was a little aprehensive of what was going on. Thanks for quelling my doubts!

Share this post


Link to post
Share on other sites

Hi Steve,

 

I have the IP-addresses from a colocation hoster in the US in my logs.

 

Some details:

United States - Delaware - Newark - Mccolo Corporation

McColo Corporation MCCOLO (NET-208-66-192-0-1)

208.66.192.0 - 208.66.195.255

Digital Infinity Ltd DIGITALINFINITY (NET-208-66-195-0-1)

IP range : 208.66.195.0 - 208.66.195.15

www.mccolo.com

 

 

Its seems so that a spider is running on some of the addresses there, cause I found the details in another SEO forum.

 

Any hints how to include them in the spiders.txt ?

 

Thanks in advance

kind regards

Andreas

Share this post


Link to post
Share on other sites
Andreas,

 

Spider detection looks at the user agent string in the web request, not the IP. What is the UA string for these spider requests?

 

Hi Steve,

 

cannot block on the user agent string.

Its only

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

 

 

I found out that from this IP range a scraper, comment bot and/or referer spammer bot is coming.

See here:

http://www.donsausa.com/2006/06/psycheclon...t-advisory.html

 

I think, I have to block that IP range, because some other folks wrote that that bot is not friendly.

 

Thanks in advance,

kind regards

Andreas

Share this post


Link to post
Share on other sites
Hi Steve,

 

cannot block on the user agent string.

Its only

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

I found out that from this IP range a scraper, comment bot and/or referer spammer bot is coming.

See here:

http://www.donsausa.com/2006/06/psycheclon...t-advisory.html

 

I think, I have to block that IP range, because some other folks wrote that that bot is not friendly.

 

Thanks in advance,

kind regards

Andreas

 

if this is an email address harvester as suggested, simply make sure there is nothing to harvest or you will find yourself blocking ip ranges for the rest of your life.


Treasurer MFC

Share this post


Link to post
Share on other sites

What she said.. On my site, I replace @ in email addresses with

 

& #064;

 

without the space after the & - I have to have that here or else it will appear like @

 

The harvesters don't bother looking for this - it has been quite successful for me, but if the email addresses have been exposed in the past, there's no putting the genie back in the bottle.

Share this post


Link to post
Share on other sites

Hi folks,

 

I have here hits from "bl1sch4091909.phx.gbl".

The referer is msnbot-media/1.0 (+http://search.msn.com/msnbot.htm).

 

My question is, due to that the crawler is named "*-media", if this is a special MSN bot for crawling images and media ?

If so, should I exclude him via robots.txt like the "Googlebot-Image" ?

Or is it the normal MSN spider ?

 

Thanks in advance,

kind regards

Andreas

Share this post


Link to post
Share on other sites

I have seen complaints elsewhere that msnbot (and particularly msnbot-media) doesn't honor robots.txt. Yes, this is like googlebot-image, but I haven't found anything from MSN to say that one needs a special entry in robots.txt for this spider. You can try adding msnbot-media and see what it gets you.

 

Note that as far as spiders.txt is concerned, this robot is covered by "nbot".

 

Also, that phx.gbl hostname sure is weird, but I see other reports of that. I have no idea what it means.

Share this post


Link to post
Share on other sites
I have seen complaints elsewhere that msnbot (and particularly msnbot-media) doesn't honor robots.txt. Yes, this is like googlebot-image, but I haven't found anything from MSN to say that one needs a special entry in robots.txt for this spider. You can try adding msnbot-media and see what it gets you.

 

Any ideas, how the robots.txt-Entry should look like ?

Now I have :

User-agent: msnbot-media

Disallow: /

 

...but I have no idea, if the Entry is correct (regarding spelling, upper-/lowercase etc.).

 

Thanks in advance,

Regards

Andreas

Share this post


Link to post
Share on other sites

I have no idea myself. I can't find anything on MSN's web site to suggest what it should be. Note that there's often no direct correlation to what is in the referrer string.

Share this post


Link to post
Share on other sites
Any ideas, how the robots.txt-Entry should look like ?

Now I have :

User-agent: msnbot-media

Disallow: /

 

...but I have no idea, if the Entry is correct (regarding spelling, upper-/lowercase etc.).

 

Thanks in advance,

Regards

Andreas

 

User Agent transmitted to the visited web server :

msnbot/1.0 (+http://search.msn.com/msnbot.htm)

msnbot/0.9 (+http://search.msn.com/msnbot.htm)

msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)

IP address range : from 65.52.0.0 to 65.55.255.255 (msn.com)

from 207.68.128.0 to 207.68.207.255 (phx.gbl)

 

URL for more information : http://search.msn.com/msnbot.htm

 

Access control options understood by the robot :

robots.txt

META NAME="robots"

 

User Agent to be used in the robots.txt file: msnbot


Treasurer MFC

Share this post


Link to post
Share on other sites
User Agent to be used in the robots.txt file: msnbot

 

I found exactly the same page you posted above, but I'm not sure, if "msnbot" is the correct agent to place in the robots.txt, because msnbot is the Searchbot, msnbot-media is the Image bot.

I only want to exclude the msnbot-media for Images.

 

Regards

Andreas

Share this post


Link to post
Share on other sites

Warning: file(includes/spiders.txt) [function.file]: failed to open stream: No such file or directory in /home/wof/public_html/includes/application_top.php on line 178

 

Warning: session_start() [function.session-start]: Cannot send session cookie - headers already sent by (output started at /home/wof/public_html/includes/application_top.php:178) in /home/wof/public_html/includes/functions/sessions.php on line 67

 

Warning: session_start() [function.session-start]: Cannot send session cache limiter - headers already sent (output started at /home/wof/public_html/includes/application_top.php:178) in /home/wof/public_html/includes/functions/sessions.php on line 67

 

 

 

hi guys, this code suddently come out from my google cache page ...i had the latest spider.txt..

whats wrong ? any idea ? Thanks in advance..

Edited by chongordon

Puppy Clothes by WoofWoofLand

Share this post


Link to post
Share on other sites

Andreas, the problem is that we don't know what robots.txt string msnbot-media looks for. Try using "msnbot-media", but you should probably ask the MSN search support what to use.

 

chongordon, it seems that your spiders.txt file is gone.

Share this post


Link to post
Share on other sites

weird... the SPIDERS.TXT always there .... theres no problem when i view my website ....but theres a error in cache...


Puppy Clothes by WoofWoofLand

Share this post


Link to post
Share on other sites

This may be a silly question, but do I add bot spiders.txt and spiders-large.txt?

 

Or should I just use one and if so what are the advantages/disadvantages obver the large one?

 

 

Cheers

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×