Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Updated spiders.txt Official Support Topic


stevel

Recommended Posts

Hi,

 

I've need a new crawler frequently on my site. The line from the apache logs is:

 

206.169.110.66 - - [24/Jul/2006:13:59:28 +0200] "GET /shop/naomi-doll-p-1099.html HTTP/1.0" 200 51348 www.perfectpassion.co.uk "-" "page_verifier http://www.securecomputing.com/goto/pv" "-"

 

 

Following the link in the user agent brings a page stating that it is a new crawler that search across the net for pages with malware.

 

Does this need adding to spiders.txt?

 

Thanks,

Tom

Link to comment
Share on other sites

Probably a good idea so that it doesn't start a session and use up resources. You could add the string "page_verifier". I'll include this in the next update.

Link to comment
Share on other sites

  • 3 weeks later...

hi there all... first off, great contrib :) quick question - i had a bot crawling my site for about 4 and a half days, each time for about 16 hours or so. it seemed to obey my robots file, the name that showed up in my whos online sections named it 'FAST' and after some searching I found that it was 'FAST MetaWeb Crawler (helpdesk at fastsearch dot com)' - just wondering if anyone knows if this is a 'bad' or 'ban-able' bot?

TIA

~bobsi18~

Link to comment
Share on other sites

  • 2 weeks later...

Hi Steve,

 

I have the IP-addresses from a colocation hoster in the US in my logs.

 

Some details:

United States - Delaware - Newark - Mccolo Corporation

McColo Corporation MCCOLO (NET-208-66-192-0-1)

208.66.192.0 - 208.66.195.255

Digital Infinity Ltd DIGITALINFINITY (NET-208-66-195-0-1)

IP range : 208.66.195.0 - 208.66.195.15

www.mccolo.com

 

 

Its seems so that a spider is running on some of the addresses there, cause I found the details in another SEO forum.

 

Any hints how to include them in the spiders.txt ?

 

Thanks in advance

kind regards

Andreas

Link to comment
Share on other sites

Andreas,

 

Spider detection looks at the user agent string in the web request, not the IP. What is the UA string for these spider requests?

Link to comment
Share on other sites

Andreas,

 

Spider detection looks at the user agent string in the web request, not the IP. What is the UA string for these spider requests?

 

Hi Steve,

 

cannot block on the user agent string.

Its only

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

 

 

I found out that from this IP range a scraper, comment bot and/or referer spammer bot is coming.

See here:

http://www.donsausa.com/2006/06/psycheclon...t-advisory.html

 

I think, I have to block that IP range, because some other folks wrote that that bot is not friendly.

 

Thanks in advance,

kind regards

Andreas

Link to comment
Share on other sites

Hi Steve,

 

cannot block on the user agent string.

Its only

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

I found out that from this IP range a scraper, comment bot and/or referer spammer bot is coming.

See here:

http://www.donsausa.com/2006/06/psycheclon...t-advisory.html

 

I think, I have to block that IP range, because some other folks wrote that that bot is not friendly.

 

Thanks in advance,

kind regards

Andreas

 

if this is an email address harvester as suggested, simply make sure there is nothing to harvest or you will find yourself blocking ip ranges for the rest of your life.

Treasurer MFC

Link to comment
Share on other sites

What she said.. On my site, I replace @ in email addresses with

 

& #064;

 

without the space after the & - I have to have that here or else it will appear like @

 

The harvesters don't bother looking for this - it has been quite successful for me, but if the email addresses have been exposed in the past, there's no putting the genie back in the bottle.

Link to comment
Share on other sites

  • 4 weeks later...

Hi folks,

 

I have here hits from "bl1sch4091909.phx.gbl".

The referer is msnbot-media/1.0 (+http://search.msn.com/msnbot.htm).

 

My question is, due to that the crawler is named "*-media", if this is a special MSN bot for crawling images and media ?

If so, should I exclude him via robots.txt like the "Googlebot-Image" ?

Or is it the normal MSN spider ?

 

Thanks in advance,

kind regards

Andreas

Link to comment
Share on other sites

I have seen complaints elsewhere that msnbot (and particularly msnbot-media) doesn't honor robots.txt. Yes, this is like googlebot-image, but I haven't found anything from MSN to say that one needs a special entry in robots.txt for this spider. You can try adding msnbot-media and see what it gets you.

 

Note that as far as spiders.txt is concerned, this robot is covered by "nbot".

 

Also, that phx.gbl hostname sure is weird, but I see other reports of that. I have no idea what it means.

Link to comment
Share on other sites

I have seen complaints elsewhere that msnbot (and particularly msnbot-media) doesn't honor robots.txt. Yes, this is like googlebot-image, but I haven't found anything from MSN to say that one needs a special entry in robots.txt for this spider. You can try adding msnbot-media and see what it gets you.

 

Any ideas, how the robots.txt-Entry should look like ?

Now I have :

User-agent: msnbot-media

Disallow: /

 

...but I have no idea, if the Entry is correct (regarding spelling, upper-/lowercase etc.).

 

Thanks in advance,

Regards

Andreas

Link to comment
Share on other sites

I have no idea myself. I can't find anything on MSN's web site to suggest what it should be. Note that there's often no direct correlation to what is in the referrer string.

Link to comment
Share on other sites

Any ideas, how the robots.txt-Entry should look like ?

Now I have :

User-agent: msnbot-media

Disallow: /

 

...but I have no idea, if the Entry is correct (regarding spelling, upper-/lowercase etc.).

 

Thanks in advance,

Regards

Andreas

 

User Agent transmitted to the visited web server :

msnbot/1.0 (+http://search.msn.com/msnbot.htm)

msnbot/0.9 (+http://search.msn.com/msnbot.htm)

msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)

IP address range : from 65.52.0.0 to 65.55.255.255 (msn.com)

from 207.68.128.0 to 207.68.207.255 (phx.gbl)

 

URL for more information : http://search.msn.com/msnbot.htm

 

Access control options understood by the robot :

robots.txt

META NAME="robots"

 

User Agent to be used in the robots.txt file: msnbot

Treasurer MFC

Link to comment
Share on other sites

User Agent to be used in the robots.txt file: msnbot

 

I found exactly the same page you posted above, but I'm not sure, if "msnbot" is the correct agent to place in the robots.txt, because msnbot is the Searchbot, msnbot-media is the Image bot.

I only want to exclude the msnbot-media for Images.

 

Regards

Andreas

Link to comment
Share on other sites

Warning: file(includes/spiders.txt) [function.file]: failed to open stream: No such file or directory in /home/wof/public_html/includes/application_top.php on line 178

 

Warning: session_start() [function.session-start]: Cannot send session cookie - headers already sent by (output started at /home/wof/public_html/includes/application_top.php:178) in /home/wof/public_html/includes/functions/sessions.php on line 67

 

Warning: session_start() [function.session-start]: Cannot send session cache limiter - headers already sent (output started at /home/wof/public_html/includes/application_top.php:178) in /home/wof/public_html/includes/functions/sessions.php on line 67

 

 

 

hi guys, this code suddently come out from my google cache page ...i had the latest spider.txt..

whats wrong ? any idea ? Thanks in advance..

Edited by chongordon

Puppy Clothes by WoofWoofLand

Link to comment
Share on other sites

Andreas, the problem is that we don't know what robots.txt string msnbot-media looks for. Try using "msnbot-media", but you should probably ask the MSN search support what to use.

 

chongordon, it seems that your spiders.txt file is gone.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...