Jump to content
stevel

Updated spiders.txt Official Support Topic

Recommended Posts

spiders.txt does not block search engines from your site. What it does is prevent them from creating sessions so that they are unable to do "add to cart", go places only humans can go, and, most importantly, it prevents URLs in their index from containing session IDs.

 

When a "bot" visits your site, it supplies a user agent string that identifies it (usually). Since a lot of bots have the string "ebot" in their UA strings, this is used to detect all of them. Googlebot is just one. Similarly, "nbot" detects MSNbot and any other with "nbot" in the UA string. These bots are not bad - in fact they are good - you want your site indexed. You just don't want them following "add to cart" links and leaving session IDs in URLs.

 

If you actually want to block a bot, the first thing is to add an entry to robots.txt. All well-behaved bots will honor this. See this Wikipedia article for more info. I don't know if Yandex honors this - it probably does. You may have to visit its web site to see what to put in robots.txt.

Share this post


Link to post
Share on other sites

Thank you for your kind reply.

 

I also saw this trolling on my website.

as13448.com

 

Do I just put as13448 somewhere in spider.txt file to stop this bot from creating sessions?

 

 

 

Another questions..

So by putting the "yandex" in spider.txt file, you stop them from creating sessions?

Which will reduce them using your bandwidth? So it is OK for them to visit the front page of my website?

Because whenever yandex.ru came to my website, they were viewing most of my products one by one.

 

So does this mean that I will still see them on my who's online page?

 

Thank you.

Edited by Androider

Share this post


Link to post
Share on other sites

Do I just put as13448 somewhere in spider.txt file to stop this bot from creating sessions?

No - you have to look at the user agent string from the server log and see what it has there. It may not have anything you can use to identify it if it is not a well-behaved bot. Is it causing trouble for you?

 

Yes, you will still see the bots on Who's Online. From experience, I'd say to NOT trust what that says for whether or not the visitor has a session.

Edited by stevel

Share this post


Link to post
Share on other sites

Is it causing trouble for you?

 

 

To be honest, I'm not sure if bots are causing problems...

I just became curious who this yandex.ru (who was on my website everyday) was and did some search

and people were complaining its eating up bandwidth of upto 1gb a day.

So is this how you stop them using bandwidth? spider.txt?

 

I just want my site clean as possible.

 

So, I should just remove as13448 from spider.txt? As its of no use?

Share this post


Link to post
Share on other sites

I would remove as13448 from spiders.txt. You can use robots.txt to slow down a spider - read the link I posted.

 

AS13448.com is operated by a company called Websense, a company that sells web filtering devices and services. Can you show me a line from your server log indicating an as13448.com IP address?

Share this post


Link to post
Share on other sites

That's not the user agent string. You want a line that looks something like this:

 

220.181.7.44 - - [12/Apr/2010:02:32:03 -0400] "GET /robots.txt HTTP/1.1" 200 451 www.example.com "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)" "-"

 

See that string that starts "Baiduspider"? That's the user agent. If you're using awstats, you should be able to locate the access log.

 

If you want to block Yandex entirely - and posts I have read suggest that is a good idea, add this to your robots.txt:

 

User-agent: Yandex

Disallow: /

Share this post


Link to post
Share on other sites

Got something on my site, which I'm not familiar with:

 

Name: 0.83

IP-address is changing, but a lot from different comcast-nodes like "c-66-41-29-213.hsd1.mn.comcast.net".

 

No session, no referrer.

 

I searched through my spiders.txt, but did not found anything like "0.83".

 

Do anyone of you know, if this is a real "bot" or someone too interested in my site?

 

Thanks in advance,

regards

Andreas

Share this post


Link to post
Share on other sites

I have this not recognized spider:

 

msnbot-207-46-12-118.search.msn.com

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)

IP: 207.46.12.118

Share this post


Link to post
Share on other sites

A security risk? No more than any other PC. The thing to look at is if this "user" went around your site adding items to a cart. How many pages did it visit at this time? Do you see a session ID in all the URLs or maybe just one or two?

 

Remember that the purpose of spiders.txt is NOT to prevent bots from visiting your site - it's to keep session IDs out of search engine indexes and to prevent them from doing things that require a session.

Share this post


Link to post
Share on other sites

We have installed a site search engine and would like to add our own site spider to the list. Anyone know how this can be done?

Share this post


Link to post
Share on other sites

I am using spiders.txt dated 04-17-2010, which I believe is the most recent. It is not detecting the following bot:

 

User Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

 

I thought adding "bingbot" (without quotes) to spiders.txt would allow detection but that did not seem to work. I actually thought that one of the existing strings would catch it but this bot is showing up in Who's Online as a customer. Can someone please tell me what string needs to be in spiders.txt to allow proper detection?

 

Thanks

Share this post


Link to post
Share on other sites

I am using spiders.txt dated 04-17-2010, which I believe is the most recent. It is not detecting the following bot:

 

User Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

 

I thought adding "bingbot" (without quotes) to spiders.txt would allow detection but that did not seem to work. I actually thought that one of the existing strings would catch it but this bot is showing up in Who's Online as a customer. Can someone please tell me what string needs to be in spiders.txt to allow proper detection?

 

Thanks

 

gbot picks up this spider - line 27 in spiders.txt (presuming you havnt changed the order of the bots from the original file).

 

my Whos online registers User Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) as a bot

 

Smiler

Edited by Jan Zonjee
spamming

Share this post


Link to post
Share on other sites

We have installed a site search engine and would like to add our own site spider to the list. Anyone know how this can be done?

 

You need to know what "user agent" string the spider supplies when making the http request. It would, ideally, have some part of it that can be used to identify it as a bot. If the UA string includes "bot/" or "/bot" that would do the trick. If it doesn't fit the pattern of any of the existing strings, then figure out what would identify it (without a false positive on a legitimate browser) and add the string to the spiders.txt file.

 

If your search engine supplies a generic UA or one that matches that of a browser, you can't.

Share this post


Link to post
Share on other sites

Steve,

 

I am getting lots of vosits from users who have SIMBAR in their user agent, from what i have read it appears that these users have some sort of malware/adware on their system. Should i be concerned in any way, should i block any user with SIMBAR in their user agent.

Share this post


Link to post
Share on other sites

Blocking people because of "this, that, or the other thing" is a never ending endeavor because "this, that, or the other thing" is constantly changing.

 

Either your site is secure or it isn't.

 

If it's secure you don't have to worry.

 

If it isn't, sooner or later someone will break in before you have the chance to block them because of "this, that, or the other thing".

:blush:

 

Just my 2 cents.

 

Take it or leave it.

:)


If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you.

 

"Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice."

- Me -

 

"Headers already sent" - The definitive help

 

"Cannot redeclare ..." - How to find/fix it

 

SSL Implementation Help

 

Like this post? "Like" it again over there >

Share this post


Link to post
Share on other sites

Hi There

 

I too have a MSN bot that is showing in my whos online 3.5.4 as a customer rather than a bot,

 

not sure why, i have recently moved servers and have had to make many changes to get things right, this is one of them but i cant work out why, i have downloaded the latest spiders.txt, any clues would be appriciated.

below is the info from whos online

 

00:00:00 Guest msnbot-207-46-13-95.search.msn.com 09:59:52 am 09:59:52 am HTC 35H00132-00M, 35H00132-05M, BA S410 , Battery (Product) Yes Not Found Name: Guest

pixel_trans.gif

ID: 0

pixel_trans.gif

IP Address: 207.46.13.95

pixel_trans.gif

User Agent:

pixel_trans.gif

osCsid: e8cb6afc74dafb79a9b16df0a4b25da8

 

 

 

 

 

thank you

 

David


David

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×