stevel Posted July 2, 2010 Author Share Posted July 2, 2010 spiders.txt does not block search engines from your site. What it does is prevent them from creating sessions so that they are unable to do "add to cart", go places only humans can go, and, most importantly, it prevents URLs in their index from containing session IDs. When a "bot" visits your site, it supplies a user agent string that identifies it (usually). Since a lot of bots have the string "ebot" in their UA strings, this is used to detect all of them. Googlebot is just one. Similarly, "nbot" detects MSNbot and any other with "nbot" in the UA string. These bots are not bad - in fact they are good - you want your site indexed. You just don't want them following "add to cart" links and leaving session IDs in URLs. If you actually want to block a bot, the first thing is to add an entry to robots.txt. All well-behaved bots will honor this. See this Wikipedia article for more info. I don't know if Yandex honors this - it probably does. You may have to visit its web site to see what to put in robots.txt. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Androider Posted July 2, 2010 Share Posted July 2, 2010 (edited) Thank you for your kind reply. I also saw this trolling on my website. as13448.com Do I just put as13448 somewhere in spider.txt file to stop this bot from creating sessions? Another questions.. So by putting the "yandex" in spider.txt file, you stop them from creating sessions? Which will reduce them using your bandwidth? So it is OK for them to visit the front page of my website? Because whenever yandex.ru came to my website, they were viewing most of my products one by one. So does this mean that I will still see them on my who's online page? Thank you. Edited July 2, 2010 by Androider Quote Link to comment Share on other sites More sharing options...
stevel Posted July 2, 2010 Author Share Posted July 2, 2010 (edited) Do I just put as13448 somewhere in spider.txt file to stop this bot from creating sessions? No - you have to look at the user agent string from the server log and see what it has there. It may not have anything you can use to identify it if it is not a well-behaved bot. Is it causing trouble for you? Yes, you will still see the bots on Who's Online. From experience, I'd say to NOT trust what that says for whether or not the visitor has a session. Edited July 2, 2010 by stevel Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Androider Posted July 2, 2010 Share Posted July 2, 2010 Is it causing trouble for you? To be honest, I'm not sure if bots are causing problems... I just became curious who this yandex.ru (who was on my website everyday) was and did some search and people were complaining its eating up bandwidth of upto 1gb a day. So is this how you stop them using bandwidth? spider.txt? I just want my site clean as possible. So, I should just remove as13448 from spider.txt? As its of no use? Quote Link to comment Share on other sites More sharing options...
stevel Posted July 2, 2010 Author Share Posted July 2, 2010 I would remove as13448 from spiders.txt. You can use robots.txt to slow down a spider - read the link I posted. AS13448.com is operated by a company called Websense, a company that sells web filtering devices and services. Can you show me a line from your server log indicating an as13448.com IP address? Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Androider Posted July 2, 2010 Share Posted July 2, 2010 I'm a newbie at technical stuff.. But I was able to find this from cpanel AWSTATS. static-208-80-193-39.as13448.com Quote Link to comment Share on other sites More sharing options...
stevel Posted July 2, 2010 Author Share Posted July 2, 2010 That's not the user agent string. You want a line that looks something like this: 220.181.7.44 - - [12/Apr/2010:02:32:03 -0400] "GET /robots.txt HTTP/1.1" 200 451 www.example.com "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)" "-" See that string that starts "Baiduspider"? That's the user agent. If you're using awstats, you should be able to locate the access log. If you want to block Yandex entirely - and posts I have read suggest that is a good idea, add this to your robots.txt: User-agent: Yandex Disallow: / Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
Andreas2003 Posted July 2, 2010 Share Posted July 2, 2010 Got something on my site, which I'm not familiar with: Name: 0.83 IP-address is changing, but a lot from different comcast-nodes like "c-66-41-29-213.hsd1.mn.comcast.net". No session, no referrer. I searched through my spiders.txt, but did not found anything like "0.83". Do anyone of you know, if this is a real "bot" or someone too interested in my site? Thanks in advance, regards Andreas Quote Link to comment Share on other sites More sharing options...
Andreas2003 Posted July 2, 2010 Share Posted July 2, 2010 (edited) Name: 0.83 Got some more information: http://www.80legs.com/spider.html I blocked it through robots.txt: User-agent: 008 Disallow: / Hope, that will work. Edited July 2, 2010 by Andreas2003 Quote Link to comment Share on other sites More sharing options...
surcie Posted July 23, 2010 Share Posted July 23, 2010 I have this not recognized spider: msnbot-207-46-12-118.search.msn.com Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648) IP: 207.46.12.118 Quote Link to comment Share on other sites More sharing options...
stevel Posted July 23, 2010 Author Share Posted July 23, 2010 Well, that's a bit odd. While the hostname has msnbot in it, the user agent string just looks like MSIE. There's no way to detect that as a bot. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
surcie Posted July 24, 2010 Share Posted July 24, 2010 So it's a pc user from msn services?? Can this be considered a risk in security?? thanks in advance Quote Link to comment Share on other sites More sharing options...
stevel Posted July 24, 2010 Author Share Posted July 24, 2010 A security risk? No more than any other PC. The thing to look at is if this "user" went around your site adding items to a cart. How many pages did it visit at this time? Do you see a session ID in all the URLs or maybe just one or two? Remember that the purpose of spiders.txt is NOT to prevent bots from visiting your site - it's to keep session IDs out of search engine indexes and to prevent them from doing things that require a session. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
spoofy Posted February 7, 2011 Share Posted February 7, 2011 Hey Steve, Should we go ahead and add the new Bing/Yahoo bot called "bingbot" Quote My Contributions: Google XML Sitemap SEO compatible with Ultimate SEO URL by FWR Media ::: Accurate & Precise Bread Crumb Trail Link to comment Share on other sites More sharing options...
Guest Posted April 18, 2011 Share Posted April 18, 2011 We have installed a site search engine and would like to add our own site spider to the list. Anyone know how this can be done? Quote Link to comment Share on other sites More sharing options...
etzeppy Posted April 26, 2011 Share Posted April 26, 2011 I am using spiders.txt dated 04-17-2010, which I believe is the most recent. It is not detecting the following bot: User Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) I thought adding "bingbot" (without quotes) to spiders.txt would allow detection but that did not seem to work. I actually thought that one of the existing strings would catch it but this bot is showing up in Who's Online as a customer. Can someone please tell me what string needs to be in spiders.txt to allow proper detection? Thanks Quote Link to comment Share on other sites More sharing options...
smiler99 Posted May 12, 2011 Share Posted May 12, 2011 (edited) I am using spiders.txt dated 04-17-2010, which I believe is the most recent. It is not detecting the following bot: User Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) I thought adding "bingbot" (without quotes) to spiders.txt would allow detection but that did not seem to work. I actually thought that one of the existing strings would catch it but this bot is showing up in Who's Online as a customer. Can someone please tell me what string needs to be in spiders.txt to allow proper detection? Thanks gbot picks up this spider - line 27 in spiders.txt (presuming you havnt changed the order of the bots from the original file). my Whos online registers User Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) as a bot Smiler Edited May 12, 2011 by Jan Zonjee spamming Quote Link to comment Share on other sites More sharing options...
stevel Posted June 8, 2011 Author Share Posted June 8, 2011 We have installed a site search engine and would like to add our own site spider to the list. Anyone know how this can be done? You need to know what "user agent" string the spider supplies when making the http request. It would, ideally, have some part of it that can be used to identify it as a bot. If the UA string includes "bot/" or "/bot" that would do the trick. If it doesn't fit the pattern of any of the existing strings, then figure out what would identify it (without a false positive on a legitimate browser) and add the string to the spiders.txt file. If your search engine supplies a generic UA or one that matches that of a browser, you can't. Quote Steve Contributions: Country-State Selector Login Page a la Amazon Protection of Configuration Updated spiders.txt Embed Links with SID in Description Link to comment Share on other sites More sharing options...
smiler99 Posted August 14, 2011 Share Posted August 14, 2011 Steve, I am getting lots of vosits from users who have SIMBAR in their user agent, from what i have read it appears that these users have some sort of malware/adware on their system. Should i be concerned in any way, should i block any user with SIMBAR in their user agent. Quote Link to comment Share on other sites More sharing options...
germ Posted August 14, 2011 Share Posted August 14, 2011 Blocking people because of "this, that, or the other thing" is a never ending endeavor because "this, that, or the other thing" is constantly changing. Either your site is secure or it isn't. If it's secure you don't have to worry. If it isn't, sooner or later someone will break in before you have the chance to block them because of "this, that, or the other thing". :blush: Just my 2 cents. Take it or leave it. :) Quote If I suggest you edit any file(s) make a backup first - I'm not perfect and neither are you. "Given enough impetus a parallelogramatically shaped projectile can egress a circular orifice." - Me - "Headers already sent" - The definitive help "Cannot redeclare ..." - How to find/fix it SSL Implementation Help Like this post? "Like" it again over there > Link to comment Share on other sites More sharing options...
DAVID3733 Posted August 31, 2011 Share Posted August 31, 2011 Hi There I too have a MSN bot that is showing in my whos online 3.5.4 as a customer rather than a bot, not sure why, i have recently moved servers and have had to make many changes to get things right, this is one of them but i cant work out why, i have downloaded the latest spiders.txt, any clues would be appriciated. below is the info from whos online 00:00:00 Guest msnbot-207-46-13-95.search.msn.com 09:59:52 am 09:59:52 am HTC 35H00132-00M, 35H00132-05M, BA S410 , Battery (Product) Yes Not Found Name: Guest ID: 0 IP Address: 207.46.13.95 User Agent: osCsid: e8cb6afc74dafb79a9b16df0a4b25da8 thank you David Quote David Link to comment Share on other sites More sharing options...
sackling Posted March 13, 2013 Share Posted March 13, 2013 What happened to the updates to this addon? Stephan Gebbers 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.