NodsDorf Posted January 17, 2012 Share Posted January 17, 2012 I've read probably a couple dozen articles and post about stopping Baiduspider and I have yet been able to keep them off our site. If anybody has experience with "effectively" blocking them please share. In my efforts I have blocked user agents, first tried to emulate http.conf in htaccess with SetEnvIfNoCase User-Agent "^Baiduspider" block_bot Order Allow,Deny Allow from All Deny from env=block_bot In conjuction with pure .htaccess user agent block: RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC] RewriteRule .* - [F] Finally resorting to baning IP's and host: order allow,deny deny from *.baidu.com deny from 203.125.234. deny from 220.181.7. deny from 123.125.66. deny from 123.125.71. deny from 119.63.192. deny from 119.63.193. deny from 119.63.194. deny from 119.63.195. deny from 119.63.196. deny from 119.63.197. deny from 119.63.198 deny from 119.63.199. deny from 180.76.5. deny from 202.108.249.185 deny from 202.108.249.177 deny from 202.108.249.182 deny from 202.108.249.184 deny from 202.108.249.189 deny from 61.135.146.200 deny from 61.135.145.221 deny from 61.135.145.207 deny from 202.108.250.196 deny from 68.170.119.76 deny from 207.46.199.52 allow from all Yet, Baidu appears to be masking itself under different ISPs. I've seen msn, kimsufi.com, and now wowrack.com as the ISP but the user agent is still baiduspider. No idea how they are getting around my user agent blocks but they are.. This is currently on my site: 208-115-111-72-reverse.wowrack.com IP address: 208.115.111.72 User agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) I can of course ban this IP but they seem to have limitless ISPs and IPs to draw from, and we don't really like banning IP's unless they are from another country in which we don't do business. Link to comment Share on other sites More sharing options...
Guest Posted January 17, 2012 Share Posted January 17, 2012 @@NodsDorf, I have the same problem with the brandwatch bot from the UK, it constantly hi-jacks servers to continue to crawl and disregards the robots.txt. I have no solution for you, just thought I would mention baidu is not the only rouge bot. Chris Link to comment Share on other sites More sharing options...
♥kymation Posted January 17, 2012 Share Posted January 17, 2012 Try this change to your .htaccess block: RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC] Regards Jim See my profile for a list of my addons and ways to get support. Link to comment Share on other sites More sharing options...
Guest Posted January 17, 2012 Share Posted January 17, 2012 Try This mileage may vary, but works fine for me. RewriteEngine on RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net$ [NC] RewriteCond %{HTTP_REFERER} !^http://yoursite.net/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http://yoursite.net$ [NC] RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ - [F,NC] SetEnvIfNoCase User-Agent "^baiduspider" bad_bot <Limit GET POST> Order Allow,Deny Allow from all Deny from env=bad_bot </Limit> Regards, George Link to comment Share on other sites More sharing options...
LinkYeah Posted January 17, 2012 Share Posted January 17, 2012 @@DunWeb If you're not happy about the Brandwatch crawler accessing your site, and it's not listening to your robot.txt, then let me know the URL and I'll make sure we stop crawling. Sorry if it's caused you any discontent. Thanks, Joel Community Manager at Brandwatch Link to comment Share on other sites More sharing options...
♥14steve14 Posted January 17, 2012 Share Posted January 17, 2012 I had a similar problem with the Baiduspider bot constantly crawling the site. I tried adding the following to the robots.txt User-agent: Baiduspider Disallow: / User-agent: Baiduspider-image Disallow: / User-agent: Baiduspider-video Disallow: / User-agent: Baiduspider-news Disallow: / User-agent: Baiduspider-favo Disallow: / User-agent: Baiduspider-cpro Disallow: / User-agent: Baiduspider-ads Disallow: / User-agent: Baidu Disallow: / And still it came. I then added into the list of bad bots to block in htaccess file # Block Bad Bots RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR] and then added to the bottom of the htaccess file <Files 403.shtml> order allow,deny allow from all </Files> deny from 180.76.0.0/16 For the time being it seems to have stopped visiting. I dare say it will start again. The only other thing i could add is that it took a few weeks for them to stop. REMEMBER BACKUP, BACKUP AND BACKUP Link to comment Share on other sites More sharing options...
Dennisra Posted January 17, 2012 Share Posted January 17, 2012 The Baidu spiders will obey the robots.txt file. However, if you block the Baidu ip's it won't access the file in the first place. You can find all the information needed to halt Baidu here: http://www.baidu.com/search/spider_english.html It takes a few days to update the database. There are a few fake Baidu spiders. "Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames." Link to comment Share on other sites More sharing options...
NodsDorf Posted January 17, 2012 Author Share Posted January 17, 2012 Try this change to your .htaccess block: RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC] Regards Jim Thanks Jim, I added the . * this morning I'll keep an eye on the site today and let everybody know the results. Try This mileage may vary, but works fine for me. RewriteEngine on RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net$ [NC] RewriteCond %{HTTP_REFERER} !^http://yoursite.net/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http://yoursite.net$ [NC] RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ - [F,NC] SetEnvIfNoCase User-Agent "^baiduspider" bad_bot <limit get="" post=""> Order Allow,Deny Allow from all Deny from env=bad_bot </limit> Regards, George Thanks for the post George, though I'ved tried this approach, yours unless modified would only stop them from indexing or crawling images. The Baidu spiders will obey the robots.txt file. However, if you block the Baidu ip's it won't access the file in the first place. You can find all the information needed to halt Baidu here: http://www.baidu.com/search/spider_english.html It takes a few days to update the database. There are a few fake Baidu spiders. "Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames." Thanks for the response Joseph, I have read Baidu's crawl page and noticed where they point out they do obey robots.txt but they know that people are spoofing them. It maybe that we are harshly calling Baidu bad when in fact it is people pretending to be them, that are the actual problem. But I look at it from this point of view, 1 we will not ship to anywhere in Asia so there is no need for presence in their search engine in the first place, and 2 since the Chinese are the Xerox machines of the world we'd rather them not see us at all. Link to comment Share on other sites More sharing options...
Dennisra Posted January 17, 2012 Share Posted January 17, 2012 Check the ip address as recommended on the crawl page. That will tell you if it's a spoof or not. My giess is it's not. "Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames." Link to comment Share on other sites More sharing options...
NodsDorf Posted January 18, 2012 Author Share Posted January 18, 2012 Regardless if its Baidu or a agent masked as Baidu the user agent check should catch them, which it is not. I tried Jim's suggestion which looked real promising, but still hasn't stopped them. As of today: IP address: 180.76.5.59 (dns info is baidu) User agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) Link to comment Share on other sites More sharing options...
Dennisra Posted January 18, 2012 Share Posted January 18, 2012 Don Ford: You have missed the point completely. Forget all the ip stuff. It just uses resources and you may be implmenting it incorrectly to begin with. You can take care of your problem with the robot.txt file. Reread the posts with an open mind. Link to comment Share on other sites More sharing options...
♥14steve14 Posted January 18, 2012 Share Posted January 18, 2012 Don Ford: You can take care of your problem with the robot.txt file. Reread the posts with an open mind. As i have already said, i had a similar problem, and so do so many other people with this spider. The internet is full of people with robots.txt files trying to stop this spider, and it does not work. Google for ban Baiduspider and there will be hundreds of listing pages. I have hopefully got rid of them by doing what i said in my previous post, but it did take a few weeks. Dont expect them to go away over night. I was getting crawled by several of their spiders several times a day and for hours on end, and they were visiting files that were listed in the robots file and they were listing them, I tried banning individual ip addresses, and even adding a blanket ban on a whole range of ip addresses. I was only getting problems with the 180 range, as is the OP.. REMEMBER BACKUP, BACKUP AND BACKUP Link to comment Share on other sites More sharing options...
NodsDorf Posted January 19, 2012 Author Share Posted January 19, 2012 Hi Joseph, I wasn't trying to critcise your post. I have already stated that I realize that Baidu "Says" they obey robots.txt I will not agure that point nor was I trying to. I'm wondering how a $_SERVER['HTTP_USER_AGENT'] that comes back with Baiduspider (whether it's actually them or not doesn't matter) if the agent comes back as Baidu they should be dropped plain and simple. Link to comment Share on other sites More sharing options...
Dennisra Posted January 19, 2012 Share Posted January 19, 2012 As i have already said, i had a similar problem, and so do so many other people with this spider. The internet is full of people with robots.txt files trying to stop this spider, and it does not work. Google for ban Baiduspider and there will be hundreds of listing pages. I have hopefully got rid of them by doing what i said in my previous post, but it did take a few weeks. Dont expect them to go away over night. I was getting crawled by several of their spiders several times a day and for hours on end, and they were visiting files that were listed in the robots file and they were listing them, I tried banning individual ip addresses, and even adding a blanket ban on a whole range of ip addresses. I was only getting problems with the 180 range, as is the OP.. I beg to differ. When correctly deployed the Baidu spiders do obey robots.txt files. Plain and simple. However, I realize this is back and forth is no longer worth the energy so please disregard. Link to comment Share on other sites More sharing options...
NodsDorf Posted January 19, 2012 Author Share Posted January 19, 2012 I beg to differ. When correctly deployed the Baidu spiders do obey robots.txt files. Plain and simple. However, I realize this is back and forth is no longer worth the energy so please disregard. I'm not sure what you're not reading or failing to see. I already acknowledged this point twice already. --> BAIDU CLAIMS TO OBEY ROBOTS.TXT <--- Maybe 3 is a charm But that isn't the issue, maybe you should re-read what I have posted. Link to comment Share on other sites More sharing options...
atddoug Posted May 13, 2012 Share Posted May 13, 2012 My two cents ... I tried the disallow in robots.txt and I was still getting it from two ips, the 180 range and then the 220 range. This is working for me in .htaccess ... for now ... order allow,deny deny from 180.76. deny from 220.181. allow from all Link to comment Share on other sites More sharing options...
Igal-Incapsula Posted August 15, 2012 Share Posted August 15, 2012 Hi, Baidu Spider uses the following user agents: [color=#707070][font=Arial, Helvetica, sans-serif][size=3] Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3] Mozilla/5.0 (Linux;u;Android 2.3.7;zh-cn;) AppleWebKit/533.1 (KHTML,like Gecko) Version/4.0 Mobile Safari/533.1 (compatible; +http://www.baidu.com/search/spider.html)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3] Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.8;baidu Transcoder) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3] Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; baidu Transcoder;)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3] Baiduspider-image+(+http://www.baidu.com/search/spider.htm)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3] Baiduspider+(+http://www.baidu.com/search/spider.htm)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3] Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)[/size][/font][/color] Source: Botopedia.org For IP range cross-verification, you can use the "Check if this IP belongs to this bot" feature, inside bot profile page. Hope this helps. Link to comment Share on other sites More sharing options...
vampirehunter Posted August 15, 2012 Share Posted August 15, 2012 on a fresh install of 2.3.2 whats required to ensure your files.pages are all protected and aren't accessed by these bots? is there a tutorial somewhere? thanks Link to comment Share on other sites More sharing options...
al3ks Posted August 16, 2012 Share Posted August 16, 2012 @@vampirehunter The solutions mentioned in this post work with all oscommerce verisons Try this change to your .htaccess block: RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC] Regards Jim Find this post helpful? Click the 'Like this' button. :) Link to comment Share on other sites More sharing options...
sammedit Posted August 20, 2012 Share Posted August 20, 2012 Why does this spider need to be blocked? What does it do? Link to comment Share on other sites More sharing options...
♥14steve14 Posted August 20, 2012 Share Posted August 20, 2012 Baiduspider is a chinese search engine spider, and they can become a real pain in the as= and use a lot of resources. There is nothing to worry about if they are not hitting your site several times a day and for long periods. REMEMBER BACKUP, BACKUP AND BACKUP Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.