Jump to content


Corporate Sponsors


Latest News: (loading..)

- - - - -

Baiduspider using multiple User Agents how to stop them?


15 replies to this topic

#1 NodsDorf

  • Community Member
  • 1,233 posts
  • Real Name:Don Ford
  • Gender:Male
  • Location:ohio usa

Posted 17 January 2012, 00:09

I've read probably a couple dozen articles and post about stopping Baiduspider and I have yet been able to keep them off our site.

If anybody has experience with "effectively" blocking them please share.

In my efforts I have blocked user agents, first tried to emulate http.conf in htaccess with

SetEnvIfNoCase User-Agent "^Baiduspider" block_bot
Order Allow,Deny
Allow from All
Deny from env=block_bot

In conjuction with pure .htaccess user agent block:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC]
RewriteRule .* - [F]

Finally resorting to baning IP's and host:
order allow,deny
deny from *.baidu.com
deny from 203.125.234.
deny from 220.181.7.
deny from 123.125.66.
deny from 123.125.71.
deny from 119.63.192.
deny from 119.63.193.
deny from 119.63.194.
deny from 119.63.195.
deny from 119.63.196.
deny from 119.63.197.
deny from 119.63.198
deny from 119.63.199.
deny from 180.76.5.
deny from 202.108.249.185
deny from 202.108.249.177
deny from 202.108.249.182
deny from 202.108.249.184
deny from 202.108.249.189
deny from 61.135.146.200
deny from 61.135.145.221
deny from 61.135.145.207
deny from 202.108.250.196
deny from 68.170.119.76
deny from 207.46.199.52
allow from all


Yet, Baidu appears to be masking itself under different ISPs. I've seen msn, kimsufi.com, and now wowrack.com as the ISP but the user agent is still baiduspider. No idea how they are getting around my user agent blocks but they are..

This is currently on my site:
208-115-111-72-reverse.wowrack.com
IP address: 208.115.111.72
User agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

I can of course ban this IP but they seem to have limitless ISPs and IPs to draw from, and we don't really like banning IP's unless they are from another country in which we don't do business.

Edited by NodsDorf, 17 January 2012, 00:15.


#2 DunWeb

  • Community Sponsor
  • 10,466 posts
  • Real Name:Chris Dunn
  • Gender:Male
  • Location:Tecumseh, Ontario, Canada N8N 1X8

Posted 17 January 2012, 01:27

@NodsDorf,

I have the same problem with the brandwatch bot from the UK, it constantly hi-jacks servers to continue to crawl and disregards the robots.txt. I have no solution for you, just thought I would mention baidu is not the only rouge bot.



Chris
:|: Was this post helpful ? Click the LIKE THIS button :|:

:|: Click Here to learn how I can help you with custom coding, add ons, security and templates :|:

:|: Need an Area Calculator, Pre-Paid Account, Virtual Pin, Auction or Layaway Add on ? Click Here :|:

#3 kymation

  • Community Sponsor
  • 5,663 posts
  • Real Name:Jim Keebaugh
  • Gender:Male
  • Location:Aberdeen WA USA

Posted 17 January 2012, 01:33

Try this change to your .htaccess block:

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]

Regards
Jim
My Addons

Banners Box 2.3.1 Support
Categories Accordion Box 2.3.1 Support
Categories Images Box 2.2x 2.3.1 Support
Closest Shipper 2.2x Support
Document Manager 2.2x Support
Generic Box 2.3.1 Support
Get 1 Free 2.2x Support
Include HTML and Text Boxes 2.2x
jQuery Banner Rotator 2.2x 2.3.1 Support
Modular Front Page 2.3.1 Support
Modular SEO Header Tags 2.3.1 Support
More Pics 2.2x Support
MVS 2.2x Support
osC Catalog 2.2x Support
PDF Datasheet 2.3.1 Support
Price Updater 2.2x
Products Specifications 2.2x 2.3.1 Development Version Support Bugs/Suggestions
Request a Review 2.2x - 2.3.1 Support
Similar Products Box 2.2x
Theme Switcher 2.3.1 Support

#4 H2H

  • Community Member
  • 1 posts
  • Real Name:George

Posted 17 January 2012, 04:25

Try This mileage may vary, but works fine for me.

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net/.*$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://yoursite.net/.*$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://yoursite.net$	  [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ - [F,NC]
SetEnvIfNoCase User-Agent "^baiduspider" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

Regards,

George

Edited by H2H, 17 January 2012, 04:26.


#5 LinkYeah

  • Community Member
  • 1 posts
  • Real Name:Joel Windels

Posted 17 January 2012, 08:54

@DunWeb

If you're not happy about the Brandwatch crawler accessing your site, and it's not listening to your robot.txt, then let me know the URL and I'll make sure we stop crawling. Sorry if it's caused you any discontent.

Thanks,

Joel

Community Manager at Brandwatch

#6 14steve14

  • Community Member
  • 2,176 posts
  • Real Name:Steve
  • Gender:Male

Posted 17 January 2012, 09:23

I had a similar problem with the Baiduspider bot constantly crawling the site.

I tried adding the following to the robots.txt

User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-image
Disallow: /
User-agent: Baiduspider-video
Disallow: /
User-agent: Baiduspider-news
Disallow: /
User-agent: Baiduspider-favo
Disallow: /
User-agent: Baiduspider-cpro
Disallow: /
User-agent: Baiduspider-ads
Disallow: /
User-agent: Baidu
Disallow: /

And still it came.

I then added into the list of bad bots to block in htaccess file

# Block Bad Bots
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR]

and then added to the bottom of the htaccess file

<Files 403.shtml>
order allow,deny
allow from all
</Files>
deny from 180.76.0.0/16

For the time being it seems to have stopped visiting. I dare say it will start again. The only other thing i could add is that it took a few weeks for them to stop.
REMEMBER BACKUP, BACKUP AND BACKUP
I am not a coder. OSC has a steep learning curve, but in general the program does work. If it doesnt work, the chances are it is something you have done.

#7 Dennisra

  • Community Member
  • 507 posts
  • Real Name:Joseph D. Jefferson
  • Gender:Male

Posted 17 January 2012, 16:48

The Baidu spiders will obey the robots.txt file. However, if you block the Baidu ip's it won't access the file in the first place.
You can find all the information needed to halt Baidu here:
http://www.baidu.com/search/spider_english.html

It takes a few days to update the database.

There are a few fake Baidu spiders. "Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames."

#8 NodsDorf

  • Community Member
  • 1,233 posts
  • Real Name:Don Ford
  • Gender:Male
  • Location:ohio usa

Posted 17 January 2012, 17:53

View Postkymation, on 17 January 2012, 01:33, said:

Try this change to your .htaccess block:

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]

Regards
Jim
Thanks Jim, I added the . * this morning I'll keep an eye on the site today and let everybody know the results.

View PostH2H, on 17 January 2012, 04:25, said:

Try This mileage may vary, but works fine for me.

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net/.*$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://yoursite.net/.*$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://yoursite.net$	  [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ - [F,NC]
SetEnvIfNoCase User-Agent "^baiduspider" bad_bot
<limit get="" post="">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</limit>

Regards,

George

Thanks for the post George, though I'ved tried this approach, yours unless modified would only stop them from indexing or crawling images.


View PostDennisra, on 17 January 2012, 16:48, said:

The Baidu spiders will obey the robots.txt file. However, if you block the Baidu ip's it won't access the file in the first place.
You can find all the information needed to halt Baidu here:
http://www.baidu.com/search/spider_english.html

It takes a few days to update the database.

There are a few fake Baidu spiders. "Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames."

Thanks for the response Joseph, I have read Baidu's crawl page and noticed where they point out they do obey robots.txt but they know that people are spoofing them. It maybe that we are harshly calling Baidu bad when in fact it is people pretending to be them, that are the actual problem. But I look at it from this point of view, 1 we will not ship to anywhere in Asia so there is no need for presence in their search engine in the first place, and 2 since the Chinese are the Xerox machines of the world we'd rather them not see us at all.

Edited by NodsDorf, 17 January 2012, 17:54.


#9 Dennisra

  • Community Member
  • 507 posts
  • Real Name:Joseph D. Jefferson
  • Gender:Male

Posted 17 January 2012, 21:12

Check the ip address as recommended on the crawl page. That will tell you if it's a spoof or not. My giess is it's not.

"Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames."

#10 NodsDorf

  • Community Member
  • 1,233 posts
  • Real Name:Don Ford
  • Gender:Male
  • Location:ohio usa

Posted 18 January 2012, 22:00

Regardless if its Baidu or a agent masked as Baidu the user agent check should catch them, which it is not.

I tried Jim's suggestion which looked real promising, but still hasn't stopped them.

As of today:
IP address: 180.76.5.59 (dns info is baidu)
User agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

#11 Dennisra

  • Community Member
  • 507 posts
  • Real Name:Joseph D. Jefferson
  • Gender:Male

Posted 18 January 2012, 22:47

Don Ford:
You have missed the point completely. Forget all the ip stuff. It just uses resources and you may be implmenting it incorrectly to begin with. You can take care of your problem with the robot.txt file. Reread the posts with an open mind.

#12 14steve14

  • Community Member
  • 2,176 posts
  • Real Name:Steve
  • Gender:Male

Posted 18 January 2012, 22:56

View PostDennisra, on 18 January 2012, 22:47, said:

Don Ford:
You can take care of your problem with the robot.txt file. Reread the posts with an open mind.

As i have already said, i had a similar problem, and so do so many other people with this spider. The internet is full of people with robots.txt files trying to stop this spider, and it does not work. Google for ban Baiduspider and there will be hundreds of listing pages.

I have hopefully got rid of them by doing what i said in my previous post, but it did take a few weeks. Dont expect them to go away over night. I was getting crawled by several of their spiders several times a day and for hours on end, and they were visiting files that were listed in the robots file and they were listing them, I tried banning individual ip addresses, and even adding a blanket ban on a whole range of ip addresses. I was only getting problems with the 180 range, as is the OP..
REMEMBER BACKUP, BACKUP AND BACKUP
I am not a coder. OSC has a steep learning curve, but in general the program does work. If it doesnt work, the chances are it is something you have done.

#13 NodsDorf

  • Community Member
  • 1,233 posts
  • Real Name:Don Ford
  • Gender:Male
  • Location:ohio usa

Posted 19 January 2012, 00:47

Hi Joseph,

I wasn't trying to critcise your post. I have already stated that I realize that Baidu "Says" they obey robots.txt I will not agure that point nor was I trying to. I'm wondering how a $_SERVER['HTTP_USER_AGENT'] that comes back with Baiduspider (whether it's actually them or not doesn't matter) if the agent comes back as Baidu they should be dropped plain and simple.

#14 Dennisra

  • Community Member
  • 507 posts
  • Real Name:Joseph D. Jefferson
  • Gender:Male

Posted 19 January 2012, 02:39

View Post14steve14, on 18 January 2012, 22:56, said:

As i have already said, i had a similar problem, and so do so many other people with this spider. The internet is full of people with robots.txt files trying to stop this spider, and it does not work. Google for ban Baiduspider and there will be hundreds of listing pages.

I have hopefully got rid of them by doing what i said in my previous post, but it did take a few weeks. Dont expect them to go away over night. I was getting crawled by several of their spiders several times a day and for hours on end, and they were visiting files that were listed in the robots file and they were listing them, I tried banning individual ip addresses, and even adding a blanket ban on a whole range of ip addresses. I was only getting problems with the 180 range, as is the OP..

I beg to differ. When correctly deployed the Baidu spiders do obey robots.txt files. Plain and simple. However, I realize this is back and forth is no longer worth the energy so please disregard.

#15 NodsDorf

  • Community Member
  • 1,233 posts
  • Real Name:Don Ford
  • Gender:Male
  • Location:ohio usa

Posted 19 January 2012, 15:16

View PostDennisra, on 19 January 2012, 02:39, said:

I beg to differ. When correctly deployed the Baidu spiders do obey robots.txt files. Plain and simple. However, I realize this is back and forth is no longer worth the energy so please disregard.

I'm not sure what you're not reading or failing to see. I already acknowledged this point twice already.
--> BAIDU CLAIMS TO OBEY ROBOTS.TXT <---
Maybe 3 is a charm


But that isn't the issue, maybe you should re-read what I have posted.

#16 atddoug

  • Community Member
  • 1 posts
  • Real Name:Doug Marquardt

Posted 13 May 2012, 03:06

My two cents ... I tried the disallow in robots.txt and I was still getting it from two ips, the 180 range and then the 220 range. This is working for me in .htaccess ... for now ...

order allow,deny
deny from 180.76.
deny from 220.181.
allow from all