Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Baiduspider using multiple User Agents how to stop them?


NodsDorf

Recommended Posts

I've read probably a couple dozen articles and post about stopping Baiduspider and I have yet been able to keep them off our site.

 

If anybody has experience with "effectively" blocking them please share.

 

In my efforts I have blocked user agents, first tried to emulate http.conf in htaccess with

 

SetEnvIfNoCase User-Agent "^Baiduspider" block_bot
Order Allow,Deny
Allow from All
Deny from env=block_bot

 

In conjuction with pure .htaccess user agent block:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC]
RewriteRule .* - [F]

 

Finally resorting to baning IP's and host:

order allow,deny
deny from *.baidu.com
deny from 203.125.234.
deny from 220.181.7.
deny from 123.125.66.
deny from 123.125.71.
deny from 119.63.192.
deny from 119.63.193.
deny from 119.63.194.
deny from 119.63.195.
deny from 119.63.196.
deny from 119.63.197.
deny from 119.63.198
deny from 119.63.199.
deny from 180.76.5.
deny from 202.108.249.185
deny from 202.108.249.177
deny from 202.108.249.182
deny from 202.108.249.184
deny from 202.108.249.189
deny from 61.135.146.200
deny from 61.135.145.221
deny from 61.135.145.207
deny from 202.108.250.196
deny from 68.170.119.76
deny from 207.46.199.52
allow from all

 

 

Yet, Baidu appears to be masking itself under different ISPs. I've seen msn, kimsufi.com, and now wowrack.com as the ISP but the user agent is still baiduspider. No idea how they are getting around my user agent blocks but they are..

 

This is currently on my site:

208-115-111-72-reverse.wowrack.com

IP address: 208.115.111.72

User agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

 

I can of course ban this IP but they seem to have limitless ISPs and IPs to draw from, and we don't really like banning IP's unless they are from another country in which we don't do business.

Link to comment
Share on other sites

@@NodsDorf,

 

I have the same problem with the brandwatch bot from the UK, it constantly hi-jacks servers to continue to crawl and disregards the robots.txt. I have no solution for you, just thought I would mention baidu is not the only rouge bot.

 

 

 

Chris

Link to comment
Share on other sites

Try This mileage may vary, but works fine for me.

 

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net/.*$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://yoursite.net/.*$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://yoursite.net$	  [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ - [F,NC]
SetEnvIfNoCase User-Agent "^baiduspider" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

 

Regards,

 

George

Link to comment
Share on other sites

@@DunWeb

 

If you're not happy about the Brandwatch crawler accessing your site, and it's not listening to your robot.txt, then let me know the URL and I'll make sure we stop crawling. Sorry if it's caused you any discontent.

 

Thanks,

 

Joel

 

Community Manager at Brandwatch

Link to comment
Share on other sites

I had a similar problem with the Baiduspider bot constantly crawling the site.

 

I tried adding the following to the robots.txt

 

User-agent: Baiduspider

Disallow: /

User-agent: Baiduspider-image

Disallow: /

User-agent: Baiduspider-video

Disallow: /

User-agent: Baiduspider-news

Disallow: /

User-agent: Baiduspider-favo

Disallow: /

User-agent: Baiduspider-cpro

Disallow: /

User-agent: Baiduspider-ads

Disallow: /

User-agent: Baidu

Disallow: /

 

And still it came.

 

I then added into the list of bad bots to block in htaccess file

 

# Block Bad Bots

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR]

 

and then added to the bottom of the htaccess file

 

<Files 403.shtml>

order allow,deny

allow from all

</Files>

deny from 180.76.0.0/16

 

For the time being it seems to have stopped visiting. I dare say it will start again. The only other thing i could add is that it took a few weeks for them to stop.

REMEMBER BACKUP, BACKUP AND BACKUP

Link to comment
Share on other sites

The Baidu spiders will obey the robots.txt file. However, if you block the Baidu ip's it won't access the file in the first place.

You can find all the information needed to halt Baidu here:

http://www.baidu.com/search/spider_english.html

 

It takes a few days to update the database.

 

There are a few fake Baidu spiders. "Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames."

Link to comment
Share on other sites

Try this change to your .htaccess block:

 

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]

 

Regards

Jim

Thanks Jim, I added the . * this morning I'll keep an eye on the site today and let everybody know the results.

 

Try This mileage may vary, but works fine for me.

 

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net/.*$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://www.yoursite.net$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://yoursite.net/.*$	  [NC]
RewriteCond %{HTTP_REFERER} !^http://yoursite.net$	  [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ - [F,NC]
SetEnvIfNoCase User-Agent "^baiduspider" bad_bot
<limit get="" post="">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</limit>

 

Regards,

 

George

 

Thanks for the post George, though I'ved tried this approach, yours unless modified would only stop them from indexing or crawling images.

 

 

The Baidu spiders will obey the robots.txt file. However, if you block the Baidu ip's it won't access the file in the first place.

You can find all the information needed to halt Baidu here:

http://www.baidu.com/search/spider_english.html

 

It takes a few days to update the database.

 

There are a few fake Baidu spiders. "Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames."

 

Thanks for the response Joseph, I have read Baidu's crawl page and noticed where they point out they do obey robots.txt but they know that people are spoofing them. It maybe that we are harshly calling Baidu bad when in fact it is people pretending to be them, that are the actual problem. But I look at it from this point of view, 1 we will not ship to anywhere in Asia so there is no need for presence in their search engine in the first place, and 2 since the Chinese are the Xerox machines of the world we'd rather them not see us at all.

Link to comment
Share on other sites

Check the ip address as recommended on the crawl page. That will tell you if it's a spoof or not. My giess is it's not.

 

"Example: In Linux platform, you can identify Baiduspider by using a reverse DNS lookup. The hostname of Baiduspider is *.baidu.com or *.baidu.jp. Others are fake hostnames."

Link to comment
Share on other sites

Regardless if its Baidu or a agent masked as Baidu the user agent check should catch them, which it is not.

 

I tried Jim's suggestion which looked real promising, but still hasn't stopped them.

 

As of today:

IP address: 180.76.5.59 (dns info is baidu)

User agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Link to comment
Share on other sites

Don Ford:

You have missed the point completely. Forget all the ip stuff. It just uses resources and you may be implmenting it incorrectly to begin with. You can take care of your problem with the robot.txt file. Reread the posts with an open mind.

Link to comment
Share on other sites

Don Ford:

You can take care of your problem with the robot.txt file. Reread the posts with an open mind.

 

As i have already said, i had a similar problem, and so do so many other people with this spider. The internet is full of people with robots.txt files trying to stop this spider, and it does not work. Google for ban Baiduspider and there will be hundreds of listing pages.

 

I have hopefully got rid of them by doing what i said in my previous post, but it did take a few weeks. Dont expect them to go away over night. I was getting crawled by several of their spiders several times a day and for hours on end, and they were visiting files that were listed in the robots file and they were listing them, I tried banning individual ip addresses, and even adding a blanket ban on a whole range of ip addresses. I was only getting problems with the 180 range, as is the OP..

REMEMBER BACKUP, BACKUP AND BACKUP

Link to comment
Share on other sites

Hi Joseph,

 

I wasn't trying to critcise your post. I have already stated that I realize that Baidu "Says" they obey robots.txt I will not agure that point nor was I trying to. I'm wondering how a $_SERVER['HTTP_USER_AGENT'] that comes back with Baiduspider (whether it's actually them or not doesn't matter) if the agent comes back as Baidu they should be dropped plain and simple.

Link to comment
Share on other sites

As i have already said, i had a similar problem, and so do so many other people with this spider. The internet is full of people with robots.txt files trying to stop this spider, and it does not work. Google for ban Baiduspider and there will be hundreds of listing pages.

 

I have hopefully got rid of them by doing what i said in my previous post, but it did take a few weeks. Dont expect them to go away over night. I was getting crawled by several of their spiders several times a day and for hours on end, and they were visiting files that were listed in the robots file and they were listing them, I tried banning individual ip addresses, and even adding a blanket ban on a whole range of ip addresses. I was only getting problems with the 180 range, as is the OP..

 

I beg to differ. When correctly deployed the Baidu spiders do obey robots.txt files. Plain and simple. However, I realize this is back and forth is no longer worth the energy so please disregard.

Link to comment
Share on other sites

I beg to differ. When correctly deployed the Baidu spiders do obey robots.txt files. Plain and simple. However, I realize this is back and forth is no longer worth the energy so please disregard.

 

I'm not sure what you're not reading or failing to see. I already acknowledged this point twice already.

--> BAIDU CLAIMS TO OBEY ROBOTS.TXT <---

Maybe 3 is a charm

 

 

But that isn't the issue, maybe you should re-read what I have posted.

Link to comment
Share on other sites

  • 3 months later...

My two cents ... I tried the disallow in robots.txt and I was still getting it from two ips, the 180 range and then the 220 range. This is working for me in .htaccess ... for now ...

 

order allow,deny
deny from 180.76.
deny from 220.181.
allow from all

Link to comment
Share on other sites

  • 3 months later...

Hi,

 

Baidu Spider uses the following user agents:

 

 

[color=#707070][font=Arial, Helvetica, sans-serif][size=3]
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3]
Mozilla/5.0 (Linux;u;Android 2.3.7;zh-cn;) AppleWebKit/533.1 (KHTML,like Gecko) Version/4.0 Mobile Safari/533.1 (compatible; +http://www.baidu.com/search/spider.html)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3]
Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.8;baidu Transcoder) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3]
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; baidu Transcoder;)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3]
Baiduspider-image+(+http://www.baidu.com/search/spider.htm)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3]
Baiduspider+(+http://www.baidu.com/search/spider.htm)[/size][/font][/color][color=#707070][font=Arial, Helvetica, sans-serif][size=3]
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)[/size][/font][/color]

 

Source: Botopedia.org

 

 

For IP range cross-verification, you can use the "Check if this IP belongs to this bot" feature, inside bot profile page.

 

Hope this helps.

Link to comment
Share on other sites

@@vampirehunter

 

The solutions mentioned in this post work with all oscommerce verisons

 

Try this change to your .htaccess block:

 

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.* [NC]

 

Regards

Jim

Find this post helpful? Click the 'Like this' button. :)

Link to comment
Share on other sites

Baiduspider is a chinese search engine spider, and they can become a real pain in the as= and use a lot of resources. There is nothing to worry about if they are not hitting your site several times a day and for long periods.

REMEMBER BACKUP, BACKUP AND BACKUP

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...