Jump to content

Archived

This topic is now archived and is closed to further replies.

stefan21

Blackhole for Bad Bots

Recommended Posts

I don't know, if this is the right place for my question.

 

I'd like to secure my shop (osCommerce Online Merchant v2.3.4) with this little idea:

 

https://perishablepress.com/blackhole-bad-bots/

 

It's not an addon for oscommerce - but what I really like is the idea, of the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment.

 

As I am not a programmer I'm not able by following the instructions to make this work in osC. It would be nice, if someone else would jump on this approach and help to implement it.

 

If there's already an addon like this out there, I'd like to know the name. In this case, sorry to bother. I couldn't find one.

 

stefan

Share this post


Link to post
Share on other sites

There are addons that use that approach already.  See the Bad Robot Blocker, for example. There are others. Whichever approach you use, be very sure to allow the main search engines through since they do not always obey the robots.txt rules. If you block them by mistake, it could end up causing all of your links with them to disappear.

Share this post


Link to post
Share on other sites

This code will ban Bing/Yahoo, and probably Google, from crawling your site. Do you really want to do this?

 

Regrds

Jim


See my profile for a list of my addons and ways to get support.

Share this post


Link to post
Share on other sites

I haven't looked at the PHP code, but it's possible that it's obsolete and fails with current PHP levels. Can I assume that you got it installed and it's failing when you try to run it? Any error messages? Or are you not understanding the installation instructions?

 

Anyway, I like the idea of a "one strike and you're out" trap for ill-mannered bots. I don't think it's necessary to embed something like this into your applications (like osC). I would make up a fake file that hackers like to attack (e.g., /wp-admin.php), per system 404 error log (assuming you don't have a real WP installation in your root). Make sure these places go into your robots.txt, so well-behaved bots don't get trapped. The code within them could just be to email you with a report of the IP address of the bot, so that you could manually add it to your .htaccess deny list. It might also just log the attempt and leave it to you to periodically add to your deny list. I'd be a bit leery of trying to automatically update .htaccess from within a program.

 

I'm assuming that a bad bot poking around your site is looking to run specifically named PHP files and the like (which you can harvest from 404 logs). If they're merely looking for files to edit (add malware), maybe a 400 permission would work (I haven't tried it) -- you would probably have to read the 401 error log and manually add it to the deny list. Maybe /creditcards/index.html could trap bots in a similar manner (figuring that bad bots read robots.txt to find forbidden areas to look at). Anyway, google "honeypot" for more ideas. I'm sure there are lots of things that could be done.

Share this post


Link to post
Share on other sites

There are addons that use that approach already.  See the Bad Robot Blocker, for example. There are others. Whichever approach you use, be very sure to allow the main search engines through since they do not always obey the robots.txt rules. If you block them by mistake, it could end up causing all of your links with them to disappear.

Jack_mcs, thank you for your reply. I'll have a closer look at http://addons.oscommerce.com/info/5914

 

 

This code will ban Bing/Yahoo, and probably Google, from crawling your site. Do you really want to do this?

 

Regrds

Jim

Jim, looking at the code, I don't think so. Anyway - thank you for your response.

 

 

I haven't looked at the PHP code, but it's possible that it's obsolete and fails with current PHP levels. Can I assume that you got it installed and it's failing when you try to run it? Any error messages? Or are you not understanding the installation instructions?

 

Anyway, I like the idea of a "one strike and you're out" trap for ill-mannered bots. I don't think it's necessary to embed something like this into your applications (like osC). I would make up a fake file that hackers like to attack (e.g., /wp-admin.php), per system 404 error log (assuming you don't have a real WP installation in your root). Make sure these places go into your robots.txt, so well-behaved bots don't get trapped. The code within them could just be to email you with a report of the IP address of the bot, so that you could manually add it to your .htaccess deny list. It might also just log the attempt and leave it to you to periodically add to your deny list. I'd be a bit leery of trying to automatically update .htaccess from within a program.

 

I'm assuming that a bad bot poking around your site is looking to run specifically named PHP files and the like (which you can harvest from 404 logs). If they're merely looking for files to edit (add malware), maybe a 400 permission would work (I haven't tried it) -- you would probably have to read the 401 error log and manually add it to the deny list. Maybe /creditcards/index.html could trap bots in a similar manner (figuring that bad bots read robots.txt to find forbidden areas to look at). Anyway, google "honeypot" for more ideas. I'm sure there are lots of things that could be done.

MrPhil, the idea of a "one strike and you're out" trap sounds charming to me. I'd like to have this automatically because of the fact, that all crawlers are changing IP's fluently. I don't want to spent too much time for looking and searching in i.e. honeypot.org for bad bots...

 

Thank's to all for sharing.

stefan

Share this post


Link to post
Share on other sites

If the bad guys are constantly changing IP addresses, to evade .htaccess "deny" blocks, that would be a hassle to keep manually updating. Perhaps something could be done to semi-automate updates to .htaccess from logged reports of intrusions and so forth? That could mean keeping an automated database of bad IP addresses and date of last access attempt, and on a cron job purging addresses that haven't been active for 6 months or more (from both the database and .htaccess deny list). I'm not sure it's worth automating blocks to work from within applications (such as osC), but maybe others have positive experiences to relate? You have to be careful about who you end up blocking (just ill-mannered bots and site scrapers, plus hackers, and not Google), so a fully automated block system (whether or not it uses .htaccess) might be hazardous.

Share this post


Link to post
Share on other sites

@@stefan21 May be worth a look at the IP Trap contribution - it allows you to add directories such as "Personal" "admin" "Secret" etc then ban the IPs that try to access them.

 

I used to use this on some 2.2 stores, also may be worth a look at OSC SEC.


Now running on a fully modded, Mobile Friendly 2.3.4 Store with the Excellent MTS installed - See my profile for the mods installed ..... So much thanks for all the help given along the way by forum members.

Share this post


Link to post
Share on other sites

Anybody that is familiar with Google and Bing will tell you that they will follow any link without regard to its listing in the robots.txt. They should not index any pages that are listed in the robots.txt, but they will still request the page. And as soon as they request the page, that script will ban their IP.

 

I've had to code traps for their bots to keep them from being served a page that was forbidden in the robots.txt, so please don't tell me this doesn't happen.

 

Regards

Jim


See my profile for a list of my addons and ways to get support.

Share this post


Link to post
Share on other sites

@@piernas  Probably because the IP Trap addon doesn't include any links to the trap page(s). I've used it as well with good results.

 

Regards

Jim


See my profile for a list of my addons and ways to get support.

Share this post


Link to post
Share on other sites

Anybody that is familiar with Google and Bing will tell you that they will follow any link without regard to its listing in the robots.txt. They should not index any pages that are listed in the robots.txt, but they will still request the page. And as soon as they request the page, that script will ban their IP.

 

I've had to code traps for their bots to keep them from being served a page that was forbidden in the robots.txt, so please don't tell me this doesn't happen.

 

Regards

Jim

I don't.

 

On https://perishablepress.com/blackhole-bad-bots/ is said:

 

Whitelisting Search Bots

Initially, the Blackhole blocked any bot that disobeyed the robots.txt directives. Unfortunately, as discussed in the comments, Googlebot, Yahoo, and other major search bots do not always obey robots rules. And while blocking Yahoo! Slurp is debatable, blocking Google, MSN/Bing, et al would just be dumb. Thus, the Blackhole now “whitelists” any user agent identifying as any of the following:

  • googlebot (Google)
  • msnbot (MSN/Bing)
  • yandex (Yandex)
  • teoma (Ask)
  • slurp (Yahoo)

Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “hey look, I’m teh Googlebot!” and the whitelist would grant access. It is possible to verify the true identity of each bot, but as X3M explains in the comments, doing so consumes significant resources and could overload the server. Avoiding that scenario, the Blackhole errs on the side of caution: it’s better to allow a few spoofs than to block any of the major search engines.

 

Regards,

stefan

Share this post


Link to post
Share on other sites

Hi ,

 

I have a huge issue with bad bots attacking my site and causing CPU overage.

 

Can some one, please, let me know which is the best contribution to use to block bad Bots for oScommerce 2.2. I looked at

Bad Robot Blocker 1.0.1

but in the instructions it says to update index.php file in password_list directory. I don't have password_list directory in my set up.  

 

Any help will be greatly appreciated.

Share this post


Link to post
Share on other sites

@@Lary_an I'm not familiar with that addon but are sure the file you want is not in the package? The typical way those blockers work is to create some location that bots should not visit and then watch to see if they do visit. So the addon probably has you upload the trap files and then change the .htaccess file.

 

The above will work, assuming it is coded correctly, but it only catches some bots and you have to be careful that legitimate ones are not caught, which can happen. There should be a way to avoid that in that code. But the problem you describe is not just caused by bots, at least not the kind that addon will catch. There are data skimmers that visit sites and just bounce around gathering data as a customer would so they never get to the trap. You can install the addon you mentioned but I'm not sure it will make enough of a difference to the load you are seeing. I suggest you install View Counter. It allows you to see what's going on with visitors and to block them as needed.

Share this post


Link to post
Share on other sites

Hi Jack,

 

I definitely would like to install the View Counter now, it looks like it will help us manage the traffic, but it says that it is for oSc 2.3 we are still on 2.2.    Is there a version i should use. 

 

The reason i was looking to install the Bot Blocker is because my host actually told me that majority of problems are caused by bad bots and we tried to manually block them. It helped for now, but if they are change IPs we will be back to square one.

And you are absolutely right it does provide the directory and the file, i guess i am so tired from trying to fix the problem that i just missed the obvious. I feel dumb now.

 

Thank you very much for your help

Share this post


Link to post
Share on other sites

@@Lary_an View Counter will work with any version of oscommerce. I never noticed it said 2.3. I guess I made a mistake when uploading.

 

Most hosts just see the increased traffic and attribute it to bots, which it may or may not be. One thing you definitely do is block Yandex.

 

Most established bots don't change their IP's but they do add to them. In either case, blocking IP's can be a bothersome, but necessary, chore.

Share this post


Link to post
Share on other sites

Dear @@Jack_mcs and @@kymation

 

Do I really need to have the file robots.txt in my catalog directory?

 

where I can get the latest version?

 

Best regards

 

Valqui


Setting up a new Frozen site with so many nice addons available on the market and waiting to be admitted to Phoenix club!

Community Oscommerce fan :heart:

 

Share this post


Link to post
Share on other sites

First to be clear, when you say your catalog directory, that implies the shop is not in the root directory, something like http: mysite.com/catalog/. The robots file goes in the root directory, regardless of where the shop is located.

 

Second, the robots file is specific to each site. You have to create it. There isn't a "latest version." There are tools on the web that will help set this up and google webmaster has a tool to check it. But it is OK to copy some other shops file and edit it to fit yours. Here is a link to my robots.txt file. 

 

As to whether you need one, yes, all sites should have one. Even if you don't want to have any entries in it, you should have a blank one present. The reason is because all of the search engines try to load it when first visiting a site and if it can't be found, a 404 error is created and that page gets loaded. This needlessly increases the number of errors your site gets and wastes bandwidth, albeit not much all-in-all.

 

As to what goes in the file, it is generally used to list files and directories not to be indexed. A lot of shop owners, in my experience, include the admin in it. While that may be correct for the search engines, it is also a way to let hackers know the name of your admin, which shouldn't be admin by-the-way. It is not really necessary to list directories that are blocked, like includes. But hackers will, if they can, add files to places like the temp directory and if that isn't blocked, those pages will be listed.

 

You may also want to add traps like I have for View Counter. The command tells the search engines not to visit that location but hackers will ignore that and visit it anyway so it allows me to know of them and ban them. And if you have an XML sitemap, which you should, having a link listed to it in the robots file will allow for faster updates.

Share this post


Link to post
Share on other sites

×