Latest News: (loading..)

Archived

This topic is now archived and is closed to further replies.

Parikesit

Dynamic robots.txt to stop worthless traffic

6 posts in this topic

Beside the security topic, there are also some thread about bandwith saving, creating robots.txt, bad-engine trap, block bad-engine, and others. Below some threads thats inspired me to create dynamic robots.txt:

To create dynamic robots.txt, what we need here are:

  1. .htaccess (required Apache with modrewrite enabled)
  2. robots.txt -> as default robots.txt for good (whitelist) search engine
  3. robots.php -> provide dynamic robots.txt

 

.htaccess file

Options +FollowSymLinks
Options -Indexes

ServerSignature Off

#BADENGINE
#empty user-agent
SetEnvIfNoCase User-Agent (^(\s+)?$) BADENGINE
#others user-agent
SetEnvIfNoCase User-Agent (some_user_agent) BADENGINE
SetEnvIfNoCase User-Agent (another1_user_agent) BADENGINE
SetEnvIfNoCase User-Agent (another2_user_agent) BADENGINE

#UNCOMMENT for TESTING
#SetEnv BADENGINE 1

#let robot access robots.txt
SetEnvIfNoCase Request_URI "robots\.txt" ROBOTS_LET_IN

<LimitExcept CONNECT>
Order Allow,Deny
Allow from all
Deny from env=BADENGINE
Allow from env=ROBOTS_LET_IN
</LimitExcept>

RewriteEngine On
RewriteBase   /

RewriteRule  ^robots\.txt$ /robots.php [NC,L]

 

Default robots.txt

User-agent: *
Disallow: /includes/
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /some_others_folder/

 

Dynamic robots.php

<?php
error_reporting(0);

$ROBOTS_LET_IN = false;
$ROBOTS_BADENGINE = false;
$ROBOTS_NAME = '-';
if (isset($_SERVER["ROBOTS_LET_IN"]) || isset($_SERVER["REDIRECT_ROBOTS_LET_IN"])) {
$ROBOTS_LET_IN = true;
}

if (isset($_SERVER["BADENGINE"]) || isset($_SERVER["REDIRECT_BADENGINE"])) {
$ROBOTS_BADENGINE = true;
}

if (!$ROBOTS_LET_IN) {
//accessing robots.php directly
$header = array( "HTTP/1.1 404 Not Found", "HTTP/1.1 404 Not Found", "Content-Length: 0" );
foreach ( $header as $sent ) {
	header( $sent );
}
die();
}

header("Content-Type:text/plain");
if ($ROBOTS_BADENGINE) {
//disallow all files and folders on all badengine
echo "User-agent: *\n";
echo "Disallow: /\n";
} else {
//print default robots.txt
echo @file_get_contents('robots.txt');
}
?>

 

You can test it by uncomment one line in .htaccess, than try to access yourdomain.com/robots.txt

 

#UNCOMMENT for TESTING
SetEnv BADENGINE 1

 

@zaenal

Share this post


Link to post
Share on other sites

Here some badbot (bad search engine) you could use. Open your .htaccess, edit BADENGINE with one of following codes:

 

 

1. From AskApache (http://www.askapache...h-htaccess.html)

#BADENGINE from ASKAPACHE
SetEnvIfNoCase User-Agent .*(aesop_com_spiderman|alexibot|backweb|bandit|batchftp|bigfoot) BADENGINE
SetEnvIfNoCase User-Agent .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) BADENGINE
SetEnvIfNoCase User-Agent .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) BADENGINE
SetEnvIfNoCase User-Agent .*(cosmos|crescent|curl|custo|da|diibot|disco|dittospyder|dragonfly) BADENGINE
SetEnvIfNoCase User-Agent .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) BADENGINE
SetEnvIfNoCase User-Agent .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) BADENGINE
SetEnvIfNoCase User-Agent .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) BADENGINE
SetEnvIfNoCase User-Agent .*(grafula|harvest|hloader|hmview|httplib|httrack|humanlinks|ilsebot) BADENGINE
SetEnvIfNoCase User-Agent .*(infonavirobot|infotekies|intelliseek|interget|iria|jennybot|jetcar) BADENGINE
SetEnvIfNoCase User-Agent .*(joc|justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp|libweb) BADENGINE
SetEnvIfNoCase User-Agent .*(likse|linkscan|linkwalker|lnspiderguy|lwp|magnet|mag-net|markwatch) BADENGINE
SetEnvIfNoCase User-Agent .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) BADENGINE
SetEnvIfNoCase User-Agent .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) BADENGINE
SetEnvIfNoCase User-Agent .*(net.?vampire|netants|netcraft|netmechanic|netspider|nextgensearchbot) BADENGINE
SetEnvIfNoCase User-Agent .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) BADENGINE
SetEnvIfNoCase User-Agent .*(offline.?navigator|openfind|outfoxbot|pagegrabber|papa|pavuk) BADENGINE
SetEnvIfNoCase User-Agent .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) BADENGINE
SetEnvIfNoCase User-Agent .*(psbot|pump|queryn|recorder|realdownload|reaper|reget|true_robot) BADENGINE
SetEnvIfNoCase User-Agent .*(repomonkey|rma|internetseer|sitesnagger|siphon|slysearch|smartdownload) BADENGINE
SetEnvIfNoCase User-Agent .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) BADENGINE
SetEnvIfNoCase User-Agent .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) BADENGINE
SetEnvIfNoCase User-Agent .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) BADENGINE
SetEnvIfNoCase User-Agent .*(turingos|turnitinbot|urly.?warning|vacuum|vci|voideye|whacker) BADENGINE
SetEnvIfNoCase User-Agent .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) BADENGINE
SetEnvIfNoCase User-Agent .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|master|reaper|sauger|site.?quester|whack) BADENGINE
SetEnvIfNoCase User-Agent .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) BADENGINE
SetEnvIfNoCase User-Agent .*(libwww-perl|aesop_com_spiderman) BADENGINE

 

2. My own version

#BADENGINE of mine
SetEnvIfNoCase User-Agent (^$|\<|\>|\'|\%|\_iRc|\_Works|\@\$x|\<\?|\$x0e|\+select\+|\+union\+|1\,\1\,1\,|2icommerce|3GSE|4all|59\.64\.153\.|88\.0\.106\.|85\.17\.|A\_Browser|ABAC|Abont|abot|Accept|Access|Accoo|AceFTP|Acme|ActiveTouristBot|Address|Adopt|adress|adressendeutschland|ADSARobot|ah\-ha|Ahead|AESOP\_com\_SpiderMan|aipbot|Alarm|Albert|Alek|Alexibot|Alligator|AllSubmitter|alma|almaden|ALot|Alpha|aktuelles|Akregat|Amfi|amzn\_assoc|Anal|Anarchie|andit|Anon|AnotherBot|Ansearch|AnswerBus|antivirx|Apexoo|appie|Aqua_Products|Arachmo|archive|arian|ASPSe|ASSORT|Atari|ATHENS|AtHome|Atlocal|Atomic_Email_Hunter|Atomz|Atrop|^attach|attrib|autoemailspider|autohttp|axod|batch|b2w|Back|BackDoorBot|BackStreet|BackWeb|Badass|Bali|Bandit|Barry|BasicHTTP|BatchFTP|bdfetch|beat|Become|Beij|BenchMark|berts|bew|big\.brother|Bigfoot|Bilgi|Bison|Bitacle|Biz360|Black|Black\.Hole|BlackWidow|bladder\.fusion|Blaiz|Blog\.Checker|Blogl|BlogPeople|Blogshares\.Spiders|Bloodhound|Blow|bmclient|Board|BOI|boitho|Bond|Bookmark\.search\.tool|boris|Bost|Boston\.Project|BotRightHere|Bot\.mailto:craftbot@yahoo\.com|BotALot|botpaidtoclick|botw|brandwatch|BravoBrian|Brok|Bropwers|Broth|browseabit|BrowseX|Browsezilla|Bruin|bsalsa|Buddy|Build|Built|Bulls|bumblebee|Bunny|Busca|Busi|Buy|bwh3) BADENGINE
SetEnvIfNoCase User-Agent (c\-spider|CafeK|Cafi|camel|Cand|captu|Catch|cd34|Ceg|CFNetwork|cgichk|Cha0s|Chang|chaos|Char|char\(32\,35\)|charlotte|CheeseBot|Chek|CherryPicker|chill|ChinaClaw|CICC|Cisco|Cita|Clam|Claw|Click\.Bot|clipping|clshttp|Clush|COAST|ColdFusion|Coll|Comb|commentreader|Compan|contact|Control|contype|Conc|Conv|Copernic|Copi|Copy|Coral|Corn|core-project|cosmos|costa|cr4nk|crank|craft|Crap|Crawler0|Crazy|Cres|cs\-CZ|cuill|Custo|Cute|CSHttp|Cyber|cyberalert|^DA$|daoBot|DARK|Data|Daten|Daum|dcbot|dcs|Deep|DepS|Detect|Deweb|Diam|Digger|Digimarc|digout4uagent|DIIbot|Dillo|Ding|DISC|discobot|Disp|Ditto|DLC|DnloadMage|DotBot|Doubanbot|Download|Download\.Demon|Download\.Devil|Download\.Wonder|Downloader|drag|DreamPassport|Drec|Drip|dsdl|dsok|DSurf|DTAAgent|DTS|Dual|dumb|DynaWeb) BADENGINE
SetEnvIfNoCase User-Agent (e\-collector|eag|earn|EARTHCOM|EasyDL|ebin|EBM-APPLE|EBrowse|eCatch|echo|ecollector|Edco|edgeio|efp\@gmx\.net|EirGrabber|email|Email\.Extractor|EmailCollector|EmailSearch|EmailSiphon|EmailWolf|Emer|empas|Enfi|Enhan|Enterprise\_Search|envolk|erck|EroCr|ESurf|Eval|Evil|Evere|EWH|Exabot|Exact|EXPLOITER|Expre|Extra|ExtractorPro|EyeN|FairAd|Fake|FANG|FAST|fastlwspider|FavOrg|Favorites\.Sweeper|Faxo|FDM\_1|FDSE|FEZhead|Filan|FileHound|find|Firebat|Firs|Flam|Flash|FlickBot|Flip|fluffy|flunky|focus|Foob|Fooky|Forex|Forum|ForV|Fost|Foto|Foun|Franklin\.Locator|freefind|FreshDownload|FrontPage|FSurf|Fuck|Fuer|futile|Fyber|Gais|GalaxyBot|Galbot|Gamespy\_Arcade|GbPl|Gener|geni|Geona|Get|gigabaz|Gira|Ginxbot|gluc|glx\.?v|gnome|Go\.Zilla|Goldfire|Got\-It|GOFORIT|gonzo|GornKer|GoSearch|^gotit$|gozilla|grab|Grabber|GrabNet|Grub|Grup|Graf|Green\.Research|grub|grub\-client|gsa\-cra|GSearch|GT\:\:WWW|GuideBot|guruji|gvfs|Gyps|hack|haha|hailo|Harv|Hatena|Hax|Head|Helm|herit|hgre|hhjhj\@yahoo|Hippo|hloader|HMView|holm|holy|HomePageSearch|HooWWWer|HouxouCrawler|HMSE|HPPrint|htdig|HTTPConnect|httpdown|http\.generic|HTTPGet|httplib|HTTPRetriever|HTTrack|human|Huron|hverify|Hybrid|Hyper|ia\_archiver|iaskspi|IBM\_Planetwide|iCCra|ichiro|ID\-Search|IDA|IDBot|IEAuto|IEMPT|iexplore\.exe|iGetter|Ilse|Iltrov|Image\.Stripper|Image\.Sucker|imagefetch|iimds\_monitor|Incutio|IncyWincy|Indexer|Industry\.Program|Indy|InetURL|informant|InfoNav|InfoTekies|Ingelin|Innerpr|Inspect|InstallShield\.DigitalWizard|Insuran\.|Intellig|Intelliseek|InterGET|Internet\.Ninja|Internet\.x|Internet\_Explorer|InternetLinkagent|InternetSeer\.com|Intraf|IP2|Ipsel|Iria|IRLbot|Iron33|Irvine|ISC\_Sys|iSilo|ISRCCrawler|ISSpi|IUPUI\.Research\.Bot|Jady|Jaka|Jam|^Java|java\/|Java\(tm\)|JBH\.agent|Jenny|JetB|JetC|jeteye|jiro|JoBo|JOC|jupit|Just|Jyx|Kapere|kash|Kazo|KBee|Kenjin|Kernel|Keywo|KFSW|KKma|Know|kosmix|KRAE|KRetrieve|Krug|ksibot|ksoap|Kum|KWebGet) BADENGINE
SetEnvIfNoCase User-Agent (Lachesis|lanshan|Lapo|larbin|leacher|leech|LeechFTP|LeechGet|leipzig\.de|Lets|Lexi|lftp|Libby|libcrawl|libfetch|libghttp|libWeb|libwhisker|libwww|libwww\-FM|libwww\-perl|LightningDownload|likse|Linc|Link\.Sleuth|LinkextractorPro|Linkie|LINKS\.ARoMATIZED|LinkScan|linktiger|LinkWalker|Lint|List|lmcrawler|LMQ|LNSpiderguy|loader|LocalcomBot|Locu|London|lone|looksmart|loop|Lork|LTH\_|lwp\-request|LWP|lwp-request|lwp-trivial|Mac\.Finder|Macintosh\;\.I\;\.PPC|Mac\_F|magi|Mag\-Net|Magnet|Magp|Mail\.Sweeper|main|majest|Mam|Mana|MarcoPolo|mark\.blonin|MarkWatch|MaSagool|Mass|Mass\.Downloader|Mata|mavi|McBot|Mecha|MCspider|^Memo|MetaProducts\.Download\.Express|Metaspin|Mete|Microsoft\.Data\.Access|Microsoft\.URL|Microsoft\_Internet\_Explorer|MIDo|MIIx|miner|Mira|MIRE|Mirror|Miss|Missauga|Missigua\.Locator|Missouri\.College\.Browse|Mist|Mizz|MJ12|mkdb|mlbot|MLM|MMMoCrawl|MnoG|moge|Moje|Monster|Monza\.Browser|Mooz|Moreoverbot|MOT\-MPx220|mothra\/netscan|mouse|MovableType|Mozdex|Mozi\!|Mp3Bot|MPF|MRA|MS\.FrontPage|MS\.?Search|MSFrontPage|MSIECrawler|msnbot\-media|msnbot\-Products|MSNPTC|MSProxy|MSRBOT|multithreaddb|musc|MVAC|MWM|My\_age|MyApp|MyDog|MyEng|MyFamilyBot|MyGetRight|MyIE2|mysearch|myurl|NAG|NAMEPROTECT|NASA\.Search|nationaldirectory|Naver|Navr|Near|NetAnts|netattache|Netcach|NetCarta|Netcraft|NetCrawl|NetMech|netprospector|NetResearchServer|NetSp|Net\.Vampire|netX|NetZ|Neut|newLISP|NewsGatorInbox|NEWT|NEWT\.ActiveX|Next|^NG|NICE|nikto|Nimb|Ninja|Ninte|NIPGCrawler|Noga|nogo|Noko|Nomad|Norb|noxtrumbot|NPbot|NuSe|Nutch|Nutex|NWSp|Obje|Ocel|Octo|ODI3|oegp|Offline|Offline\.Explorer|Offline\.Navigator|OK\.Mozilla|omg|Omni|Onfo|onyx|OpaL|OpenBot|Openf|OpenTextSiteCrawler|OpenU|Orac|OrangeBot|Orbit|Oreg|osis|Outf|Owl) BADENGINE
SetEnvIfNoCase User-Agent (P3P|PackRat|PageGrabber|PagmIEDownload|pansci|Papa|Pars|Patw|pavu|Pb2Pb|pcBrow|PEAR|PEER|PECL|pepe|Perl|PerMan|PersonaPilot|Persuader|petit|PHP\.vers|PHPot|Phras|PicaLo|Piff|Pige|pigs|^Ping|Pingd|PingALink|Pipe|Plag|Plant|playstarmusic|Pluck|Pockey|POE\-Com|Poirot|Pomp|Port\.Huron|Post|powerset|Preload|press|Privoxy|Probe|Program\.Shareware|Progressive\.Download|ProPowerBot|prospector|Provider\.Protocol\.Discover|ProWebWalker|Prowl|Proxy|Prozilla|psbot|PSurf|psycheclone|^puf$|Pulse|Pump|PushSite|PussyCat|PuxaRapido|Pyth|PyQ|QuepasaCreep|Query|Quest|QRVA|Qweer|radian|Radiation|Rambler|RAMP|RealDownload|Reap|Recorder|RedCarpet|RedKernel|ReGet|^Mozilla$|Mozilla\:|Mozilla\/Firefox|^Mozilla\.*Indy|^Mozilla\.*NEWT|^Mozilla*MSIECrawler|relevantnoise|replacer|Repo|requ|Rese|Retrieve|Rip|Rix|RMA|Roboz|Rogue|Rover|RPT\-HTTP|Rsync|RTG30|\.ru\)|ruby|Rufus|Salt|Sample|SAPO|Sauger|savvy|SBIder|SBP|SCAgent|scan|SCEJ\_|Sched|Schizo|Schlong|Schmo|Scout|Scooter|Scorp|ScoutOut|SCrawl|screen|script|SearchExpress|searchhippo|Searchme|searchpreview|searchterms|Second\.Street\.Research|Security\.Kol|Seekbot|Sega|Sensis|Sept|Serious|Sezn|Shai|Share|Sharp|Shaz|shell|shelo|Sherl|Shim|Shiretoko|ShopWiki|SickleBot|Simple|Siph|sitecheck|SiteCrawler|SiteSnagger|Site\.Sniper|SiteSucker|sitevigil|SiteX|Sleip|Slide|Slurpy\.Verifier|Sly|Smag|SmartDownload|Smurf|sna\-|snag|Snake|Snapbot|Snip|Snoop|So\-net|SocSci|sogou|Sohu|solr|sootle|Soso|SpaceBison|Spad|Span|spanner|Speed|Spegla|Sphere|Sphider|SpiderBot|SpiderEngine|SpiderView|Spin|sproose|Spurl|Spyder|Squi|SQ\.Webscanner|sqwid|Sqworm|SSM\_Ag|Stack|Stamina|stamp|Stanford|Statbot|State|Steel|Strateg|Stress|Strip|studybot|Style|subot|Suck|Sume|sun4m|Sunrise|SuperBot|SuperBro|Supervi|Surf4Me|SuperHTTP|Surfbot|SurfWalker|Susi|suza|suzu|Sweep|sygol|syncrisis|Systems|Szukacz) BADENGINE
SetEnvIfNoCase User-Agent (Tagger|Tagyu|tAke|Talkro|TALWinHttpClient|tamu|Tandem|Tarantula|tarspider|tBot|TCF|Tcs\/1|TeamSoft|Tecomi|Teleport|Telesoft|Templeton|Tencent|Terrawiz|Test|TexNut|trivial|Turnitin|The\.Intraformant|TheNomad|Thomas|TightTwatBot|Timely|Titan|TMCrawler|TMhtload|toCrawl|Todobr|Tongco|topic|Torrent|Track|translate|Traveler|TREEVIEW|True|Tunnel|turing|Turnitin|TutorGig|TV33\_Mercator|Twat|Tweak|Twice|Twisted\.PageGetter|Tygo|ubee|UCmore|UdmSearch|UIowaCrawler|Ultraseek|UMBC|unf|UniversalFeedParser|unknown|UPG1|UtilMind|URLBase|URL\.Control|URL\_Spider\_Pro|urldispatcher|URLGetFile|urllib|URLSpiderPro|URLy|User\-Agent|UserAgent|USyd|Vacuum|vagabo|Valet|Valid|Vamp|vayala|VB\_|VCI|VERI\~LI|versus|via|Viewer|virtual|visibilitygap|Visual|vobsub|Void|VoilaBot|voyager|vspider|VSyn|w\:PACBHO60|w0000t|W3C|w3m|w3search|walhello|Walker|Wand|WAOL|WAPT|Watch|Wavefire|wbdbot|Weather|web\.by\.mail|Web\.Data\.Extractor|Web\.Downloader|Web\.Ima|Web\.Mole|Web\.Sucker|Web2Mal|Web2WAP|WebaltBot|WebAuto|WebBandit|WebCapture|WebCat|webcraft\@bea|Webclip|webcollage|WebCollector|WebCopier|WebCopy|WebCor|webcrawl|WebDat|WebDav|webdevil|webdownloader|Webdup|WebEMail|WebEMailExtrac|WebEnhancer|WebFetch|WebGo|WebHook|Webinator|WebInd|webitpr|WebFilter|WebFountain|WebLea|WebmasterWorldForumBot|WebMin|WebMirror|webmole|webpic|WebPin|WebPix|WebReaper|WebRipper|WebRobot|WebSauger|Website\.eXtractor|Website\.Quester|WebSnake|webspider|Webster|WebStripper|websucker|WebTre|WebVac|webwalk|WebWasher|WebWeasel|WebWhacker|WebZIP|Wells|WEP\_S|WEP\.Search\.00|WeRelateBot|wget|Whack|Whacker|whiz|WhosTalking|Widow|Win67|window\.location|Windows\.95\;|Windows\.98\;|Winodws|Wildsoft\.Surfer|WinHT|winhttp|WinHttpRequest|WinHTTrack|Winnie\.Poh|WISEbot|wisenutbot|wish|Wizz|WordP|Works|world|WUMPUS|Wweb|WWWC|WWWOFFLE|WWW\-Collector|WWW\.Mechanize|www\.ranks\.nl|wwwster|^x$|X12R1|x\-Tractor|Xaldon|Xenu|XGET|xirq|Y\!OASIS|Y\!Tunnel|yacy|YaDirectBot|Yahoo\-MMAudVid|YahooYSMcm|Yamm|Yand|yang|Yeti|Yoono|yori|Yotta|YTunnel|Zade|zagre|ZBot|Zeal|ZeBot|zerx|Zeus|ZIPCode|Zixy|zmao|Zyborg) BADENGINE
SetEnvIfNoCase User-Agent (cyberpatrol\.com|Macintosh\;\s+) !BADENGINE

 

3. Your own version

You have your own version? add that to the list.

 

 

Dont forget to remove or commented out below line on production site

#UNCOMMENT for TESTING
#SetEnv BADENGINE 1

 

 

Cheers,

@zaenal

Share this post


Link to post
Share on other sites

Hi Zaenal

 

Will this prevent the attack that I had recently?

 

http://forums.oscomm...enamenetworkru/

 

I have to take my site offline and remove .htaccess file. But as soon as I load the osCommerce back, an .htaccess is created with all the hack codes.

 

Thank you.

 

Naz

Share this post


Link to post
Share on other sites