Guyver Posted November 10, 2002 Share Posted November 10, 2002 Well search engines are hitting my site ( www.itsuppliues.net ) but not staying: Scooter (AltaVista) 43 674.89 KB 08 Nov 2002 - 23:15 IBM_Planetwide 20 418.46 KB 09 Nov 2002 - 11:26 Googlebot (Google) 14 243.16 KB 05 Nov 2002 - 11:42 Fast-Webcrawler (AllTheWeb) 14 358.92 KB 05 Nov 2002 - 15:21 Unknown robot (identified by 'crawl') 2 56.43 KB 08 Nov 2002 - 00:06 GigaBot 1 28.25 KB 08 Nov 2002 - 20:54 Inktomi Slurp 1 27.97 KB 07 Nov 2002 - 17:59 Since I've installed the new CVS though without any of these 'extras' eveyone is using, I seem to be getting a few more engines my way! John Link to comment Share on other sites More sharing options...
Guest Posted November 10, 2002 Share Posted November 10, 2002 Since 7h 55 min 2 spiders with same id are visit my site (146.101.142.226) oh my god Torsten Link to comment Share on other sites More sharing options...
shelly Posted November 12, 2002 Share Posted November 12, 2002 Followed the whole discusion, and in some ways I'm a little envious, as not matter how many times I prompted Google, it just has come to see me... Having read through this discussion it seemed that there was a potential solution, but you had to read between many lines to get to it. So in fear of being brought down by the monster spiders, I put this togehter. Don't yet know how to make a contribution so I've attached it here. I've got it running on my site and nothing seems broken... Yet. If the great would review this and offer suggestions as to why its not quite maybe this can a contribution to the cause. Create a file called "search_engine_protector.php" and place in DIR_WS_FUNCTIONS, cut this code into the file. Be careful not to add lines at the top or bottom of the file <?php /* osCommerce, Open Source E-Commerce Solutions http://www.oscommerce.com Copyright (c) 2002 osCommerce Released under the GNU General Public License Authored through discussion on the formum! */ // Do you want to check for spiders? define('PROTECT_FROM_SPIDERS', 'true'); // Do you want to use the HTTP_USER_AGENT method to check // for spiders? define('USE_SPIDER_HTTP_USER_AGENT', 'true' ); // Do you want to yous an IP Address check to check for Spiders? define('USE_SPIDER_IP_ADDRESS', 'true' ); // Global, have we already tried to establish if this request if from a spider? $spiderCheckCompleted = false; // Global, is this user agent potentially a spider? // XXX: WARNING :XXX // Do not access this variable directly, instead use the function isSpider() below $userAgentIsSpider = false; /* * Call this function from within the html_output to check if * the user agent is a spider or not */ function isSpider() { global $userAgentIsSpider, $spiderCheckCompleted; $host_ip = ""; $agent = ""; if ( ( $spiderCheckCompleted == false ) && ( PROTECT_FROM_SPIDERS == 'true' ) ) { if ( USE_SPIDER_HTTP_USER_AGENT == 'true' ) { $userAgentIsSpider = checkUserAgentForSpider(); } if ( (!$userAgentIsSpider) && ( USE_SPIDER_IP_ADDRESS == 'true' ) ) { $userAgentIsSpider = checkIpAddressForSpider(); } $spiderCheckCompleted = true; } return( $userAgentIsSpider ); } function checkUserAgentForSpider() { global $userAgentIsSpider; // Lets ensure the agent is always in lowercase $agent = strtolower( getenv('HTTP_USER_AGENT') ); // Now lets define possible spiders $spider_footprint = array( "bot", "rawler", "pider", "ppie", "rchitext", "aaland", "igout4u", "cho", "ferret", "ulliver", "arvest", "tdig", "rchiver", "eeves", "inkwalker", "ycos", "ercator", "uscatferret", "yweb", "omad", "eternews", "cooter", "lurp", "oila", "oyager", "ebbase", "eblayers", "get", "eek", "canner", "rachnoidea", "lanetwide", "kit"); $i = 0; while (($i < (count($spider_footprint))) && (!$userAgentIsSpider)) { if (strstr($agent, $spider_footprint[$i])) { $userAgentIsSpider = true; } $i++; } return( $userAgentIsSpider ); } function checkIpAddressForSpider() { global $userAgentIsSpider; $host_ip = getenv('REMOTE_ADDR'); $spider_ip = array( "64.209.181.53", "64.208.33.33", "64.209.181.52", "209.185.108.", "209.185.253", "216.239.49.", "216.239.46.", "204.123.", "204.74.103.", "203.108.10.", "195.4.183.", "195.242.46.", "198.3.97.", "204.62.245.", "193.189.227.", "209.1.12.", "204.162.96.", "204.162.98.", "194.121.108.", "128.182.72.", "207.77.91.", "206.79.171.", "207.77.90.", "208.213.76.", "194.124.202.", "193.114.89.", "193.131.74.", "131.84.1.", "208.219.77.", "206.64.113.", "195.186.1.", "195.3.97.", "194.191.121.", "139.175.250.", "209.73.233.", "194.191.121.", "198.49.220.", "204.62.245.", "198.3.99.", "198.2.101.", "204.192.112.", "206.181.238", "208.215.47.", "171.64.75.", "204.162.98.", "204.162.96.", "204.123.9.52", "204.123.2.44", "204.74.103.39", "204.123.9.53", "204.62.245.", "206.64.113.", "204.138.115.", "94.22.130.", "164.195.64.1", "205.181.75.169", "129.170.24.57", "204.162.96.", "204.162.96.", "204.162.98.", "204.162.96.", "207.77.90.", "207.77.91.", "208.200.146.", "204.123.9.20", "204.138.115.", "209.1.32.", "209.1.12.", "192.216.46.49", "192.216.46.31", "192.216.46.30", "203.9.252.2"); $i = 0; while (($i < (count($spider_ip))) && (!$userAgentIsSpider )) { if (strstr($host_ip, $spider_ip[$i])) { $userAgentIsSpider = true; } $i++; } return( $userAgentIsSpider ); } ?> Then within html_output.php add: // CWS MOD to protect from spiders -- START require(DIR_WS_FUNCTIONS . 'search_engine_protector.php'); // CWS MOD to protect from spiders -- END Just under the Copyright statement. Finally, search for "isset($sid" within html_output.php. On the lines above the line if (isset($sid)) { , add: // CWS MOD to protect from spiders -- START if ( isSpider() ) { $sid = NULL; } // CWS MOD to protect from spiders -- END if (isset($sid)) { $link .= $separator . $sid; } I would be grateful for feedback before people start using this code on live sites. This work has been generated from this discussion group. All the people involved can take blamecredit for its usefulness Link to comment Share on other sites More sharing options...
Guest Posted November 14, 2002 Share Posted November 14, 2002 I mentioned something related to this on anther post regarding the fact that google anounces it's useragent as "GoogleBot" Why can't we use it instead of a list of ip's that will no doubt need to be updated fairly often in order to have any lasting effect. I noticed someone else mention this as well but it didn't go anywhere, can't understand why, it seems like a better temp solution for killing session id's based on the user. http://www.oscommerce.com/forums/viewtopic.php?t=24998 ~Tim Link to comment Share on other sites More sharing options...
xaraya Posted November 14, 2002 Share Posted November 14, 2002 ok, i'm giving it a shot after being recently "unindexed"... what a bummer. I had Ian's SID killer too :cry: Link to comment Share on other sites More sharing options...
xaraya Posted November 14, 2002 Share Posted November 14, 2002 Just to be on the safe side. Is this the correct placement for that last bit of code? // CWS MOD to protect from spiders -- START if ( isSpider() ) { $sid = NULL; } // CWS MOD to protect from spiders -- END if (isset($sid)) { $link .= $separator . $sid; } if ( (isset($sid) ) && ( !$kill_sid ) ) { $link .= $separator . $sid; } Link to comment Share on other sites More sharing options...
xaraya Posted November 14, 2002 Share Posted November 14, 2002 I also added these lines to robots.txt... Disallow: /catalog/admin Disallow: /catalog/includes Disallow: /catalog/images Disallow: /catalog/pub Disallow: /catalog/download Link to comment Share on other sites More sharing options...
tammy507 Posted March 18, 2004 Share Posted March 18, 2004 I have a headache that bayer wont touch! LOL I have browsed through this thread as well as SEVERAL others........ and I am now totally and completely LOST and CONFUSED. I am using the latest version of OSC with the following Mods: Links Manager Article Manager PayPal IPN Random Header Image USPS Zones Visitor Stats Category Descriptions Header Tags Controller The only mod I will be adding later this week is the Ebay Auction mod (the one from Auctionblox) Anyway........ I want spiders to check me out, right? I need some guideance here, please... I dont understand what this spider list is for Im not sure where to start. what exactly is the html_output file for anyway? If I have the Header Tags Controller installed, doesnt this act just like regular html meta tags? Oh my!?!?!?!?!? Gee, The more I read the more I find out that I didnt know as much as I thought I did. I "thought" adding the Header Tags Controller to my site would make it easier for spiders etc to check out each of my pages. thanks in advance tammy Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.