Jump to content
  • Checkout
  • Login
  • Get in touch

osCommerce

The e-commerce.

Google bots are out in full force today.


wizardsandwars

Recommended Posts

Well search engines are hitting my site ( www.itsuppliues.net ) but not staying:

 

Scooter (AltaVista) 43 674.89 KB 08 Nov 2002 - 23:15 

IBM_Planetwide 20 418.46 KB 09 Nov 2002 - 11:26 

Googlebot (Google) 14 243.16 KB 05 Nov 2002 - 11:42 

Fast-Webcrawler (AllTheWeb) 14 358.92 KB 05 Nov 2002 - 15:21 

Unknown robot (identified by 'crawl') 2 56.43 KB 08 Nov 2002 - 00:06 

GigaBot 1 28.25 KB 08 Nov 2002 - 20:54 

Inktomi Slurp 1 27.97 KB 07 Nov 2002 - 17:59

 

Since I've installed the new CVS though without any of these 'extras' eveyone is using, I seem to be getting a few more engines my way!

 

John

Link to comment
Share on other sites

  • Replies 132
  • Created
  • Last Reply

Followed the whole discusion, and in some ways I'm a little envious, as not matter how many times I prompted Google, it just has come to see me...

 

Having read through this discussion it seemed that there was a potential solution, but you had to read between many lines to get to it. So in fear of being brought down by the monster spiders, I put this togehter. Don't yet know how to make a contribution so I've attached it here. I've got it running on my site and nothing seems broken... Yet.

 

If the great would review this and offer suggestions as to why its not quite maybe this can a contribution to the cause.

 

Create a file called "search_engine_protector.php" and place in DIR_WS_FUNCTIONS, cut this code into the file. Be careful not to add lines at the top or bottom of the file

<?php

/*

 osCommerce, Open Source E-Commerce Solutions

 http://www.oscommerce.com



 Copyright (c) 2002 osCommerce



 Released under the GNU General Public License



 Authored through discussion on the formum!

*/

 // Do you want to check for spiders?

 define('PROTECT_FROM_SPIDERS',          'true');

 // Do you want to use the HTTP_USER_AGENT method to check

 // for spiders?

 define('USE_SPIDER_HTTP_USER_AGENT',    'true' );

 // Do you want to yous an IP Address check to check for Spiders?

 define('USE_SPIDER_IP_ADDRESS',         'true' );



// Global, have we already tried to establish if this request if from a spider?

$spiderCheckCompleted = false;



// Global, is this user agent potentially a spider?

// XXX: WARNING :XXX 

// Do not access this variable directly, instead use the function isSpider() below

$userAgentIsSpider = false;



/* 

* Call this function from within the html_output to check if 

* the user agent is a spider or not

*/

function isSpider()

{

   global $userAgentIsSpider, $spiderCheckCompleted;



   $host_ip = "";

   $agent = "";

   

   if ( ( $spiderCheckCompleted == false ) && ( PROTECT_FROM_SPIDERS == 'true' ) )

   {

       if ( USE_SPIDER_HTTP_USER_AGENT == 'true' )

       {

           $userAgentIsSpider = checkUserAgentForSpider(); 

       } 

          

       if ( (!$userAgentIsSpider) && ( USE_SPIDER_IP_ADDRESS == 'true' ) )

       {

           $userAgentIsSpider =  checkIpAddressForSpider();

       }

      

       $spiderCheckCompleted = true;

  }



  return( $userAgentIsSpider );

} 





function checkUserAgentForSpider()

{

   global $userAgentIsSpider;

   

   // Lets ensure the agent is always in lowercase

   $agent = strtolower( getenv('HTTP_USER_AGENT') ); 



   // Now lets define possible spiders

   $spider_footprint = array( "bot", 

                              "rawler", 

                              "pider", 

                              "ppie", 

                              "rchitext", 

                              "aaland", 

                              "igout4u", 

                              "cho", 

                              "ferret", 

                              "ulliver", 

                              "arvest", 

                              "tdig", 

                              "rchiver", 

                              "eeves", 

                              "inkwalker", 

                              "ycos", 

                              "ercator", 

                              "uscatferret", 

                              "yweb", 

                              "omad", 

                              "eternews", 

                              "cooter", 

                              "lurp", 

                              "oila", 

                              "oyager", 

                              "ebbase", 

                              "eblayers", 

                              "get",

                              "eek",

                              "canner", 

                              "rachnoidea",

                              "lanetwide",

                              "kit");  





   $i = 0; 

   while (($i < (count($spider_footprint))) && (!$userAgentIsSpider)) 

   { 

       if (strstr($agent, $spider_footprint[$i])) 

       { 

           $userAgentIsSpider = true;         

       } 

       

       $i++; 

   } 



   return( $userAgentIsSpider );

}





function checkIpAddressForSpider()

{

   global $userAgentIsSpider;

   

   $host_ip = getenv('REMOTE_ADDR'); 

   $spider_ip =        array( "64.209.181.53", 

                              "64.208.33.33", 

                              "64.209.181.52", 

                              "209.185.108.", 

                              "209.185.253", 

                              "216.239.49.", 

                              "216.239.46.", 

                              "204.123.", 

                              "204.74.103.", 

                              "203.108.10.", 

                              "195.4.183.", 

                              "195.242.46.", 

                              "198.3.97.", 

                              "204.62.245.", 

                              "193.189.227.", 

                              "209.1.12.", 

                              "204.162.96.", 

                              "204.162.98.", 

                              "194.121.108.", 

                              "128.182.72.", 

                              "207.77.91.",

                              "206.79.171.",

                              "207.77.90.",

                              "208.213.76.",

                              "194.124.202.",

                              "193.114.89.",

                              "193.131.74.",

                              "131.84.1.",

                              "208.219.77.",

                              "206.64.113.",

                              "195.186.1.",

                              "195.3.97.",

                              "194.191.121.",

                              "139.175.250.",

                              "209.73.233.",

                              "194.191.121.",

                              "198.49.220.",

                              "204.62.245.",

                              "198.3.99.",

                              "198.2.101.",

                              "204.192.112.",

                              "206.181.238",

                              "208.215.47.",

                              "171.64.75.",

                              "204.162.98.",

                              "204.162.96.",

                              "204.123.9.52",

                              "204.123.2.44",

                              "204.74.103.39",

                              "204.123.9.53",

                              "204.62.245.",

                              "206.64.113.",

                              "204.138.115.",

                              "94.22.130.",

                              "164.195.64.1",

                              "205.181.75.169",

                              "129.170.24.57",

                              "204.162.96.",

                              "204.162.96.",

                              "204.162.98.",

                              "204.162.96.",

                              "207.77.90.",

                              "207.77.91.",

                              "208.200.146.",

                              "204.123.9.20",

                              "204.138.115.",

                              "209.1.32.",

                              "209.1.12.",

                              "192.216.46.49",

                              "192.216.46.31",

                              "192.216.46.30",

                              "203.9.252.2"); 



       $i = 0; 

       while (($i < (count($spider_ip))) && (!$userAgentIsSpider ))

       { 

           if (strstr($host_ip, $spider_ip[$i])) 

           {

               $userAgentIsSpider = true;

           } 

           $i++; 

       } 



   return( $userAgentIsSpider );

}

?>

 

Then within html_output.php add:

// CWS MOD to protect from spiders -- START

   require(DIR_WS_FUNCTIONS . 'search_engine_protector.php');

// CWS MOD to protect from spiders -- END

 

Just under the Copyright statement.

 

Finally, search for "isset($sid" within html_output.php. On the lines above the line

if (isset($sid)) {

, add:

 

// CWS MOD to protect from spiders -- START

   if ( isSpider() ) {

       $sid = NULL;

   }

// CWS MOD to protect from spiders -- END

   

   if (isset($sid)) {

     $link .= $separator . $sid;

   }

 

I would be grateful for feedback before people start using this code on live sites.

 

This work has been generated from this discussion group. All the people involved can take blamecredit for its usefulness

Link to comment
Share on other sites

I mentioned something related to this on anther post regarding the fact that google anounces it's useragent as "GoogleBot" Why can't we use it instead of a list of ip's that will no doubt need to be updated fairly often in order to have any lasting effect.

 

I noticed someone else mention this as well but it didn't go anywhere, can't understand why, it seems like a better temp solution for killing session id's based on the user.

 

http://www.oscommerce.com/forums/viewtopic.php?t=24998

 

~Tim

Link to comment
Share on other sites

Just to be on the safe side. Is this the correct placement for that last bit of code?

 

 

// CWS MOD to protect from spiders -- START

if ( isSpider() ) {

$sid = NULL;

}

// CWS MOD to protect from spiders -- END

 

if (isset($sid)) {

$link .= $separator . $sid;

}

 

if ( (isset($sid) ) && ( !$kill_sid ) ) {

$link .= $separator . $sid;

}

Link to comment
Share on other sites

  • 1 year later...

I have a headache that bayer wont touch! LOL

 

I have browsed through this thread as well as SEVERAL others........ and I am now totally and completely LOST and CONFUSED.

 

I am using the latest version of OSC with the following Mods:

Links Manager

Article Manager

PayPal IPN

Random Header Image

USPS Zones

Visitor Stats

Category Descriptions

Header Tags Controller

 

The only mod I will be adding later this week is the Ebay Auction mod (the one from Auctionblox)

 

Anyway........

 

I want spiders to check me out, right?

 

I need some guideance here, please...

I dont understand what this spider list is for

 

Im not sure where to start.

 

what exactly is the html_output file for anyway?

 

If I have the Header Tags Controller installed, doesnt this act just like regular html meta tags?

 

Oh my!?!?!?!?!?

 

Gee, The more I read the more I find out that I didnt know as much as I thought I did. I "thought" adding the Header Tags Controller to my site would make it easier for spiders etc to check out each of my pages.

 

 

thanks in advance

tammy

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...