Jump to content

Archived

This topic is now archived and is closed to further replies.

zelf

Spiders and Sessions

Recommended Posts

:angry: Spiders are obtaining a session id still. I have uploaded the updated spiders.txt file and also have prevent spiders from obtaining a session id, but it is not working.

 

Any ideas? Several pages have been indexed with session id's now in Yahoo.

 

Also, I have admin set to recreate sessions, but it is not working either. After logging in the same session id exists.

 

Why in the updated spiders.txt list is googlebot and msnbot not listed?

 

Not good. Please help.

Share this post


Link to post
Share on other sites
Spiders are obtaining a session id still.

 

How do you know? What are you using to test with?

 

After logging in the same session id exists.

 

If you follow one of those indexed links with the SID in them from yahoo, is that session still active, or is it creating a new session?

 

Why in the updated spiders.txt list is googlebot and msnbot not listed?

 

Becuase it only has to match a part of the User Agent. Therefore 'bot' will match for 'googlebot'.


-------------------------------------------------------------------------------------------------------------------------

NOTE: As of Oct 2006, I'm not as active in this forum as I used to be, but I still work with osC quite a bit.

If you have a question about any of my posts here, your best bet is to contact me though either Email or PM in my profile, and I'll be happy to help.

Share this post


Link to post
Share on other sites

Thanks for the reply.

 

How do you know? What are you using to test with?

Like I said Yahoo has indexed my site and the listings contain sid's. I have also tested with http://www.hashemian.com/tools/browser-simulator.htm and the results from using different bots displays an osCsid. I also watched yesterday with the "Who's online enhancement" MSNBot add 700.00 dollars worth of stuff to a cart. and it had a session. However, it was not listed as a Bot in the who's online enhancement.

If you follow one of those indexed links with the SID in them from yahoo, is that session still active, or is it creating a new session?

I'll have to dig deeper into this one, but the sid that displays in Yahoo does not change throughout the session, even after logging in.

Becuase it only has to match a part of the User Agent. Therefore 'bot' will match for 'googlebot'.

Right, I figured this out after I posted.

 

I have fixed the problem temporarily by renaming osCsid to something else so a new session will be created for each new visitor that clicks on my links, but what the heck is going on here? Everything else is working perfectly in my store.

 

If it helps here is my stores address: http://www.topless-sandal.com/

Share this post


Link to post
Share on other sites
In admin->configuration->Sessions->Prevent Spider Sessions

 

is that set to true?

 

I guess you didn't believe me when I said it the first time. Here's a screenshot of my admin screen to prove it I guess:

 

http://www.topless-sandal.com/prevent_spiders.gif

Share this post


Link to post
Share on other sites

sorry zelf :blink: yes you right. The session is created in includes\application_top.php by tep_session_start() Now this function should be present only in this file. You could check the tree just in case something else calls this function.

 

So the code to start the session looks like this:

// start the session
 $session_started = false;
 if (SESSION_FORCE_COOKIE_USE == 'True') {
   tep_setcookie('cookie_test', 'please_accept_for_session', time()+60*60*24*30, $cookie_path, $cookie_domain);

   if (isset($HTTP_COOKIE_VARS['cookie_test'])) {
     tep_session_start();
     $session_started = true;
   }
 } elseif (SESSION_BLOCK_SPIDERS == 'True') {
   $user_agent = strtolower(getenv('HTTP_USER_AGENT'));
   $spider_flag = false;

   if (tep_not_null($user_agent)) {
     $spiders = file(DIR_WS_INCLUDES . 'spiders.txt');

     for ($i=0, $n=sizeof($spiders); $i<$n; $i++) {
       if (tep_not_null($spiders[$i])) {
         if (is_integer(strpos($user_agent, trim($spiders[$i])))) {
           $spider_flag = true;
           break;
         }
       }
     }
   }

   if ($spider_flag == false) {
     tep_session_start();
     $session_started = true;
   }
 } else {
   tep_session_start();
   $session_started = true;
 }

I would expect the spider_flag to be set to true based on the spiders.txt that its in includes directory and the user agent is set to the spider's name. So to test if the spider_flag works just hard-coded to true and see if your shop generates sessions. Because if it does must be another place that starts sessions.

 

The other global var to check is the $session_started as the second step to figure this out

Share this post


Link to post
Share on other sites
Now this function should be present only in this file. You could check the tree just in case something else calls this function.

 

I checked all files for this code and it is only present in includes/application_top.php

 

So to test if the spider_flag works just hard-coded to true and see if your shop generates sessions. Because if it does must be another place that starts sessions.

 

I hard-coded to true and sessions stopped being generated.

 

The other global var to check is the $session_started as the second step to figure this out

 

I'm not sure what I am supposed to test here on the second step.

Share this post


Link to post
Share on other sites

ok the 2nd step I mentioned was in case the hard-coded didnt work. So you said you have a spider.txt in catalog\includes directory. If you have a firewall that can respond with a custom user-agent field, then set this field in the firewall to one of the names in the spiders.txt. Once a page from your site is downloaded check the source code to see if the links have a session appended.

 

Do you have a php debugger by any chance?

Share this post


Link to post
Share on other sites
ok the 2nd step I mentioned was in case the hard-coded didnt work. So you said you have a spider.txt in catalog\includes directory. If you have a firewall that can respond with a custom user-agent field, then set this field in the firewall to one of the names in the spiders.txt. Once a page from your site is downloaded check the source code to see if the links have a session appended.

 

Do you have a php debugger by any chance?

 

I am going to setup a cURL script to specify different user agents and see if I can track this down. I just don't get it. It sounds like everyone else has this working except for me. Except for minor layout mods I have not altered the stock osc core modules. Specifically I haven't touched anything to do with the session handling. My guess is it is something with reading in the spiders.txt file correctly or something of that nature.

 

I am serving on MAC OS X.

Share this post


Link to post
Share on other sites

Update: That's exactly what is going on. Spiders.txt is not being separated into different array elements per line.

 

The array created by file() looks like below. Must be something with the new line encoding. I'll send and update for MAC OS X people once I get it finished.

 

Array ( [0] => crawl slurp spider ebot obot abot dbot hbot kbot mbot nbot pbot rbot sbot tbot ybot bot. accoona appie architext aspseek asterias atlocal atomz augurfind bannana_bot boitho booch cfetch dmoz docomo falcon findlinks gazz goforit grub gulliver harvest helix holmes homer ia_archiver ichiro iconsurf iltrovatore indexer infoseek ingrid ivia java/ jetbot kit_fireball knowledge lachesis larbin libwww linkwalker lwp mantraagent mediapartners mercator miva mj12 mnogo moget/ msnbot multitext muscatferret myweb nameprotect ncsa beta netmechanic netresearchserver ng/ npbot nutch objectssearch omni osis-project pear. pompos poppelsdorf rambler scooter scrubby seeker shopwiki sidewinder smartwit sna- sohu spyder steeler/ sygol szukacz t-h-u-n-d-e-r-s-t-o-n-e /teoma tutorgig ultraseek vagabondo volcano voyager/ w3c_validator websitepulse wget worldlight worm zao/ xenu zippp zyborg .... ! spiders.txt Contribution version 2005-09-08 )

Share this post


Link to post
Share on other sites

file() is not reading the bots on each line into array elements for some reason. I have fixed it by replacing all the line breaks in spiders.txt with ",". And then I altered the code in includes/application_top.php as follows:

 

$spiders = file(DIR_WS_INCLUDES . 'spiders.txt');

$spiders = $spiders[0];

$spiders = explode(",", $spiders);

 

The file is still read into an array, but then I extract and explode the comma separated string into the array that should have been there. Once I have a real fix I will post it. Has to be something with MAC OS X.

Share this post


Link to post
Share on other sites

I just checked the php pages for the file function there is note saying this:

Note: If you are having problems with PHP not recognizing the line endings when reading files either on or created by a Macintosh computer, you might want to enable the auto_detect_line_endings run-time configuration option.

 

the page:

 

http://www.php.net/manual/en/function.file.php

Share this post


Link to post
Share on other sites
I just checked the php pages for the file function there is note saying this:

the page:

 

http://www.php.net/manual/en/function.file.php

 

:D That's the real fix for MAC OS X people. You do not want to set this in php.ini because of small performance losses if you are running a shared server.

 

In includes/application_top.php you will want to add:

 

ini_set('auto_detect_line_endings', '1');

 

I put it right below the ending "}" at line 156. So my code now reads:

 

// set the session cookie parameters

if (function_exists('session_set_cookie_params')) {

session_set_cookie_params(0, $cookie_path, $cookie_domain);

} elseif (function_exists('ini_set')) {

ini_set('session.cookie_lifetime', '0');

ini_set('session.cookie_path', $cookie_path);

ini_set('session.cookie_domain', $cookie_domain);

}

 

ini_set('auto_detect_line_endings', '1');

 

It works perfectly without having to alter spiders.txt or the spider array code.

Share this post


Link to post
Share on other sites

Zelf,

 

I'm having the same problem as you but I'm on a shared linux server. Yahoo and MSN index my site but with sid's. Google has only indexed the index page. I noticed google only indexed your index page too. When you talk about Mac OSX are you talking only about OSX Servers that host the website. I use a OSX desktop but not for hosting my website. Do you think adding the one line change to my application_top.php would help? I am using the Ultimate SEO URLs contribution, do you think that would causing sid's to be indexed? Sorry for all the questions, I just cant figure this out and your post is the only one that seemed like a logical answer. I've been trying to figure this out for about a month now.

 

-Brian

Share this post


Link to post
Share on other sites
I'm on a shared linux server.

You shouldn't have a problem with the stock spiders file function on Linux. My issue was with Mac OS X incorrectly reading line endings.

are you talking only about OSX Servers that host the website.

Yes. The problem is with the OSX Server reading line endings in the spiders.txt file. Did you convert the line endings in the file to OSX line endings?

Do you think adding the one line change to my application_top.php would help?

Don't guess. Test it for yourself. The easiest way to test it is to use http://www.hashemian.com/tools/browser-simulator.htm where you can type in your URL and then select a user agent like "Googlebot". If the headers come back displaying a session id or cookie then your spiders function is not working correctly.

 

Before you make any changes I would test it on the site I listed. Then either try converting your line endings to UNIX or I guess you could try my fix, which was to first add at line 179 of inc/application_top.php the following lines so the reading of the spiders file should look like this:

 

if (tep_not_null($user_agent)) {

$spiders = file(DIR_WS_INCLUDES . 'spiders.txt');

$spiders = $spiders[0];

$spiders = explode(",", $spiders);

 

or use the my post in this thread for adding the ini_set, but I would use that last as there is a small performance loss.

 

Finally and really this should be the first thing you do -- test how the spiders.txt file is being read in by inc/application_top.php by doing: print_r($spiders); after the file is read in at line 179. If the array does not separate the list of user agents into separate indexes then you know that is your problem. Review this thread for an example of what that would look like.

I am using the Ultimate SEO URLs contribution, do you think that would causing sid's to be indexed?

No. Absolutely not. If user agents are obtaining a session then the fault is most likely with the user agent not being matched to a user agent in the spiders.txt file. I would focus there.

Share this post


Link to post
Share on other sites
$spiders = explode(",", $spiders);

I had to modify my spiders.txt file in order for this to work, but I suppose without modifying the file if it is all being read into index[0] then you could modify the explode function to read.

 

$spiders = explode(" ", $spiders);

 

Just test the array to make sure each user agent has it's own index.

Share this post


Link to post
Share on other sites

×