Jump to content
Sign in to follow this  
Ian

[Contribution] Googlebot/Spider session id killer

Recommended Posts

Yes, if a spider follows a link from allprods.php then there's no session_id and they get the product page in the appropriate language.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Ian,

 

Believe it or not I got spidered again by google, no index, no anything when I search google for me site, but thats besides the point.

 

I noticed that while they were re-spidering me, they lost the session Id's only when viewing the product_info pages. When the bot hit a category page and transferred to the cpath, they picked up the Sid again. Granted, there is no info on my cpath area that would make good content for an index, but it caused them to loop around again.

 

Just some input as to what is going on. Like i said though, they are losing the sid's on product_info.php but as far as I can tell, this is the only page they lose it on.


Brandon Sweet

Share this post


Link to post
Share on other sites

Brandon,

 

I'm no google expert, but from reading all these topics I seem to remember someone saying you have to wait 2-3 days before a spider session makes any noticeable impact on the google index.

 

The categories sid issue is interesting. I am going to spend some time just wandering round a stock osc site with cookies off, just to get a feel for the issues this may throw up.

 

As a matter of interest are you using any other google/spider fixes. e.g. allprods, robots.txt etc.

 

I'll also wander around yous site (oh such hard work I give myself) to see if it has any special issues.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Brandon,

 

Just had a though, do you use the buy now button in your product listings. If you do, this is almost certainly the cause of the reapperance of the sid.

 

My code is built to work for a customer who has cookies turned off. If this customer adds something to the cart they need sid's to retain the cart.

 

This is fine for the add to cart button as this is a form action which bots can't do. However the buy now button is a straight forward link that siders will follow. My code will see that something has been added to the cart and turn sid's back on.

 

It's a major failing that I've mentioned before. I have yet to come up with a simple fix(the non-simple fix is to recode product listing to make buy now a form action)


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

What about using Search-Engine Safe URLs option to strip the sid out of the pages? It seems to work fine... no sid in the code that I can see.

 

http://www.mountainwatersspa.com/catalog

 

Anyone know when this feature was implimented. I found it by surprise in a 101402 snapshot. I was so excited to see that it worked.

Share this post


Link to post
Share on other sites

Gegory

 

Dont get too excited - the SID disapears as you have a cookie set instead. It will come back for the search engines :(

 

Also, note that with Search Engine Safe URLs, you cannot add to basket or log-on if cookies are disabled. :( :(

 

This feature is still under development I think.


Ian-san

Flawlessnet

Share this post


Link to post
Share on other sites

Thanks for bringing me up to date. This really helps :)

 

I suppose I can live with trade off of functionality for search engine indexing for a little while. It's very important to get our new catalog indexed. Our web site is already well indexed on google.com. I'll use Ian Wilson's fix until a better solution is implemented.

 

Warm Regards,

Greg

Share this post


Link to post
Share on other sites

I never checked our cookie output before. We use PostNuke for content management and I see it generates a session number in cookies. Is this what's throwing off google bots in OSC? The PostNuke session ID never seemed to affect our indexing status with google.com in the past.

 

I just added Ian's session remover code to be on the safe side. Since we use PayPal to process all orders cookies have to be enabled so the search engine safe URL option will not affect us. Need to post a comment at the sign in page to have "cookies enabled".

 

Here's what our cookie file looks like now with both oscommerce (Ian's code added) and PostNuke data:

 

www.mountainwatersspa.com FALSE /catalog FALSE 1040913873 email_address greg%40mountainwatersspa.com

www.mountainwatersspa.com FALSE /catalog FALSE 1040913875 first_name Gregory

.www.mountainwatersspa.com TRUE / FALSE 1038926728 POSTNUKESID 088c8254e59c0586633edfbe38fd0515

Share this post


Link to post
Share on other sites

Gregory

 

I am not an expert on cookies but if OS manages to store a cookie on the customers computer, it turns off the SIDs in the URL as they are not required anymore. But when Google comes, clearly it does not accept cookies so the SID comes back.

 

There is an enormous amount of stuff in these forums on cookies, Google, session IDs, SEFUs etc.

 

Like a cold, there are so many cures because non of them really work 100%.

 

Here is the cookie you stored on my computer:

 

POSTNUKESID

672c1113616f512a432f541ba329f063

www.mountainwatersspa.com/

1536

309190016

29524804

2981840080

29523395

*

 

clearly with SID. And here is one of my own:

 

email_address

XXXXXXXX(DELETED)XXXXXXX

www.nowsayit.com/catalog_en

1024

1454922752

29529413

797726816

29523379

*

first_name

Ian

www.nowsayit.com/catalog_en

1024

1454922752

29529413

797726816

29523379

*

 

I am not sure, but it doesn't look like a SID to me. I have Ian's mod added as well.


Ian-san

Flawlessnet

Share this post


Link to post
Share on other sites

Thanks for the contribution Ian.

 

I've had it it installed for a week or more and I notice that once the customer adds something to their cart and then decides to click on say for instance contact_us.php or any other product page the ocid is dropped and therefore the shopping cart appears empty.

 

This is probably part of the optimization for spiders but if a customer clicks on another page and sees that their basket is empty they are more than likely not going to complete the transaction.

 

Have there been any updates in your code since you last posted.

 

Thanks,

 

H

Share this post


Link to post
Share on other sites

This is not the way it is supposed to work, and certainly not how it works on my test system. The sid should be carried on after adding a product to the cart and should not be dropped unless the customer empties the cart.

 

Are you sure you have the latest version.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Hi Ian,

 

I'm using the latest version through CVS and the latest version of your code gathered from this post.

 

//================================================================ 

if ( ($HTTP_GET_VARS['currency']) ) { 

  tep_session_register('kill_sid'); 

  $kill_sid=false; 

 } 

if ( ($HTTP_GET_VARS['language']) ) { 

 tep_session_register('kill_sid'); 

 $kill_sid = false; 

 } 

if (basename($_SERVER['HTTP_REFERER']) == 'allprods.php' ) $kill_sid = true; 

if ( ( !tep_session_is_registered('customer_id') ) && ( $cart->count_contents()==0 ) && (!tep_session_is_registered('kill_sid') ) ) $kill_sid = false; 

if (basename($PHP_SELF) == FILENAME_LOGIN ) $kill_sid = false; 

//================================================================

 

I changed the second line from $kill_sid = true; to $kill_sid = false;

 

It works now and I actually had google crawl all throughout my site from product_info.php.

 

I'm guessing that this function stops killing a customer sid once a session is registered so if the bot/spider crawls from product_info.php the sid is killed from there. I do not have buy now functions enabled. I was enjoying pretty decent rankings before implementing this so I'm anxious to see if the spider will crawl every product page now.

 

Henry

Share this post


Link to post
Share on other sites

Let's just look at what that line of code should be doing.

If a customer is has logged in or if there is something in the basket, we want sid's to be produced.

 

So the line in my code.

if ( ( !tep_session_is_registered('customer_id') ) && ( $cart->count_contents()==0 ) && (!tep_session_is_registered('kill_sid') ) ) $kill_sid = true;

 

N.B. In my code notice it sets $kill_sid = true

 

if crawler/customer is not logged in and has nothing in the cart - kill the sid. We don't need it.

 

If you change that to false then if a customer does not have cookie's enabled they won't be able to add item's to the cart.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Just to make a couple of points clear.

 

First all othe code presented in this thread which was written to stop osCommerce producing an sid when google crawls was NOT written specifically to help you get a better page rank. It's intention is to stop the spider getting trapped and producing 1000's of hit's on the site.

 

Second, the attachment of sid's to an url is not what causes google to get trapped per se. Normally if a user visits your site and wanders around (with cookies disababled) they will do so with a consistent sid. However googlebot does not appear to behave like a normal user following link to link. If it did then the sid would be consistent across it's visit. The problem appears to be that when it visit's a new link, it also generates a completely new sid.

 

Why it does this is still a mystery to me. I'm currently working through a number of sites/forums to see why this as.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites
If it did then the sid would be consistent across it's visit. The problem appears to be that when it visit's a new link, it also generates a completely new sid.

 

Why it does this is still a mystery to me. I'm currently working through a number of sites/forums to see why this as.

 

I think what happens is that the Google Spider initially hits the site with all its existing stored links and so collects many successful hits each with their own Sid. Google then rationalises the list but as many of the hits have a unique Sid, they seem to be different links to Google. So, when the Google indexer comes back a couple of days later it has many 'unique' links to try to index. As all these indexers are wandering around at the same time, it just looks random.

 

With this mod, the initial hit should result in no Sids being returned so no duplication of links for the indexer. Keeping Google from wandering around the site at random is the task of the robots.txt file.


Ian-san

Flawlessnet

Share this post


Link to post
Share on other sites

I've noticed many oscommerce users complaining about the hundreds of requests attributed to sid. I've never had such a problem with over 100 result pages appearing high in Google. Google seems to follow the all_prod.php links perfectly after looking at my log files.

 

So I'm not completely convinced this problem could be completely attributed to sid. Google seems to follow links throughout my site without discrimination. I'm not quite sure why so many problems have been reported.

 

I know Google has and is continuing to improve its indexing of dynamic pages and would actually prefer a raw output rather than search engine friendly urls. I included this add on to see whether more pages would be indexed in case (although not yet identified by myself) spiders/bots time out through category paths because of sid's.

 

I've changed the previous code back to "true" and it seems to be functioning ok now. I think the problem was related to html_output.

 

Thanks for the feedback

 

Henry

Share this post


Link to post
Share on other sites

I have a snapshot from June 4th and cannot find the line of

 

if ( isset($sid) ) {

$link .= $separator . $sid;

}

 

 

in the html_output.php file to change. Has anyone else implemented this with a date around this time that could give me a hand? I'm getting ready for google since I should be spidered soon...

Share this post


Link to post
Share on other sites

I got same problem dude.

 

I think the code we need to alter would be this:

 

// Append the session id string to the URL

if ($sess) {

$sess = $separator . $sess;

}

$link .= $sess;

 

return $link;

}

 

But I altered the code uploaded it and then it buggered up the loading of my site.

So I didnt leave it in.

 

If anyone can help, gis a shout plz.

 

CC.

Share this post


Link to post
Share on other sites

Yeah I cant find that peice of code also. Can anyone help on adding it

Share this post


Link to post
Share on other sites

Are you trying to add my sid killer mod,

 

If so, try the following.

 

 // Append the session id string to the URL

if ($sess) {

$sess = $separator . $sess;

}

$link .= $sess;



return $link;

}

 

and change this to

 

 // Append the session id string to the URL

if ($sess) {

$sess = $separator . $sess;

}

if (!$kill_sid) $link .= $sess;



return $link;

}

 

Not having the full code for your snapshot can't be 100% certain, but looks right to me. You will of course have to add the application_top.php code as well


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Ian can yours be run alongside this:

 

// start session ID removal

 

if (eregi("Googlebot",getenv("HTTP_USER_AGENT")) || eregi("googlebot",getenv("HTTP_USER_AGENT"))) {

$sess = NULL;

}

if (eregi("WebCrawler",getenv("HTTP_USER_AGENT")) || eregi("InternetSeer",getenv("HTTP_USER_AGENT"))) {

$sess = NULL;

}

 

Or will it cause problems?

 

Cheers

 

CC.

Share this post


Link to post
Share on other sites

Well, possibly,

 

My code was intended to do away with testing user_agent, ip address etc, as this could be an ever moving target.

 

e.g Alta Vista are supposed to be about to redo their whole search engine experience. If it causes problems you then have to add another spider test. Google upgrade their network, change their spder name and ip, your f*.

 

I'm not saying my code is the definitive solution, it still has one or two problems, however it's ad vantage is it's not tied to trying to recognise who is browsing your site.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

So Ian,

 

Are you saying this mod will only work if there are no ppl on my site and there is nothing in any carts?

 

And if so, is there anyway to test this to make sure it works with my snapshot?

 

Also in the part that goes in application_top you said it goes after the first line, but my code all sits on one line, so...

 

Do you mean like this:

function tep_href_link($page = '', $parameters = '', $connection = 'NONSSL', $add_session_id = true, $search_engine_safe = true) {

global $kill_sid;

 

Or like this:

function tep_href_link($page = '', $parameters = '', $connection = 'NONSSL', $add_session_id = true, global $kill_sid; search_engine_safe = true) {

 

Cheers

 

CC.

Share this post


Link to post
Share on other sites

Hi, I noticed that this topic might have finally been addressed in the latest CVS commit.

 

Does this update to html_output.php make this contribution obsolete/unecessary? Hopefully it does so that the behaviors in this contribution are performed by default.

 

Thanks

 

Henry

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×