Jump to content
Sign in to follow this  
Ian

[Contribution] Googlebot/Spider session id killer

Recommended Posts

A number of methods have been suggested to stop session id's being added to the url when being spidered by Googlebot or other search engine bots that gets trapped by session id's.

 

The two main suggestions are

 

1. Kill the session id completely. This works but makes the site unusable by anyone who has cookies disabled.

 

2. Use some sort of spider recognition code. usually a list of ip addresses and or a string match on the http_referer. This can be unwieldy (huge amount of ip's which need constant updating) and of course falls over if the spider suddenly changes ip.

 

The code I have written is very simple, and works of a very simple premise.

 

If no customer is logged in or there is nothing in the cart, we kill the session id. As these two actions can't be performed by a spider/bot then they should never generate a session id.

 

CODE CHANGES.

 

In catalog/includes/application_top.php add this just before the closing ?> tag

//================================================================

if ( ($HTTP_GET_VARS['currency']) ) {

  tep_session_register($kill_sid);

  $kill_sid=false;

 }

if ( ($HTTP_GET_VARS['language']) ) {

 tep_session_register($kill_sid);

 $kill_sid = false;

 }

if ( ( !tep_session_is_registered('customer_id') ) && ( $cart->count_contents()==0 ) && (!tep_session_is_registered($kill_sid) ) ) $kill_sid = true;

//================================================================

 

Now find the function tep_href_link (should be the first one) in /includes/functions/html_output.php

 

After the first line

  function tep_href_link($page = '', $parameters = '', $connection = 'NONSSL', $add_session_id = true, $search_engine_safe = true) {

 

add the line

    global $kill_sid;

 

now find the lines

    if ( isset($sid) ) {

     $link .= $separator . $sid;

   }

 

and change to

 

    if ( (isset($sid) ) && ( !$kill_sid ) ) {

$link .= $separator . $sid;

}

[code]

 

and that's it. Some older snapshot's tep_href_link function may appear slightly different but you should find where the addittion goes.

 

Warnings. The code above has been tested on a clean oscommerce installation. The only problem reported has been with 'USE_DEFAULT_LANGUAGE_CURRENCY' set in admin.

 

Contributions which require extra session variables to be registered may have problems if these session variables are needed before login or empty cart.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Thanks Ian for all the work that you did in the last time.

It is not important whether google catches the session id or not.

That is not the point. Google acticvates long befor stored pages depending on the count of backlinks to your domain. You can proof that when you search google for oscommerce-shops. Mostly you find pages like privacy.php or polls.php. The most updated pages are listed first by google. What you might do to catch that bots is anozher thing.

Try to change all files in the shops root directory to *. (yes indeed, without a filetype behind the dot) most important is to change the default.php to another name. Take maybe "shop" and not shop.php. Important too that you change in application.top and in the languages directory. Proof whether you have no java code inside your shop, but javascript is ok.

Change your htaccess file like that

 

<Files shop> <--this is default.php

ForceType application/x-httpd-php

</Files>

<Files product_info>

ForceType application/x-httpd-php

</Files>

<Files specials>

ForceType application/x-httpd-php

</Files>

<Files contact_us>

ForceType application/x-httpd-php

</Files>

<Files shipping>

ForceType application/x-httpd-php

</Files>

<Files privacy>

ForceType application/x-httpd-php

</Files>

<Files conditions>

ForceType application/x-httpd-php

</Files>

and so on

 

What is more important than to redirect on 404's directly to "shop".

Then use the metatag contribution witch I should send you when you have none. AND WAIT FOR THE BOT. It might that you have to wait a half year, only because google surfs in silent mode to find out how your side-structure is. Then suddenly you find 50000 instances of the same IP in your shop and everything is turning round and around. Expensive traffic!!!

And afterwards only two pages in google. Wait longer and change the pages often. Maybe use a script that plugs different keywords from a pool in the database. The filelenght seems to show google wheater pages are updated or not. So I did it to reach number 1 pages place 1-3 for 8 important keywords of my branch.

I will try to use your code snip here, too. But what is when google recognizes your *.php's ? Another trick is to use forcetype and apache to parse html endings like php-files. But its most clever to let google know he surfs dirctories. Good luck :wink:


xtcommerce Templates

And this is my new coding project. A multilingual sitesearch for online stores. Have a look at the search field left in the navbar:

WWW.BE-INSHAPE.DE Proteinpulver, Aminosäure Liquids and Supplements for Bodybuilding and Fitness

It finds all the ingredients like amino acids, carnitin and proteins if you don't know how to spell. In realtime...

Share this post


Link to post
Share on other sites

This might be a stupid question but....

 

Could a robot follow the 'in cart' or 'buy now' links?

What would happen if it did?

 

Could you expand on the currency setting issue? It didn't make sense to me. What should I look for and what shouldn't it be?

 

Thanks,

 

Jon.

Share this post


Link to post
Share on other sites

Sure a robot scans every document that is reachable.

Google pulls everything he can get into the shopping cart. But googlebot don't saves or caches this sites. Only follows the next link he can find.

SID and ? are things where robots recognize that your page is dynamic.

It is not a problem, but for registers it is not senceful, so they ignore such sites.


xtcommerce Templates

And this is my new coding project. A multilingual sitesearch for online stores. Have a look at the search field left in the navbar:

WWW.BE-INSHAPE.DE Proteinpulver, Aminosäure Liquids and Supplements for Bodybuilding and Fitness

It finds all the ingredients like amino acids, carnitin and proteins if you don't know how to spell. In realtime...

Share this post


Link to post
Share on other sites

Gwinger,

 

I think if you follow the threads on this forum you will find that session id's are a big problem for people being spidered by googlebot.

 

This code is not meant to make your site get better rankings. What happens when googlebot visits your site, is that because it does not use cookies is that every visit generates a different session id. Some people have reported in excess of 50,000 google hits per visit because of this.

 

Jon_l

 

in cart action will not be followed as it a form post. Google can't follow form actions. The buy_now button is more problematic as it is a direct href link. I'll look to see what can be done to fix this.

 

As for the currency setting. In admin there is a setting to switch to a currency depending on the language. So if you switch to Russian language it switches to Rouble currency.

 

Very few people AFAIK use this. What happens with my code if you have ths setting enabled. If customer has cookies enabled then things work as expected. if they have cookies disabled then as soon as they click on another link they sre switched back to the default currency.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

I know that google follows into the shopping cart.

I saw him not to login and buy, but they have anything that is new. I think they search form-action links for catching links on answer pages which are important to count for their backlink ranking strategie.


xtcommerce Templates

And this is my new coding project. A multilingual sitesearch for online stores. Have a look at the search field left in the navbar:

WWW.BE-INSHAPE.DE Proteinpulver, Aminosäure Liquids and Supplements for Bodybuilding and Fitness

It finds all the ingredients like amino acids, carnitin and proteins if you don't know how to spell. In realtime...

Share this post


Link to post
Share on other sites

Well....I'd call 30,000+ page hits (from 50 products) in the space of a few days, by the Google spider a bit of a problem.

 

I will double check the bit on what Google follows in our web logs....however, this might not be a problem. If you disable indexing of shopping_cart.php in the robots.txt, it will follow the link but then won't do anything with it. So it shouldn't matter even if a session id is added to the url at that point. I think.

 

This seems a nice and tidy way of fixing the problem.

 

I reckon I've got a few weeks before it will be back on mass again, I'll try to get the changes applied before then.

 

I've said it before, and I'll say it again....we really need something in the cvs for this. You don't realise what a problem it is until Google targets you. You get all excited to begin with, then realise what is happening. The worse thing for me is that it screws up all the counters (product views and page hits), making the data worthless. Plus of course the bandwidth (and we are talking lots).

 

Thanks,

 

Jon.

Share this post


Link to post
Share on other sites

The main reason for putting this code as a contribution is it needs a lot of testing. I initially emailed it to half a dozen people and with their feedback refined the code some what. Thanks Guys.

 

However because of the time lag waiting for a spider to hit a tester's site the more people testing the faster i'll get results.

 

Using robbots.txt even with this code is a must. However it won't necessarily fix the buy_now button problem. It's possible to set oscommerce not to go to shopping cart after adding something.

 

The only solution I can see is to make the buy_now button a form action.

 

I'm still looking at this


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Maybe others who have installed this mod could try this as I am getting mixed results:

 

1) Set your browser security to maximum so as to prevent all cookies

2) Make sure Search Engine Friendly URLs are not being used as this feature does not work anyway without cookies being allowed.

3) Load your site and go straight to your login page without clicking anything else.

4) Can you see the SID in the URL? can you log-in.

 

Ian, if this IS a problem (not sure yet that it is), do you think that just adding:

 

$kill_sid = false;

 

to the login page would fix it?


Ian-san

Flawlessnet

Share this post


Link to post
Share on other sites

I don't even know where the Problem is?

My page and it is definitely OsC 2.2 is listet in Google and in Fireball and in other sites with thousands of entries. Where is the problem for OsCommerse Sites beeing spidered?

Save your Sessions in mysql and everything reaches out. Just do the things I told you just above. The reason why you were not spidered should be one of those cases:

1. You have no Metatags

2. You have the same Metatags on every page with default.php

3. You have only products without very much description and your page has less content.


xtcommerce Templates

And this is my new coding project. A multilingual sitesearch for online stores. Have a look at the search field left in the navbar:

WWW.BE-INSHAPE.DE Proteinpulver, Aminosäure Liquids and Supplements for Bodybuilding and Fitness

It finds all the ingredients like amino acids, carnitin and proteins if you don't know how to spell. In realtime...

Share this post


Link to post
Share on other sites

I've updated the application_top.php code to get round some login problems when coolies are disabled.

 

//================================================================

if ( ($HTTP_GET_VARS['currency']) ) {

  tep_session_register('kill_sid');

  $kill_sid=false;

 }

if ( ($HTTP_GET_VARS['language']) ) {

 tep_session_register('kill_sid');

 $kill_sid = false;

 }

if ( ( !tep_session_is_registered('customer_id') ) && ( $cart->count_contents()==0 ) && (!tep_session_is_registered('kill_sid') ) ) $kill_sid = true;

if (basename($PHP_SELF) == FILENAME_LOGIN ) $kill_sid = false;

//================================================================


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Ian,

 

There seems to be a problem when using your spider-contrib together with my latest versions of "allprods"-contrib:

 

Allprods includes the language in the url with the intention to get your products listed in the bot's directories in all avaulable languages.

 

The inclusion of the language makes your contrib believe that a "normal" customer is visiting the store. It registers "$kill_sid" and sets it to "false" so the sid will be included.

 

When the spider come into the shop using allprods, it will be treated as a ordinary customer...

 

Wouldn't it be better to register the kill_sid variable during actual logon?

 

Marcel


Greetings from Marcel

|Current version|Documentation|Contributions|

Share this post


Link to post
Share on other sites

Marcel

 

Wouldn't it be better to register the kill_sid variable during actual logon?

 

The reason i pass a session id when the language/currency changes is for people who have cookies turned off. Otherwise they couldn't change language or currency.

 

I've just added a bit of code to your's and my contributions. Basically I've made your code add &kill_sid=true. to the url for all links. then to the end off my application_top.php I add

if ($HTTP_GET_VARS['kill_sid']) $kill_sid = $HTTP_GET_VARS['kill_sid'];

 

This solves the problem, could also be used by other contributions to ensure compatibility.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites
This solves the problem, could also be used by other contributions to ensure compatibility.

 

However, with cookies off and &kill_sid=true in the url, you can't the add to the basket using a link created by allprods :(


Ian-san

Flawlessnet

Share this post


Link to post
Share on other sites

Ian,

 

I was under the impression that people used allprods.php as a hidden file, not for browsing by a normal user.

 

Is this not the case.


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Ian

 

Thst is certainly its purpose. However, it does make a nice way to list the whole product catalogue and I have seen it used this way in a few stores.

 

Maybe not having browsing capability for allprods when cookies are off is a reasonable situation, I for one can live with it as (despite what many programmers say) I doubt if the average customer knows what a cookie is let alone how to turn it off - and IE comes by default with Cookies on.

 

However, I was also thinking that the spider would pick up the link for the product and that this link would now include the &kill_sid=true in the url? In the same way as it currently picks up the SID - so when the customer follows the listed search engine link, it would be a problem if cookies are off. Maybe that is not the case?

 

Or is it that when you return to the store, SID is turned on by the command $kill_sid = false; at the top of your mod to application_top despite the url saying &kill_sid=true?


Ian-san

Flawlessnet

Share this post


Link to post
Share on other sites

Maybe defining a mode for allprods is a solution. In reality if you were using allprods for customers then it should only list in one language anyway. In bot mode it would show all languages and add the kill_sid

 

If a customer clicks on a generated google link with kill_sid=true then if they immediatlely try to add to cart then it won't work, this is due to the tep_get_all_parameters function. If they just go on to browse the site then no problem as the kill_sid parameter drops off the url.

 

BTW if your not bothered about cookies being off then the easiest thing to do is kill sids always.

 

Maybe a better idea is to see if the previous page was allprods.php and then kill the sid (unless of course we are logged in or have something in the cart). No kill_sid parameter in the url is then needed.

 

I'll give that a try


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites
In reality if you were using allprods for customers then it should only list in one language anyway.

 

Ian

 

I think you are right here. The correct approach would be to use a modified, single language, allprods for customers and a second, hidden, multi-lingual one, for the robot.

 

Anyway, we have to remember that allprods is a contribution, not part of the official code.

 

If a customer clicks on a generated google link with kill_sid=true then if they immediatlely try to add to cart then it won't work, this is due to the tep_get_all_parameters function.

 

This may be more of a problem for some but it is difficult to think that a customer will buy immediately on entering a store without clicking on something else first AND have cookies switched off!


Ian-san

Flawlessnet

Share this post


Link to post
Share on other sites

It seems that doing a referer check works, clicking on an allprods link and the sid is still killed. One downside is that if you click on a foreign language link the page is shown in the correct language but then clicking on another link reverts to default language.

 

Maybe we can live with that.

 

This now makes my application_top code

 

//================================================================

if ( ($HTTP_GET_VARS['currency']) ) {

  tep_session_register('kill_sid');

  $kill_sid=false;

 }

if ( ($HTTP_GET_VARS['language']) ) {

 tep_session_register('kill_sid');

 $kill_sid = false;

 }

if (basename($_SERVER['HTTP_REFERER']) == 'allprods.php' ) $kill_sid = true;

if ( ( !tep_session_is_registered('customer_id') ) && ( $cart->count_contents()==0 ) && (!tep_session_is_registered('kill_sid') ) ) $kill_sid = true;

if (basename($PHP_SELF) == FILENAME_LOGIN ) $kill_sid = false;

//================================================================


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites
Anyway, we have to remember that allprods is a contribution, not part of the official code.

 

That is true, but what i'm trying to code is something that is generic. Already i'm having to make special consideration for a contribution.

 

I don't want to write code that excludes contributions by others, and at the same time I don't want to write code that is having to do multiple if statements to consider others.

 

It's a dilemma, but that's what peer review is all about, I code somethng, you call me names, I code better :lol:


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites
I code something' date=' you call me names, I code better[/quote']

 

Ian, as usual you managed it without the need for any name calling. Many thanks.:D

 

But maybe you should meet my wife and then you could achieve supreme expertise in your coding ability ... :twisted:


Ian-san

Flawlessnet

Share this post


Link to post
Share on other sites

If I remember correctly, your wife is Japanese.

 

It's always been my belief that the only reason to learn a foreign language is so that you can call someone names without them even realising the insult.

 

Ian

ye hev a heed like a boiled neep.

 

:lol: :lol: :lol: :lol:


Trust me, I'm an Accountant.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×