Jump to content
Latest News: (loading..)

Archived

This topic is now archived and is closed to further replies.

lee the bean

2.3.3 How To Prevent Inappropriate URL Parameters

Recommended Posts

I've noticed that BING (in particular) has a niffty but useless and harmful knack of building a URL with multiple parameters when its bots are on-site.

 

It isn't getting these from site links, so must be building a parameters list and then trying all variations and going for it. BING seems to ignore any canonical information in the <head> thus merrily indexing a whole bunch of URL's for the single product_id or catalog category pages.

 

Here are examples:

 

/index.php?cPath=34_40&products_id=1534&products_id=539

/index.php?cPath=34_39&page=6&products_id=1084&products_id=990

 

'and here's one I made earlier' that also works (ie doesn't give a 404)

 

/index.php?cPath=34_39&page=6&products_id=1084&products_id=990&test=3

/product_info.php?cPath=34_39&products_id=991&products_id=990

 

Because osCommerce code doesn't object to a whole bunch of parameters I suspect BING is spending more time on-site and most likely applying penalties for duplicate content.

 

Is there anything that will:

 

1. Read the parameters and redirect to a URL with the parameters always in the same order

2. Look for illegal (ie don't accept any old parameter - like test3) and remove them

3. Look for duplicate parameters and strip off the extras.

4. Look for inappropriate parameters and remove them (ie products_id on index.php)

Share this post


Link to post
Share on other sites

I can't think of any way to do that in .htaccess. It could be done within osC's PHP code, but that would not catch swapped terms or duplicates, as everything would be loaded into the $_REQUEST array. Is Bing really so stupid that it calls a=1&b=2 and b=2&a=1 separate pages? I don't see any way to detect order or duplicates, but you could detect illegal or inappropriate terms and return a 301 code (the URL with those removed).


If you are running the "official" osC 2.3.4 or 2.3.4.1 download, your installation is obsolete! Get (stable) Frozenpatches or (unstable) Edge. See also the naming convention and the latest community-supported responsive "Edge" release

Share this post


Link to post
Share on other sites

Hi MrPhil,

 

Thanks. Yep BING really is that stupid. I've been watching it for months now. It doesnt seem to be indexing the duplicates, but it does spend time on-site seemingly trawling all combinations of parameters, and doubling up on them.

 

Even in BING webmaster if you set the ignore parameters option (except for the ones you really need 'products_id=') it still goes on trawling away blindly.

 

Mind you - that doesnt surprise me BING still attempts to get to pages I deleted years ago - and they aren't in BING's index, or linked too anywhere on the interweb yet alone from BINGs index. So the parameters thing is unlikely to be resolved by BING team.

 

Not to worry - but I think that future developments of osCommerce should address SEO parameters so that they are always called in an order and trap duplicates or irrelevant ones.

 

Cheers ears.

Share this post


Link to post
Share on other sites

As I said, I don't think there's any way to do this at the .htaccess level (in a very general and robust manner), and by the time osC sees it in PHP code, there's no ordering and probably no duplicates. It could detect illegal/irrelevant ones. If you push back with a 301 redirect without those entries, who knows if Bing will just tack them again.

 

I think the thing to do is to look at Bing discussions and see if this is a known problem. If it isn't, raise a public ruckus about how stupid MS is. But first, make sure Google and other major search engines aren't doing the same thing. Also see if this is happening to anyone else, or is it something peculiar to your site (perhaps something stupid going on in your .htaccess).


If you are running the "official" osC 2.3.4 or 2.3.4.1 download, your installation is obsolete! Get (stable) Frozenpatches or (unstable) Edge. See also the naming convention and the latest community-supported responsive "Edge" release

Share this post


Link to post
Share on other sites

Why should any piece of software have to be hacked at to deal with something that Bing could sort out by refactoring how their spider works. As Phil rightly points out, you need to go to the Bing discussion groups and get further input.


This is a signature that appears on all my posts.  
IF YOU MAKE A POST REQUESTING HELP...please state the exact version
of osCommerce that you are using. THANKS

 
Get the latest current code (community-supported responsive 2.3.4.1BS Edge) here

 

Share this post


Link to post
Share on other sites

Yes, you have to be alert to spot this one, and I'm pretty sure this is just BING not Google doing this. Nowt causing this at my end, and I trawled the net looking for others who my have reported this but both Bing and osCommerce dont seem to have anything reported. I thought I'd post here and see if anyone else has spotted this one.

 

One thing - once you have a url page referenced like this by BING, then you'll just get more and more because of the way the catalog page number links pass the parameters - ie:

 

/index.php?cPath=34_39&page=6&products_id=1084&products_id=990 goes to page 6.

The page links at the top/bottom then also include the rogue products_id, thus causing a compounding of the problem as Bing merrily chases all the links on the pages. Hum.

 

I don't think I'd get much joy from MS - they dont exactly provide accurate or complete information on how they belive their indexing / bots are supposed to work (neither do google for that matter).

Share this post


Link to post
Share on other sites

http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2012/04/27/better-than-canonical-url-normalization.aspx

 

That's an interesting read. Basically saying "we're right, google et al are wrong, deal with it".


This is a signature that appears on all my posts.  
IF YOU MAKE A POST REQUESTING HELP...please state the exact version
of osCommerce that you are using. THANKS

 
Get the latest current code (community-supported responsive 2.3.4.1BS Edge) here

 

Share this post


Link to post
Share on other sites

Just a quick update:

 

Bing bot still adding double products_id parameters to index.php, raised a technical support ticket with Bing product support. Awaiting results - still.....

Share this post


Link to post
Share on other sites

Finally worked out what is causing this and applied a fix (hopefully). Its only apparent on sites with Buy Now column true in products listings.

 

1. Bing isn't making up the double products_id parameters on the index.php.

 

2. For example a valid product listing url is:-

index.php?cPath=54

 

The code for the buy now button is:

<td align="right" style="width:15%"><span class="tdbLink"><a id="tdb10" href="http://www.mysite.com/index.php?cPath=54&sort=2a&action=buy_now&products_id=635">Buy Now</a></span><script type="text/javascript">$("#tdb10").button({icons:{primary:"ui-icon-cart"}}).addClass("ui-priority-secondary").parent().removeClass("tdbLink");</script></td>

 

Although the buy button creates this href - the coding actually takes the parameters &action=buy_now&products_id=635 together and uses them to add product to and redirect user to shopping_cart.php (if set in configuration). That is - we don't follow this url at all.

 

3. Bing scrapes the page and takes the buy now button code as a valid url to index/follow.

 

4. In BING Webmasters I can't ignore products_id parameters as its used on products_info.php and us therefore vital.

 

5. BING follows the buy now link and now has a valid url:

http://www.mysite.com/index.php?cPath=54&products_id=635 (it strips out the ignore url parameters like &sort, &action etc)

 

6. Now when BING scrapes this page it has a new url for the buy now button code:

http://www.mysite.com/index.php?cPath=54&products_id=635&sort=2a&action=buy_now&products_id=635"

 

7. BING again follows the buy now link and now has a valid url:

http://www.mysite.com/index.php?cPath=54&products_id=635&products_id=635 (it strips out the ignore parameters like &sort, &action etc)

 

Of course there are many buy now buttons per product listing page, so we end up with combination products_id parameters like the ones I observed at the beginning of this ticket:- ie

/index.php?cPath=34_40&products_id=1534&products_id=539.

 

 

FIX:-

/includes/functions/html_output.php

 

find function tep_draw_button function (near end of file)

 

Change:

if ( ($params['type'] == 'button') && isset($link) ) {

$button .= '<a id="tdb' . $button_counter . '" href="' . $link . '"';

 

To:

 

if ( ($params['type'] == 'button') && isset($link) ) {

$button .= '<a rel="nofollow" id="tdb' . $button_counter . '" href="' . $link . '"';

 

 

Now BING (and any other content scraper/indexer) should stop following (buy) button created hrefs.

 

 

Der der.

Share this post


Link to post
Share on other sites

Ok - so finally got an official Bing / Microsoft Customer Support response to their bots indexing of index.php with &products_id= parameters.

 

"Our product group informed us that if pages like http://my-site.com/index.php?cPath=34_39&products_id=983&page=8 is invalid, please return us 404 or other failure http code, instead of 200."

 

So I think they are implying that adding a rel="nofollow" to index.php / product_listing.php "buy button" anchor href - eg:-

 

( <td align="right" style="width:15%"><span class="tdbLink"><a rel="nofollow" id="tdb18" href="http://www.my-site.com/index.php?cPath=34_39&page=8&sort=2a&action=buy_now&products_id=1087">Buy Now</a></span><script type="text/javascript">$("#tdb18").button({icons:{primary:"ui-icon-cart"}}).addClass("ui-priority-secondary").parent().removeClass("tdbLink");</script></td> )

 

won't prevent BING bots from following and indexing and thus creating duff and duplicated index.php urls and applying ranking penalties.

 

Without re-engineering the BUY NOW button functionality (which I'm not in the mood to do just now), then I can only think a temporary fix would be to add a :-

 

RewriteCond %{QUERY_STRING}

RewriteRule .* - [G,L]

 

that traps any products_id= parameter strings in an index.php url.

 

Can anyone with a little apache .htaccess technical skills come up with the code, please. Pretty please.

 

Lee

Share this post


Link to post
Share on other sites

×