Jump to content

Archived

This topic is now archived and is closed to further replies.

FWR Media

Duplicate content for osC sort functions

Recommended Posts

Check Google webmaster tools because Google has started showing duplicate titles/descriptions due to the standard oscommerce sort and paging functions.

 

Example shown with SEO URLs.

 

Product URL

 

http://www.mysite.com/a-great-product-c-32.html

 

Duplicate titles/descriptions

 

http://www.mysite.com/a-great-product-c-32...t=2a&page=2

http://www.mysite.com/a-great-product-c-32...t=2d&page=2

http://www.mysite.com/a-great-product-c-32...t=3a&page=2

http://www.mysite.com/a-great-product-c-32...t=3d&page=2

 

Then when you consider the number of pages there may be it gets worse from there.

 

I've had an answer from webmaster tools and they seem to recommend the use of rel="nofollow"

 

The following is my initial suggestion but has no testing: -

 

1) includes/classes/split_page_results.php

 

Find all instances of tep_href_link

 

in each link find ..

 

class="pageResults"

 

 

change to ..

 

 

class="pageResults" rel="nofollow"

 

 

2) includes/functions/general.php

 

Find function tep_create_sort_heading

 

find in the function ..

 

 

title="

 

 

Change to ..

 

 

rel="nofollow" title="

 

 

3) includes/boxes/languages.php

 

Find ..

 

language=' . $key, $request_type) . '">

 

 

Replace with ..

 

 

language=' . $key, $request_type) . '" rel="nofollow">

 

It would be good if others added or changed this over time as we find out how it works and/or which additions are needed.

Share this post


Link to post
Share on other sites

Thank-you for your post! How did you come upon solving this error? Will nofollow bring down search ranks? Has anyone else tested this out?

 

Thanks

 

- 32 Degrees


plasma.jpg

Share this post


Link to post
Share on other sites

Be carefull when using the rel=nofollow attribute in splitting results.

 

If you have not submitted a sitemap to google and don't offer a sitemap for others search engines, then you can prevent some products to be indexed. This is because if google does not follow the page x of the category A, then google and search engines that recognize the rel=nofollow attribute could never discover the products you have in page x, if those products are not listed in other place.

 

So, the recommendation is to use a sitemap if you use this rel=nofollow technique. Other way could be modifying a bit the tittle and description.

 

For example, where you have (in all of your catalog/ files that generates splited results):

 

<title><?php echo TITLE; ?></title>

 

use something like

 

<title><?php echo TITLE . (isset($HTTP_GET_VARS['page'])? ' - page ' . $HTTP_POST_VARS['page'] : '' );?></title>

 

Use something like the above example in your metatags generator if you are using one.

 

Another tip.

 

Splitting pages can cause that, in page 2, for example, you have a link to the page 1, thus showing duplicated content because:

 

http://example.com/anyting

and

http://example.com/anyting?page=1

 

shows the very same content. To solve this, backup and open your application_top.php and add the following before the closing ?> at the end of the file:

 

 // inserted to redirect from a request to a ?page=1 page, to the same version without the "page" parameter. This to avoid duplicated content. 
if(isset($HTTP_GET_VARS['page']) && ($HTTP_GET_VARS['page']=='1' || $HTTP_GET_VARS['page']==1 )){
header('HTTP/1.1 301 Moved Permanently');
header('Location: ' . tep_href_link(basename($PHP_SELF), tep_get_all_get_params(array('page')) ));
}

 

I am using this on my store and works perfect. I don't use the SEO contribution, but it should work everytime you get a page request with the page=1 parameter in the URL.

 

Also, if you are using a sitemap, remove all the pages with the page=1 parameter. Most of the sitemap generators for osCommerce don't includes that kind of urls.

 

This mod still allows that link with the page=1, but tells user agents that that page has been moved permanently, so spiders will not index twice.

 

Regards,


Hey!!... I still need help with this http://forums.oscommerce.com/index.php?showtopic=309208. Please, take a look on it.

Share this post


Link to post
Share on other sites

I don't agree

 

rel=nofollow is no danger whatsoever in split results, google will find the real links, in fact it has already to see the sort functions, and if it hadn't you wouldn't be seeing duplicate titles/description .

 

Point two imo is dangerous as google does not appreciate redirects every time it visits, you could end up with the "too many redirects" problem. The valid question "why the hell am I being redirected when the link still exists" springs to mind.

 

The limitation of the rel=nofollow method is that it will not remove already indexed pages that are listed as duplicates.

Share this post


Link to post
Share on other sites

Another method likely to work is the following which also covers more exclusions.

 

The following array or similar should be available.

 

$spiderNoFollow = array('sort', 'page', 'language', 'currency');

 

The following code would go in the <head></head> of the page ..

 

 

<?php
if( (true === $spider_flag) && spiderNoFollow($spiderNoFollow) ) {
echo '<meta name="ROBOTS" content="NOINDEX, FOLLOW">';
}
?>

 

 

The following function could e.g. go into includes/functions/general.php

 

function spiderNoFollow($spiderNoFollow){

 foreach( $spiderNoFollow as $value ){
if( isset($_GET[$value]) ){
  return true;
}
 }
 return false;
}

Share this post


Link to post
Share on other sites

Another nice easy one if you use ULTIMATE SEO URLS is putting the following in robots.txt

 

Disallow: /*.html?

 

This disallows Google from any querystring where it is an SEO URL (or ending in .html)

 

The only valid item being banned (to my knowledge) is the reviews_id

Share this post


Link to post
Share on other sites

FWR. I understand what you mean, but I have been using the redirection approach for months and no problem has been reported in google webmasters tools. Furthermore, as long as I can remember, the "too many redirects" is not a error message, just a warning message, so it doesn´t affect your positioning in any way. (If i'm wrong, please let me know). I agree with you that this is not the best method, but I couldn't figure how to parse the &page=1 link to the same version without the page parameter (if you know how, please let me know).

 

On the other hand, i do think that it could be problematic using rel=nofollow approach if you are not using a sitemap. Think on a new category, with 5 page of new products. Also, assume that the new products are not listed elsewhere. Then, if the rel=nofollow attribute is added in pages 2-5, then search engines will be able to find just one page of the new products, leaving 4 pages with new products not indexed. But, if you have a sitemap, search engines will be able to find your products without any problem.

 

Furthermore, if you use just the rel=nofollow attribute on your website, the page can be indexed if another site links to your &page=X pages (if this happens, thought, search engines will be able to find the products on that page of course), and you are still experiencing "duplicated content" issues. I agree with FWR that a better approach to prevent your pages being indexed is the use of the robots.txt file, or the metatags robots.

 

However, I still think that the best approach to solve this, without any potential problems or potential side-effects, is to modify the title and description dynamically when the url has the "page" parameter.

 

Regards,


Hey!!... I still need help with this http://forums.oscommerce.com/index.php?showtopic=309208. Please, take a look on it.

Share this post


Link to post
Share on other sites

Hi.

 

About the page=1 issue, I've tried to modify the generation of the link so when the value of page is 1, do not print the page parameter in the link. As I couldn't do that, a few months ago, i used the redirection method i posted above, and this helped me a lot.

Now, I realized that, if i couldn't avoid to send the page=1 to the tep_href_link() funtion, I could modify the link generated using tep_href_link() function, before the link is returned. This is how it looks by now:

 

In catalog/includes/functions/html_output.php:

 

just before the ending "return $link;" statement, add the following:

 


if(strpos($link, 'page=1')) { //if page=1 is inside the generated link
	if($there_params = strpos($link, '?')){ 
		$parametros = substr($link, strlen($link) - strlen(substr($link, $there_params))+1);
	} else {
		$parametros ='';
	}

	$params_array = explode('&', $parametros);
	$other_params = array();
	for ($io = 0; $io< sizeof($params_array); $io++){
		$sub_param = explode('=', $params_array[$io]);
		if ($sub_param[0]=='page'){ 
			//do nothing
		}
		else {
			// if this is not page parameter, then store it
			$other_params[] = $sub_param[0] . '=' . $sub_param[1];
		}
	}

	// we are ready to break the $link
	$link = substr($link, 0, $there_params);
	if (sizeof($other_params)){
		$link .= '?' . implode('&', $other_params);
	}
}

 

I've tested it just a little bit, and it worked as espected.

 

What do you think about?

 

Any improvement to this, is strongly appreciated.

 

Regards,


Hey!!... I still need help with this http://forums.oscommerce.com/index.php?showtopic=309208. Please, take a look on it.

Share this post


Link to post
Share on other sites

The rel=nofollow is not a system that should be used on an e-comm site.

 

Put in a very simple way, how Google look at a rel=nofollow is that you (the webmaster aka shop owner) don't "trust" the page (that you are linking to). There is some more about this on one of the official Google blogs.

 

It's my belief that no webmaster should be saying that they do not trust one of their own sites webpages...


Help shape the future of Phoenix; join the Phoenix Club

Share this post


Link to post
Share on other sites
Hi.

 

About the page=1 issue, I've tried to modify the generation of the link so when the value of page is 1, do not print the page parameter in the link. As I couldn't do that, a few months ago, i used the redirection method i posted above, and this helped me a lot.

Now, I realized that, if i couldn't avoid to send the page=1 to the tep_href_link() funtion, I could modify the link generated using tep_href_link() function, before the link is returned. This is how it looks by now:

 

In catalog/includes/functions/html_output.php:

 

just before the ending "return $link;" statement, add the following:

 


if(strpos($link, 'page=1')) { //if page=1 is inside the generated link
	if($there_params = strpos($link, '?')){ 
		$parametros = substr($link, strlen($link) - strlen(substr($link, $there_params))+1);
	} else {
		$parametros ='';
	}

	$params_array = explode('&', $parametros);
	$other_params = array();
	for ($io = 0; $io< sizeof($params_array); $io++){
		$sub_param = explode('=', $params_array[$io]);
		if ($sub_param[0]=='page'){ 
			//do nothing
		}
		else {
			// if this is not page parameter, then store it
			$other_params[] = $sub_param[0] . '=' . $sub_param[1];
		}
	}

	// we are ready to break the $link
	$link = substr($link, 0, $there_params);
	if (sizeof($other_params)){
		$link .= '?' . implode('&', $other_params);
	}
}

 

I've tested it just a little bit, and it worked as espected.

 

What do you think about?

 

Any improvement to this, is strongly appreciated.

 

Regards,

 

I don't know why you are so focussed on page=1 imo ALL the sort functions are irrelevant to bots not just page=1

Share this post


Link to post
Share on other sites
The rel=nofollow is not a system that should be used on an e-comm site.

 

Put in a very simple way, how Google look at a rel=nofollow is that you (the webmaster aka shop owner) don't "trust" the page (that you are linking to). There is some more about this on one of the official Google blogs.

 

It's my belief that no webmaster should be saying that they do not trust one of their own sites webpages...

 

Well that IS interesting because it was Google who recommended it.

 

To my (current) understanding it is no different to NOFOLLOW in the meta, but I'm always interested to hear differing views.

 

In fact Cutts has suggested the use of rel=nofollow to not get penalised for links on a link farm .. perhaps that is what you mean.

 

Do you have the link?

Share this post


Link to post
Share on other sites

Here's a link you may find interesting burt (more the vBSEO stuff than Cutts) and please don't stop the analysis, we all need to get together on this one.

 

http://www.vbseo.com/f104/matt-cutts-talks...-nofollow-5017/

Share this post


Link to post
Share on other sites

Cutts' blog would be the one most likely.

 

In my "enforced" layoff of osCommerce, I spent 3 years in the "dark side" so that was pretty good reading.

 

edit: you edited and replied yourself whilst i was trying to find where I read it.

 

Yeah, so I'm not 100% sure if this would be the right way or not. I think a better way is to detect the spider and the url as you posted above. nofollow is such an uncertain beast.


Help shape the future of Phoenix; join the Phoenix Club

Share this post


Link to post
Share on other sites
I don't know why you are so focussed on page=1 imo ALL the sort functions are irrelevant to bots not just page=1

 

 

So simple, I disabled all sort methods from my site :lol: . And, if you are so worried about having duplicated content, this page=1 issue can produce you to have a LOT of duplicated content issues.

 

Well that IS interesting because it was Google who recommended it.

 

To my (current) understanding it is no different to NOFOLLOW in the meta, but I'm always interested to hear differing views.

 

In fact Cutts has suggested the use of rel=nofollow to not get penalised for links on a link farm .. perhaps that is what you mean.

 

Do you have the link?

 

There is a lot of difference between the rel=nofollow and the NOFOLLOW metatags.

 

1.- (obvious) The rel=nofollow affects a single link while the metatag affects all page links.

2.- Neither the rel=nofollow nor the NOFOLLOW metatags prevents your page to be indexed. The rel=nofollow says to google not to follow the link to "pageX" for example. But when another page or site post a link to the pageX (without rel=nofollow), google will discover the page anyway, so the page will be indexed. The NOFOLLOW metatag will allow search engines to index the page anyway. You could use the NOINDEX metatag or the robots.txt file to prevent a page to being indexed.

3.- The rel=nofollow was introduced by google just to prevent abuse in transferring pagerank to other sites. When you put a rel=nofollow to a link, you only tell google "Hey google, I don't trust the following, so please don't penalize me if this link is not a good one and, as long as I don't trust it, don't transfer page rank to it". In this sense, it seems that google still index the page linked, but don't transfer any pagerank. The NOFOLLOW metatags instruct search engines to no follow all the links on the page, but it could be indexed if search engines discover them following another link to that nofollowed pages.

 

I agree with burt that it could be contra productive introduce links to my own site using the rel=nofollow. You don't want to tell google that you don't trust another page of your own site.

 

In short, if you want prevent some of your pages being indexed, use the NOINDEX metatag or the robots.txt file.

 

Regards,


Hey!!... I still need help with this http://forums.oscommerce.com/index.php?showtopic=309208. Please, take a look on it.

Share this post


Link to post
Share on other sites
So simple, I disabled all sort methods from my site :lol: . And, if you are so worried about having duplicated content, this page=1 issue can produce you to have a LOT of duplicated content issues.

 

 

 

There is a lot of difference between the rel=nofollow and the NOFOLLOW metatags.

 

1.- (obvious) The rel=nofollow affects a single link while the metatag affects all page links.

2.- Neither the rel=nofollow nor the NOFOLLOW metatags prevents your page to be indexed. The rel=nofollow says to google not to follow the link to "pageX" for example. But when another page or site post a link to the pageX (without rel=nofollow), google will discover the page anyway, so the page will be indexed. The NOFOLLOW metatag will allow search engines to index the page anyway. You could use the NOINDEX metatag or the robots.txt file to prevent a page to being indexed.

3.- The rel=nofollow was introduced by google just to prevent abuse in transferring pagerank to other sites. When you put a rel=nofollow to a link, you only tell google "Hey google, I don't trust the following, so please don't penalize me if this link is not a good one and, as long as I don't trust it, don't transfer page rank to it". In this sense, it seems that google still index the page linked, but don't transfer any pagerank. The NOFOLLOW metatags instruct search engines to no follow all the links on the page, but it could be indexed if search engines discover them following another link to that nofollowed pages.

 

I agree with burt that it could be contra productive introduce links to my own site using the rel=nofollow. You don't want to tell google that you don't trust another page of your own site.

 

In short, if you want prevent some of your pages being indexed, use the NOINDEX metatag or the robots.txt file.

 

Regards,

 

No sorry don't agree

 

Google just say that "ideally" rel=nofollow would mean that you have no trust, but due to the inability of bots to destinguish between a good page and it's e.g sort functions it is still a valid method of excluding a page from index, and I can see no detrimental effect to the core page without the sort functions.

 

Also re: NOFOLLOW that was just an example .. for the sort pages in osc as I stated at the very beginning one would use ..

<meta name="ROBOTS" content="NOINDEX, FOLLOW">

Share this post


Link to post
Share on other sites
Also re: NOFOLLOW that was just an example .. for the sort pages in osc as I stated at the very beginning one would use ..

<meta name="ROBOTS" content="NOINDEX, FOLLOW">

 

Sorry. Probably i didn't understand you well.

 

No sorry don't agree

 

Google just say that "ideally" rel=nofollow would mean that you have no trust, but due to the inability of bots to destinguish between a good page and it's e.g sort functions it is still a valid method of excluding a page from index, and I can see no detrimental effect to the core page without the sort functions

 

Maybe you're right about the detrimental effect. This is something I have not tested yet, but that is just what I think.

However, I insist that rel=nofollow will not prevent user agents from indexing the target page. Just try it!!.

 

Another way to prove what I said. When you are about to send a "Remove URL" request from the google's webmasters tools, google says that you need to do some of the following to remove the url from the search results.

 

A. Make that the URL returns a HTTP 404 status code (not found).

B. Use a NOINDEX metatag

C. Use the robots.txt file to block that url

 

As you can see for yourself on the Google's Webmasters Tools, it is not enough to just add a rel=nofollow in order to block content to search engines. In other words, An url cannot be removed from google (and others search engines) just by adding a rel=nofollow attribute inside the links that points to the page you want to remove.

 

Regards,


Hey!!... I still need help with this http://forums.oscommerce.com/index.php?showtopic=309208. Please, take a look on it.

Share this post


Link to post
Share on other sites
Sorry. Probably i didn't understand you well.

 

 

 

Maybe you're right about the detrimental effect. This is something I have not tested yet, but that is just what I think.

However, I insist that rel=nofollow will not prevent user agents from indexing the target page. Just try it!!.

 

Another way to prove what I said. When you are about to send a "Remove URL" request from the google's webmasters tools, google says that you need to do some of the following to remove the url from the search results.

 

A. Make that the URL returns a HTTP 404 status code (not found).

B. Use a NOINDEX metatag

C. Use the robots.txt file to block that url

 

As you can see for yourself on the Google's Webmasters Tools, it is not enough to just add a rel=nofollow in order to block content to search engines. In other words, An url cannot be removed from google (and others search engines) just by adding a rel=nofollow attribute inside the links that points to the page you want to remove.

 

Regards,

 

This thread has never stated that rel=nofollow removes already indexed pages from SE indexes.

Share this post


Link to post
Share on other sites
This thread has never stated that rel=nofollow removes already indexed pages from SE indexes.

 

But you stated that the use of rel=nofollow would prevent duplicated content. I tried to show to everyone that such an assumption is not correct, whether the page is indexed or not. Maybe the last example was not the best (but it illustrates that the rel=nofollow does not block search engines for indexing a page). For sure, the best example will be your pages being indexed when you use the rel=nofollow.

 

It was an interesting discussion.

 

Regards,


Hey!!... I still need help with this http://forums.oscommerce.com/index.php?showtopic=309208. Please, take a look on it.

Share this post


Link to post
Share on other sites

Well in which way rel=nofollow works or doesn't work it was never going to be the best solution as it would never stop others linking to the page and therefore the page being indexed via other means.

 

Let's move the emphasis away from this tag and concentrate on the other methods.

 

No one has commented on the other methods mentioned (my personal favourite being the $spiders_flag in page meta).

Share this post


Link to post
Share on other sites

Apologies burt I missed that one.

Share this post


Link to post
Share on other sites

Simply put, which is the best way? I have 54 duplicates from google and I need to fix this ASAP.


plasma.jpg

Share this post


Link to post
Share on other sites
I tried the original way and had no results...

 

What do you mean "no results" what results are you looking for?

 

The idea of the code is for Google to not index those links, it will not remove the already indexed links.

Share this post


Link to post
Share on other sites

×