Jump to content
Sign in to follow this  
spooks

Remove & Prevent duplicate content with the canonical tag

Recommended Posts

set cPath to only current category id, so index.php?cPath=46_61 becomes index.php?cPath=61 in the canonical

 

 

Well I had a moment so I`ve written the code, in the function

 

replace:

 

global $request_type;

 

with:

 

global $request_type, $current_category_id;

 

after:

$request_uri = preg_replace('/\?&/', '?', preg_replace($search, '', $request_uri )); 

 

add:

 

 if ($current_category_id) $request_uri = preg_replace('/([&\/]*cPath[=\/]+)[0-9_]+/i', '${1}'.$current_category_id , $request_uri ); 

 

 

That should be all you need wink.gif

Edited by spooks

Sam

 

Remember, What you think I ment may not be what I thought I ment when I said it.

 

Contributions:

 

Auto Backup your Database, Easy way

 

Multi Images with Fancy Pop-ups, Easy way

 

Products in columns with multi buy etc etc

 

Disable any Category or Product, Easy way

 

Secure & Improve your account pages et al.

Share this post


Link to post
Share on other sites

I am yet to be convinced there is any issue that needs addressing, the canonical does not say 'do not visit these pages' nor does it say 'do not follow links' it just says 'this is the page that should appear in the index. smile.gif

 

That's fine but if i wanted to do it what would the if statement be?

 

I have this bit of code on my site for something i just want to show on my homepage. Could this be modified so $currentpage = split page results page?

 

$homepage = "/";

$currentpage = $_SERVER['REQUEST_URI'];

if($homepage==$currentpage) {

Edited by OFS

Share this post


Link to post
Share on other sites

That's fine but if i wanted to do it what would the if statement be?

 

 

 

The obvious flag to use would be the page param in the query string.

 

But you have to realise this method is distructive, the advantage of the canonical is any page rank spread on the pages covered is acumilated onto the indicated page, but with this method its simply lost, also should the flag you find to use happen to cover all the pages crawled, no page in the group will get indexed, with the canonical it only needs one to be crawled.


Sam

 

Remember, What you think I ment may not be what I thought I ment when I said it.

 

Contributions:

 

Auto Backup your Database, Easy way

 

Multi Images with Fancy Pop-ups, Easy way

 

Products in columns with multi buy etc etc

 

Disable any Category or Product, Easy way

 

Secure & Improve your account pages et al.

Share this post


Link to post
Share on other sites

Here is an issue I found..

 

When a product has attributes, and then you add the product to the cart, the url is not converted by USU 5..the url gets something like {5}55{9}20

 

 

At that point the canonical is no longer the USU 5 url..basically what this will do is have a duplicate page

 

Any ideas on how to fix this?

Share this post


Link to post
Share on other sites

When a product has attributes, and then you add the product to the cart, the url is not converted by USU 5..the url gets something like {5}55{9}20

 

 

Im sorry I'm not clear what your saying, can you make yourself clearer please. ie is {5}55{9}20 a param, its certainly not a complete uri!

 

I would also mention none of your checkout pages should be indexed, have you set your robots file?


Sam

 

Remember, What you think I ment may not be what I thought I ment when I said it.

 

Contributions:

 

Auto Backup your Database, Easy way

 

Multi Images with Fancy Pop-ups, Easy way

 

Products in columns with multi buy etc etc

 

Disable any Category or Product, Easy way

 

Secure & Improve your account pages et al.

Share this post


Link to post
Share on other sites

I noticed one small issue with this. On pages that have keywords that begin with a number, for example:

your-product-p-1136.html?ref=480&keywords=1-2-3%20your%20product&sort=4d&page=1

 

the cannonical link is not created correctly--it looks like this:

your-product-p-1136.html?-2-3%20your%20product

 

I suspect that this code needs to be changed to deal with such issues:

$search[] = '/&*' . $value . '[=\/]+[\w%..\+]*\/?/i';

 

but I do not know how to do this.

 

I would also like to share a helpful modification. This is what multiple pages would look like with switches to make it all work. If you replace this:

 

$page_remove_array = array( FILENAME_CATEGORIES => array('manufacturers_id', 'cPath'),

FILENAME_DEFAULT => array() );

 

with this (note--this includes the modifications that work best for me:

 

$page_remove_array = array( FILENAME_PRODUCT_INFO => array('manufacturers_id', 'cPath', 'reviews_id', 'keywords', 'gclid', 'filter_id', 'inc_subcat', 'pfrom', 'pto', 'dfrom', 'dto'),

FILENAME_DEFAULT => array('sort', 'filter_id'),

FILENAME_CATEGORIES => array('manufacturers_id', 'cPath', 'reviews_id', 'keywords', 'gclid', 'filter_id'),

FILENAME_PRODUCT_REVIEWS => array('manufacturers_id', 'cPath', 'keywords', 'gclid', 'filter_id'),

FILENAME_PRODUCT_REVIEWS => array('manufacturers_id', 'cPath', 'keywords', 'gclid', 'filter_id')

);

 

It will handle a large variety of issues.

 

Last, but perhaps most important (to me anyway)...I messed some things up when I first launched this, and my site was indexed with some bad coding, which made my duplicate content issues go from around 5,000 to about 8,000! I fixed the bad coding issue within 24 hours, and am now waiting to see the 8,000 start dropping. The final fix of the code has been well tested, and only been live for about a day now, however, I thought I'd begin to see a drop by now...but haven't. Am I just impatient, or could the bad code (which basically was making many of my bad links report to google that they were cannonical links) have caused long-term indexing issues?

 

Can anyone else share some time frame with me regarding how fast it took them to clear up their indexing issues? FYI: My site gets indexed by google fairly often, at least every few days.

Share this post


Link to post
Share on other sites

Hi Sam, I'm hoping you might be able to help with a problem I have after installing this mod.

 

I have header_tags seo installed so added this code in the includes/header_tags.php file

CanonicalLink( $xhtml = false, 'SSL' );

 

and the code in the includes/functions/html-output.php.

 

Lastly I added the the following in my .htacces file

 

RewriteCond %{THE_REQUEST} !^POST 
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)index\.php(.*)\ HTTP/ [NC]
RewriteRule ^index\.php(.*)$ http:\/\/www.my-domain.com/$1 [R=301,L]

 

for some reason since the install every category redirects me back to my home page (index.php)

 

I replaced all of the files from the backups I took but for some reason I still have the problem even after de-caching my laptops.

 

I'm seriously confused, if you have any thoughts on what's occurring it would be much appreciated.

Share this post


Link to post
Share on other sites

Hi Sam, I'm hoping you might be able to help with a problem I have after installing this mod.

 

I have header_tags seo installed so added this code in the includes/header_tags.php file

CanonicalLink( $xhtml = false, 'SSL' );

 

and the code in the includes/functions/html-output.php.

 

Lastly I added the the following in my .htacces file

 

RewriteCond %{THE_REQUEST} !^POST 
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)index\.php(.*)\ HTTP/ [NC]
RewriteRule ^index\.php(.*)$ http:\/\/www.my-domain.com/ [R=301,L]

 

for some reason since the install every category redirects me back to my home page (index.php)

 

I replaced all of the files from the backups I took but for some reason I still have the problem even after de-caching my laptops.

 

I'm seriously confused, if you have any thoughts on what's occurring it would be much appreciated.

 

 

RewriteCond %{THE_REQUEST} !^POST 
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)index\.php(.*)\ HTTP/ [NC]
RewriteRule ^index\.php(.*)$ http:\/\/www.my-domain.com/ [R=301,L]

 

Should be $1 in there not £0.68...don't know where that came from!

Share this post


Link to post
Share on other sites

There are times when you wouldn't want Search Engines to index your web page, but how do you go about preventing it? There are a number of ways to make sure that your web page is not found by the search bots, using meta tags is one of them. Meta tags are tags that provide detailed instructions regarding the web page to the Search Engines.

 

To make sure that the particular web page is not indexed, use the "NOINDEX" meta-tag and to prevent bots from following links from the page, use the "NOFOLLOW" tag between the <HEAD> and </HEAD> tags of your HTML.

 

 

Hope this did add some value.Any suggestions are appreciated.

 

Eliza

Share this post


Link to post
Share on other sites

Sam,

 

Modification causes security vulnerability. Is it possible to santize the input or change replace them? Below is the McAfee Secure report.

 

Description

The remote web application appears to be vulnerable to cross-site scripting (XSS).

 

The cross-site scripting attack is one of the most common, yet overlooked, security problems facing web developers today. A web site is vulnerable if it displays user-submitted content without sanitizing user input.

 

The target of cross-site scripting attacks is not the server itself, but the users of the server. By finding a page that does not properly sanitize user input the attacker submits client-side code to the server that will then be rendered by the client. It is important to note that websites that use SSL are just as vulnerable as websites that do not encrypt browser sessions.

 

The damage caused by such an attack can range from stealing session and cookie data from your customers to loading a virus payload onto their computer via browser.

General Solution

When accepting user input ensure that you are HTML encoding potentially malicious characters if you ever display the data back to the client.

 

Ensure that parameters and user input are sanitized by doing the following:

Remove < input and replace with <

Remove > input and replace with >

Remove ' input and replace with '

Remove " input and replace with "

Remove ) input and replace with )

Remove ( input and replace with (

Edited by 32 Degrees

plasma.jpg

Share this post


Link to post
Share on other sites

Here is my canonical fuction I changed a bit so it passes PCI scans (XSS issue) and a few minor changes :

 

// remove duplicate content with canonical tag by Spooks 12/2009
function CanonicalLink( $xhtml = false , $ssl = 'SSL' ) {
global $request_type;
$rem_index = true; // Set to true to additionally remove index.php from the uri
$close_tag = ( false === $xhtml ? ' >' : ' />' );
$spage = '';
$domain = ( $request_type == 'SSL' && $ssl == 'SSL' ? HTTPS_SERVER : HTTP_SERVER ); // gets the base URI

// Find the file basename safely = PHP_SELF is unreliable - SCRIPT_NAME can show path to phpcgi
if ( array_key_exists( 'SCRIPT_NAME', $_SERVER ) && ( substr( basename( $_SERVER['SCRIPT_NAME'] ), -4, 4 ) == '.php' ) ) {
		$basefile = basename( $_SERVER['SCRIPT_NAME'] );
} elseif ( array_key_exists( 'PHP_SELF', $_SERVER )	&& ( substr( basename( $_SERVER['PHP_SELF'] ), -4, 4 ) == '.php' ) ) {
		$basefile = basename( $_SERVER['PHP_SELF'] );
} else {
// No base file so we have to return nothing
	return false;
}

// Don't produce canonicals for SSL pages that bots shouldn't see
$ignore_array = array( 'account', 'address', 'checkout', 'login', 'password', 'logoff' );
// partial match to ssl filenames

foreach ( $ignore_array as $value ) {
	$spage .= '(' . $value . ')|';
}
$spage = rtrim($spage,'|');	
if (preg_match("/$spage/", $basefile)) return false;

// REQUEST_URI usually doesn't exist on Windows servers ( sometimes ORIG_PATH_INFO doesn't either )
if ( array_key_exists( 'REQUEST_URI', $_SERVER ) ) {
		$request_uri = $_SERVER['REQUEST_URI'];
} elseif( array_key_exists( 'ORIG_PATH_INFO', $_SERVER ) ) {
		$request_uri = $_SERVER['ORIG_PATH_INFO'];
} else {
// we need to fail here as we have no REQUEST_URI and return no canonical link html
	return false;
}	

$remove_array = array( 'currency', 'language', 'main_page', 'page', 'sort', 'ref', 'affiliate_banner_id', 'max', 'gclid');	
// Add to this array any additional params you need to remove in the same format as the existing

$page_remove_array = array(	
	FILENAME_PRODUCT_INFO => array('manufacturers_id', 'cPath', 'reviews_id', 'keywords', 'gclid', 'filter_id', 'inc_subcat', 'pfrom', 'pto', 'dfrom', 'dto', 'fl'),
	FILENAME_DEFAULT => array('sort', 'filter_id', 'src', 'OVRAW', 'OVKEY', 'OVMTC', 'OVADID', 'OVKWID', 'ysmwa'),
				FILENAME_CATEGORIES => array('manufacturers_id', 'cPath', 'reviews_id', 'keywords', 'gclid', 'filter_id'),
				FILENAME_PRODUCT_REVIEWS => array('manufacturers_id', 'cPath', 'keywords', 'gclid', 'filter_id'),
				FILENAME_ADVANCED_SEARCH_RESULT => array('manufacturers_id', 'cPath', 'keywords', 'gclid', 'filter_id', 'x', 'y', 'inc_subcat', 'categories_id', 'pfrom', 'pto', 'dto', 'dfrom'),
				FILENAME_ADVANCED_SEARCH => array('manufacturers_id', 'cPath', 'keywords', 'gclid', 'filter_id')
						);

// remove page specific params, should be in same format as previous, given is manufacturers_id & cPath 
// have to be removed in product_info.php only

if (is_array($page_remove_array[$basefile])) $remove_array = array_merge($remove_array, $page_remove_array[$basefile]);

foreach ( $remove_array as $value ) {
		$search[] = '/&*' . $value . '[=\/]+[\-\]+[\w%..\+]*\/?/i';
}


$search[] = ('/&*osCsid.*/');
	$search[] = ('/\?\z/');

if ($rem_index) $search[] = ('/index.html\/*/');	
$request_uri = preg_replace('/\?&/', '?', preg_replace($search, '', $request_uri )); 	


//XSS isssue resolved here


$request_uri = str_replace("<", "<", $request_uri); 
$request_uri = str_replace(">", ">", $request_uri);
$request_uri = str_replace("'", "'", $request_uri);
$request_uri = str_replace("\"", """, $request_uri);
$request_uri = str_replace(")", ")", $request_uri);
$request_uri = str_replace("(", "(", $request_uri);


// added this in for home page issues modify if you do not use a sub folder or is a different name for the cart system

if (($request_uri == '/catalog/') || ($request_uri == '/catalog/index.php')){
echo '<link rel="canonical" href="' . $domain . '"' . $close_tag . PHP_EOL;
}else{
echo '<link rel="canonical" href="' . $domain . $request_uri . '"' . $close_tag . PHP_EOL; 
}

} 
///

 

 

Nice function by the way....I would highly recommend this be added to all shops to remove duplicate content issues.

 

cheers


Peter McGrath

-----------------------------

See my Profile (click here) for more information and to contact me for professional osCommerce support that includes SEO development, custom development and security implementation

Share this post


Link to post
Share on other sites

Hello, when i open google webmater tool, found there is thousand of duplicate meta tag, some page come with 30 different url for same contetn page, and the code in url are very strange, don't know where it come from, the url is like below for same url, the seo url i am using is umliate seo url and header tag seo and canonical_links, anybody know why so much strange url for same content page. Please help.

 

domain/shoes-c-33.html?%252525252525253Blanguage=en

domain/shoes-c-33.html?%2525252525253Bcurrency=GBP?cPath

domain/shoes-c-33.html?;language=en%3F%253Blanguage%3Den?cPath

domain/shoes-c-33.html?page=1&%25253Blanguage=en%253FcPath%253D33%3FcPath&sort=2a

domain/shoes-c-33.html?page=11&%25253Bcurrency=GBP%252525253FcPath%252525253D33%25253FcPath%25253D33&sort=2a

domain/shoes-c-33.html?page=11&;currency=CAD&;amp;language=en%3FcPath%3D33&sort=2a

domain/shoes-c-33.html?page=2&;language=en&sort=2a

domain/shoes-c-33.html?page=2&sort=2a&%3Bsort=2a%2525253FcPath%2525253D33%253FcPath%253D33

domain/shoes-c-33.html?page=3&%25253Bcurrency=GBP&%25253Bamp%25253Blanguage=en%3FcPath%3D33&sort=2a

domain/shoes-c-33.html?page=3&%25253Blanguage=en%253FcPath%253D33%3FcPath&sort=2a

domain/shoes-c-33.html?page=3&sort=2a&%3Bsort=2a%2525253FcPath%2525253D33%253FcPath%253D33

domain/shoes-c-33.html?page=4&%25253Blanguage=en%253FcPath%253D33%3FcPath&sort=2a

domain/shoes-c-33.html?page=4&sort=2a&%2525253Bsort=2a

domain/shoes-c-33.html?page=5&%25253Blanguage=en%253FcPath%253D33%3FcPath&sort=2a

domain/shoes-c-33.html?page=6&%25253Bcurrency=GBP%252525253FcPath%252525253D33%25253FcPath%25253D33&sort=2a

domain/shoes-c-33.html?page=6&%25253Blanguage=en%253FcPath%253D33%3FcPath&sort=2a

domain/shoes-c-33.html?page=7&;currency=CAD&;amp;language=en%3FcPath%3D33&sort=2a

domain/shoes-c-33.html?page=9&%25253Bcurrency=GBP%252525253FcPath%252525253D33%25253FcPath%25253D33&sort=2a

domain/shoes-c-33.html?page=9&;currency=CAD&;amp;language=en%3FcPath%3D33&sort=2a

Share this post


Link to post
Share on other sites

anyone able to get sam's Duplicate Content with Canonical Tag contrib to work with jack's Header Tag SEO contrib?

 

When I first installed the canonical contrib, it returned a completely blank page until i removed the following from the within the <head> tag on the index page:

<?php
/***Begin Header Tags SEO***/
if ( file_exists(DIR_WS_INCLUDES. 'header_tags.php')){
require(DIR_WS_INCLUDES.'header_tags.php');
}else {
?><meta http-equiv="content-type" content="text/html; charest=<?php echo CHARSET; ?>">
   <title><?php echo TITLE;?></title>
<?php }
/***End Header Tags SEO***/
?>

 

Sam's add-on is now working great(no duplicates), but it still returns a blank page when I add the above code back in.

 

what could be wrong?

Share this post


Link to post
Share on other sites

Hey cannuck1964 / Peter:

 

I noticed one change in your code compared to the original code:

 

Original:

$search[] = '/&*' . $value . '[=\/]+[\w%..\+]*\/?/i';

 

Your Code:

$search[] = '/&*' . $value . '[=\/]+[\-\]+[\w%..\+]*\/?/i';

 

Just wondering what is the difference?

 

Thanks

Share this post


Link to post
Share on other sites

Hey cannuck1964 / Peter:

 

I noticed one change in your code compared to the original code:

 

Original:

 

 

Your Code:

 

 

Just wondering what is the difference?

 

Thanks

 

I was having issues with the - (minus sign) within the urls.

 

there is also additional code to prevent the XSS with the replacement of specific characters with the encoded representatives of the character...

 

cheers


Peter McGrath

-----------------------------

See my Profile (click here) for more information and to contact me for professional osCommerce support that includes SEO development, custom development and security implementation

Share this post


Link to post
Share on other sites

I was having issues with the - (minus sign) within the urls.

 

there is also additional code to prevent the XSS with the replacement of specific characters with the encoded representatives of the character...

 

cheers

 

Hey,

 

Got it. I tried and tested it. Seems like an issue which needs to be addressed. Thanks for posting that!

 

Also, thanks for XSS code.

Share this post


Link to post
Share on other sites

Hi there,

 

First up - excellent contribution!

 

I am trying to turn off error supression for my site and I am getting the following error message from Canonical Tags:

 

Notice: Undefined index: xxxxx.php

 

Where xxxxx = the page I am viewing eg. shopping_cart.php

 

It also points me to the line for the error - which is this one:

 

if (is_array($page_remove_array[$basefile])) $remove_array = array_merge($remove_array, $page_remove_array[$basefile]);

 

So one of these arrays is undefined - but which one and how do I define it.

 

Any help much appreciated.

Share this post


Link to post
Share on other sites

one basic question regarding the installation procedure.

 

Step 2b (if you do have a meta tag contribution installed) says:

 

open the meta tag file in the catalog/includes folder (ie 'header_tags.php')

 

I can't find header_tags.php in my catalog/includes folder.

How do I find the right file for this step?? :blink:

This is what is in my catalog/includes folder.

 

application_bottom.php

application_top.php

applicatoin_top.php_TOM

column_left.php

column_right.php

configure.php

counter.php

database_tables.php

filenames.php

footer.php

form_check.js.php

general.js

header.php

spiders.txt

tld.txt

 

Thanks.

Share this post


Link to post
Share on other sites

In case anyone else is trying to turn off error supression - change this:

 

 

if (is_array($page_remove_array[$basefile])) $remove_array = array_merge($remove_array, $page_remove_array[$basefile]);

 

to this:

 

if (is_array(isset($page_remove_array[$basefile]))) $remove_array = array_merge($remove_array, $page_remove_array[$basefile]);

 

Regards,

Share this post


Link to post
Share on other sites

I installed this add on in a 2.3.1 shop I am working on. My meta tag add on for the shop is is easy_meta_1_7a, which has has a modificatiton for 2.3.1.

 

I followed Sam's install instructions as given for the Remove & Prevent duplicate content with the canonical tag add on, but even though Sam's structions don't cover 2.3.1, the add on appears to be working as it is supposed to.

 

I tested by running trough many shop pages, checking the canonical url on each one, and watching for glitches and bugs along the way, including adding some dummy customers and did some trial checkouts.

 

After all that, the canonicals all appear to be correctly generated; as well I noted no bugs or glitches.

 

All that being said, if someone with more extensive knowledge of code can advise of any unknown factors I should be aware of by installing this on a 2.3.1 shop that would be more than welcome.

 

Thanks


I am not a professional webmaster or PHP coder by background or training but I will try to help as best I can.

I remember what it was like when I first started with osC. It can be overwhelming.

However, I strongly recommend considering hiring a professional for extensive site modifications, site cleaning, etc.

There are several good pros here on osCommerce. Look around, you'll figure out who they are.

Share this post


Link to post
Share on other sites

Hi there

 

I installed this contib and although i do not see how to check its working it did appear to change the results in Google webmaster, unfornanatly i have site in 2 languages and its also diverting Fr to En so i have even more duplicates now as it looks at the tags for Fr and En link and deciedes its the same.I have Ulimate SEO and Header tags installed, 4 Currencys and 2 languages, i have no problems with any of that just the way that google views it as duplicate content.

 

Is there anyway of using this Contrib to sort this out or should i be looking elsewhere, will say it was one of the easiest i have ever installed, wish they was all like this.

 

Best Regards

 

 

David


David

Share this post


Link to post
Share on other sites

Hello,

 

I've just installed this great contrib, but I don't know if it's work correctly. The <link rel="canonical" href="http://www.example.com/canonical.html"/> is also visible in the code if I go right to the page without params (the canonical page) - http://www.example.com/canonical.html.

 

Thank you for your comment.

 

 

Petr

Share this post


Link to post
Share on other sites

I find that the <link rel="canonical" href="http://www.example.com/canonical.html"/> should be only in duaplicate page (some info: http://www.google.com/support/webmasters/bin/answer.py?answer=139394). So I tried to install KissMT_Dynamic_Meta_Tags which also using rel="canonical", but the result is same. I still get <link rel="canonical" href="http://www.example.com/canonical.html"/> in every page, in the page without params including.

 

 

Hello,

 

I've just installed this great contrib, but I don't know if it's work correctly. The <link rel="canonical" href="http://www.example.com/canonical.html"/> is also visible in the code if I go right to the page without params (the canonical page) - http://www.example.com/canonical.html.

 

Thank you for your comment.

 

 

Petr

Share this post


Link to post
Share on other sites

I have read the install instructions. However, I'm a little confused. I have unique header tags for each page but I do not have a catalog/includes/header_tags.php file. I do, however, have catalog/includes/meta_tags.php and catalog/includes/classes/seo.class.php file, which includes references to USE_SEO_HEADER_TAGS. Is these comparable to header_tags.php or am I totally off base?

 

In addition, the last update to this contribution is dated 23 Mar 2010. So, for the catalog/includes/functions/html_output.php code, should the code included in the contribution be used or the code displayed in post #111 of this thread be used (because it is more recent)?

 

Thanks.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×