Jump to content



Photo
* * * * * 1 votes

Google Duplicate Content with Strange cPath Variable Indexed


This topic has been archived. This means that you cannot reply to this topic.
28 replies to this topic

#1   clustersolutions

clustersolutions
  • Members
  • 71 posts

Posted 16 November 2011 - 17:17

In our Goole duplicate content we can see that Google had indexed some strange cPath variables, i.e. www.xyz.com/abc.html?cPath=24_0_21, www.xyz.com/abc.html?cPath=23_0_53, and etc. The problem is we don't use subcategories so I have no idea where the bot had found these links and the cPath variables (unless they are from external). So any idea and don't mind sharing your thoughts? Thanks!

Tim

#2   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 16 November 2011 - 17:25

Have the same problem using usu5 pro contrib. Seems that google started to index these urls and unfortunately oscommerce code does not return the 404 error code status. I think oscommerce needs in the code to check if the cpath number is correct, if not then return 404 error.

#3   clustersolutions

clustersolutions
  • Members
  • 71 posts

Posted 16 November 2011 - 18:11

USU5 pro that's Ultimate SEO 5 by Chemo? We have installed that years ago and after so many other packages later we just can't pinpoint the cause anymore. If USU5 for sure is the cause may be it is a better idea to go fix that...for now we did do a checker and 301 redirect if a product and cPath don't match...

#4   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 16 November 2011 - 19:47

It is not only a usu5 issue but how oscommerce check if a category has the right url. I'm stopping this URL to be indexed with robots.txt

#5   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 17 November 2011 - 13:36

Opened a bug issue at http://forums.oscomm...r&showissue=391

Hope it will be fixed.

#6   clustersolutions

clustersolutions
  • Members
  • 71 posts

Posted 17 November 2011 - 17:19

Well...if anyone can help in pinpointing the root of the bug I can help getting it fixed...as with all open source...sometime a bandage is a quick fix and it is good...the subroutine below can check if the url cPath parameter is valid by comparing it to the system. We have a package that validate our SEO URL and we perform the check there and if it returns false we would then do a 301 redirect to the URL without the cPath variable (also, look into rel="canonical" as we use that where it is a better solution). Hope this is helpful...

/*
Copyright © 2011 clustersolutions.net
Released under the GNU General Public License.
Please give credit where credit is due.
*/
// Validate URL cPath Parameter
function tep_validate_url_cpath() {
global $HTTP_GET_VARS, $products_id;
if (isset($HTTP_GET_VARS['cPath']) && tep_not_null($products_id)) {
$bb = array();
$prod_cat_check_query = tep_db_query("select categories_id from ". TABLE_PRODUCTS_TO_CATEGORIES . " where products_id = " . $products_id);
while ($prod_cat_check = tep_db_fetch_array($prod_cat_check_query)) {
$aa = array();
$path_check['parent_id'] = $prod_cat_check['categories_id'];
do {
array_push($aa, $path_check['parent_id']);
$path_check_query = tep_db_query("select * from " . TABLE_CATEGORIES . " where categories_id = " . $path_check['parent_id']);
$path_check = tep_db_fetch_array($path_check_query);
} while ($path_check['parent_id'] != 0);
array_push($bb, implode('_', array_reverse($aa)));
}
return (in_array($HTTP_GET_VARS['cPath'], $bb) ? true : false);
} else {
return true;
}
}

#7   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 27 November 2011 - 20:05

I think that modifing in general.php the function function tep_parse_category_path to check if the url is valid can resolve the issue, but i'm unable to find a way to do the check

#8   clustersolutions

clustersolutions
  • Members
  • 71 posts

Posted 03 December 2011 - 08:04

I probably wouldn't do it there as that funcation just parse the cPath variable and return it in an array...I would do it where it does the URL check...and you probably should have the SEO URL validation contrib installed as without that gives problems with SEO too...we did that way back and it was beneficial...good luck...tim

#9   graith

graith
  • Members
  • 61 posts

Posted 24 December 2011 - 07:53

Assuming abc.html is actually providing the category information, then links with cPath parameters are probably redundant and will be viewed by Google as a duplicate page. You should try using the CANONICAL link:

http://googlewebmast...-canonical.html

which will allow Google to index only the correct pages

#10   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 12 January 2012 - 17:19

Hello,

i'm tryng to use this function and changed code in application_top.php to the following:

// calculate category path
  if (isset($_GET['cPath'])) {
    $cPath = $_GET['cPath'];
  } elseif (isset($_GET['products_id']) && !isset($_GET['manufacturers_id'])) {
    $cPath = tep_get_product_path($_GET['products_id']);
  } else {
    $cPath = '';
  }
 
  if (tep_validate_url_cpath($cPath) === false)   {   
		 header('HTTP/1.1 404 Not Found');
		 echo   '<h1>404 Not Found</h1>';
	  tep_exit(); 
	 } else {				
  if (tep_not_null($cPath)) {   
    $cPath_array = tep_parse_category_path($cPath);
    $cPath = implode('_', $cPath_array);
    $current_category_id = $cPath_array[(sizeof($cPath_array)-1)];
  } else {
    $current_category_id = 0;
  }   
	 }

But the function it returns always true and goes the 404.

I'm using canonical contribution but they return as canonical links also these duplicate content categoryes because this is part of the ocommerce core.

I think that this is the right place to put the check

#11   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 08 March 2012 - 13:29

I reupped this topic to ask if somene has resolved with this duplicate content issue.

Unfortunately i tried several canonical urls contribs and using ultimate seo URL 5 pro but unfortunately they cannot resolve this problem.

Google webmaster tools reported up to 2000 duplicated content pages with random cpath in the url.

#12   alarm_seo

alarm_seo
  • Members
  • 13 posts

Posted 10 March 2012 - 10:49

To acidvertigo:

Try to add Disallow: /*?* to robots.txt to remove the duplicate issue.

Still one seo problem remains. Links to products pages from categories pages end with .html?cPath...

Can anyone suggest me how to fix the categories template so links will be ".html" ?

#13   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 10 March 2012 - 14:19

I'm trying with this code in applceation_top.php

<code>


$duplicate = array( '52_260','288_380_186','288_380_2_186','288_504_2_186','504_22_301_77','47_2_114','3_47_22_301','70_34_544_514','288_504_34_389',
'70_34_389','34_389','288_380_389','369_546','288_52_260_531','70_537_160_479','288_380_34_474_491','288_504_34_474_491','70_537_34_474_491','288_504_70_537_560','288_504_443_444',
'380_34_474_602');

if  (in_array($_GET['cPath'], $duplicate)) {
header("HTTP/1.1 404 Not Found");
echo "<h1>404 Not Found</h1>";
unset($duplicate);
 tep_exit();
}


</code>

#14   kymation

kymation

    Code Monkey

  • Community Sponsor
  • 8,133 posts

Posted 10 March 2012 - 17:55

@alarm_seo That won't work. File names in a robots.txt file must be the literal filename. Wildcards such as * are not allowed.

Regards
Jim

My Addons

Banners Box Download Support
Categories Accordion Box Download Support
Closest Shipper 2.2x Support
Document Manager 2.2x Support
Generic Box Download Support
Get 1 Free 2.2x Support
Price in Cart Only/MAPP Download Support
Modular Front Page Download Support
Modular SEO Header Tags Download Support
MVS 2.2x Support
PDF Datasheet Download Support
Price Updater 2.2x
Products Specifications 2.3.x Development Version Support Bugs/Suggestions
Request a Review Download Support

Shopping List Download Support New!
Specials Image Overlay Download Support
Superfish Categories Box Download Support
Theme Switcher 2.3+ Support  Updated


#15   alarm_seo

alarm_seo
  • Members
  • 13 posts

Posted 12 March 2012 - 09:41

If I'm not mistaken, that should work.

Here is what I found at webmaster centre help at google:
  • To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
    User-agent: Googlebot
    Disallow: /*?
BTW, any suggestions regarding fixing the links on category pages? I really need to get rid of "?cPach" in links to products from categories.

#16   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 12 March 2012 - 19:55

I find in this topic this function to get the full catalog path

http://forums.oscomm...-full-cat-path/

function get_full_cat_from_cPath ($zipote)
{
$query_trabajo_1=tep_db_query("SELECT `parent_id` FROM `categories` WHERE `categories_id` =  '" . $zipote . "'");
$land = tep_db_fetch_array($query_trabajo_1);
$cat_completa = $zipote;
while (! $land[parent_id] == 0) {
$query_ciclica=tep_db_query("SELECT `parent_id` FROM `categories` WHERE `categories_id` =  '" . $land[parent_id] . "'");
$land=tep_db_fetch_array($query_ciclica);
if (! $land[parent_id] == 0) {
$cat_completa = $land[parent_id] . '_' . $cat_completa;
}
}
	    return $cat_completa;
}

I put this in general.php but i cannot make it work. If this function can return the fulul catalog path it can be compared with the current url and if doesn't match give a 301 redirect o 404 error code.

Please let me know if this is a good place where to start and how i can make work this function

#17   kymation

kymation

    Code Monkey

  • Community Sponsor
  • 8,133 posts

Posted 12 March 2012 - 20:38

@alarm_seo -- I probably should have qualified that. Your solution will work for Google if they really do read the robots.txt file that way. However, it does not meet the standard, so other search engines probably won't read it that way. So, you can use that code in a robots.txt block that is for Google only, but probably not in a general block.

The cPath in a URL is used to provide the navigation in the categories box. You can rewrite it to something else, but the category information must still be in the link somewhere for the navigation to work.

Since you quote Google on robots.txt, why don't you read this Google help page. I'll quote the relevant sentence:

It's much safer to serve us the original dynamic URL and let us handle the problem of detecting and avoiding problematic parameters.


I suggest you stop wasting time trying to fix a broken URL rewriter that won't do you any good, ans start spending time on things that actually will help your search engine ranking.

Regards
Jim

Edited by kymation, 12 March 2012 - 20:39.

My Addons

Banners Box Download Support
Categories Accordion Box Download Support
Closest Shipper 2.2x Support
Document Manager 2.2x Support
Generic Box Download Support
Get 1 Free 2.2x Support
Price in Cart Only/MAPP Download Support
Modular Front Page Download Support
Modular SEO Header Tags Download Support
MVS 2.2x Support
PDF Datasheet Download Support
Price Updater 2.2x
Products Specifications 2.3.x Development Version Support Bugs/Suggestions
Request a Review Download Support

Shopping List Download Support New!
Specials Image Overlay Download Support
Superfish Categories Box Download Support
Theme Switcher 2.3+ Support  Updated


#18   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 14 March 2012 - 00:08

@clustersolutions and @All I have modified the previous function in general.php as it follows

function get_full_cat_from_cPath($zipote)
{
$query1=tep_db_query("SELECT parent_id FROM categories WHERE categories_id =  '" . $zipote . "'");
$land = tep_db_fetch_array($query1);
$cat_completa = $zipote;
while (! $land[parent_id] == 0) {
tep_redirect(tep_href_link(FILENAME_DEFAULT));
  tep_exit();
$cat_completa = $land[parent_id] . '_' . $cat_completa;
}
	    return $cat_completa;
}

calling this function in index.php redirects to the default page. UNFORTUNATELY it works only for the categories were the parent_id is not set.

for example if the orginal cPAth=160_479 i go to the correct page, calling only cPAth 479 it redirects to the dafault page (deleting in my case some hundreds of duplicate pages). But if i call 1_479 (1 is a existant parent_id ) this code does not make the redirect.

p.s. in my webmaster tools i have duplicate content for urls with 8 concatenated cPath like 8_256_47_48_8_78_54_132 and still growing!!!!

#19   Gergely

Gergely

    Action Hero

  • Community Team
  • 1,190 posts

Posted 17 March 2012 - 10:50

Hi!

My opinion that would be better to catch in application_top. The tep_parse_category_path() function is good for it.

  if (tep_not_null($cPath)) {
    $cPath_array = tep_parse_category_path($cPath);

so in the tep_parse_category_path() function can do controll anything and this is the main built in function.

////
// Parse and secure the cPath parameter values
  function tep_parse_category_path($cPath) {
// make sure the category IDs are integers
    $cPath_array = array_map('tep_string_to_int', explode('_', $cPath));
// make sure no duplicate category IDs exist which could lock the server in a loop
    $tmp_array = array();
    $n = sizeof($cPath_array);
    for ($i=0; $i<$n; $i++) {
	  if (!in_array($cPath_array[$i], $tmp_array)) {
	    $tmp_array[] = $cPath_array[$i];
	  }
    }

/*** Here is the estimated controlling place and need to validate cPath string ***/
    return $tmp_array;
  }

This problem maybe is persist for all cPath used pages.

some rewrites :-


#20   acidvertigo

acidvertigo
  • Members
  • 185 posts

Posted 17 March 2012 - 19:04

Or something that controls the $tree array in bm_categories here:

function getData() {
	  global $categories_string, $tree, $languages_id, $cPath, $cPath_array;
	  $categories_string = '';
	  $tree = array();
	  $categories_query = tep_db_query("select c.categories_id, cd.categories_name, c.parent_id from " . TABLE_CATEGORIES . " c, " . TABLE_CATEGORIES_DESCRIPTION . " cd where c.parent_id = '0' and c.categories_id = cd.categories_id and cd.language_id='" . (int)$languages_id ."' order by sort_order, cd.categories_name");
	  while ($categories = tep_db_fetch_array($categories_query))  {
	    $tree[$categories['categories_id']] = array('name' => $categories['categories_name'],
												    'parent' => $categories['parent_id'],
												    'level' => 0,
												    'path' => $categories['categories_id'],
												    'next_id' => false);
	    if (isset($parent_id)) {
		  $tree[$parent_id]['next_id'] = $categories['categories_id'];
	    }
	    $parent_id = $categories['categories_id'];
	    if (!isset($first_element)) {
		  $first_element = $categories['categories_id'];
	    }
	  }
	  if (tep_not_null($cPath)) {
	    $new_path = '';
	    reset($cPath_array);
	    while (list($key, $value) = each($cPath_array)) {
		  unset($parent_id);
		  unset($first_id);
		  $categories_query = tep_db_query("select c.categories_id, cd.categories_name, c.parent_id from " . TABLE_CATEGORIES . " c, " . TABLE_CATEGORIES_DESCRIPTION . " cd where c.parent_id = '" . (int)$value . "' and c.categories_id = cd.categories_id and cd.language_id='" . (int)$languages_id ."' order by sort_order, cd.categories_name");
		  if (tep_db_num_rows($categories_query)) {
		    $new_path .= $value;
		    while ($row = tep_db_fetch_array($categories_query)) {
			  $tree[$row['categories_id']] = array('name' => $row['categories_name'],
												   'parent' => $row['parent_id'],
												   'level' => $key+1,
												   'path' => $new_path . '_' . $row['categories_id'],
												   'next_id' => false);
			  if (isset($parent_id)) {
			    $tree[$parent_id]['next_id'] = $row['categories_id'];
			  }
			  $parent_id = $row['categories_id'];
			  if (!isset($first_id)) {
			    $first_id = $row['categories_id'];
			  }
			  $last_id = $row['categories_id'];
		    }
		    $tree[$last_id]['next_id'] = $tree[$value]['next_id'];
		    $tree[$value]['next_id'] = $first_id;
		    $new_path .= '_';
		  } else {
		    break;
		  }
	    }
	  }

This code outputs the array as follows;

Array
(
[1] => Array
(
[name] => Hardware
[parent] => 0
[level] => 0
[path] => 1
[next_id] => 17
)
[2] => Array
(
[name] => Software
[parent] => 0
[level] => 0
[path] => 2
[next_id] => 3
)
[3] => Array
(
[name] => DVD Movies
[parent] => 0
[level] => 0
[path] => 3
[next_id] =>
)
[17] => Array
(
[name] => CDROM Drives
[parent] => 1
[level] => 1
[path] => 1_17
[next_id] => 4
)
[4] => Array
(
[name] => Graphics Cards
[parent] => 1
[level] => 1
[path] => 1_4
[next_id] => 8
)
[8] => Array
(
[name] => Keyboards
[parent] => 1
[level] => 1
[path] => 1_8
[next_id] => 16
)
[16] => Array
(
[name] => Memory
[parent] => 1
[level] => 1
[path] => 1_16
[next_id] => 9
)
[9] => Array
(
[name] => Mice
[parent] => 1
[level] => 1
[path] => 1_9
[next_id] => 6
)
[6] => Array
(
[name] => Monitors
[parent] => 1
[level] => 1
[path] => 1_6
[next_id] => 5
)
[5] => Array
(
[name] => Printers
[parent] => 1
[level] => 1
[path] => 1_5
[next_id] => 7
)
[7] => Array
(
[name] => Speakers
[parent] => 1
[level] => 1
[path] => 1_7
[next_id] => 2
)
)

But unfortunately in the duplicated pages this array still is valid with all duplicates values i cannot find anything to check if this array is the good one or the duplicated one. /unsure.png' class='bbc_emoticon' alt=':unsure:' />