Jump to content



Latest News: (loading..)

- - - - -

osc 2.3 product review recover html function


This topic has been archived. This means that you cannot reply to this topic.
12 replies to this topic

#-19   xvoyance

xvoyance
  • Members
  • 87 posts

Posted 06 May 2012 - 02:55 PM

to my understanding osc 2.3.1 products review intentionally disabled html function so that no url can be posted. However this cause problem for displaying CJK fonts.
I would like to recover html function.
DO I simply change product_review_infor.php
and remove tep_output_string_protected?

#-18   MrPhil

MrPhil
  • Members
  • 4,139 posts

Posted 06 May 2012 - 04:43 PM

What that function call does is feed the text through htmlspecialchars(). That, in turn, looks for <, >, &, and maybe a few other characters that have special meaning in HTML, and turn them into "entities" (&lt; etc.). It sounds like maybe htmlspecialchars() is corrupting certain multibyte CJK characters that contain the single bytes for < etc.? What character encoding are you using? It's supposed to work properly with UTF-8 and ISO-8859-1 (Latin-1), but, according to http://us3.php.net/manual/en/function.htmlspecialchars.php , ISO-8859-1 is the default encoding for this call. If it doesn't have the optional encoding parameter set, it may be interpreting UTF-8 multibyte characters incorrectly. Something you might try if your site is UTF-8 is in both includes/functions/general.php and admin/includes/functions/general.php is find
  function tep_output_string($string, $translate = false, $protected = false) {
	if ($protected == true) {
	  return htmlspecialchars($string);
	} else {

and try changing it to
  function tep_output_string($string, $translate = false, $protected = false) {
	if ($protected == true) {
	  // return htmlspecialchars($string);
	  return htmlspecialchars($string, ENT_COMPAT|ENT_HTML401, 'UTF-8');
	} else {

If it doesn't work, back out the change.

#-17   xvoyance

xvoyance
  • Members
  • 87 posts

Posted 07 May 2012 - 03:02 AM

View PostMrPhil, on 06 May 2012 - 04:43 PM, said:

What that function call does is feed the text through htmlspecialchars(). That, in turn, looks for <, >, &, and maybe a few other characters that have special meaning in HTML, and turn them into "entities" (&lt; etc.). It sounds like maybe htmlspecialchars() is corrupting certain multibyte CJK characters that contain the single bytes for < etc.? What character encoding are you using? It's supposed to work properly with UTF-8 and ISO-8859-1 (Latin-1), but, according to http://us3.php.net/manual/en/function.htmlspecialchars.php , ISO-8859-1 is the default encoding for this call. If it doesn't have the optional encoding parameter set, it may be interpreting UTF-8 multibyte characters incorrectly. Something you might try if your site is UTF-8 is in both includes/functions/general.php and admin/includes/functions/general.php is find
 function tep_output_string($string, $translate = false, $protected = false) { if ($protected == true) { return htmlspecialchars($string); } else {
and try changing it to
 function tep_output_string($string, $translate = false, $protected = false) { if ($protected == true) { // return htmlspecialchars($string); return htmlspecialchars($string, ENT_COMPAT|ENT_HTML401, 'UTF-8'); } else {
If it doesn't work, back out the change.

I tried the change you proposed. No use, although no harm.

My tep_output_string looks like this:

  function tep_output_string($string, $translate = false, $protected = false) {
    if ($protected == true) {
  return htmlspecialchars($string);
    } else {
  if ($translate == false) {
    return tep_parse_input_field_data($string, array('"' => '&quot;'));
  } else {
    return tep_parse_input_field_data($string, $translate);
  }
    }
  }


Furthermore, to my understanding, my system indeed use UTF-8.
The stored text is correct, since if I edit the stored text it came back correctly.
Only the displayed wrong. Furthermore, the the displayed screen is somewhat misplaced.
A product picture is shown on the upper-right, but overlap with the right column.
Presumably it is CSS problem.

#-16   MrPhil

MrPhil
  • Members
  • 4,139 posts

Posted 07 May 2012 - 03:21 AM

If it didn't work, I'm out of ideas. Hopefully someone will come along who has seen this before. My fix above was based on the assumption that certain bytes in CJK text were not being recognized as being part of UTF-8 characters, but were being treated as single byte ASCII and converted to HTML entities. You could look yourself at the browser View > Page Source and see if the corrupted CJK characters indeed have &lt; &gt; &amp; etc. embedded in the middle of them.

#-15   xvoyance

xvoyance
  • Members
  • 87 posts

Posted 07 May 2012 - 06:40 AM

There seems to have some characters (CJK characters are 2 bytes for one characters) be broken in the middle.
It seems <br/> is intentionally inserted somewhere.

Furthermore, error messages appears saying method button and buttonset are not supported.
buttonset came from

<script type="text/javascript">
  $("#headerShortcuts").buttonset();
</script>

#-14   xvoyance

xvoyance
  • Members
  • 87 posts

Posted 07 May 2012 - 11:56 AM

:devil:  I saw a smoking gun!

It is tep_break_string who inserted - and break the CJK characters.

Edited by xvoyance, 07 May 2012 - 12:09 PM.


#-13   MrPhil

MrPhil
  • Members
  • 4,139 posts

Posted 07 May 2012 - 05:18 PM

Yeah, that function seems to work only for single-byte encodings such as Latin-1. It would have to be modified to use mb_ functions if UTF-8, to make sure it doesn't insert the break character (default '-') within a multibyte character. I'm assuming that the browser then handles breaking the word (and line) at the hyphen -. In most uses in osC, it appears to be -<br />, which not only hyphenates, but explicitly adds a line break.

Any MB experts out there? If not, I could take a look at it tonight. First, I need to understand when tep_break_string() gets called, and when word wrap is simply left to the browser. If it has to back up all the way to the beginning of the word to avoid breaking within a multibyte character, it would have to be just a <br />.

#-12   MrPhil

MrPhil
  • Members
  • 4,139 posts

Posted 08 May 2012 - 03:00 AM

I think it can be done, but I need some information on how languages using CJK characters are organized. Are words separated by ASCII blanks (same as Western languages), or are the ideographs all run together in one block? If words are not separated, how about sentences? Are ideographs twice as wide as a blank, or does that depend on the font? tep_break_string() wants to insert a break character or string whenever a "word" (separated by blanks) exceeds some maximum length. Usually a -<br /> is inserted, but sometimes it's just a hyphen or even a space. Is this appropriate for CJK languages? Are you using UTF-8? Are you using non-CJK languages too? If there are different rules for CJK languages and non-CJK languages with regards to how words or sentences are separated (and how they should be split, if a block of characters is too long), and whether a hyphen is needed at the end of a split line, this could get very sticky.

#-11   xvoyance

xvoyance
  • Members
  • 87 posts

Posted 08 May 2012 - 12:43 PM

View PostMrPhil, on 07 May 2012 - 05:18 PM, said:

Yeah, that function seems to work only for single-byte encodings such as Latin-1. It would have to be modified to use mb_ functions if UTF-8, to make sure it doesn't insert the break character (default '-') within a multibyte character. I'm assuming that the browser then handles breaking the word (and line) at the hyphen -. In most uses in osC, it appears to be -<br />, which not only hyphenates, but explicitly adds a line break.

Any MB experts out there? If not, I could take a look at it tonight. First, I need to understand when tep_break_string() gets called, and when word wrap is simply left to the browser. If it has to back up all the way to the beginning of the word to avoid breaking within a multibyte character, it would have to be just a <br />.

I simply remove that function, then everything looks fine. The line is automatically break.

There is one more question about the product image on the review page.
It is not locate correctly.
I am not sure how is others. Presumably that is a CSS problem?

#-10   xvoyance

xvoyance
  • Members
  • 87 posts

Posted 08 May 2012 - 01:31 PM

View PostMrPhil, on 08 May 2012 - 03:00 AM, said:

I think it can be done, but I need some information on how languages using CJK characters are organized. Are words separated by ASCII blanks (same as Western languages), or are the ideographs all run together in one block? If words are not separated, how about sentences? Are ideographs twice as wide as a blank, or does that depend on the font? tep_break_string() wants to insert a break character or string whenever a "word" (separated by blanks) exceeds some maximum length. Usually a -<br /> is inserted, but sometimes it's just a hyphen or even a space. Is this appropriate for CJK languages? Are you using UTF-8? Are you using non-CJK languages too? If there are different rules for CJK languages and non-CJK languages with regards to how words or sentences are separated (and how they should be split, if a block of characters is too long), and whether a hyphen is needed at the end of a split line, this could get very sticky.

CJK words are not separated by anything, to my understanding. Sentences are not separated either, unless you put punctuation mark. I do not know what is ideographs (and i tried to look up that word but still do not understand, which perhaps means not related.) Each character should be equally spaced, unless you tried to do some stretching on typesetting, to my understanding.

CJK (Chinese-Japanse-Korean) fonts are difficult to handle but now people should already know pretty well how to do that (although not me). TeX/LaTex used to have difficulty to handle CJK fonts, but now XeLatex within MikTeX do that well (although I do not know how did they do that. I simply use it.)

I cannot attach a screen shut file in this forum. Otherwise I can show you it looks find now Except the product image.

Edited by xvoyance, 08 May 2012 - 01:34 PM.


#-9   xvoyance

xvoyance
  • Members
  • 87 posts

Posted 08 May 2012 - 02:16 PM

p.s. remove tep_break_string made no harm for latin font either, to my understanding. The long sentence will automatically warp.

#-8   xvoyance

xvoyance
  • Members
  • 87 posts

Posted 09 May 2012 - 12:17 AM

View PostMrPhil, on 08 May 2012 - 03:00 AM, said:

I think it can be done, but I need some information on how languages using CJK characters are organized. Are words separated by ASCII blanks (same as Western languages), or are the ideographs all run together in one block? If words are not separated, how about sentences? Are ideographs twice as wide as a blank, or does that depend on the font? tep_break_string() wants to insert a break character or string whenever a "word" (separated by blanks) exceeds some maximum length. Usually a -<br /> is inserted, but sometimes it's just a hyphen or even a space. Is this appropriate for CJK languages? Are you using UTF-8? Are you using non-CJK languages too? If there are different rules for CJK languages and non-CJK languages with regards to how words or sentences are separated (and how they should be split, if a block of characters is too long), and whether a hyphen is needed at the end of a split line, this could get very sticky.

to my understanding, those two bytes in CJK characters are not identical. One has the first bit set the other left the first bit blank,
so that the system can detect where is the boundary of each character.

#-7   MrPhil

MrPhil
  • Members
  • 4,139 posts

Posted 09 May 2012 - 07:15 PM

Yes, UTF-8 has special formatting requirements so that it's easy to tell if a given byte is the start of a character or somewhere in the middle of a character. You need to back up to the left until you find a byte with certain high order bits set, and that will also tell you how many bytes follow within this one character (it may be one, two, or three for CJK). Anything after that with the high bit 0 is ASCII and is single byte.