Archived

This topic is now archived and is closed to further replies.

xvoyance

osc 2.3 product review recover html function

13 posts in this topic

to my understanding osc 2.3.1 products review intentionally disabled html function so that no url can be posted. However this cause problem for displaying CJK fonts.

I would like to recover html function.

DO I simply change product_review_infor.php

and remove tep_output_string_protected?

Share this post


Link to post
Share on other sites

What that function call does is feed the text through htmlspecialchars(). That, in turn, looks for <, >, &, and maybe a few other characters that have special meaning in HTML, and turn them into "entities" (< etc.). It sounds like maybe htmlspecialchars() is corrupting certain multibyte CJK characters that contain the single bytes for < etc.? What character encoding are you using? It's supposed to work properly with UTF-8 and ISO-8859-1 (Latin-1), but, according to http://us3.php.net/manual/en/function.htmlspecialchars.php , ISO-8859-1 is the default encoding for this call. If it doesn't have the optional encoding parameter set, it may be interpreting UTF-8 multibyte characters incorrectly. Something you might try if your site is UTF-8 is in both includes/functions/general.php and admin/includes/functions/general.php is find

  function tep_output_string($string, $translate = false, $protected = false) {
if ($protected == true) {
  return htmlspecialchars($string);
} else {

 

and try changing it to

  function tep_output_string($string, $translate = false, $protected = false) {
if ($protected == true) {
  // return htmlspecialchars($string);
  return htmlspecialchars($string, ENT_COMPAT|ENT_HTML401, 'UTF-8');
} else {

 

If it doesn't work, back out the change.

Share this post


Link to post
Share on other sites
What that function call does is feed the text through htmlspecialchars(). That, in turn, looks for <, >, &, and maybe a few other characters that have special meaning in HTML, and turn them into "entities" (< etc.). It sounds like maybe htmlspecialchars() is corrupting certain multibyte CJK characters that contain the single bytes for < etc.? What character encoding are you using? It's supposed to work properly with UTF-8 and ISO-8859-1 (Latin-1), but, according to http://us3.php.net/manual/en/function.htmlspecialchars.php , ISO-8859-1 is the default encoding for this call. If it doesn't have the optional encoding parameter set, it may be interpreting UTF-8 multibyte characters incorrectly. Something you might try if your site is UTF-8 is in both includes/functions/general.php and admin/includes/functions/general.php is find
 function tep_output_string($string, $translate = false, $protected = false) { if ($protected == true) { return htmlspecialchars($string); } else {

and try changing it to

 function tep_output_string($string, $translate = false, $protected = false) { if ($protected == true) { // return htmlspecialchars($string); return htmlspecialchars($string, ENT_COMPAT|ENT_HTML401, 'UTF-8'); } else {

If it doesn't work, back out the change.

 

I tried the change you proposed. No use, although no harm.

 

My tep_output_string looks like this:

 

function tep_output_string($string, $translate = false, $protected = false) {

if ($protected == true) {

return htmlspecialchars($string);

} else {

if ($translate == false) {

return tep_parse_input_field_data($string, array('"' => '"'));

} else {

return tep_parse_input_field_data($string, $translate);

}

}

}

 

 

Furthermore, to my understanding, my system indeed use UTF-8.

The stored text is correct, since if I edit the stored text it came back correctly.

Only the displayed wrong. Furthermore, the the displayed screen is somewhat misplaced.

A product picture is shown on the upper-right, but overlap with the right column.

Presumably it is CSS problem.

Share this post


Link to post
Share on other sites

If it didn't work, I'm out of ideas. Hopefully someone will come along who has seen this before. My fix above was based on the assumption that certain bytes in CJK text were not being recognized as being part of UTF-8 characters, but were being treated as single byte ASCII and converted to HTML entities. You could look yourself at the browser View > Page Source and see if the corrupted CJK characters indeed have < > & etc. embedded in the middle of them.

Share this post


Link to post
Share on other sites

There seems to have some characters (CJK characters are 2 bytes for one characters) be broken in the middle.

It seems <br/> is intentionally inserted somewhere.

 

Furthermore, error messages appears saying method button and buttonset are not supported.

buttonset came from

 

<script type="text/javascript">

$("#headerShortcuts").buttonset();

</script>

Share this post


Link to post
Share on other sites

:devil: I saw a smoking gun!

 

It is tep_break_string who inserted - and break the CJK characters.

Share this post


Link to post
Share on other sites

Yeah, that function seems to work only for single-byte encodings such as Latin-1. It would have to be modified to use mb_ functions if UTF-8, to make sure it doesn't insert the break character (default '-') within a multibyte character. I'm assuming that the browser then handles breaking the word (and line) at the hyphen -. In most uses in osC, it appears to be -<br />, which not only hyphenates, but explicitly adds a line break.

 

Any MB experts out there? If not, I could take a look at it tonight. First, I need to understand when tep_break_string() gets called, and when word wrap is simply left to the browser. If it has to back up all the way to the beginning of the word to avoid breaking within a multibyte character, it would have to be just a <br />.

Share this post


Link to post
Share on other sites

I think it can be done, but I need some information on how languages using CJK characters are organized. Are words separated by ASCII blanks (same as Western languages), or are the ideographs all run together in one block? If words are not separated, how about sentences? Are ideographs twice as wide as a blank, or does that depend on the font? tep_break_string() wants to insert a break character or string whenever a "word" (separated by blanks) exceeds some maximum length. Usually a -<br /> is inserted, but sometimes it's just a hyphen or even a space. Is this appropriate for CJK languages? Are you using UTF-8? Are you using non-CJK languages too? If there are different rules for CJK languages and non-CJK languages with regards to how words or sentences are separated (and how they should be split, if a block of characters is too long), and whether a hyphen is needed at the end of a split line, this could get very sticky.

Share this post


Link to post
Share on other sites

Yeah, that function seems to work only for single-byte encodings such as Latin-1. It would have to be modified to use mb_ functions if UTF-8, to make sure it doesn't insert the break character (default '-') within a multibyte character. I'm assuming that the browser then handles breaking the word (and line) at the hyphen -. In most uses in osC, it appears to be -<br />, which not only hyphenates, but explicitly adds a line break.

 

Any MB experts out there? If not, I could take a look at it tonight. First, I need to understand when tep_break_string() gets called, and when word wrap is simply left to the browser. If it has to back up all the way to the beginning of the word to avoid breaking within a multibyte character, it would have to be just a <br />.

 

I simply remove that function, then everything looks fine. The line is automatically break.

 

There is one more question about the product image on the review page.

It is not locate correctly.

I am not sure how is others. Presumably that is a CSS problem?

Share this post


Link to post
Share on other sites

I think it can be done, but I need some information on how languages using CJK characters are organized. Are words separated by ASCII blanks (same as Western languages), or are the ideographs all run together in one block? If words are not separated, how about sentences? Are ideographs twice as wide as a blank, or does that depend on the font? tep_break_string() wants to insert a break character or string whenever a "word" (separated by blanks) exceeds some maximum length. Usually a -<br /> is inserted, but sometimes it's just a hyphen or even a space. Is this appropriate for CJK languages? Are you using UTF-8? Are you using non-CJK languages too? If there are different rules for CJK languages and non-CJK languages with regards to how words or sentences are separated (and how they should be split, if a block of characters is too long), and whether a hyphen is needed at the end of a split line, this could get very sticky.

 

CJK words are not separated by anything, to my understanding. Sentences are not separated either, unless you put punctuation mark. I do not know what is ideographs (and i tried to look up that word but still do not understand, which perhaps means not related.) Each character should be equally spaced, unless you tried to do some stretching on typesetting, to my understanding.

 

CJK (Chinese-Japanse-Korean) fonts are difficult to handle but now people should already know pretty well how to do that (although not me). TeX/LaTex used to have difficulty to handle CJK fonts, but now XeLatex within MikTeX do that well (although I do not know how did they do that. I simply use it.)

 

I cannot attach a screen shut file in this forum. Otherwise I can show you it looks find now Except the product image.

Share this post


Link to post
Share on other sites

p.s. remove tep_break_string made no harm for latin font either, to my understanding. The long sentence will automatically warp.

Share this post


Link to post
Share on other sites

I think it can be done, but I need some information on how languages using CJK characters are organized. Are words separated by ASCII blanks (same as Western languages), or are the ideographs all run together in one block? If words are not separated, how about sentences? Are ideographs twice as wide as a blank, or does that depend on the font? tep_break_string() wants to insert a break character or string whenever a "word" (separated by blanks) exceeds some maximum length. Usually a -<br /> is inserted, but sometimes it's just a hyphen or even a space. Is this appropriate for CJK languages? Are you using UTF-8? Are you using non-CJK languages too? If there are different rules for CJK languages and non-CJK languages with regards to how words or sentences are separated (and how they should be split, if a block of characters is too long), and whether a hyphen is needed at the end of a split line, this could get very sticky.

 

to my understanding, those two bytes in CJK characters are not identical. One has the first bit set the other left the first bit blank,

so that the system can detect where is the boundary of each character.

Share this post


Link to post
Share on other sites

Yes, UTF-8 has special formatting requirements so that it's easy to tell if a given byte is the start of a character or somewhere in the middle of a character. You need to back up to the left until you find a byte with certain high order bits set, and that will also tell you how many bytes follow within this one character (it may be one, two, or three for CJK). Anything after that with the high bit 0 is ASCII and is single byte.

Share this post


Link to post
Share on other sites