Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[F] Word Wrapping garbles UTF-8 [R] [C-Michael83]
#1
This user has been denied support. This user has been denied support.
The Word Wrapping feature inserts spaces into words that are longer than e.g. 80 characters. However, it doesn't seem to be UTF-8 aware, so not only does it insert spaces when it's not necessary (for UTF-8 characters that the browser wraps anyway), but it also inserts the spaces in the middle of a UTF-8 multi byte sequence, producing garbled characters.

As Word Wrapping seems to be a default setting and it's even active in this forum, here you have a Japanese sentence copy & pasted from Japanese Wikipedia, demonstrating the problem.

日本語(にほんご、にっぽんご)は、主として、日本列島で使用されてきた言語である。日本手話を母語とする者などを除いて、ほぼ全ての日本在住者は日本語を第一言語とする。日本国は法令上、公用語を明記していないが、事実上の公用語となっており、学校教育の「国語」で教えられる。使用者は、日本国内を主として約1億3千万人。日本語の文法体系や音韻体系を反映する手話として日本語対応手話がある。

Workaround:
set Word Wrapping to 0.

What has to be done in order to fix this problem:
1.) make the word wrapper UTF-8 aware so it does not insert the whitespace character in the middle of a UTF-8 character multi byte sequence.
2.) make the word wrapper smarter so it does not insert spaces when there are UTF-8 characters (such as CJK) that already get wrapped by the browser.

Can you confirm that this is a bug?
Or should Word Wrapper simply not be used at all on multi lingual forums?
I posted this in the bug report section and it was moved to general support without comment. Huh So apparently it's not a bug. Confused

I modified the my_wordwrap() in inc/functions.php on my server so it leaves UTF-8 alone but still wordwraps latin characters. It's a hack because now it doesn't wordwrap any UTF-8 even characters that should be (German umlauts for example). But oh well it's better than nothing. If it causes trouble again I'll just turn wordwrap off. Dodgy
After a bit of googling I found this http://www.php.net/wordwrap
In the comments there are dozens of alternatives to wordwrap that other users wrote.

The most interesting comment was about using zero-width characters instead of spaces ​

There are also several functions that claim to be UTF-8 safe, which the MyBB version currently clearly is not.

This way word wrapping can be implemented without actually inserting spaces, so when the browser decides it doesn't need to wrap yet after all, the wordwrapped text will just look the same as if it wasn't wordwrapped. However I haven't tested yet wether this zero width character hurts a browsers Find/Search feature, which would be bad (on the other hand, a space hurts it too, but at least the user can see the space and thus understand why it doesn't work).

Right now as a workaround I added \0x80-0xFF in the [^] section of the regexp, to make it ignore UTF-8 characters altogether, but if I decide to stick with MyBB I'll write up a proper fix based on the comments I found so far.
#2
A note: It also happens with umlauts.

aäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜaäoöuüAÄOÖUÜ
Greets,
Michael
-------------
[Image: donation_drive_sig.png]
#3
Hi,

Please try this fix and let me know if there are any problems.

In inc/functions.php find:

edit: see below

Two things to note;
- We are still working on other ways to try and properly detect the encoding
- If PCRE is not compiled with Unicode support this will give you an error. We are still trying to work on a fix for that as well.

SQA Team, please test this thoroughly. (Different languages, different text, different DB encoding, different PHP versions)
#4
This user has been denied support. This user has been denied support.
Hi,

thank you for your fix.

(2008-11-08, 07:05 PM)frostschutz Wrote: What has to be done in order to fix this problem:
1.) make the word wrapper UTF-8 aware so it does not insert the whitespace character in the middle of a UTF-8 character multi byte sequence.
2.) make the word wrapper smarter so it does not insert spaces when there are UTF-8 characters (such as CJK) that already get wrapped by the browser.

Your changes seem to resolve 1.) - the characters do not get garbled anymore. That is a huge improvement. Thank you for that.

However it does not fix 2.). With your change, spaces still get inserted into Japanese text which is both wrong and not required. This is not a showstopper but it would be very nice to fix 2.) as well.

The PHP comments on wordwrap() suggest to use a zero width character instead of a space, i.e. use "$0​​" instead of "$0 " as a replacement string. This works fine in terms of looks (as in it does not insert spaces but it still wraps so the Japanese looks good and not unnecessarily spaced). However unfortunately, Firefox find feature is not smart enough to disregard zero-width characters, so the Browser does not find text even though the user can see and read it right there. On the other hand it does not find words if they are spaced either so the situation is not worse than before.

I don't know on what basis browsers like Firefox decide what to wrap and what not to wrap. Japanese text gets wrapped everywhere without the use of spaces or other characters. So for a real fix of 2.) the my_wordwrap() function would need to know about which characters wrap and which don't wrap in more detail. The regexp used right now is a heuristic that works well for latin charset based language, but it's not a good solution for other writing systems.

Japanese is actually quite harmless in regards to wrapping because you basically are allowed to wrap Japanese pretty much anywhere, even in the middle of a word. I can only guess that for other languages wrapping in the wrong place or not detecting word delimiters properly can be even worse.

If only there was a html tag or css setting that'd force the browser to wrap anyway, we wouldn't have this problem as a my_wordwrap() function wouldn't be required at all Sad

EDIT: the forum ate & # 8203 ; when I typed it without spaces, so I guess it actually understands html entities? Wink
#5
(2008-11-09, 11:42 PM)frostschutz Wrote: However it does not fix 2.). With your change, spaces still get inserted into Japanese text which is both wrong and not required. This is not a showstopper but it would be very nice to fix 2.) as well.

The PHP comments on wordwrap() suggest to use a zero width character instead of a space, i.e. use "$0​​" instead of "$0 " as a replacement string. This works fine in terms of looks (as in it does not insert spaces but it still wraps so the Japanese looks good and not unnecessarily spaced). However unfortunately, Firefox find feature is not smart enough to disregard zero-width characters, so the Browser does not find text even though the user can see and read it right there. On the other hand it does not find words if they are spaced either so the situation is not worse than before.

I don't know on what basis browsers like Firefox decide what to wrap and what not to wrap. Japanese text gets wrapped everywhere without the use of spaces or other characters. So for a real fix of 2.) the my_wordwrap() function would need to know about which characters wrap and which don't wrap in more detail. The regexp used right now is a heuristic that works well for latin charset based language, but it's not a good solution for other writing systems.

Japanese is actually quite harmless in regards to wrapping because you basically are allowed to wrap Japanese pretty much anywhere, even in the middle of a word. I can only guess that for other languages wrapping in the wrong place or not detecting word delimiters properly can be even worse.

If only there was a html tag or css setting that'd force the browser to wrap anyway, we wouldn't have this problem as a my_wordwrap() function wouldn't be required at all Sad

EDIT: the forum ate & # 8203 ; when I typed it without spaces, so I guess it actually understands html entities? Wink

Alright, so I changed it to that zero width character. Personally I'm not concerned with Firefox's "find" issue.

As for the Japanese issue, I guess I don't fully understand how the wordwrap introduces any problems with Japenese characters? If they wrap anyway then this should be a no-harm, no-foul situation?

Here's the updated fix with it fixed for those PCRE installations without unicode support

function convert_through_utf8($str, $to=true)
{
	global $lang;
	static $charset;
	static $use_mb;
	static $use_iconv;
	
	if(!isset($charset))
	{
		$charset = my_strtolower($lang->settings['charset']);
	}
	
	if($charset == "utf-8")
	{
		return $str;
	}
	
	if(!isset($use_iconv))
	{
		$use_iconv = function_exists("iconv");
	}
	
	if(!isset($use_mb))
	{
		$use_mb = function_exists("mb_convert_encoding");
	}
	
	if($use_iconv || $use_mb)
	{
		if($to)
		{
			$from_charset = $lang->settings['charset'];
			$to_charset = "UTF-8";
		}
		else
		{
			$from_charset = "UTF-8";
			$to_charset = $lang->settings['charset'];
		}
		if($use_iconv)
		{
			return iconv($from_charset, $to_charset."//IGNORE", $str);
		}
		else
		{
			return @mb_convert_encoding($str, $to_charset, $from_charset);
		}
	}
	elseif($charset == "iso-8859-1" && function_exists("utf8_encode"))
	{
		if($to)
		{
			return utf8_encode($str);
		}
		else
		{
			return utf8_decode($str);
		}
	}
	else
	{
		return $str;
	}
}

/**
 * Replacement function for PHP's wordwrap(). This version does not break up HTML tags, URLs or unicode references.
 *
 * @param string The string to be word wrapped
 * @return string The word wraped string
 */
function my_wordwrap($message)
{
	global $mybb;

	if($mybb->settings['wordwrap'] > 0)
	{
		$message = convert_through_utf8($message);
		
		if(!($new_message = @preg_replace("#(?>[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#u", "$0​", $message)))
		{
			$new_message = preg_replace("#(?>[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#", "$0​", $message);	
		}
		
		$new_message = convert_through_utf8($new_message, false);
	}

	return $new_message;
}
#6
This user has been denied support. This user has been denied support.
The Japanese characters wrap anyway, yes. However the my_wordwrap() inserts characters into Japanese text even though it wraps anyway. It is not necessary for my_wordwrap() to inserts spaces there yet it does anyway because it does not know better. Thus it breaks the browsers "find" for no reason, which is not so nice. To fix this issue the my_wordwrap() would need to learn how to ignore all characters that wrap anyway, not just the latin space dot comma ones.

Or in other words, the my_wordwrap() function sometimes thinks that a string can not be wrapped, when in fact it can in some places (word boundaries). Inserting new wrap points at some random place in such a string can then cause the browser to wrap in a wrong place. This is unlikely to happen with Japanese as Japanese is OK to wrap (almost) anywhere, but it could be an issue with other languages.

To make an example in English, imagine the sentence "The quick brown fox jumps over the lazy dog." translated to some other language / script and suddenly my_wordwrap does not recognize the word boundaries anymore and inserts a wrap character in the middle of the word. And the browser may suddenly display that sentence as "The quick brown fox jum
ps over the lazy dog.". Without my_wordwrap() intervention the browser would have wrapped it between fox and jumps or between jumps and over but not in the middle of jumps. All because my_wordwrap() does not recognize wrappable characters the same way as browsers do (which leads us to the question, how do browsers do it?).

But I guess as long as no one complains this is just perfectionism, I'm happy with the fix except for the Firefox find breakage but that's not a showstopper.
#7
(2008-11-10, 01:16 AM)frostschutz Wrote: The Japanese characters wrap anyway, yes. However the my_wordwrap() inserts characters into Japanese text even though it wraps anyway. It is not necessary for my_wordwrap() to inserts spaces there yet it does anyway because it does not know better. Thus it breaks the browsers "find" for no reason, which is not so nice. To fix this issue the my_wordwrap() would need to learn how to ignore all characters that wrap anyway, not just the latin space dot comma ones.

Or in other words, the my_wordwrap() function sometimes thinks that a string can not be wrapped, when in fact it can in some places (word boundaries). Inserting new wrap points at some random place in such a string can then cause the browser to wrap in a wrong place. This is unlikely to happen with Japanese as Japanese is OK to wrap (almost) anywhere, but it could be an issue with other languages.

To make an example in English, imagine the sentence "The quick brown fox jumps over the lazy dog." translated to some other language / script and suddenly my_wordwrap does not recognize the word boundaries anymore and inserts a wrap character in the middle of the word. And the browser may suddenly display that sentence as "The quick brown fox jum
ps over the lazy dog.". Without my_wordwrap() intervention the browser would have wrapped it between fox and jumps or between jumps and over but not in the middle of jumps. All because my_wordwrap() does not recognize wrappable characters the same way as browsers do (which leads us to the question, how do browsers do it?).

But I guess as long as no one complains this is just perfectionism, I'm happy with the fix except for the Firefox find breakage but that's not a showstopper.

Wordwrap is not called unless it is a continuous line of characters more then the limit. Is there a Japenese word longer then like 40 characters? If so, who would search for it?

It doesn't break long sentences. It only breaks long words (or just long non-breaking group characters)

like aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
#8
This user has been denied support. This user has been denied support.
It recognizes Japanese text as one long word though, as you can see in the example in the first posting here. That is because Japanese does not have Latin word boundaries such as spaces or punctuation, but rather comes with its own UTF-8 characters that fulfill the same role, e.g. 。、「」!? - there is even a whole set of latin characters for use in Japanese text that is different from original latin characters, e.g. ABC abc instead of ABC abc.

So to my_wordwrap() a script in a different language looks just like one long word even though in reality for someone who knows and understands that sentence it's proper words sentences punctuation etc.
#9
Alright, so do we just try to auto detect the encoding, and if it is Japenese, ignore any splitting?

I'm not too sure how well that will work.
#10
This user has been denied support. This user has been denied support.
It would work for Japanese. Other languages I don't know. I'll try to find out how browsers decide what they wrap and what they don't wrap, and see if it can be applied to a PHP regular expression easily. It's probably not possible. :-)

A user friendly "workaround" would probably a custom text field in the Admin CP where I can tell the word wrapper which characters to ignore. Then I could in my case add the Japanese interpuncations characters into it and probably even normal characters, which would then effectively cause Japanese text to be ignored by the word wrapper while it still is operational for English / German text in the same forum.

But that would probably be too much effort and it's still not the "proper" solution.


Forum Jump:


Users browsing this thread: 1 Guest(s)