2008-11-08, 07:05 PM
The Word Wrapping feature inserts spaces into words that are longer than e.g. 80 characters. However, it doesn't seem to be UTF-8 aware, so not only does it insert spaces when it's not necessary (for UTF-8 characters that the browser wraps anyway), but it also inserts the spaces in the middle of a UTF-8 multi byte sequence, producing garbled characters.
As Word Wrapping seems to be a default setting and it's even active in this forum, here you have a Japanese sentence copy & pasted from Japanese Wikipedia, demonstrating the problem.
日本語(にほんご、にっぽんご)は、主として、日本列島で使用されてきた言語である。日本手話を母語とする者などを除いて、ほぼ全ての日本在住者は日本語を第一言語とする。日本国は法令上、公用語を明記していないが、事実上の公用語となっており、学校教育の「国語」で教えられる。使用者は、日本国内を主として約1億3千万人。日本語の文法体系や音韻体系を反映する手話として日本語対応手話がある。
Workaround:
set Word Wrapping to 0.
What has to be done in order to fix this problem:
1.) make the word wrapper UTF-8 aware so it does not insert the whitespace character in the middle of a UTF-8 character multi byte sequence.
2.) make the word wrapper smarter so it does not insert spaces when there are UTF-8 characters (such as CJK) that already get wrapped by the browser.
Can you confirm that this is a bug?
Or should Word Wrapper simply not be used at all on multi lingual forums?
I posted this in the bug report section and it was moved to general support without comment. So apparently it's not a bug.
I modified the my_wordwrap() in inc/functions.php on my server so it leaves UTF-8 alone but still wordwraps latin characters. It's a hack because now it doesn't wordwrap any UTF-8 even characters that should be (German umlauts for example). But oh well it's better than nothing. If it causes trouble again I'll just turn wordwrap off.
After a bit of googling I found this http://www.php.net/wordwrap
In the comments there are dozens of alternatives to wordwrap that other users wrote.
The most interesting comment was about using zero-width characters instead of spaces
There are also several functions that claim to be UTF-8 safe, which the MyBB version currently clearly is not.
This way word wrapping can be implemented without actually inserting spaces, so when the browser decides it doesn't need to wrap yet after all, the wordwrapped text will just look the same as if it wasn't wordwrapped. However I haven't tested yet wether this zero width character hurts a browsers Find/Search feature, which would be bad (on the other hand, a space hurts it too, but at least the user can see the space and thus understand why it doesn't work).
Right now as a workaround I added \0x80-0xFF in the [^] section of the regexp, to make it ignore UTF-8 characters altogether, but if I decide to stick with MyBB I'll write up a proper fix based on the comments I found so far.
As Word Wrapping seems to be a default setting and it's even active in this forum, here you have a Japanese sentence copy & pasted from Japanese Wikipedia, demonstrating the problem.
日本語(にほんご、にっぽんご)は、主として、日本列島で使用されてきた言語である。日本手話を母語とする者などを除いて、ほぼ全ての日本在住者は日本語を第一言語とする。日本国は法令上、公用語を明記していないが、事実上の公用語となっており、学校教育の「国語」で教えられる。使用者は、日本国内を主として約1億3千万人。日本語の文法体系や音韻体系を反映する手話として日本語対応手話がある。
Workaround:
set Word Wrapping to 0.
What has to be done in order to fix this problem:
1.) make the word wrapper UTF-8 aware so it does not insert the whitespace character in the middle of a UTF-8 character multi byte sequence.
2.) make the word wrapper smarter so it does not insert spaces when there are UTF-8 characters (such as CJK) that already get wrapped by the browser.
Can you confirm that this is a bug?
Or should Word Wrapper simply not be used at all on multi lingual forums?
I posted this in the bug report section and it was moved to general support without comment. So apparently it's not a bug.
I modified the my_wordwrap() in inc/functions.php on my server so it leaves UTF-8 alone but still wordwraps latin characters. It's a hack because now it doesn't wordwrap any UTF-8 even characters that should be (German umlauts for example). But oh well it's better than nothing. If it causes trouble again I'll just turn wordwrap off.
After a bit of googling I found this http://www.php.net/wordwrap
In the comments there are dozens of alternatives to wordwrap that other users wrote.
The most interesting comment was about using zero-width characters instead of spaces
There are also several functions that claim to be UTF-8 safe, which the MyBB version currently clearly is not.
This way word wrapping can be implemented without actually inserting spaces, so when the browser decides it doesn't need to wrap yet after all, the wordwrapped text will just look the same as if it wasn't wordwrapped. However I haven't tested yet wether this zero width character hurts a browsers Find/Search feature, which would be bad (on the other hand, a space hurts it too, but at least the user can see the space and thus understand why it doesn't work).
Right now as a workaround I added \0x80-0xFF in the [^] section of the regexp, to make it ignore UTF-8 characters altogether, but if I decide to stick with MyBB I'll write up a proper fix based on the comments I found so far.