2009-01-09, 04:35 PM
This function splits strings into one or more encoded words that can be put in a mail header for UTF-8 characters in Subject/From/To/etc. Header fields. These encoded words are described in RFC2047. An encoded word mustn't be longer than 75 characters hence the string that needs to be encoded (e.g. a thread subject) will be split.
Among other things the RFC says the following regarding the splitting:
Unfortunately, MyBB current implementation of encoded words does just this, it splits the string at byte position x regardless of wether that breaks an UTF-8 multibyte sequence. Some mail clients are nice enough to deal with this and display the correct characters anyway, others give you ?? instead of the correct UTF-8 character. Since there may be several splits necessary in a header string you can end up with even more broken characters.
In Drupal (CMS written in PHP under GPL license) this problem is avoided by putting a backtracking into the splitting algorithm like this:
This is a crude way of detecting wether a multibyte character would be split, and reduces $len until the problem is resolved. MyBB lacks such a check, hence it breaks UTF-8.
To demonstrate the problem, create a thread with this (pointless) subject. Dots are in it to offset multibyte UTF-8 characters by 1.
If you subscribe to this thread, MyBB will send you a Mail with this Subject:
A mail client may display this as:
This is what the mail header created by MyBB looks like:
The problem occurs because each word claims to be an UTF-8 encoded string, when half a UTF-8 byte sequence is actually not valid UTF-8. Some mail clients can deal with this (they only decode the word without checking for charset and thus put the split byte sequences back together), others cannot (they try to make sense of each word as UTF-8 which fails for split characters).
On a sidenote, words are separated by linear-white-space (RFC822) and it allows CRLF (linebreak) followed by one or more LSWP (whitespace) characters. This way you can avoid very long lines in headers. The only reason why encoded words have to be split in the first place is to avoid such long lines. Unfortunately splitting the words is obligatory, while putting in newlines is only for convenience. Since MyBB is producing long lines anyway there is little point in splitting the words at all but it has to be done, unfortunately.
Although it makes for nice looking headers like this:
MyBB could do that too so all the effort for splitting the words pays off at all, but then again, no one really looks at a mail header these days. Except when something doesn't work as expected anyway.
Those RFCs are already 10-20 years old, it's sad to see that they still have to be implemented manually in PHP...
Among other things the RFC says the following regarding the splitting:
RFC2047 Wrote:Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's.
Unfortunately, MyBB current implementation of encoded words does just this, it splits the string at byte position x regardless of wether that breaks an UTF-8 multibyte sequence. Some mail clients are nice enough to deal with this and display the correct characters anyway, others give you ?? instead of the correct UTF-8 character. Since there may be several splits necessary in a header string you can end up with even more broken characters.
In Drupal (CMS written in PHP under GPL license) this problem is avoided by putting a backtracking into the splitting algorithm like this:
drupal/includes/unicode.inc:
while (--$len >= 0 && ord($string[$len]) >= 0x80 && ord($string[$len]) < 0xC0) {};
return substr($string, 0, $len);
This is a crude way of detecting wether a multibyte character would be split, and reduces $len until the problem is resolved. MyBB lacks such a check, hence it breaks UTF-8.
To demonstrate the problem, create a thread with this (pointless) subject. Dots are in it to offset multibyte UTF-8 characters by 1.
.ÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜ
If you subscribe to this thread, MyBB will send you a Mail with this Subject:
New Reply to .ÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜ
A mail client may display this as:
New Reply to .ÄÖÜäöüÄÖÜäöüÄÖÜ��öü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄ��Üäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäö��ÄÖÜäöü.ÄÖÜäöüÄÖÜ
This is what the mail header created by MyBB looks like:
Subject: =?utf-8?B?TmV3IFJlcGx5IHRvIC7DhMOWw5zDpMO2w7zDhMOWw5zDpMO2w7zDhMOWw5zD?= =?utf-8?B?pMO2w7wuw4TDlsOcw6TDtsO8w4TDlsOcw6TDtsO8w4TDlsOcw6TDtsO8w4TD?= =?utf-8?B?lsOcw6TDtsO8LsOEw5bDnMOkw7bDvMOEw5bDnMOkw7bDvMOEw5bDnMOkw7bD?= =?utf-8?B?vMOEw5bDnMOkw7bDvC7DhMOWw5zDpMO2w7zDhMOWw5w=?=
The problem occurs because each word claims to be an UTF-8 encoded string, when half a UTF-8 byte sequence is actually not valid UTF-8. Some mail clients can deal with this (they only decode the word without checking for charset and thus put the split byte sequences back together), others cannot (they try to make sense of each word as UTF-8 which fails for split characters).
On a sidenote, words are separated by linear-white-space (RFC822) and it allows CRLF (linebreak) followed by one or more LSWP (whitespace) characters. This way you can avoid very long lines in headers. The only reason why encoded words have to be split in the first place is to avoid such long lines. Unfortunately splitting the words is obligatory, while putting in newlines is only for convenience. Since MyBB is producing long lines anyway there is little point in splitting the words at all but it has to be done, unfortunately.
Although it makes for nice looking headers like this:
Subject: New Reply to =?utf-8?B?44Gy44KJ44GM44Gq?=
=?utf-8?B?44Kr44K/44Kr44OK5ryi5a2X44Gy44KJ44GM44Gq44Kr44K/44Kr44OK5ryi?=
=?utf-8?B?5a2X44Gy44KJ44GM44Gq44Kr44K/44Kr44OK5ryi5a2X44Gy44KJ44GM44Gq?=
=?utf-8?B?44Gy44KJ44GM44Gq44Kr44K/44Kr44OK5ryi?=
MyBB could do that too so all the effort for splitting the words pays off at all, but then again, no one really looks at a mail header these days. Except when something doesn't work as expected anyway.
Those RFCs are already 10-20 years old, it's sad to see that they still have to be implemented manually in PHP...