MyBB Community Forums

Full Version: [F] MailHandler::utf8_encode() breaks multibyte characters
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
This function splits strings into one or more encoded words that can be put in a mail header for UTF-8 characters in Subject/From/To/etc. Header fields. These encoded words are described in RFC2047. An encoded word mustn't be longer than 75 characters hence the string that needs to be encoded (e.g. a thread subject) will be split.

Among other things the RFC says the following regarding the splitting:

RFC2047 Wrote:Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's.

Unfortunately, MyBB current implementation of encoded words does just this, it splits the string at byte position x regardless of wether that breaks an UTF-8 multibyte sequence. Some mail clients are nice enough to deal with this and display the correct characters anyway, others give you ?? instead of the correct UTF-8 character. Since there may be several splits necessary in a header string you can end up with even more broken characters.

In Drupal (CMS written in PHP under GPL license) this problem is avoided by putting a backtracking into the splitting algorithm like this:

drupal/includes/unicode.inc:
while (--$len >= 0 && ord($string[$len]) >= 0x80 && ord($string[$len]) < 0xC0) {};
return substr($string, 0, $len);

This is a crude way of detecting wether a multibyte character would be split, and reduces $len until the problem is resolved. MyBB lacks such a check, hence it breaks UTF-8.

To demonstrate the problem, create a thread with this (pointless) subject. Dots are in it to offset multibyte UTF-8 characters by 1.

.ÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜ

If you subscribe to this thread, MyBB will send you a Mail with this Subject:
New Reply to .ÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄÖÜäöü.ÄÖÜäöüÄÖÜ

A mail client may display this as:
New Reply to .ÄÖÜäöüÄÖÜäöüÄÖÜ��öü.ÄÖÜäöüÄÖÜäöüÄÖÜäöüÄ��Üäöü.ÄÖÜäöüÄÖÜäöüÄÖÜäö��ÄÖÜäöü.ÄÖÜäöüÄÖÜ

This is what the mail header created by MyBB looks like:
Subject: =?utf-8?B?TmV3IFJlcGx5IHRvIC7DhMOWw5zDpMO2w7zDhMOWw5zDpMO2w7zDhMOWw5zD?= =?utf-8?B?pMO2w7wuw4TDlsOcw6TDtsO8w4TDlsOcw6TDtsO8w4TDlsOcw6TDtsO8w4TD?= =?utf-8?B?lsOcw6TDtsO8LsOEw5bDnMOkw7bDvMOEw5bDnMOkw7bDvMOEw5bDnMOkw7bD?= =?utf-8?B?vMOEw5bDnMOkw7bDvC7DhMOWw5zDpMO2w7zDhMOWw5w=?=

The problem occurs because each word claims to be an UTF-8 encoded string, when half a UTF-8 byte sequence is actually not valid UTF-8. Some mail clients can deal with this (they only decode the word without checking for charset and thus put the split byte sequences back together), others cannot (they try to make sense of each word as UTF-8 which fails for split characters).

On a sidenote, words are separated by linear-white-space (RFC822) and it allows CRLF (linebreak) followed by one or more LSWP (whitespace) characters. This way you can avoid very long lines in headers. The only reason why encoded words have to be split in the first place is to avoid such long lines. Unfortunately splitting the words is obligatory, while putting in newlines is only for convenience. Since MyBB is producing long lines anyway there is little point in splitting the words at all but it has to be done, unfortunately. Sad

Although it makes for nice looking headers like this:
Subject: New Reply to =?utf-8?B?44Gy44KJ44GM44Gq?=
	=?utf-8?B?44Kr44K/44Kr44OK5ryi5a2X44Gy44KJ44GM44Gq44Kr44K/44Kr44OK5ryi?=
	=?utf-8?B?5a2X44Gy44KJ44GM44Gq44Kr44K/44Kr44OK5ryi5a2X44Gy44KJ44GM44Gq?=
	=?utf-8?B?44Gy44KJ44GM44Gq44Kr44K/44Kr44OK5ryi?=

MyBB could do that too so all the effort for splitting the words pays off at all, but then again, no one really looks at a mail header these days. Except when something doesn't work as expected anyway. ShyRolleyesConfused

Those RFCs are already 10-20 years old, it's sad to see that they still have to be implemented manually in PHP...
There is not enough information to tell what the error is in the other thread at all.

Quote:2. If the subject of an email contains an umlaut it gets messed up.

Subject: Schönes Wochenende
What you get: =?utf-8?B?U2Now7ZuZXMgV29jaGVuZW5kZSE=?=

That's perfectly alright, =?utf-8?B?U2Now7ZuZXMgV29jaGVuZW5kZSE=?= is a correct representation of "Schönes Wochenende!" in a mail Subject header. The subject is short, so it's not split at all, and the problem described here only occurs when the subject is long enough to be split, and the split happens to be in mid of a UTF-8 multi byte sequence.

In the log screenshot of the other thread, it does show encoded words that are split though. Maybe the original unencoded subject should be shown there instead of the encoded one, especially since the base64 encoding makes it completely unreadable. However that is just eye candy, not an error with actually sending mails.
How would it affect MyBB emails if we used a similar fix to drupals? Would it be smart to implement this into MyBB 1.4.5 or leave the fix only for 1.6 so it could be tested? Just judging the risks of doing the fix not 100%
Fixing this in 1.4.5 wouldn't hurt IMO (at least the --$len hack so it does not kill UTF-8 chars anymore seems simple enough, and that's the only issue I'm actually seeing in my mail client and the reason for this report). Personally I'd prefer pretty formatted and readable (quoted printable instead of base64) mail headers, but PHP is a bit lacking in this field unfortunately... Drupals implementation is not perfect either.
If I understand correctly, the fix should work like this (see the new function at the bottom)?

The file goes in inc/mailhandlers/
The problem occurs specifically in the utf8_encode() function of MailHandler (inc/class_mailhandler.php), which splits the already base64 encoded string using chunk_split(), which breaks UTF-8. The problem would have to be fixed in this function directly. (I just noticed that utf8_encode is also the name of an official PHP function, is it okay to define a function with the same name in a class? But I guess API changes would have to wait for 1.6)

This code seems to work for me (not taken from but inspired by Drupal). Tested with my umlaut subject above that broke before and with some other mixed latin / utf-8 strings.

    function utf8_encode($string)
    {
        if(strtolower($this->charset) == 'utf-8'
           && preg_match('/[^\x20-\x7E]/', $string))
        {
            $chunk_size = 47; // floor((75 - strlen("=?UTF-8?B??=")) * 0.75);
            $len = strlen($string);
            $output = '';
            $pos = 0;

            while($pos < $len)
            {
                $newpos = min($pos + $chunk_size, $len);

                while(ord($string[$newpos]) >= 0x80 && ord($string[$newpos]) < 0xC0)
                {
                    // Reduce len until it's safe to split UTF-8.
                    $newpos--;
                }

                $chunk = substr($string, $pos, $newpos - $pos);
                $pos = $newpos;

                $output .= " =?UTF-8?B?".base64_encode($chunk)."?=\n";
            }
            return trim($output);
        }
        return $string;
    }

The headers it creates look like this:
Subject: =?UTF-8?B?TmV3IFJlcGx5IHRvIC7DhMOWw5zDpMO2w7zDhMOWw5zDpMO2w7zDhMOWw5zDpA==?=
 =?UTF-8?B?w7bDvC7DhMOWw5zDpMO2w7zDhMOWw5zDpMO2w7zDhMOWw5zDpMO2w7zDhMOWw5w=?=
 =?UTF-8?B?w6TDtsO8LsOEw5bDnMOkw7bDvMOEw5bDnMOkw7bDvMOEw5bDnMOkw7bDvMOEw5Y=?=
 =?UTF-8?B?w5zDpMO2w7wuw4TDlsOcw6TDtsO8w4TDlsOc?=
Yes it's fine to create a method inside a class with the same name as a PHP function because doing $class->utf8_decode versus calling utf8_decode doesn't overwrite PHP's functionality and are two separate things

And alright, I'll implement that change
Thank you for your bug report.

This bug has been fixed in our internal code repository. Please note that the problem will not be fixed here until these forums are updated.

With regards,
MyBB Group