MyBB Community Forums

Full Version: 4 byte UTF-8 characters
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
https://github.com/mybb/mybb/commit/21a6...23117a6ac5

Seems like a really ugly fix. If it's a DB issue, why are the changes sprinkled around everywhere instead of being put in the MySQL class?

Though apart from the PM issue (which is really a bug in the PM code), I don't really see how this is particularly a security issue, though it does cause odd behaviour in places. I'd recommend fixing these bugs instead of trying to prevent blank usernames, since you can't simply assume you can catch all possible loopholes (hint: the current codebase doesn't).

Firstly, plug this, inc/datahandlers/pm.php:
			foreach(array("to", "bcc") as $recipient_type)
			{
Add below:
				if(!$pm[$recipient_type])
				{
					continue;
				}

Fixes security issue, plus saves an unnecessary query in most cases.

Next, I'd go through and see which places break with unexpected blank strings (eg user being displayed as Guest).

Oh, and if you're going to block these characters, puh-lease revert the above commit and do it in the right place (MySQL class). Neater code, less gunk spread everywhere, and also "fixes" plugins or other code modifications that use MyBB properly.
You may also wish to consider people who are using the utf8mb4 character set, as they probably want 4 byte UTF-8 characters, or those not even using UTF-8 (yeah, I know MyBB doesn't really support anything other than UTF-8, so moot point perhaps).


I'm guessing the installation here has a latin1 character set, hence it not barfing at this user name. It could've been fun otherwise.


Nevertheless, thanks for the work that went into fix, as well as the focus on security and continued development of the project.
(2013-10-10, 12:13 PM)�?�?�?�?� Wrote: [ -> ]You may also wish to consider people who are using the utf8mb4 character set, as they probably want 4 byte UTF-8 characters, or those not even using UTF-8 (yeah, I know MyBB doesn't really support anything other than UTF-8, so moot point perhaps).
Unfortunately you don't know the encoding of the specific database field (you can have different encodings in the same database) without querying the database structure. Wink
(2013-10-10, 12:25 PM)StefanT Wrote: [ -> ]
(2013-10-10, 12:13 PM)�?�?�?�?� Wrote: [ -> ]You may also wish to consider people who are using the utf8mb4 character set, as they probably want 4 byte UTF-8 characters, or those not even using UTF-8 (yeah, I know MyBB doesn't really support anything other than UTF-8, so moot point perhaps).
Unfortunately you don't know the encoding of the specific database field (you can have different encodings in the same database) without querying the database structure. Wink
But do you really need to know the encoding of the specific database field?
Do you really need to support someone who wishes to mix encodings in really weird ways?

Think about it Toungue
Feel free to make Pull requests to fix the issues Toungue We should fix them at least in 1.8.
(2013-10-10, 01:14 PM)�?�?�?�?� Wrote: [ -> ]Do you really need to support someone who wishes to mix encodings in really weird ways?
A solution must work under all circumstances. There are enough plugins specifying a certain encoding or even no encoding so MySQL falls back to the default. And suddenly there are different encodings in the database... Wink
MySQL supports 4-byte UTF-8, you just have to change the charset from utf8 to utf8mb4. That's what I did on my Japanese language learner's forum where some people try to use such characters on occasion. I haven't seen any ill effects because of it.

Now, the database supports the characters, and this function looks like it's breaking them. It's not a fix, it just replaces one problem with another.

I understand the sentiment (MySQL without utf8mb4 just cuts of the string at the first 4 byte character, causing a ton of other issues), but at the very least there should be a setting to turn this off.

I also think the database class should be the proper place for it. As long as the current occurences do not use completely custom queries but simple update/insert_query a generic check could be done inside this function for strings, if the charset isn't set to utf8mb4. Although one could argue the same for escape_string or making prepared queries in the first place...

Also, the function is quite slow. It rebuilds strings even when it is not necessary to do so.
Quote:MySQL supports 4-byte UTF-8, you just have to change the charset from utf8 to utf8mb4. That's what I did on my Japanese language learner's forum where some people try to use such characters on occasion. I haven't seen any ill effects because of it.

This is the only true solution. Unfortunately we can't do this by default unless we drop support for anything below MySQL 5.5. It's a messy issue and there isn't a simple solution.
Much faster (not to mention shorter) than the above (and it covers 4/5/6 byte UTF-8):

$input = preg_replace("#.[\\x80-\\xBF]{3,}#", "?", $input);

Faster still if you only run it when it's not ASCII (may be checked in other ways):

if(!mb_check_encoding($input, "US-ASCII"))
{
    $input = preg_replace("#.[\\x80-\\xBF]{3,}#", "?", $input);
}

On systems that support 4-byte-UTF-8, the {3,} should be replaced with {4,}. I'm guessing that utf8mb4 fails the same way for 5/6-byte-UTF-8, as utf8mb3 does for 4-byte...

I'd still like a setting to just turn it off and damn the consequences.
(2013-10-10, 01:36 PM)StefanT Wrote: [ -> ]A solution must work under all circumstances.
But the current solution doesn't work under all circumstances.

(2013-10-10, 03:36 PM)frostschutz Wrote: [ -> ]I also think the database class should be the proper place for it. As long as the current occurences do not use completely custom queries but simple update/insert_query a generic check could be done inside this function for strings, if the charset isn't set to utf8mb4. Although one could argue the same for escape_string or making prepared queries in the first place...
I'd argue for escape_string being the proper place, as you'd reasonably expect it to be where string sanitisation is done.
Could break anything that uses a binary column, but I don't think MyBB uses that. Probably still better to have a escape_binary function though.

(2013-10-10, 05:48 PM)frostschutz Wrote: [ -> ]Much faster (not to mention shorter) than the above (and it covers 4/5/6 byte UTF-8):

$input = preg_replace("#.[\\x80-\\xBF]{3,}#", "?", $input);

Faster still if you only run it when it's not ASCII (may be checked in other ways):

if(!mb_check_encoding($input, "US-ASCII"))
{
    $input = preg_replace("#.[\\x80-\\xBF]{3,}#", "?", $input);
}
Works, though I'd feel a bit safer using the proper prefix instead of the '.' match.
Also the second version requires the MB library, which I'm not sure MyBB assumes exists*? I think a preg_replace call is fair in terms of overhead.

* It's probably better to make it a requirement because there's another issue here if anyone can figure it out...
We can't use mb_check_encoding in MyBB 1.6 anyway as it requires PHP 5 >= 5.1.3. And changing the requirements at that point isn't reasonable. Wink
Pages: 1 2 3 4 5