MyBB Community Forums

Pages: 1 2 3 4

(2009-02-13, 10:38 PM)Ryan Gordon Wrote: [ -> ]Why then does this work fine? (inc/class_parser.php...)

// If MyCode needs to be replaced, first filter out [code] and [php] tags.
		if($options['allow_mycode'])
		{
			// First we split up the contents of code and php tags to ensure they're not parsed.
			preg_match_all("#\[(code|php)\](.*?)\[/\\1\](\r\n?|\n?)#si", $message, $code_matches, PREG_SET_ORDER);
			$message = preg_replace("#\[(code|php)\](.*?)\[/\\1\](\r\n?|\n?)#si", "<mybb-code>\n", $message);
		}

...

if($options['allow_mycode'])
		{
			// Now that we're done, if we split up any code tags, parse them and glue it all back together
			if(count($code_matches) > 0)
			{
				foreach($code_matches as $text)
				{
					// Fix up HTML inside the code tags so it is clean
					if($options['allow_html'] != 0)
					{
						$text[2] = $this->parse_html($text[2]);
					}
					
					if(my_strtolower($text[1]) == "code")
					{
						$code = $this->mycode_parse_code($text[2]);
					}
					elseif(my_strtolower($text[1]) == "php")
					{
						$code = $this->mycode_parse_php($text[2]);
					}
					$message = preg_replace("#\<mybb-code>\n?#", $code, $message, 1);
				}
			}
		}

Well made point, I haven't thought about the bbcode parser since I developed mine a bit different for Grestul and haven't looked at the bbcode parser here. Again my point still remains valid as I did say I could be wrong. Toungue

I tested my code on my machine with lots of text, but it really wasn't that slow for me. Just retested with the source code of this page (lots of HTML elements, and over 110KB of data) and took 1.7 seconds. Returning directly after the call to preg_split (ie doing no processing), it takes 1.4 seconds, so the bulk of the time is in the preg_split rather than the PHP logic.
Interestingly, the "u" modifier seems to be making preg_* functions slow. Taking it away makes my function execute around 4 times faster.
The MyBB wordwrap on the same data takes around 0.1 seconds, so yes, mine is a fair bit slower, but most posts won't contain so much HTML, and unlikely to be so long.
The problem with your solution is that it only cares about links. My solution aims to fix issues with long URLs in images, or possible custom MyCodes.

Other reasons my function is slow:
- MyBB's unhtmlentities function is slow - commenting out the line increases speed by a factor of 2; using html_entity_decode is much faster; interestingly enough, if we put an "if(my_strpos($token, '&')!==false)" before it, it's even faster
- htmlspecialchars_uni is slow, because it calls preg_replace each time - since html_entity_decode will get rid of &#...; constructs, we can replace it with plain htmlspecialchars for a speed gain

If all you want to do is remove stuff in URLs, a dirty trick is just:

$new_message = preg_replace('~\<([^>]+?)([^>]*?)\>~su', '<$1$2>', $new_message);

Obviously won't be as accurate as my solution, but it should be able to fix things a bit.

My "solution" cared only about links because the issue reported here was originally with links. You can extend it to any HTML by removing the "a " from the pattern. Takes 0.94 seconds then for me, having to consider every html tag with this pointless yet expensive pattern really hurts performance.

Simply removing the & #8203 ; from inside HTML tags is what I tried first, but I couldn't come up with a pattern that would remove more than one such occurence per tag in a single run. Of course you could argue that since it's so highly unlikely for one such thing to occur inside an HTML tag, it's even less likely for there to be two of those bastards in the wrong place, so you could just run the query once. However that's not a real solution.

Code would look something like this:

function my_wordwrap($message)
{
    global $mybb;

    if($mybb->settings['wordwrap'] > 0)
    {
        $message = convert_through_utf8($message);

        if(!($new_message = @preg_replace("#(?>[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#u", "$0&amp;#8203;", $message)))
        {
            $new_message = preg_replace("#(?>[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#", "$0&amp;#8203;", $message);
        }

        $newlen = strlen($new_message);

        do
        {
            $len = $newlen;
            $new_message = preg_replace('/<([^>]*?)&amp;#8203;([^>]*)>/u', '<$1$2>', $msg);
            $newlen = strlen($new_message);
        } while($newlen != $len);

        $new_message = convert_through_utf8($new_message, false);

        return $new_message;
    }

    return $message;
}

Takes 0.96 seconds for me.

Those 0.96 can actually be reduced to 0.93 by replacing the first [^>]*? with [^>]{$wordwrap,}, as that's the minimum number of characters that have to be there before the & 8203 entity can occur, and I guess internally it lets preg_replace disregard too short tags more quickly.

(2009-02-14, 04:24 AM)frostschutz Wrote: [ -> ]My "solution" cared only about links because the issue reported here was originally with links. You can extend it to any HTML by removing the "a " from the pattern. Takes 0.94 seconds then for me, having to consider every html tag with this pointless yet expensive pattern really hurts performance.

Then you'd probably have to put the "u" modifier in as well, which will make it even slower...
But thanks for the contribution.

(2009-02-14, 04:24 AM)frostschutz Wrote: [ -> ]Simply removing the & #8203 ; from inside HTML tags is what I tried first, but I couldn't come up with a pattern that would remove more than one such occurence per tag in a single run.

Ah, didn't consider that.
Does this work?

$new_message = preg_replace('~\<(([^>]+?)& #8203;)+([^>]*?)\>~su', '<$2$3>', $new_message);

Can see it running slower though.

Yet another variant, takes 0.90 seconds for me.

function my_wordwrap_callback($matches)
{
    if($matches[1])
    {
        // long HTML tag, return unchanged.
        return $matches[1];
    }

    // long word, append spacer.
    return "{$matches[2]}&# 8203;";
}

function my_wordwrap($message)
{
    global $mybb;

    if($mybb->settings['wordwrap'] > 0)
    {
        $message = convert_through_utf8($message);

        if(!($new_message = preg_replace_callback("#(<[^>]{{$mybb->settings['wordwrap']},}>)|([^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#u",
                                                  'my_wordwrap_callback', $message)))
        {
            $new_message = preg_replace_callback("#(<[^>]{{$mybb->settings['wordwrap']},}>)|([^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#",
                                                  'my_wordwrap_callback', $message));
        }

        $new_message = convert_through_utf8($new_message, false);

        return $new_message;
    }

    return $message;
}

I like this solution best because it does not insert crap into HTML tags in the first place. I expected callbacks to run a lot slower though, since calling PHP functions is slow compared to preg_replace internal string replacing. Yet it takes only 0.9 seconds even though the callback is called hundreds of times on my test thread, it's amazing. Smile

Do like I do...

grab all the messages in the thread that are going to be displayed in that page view, and run all of them through the word_wrap at one time, (but use a better word wrap /line wrap). Splitting each post with uniqid id. (1) preg_ for all posts. Sure you have to mod the code, but it will increase your boards rendering speed a whole lot. (1) post or (50) post(s) you will not see that much difference in speed. Remember pcre biggest downfall is compilation. And you shouldn't use the U modifier because it's backend is just a awful hack, (look at source code and you'll see it). I mean it doesn't even use Perl's standard unicode parser, so it's very slow! You are better off converting the unicode with a standard input parser. I myself changed that to because it's sort of silly every thread displayed to do the same encoding / decoding over and over again.

I like MyBB a lot, but the object of developing is to remove things from your code base, not add repetitive logic that only makes the application process longer. As I said, MyBB is really cool and I like it a lot, but I really think it's time for a logical code review!

Quote:Splitting each post with uniqid id.

That sucks.

Quote:Remember pcre biggest downfall is compilation.

If PHP is too stupid to cache and re-use previously compiled regexes like everybody else does, that's true.

Quote:you shouldn't use the U modifier because it's backend is just a awful hack

It's PHP, so what are you complaining about?

Quote:You are better off converting the unicode with a standard input parser.

How is that supposed to work?

(2009-03-01, 10:39 AM)frostschutz Wrote: [ -> ]
Quote:Splitting each post with uniqid id.

That sucks.

Quote:Remember pcre biggest downfall is compilation.

If PHP is too stupid to cache and re-use previously compiled regexes like everybody else does, that's true.

Quote:you shouldn't use the U modifier because it's backend is just a awful hack

It's PHP, so what are you complaining about?

Quote:You are better off converting the unicode with a standard input parser.

How is that supposed to work?

Quote:If PHP is too stupid to cache and re-use previously compiled regexes like everybody else does, that's true.

You make no sense at all, how in the world can the compiler cache a dynamic pattern. It can't no matter what language you code in. The reason being, is that there is no way the complier can determine if the variable used in the expression will be changed during the execution process. It's dynamic, not fixed.

Quote:How is that supposed to work?

data normalization is the key to application development. One should never have to use different functions mb_ iconv_ to process internal application strings because the strings should be normalized when they enter the application. It's better to handle the normalization when a new post, thread, pm or whatever data enters the application. That way you only need to use the standard core functions, (IE: you wouldn't need to do things like)...

if function_exists mb_... do this
else if function_exists iconv_... do this
else oh no we can't do anything because we didn't normalize the string

Whats better...

strtolower ( normalized string )

25 posts in a thread...

75 calls to strtolower (called 3 times per post)

0 conditions asked, just lower case the normalized string

extra memory allocated = negligible; new string overwrites old string

or

my_strtolower ( unnormalized string )

25 posts in a thread...

75 unneeded function calls (called 3 times per post)

75 unneeded if (conditions) if function_exists mb_, mb_strtolower else strtolower

$string variable in function

extra memory allocated = ($string size * 8 bytes) per function call

That's just for string to lower case handling, (times) that by all the other functions that are used because incoming data was not normalized!

Normalization means, that you don't rely on function based normalization.

For example...

mb_

The mb_ function list converts from one encoding unicode type to another. So it should not be used to normalize data because you are not guaranteed that all the data in the string being encoded or decoded is in the same encoding. So the mb_ function will only encode or decode the entities that match encode_from, encode_to. That results in unnormalized data because all the data is not necessarily converted to it's single byte representation. The only way to normalize data is to use the mappings provided by the functions chr(), ord(), hexdec(), dechex() because those functions have all the unicode mappings to normalize every character, whether they be single, two, four or eight bit wide characters. By normalizing data, utf-8 valid characters get placed into a single byte normalized character, characters that don't fit into the utf-8 single byte range, (IE: double byte characters) utf-16, are assigned mappings to the utf-8 range. So while it really a double byte character, it still maps to the utf-8 single byte range. Thus allowing for both utf-8 (single byte), utf-16 (unicode mlti-byte) to both map to each other. So after converting all the data, no matter if it's utf-8 or utf-16, it is normalized. That allows you to use all the normal str_, preg, core functions on your normalized string. Which removes the need for all those functions (mb_, iconv) that should only be used for converting normalized data being outputted in the clients preferred encoding type!

Regexp patterns are usually static strings such as "#\s*([a-z0-9a+\\\[\]\-\"=_:>\*\.\#\,\s\|~\^]+)(\s*)\{(\n*)#isu", if you call such a pattern in a loop you compile it once and then cache it. The input is still dynamic of course.

Regarding normalization, I don't follow you. How do you represent >5000 characters (and that's only CJK) in a single byte? Those if function_exists etc. are a nuisance, but meant as a fall back to support more limited hosts. I think for MyBB some higher requirements are planned, not sure.

Regarding optimization, joining / splitting strings does not come for free either, and it's error prone (unique delimiter, parsing issues, etc). Running lowercase once on a huge string can be more expensive and use more memory than doing it in 75 smaller steps. Apart from the performance issue, throwing separate strings (postings, so the strings come from the database as separate rows / fields) into one huge string only to split it up to the original slabs later is something you just don't do even if you could save a handful cpu cycles by doing this.

Anyway, we strayed quite a bit from the original topic here, which was kinda pointless to begin with. Several solutions have been posted, take them or leave them, there are more important issues around than URLs that only break if you enable an optional feature and even then only extremely rarely.

Corresponding Ticket: http://dev.mybboard.net/issues/409

Pages: 1 2 3 4

Imad Jomaa

Yumi

frostschutz

Yumi

frostschutz

meme

frostschutz

meme

frostschutz

User 1428