MyBB Community Forums

Pages: 1 2 3 4

Yeah, somehow the code didn't come out the way I posted it. Plus it didn't work, it seems the | can't be outsmarted that easily.

What could work is split the message by HTML tags like so:

$message_split = preg_split("/(<.*?>)/u", $message, -1, PREG_SPLIT_DELIM_CAPTURE);

Then wordwrap every element of that array that is not a delimiter (not a html tag), i.e. the elements with even index number (0 2 4 6 ...).

Then implode() return the message.

Needless to say this is much more expensive than what is done now, since it replaces one preg_replace call with possibly dozens for every post that contains mycodes / html.

What I tried before was ignoring html tags and splitting words in a single regular expression step but unfortunately it does not work this way (the regular expression backtracks to find the long word in the html tag even though the tag was already matched as a whole). It would work if it was possible to turn off backtracking somehow, I don't know if PCRE supports that.

A cheaper alternative could be to run wordwrap first and then have the filter that recognizes the URLs remove the entities that were put in by the wordwrap. But this is nothing short of a hack.

Seeing how this is an extremely rare case the best option would probably be to do nothing after all.

(2009-01-23, 12:26 PM)frostschutz Wrote: [ -> ]Yeah, somehow the code didn't come out the way I posted it. Plus it didn't work, it seems the | can't be outsmarted that easily.

What could work is split the message by HTML tags like so:
$message_split = preg_split("/(<.*?>)/u", $message, -1, PREG_SPLIT_DELIM_CAPTURE);
Then wordwrap every element of that array that is not a delimiter (not a html tag), i.e. the elements with even index number (0 2 4 6 ...).

Then implode() return the message.

Needless to say this is much more expensive than what is done now, since it replaces one preg_replace call with possibly dozens for every post that contains mycodes / html.

What I tried before was ignoring html tags and splitting words in a single regular expression step but unfortunately it does not work this way (the regular expression backtracks to find the long word in the html tag even though the tag was already matched as a whole). It would work if it was possible to turn off backtracking somehow, I don't know if PCRE supports that.

A cheaper alternative could be to run wordwrap first and then have the filter that recognizes the URLs remove the entities that were put in by the wordwrap. But this is nothing short of a hack.

Seeing how this is an extremely rare case the best option would probably be to do nothing after all.

Once we implement a post-parser that might be more conceivable, but doing it on page load for every single post only-the-fly would extremely intensive.

Yeah, well, in the meantime I thought of a way of doing it with 2 preg_replace calls, but it's stupid so meh. Smile

I remember we strip out certain bbcodes then place them back in for various reasons. We could do the same in this case. Strip out the actual hrefs before we do the wordwrap, then run it, then place them back in.

Ewww, but if you're already doing it anyway, why not.

Actually, I don't see why preg_split would be significantly slower than preg_replace. The regex for preg_split is slightly more complex, but should be negligible. For each array element, you don't need to use preg_replace to wordwrap it - PHP's wordwrap function should work reasonably. The only issue remaining is HTML entities, but a simple work around is to un-htmlspecialchars, wordwrap, then htmlspecialchars.
Speed-wise, would be slower, but probably not by a lot.

preg_replace builds a single new string. has to allocate memory for that string only once too. preg_split has to allocate memory for a whole array and then you have to do an operation on each of the elements of the array and then join it. And PHP's wordwrap doesn't work for UTF-8 I believe so we can't use it.

The solution with 2 preg_replaces would be to match html tags or long words in the first expression (so a break character is inserted after every tag [but not inside them] and in long words), and then the second preg_replace would then be required to remove the break characters that were inserted after html tags. In terms of cost it's twice as expensive as the current implementation (whereas preg_split solution is n-times as expensive with possibly n being much larger than 2), but the solution is fugly (you don't remove things you just inserted), so I disregarded it.

It's probably easier to just strip out the hrefs and place them back in. We can implement a pre-parser making the speed a negligible issue.

(2009-01-31, 01:12 PM)frostschutz Wrote: [ -> ]preg_replace builds a single new string. has to allocate memory for that string only once too. preg_split has to allocate memory for a whole array and then you have to do an operation on each of the elements of the array and then join it.

There's very little difference between the two. preg_replace can't guess the memory requirements of the buffer to allocate - it doesn't know whether your pattern will cause the string to become larger or smaller than the original, thus it can't preallocate a predefined amount of memory. This means that it most likely allocates some memory as a string buffer, then when it overflows, allocate another block of larger memory, copy the contents of the existing buffer over to the new one, then deallocate the old buffer.
In fact, if you want to think of it that way, preg_split is probably faster, because it can allocate memory for a single string after it knows its length (it will know this once it matches a pattern), then simply add a pointer to it to the linked list, which acts as an array, then continue.
preg_replace could internally do the same thing as preg_split (allocate memory, and attach to a linked list), then simply perform the equivalent of implode() on the "array" to give the final string.

In general however, the difference should be negligible.

I'll admit I didn't think about wordwrap not supporting UTF-8 though. A possibility is to include the delimiters in the preg_split call and use my_substr etc conditionally on the array elements.

You're confusing data structures of a built in function with PHP data structures.

I'm sure that internally the function has a whole array of memory structures that no one wants to know about, it even has to compile the regular expression first and everything before it can do any work. All of this does not matter from PHP point of view as it's a builtin function, written in C and compiled to machine code, so it's blazingly fast, compared to the scripting language. But in the end, preg_replace builds a single PHP string, which in this case is your ready to use end result even. All the work was done by the builtin function.

Compared to that, preg_split has to build a data structure that is n times more complex than what preg_replace builds, i.e. not one string, but possibly dozens of PHP strings wrapped in a PHP array/hash. That alone makes it definitely slower than preg_replace. What makes PHP data structures so expensive is that no one knows what they will be used for, so they have to be allocated, initialized, tracked, refcounted, and garbage collected.

But then that isn't even your ready to use end result, you have to use PHP logic, call other PHP functions that build new structures, etc. etc. and finally build the end string. Since PHP is slow you have by then reached several dimensions slower than what the original preg_replace solution did.

In scripting languages, optimizing works by offloading as much work as possible to as little builtin function calls as possible. Usually regardless of how those functions work internally - they're builtin so even if they do something internally that isn't actually necessary to your problem it doesn't matter because it's still several dimensions faster than writing a better logic in the scripting language itself because the scripting language is just so slow in comparison.

If you're replacing a single builtin function call with dozens of lines of PHP code and function calls you're going into the wrong direction. The only thing that makes this negligible is the sheer amount of processing power offered by machines nowadays so no one notices if something is a helluvalot more expensive than need be.

Pages: 1 2 3 4

frostschutz

Ryan Gordon

frostschutz

Ryan Gordon

frostschutz

Yumi

frostschutz

Ryan Gordon

Yumi

frostschutz