MyBB Community Forums

Full Version: Problem with links
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
(2009-02-01, 01:05 AM)frostschutz Wrote: [ -> ]You're confusing data structures of a built in function with PHP data structures.

I'm sure that internally the function has a whole array of memory structures that no one wants to know about, it even has to compile the regular expression first and everything before it can do any work. All of this does not matter from PHP point of view as it's a builtin function, written in C and compiled to machine code, so it's blazingly fast, compared to the scripting language. But in the end, preg_replace builds a single PHP string, which in this case is your ready to use end result even. All the work was done by the builtin function.

Compared to that, preg_split has to build a data structure that is n times more complex than what preg_replace builds, i.e. not one string, but possibly dozens of PHP strings wrapped in a PHP array/hash. That alone makes it definitely slower than preg_replace. What makes PHP data structures so expensive is that no one knows what they will be used for, so they have to be allocated, initialized, tracked, refcounted, and garbage collected.
Well, your first point says that the compiled code is fast, and your second point is saying it's slow. What is it exactly?
preg_replace does all the work by putting it into a string. Similarly, preg_split does "all the work" by putting it into an array.

(2009-02-01, 01:05 AM)frostschutz Wrote: [ -> ]But then that isn't even your ready to use end result, you have to use PHP logic, call other PHP functions that build new structures, etc. etc. and finally build the end string. Since PHP is slow you have by then reached several dimensions slower than what the original preg_replace solution did.

In scripting languages, optimizing works by offloading as much work as possible to as little builtin function calls as possible. Usually regardless of how those functions work internally - they're builtin so even if they do something internally that isn't actually necessary to your problem it doesn't matter because it's still several dimensions faster than writing a better logic in the scripting language itself because the scripting language is just so slow in comparison.

If you're replacing a single builtin function call with dozens of lines of PHP code and function calls you're going into the wrong direction. The only thing that makes this negligible is the sheer amount of processing power offered by machines nowadays so no one notices if something is a helluvalot more expensive than need be.
Not always the case. Do some benchmarks yourself - preg_* functions are extremely expensive - much more than using a fair bit of PHP logic to do the task in various cases. Reason being is that regular expressions are very difficult to evaluate at a lower level - much more difficult than parsing PHP code and evaluating that.
Well, decided to give this a try.
See if this function works for you:
function my_wordwrap($msg, $wrapat=null)
{
	if(!isset($wrapat))
	{
		global $mybb;
		$wrapat = intval($mybb->settings['wordwrap']);
	}
	if($wrapat == 0 || my_strlen($msg) < $wrapat)
	{
		return $msg;
	}
	
	
	$split_msg = @preg_split('#(\</?[a-z0-9\-]+.*?\>|[\s&/"\\-.\[\]]+)#isu', $msg, -1, PREG_SPLIT_DELIM_CAPTURE);
	if(!isset($split_msg))
	{
		$split_msg = @preg_split('#(\</?[a-z0-9\-]+.*?\>|[\s&/"\\-.\[\]]+)#is', $msg, -1, PREG_SPLIT_DELIM_CAPTURE);
	}
	
	$new_msg = '';
	$len = 0;
	foreach($split_msg as $i => &$token)
	{
		if($i % 2 == 0)
		{
			$token = unhtmlentities($token);
			$token_len = my_strlen($token);
			if($token_len + $len > $wrapat)
			{
				$iterations = floor(($token_len+$len) / $wrapat); // will be >=1
				$remainder = ($token_len+$len) % $wrapat;
				for($i=0; $i<$iterations; ++$i)
				{
					$new_msg .= htmlspecialchars_uni(my_substr($token, $i*$wrapat, $wrapat-$len)).'​';
					$len = 0; // since we're inserting tokens
				}
				if($remainder)
				{
					$new_msg .= htmlspecialchars_uni(my_substr($token, -$remainder));
				}
				
				$len = $remainder;
			}
			else
			{
				$len += $token_len;
				$new_msg .= htmlspecialchars_uni($token);
			}
		}
		else
		{
			// delimeter
			// only reset length if delimeter is not a tag
			if($token{0} != '<')
			{
				$len = 0;
			}
			$new_msg .= $token;
		}
	}
	return $new_msg;
}
Works for me, but it's slow, I get page generation times > 1.5 seconds on a 2x3Ghz idle localhost vs 0.1 seconds with original my_wordwrap (long thread, plenty of links and other formatting). With trashing you can get 0.8 seconds MyBB vs. 12 seconds your function vs. 0.7 seconds without wordwrap.

Original my_wordwrap is more or less constant in cost regardless of posting content, as it's either way just a string passed off to preg_replace. With your solution the content really matters, as each link or formatting adds to the number of logic iterations function calls string reallocations etc.
Like I've stated before this would be more reasonable to do with a post-pre-parser.
(2009-02-13, 05:46 PM)Ryan Gordon Wrote: [ -> ]Like I've stated before this would be more reasonable to do with a post-pre-parser.

I agree, it'll definitely be faster. The wordwrap function above will continue to get slower if the post is longer with scattered links,it'll definitely be insufficient.
Correct.

I'm still interested if stripping out the links and putting them back in after the wordwrap to see if that could work. (With a preg_match/preg_replace versus preg_split)
(2009-02-13, 07:53 PM)Ryan Gordon Wrote: [ -> ]Correct.

I'm still interested if stripping out the links and putting them back in after the wordwrap to see if that could work. (With a preg_match/preg_replace versus preg_split)

It'll probably be just as slow since it's searching for the urls removing them, setting the wordwrap and then bringing them back into the post. I could be wrong though. A better solution would be to create some type of tag such as this "<nowrap>" "</nowrap>" and exclude it in the wordwrap, but again we go back to the same problem with speed.
(2009-02-13, 08:20 PM)Rcpalace Wrote: [ -> ]
(2009-02-13, 07:53 PM)Ryan Gordon Wrote: [ -> ]Correct.

I'm still interested if stripping out the links and putting them back in after the wordwrap to see if that could work. (With a preg_match/preg_replace versus preg_split)

It'll probably be just as slow since it's searching for the urls removing them, setting the wordwrap and then bringing them back into the post. I could be wrong though. A better solution would be to create some type of tag such as this "<nowrap>" "</nowrap>" and exclude it in the wordwrap, but again we go back to the same problem with speed.

That actually sounds like a good solution.
(2009-02-13, 08:20 PM)Rcpalace Wrote: [ -> ]
(2009-02-13, 07:53 PM)Ryan Gordon Wrote: [ -> ]Correct.

I'm still interested if stripping out the links and putting them back in after the wordwrap to see if that could work. (With a preg_match/preg_replace versus preg_split)

It'll probably be just as slow since it's searching for the urls removing them, setting the wordwrap and then bringing them back into the post. I could be wrong though.

Why then does this work fine? (inc/class_parser.php...)

// If MyCode needs to be replaced, first filter out [code] and [php] tags.
		if($options['allow_mycode'])
		{
			// First we split up the contents of code and php tags to ensure they're not parsed.
			preg_match_all("#\[(code|php)\](.*?)\[/\\1\](\r\n?|\n?)#si", $message, $code_matches, PREG_SET_ORDER);
			$message = preg_replace("#\[(code|php)\](.*?)\[/\\1\](\r\n?|\n?)#si", "<mybb-code>\n", $message);
		}

...

if($options['allow_mycode'])
		{
			// Now that we're done, if we split up any code tags, parse them and glue it all back together
			if(count($code_matches) > 0)
			{
				foreach($code_matches as $text)
				{
					// Fix up HTML inside the code tags so it is clean
					if($options['allow_html'] != 0)
					{
						$text[2] = $this->parse_html($text[2]);
					}
					
					if(my_strtolower($text[1]) == "code")
					{
						$code = $this->mycode_parse_code($text[2]);
					}
					elseif(my_strtolower($text[1]) == "php")
					{
						$code = $this->mycode_parse_php($text[2]);
					}
					$message = preg_replace("#\<mybb-code>\n?#", $code, $message, 1);
				}
			}
		}
You could optimize that quoted code by not calling preg_replace when $code_matches is empty.
However as it's only a single call it hardly matters.

In most cases this code doesn't do anything besides the two calls to preg_match_all or preg_replace. It becomes expensive when there actually are code or php elements in a posting. That is the only case where the foreach loop gets actually executed.

As most postings do not contain code, or only very little of them (one in your last posting, maybe 3-4 max. in other postings), you won't notice much. Especially compared to the cost of syntax highlighting that code.

It's much more likely to come across a posting that contains 100 links, than one that contains 100 PHP snippets. The question is, will you notice an impact of that for loop when it has to do 100 preg_replaces, as opposed to just one or two?

In other words, the current solution for code may work well, but not necessarily scale well.

On the other hand I just tested how expensive it was to add a couple of hundred code tags to a posting, and it doesn't seem to be a problem, i.e. preg_replace seems to be really fast for simple queries Smile Yay for external functions written in C.

Based on this idea I produced this code:
function my_wordwrap($message)
{
    global $mybb;

    if($mybb->settings['wordwrap'] > 0)
    {
        $message = convert_through_utf8($message);

        preg_match_all("#<a [^>]*>#", $message, $url_matches);
        $message = preg_replace("#<a [^>]*>#", "<a>", $message);

        if(!($new_message = @preg_replace("#(?>[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#u", "$0​", $message)))
        {
            $new_message = preg_replace("#(?>[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#", "$0​", $message);
        }

        foreach($url_matches[0] as $url)
        {
            $new_message = preg_replace("#<a>#", $url, $new_message, 1);
        }

        $new_message = convert_through_utf8($new_message, false);

        return $new_message;
    }

    return $message;
}

Did some extreme condition testing (several thousand links on one page).

Results:
without wordwrapping: 0.74 seconds
with MyBB wordwrapping: 0.84 seconds
with above wordwrapping: 2.19 seconds
with yumis wordwrapping: 14.9 seconds

This benchmark is not entirely fair since I also put other formatting (bold, italics, etc) into the posting, and yumis solution suffers from all of these. However in the end all of those solutions are wrong as the result of the 0.84 seconds one is actually 99% identical to the 2.19 seconds one, as there is only a difference in the rare case where there is an URL that contains a segment longer than 80 chars without punctuation.

This in turn let me change the above code to something that will only remove, and insert back, <a> tags that contain words that would be wordwrapped.

The result is this butt ugly code:

function my_wordwrap($message)
{
    global $mybb;

    if($mybb->settings['wordwrap'] > 0)
    {
        $message = convert_through_utf8($message);

        preg_match_all("#<a [^>]*[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}}[^>]*>#", $message, $url_matches);
        $message = preg_replace("#<a [^>]*[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}}[^>]*>#", "<a>", $message);

        if(!($new_message = @preg_replace("#(?>[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#u", "$0​", $message)))
        {
            $new_message = preg_replace("#(?>[^\s&/<>\"\\-\.\[\]]{{$mybb->settings['wordwrap']}})#", "$0​", $message);
        }

        foreach($url_matches[0] as $url)
        {
            $new_message = preg_replace("#<a>#", $url, $new_message, 1);
        }

        $new_message = convert_through_utf8($new_message, false);

        return $new_message;
    }

    return $message;
}

With this code links that would be word wrapped are left alone and execution time for this is 0.87 seconds. If you're absolutely intent on fixing the long URL issue even though it's a damn rare case as URLs normally contain punctuation anyhow, this is the best solution I've found so far. However even if this code in my test is only 0.87 seconds, it would go back up to the original 2.19 seconds (or worse) if those hundreds of links in my test posting, were actually links that are actually affected by the word wrap issue. In my posting only a fraction of links were so long so that's why this additional optimization helps a lot.

Still, it's hard to find a real world example of a link that would actually suffer from the current solution. Who puts areallylongcollectionofasciitextintoanurlwithoutanypunctuationwhatsoever? Normally URLs contain punctuation everywhere so the problem does not really exist in the first place.

Something eats the & # 8203 ; when I post here, add them after the $0.
Pages: 1 2 3 4