(2009-03-01, 10:39 AM)frostschutz Wrote: [ -> ]Quote:Splitting each post with uniqid id.
That sucks.
Quote:Remember pcre biggest downfall is compilation.
If PHP is too stupid to cache and re-use previously compiled regexes like everybody else does, that's true.
Quote:you shouldn't use the U modifier because it's backend is just a awful hack
It's PHP, so what are you complaining about?
Quote:You are better off converting the unicode with a standard input parser.
How is that supposed to work?
Quote:If PHP is too stupid to cache and re-use previously compiled regexes like everybody else does, that's true.
You make no sense at all, how in the world can the compiler cache a dynamic pattern. It can't no matter what language you code in. The reason being, is that there is no way the complier can determine if the variable used in the expression will be changed during the execution process. It's dynamic, not fixed.
Quote:How is that supposed to work?
data normalization is the key to application development. One should never have to use different functions mb_ iconv_ to process internal application strings because the strings should be normalized when they enter the application. It's better to handle the normalization when a new post, thread, pm or whatever data enters the application. That way you only need to use the standard core functions, (IE: you wouldn't need to do things like)...
if function_exists mb_... do this
else if function_exists iconv_... do this
else oh no we can't do anything because we didn't normalize the string
Whats better...
strtolower ( normalized string )
25 posts in a thread...
75 calls to strtolower (called 3 times per post)
0 conditions asked, just lower case the normalized string
extra memory allocated = negligible; new string overwrites old string
or
my_strtolower ( unnormalized string )
25 posts in a thread...
75 unneeded function calls (called 3 times per post)
75 unneeded if (conditions) if function_exists mb_, mb_strtolower else strtolower
$string variable in function
extra memory allocated = ($string size * 8 bytes) per function call
That's just for string to lower case handling, (times) that by all the other functions that are used because incoming data was not normalized!
Normalization means, that you don't rely on function based normalization.
For example...
mb_
The mb_ function list converts from one encoding unicode type to another. So it should not be used to normalize data because you are not guaranteed that all the data in the string being encoded or decoded is in the same encoding. So the mb_ function will only encode or decode the entities that match encode_from, encode_to. That results in unnormalized data because all the data is not necessarily converted to it's single byte representation. The only way to normalize data is to use the mappings provided by the functions chr(), ord(), hexdec(), dechex() because those functions have all the unicode mappings to normalize every character, whether they be single, two, four or eight bit wide characters. By normalizing data, utf-8 valid characters get placed into a single byte normalized character, characters that don't fit into the utf-8 single byte range, (IE: double byte characters) utf-16, are assigned mappings to the utf-8 range. So while it really a double byte character, it still maps to the utf-8 single byte range. Thus allowing for both utf-8 (single byte), utf-16 (unicode mlti-byte) to both map to each other. So after converting all the data, no matter if it's utf-8 or utf-16, it is normalized. That allows you to use all the normal str_, preg, core functions on your normalized string. Which removes the need for all those functions (mb_, iconv) that should only be used for converting normalized data being outputted in the clients preferred encoding type!