A plea on behalf of SIMPLIFIED, FAST & READABLE regex code!
(From Pr0no in comments) Do you think you could simplify the regex or have a tip on how to start with a php solution? Yes, Pr0n0, I believe I can simplify the regex.
I'd like to make the case that regex is by far the best tool for the job and that it doesn't have to be frightening & unreadable expressions as we've seen earlier. I have broken this function down into understandable parts.
I've avoided complex regex features like capture groups and wildcard expressions and focused on trying to produce something simple that you'll feel comfortable coming back to in 3 months time.
My proposed function (commented)
function headerSplit($input) {
// First, let's put our headers (any two consecutive uppercase characters at the start of a line) in an array
preg_match_all(
"/^[A-Z]{2}/m", /* Find 2 uppercase letters at start of a line */
$input, /* In the '$input' string */
$matches /* And store them in a $matches array */
);
// Next, let's split our string into an array, breaking on those headers
$split = preg_split(
"/^[A-Z]{2}/m", /* Find 2 uppercase letters at start of a line */
$input, /* In the '$input' string */
null, /* No maximum limit of matches */
PREG_SPLIT_NO_EMPTY /* Don't give us an empty first element */
);
// Finally, put our values into a new associative array
$result = array();
foreach($matches[0] as $key => $value) {
$result[$value] = str_replace(
"\r\n", /* Search for a new line character */
" ", /* And replace with a space */
trim($split[$key]) /* After trimming the string */
);
}
return $result;
}
And the output (note: you may need to replace \r\n with \n in str_replace function depending on your operating system):
array(5) {
["HD"]=> string(41) "Alcoa Earnings Soar; Outlook Stays Upbeat"
["BY"]=> string(35) "By James R. Hagerty and Matthew Day"
["PD"]=> string(12) "12 July 2011"
["LP"]=> string(172) "Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts. However, profits wereless than expected"
["TD"]=> string(59) "Licence this article via our website: http://example.com"
}
Removing the Comments for a Cleaner Function
Condensed version of this function. It's exactly the same as above but with the comments removed:
function headerSplit($input) {
preg_match_all("/^[A-Z]{2}/m",$input,$matches);
$split = preg_split("/^[A-Z]{2}/m",$input,null,PREG_SPLIT_NO_EMPTY);
$result = array();
foreach($matches[0] as $key => $value) $result[$value] = str_replace("\r\n"," ",trim($split[$key]));
return $result;
}
Theoretically it shouldn't matter which one you use in your live code as parsing comments has little performance impact, so use the one you're more comfortable with.
Breakdown of the Regular Expression Used Here
There is only one expression in the function (albeit, used twice), let's break it down for simplicity:
"/^[A-Z]{2}/m"
/ - This is a delimiter, representing the start of the pattern.
^ - This means 'Match at the beginning of the text'.
[A-Z] - This means match any uppercase character.
{2} - This means match exactly two of the previous character (so exactly two uppercase characters).
/ - This is the second delimiter, meaning the pattern is over.
m - This is 'multi-line mode', telling regex to treat each line as a new string.
This tiny expression is powerful enough to match HD but not HDM at the start of a line, and not HD (for example in Full HD) in the middle of a line. You will not easily achieve this with non-regex options.
If you want two or more (instead of exactly 2) consecutive uppercase characters to signify a new section, use /^[A-Z]{2,}/m.
Using a list of pre-defined headers
Having read your last question, and your comment under @jgb's post, it looks like you want to use a pre-defined list of headers. You can do that by replacing our regex with "/^(HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)/m -- the | is treated as an 'or' in regular expressions.
Benchmarking - Readable Doesn't Mean Slow
Somehow benchmarking has become part of the conversation, and even though I think it's missing the point which is to provide you with a readable & maintainable solution, I rewrote JGB's benchmark to show you a few things.
Here are my results, showing that this regex-based code is the fastest option here (these results based on 5,000 iterations):
SWEETIE BELLE'S SOLUTION (2 UPPERCASE IS A HEADER): 0.054 seconds
SWEETIE BELLE'S SOLUTION (2+ UPPERCASE IS A HEADER): 0.057 seconds
MATEWKA'S SOLUTION (MODIFIED, 2 UPPERCASE IS A HEADER): 0.069 seconds
BABA'S SOLUTION (2 UPPERCASE IS A HEADER): 0.075 seconds
SWEETIE BELLE'S SOLUTION (USES DEFINED LIST OF HEADERS): 0.086 seconds
JGB'S SOLUTION (USES DEFINED LIST OF HEADERS, MODIFIED): 0.107 seconds
And the benchmarks for solutions with incorrectly formatted output:
MATEWKA'S SOLUTION: 0.056 seconds
JGB'S SOLUTION: 0.061 seconds
HEK2MGL'S SOLUTION: 0.106 seconds
ANUBHAVA'S SOLUTION: 0.167 seconds
The reason I offered a modified version of JGB's function is because his original function doesn't remove newlines before adding paragraphs to the output array. Small string operations make a huge difference in performance and must be benchmarked equally to get a fair estimation of performance.
Also, with jgb's function, if you pass in the full list of headers, you will get a bunch of null values in your arrays as it doesn't appear to check if the key is present before assigning it. This would cause another performance hit if you wanted to loop over these values later as you'd have to check empty first.