How to explode different section from a textfile into an array using php (and no regex)?

Question

This question is almost duplicate to How to transform structured textfiles into PHP multidimensional array but I have posted it again since I was unable to understand the regular expression based solutions that were given. It seems better to try and solve this using just PHP so that I may actually learn from it (regex is too hard to understand at this point).

Assume the following text file:

HD Alcoa Earnings Soar; Outlook Stays Upbeat 
BY By James R. Hagerty and Matthew Day 
PD 12 July 2011
LP 

Alcoa Inc.'s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts' forecasts.

However, profits wereless than expected

TD
Licence this article via our website:

http://example.com

I read this textfile with PHP, an need a robust way to put the file contents into an array, like this:

array(
  [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat,
  [BY] => By James R. Hagerty and Matthew Day,
  [PD] => 12 July 2011,
  [LP] => Alcoa Inc.'s profit...than expected,
  [TD] => Licence this article via our website: http://example.com
)

The words HD BY PD LP TD are keys to identify a new section in the file. In the array, all newlines may be stripped from the values. Ideally I would be able to do this without regular expressions. I believe exploding on all keys could be one way of doing it, but it would be very dirty:

$fields = array('HD', 'BY', 'PD', 'LP', 'TD');
$parts = explode($text, "\nHD ");
$HD = $parts[0];

Does anybody have a more clean idea on how to loop through the text, perhaps even once, and dividing it up into the array as given above?

"And no regex" - why? This is one of those (admittedly rare) cases where a regex is the right tool for the job. Unless you help us understand why it isn't an option, you'll likely get a lot of this. — Mels
– Mels, Commented Aug 21, 2013 at 14:11
You've asked this question before - stackoverflow.com/questions/18318530/… — Jason McCreary
– Jason McCreary, Commented Aug 21, 2013 at 14:13
@mels I am asking this question again since I cannot get the regex to work because of my lack of understanding. I think it would be better to stay in the realm of scripting where I feel comfortable for now. PHP code I can understand. — Pr0no
– Pr0no, Commented Aug 21, 2013 at 14:16
Okay. Edit your question to appropriately link to your old question (and explain the difference like you just did), and you lose my downvote. It's an SE faux-pas to post a near-dupe question without explanation ;-). — Mels
– Mels, Commented Aug 21, 2013 at 14:20
any solution not using regular expressions would be way more complex and hard to understand then a solution using regex — Hannes
– Hannes, Commented Aug 21, 2013 at 14:40

jgb · Accepted Answer · 2013-08-30 12:16:45Z

13

+150

This is another, even shorter approach without using regular expressions.

/**
 * @param  array  array of stopwords eq: array('HD', 'BY', ...)
 * @param  string Text to search in
 * @param  string End Of Line symbol
 * @return array  [ stopword => string, ... ]
 */
function extract_parts(array $parts, $str, $eol=PHP_EOL) {
  $ret=array_fill_keys($parts, '');
  $current=null;
  foreach(explode($eol, $str) AS $line) {
    $substr = substr($line, 0, 2);
    if (isset($ret[$substr])) {
      $current = $substr;
      $line = trim(substr($line, 2));
    }
    if ($current) $ret[$current] .= $line;
  }
  return $ret;
}

$ret = extract_parts(array('HD', 'BY', 'PD', 'LP', 'TD'), $str);
var_dump($ret);

Why not using regular expressions?

Since the php documentation, particular in preg_* functions, recommend to not use regular expressions if not strongly required. I was wondering which of the examples in the answers to this question has the best berformance.

The result surprised myself:

Answer 1 by: hek2mgl     2.698 seconds (regexp)
Answer 2 by: Emo Mosley  2.38  seconds
Answer 3 by: anubhava    3.131 seconds (regexp)
Answer 4 by: jgb         1.448 seconds

I would have expected that the regexp variants would be the fastest.

Well, it isn't a bad thing to not use regular expressions in any case. In other words: using regular expressions is not the best solution in general. You have to decide for the best solution case-by-case.

You may repeat the measurement with this script.

Edit

Here is a short, more optimized example using a regexp pattern. Still not as fast as my example above but faster than the other regexp based examples.

The Output format may be optimized (whitespaces / line breaks).

function extract_parts_regexp($str) {
  $a=array();
  preg_match_all('/(?<k>[A-Z]{2})(?<v>.*?)(?=\n[A-Z]{2}|$)/Ds', $str, $a);
  return array_combine($a['k'], $a['v']);
}

edited Aug 30, 2013 at 12:16

answered Aug 23, 2013 at 19:47

jgb

1,2045 gold badges18 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

20 Comments

Pr0no Over a year ago

Thank you for your elaborate reply. I keep having one issue though: the array values contain multiple instances of the same string but I am unable to see where this goes wrong as the loops are correct as far as I can tell. An example output would be [BY] => Chris Chris Chris Chris Chris. The code I'm using is here: pastebin.com/y0iaWVVZ and the actual file I'm working on here : pastebin.com/h0XkTxHk. I would be very gratefull if you would have a look to see if you know what is wrong.

hek2mgl Over a year ago

@jgb Nice one.. At least my answer was not the slowest solution ;)

jgb Over a year ago

@Pr0no There are two new requirements to your original question (3byte keyword and multiple records per string/textfile). To not break the answer to your original question i fixed that for you here. I bet you would be able to solve that problem by your self.

anubhava Over a year ago

Just want to put one point across. Almost all the answers here start with a fixed 2 letter keys: 'HD', 'BY', 'PD', 'LP', 'TD' and then do the parsing which I avoided in my answer to keep it generic. It is up to the OP's requirements whether required solution should be specific to some known keys (which might be subject to change in future) or keep it generic/open like I proposed.

jgb Over a year ago

@anubhava Your answer using: /^[A-Z]{2}/ is also not generic. Changing the length in your regexp or in my substr call will be the same.

|

22 revs · Accepted Answer · 2013-08-31 11:43:31Z

8

A plea on behalf of SIMPLIFIED, FAST & READABLE regex code!

(From Pr0no in comments) Do you think you could simplify the regex or have a tip on how to start with a php solution? Yes, Pr0n0, I believe I can simplify the regex.

I'd like to make the case that regex is by far the best tool for the job and that it doesn't have to be frightening & unreadable expressions as we've seen earlier. I have broken this function down into understandable parts.

I've avoided complex regex features like capture groups and wildcard expressions and focused on trying to produce something simple that you'll feel comfortable coming back to in 3 months time.

My proposed function (commented)

function headerSplit($input) {

    // First, let's put our headers (any two consecutive uppercase characters at the start of a line) in an array
    preg_match_all(
        "/^[A-Z]{2}/m",       /* Find 2 uppercase letters at start of a line   */
        $input,               /* In the '$input' string                        */
        $matches              /* And store them in a $matches array            */
    );

    // Next, let's split our string into an array, breaking on those headers
    $split = preg_split(
        "/^[A-Z]{2}/m",       /* Find 2 uppercase letters at start of a line   */
        $input,               /* In the '$input' string                        */
        null,                 /* No maximum limit of matches                   */
        PREG_SPLIT_NO_EMPTY   /* Don't give us an empty first element          */
    );

    // Finally, put our values into a new associative array
    $result = array();
    foreach($matches[0] as $key => $value) {
        $result[$value] = str_replace(
            "\r\n",              /* Search for a new line character            */
            " ",                 /* And replace with a space                   */
            trim($split[$key])   /* After trimming the string                  */
        );
    }

    return $result;
}

And the output (note: you may need to replace \r\n with \n in str_replace function depending on your operating system):

array(5) {
  ["HD"]=> string(41) "Alcoa Earnings Soar; Outlook Stays Upbeat"
  ["BY"]=> string(35) "By James R. Hagerty and Matthew Day"
  ["PD"]=> string(12) "12 July 2011"
  ["LP"]=> string(172) "Alcoa Inc.'s profit more than doubled in the second quarter.  The giant aluminum producer managed to meet analysts' forecasts.    However, profits wereless than expected"
  ["TD"]=> string(59) "Licence this article via our website:    http://example.com"
}

Removing the Comments for a Cleaner Function

Condensed version of this function. It's exactly the same as above but with the comments removed:

function headerSplit($input) {
    preg_match_all("/^[A-Z]{2}/m",$input,$matches);
    $split = preg_split("/^[A-Z]{2}/m",$input,null,PREG_SPLIT_NO_EMPTY);
    $result = array();
    foreach($matches[0] as $key => $value) $result[$value] = str_replace("\r\n"," ",trim($split[$key]));
    return $result;
}

Theoretically it shouldn't matter which one you use in your live code as parsing comments has little performance impact, so use the one you're more comfortable with.

Breakdown of the Regular Expression Used Here

There is only one expression in the function (albeit, used twice), let's break it down for simplicity:

"/^[A-Z]{2}/m"

/     - This is a delimiter, representing the start of the pattern.
^     - This means 'Match at the beginning of the text'.
[A-Z] - This means match any uppercase character.
{2}   - This means match exactly two of the previous character (so exactly two uppercase characters).
/     - This is the second delimiter, meaning the pattern is over.
m     - This is 'multi-line mode', telling regex to treat each line as a new string.

This tiny expression is powerful enough to match HD but not HDM at the start of a line, and not HD (for example in Full HD) in the middle of a line. You will not easily achieve this with non-regex options.

If you want two or more (instead of exactly 2) consecutive uppercase characters to signify a new section, use /^[A-Z]{2,}/m.

Using a list of pre-defined headers

Having read your last question, and your comment under @jgb's post, it looks like you want to use a pre-defined list of headers. You can do that by replacing our regex with "/^(HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)/m -- the | is treated as an 'or' in regular expressions.

Benchmarking - Readable Doesn't Mean Slow

Somehow benchmarking has become part of the conversation, and even though I think it's missing the point which is to provide you with a readable & maintainable solution, I rewrote JGB's benchmark to show you a few things.

Here are my results, showing that this regex-based code is the fastest option here (these results based on 5,000 iterations):

SWEETIE BELLE'S SOLUTION (2 UPPERCASE IS A HEADER):         0.054 seconds
SWEETIE BELLE'S SOLUTION (2+ UPPERCASE IS A HEADER):        0.057 seconds
MATEWKA'S SOLUTION (MODIFIED, 2 UPPERCASE IS A HEADER):     0.069 seconds
BABA'S SOLUTION (2 UPPERCASE IS A HEADER):                  0.075 seconds
SWEETIE BELLE'S SOLUTION (USES DEFINED LIST OF HEADERS):    0.086 seconds
JGB'S SOLUTION (USES DEFINED LIST OF HEADERS, MODIFIED):    0.107 seconds

And the benchmarks for solutions with incorrectly formatted output:

MATEWKA'S SOLUTION:                                         0.056 seconds
JGB'S SOLUTION:                                             0.061 seconds
HEK2MGL'S SOLUTION:                                         0.106 seconds
ANUBHAVA'S SOLUTION:                                        0.167 seconds

The reason I offered a modified version of JGB's function is because his original function doesn't remove newlines before adding paragraphs to the output array. Small string operations make a huge difference in performance and must be benchmarked equally to get a fair estimation of performance.

Also, with jgb's function, if you pass in the full list of headers, you will get a bunch of null values in your arrays as it doesn't appear to check if the key is present before assigning it. This would cause another performance hit if you wanted to loop over these values later as you'd have to check empty first.

edited Aug 31, 2013 at 11:43

community wiki

22 revs
Sweetie Belle

10 Comments

Emo Mosley Over a year ago

@Sweetie Belle - Check out Pr0no's comment under jgb's answer - the OP posted his example code and a live file to work with. Some of his field parts are 2-letter and some are 3. Also, not every line that begins with two captial letters will qualify as a section (i.e. JPMorgan).

Glitch Desire Over a year ago

@EmoMosley I gave a solution for that: /^(HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)/m

Emo Mosley Over a year ago

@Sweetie Belle - Sorry, I was looking at the {2} in your code and assuming it was a limiting factor. I should have read the whole thing.

Glitch Desire Over a year ago

@EmoMosley I gave options for 2, 2+ and a defined list. OP should really have put his full question in here and not relied on us reading the old question or random comments. :P

Emo Mosley Over a year ago

@Sweetie Belle - I can say, I learned a lot of neat tricks from everyone! :)

|

Baba · Accepted Answer · 2013-08-25 20:47:13Z

6

Here is a simple solution without regex

$data = explode("\n", $str);
$output = array();
$key = null;

foreach($data as $text) {
    $newKey = substr($text, 0, 2);
    if (ctype_upper($newKey)) {
        $key = $newKey;
        $text = substr($text, 2);
    }
    $text = trim($text);
    isset($output[$key]) ? $output[$key] .= $text : $output[$key] = $text;
}
print_r($output);

Output

Array
(
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
    [BY] => By James R. Hagerty and Matthew Day
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter.The giant aluminum producer managed to meet analysts' forecasts.However, profits wereless than expected
    [TD] => Licence this article via our website:http://example.com
)

See Live Demo

Note

You might also want to do the following :

Check for duplicates Data
Make sure only HD|BY|PD|LP|TD are used
Remove $text = trim($text) so that the new lines would be preserved in the text

answered Aug 25, 2013 at 20:47

Baba

95.3k29 gold badges172 silver badges222 bronze badges

1 Comment

Gadoma Over a year ago

this one is very clean and elegant, I'd propose something similar myself :) cheers

hek2mgl · Accepted Answer · 2013-08-23 15:56:58Z

If it's just one record per file, here you go:

$record = array();
foreach(file('input.txt') as $line) {
    if(preg_match('~^(HD|BY|PD|LP|TD) ?(.*)?$~', $line, $matches)) {
        $currentKey = $matches[1];
        $record[$currentKey] = $matches[2];
    } else {
        $record[$currentKey] .= str_replace("\n", ' ', $line);
    }   
}

The code iterates over each line of input and checks whether the line starts with an identifier. If so, currentKey is set to this identifier. All following content unless a new identifier was found will be added to this key in the array after new lines have been removed.

var_dump($record);

Output:

array(5) {
  'HD' =>
  string(42) "Alcoa Earnings Soar; Outlook Stays Upbeat "
  'BY' =>
  string(36) "By James R. Hagerty and Matthew Day "
  'PD' =>
  string(12) "12 July 2011"
  'LP' =>
  string(169) " Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts.  However, profits wereless than expected  "
  'TD' =>
  string(58) "Licence this article via our website:  http://example.com "
}

Note: If there are multiple records per file, you can refine the parser to return an multidimensional array:

$records = array();
foreach(file('input.txt') as $line) {
    if(preg_match('~^(HD|BY|PD|LP|TD) ?(.*)?$~', $line, $matches)) {
        $currentKey = $matches[1];

        // start a new record if `HD` was found.
        if($currentKey === 'HD') {
            if(is_array($record)) {
                $records []= $record;
            }
            $record = array();
        }
        $record[$currentKey] = $matches[2];
    } else {
        $record[$currentKey] .= str_replace("\n", ' ', $line);
    }   
}

However the data format itself looks fragile to me. What if LP looks like this:

LP dfks ldsfjksdjlf
lkdsjflk dsfjksld..
HD defsdf sdf sd....

You see, there is a HD in the data of LP in my example. In order to keep data parseable you'll have to avoid such situations.

Emo Mosley · Accepted Answer · 2013-08-29 17:11:00Z

5

UPDATE :

Given the posted example input file and code, I've altered my answer. I've added the OP's provided "parts" that define the section codes and make the function able to handle 2-or-more-digit codes. Below is a non-regex procedural function that should produce the desired results:

# Parses the given text file and populates an array with coded sections.
# INPUT:
#   filename = (string) path and filename to text file to parse
# RETURNS: (assoc array)
#   null is returned if there was a file error or no data was found
#   otherwise an associated array of the field sections is returned
function getSections($parts, $lines) {
   $sections = array();
   $code = "";
   $str = "";
   # examine each line to build section array
   for($i=0; $i<sizeof($lines); $i++) {
      $line = trim($lines[$i]);
      # check for special field codes
      $words = explode(' ', $line, 2);
      $left = $words[0];
      #echo "DEBUG: left[$left]\n";
      if(in_array($left, $parts)) {
         # field code detected; first, finish previous section, if exists
         if($code) {
            # store the previous section
            $sections[$code] = trim($str);
         }
         # begin to process new section
         $code = $left;
         $str = trim(substr($line, strlen($code)));
      } else if($code && $line) {
         # keep a running string of section content
         $str .= " ".$line;
      }
   } # for i
   # check for no data
   if(!$code)
      return(null);
   # store the last section and return results
   $sections[$code] = trim($str);
   return($sections);
} # getSections()


$parts = array('HD', 'BY', 'WC', 'PD', 'SN', 'SC', 'PG', 'LA', 'CY', 'LP', 'TD', 'CO', 'IN', 'NS', 'RE', 'IPC', 'PUB', 'AN');

$datafile = $argv[1]; # NOTE: I happen to be testing this from command-line
# load file as array of lines
$lines = file($datafile);
if($lines === false)
   die("ERROR: unable to open file ".$datafile."\n");
$data = getSections($parts, $lines);
echo "Results from ".$datafile.":\n";
if($data)
   print_r($data);
else
   echo "ERROR: no data detected in ".$datafile."\n";

Results:

Array
(   
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
    [BY] => By James R. Hagerty and Matthew Day
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts. However, profits wereless than expected
    [TD] => Licence this article via our website: http://example.com
)

edited Aug 29, 2013 at 17:11

answered Aug 23, 2013 at 15:47

Emo Mosley

5353 silver badges9 bronze badges

3 Comments

Emo Mosley Over a year ago

@Pr0no - I've improved my answer; I hope it meets you needs.

Gnudiff Over a year ago

This seems to be the most concise answer so far submitted, however, OP should be warned that, as hek2mgl suggested, the data format itself(unless clearly specified somewhere to avoid) can lead to ambiguity. No better answer seems currently possible without more info on data format.

Emo Mosley Over a year ago

@Gnudiff - The OP posted example code and a live data file in the comment under jgb's answer - all you need is in there.

anubhava · Accepted Answer · 2013-08-23 16:21:23Z

This is one problem where I think using regex shouldn't be a problem considering rules to parse inout data. Consider code like this:

$s = file_get_contents('input'); // read input file into a string
$match = array(); // will hold final output
if (preg_match_all('~(^|[A-Z]{2})\s(.*?)(?=[A-Z]{2}\s|$)~s', $s, $arr)) {
    for ( $i = 0; $i < count($arr[1]); $i++ )
       $match[ trim($arr[1][$i]) ] = str_replace( "\n", "", $arr[2][$i] );
}
print_r($match);

As you can see how compact code becomes because of the way preg_match_all has been used to match data from input file.

OUTPUT:

Array
(
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat 
    [BY] => By James R. Hagerty and Matthew Day 
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter.The giant aluminum producer managed to meet analysts' forecasts.However, profits wereless than expected
    [TD] => Licence this article via our website:http://example.com
)

gwc · Accepted Answer · 2013-08-29 16:43:40Z

2

Don't loop at all. How about this (assuming one record per file)?

$inrec = file_get_contents('input');
$inrec = str_replace( "\n'", "'", str_replace( array( 'HD ', 'BY ', 'PD ', 'LP', 'TD' ), array( "'HD' => '", "','BY' => '", "','PD' => '", "','LP' => '", "','TD' => '" ), str_replace( "'", "\\'", $inrec ) ) )."'";
eval( '$record = array('.$inrec.');' );
var_export($record);

results:

array (
  'HD' => 'Alcoa Earnings Soar; Outlook Stays Upbeat ',
  'BY' => 'By James R. Hagerty and Matthew Day ',
  'PD' => '12 July 2011',
  'LP' => ' 

Alcoa Inc.\'s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts\' forecasts.

However, profits wereless than expected
',
  'TD' => '
Licence this article via our website:

http://example.com',
)

If there can be more than on record per file, try something like:

$inrecs = explode( 'HD ', file_get_contents('input') );
$records = array();
foreach ( $inrecs as $inrec ) {
   $inrec = str_replace( "\n'", "'", str_replace( array( 'HD ', 'BY ', 'PD ', 'LP', 'TD' ), array( "'HD' => '", "','BY' => '", "','PD' => '", "','LP' => '", "','TD' => '" ), str_replace( "'", "\\'", 'HD ' . $inrec ) ) )."'";
   eval( '$records[] = array('.$inrec.');' );
}
var_export($records);

Edit

Here's a version with the $inrec functions split out so it can be more easily understood - and with a couple of tweaks: strips new-lines, trims leading and trailing spaces, and addresses backslash concern in EVAL in case the data is from an untrusted source.

$inrec = file_get_contents('input');
$inrec = str_replace( '\\', '\\\\', $inrec );       // Preceed all backslashes with backslashes
$inrec = str_replace( "'", "\\'", $inrec );         // Precede all single quotes with backslashes
$inrec = str_replace( PHP_EOL, " ", $inrec );       // Replace all new lines with spaces
$inrec = str_replace( array( 'HD ', 'BY ', 'PD ', 'LP ', 'TD ' ), array( "'HD' => trim('", "'),'BY' => trim('", "'),'PD' => trim('", "'),'LP' => trim('", "'),'TD' => trim('" ), $inrec )."')";
eval( '$record = array('.$inrec.');' );
var_export($record);

Results:

array (
  'HD' => 'Alcoa Earnings Soar; Outlook Stays Upbeat',
  'BY' => 'By James R. Hagerty and Matthew Day',
  'PD' => '12 July 2011',
  'LP' => 'Alcoa Inc.\'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts\' forecasts.  However, profits wereless than expected',
  'TD' => 'Licence this article via our website:  http://example.com',
)

edited Aug 29, 2013 at 16:43

answered Aug 24, 2013 at 23:08

gwc

1,2937 silver badges14 bronze badges

8 Comments

jgb Over a year ago

Don't exec at all. What if the string would be: $str="); system(\"cat /etc/passwd\"); /*"; ore something similar like rm -rf /?

user2136840 Over a year ago

-1 for using eval where it's completely unnecessary (i.e. 99.99% of the time)

gwc Over a year ago

I can only assume that you didn't completely understand the $inrec = str_replace(... statement where all single quotes are replaced by a single quote that is preceded by a backslash and then where the entire text is enclosed in single quotes. Double quotes within a single quoted string are just a literal and won't terminate the string so what follows won't get executed. Also, we don't know how the questioner intends to use this code. Is it a one-time fix? Is the data from a known trusted source? While EVAL can be dangerous, the dangers can also be mitigated.

gwc Over a year ago

Anyways, the point of the exercise was to find a fast way to put the data into an array WITHOUT using regex. The question is: is this answer any quicker than those already presented? IF there is an exposure with the EVAL (which I think has already been addressed), the building of the $inrec string can be tweaked to eliminate the exposure.

gwc Over a year ago

One such tweak that may be needed would be to also precede all backslashes by a backslash. If a simple performance warrants proceeding, then the tweaking can begin.

|

gwc · Accepted Answer · 2013-08-30 12:28:55Z

Update

It dawned on me that in a multi-record scenario, building $repl outside of the record loop would perform even better. Here's the 2 byte keyword version:

$inrecs = file_get_contents('input');
$inrecs = str_replace( PHP_EOL, " ", $inrecs );
$keys  = array( 'HD', 'BY', 'PD', 'LP', 'TD' );
$split = chr(255);
$repl = explode( ',', $split . implode( ','.$split, $keys ) );
$inrecs = explode( 'HD ', $inrecs );
array_shift( $inrecs );
$records = array();
foreach( $inrecs as $inrec ) $records[] = parseRecord( $keys, $repl, 'HD '.$inrec );

function parseRecord( $keys, $repl, $rec ) {
    $split = chr(255);
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ substr( $line, 0, 2 ) ] = trim( substr( $line, 3 ) );
    return $out;
}

Benchmark (thanks @jgb):

Answer 1 by: hek2mgl     6.783 seconds (regexp)
Answer 2 by: Emo Mosley  4.738 seconds
Answer 3 by: anubhava    6.299 seconds (regexp)
Answer 4 by: jgb         2.47 seconds
Answer 5 by: gwc         3.589 seconds (eval)
Answer 6 by: gwc         1.871 seconds

Here's another answer for multiple input records (assuming each records begins with 'HD ') and supporting 2 byte, 2 or 3 byte, or variable length keywords.

$inrecs = file_get_contents('input');
$inrecs = str_replace( PHP_EOL, " ", $inrecs );
$keys  = array( 'HD', 'BY', 'PD', 'LP', 'TD' );
$inrecs = explode( 'HD ', $inrecs );
array_shift( $inrecs );
$records = array();
foreach( $inrecs as $inrec ) $records[] = parseRecord( $keys, 'HD '.$inrec );

Parse record with 2 byte keywords:

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ substr( $line, 0, 2 ) ] = trim( substr( $line, 3 ) );
    return $out;
}

Parse record with 2 or 3 byte keywords (assumes space or PHP_EOL between key and content):

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ trim( substr( $line, 0, 3 ) ) ] = trim( substr( $line, 3 ) );
    return $out;
}

Parse record with variable length keywords (assumes space or PHP_EOL between key and content):

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) {
        $keylen = strpos( $line.' ', ' ' );
        $out[ trim( substr( $line, 0, $keylen ) ) ] = trim( substr( $line, $keylen+1 ) );
    }
    return $out;
}

Expectation is that each parseRecord function above would perform a little worse than its predecessor.

Results:

Array
(
    [0] => Array
        (
            [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
            [BY] => By James R. Hagerty and Matthew Day
            [PD] => 12 July 2011
            [LP] => Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts.  However, profits wereless than expected
            [TD] => Licence this article via our website:  http://example.com
        )

)

Community · Accepted Answer · 2017-05-23 11:45:11Z

1

I prepared my own solution which came out slightly faster than jgb's answer. Here's the code:

function answer_5(array $parts, $str) {
    $result = array_fill_keys($parts, '');
    $poss = $result;
    foreach($poss as $key => &$val) {
        $val = strpos($str, "\n" . $key) + 2;
    }

    arsort($poss);

    foreach($poss as $key => $pos) {
        $result[$key] = trim(substr($str, $pos+1));
        $str = substr($str, 0, $pos-1);
    }
    return str_replace("\n", "", $result);
}

And here's comparison of the performance:

Answer 1 by: hek2mgl    2.791 seconds (regexp) 
Answer 2 by: Emo Mosley 2.553 seconds 
Answer 3 by: anubhava   3.087 seconds (regexp) 
Answer 4 by: jgb        1.53  seconds 
Answer 5 by: matewka    1.403 seconds

Testing enviroment was the same as jgb's (100000 iterations - script borrowed from here).

Enjoy and please leave comments.

edited May 23, 2017 at 11:45

CommunityBot

11 silver badge

answered Aug 30, 2013 at 14:16

matewka

10.2k3 gold badges36 silver badges43 bronze badges

1 Comment

Glitch Desire Over a year ago

Have added you to the better benchmark (see my answer for why jgb's benchmark is bad).

Collectives™ on Stack Overflow

How to explode different section from a textfile into an array using php (and no regex)?

9 Answers 9

20 Comments

A plea on behalf of SIMPLIFIED, FAST & READABLE regex code!

10 Comments

1 Comment

Comments

3 Comments

Comments

8 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

20 Comments

A plea on behalf of SIMPLIFIED, FAST & READABLE regex code!

10 Comments

1 Comment

Comments

3 Comments

Comments

8 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related